Parallelism (2) & Heterogeneous Computing & Future Perspetives
Parallelism (2) & Heterogeneous Computing & Future Perspetives
Heterogeneous computing
& Future perspetives
Hung-Wei Tseng
Modern processor pipelines
INTEL 64 AND IA-32 PROCESSOR ARCHITECTURES
32K L1 Instruction
BPU
Cache
Allocate/Rename/Retire/MoveElimination/ZeroIdiom
Scheduler
256K L2 Cache
Port 2 (Unified)
Port 0 Port 1 Port 5 Port 6 LD/STA
Int ALU, Int ALU, Int ALU,
Vec FMA, Fast LEA, Int ALU, Port 3
Fast LEA,
Vec MUL, Vec FMA, Int Shft, LD/STA
Vec SHUF,
Vec Add, Vec MUL, Branch1,
Vec ALU,
Vec ALU, Vec Add, CVT 32K L1 Data Cache
Vec Shft, Vec ALU, Port 4
Divide, Vec Shft, STD
Branch2 Int MUL,
Slow LEA Port 7
STA
2
Figure 2-1. CPU Core Pipeline Functionality of the Skylake Microarchitecture
Super-scalar processors with OoO
Superscalar: Modern processors provide multiple pipelines/
functional units to allow more instructions working at the
same time to provide instruction level parallelism
Fetch multiple instructions at the same time
Execute them at the same time
Out-of-order execution: Modern processors dynamically
extracts data flow among instructions to achieve better
instruction level parallelism
Instructions without data dependencies can be executed in
parallel
Buffer the results and commit the result after previous
instructions are finished
If youre interested, we will discuss more about this in
CSC456 (Spring 2018, planned)
3
Simplified SuperScalar+OOO pipeline
Schedule instructions
whenever their inputs
Fetch a bunch of and target functional
instructions units are ready
Register Reorder
Instruction Instruction Execution Data
renaming Schedule Buffer/
Fetch Decode Units Memory
logic Commit
4
Pentium 4 v.s. Athlon 64
Application: 80% ALU, 20% Branch, 90% prediction accuracy,
no data dependencies, perfect cache, consider the two
machines:
Pentium 4 with 20 pipeline stages, branch resolved in stage 19,
running at 3 GHz
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
TC Nxt IP TC Fetch Drive Alloc Rename Que Sch Sch Sch Disp Disp RF RF Ex Flgs Br Ck Drive
Instruction
ROB: T0
Fetch: T0
Instruction
Fetch: T1 Register ROB: T1
Instruction Execution Data
renaming Schedule
Instruction Decode Units Cache
logic ROB: T2
Fetch: T2
Instruction
ROB: T3
Fetch: T3
6
Simultaneous Multi-Threading (SMT)
Fetch instructions from different threads/processes to fill the
pipeline
Exploit thread level parallelism (TLP) to solve the problem of
insufficient ILP in a single thread
Each thread is a separate scheduling identity in OS
Known as hyper-threading in intel processors
Keep separate architectural states for each thread
PC
Register Files
Reorder Buffer
Create an illusion of multiple processors for OSs
The rest of superscalar processor hardware is shared
Invented by Dean Tullsen Hung-Weis PhD advisor
7
SMT
How many of the following about SMT are correct?
SMT makes processors with deep pipelines more tolerable to
mis-predicted branches
SMT can improve the latency of a single-threaded application
hurt, b/c you are sharing resource with other threads.
SMT processors can better utilize hardware during cache
misses comparing with superscalar processors with the same
issue width
SMT processors can have higher cache miss rates comparing
with superscalar processors with the same cache sizes when
executing the same set of applications.
A. 0
B. 1
C. 2
D. 3
E. 4
8
Announcement
Check your grades as soon as possible
Homework #4 due 4/26 (Wednesday)
Final review this Wednesday
You may be getting some surprise!
Project #3 is up. Due 5/1 (Monday)
Project #2 You may revise and resubmit before
5/1 10% penalty
Final exam 5/10 8am
ClassEval
9
Outline
Thread-level parallelism Multithreaded
processors
Brief introduction to parallel computing
Heterogeneous computing and future
10
Chip-multiprocessor
11
A wide-issue processor or
multiple narrower-issue processors?
What can you do within a 21 mm * 21 mm area?
21 21
mmmm 2121
mmmm
I-Cache
I-Cache #1#1
(8K) I-Cache
(8K) I-Cache #2#2 (8K)
(8K)
Instruction
Instruction
External
External Cache
Cache External
External
Interface Instruction
Instruction (32 KB) Interface
Interface
Interface
Fetch (32 KB)
Fetch
TLB Processor Processor
Processor Processor
(256KB)
On-Chip L2 Cache (256KB)
TLB
Cache (256KB)
On-Chip L2 Cache (256KB)
#1#1 #2
#2
Inst.
Inst. Decode
Decode && Data
Crossbar
Data
Communication Crossbar
Clocking & Pads
Clocking & Pads
Rename
Rename Cache
Cache You will have more
L2 Cache
(32
(32 KB)
KB)
21 mm
21 mm 2121
mmmm
ALUs if you choose D-Cache
D-Cache #1#1
D-Cache #3
(8K)D-Cache
(8K)
(8K)
D-Cache
D-Cache
#2#2
#4
(8K)
(8K)
(8K)
D-Cache #3 (8K) D-Cache #4 (8K)
L2Communication
On-Chip L2
Reorder
Reorder Buffer,
Buffer,
Instruction Queues, this!
Integer Unit
On-Chip
Instruction Queues,
Integer Unit
and
and Out-of-Order
Out-of-Order Logic
Logic
Processor
Processor Processor
Processor
#3#3 #4
#4
Floating
Floating Point
Point
L2
Unit
Unit
I-Cache
I-Cache #3#3 (8K) I-Cache
(8K) I-Cache
#4#4 (8K)
(8K)
A 6-issue
Figure superscalar processor 4 2-issue superscalar processor
Figure 2. 2.Floorplan
Floorplan
forfor
thethe six-issue
six-issue dynamic
dynamic superscalar
superscalar Figure
Figure 3.3. Floorplan
Floorplan forthe
for thefour-way
four-way single-chip
single-chip
3 integer ALUs microprocessor.
microprocessor. 4*1 integer ALUs
multiprocessor.
multiprocessor.
vents
3 floating point ALUs 4*1 floating point ALUs
vents thethe instruction
instruction fetch
fetch mechanism
mechanism from
from becoming
becoming a bottleneck sors
a bottleneck sorsis isless
lessthan
thanone-fourth
one-fourththe
thesize
sizeofofthe
the6-way
6-waySS SSprocessor,
processor,asas
since
since 3
thethe load/store
6-way
6-way execution
execution units
engine
engine requires
requires a much
a much higherinstruc-
higher shownininTable
instruc- shown 4*13.3.load/store
Table Thenumber
The units
numberofofexecution
executionunits
unitsactually
actuallyincreases
increases
tiontion fetch
fetch bandwidth
bandwidth than
than thethe 2-way
2-way processorsused
processors MP inin
usedin inthetheMP thetheMP MPbecause
becausethe
the6-way
6-wayprocessor
processorhad hadthree
threeunits
unitsofofeach
eachtype,
type,
architecture.
architecture.
12 whilethe
while the4-way
4-wayMP MPmust
musthave
havefour
fourone onefor
foreach
eachCPU.
CPU.On Onthe
the
Die photo of a CMP processor
13
CMP advantages
How many of the following are advantages of CMP
over traditional superscalar processor
CMP can exploit thread-level parallelism
CMP can deliver better instruction throughput within the
same die area (chip size)
CMP can achieve better ILP for each running thread
Not really, this depends on single-core microarchitecture
CMP can improve the performance of a single-threaded
application without modifying code
A. 0 Not really, you cannot have single-threaded programs
accessing functional units from other processor cores
B. 1
C. 2
D. 3
E. 4
14
Speedup a single
application on multi-
threaded processors
15
Parallel programming
To exploit CMP/SMT parallelism you need to break your
computation into multiple processes or multiple threads
Processes (in OS/software systems)
Separate programs actually running (not sitting idle) on your
computer at the same time.
Each process will have its own virtual memory space and you need
explicitly exchange data using inter-process communication APIs
Threads (in OS/software systems)
Independent portions of your program that can run in parallel
All threads share the same virtual memory space
We will refer to these collectively as threads
A typical user system might have 1-8 actively running threads.
Servers can have more if needed (the sysadmins will hopefully
configure it that way)
16
Create threads/processes
The only way we can improve a single application
performance on CMP/SMT
You can use fork() to create a child process
Or you can use pthread or openmp to compose
multi-threaded programs
Threads share the same memory space with each other
/* Do matrix multiplication */
for(i = 0 ; i < NUM_OF_THREADS ; i++)
{ Spawn a thread
tids[i] = i;
pthread_create(&thread[i], NULL, threaded_blockmm, &tids[i]);
}
for(i = 0 ; i < NUM_OF_THREADS ; i++)
pthread_join(thread[i], NULL);
Synchronize and wait a for
thread to terminate
17
Supporting shared memory model
Provide a single memory space that all processors
can share
All threads within the same program shares the
same address space.
Threads communicate with each other using shared
variables in memory
Provide the same memory abstraction as single-
thread programming
18
Simple idea...
Connecting all processor and shared memory to a
bus.
Processor speed will be slow b/c all devices on a
bus must run at the same speed
Bus
Shared $
19
Memory hierarchy on CMP
Each processor has
its own local cache
Core 0 Core 1
Local $ Local $
Shared $
Bus
Local $ Local $
Core 2 Core 3
20
Cache on Multiprocessor
Coherency
Guarantees all processors see the same value for a
variable/memory address in the system when the
processors need the value at the same time
What value should be seen
Consistency
All threads see the change of data in the same order
When the memory operation should be done
21
What each thread will see?
Assuming that we are running the following code on
a CMP with some cache coherency protocol, which
output is NOT possible? (a is initialized to 0)
thread 1 thread 2
while(1) while(1)
printf(%d ,a); a++;
A. 0 1 2 3 4 5 6 7 8 9
B. 1 2 5 9 3 6 8 10 12 13
C. 1 1 1 1 1 1 1 1 1 1
D. 1 1 1 1 1 1 1 1 1 100
22
Its show time!
Demo!
thread 1 thread 2
while(1) while(1)
printf(%d ,a); a++;
23
Cache coherency
Assuming that we are running the following code on
a CMP with some cache coherency protocol, which
output is NOT possible? (a is initialized to 0)
thread 1 thread 2
while(1) while(1)
printf(%d ,a); a++;
A. 0 1 2 3 4 5 6 7 8 9
B. 1 2 5 9 3 6 8 10 12 13
C. 1 1 1 1 1 1 1 1 1 1
D. 1 1 1 1 1 1 1 1 1 100
24
Hard to debug
thread 1 thread 2
int loop; void* modifyloop(void *x)
{
int main() sleep(1);
{ printf("Please input a number:\n");
pthread_t thread; scanf("%d",&loop);
loop = 1; return NULL;
}
pthread_create(&thread, NULL,
modifyloop, NULL);
while(loop == 1)
{
continue;
}
pthread_join(thread, NULL);
fprintf(stderr,"User input: %d\n",
loop);
return 0;
}
25
Hard to debug
thread 1 thread 2
volatile int loop; void* modifyloop(void *x)
{
int main() sleep(1);
{ printf("Please input a number:\n");
pthread_t thread; scanf("%d",&loop);
loop = 1; return NULL;
}
pthread_create(&thread, NULL,
modifyloop, NULL); make sure your compiled code
while(loop == 1)
{ put this in memory all the time
continue;
}
pthread_join(thread, NULL);
fprintf(stderr,"User input: %d\n",
loop);
return 0;
}
26
void* update_array(void *x)
{
27
How to make it work?
volatile int lock=0;
inline void atomic_increment(volatile int *pw) {
asm(
"lock\n\t"
"incl %0\n\t"
:
"=m"(*pw): // output (%0)
"m"(*pw): // input (%1) You need these instructions
"cc" // clobbers
}
);
to make it work
inline int atomic_compare_and_exchange(int requiredOldValue, volatile int* _ptr, int newValue, int
sizeOfValue){
int old;
__asm volatile
(
"mov %3, %%eax;\n\t"
"lock\n\t"
"cmpxchg %4, %0\n\t"
"mov %%eax, %1\n\t"
:
"=m" ( *_ptr ), "=r" ( old ): // outputs (%0 %1)
"m" ( *_ptr ), "r" ( requiredOldValue), "r" ( newValue ): // inputs (%2, %3, %4)
"memory", "eax", "cc" // clobbers
);
return old;
}
30
The changing role of
GPUs
31
GPU (Graphics Processing Unit)
Originally for displaying images
HD video: 1920*1080 pixels * 60 frames per second
Graphics processing pipeline
Raster
Input Vertex Geometry Setup & Pixel Operations
Assembler Shader Shader Rasterizer Shader /Output
merger
V vec3 L = normalize(o_toLight);
vec3 V = normalize(o_toCamera);
33
What do you want from a GPU?
Given the basic idea of shading algorithms, how many of
the following statements would fit the agenda of designing
a GPU?
Many ALUs to process multiple pixels simultaneously
Each frame contains 1920*1080 pixels!
High bandwidth memory bus to supply pixels, vectors and
textures Each pixel requires different L, N, R, V
High performance branch predictors
not really, the behavior is uniform across all pixels
Powerful ALUs to process many different kinds of operators
not really, we only need vector add, vector mul, vector div. Low
frequency is OK since we have many threads
A. 0
B. 1
C. 2
D. 3
E. 4
34
implementation.
Hardware support throughout the design to enable new programming model capabilities
Nvidia GPU architecture
GK210 expands upon GK110s on-chip resources, doubling the available register file and shar
memory capacities per SMX.
SMX (Streaming
Multiprocessor)
35
Streaming Multiprocessor (SMX) Architecture
Each of these
performs the same
operation, but each
of these is also a
thread
36
into 2 RBEs and 3 memory controllers, a 50% boost in memory bandwidth. In contrast, the AMD Radeon HD 7770 GHz Edition has a single primitive pipeline,
2 pixel pipelines and 10 compute units. The pixel pipelines in the AMD Radeon HD 7770 GHz Edition also scaled back to 2 memory controllers, for a 128-bit
wide interface.
Another crucial innovation in GCN is coherent caching. Historically, GPUs have relied on specialized caches (such as read-only texture caches) that do
not maintain a coherent view of memory. To communicate between cores within a GPU, the programmer or compiler must insert explicit synchronization
38design, it increases overhead for applications which share data. GCN is
instructions to flush shared data back to memory. While this approach simplifies
CPU v.s. GPU
Comparing the CPU and GPU architectures, how many of
the followings are correct?
Think about SMT
GPU architectures are more tolerable to memory latencies
CPU is a better fit if the application requires low latency
operations
Given GPU memory are optimized for bandwidth, sacrificing latencies
GPU is efficient for workloads like matrix multiplications
GPU is inefficient if the workload needs only a small set of
input data points but contains lots of decision making and
backward references
The utilization of GPU will be low, plus GPU ALUs are slower than
CPU ALUs
A. 0
B. 1
C. 2
D. 3
E. 4
39
Programming GPGPU
A GPGPU application contains the host program and
GPU kernels
Host program: A C/C++ based CPU program that invokes
GPU API
GPU kernels: C/C++-like programs running on GPUs
Programming models
CUDA (Compute Unified Device Architecture)
Proposed by NVIDIA
Only NVIDIA GPUs support
OpenCL
Maintained by Khronos Group (non-profit)
Supported by Altera, AMD, Apple, ARM Holdings, Creative
Technology, IBM, Imagination Technologies, Intel, Nvidia,
Qualcomm, Samsung, Vivante, Xilinx, and ZiiLABS
40
What a host program looks like?
int main (int argc, const char * argv[]) { Initialize GPU runtime
dispatch_queue_t queue = gcl_create_dispatch_queue(CL_DEVICE_TYPE_GPU, NULL);
float value = 0;
for (int k = 0; k < size; ++k)
{
value += input_a[row * size + k] * input_b[k * size + col];
}
output[row * size + col] = value;
}
42
Demo
Matrix multiplication
block algorithm v.s. naive OpenCL code
43
How things are connected
DRAM
CPU
GPU
PCIe
Switch
44
New overhead/bottleneck emerges
New overhead
DRAM
CPU
GPU
PCIe
Switch
45
Closing remarks and
future perspectives
49
Challenges
Moores law is slowing down
Leakage power constrains the processing power of
a single chip
Discontinuation of Dennards scaling
Also called the dark silicon problem
50
CPU
F i g u r e 1performance scaling slows down
80 4E+07
1.5x in 56 months
60 3E+07
SuperComputer FLOPS
No faster
CPU SPECRate
2.7x in 56 months
4x
supercomputers
since Dec, 2012
40 2E+07
20 1E+07
DRAM
CPU
GPU
52
Conclusion
In the past, software engineers and hardware engineers work on
different sides of the ISA abstraction
Software engineers have no idea about what happen in processors/
hardware
Hardware engineers have no sense about what are the demands of
applications
This works fine if we can keep accelerating CPUs, but not true anymore
We need new execution & programming model to better utilize
these hardware components
We need innovative computer system design to address the
challenges from process technologies and the application
demands
Artificial intelligence
3D VR/AR
Hope to see you again in CSC456 Spring 2018 (planned)
53
We will talk more about high-
performance/heterogenous
computer system design as well
as corresponding techniques to
accelerate your programs in
CSC456 next Spring!
54