Chapter 8
Chapter 8
2024/2025 Sem A
1
Sequential Execution
Fetch Decode Execute Memory Writeback
Address
Instruction
IR
Next
Data
PC
Data
Address
Sel Sel
Imm Imm
Decoder Main Memory
Add
Operand
Sel
Sel
s
Register Mul
Files Data
Reg.
.
..
Data
ALU
Data
2
Processing Models
single instruction and single data (SISD) single instruction and multiple data (SIMD)
multiple instruction and single data (MISD) multiple instruction and multiple data (MIMD)
3
Multicore Processors
• A single chip with two or more separate processing units (cores)
• Each of which reads and executes program instructions
• Increasing overall speed for programs that support multithreading
or other parallel computing techniques
4
Trend of Morden Processors
Moore’s Law:
integrated circuit
resources double
every 18–24 months
5
Amdahl’s Law
• In a program with parallel processing , a few instructions that
have to be performed in sequence will have a limiting factor on
program speedup such that adding more processors may not
make the program run faster.
6
Amdahl’s Law
ExeTimeold
Speedup =
ExeTimenew
Fractionenhanced
ExeTimenew = ExeTimeold × [(1 - Fractionenhanced) + ]
Speedupenhanced
ExeTimeold 1
Speedup = =
ExeTime new
Fractionenhanced
(1 - Fractionenhanced) +
Speedupenhanced
7
Amdahl’s Law
• When executing software on multi-core processors, the ideal
speedup achieved is equal to the number of processor cores
(or called cores for short)
• In reality, only part of the software can be parallelized
• Example: suppose 80% of a software program can be
parallelized, what is the best speedup if it is executed on 100
processor cores?
ExeTimeold 1 1
Speedup = = = = 1/0.208 = 4.808
ExeTimenew Fractionenhanced (1-0.8) + 0.8/100
(1 - Fractionenhanced) +
Speedupenhanced
8
GPU (Graphic Processing Unit)
9
Graphics Applications
• Graphics pipeline have a lot of redundant instructions
– Draw dots
– Draw triangles
– Draw lines
– Raster an image(convert from vector to pixel bitmap)
– Draw surfaces
– Determine which object to be shown on each pixel
– Compute lightning on each pixel
– Repeat
– All of these are lots of adding and multiplying tons of independent data in parallel !
10
GPU in Modern Systems
DOMAIN-SPECIFIC
cuDNN nvGRAPH TensorRT NCCL
VISUAL
PROCESSING NVIDA NPP DeepStream SDK NVIDIA CODEC SDK Index Framework
LINEAR
ALGEBRA
cuSPARSE cuBLAS cuRAND cuFFT
MATH
ALGORITHMS
CUDA Math library AmgX cuSOLVER THRUST LIBRARY
11
GPU: SIMD
• Idea: Use one decoder on multiple data
• Basically all these data perform the same operation
• Save logic required to fetch/decode tons of instructions
12
GPU Architecture SM
SP SP SP SP
SP: Scalar Processor
‘CUDA core’ SP SP SP SP
SP SP SP SP
SP SP SP SP
SP SP SP SP
SHARED MEMORY
HOST
14
GPU Programming Models
• CUDA (Compute Unified Device Architecture): parallel GPU
programming API created by NVIDIA
• Hardware and software architecture for issuing and managing
computations on GPU
• API libraries with C/C++/Fortran language
15
CUDA Structure
• Kernel: a function executed on the GPU Host Device
Grid 1
multiple of 32)
Block (1, 1)
• Grid: a group of one or more blocks. Thread Thread Thread Thread Thread
(0, 0) (1, 0) (2, 0) (3, 0) (4, 0)
– A grid is created for each CUDA kernel
Thread Thread Thread Thread Thread
(0, 1) (1, 1) (2, 1) (3, 1) (4, 1)
16
CUDA Structure
• One block executes on one SM Host Device
17
CUDA Structure
• Each block has a unique ID Host Device
Block (1, 1)
18
CUDA Structure
• Build-in variables: Host Device
Kernel
2
Block (1, 1)
19
CUDA: Programming
• Declarations on functions: __host__, __global__, __device__
– __host__ indicates host functions which execute on CPU
– __global__ makes a function for a kernel which executes on GPU
– __device__ indicates device functions, which can only be called from other
device or global functions. It cannot be called from host code
20
CUDA: Programming
mycode.cu
Compiled by CPU Compiled by GPU
int main_data; compiler compiler
__shared__ int sdata;
int main_data; __shared__ sdata;
Main() { }
__host__ hfunc () {
Main() {} __global__ gfunc() {
Host Only int hdata;
__host__ hfunc () { int gdata;
gfunc<<<g,b,m>>>();
int hdata; dfunc()
}
gfunc<<<g,b,m>>>(); }
__global__ gfunc() { }
int gdata; __device__ dfunc() {
Interface dfunc(); int ddata;
} }
__device__ dfunc() {
Device Only int ddata;
}
21
Difference between CPU and GPU Programs
22
Example 1
• Task to do
Scan elements of an array of numbers (any of 0 to 9)
Count how many times does “6” appear
The array has 16 elements (SIZE=16)
Each block contains 4 threads (BLOCKSIZE=4)
So each thread checks 4 elements (SIZE/BLOCKSIZE=4)
1 block in the grid
3 6 7 5 3 5 6 2 9 1 2 7 0 9 3 6
Kernel Function:
Thread scans subset of array elements
Call device function to compare with 6
Compute local result
Device Function:
Compare current element and 6
Return 1 if same, else 0
3 6 7 5 3 5 6 2 9 1 2 7 0 9 3 6
blockIdx.y*blockDim.y threadIdx.y
threadId.y
blockIdx.y
i,j
blockIdx.x*blockDim.x threadIdx.x
N-1,N-
N-1,0 1 15,0 15,15
25
GPU Tradeoffs
• Very high throughput
26
Interconnection
• For transferring information between
nodes or for broadcasting information to
all nodes
• Suitability of a network judged by cost,
bandwidth, and effective throughput
– Bandwidth is capacity in bits per second
– Effective throughput is actual rate due to
need for transferring control information
Bus and Ring
• A bus is a set of lines providing a shared path
– Requires arbitration for one access at a time
– Simple bus is held until response is ready, but split-transaction bus can
overlap requests
29
Shared Memory Multiprocessors
• All processors have direct (hardware) access to all the main
memory, i.e., share the same address space
– Multicore processors, GPU, …
30
UMA vs NUMA
Uniform Memory Access (UMA) non-uniform Memory Access (NUMA)
Memory access time is balanced or equal. Memory access time is not equal.
UMA is usually slower than NUMA NUMA is usually faster than UMA
UMA has limited bandwidth NUMA has higher bandwidth than UMA
31
Shared Data
• Proc 0 writes to an address, followed by Proc 1 reading
Proc 0 Proc 1
Mem[A] = 1 …
Print Mem[A]
32
Cache Coherence
• Basic question: If multiple processors cache the same block,
how do they ensure they all see a consistent state?
P1 P2
Interconnection Network
1000
x
Main Memory
33
Cache Coherence
• Basic question: If multiple processors cache the same block,
how do they ensure they all see a consistent state?
P1 P2 lw r2, x
1000
Interconnection Network
1000
x
Main Memory
34
Cache Coherence
• Basic question: If multiple processors cache the same block,
how do they ensure they all see a consistent state?
lw r2, x P1 P2 lw r2, x
1000 1000
Interconnection Network
1000
x
Main Memory
35
Cache Coherence
• Basic question: If multiple processors cache the same block,
how do they ensure they all see a consistent state?
lw r2, x P1 P2 lw r2, x
add r1, r2, r4
sw r1, x
2000 1000
Interconnection Network
1000
x
Main Memory
36
Cache Coherence
• Basic question: If multiple processors cache the same block,
how do they ensure they all see a consistent state?
lw r2, x P1 P2 lw r2, x
add r1, r2, r4 Should NOT
lw r5, x
sw r1, x load 1000
2000 1000
Interconnection Network
1000
x
Main Memory
37
Cache Coherence
• Basic idea:
P1 P2
– A processor/cache broadcasts its
write/update to a memory 2000 1000
location to all other processors
– Another cache that has the
Interconnection Network
location either updates or
invalidates its local copy
1000
x
Main Memory
38
Snoopy Coherence Scheme
• All caches “snoop” all other caches’ read/write requests and
keep the cache line coherent; invalidate the cache line if other
caches change the value
40
Directory-based Cache Coherence Scheme
• Requestor: the processor who requesting for a read/write of a memory block.
• Owner: An owner node owns the most recent state of the cache block, note that
directory might not be always up to date with latest data.
• Sharer: One or many node which are sharing a copy of the cache block.
41
Directory-based Cache Coherence Scheme
• Directory-based Cache Coherence for NUMA
42
Directory-based Cache Coherence Scheme
• Directory-based Cache Coherence for NUMA
– read miss to clean line
43
Directory-based Cache Coherence Scheme
• Directory-based Cache Coherence for NUMA
– read miss to clean line
44
Directory-based Cache Coherence Scheme
• Directory-based Cache Coherence for NUMA
– read miss to clean line
45
Directory-based Cache Coherence Scheme
• Directory-based Cache Coherence for NUMA
– read miss to dirty line
46
Directory-based Cache Coherence Scheme
• Directory-based Cache Coherence for NUMA
– read miss to dirty line
47
Directory-based Cache Coherence Scheme
• Directory-based Cache Coherence for NUMA
– read miss to dirty line
48
Directory-based Cache Coherence Scheme
• Directory-based Cache Coherence for NUMA
– read miss to dirty line
49
Snoopy vs Directory-Based
Snoopy Directory-based
Full broadcast communication Limited broadcast or point-to-point
communication
Simple control logic More complex control logic
(need to maintain directory states)
Low latency Higher latency
(need to access the directory)
Limited scalability Good scalability
50
Message-Passing Multiprocessors
• Processors have their own physical address space
– memory accessed only by local processor
• Communicate via explicit message passing
– one processor needs to know when a message is sent, and the
receiving processor needs to know when a message arrives
51
Message-Passing Multiprocessors
• Each node is effectively a complete computer
• To build large-scale computers, usually use high-performance
message-passing interconnection networks
– Special interconnection networks offer better communication
performance than using e.g., LAN, but is much more expensive
• Much easier for hardware designer to build
– But require more programming efforts
52
Shared Memory vs Message Passing
Shared Memory Message Passing
The shared memory region is used for communication. A message passing facility is used for communication.
Communicating processes share a common address Communicating processes have different address
space. space.
It provides a maximum speed of computation as It is time-consuming as message passing is
communication is done through shared memory so implemented crossing software layers (e.g., system
system calls are made only to establish the shared calls provided by operating systems).
memory.
Faster communication strategy. Relatively slower communication strategy.
It can be used in exchanging larger amounts of data. It can be used in exchanging small amounts of data.
Software on different processors need to ensure the Explicit communication actions, no consistency
data consistency problem
53
Network-on-Chips (NoC)
• Bus has been the most popular for multiprocessor systems, but
– When expanding to a many core-system, contention decreases throughput;
– When scaling sizes and frequency, wire delays remain larger than clock cycle
• Need for interconnect with deterministic delays and scalability
• Network-on-Chip (NoC): Each on-chip component connected by an
intelligent switch (router) to particular communication wire(s)
54
Network-on-Chips (NoC)
Mesh NoC
System-on-Chip (SoC)
• Integrates most or all components of a computer on one chip
– Rather than many chips on a board
– Popular in domains sensitive to size and energy efficiency
• Mobile devices
• Embedded devices
56
Exercise 1
• 80% of a program can be perfectly parallelized; 10% can be
parallelized on at most 4 cores; the remaining has to execute
sequentially
• What is the speedup if executed on 8 cores?
57
Exercise 2
• 40% of a program can be perfectly parallelized; 40% can be
parallelized on at most 8 cores; 10% can be parallelized on at
most 4 cores; the remaining has to execute sequentially
58