Lecture2 GPU Architecture - 2025
Lecture2 GPU Architecture - 2025
2
Agenda
3
Example Program
4
Execute Program on a simple processor
At cycle n
X[i]
Fetch/ …..
PC and r2, r2, 0
Decode
ld r0, x[r1]
mul r4, r0, r2
Execute
add r5, r5, r4
(ALU) add r2, r2, 1
…
Execution Context
(Registers) ….
….
st result[r10], r5
….
result[i]
5
Execute Program on a simple processor
At cycle n+1
X[i]
Fetch/ …..
and r2, r2, 0
Decode
PC ld r0, x[r1]
mul r4, r0, r2
Execute
add r5, r5, r4
(ALU) add r2, r2, 1
…
Execution Context
(Registers) ….
….
st result[r10], r5
….
result[i]
6
Execute Program on a simple processor
At cycle n+2
X[i]
Fetch/ …..
and r2, r2, 0
Decode
ld r0, x[r1]
PC mul r4, r0, r2
Execute
add r5, r5, r4
(ALU) add r2, r2, 1
…
Execution Context
(Registers) ….
….
st result[r10], r5
….
result[i]
7
Execute Program on a Superscalar processor
▪ Exploit ILP: Decode and execute multiple instrucitons in parallel
At cycle n
X[i]
result[i]
8
Pre multi-core era
▪ Majority of transistors are used to perform operations that help a single
instruction stream run fast
Fetch/
Decode
Cache
Execute
(ALU)
Memory prefetcher
Decode Decode
Execute Execute
(ALU) (ALU)
Execution Execution
Context Context
result[0] result[1]
• Each core can be slower than a high-performance core (e.g., 0.75 times as fast)
• But, overall performance of two cores will be higher (e.g., 0.75 x 2 =1.5)
10
Four cores: compute 4 elements in parallel
13
But, how do you feed all these cores? ➔ Data-level Parallelism
Interesting property of Example Program
▪ Parallelism is across iterations of the loop
▪ All the iterations of the loop do the same thing
14
Add ALUs to increase compute capability
▪ SIMD (Single Instruction Multiple Data) Processing
o Share cost of fetch / decode across many ALUs
o Add ALUs and execute the same instruction on them with different
data
Execution Context
16
What about conditional execution in SIMD?
17
What about conditional execution in SIMD?
18
What about conditional execution in SIMD?
▪ Mask (discard) output of ALU
19
What about conditional execution in SIMD?
▪ After branch: continue at full performance
20
Examples:
▪ NVIDIA GTX480
o 15 cores
o 32 SIMD ALUs per core
o 1.3 TFLOPS
21
Summary
▪ Several forms of parallel execution in modern processors
o Multi-core: use multiple processing cores
• Provides thread-level parallelism: simultaneously execute a completely
different instruction stream on each core
• Software decides when to create threads (e.g., via pthread API)
o SIMD: use multiple ALUs controlled by same instruction stream (within a core)
• Efficient design for data-parallel workloads by exploting DLP (Data-level
Parallelism)
• Vectorization can be done by compiler or at runtime by hardware
22
Agenda
23
Terminology
▪ Memory Latency
o The amount of time for a memory request (from., load, store) to be serviced by
the memory system
o Example: 100 cycles, 100nsec
▪ Memory Bandwidth
o The rate at which the memory system can provide data to a processor
o Example : 20 GB/s
▪ Stall
o A Processor “stalls” when it cannot run the next instruction in an instruciton
stream because of a dependency on a previous instruction
o Accessing memory is a major source of stalls
24 o Memory latency : more than 100 cycles Slide credit : CMU 15-418/15-618
Hiding stalls with multi-threading
▪ Idea: interleave processing of multiple threads on the same core to hide
stalls
25
Slide credit : CMU 15-418/15-618
Hiding stalls with multi-threading
▪ Idea: interleave processing of multiple threads on the same core to hide
stalls
26
Slide credit : CMU 15-418/15-618
Hiding stalls with multi-threading
▪ Idea: interleave processing of multiple threads on the same core to hide
stalls
27
Slide credit : CMU 15-418/15-618
Hiding stalls with multi-threading
▪ Idea: interleave processing of multiple threads on the same core to hide
stalls
28
Slide credit : CMU 15-418/15-618
Multi-threading summary
▪ Benefits: use a core’s ALU resources more efficiently by hiding memory
latency
▪ Costs
o Require additional storage for thread contexts
o Relies heavily on memory bandwidth
• More threads → Larger working set → less cache space per thread
• May go to memory more often, but can hide the latency
29
Slide credit : CMU 15-418/15-618
Agenda
GPU Architectures
30
CPU and GPU are designed very differently
▪ CPU is designed to minimize the execution latency of a single thread
▪ GPU is designed to maximize the computation throughput
▪ GPU uses larger fraction of silicon for computation than CPU
▪ GPU consumes order of magnitude less energy per operation than CPU.
o 2nJ/Operation at CPU, 200pJ/Operation at GPU
CPU GPU
Latency Oriented Cores Throughput Oriented Cores
Chip Chip
Core Compute Unit
Cache/Local Mem
Local Cache
Threading
Registers
Control
Registers SIMD
SIMD Unit Unit
Modern GPU looks like
▪ NVIDIA Pascal
32
Inside a GPU
33
Inside a GPU
▪ Hierarchical Approach
…
TPC TPC TPC TPC TPC TPC
TPC: Texture
Instruction L1 Data L1
Processor Cluster
SM Instruction Fetch/Dispatch
Shared Memory
TEX: Texture TEX
Processor SP SP SP:
Streaming
SM SP SP Processor
SFU SFU
SP SP
SP SP
34
Inside a GPU
35
Inside a GPU
▪ GPUs have many Streaming
Multiprocessors (SMs)
o Each SM has multiple processors but
only one instruction unit
• All SP within a SM shares program
counter = SM
o Groups of processors must run the
exact same set of instructions at any
given time within a single SM
=SP
▪ 16 SMs
▪ Each with 32 cores
o 512 total cores
▪ Each SM hosts up to
o 48 warps (=1,536 threads)
• Warp : 32 threads
▪ In flight, up to
o 24,576 threads
39
Example: NVIDIA Fermi Architecture
▪ Streaming Multiprocessor (SM)
o 32 Streaming Processors (SP) = CUDA Core
o 16 Load/store units
o 4 Special Function Units (SFU)
o 64KB high speed on-chip memory (L1+shared
memory)
o Interface to the L2 cache
o 32K of 32-bit registers
o Two warp schedulers, two dispatch units
▪ SP (CUDA core)
o execution unit for integer and floating-point numbers
o 32-bit precision for all instructions
40
Thread Scheduling/Execution
▪ Threads run concurrently
o SM assigns/maintains thread id #s
o SM manages/schedules thread execution
▪ Each Thread Blocks is divided in 32- thread block: 1 to 1024 threads
thread Warps
1 warp = 32 threads
▪ A scenario
o 3 blocks to an SM
o each block has 256 threads block #3
read instruction
Memory
▪ Example SM multithreaded
Warp scheduler
o Assumption
time
• 1 clock cycles needed to dispatch the same warp 8 instruction 11
instruction for all threads in a Warp
• If one global memory access is needed for warp 1 instruction 42
every 4 instructions
warp 3 instruction 95
o A minimum of 26 Warps are needed to fully ..
tolerate 100-cycle memory latency .
warp 8 instruction 12
warp 3 instruction 96
45
Slide credit : CMU 15-418/15-618
Next..
▪ Fundamentals of CUDA
46