0% found this document useful (0 votes)
24 views46 pages

Lecture2 GPU Architecture - 2025

The document discusses GPU architecture and its differences from CPU design, emphasizing the parallel execution capabilities of modern processors, including multi-core and SIMD processing. It explains how GPUs maximize computation throughput and manage memory latency through multi-threading and warp scheduling. The document also details the hierarchical structure of GPUs, including components like Streaming Multiprocessors and their execution strategies for handling multiple threads efficiently.

Uploaded by

shdudtls2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views46 pages

Lecture2 GPU Architecture - 2025

The document discusses GPU architecture and its differences from CPU design, emphasizing the parallel execution capabilities of modern processors, including multi-core and SIMD processing. It explains how GPUs maximize computation throughput and manage memory latency through multi-threading and warp scheduling. The document also details the hierarchical structure of GPUs, including components like Streaming Multiprocessors and their execution strategies for handling multiple threads efficiently.

Uploaded by

shdudtls2
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

GPU Architecture

Prof. Seokin Hong


Agenda
▪ Parallel Execution in Modern Processors
▪ Multi-threading for Hiding Memory Latency
▪ GPU Architectures

2
Agenda

Parallel Execution in Modern Processors

3
Example Program

void mul(int N, float *x, float *result) …..


{ and r2, r2, 0
for(int i=0; i<N; i++) Compile ld r0, x[r1]
{ mul r4, r0, r2
add r5, r5, r4
float value=0;
add r2, r2, 1
for(int j=0; j<N; j++)
value += x[i] * j; …
….
result[i]= value;
….
}
st result[r10], r5
}
….

4
Execute Program on a simple processor
At cycle n

X[i]

Fetch/ …..
PC and r2, r2, 0
Decode
ld r0, x[r1]
mul r4, r0, r2
Execute
add r5, r5, r4
(ALU) add r2, r2, 1

Execution Context
(Registers) ….
….
st result[r10], r5
….

result[i]
5
Execute Program on a simple processor
At cycle n+1

X[i]

Fetch/ …..
and r2, r2, 0
Decode
PC ld r0, x[r1]
mul r4, r0, r2
Execute
add r5, r5, r4
(ALU) add r2, r2, 1

Execution Context
(Registers) ….
….
st result[r10], r5
….

result[i]
6
Execute Program on a simple processor
At cycle n+2

X[i]

Fetch/ …..
and r2, r2, 0
Decode
ld r0, x[r1]
PC mul r4, r0, r2
Execute
add r5, r5, r4
(ALU) add r2, r2, 1

Execution Context
(Registers) ….
….
st result[r10], r5
….

result[i]
7
Execute Program on a Superscalar processor
▪ Exploit ILP: Decode and execute multiple instrucitons in parallel
At cycle n
X[i]

Fetch/ Fetch/ …..


Decode Decode PC and r2, r2, 0
ld r0, x[r1]
Execute Execute No dependency mul r4, r0, r2
between these
add r5, r5, r4
(ALU) (ALU) instructions.
So execute them add r2, r2, 1
simultaneously
Execution Context …
(Registers) ….
….
st result[r10], r5
….

result[i]
8
Pre multi-core era
▪ Majority of transistors are used to perform operations that help a single
instruction stream run fast

Fetch/
Decode
Cache
Execute
(ALU)

Execution Context Out-of-order control logic


(Registers)
Fancy branch predictor

Memory prefetcher

• More transistors → Larger cache, smarter out-of-order logic →


smarter branch predictor, etc…
• This approach has fundemental limitations: Power wall,
9 Diminishing gain with ILP
Multi-core
▪ Use increasing transistor count to add more cores to processor rather
than use transistors to accelerates a single instruction stream
Core 0 Core 1

X[0] Fetch/ Fetch/ X[1]

Decode Decode

Execute Execute
(ALU) (ALU)
Execution Execution
Context Context

result[0] result[1]

• Each core can be slower than a high-performance core (e.g., 0.75 times as fast)
• But, overall performance of two cores will be higher (e.g., 0.75 x 2 =1.5)
10
Four cores: compute 4 elements in parallel

Four cores run four simultaneous instruction streams


11
Sixteen cores: compute 16 elements in parallel

12 Sixteen cores run sixteen simultaneous instruction streams


128 cores?

128 cores → 128 simultaneous instruction streams

13
But, how do you feed all these cores? ➔ Data-level Parallelism
Interesting property of Example Program
▪ Parallelism is across iterations of the loop
▪ All the iterations of the loop do the same thing

void mul(int N, float *x, float *result)


{
for(int i=0; i<N; i++)
{
float value=0;
for(int j=0; j<N; j++)
value += x[i] * j;
result[i]= value;
}
}

14
Add ALUs to increase compute capability
▪ SIMD (Single Instruction Multiple Data) Processing
o Share cost of fetch / decode across many ALUs
o Add ALUs and execute the same instruction on them with different
data

X[0] X[1] X[2] X[3]


Fetch/
Decode X[4] X[5] X[6] X[7]

ALU0 ALU1 ALU2 ALU3

ALU4 ALU5 ALU6 ALU7

Execution Context

result result result result


[0] [1] [2] [3]
result result result result
[4] [5] [6] [7]
15
16 SIMD cores: 128 elements in parallel

16
What about conditional execution in SIMD?

17
What about conditional execution in SIMD?

18
What about conditional execution in SIMD?
▪ Mask (discard) output of ALU

19
What about conditional execution in SIMD?
▪ After branch: continue at full performance

20
Examples:

▪ Intel Core i9 (Coffee Lake)


o 8 cores
o 8 SIMD ALUs per core

▪ NVIDIA GTX480
o 15 cores
o 32 SIMD ALUs per core
o 1.3 TFLOPS

21
Summary
▪ Several forms of parallel execution in modern processors
o Multi-core: use multiple processing cores
• Provides thread-level parallelism: simultaneously execute a completely
different instruction stream on each core
• Software decides when to create threads (e.g., via pthread API)

o SIMD: use multiple ALUs controlled by same instruction stream (within a core)
• Efficient design for data-parallel workloads by exploting DLP (Data-level
Parallelism)
• Vectorization can be done by compiler or at runtime by hardware

o Superscalar: exploit ILP (Instruction-level Parallelism) within an instruction


stream
• Process different instructions from the same instruction stream in parallel
(within a core)
• Parallelism dynamically discovered by hardware during execution

22
Agenda

Multi-threading for hiding memory latency

23
Terminology
▪ Memory Latency
o The amount of time for a memory request (from., load, store) to be serviced by
the memory system
o Example: 100 cycles, 100nsec

▪ Memory Bandwidth
o The rate at which the memory system can provide data to a processor
o Example : 20 GB/s

▪ Stall
o A Processor “stalls” when it cannot run the next instruction in an instruciton
stream because of a dependency on a previous instruction
o Accessing memory is a major source of stalls

24 o Memory latency : more than 100 cycles Slide credit : CMU 15-418/15-618
Hiding stalls with multi-threading
▪ Idea: interleave processing of multiple threads on the same core to hide
stalls

25
Slide credit : CMU 15-418/15-618
Hiding stalls with multi-threading
▪ Idea: interleave processing of multiple threads on the same core to hide
stalls

26
Slide credit : CMU 15-418/15-618
Hiding stalls with multi-threading
▪ Idea: interleave processing of multiple threads on the same core to hide
stalls

27
Slide credit : CMU 15-418/15-618
Hiding stalls with multi-threading
▪ Idea: interleave processing of multiple threads on the same core to hide
stalls

28
Slide credit : CMU 15-418/15-618
Multi-threading summary
▪ Benefits: use a core’s ALU resources more efficiently by hiding memory
latency

▪ Costs
o Require additional storage for thread contexts
o Relies heavily on memory bandwidth
• More threads → Larger working set → less cache space per thread
• May go to memory more often, but can hide the latency

29
Slide credit : CMU 15-418/15-618
Agenda

GPU Architectures

30
CPU and GPU are designed very differently
▪ CPU is designed to minimize the execution latency of a single thread
▪ GPU is designed to maximize the computation throughput
▪ GPU uses larger fraction of silicon for computation than CPU
▪ GPU consumes order of magnitude less energy per operation than CPU.
o 2nJ/Operation at CPU, 200pJ/Operation at GPU

CPU GPU
Latency Oriented Cores Throughput Oriented Cores

Chip Chip
Core Compute Unit
Cache/Local Mem
Local Cache

Threading
Registers
Control

Registers SIMD
SIMD Unit Unit
Modern GPU looks like
▪ NVIDIA Pascal

32
Inside a GPU

33
Inside a GPU
▪ Hierarchical Approach

SPA : Streaming Processor Array (=GPC: Graphics Processing Cluster)


TPC TPC TPC TPC TPC TPC

SM: Streaming Multiprocessor

TPC: Texture
Instruction L1 Data L1
Processor Cluster
SM Instruction Fetch/Dispatch

Shared Memory
TEX: Texture TEX
Processor SP SP SP:
Streaming
SM SP SP Processor
SFU SFU
SP SP

SP SP

34
Inside a GPU

▪ SPA : Streaming Processor Array


▪ TPC : Texture Processor Cluster
o Multiple SMs + TEX
o TEX : texture processor for graphics purpose
▪ SM : Streaming Multiprocessor
o Multiple Processors (SPs)
o Multi-threaded processor core
o Fundamental processing unit for thread block
▪ SP (or CUDA core) : Streaming Processor
o ALU for a single thread
▪ SFU: special function unit
o For complex math functions: sin, cos, square root, ...

35
Inside a GPU
▪ GPUs have many Streaming
Multiprocessors (SMs)
o Each SM has multiple processors but
only one instruction unit
• All SP within a SM shares program
counter = SM
o Groups of processors must run the
exact same set of instructions at any
given time within a single SM
=SP

▪ When a kernel (GPU code) is called, the


task is divided up into threads
o Each thread handles a small portion of
the given task
o All thread execute the same kernel code
▪ The threads are divided into Blocks
Inside a GPU
▪ Each block is assigned to an SM
▪ Inside the SM, the block is divided into
Warps of threads
o Warps consist of 32 threads
o All 32 threads MUST run the exact = SM
same set of instructions at the same
time ➔ SIMT
• Due to the fact that there is only
one instruction unit =SP

o Warps are run concurrently in an SM


Inside a GPU
▪ Microarchitecture of Generic GPGPU core
Example: NVIDIA Fermi Architecture

▪ 16 SMs
▪ Each with 32 cores
o 512 total cores
▪ Each SM hosts up to
o 48 warps (=1,536 threads)
• Warp : 32 threads
▪ In flight, up to
o 24,576 threads

39
Example: NVIDIA Fermi Architecture
▪ Streaming Multiprocessor (SM)
o 32 Streaming Processors (SP) = CUDA Core
o 16 Load/store units
o 4 Special Function Units (SFU)
o 64KB high speed on-chip memory (L1+shared
memory)
o Interface to the L2 cache
o 32K of 32-bit registers
o Two warp schedulers, two dispatch units

▪ SP (CUDA core)
o execution unit for integer and floating-point numbers
o 32-bit precision for all instructions

40
Thread Scheduling/Execution
▪ Threads run concurrently
o SM assigns/maintains thread id #s
o SM manages/schedules thread execution
▪ Each Thread Blocks is divided in 32- thread block: 1 to 1024 threads
thread Warps

1 warp = 32 threads

divided into warps

1 SM = 32 cores → parallel execution !

41 Slide credit : Prof. Baek


Thread Scheduling/Execution
▪ Warps are scheduling units in SM

▪ A scenario
o 3 blocks to an SM
o each block has 256 threads block #3

▪ how many warps?


8 warps from block #1
o each block has 256 / 32 = 8 warps
o SM has 3 * 8 = 24 warps
block #2
o At any point in time,
only one of the 24 Warps will be selected for
instruction fetch and execution.

42 Slide credit : Prof. Baek


SM Warp Scheduling
▪ All threads in a Warp execute the same instruction when selected
o only one control logic for an SM
▪ memory access → latency problem → scheduling required !

read instruction

Memory

result after 100 to 200 clock cycles

43 Slide credit : Prof. Baek


SM Warp Scheduling to Hide Memory Latency
▪ SM hardware implements zero-overhead Warp scheduling
o Warps whose next instruction has its operands ready for consumption are
eligible for execution
o Eligible Warps are selected for execution on a prioritized scheduling policy

▪ Example SM multithreaded
Warp scheduler
o Assumption
time
• 1 clock cycles needed to dispatch the same warp 8 instruction 11
instruction for all threads in a Warp
• If one global memory access is needed for warp 1 instruction 42
every 4 instructions
warp 3 instruction 95
o A minimum of 26 Warps are needed to fully ..
tolerate 100-cycle memory latency .
warp 8 instruction 12

warp 3 instruction 96

44 Slide credit : Prof. Baek


CPU vs GPU memory hierarchies

45
Slide credit : CMU 15-418/15-618
Next..
▪ Fundamentals of CUDA

46

You might also like