0% found this document useful (0 votes)

25 views46 pages

Lecture2 GPU Architecture - 2025

The document discusses GPU architecture and its differences from CPU design, emphasizing the parallel execution capabilities of modern processors, including multi-core and SIMD processing. It explains how GPUs maximize computation throughput and manage memory latency through multi-threading and warp scheduling. The document also details the hierarchical structure of GPUs, including components like Streaming Multiprocessors and their execution strategies for handling multiple threads efficiently.

Uploaded by

shdudtls2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views46 pages

Lecture2 GPU Architecture - 2025

Uploaded by

shdudtls2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

GPU Architecture

Prof. Seokin Hong

Agenda
▪ Parallel Execution in Modern Processors
▪ Multi-threading for Hiding Memory Latency
▪ GPU Architectures

2
Agenda

Parallel Execution in Modern Processors

3
Example Program

void mul(int N, float x, float result) …..

{ and r2, r2, 0
for(int i=0; i<N; i++) Compile ld r0, x[r1]
{ mul r4, r0, r2
add r5, r5, r4
float value=0;
add r2, r2, 1
for(int j=0; j<N; j++)
value += x[i] * j; …
….
result[i]= value;
….
}
st result[r10], r5
}
….

4
Execute Program on a simple processor
At cycle n

X[i]

Fetch/ …..
PC and r2, r2, 0
Decode
ld r0, x[r1]
mul r4, r0, r2
Execute
add r5, r5, r4
(ALU) add r2, r2, 1
…
Execution Context
(Registers) ….
….
st result[r10], r5
….

result[i]
5
Execute Program on a simple processor
At cycle n+1

X[i]

Fetch/ …..
and r2, r2, 0
Decode
PC ld r0, x[r1]
mul r4, r0, r2
Execute
add r5, r5, r4
(ALU) add r2, r2, 1
…
Execution Context
(Registers) ….
….
st result[r10], r5
….

result[i]
6
Execute Program on a simple processor
At cycle n+2

X[i]

Fetch/ …..
and r2, r2, 0
Decode
ld r0, x[r1]
PC mul r4, r0, r2
Execute
add r5, r5, r4
(ALU) add r2, r2, 1
…
Execution Context
(Registers) ….
….
st result[r10], r5
….

result[i]
7
Execute Program on a Superscalar processor
▪ Exploit ILP: Decode and execute multiple instrucitons in parallel
At cycle n
X[i]

Fetch/ Fetch/ …..

Decode Decode PC and r2, r2, 0
ld r0, x[r1]
Execute Execute No dependency mul r4, r0, r2
between these
add r5, r5, r4
(ALU) (ALU) instructions.
So execute them add r2, r2, 1
simultaneously
Execution Context …
(Registers) ….
….
st result[r10], r5
….

result[i]
8
Pre multi-core era
▪ Majority of transistors are used to perform operations that help a single
instruction stream run fast

Fetch/
Decode
Cache
Execute
(ALU)

Execution Context Out-of-order control logic

(Registers)
Fancy branch predictor

Memory prefetcher

• More transistors → Larger cache, smarter out-of-order logic →

smarter branch predictor, etc…
• This approach has fundemental limitations: Power wall,
9 Diminishing gain with ILP
Multi-core
▪ Use increasing transistor count to add more cores to processor rather
than use transistors to accelerates a single instruction stream
Core 0 Core 1

X[0] Fetch/ Fetch/ X[1]

Decode Decode

Execute Execute
(ALU) (ALU)
Execution Execution
Context Context

result[0] result[1]

• Each core can be slower than a high-performance core (e.g., 0.75 times as fast)
• But, overall performance of two cores will be higher (e.g., 0.75 x 2 =1.5)
10
Four cores: compute 4 elements in parallel

Four cores run four simultaneous instruction streams

11
Sixteen cores: compute 16 elements in parallel

12 Sixteen cores run sixteen simultaneous instruction streams

128 cores?

128 cores → 128 simultaneous instruction streams

13
But, how do you feed all these cores? ➔ Data-level Parallelism
Interesting property of Example Program
▪ Parallelism is across iterations of the loop
▪ All the iterations of the loop do the same thing

void mul(int N, float x, float result)

{
for(int i=0; i<N; i++)
{
float value=0;
for(int j=0; j<N; j++)
value += x[i] * j;
result[i]= value;
}
}

14
Add ALUs to increase compute capability
▪ SIMD (Single Instruction Multiple Data) Processing
o Share cost of fetch / decode across many ALUs
o Add ALUs and execute the same instruction on them with different
data

X[0] X[1] X[2] X[3]

Fetch/
Decode X[4] X[5] X[6] X[7]

ALU0 ALU1 ALU2 ALU3

ALU4 ALU5 ALU6 ALU7

Execution Context

result result result result

[0] [1] [2] [3]
result result result result
[4] [5] [6] [7]
15
16 SIMD cores: 128 elements in parallel

16
What about conditional execution in SIMD?

17
What about conditional execution in SIMD?

18
What about conditional execution in SIMD?
▪ Mask (discard) output of ALU

19
What about conditional execution in SIMD?
▪ After branch: continue at full performance

20
Examples:

▪ Intel Core i9 (Coffee Lake)

o 8 cores
o 8 SIMD ALUs per core

▪ NVIDIA GTX480
o 15 cores
o 32 SIMD ALUs per core
o 1.3 TFLOPS

21
Summary
▪ Several forms of parallel execution in modern processors
o Multi-core: use multiple processing cores
• Provides thread-level parallelism: simultaneously execute a completely
different instruction stream on each core
• Software decides when to create threads (e.g., via pthread API)

o SIMD: use multiple ALUs controlled by same instruction stream (within a core)
• Efficient design for data-parallel workloads by exploting DLP (Data-level
Parallelism)
• Vectorization can be done by compiler or at runtime by hardware

o Superscalar: exploit ILP (Instruction-level Parallelism) within an instruction

stream
• Process different instructions from the same instruction stream in parallel
(within a core)
• Parallelism dynamically discovered by hardware during execution

22
Agenda

Multi-threading for hiding memory latency

23
Terminology
▪ Memory Latency
o The amount of time for a memory request (from., load, store) to be serviced by
the memory system
o Example: 100 cycles, 100nsec

▪ Memory Bandwidth
o The rate at which the memory system can provide data to a processor
o Example : 20 GB/s

▪ Stall
o A Processor “stalls” when it cannot run the next instruction in an instruciton
stream because of a dependency on a previous instruction
o Accessing memory is a major source of stalls

24 o Memory latency : more than 100 cycles Slide credit : CMU 15-418/15-618
Hiding stalls with multi-threading
▪ Idea: interleave processing of multiple threads on the same core to hide
stalls

25
Slide credit : CMU 15-418/15-618
Hiding stalls with multi-threading
▪ Idea: interleave processing of multiple threads on the same core to hide
stalls

26
Slide credit : CMU 15-418/15-618
Hiding stalls with multi-threading
▪ Idea: interleave processing of multiple threads on the same core to hide
stalls

27
Slide credit : CMU 15-418/15-618
Hiding stalls with multi-threading
▪ Idea: interleave processing of multiple threads on the same core to hide
stalls

28
Slide credit : CMU 15-418/15-618
Multi-threading summary
▪ Benefits: use a core’s ALU resources more efficiently by hiding memory
latency

▪ Costs
o Require additional storage for thread contexts
o Relies heavily on memory bandwidth
• More threads → Larger working set → less cache space per thread
• May go to memory more often, but can hide the latency

29
Slide credit : CMU 15-418/15-618
Agenda

GPU Architectures

30
CPU and GPU are designed very differently
▪ CPU is designed to minimize the execution latency of a single thread
▪ GPU is designed to maximize the computation throughput
▪ GPU uses larger fraction of silicon for computation than CPU
▪ GPU consumes order of magnitude less energy per operation than CPU.
o 2nJ/Operation at CPU, 200pJ/Operation at GPU

CPU GPU
Latency Oriented Cores Throughput Oriented Cores

Chip Chip
Core Compute Unit
Cache/Local Mem
Local Cache

Threading
Registers
Control

Registers SIMD
SIMD Unit Unit
Modern GPU looks like
▪ NVIDIA Pascal

32
Inside a GPU

33
Inside a GPU
▪ Hierarchical Approach

SPA : Streaming Processor Array (=GPC: Graphics Processing Cluster)

…
TPC TPC TPC TPC TPC TPC

SM: Streaming Multiprocessor

TPC: Texture
Instruction L1 Data L1
Processor Cluster
SM Instruction Fetch/Dispatch

Shared Memory
TEX: Texture TEX
Processor SP SP SP:
Streaming
SM SP SP Processor
SFU SFU
SP SP

SP SP

34
Inside a GPU

▪ SPA : Streaming Processor Array

▪ TPC : Texture Processor Cluster
o Multiple SMs + TEX
o TEX : texture processor for graphics purpose
▪ SM : Streaming Multiprocessor
o Multiple Processors (SPs)
o Multi-threaded processor core
o Fundamental processing unit for thread block
▪ SP (or CUDA core) : Streaming Processor
o ALU for a single thread
▪ SFU: special function unit
o For complex math functions: sin, cos, square root, ...

35
Inside a GPU
▪ GPUs have many Streaming
Multiprocessors (SMs)
o Each SM has multiple processors but
only one instruction unit
• All SP within a SM shares program
counter = SM
o Groups of processors must run the
exact same set of instructions at any
given time within a single SM
=SP

▪ When a kernel (GPU code) is called, the

task is divided up into threads
o Each thread handles a small portion of
the given task
o All thread execute the same kernel code
▪ The threads are divided into Blocks
Inside a GPU
▪ Each block is assigned to an SM
▪ Inside the SM, the block is divided into
Warps of threads
o Warps consist of 32 threads
o All 32 threads MUST run the exact = SM
same set of instructions at the same
time ➔ SIMT
• Due to the fact that there is only
one instruction unit =SP

o Warps are run concurrently in an SM

Inside a GPU
▪ Microarchitecture of Generic GPGPU core
Example: NVIDIA Fermi Architecture

▪ 16 SMs
▪ Each with 32 cores
o 512 total cores
▪ Each SM hosts up to
o 48 warps (=1,536 threads)
• Warp : 32 threads
▪ In flight, up to
o 24,576 threads

39
Example: NVIDIA Fermi Architecture
▪ Streaming Multiprocessor (SM)
o 32 Streaming Processors (SP) = CUDA Core
o 16 Load/store units
o 4 Special Function Units (SFU)
o 64KB high speed on-chip memory (L1+shared
memory)
o Interface to the L2 cache
o 32K of 32-bit registers
o Two warp schedulers, two dispatch units

▪ SP (CUDA core)
o execution unit for integer and floating-point numbers
o 32-bit precision for all instructions

40
Thread Scheduling/Execution
▪ Threads run concurrently
o SM assigns/maintains thread id #s
o SM manages/schedules thread execution
▪ Each Thread Blocks is divided in 32- thread block: 1 to 1024 threads
thread Warps

1 warp = 32 threads

divided into warps

1 SM = 32 cores → parallel execution !

41 Slide credit : Prof. Baek

Thread Scheduling/Execution
▪ Warps are scheduling units in SM

▪ A scenario
o 3 blocks to an SM
o each block has 256 threads block #3

▪ how many warps?

8 warps from block #1
o each block has 256 / 32 = 8 warps
o SM has 3 * 8 = 24 warps
block #2
o At any point in time,
only one of the 24 Warps will be selected for
instruction fetch and execution.

42 Slide credit : Prof. Baek

SM Warp Scheduling
▪ All threads in a Warp execute the same instruction when selected
o only one control logic for an SM
▪ memory access → latency problem → scheduling required !

read instruction

Memory

result after 100 to 200 clock cycles

43 Slide credit : Prof. Baek

SM Warp Scheduling to Hide Memory Latency
▪ SM hardware implements zero-overhead Warp scheduling
o Warps whose next instruction has its operands ready for consumption are
eligible for execution
o Eligible Warps are selected for execution on a prioritized scheduling policy

▪ Example SM multithreaded
Warp scheduler
o Assumption
time
• 1 clock cycles needed to dispatch the same warp 8 instruction 11
instruction for all threads in a Warp
• If one global memory access is needed for warp 1 instruction 42
every 4 instructions
warp 3 instruction 95
o A minimum of 26 Warps are needed to fully ..
tolerate 100-cycle memory latency .
warp 8 instruction 12

warp 3 instruction 96

44 Slide credit : Prof. Baek

CPU vs GPU memory hierarchies

45
Slide credit : CMU 15-418/15-618
Next..
▪ Fundamentals of CUDA

A1.1 - Computer Hardware and Operations
No ratings yet
A1.1 - Computer Hardware and Operations
30 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
100% (1)
Data-Level Parallelism in Vector, SIMD, And: GPU Architectures
29 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
Capacitor Filter
100% (2)
Capacitor Filter
21 pages
GPU Architecture
No ratings yet
GPU Architecture
70 pages
GPGPU
No ratings yet
GPGPU
139 pages
Cuda
No ratings yet
Cuda
69 pages
Onur Digitaldesign 2020 Lecture20 Gpu Beforelecture
No ratings yet
Onur Digitaldesign 2020 Lecture20 Gpu Beforelecture
73 pages
Gpu1 - GPU Introduction
No ratings yet
Gpu1 - GPU Introduction
20 pages
Lecture ParallelArchTLP-DLP
No ratings yet
Lecture ParallelArchTLP-DLP
52 pages
Parallel Computing 1 Unit
No ratings yet
Parallel Computing 1 Unit
59 pages
Arallel Rocessing NIT
No ratings yet
Arallel Rocessing NIT
58 pages
Chapter 5 - General Purpose PGPU, CUDA
No ratings yet
Chapter 5 - General Purpose PGPU, CUDA
70 pages
0 Gpu Computing I Give It
No ratings yet
0 Gpu Computing I Give It
57 pages
Lecture GPUArchCUDA01
No ratings yet
Lecture GPUArchCUDA01
57 pages
Arallel Rocessing NIT
No ratings yet
Arallel Rocessing NIT
44 pages
RG1 Intro ParallelArch HPCAI Jan2020
No ratings yet
RG1 Intro ParallelArch HPCAI Jan2020
47 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
Lec 3
No ratings yet
Lec 3
48 pages
Lec 14
No ratings yet
Lec 14
52 pages
Ppar2017 Gpu 1
No ratings yet
Ppar2017 Gpu 1
61 pages
Hardware
No ratings yet
Hardware
54 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
Paralelismo 2024
No ratings yet
Paralelismo 2024
30 pages
Chapter 9 - Multiple Core Computers
No ratings yet
Chapter 9 - Multiple Core Computers
44 pages
Gpgpu Final
No ratings yet
Gpgpu Final
124 pages
Seminar Igor Kamzic COSC3P93
No ratings yet
Seminar Igor Kamzic COSC3P93
58 pages
Parralel 01
No ratings yet
Parralel 01
38 pages
CSED405 Lec2-CUDA Overview - 240916 - 131108
No ratings yet
CSED405 Lec2-CUDA Overview - 240916 - 131108
52 pages
PDC Lecture 7-8 GPU Architectures
No ratings yet
PDC Lecture 7-8 GPU Architectures
25 pages
Kirk+Hwu GPU
No ratings yet
Kirk+Hwu GPU
92 pages
GPU Cluster4
No ratings yet
GPU Cluster4
31 pages
CH19 COA10e
No ratings yet
CH19 COA10e
20 pages
Vector Processors
No ratings yet
Vector Processors
20 pages
DS1822 - Parallel Computing-Unit3
No ratings yet
DS1822 - Parallel Computing-Unit3
17 pages
PART19
No ratings yet
PART19
20 pages
04 DLP
No ratings yet
04 DLP
19 pages
L 3 GPU
No ratings yet
L 3 GPU
33 pages
GPU Architectures
No ratings yet
GPU Architectures
29 pages
Part 1 - Lecture 2 - Parallel Hardware
No ratings yet
Part 1 - Lecture 2 - Parallel Hardware
60 pages
GPU Computing 3
No ratings yet
GPU Computing 3
32 pages
Coe4590 15 Gpu1
No ratings yet
Coe4590 15 Gpu1
14 pages
Lecture 1
No ratings yet
Lecture 1
17 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
Slides - Chapter 6
No ratings yet
Slides - Chapter 6
59 pages
Lec 1
No ratings yet
Lec 1
27 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
Lecture 1: Introduction: Graphics Processing Units (Gpus) : Architecture and Programming
No ratings yet
Lecture 1: Introduction: Graphics Processing Units (Gpus) : Architecture and Programming
33 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Cpus: Latency Oriented Design
No ratings yet
Cpus: Latency Oriented Design
2 pages
CPU Parallelism & GPU
No ratings yet
CPU Parallelism & GPU
12 pages
Lecture 0: Cpus and Gpus: Prof. Mike Giles
No ratings yet
Lecture 0: Cpus and Gpus: Prof. Mike Giles
36 pages
Summary Exam 2015
No ratings yet
Summary Exam 2015
30 pages
Note2 4
No ratings yet
Note2 4
11 pages
Parallel Path Tracing
No ratings yet
Parallel Path Tracing
35 pages
KRL Institute of Technology, Kahuta: CIT-2 Year Subject: Digital & Industrial Electronics Multiple Choice Questions
No ratings yet
KRL Institute of Technology, Kahuta: CIT-2 Year Subject: Digital & Industrial Electronics Multiple Choice Questions
7 pages
An Efficient Methodology For Achieving Optimal Power and Speed in Asic
50% (2)
An Efficient Methodology For Achieving Optimal Power and Speed in Asic
22 pages
GPU Fundamentals
No ratings yet
GPU Fundamentals
20 pages
GPU in Supercomputer
No ratings yet
GPU in Supercomputer
7 pages
Quality Assurance Plan For Battery Charger
No ratings yet
Quality Assurance Plan For Battery Charger
3 pages
TL866II List
No ratings yet
TL866II List
139 pages
Documentation Practices in Agile Software Developm
No ratings yet
Documentation Practices in Agile Software Developm
9 pages
BPAC SL User's Manual - V-2
No ratings yet
BPAC SL User's Manual - V-2
125 pages
Catalog PLC Guide
No ratings yet
Catalog PLC Guide
38 pages
PLC Mitsubishi Manual
No ratings yet
PLC Mitsubishi Manual
9 pages
2784
No ratings yet
2784
4 pages
Ece406 Embedded-system-Design TH 1.00 Ac16
No ratings yet
Ece406 Embedded-system-Design TH 1.00 Ac16
1 page
C Programming Guddu Mehta
No ratings yet
C Programming Guddu Mehta
177 pages
List WTH Device Ram
No ratings yet
List WTH Device Ram
38 pages
Programming Cables: FPGA-UG-02042 Version 26.0
No ratings yet
Programming Cables: FPGA-UG-02042 Version 26.0
18 pages
Bitextor - Bitextor - Bitextor Generates Translation Memories From Multilingual Websites
No ratings yet
Bitextor - Bitextor - Bitextor Generates Translation Memories From Multilingual Websites
13 pages
Stk12C68-M: Cmos Nvsram 8Kx8 Autostore™ Nonvolatile Static Ram Mil-Std-883 / SMD # 5962-94599
No ratings yet
Stk12C68-M: Cmos Nvsram 8Kx8 Autostore™ Nonvolatile Static Ram Mil-Std-883 / SMD # 5962-94599
10 pages
Sample Project Synopsis
No ratings yet
Sample Project Synopsis
31 pages
Sequentix P3 OS4-4.5 2
No ratings yet
Sequentix P3 OS4-4.5 2
7 pages
TOA DA-550F Manual
No ratings yet
TOA DA-550F Manual
20 pages
Performance Testing: Methodologies and Tools
No ratings yet
Performance Testing: Methodologies and Tools
9 pages
Cana Trans Manual
No ratings yet
Cana Trans Manual
8 pages
Solar Power Manager (C) - Waveshare Wiki
No ratings yet
Solar Power Manager (C) - Waveshare Wiki
9 pages
Tab P10 Single Model 202003041638
No ratings yet
Tab P10 Single Model 202003041638
3 pages
Xapp495 S6TMDS Video Interface
No ratings yet
Xapp495 S6TMDS Video Interface
16 pages
Top Ten Considerations For Devicewise Configurations
No ratings yet
Top Ten Considerations For Devicewise Configurations
14 pages
11.6.1.5 Lab Schedule A Task Using The GUI and The Command Line
No ratings yet
11.6.1.5 Lab Schedule A Task Using The GUI and The Command Line
3 pages
Prat Ikawa Le
No ratings yet
Prat Ikawa Le
3 pages
Atul Gaikwad CV-2
No ratings yet
Atul Gaikwad CV-2
1 page
DVR 5000 Series
No ratings yet
DVR 5000 Series
2 pages
Zenith30 Series: Geomax Gps/Gnss
No ratings yet
Zenith30 Series: Geomax Gps/Gnss
4 pages
Quick Python Guide
From Everand
Quick Python Guide
Coder1
No ratings yet
Lisp Programming Language
From Everand
Lisp Programming Language
Faiz ul haque Zeya
No ratings yet

Lecture2 GPU Architecture - 2025

Uploaded by

Lecture2 GPU Architecture - 2025

Uploaded by

GPU Architecture

Prof. Seokin Hong

Parallel Execution in Modern Processors

void mul(int N, float *x, float *result) …..

Fetch/ Fetch/ …..

Execution Context Out-of-order control logic

• More transistors → Larger cache, smarter out-of-order logic →

X[0] Fetch/ Fetch/ X[1]

Four cores run four simultaneous instruction streams

12 Sixteen cores run sixteen simultaneous instruction streams

128 cores → 128 simultaneous instruction streams

void mul(int N, float *x, float *result)

X[0] X[1] X[2] X[3]

ALU0 ALU1 ALU2 ALU3

ALU4 ALU5 ALU6 ALU7

result result result result

▪ Intel Core i9 (Coffee Lake)

o Superscalar: exploit ILP (Instruction-level Parallelism) within an instruction

Multi-threading for hiding memory latency

SPA : Streaming Processor Array (=GPC: Graphics Processing Cluster)

SM: Streaming Multiprocessor

▪ SPA : Streaming Processor Array

▪ When a kernel (GPU code) is called, the

o Warps are run concurrently in an SM

divided into warps

1 SM = 32 cores → parallel execution !

41 Slide credit : Prof. Baek

▪ how many warps?

42 Slide credit : Prof. Baek

result after 100 to 200 clock cycles

43 Slide credit : Prof. Baek

44 Slide credit : Prof. Baek

You might also like

void mul(int N, float x, float result) …..

void mul(int N, float x, float result)