0% found this document useful (0 votes)
136 views

How CUDA Programming Works - 1647539841016001sz6e

The document discusses how CUDA programming works and why it is designed the way it is due to limitations from physics. CUDA is designed to maximize performance on GPUs by utilizing all available resources like SMs, threads, memory bandwidth, and caches. However, GPU performance is ultimately limited by the physics of random access memory, which has costs associated with activating rows and switching pages that introduce latency. As a result, CUDA programming focuses on coalescing memory accesses to maximize bandwidth and hide memory access latency.

Uploaded by

sadeqbillah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
136 views

How CUDA Programming Works - 1647539841016001sz6e

The document discusses how CUDA programming works and why it is designed the way it is due to limitations from physics. CUDA is designed to maximize performance on GPUs by utilizing all available resources like SMs, threads, memory bandwidth, and caches. However, GPU performance is ultimately limited by the physics of random access memory, which has costs associated with activating rows and switching pages that introduce latency. As a result, CUDA programming focuses on coalescing memory accesses to maximize bandwidth and hide memory access latency.

Uploaded by

sadeqbillah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 101

HOW CUDA PROGRAMMING WORKS

STEPHEN JONES, GTC 2022


WHY IS THE WAY IT IS
HOW CUDA PROGRAMMING WORKS
STEPHEN JONES, GTC 2022
SO WHY IS CUDA THE WAY IT IS?
SO WHY IS CUDA THE WAY IT IS?

Physics
The reason you’re using a GPU is for performance
You use CUDA to write performant code on a GPU
Performance is limited by the laws of physics
So CUDA is the way that it is because of

Physics
The reason you’re using a GPU is for performance
You use CUDA to write performant code on a GPU
Performance is limited by the laws of physics
So CUDA is the way that it is because of

Physics
The reason you’re using a GPU is for performance
You use CUDA to write performant code on a GPU
Performance is limited by the laws of physics
So CUDA is the way that it is because of

Physics

How GPU Computing Works [GTC 2021 - S31151]


The reason you’re using a GPU is for performance
The reason you’re using a GPU is for performance

This means using all the GPU resources that you can
THE NVIDIA AMPERE GPU ARCHITECTURE
These are the resources that are available

SMs 108
Total threads 221,184
Peak FP32 TFLOP/s 19.5
Peak FP64 TFLOP/s (non-tensor) 9.7
Peak FP64 TFLOP/s (tensor) 19.5
FP64, TF32, BF16,
Tensor Core Precision
FP16, I8, I4, B1
Shared Memory per SM 160 kB
L2 Cache Size 40960 kB
Memory Bandwidth 1555 GB/sec
GPU Boost Clock 1410 MHz
NVLink Interconnect 600 GB/sec
THE NVIDIA AMPERE GPU ARCHITECTURE
These are the resources that are available

SMs 108
Total threads 221,184
Peak FP32 TFLOP/s 19.5
Peak FP64 TFLOP/s (non-tensor) 9.7
Peak FP64 TFLOP/s (tensor) 19.5
FP64, TF32, BF16,
Tensor Core Precision
FP16, I8, I4, B1
Shared Memory per SM 160 kB
L2 Cache Size 40960 kB
Memory Bandwidth 1555 GB/sec
GPU Boost Clock 1410 MHz
NVLink Interconnect 600 GB/sec
THE NVIDIA AMPERE GPU ARCHITECTURE
These are the resources that are available

SMs 108
Total threads 221,184
Peak FP32 TFLOP/s 19.5
Peak FP64 TFLOP/s (non-tensor) 9.7
Peak FP64 TFLOP/s (tensor) 19.5
FP64, TF32, BF16,
Tensor Core Precision
FP16, I8, I4, B1
Shared Memory per SM 160 kB
L2 Cache Size 40960 kB
Memory Bandwidth 1555 GB/sec
GPU Boost Clock 1410 MHz
NVLink Interconnect 600 GB/sec

ASCI White supercomputer


Lawrence Livermore National Laboratory
Top500 #1 in 2001, 7.9 teraflop/s
BUT FLOPS AREN’T THE ISSUE – BANDWIDTH IS

SMs 108
Total threads 221,184
Peak FP32 TFLOP/s 19.5
Peak FP64 TFLOP/s (non-tensor) 9.7
Peak FP64 TFLOP/s (tensor) 19.5
FP64, TF32, BF16,
Tensor Core Precision
FP16, I8, I4, B1
Shared Memory per SM 160 kB
L2 Cache Size 40960 kB
Memory Bandwidth 1555 GB/sec
GPU Boost Clock 1410 MHz
NVLink Interconnect 600 GB/sec
BUT FLOPS AREN’T THE ISSUE – BANDWIDTH IS

Ampere A100 GPU

SM 0 SM 1 SM 2 SM 3 SM 107
108 SMs in the GPU, running @ 1410 MHz (boost clock)
regs regs regs regs ... regs
(256k) (256k) (256k) (256k) (256k)
Each SM can load 64 Bytes per clock cycle
L1$ L1$ L1$ L1$ L1$
(192k) (192k) (192k) (192k) ... (192k)

Peak memory request rate = 64B x 108 SMs * 1410 MHz = 9750 Gigabytes/sec

L2 Cache (40MB)
Ratio of bandwidth requested 9750
= = 6.3x
to bandwidth provided 1555

HBM2 Memory Bandwidth = 1555 GBytes / sec HBM Memory (80GB)


HBM
HBM
BUT FLOPS AREN’T THE ISSUE – BANDWIDTH IS

SMs 108
Total threads 221,184
FP64 FLOP/S BASED ON MEMORY
Peak FP32SPEED
TFLOP/s = 1555
19.5GB/SEC / 8 BYTES = 194 GFLOP/S
Peak FP64 TFLOP/s (non-tensor) 9.7
Peak FP64 TFLOP/s (tensor) 19.5
FP64, TF32, BF16,
Tensor Core Precision
FP16, I8, I4, B1
Shared Memory per SM 160 kB
L2 Cache Size 40960 kB
Memory Bandwidth 1555 GB/sec
GPU Boost Clock 1410 MHz
NVLink Interconnect 600 GB/sec

Hitachi SR2201 supercomputer


University of Tokyo
Top500 #1 in 1996, 220 gigaflop/s
A CLOSER LOOK AT RANDOM ACCESS MEMORY
Read address: 001100010010011110100001101101110011
Read address: 001100010010011110100001101101110011

Activate row and pull data into sense amplifiers


1
This destroys data in the row as capacitors drain
Read address: 001100010010011110100001101101110011

Activate row and pull data into sense amplifiers


1
This destroys data in the row as capacitors drain

Read from page held in amplifiers at column index


2
This does not destroy data in the amplifiers
Read address: 001100010010011110100001101101110100

Activate row and pull data into sense amplifiers


1
This destroys data in the row as capacitors drain

Read from page held in amplifiers at column index


2
This does not destroy data in the amplifiers

May make repeated reads from the same page


3
at different column indexes
Read address: 001100010010011110100001101101110100

Activate row and pull data into sense amplifiers


1
This destroys data in the row as capacitors drain

Read from page held in amplifiers at column index


2
This does not destroy data in the amplifiers

May make repeated reads from the same page


3
“Burst” reads load multiple columns at a time
Read address: 001100010010011101100001101101110011

Activate row and pull data into sense amplifiers


1
This destroys data in the row as capacitors drain

Read from page held in amplifiers at column index


2
This does not destroy data in the amplifiers

May make repeated reads from the same page


3
“Burst” reads load multiple columns at a time

Before a new page is fetched, old row must be


4
written back because data was destroyed
THIS IS WHERE THE PHYSICS COMES IN

Example HBM values


Time to read new column: CL = 16 cycles
Time to load new page: TRCD = 16 cycles
Time to write back data: TRP = 16 cycles
Page (row) size = 1kB

Each read has a cost (CL = 16 cycles)


Switching page has 3x larger cost (TRP + TRCD + CL = 48 cycles)

This is because switching page requires charging/discharging


capacitors with a physical RC time constant:

VC = VS(1 – e(-t/RC) )
SO WHAT DOES THIS ALL MEAN?

We’d expect a significant performance difference


HBM Memory Throughput as Addresses Diverge
for coalesced vs. scattered reads
(8-byte reads, A100)
2048

On A100, memory bandwidth for widely-spaced reads is 1418 GB/sec


1024

Achieved Bandwidth (GB/sec)


111
= 8% of peak bandwidth 724 GB/sec
1418
512
Burst size = 64 Bytes
HBM page size = 1kB
256
That’s 1/13th of
peak bandwidth!
111 GB/sec
128

64
8 16 32 64 128 256 512 1024 2048 4096 8192
Stride Interval (in Bytes) Between Successive Reads
SO WHAT DOES THIS ALL MEAN?

We’d expect a significant performance difference


HBM Memory Throughput as Addresses Diverge
for coalesced vs. scattered reads
(8-byte reads, A100)
2048

On A100, memory bandwidth for widely-spaced reads is 1418 GB/sec


1024

Achieved Bandwidth (GB/sec)


111
= 8% of peak bandwidth
1418
512

256

111 GB/sec
128

64
8 16 32 64 128 256 512 1024 2048 4096 8192
Stride Interval (int Bytes) Between Successive Reads

CM-5 supercomputer
Los Alamos National Laboratory
First ever #1 in 1993, 59.7 gigaflop/s
DATA ACCESS PATTERNS REALLY MATTER

Column read latency CL


for(y=0; y<M; y++) {
for(x=0; x<N; x++) {
load(array[y][x]);
}
}

Row-major array traversal

Row-major array layout


DATA ACCESS PATTERNS REALLY MATTER

Column read latency CL


for(y=0; y<M; y++) { for(x=0; x<N; x++) {
for(x=0; x<N; x++) { for(y=0; y<M; y++) {
load(array[y][x]); load(array[y][x]);
} }
} }

Row-major array traversal Column-major array traversal

Row read latency


TRAS = TRP + TRDC + CL
13x slower than
column access

Row-major array layout


The reason you’re using a GPU is for performance

This means using all the GPU resources that you can,
which means managing memory access patterns
But what’s this got to do with CUDA?
CUDA’S GPU EXECUTION HIERARCHY

Grid
of work

Divide into
many blocks

Many threads
in each block
THE CUDA THREAD BLOCK

Thread block

A block has a fixed number of threads


which are guaranteed to be running
simultaneously on the same SM
EVERY THREAD RUNS EXACTLY THE SAME PROGRAM
This is the “SIMT” model

Thread block

A block has a fixed number of threads

__global__ void euclidian_distance(float2 *p1, float2 *p2, float *distance, int count) {
// Calculate the index of the point my thread is working on
int index = threadIdx.x + (blockIdx.x * blockDim.x);

// Check if thread is in-range before reading data


if (index < count) {
// Compute the Euclidian distance between two points
float2 dp = p2[index] - p1[index];
float dist = sqrtf(dp.x * dp.x + dp.y * dp.y);

// Write out the computed distance


distance[index] = dist;
}
}
EVERY THREAD RUNS EXACTLY THE SAME PROGRAM
This is the “SIMT” model

Thread block

A block has a fixed number of threads

__global__ void euclidian_distance(float2 *p1, float2 *p2, float *distance, int count) {
It’s all about this // Calculate the index of the point my thread is working on
one line of code int index = threadIdx.x + (blockIdx.x * blockDim.x);

// Check if thread is in-range before reading data


if (index < count) {
// Compute the Euclidian distance between two points
float2 dp = p2[index] - p1[index];
float dist = sqrtf(dp.x * dp.x + dp.y * dp.y);

// Write out the computed distance


distance[index] = dist;
}
}
EVERY THREAD RUNS EXACTLY THE SAME PROGRAM
This is the “SIMT” model

Thread block

A block has a fixed number of threads

__global__ void euclidian_distance(float2 *p1, float2 *p2, float *distance, int count) {
// Calculate the index of the point my thread is working on
int index = threadIdx.x + (blockIdx.x * blockDim.x);

This line is what // Check if thread is in-range before reading data


SIMT is all about if (index < count) {
// Compute the Euclidian distance between two points
float2 dp = p2[index] - p1[index];
float dist = sqrtf(dp.x * dp.x + dp.y * dp.y);

// Write out the computed distance


distance[index] = dist;
}
}
Thread block

The block of threads is broken up into “warps” of 32 threads


A “warp” is the vector element of the GPU

Warp Warp Warp Warp Warp Warp

thread 0...31 thread 32...63 thread 64...95 thread 96...127 thread 128...159 thread 160...191
Thread block

The block of threads is broken up into “warps” of 32 threads


A “warp” is the vector element of the GPU

Warp Warp Warp Warp Warp Warp

thread 0...31 thread 32...63 thread 64...95 thread 96...127 thread 128...159 thread 160...191

int index = threadIdx.x + (blockIdx.x * blockDim.x);


...
float2 dp = p2[index] - p1[index];
Thread block

The block of threads is broken up into “warps” of 32 threads


A “warp” is the vector element of the GPU

Warp Warp Warp Warp Warp Warp

thread 0...31 thread 32...63 thread 64...95 thread 96...127 thread 128...159 thread 160...191

float2 dp = ... p1[f(thread ID)];


Thread block

The block of threads is broken up into “warps” of 32 threads


A “warp” is the vector element of the GPU

Warp Warp Warp Warp Warp Warp

thread 0...31 thread 32...63 thread 64...95 thread 96...127 thread 128...159 thread 160...191

float2 dp = ... p1[f(thread ID)];

float2 float x float y


sizeof(float2) is 2x 4 bytes = 8 bytes
Thread block

The block of threads is broken up into “warps” of 32 threads


A “warp” is the vector element of the GPU

Warp Warp Warp Warp Warp Warp

thread 0...31 thread 32...63 thread 64...95 thread 96...127 thread 128...159 thread 160...191

float2 dp = ... p1[f(thread ID)];

One warp loads 32 x 8 bytes of data = 256 bytes


WARP EXECUTION ON THE GPU

Warp Warp Warp Warp Warp Warp

thread 0...31 thread 32...63 thread 64...95 thread 96...127 thread 128...159 thread 160...191

A100 SM block diagram


WARP EXECUTION ON THE GPU

Warp Warp Warp Warp Warp Warp

thread 0...31 thread 32...63 thread 64...95 thread 96...127 thread 128...159 thread 160...191

float2 dp = ... p1[f(thread ID)];

Memory request issuing from one SM:


4 warps x 256 bytes per warp = 1024 bytes
WARP EXECUTION ON THE GPU

Warp Warp Warp Warp Warp Warp


Warp Warp

thread 0...31 thread 32...63 thread 64...95 thread 96...127 thread 128...159 thread 160...191

float2 dp = ... p1[f(thread ID)];

Memory request issuing from one SM:


4 warps x 256 bytes per warp = 1024 bytes

Memory page size = 1024 bytes


Warp Warp Warp Warp

thread 0...31 thread 32...63 thread 64...95 thread 96...127

float2 dp = p2[index] - p1[index];

For a single thread, this would look


like random-address memory reads
Warp Warp Warp Warp

thread 0...31 thread 32...63 thread 64...95 thread 96...127

float2 dp = p2[index] - p1[index];

For a single thread, this would look


like random-address memory reads

But because this is executed in parallel


by 128 threads, it’s actually adjacent
reads of whole pages of memory
Warp Warp Warp Warp

HBM Memory Throughput as Addresses Diverge


(8-byte reads, A100)
thread 0...31 thread 32...63 thread 64...95 thread 96...127
2048

Achieved Bandwidth (GB/sec)


1024

float2 dp = p2[index] - p1[index];


512

256

For a single thread, this would look 128


like random-address memory reads

But because this is executed in parallel 64


8 16 32 64 128 256 512 1024 2048 4096 8192
by 128 threads, it’s actually adjacent
Stride Interval (int Bytes) Between Successive Reads
reads of whole pages of memory
USING ALL THE GPU RESOURCES YOU CAN GET

Ampere A100 GPU SM 0 SM 1 SM 2 SM 3 SM 107


SMs 108 regs regs regs regs ... regs
(256k) (256k) (256k) (256k) (256k)
Total threads 221,184
L1$ L1$ L1$ L1$ L1$
...
Peak FP32 TFLOP/s 19.5 (192k) (192k) (192k) (192k) (192k)

Peak FP64 TFLOP/s (non-tensor) 9.7


Peak FP64 TFLOP/s 19.5
FP64, TF32, BF16, L2 Cache (40MB)
Tensor Core Precision
FP16, I8, I4, B1
Shared Memory per SM 160 kB
L2 Cache Size 40960 kB
Memory Bandwidth 1555 GB/sec
GPU Boost Clock 1410 MHz HBM Memory (80GB)
HBM
NVLink Interconnect 600 GB/sec HBM
USING ALL THE GPU RESOURCES YOU CAN GET

Ampere A100 GPU SM 0 SM 1 SM 2 SM 3 SM 107


SMs 108 regs regs regs regs ... regs
(256k) (256k) (256k) (256k) (256k)
Total threads 221,184
L1$ L1$ L1$ L1$ L1$
...
Peak FP32 TFLOP/s 19.5 (192k) (192k) (192k) (192k) (192k)

Peak FP64 TFLOP/s (non-tensor) 9.7


Peak FP64 TFLOP/s 19.5
FP64, TF32, BF16, L2 Cache (40MB)
Tensor Core Precision
FP16, I8, I4, B1
Shared Memory per SM 160 kB
L2 Cache Size 40960 kB
Memory Bandwidth 1555 GB/sec
GPU Boost Clock 1410 MHz HBM Memory (80GB)
HBM
NVLink Interconnect 600 GB/sec HBM
The reason you’re using a GPU is for performance

This means using all the GPU resources that you can,
which means managing memory access patterns
and also something called occupancy
CUDA’S GPU EXECUTION HIERARCHY

Grid
of work

Divide into
many blocks

Many threads
in each block
START WITH SOME WORK TO PROCESS
DIVIDE INTO A SET OF EQUAL-SIZED BLOCKS: THIS IS THE “GRID” OF WORK
EACH BLOCK WILL NOW BE PROCESSED INDEPENDENTLY
CUDA does not guarantee the order of execution and you cannot exchange data between blocks
EACH BLOCK WILL NOW BE PROCESSED INDEPENDENTLY
CUDA does not guarantee the order of execution and you cannot exchange data between blocks
EVERY BLOCK GET PLACED ONTO AN SM
CUDA does not guarantee the order of execution and you cannot exchange data between blocks

SM 0 SM 1 SM 2 SM 3 SM 4 SM 5

SM 6 SM 7 SM 8 SM 9 SM 10 SM 11
EVERY BLOCK GET PLACED ONTO AN SM
CUDA does not guarantee the order of execution and you cannot exchange data between blocks

SM 0 SM 1 SM 2 SM 3 SM 4 SM 5

SM 6 SM 7 SM 8 SM 9 SM 10 SM 11
BLOCKS CONTINUE TO GET PLACED UNTIL EACH SM IS “FULL”
When a block completes its work and exits, a new block is placed in its spot until the whole grid is done

SM 0 SM 1 SM 2 SM 3 SM 4 SM 5

SM 6 SM 7 SM 8 SM 9 SM 10 SM 11
BLOCKS CONTINUE TO GET PLACED UNTIL EACH SM IS “FULL”
When a block completes its work and exits, a new block is placed in its spot until the whole grid is done

SM 0 SM 1 SM 2 SM 3 SM 4 SM 5

SM 6 SM 7 SM 8 SM 9 SM 10 SM 11
BLOCKS CONTINUE TO GET PLACED UNTIL EACH SM IS “FULL”
When a block completes its work and exits, a new block is placed in its spot until the whole grid is done

SM 0 SM 1 SM 2 SM 3 SM 4 SM 5

SM 6 SM 7 SM 8 SM 9 SM 10 SM 11
BLOCKS CONTINUE TO GET PLACED UNTIL EACH SM IS “FULL”
When a block completes its work and exits, a new block is placed in its spot until the whole grid is done

SM 0 SM 1 SM 2 SM 3 SM 4 SM 5

SM 6 SM 7 SM 8 SM 9 SM 10 SM 11
BLOCKS CONTINUE TO GET PLACED UNTIL EACH SM IS “FULL”
When a block completes its work and exits, a new block is placed in its spot until the whole grid is done

SM 0 SM 1 SM 2 SM 3 SM 4 SM 5

SM 6 SM 7 SM 8 SM 9 SM 10 SM 11
WHAT DOES IT MEAN FOR AN SM TO BE “FULL”?

SM 0 SM 1 SM 2 SM 3 SM 4 SM 5

SM 6 SM 7 SM 8 SM 9 SM 10 SM 11
LOOKING INSIDE A STREAMING MULTIPROCESSOR

A100 SM Resources
2048 Max threads per SM
32 Max blocks per SM
65,536 Total registers per SM
160 kB Total shared memory in SM
32 Threads per warp
4 Concurrent warps active
64 FP32 cores per SM
32 FP64 cores per SM
192 kB Max L1 cache size
90 GB/sec Load bandwidth per SM
1410 MHz GPU Boost Clock
LOOKING INSIDE A STREAMING MULTIPROCESSOR

A100 SM Resources
2048 Max threads per SM
32 Max blocks per SM
65,536 Total registers per SM
160 kB Total shared memory in SM
32 Threads per warp
4 Concurrent warps active
64 FP32 cores per SM
32 FP64 cores per SM
192 kB Max L1 cache size
90 GB/sec Load bandwidth per SM
1410 MHz GPU Boost Clock
LOOKING INSIDE A STREAMING MULTIPROCESSOR

A100 SM Resources
2048 Max threads per SM
32 Max blocks per SM
65,536 Total registers per SM

SM 0 SM 1 SM 2 SM 3 SM 4 SM 5 160 kB Total shared memory in SM


32 Threads per warp
4 Concurrent warps active
64 FP32 cores per SM
32 FP64 cores per SM
192 kB Max L1 cache size
90 GB/sec Load bandwidth per SM
SM 6 SM 7 SM 8 SM 9 SM 10 SM 11 1410 MHz GPU Boost Clock
LOOKING INSIDE A STREAMING MULTIPROCESSOR

A100 SM Resources
2048 Max threads per SM
32 Max blocks per SM
65,536 Total registers per SM
160 kB Total shared memory in SM
32 Threads per warp
4 Concurrent warps active
64 FP32 cores per SM
32 FP64 cores per SM
192 kB Max L1 cache size
90 GB/sec Load bandwidth per SM
1410 MHz GPU Boost Clock
ANATOMY OF A THREAD BLOCK

Thread block
Grid
of work
A block has a fixed number of threads

__shared__ float mean = 0.0f;


Divide into __device__ float mean_euclidian_distance(float2 *p1, float2 *p2) {
many blocks // Compute the Euclidian distance between two points
float2 dp = p2[threadIdx.x] - p1[threadIdx.x];
float dist = sqrtf(dp.x * dp.x + dp.y * dp.y);

// Accumulate the mean distance atomically and return distance


atomicAdd(&mean, dist / blockDim.x);
Many threads return dist;
in each block }

Every thread runs exactly the same program


(this is the “SIMT” model)
ANATOMY OF A THREAD BLOCK

All blocks in a grid run the same program using the same Thread block
number of threads, leading to 3 resource requirements

1. Block size – the number of threads which must be concurrent A block has a fixed number of threads

__shared__ float mean = 0.0f;


__device__ float mean_euclidian_distance(float2 *p1, float2 *p2) {
// Compute the Euclidian distance between two points
float2 dp = p2[threadIdx.x] - p1[threadIdx.x];
float dist = sqrtf(dp.x * dp.x + dp.y * dp.y);

// Accumulate the mean distance atomically and return distance


atomicAdd(&mean, dist / blockDim.x);
return dist;
}

Every thread runs exactly the same program


(this is the “SIMT” model)
ANATOMY OF A THREAD BLOCK

All blocks in a grid run the same program using the same Thread block
number of threads, leading to 3 resource requirements

1. Block size – the number of threads which must be concurrent A block has a fixed number of threads

2. Shared memory – common to all threads in a block


__shared__ float mean = 0.0f;
__device__ float mean_euclidian_distance(float2 *p1, float2 *p2) {
// Compute the Euclidian distance between two points
float2 dp = p2[threadIdx.x] - p1[threadIdx.x];
float dist = sqrtf(dp.x * dp.x + dp.y * dp.y);

// Accumulate the mean distance atomically and return distance


atomicAdd(&mean, dist / blockDim.x);
return dist;
}

Every thread runs exactly the same program


(this is the “SIMT” model)
ANATOMY OF A THREAD BLOCK

All blocks in a grid run the same program using the same Thread block
number of threads, leading to 3 resource requirements

1. Block size – the number of threads which must be concurrent A block has a fixed number of threads

2. Shared memory – common to all threads in a block


__shared__ float mean = 0.0f;
__device__ float mean_euclidian_distance(float2 *p1, float2 *p2) {
// Compute the Euclidian distance between two points
3. Registers – depends on program complexity float2 dp = p2[threadIdx.x] - p1[threadIdx.x];
float dist = sqrtf(dp.x * dp.x + dp.y * dp.y);
Registers are a per-thread resource, so total budget is:
(threads-per-block x registers-per-thread) // Accumulate the mean distance atomically and return distance
atomicAdd(&mean, dist / blockDim.x);
return dist;
}

Every thread runs exactly the same program


(this is the “SIMT” model)
HOW THE GPU PLACES BLOCKS ON AN SM

A100 SM Key Resources

Thread block
2048 Threads

A block has a fixed number of threads


always running on a single SM

Example block resource requirements


65536 Registers
256 Threads per block
64 (Registers per thread)
(256 * 64) = 16384 Registers per block
48 kB Shared memory per block

160kB Shared Memory


HOW THE GPU PLACES BLOCKS ON AN SM

A100 SM Key Resources


Block 0
Thread block
2048
Threads
A block has a fixed number of threads
always running on a single SM

Block 0

65,536 Example block resource requirements


Registers 256 Threads per block
64 (Registers per thread)
(256 * 64) = 16384 Registers per block
Block 0 48 kB Shared memory per block
160 kB
Shared
Memory
HOW THE GPU PLACES BLOCKS ON AN SM

A100 SM Key Resources


Block 0 Block 1
Thread block
2048
Threads
A block has a fixed number of threads
always running on a single SM

Block 0 Block 1

65,536 Example block resource requirements


Registers 256 Threads per block
64 (Registers per thread)
(256 * 64) = 16384 Registers per block
Block 0 48 kB Shared memory per block
160 kB
Shared Block 1
Memory
HOW THE GPU PLACES BLOCKS ON AN SM

A100 SM Key Resources


Block 0 Block 1
Thread block
2048 Block 2
Threads
A block has a fixed number of threads
always running on a single SM

Block 0 Block 1

65,536 Example block resource requirements


Registers 256 Threads per block
Block 2
64 (Registers per thread)
(256 * 64) = 16384 Registers per block
Block 0 48 kB Shared memory per block
160 kB
Shared Block 1
Memory
Block 2
HOW THE GPU PLACES BLOCKS ON AN SM

A100 SM Key Resources


Block 0 Block 1
Thread block
2048 Block 2 Block 3
Threads Block 4 Block 5
Block 6 Block 7 A block has a fixed number of threads
always running on a single SM

Block 0 Block 1

65,536 Example block resource requirements


Registers 256 Threads per block
Block 2 Block 3
64 (Registers per thread)
(256 * 64) = 16384 Registers per block
Block 0 48 kB Shared memory per block
160 kB
Shared Block 1
Memory
Block 2

Block 3
HOW THE GPU PLACES BLOCKS ON AN SM

A100 SM Key Resources


Block 0 Block 1
Thread block
2048 Block 2 Block 3
Threads Block 4 Block 5
Block 6 Block 7 A block has a fixed number of threads
always running on a single SM

Block 0 Block 1

65,536 Example block resource requirements


Registers 256 Threads per block
Block 2 Block 3
64 (Registers per thread)
(256 * 64) = 16384 Registers per block
Block 0
32 kB Shared memory per block
160 kB Block 1
Shared Block 2
Memory Block 3
HOW THE GPU PLACES BLOCKS ON AN SM

A100 SM Key Resources


Block 0 Block 1
Thread block
2048 Block 2 Block 3
Threads
A block has a fixed number of threads
always running on a single SM

Block 0 Block 1

65,536 Example block resource requirements


Registers 256 Threads per block
Block 2 Block 3
64 (Registers per thread)
(256 * 64) = 16384 Registers per block
Block 0
32 kB Shared memory per block
160 kB Block 1
Shared Block 2
Memory Block 3
HOW THE GPU PLACES BLOCKS ON AN SM

A100 SM Key Resources


Block 0 Block 1
Thread block
2048 Block 2 Block 3
Threads Block 4 Block 5
Block 6 Block 7 A block has a fixed number of threads
always running on a single SM

Block 0 Block 1

65,536 Example block resource requirements


Registers 256 Threads per block
Block 2 Block 3
64 (Registers per thread)
(256 * 64) = 16384 Registers per block
Block 0
32 kB Shared memory per block
160 kB Block 1
Shared Block 2
Memory Block 3
Block 4
HOW THE GPU PLACES BLOCKS ON AN SM

A100 SM Key Resources


Block 0 Block 1
Thread block
Block 2 Block 3
Block 4 Block 5
Block 6 Block 7 A block has a fixed number of threads
always running on a single SM

Block 0 Block 1
Example block resource requirements
256 Threads per block
Block 2 Block 3
64 (Registers per thread)
(256 * 64) = 16384 Registers per block
Block 0
32 kB Shared memory per block
Block 1
Block 2
Block 3
Block 4
OCCUPANCY

A100 SM Key Resources A100 SM Key Resources


Block 0 Block 1 Block 0 Block 1
Block 2 Block 2 Block 3
Threads

Block 0 Block 1 Block 0 Block 1

Registers

Block 2 Block 2 Block 3

Block 0
Block 0
Block 1
Shared Block 1 Block 2
Memory
Block 3
Block 2

Shared memory limited case Register limited case


OCCUPANCY

A100 SM Key Resources A100 SM Key Resources


Block 0 Block 1 Block 0 Block 1
Block 2 Block 2 Block 3
Threads

Block 0 Block 1 Block 0 Block 1


Occupancy Occupancy
Registers 3 blocks/SM 4 blocks/SM
Block 2 Block 2 Block 3

Block 0
Block 0
Block 1
Shared Block 1 Block 2
Memory
Block 3
Block 2

Shared memory limited case Register limited case


OCCUPANCY IS THE MOST POWERFUL TOOL FOR TUNING A PROGRAM

A100 SM Key Resources A100 SM Key Resources


Block 0 Block 1 Block 0 Block 1
Block 2 Block 2 Block 3
Threads

Block 0 Block 1 Block 0 Block 1


Occupancy Occupancy
Registers 33% Faster
3 blocks/SM 4 blocks/SM
Block 2 Block 2 Block 3

Block 0
Block 0
Block 1
Shared Block 1 Block 2
Memory
Block 3
Block 2

Shared memory limited case Register limited case


FILLING IN THE GAPS

A100 SM Key Resources


Block 0 Block 1 Resource requirements (blue grid)
Block 2
Threads 256 Threads per block
64 (Registers per thread)
(256 * 64) = 16384 Registers per block
48 kB Shared memory per block
Block 0 Block 1

Registers

Block 2

Block 0

Shared Block 1
Memory
Block 2

Shared memory limited case


FILLING IN THE GAPS

A100 SM Key Resources


Block 0 Block 1 Resource requirements (blue grid)
Block 2
Threads 256 Threads per block
64 (Registers per thread)
(256 * 64) = 16384 Registers per block
48 kB Shared memory per block
Block 0 Block 1

Registers

Block 2
Resource requirements (green grid)
512 Threads per block
Block 0 32 (Registers per thread)
Shared Block 1 (512 * 32) = 16384 Registers per block
Memory
0 kB Shared memory per block
Block 2
FILLING IN THE GAPS

A100 SM Key Resources


Block 0 Block 1 Resource requirements (blue grid)
Block 2
Threads 256 Threads per block
Block 0
64 (Registers per thread)
(256 * 64) = 16384 Registers per block
48 kB Shared memory per block
Block 0 Block 1

Registers

Block 2 Block 0
Resource requirements (green grid)
512 Threads per block
Block 0 32 (Registers per thread)
Shared Block 1 (512 * 32) = 16384 Registers per block
Memory
0 kB Shared memory per block
Block 2
CONCURRENCY: DOING MULTIPLE THINGS AT ONCE

Copy Memory
Process Flower
CONCURRENCY: DEPENDENCIES

Copy to GPU Process Flower Copy from GPU


CONCURRENCY: DEPENDENCIES

Copy to GPU Process Flower Copy from GPU Copy to GPU Process Flower Copy from GPU
CONCURRENCY: INDEPENDENCIES

Stream 1

Copy to GPU Process Flower Copy from GPU

Stream 2

Copy to GPU Process Flower Copy from GPU


CONCURRENCY: IT’S REALLY ALL ABOUT OVERSUBSCRIPTION

A100 SM Key Resources


Flower Block 0 Flower Block 1
Flower Block 2
Stream 1 Copy Block 0

Copy to GPU Process Flower Copy from GPU


Flower Block 0 Flower Block 1

Flower Block 2 Copy Block 0

Flower Block 0
Stream 2
Flower Block 1

Copy to GPU Process Flower Copy from GPU Flower Block 2


CONCURRENCY: IT’S REALLY ALL ABOUT OVERSUBSCRIPTION

Stream 1

Copy to GPU Process Flower Copy from GPU

Stream 2

Copy to GPU Process Flower Copy from GPU


SO WHY IS CUDA THE WAY IT IS?
Physics
WHY IS THE WAY IT IS
HOW CUDA PROGRAMMING WORKS
STEPHEN JONES, GTC 2022

You might also like