How CUDA Programming Works - 1647539841016001sz6e
How CUDA Programming Works - 1647539841016001sz6e
Physics
The reason you’re using a GPU is for performance
You use CUDA to write performant code on a GPU
Performance is limited by the laws of physics
So CUDA is the way that it is because of
Physics
The reason you’re using a GPU is for performance
You use CUDA to write performant code on a GPU
Performance is limited by the laws of physics
So CUDA is the way that it is because of
Physics
The reason you’re using a GPU is for performance
You use CUDA to write performant code on a GPU
Performance is limited by the laws of physics
So CUDA is the way that it is because of
Physics
This means using all the GPU resources that you can
THE NVIDIA AMPERE GPU ARCHITECTURE
These are the resources that are available
SMs 108
Total threads 221,184
Peak FP32 TFLOP/s 19.5
Peak FP64 TFLOP/s (non-tensor) 9.7
Peak FP64 TFLOP/s (tensor) 19.5
FP64, TF32, BF16,
Tensor Core Precision
FP16, I8, I4, B1
Shared Memory per SM 160 kB
L2 Cache Size 40960 kB
Memory Bandwidth 1555 GB/sec
GPU Boost Clock 1410 MHz
NVLink Interconnect 600 GB/sec
THE NVIDIA AMPERE GPU ARCHITECTURE
These are the resources that are available
SMs 108
Total threads 221,184
Peak FP32 TFLOP/s 19.5
Peak FP64 TFLOP/s (non-tensor) 9.7
Peak FP64 TFLOP/s (tensor) 19.5
FP64, TF32, BF16,
Tensor Core Precision
FP16, I8, I4, B1
Shared Memory per SM 160 kB
L2 Cache Size 40960 kB
Memory Bandwidth 1555 GB/sec
GPU Boost Clock 1410 MHz
NVLink Interconnect 600 GB/sec
THE NVIDIA AMPERE GPU ARCHITECTURE
These are the resources that are available
SMs 108
Total threads 221,184
Peak FP32 TFLOP/s 19.5
Peak FP64 TFLOP/s (non-tensor) 9.7
Peak FP64 TFLOP/s (tensor) 19.5
FP64, TF32, BF16,
Tensor Core Precision
FP16, I8, I4, B1
Shared Memory per SM 160 kB
L2 Cache Size 40960 kB
Memory Bandwidth 1555 GB/sec
GPU Boost Clock 1410 MHz
NVLink Interconnect 600 GB/sec
SMs 108
Total threads 221,184
Peak FP32 TFLOP/s 19.5
Peak FP64 TFLOP/s (non-tensor) 9.7
Peak FP64 TFLOP/s (tensor) 19.5
FP64, TF32, BF16,
Tensor Core Precision
FP16, I8, I4, B1
Shared Memory per SM 160 kB
L2 Cache Size 40960 kB
Memory Bandwidth 1555 GB/sec
GPU Boost Clock 1410 MHz
NVLink Interconnect 600 GB/sec
BUT FLOPS AREN’T THE ISSUE – BANDWIDTH IS
SM 0 SM 1 SM 2 SM 3 SM 107
108 SMs in the GPU, running @ 1410 MHz (boost clock)
regs regs regs regs ... regs
(256k) (256k) (256k) (256k) (256k)
Each SM can load 64 Bytes per clock cycle
L1$ L1$ L1$ L1$ L1$
(192k) (192k) (192k) (192k) ... (192k)
Peak memory request rate = 64B x 108 SMs * 1410 MHz = 9750 Gigabytes/sec
L2 Cache (40MB)
Ratio of bandwidth requested 9750
= = 6.3x
to bandwidth provided 1555
SMs 108
Total threads 221,184
FP64 FLOP/S BASED ON MEMORY
Peak FP32SPEED
TFLOP/s = 1555
19.5GB/SEC / 8 BYTES = 194 GFLOP/S
Peak FP64 TFLOP/s (non-tensor) 9.7
Peak FP64 TFLOP/s (tensor) 19.5
FP64, TF32, BF16,
Tensor Core Precision
FP16, I8, I4, B1
Shared Memory per SM 160 kB
L2 Cache Size 40960 kB
Memory Bandwidth 1555 GB/sec
GPU Boost Clock 1410 MHz
NVLink Interconnect 600 GB/sec
VC = VS(1 – e(-t/RC) )
SO WHAT DOES THIS ALL MEAN?
64
8 16 32 64 128 256 512 1024 2048 4096 8192
Stride Interval (in Bytes) Between Successive Reads
SO WHAT DOES THIS ALL MEAN?
256
111 GB/sec
128
64
8 16 32 64 128 256 512 1024 2048 4096 8192
Stride Interval (int Bytes) Between Successive Reads
CM-5 supercomputer
Los Alamos National Laboratory
First ever #1 in 1993, 59.7 gigaflop/s
DATA ACCESS PATTERNS REALLY MATTER
This means using all the GPU resources that you can,
which means managing memory access patterns
But what’s this got to do with CUDA?
CUDA’S GPU EXECUTION HIERARCHY
Grid
of work
Divide into
many blocks
Many threads
in each block
THE CUDA THREAD BLOCK
Thread block
Thread block
__global__ void euclidian_distance(float2 *p1, float2 *p2, float *distance, int count) {
// Calculate the index of the point my thread is working on
int index = threadIdx.x + (blockIdx.x * blockDim.x);
Thread block
__global__ void euclidian_distance(float2 *p1, float2 *p2, float *distance, int count) {
It’s all about this // Calculate the index of the point my thread is working on
one line of code int index = threadIdx.x + (blockIdx.x * blockDim.x);
Thread block
__global__ void euclidian_distance(float2 *p1, float2 *p2, float *distance, int count) {
// Calculate the index of the point my thread is working on
int index = threadIdx.x + (blockIdx.x * blockDim.x);
thread 0...31 thread 32...63 thread 64...95 thread 96...127 thread 128...159 thread 160...191
Thread block
thread 0...31 thread 32...63 thread 64...95 thread 96...127 thread 128...159 thread 160...191
thread 0...31 thread 32...63 thread 64...95 thread 96...127 thread 128...159 thread 160...191
thread 0...31 thread 32...63 thread 64...95 thread 96...127 thread 128...159 thread 160...191
thread 0...31 thread 32...63 thread 64...95 thread 96...127 thread 128...159 thread 160...191
thread 0...31 thread 32...63 thread 64...95 thread 96...127 thread 128...159 thread 160...191
thread 0...31 thread 32...63 thread 64...95 thread 96...127 thread 128...159 thread 160...191
thread 0...31 thread 32...63 thread 64...95 thread 96...127 thread 128...159 thread 160...191
256
This means using all the GPU resources that you can,
which means managing memory access patterns
and also something called occupancy
CUDA’S GPU EXECUTION HIERARCHY
Grid
of work
Divide into
many blocks
Many threads
in each block
START WITH SOME WORK TO PROCESS
DIVIDE INTO A SET OF EQUAL-SIZED BLOCKS: THIS IS THE “GRID” OF WORK
EACH BLOCK WILL NOW BE PROCESSED INDEPENDENTLY
CUDA does not guarantee the order of execution and you cannot exchange data between blocks
EACH BLOCK WILL NOW BE PROCESSED INDEPENDENTLY
CUDA does not guarantee the order of execution and you cannot exchange data between blocks
EVERY BLOCK GET PLACED ONTO AN SM
CUDA does not guarantee the order of execution and you cannot exchange data between blocks
SM 0 SM 1 SM 2 SM 3 SM 4 SM 5
SM 6 SM 7 SM 8 SM 9 SM 10 SM 11
EVERY BLOCK GET PLACED ONTO AN SM
CUDA does not guarantee the order of execution and you cannot exchange data between blocks
SM 0 SM 1 SM 2 SM 3 SM 4 SM 5
SM 6 SM 7 SM 8 SM 9 SM 10 SM 11
BLOCKS CONTINUE TO GET PLACED UNTIL EACH SM IS “FULL”
When a block completes its work and exits, a new block is placed in its spot until the whole grid is done
SM 0 SM 1 SM 2 SM 3 SM 4 SM 5
SM 6 SM 7 SM 8 SM 9 SM 10 SM 11
BLOCKS CONTINUE TO GET PLACED UNTIL EACH SM IS “FULL”
When a block completes its work and exits, a new block is placed in its spot until the whole grid is done
SM 0 SM 1 SM 2 SM 3 SM 4 SM 5
SM 6 SM 7 SM 8 SM 9 SM 10 SM 11
BLOCKS CONTINUE TO GET PLACED UNTIL EACH SM IS “FULL”
When a block completes its work and exits, a new block is placed in its spot until the whole grid is done
SM 0 SM 1 SM 2 SM 3 SM 4 SM 5
SM 6 SM 7 SM 8 SM 9 SM 10 SM 11
BLOCKS CONTINUE TO GET PLACED UNTIL EACH SM IS “FULL”
When a block completes its work and exits, a new block is placed in its spot until the whole grid is done
SM 0 SM 1 SM 2 SM 3 SM 4 SM 5
SM 6 SM 7 SM 8 SM 9 SM 10 SM 11
BLOCKS CONTINUE TO GET PLACED UNTIL EACH SM IS “FULL”
When a block completes its work and exits, a new block is placed in its spot until the whole grid is done
SM 0 SM 1 SM 2 SM 3 SM 4 SM 5
SM 6 SM 7 SM 8 SM 9 SM 10 SM 11
WHAT DOES IT MEAN FOR AN SM TO BE “FULL”?
SM 0 SM 1 SM 2 SM 3 SM 4 SM 5
SM 6 SM 7 SM 8 SM 9 SM 10 SM 11
LOOKING INSIDE A STREAMING MULTIPROCESSOR
A100 SM Resources
2048 Max threads per SM
32 Max blocks per SM
65,536 Total registers per SM
160 kB Total shared memory in SM
32 Threads per warp
4 Concurrent warps active
64 FP32 cores per SM
32 FP64 cores per SM
192 kB Max L1 cache size
90 GB/sec Load bandwidth per SM
1410 MHz GPU Boost Clock
LOOKING INSIDE A STREAMING MULTIPROCESSOR
A100 SM Resources
2048 Max threads per SM
32 Max blocks per SM
65,536 Total registers per SM
160 kB Total shared memory in SM
32 Threads per warp
4 Concurrent warps active
64 FP32 cores per SM
32 FP64 cores per SM
192 kB Max L1 cache size
90 GB/sec Load bandwidth per SM
1410 MHz GPU Boost Clock
LOOKING INSIDE A STREAMING MULTIPROCESSOR
A100 SM Resources
2048 Max threads per SM
32 Max blocks per SM
65,536 Total registers per SM
A100 SM Resources
2048 Max threads per SM
32 Max blocks per SM
65,536 Total registers per SM
160 kB Total shared memory in SM
32 Threads per warp
4 Concurrent warps active
64 FP32 cores per SM
32 FP64 cores per SM
192 kB Max L1 cache size
90 GB/sec Load bandwidth per SM
1410 MHz GPU Boost Clock
ANATOMY OF A THREAD BLOCK
Thread block
Grid
of work
A block has a fixed number of threads
All blocks in a grid run the same program using the same Thread block
number of threads, leading to 3 resource requirements
1. Block size – the number of threads which must be concurrent A block has a fixed number of threads
All blocks in a grid run the same program using the same Thread block
number of threads, leading to 3 resource requirements
1. Block size – the number of threads which must be concurrent A block has a fixed number of threads
All blocks in a grid run the same program using the same Thread block
number of threads, leading to 3 resource requirements
1. Block size – the number of threads which must be concurrent A block has a fixed number of threads
Thread block
2048 Threads
Block 0
Block 0 Block 1
Block 0 Block 1
Block 0 Block 1
Block 3
HOW THE GPU PLACES BLOCKS ON AN SM
Block 0 Block 1
Block 0 Block 1
Block 0 Block 1
Block 0 Block 1
Example block resource requirements
256 Threads per block
Block 2 Block 3
64 (Registers per thread)
(256 * 64) = 16384 Registers per block
Block 0
32 kB Shared memory per block
Block 1
Block 2
Block 3
Block 4
OCCUPANCY
Registers
Block 0
Block 0
Block 1
Shared Block 1 Block 2
Memory
Block 3
Block 2
Block 0
Block 0
Block 1
Shared Block 1 Block 2
Memory
Block 3
Block 2
Block 0
Block 0
Block 1
Shared Block 1 Block 2
Memory
Block 3
Block 2
Registers
Block 2
Block 0
Shared Block 1
Memory
Block 2
Registers
Block 2
Resource requirements (green grid)
512 Threads per block
Block 0 32 (Registers per thread)
Shared Block 1 (512 * 32) = 16384 Registers per block
Memory
0 kB Shared memory per block
Block 2
FILLING IN THE GAPS
Registers
Block 2 Block 0
Resource requirements (green grid)
512 Threads per block
Block 0 32 (Registers per thread)
Shared Block 1 (512 * 32) = 16384 Registers per block
Memory
0 kB Shared memory per block
Block 2
CONCURRENCY: DOING MULTIPLE THINGS AT ONCE
Copy Memory
Process Flower
CONCURRENCY: DEPENDENCIES
Copy to GPU Process Flower Copy from GPU Copy to GPU Process Flower Copy from GPU
CONCURRENCY: INDEPENDENCIES
Stream 1
Stream 2
Flower Block 0
Stream 2
Flower Block 1
Stream 1
Stream 2