A41101 - How CUDA Programming Works
A41101 - How CUDA Programming Works
50%
Average
GPU Utilization
15%
0% 100%
Use it at all
GETTING THE MOST OUT OF A GPU
0% 100%
0% 100%
Average
GPU Utilization 75%
15%
50%
0% 100%
25%
0%
Use it at all
KEEPING BUSY
Fetch
image
100%
Process image
on GPU 75%
Loop
Insert image 50%
into database
25%
Done?
0%
Finished
STILL NOT ALL THAT BUSY
Fetch BATCH
of images
100%
data = load_images(N);
Process image
on GPU 75%
for(i=0; i<N; i++) {
process_image(i); Loop
} 50%
Done?
insert_images(data);
25%
Bulk insert
into database
0%
Finished
SYNCHRONOUS EXECUTION
Fetch BATCH
of images
100%
data = load_images(N);
Process image
75%
for(i=0; i<N; i++) { Wait for GPU
process_image(i); Loop
} 50%
Done?
insert_images(data);
25%
Bulk insert
into database
0%
Finished
SYNCHRONOUS EXECUTION
Fetch BATCH
Process image
of images
100%
Process image
75%
Wait for GPU
Loop
50%
Done?
25%
Bulk insert
into database
0%
Finished
SYNCHRONOUS EXECUTION
Fetch BATCH
Process image
of images
100%
Synchronize
Process image
75%
Wait for GPU
Loop
50%
Done?
25%
Bulk insert
into database
0%
Finished
SYNCHRONOUS EXECUTION
Fetch BATCH
Process image
of images
100%
Synchronize
Process image
75%
Wait for GPU
Loop
50%
Done?
25%
Bulk insert
into database
0%
Finished
SYNCHRONOUS EXECUTION
Fetch BATCH
Process image
of images
100%
Synchronize
Process image
75%
Wait for GPU
Loop
50%
Done?
25%
Bulk insert
into database
0%
Finished
SYNCHRONOUS EXECUTION
Fetch BATCH
Process image
of images
100%
Synchronize
Process image
75%
Wait for GPU
Loop
50%
Done?
25%
Bulk insert
into database
0%
Finished
SYNCHRONOUS EXECUTION
Fetch BATCH
Process image
of images
100%
Synchronize
Process image
75%
Wait for GPU
Loop
50%
Done?
25%
Bulk insert
into database
0%
Finished
ASYNCHRONOUS EXECUTION
Fetch BATCH
Process image
of images
100%
Finished
Synchronize
ASYNCHRONOUS EXECUTION
50%
100%
Average
GPU Utilization 75%
85%
50%
0% 100%
25%
0%
Use all of it
EFFICIENT USE OF RESOURCES
SM 0 SM 1 SM 2 SM 3 SM 107
L2 Cache (40MB)
Grid
of work
Divide into
many blocks
Many threads
in each block
START WITH SOME WORK TO PROCESS
DIVIDE INTO A SET OF EQUAL-SIZED BLOCKS: THIS IS THE “GRID” OF WORK
EACH BLOCK WILL NOW BE PROCESSED INDEPENDENTLY
CUDA does not guarantee the order of execution and you cannot exchange data between blocks
EACH BLOCK WILL NOW BE PROCESSED INDEPENDENTLY
CUDA does not guarantee the order of execution and you cannot exchange data between blocks
EVERY BLOCK GET PLACED ONTO AN SM
CUDA does not guarantee the order of execution and you cannot exchange data between blocks
SM 0 SM 1 SM 2 SM 3 SM 4 SM 5
SM 6 SM 7 SM 8 SM 9 SM 10 SM 11
EVERY BLOCK GET PLACED ONTO AN SM
CUDA does not guarantee the order of execution and you cannot exchange data between blocks
SM 0 SM 1 SM 2 SM 3 SM 4 SM 5
SM 6 SM 7 SM 8 SM 9 SM 10 SM 11
BLOCKS CONTINUE TO GET PLACED UNTIL EACH SM IS “FULL”
When a block completes its work and exits, a new block is placed in its spot until the whole grid is done
SM 0 SM 1 SM 2 SM 3 SM 4 SM 5
SM 6 SM 7 SM 8 SM 9 SM 10 SM 11
BLOCKS CONTINUE TO GET PLACED UNTIL EACH SM IS “FULL”
When a block completes its work and exits, a new block is placed in its spot until the whole grid is done
SM 0 SM 1 SM 2 SM 3 SM 4 SM 5
SM 6 SM 7 SM 8 SM 9 SM 10 SM 11
BLOCKS CONTINUE TO GET PLACED UNTIL EACH SM IS “FULL”
When a block completes its work and exits, a new block is placed in its spot until the whole grid is done
SM 0 SM 1 SM 2 SM 3 SM 4 SM 5
SM 6 SM 7 SM 8 SM 9 SM 10 SM 11
BLOCKS CONTINUE TO GET PLACED UNTIL EACH SM IS “FULL”
When a block completes its work and exits, a new block is placed in its spot until the whole grid is done
SM 0 SM 1 SM 2 SM 3 SM 4 SM 5
SM 6 SM 7 SM 8 SM 9 SM 10 SM 11
BLOCKS CONTINUE TO GET PLACED UNTIL EACH SM IS “FULL”
When a block completes its work and exits, a new block is placed in its spot until the whole grid is done
SM 0 SM 1 SM 2 SM 3 SM 4 SM 5
SM 6 SM 7 SM 8 SM 9 SM 10 SM 11
WHAT DOES IT MEAN FOR AN SM TO BE “FULL”?
SM 0 SM 1 SM 2 SM 3 SM 4 SM 5
SM 6 SM 7 SM 8 SM 9 SM 10 SM 11
LOOKING INSIDE A STREAMING MULTIPROCESSOR
A100 SM Resources
2048 Max threads per SM
32 Max blocks per SM
65,536 Total registers per SM
160 kB Total shared memory in SM
32 Threads per warp
4 Concurrent warps active
64 FP32 cores per SM
32 FP64 cores per SM
192 kB Max L1 cache size
90 GB/sec Load bandwidth per SM
1410 MHz GPU Boost Clock
LOOKING INSIDE A STREAMING MULTIPROCESSOR
A100 SM Resources
2048 Max threads per SM
32 Max blocks per SM
65,536 Total registers per SM
160 kB Total shared memory in SM
32 Threads per warp
4 Concurrent warps active
64 FP32 cores per SM
32 FP64 cores per SM
192 kB Max L1 cache size
90 GB/sec Load bandwidth per SM
1410 MHz GPU Boost Clock
LOOKING INSIDE A STREAMING MULTIPROCESSOR
A100 SM Resources
2048 Max threads per SM
32 Max blocks per SM
65,536 Total registers per SM
160 kB Total shared memory in SM
32 Threads per warp
4 Concurrent warps active
64 FP32 cores per SM
32 FP64 cores per SM
192 kB Max L1 cache size
90 GB/sec Load bandwidth per SM
1410 MHz GPU Boost Clock
THE CUDA PROGRAMMING MODEL
Thread block
Grid
of work
A block has a fixed number of threads
All blocks in a grid run the same program using the same Thread block
number of threads, leading to 3 resource requirements
1. Block size – the number of threads which must be concurrent A block has a fixed number of threads
All blocks in a grid run the same program using the same Thread block
number of threads, leading to 3 resource requirements
1. Block size – the number of threads which must be concurrent A block has a fixed number of threads
All blocks in a grid run the same program using the same Thread block
number of threads, leading to 3 resource requirements
1. Block size – the number of threads which must be concurrent A block has a fixed number of threads
Thread block
2048 Threads
Block 0
Block 0 Block 1
Block 0 Block 1
Block 0 Block 1
Block 3
HOW THE GPU PLACES BLOCKS ON AN SM
Block 0 Block 1
Block 0 Block 1
Block 0 Block 1
Block 0 Block 1
Example block resource requirements
256 Threads per block
Block 2 Block 3
64 (Registers per thread)
(256 * 64) = 16384 Registers per block
Block 0
32 kB Shared memory per block
Block 1
Block 2
Block 3
Block 4
OCCUPANCY
Registers
Block 0
Block 0
Block 1
Shared Block 1 Block 2
Memory
Block 3
Block 2
Block 0
Block 0
Block 1
Shared Block 1 Block 2
Memory
Block 3
Block 2
Block 0
Block 0
Block 1
Shared Block 1 Block 2
Memory
Block 3
Block 2
Registers
Block 2
Block 0
Shared Block 1
Memory
Block 2
Registers
Block 2
Resource requirements (green grid)
512 Threads per block
Block 0 32 (Registers per thread)
Shared Block 1 (512 * 32) = 16384 Registers per block
Memory
0 kB Shared memory per block
Block 2
FILLING IN THE GAPS
Registers
Block 2 Block 0
Resource requirements (green grid)
512 Threads per block
Block 0 32 (Registers per thread)
Shared Block 1 (512 * 32) = 16384 Registers per block
Memory
0 kB Shared memory per block
Block 2
ASYNCHRONOUS EXECUTION
Process image
100%
Process image
75%
Process image 0%
Synchronize
ASYNCHRONOUS EXECUTION
Process image
Process image
Process image
Process image
Synchronize
EVEN-MORE-ASYNCHRONOUS EXECUTION
Process image
...
EVEN-MORE-ASYNCHRONOUS EXECUTION
Copy to GPU
...
Process Flower
...
Process Flower Process Flower Process Flower
Stream 1 Stream 2
Flower Block 0
Flower Block 1
Copy from GPU Copy from GPU
Flower Block 2
Synchronize Synchronize
KEEPING THE GPU ACTIVE
Use it efficiently
THE NVIDIA AMPERE GPU ARCHITECTURE
These are the resources that are available
SMs 108
Total threads 221,184
Peak FP32 TFLOP/s 19.5
Peak FP64 TFLOP/s (non-tensor) 9.7 221,184 threads @ 1410 MHz
Peak FP64 TFLOP/s (tensor) 19.5 =
311,869,440,000,000 operations/sec
FP64, TF32, BF16,
Tensor Core Precision
FP16, I8, I4, B1
Shared Memory per SM 160 kB
L2 Cache Size 40960 kB
Memory Bandwidth 1555 GB/sec
GPU Boost Clock 1410 MHz
NVLink Interconnect 600 GB/sec Use it efficiently
THE NVIDIA AMPERE GPU ARCHITECTURE
These are the resources that are available
SMs 108
Total threads 221,184
Peak FP32 TFLOP/s 19.5
Peak FP64 TFLOP/s (non-tensor) 9.7
Peak FP64 TFLOP/s (tensor) 19.5
FP64, TF32, BF16,
Tensor Core Precision
FP16, I8, I4, B1
Shared Memory per SM 160 kB
L2 Cache Size 40960 kB
Memory Bandwidth 1555 GB/sec
GPU Boost Clock 1410 MHz
NVLink Interconnect 600 GB/sec
THE NVIDIA AMPERE GPU ARCHITECTURE
These are the resources that are available
SMs 108
Total threads 221,184
Peak FP32 TFLOP/s 19.5
Peak FP64 TFLOP/s (non-tensor) 9.7
Peak FP64 TFLOP/s (tensor) 19.5
FP64, TF32, BF16,
Tensor Core Precision
FP16, I8, I4, B1
Shared Memory per SM 160 kB
L2 Cache Size 40960 kB
Memory Bandwidth 1555 GB/sec
GPU Boost Clock 1410 MHz
NVLink Interconnect 600 GB/sec
THE NVIDIA AMPERE GPU ARCHITECTURE
These are the resources that are available
SMs 108
Total threads 221,184
Peak FP32 TFLOP/s 19.5
Peak FP64 TFLOP/s (non-tensor) 9.7
Peak FP64 TFLOP/s (tensor) 19.5
FP64, TF32, BF16,
Tensor Core Precision
FP16, I8, I4, B1
Shared Memory per SM 160 kB
L2 Cache Size 40960 kB
Memory Bandwidth 1555 GB/sec
GPU Boost Clock 1410 MHz
NVLink Interconnect 600 GB/sec
SMs 108
Total threads 221,184
Peak FP32 TFLOP/s 19.5
Peak FP64 TFLOP/s (non-tensor) 9.7
Peak FP64 TFLOP/s (tensor) 19.5
FP64, TF32, BF16,
Tensor Core Precision
FP16, I8, I4, B1
Shared Memory per SM 160 kB
L2 Cache Size 40960 kB
Memory Bandwidth 1555 GB/sec
GPU Boost Clock 1410 MHz
NVLink Interconnect 600 GB/sec
SMs 108
Total threads 221,184
Peak FP32 TFLOP/s 19.5
Peak FP64 TFLOP/s (non-tensor) 9.7
Peak FP64 TFLOP/s (tensor) 19.5
FP64, TF32, BF16,
Tensor Core Precision
FP16, I8, I4, B1
Shared Memory per SM 160 kB
L2 Cache Size 40960 kB
Memory Bandwidth 1555 GB/sec
GPU Boost Clock 1410 MHz
NVLink Interconnect 600 GB/sec
HOW MUCH DATA CAN THE A100 GPU PULL IN?
SM 0 SM 1 SM 2 SM 3 SM 107
108 SMs in the GPU, running @ 1410 MHz (boost clock)
regs regs regs regs ... regs
(256k) (256k) (256k) (256k) (256k)
Each SM can load 64 Bytes per clock cycle
L1$ L1$ L1$ L1$ L1$
(192k) (192k) (192k) (192k) ... (192k)
Peak memory request rate = 64B x 108 SMs * 1410 MHz = 9750 Gigabytes/sec
L2 Cache (40MB)
SM 0 SM 1 SM 2 SM 3 SM 107
108 SMs in the GPU, running @ 1410 MHz (boost clock)
regs regs regs regs ... regs
(256k) (256k) (256k) (256k) (256k)
Each SM can load 64 Bytes per clock cycle
L1$ L1$ L1$ L1$ L1$
(192k) (192k) (192k) (192k) ... (192k)
Peak memory request rate = 64B x 108 SMs * 1410 MHz = 9750 Gigabytes/sec
L2 Cache (40MB)
SM 0 SM 1 SM 2 SM 3 SM 107
108 SMs in the GPU, running @ 1410 MHz (boost clock)
regs regs regs regs ... regs
(256k) (256k) (256k) (256k) (256k)
Each SM can load 64 Bytes per clock cycle
L1$ L1$ L1$ L1$ L1$
(192k) (192k) (192k) (192k) ... (192k)
Peak memory request rate = 64B x 108 SMs * 1410 MHz = 9750 Gigabytes/sec
L2 Cache (40MB)
Ratio of bandwidth requested 9750
= = 6.3x
to bandwidth provided 1555
SMs 108
Total threads 221,184
FP64 FLOP/S BASED ON MEMORY
Peak FP32SPEED
TFLOP/s = 1555
19.5GB/SEC / 8 BYTES = 194 GFLOP/S
Peak FP64 TFLOP/s (non-tensor) 9.7
Peak FP64 TFLOP/s (tensor) 19.5
FP64, TF32, BF16,
Tensor Core Precision
FP16, I8, I4, B1
Shared Memory per SM 160 kB
L2 Cache Size 40960 kB
Memory Bandwidth 1555 GB/sec
GPU Boost Clock 1410 MHz
NVLink Interconnect 600 GB/sec
SMs 108
Total threads 221,184
FP64 FLOP/S BASED ON MEMORY
Peak FP32SPEED
TFLOP/s = 1555
19.5GB/SEC / 8 BYTES = 194 GFLOP/S
Peak FP64 TFLOP/s (non-tensor) 9.7
Peak FP64 TFLOP/s (tensor) 19.5
FP64, TF32, BF16,
Tensor Core Precision
FP16, I8, I4, B1
Shared Memory per SM 160 kB
L2 Cache Size 40960 kB
Memory Bandwidth 1555 GB/sec
GPU Boost Clock 1410 MHz
NVLink Interconnect 600 GB/sec
1x iPhone 11 Pro
(A13 CPU)
A CLOSER LOOK AT RANDOM ACCESS MEMORY
Read address: 001100010010011110100001101101110011
Read address: 001100010010011110100001101101110011
VC = VS(1 – e(-t/RC) )
SO WHAT DOES THIS ALL MEAN?
64
8 16 32 64 128 256 512 1024 2048 4096 8192
Stride Interval (in Bytes) Between Successive Reads
SO WHAT DOES THIS ALL MEAN?
256
111 GB/sec
128
64
8 16 32 64 128 256 512 1024 2048 4096 8192
Stride Interval (int Bytes) Between Successive Reads
CM-5 supercomputer
Los Alamos National Laboratory
First ever #1 in 1993, 59.7 gigaflop/s
SO WHAT DOES THIS ALL MEAN?
256
111 GB/sec
128
64
8 16 32 64 128 256 512 1024 2048 4096 8192
Stride Interval (int Bytes) Between Successive Reads
1x iPhone 6s
(A9 CPU)
DATA ACCESS PATTERNS REALLY MATTER
This means using all the GPU resources that you can,
which means managing memory access patterns
But what’s this got to do with CUDA?
CUDA’S GPU EXECUTION HIERARCHY
Grid
of work
Divide into
many blocks
Many threads
in each block
THE CUDA THREAD BLOCK
Thread block
Thread block
__global__ void euclidian_distance(float2 *p1, float2 *p2, float *distance, int count) {
// Calculate the index of the point my thread is working on
int index = threadIdx.x + (blockIdx.x * blockDim.x);
Thread block
__global__ void euclidian_distance(float2 *p1, float2 *p2, float *distance, int count) {
It’s all about this // Calculate the index of the point my thread is working on
one line of code int index = threadIdx.x + (blockIdx.x * blockDim.x);
Thread block
__global__ void euclidian_distance(float2 *p1, float2 *p2, float *distance, int count) {
// Calculate the index of the point my thread is working on
int index = threadIdx.x + (blockIdx.x * blockDim.x);
thread 0...31 thread 32...63 thread 64...95 thread 96...127 thread 128...159 thread 160...191
Thread block
thread 0...31 thread 32...63 thread 64...95 thread 96...127 thread 128...159 thread 160...191
thread 0...31 thread 32...63 thread 64...95 thread 96...127 thread 128...159 thread 160...191
thread 0...31 thread 32...63 thread 64...95 thread 96...127 thread 128...159 thread 160...191
thread 0...31 thread 32...63 thread 64...95 thread 96...127 thread 128...159 thread 160...191
thread 0...31 thread 32...63 thread 64...95 thread 96...127 thread 128...159 thread 160...191
thread 0...31 thread 32...63 thread 64...95 thread 96...127 thread 128...159 thread 160...191
thread 0...31 thread 32...63 thread 64...95 thread 96...127 thread 128...159 thread 160...191
256