0% found this document useful (0 votes)

136 views

How CUDA Programming Works - 1647539841016001sz6e

The document discusses how CUDA programming works and why it is designed the way it is due to limitations from physics. CUDA is designed to maximize performance on GPUs by utilizing all available resources like SMs, threads, memory bandwidth, and caches. However, GPU performance is ultimately limited by the physics of random access memory, which has costs associated with activating rows and switching pages that introduce latency. As a result, CUDA programming focuses on coalescing memory accesses to maximize bandwidth and hide memory access latency.

Uploaded by

sadeqbillah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

136 views

How CUDA Programming Works - 1647539841016001sz6e

Uploaded by

sadeqbillah

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 101

HOW CUDA PROGRAMMING WORKS

STEPHEN JONES, GTC 2022

WHY IS THE WAY IT IS
HOW CUDA PROGRAMMING WORKS
STEPHEN JONES, GTC 2022
SO WHY IS CUDA THE WAY IT IS?
SO WHY IS CUDA THE WAY IT IS?

Physics
The reason you’re using a GPU is for performance
You use CUDA to write performant code on a GPU
Performance is limited by the laws of physics
So CUDA is the way that it is because of

Physics

How GPU Computing Works [GTC 2021 - S31151]

The reason you’re using a GPU is for performance
The reason you’re using a GPU is for performance

This means using all the GPU resources that you can
THE NVIDIA AMPERE GPU ARCHITECTURE
These are the resources that are available

SMs 108
Total threads 221,184
Peak FP32 TFLOP/s 19.5
Peak FP64 TFLOP/s (non-tensor) 9.7
Peak FP64 TFLOP/s (tensor) 19.5
FP64, TF32, BF16,
Tensor Core Precision
FP16, I8, I4, B1
Shared Memory per SM 160 kB
L2 Cache Size 40960 kB
Memory Bandwidth 1555 GB/sec
GPU Boost Clock 1410 MHz
NVLink Interconnect 600 GB/sec
THE NVIDIA AMPERE GPU ARCHITECTURE
These are the resources that are available

ASCI White supercomputer

Lawrence Livermore National Laboratory
Top500 #1 in 2001, 7.9 teraflop/s
BUT FLOPS AREN’T THE ISSUE – BANDWIDTH IS

Ampere A100 GPU

SM 0 SM 1 SM 2 SM 3 SM 107
108 SMs in the GPU, running @ 1410 MHz (boost clock)
regs regs regs regs ... regs
(256k) (256k) (256k) (256k) (256k)
Each SM can load 64 Bytes per clock cycle
L1$ L1$ L1$ L1$ L1$
(192k) (192k) (192k) (192k) ... (192k)

Peak memory request rate = 64B x 108 SMs * 1410 MHz = 9750 Gigabytes/sec

L2 Cache (40MB)
Ratio of bandwidth requested 9750
= = 6.3x
to bandwidth provided 1555

HBM2 Memory Bandwidth = 1555 GBytes / sec HBM Memory (80GB)

HBM
HBM
BUT FLOPS AREN’T THE ISSUE – BANDWIDTH IS

SMs 108
Total threads 221,184
FP64 FLOP/S BASED ON MEMORY
Peak FP32SPEED
TFLOP/s = 1555
19.5GB/SEC / 8 BYTES = 194 GFLOP/S
Peak FP64 TFLOP/s (non-tensor) 9.7
Peak FP64 TFLOP/s (tensor) 19.5
FP64, TF32, BF16,
Tensor Core Precision
FP16, I8, I4, B1
Shared Memory per SM 160 kB
L2 Cache Size 40960 kB
Memory Bandwidth 1555 GB/sec
GPU Boost Clock 1410 MHz
NVLink Interconnect 600 GB/sec

Hitachi SR2201 supercomputer

University of Tokyo
Top500 #1 in 1996, 220 gigaflop/s
A CLOSER LOOK AT RANDOM ACCESS MEMORY
Read address: 001100010010011110100001101101110011
Read address: 001100010010011110100001101101110011

Activate row and pull data into sense amplifiers

1
This destroys data in the row as capacitors drain
Read address: 001100010010011110100001101101110011

Activate row and pull data into sense amplifiers

1
This destroys data in the row as capacitors drain

Read from page held in amplifiers at column index

2
This does not destroy data in the amplifiers
Read address: 001100010010011110100001101101110100

Activate row and pull data into sense amplifiers

1
This destroys data in the row as capacitors drain

Read from page held in amplifiers at column index

2
This does not destroy data in the amplifiers

May make repeated reads from the same page

3
at different column indexes
Read address: 001100010010011110100001101101110100

Activate row and pull data into sense amplifiers

1
This destroys data in the row as capacitors drain

Read from page held in amplifiers at column index

2
This does not destroy data in the amplifiers

May make repeated reads from the same page

3
“Burst” reads load multiple columns at a time
Read address: 001100010010011101100001101101110011

Activate row and pull data into sense amplifiers

1
This destroys data in the row as capacitors drain

Read from page held in amplifiers at column index

2
This does not destroy data in the amplifiers

May make repeated reads from the same page

3
“Burst” reads load multiple columns at a time

Before a new page is fetched, old row must be

4
written back because data was destroyed
THIS IS WHERE THE PHYSICS COMES IN

Example HBM values

Time to read new column: CL = 16 cycles
Time to load new page: TRCD = 16 cycles
Time to write back data: TRP = 16 cycles
Page (row) size = 1kB

Each read has a cost (CL = 16 cycles)

Switching page has 3x larger cost (TRP + TRCD + CL = 48 cycles)

This is because switching page requires charging/discharging

capacitors with a physical RC time constant:

VC = VS(1 – e(-t/RC) )
SO WHAT DOES THIS ALL MEAN?

We’d expect a significant performance difference

HBM Memory Throughput as Addresses Diverge
for coalesced vs. scattered reads
(8-byte reads, A100)
2048

On A100, memory bandwidth for widely-spaced reads is 1418 GB/sec

1024

Achieved Bandwidth (GB/sec)

111
= 8% of peak bandwidth 724 GB/sec
1418
512
Burst size = 64 Bytes
HBM page size = 1kB
256
That’s 1/13th of
peak bandwidth!
111 GB/sec
128

64
8 16 32 64 128 256 512 1024 2048 4096 8192
Stride Interval (in Bytes) Between Successive Reads
SO WHAT DOES THIS ALL MEAN?

We’d expect a significant performance difference

HBM Memory Throughput as Addresses Diverge
for coalesced vs. scattered reads
(8-byte reads, A100)
2048

On A100, memory bandwidth for widely-spaced reads is 1418 GB/sec

1024

Achieved Bandwidth (GB/sec)

111
= 8% of peak bandwidth
1418
512

256

111 GB/sec
128

64
8 16 32 64 128 256 512 1024 2048 4096 8192
Stride Interval (int Bytes) Between Successive Reads

CM-5 supercomputer
Los Alamos National Laboratory
First ever #1 in 1993, 59.7 gigaflop/s
DATA ACCESS PATTERNS REALLY MATTER

Column read latency CL

for(y=0; y<M; y++) {
for(x=0; x<N; x++) {
load(array[y][x]);
}
}

Row-major array traversal

Row-major array layout

DATA ACCESS PATTERNS REALLY MATTER

Column read latency CL

for(y=0; y<M; y++) { for(x=0; x<N; x++) {
for(x=0; x<N; x++) { for(y=0; y<M; y++) {
load(array[y][x]); load(array[y][x]);
} }
} }

Row-major array traversal Column-major array traversal

Row read latency

TRAS = TRP + TRDC + CL
13x slower than
column access

Row-major array layout

The reason you’re using a GPU is for performance

This means using all the GPU resources that you can,
which means managing memory access patterns
But what’s this got to do with CUDA?
CUDA’S GPU EXECUTION HIERARCHY

Grid
of work

Divide into
many blocks

Many threads
in each block
THE CUDA THREAD BLOCK

Thread block

A block has a fixed number of threads

which are guaranteed to be running
simultaneously on the same SM
EVERY THREAD RUNS EXACTLY THE SAME PROGRAM
This is the “SIMT” model

Thread block

A block has a fixed number of threads

__global__ void euclidian_distance(float2 *p1, float2 *p2, float *distance, int count) {
// Calculate the index of the point my thread is working on
int index = threadIdx.x + (blockIdx.x * blockDim.x);

// Check if thread is in-range before reading data

if (index < count) {
// Compute the Euclidian distance between two points
float2 dp = p2[index] - p1[index];
float dist = sqrtf(dp.x * dp.x + dp.y * dp.y);

// Write out the computed distance

distance[index] = dist;
}
}
EVERY THREAD RUNS EXACTLY THE SAME PROGRAM
This is the “SIMT” model

Thread block

A block has a fixed number of threads

__global__ void euclidian_distance(float2 *p1, float2 *p2, float *distance, int count) {
It’s all about this // Calculate the index of the point my thread is working on
one line of code int index = threadIdx.x + (blockIdx.x * blockDim.x);

// Check if thread is in-range before reading data

if (index < count) {
// Compute the Euclidian distance between two points
float2 dp = p2[index] - p1[index];
float dist = sqrtf(dp.x * dp.x + dp.y * dp.y);

// Write out the computed distance

distance[index] = dist;
}
}
EVERY THREAD RUNS EXACTLY THE SAME PROGRAM
This is the “SIMT” model

Thread block

A block has a fixed number of threads

This line is what // Check if thread is in-range before reading data

SIMT is all about if (index < count) {
// Compute the Euclidian distance between two points
float2 dp = p2[index] - p1[index];
float dist = sqrtf(dp.x * dp.x + dp.y * dp.y);

// Write out the computed distance

distance[index] = dist;
}
}
Thread block

The block of threads is broken up into “warps” of 32 threads

A “warp” is the vector element of the GPU

Warp Warp Warp Warp Warp Warp

thread 0...31 thread 32...63 thread 64...95 thread 96...127 thread 128...159 thread 160...191
Thread block

The block of threads is broken up into “warps” of 32 threads

A “warp” is the vector element of the GPU

Warp Warp Warp Warp Warp Warp

thread 0...31 thread 32...63 thread 64...95 thread 96...127 thread 128...159 thread 160...191

int index = threadIdx.x + (blockIdx.x * blockDim.x);

...
float2 dp = p2[index] - p1[index];
Thread block

The block of threads is broken up into “warps” of 32 threads

A “warp” is the vector element of the GPU

Warp Warp Warp Warp Warp Warp

thread 0...31 thread 32...63 thread 64...95 thread 96...127 thread 128...159 thread 160...191

float2 dp = ... p1[f(thread ID)];

Thread block

The block of threads is broken up into “warps” of 32 threads

A “warp” is the vector element of the GPU

Warp Warp Warp Warp Warp Warp

thread 0...31 thread 32...63 thread 64...95 thread 96...127 thread 128...159 thread 160...191

float2 dp = ... p1[f(thread ID)];

float2 float x float y

sizeof(float2) is 2x 4 bytes = 8 bytes
Thread block

The block of threads is broken up into “warps” of 32 threads

A “warp” is the vector element of the GPU

Warp Warp Warp Warp Warp Warp

thread 0...31 thread 32...63 thread 64...95 thread 96...127 thread 128...159 thread 160...191

float2 dp = ... p1[f(thread ID)];

One warp loads 32 x 8 bytes of data = 256 bytes

WARP EXECUTION ON THE GPU

Warp Warp Warp Warp Warp Warp

thread 0...31 thread 32...63 thread 64...95 thread 96...127 thread 128...159 thread 160...191

A100 SM block diagram

WARP EXECUTION ON THE GPU

Warp Warp Warp Warp Warp Warp

thread 0...31 thread 32...63 thread 64...95 thread 96...127 thread 128...159 thread 160...191

float2 dp = ... p1[f(thread ID)];

Memory request issuing from one SM:

4 warps x 256 bytes per warp = 1024 bytes
WARP EXECUTION ON THE GPU

Warp Warp Warp Warp Warp Warp

Warp Warp

thread 0...31 thread 32...63 thread 64...95 thread 96...127 thread 128...159 thread 160...191

float2 dp = ... p1[f(thread ID)];

Memory request issuing from one SM:

4 warps x 256 bytes per warp = 1024 bytes

Memory page size = 1024 bytes

Warp Warp Warp Warp

thread 0...31 thread 32...63 thread 64...95 thread 96...127

float2 dp = p2[index] - p1[index];

For a single thread, this would look

like random-address memory reads
Warp Warp Warp Warp

thread 0...31 thread 32...63 thread 64...95 thread 96...127

float2 dp = p2[index] - p1[index];

For a single thread, this would look

like random-address memory reads

But because this is executed in parallel

by 128 threads, it’s actually adjacent
reads of whole pages of memory
Warp Warp Warp Warp

HBM Memory Throughput as Addresses Diverge

(8-byte reads, A100)
thread 0...31 thread 32...63 thread 64...95 thread 96...127
2048

Achieved Bandwidth (GB/sec)

1024

float2 dp = p2[index] - p1[index];

512

256

For a single thread, this would look 128

like random-address memory reads

But because this is executed in parallel 64

8 16 32 64 128 256 512 1024 2048 4096 8192
by 128 threads, it’s actually adjacent
Stride Interval (int Bytes) Between Successive Reads
reads of whole pages of memory
USING ALL THE GPU RESOURCES YOU CAN GET

Ampere A100 GPU SM 0 SM 1 SM 2 SM 3 SM 107

SMs 108 regs regs regs regs ... regs
(256k) (256k) (256k) (256k) (256k)
Total threads 221,184
L1$ L1$ L1$ L1$ L1$
...
Peak FP32 TFLOP/s 19.5 (192k) (192k) (192k) (192k) (192k)

Peak FP64 TFLOP/s (non-tensor) 9.7

Peak FP64 TFLOP/s 19.5
FP64, TF32, BF16, L2 Cache (40MB)
Tensor Core Precision
FP16, I8, I4, B1
Shared Memory per SM 160 kB
L2 Cache Size 40960 kB
Memory Bandwidth 1555 GB/sec
GPU Boost Clock 1410 MHz HBM Memory (80GB)
HBM
NVLink Interconnect 600 GB/sec HBM
USING ALL THE GPU RESOURCES YOU CAN GET

Ampere A100 GPU SM 0 SM 1 SM 2 SM 3 SM 107

SMs 108 regs regs regs regs ... regs
(256k) (256k) (256k) (256k) (256k)
Total threads 221,184
L1$ L1$ L1$ L1$ L1$
...
Peak FP32 TFLOP/s 19.5 (192k) (192k) (192k) (192k) (192k)

Peak FP64 TFLOP/s (non-tensor) 9.7

This means using all the GPU resources that you can,
which means managing memory access patterns
and also something called occupancy
CUDA’S GPU EXECUTION HIERARCHY

Grid
of work

Divide into
many blocks

Many threads
in each block
START WITH SOME WORK TO PROCESS
DIVIDE INTO A SET OF EQUAL-SIZED BLOCKS: THIS IS THE “GRID” OF WORK
EACH BLOCK WILL NOW BE PROCESSED INDEPENDENTLY
CUDA does not guarantee the order of execution and you cannot exchange data between blocks
EACH BLOCK WILL NOW BE PROCESSED INDEPENDENTLY
CUDA does not guarantee the order of execution and you cannot exchange data between blocks
EVERY BLOCK GET PLACED ONTO AN SM
CUDA does not guarantee the order of execution and you cannot exchange data between blocks

SM 0 SM 1 SM 2 SM 3 SM 4 SM 5

SM 6 SM 7 SM 8 SM 9 SM 10 SM 11
EVERY BLOCK GET PLACED ONTO AN SM
CUDA does not guarantee the order of execution and you cannot exchange data between blocks

SM 0 SM 1 SM 2 SM 3 SM 4 SM 5

SM 6 SM 7 SM 8 SM 9 SM 10 SM 11
BLOCKS CONTINUE TO GET PLACED UNTIL EACH SM IS “FULL”
When a block completes its work and exits, a new block is placed in its spot until the whole grid is done

SM 0 SM 1 SM 2 SM 3 SM 4 SM 5

SM 6 SM 7 SM 8 SM 9 SM 10 SM 11
BLOCKS CONTINUE TO GET PLACED UNTIL EACH SM IS “FULL”
When a block completes its work and exits, a new block is placed in its spot until the whole grid is done

SM 0 SM 1 SM 2 SM 3 SM 4 SM 5

SM 6 SM 7 SM 8 SM 9 SM 10 SM 11
BLOCKS CONTINUE TO GET PLACED UNTIL EACH SM IS “FULL”
When a block completes its work and exits, a new block is placed in its spot until the whole grid is done

SM 0 SM 1 SM 2 SM 3 SM 4 SM 5

SM 6 SM 7 SM 8 SM 9 SM 10 SM 11
BLOCKS CONTINUE TO GET PLACED UNTIL EACH SM IS “FULL”
When a block completes its work and exits, a new block is placed in its spot until the whole grid is done

SM 0 SM 1 SM 2 SM 3 SM 4 SM 5

SM 6 SM 7 SM 8 SM 9 SM 10 SM 11
BLOCKS CONTINUE TO GET PLACED UNTIL EACH SM IS “FULL”
When a block completes its work and exits, a new block is placed in its spot until the whole grid is done

SM 0 SM 1 SM 2 SM 3 SM 4 SM 5

SM 6 SM 7 SM 8 SM 9 SM 10 SM 11
WHAT DOES IT MEAN FOR AN SM TO BE “FULL”?

SM 0 SM 1 SM 2 SM 3 SM 4 SM 5

SM 6 SM 7 SM 8 SM 9 SM 10 SM 11
LOOKING INSIDE A STREAMING MULTIPROCESSOR

A100 SM Resources
2048 Max threads per SM
32 Max blocks per SM
65,536 Total registers per SM
160 kB Total shared memory in SM
32 Threads per warp
4 Concurrent warps active
64 FP32 cores per SM
32 FP64 cores per SM
192 kB Max L1 cache size
90 GB/sec Load bandwidth per SM
1410 MHz GPU Boost Clock
LOOKING INSIDE A STREAMING MULTIPROCESSOR

A100 SM Resources
2048 Max threads per SM
32 Max blocks per SM
65,536 Total registers per SM

SM 0 SM 1 SM 2 SM 3 SM 4 SM 5 160 kB Total shared memory in SM

32 Threads per warp
4 Concurrent warps active
64 FP32 cores per SM
32 FP64 cores per SM
192 kB Max L1 cache size
90 GB/sec Load bandwidth per SM
SM 6 SM 7 SM 8 SM 9 SM 10 SM 11 1410 MHz GPU Boost Clock
LOOKING INSIDE A STREAMING MULTIPROCESSOR

Thread block
Grid
of work
A block has a fixed number of threads

shared float mean = 0.0f;

Divide into __device__ float mean_euclidian_distance(float2 *p1, float2 *p2) {
many blocks // Compute the Euclidian distance between two points
float2 dp = p2[threadIdx.x] - p1[threadIdx.x];
float dist = sqrtf(dp.x * dp.x + dp.y * dp.y);

// Accumulate the mean distance atomically and return distance

atomicAdd(&mean, dist / blockDim.x);
Many threads return dist;
in each block }

Every thread runs exactly the same program

(this is the “SIMT” model)
ANATOMY OF A THREAD BLOCK

All blocks in a grid run the same program using the same Thread block
number of threads, leading to 3 resource requirements

1. Block size – the number of threads which must be concurrent A block has a fixed number of threads

shared float mean = 0.0f;

__device__ float mean_euclidian_distance(float2 *p1, float2 *p2) {
// Compute the Euclidian distance between two points
float2 dp = p2[threadIdx.x] - p1[threadIdx.x];
float dist = sqrtf(dp.x * dp.x + dp.y * dp.y);

// Accumulate the mean distance atomically and return distance

atomicAdd(&mean, dist / blockDim.x);
return dist;
}

Every thread runs exactly the same program

(this is the “SIMT” model)
ANATOMY OF A THREAD BLOCK

All blocks in a grid run the same program using the same Thread block
number of threads, leading to 3 resource requirements

1. Block size – the number of threads which must be concurrent A block has a fixed number of threads

2. Shared memory – common to all threads in a block

__shared__ float mean = 0.0f;
__device__ float mean_euclidian_distance(float2 *p1, float2 *p2) {
// Compute the Euclidian distance between two points
float2 dp = p2[threadIdx.x] - p1[threadIdx.x];
float dist = sqrtf(dp.x * dp.x + dp.y * dp.y);

// Accumulate the mean distance atomically and return distance

atomicAdd(&mean, dist / blockDim.x);
return dist;
}

Every thread runs exactly the same program

(this is the “SIMT” model)
ANATOMY OF A THREAD BLOCK

All blocks in a grid run the same program using the same Thread block
number of threads, leading to 3 resource requirements

1. Block size – the number of threads which must be concurrent A block has a fixed number of threads

2. Shared memory – common to all threads in a block

__shared__ float mean = 0.0f;
__device__ float mean_euclidian_distance(float2 *p1, float2 *p2) {
// Compute the Euclidian distance between two points
3. Registers – depends on program complexity float2 dp = p2[threadIdx.x] - p1[threadIdx.x];
float dist = sqrtf(dp.x * dp.x + dp.y * dp.y);
Registers are a per-thread resource, so total budget is:
(threads-per-block x registers-per-thread) // Accumulate the mean distance atomically and return distance
atomicAdd(&mean, dist / blockDim.x);
return dist;
}

Every thread runs exactly the same program

(this is the “SIMT” model)
HOW THE GPU PLACES BLOCKS ON AN SM

A100 SM Key Resources

Thread block
2048 Threads

A block has a fixed number of threads

always running on a single SM

Example block resource requirements

65536 Registers
256 Threads per block
64 (Registers per thread)
(256 * 64) = 16384 Registers per block
48 kB Shared memory per block

160kB Shared Memory

HOW THE GPU PLACES BLOCKS ON AN SM

A100 SM Key Resources

Block 0
Thread block
2048
Threads
A block has a fixed number of threads
always running on a single SM

Block 0

65,536 Example block resource requirements

Registers 256 Threads per block
64 (Registers per thread)
(256 * 64) = 16384 Registers per block
Block 0 48 kB Shared memory per block
160 kB
Shared
Memory
HOW THE GPU PLACES BLOCKS ON AN SM

A100 SM Key Resources

Block 0 Block 1
Thread block
2048
Threads
A block has a fixed number of threads
always running on a single SM

Block 0 Block 1

65,536 Example block resource requirements

Registers 256 Threads per block
64 (Registers per thread)
(256 * 64) = 16384 Registers per block
Block 0 48 kB Shared memory per block
160 kB
Shared Block 1
Memory
HOW THE GPU PLACES BLOCKS ON AN SM

A100 SM Key Resources

Block 0 Block 1
Thread block
2048 Block 2
Threads
A block has a fixed number of threads
always running on a single SM

Block 0 Block 1

65,536 Example block resource requirements

Registers 256 Threads per block
Block 2
64 (Registers per thread)
(256 * 64) = 16384 Registers per block
Block 0 48 kB Shared memory per block
160 kB
Shared Block 1
Memory
Block 2
HOW THE GPU PLACES BLOCKS ON AN SM

A100 SM Key Resources

Block 0 Block 1
Thread block
2048 Block 2 Block 3
Threads Block 4 Block 5
Block 6 Block 7 A block has a fixed number of threads
always running on a single SM

Block 0 Block 1

65,536 Example block resource requirements

Registers 256 Threads per block
Block 2 Block 3
64 (Registers per thread)
(256 * 64) = 16384 Registers per block
Block 0 48 kB Shared memory per block
160 kB
Shared Block 1
Memory
Block 2

Block 3
HOW THE GPU PLACES BLOCKS ON AN SM

A100 SM Key Resources

Block 0 Block 1
Thread block
2048 Block 2 Block 3
Threads Block 4 Block 5
Block 6 Block 7 A block has a fixed number of threads
always running on a single SM

Block 0 Block 1

65,536 Example block resource requirements

Registers 256 Threads per block
Block 2 Block 3
64 (Registers per thread)
(256 * 64) = 16384 Registers per block
Block 0
32 kB Shared memory per block
160 kB Block 1
Shared Block 2
Memory Block 3
HOW THE GPU PLACES BLOCKS ON AN SM

A100 SM Key Resources

Block 0 Block 1
Thread block
2048 Block 2 Block 3
Threads
A block has a fixed number of threads
always running on a single SM

Block 0 Block 1

65,536 Example block resource requirements

A100 SM Key Resources

Block 0 Block 1
Thread block
2048 Block 2 Block 3
Threads Block 4 Block 5
Block 6 Block 7 A block has a fixed number of threads
always running on a single SM

Block 0 Block 1

65,536 Example block resource requirements

A100 SM Key Resources

Block 0 Block 1
Thread block
Block 2 Block 3
Block 4 Block 5
Block 6 Block 7 A block has a fixed number of threads
always running on a single SM

Block 0 Block 1
Example block resource requirements
256 Threads per block
Block 2 Block 3
64 (Registers per thread)
(256 * 64) = 16384 Registers per block
Block 0
32 kB Shared memory per block
Block 1
Block 2
Block 3
Block 4
OCCUPANCY

A100 SM Key Resources A100 SM Key Resources

Block 0 Block 1 Block 0 Block 1
Block 2 Block 2 Block 3
Threads

Block 0 Block 1 Block 0 Block 1

Registers

Block 2 Block 2 Block 3

Block 0
Block 0
Block 1
Shared Block 1 Block 2
Memory
Block 3
Block 2

Shared memory limited case Register limited case

OCCUPANCY

A100 SM Key Resources A100 SM Key Resources

Block 0 Block 1 Block 0 Block 1
Block 2 Block 2 Block 3
Threads

Block 0 Block 1 Block 0 Block 1

Occupancy Occupancy
Registers 3 blocks/SM 4 blocks/SM
Block 2 Block 2 Block 3

Block 0
Block 0
Block 1
Shared Block 1 Block 2
Memory
Block 3
Block 2

Shared memory limited case Register limited case

OCCUPANCY IS THE MOST POWERFUL TOOL FOR TUNING A PROGRAM

A100 SM Key Resources A100 SM Key Resources

Block 0 Block 1 Block 0 Block 1
Block 2 Block 2 Block 3
Threads

Block 0 Block 1 Block 0 Block 1

Occupancy Occupancy
Registers 33% Faster
3 blocks/SM 4 blocks/SM
Block 2 Block 2 Block 3

Block 0
Block 0
Block 1
Shared Block 1 Block 2
Memory
Block 3
Block 2

Shared memory limited case Register limited case

FILLING IN THE GAPS

A100 SM Key Resources

Block 0 Block 1 Resource requirements (blue grid)
Block 2
Threads 256 Threads per block
64 (Registers per thread)
(256 * 64) = 16384 Registers per block
48 kB Shared memory per block
Block 0 Block 1

Registers

Block 2

Block 0

Shared Block 1
Memory
Block 2

Shared memory limited case

FILLING IN THE GAPS

A100 SM Key Resources

Block 0 Block 1 Resource requirements (blue grid)
Block 2
Threads 256 Threads per block
64 (Registers per thread)
(256 * 64) = 16384 Registers per block
48 kB Shared memory per block
Block 0 Block 1

Registers

Block 2
Resource requirements (green grid)
512 Threads per block
Block 0 32 (Registers per thread)
Shared Block 1 (512 * 32) = 16384 Registers per block
Memory
0 kB Shared memory per block
Block 2
FILLING IN THE GAPS

A100 SM Key Resources

Block 0 Block 1 Resource requirements (blue grid)
Block 2
Threads 256 Threads per block
Block 0
64 (Registers per thread)
(256 * 64) = 16384 Registers per block
48 kB Shared memory per block
Block 0 Block 1

Registers

Block 2 Block 0
Resource requirements (green grid)
512 Threads per block
Block 0 32 (Registers per thread)
Shared Block 1 (512 * 32) = 16384 Registers per block
Memory
0 kB Shared memory per block
Block 2
CONCURRENCY: DOING MULTIPLE THINGS AT ONCE

Copy Memory
Process Flower
CONCURRENCY: DEPENDENCIES

Copy to GPU Process Flower Copy from GPU

CONCURRENCY: DEPENDENCIES

Copy to GPU Process Flower Copy from GPU Copy to GPU Process Flower Copy from GPU
CONCURRENCY: INDEPENDENCIES

Stream 1

Copy to GPU Process Flower Copy from GPU

Stream 2

Copy to GPU Process Flower Copy from GPU

CONCURRENCY: IT’S REALLY ALL ABOUT OVERSUBSCRIPTION

A100 SM Key Resources

Flower Block 0 Flower Block 1
Flower Block 2
Stream 1 Copy Block 0

Copy to GPU Process Flower Copy from GPU

Flower Block 0 Flower Block 1

Flower Block 2 Copy Block 0

Flower Block 0
Stream 2
Flower Block 1

Copy to GPU Process Flower Copy from GPU Flower Block 2

CONCURRENCY: IT’S REALLY ALL ABOUT OVERSUBSCRIPTION

Stream 1

Copy to GPU Process Flower Copy from GPU

Stream 2

Copy to GPU Process Flower Copy from GPU

SO WHY IS CUDA THE WAY IT IS?
Physics
WHY IS THE WAY IT IS
HOW CUDA PROGRAMMING WORKS
STEPHEN JONES, GTC 2022

Hamley I. Small-Angle Scattering. Theory, Instrumentation,... App 2021
No ratings yet
Hamley I. Small-Angle Scattering. Theory, Instrumentation,... App 2021
283 pages
(Graduate Studies in Mathematics 047) A. Yu. Kitaev, A. H. Shen, M. N. Vyalyi - Classical and Quantum Computation-Amer Mathematical Society (2002)
No ratings yet
(Graduate Studies in Mathematics 047) A. Yu. Kitaev, A. H. Shen, M. N. Vyalyi - Classical and Quantum Computation-Amer Mathematical Society (2002)
274 pages
Cuda Reference Manual
No ratings yet
Cuda Reference Manual
256 pages
Nvidia Project
No ratings yet
Nvidia Project
26 pages
TS2022 User Guide en
No ratings yet
TS2022 User Guide en
87 pages
CUDA
No ratings yet
CUDA
33 pages
Transactional Memory 2e PDF
No ratings yet
Transactional Memory 2e PDF
265 pages
Cuda C/C++ Basics: NVIDIA Corporation
No ratings yet
Cuda C/C++ Basics: NVIDIA Corporation
67 pages
Optics, Photonics and Laser Technology (Paulo A. Ribeiro, Maria Raposo)
No ratings yet
Optics, Photonics and Laser Technology (Paulo A. Ribeiro, Maria Raposo)
258 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
Fast-Fourier-transform Based Numerical Integration Method For The RayleighSommerfeld Diffraction Formula, PDF
No ratings yet
Fast-Fourier-transform Based Numerical Integration Method For The RayleighSommerfeld Diffraction Formula, PDF
9 pages
S I: T N G S M: Imulation Ntelligence Owards A EW Eneration OF Cientific Ethods
No ratings yet
S I: T N G S M: Imulation Ntelligence Owards A EW Eneration OF Cientific Ethods
109 pages
Graphblas
No ratings yet
Graphblas
8 pages
An Introduction To PyCUDA Using Prefix Sum Algorithm PDF
No ratings yet
An Introduction To PyCUDA Using Prefix Sum Algorithm PDF
6 pages
Counterfactual Explanations and Algorithmic Recourses For Machine Learning: A Review
No ratings yet
Counterfactual Explanations and Algorithmic Recourses For Machine Learning: A Review
23 pages
Neural Networks
No ratings yet
Neural Networks
22 pages
Robert J. Marks II - The Joy of Fourier
No ratings yet
Robert J. Marks II - The Joy of Fourier
811 pages
Introduction To Classical Field Theory
100% (1)
Introduction To Classical Field Theory
235 pages
The UNIX I/O System - Dennis M. Ritchie
No ratings yet
The UNIX I/O System - Dennis M. Ritchie
7 pages
CUDA C Programming Guide
100% (1)
CUDA C Programming Guide
275 pages
PrinciplesOfConcurrent PDF
No ratings yet
PrinciplesOfConcurrent PDF
16 pages
Theory and Practice of Formal Methods: Erika Ábrahám Marcello Bonsangue Einar Broch Johnsen
No ratings yet
Theory and Practice of Formal Methods: Erika Ábrahám Marcello Bonsangue Einar Broch Johnsen
435 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Interactions Between Computational Intelligence and Mathematics Compress (1)
100% (3)
Interactions Between Computational Intelligence and Mathematics Compress (1)
125 pages
1992 - Meyer - Wavelets Algorithms and Applications
100% (1)
1992 - Meyer - Wavelets Algorithms and Applications
133 pages
Topics in Signal Processing Volume 10 - Computer Vision Analysis of Image Motion by Variational Methods (2014) (Amar Mitiche, J.K. Aggarwal)
100% (1)
Topics in Signal Processing Volume 10 - Computer Vision Analysis of Image Motion by Variational Methods (2014) (Amar Mitiche, J.K. Aggarwal)
212 pages
The Science of Deep Learning
0% (1)
The Science of Deep Learning
2 pages
VetterliKovacevic95 Manuscript
No ratings yet
VetterliKovacevic95 Manuscript
521 pages
d2l en
No ratings yet
d2l en
883 pages
Ben Simons - Advanced Quantum Physics (2009)
No ratings yet
Ben Simons - Advanced Quantum Physics (2009)
233 pages
Principles of Superconducting Quantum Computers
No ratings yet
Principles of Superconducting Quantum Computers
380 pages
AMD OpenCL Programming User Guide
No ratings yet
AMD OpenCL Programming User Guide
180 pages
Daniel D. Gajski, Jianwen Zhu, Rainer Dömer (Auth.), Jørgen Staunstrup, ...
No ratings yet
Daniel D. Gajski, Jianwen Zhu, Rainer Dömer (Auth.), Jørgen Staunstrup, ...
405 pages
3 Signal Space Representation of Waveforms
No ratings yet
3 Signal Space Representation of Waveforms
5 pages
MR Book
0% (1)
MR Book
414 pages
Amitabha Bagchi, Rahul Muthu - Algorithms and Discrete Applied Mathematics 2023
No ratings yet
Amitabha Bagchi, Rahul Muthu - Algorithms and Discrete Applied Mathematics 2023
464 pages
Quantum Mechanis
No ratings yet
Quantum Mechanis
347 pages
Finite Difference Methods Notes
No ratings yet
Finite Difference Methods Notes
125 pages
GNN Review
No ratings yet
GNN Review
26 pages
Diff Geom Notes PDF
No ratings yet
Diff Geom Notes PDF
219 pages
Complete Download Image Processing The Fundamentals Second Edition Maria Petrou PDF All Chapters
100% (1)
Complete Download Image Processing The Fundamentals Second Edition Maria Petrou PDF All Chapters
67 pages
2020 - William L. Hamilton - Graph Representation Learning-Morgan & Claypool
No ratings yet
2020 - William L. Hamilton - Graph Representation Learning-Morgan & Claypool
161 pages
Ordinary Differential Equations - G. F. Carrier and C. E. Pearson - SIAM
No ratings yet
Ordinary Differential Equations - G. F. Carrier and C. E. Pearson - SIAM
231 pages
Executive Summary of AI and ET
No ratings yet
Executive Summary of AI and ET
154 pages
Ray Tracing From The Ground Up
No ratings yet
Ray Tracing From The Ground Up
784 pages
An Abstract PROLOG Instruction Set
No ratings yet
An Abstract PROLOG Instruction Set
34 pages
CUDA C Programming Guide
No ratings yet
CUDA C Programming Guide
376 pages
Mathematics in Lean
No ratings yet
Mathematics in Lean
140 pages
Py Torch
No ratings yet
Py Torch
786 pages
Advanced AI
No ratings yet
Advanced AI
184 pages
Computer Simulation Techniques: The Definitive Introduction!
No ratings yet
Computer Simulation Techniques: The Definitive Introduction!
175 pages
Observation of A Strange Attractor
No ratings yet
Observation of A Strange Attractor
10 pages
Evolution of The WAM:: Introduction To Prolog Implementation: The Warren Abstract Machine (WAM)
No ratings yet
Evolution of The WAM:: Introduction To Prolog Implementation: The Warren Abstract Machine (WAM)
21 pages
RF MEMS Switches and Switch Circuits: Shimul Chandra Saha
No ratings yet
RF MEMS Switches and Switch Circuits: Shimul Chandra Saha
174 pages
LAPACK Users Guide PDF
No ratings yet
LAPACK Users Guide PDF
425 pages
Lecture 12 Learning in Vision 2022
No ratings yet
Lecture 12 Learning in Vision 2022
100 pages
Deep Learning Kathi
No ratings yet
Deep Learning Kathi
18 pages
New Astronomy For Beginners - Todd
100% (1)
New Astronomy For Beginners - Todd
437 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
Lec 14
No ratings yet
Lec 14
52 pages
Game Boy Advance Architecture: Architecture of Consoles: A Practical Analysis, #7
From Everand
Game Boy Advance Architecture: Architecture of Consoles: A Practical Analysis, #7
Rodrigo Copetti
No ratings yet
FAQ For Bikash Add Money Campaign
No ratings yet
FAQ For Bikash Add Money Campaign
3 pages
The Fastest Deformable Part Model For Object Detection
No ratings yet
The Fastest Deformable Part Model For Object Detection
8 pages
(HotPower10) Demystifying 802.11n Power Consumption
No ratings yet
(HotPower10) Demystifying 802.11n Power Consumption
5 pages
(CL12) On The Performance of Packet Aggregation in IEEE 802.11ac MU-MIMO WLANs
No ratings yet
(CL12) On The Performance of Packet Aggregation in IEEE 802.11ac MU-MIMO WLANs
4 pages
Lecture NN Intro
No ratings yet
Lecture NN Intro
49 pages
Loop Fusion in Haskell
No ratings yet
Loop Fusion in Haskell
90 pages
Types of Softwares
100% (1)
Types of Softwares
29 pages
Malib: A Parallel Framework For Population-Based Multi-Agent Reinforcement Learning
No ratings yet
Malib: A Parallel Framework For Population-Based Multi-Agent Reinforcement Learning
12 pages
3160_Wolf_Accurate_Video_Capti
No ratings yet
3160_Wolf_Accurate_Video_Capti
12 pages
PC & Tech Authority - June 2016
No ratings yet
PC & Tech Authority - June 2016
116 pages
P150SM Esm
No ratings yet
P150SM Esm
126 pages
Lastexception 63821675730
No ratings yet
Lastexception 63821675730
4 pages
L 1 ParallelProcess Challenges
No ratings yet
L 1 ParallelProcess Challenges
82 pages
Nvidia Teslap100 Techoverview
No ratings yet
Nvidia Teslap100 Techoverview
5 pages
(GUIDE) 1st Generation Intel HD Graphics QE/CI
No ratings yet
(GUIDE) 1st Generation Intel HD Graphics QE/CI
47 pages
Lastexception 63840621200
No ratings yet
Lastexception 63840621200
3 pages
1 An LPDDR-based CXL-PNM Platform For TCO-efficient Inference of Transformer-Based Large Language Models
No ratings yet
1 An LPDDR-based CXL-PNM Platform For TCO-efficient Inference of Transformer-Based Large Language Models
13 pages
Troubleshooting
No ratings yet
Troubleshooting
69 pages
Artificial Intelligence (AI) : U C o I N
No ratings yet
Artificial Intelligence (AI) : U C o I N
13 pages
PCSX2 1.0.0: Frequently Asked Questions
No ratings yet
PCSX2 1.0.0: Frequently Asked Questions
7 pages
HANDS ON LAB S4795 Accelerating Computer Vision Opencv Cuda
No ratings yet
HANDS ON LAB S4795 Accelerating Computer Vision Opencv Cuda
19 pages
Intel® Core™ Ultra 9 Processor 185H
No ratings yet
Intel® Core™ Ultra 9 Processor 185H
6 pages
LOQ 15ARP9 Spec
No ratings yet
LOQ 15ARP9 Spec
7 pages
Muhammad Ali Khan: Fessional Summary
No ratings yet
Muhammad Ali Khan: Fessional Summary
1 page
MINI PROJECT Music Recommendation
No ratings yet
MINI PROJECT Music Recommendation
21 pages
Ezcap GameLink
No ratings yet
Ezcap GameLink
41 pages
332.21 Win8 Win7 Winvista Desktop Release Notes
No ratings yet
332.21 Win8 Win7 Winvista Desktop Release Notes
68 pages
Fast KG
No ratings yet
Fast KG
22 pages
Low-Cost Convolutional Neural Network For Tomato Plant Diseases Classification
No ratings yet
Low-Cost Convolutional Neural Network For Tomato Plant Diseases Classification
9 pages
h17343 Vxrail VSRN Citrix Design Guide
No ratings yet
h17343 Vxrail VSRN Citrix Design Guide
23 pages
T-GCPBDML-B - M1 - Big Data and Machine Learning On Google Cloud - ILT Slides
No ratings yet
T-GCPBDML-B - M1 - Big Data and Machine Learning On Google Cloud - ILT Slides
63 pages
Computational Sci. & Engg PDF
No ratings yet
Computational Sci. & Engg PDF
407 pages