0% found this document useful (0 votes)

34 views116 pages

A41101 - How CUDA Programming Works

The document provides an overview of CUDA programming, focusing on maximizing GPU utilization and efficiency. It discusses the execution hierarchy, including grids, blocks, and threads, as well as the importance of asynchronous execution and resource management. Key performance metrics and architectural details of the Ampere A100 GPU are also highlighted to illustrate effective resource usage.

Uploaded by

huynhgse183099

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views116 pages

A41101 - How CUDA Programming Works

Uploaded by

huynhgse183099

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 116

HOW CUDA PROGRAMMING WORKS

STEPHEN JONES, GTC FALL 2022

WHY IS THE WAY IT IS
HOW CUDA PROGRAMMING WORKS
STEPHEN JONES, GTC FALL 2022
GETTING THE MOST OUT OF A GPU

50%

Average
GPU Utilization

15%

0% 100%

Use it at all
GETTING THE MOST OUT OF A GPU

50% Total Threads 221,184

Total Shared Memory 17,280 kB
Average
GPU Utilization Max Concurrent Kernels 128
MIG Partitions 7
15%

0% 100%

Use it at all Use all of it

GETTING THE MOST OUT OF A GPU

50% Total Threads 221,184

Total Shared Memory 17,280 kB 221,184 threads @ 1410 MHz
Average =
GPU Utilization Max Concurrent Kernels 128 311,869,440,000,000 operations/sec
MIG Partitions 7
15%

0% 100%

Use it at all Use all of it Use it efficiently

KEEPING BUSY

GPU Utilization Over Time

50%
100%

Average
GPU Utilization 75%

15%
50%

0% 100%
25%

Use it at all
KEEPING BUSY

Fetch
image
100%

Process image
on GPU 75%

Loop
Insert image 50%
into database

25%

Done?
0%

Finished
STILL NOT ALL THAT BUSY

Fetch BATCH
of images
100%

data = load_images(N);
Process image
on GPU 75%
for(i=0; i<N; i++) {
process_image(i); Loop
} 50%
Done?
insert_images(data);
25%
Bulk insert
into database
0%

Finished
SYNCHRONOUS EXECUTION

Fetch BATCH
of images
100%

data = load_images(N);
Process image
75%
for(i=0; i<N; i++) { Wait for GPU
process_image(i); Loop
} 50%
Done?
insert_images(data);
25%
Bulk insert
into database
0%

Finished
SYNCHRONOUS EXECUTION

GPU Work Queue

(“Stream”)

Fetch BATCH
Process image
of images
100%

Process image
75%
Wait for GPU
Loop
50%
Done?

25%
Bulk insert
into database
0%

Finished
SYNCHRONOUS EXECUTION

GPU Work Queue

(“Stream”)

Fetch BATCH
Process image
of images
100%

Synchronize
Process image
75%
Wait for GPU
Loop
50%
Done?

25%
Bulk insert
into database
0%

Finished
SYNCHRONOUS EXECUTION

GPU Work Queue

(“Stream”)

Fetch BATCH
Process image
of images
100%

Synchronize
Process image
75%
Wait for GPU
Loop
50%
Done?

25%
Bulk insert
into database
0%

Finished
SYNCHRONOUS EXECUTION

GPU Work Queue

(“Stream”)

Fetch BATCH
Process image
of images
100%

Synchronize
Process image
75%
Wait for GPU
Loop
50%
Done?

25%
Bulk insert
into database
0%

Finished
SYNCHRONOUS EXECUTION

GPU Work Queue

(“Stream”)

Fetch BATCH
Process image
of images
100%

Synchronize
Process image
75%
Wait for GPU
Loop
50%
Done?

25%
Bulk insert
into database
0%

Finished
SYNCHRONOUS EXECUTION

GPU Work Queue

(“Stream”)

Fetch BATCH
Process image
of images
100%

Synchronize
Process image
75%
Wait for GPU
Loop
50%
Done?

25%
Bulk insert
into database
0%

Finished
ASYNCHRONOUS EXECUTION

GPU Work Queue

(“Stream”)

Fetch BATCH
Process image
of images
100%

Process image Process image

75%
Loop
Done?
Process image 50%

Wait for GPU

Process image 25%
Bulk insert
into database
Process image 0%

Finished
Synchronize
ASYNCHRONOUS EXECUTION

50%
100%

Average
GPU Utilization 75%

85%
50%

0% 100%
25%

Keeping the GPU busy

KEEPING THE GPU FULL

Total Threads 221,184

Total Shared Memory 17,280 kB
Max Concurrent Kernels 128
MIG Partitions 7

Use all of it
EFFICIENT USE OF RESOURCES

SM 0 SM 1 SM 2 SM 3 SM 107

regs regs regs regs ... regs

(256k) (256k) (256k) (256k) (256k)
L1$ L1$ L1$ L1$ L1$
(192k) (192k) (192k) (192k) ... (192k)

L2 Cache (40MB)

Use all of it HBM Memory (80GB)

HBM
HBM
EFFICIENT USE OF RESOURCES

Ampere A100 GPU SM 0 SM 1 SM 2 SM 3 SM 107

SMs 108 regs regs regs regs ... regs
(256k) (256k) (256k) (256k) (256k)
Total threads 221,184
L1$ L1$ L1$ L1$ L1$
...
Peak FP32 TFLOP/s 19.5 (192k) (192k) (192k) (192k) (192k)

Peak FP64 TFLOP/s (non-tensor) 9.7

Peak FP64 TFLOP/s 19.5
FP64, TF32, BF16, L2 Cache (40MB)
Tensor Core Precision
FP16, I8, I4, B1
Shared Memory per SM 160 kB
L2 Cache Size 40960 kB
Memory Bandwidth 1555 GB/sec
GPU Boost Clock 1410 MHz HBM Memory (80GB)
HBM
NVLink Interconnect 600 GB/sec HBM
EFFICIENT USE OF RESOURCES

Ampere A100 GPU SM 0 SM 1 SM 2 SM 3 SM 107

SMs 108 regs regs regs regs ... regs
(256k) (256k) (256k) (256k) (256k)
Total threads 221,184
L1$ L1$ L1$ L1$ L1$
...
Peak FP32 TFLOP/s 19.5 (192k) (192k) (192k) (192k) (192k)

Peak FP64 TFLOP/s (non-tensor) 9.7

Grid
of work

Divide into
many blocks

Many threads
in each block
START WITH SOME WORK TO PROCESS
DIVIDE INTO A SET OF EQUAL-SIZED BLOCKS: THIS IS THE “GRID” OF WORK
EACH BLOCK WILL NOW BE PROCESSED INDEPENDENTLY
CUDA does not guarantee the order of execution and you cannot exchange data between blocks
EACH BLOCK WILL NOW BE PROCESSED INDEPENDENTLY
CUDA does not guarantee the order of execution and you cannot exchange data between blocks
EVERY BLOCK GET PLACED ONTO AN SM
CUDA does not guarantee the order of execution and you cannot exchange data between blocks

SM 0 SM 1 SM 2 SM 3 SM 4 SM 5

SM 6 SM 7 SM 8 SM 9 SM 10 SM 11
EVERY BLOCK GET PLACED ONTO AN SM
CUDA does not guarantee the order of execution and you cannot exchange data between blocks

SM 0 SM 1 SM 2 SM 3 SM 4 SM 5

SM 6 SM 7 SM 8 SM 9 SM 10 SM 11
BLOCKS CONTINUE TO GET PLACED UNTIL EACH SM IS “FULL”
When a block completes its work and exits, a new block is placed in its spot until the whole grid is done

SM 0 SM 1 SM 2 SM 3 SM 4 SM 5

SM 6 SM 7 SM 8 SM 9 SM 10 SM 11
BLOCKS CONTINUE TO GET PLACED UNTIL EACH SM IS “FULL”
When a block completes its work and exits, a new block is placed in its spot until the whole grid is done

SM 0 SM 1 SM 2 SM 3 SM 4 SM 5

SM 6 SM 7 SM 8 SM 9 SM 10 SM 11
BLOCKS CONTINUE TO GET PLACED UNTIL EACH SM IS “FULL”
When a block completes its work and exits, a new block is placed in its spot until the whole grid is done

SM 0 SM 1 SM 2 SM 3 SM 4 SM 5

SM 6 SM 7 SM 8 SM 9 SM 10 SM 11
BLOCKS CONTINUE TO GET PLACED UNTIL EACH SM IS “FULL”
When a block completes its work and exits, a new block is placed in its spot until the whole grid is done

SM 0 SM 1 SM 2 SM 3 SM 4 SM 5

SM 6 SM 7 SM 8 SM 9 SM 10 SM 11
BLOCKS CONTINUE TO GET PLACED UNTIL EACH SM IS “FULL”
When a block completes its work and exits, a new block is placed in its spot until the whole grid is done

SM 0 SM 1 SM 2 SM 3 SM 4 SM 5

SM 6 SM 7 SM 8 SM 9 SM 10 SM 11
WHAT DOES IT MEAN FOR AN SM TO BE “FULL”?

SM 0 SM 1 SM 2 SM 3 SM 4 SM 5

SM 6 SM 7 SM 8 SM 9 SM 10 SM 11
LOOKING INSIDE A STREAMING MULTIPROCESSOR

A100 SM Resources
2048 Max threads per SM
32 Max blocks per SM
65,536 Total registers per SM
160 kB Total shared memory in SM
32 Threads per warp
4 Concurrent warps active
64 FP32 cores per SM
32 FP64 cores per SM
192 kB Max L1 cache size
90 GB/sec Load bandwidth per SM
1410 MHz GPU Boost Clock
LOOKING INSIDE A STREAMING MULTIPROCESSOR

Thread block
Grid
of work
A block has a fixed number of threads

shared float mean = 0.0f;

Divide into __device__ float mean_euclidian_distance(float2 *p1, float2 *p2) {
many blocks // Compute the Euclidian distance between two points
float2 dp = p2[threadIdx.x] - p1[threadIdx.x];
float dist = sqrtf(dp.x * dp.x + dp.y * dp.y);

// Accumulate the mean distance atomically and return distance

atomicAdd(&mean, dist / blockDim.x);
Many threads return dist;
in each block }

Every thread runs exactly the same program

(this is the “SIMT” model)
ANATOMY OF A THREAD BLOCK

All blocks in a grid run the same program using the same Thread block
number of threads, leading to 3 resource requirements

1. Block size – the number of threads which must be concurrent A block has a fixed number of threads

shared float mean = 0.0f;

__device__ float mean_euclidian_distance(float2 *p1, float2 *p2) {
// Compute the Euclidian distance between two points
float2 dp = p2[threadIdx.x] - p1[threadIdx.x];
float dist = sqrtf(dp.x * dp.x + dp.y * dp.y);

// Accumulate the mean distance atomically and return distance

atomicAdd(&mean, dist / blockDim.x);
return dist;
}

Every thread runs exactly the same program

(this is the “SIMT” model)
ANATOMY OF A THREAD BLOCK

All blocks in a grid run the same program using the same Thread block
number of threads, leading to 3 resource requirements

1. Block size – the number of threads which must be concurrent A block has a fixed number of threads

2. Shared memory – common to all threads in a block

__shared__ float mean = 0.0f;
__device__ float mean_euclidian_distance(float2 *p1, float2 *p2) {
// Compute the Euclidian distance between two points
float2 dp = p2[threadIdx.x] - p1[threadIdx.x];
float dist = sqrtf(dp.x * dp.x + dp.y * dp.y);

// Accumulate the mean distance atomically and return distance

atomicAdd(&mean, dist / blockDim.x);
return dist;
}

Every thread runs exactly the same program

(this is the “SIMT” model)
ANATOMY OF A THREAD BLOCK

All blocks in a grid run the same program using the same Thread block
number of threads, leading to 3 resource requirements

1. Block size – the number of threads which must be concurrent A block has a fixed number of threads

2. Shared memory – common to all threads in a block

__shared__ float mean = 0.0f;
__device__ float mean_euclidian_distance(float2 *p1, float2 *p2) {
// Compute the Euclidian distance between two points
3. Registers – depends on program complexity float2 dp = p2[threadIdx.x] - p1[threadIdx.x];
float dist = sqrtf(dp.x * dp.x + dp.y * dp.y);
Registers are a per-thread resource, so total budget is:
(threads-per-block x registers-per-thread) // Accumulate the mean distance atomically and return distance
atomicAdd(&mean, dist / blockDim.x);
return dist;
}

Every thread runs exactly the same program

(this is the “SIMT” model)
HOW THE GPU PLACES BLOCKS ON AN SM

A100 SM Key Resources

Thread block
2048 Threads

A block has a fixed number of threads

always running on a single SM

Example block resource requirements

65536 Registers
256 Threads per block
64 (Registers per thread)
(256 * 64) = 16384 Registers per block
48 kB Shared memory per block

160kB Shared Memory

HOW THE GPU PLACES BLOCKS ON AN SM

A100 SM Key Resources

Block 0
Thread block
2048
Threads
A block has a fixed number of threads
always running on a single SM

Block 0

65,536 Example block resource requirements

Registers 256 Threads per block
64 (Registers per thread)
(256 * 64) = 16384 Registers per block
Block 0 48 kB Shared memory per block
160 kB
Shared
Memory
HOW THE GPU PLACES BLOCKS ON AN SM

A100 SM Key Resources

Block 0 Block 1
Thread block
2048
Threads
A block has a fixed number of threads
always running on a single SM

Block 0 Block 1

65,536 Example block resource requirements

Registers 256 Threads per block
64 (Registers per thread)
(256 * 64) = 16384 Registers per block
Block 0 48 kB Shared memory per block
160 kB
Shared Block 1
Memory
HOW THE GPU PLACES BLOCKS ON AN SM

A100 SM Key Resources

Block 0 Block 1
Thread block
2048 Block 2
Threads
A block has a fixed number of threads
always running on a single SM

Block 0 Block 1

65,536 Example block resource requirements

Registers 256 Threads per block
Block 2
64 (Registers per thread)
(256 * 64) = 16384 Registers per block
Block 0 48 kB Shared memory per block
160 kB
Shared Block 1
Memory
Block 2
HOW THE GPU PLACES BLOCKS ON AN SM

A100 SM Key Resources

Block 0 Block 1
Thread block
2048 Block 2 Block 3
Threads Block 4 Block 5
Block 6 Block 7 A block has a fixed number of threads
always running on a single SM

Block 0 Block 1

65,536 Example block resource requirements

Registers 256 Threads per block
Block 2 Block 3
64 (Registers per thread)
(256 * 64) = 16384 Registers per block
Block 0 48 kB Shared memory per block
160 kB
Shared Block 1
Memory
Block 2

Block 3
HOW THE GPU PLACES BLOCKS ON AN SM

A100 SM Key Resources

Block 0 Block 1
Thread block
2048 Block 2 Block 3
Threads Block 4 Block 5
Block 6 Block 7 A block has a fixed number of threads
always running on a single SM

Block 0 Block 1

65,536 Example block resource requirements

Registers 256 Threads per block
Block 2 Block 3
64 (Registers per thread)
(256 * 64) = 16384 Registers per block
Block 0
32 kB Shared memory per block
160 kB Block 1
Shared Block 2
Memory Block 3
HOW THE GPU PLACES BLOCKS ON AN SM

A100 SM Key Resources

Block 0 Block 1
Thread block
2048 Block 2 Block 3
Threads
A block has a fixed number of threads
always running on a single SM

Block 0 Block 1

65,536 Example block resource requirements

A100 SM Key Resources

Block 0 Block 1
Thread block
2048 Block 2 Block 3
Threads Block 4 Block 5
Block 6 Block 7 A block has a fixed number of threads
always running on a single SM

Block 0 Block 1

65,536 Example block resource requirements

A100 SM Key Resources

Block 0 Block 1
Thread block
Block 2 Block 3
Block 4 Block 5
Block 6 Block 7 A block has a fixed number of threads
always running on a single SM

Block 0 Block 1
Example block resource requirements
256 Threads per block
Block 2 Block 3
64 (Registers per thread)
(256 * 64) = 16384 Registers per block
Block 0
32 kB Shared memory per block
Block 1
Block 2
Block 3
Block 4
OCCUPANCY

A100 SM Key Resources A100 SM Key Resources

Block 0 Block 1 Block 0 Block 1
Block 2 Block 2 Block 3
Threads

Block 0 Block 1 Block 0 Block 1

Registers

Block 2 Block 2 Block 3

Block 0
Block 0
Block 1
Shared Block 1 Block 2
Memory
Block 3
Block 2

Shared memory limited case Register limited case

OCCUPANCY

A100 SM Key Resources A100 SM Key Resources

Block 0 Block 1 Block 0 Block 1
Block 2 Block 2 Block 3
Threads

Block 0 Block 1 Block 0 Block 1

Occupancy Occupancy
Registers 3 blocks/SM 4 blocks/SM
Block 2 Block 2 Block 3

Block 0
Block 0
Block 1
Shared Block 1 Block 2
Memory
Block 3
Block 2

Shared memory limited case Register limited case

OCCUPANCY IS THE MOST POWERFUL TOOL FOR TUNING A PROGRAM

A100 SM Key Resources A100 SM Key Resources

Block 0 Block 1 Block 0 Block 1
Block 2 Block 2 Block 3
Threads

Block 0 Block 1 Block 0 Block 1

Occupancy Occupancy
Registers 33% Faster
3 blocks/SM 4 blocks/SM
Block 2 Block 2 Block 3

Block 0
Block 0
Block 1
Shared Block 1 Block 2
Memory
Block 3
Block 2

Shared memory limited case Register limited case

FILLING IN THE GAPS

A100 SM Key Resources

Block 0 Block 1 Resource requirements (blue grid)
Block 2
Threads 256 Threads per block
64 (Registers per thread)
(256 * 64) = 16384 Registers per block
48 kB Shared memory per block
Block 0 Block 1

Registers

Block 2

Block 0

Shared Block 1
Memory
Block 2

Shared memory limited case

FILLING IN THE GAPS

A100 SM Key Resources

Block 0 Block 1 Resource requirements (blue grid)
Block 2
Threads 256 Threads per block
64 (Registers per thread)
(256 * 64) = 16384 Registers per block
48 kB Shared memory per block
Block 0 Block 1

Registers

Block 2
Resource requirements (green grid)
512 Threads per block
Block 0 32 (Registers per thread)
Shared Block 1 (512 * 32) = 16384 Registers per block
Memory
0 kB Shared memory per block
Block 2
FILLING IN THE GAPS

A100 SM Key Resources

Block 0 Block 1 Resource requirements (blue grid)
Block 2
Threads 256 Threads per block
Block 0
64 (Registers per thread)
(256 * 64) = 16384 Registers per block
48 kB Shared memory per block
Block 0 Block 1

Registers

Block 2 Block 0
Resource requirements (green grid)
512 Threads per block
Block 0 32 (Registers per thread)
Shared Block 1 (512 * 32) = 16384 Registers per block
Memory
0 kB Shared memory per block
Block 2
ASYNCHRONOUS EXECUTION

GPU Work Queue

(“Stream”)

Process image
100%

Process image
75%

Process image 50%

Process image 25%

Process image 0%

Synchronize
ASYNCHRONOUS EXECUTION

Stream 1 Stream 2 Stream 3

Process image

Process image ...

Process image

Synchronize
EVEN-MORE-ASYNCHRONOUS EXECUTION

Stream 1 Stream 2 Stream 3

Process image

...
EVEN-MORE-ASYNCHRONOUS EXECUTION

Stream 1 Stream 2 Stream 3

Copy to GPU

...
Process Flower

Copy from GPU

EVEN-MORE-ASYNCHRONOUS EXECUTION

Stream 1 Stream 2 Stream 3

Copy to GPU Copy to GPU Copy to GPU

...
Process Flower Process Flower Process Flower

Copy from GPU Copy from GPU Copy from GPU

Synchronize Synchronize Synchronize

KEEPING THE GPU FULL

Stream 1 Stream 2

A100 SM Key Resources

Flower Block 0 Flower Block 1
Flower Block 2
Copy to GPU Copy to GPU
Copy Block 0

Flower Block 0 Flower Block 1

Flower Block 2 Copy Block 0

Process Flower Process Flower

Flower Block 0

Flower Block 1
Copy from GPU Copy from GPU
Flower Block 2

Synchronize Synchronize
KEEPING THE GPU ACTIVE

221,184 threads @ 1410 MHz

=
311,869,440,000,000 operations/sec

Use it efficiently
THE NVIDIA AMPERE GPU ARCHITECTURE
These are the resources that are available

SMs 108
Total threads 221,184
Peak FP32 TFLOP/s 19.5
Peak FP64 TFLOP/s (non-tensor) 9.7 221,184 threads @ 1410 MHz
Peak FP64 TFLOP/s (tensor) 19.5 =
311,869,440,000,000 operations/sec
FP64, TF32, BF16,
Tensor Core Precision
FP16, I8, I4, B1
Shared Memory per SM 160 kB
L2 Cache Size 40960 kB
Memory Bandwidth 1555 GB/sec
GPU Boost Clock 1410 MHz
NVLink Interconnect 600 GB/sec Use it efficiently
THE NVIDIA AMPERE GPU ARCHITECTURE
These are the resources that are available

SMs 108
Total threads 221,184
Peak FP32 TFLOP/s 19.5
Peak FP64 TFLOP/s (non-tensor) 9.7
Peak FP64 TFLOP/s (tensor) 19.5
FP64, TF32, BF16,
Tensor Core Precision
FP16, I8, I4, B1
Shared Memory per SM 160 kB
L2 Cache Size 40960 kB
Memory Bandwidth 1555 GB/sec
GPU Boost Clock 1410 MHz
NVLink Interconnect 600 GB/sec
THE NVIDIA AMPERE GPU ARCHITECTURE
These are the resources that are available

ASCI White supercomputer

Lawrence Livermore National Laboratory
Top500 #1 in 2001, 7.9 teraflop/s
THE NVIDIA AMPERE GPU ARCHITECTURE
These are the resources that are available

27x iPhone 13 Pro

(A15 CPU)
BUT FLOPS AREN’T THE ISSUE – BANDWIDTH IS

Ampere A100 GPU

SM 0 SM 1 SM 2 SM 3 SM 107
108 SMs in the GPU, running @ 1410 MHz (boost clock)
regs regs regs regs ... regs
(256k) (256k) (256k) (256k) (256k)
Each SM can load 64 Bytes per clock cycle
L1$ L1$ L1$ L1$ L1$
(192k) (192k) (192k) (192k) ... (192k)

Peak memory request rate = 64B x 108 SMs * 1410 MHz = 9750 Gigabytes/sec

L2 Cache (40MB)

HBM Memory (80GB)

HBM
HBM
HOW MUCH DATA CAN THE A100 GPU PULL IN?

Ampere A100 GPU

Peak memory request rate = 64B x 108 SMs * 1410 MHz = 9750 Gigabytes/sec

L2 Cache (40MB)

HBM2 Memory Bandwidth = 1555 GBytes / sec HBM Memory (80GB)

HBM
HBM
HOW MUCH DATA CAN THE A100 GPU PULL IN?

Ampere A100 GPU

Peak memory request rate = 64B x 108 SMs * 1410 MHz = 9750 Gigabytes/sec

L2 Cache (40MB)
Ratio of bandwidth requested 9750
= = 6.3x
to bandwidth provided 1555

HBM2 Memory Bandwidth = 1555 GBytes / sec HBM Memory (80GB)

HBM
HBM
BUT FLOPS AREN’T THE ISSUE – BANDWIDTH IS

SMs 108
Total threads 221,184
FP64 FLOP/S BASED ON MEMORY
Peak FP32SPEED
TFLOP/s = 1555
19.5GB/SEC / 8 BYTES = 194 GFLOP/S
Peak FP64 TFLOP/s (non-tensor) 9.7
Peak FP64 TFLOP/s (tensor) 19.5
FP64, TF32, BF16,
Tensor Core Precision
FP16, I8, I4, B1
Shared Memory per SM 160 kB
L2 Cache Size 40960 kB
Memory Bandwidth 1555 GB/sec
GPU Boost Clock 1410 MHz
NVLink Interconnect 600 GB/sec

Hitachi SR2201 supercomputer

University of Tokyo
Top500 #1 in 1996, 220 gigaflop/s
BUT FLOPS AREN’T THE ISSUE – BANDWIDTH IS

1x iPhone 11 Pro
(A13 CPU)
A CLOSER LOOK AT RANDOM ACCESS MEMORY
Read address: 001100010010011110100001101101110011
Read address: 001100010010011110100001101101110011

Activate row and pull data into sense amplifiers

1
This destroys data in the row as capacitors drain
Read address: 001100010010011110100001101101110011

Activate row and pull data into sense amplifiers

1
This destroys data in the row as capacitors drain

Read from page held in amplifiers at column index

2
This does not destroy data in the amplifiers
Read address: 001100010010011110100001101101110100

Activate row and pull data into sense amplifiers

1
This destroys data in the row as capacitors drain

Read from page held in amplifiers at column index

2
This does not destroy data in the amplifiers

May make repeated reads from the same page

3
at different column indexes
Read address: 001100010010011110100001101101110100

Activate row and pull data into sense amplifiers

1
This destroys data in the row as capacitors drain

Read from page held in amplifiers at column index

2
This does not destroy data in the amplifiers

May make repeated reads from the same page

3
“Burst” reads load multiple columns at a time
Read address: 001100010010011101100001101101110011

Activate row and pull data into sense amplifiers

1
This destroys data in the row as capacitors drain

Read from page held in amplifiers at column index

2
This does not destroy data in the amplifiers

May make repeated reads from the same page

3
“Burst” reads load multiple columns at a time

Before a new page is fetched, old row must be

4
written back because data was destroyed
Example HBM values
Time to read new column: CL = 16 cycles
Time to load new page: TRCD = 16 cycles
Time to write back data: TRP = 16 cycles
Page (row) size = 1kB

Each read has a cost (CL = 16 cycles)

Switching page has 3x larger cost (TRP + TRCD + CL = 48 cycles)

This is because switching page requires charging/discharging

capacitors with a physical RC time constant:

VC = VS(1 – e(-t/RC) )
SO WHAT DOES THIS ALL MEAN?

We’d expect a significant performance difference

HBM Memory Throughput as Addresses Diverge
for coalesced vs. scattered reads
(8-byte reads, A100)
2048

On A100, memory bandwidth for widely-spaced reads is 1418 GB/sec

1024

Achieved Bandwidth (GB/sec)

111
= 8% of peak bandwidth 724 GB/sec
1418
512
Burst size = 64 Bytes
HBM page size = 1kB
256
That’s 1/13th of
peak bandwidth!
111 GB/sec
128

64
8 16 32 64 128 256 512 1024 2048 4096 8192
Stride Interval (in Bytes) Between Successive Reads
SO WHAT DOES THIS ALL MEAN?

We’d expect a significant performance difference

HBM Memory Throughput as Addresses Diverge
for coalesced vs. scattered reads
(8-byte reads, A100)
2048

On A100, memory bandwidth for widely-spaced reads is 1418 GB/sec

1024

Achieved Bandwidth (GB/sec)

111
= 8% of peak bandwidth
1418
512

256

111 GB/sec
128

64
8 16 32 64 128 256 512 1024 2048 4096 8192
Stride Interval (int Bytes) Between Successive Reads

CM-5 supercomputer
Los Alamos National Laboratory
First ever #1 in 1993, 59.7 gigaflop/s
SO WHAT DOES THIS ALL MEAN?

We’d expect a significant performance difference

HBM Memory Throughput as Addresses Diverge
for coalesced vs. scattered reads
(8-byte reads, A100)
2048

On A100, memory bandwidth for widely-spaced reads is 1418 GB/sec

1024

Achieved Bandwidth (GB/sec)

111
= 8% of peak bandwidth
1418
512

256

111 GB/sec
128

64
8 16 32 64 128 256 512 1024 2048 4096 8192
Stride Interval (int Bytes) Between Successive Reads

1x iPhone 6s
(A9 CPU)
DATA ACCESS PATTERNS REALLY MATTER

Column read latency CL

for(y=0; y<M; y++) {
for(x=0; x<N; x++) {
load(array[y][x]);
}
}

Row-major array traversal

Row-major array layout

DATA ACCESS PATTERNS REALLY MATTER

Column read latency CL

for(y=0; y<M; y++) { for(x=0; x<N; x++) {
for(x=0; x<N; x++) { for(y=0; y<M; y++) {
load(array[y][x]); load(array[y][x]);
} }
} }

Row-major array traversal Column-major array traversal

Row read latency

TRAS = TRP + TRDC + CL
13x slower than
column access

Row-major array layout

The reason you’re using a GPU is for performance

This means using all the GPU resources that you can,
which means managing memory access patterns
But what’s this got to do with CUDA?
CUDA’S GPU EXECUTION HIERARCHY

Grid
of work

Divide into
many blocks

Many threads
in each block
THE CUDA THREAD BLOCK

Thread block

A block has a fixed number of threads

which are guaranteed to be running
simultaneously on the same SM
EVERY THREAD RUNS EXACTLY THE SAME PROGRAM
This is the “SIMT” model

Thread block

A block has a fixed number of threads

__global__ void euclidian_distance(float2 *p1, float2 *p2, float *distance, int count) {
// Calculate the index of the point my thread is working on
int index = threadIdx.x + (blockIdx.x * blockDim.x);

// Check if thread is in-range before reading data

if (index < count) {
// Compute the Euclidian distance between two points
float2 dp = p2[index] - p1[index];
float dist = sqrtf(dp.x * dp.x + dp.y * dp.y);

// Write out the computed distance

distance[index] = dist;
}
}
EVERY THREAD RUNS EXACTLY THE SAME PROGRAM
This is the “SIMT” model

Thread block

A block has a fixed number of threads

__global__ void euclidian_distance(float2 *p1, float2 *p2, float *distance, int count) {
It’s all about this // Calculate the index of the point my thread is working on
one line of code int index = threadIdx.x + (blockIdx.x * blockDim.x);

// Check if thread is in-range before reading data

if (index < count) {
// Compute the Euclidian distance between two points
float2 dp = p2[index] - p1[index];
float dist = sqrtf(dp.x * dp.x + dp.y * dp.y);

// Write out the computed distance

distance[index] = dist;
}
}
EVERY THREAD RUNS EXACTLY THE SAME PROGRAM
This is the “SIMT” model

Thread block

A block has a fixed number of threads

This line is what // Check if thread is in-range before reading data

SIMT is all about if (index < count) {
// Compute the Euclidian distance between two points
float2 dp = p2[index] - p1[index];
float dist = sqrtf(dp.x * dp.x + dp.y * dp.y);

// Write out the computed distance

distance[index] = dist;
}
}
Thread block

The block of threads is broken up into “warps” of 32 threads

A “warp” is the vector element of the GPU

Warp Warp Warp Warp Warp Warp

thread 0...31 thread 32...63 thread 64...95 thread 96...127 thread 128...159 thread 160...191
Thread block

The block of threads is broken up into “warps” of 32 threads

A “warp” is the vector element of the GPU

Warp Warp Warp Warp Warp Warp

thread 0...31 thread 32...63 thread 64...95 thread 96...127 thread 128...159 thread 160...191

int index = threadIdx.x + (blockIdx.x * blockDim.x);

...
float2 dp = p2[index] - p1[index];
Thread block

The block of threads is broken up into “warps” of 32 threads

A “warp” is the vector element of the GPU

Warp Warp Warp Warp Warp Warp

thread 0...31 thread 32...63 thread 64...95 thread 96...127 thread 128...159 thread 160...191

float2 dp = ... p1[f(thread ID)];

Thread block

The block of threads is broken up into “warps” of 32 threads

A “warp” is the vector element of the GPU

Warp Warp Warp Warp Warp Warp

thread 0...31 thread 32...63 thread 64...95 thread 96...127 thread 128...159 thread 160...191

float2 dp = ... p1[f(thread ID)];

float2 float x float y

sizeof(float2) is 2x 4 bytes = 8 bytes
Thread block

The block of threads is broken up into “warps” of 32 threads

A “warp” is the vector element of the GPU

Warp Warp Warp Warp Warp Warp

thread 0...31 thread 32...63 thread 64...95 thread 96...127 thread 128...159 thread 160...191

float2 dp = ... p1[f(thread ID)];

One warp loads 32 x 8 bytes of data = 256 bytes

WARP EXECUTION ON THE GPU

Warp Warp Warp Warp Warp Warp

thread 0...31 thread 32...63 thread 64...95 thread 96...127 thread 128...159 thread 160...191

A100 SM block diagram

WARP EXECUTION ON THE GPU

Warp Warp Warp Warp Warp Warp

thread 0...31 thread 32...63 thread 64...95 thread 96...127 thread 128...159 thread 160...191

float2 dp = ... p1[f(thread ID)];

Memory request issuing from one SM:

4 warps x 256 bytes per warp = 1024 bytes
WARP EXECUTION ON THE GPU

Warp Warp Warp Warp Warp Warp

thread 0...31 thread 32...63 thread 64...95 thread 96...127 thread 128...159 thread 160...191

float2 dp = ... p1[f(thread ID)];

Memory request issuing from one SM:

4 warps x 256 bytes per warp = 1024 bytes

Memory page size = 1024 bytes

Warp Warp Warp Warp

thread 0...31 thread 32...63 thread 64...95 thread 96...127

float2 dp = p2[index] - p1[index];

For a single thread, this would look

like random-address memory reads
Warp Warp Warp Warp

thread 0...31 thread 32...63 thread 64...95 thread 96...127

float2 dp = p2[index] - p1[index];

For a single thread, this would look

like random-address memory reads

But because this is executed in parallel

by 128 threads, it’s actually adjacent
reads of whole pages of memory
Warp Warp Warp Warp

HBM Memory Throughput as Addresses Diverge

(8-byte reads, A100)
thread 0...31 thread 32...63 thread 64...95 thread 96...127
2048

Achieved Bandwidth (GB/sec)

1024

float2 dp = p2[index] - p1[index];

512

256

For a single thread, this would look 128

like random-address memory reads

But because this is executed in parallel 64

8 16 32 64 128 256 512 1024 2048 4096 8192
by 128 threads, it’s actually adjacent
Stride Interval (int Bytes) Between Successive Reads
reads of whole pages of memory
SO WHY IS CUDA THE WAY IT IS?
WHY IS THE WAY IT IS
HOW CUDA PROGRAMMING WORKS
STEPHEN JONES, GTC FALL 2022

Performance Optimization With Modern CUDA Programming Techniques - 1635781161534001h3am
No ratings yet
Performance Optimization With Modern CUDA Programming Techniques - 1635781161534001h3am
77 pages
Course Code: Cpe 523 Course Title: Advanced Operating Systems
No ratings yet
Course Code: Cpe 523 Course Title: Advanced Operating Systems
9 pages
Lec 14
No ratings yet
Lec 14
52 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
How CUDA Programming Works - 1647539841016001sz6e
No ratings yet
How CUDA Programming Works - 1647539841016001sz6e
101 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
GPU Programming Slides 3
No ratings yet
GPU Programming Slides 3
73 pages
CUDA
No ratings yet
CUDA
33 pages
Class 10
No ratings yet
Class 10
13 pages
GTC S62191
No ratings yet
GTC S62191
89 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
CSED405 Lec2-CUDA Overview - 240916 - 131108
No ratings yet
CSED405 Lec2-CUDA Overview - 240916 - 131108
52 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
3H
No ratings yet
3H
34 pages
CUDA Programming
No ratings yet
CUDA Programming
35 pages
Lect11 12 Cuda Threads
No ratings yet
Lect11 12 Cuda Threads
25 pages
Gpu1 - GPU Introduction
No ratings yet
Gpu1 - GPU Introduction
20 pages
27th Aug - Introduction To GPGPU - Part 1
No ratings yet
27th Aug - Introduction To GPGPU - Part 1
32 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
No ratings yet
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
29 pages
CSE Lec4 Cuda
No ratings yet
CSE Lec4 Cuda
91 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
GPU Programming: Dr. Florian Ferreira
No ratings yet
GPU Programming: Dr. Florian Ferreira
101 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
G80 Cuda
No ratings yet
G80 Cuda
25 pages
GPU Computing 3
No ratings yet
GPU Computing 3
32 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
Hardware
No ratings yet
Hardware
54 pages
8 Cud A 1
No ratings yet
8 Cud A 1
38 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Cuda Webinars WarpsAndOccupancy
No ratings yet
Cuda Webinars WarpsAndOccupancy
14 pages
Parallel Processing With Cuda
No ratings yet
Parallel Processing With Cuda
25 pages
06-CUDA Thread Organization
No ratings yet
06-CUDA Thread Organization
27 pages
Summary Exam 2015
No ratings yet
Summary Exam 2015
30 pages
Gpu History and Cuda Programming Basics
No ratings yet
Gpu History and Cuda Programming Basics
44 pages
CUDA
No ratings yet
CUDA
18 pages
S51413 - Developing Optimal CUDA Kernels On Hopper Tensor Cores - 1679452516682001bWRm
No ratings yet
S51413 - Developing Optimal CUDA Kernels On Hopper Tensor Cores - 1679452516682001bWRm
80 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
CUDA Memory Architecture: GPGPU Class Week 4
No ratings yet
CUDA Memory Architecture: GPGPU Class Week 4
28 pages
High Performance Computing WS2022 Slides 11 Cuda
No ratings yet
High Performance Computing WS2022 Slides 11 Cuda
18 pages
Cuda
No ratings yet
Cuda
25 pages
CUDA Execution Model
No ratings yet
CUDA Execution Model
67 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
GPU Programming Slides 2
No ratings yet
GPU Programming Slides 2
37 pages
GPU Architecture
No ratings yet
GPU Architecture
17 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
Vector Processors
No ratings yet
Vector Processors
20 pages
07 cmsc416 Cuda
No ratings yet
07 cmsc416 Cuda
26 pages
Section 2 Tr
No ratings yet
Section 2 Tr
26 pages
Parallel Programming Module 5
No ratings yet
Parallel Programming Module 5
24 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
Lec6 Cuda Memory
No ratings yet
Lec6 Cuda Memory
18 pages
CUDA Class Lecture02
No ratings yet
CUDA Class Lecture02
24 pages
0 Gpu Computing I Give It
No ratings yet
0 Gpu Computing I Give It
57 pages
Python图像处理实战: Chinese Edition
From Everand
Python图像处理实战: Chinese Edition
Posts & Telecom Press
No ratings yet
CompTIA A+ Fast Track : Study Guide & Practice Tests
From Everand
CompTIA A+ Fast Track : Study Guide & Practice Tests
SUJAN
No ratings yet
Cell Material Interaction Lab Manual S241
No ratings yet
Cell Material Interaction Lab Manual S241
22 pages
Midterm SC
No ratings yet
Midterm SC
5 pages
Universal Model Serving Via Triton and Tensorrt: Ke Ma, Genai@Snap, Inc
No ratings yet
Universal Model Serving Via Triton and Tensorrt: Ke Ma, Genai@Snap, Inc
28 pages
Open-World Segmentation and Tracking in 3D: Laura Leal-Taixé - GTC - March 18-21, 2024
No ratings yet
Open-World Segmentation and Tracking in 3D: Laura Leal-Taixé - GTC - March 18-21, 2024
35 pages
S62797 - LLM Inference Sizing - Benchmarking End-to-End Inference Systems
No ratings yet
S62797 - LLM Inference Sizing - Benchmarking End-to-End Inference Systems
36 pages
Big Data Analytics (2171607) : Chapter - 1 Mapreduce
No ratings yet
Big Data Analytics (2171607) : Chapter - 1 Mapreduce
32 pages
Difference Between Vector Processor and Scalar Processor
No ratings yet
Difference Between Vector Processor and Scalar Processor
1 page
High Speed Cipher Cracking: The Case of Keeloq On CUDA: Rypto Eeloq Eeloq
No ratings yet
High Speed Cipher Cracking: The Case of Keeloq On CUDA: Rypto Eeloq Eeloq
12 pages
Introduction To Parallelism
No ratings yet
Introduction To Parallelism
27 pages
High Performance Computing (HPC) - Lec2
No ratings yet
High Performance Computing (HPC) - Lec2
53 pages
Priority Ceiling Protocols: Today's Topic: Resource/Data Sharing and Synchronization
No ratings yet
Priority Ceiling Protocols: Today's Topic: Resource/Data Sharing and Synchronization
20 pages
DITS 2213 Final Exam OS
No ratings yet
DITS 2213 Final Exam OS
6 pages
SPSE Slides - Module2
No ratings yet
SPSE Slides - Module2
34 pages
Sisd:: Example
No ratings yet
Sisd:: Example
4 pages
Lec 14 Reader Writer Problem and Monitor
No ratings yet
Lec 14 Reader Writer Problem and Monitor
10 pages
Chapter2 Mutex BasicTopics
No ratings yet
Chapter2 Mutex BasicTopics
99 pages
OS Lab 9
No ratings yet
OS Lab 9
7 pages
Consistency Model PDF
No ratings yet
Consistency Model PDF
4 pages
Mod 6
No ratings yet
Mod 6
91 pages
Lecture 1
No ratings yet
Lecture 1
37 pages
THREADS - Top 80 Interview Questions: What Is Thread in Java?
No ratings yet
THREADS - Top 80 Interview Questions: What Is Thread in Java?
68 pages
Unit 4 OSY Presentation
No ratings yet
Unit 4 OSY Presentation
37 pages
Distributed Systems: Mutual Exclusion
No ratings yet
Distributed Systems: Mutual Exclusion
24 pages
07f ch4 Testbank
100% (1)
07f ch4 Testbank
7 pages
EE 520-Computer Architecture-Adeel Pasha PDF
No ratings yet
EE 520-Computer Architecture-Adeel Pasha PDF
3 pages
Lec3 Processes
No ratings yet
Lec3 Processes
27 pages
MODULE: Threads: - Using Threads - Java - Lang.thread - Java - Lang.runnable - Sleep, Join, Yield
No ratings yet
MODULE: Threads: - Using Threads - Java - Lang.thread - Java - Lang.runnable - Sleep, Join, Yield
27 pages
Difference Between Parallel and Distributed Databases
No ratings yet
Difference Between Parallel and Distributed Databases
3 pages
Cs3551 Distributed Computing
No ratings yet
Cs3551 Distributed Computing
2 pages
Anr 5.37 (53700005) 0
No ratings yet
Anr 5.37 (53700005) 0
13 pages
Os Full
No ratings yet
Os Full
59 pages
Cs23312-Unit 3
No ratings yet
Cs23312-Unit 3
19 pages
CHAP-5 Task - Communication
100% (1)
CHAP-5 Task - Communication
33 pages
Term Paper of Operating System
100% (1)
Term Paper of Operating System
25 pages