0% found this document useful (0 votes)

63 views77 pages

Performance Optimization With Modern CUDA Programming Techniques - 1635781161534001h3am

The document discusses performance optimization techniques in CUDA programming, focusing on memory hierarchy, access patterns, thread divergence, GPU occupancy, and the use of CUDA streams and graphs. It emphasizes the importance of understanding memory access patterns and the CUDA memory model to enhance performance. Key concepts include coalesced memory access, synchronization between thread blocks, and optimizing atomic operations.

Uploaded by

huynhgse183099

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

63 views77 pages

Performance Optimization With Modern CUDA Programming Techniques - 1635781161534001h3am

Uploaded by

huynhgse183099

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 77

PERFORMANCE OPTIMIZATION WITH MODERN CUDA

PROGRAMMING TECHNIQUES
GUILLAUME THOMAS-COLLIGNON, VISHAL MEHTA
DEVTECH COMPUTE
• Memory Hierarchy
▪ Memory Access Patterns
▪ Memory Model
▪ L2 Cache
▪ Shared Memory

• Thread Divergence

• GPU Occupancy

• CUDA streams & Graphs

MEMORY HIERARCHY
Understanding Memory and Caches

SM
SM
SM
Compute Units

L1
Registers

40 MB 2.0TB/s
192 80 GB HBM2e
KB L2
SHM

108 SMs PCIe, NVLINK

NVIDIA A100 80GB
MEMORY HIERARCHY
Nsight Compute View
MEMORY HIERARCHY
Instructions, Requests, Sectors

One warp = 32 Threads

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 … 30 31

Load Instruction
(request)

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

Granularity = 32 Byte sector

How many different 32-Byte sectors are being touched by the WARP?
MEMORY ACCESS PATTERNS
0 31 0 31
WARP WARP

Sector 0 Sector 1 Sector 2 Sector 3 Sector 0 Sector 1 Sector 2 Sector 3

1-byte per thread coalesced memory access 4-bytes per thread coalesced memory access

Coalesced memory accesses, touching only the ideal number of sectors

0 31
WARP

Sector 0 Sector 1 Sector 2 Sector 3 Sector 4 Sector 5 Sector 6 Sector 7

8-bytes per thread coalesced memory access

MEMORY ACCESS PATTERNS
0 31 0 31
WARP WARP

Sector 0 Sector 1 Sector 2 Sector 3 Sector 0 Sector 1 Sector 2 Sector 3 Sector 4

1-byte per thread unaligned memory access 4-bytes per thread unaligned memory access

Coalesced but unaligned memory accesses will increase the number of sectors per request

0 31
WARP

Sector 0 Sector 1 Sector 2 Sector 3 Sector 4 Sector 5 Sector 6 Sector 7 Sector 8

8-bytes per thread unaligned memory access

MEMORY ACCESS PATTERNS
0 31
WARP

0 31
Sector 0 Sector 1 Sector 2 Sector 3
WARP

1-byte per thread, stride = 2

Sector 0 Sector 1 Sector 2 Sector 3 Sector 4 Sector 5 Sector 6 Sector 7

4-bytes per thread, stride = 2

Strided memory accesses also increase the number of sectors per request
MEMORY ACCESS PATTERNS

Sector
Sector
Sector Sector
Sector
Sector
Sector Sector
Sector
Sector Sector
Sector

Random memory accesses request too many sectors, wasting bandwidth!

Sector WARP Sector

Sector

Sector
Sector
Sector
Sector
Sector Sector Sector
Sector
MEMORY HIERARCHY
Nsight Compute
32B sectors are
Instructions generate and eventually
transferred between L1
requests to L1 cache to/from memory
and L2 caches

Make sure
these numbers
match your
expectations
MEMORY MODEL
MEMORY MODEL
CUDA Generic Address Space

64-bit Generic Address space

Local Memory Shared Memory Global Memory

• Single Generic address space Visible to Thread

• Fully C++ object model conforming
MEMORY MODEL
CUDA Generic Address Space

64-bit Generic Address space

Local Memory Shared Memory Global Memory

device float square(float* data) {

float x = data[0] // Memory operations can be performed with generic address
// agnostic of shared, local or global memory
// Compiler can optimize if it can determine the address space
return (x*x);
}
MEMORY MODEL
CUDA Generic Address Space

64-bit Generic Address space

Local Memory Shared Memory Global Memory

Visible to a thread block

MEMORY MODEL
CUDA Generic Address Space

64-bit Generic Address space

Local Memory Shared Memory Global Memory

Visible to all GPU

threads
MEMORY MODEL
Reads

Reading from local / Global Mem can hit in L1 or L2

SM
SM
SM
Compute Units

L1
Registers

40 MB
80 GB HBM2
L2
SHM
New!
LDGSTS

PCIe, NVLINK
MEMORY MODEL
Writes
L1 is write-through, L2 is write-back
Writes will always reach at least L2, Read After Write can hit in L1 (e.g. register spills)

SM
SM
SM
Compute Units

L1
Registers

40 MB
80 GB HBM2
L2
SHM
New!
LDGSTS

PCIe, NVLINK
MEMORY MODEL
Scope levels

System scope

Host Host
CPU CPU
memory memory

Device Scope Device Scope Device Scope Device Scope

Global Global Global Global
Memory Memory Memory Memory

L2 L2 L2 L2

SMs SMs SMs SMs

Inside SM:

Block Scope

Thread Scope
MEMORY MODEL
Synchronizing memory between multiple Thread Blocks

L2 Cache flag = 0 Data = ?

Cooperative Kernel
L1 Cache L1 Cache

host device host device

int poll_then_read(int& flag, int& data) void write_then_signal(int& flag, int&
{ data, int value) {
while (flag != 1) ; data = value;
return data; flag = 1;
} }

Reader: Thread 0, Block 0, SM 0 Writer: Thread 0, Block 1, SM 1

MEMORY MODEL
Synchronizing memory between multiple Thread Blocks

L2 Cache flag = 0 Data = ?

Cooperative Kernel
L1 Cache flag = 0 L1 Cache

host device host device

int poll_then_read(int& flag, int& data) void write_then_signal(int& flag, int&
{ data, int value) {
while (flag != 1) ; data = value;
return data; flag = 1;
} }

Reader: Thread 0, Block 0, SM 0 Writer: Thread 0, Block 1, SM 1

MEMORY MODEL
Synchronizing memory between multiple Thread Blocks

L2 Cache flag = 0 Data = value

L1 Cache flag = 0 Data = value L1 Cache

host device host device

int poll_then_read(int& flag, int& data) void write_then_signal(int& flag, int&
{ data, int value) {
while (flag != 1) ; data = value;
return data; flag = 1;
} }

Reader: Thread 0, Block 0, SM 0 Writer: Thread 0, Block 1, SM 1

MEMORY MODEL
Synchronizing memory between multiple Thread Blocks

L2 Cache flag = 1 Data = value

L1 Cache flag = 0 flag = 1 Data = value L1 Cache

host device host device

int poll_then_read(int& flag, int& data) void write_then_signal(int& flag, int&
{ data, int value) {
while (flag != 1) ; data = value;
return data; flag = 1;
} }

Reader: Thread 0, Block 0, SM 0 Writer: Thread 0, Block 1, SM 1

MEMORY MODEL
Synchronizing memory between multiple Thread Blocks

L2 Cache flag = 1 Data = value

• Volatile does not mention the scope of

• __threadfence() is device only participating threads

L1 Cache flag = 1 flag = 1 Data = value L1 Cache

host device host device

int poll_then_read(volatile int& flag, void write_then_signal(volatile int&
int& data ){ flag, int& data, int value) {
while (flag != 1) ; //data-race data = value;
__threadfence(); __threadfence();
return data; flag = 1; //data-race
} }

Reader: Thread 0, Block 0, SM 0 Writer: Thread 0, Block 1, SM 1

MEMORY MODEL
Synchronizing memory between multiple Thread Blocks

L2 Cache flag = false Data = ?

L1 Cache flag = false L1 Cache

host device host device

int poll_then_read(atomic<bool>& flag, void write_then_signal(atomic<bool>&
int& data ) { flag, int& data, int value) {
while (!flag.load()) ; data = value;
return data; flag = true;
} }

Reader: Thread 0, Block 0, SM 0 Writer: Thread 0, Block 1, SM 1

MEMORY MODEL
Synchronizing memory between multiple Thread Blocks

L2 Cache flag = false Data = value

L1 Cache flag = false flag = false Data = value L1 Cache

host device host device

int poll_then_read(atomic<bool>& flag, void write_then_signal(atomic<bool>&
int& data ) { flag, int& data, int value) {
while (!flag.load()) ; data = value;
return data; flag = true;
} }

Reader: Thread 0, Block 0, SM 0 Writer: Thread 0, Block 1, SM 1

MEMORY MODEL
Synchronizing memory between multiple Thread Blocks

L2 Cache flag = true Data = value

L1 Cache flag = false flag = true Data = value L1 Cache

host device host device

int poll_then_read(atomic<bool>& flag, void write_then_signal(atomic<bool>&
int& data ) { flag, int& data, int value) {
while (!flag.load()) ; data = value;
return data; flag = true;
} }

Reader: Thread 0, Block 0, SM 0 Writer: Thread 0, Block 1, SM 1

MEMORY MODEL
Synchronizing memory between multiple Thread Blocks

L2 Cache flag = true Data = value

L1 Cache flag = true flag = true Data = value L1 Cache

host device host device

int poll_then_read(atomic<bool>& flag, void write_then_signal(atomic<bool>&
int& data ) { flag, int& data, int value) {
while (!flag.load()) ; data = value;
return data; flag = true;
} }

Reader: Thread 0, Block 0, SM 0 Writer: Thread 0, Block 1, SM 1

MEMORY MODEL
Synchronizing memory between different Thread Blocks

Three Performance issues:

1. Uses system scope atomic. atomic<Type, cuda::thread_scope_system>
2. Full sequential consistency not needed
3. Spinning while loop, no backoff

host device host device

int poll_then_read(atomic<bool>& flag, void write_then_signal(atomic<bool>&
int& data ) { flag, int& data, int value) {
while (!flag.load()) ; data = value;
return data; flag = true;
} }

Reader: Thread 0, Block 0, SM 0 Writer: Thread 0, Block 1, SM 1

MEMORY MODEL
Synchronizing memory between different Thread Blocks

Three Performance issues:

1. Uses system scope atomic Use GPU Device scope
2. Full sequential consistency not needed
3. Spinning while loop, no backoff

host device host device

int poll_then_read( void write_then_signal (
cuda::atomic<bool, cuda::thread_scope_device>& cuda::atomic<bool, cuda::thread_scope_device>&
flag, int& data ) { flag, int& data, int value) {
while (!flag.load()); data = value;
return data; flag = true;
} }

Reader: Thread 0, Block 0, SM 0 Writer: Thread 0, Block 1, SM 1

MEMORY MODEL
Synchronizing memory between different Thread Blocks

Three Performance issues:

1. Uses system scope atomic Use GPU Device scope
2. Full sequential consistency not needed C++ acquire – release semantics are sufficient
3. Spinning while loop, no backoff

host device host device

Reader: Thread 0, Block 0, SM 0 Writer: Thread 0, Block 1, SM 1

MEMORY MODEL
Synchronizing memory between different Thread Blocks

Three Performance issues:

1. Uses system scope atomic Use GPU Device scope
2. Full sequential consistency not needed C++ acquire – release semantics are sufficient
3. Spinning while loop, no backoff Use notify_all() wait() API with built in exponential backoff

host device host device

int poll_then_read( void write_then_signal (
cuda::atomic<bool, cuda::thread_scope_device>& cuda::atomic<bool, cuda::thread_scope_device>&
flag, int& data ) { flag, int& data, int value) {
flag.wait(false, memory_order_acquire); data = value;
return data; flag.store(true, memory_order_release);
} flag.notify_all();
}

Reader: Thread 0, Block 0, SM 0 Writer: Thread 0, Block 1, SM 1

MEMORY MODEL
Synchronizing memory at system scope

GPU 0 GPU 1

GPU 2 GPU 3

__host__ __device__
__host__ __device__
void write_then_signal (
int poll_then_read(
cuda::atomic<bool, cuda::thread_scope_system>&
cuda::atomic<bool, cuda::thread_scope_system>&
flag, int& data, int value) {
flag, int& data ) {
data = value;
flag.wait(false, memory_order_acquire);
flag.store(true, memory_order_release);
return data;
flag.notify_all();
}
}
Reader: Thread 0, Block 0, SM 0, GPU 0 Writer: Thread 0, Block 1, SM 1, GPU 1
L2 CACHE MANAGEMENT
ANNOTATED POINTER
A pointer annotated with an access property

template <typename T, typename AccessProperty>

class cuda::annotated_ptr;

int* g;
A pointer to global memory.
Like raw pointer with hint: property might be
cuda::annotated_ptr<int, access_property::global> p{g};
applied/ignored!
p[threadIdx.x] = 42;

Propagates properties through ABI boundaries for

__device__ void indepentently_compiled( independently compiled device code.
cuda::annotated_ptr<int, access_property>
);
ANNOTATED POINTER
Access Properties

cuda::annotated_ptr<T, cuda::access_property>
Shared

cuda::access_property::shared Memory access to the shared-memory

cuda::access_property::global Memory access to the global-memory (does not indicate frequency of access)
Static

cuda::access_property::normal Memory access to the global-memory as frequent as others

Global

cuda::access_property::persisting Memory access to the global-memory more frequent than others

cuda::access_property::streaming Memory access to the global-memory less frequent than others

Dynamic

cuda::access_property Memory access to the global-memory with dynamic hint

• Interleaved • Request properties for probabilities of memory addresses
• Range • Request properties for elements of an address range

• sizeof(cuda::annotated_ptr<T, StaticAccessProperty>) == sizeof(T*)

• sizeof(cuda::annotated_ptr<T, cuda::access_property>) == 2*sizeof(T*)
ANNOTATED POINTER
Interleaved dynamic property

int* g_ptr; size_t sz; g_ptr

cuda::access_property interleaved{
cuda::access_property::persisting{},
0.3,
cuda::access_property::streaming{} size = 10

};

30% of memory addresses accessed with persisting property.

cuda::annotated_ptr<int, access_property> p{ 70% of memory addresses accessed with streaming property.
g_ptr, interleaved
};
p[threadIdx.x] = 42;
int v = p[threadIdx.x];
ANNOTATED POINTER
Range dynamic property
g_ptr
int* g_ptr; size_t leading_size, total_size;

cuda::access_property range{
g_ptr, leading_size, total_size,
leading_size == total_size
cuda::access_property::persistent{},
cuda::access_property::streaming{}
g_ptr
};
cuda::annotated_ptr<int, access_property::global> p{
g_ptr, range
};
leading_size

total_size
ANNOTATED POINTER
Prefetching memory

• Prefetches memory at L2 cache line granularity and sets access frequency

• Can use it as a larger shared memory (backed by global memory) that can be shared across thread-blocks

global void pin(int* a) {

if(threadIdx.x == 0 ) // Thread 0 prefetches one L2 cache line or 32 integer elements

cuda::apply_access_property(a, 32*sizeof(int), cuda::access_property::persisting{});
}

40 MB Prefetch
80 GB HBM2e
L2
ANNOTATED POINTER
Discarding or Resetting memory

• Once done using it, either set its access frequency back to normal
• Or discard if the application does not need the lines to be written back to main memory
• Otherwise, the memory might be kept in the L2 for a very long time.

No write back. Saves Bandwidth !

40 40
MB 80 GB HBM2e MB 80 GB HBM2e
L2 L2

cuda::apply_access_property(ptr, size, cuda::discard_memory(ptr, size);

cuda::access_property::normal{} );
ANNOTATED POINTER
Pass 1: Multi sweep on weather and climate stencils

for (int k = 0; k < nz; ++k) { Z

pos += y_stride;
float tmp_reg = dstm * src_p[pos];
Y
dst[pos] = tmp_reg + 1;
tmp[pos] = tmp_reg - 1;

Pass 1
dstm = tmp_reg;
}

X
ANNOTATED POINTER
Pass 2: Multi sweep on weather and climate stencils

Pass 2
for (int k = nz - 1; k >= 0; --k) {
pos -= z_stride;
dstm += (tmp[pos] - dst[pos] + 2.f);
dst[pos] = dstm;

X
}
ANNOTATED POINTER
Multi sweep on weather and climate stencils

Y
cuda::annotated_ptr<float,cuda::access_property
::persisting> dst_p{dst};

cuda::annotated_ptr<float,cuda::access_property
::persisting> tmp_p{tmp};

cuda::annotated_ptr<float,cuda::access_property
::streaming> src_p{src};

X
ANNOTATED POINTER
Pass 2: Multi sweep on weather and climate stencils

cuda::annotated_ptr<float,cuda::access_property::persisting> dst_p{dst};
cuda::annotated_ptr<float,cuda::access_property::persisting> tmp_p{tmp};
cuda::annotated_ptr<float,cuda::access_property::streaming> src_p{src};
Z
for (int k = 0; k < nz; ++k) {
Y
pos += y_stride;
float tmp_reg = dstm * src_p[pos];
dst_p[pos] = tmp_reg + 1;
tmp_p[pos] = tmp_reg - 1;
dstm = tmp_reg;
}
for (int k = nz - 1; k >= 0; --k) {
pos -= z_stride;
dstm += (tmp_p[pos] - dst_p[pos] + 2.f);
X
dst_p[pos] = dstm;
}
ANNOTATED POINTER
Pass 2: Multi sweep on weather and climate stencils

for (int k = 0; k < nz; ++k) { Z

pos += y_stride;
float tmp_reg = dstm * src_p[pos]; Y
dst_p[pos] = tmp_reg + 1;
tmp_p[pos] = tmp_reg - 1;
dstm = tmp_reg;
}
for (int k = nz - 1; k >= 0; --k) {
pos -= z_stride;
dstm += (tmp_p[pos] - dst_p[pos] + 2.f);
dst_p[pos] = dstm;
if( threadIdx.x == 0 )
cuda::discard_memory(&tmp_p[pos],32*sizeof(float));
X
}
ANNOTATED POINTER
Performance Multi sweep on weather and climate stencils

auto l2_persistent_size = prop.persistingL2CacheMaxSize; // 30MB on A100. 75% of total

cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize, l2_persistent_size);

Performance of 2 sweep kernel

1.8

1.6
1.56
1.4
1.15
Performance

1.2
1.00 1.00 1.04
1 0.92
0.8

0.6

0.4

0.2

Normal Persisting Persisting_Discard Normal Persisting Persisting_Discard

Persistent Cache size 30 MB

Grid: 330x330x30 Grid: 330x330x60

ANNOTATED POINTER
Performance Multi sweep on weather and climate stencils

auto l2_persistent_size = 2010001000; // 20MB. 50% of total on A100

cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize, l2_persistent_size);

Performance of 2 sweep kernel

1.8
1.56 1.60
1.6

1.4
1.15 1.15
Performance

1.2
1.04 1.08
1.00 1.00 1.00 1.00 1.01
1 0.92
0.8

0.6

0.4

0.2

Normal Persisting Persisting_Discard Normal Persisting Persisting_Discard

Persistent Cache size 30 MB Persistent Cache size 20 MB

Grid: 330x330x30 Grid: 330x330x60

SHARED MEMORY
SHARED MEMORY
Fast, Threadblock local scratchpad

• Inside SM, private to thread block, user-controlled cache SM

• Default max = 48KB per thread block, up to 164KB (explicit
opt-in) in A100

Compute Units
L1
• Max bandwidth A100 = 108 SMs x 1.41GHz x 128

Registers
Bytes/clock = 19.5 TB/s 192
KB
• ~10x more bandwidth than Global Memory, very low
latency SHM

• Only 4-Byte access granularity

108 SMs
SHARED MEMORY
Fast, SM-local scratchpad

Shared memory organization: 32 banks x 4 Bytes = 128 Bytes per row

Bank0 Bank1 Bank2 Bank3 Bank4 … Bank31

Byte 0
Byte 128
Byte 256

4 Bytes
SHARED MEMORY
Fast, SM-local scratchpad

Shared memory organization: 32 banks x 4 Bytes = 128 Bytes per row

Bank conflicts can happen, inside a warp:
2+ threads access different 4-Byte words from the same bank
Worst case = 32-way conflicts, 31 replays
Replays increase latency, decrease bandwidth

Bank0 Bank1 Bank2 Bank3 Bank4 … Bank31

Byte 0
Byte 128
Byte 256

4 Bytes
SHARED MEMORY
Example: 32x32 shared memory transpose

shared float sm[32][32]; shared float sm[32][33];

padding

Bank numbers

No conflicts when accessing rows No conflicts when accessing rows

32-way conflicts accessing columns No conflicts when accessing columns
SHARED MEMORY
Bank conflicts guidelines

Accessing Banks per Bank Conflict

elements element Resolution level
1,2,4 bytes 1 Warp Up to 32-way conflicts!

8 bytes 2 ½ warp Up to 16-way conflicts

16 bytes 4 ¼ warps Up to 8-way conflicts

Accessing same word, or bytes inside same word = no conflict (multicast)

Easy to avoid conflicts by adding padding or changing access patterns

SHARED MEMORY
Async load to shared memory

Typical way of loading data to shared memory: LDG.E.SYS R0, [R2] ;

__shared__ int smem[1024];
* STALL *
smem[threadIdx.x] = input[index];
STS [R5], R0 ;
SM

• Wasting registers
Compute Units

L1
Registers

L2
• Stalling while the data is loaded
SHM

• Wasting L1/SHM bandwidth

SHARED MEMORY
Async load to shared memory
The cooperative_groups API:
cooperative_groups::memcpy_async(Group, dst*, src*, Shape);

The CUDA APIs with synchronization primitives:

cuda::memcpy_async(Group, dst*, src*, Shape, cuda::barrier);
cuda::memcpy_async(Group, dst*, src*, Shape, cuda::pipeline);

SM
Compute Units

L1
Registers

L2 Copy the data to shared memory asynchronously

SHM
SHARED MEMORY
Simple replacement for filling in shared memory

extern shared float shmem[];

extern __shared__ float shmem[];
namespace cg = cooperative_groups;
for(int i=threadIdx.x; i<size; i+=blockDim.x)
auto block = cg::this_thread_block();
{
shmem[i] = gmem[i];
cg::memcpy_async(block, shmem, gmem, size*sizeof(float));
}
cg::wait(block);
__syncthreads();

SM SM
L1

Registers
L1
Registers

L2 L2
SH SH
M M
MEMCPY_ASYNC()
Comparing variants

Cooperative Groups cuda::barrier cuda::pipeline What happens

Pipe.producer_acquire() Wait for consumer to release oldest pipeline stage.

circular queue

Issue Async copies on the

cuda::memcpy_async(g,…, pipe) barrier or pipe->barrier[i]
cg::memcpy_async(g,…) cuda::memcpy_async(g,…,bar)
Commit the issued memcpy_async to the barrier.
Code flow

Pipe.producer_commit()
Thread arrival on the barrier
bar.arrive()
cg::wait()
Pipe.consumer_wait()
cg::wait_prior<N>() Wait on the barrier
bar.wait()
Release current pipeline stage (only pipeline)
Pipe.consumer_release()
THREAD DIVERGENCE
DIVERGENCE
Intra-warp vs extra-warp divergence

CUDA deals with divergence at the warp level

if (condition)
A(); Warp 0 Divergence between warps:
All threads inside each warp
else take the same branch
B(); Warp 1

Divergence between warps is OK.

DIVERGENCE
Intra-warp vs extra-warp divergence

CUDA deals with divergence at the warp level

if (condition)
A(); Warp 0 Warp 1 Divergence inside a warp :
Execute both branches,
else masking out inactive threads
B(); Warp 0 Warp 1

More instructions!
Avoid intra-warp divergence if you can
DIVERGENCE
CUDA Threads are Threads

Since Volta, NVIDIA GPUs guarantee forward progress

GPU must know which threads participate in warp-synchronous instructions

synchronization guaranteed only inside these instructions

The old warp-synchronous instructions without mask are deprecated!

Update your code!

EXAMPLE
Update from shfl to shfl_sync

Simple case where all the threads are known to be active:

y = shfl (x, 0); y = shfl_sync (0xffffffff, x, 0);

namespace cg=cooperative_groups;
auto block = cg::this_thread_block();
Or, use cooperative groups auto tile32 = cg::tiled_partition<32>(block);
y = tile32.shfl(x, 0);
EXAMPLE
Update from shfl to shfl_sync

More complex case where all threads might not participate:

for (int i=tid; i<N; i+=256) for (int i=0; i<N; i+= 256)
{ {
<…> mask = __ballot_sync (0xffffffff, tid+i<N);
y = __shfl(x, 0); if (tid+i<N)
<…> {
} <…>
y = __shfl_sync(mask, x, 0);
<…>
}
}
EXAMPLE
Update to Cooperative Groups

You can also use Cooperative groups

namespace cg=cooperative_groups;
auto block = cg::this_thread_block();
auto tile32 = cg::tiled_partition<32>(block);

for (int i=0; i<N; i+= 256) for (int i=0; i<N; i+= 256)
{ {
mask = __ballot_sync (0xffffffff, tid+i<N); auto subtile = cg::binary_partition(tile32, tid+i<N);
if (tid+i<N) if (tid+i<N)
{ {
<…> <…>
y = __shfl_sync(mask, x, 0); y = subtile.shfl(x, 0);
<…> <…>
} }
} }
DIVERGENCE
No convergence assumptions

Expect ≠ Assume

• Expecting convergence is reasonable (performance)

• Assuming convergence is illegal!

Keeping up with GPUs getting wider and bigger
OCCUPANCY
OCCUPANCY
SM Resources on A100

Achieved number of threads per SM

Occupancy =
Maximum number of threads per SM

65536 32-bit Registers

164 KB Shared Memory

CUDA Cores:
2048 Threads / 32 Thread Blocks
A100 SM
OCCUPANCY
SM Resources on A100

Example kernel: 512 threads per block, 8 registers / thread (4096 / block) , 8KB shared memory per block

4096 409665536 32-bit Registers

4096 4096

9 KB 9 KB164 KB Shared9 Memory

KB 9 KB

CUDA Cores:
Threads: 512 100%
Threads: 512 Threads: 512 Threads: 512
2048 Threads / 32 Thread Blocks occupancy
Thread Block 0 Thread Block 1 Thread Block 2 Thread Block 3

A100 SM
OCCUPANCY
Higher occupancy

In general, higher occupancy is better:

• More warps per SM
• More parallelism to hide the latencies
• Can also do well with lower occupancy, but more work per thread.

Example kernel:
• 512 threads per block
• 8 registers per thread occupancy = 100%
• 8KB shared memory
OCCUPANCY
Higher occupancy

In general, higher occupancy is better:

• More warps per SM
• More parallelism to hide the latencies
• Can also do well with lower occupancy, but more work per thread.

But launching only 80 thread blocks?

Example kernel:
• 512 threads per block 80 blocks x 512 threads
= 36% of P100
• 8 registers per thread occupancy = 100% 56 SMs x 2048 threads
• 8KB shared memory 80 blocks x 512 threads
= 18% of A100
108 SMs x 2048 threads
CUDA STREAMS AND GRAPHS
CUDA STREAMS
Overlapping compute, copies and allocations

cudaMalloc() or cudaFree() halts GPU New stream-ordered asynchronous

memory allocations
Very low latency (memory pool)
Stream 0 Stream 1
cudaMalloc(buffer); Stream 0 Stream 1
H2D cudaMallocAsync(buffer);
cudaMallocAsync(buffer);
cudaMalloc(buffer); H2D
Kernel 1 H2D Kernel 1 H2D
D2H Kernel 1
GPU idle! D2H Kernel 1
cudaFree(buffer); cudaFreeAsync(buffer); D2H
D2H cudaFreeAsync(buffer);
cudaFree(buffer);
CUDA GRAPHS
Free up CPU resources

Launch Launch Launch Launch

A B C D
Launch E CPU Idle
Stream
Launch
A B C D E

time

Build Launch
Graph Graph CPU Idle
Graph
Launch
A B C D E
CUDA GRAPHS
Reduction in launch overhead

Launch A Launch B Launch C Launch D Launch E CPU Time

Launch
Latency GPU Time
A B C D E

time
Build
Launch Graph time saved
Graph

A B C D E

When kernel runtime is short, execution time is dominated by CPU launch cost
CUDA GRAPHS
Embedding memory allocation in graphs

A
CUDA Graphs offers two new node types:
allocation & free
alloc alloc
Identical semantics to stream cudaMallocAsync() X Y
▪ Pointer is returned at node creation time
▪ Pointer may be passed as argument to later nodes
▪ Dereferencing pointer is only permitted downstream B(X) C(X) D(Y)
of allocation node & upstream of free node

free free
X Y

E
TAKEAWAYS
How to keep up with larger GPUs

• Be aware of your grid sizes and achieved occupancy

• Re-visit code, express more parallelism

• Launch more independent kernels in parallel: CUDA streams, graphs

• Share the GPU with multiple processes: MPS

• Split the GPU into smaller instances: MIG

Advanced Performance Optimization in CUDA (S62192)
No ratings yet
Advanced Performance Optimization in CUDA (S62192)
127 pages
FastTrack of OS - OneDay (6H) - May 2022 - Ver 1
No ratings yet
FastTrack of OS - OneDay (6H) - May 2022 - Ver 1
102 pages
CUDA Memory Architecture: GPGPU Class Week 4
No ratings yet
CUDA Memory Architecture: GPGPU Class Week 4
28 pages
Building LLaMA 3 From Scratch With Python
No ratings yet
Building LLaMA 3 From Scratch With Python
34 pages
S62797 - LLM Inference Sizing - Benchmarking End-to-End Inference Systems
No ratings yet
S62797 - LLM Inference Sizing - Benchmarking End-to-End Inference Systems
36 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
Operating System Short Questions and Answers
40% (5)
Operating System Short Questions and Answers
62 pages
4th Sem Syllabus-49-53
100% (2)
4th Sem Syllabus-49-53
5 pages
Akka Net Succinctly
No ratings yet
Akka Net Succinctly
123 pages
AIML
No ratings yet
AIML
100 pages
A41101 - How CUDA Programming Works
No ratings yet
A41101 - How CUDA Programming Works
116 pages
DCA6105 Harshitha 2314509928 CA
No ratings yet
DCA6105 Harshitha 2314509928 CA
12 pages
Open-World Segmentation and Tracking in 3D: Laura Leal-Taixé - GTC - March 18-21, 2024
No ratings yet
Open-World Segmentation and Tracking in 3D: Laura Leal-Taixé - GTC - March 18-21, 2024
35 pages
Unit 5
No ratings yet
Unit 5
66 pages
T.E. Information Technology 2019 Course 28.06.2021
No ratings yet
T.E. Information Technology 2019 Course 28.06.2021
116 pages
Universal Model Serving Via Triton and Tensorrt: Ke Ma, Genai@Snap, Inc
No ratings yet
Universal Model Serving Via Triton and Tensorrt: Ke Ma, Genai@Snap, Inc
28 pages
Advanced Computer Architecture: CSE-401 E
No ratings yet
Advanced Computer Architecture: CSE-401 E
71 pages
Cuda 1
No ratings yet
Cuda 1
45 pages
Background Operations On Delphi Android, With Threads and Timers
No ratings yet
Background Operations On Delphi Android, With Threads and Timers
16 pages
Lecture 28
No ratings yet
Lecture 28
19 pages
Electronic and Mobile Commerce
100% (1)
Electronic and Mobile Commerce
18 pages
Cell Material Interaction Lab Manual S241
No ratings yet
Cell Material Interaction Lab Manual S241
22 pages
Pipelining and Superscalar Techniques: CSE539: Advanced Computer Architecture
No ratings yet
Pipelining and Superscalar Techniques: CSE539: Advanced Computer Architecture
49 pages
Chapter 2-Process Management
No ratings yet
Chapter 2-Process Management
8 pages
Multithreading in Python
No ratings yet
Multithreading in Python
23 pages
Project
No ratings yet
Project
62 pages
Wiener User Guide
No ratings yet
Wiener User Guide
16 pages
Which GPU(s) To Get For Deep Learning
No ratings yet
Which GPU(s) To Get For Deep Learning
388 pages
Operating System (CS 502)
No ratings yet
Operating System (CS 502)
1 page
Chapter 2 Instructions Language of The Computer
No ratings yet
Chapter 2 Instructions Language of The Computer
95 pages
ECMWF Advanced GPU Topics 1
100% (1)
ECMWF Advanced GPU Topics 1
59 pages
Midterm SC
No ratings yet
Midterm SC
5 pages
Operations of Operating System: Dr. Tapas K. Mishra
No ratings yet
Operations of Operating System: Dr. Tapas K. Mishra
17 pages
Tensor Processing Unit
100% (1)
Tensor Processing Unit
15 pages
AMD ZEN Architecture PDF
100% (1)
AMD ZEN Architecture PDF
19 pages
Multiple Processor Systems
No ratings yet
Multiple Processor Systems
50 pages
Introduction To High Performance Scientific Computing
No ratings yet
Introduction To High Performance Scientific Computing
464 pages
Cics Tutorials MODIFIED
No ratings yet
Cics Tutorials MODIFIED
188 pages
Intel MKL 2019 Developer Guide Linux PDF
No ratings yet
Intel MKL 2019 Developer Guide Linux PDF
124 pages
WWII 457th Anti-Aircraft Artillery
No ratings yet
WWII 457th Anti-Aircraft Artillery
229 pages
Chapter 6 Parallel Processor
No ratings yet
Chapter 6 Parallel Processor
21 pages
Lecture Week - 1 Introduction 1 - SP-24
No ratings yet
Lecture Week - 1 Introduction 1 - SP-24
51 pages
Processes and Threads: Modern Operating Systems
No ratings yet
Processes and Threads: Modern Operating Systems
62 pages
HPC Unit 456
No ratings yet
HPC Unit 456
25 pages
Pipelining vs. Parallel Processing
No ratings yet
Pipelining vs. Parallel Processing
23 pages
Unit 5 (Slides)
No ratings yet
Unit 5 (Slides)
75 pages
Tesla V100 Performance Guide
No ratings yet
Tesla V100 Performance Guide
23 pages
Rtlws Paper
No ratings yet
Rtlws Paper
7 pages
Swe3001 Operating-Systems Eth 1.0 37 Swe3001
No ratings yet
Swe3001 Operating-Systems Eth 1.0 37 Swe3001
2 pages
Nvidia Ampere Architecture Whitepaper
No ratings yet
Nvidia Ampere Architecture Whitepaper
83 pages
Day5 FDP IoT Part1
No ratings yet
Day5 FDP IoT Part1
89 pages
L 1 ParallelProcess Challenges
No ratings yet
L 1 ParallelProcess Challenges
82 pages
CUDA Installation Guide Windows
No ratings yet
CUDA Installation Guide Windows
28 pages
Speaker - A02 - 5747 - Best Practices in Networking For AI
No ratings yet
Speaker - A02 - 5747 - Best Practices in Networking For AI
15 pages
Multithreading in LabWindows - CVI - Tutorial - Developer Zone ...
No ratings yet
Multithreading in LabWindows - CVI - Tutorial - Developer Zone ...
13 pages
High Speed Cipher Cracking: The Case of Keeloq On CUDA: Rypto Eeloq Eeloq
No ratings yet
High Speed Cipher Cracking: The Case of Keeloq On CUDA: Rypto Eeloq Eeloq
12 pages
Gpu1 - GPU Introduction
No ratings yet
Gpu1 - GPU Introduction
20 pages
Virtualization and The Cloud
No ratings yet
Virtualization and The Cloud
21 pages
Numerical Methods Implementation On CUDA
No ratings yet
Numerical Methods Implementation On CUDA
73 pages
Nvidia Opencl Best Practices Guide: Optimization
No ratings yet
Nvidia Opencl Best Practices Guide: Optimization
49 pages
CS 179: GPU Computing: Lecture 2: More Basics
No ratings yet
CS 179: GPU Computing: Lecture 2: More Basics
23 pages
Cluster Computing
No ratings yet
Cluster Computing
32 pages
Evolution of Microprocessor
No ratings yet
Evolution of Microprocessor
28 pages
Unit-1 and Swayam Week 1 and Week 2 Revision Unit-1 Course Contents
No ratings yet
Unit-1 and Swayam Week 1 and Week 2 Revision Unit-1 Course Contents
4 pages
Cortex-M For Beginners - 2016 (Final v3)
No ratings yet
Cortex-M For Beginners - 2016 (Final v3)
25 pages
pdc1: MODULE 1: PARALLELISM FUNDAMENTALS
No ratings yet
pdc1: MODULE 1: PARALLELISM FUNDAMENTALS
42 pages
J Threads PDF
No ratings yet
J Threads PDF
24 pages
Cuda C
No ratings yet
Cuda C
70 pages
Chapter 2 ARM Cortex-M3 Architecture - 3
No ratings yet
Chapter 2 ARM Cortex-M3 Architecture - 3
68 pages
Cache Memory
No ratings yet
Cache Memory
12 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
Load Scheduling
100% (1)
Load Scheduling
10 pages
GPU Wiki
No ratings yet
GPU Wiki
9 pages
Answers: Chapter 2: Processes - Exercises
No ratings yet
Answers: Chapter 2: Processes - Exercises
9 pages
Slot14 15 CH08 OperatingSystemSupport 43 Slides
No ratings yet
Slot14 15 CH08 OperatingSystemSupport 43 Slides
34 pages
GPGPU Sim Tutorial
No ratings yet
GPGPU Sim Tutorial
28 pages
Chapter 8 - Pipelining
No ratings yet
Chapter 8 - Pipelining
31 pages
Architecture
No ratings yet
Architecture
21 pages
Design Issues: SMT and CMP Architectures
No ratings yet
Design Issues: SMT and CMP Architectures
9 pages
RISC-VTF RISC-V Based Extended Instruction Set For Transformer
No ratings yet
RISC-VTF RISC-V Based Extended Instruction Set For Transformer
6 pages
Multiprocessors Interconnection Networks
No ratings yet
Multiprocessors Interconnection Networks
32 pages
NSD Tuning
No ratings yet
NSD Tuning
6 pages
Performance Evaluation of Real-Time Operating Systems: Rtems, Rtlinux, Ecos
No ratings yet
Performance Evaluation of Real-Time Operating Systems: Rtems, Rtlinux, Ecos
31 pages
OpenCL Best Practices Guide
No ratings yet
OpenCL Best Practices Guide
54 pages
Systolic Array
No ratings yet
Systolic Array
42 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
Nvidia Cuda Arc
No ratings yet
Nvidia Cuda Arc
16 pages
Openmp Tutorial: Seung-Jai Min
No ratings yet
Openmp Tutorial: Seung-Jai Min
30 pages
Operating System Notes PDF
No ratings yet
Operating System Notes PDF
44 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
Computer Architecture - A Quantitative Approach Chapter 5 Solutions
No ratings yet
Computer Architecture - A Quantitative Approach Chapter 5 Solutions
14 pages
High Performance Network-on-Chip Through MPLS
No ratings yet
High Performance Network-on-Chip Through MPLS
4 pages
Questions Answers
No ratings yet
Questions Answers
3 pages
S.No Topics Lec: Advanced Computer Network ETCS-401
No ratings yet
S.No Topics Lec: Advanced Computer Network ETCS-401
4 pages

Performance Optimization With Modern CUDA Programming Techniques - 1635781161534001h3am

Uploaded by

Performance Optimization With Modern CUDA Programming Techniques - 1635781161534001h3am

Uploaded by

PERFORMANCE OPTIMIZATION WITH MODERN CUDA

• CUDA streams & Graphs

108 SMs PCIe, NVLINK

One warp = 32 Threads

Granularity = 32 Byte sector

Sector 0 Sector 1 Sector 2 Sector 3 Sector 0 Sector 1 Sector 2 Sector 3

Coalesced memory accesses, touching only the ideal number of sectors

Sector 0 Sector 1 Sector 2 Sector 3 Sector 4 Sector 5 Sector 6 Sector 7

8-bytes per thread coalesced memory access

Sector 0 Sector 1 Sector 2 Sector 3 Sector 0 Sector 1 Sector 2 Sector 3 Sector 4

Sector 0 Sector 1 Sector 2 Sector 3 Sector 4 Sector 5 Sector 6 Sector 7 Sector 8

8-bytes per thread unaligned memory access

1-byte per thread, stride = 2

Sector 0 Sector 1 Sector 2 Sector 3 Sector 4 Sector 5 Sector 6 Sector 7

4-bytes per thread, stride = 2

Random memory accesses request too many sectors, wasting bandwidth!

Sector WARP Sector

64-bit Generic Address space

Local Memory Shared Memory Global Memory

• Single Generic address space Visible to Thread

64-bit Generic Address space

Local Memory Shared Memory Global Memory

__device__ float square(float* data) {

64-bit Generic Address space

Local Memory Shared Memory Global Memory

Visible to a thread block

64-bit Generic Address space

Local Memory Shared Memory Global Memory

Visible to all GPU

Reading from local / Global Mem can hit in L1 or L2

Device Scope Device Scope Device Scope Device Scope

SMs SMs SMs SMs

L2 Cache flag = 0 Data = ?

__host__ __device__ __host__ __device__

Reader: Thread 0, Block 0, SM 0 Writer: Thread 0, Block 1, SM 1

L2 Cache flag = 0 Data = ?

__host__ __device__ __host__ __device__

Reader: Thread 0, Block 0, SM 0 Writer: Thread 0, Block 1, SM 1

L2 Cache flag = 0 Data = value

L1 Cache flag = 0 Data = value L1 Cache

__host__ __device__ __host__ __device__

Reader: Thread 0, Block 0, SM 0 Writer: Thread 0, Block 1, SM 1

L2 Cache flag = 1 Data = value

L1 Cache flag = 0 flag = 1 Data = value L1 Cache

__host__ __device__ __host__ __device__

Reader: Thread 0, Block 0, SM 0 Writer: Thread 0, Block 1, SM 1

L2 Cache flag = 1 Data = value

• Volatile does not mention the scope of

L1 Cache flag = 1 flag = 1 Data = value L1 Cache

__host__ __device__ __host__ __device__

Reader: Thread 0, Block 0, SM 0 Writer: Thread 0, Block 1, SM 1

L2 Cache flag = false Data = ?

L1 Cache flag = false L1 Cache

__host__ __device__ __host__ __device__

Reader: Thread 0, Block 0, SM 0 Writer: Thread 0, Block 1, SM 1

L2 Cache flag = false Data = value

L1 Cache flag = false flag = false Data = value L1 Cache

__host__ __device__ __host__ __device__

Reader: Thread 0, Block 0, SM 0 Writer: Thread 0, Block 1, SM 1

L2 Cache flag = true Data = value

L1 Cache flag = false flag = true Data = value L1 Cache

__host__ __device__ __host__ __device__

Reader: Thread 0, Block 0, SM 0 Writer: Thread 0, Block 1, SM 1

L2 Cache flag = true Data = value

L1 Cache flag = true flag = true Data = value L1 Cache

__host__ __device__ __host__ __device__

Reader: Thread 0, Block 0, SM 0 Writer: Thread 0, Block 1, SM 1

Three Performance issues:

__host__ __device__ __host__ __device__

Reader: Thread 0, Block 0, SM 0 Writer: Thread 0, Block 1, SM 1

Three Performance issues:

__host__ __device__ __host__ __device__

Reader: Thread 0, Block 0, SM 0 Writer: Thread 0, Block 1, SM 1

Three Performance issues:

__host__ __device__ __host__ __device__

Reader: Thread 0, Block 0, SM 0 Writer: Thread 0, Block 1, SM 1

device float square(float* data) {

host device host device

host device host device

host device host device

host device host device

host device host device

host device host device

host device host device

host device host device

host device host device

host device host device

host device host device

host device host device

host device host device

global void pin(int* a) {

auto l2_persistent_size = 2010001000; // 20MB. 50% of total on A100

shared float sm[32][32]; shared float sm[32][33];

extern shared float shmem[];

y = shfl (x, 0); y = shfl_sync (0xffffffff, x, 0);