0% found this document useful (0 votes)
63 views77 pages

Performance Optimization With Modern CUDA Programming Techniques - 1635781161534001h3am

The document discusses performance optimization techniques in CUDA programming, focusing on memory hierarchy, access patterns, thread divergence, GPU occupancy, and the use of CUDA streams and graphs. It emphasizes the importance of understanding memory access patterns and the CUDA memory model to enhance performance. Key concepts include coalesced memory access, synchronization between thread blocks, and optimizing atomic operations.

Uploaded by

huynhgse183099
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views77 pages

Performance Optimization With Modern CUDA Programming Techniques - 1635781161534001h3am

The document discusses performance optimization techniques in CUDA programming, focusing on memory hierarchy, access patterns, thread divergence, GPU occupancy, and the use of CUDA streams and graphs. It emphasizes the importance of understanding memory access patterns and the CUDA memory model to enhance performance. Key concepts include coalesced memory access, synchronization between thread blocks, and optimizing atomic operations.

Uploaded by

huynhgse183099
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

PERFORMANCE OPTIMIZATION WITH MODERN CUDA

PROGRAMMING TECHNIQUES
GUILLAUME THOMAS-COLLIGNON, VISHAL MEHTA
DEVTECH COMPUTE
• Memory Hierarchy
▪ Memory Access Patterns
▪ Memory Model
▪ L2 Cache
▪ Shared Memory

• Thread Divergence

• GPU Occupancy

• CUDA streams & Graphs


MEMORY HIERARCHY
Understanding Memory and Caches

SM
SM
SM
Compute Units

L1
Registers

40 MB 2.0TB/s
192 80 GB HBM2e
KB L2
SHM

108 SMs PCIe, NVLINK


NVIDIA A100 80GB
MEMORY HIERARCHY
Nsight Compute View
MEMORY HIERARCHY
Instructions, Requests, Sectors

One warp = 32 Threads


0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 … 30 31

Load Instruction
(request)

? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

Granularity = 32 Byte sector

How many different 32-Byte sectors are being touched by the WARP?
MEMORY ACCESS PATTERNS
0 31 0 31
WARP WARP

Sector 0 Sector 1 Sector 2 Sector 3 Sector 0 Sector 1 Sector 2 Sector 3

1-byte per thread coalesced memory access 4-bytes per thread coalesced memory access

Coalesced memory accesses, touching only the ideal number of sectors

0 31
WARP

Sector 0 Sector 1 Sector 2 Sector 3 Sector 4 Sector 5 Sector 6 Sector 7

8-bytes per thread coalesced memory access


MEMORY ACCESS PATTERNS
0 31 0 31
WARP WARP

Sector 0 Sector 1 Sector 2 Sector 3 Sector 0 Sector 1 Sector 2 Sector 3 Sector 4

1-byte per thread unaligned memory access 4-bytes per thread unaligned memory access

Coalesced but unaligned memory accesses will increase the number of sectors per request

0 31
WARP

Sector 0 Sector 1 Sector 2 Sector 3 Sector 4 Sector 5 Sector 6 Sector 7 Sector 8

8-bytes per thread unaligned memory access


MEMORY ACCESS PATTERNS
0 31
WARP

0 31
Sector 0 Sector 1 Sector 2 Sector 3
WARP

1-byte per thread, stride = 2

Sector 0 Sector 1 Sector 2 Sector 3 Sector 4 Sector 5 Sector 6 Sector 7

4-bytes per thread, stride = 2

Strided memory accesses also increase the number of sectors per request
MEMORY ACCESS PATTERNS

Sector
Sector
Sector Sector
Sector
Sector
Sector Sector
Sector
Sector Sector
Sector

Random memory accesses request too many sectors, wasting bandwidth!

Sector WARP Sector

Sector

Sector
Sector
Sector
Sector
Sector Sector Sector
Sector
MEMORY HIERARCHY
Nsight Compute
32B sectors are
Instructions generate and eventually
transferred between L1
requests to L1 cache to/from memory
and L2 caches

Make sure
these numbers
match your
expectations
MEMORY MODEL
MEMORY MODEL
CUDA Generic Address Space

64-bit Generic Address space

Local Memory Shared Memory Global Memory

• Single Generic address space Visible to Thread


• Fully C++ object model conforming
MEMORY MODEL
CUDA Generic Address Space

64-bit Generic Address space

Local Memory Shared Memory Global Memory

__device__ float square(float* data) {


float x = data[0] // Memory operations can be performed with generic address
// agnostic of shared, local or global memory
// Compiler can optimize if it can determine the address space
return (x*x);
}
MEMORY MODEL
CUDA Generic Address Space

64-bit Generic Address space

Local Memory Shared Memory Global Memory

Visible to a thread block


MEMORY MODEL
CUDA Generic Address Space

64-bit Generic Address space

Local Memory Shared Memory Global Memory

Visible to all GPU


threads
MEMORY MODEL
Reads

Reading from local / Global Mem can hit in L1 or L2

SM
SM
SM
Compute Units

L1
Registers

40 MB
80 GB HBM2
L2
SHM
New!
LDGSTS

PCIe, NVLINK
MEMORY MODEL
Writes
L1 is write-through, L2 is write-back
Writes will always reach at least L2, Read After Write can hit in L1 (e.g. register spills)

SM
SM
SM
Compute Units

L1
Registers

40 MB
80 GB HBM2
L2
SHM
New!
LDGSTS

PCIe, NVLINK
MEMORY MODEL
Scope levels

System scope

Host Host
CPU CPU
memory memory

Device Scope Device Scope Device Scope Device Scope


Global Global Global Global
Memory Memory Memory Memory

L2 L2 L2 L2

SMs SMs SMs SMs


Inside SM:

Block Scope

Thread Scope
MEMORY MODEL
Synchronizing memory between multiple Thread Blocks

L2 Cache flag = 0 Data = ?

Cooperative Kernel
L1 Cache L1 Cache

__host__ __device__ __host__ __device__


int poll_then_read(int& flag, int& data) void write_then_signal(int& flag, int&
{ data, int value) {
while (flag != 1) ; data = value;
return data; flag = 1;
} }

Reader: Thread 0, Block 0, SM 0 Writer: Thread 0, Block 1, SM 1


MEMORY MODEL
Synchronizing memory between multiple Thread Blocks

L2 Cache flag = 0 Data = ?

Cooperative Kernel
L1 Cache flag = 0 L1 Cache

__host__ __device__ __host__ __device__


int poll_then_read(int& flag, int& data) void write_then_signal(int& flag, int&
{ data, int value) {
while (flag != 1) ; data = value;
return data; flag = 1;
} }

Reader: Thread 0, Block 0, SM 0 Writer: Thread 0, Block 1, SM 1


MEMORY MODEL
Synchronizing memory between multiple Thread Blocks

L2 Cache flag = 0 Data = value

L1 Cache flag = 0 Data = value L1 Cache

__host__ __device__ __host__ __device__


int poll_then_read(int& flag, int& data) void write_then_signal(int& flag, int&
{ data, int value) {
while (flag != 1) ; data = value;
return data; flag = 1;
} }

Reader: Thread 0, Block 0, SM 0 Writer: Thread 0, Block 1, SM 1


MEMORY MODEL
Synchronizing memory between multiple Thread Blocks

L2 Cache flag = 1 Data = value

L1 Cache flag = 0 flag = 1 Data = value L1 Cache

__host__ __device__ __host__ __device__


int poll_then_read(int& flag, int& data) void write_then_signal(int& flag, int&
{ data, int value) {
while (flag != 1) ; data = value;
return data; flag = 1;
} }

Reader: Thread 0, Block 0, SM 0 Writer: Thread 0, Block 1, SM 1


MEMORY MODEL
Synchronizing memory between multiple Thread Blocks

L2 Cache flag = 1 Data = value

• Volatile does not mention the scope of


• __threadfence() is device only participating threads

L1 Cache flag = 1 flag = 1 Data = value L1 Cache

__host__ __device__ __host__ __device__


int poll_then_read(volatile int& flag, void write_then_signal(volatile int&
int& data ){ flag, int& data, int value) {
while (flag != 1) ; //data-race data = value;
__threadfence(); __threadfence();
return data; flag = 1; //data-race
} }

Reader: Thread 0, Block 0, SM 0 Writer: Thread 0, Block 1, SM 1


MEMORY MODEL
Synchronizing memory between multiple Thread Blocks

L2 Cache flag = false Data = ?

L1 Cache flag = false L1 Cache

__host__ __device__ __host__ __device__


int poll_then_read(atomic<bool>& flag, void write_then_signal(atomic<bool>&
int& data ) { flag, int& data, int value) {
while (!flag.load()) ; data = value;
return data; flag = true;
} }

Reader: Thread 0, Block 0, SM 0 Writer: Thread 0, Block 1, SM 1


MEMORY MODEL
Synchronizing memory between multiple Thread Blocks

L2 Cache flag = false Data = value

L1 Cache flag = false flag = false Data = value L1 Cache

__host__ __device__ __host__ __device__


int poll_then_read(atomic<bool>& flag, void write_then_signal(atomic<bool>&
int& data ) { flag, int& data, int value) {
while (!flag.load()) ; data = value;
return data; flag = true;
} }

Reader: Thread 0, Block 0, SM 0 Writer: Thread 0, Block 1, SM 1


MEMORY MODEL
Synchronizing memory between multiple Thread Blocks

L2 Cache flag = true Data = value

L1 Cache flag = false flag = true Data = value L1 Cache

__host__ __device__ __host__ __device__


int poll_then_read(atomic<bool>& flag, void write_then_signal(atomic<bool>&
int& data ) { flag, int& data, int value) {
while (!flag.load()) ; data = value;
return data; flag = true;
} }

Reader: Thread 0, Block 0, SM 0 Writer: Thread 0, Block 1, SM 1


MEMORY MODEL
Synchronizing memory between multiple Thread Blocks

L2 Cache flag = true Data = value

L1 Cache flag = true flag = true Data = value L1 Cache

__host__ __device__ __host__ __device__


int poll_then_read(atomic<bool>& flag, void write_then_signal(atomic<bool>&
int& data ) { flag, int& data, int value) {
while (!flag.load()) ; data = value;
return data; flag = true;
} }

Reader: Thread 0, Block 0, SM 0 Writer: Thread 0, Block 1, SM 1


MEMORY MODEL
Synchronizing memory between different Thread Blocks

Three Performance issues:


1. Uses system scope atomic. atomic<Type, cuda::thread_scope_system>
2. Full sequential consistency not needed
3. Spinning while loop, no backoff

__host__ __device__ __host__ __device__


int poll_then_read(atomic<bool>& flag, void write_then_signal(atomic<bool>&
int& data ) { flag, int& data, int value) {
while (!flag.load()) ; data = value;
return data; flag = true;
} }

Reader: Thread 0, Block 0, SM 0 Writer: Thread 0, Block 1, SM 1


MEMORY MODEL
Synchronizing memory between different Thread Blocks

Three Performance issues:


1. Uses system scope atomic Use GPU Device scope
2. Full sequential consistency not needed
3. Spinning while loop, no backoff

__host__ __device__ __host__ __device__


int poll_then_read( void write_then_signal (
cuda::atomic<bool, cuda::thread_scope_device>& cuda::atomic<bool, cuda::thread_scope_device>&
flag, int& data ) { flag, int& data, int value) {
while (!flag.load()); data = value;
return data; flag = true;
} }

Reader: Thread 0, Block 0, SM 0 Writer: Thread 0, Block 1, SM 1


MEMORY MODEL
Synchronizing memory between different Thread Blocks

Three Performance issues:


1. Uses system scope atomic Use GPU Device scope
2. Full sequential consistency not needed C++ acquire – release semantics are sufficient
3. Spinning while loop, no backoff

__host__ __device__ __host__ __device__


int poll_then_read( void write_then_signal (
cuda::atomic<bool, cuda::thread_scope_device>& cuda::atomic<bool, cuda::thread_scope_device>&
flag, int& data ) { flag, int& data, int value) {
while (!flag.load(memory_order_acquire)); data = value;
return data; flag.store(true, memory_order_release);
} }

Reader: Thread 0, Block 0, SM 0 Writer: Thread 0, Block 1, SM 1


MEMORY MODEL
Synchronizing memory between different Thread Blocks

Three Performance issues:


1. Uses system scope atomic Use GPU Device scope
2. Full sequential consistency not needed C++ acquire – release semantics are sufficient
3. Spinning while loop, no backoff Use notify_all() wait() API with built in exponential backoff

__host__ __device__ __host__ __device__


int poll_then_read( void write_then_signal (
cuda::atomic<bool, cuda::thread_scope_device>& cuda::atomic<bool, cuda::thread_scope_device>&
flag, int& data ) { flag, int& data, int value) {
flag.wait(false, memory_order_acquire); data = value;
return data; flag.store(true, memory_order_release);
} flag.notify_all();
}

Reader: Thread 0, Block 0, SM 0 Writer: Thread 0, Block 1, SM 1


MEMORY MODEL
Synchronizing memory at system scope

GPU 0 GPU 1

GPU 2 GPU 3

__host__ __device__
__host__ __device__
void write_then_signal (
int poll_then_read(
cuda::atomic<bool, cuda::thread_scope_system>&
cuda::atomic<bool, cuda::thread_scope_system>&
flag, int& data, int value) {
flag, int& data ) {
data = value;
flag.wait(false, memory_order_acquire);
flag.store(true, memory_order_release);
return data;
flag.notify_all();
}
}
Reader: Thread 0, Block 0, SM 0, GPU 0 Writer: Thread 0, Block 1, SM 1, GPU 1
L2 CACHE MANAGEMENT
ANNOTATED POINTER
A pointer annotated with an access property

template <typename T, typename AccessProperty>


class cuda::annotated_ptr;

int* g;
A pointer to global memory.
Like raw pointer with hint: property might be
cuda::annotated_ptr<int, access_property::global> p{g};
applied/ignored!
p[threadIdx.x] = 42;

Propagates properties through ABI boundaries for


__device__ void indepentently_compiled( independently compiled device code.
cuda::annotated_ptr<int, access_property>
);
ANNOTATED POINTER
Access Properties

cuda::annotated_ptr<T, cuda::access_property>
Shared

cuda::access_property::shared Memory access to the shared-memory

cuda::access_property::global Memory access to the global-memory (does not indicate frequency of access)
Static

cuda::access_property::normal Memory access to the global-memory as frequent as others


Global

cuda::access_property::persisting Memory access to the global-memory more frequent than others

cuda::access_property::streaming Memory access to the global-memory less frequent than others


Dynamic

cuda::access_property Memory access to the global-memory with dynamic hint


• Interleaved • Request properties for probabilities of memory addresses
• Range • Request properties for elements of an address range

• sizeof(cuda::annotated_ptr<T, StaticAccessProperty>) == sizeof(T*)


• sizeof(cuda::annotated_ptr<T, cuda::access_property>) == 2*sizeof(T*)
ANNOTATED POINTER
Interleaved dynamic property

int* g_ptr; size_t sz; g_ptr

cuda::access_property interleaved{
cuda::access_property::persisting{},
0.3,
cuda::access_property::streaming{} size = 10

};

30% of memory addresses accessed with persisting property.


cuda::annotated_ptr<int, access_property> p{ 70% of memory addresses accessed with streaming property.
g_ptr, interleaved
};
p[threadIdx.x] = 42;
int v = p[threadIdx.x];
ANNOTATED POINTER
Range dynamic property
g_ptr
int* g_ptr; size_t leading_size, total_size;

cuda::access_property range{
g_ptr, leading_size, total_size,
leading_size == total_size
cuda::access_property::persistent{},
cuda::access_property::streaming{}
g_ptr
};
cuda::annotated_ptr<int, access_property::global> p{
g_ptr, range
};
leading_size

total_size
ANNOTATED POINTER
Prefetching memory

• Prefetches memory at L2 cache line granularity and sets access frequency

• Can use it as a larger shared memory (backed by global memory) that can be shared across thread-blocks

__global__ void pin(int* a) {

if(threadIdx.x == 0 ) // Thread 0 prefetches one L2 cache line or 32 integer elements


cuda::apply_access_property(a, 32*sizeof(int), cuda::access_property::persisting{});
}

40 MB Prefetch
80 GB HBM2e
L2
ANNOTATED POINTER
Discarding or Resetting memory

• Once done using it, either set its access frequency back to normal
• Or discard if the application does not need the lines to be written back to main memory
• Otherwise, the memory might be kept in the L2 for a very long time.

No write back. Saves Bandwidth !

40 40
MB 80 GB HBM2e MB 80 GB HBM2e
L2 L2

cuda::apply_access_property(ptr, size, cuda::discard_memory(ptr, size);


cuda::access_property::normal{} );
ANNOTATED POINTER
Pass 1: Multi sweep on weather and climate stencils

for (int k = 0; k < nz; ++k) { Z


pos += y_stride;
float tmp_reg = dstm * src_p[pos];
Y
dst[pos] = tmp_reg + 1;
tmp[pos] = tmp_reg - 1;

Pass 1
dstm = tmp_reg;
}

X
ANNOTATED POINTER
Pass 2: Multi sweep on weather and climate stencils

Pass 2
for (int k = nz - 1; k >= 0; --k) {
pos -= z_stride;
dstm += (tmp[pos] - dst[pos] + 2.f);
dst[pos] = dstm;

X
}
ANNOTATED POINTER
Multi sweep on weather and climate stencils

Y
cuda::annotated_ptr<float,cuda::access_property
::persisting> dst_p{dst};

cuda::annotated_ptr<float,cuda::access_property
::persisting> tmp_p{tmp};

cuda::annotated_ptr<float,cuda::access_property
::streaming> src_p{src};

X
ANNOTATED POINTER
Pass 2: Multi sweep on weather and climate stencils

cuda::annotated_ptr<float,cuda::access_property::persisting> dst_p{dst};
cuda::annotated_ptr<float,cuda::access_property::persisting> tmp_p{tmp};
cuda::annotated_ptr<float,cuda::access_property::streaming> src_p{src};
Z
for (int k = 0; k < nz; ++k) {
Y
pos += y_stride;
float tmp_reg = dstm * src_p[pos];
dst_p[pos] = tmp_reg + 1;
tmp_p[pos] = tmp_reg - 1;
dstm = tmp_reg;
}
for (int k = nz - 1; k >= 0; --k) {
pos -= z_stride;
dstm += (tmp_p[pos] - dst_p[pos] + 2.f);
X
dst_p[pos] = dstm;
}
ANNOTATED POINTER
Pass 2: Multi sweep on weather and climate stencils

for (int k = 0; k < nz; ++k) { Z


pos += y_stride;
float tmp_reg = dstm * src_p[pos]; Y
dst_p[pos] = tmp_reg + 1;
tmp_p[pos] = tmp_reg - 1;
dstm = tmp_reg;
}
for (int k = nz - 1; k >= 0; --k) {
pos -= z_stride;
dstm += (tmp_p[pos] - dst_p[pos] + 2.f);
dst_p[pos] = dstm;
if( threadIdx.x == 0 )
cuda::discard_memory(&tmp_p[pos],32*sizeof(float));
X
}
ANNOTATED POINTER
Performance Multi sweep on weather and climate stencils

auto l2_persistent_size = prop.persistingL2CacheMaxSize; // 30MB on A100. 75% of total


cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize, l2_persistent_size);

Performance of 2 sweep kernel


1.8

1.6
1.56
1.4
1.15
Performance

1.2
1.00 1.00 1.04
1 0.92
0.8

0.6

0.4

0.2

Normal Persisting Persisting_Discard Normal Persisting Persisting_Discard


Persistent Cache size 30 MB

Grid: 330x330x30 Grid: 330x330x60


ANNOTATED POINTER
Performance Multi sweep on weather and climate stencils

auto l2_persistent_size = 20*1000*1000; // 20MB. 50% of total on A100


cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize, l2_persistent_size);

Performance of 2 sweep kernel


1.8
1.56 1.60
1.6

1.4
1.15 1.15
Performance

1.2
1.04 1.08
1.00 1.00 1.00 1.00 1.01
1 0.92
0.8

0.6

0.4

0.2

Normal Persisting Persisting_Discard Normal Persisting Persisting_Discard


Persistent Cache size 30 MB Persistent Cache size 20 MB

Grid: 330x330x30 Grid: 330x330x60


SHARED MEMORY
SHARED MEMORY
Fast, Threadblock local scratchpad

• Inside SM, private to thread block, user-controlled cache SM


• Default max = 48KB per thread block, up to 164KB (explicit
opt-in) in A100

Compute Units
L1
• Max bandwidth A100 = 108 SMs x 1.41GHz x 128

Registers
Bytes/clock = 19.5 TB/s 192
KB
• ~10x more bandwidth than Global Memory, very low
latency SHM

• Only 4-Byte access granularity

108 SMs
SHARED MEMORY
Fast, SM-local scratchpad

Shared memory organization: 32 banks x 4 Bytes = 128 Bytes per row

Bank0 Bank1 Bank2 Bank3 Bank4 … Bank31

Byte 0
Byte 128
Byte 256

4 Bytes
SHARED MEMORY
Fast, SM-local scratchpad

Shared memory organization: 32 banks x 4 Bytes = 128 Bytes per row


Bank conflicts can happen, inside a warp:
2+ threads access different 4-Byte words from the same bank
Worst case = 32-way conflicts, 31 replays
Replays increase latency, decrease bandwidth

Bank0 Bank1 Bank2 Bank3 Bank4 … Bank31

Byte 0
Byte 128
Byte 256

4 Bytes
SHARED MEMORY
Example: 32x32 shared memory transpose

__shared__ float sm[32][32]; __shared__ float sm[32][33];


padding

Bank numbers

No conflicts when accessing rows No conflicts when accessing rows


32-way conflicts accessing columns No conflicts when accessing columns
SHARED MEMORY
Bank conflicts guidelines

Accessing Banks per Bank Conflict


elements element Resolution level
1,2,4 bytes 1 Warp Up to 32-way conflicts!

8 bytes 2 ½ warp Up to 16-way conflicts

16 bytes 4 ¼ warps Up to 8-way conflicts

Accessing same word, or bytes inside same word = no conflict (multicast)

Easy to avoid conflicts by adding padding or changing access patterns


SHARED MEMORY
Async load to shared memory

Typical way of loading data to shared memory: LDG.E.SYS R0, [R2] ;


__shared__ int smem[1024];
* STALL *
smem[threadIdx.x] = input[index];
STS [R5], R0 ;
SM

• Wasting registers
Compute Units

L1
Registers

L2
• Stalling while the data is loaded
SHM

• Wasting L1/SHM bandwidth


SHARED MEMORY
Async load to shared memory
The cooperative_groups API:
cooperative_groups::memcpy_async(Group, dst*, src*, Shape);

The CUDA APIs with synchronization primitives:


cuda::memcpy_async(Group, dst*, src*, Shape, cuda::barrier);
cuda::memcpy_async(Group, dst*, src*, Shape, cuda::pipeline);

SM
Compute Units

L1
Registers

L2 Copy the data to shared memory asynchronously

SHM
SHARED MEMORY
Simple replacement for filling in shared memory

extern __shared__ float shmem[];


extern __shared__ float shmem[];
namespace cg = cooperative_groups;
for(int i=threadIdx.x; i<size; i+=blockDim.x)
auto block = cg::this_thread_block();
{
shmem[i] = gmem[i];
cg::memcpy_async(block, shmem, gmem, size*sizeof(float));
}
cg::wait(block);
__syncthreads();

SM SM
L1

Registers
L1
Registers

L2 L2
SH SH
M M
MEMCPY_ASYNC()
Comparing variants

Cooperative Groups cuda::barrier cuda::pipeline What happens

Pipe.producer_acquire() Wait for consumer to release oldest pipeline stage.


circular queue

Issue Async copies on the


cuda::memcpy_async(g,…, pipe) barrier or pipe->barrier[i]
cg::memcpy_async(g,…) cuda::memcpy_async(g,…,bar)
Commit the issued memcpy_async to the barrier.
Code flow

Pipe.producer_commit()
Thread arrival on the barrier
bar.arrive()
cg::wait()
Pipe.consumer_wait()
cg::wait_prior<N>() Wait on the barrier
bar.wait()
Release current pipeline stage (only pipeline)
Pipe.consumer_release()
THREAD DIVERGENCE
DIVERGENCE
Intra-warp vs extra-warp divergence

CUDA deals with divergence at the warp level

if (condition)
A(); Warp 0 Divergence between warps:
All threads inside each warp
else take the same branch
B(); Warp 1

Divergence between warps is OK.


DIVERGENCE
Intra-warp vs extra-warp divergence

CUDA deals with divergence at the warp level

if (condition)
A(); Warp 0 Warp 1 Divergence inside a warp :
Execute both branches,
else masking out inactive threads
B(); Warp 0 Warp 1

More instructions!
Avoid intra-warp divergence if you can
DIVERGENCE
CUDA Threads are Threads

Since Volta, NVIDIA GPUs guarantee forward progress

GPU must know which threads participate in warp-synchronous instructions


synchronization guaranteed only inside these instructions

The old warp-synchronous instructions without mask are deprecated!

Update your code!


EXAMPLE
Update from shfl to shfl_sync

Simple case where all the threads are known to be active:

y = __shfl (x, 0); y = __shfl_sync (0xffffffff, x, 0);

namespace cg=cooperative_groups;
auto block = cg::this_thread_block();
Or, use cooperative groups auto tile32 = cg::tiled_partition<32>(block);
y = tile32.shfl(x, 0);
EXAMPLE
Update from shfl to shfl_sync

More complex case where all threads might not participate:

for (int i=tid; i<N; i+=256) for (int i=0; i<N; i+= 256)
{ {
<…> mask = __ballot_sync (0xffffffff, tid+i<N);
y = __shfl(x, 0); if (tid+i<N)
<…> {
} <…>
y = __shfl_sync(mask, x, 0);
<…>
}
}
EXAMPLE
Update to Cooperative Groups

You can also use Cooperative groups

namespace cg=cooperative_groups;
auto block = cg::this_thread_block();
auto tile32 = cg::tiled_partition<32>(block);

for (int i=0; i<N; i+= 256) for (int i=0; i<N; i+= 256)
{ {
mask = __ballot_sync (0xffffffff, tid+i<N); auto subtile = cg::binary_partition(tile32, tid+i<N);
if (tid+i<N) if (tid+i<N)
{ {
<…> <…>
y = __shfl_sync(mask, x, 0); y = subtile.shfl(x, 0);
<…> <…>
} }
} }
DIVERGENCE
No convergence assumptions

Expect ≠ Assume

• Expecting convergence is reasonable (performance)

• Assuming convergence is illegal!


Keeping up with GPUs getting wider and bigger
OCCUPANCY
OCCUPANCY
SM Resources on A100

Achieved number of threads per SM


Occupancy =
Maximum number of threads per SM

65536 32-bit Registers


164 KB Shared Memory

CUDA Cores:
2048 Threads / 32 Thread Blocks
A100 SM
OCCUPANCY
SM Resources on A100

Example kernel: 512 threads per block, 8 registers / thread (4096 / block) , 8KB shared memory per block

4096 409665536 32-bit Registers


4096 4096

9 KB 9 KB164 KB Shared9 Memory


KB 9 KB

CUDA Cores:
Threads: 512 100%
Threads: 512 Threads: 512 Threads: 512
2048 Threads / 32 Thread Blocks occupancy
Thread Block 0 Thread Block 1 Thread Block 2 Thread Block 3

A100 SM
OCCUPANCY
Higher occupancy

In general, higher occupancy is better:


• More warps per SM
• More parallelism to hide the latencies
• Can also do well with lower occupancy, but more work per thread.

Example kernel:
• 512 threads per block
• 8 registers per thread occupancy = 100%
• 8KB shared memory
OCCUPANCY
Higher occupancy

In general, higher occupancy is better:


• More warps per SM
• More parallelism to hide the latencies
• Can also do well with lower occupancy, but more work per thread.

But launching only 80 thread blocks?


Example kernel:
• 512 threads per block 80 blocks x 512 threads
= 36% of P100
• 8 registers per thread occupancy = 100% 56 SMs x 2048 threads
• 8KB shared memory 80 blocks x 512 threads
= 18% of A100
108 SMs x 2048 threads
CUDA STREAMS AND GRAPHS
CUDA STREAMS
Overlapping compute, copies and allocations

cudaMalloc() or cudaFree() halts GPU New stream-ordered asynchronous


memory allocations
Very low latency (memory pool)
Stream 0 Stream 1
cudaMalloc(buffer); Stream 0 Stream 1
H2D cudaMallocAsync(buffer);
cudaMallocAsync(buffer);
cudaMalloc(buffer); H2D
Kernel 1 H2D Kernel 1 H2D
D2H Kernel 1
GPU idle! D2H Kernel 1
cudaFree(buffer); cudaFreeAsync(buffer); D2H
D2H cudaFreeAsync(buffer);
cudaFree(buffer);
CUDA GRAPHS
Free up CPU resources

Launch Launch Launch Launch


A B C D
Launch E CPU Idle
Stream
Launch
A B C D E

time

Build Launch
Graph Graph CPU Idle
Graph
Launch
A B C D E
CUDA GRAPHS
Reduction in launch overhead

Launch A Launch B Launch C Launch D Launch E CPU Time

Launch
Latency GPU Time
A B C D E

time
Build
Launch Graph time saved
Graph

A B C D E

When kernel runtime is short, execution time is dominated by CPU launch cost
CUDA GRAPHS
Embedding memory allocation in graphs

A
CUDA Graphs offers two new node types:
allocation & free
alloc alloc
Identical semantics to stream cudaMallocAsync() X Y
▪ Pointer is returned at node creation time
▪ Pointer may be passed as argument to later nodes
▪ Dereferencing pointer is only permitted downstream B(X) C(X) D(Y)
of allocation node & upstream of free node

free free
X Y

E
TAKEAWAYS
How to keep up with larger GPUs

• Be aware of your grid sizes and achieved occupancy

• Re-visit code, express more parallelism

• Launch more independent kernels in parallel: CUDA streams, graphs

• Share the GPU with multiple processes: MPS

• Split the GPU into smaller instances: MIG

You might also like