Performance Optimization With Modern CUDA Programming Techniques - 1635781161534001h3am
Performance Optimization With Modern CUDA Programming Techniques - 1635781161534001h3am
PROGRAMMING TECHNIQUES
GUILLAUME THOMAS-COLLIGNON, VISHAL MEHTA
DEVTECH COMPUTE
• Memory Hierarchy
▪ Memory Access Patterns
▪ Memory Model
▪ L2 Cache
▪ Shared Memory
• Thread Divergence
• GPU Occupancy
SM
SM
SM
Compute Units
L1
Registers
40 MB 2.0TB/s
192 80 GB HBM2e
KB L2
SHM
Load Instruction
(request)
? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?
How many different 32-Byte sectors are being touched by the WARP?
MEMORY ACCESS PATTERNS
0 31 0 31
WARP WARP
1-byte per thread coalesced memory access 4-bytes per thread coalesced memory access
0 31
WARP
1-byte per thread unaligned memory access 4-bytes per thread unaligned memory access
Coalesced but unaligned memory accesses will increase the number of sectors per request
0 31
WARP
0 31
Sector 0 Sector 1 Sector 2 Sector 3
WARP
Strided memory accesses also increase the number of sectors per request
MEMORY ACCESS PATTERNS
Sector
Sector
Sector Sector
Sector
Sector
Sector Sector
Sector
Sector Sector
Sector
Sector
Sector
Sector
Sector
Sector
Sector Sector Sector
Sector
MEMORY HIERARCHY
Nsight Compute
32B sectors are
Instructions generate and eventually
transferred between L1
requests to L1 cache to/from memory
and L2 caches
Make sure
these numbers
match your
expectations
MEMORY MODEL
MEMORY MODEL
CUDA Generic Address Space
SM
SM
SM
Compute Units
L1
Registers
40 MB
80 GB HBM2
L2
SHM
New!
LDGSTS
PCIe, NVLINK
MEMORY MODEL
Writes
L1 is write-through, L2 is write-back
Writes will always reach at least L2, Read After Write can hit in L1 (e.g. register spills)
SM
SM
SM
Compute Units
L1
Registers
40 MB
80 GB HBM2
L2
SHM
New!
LDGSTS
PCIe, NVLINK
MEMORY MODEL
Scope levels
System scope
Host Host
CPU CPU
memory memory
L2 L2 L2 L2
Block Scope
Thread Scope
MEMORY MODEL
Synchronizing memory between multiple Thread Blocks
Cooperative Kernel
L1 Cache L1 Cache
Cooperative Kernel
L1 Cache flag = 0 L1 Cache
GPU 0 GPU 1
GPU 2 GPU 3
__host__ __device__
__host__ __device__
void write_then_signal (
int poll_then_read(
cuda::atomic<bool, cuda::thread_scope_system>&
cuda::atomic<bool, cuda::thread_scope_system>&
flag, int& data, int value) {
flag, int& data ) {
data = value;
flag.wait(false, memory_order_acquire);
flag.store(true, memory_order_release);
return data;
flag.notify_all();
}
}
Reader: Thread 0, Block 0, SM 0, GPU 0 Writer: Thread 0, Block 1, SM 1, GPU 1
L2 CACHE MANAGEMENT
ANNOTATED POINTER
A pointer annotated with an access property
int* g;
A pointer to global memory.
Like raw pointer with hint: property might be
cuda::annotated_ptr<int, access_property::global> p{g};
applied/ignored!
p[threadIdx.x] = 42;
cuda::annotated_ptr<T, cuda::access_property>
Shared
cuda::access_property::global Memory access to the global-memory (does not indicate frequency of access)
Static
cuda::access_property interleaved{
cuda::access_property::persisting{},
0.3,
cuda::access_property::streaming{} size = 10
};
cuda::access_property range{
g_ptr, leading_size, total_size,
leading_size == total_size
cuda::access_property::persistent{},
cuda::access_property::streaming{}
g_ptr
};
cuda::annotated_ptr<int, access_property::global> p{
g_ptr, range
};
leading_size
total_size
ANNOTATED POINTER
Prefetching memory
• Can use it as a larger shared memory (backed by global memory) that can be shared across thread-blocks
40 MB Prefetch
80 GB HBM2e
L2
ANNOTATED POINTER
Discarding or Resetting memory
• Once done using it, either set its access frequency back to normal
• Or discard if the application does not need the lines to be written back to main memory
• Otherwise, the memory might be kept in the L2 for a very long time.
40 40
MB 80 GB HBM2e MB 80 GB HBM2e
L2 L2
Pass 1
dstm = tmp_reg;
}
X
ANNOTATED POINTER
Pass 2: Multi sweep on weather and climate stencils
Pass 2
for (int k = nz - 1; k >= 0; --k) {
pos -= z_stride;
dstm += (tmp[pos] - dst[pos] + 2.f);
dst[pos] = dstm;
X
}
ANNOTATED POINTER
Multi sweep on weather and climate stencils
Y
cuda::annotated_ptr<float,cuda::access_property
::persisting> dst_p{dst};
cuda::annotated_ptr<float,cuda::access_property
::persisting> tmp_p{tmp};
cuda::annotated_ptr<float,cuda::access_property
::streaming> src_p{src};
X
ANNOTATED POINTER
Pass 2: Multi sweep on weather and climate stencils
cuda::annotated_ptr<float,cuda::access_property::persisting> dst_p{dst};
cuda::annotated_ptr<float,cuda::access_property::persisting> tmp_p{tmp};
cuda::annotated_ptr<float,cuda::access_property::streaming> src_p{src};
Z
for (int k = 0; k < nz; ++k) {
Y
pos += y_stride;
float tmp_reg = dstm * src_p[pos];
dst_p[pos] = tmp_reg + 1;
tmp_p[pos] = tmp_reg - 1;
dstm = tmp_reg;
}
for (int k = nz - 1; k >= 0; --k) {
pos -= z_stride;
dstm += (tmp_p[pos] - dst_p[pos] + 2.f);
X
dst_p[pos] = dstm;
}
ANNOTATED POINTER
Pass 2: Multi sweep on weather and climate stencils
1.6
1.56
1.4
1.15
Performance
1.2
1.00 1.00 1.04
1 0.92
0.8
0.6
0.4
0.2
1.4
1.15 1.15
Performance
1.2
1.04 1.08
1.00 1.00 1.00 1.00 1.01
1 0.92
0.8
0.6
0.4
0.2
Compute Units
L1
• Max bandwidth A100 = 108 SMs x 1.41GHz x 128
Registers
Bytes/clock = 19.5 TB/s 192
KB
• ~10x more bandwidth than Global Memory, very low
latency SHM
108 SMs
SHARED MEMORY
Fast, SM-local scratchpad
Byte 0
Byte 128
Byte 256
4 Bytes
SHARED MEMORY
Fast, SM-local scratchpad
Byte 0
Byte 128
Byte 256
4 Bytes
SHARED MEMORY
Example: 32x32 shared memory transpose
Bank numbers
• Wasting registers
Compute Units
L1
Registers
L2
• Stalling while the data is loaded
SHM
SM
Compute Units
L1
Registers
SHM
SHARED MEMORY
Simple replacement for filling in shared memory
SM SM
L1
Registers
L1
Registers
L2 L2
SH SH
M M
MEMCPY_ASYNC()
Comparing variants
Pipe.producer_commit()
Thread arrival on the barrier
bar.arrive()
cg::wait()
Pipe.consumer_wait()
cg::wait_prior<N>() Wait on the barrier
bar.wait()
Release current pipeline stage (only pipeline)
Pipe.consumer_release()
THREAD DIVERGENCE
DIVERGENCE
Intra-warp vs extra-warp divergence
if (condition)
A(); Warp 0 Divergence between warps:
All threads inside each warp
else take the same branch
B(); Warp 1
if (condition)
A(); Warp 0 Warp 1 Divergence inside a warp :
Execute both branches,
else masking out inactive threads
B(); Warp 0 Warp 1
More instructions!
Avoid intra-warp divergence if you can
DIVERGENCE
CUDA Threads are Threads
namespace cg=cooperative_groups;
auto block = cg::this_thread_block();
Or, use cooperative groups auto tile32 = cg::tiled_partition<32>(block);
y = tile32.shfl(x, 0);
EXAMPLE
Update from shfl to shfl_sync
for (int i=tid; i<N; i+=256) for (int i=0; i<N; i+= 256)
{ {
<…> mask = __ballot_sync (0xffffffff, tid+i<N);
y = __shfl(x, 0); if (tid+i<N)
<…> {
} <…>
y = __shfl_sync(mask, x, 0);
<…>
}
}
EXAMPLE
Update to Cooperative Groups
namespace cg=cooperative_groups;
auto block = cg::this_thread_block();
auto tile32 = cg::tiled_partition<32>(block);
for (int i=0; i<N; i+= 256) for (int i=0; i<N; i+= 256)
{ {
mask = __ballot_sync (0xffffffff, tid+i<N); auto subtile = cg::binary_partition(tile32, tid+i<N);
if (tid+i<N) if (tid+i<N)
{ {
<…> <…>
y = __shfl_sync(mask, x, 0); y = subtile.shfl(x, 0);
<…> <…>
} }
} }
DIVERGENCE
No convergence assumptions
Expect ≠ Assume
CUDA Cores:
2048 Threads / 32 Thread Blocks
A100 SM
OCCUPANCY
SM Resources on A100
Example kernel: 512 threads per block, 8 registers / thread (4096 / block) , 8KB shared memory per block
CUDA Cores:
Threads: 512 100%
Threads: 512 Threads: 512 Threads: 512
2048 Threads / 32 Thread Blocks occupancy
Thread Block 0 Thread Block 1 Thread Block 2 Thread Block 3
A100 SM
OCCUPANCY
Higher occupancy
Example kernel:
• 512 threads per block
• 8 registers per thread occupancy = 100%
• 8KB shared memory
OCCUPANCY
Higher occupancy
time
Build Launch
Graph Graph CPU Idle
Graph
Launch
A B C D E
CUDA GRAPHS
Reduction in launch overhead
Launch
Latency GPU Time
A B C D E
time
Build
Launch Graph time saved
Graph
A B C D E
When kernel runtime is short, execution time is dominated by CPU launch cost
CUDA GRAPHS
Embedding memory allocation in graphs
A
CUDA Graphs offers two new node types:
allocation & free
alloc alloc
Identical semantics to stream cudaMallocAsync() X Y
▪ Pointer is returned at node creation time
▪ Pointer may be passed as argument to later nodes
▪ Dereferencing pointer is only permitted downstream B(X) C(X) D(Y)
of allocation node & upstream of free node
free free
X Y
E
TAKEAWAYS
How to keep up with larger GPUs