NCSA02 Fundamental CUDA Optimization
NCSA02 Fundamental CUDA Optimization
Optimization
NVIDIA Corporation
Outline
Fermi/Kepler Architecture
Kernel optimizations Most concepts in this
Launch configuration
presentation apply to
Global memory throughput
any language or API
Shared memory access
on NVIDIA GPUs
Instruction throughput / control flow
Grid Device
Warps
Device
GPU
Multiprocessor
Host Multiprocessor
Registers
DRAM Shared Memory
Multiprocessor
Registers
Key to understanding:
Instructions are issued in order
A thread stalls when one of the operands isn’t ready:
Memory read by itself doesn’t stall execution
Latency is hidden by switching threads
GMEM latency: 400-800 cycles
Arithmetic latency: 18-22 cycles
How many threads/threadblocks to launch?
Conclusion:
Need enough threads to hide latency
Launch Configuration
L2
Global Memory
GMEM Operations
0 32 64 96 128 160 192 224 256 288 320 352 384 416 448
Memory addresses
Non-caching Load
0 32 64 96 128 160 192 224 256 288 320 352 384 416 448
Memory addresses
Caching Load
0 32 64 96 128 160 192 224 256 288 320 352 384 416 448
Memory addresses
Non-caching Load
0 32 64 96 128 160 192 224 256 288 320 352 384 416 448
Memory addresses
Caching Load
0 32 64 96 128 160 192 224 256 288 320 352 384 416 448
Memory addresses
Non-caching Load
0 32 64 96 128 160 192 224 256 288 320 352 384 416 448
Memory addresses
Caching Load
0 32 64 96 128 160 192 224 256 288 320 352 384 416 448
Memory addresses
Non-caching Load
0 32 64 96 128 160 192 224 256 288 320 352 384 416 448
Memory addresses
Caching Load
0 32 64 96 128 160 192 224 256 288 320 352 384 416 448
Memory addresses
Non-caching Load
0 32 64 96 128 160 192 224 256 288 320 352 384 416 448
Memory addresses
Impact of Address Alignment
Warps should access aligned regions for maximum memory throughput
L1 can help for misaligned loads if several warps are accessing a contiguous
region
ECC further significantly reduces misaligned store throughput
Experiment:
– Copy 16MB of floats
– 256 threads/block
Greatest throughput
drop:
– CA loads: 15%
– CG loads: 32%
GMEM Optimization Guidelines
warps:
0 1 2 31
Bank 0 0 1 2 31
Bank 1 0 1 2 31
… 0 1 2 31
Bank 31
0 1 2 31
Shared Memory: Avoiding Bank Conflicts
Bank 0 0 1 2 31
Bank 1 0 1 2 31
… 0 1 2 31
Bank 31
0 1 2 31
Instruction Throughput
& Control Flow
Runtime Math Library and Intrinsics
Default API:
Kernel launches are asynchronous with CPU
Memcopies (D2H, H2D) block CPU thread
CUDA calls are serialized by the driver
Streams and async functions provide:
Memcopies (D2H, H2D) asynchronous with CPU
Ability to concurrently execute a kernel and a memcopy
Stream = sequence of operations that execute in issue-order on
GPU
Operations from different streams may be interleaved
A kernel and memcopy from different streams can be overlapped
Overlap kernel and memory copy
Requirements:
D2H or H2D memcopy from pinned memory
Kernel and memcopy in different, non-0 streams
Code:
cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
K1,K2,M1,M2: K1 K2
M1 M2 K: Kernel
M: Memcopy
K1,M1,M2: K1 Integer: Stream ID
M1 M2
K1,M2,M1: K1
M2 M1
K1,M2,M2: K1
M2 M2
Time
More on Dual Copy
PCIe, x16
16 GB/s
CPU-0 CPU-0 CPU-1
QPI, 6.4 GT/s
25.6 GB/s
GPU-0 GPU-0
Duplex Copy: Experimental Results
PCIe, x16
16 GB/s
CPU-0 CPU-0 CPU-1
QPI, 6.4 GT/s
25.6 GB/s
GPU-0 GPU-0
Unified Virtual Addressing
Easier to Program with Single Address Space
0xFFFF
0xFFFF 0xFFFF 0xFFFF
PCI-e
PCI-e
Summary