CUDA Optimization Fundamentals
CUDA Optimization Fundamentals
Fundamentals
Cliff Woolley
Developer Technology Engineer
Note: Fundamentals will apply broadly
© NVIDIA 2013
Main Requirements for GPU Performance
© NVIDIA 2013
APOD: A Systematic Path to Performance
Assess
Deploy Parallelize
Optimize
© NVIDIA 2013
Assess
HOTSPOTS
Applications
Compiler Programming
Libraries Languages
Directives
© NVIDIA 2013
Optimize
Profile-driven optimization
Tools:
nsight Visual Studio Edition or Eclipse Edition
nvvp NVIDIA Visual Profiler
nvprof Command-line profiling
© NVIDIA 2013
Deploy
Productize
Early gains
Subsequent changes are evolutionary
© NVIDIA 2013
ASSESS
© NVIDIA 2013
Assess
© NVIDIA 2013
Assess
Let’s investigate…
Strong scaling and Amdahl’s Law
Weak scaling and Gustafson’s Law
Expected perf limiters: Bandwidth? Computation? Latency?
© NVIDIA 2013
Assess: Understanding Scaling
Strong Scaling
A measure of how, for fixed overall problem size, the time to
solution decreases as more processors are added to a system
Linear strong scaling: speedup achieved is equal to number of
processors used
Amdahl’s Law:
𝟏 𝟏
𝑺= ≈
𝑷 (𝟏 − 𝑷)
𝟏−𝑷 +𝑵
© NVIDIA 2013
Assess: Understanding Scaling
Weak Scaling
A measure of how time to solution changes as more processors
are added with fixed problem size per processor
Linear weak scaling: overall problem size increases as num. of
processors increases, but execution time remains constant
Gustafson’s Law:
𝑺 = 𝑵 + (𝟏 − 𝑷)(𝟏 − 𝑵)
© NVIDIA 2013
Assess: Applying Strong and Weak Scaling
© NVIDIA 2013
Assess: Applying Strong Scaling
~93%
© NVIDIA 2013
Assess: Applying Strong Scaling
~93%
© NVIDIA 2013
Assess: Speed of Light
Not sure?
Get a rough estimate by counting bytes per instruction,
𝑮𝑩𝒚𝒕𝒆𝒔/𝒔𝒆𝒄
compare it to “balanced” peak ratio
𝑮𝒊𝒏𝒔𝒏𝒔/𝒔𝒆𝒄
Profiler will help you determine this
© NVIDIA 2013
Assess: Limiting Factor
© NVIDIA 2013
Assess: Limiting Factor
© NVIDIA 2013
Assess: Speed of Light
© NVIDIA 2013
Assess: Limiting Factor
For our example SpMV kernel, our first discovery was that we’re
latency-limited, not bandwidth, since utilization was so low
~93%
© NVIDIA 2013
PARALLELIZE
© NVIDIA 2013
PARALLELIZE
Computation
© NVIDIA 2013
Parallelize
Applications
Compiler Programming
Libraries Languages
Directives
© NVIDIA 2013
Parallelize: e.g., with Thrust
thrust.github.com or developer.nvidia.com/thrust
© NVIDIA 2013
Parallelize: e.g., with OpenACC
CPU GPU
Directives-based approach
Your original
Fortran or C code www.nvidia.com/gpudirectives © NVIDIA 2013
Parallelize: e.g., with CUDA C
Standard C Code CUDA C Code
__global__
void saxpy_serial(int n, void saxpy_parallel(int n,
float a, float a,
float *x, float *x,
float *y) float *y)
{ {
int i = blockIdx.x * blockDim.x +
for (int i = 0; i < n; ++i) threadIdx.x;
y[i] = a*x[i] + y[i]; if (i < n) y[i] = a*x[i] + y[i];
} }
developer.nvidia.com/cuda-toolkit
© NVIDIA 2013
Parallelism Needed
© NVIDIA 2013
An Initial CUDA Version
__global__ void transpose(float in[], float out[], int N)
{
for(int j=0; j < N; j++)
for(int i=0; i < N; i++)
out[i*N+j] = in[j*N+i];
out[tid*N+bid] = in[bid*N+tid];
} bid bid
© NVIDIA 2013
PARALLELIZE
Data Transfer
© NVIDIA 2013
Asynchronicity = Overlap = Parallelism
DMA
DMA
© NVIDIA 2013
Asynchronicity
© NVIDIA 2013
Parallelize: Achieve Asynchronicity
© NVIDIA 2013
OPTIMIZE
© NVIDIA 2013
Main Requirements for GPU Performance
© NVIDIA 2013
GPU Optimization Fundamentals
http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/
© NVIDIA 2013
GPU Optimization Fundamentals
Kernel optimizations
Launch configuration
Global memory throughput
Shared memory access
Instruction throughput / control flow
© NVIDIA 2013
Kernel Launch Configuration
mykernel<<<blocks_per_grid,threads_per_block>>>(…);
Kepler GK110
© NVIDIA 2013
Kepler Streaming Multiprocessor (SMX)
Per SMX:
192 SP CUDA Cores
64 DP CUDA Cores
4 warp schedulers
Up to 2048 concurrent threads
One or two instructions issued
per scheduler per clock from a
single warp
Register file (256KB)
Shared memory (48KB)
© NVIDIA 2013
CUDA Execution Model
Grid
Device
© NVIDIA 2013
Launch Configuration: General Guidelines
© NVIDIA 2013
Launch Configuration: General Guidelines
© NVIDIA 2013
Warps
A warp is executed
32 Threads
physically in parallel on
32 Threads
= some multiprocessor.
32 Threads
32 Threads
Thread Block
Warps
Multiprocessor Threads of a warp issue
instructions in lock-
step (as with SIMD)
© NVIDIA 2013
Hardware Levels of Parallelism
Simultaneous Multithreading
Multiple “computers”
Cross-core, Cross-socket
Single Instruction, Multiple Data Tightly-coupled
Single Computer
In-core parallelism Supercomputing apps
OpenMP, pthreads
SIMT
Single Instruction, Multiple Threads
In-processor parallelism
Many threads on many cores
CPU GPU
Optimized for low-latency Optimized for data-parallel,
access to cached data sets throughput computation
Control logic for out-of-order Architecture tolerant of
and speculative execution memory latency
More transistors dedicated to
computation
© NVIDIA 2013
Occupancy
© NVIDIA 2013
Low Latency or High Throughput?
© NVIDIA 2013
Latency Hiding FFMA R0, R43, R0, R4;
FFMA R1, R43, R4, R5;
FMUL R7, R9, R0;
FMUL R8, R9, R1;
ST.E [R2], R7;
© NVIDIA 2013
Occupancy
© NVIDIA 2013
Occupancy and Performance
© NVIDIA 2013
Thread Block Size and Occupancy
© NVIDIA 2013
Thread Block Sizing
Too many
threads per block
© NVIDIA 2013
CUDA Occupancy Calculator
Analyze effect of
resource consumption
on occupancy
© NVIDIA 2013
Occupancy Analysis in NVIDIA Visual Profiler
© NVIDIA 2013
OPTIMIZE
Kernel Optimizations: Global Memory Throughput
© NVIDIA 2013
CUDA Memory Architecture
Device
GPU
Multiprocessor
Host Multiprocessor
Registers
DRAM Shared Memory
Multiprocessor
Registers
© NVIDIA 2013
Optimizing Memory Throughput
Little’s Law:
Access latency L
# bytes in flight = latency * bandwidth
© NVIDIA 2013
Illustration: Little’s Law for Escalators
© NVIDIA 2013
Memory-Level Parallelism: Requests in flight
© NVIDIA 2013
Requests per Thread and Performance
To achieve same
throughput at lower
occupancy or with
smaller words, need
more independent
requests per warp
© NVIDIA 2013
Optimizing Access Concurrency
© NVIDIA 2013
OPTIMIZE
Kernel Optimizations: Global Memory Access Coalescing
© NVIDIA 2013
Mechanics of a Memory Access
Operation:
Threads in a warp provide memory addresses
Hardware determines which lines/segments are needed, fetches them
© NVIDIA 2013
Memory Access Efficiency Analysis
© NVIDIA 2013
Access Patterns vs. Memory Throughput
Scenario:
Warp requests 32 aligned, consecutive 4-byte words
Addresses fall within 4 segments
Warp needs 128 bytes
128 bytes move across the bus
Bus utilization: 100%
0 32 64 96 128 160 192 224 256 288 320 352 384 416 448
Memory addresses
© NVIDIA 2013
Access Patterns vs. Memory Throughput
Scenario:
Warp requests 32 aligned, permuted 4-byte words
Addresses fall within 4 segments
Warp needs 128 bytes
128 bytes move across the bus
Bus utilization: 100%
0 32 64 96 128 160 192 224 256 288 320 352 384 416 448
Memory addresses
© NVIDIA 2013
Access Patterns vs. Memory Throughput
Scenario:
Warp requests 32 misaligned, consecutive 4-byte words
Addresses fall within at most 5 segments
Warp needs 128 bytes
At most 160 bytes move across the bus
Bus utilization: at least 80%
Some misaligned patterns will fall within 4 segments, so 100% utilization
0 32 64 96 128 160 192 224 256 288 320 352 384 416 448
Memory addresses
© NVIDIA 2013
Access Patterns vs. Memory Throughput
Scenario:
All threads in a warp request the same 4-byte word
Addresses fall within a single segment
Warp needs 4 bytes
32 bytes move across the bus
Bus utilization: 12.5%
0 32 64 96 128 160 192 224 256 288 320 352 384 416 448
Memory addresses
© NVIDIA 2013
Access Patterns vs. Memory Throughput
Scenario:
Warp requests 32 scattered 4-byte words
Addresses fall within N segments
Warp needs 128 bytes
N*32 bytes move across the bus
Bus utilization: 128 / (N*32)
0 32 64 96 128 160 192 224 256 288 320 352 384 416 448
Memory addresses
© NVIDIA 2013
Parallelizing SAXPY
© NVIDIA 2013
Parallelizing SAXPY
x
thread 0 thread 1 thread 2 thread 3 … thread 31
© NVIDIA 2013
Parallelizing SAXPY
x
thread 0 thread 1 thread 2 thread 3 … thread 31
© NVIDIA 2013
Parallelizing SAXPY
x
thread 0 thread 1 thread 2 thread 3 … thread 31
© NVIDIA 2013
A Better Way to Parallelize SAXPY
x …
loopcount = 0 loopcount = 1 … loopcount=k
© NVIDIA 2013
A Better Way to Parallelize SAXPY
x …
loopcount = 0 loopcount = 1 … loopcount=k
© NVIDIA 2013
Structures of Non-Native Size
struct Position
{
float x, y, z;
};
...
__global__ void kernel( Position *data, ... )
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
Position temp = data[idx];
...
}
© NVIDIA 2013
Structure of Non-Native Size
© NVIDIA 2013
First Load Instruction
...
0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64
© NVIDIA 2013
Second Load Instruction
...
0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64
© NVIDIA 2013
Third Load Instruction
...
0 4 8 12 16 20 24 28 32 36 40 44 48 52 56 60 64
© NVIDIA 2013
Performance and Solutions
© NVIDIA 2013
Global Memory Access Patterns
0 1 31
SoA vs AoS:
Good: point.x[i]
Not so good: point[i].x
© NVIDIA 2013
Summary: GMEM Optimization
© NVIDIA 2013
A note about caches
L1 and L2 caches
Ignore in software design
Thousands of concurrent
threads – cache blocking
difficult at best
© NVIDIA 2013
Read-only Data Cache
© NVIDIA 2013
Read-only Data Cache
© NVIDIA 2013
Read-only Data Cache
© NVIDIA 2013
Texture and Constant Memory
Read-only
Data resides in global memory
Read via special-purpose caches
© NVIDIA 2013
Texture
Separate cache
Dedicated texture cache hardware provides:
Out-of-bounds index handling
clamp or wrap-around
Optional interpolation
Think: using fp indices for arrays
Linear, bilinear, trilinear
– Interpolation weights are 9-bit
Optional format conversion
{char, short, int} -> float
All of these are “free”
© NVIDIA 2013
Examples of Texture Object Indexing
0 1 2 3 4
0 Integer indices fall between
(2.5, 0.5)
1 (1.0, 1.0) elements
2
Optional interpolation:
3
Weights are determined by coordinate distance
© NVIDIA 2013
Shared Memory
Variety of uses:
L1 SMEM
Software managed cache (e.g., tiled DGEMM)
Global memory coalescing (e.g., transpose)
Communication within a thread block (e.g., FFT, reductions)
Limited Resource
Use of shared memory affects occupancy
© NVIDIA 2013
Shared Memory Organization
C C C C
© NVIDIA 2013
Bank Addressing Examples
© NVIDIA 2013
Bank Addressing Examples
© NVIDIA 2013
Motivating Example: Matrix Transpose
© NVIDIA 2013
Transposing with Shared Memory
© NVIDIA 2013
Shared Memory: Avoiding Bank Conflicts
Bank 0 0 1 2 31
Bank 1 0 1 2 31
… 0 1 2 31
Bank 31
0 1 2 31
© NVIDIA 2013
Shared Memory: Avoiding Bank Conflicts
… 0 1 2 31
Accesses along
column produces 32
Bank 31 bank conflicts
(replays)
0 1 2 31
© NVIDIA 2013
Shared Memory: Avoiding Bank Conflicts
padding
© NVIDIA 2013
Final Notes on Shared Memory
© NVIDIA 2013
OPTIMIZE
Kernel Optimizations: Instruction Throughput / Control Flow
© NVIDIA 2013
Exposing Sufficient Parallelism
© NVIDIA 2013
Independent Instructions: ILP vs. TLP
© NVIDIA 2013
Control Flow
Divergent branches:
Threads within a single warp take different paths
if-else, ...
Different execution paths within a warp are serialized
© NVIDIA 2013
Control Flow
if ( ... )
{
instructions
// then-clause
}
else
{
// else-clause
}
© NVIDIA 2013
Execution within warps is coherent
0 1 2 3 30 31 32 33 34 35 62 63
instructions / time
Warp Warp
(“vector” of threads) (“vector” of threads)
© NVIDIA 2013
Execution diverges within a warp
0 1 2 3 30 31 32 33 34 35 62 63
instructions / time
© NVIDIA 2013
Execution diverges within a warp
0 1 2 3 30 31 32 33 34 35 62 63
instructions / time
© NVIDIA 2013
OPTIMIZE
Optimizing CPU-GPU Interaction: Maximizing PCIe Throughput
© NVIDIA 2013
Maximizing PCIe Throughput
© NVIDIA 2013
Pinned (non-pageable) memory
© NVIDIA 2013
Asynchronicity in CUDA
Default:
Kernel launches are asynchronous with CPU
Memcopies (D2H, H2D) block CPU thread
CUDA calls are serialized by the driver
Streams and async functions provide additional asynchronicity:
Memcopies (D2H, H2D) asynchronous with CPU
Ability to concurrently execute kernels and memcopies
© NVIDIA 2013
Overlap kernel and memory copy
Requirements:
D2H or H2D memcopy from pinned memory
Kernel and memcopy in different, non-0 streams
Code:
cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
© NVIDIA 2013
Call Sequencing for Optimal Overlap
© NVIDIA 2013
Hyper-Q Enables Efficient Scheduling
© NVIDIA 2013
Stream Examples without Hyper-Q
K1,M1,K2,M2: K1 K2
M1 M2
K1,K2,M1,M2: K1 K2
M1 M2 K: Kernel
M: Memcopy
K1,M1,M2: K1 Integer: Stream ID
M2 M1
K1,M2,M1: K1
M1 M2
K1,M2,M2: K1
M2 M2
Time
© NVIDIA 2013
Stream Examples with Hyper-Q
K1,M1,K2,M2: K1 K2
M1 M2
K1,K2,M1,M2: K1 K2
M1 M2 K: Kernel
M: Memcopy
K1,M1,M2: K1 Integer: Stream ID
M2 M1
K1,M2,M1: K1
M2 M1
K1,M2,M2: K1
M2 M2
Time
© NVIDIA 2013
Grid Management Stream Queue Mgmt
C R Z
B Q Y
A P X
stream_1
void foo(void)
{ kernel_A
kernel_A<<<g,b,s, stream_1>>>();
kernel_B<<<g,b,s, stream_1>>>();
kernel_B
kernel_C<<<g,b,s, stream_1>>>();
kernel_C
}
void bar(void)
{
kernel_P<<<g,b,s, stream_2>>>(); stream_2
kernel_Q<<<g,b,s, stream_2>>>();
kernel_R<<<g,b,s, stream_2>>>(); kernel_P
}
kernel_Q
kernel_R
© NVIDIA 2013
Stream Dependencies without Hyper-Q
stream_1
kernel_A
kernel_B
kernel_C
R—Q—P C—B—A
kernel_P
kernel_Q
kernel_R
© NVIDIA 2013
Stream Dependencies with Hyper-Q
stream_1
kernel_A
kernel_B
C—B—A
kernel_C
stream_2 R—Q—P
© NVIDIA 2013
Hyper-Q Example: Building a Pipeline
DMA
DMA
GPU Memory dB2 dA2 dC2 dB2 dA2 dC2 dB2 dA2
© NVIDIA 2013
Just a Higher Level of Parallelism
Result Matrix:
Problem is decomposed into parallel
“workers”.
At any given time
1 worker is using compute resources
1 worker is using copy transfers
Importantly:
The PCI-E link is kept saturated with
useful work. tile computed by stream 1
For DGEMM, compute is also saturated. tile computed by stream 2
// Rotate streams
r_streams.rotate(); r_streamids.rotate();
}
© NVIDIA 2013
Pipeline Without Hyper-Q
© NVIDIA 2013
Pipeline With Hyper-Q
© NVIDIA 2013
Hyper-Q also enables CUDA MPS
© NVIDIA 2013
But Hyper-Q != CUDA MPS
© NVIDIA 2013
Deploy
© NVIDIA 2013
GPU Optimization Fundamentals
Recap:
Develop systematically with APOD
Expose sufficient parallelism
Utilize parallel processing resources efficiently
Assess
Deploy Parallelize
Optimize
© NVIDIA 2013
Online Resources
www.udacity.com
devtalk.nvidia.com