0% found this document useful (0 votes)

5 views150 pages

CUDA Optimization Fundamentals

The document outlines GPU optimization fundamentals, emphasizing the importance of exposing sufficient parallelism and efficiently utilizing execution resources. It discusses assessing hotspots, parallelizing applications, and optimizing kernel configurations to improve performance. Key concepts include understanding strong and weak scaling, memory access coalescing, and minimizing redundant data transfers between host and device.

Uploaded by

omarsalah4055

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views150 pages

CUDA Optimization Fundamentals

Uploaded by

omarsalah4055

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 150

GPU Optimization

Fundamentals
Cliff Woolley
Developer Technology Engineer
Note: Fundamentals will apply broadly

Example performance numbers are presented for Tesla K20X,

which is based on the Kepler GK110 GPU
Same general optimization concepts apply to other GPUs, though
some parameters may be different, e.g.:
Number of SMs per GPU
Number of functional units per SM
Maximum number of concurrent warps per SM
Shared memory size per SM
Register file size per SM
Developer tools from NVIDIA help you analyze the concepts
without having to memorize parameters of each architecture
© NVIDIA 2013
GPU OPTIMIZATION FUNDAMENTALS

Expose sufficient parallelism

Utilize parallel execution resources efficiently

Use memory system efficiently
Coalesce global memory accesses
Use shared memory where possible
Have coherent execution within warps of threads

Assess

Deploy Parallelize

Optimize
© NVIDIA 2013
Assess

HOTSPOTS

Identify hotspots (total time, number of calls)

Understand scaling (strong and weak)
© NVIDIA 2013
Parallelize

Applications

Compiler Programming
Libraries Languages
Directives

Profile-driven optimization

Tools:
nsight Visual Studio Edition or Eclipse Edition
nvvp NVIDIA Visual Profiler
nvprof Command-line profiling

Productize

Check API return values Library distribution

Run cuda-memcheck tools Cluster management

Early gains
Subsequent changes are evolutionary
© NVIDIA 2013
ASSESS

Profile the code, find the hotspot(s)

Focus your attention where it will give the most benefit
© NVIDIA 2013
Assess

We’ve found a hotspot to work on!

What percent of our total time does this represent?
How much can we improve it? What is the “speed of light”?
How much will this improve our overall performance?

Let’s investigate…
Strong scaling and Amdahl’s Law
Weak scaling and Gustafson’s Law
Expected perf limiters: Bandwidth? Computation? Latency?

Strong Scaling
A measure of how, for fixed overall problem size, the time to
solution decreases as more processors are added to a system
Linear strong scaling: speedup achieved is equal to number of
processors used

Amdahl’s Law:
𝟏 𝟏
𝑺= ≈
𝑷 (𝟏 − 𝑷)
𝟏−𝑷 +𝑵

Weak Scaling
A measure of how time to solution changes as more processors
are added with fixed problem size per processor
Linear weak scaling: overall problem size increases as num. of
processors increases, but execution time remains constant

Gustafson’s Law:

𝑺 = 𝑵 + (𝟏 − 𝑷)(𝟏 − 𝑵)

Understanding which type of scaling is most applicable is an

important part of estimating speedup:
Sometimes problem size will remain constant
Other times problem size will grow to fill the available processors

Apply either Amdahl's or Gustafson's Law to determine an upper

bound for the speedup

Recall that in this case we are wanting to optimize an

existing kernel with a pre-determined workload

That’s strong scaling, so Amdahl’s Law will determine

the maximum speedup

~93%

Say, for example, our kernel is ~93% of total time:

𝟏
Speedup 𝑺 = 𝑷 (SP = speedup in parallel part)
𝟏−𝑷 +
𝑺𝑷
𝟏
In the limit when 𝑺𝑷 is huge, 𝑺 will approach 𝟏−𝟎.𝟗𝟑 ≈ 𝟏𝟒. 𝟑
In practice, it will be less than that depending on the 𝑺𝑷 achieved
Getting 𝑺𝑷 to be high is the goal of optimizing, of course

~93%

What’s the limiting factor?

Memory bandwidth?
Compute throughput?
Latency?

Not sure?
Get a rough estimate by counting bytes per instruction,
𝑮𝑩𝒚𝒕𝒆𝒔/𝒔𝒆𝒄
compare it to “balanced” peak ratio
𝑮𝒊𝒏𝒔𝒏𝒔/𝒔𝒆𝒄
Profiler will help you determine this

Comparing bytes per instr. will give you a guess as to whether

you’re likely to be bandwidth-bound or instruction-bound

Comparing actual achieved GB/s vs. theory and achieved

Ginstr/s vs. theory will give you an idea of how well you’re doing
If both are low, then you’re probably latency-bound and need to expose
more (concurrent) parallelism

What’s the limiting factor?

Memory bandwidth? Compute throughput? Latency?

Consider SpMV: intuitively expect it to be bandwidth-limited

Say we discover we’re getting only ~38% of peak bandwidth
If we aim to get this up to ~65% of peak, that’s 1.7 for this kernel
1.7 for this kernel translates into 1.6 overall due to Amdahl:
𝟏
𝐒= 𝟎.𝟗𝟑 ≈ 𝟏. 𝟔
𝟏−𝟎.𝟗𝟑 +
𝟏.𝟕
~93%

For our example SpMV kernel, our first discovery was that we’re
latency-limited, not bandwidth, since utilization was so low

This tells us our first “optimization” step actually needs to be

related how we expose (memory-level) parallelism

~93%

Applications

Compiler Programming
Libraries Languages
Directives

Pick the best tool for the job

NVIDIA cuBLAS NVIDIA cuSPARSE NVIDIA NPP NVIDIA cuFFT

Matrix Algebra on GPU Accelerated Vector Signal

GPU and Multicore Linear Algebra Image Processing NVIDIA cuRAND

Building-block C++ Templated

IMSL Library CenterSpace NMath Algorithms Parallel Algorithms

Similar to C++ STL

High-level interface // generate 32M random numbers on host
thrust::host_vector<int> h_vec(32 << 20);
Enhances developer productivity thrust::generate(h_vec.begin(),
Enables performance portability h_vec.end(),
rand);
between GPUs and multicore CPUs
// transfer data to device (GPU)
Flexible thrust::device_vector<int> d_vec = h_vec;
Backends for CUDA, OpenMP, TBB // sort data on device
Extensible and customizable thrust::sort(d_vec.begin(), d_vec.end());

Integrates with existing software // transfer data back to host

thrust::copy(d_vec.begin(),
Open source d_vec.end(),
h_vec.begin());

thrust.github.com or developer.nvidia.com/thrust
© NVIDIA 2013
Parallelize: e.g., with OpenACC
CPU GPU

Directives-based approach

Program myscience Compiler parallelizes code

... serial code ...
!$acc kernels
do k = 1,n1 OpenACC
do i = 1,n2
... parallel code ...
Compiler Works on many-core GPUs &
Directive
enddo
enddo
multicore CPUs
!$acc end kernels
...
End Program myscience

Your original
Fortran or C code www.nvidia.com/gpudirectives © NVIDIA 2013
Parallelize: e.g., with CUDA C
Standard C Code CUDA C Code
__global__
void saxpy_serial(int n, void saxpy_parallel(int n,
float a, float a,
float *x, float *x,
float *y) float *y)
{ {
int i = blockIdx.x * blockDim.x +
for (int i = 0; i < n; ++i) threadIdx.x;
y[i] = a*x[i] + y[i]; if (i < n) y[i] = a*x[i] + y[i];
} }

// Perform SAXPY on 1M elements // Perform SAXPY on 1M elements

saxpy_serial(4096*256, 2.0, x, y); saxpy_parallel<<<4096,256>>>(n,2.0,x,y);

developer.nvidia.com/cuda-toolkit
© NVIDIA 2013
Parallelism Needed

GPU is a parallel machine

Lots of arithmetic pipelines
Multiple memory banks

To get good performance, your code must expose sufficient

parallelism for 2 reasons:
To actually give work to all the pipelines
To hide latency of the pipelines

Rough rule of thumb for Tesla K20X:

You want to have 14K or more threads running concurrently
© NVIDIA 2013
Case Study: Matrix Transpose
void transpose(float in[][], float out[][], int N)
{
for(int j=0; j < N; j++)
for(int i=0; i < N; i++)
i
out[j][i] = in[i][j];
}

© NVIDIA 2013
An Initial CUDA Version
__global__ void transpose(float in[], float out[], int N)
{
for(int j=0; j < N; j++)
for(int i=0; i < N; i++)
out[i*N+j] = in[j*N+i];

float in[NN], out[NN];

…
transpose<<<1,1>>>(in, out, N);

+ Quickly implemented - Performance weak

Need to expose parallelism!
© NVIDIA 2013
An Initial CUDA Version
__global__ void transpose(float in[], float out[], int N)
{
for(int j=0; j < N; j++)
for(int i=0; i < N; i++)
out[i*N+j] = in[j*N+i];

float in[NN], out[NN];

…
transpose<<<1,1>>>(in, out, N);

+ Quickly implemented - Performance weak

Need to expose parallelism!
© NVIDIA 2013
Parallelize across matrix elements
tid tid tid

Process elements independently bid

bid
__global__ transpose(float in[], float out[])
{
int tid = threadIdx.x;
in
int bid = blockIdx.x;

out[tid*N+bid] = in[bid*N+tid];
} bid bid

float in[], out[]; tid

…
transpose <<<N,N>>>(in, out); tid
out
tid

DMA

Heterogeneous system: overlap work and data movement

This is the kind of case we would be concerned about

Found the top kernel, but the GPU is mostly idle – that is our bottleneck
Need to overlap CPU/GPU computation and PCIe transfers

What we want to see is maximum overlap of all engines

Expose sufficient parallelism

Utilize parallel execution resources efficiently

Use memory system efficiently
Coalesce global memory accesses
Use shared memory where possible
Have coherent execution within warps of threads

Find ways to parallelize sequential code

Adjust kernel launch configuration to maximize device utilization
Ensure global memory accesses are coalesced
Minimize redundant accesses to global memory
Avoid different execution paths within the same warp
Minimize data transfers between the host and the device

http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/

Find ways to parallelize sequential code

Kernel optimizations
Launch configuration
Global memory throughput
Shared memory access
Instruction throughput / control flow

Optimization of CPU-GPU interaction

Maximizing PCIe throughput
Overlapping kernel execution with memory copies
© NVIDIA 2013
OPTIMIZE
Kernel Optimizations: Kernel Launch Configuration

A kernel is a function that runs on the GPU

A kernel is launched as a grid of blocks of threads
Launch configuration is the number of blocks and number of
threads per block, expressed in CUDA with the <<< >>> notation:

mykernel<<<blocks_per_grid,threads_per_block>>>(…);

What values should we pick for these?

Need enough total threads to process entire input
Need enough threads to keep the GPU busy
Selection of block size is an optimization step involving warp occupancy
© NVIDIA 2013
High-level view of GPU Architecture

Several Streaming Multiprocessors

E.g., Kepler GK110 has up to 15 SMs
L2 Cache shared among SMs
Multiple channels to DRAM

Kepler GK110
© NVIDIA 2013
Kepler Streaming Multiprocessor (SMX)

Per SMX:
192 SP CUDA Cores
64 DP CUDA Cores
4 warp schedulers
Up to 2048 concurrent threads
One or two instructions issued
per scheduler per clock from a
single warp
Register file (256KB)
Shared memory (48KB)

Thread: Sequential execution unit

All threads execute same sequential program
Threads execute in parallel

Threads Block: a group of threads

Executes on a single Streaming Multiprocessor (SM)
Threads within a block can cooperate
Light-weight synchronization
Data exchange

Grid: a collection of thread blocks

Thread blocks of a grid execute across multiple SMs
Thread blocks do not synchronize with each other
Communication between blocks is expensive
© NVIDIA 2013
Execution Model
Software Hardware
Threads are executed by scalar CUDA Cores
CUDA
Thread Core
Thread blocks are executed on multiprocessors

Thread blocks do not migrate

Several concurrent thread blocks can reside on

Thread Block Multiprocessor
one multiprocessor - limited by multiprocessor
resources (shared memory and register file)

A kernel is launched as a grid of thread blocks

Grid
Device
© NVIDIA 2013
Launch Configuration: General Guidelines

How many blocks should we use?

1,000 or more thread blocks is best
Rule of thumb: enough blocks to fill the GPU at least 10s of times over
Makes your code ready for several generations of future GPUs

How many threads per block should we choose?

The really short answer: 128, 256, or 512 are often good choices

The slightly longer answer:

Pick a size that suits the problem well
Multiples of 32 threads are best
Pick a number of threads per block (and a number of blocks) that is
sufficient to keep the SM busy

A thread block consists

of warps of 32 threads

A warp is executed
32 Threads
physically in parallel on
32 Threads
= some multiprocessor.
32 Threads
32 Threads
Thread Block
Warps
Multiprocessor Threads of a warp issue
instructions in lock-
step (as with SIMD)

Simultaneous Multithreading
Multiple “computers”
Cross-core, Cross-socket
Single Instruction, Multiple Data Tightly-coupled
Single Computer
In-core parallelism Supercomputing apps
OpenMP, pthreads

SIMD SMT MPI

SIMT
Single Instruction, Multiple Threads
In-processor parallelism
Many threads on many cores

These form a continuum. Best performance is achieved with a mix.

CPU GPU
Optimized for low-latency Optimized for data-parallel,
access to cached data sets throughput computation
Control logic for out-of-order Architecture tolerant of
and speculative execution memory latency
More transistors dedicated to
computation
© NVIDIA 2013
Occupancy

Need enough concurrent warps

per SM to hide latencies:
Instruction latencies
Memory access latencies

Hardware resources determine

number of warps that fit per SM

Occupancy = Nactual / Nmax

CPU architecture must minimize latency within each thread

GPU architecture hides latency with computation from other (warps of) threads

GPU Streaming Multiprocessor – High-throughput Processor Computation Thread/Warp

W4 Tn Processing
W3
W2 Waiting for data
W1
Ready to be processed
CPU core – Low-latency Processor
T1 T2 T3 T4 Context switch

© NVIDIA 2013
Latency Hiding FFMA R0, R43, R0, R4;
FFMA R1, R43, R4, R5;
FMUL R7, R9, R0;
FMUL R8, R9, R1;
ST.E [R2], R7;

Instruction latencies: ILP=2

Roughly 10-20 cycles for arithmetic operations

DRAM accesses have higher latencies (400-800 cycles)
Instruction Level Parallelism (ILP)
Independent instructions between two dependent ones
ILP depends on the code, done by the compiler
Switching to a different warp
If a warp must stall for N cycles due to dependencies, having N other
warps with eligible instructions keeps the SM going
Switching among concurrently resident warps has no overhead
State (registers, shared memory) is partitioned, not stored/restored

Occupancy: number of concurrent warps per SM, expressed as:

Absolute number of warps of threads that fit concurrently (e.g., 1..64), or
Ratio of warps that fit concurrently to architectural maximum (0..100%)

Number of warps that fit determined by resource availability:

Threads per thread block
Registers per thread
Shared memory per thread block Kepler SM resources:
– 64K 32-bit registers
– Up to 48 KB of shared memory
– Up to 2048 concurrent threads
– Up to 16 concurrent thread blocks

Note that 100% occupancy isn’t needed to reach maximum

performance
Once the “needed” occupancy (enough warps to switch among to cover
latencies) is reached, further increases won’t improve performance

Level of occupancy needed depends on the code

More independent work per thread -> less occupancy is needed
Memory-bound codes tend to need more occupancy
Higher latency than for arithmetic, need more work to hide it

Thread block size is a multiple of warp size (32)

Even if you request fewer threads, hardware rounds up
Thread blocks can be too small
Kepler SM can run up to 16 thread blocks concurrently
SM can reach the block count limit before reaching good occupancy
E.g.: 1-warp blocks = 16 warps/SM on Kepler (25% occ – probably not enough)
Thread blocks can be too big
Enough SM resources for more threads, but not enough for a whole block
A thread block isn’t started until resources are available for all of its threads

Number of warps allowed by SM resources

Too few SM resources:
threads per block Registers
Shared memory

Too many
threads per block

Analyze effect of
resource consumption
on occupancy

Occupancy here is limited

by grid size and number of
threads per block

Device
GPU
Multiprocessor
Host Multiprocessor
Registers
DRAM Shared Memory
Multiprocessor
Registers

CPU Local Shared Memory

Registers
Shared Memory

Chipset Global L1 / L2 Cache

Constant
DRAM Constant and Texture
Caches
Texture

Goal: utilize all available memory

bandwidth

Little’s Law:

Access latency L
# bytes in flight = latency * bandwidth

 Increase parallelism (bytes in flight)

(or)
 Reduce latency (time between requests)

Say the parameters of our escalator are:

1 person fits on each step
Step arrives every 2 secs (bandwidth=0.5 persons/s)
20 steps tall (latency=40 seconds)
1 person in flight: 0.025 persons/s achieved
To saturate bandwidth:
Need 1 person arriving every 2 s
Means we’ll need 20 persons in flight
The idea: Bandwidth × Latency
It takes latency time units for the first person to arrive
We need bandwidth persons to get on the escalator every time unit
© NVIDIA 2013
Memory-Level Parallelism = Bandwidth

In order to saturate memory bandwidth, SM must have

enough independent memory requests in flight concurrently

Achieved Kepler memory throughput

Shown as a function of number of concurrent requests
per SM with 128-byte lines

Experiment: vary size of accesses by Accesses by a warp:

threads of a warp, check performance 4B words: 1 line
Memcopy kernel: each warp has 2 concurrent 8B words: 2 lines
requests (one write and the read following it) 16B words: 4 lines

To achieve same
throughput at lower
occupancy or with
smaller words, need
more independent
requests per warp

Ways to increase concurrent accesses:

Increase occupancy (run more warps concurrently)
Adjust block dimensions to maximize occupancy
If occupancy is limited by registers per thread, try to reduce register count
(-maxrregcount option or __launch_bounds__)

Modify code to process several elements per thread

Doubling elements per thread doubles independent accesses per thread

Memory operations are issued per warp

Just like all other instructions

Operation:
Threads in a warp provide memory addresses
Hardware determines which lines/segments are needed, fetches them

Two perspectives on the throughput:

Application’s point of view: count only bytes requested by application
HW point of view: count all bytes moved by hardware

The two views can be different:

Memory is accessed at 32 byte granularity
With a scattered or offset pattern, the application doesn’t use all the bytes the
hardware actually transferred
Broadcast: the same small transaction serves many threads in a warp

Scenario:
Warp requests 32 aligned, consecutive 4-byte words
Addresses fall within 4 segments
Warp needs 128 bytes
128 bytes move across the bus
Bus utilization: 100%