04 CUDA Fundamental Optimization
04 CUDA Fundamental Optimization
PART 2
NVIDIA Corporation
OUTLINE
2
GLOBAL MEMORY
THROUGHPUT
MEMORY HIERARCHY REVIEW
Local storage
Shared memory / L1
L2
All accesses to global memory go through L2, including copies to/from CPU host
Global memory
5
MEMORY ARCHITECTURE
Device
GPU
Multiprocessor
Host Multiprocessor
Registers
DRAM Shared Memory
Multiprocessor
Registers
CPU Local Shared Memory
Registers
Shared Memory
6
MEMORY HIERARCHY REVIEW
L2
Global Memory
7
GMEM OPERATIONS
Loads:
Caching
Default mode
Stores:
8
GMEM OPERATIONS
Loads:
Non-caching
We won’t spend much time with non-caching loads in this training session
9
LOAD OPERATION
Operation:
10
CACHING LOAD
Warp requests 32 aligned, consecutive 4-byte words
int c = a[idx];
addresses from a warp
...
0 32 64 96 128 160 192 224 256 288 320 352 384 416 448
Memory addresses
11
CACHING LOAD
Warp requests 32 aligned, permuted 4-byte words
int c = a[rand()%warpSize];
addresses from a warp
...
0 32 64 96 128 160 192 224 256 288 320 352 384 416 448
Memory addresses
12
CACHING LOAD
Warp requests 32 misaligned, consecutive 4-byte words
int c = a[idx-2];
addresses from a warp
...
0 32 64 96 128 160 192 224 256 288 320 352 384 416 448
Memory addresses
13
CACHING LOAD
All threads in a warp request the same 4-byte word
int c = a[40];
addresses from a warp
...
0 32 64 96 128 160 192 224 256 288 320 352 384 416 448
Memory addresses
14
CACHING LOAD
Warp requests 32 scattered 4-byte words
int c = a[rand()];
addresses from...a warp
0 32 64 96 128 160 192 224 256 288 320 352 384 416 448
Memory addresses
15
NON-CACHING LOAD
Warp requests 32 scattered 4-byte words
0 32 64 96 128 160 192 224 256 288 320 352 384 416 448
Memory addresses
16
GMEM OPTIMIZATION GUIDELINES
Strive for perfect coalescing
Uses:
Organization:
32 banks, 4-byte wide banks
19
SHARED MEMORY
Performance:
serialization: if N threads of 32 access different 4-byte words in the same bank, N accesses are
executed serially
20
BANK ADDRESSING EXAMPLES
No Bank Conflicts No Bank Conflicts
21
BANK ADDRESSING EXAMPLES
2-way Bank Conflicts 16-way Bank Conflicts
22
SHARED MEMORY: AVOIDING BANK CONFLICTS
32x32 SMEM array
warps:
0 1 2 31
Bank 0 0 1 2 31
Bank 1 0 1 2 31
… 0 1 2 31
Bank 31
0 1 2 31
23
SHARED MEMORY: AVOIDING BANK CONFLICTS
Add a column for padding:
Bank 0 0 1 2 31
Bank 1 0 1 2 31
… 0 1 2 31
Bank 31
0 1 2 31
24
SUMMARY
Global memory:
Cooperative Groups
26
FURTHER STUDY
Optimization in-depth:
http://on-demand.gputechconf.com/gtc/2013/presentations/S3466-Programming-Guidelines-
GPU-Architecture.pdf
Analysis-Driven Optimization:
http://on-demand.gputechconf.com/gtc/2012/presentations/S0514-GTC2012-GPU-Performance-
Analysis.pdf
https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html
https://docs.nvidia.com/cuda/index.html#programming-guides
(Kepler/Maxwell/Pascal/Volta)
27
HOMEWORK
https://github.com/olcf/cuda-training-series/blob/master/exercises/hw4/readme.md
Prerequisites: basic linux skills, e.g. ls, cd, etc., knowledge of a text editor like vi/emacs, and some
knowledge of C/C++ programming
28
QUESTIONS?