0% found this document useful (0 votes)
61 views30 pages

04 CUDA Fundamental Optimization

This document discusses optimizations for global memory throughput and shared memory access on NVIDIA GPUs. It reviews the GPU memory hierarchy including registers, shared memory, L1/L2 cache, and global memory. It describes how to optimize global memory loads for coalescing to improve throughput. Non-caching loads are also discussed. Guidelines are provided to fully utilize the memory bandwidth and caches. Shared memory is described as a way to reduce redundant global loads and improve access patterns. Bank conflicts in shared memory can serialize accesses and hurt performance.

Uploaded by

Nagaraj S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
61 views30 pages

04 CUDA Fundamental Optimization

This document discusses optimizations for global memory throughput and shared memory access on NVIDIA GPUs. It reviews the GPU memory hierarchy including registers, shared memory, L1/L2 cache, and global memory. It describes how to optimize global memory loads for coalescing to improve throughput. Non-caching loads are also discussed. Guidelines are provided to fully utilize the memory bandwidth and caches. Shared memory is described as a way to reduce redundant global loads and improve access patterns. Bank conflicts in shared memory can serialize accesses and hurt performance.

Uploaded by

Nagaraj S
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

CUDA OPTIMIZATION,

PART 2
NVIDIA Corporation
OUTLINE

Most concepts in this


Architecture:
presentation apply to
Kepler/Maxwell/Pascal/Volta any language or API
Kernel optimizations on NVIDIA GPUs
Launch configuration

Part 2 (this session):


Global memory throughput

Shared memory access

2
GLOBAL MEMORY
THROUGHPUT
MEMORY HIERARCHY REVIEW

Local storage

Each thread has own local storage

Typically registers (managed by the compiler)

Shared memory / L1

Program configurable: typically up to 48KB shared (or 64KB, or 96KB…)

Shared memory is accessible by threads in the same threadblock

Very low latency

Very high throughput: >1 TB/s aggregate


4
MEMORY HIERARCHY REVIEW

L2

All accesses to global memory go through L2, including copies to/from CPU host

Global memory

Accessible by all threads as well as host (CPU)

High latency (hundreds of cycles)

Throughput: up to ~900 GB/s (Volta V100)

5
MEMORY ARCHITECTURE
Device
GPU
Multiprocessor
Host Multiprocessor
Registers
DRAM Shared Memory
Multiprocessor
Registers
CPU Local Shared Memory
Registers
Shared Memory

Chipset Global L1 / L2 Cache


Constant
DRAM Constant and Texture
Caches
Texture

6
MEMORY HIERARCHY REVIEW

SM-0 SM-1 SM-N


Registers Registers Registers

L1 SMEM L1 SMEM L1 SMEM

L2

Global Memory
7
GMEM OPERATIONS

Loads:

Caching

Default mode

Attempts to hit in L1, then L2, then GMEM

Load granularity is 128-byte line

Stores:

Invalidate L1, write-back for L2

8
GMEM OPERATIONS

Loads:

Non-caching

Compile with –Xptxas –dlcm=cg option to nvcc

Attempts to hit in L2, then GMEM

Do not hit in L1, invalidate the line if it’s in L1 already


Load granularity is 32-bytes

We won’t spend much time with non-caching loads in this training session

9
LOAD OPERATION

Memory operations are issued per warp (32 threads)

Just like all other instructions

Operation:

Threads in a warp provide memory addresses

Determine which lines/segments are needed

Request the needed lines/segments

10
CACHING LOAD
Warp requests 32 aligned, consecutive 4-byte words

Addresses fall within 1 cache-line


Warp needs 128 bytes

128 bytes move across the bus on a miss

Bus utilization: 100%

int c = a[idx];
addresses from a warp
...

0 32 64 96 128 160 192 224 256 288 320 352 384 416 448
Memory addresses
11
CACHING LOAD
Warp requests 32 aligned, permuted 4-byte words

Addresses fall within 1 cache-line


Warp needs 128 bytes

128 bytes move across the bus on a miss

Bus utilization: 100%

int c = a[rand()%warpSize];
addresses from a warp
...

0 32 64 96 128 160 192 224 256 288 320 352 384 416 448
Memory addresses
12
CACHING LOAD
Warp requests 32 misaligned, consecutive 4-byte words

Addresses fall within 2 cache-lines


Warp needs 128 bytes

256 bytes move across the bus on misses

Bus utilization: 50%

int c = a[idx-2];
addresses from a warp
...

0 32 64 96 128 160 192 224 256 288 320 352 384 416 448
Memory addresses
13
CACHING LOAD
All threads in a warp request the same 4-byte word

Addresses fall within a single cache-line


Warp needs 4 bytes

128 bytes move across the bus on a miss

Bus utilization: 3.125%

int c = a[40];
addresses from a warp
...

0 32 64 96 128 160 192 224 256 288 320 352 384 416 448
Memory addresses
14
CACHING LOAD
Warp requests 32 scattered 4-byte words

Addresses fall within N cache-lines


Warp needs 128 bytes

N*128 bytes move across the bus on a miss

Bus utilization: 128 / (N*128) (3.125% worst case N=32)

int c = a[rand()];
addresses from...a warp

0 32 64 96 128 160 192 224 256 288 320 352 384 416 448
Memory addresses
15
NON-CACHING LOAD
Warp requests 32 scattered 4-byte words

Addresses fall within N segments


Warp needs 128 bytes

N*32 bytes move across the bus on a miss

Bus utilization: 128 / (N*32) (12.5% worst case N = 32)

int c = a[rand()]; –Xptxas –dlcm=cg


addresses from a warp
...

0 32 64 96 128 160 192 224 256 288 320 352 384 416 448
Memory addresses
16
GMEM OPTIMIZATION GUIDELINES
Strive for perfect coalescing

(Align starting address - may require padding)

A warp should access within a contiguous region

Have enough concurrent accesses to saturate the bus

Process several elements per thread

Multiple loads get pipelined

Indexing calculations can often be reused

Launch enough threads to maximize throughput

Latency is hidden by switching threads (warps)

Use all the caches!


17
SHARED MEMORY
SHARED MEMORY

Uses:

Inter-thread communication within a block

Cache data to reduce redundant global memory accesses

Use it to improve global memory access patterns

Organization:
32 banks, 4-byte wide banks

Successive 4-byte words belong to different banks

19
SHARED MEMORY

Performance:

Typically: 4 bytes per bank per 1 or 2 clocks per multiprocessor

shared accesses are issued per 32 threads (warp)

serialization: if N threads of 32 access different 4-byte words in the same bank, N accesses are
executed serially

multicast: N threads access the same word in one fetch

Could be different bytes within the same word

20
BANK ADDRESSING EXAMPLES
No Bank Conflicts No Bank Conflicts

Thread 0 Bank 0 Thread 0 Bank 0


Thread 1 Bank 1 Thread 1 Bank 1
Thread 2 Bank 2 Thread 2 Bank 2
Thread 3 Bank 3 Thread 3 Bank 3
Thread 4 Bank 4 Thread 4 Bank 4
Thread 5 Bank 5 Thread 5 Bank 5
Thread 6 Bank 6 Thread 6 Bank 6
Thread 7 Bank 7 Thread 7 Bank 7

Thread 31 Bank 31 Thread 31 Bank 31

21
BANK ADDRESSING EXAMPLES
2-way Bank Conflicts 16-way Bank Conflicts

Thread 0 Bank 0 Thread 0 x16 Bank 0


Thread 1 Bank 1 Thread 1 Bank 1
Thread 2 Bank 2 Thread 2 Bank 2
Thread 3 Bank 3 Thread 3
Thread 4 Bank 4 Thread 4
Bank 5 Thread 5 Bank 7
Bank 6 Thread 6 Bank 8
Bank 7 Thread 7 Bank 9
Thread 28 x16
Thread 29
Thread 30
Thread 31 Bank 31 Thread 31 Bank 31

22
SHARED MEMORY: AVOIDING BANK CONFLICTS
32x32 SMEM array

Warp accesses a column:


32-way bank conflicts (threads in a warp access the same bank)

warps:
0 1 2 31

Bank 0 0 1 2 31

Bank 1 0 1 2 31
… 0 1 2 31
Bank 31
0 1 2 31
23
SHARED MEMORY: AVOIDING BANK CONFLICTS
Add a column for padding:

32x33 SMEM array

Warp accesses a column:

32 different banks, no bank conflicts


warps:
0 1 2 31 padding

Bank 0 0 1 2 31

Bank 1 0 1 2 31

… 0 1 2 31

Bank 31
0 1 2 31
24
SUMMARY

Kernel Launch Configuration:

Launch enough threads per SM to hide latency

Launch enough threadblocks to load the GPU

Global memory:

Maximize throughput (GPU has lots of bandwidth, use it effectively)

Use shared memory when applicable (over 1 TB/s bandwidth)

Use analysis/profiling when optimizing:

“Analysis-driven Optimization” (future session)


25
FUTURE SESSIONS

Atomics, Reductions, Warp Shuffle

Using Managed Memory

Concurrency (streams, copy/compute overlap, multi-GPU)

Analysis Driven Optimization

Cooperative Groups

26
FURTHER STUDY
Optimization in-depth:

http://on-demand.gputechconf.com/gtc/2013/presentations/S3466-Programming-Guidelines-
GPU-Architecture.pdf

Analysis-Driven Optimization:

http://on-demand.gputechconf.com/gtc/2012/presentations/S0514-GTC2012-GPU-Performance-
Analysis.pdf

CUDA Best Practices Guide:

https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html

CUDA Tuning Guides:

https://docs.nvidia.com/cuda/index.html#programming-guides

(Kepler/Maxwell/Pascal/Volta)
27
HOMEWORK

Log into Summit (ssh [email protected] -> ssh summit)

Clone GitHub repository:

Git clone [email protected]:olcf/cuda-training-series.git

Follow the instructions in the readme.md file:

https://github.com/olcf/cuda-training-series/blob/master/exercises/hw4/readme.md

Prerequisites: basic linux skills, e.g. ls, cd, etc., knowledge of a text editor like vi/emacs, and some
knowledge of C/C++ programming

28
QUESTIONS?

You might also like