0% found this document useful (0 votes)

63 views30 pages

04 CUDA Fundamental Optimization

This document discusses optimizations for global memory throughput and shared memory access on NVIDIA GPUs. It reviews the GPU memory hierarchy including registers, shared memory, L1/L2 cache, and global memory. It describes how to optimize global memory loads for coalescing to improve throughput. Non-caching loads are also discussed. Guidelines are provided to fully utilize the memory bandwidth and caches. Shared memory is described as a way to reduce redundant global loads and improve access patterns. Bank conflicts in shared memory can serialize accesses and hurt performance.

Uploaded by

Nagaraj S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

63 views30 pages

04 CUDA Fundamental Optimization

Uploaded by

Nagaraj S

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

CUDA OPTIMIZATION,

PART 2
NVIDIA Corporation
OUTLINE

Most concepts in this

Architecture:
presentation apply to
Kepler/Maxwell/Pascal/Volta any language or API
Kernel optimizations on NVIDIA GPUs
Launch configuration

Part 2 (this session):

Global memory throughput

Shared memory access

2
GLOBAL MEMORY
THROUGHPUT
MEMORY HIERARCHY REVIEW

Local storage

Each thread has own local storage

Typically registers (managed by the compiler)

Shared memory / L1

Program configurable: typically up to 48KB shared (or 64KB, or 96KB…)

Shared memory is accessible by threads in the same threadblock

Very low latency

Very high throughput: >1 TB/s aggregate

4
MEMORY HIERARCHY REVIEW

All accesses to global memory go through L2, including copies to/from CPU host

Global memory

Accessible by all threads as well as host (CPU)

High latency (hundreds of cycles)

Throughput: up to ~900 GB/s (Volta V100)

5
MEMORY ARCHITECTURE
Device
GPU
Multiprocessor
Host Multiprocessor
Registers
DRAM Shared Memory
Multiprocessor
Registers
CPU Local Shared Memory
Registers
Shared Memory

Chipset Global L1 / L2 Cache

Constant
DRAM Constant and Texture
Caches
Texture

6
MEMORY HIERARCHY REVIEW

SM-0 SM-1 SM-N

Registers Registers Registers

L1 SMEM L1 SMEM L1 SMEM

Global Memory
7
GMEM OPERATIONS

Loads:

Caching

Default mode

Attempts to hit in L1, then L2, then GMEM

Load granularity is 128-byte line

Stores:

Invalidate L1, write-back for L2

8
GMEM OPERATIONS

Loads:

Non-caching

Compile with –Xptxas –dlcm=cg option to nvcc

Attempts to hit in L2, then GMEM

Do not hit in L1, invalidate the line if it’s in L1 already

Load granularity is 32-bytes

We won’t spend much time with non-caching loads in this training session

9
LOAD OPERATION

Memory operations are issued per warp (32 threads)

Just like all other instructions

Operation:

Threads in a warp provide memory addresses

Determine which lines/segments are needed

Request the needed lines/segments

10
CACHING LOAD
Warp requests 32 aligned, consecutive 4-byte words

Addresses fall within 1 cache-line

Warp needs 128 bytes

128 bytes move across the bus on a miss

Bus utilization: 100%

int c = a[idx];
addresses from a warp
...

0 32 64 96 128 160 192 224 256 288 320 352 384 416 448
Memory addresses
11
CACHING LOAD
Warp requests 32 aligned, permuted 4-byte words

Addresses fall within 1 cache-line

Warp needs 128 bytes

128 bytes move across the bus on a miss

Bus utilization: 100%

int c = a[rand()%warpSize];
addresses from a warp
...

0 32 64 96 128 160 192 224 256 288 320 352 384 416 448
Memory addresses
12
CACHING LOAD
Warp requests 32 misaligned, consecutive 4-byte words

Addresses fall within 2 cache-lines

Warp needs 128 bytes

256 bytes move across the bus on misses

Bus utilization: 50%

int c = a[idx-2];
addresses from a warp
...

0 32 64 96 128 160 192 224 256 288 320 352 384 416 448
Memory addresses
13
CACHING LOAD
All threads in a warp request the same 4-byte word

Addresses fall within a single cache-line

Warp needs 4 bytes

128 bytes move across the bus on a miss

Bus utilization: 3.125%

int c = a[40];
addresses from a warp
...

0 32 64 96 128 160 192 224 256 288 320 352 384 416 448
Memory addresses
14
CACHING LOAD
Warp requests 32 scattered 4-byte words

Addresses fall within N cache-lines

Warp needs 128 bytes

N*128 bytes move across the bus on a miss

Bus utilization: 128 / (N*128) (3.125% worst case N=32)

int c = a[rand()];
addresses from...a warp

0 32 64 96 128 160 192 224 256 288 320 352 384 416 448
Memory addresses
15
NON-CACHING LOAD
Warp requests 32 scattered 4-byte words

Addresses fall within N segments

Warp needs 128 bytes

N*32 bytes move across the bus on a miss

Bus utilization: 128 / (N*32) (12.5% worst case N = 32)

int c = a[rand()]; –Xptxas –dlcm=cg

addresses from a warp
...

0 32 64 96 128 160 192 224 256 288 320 352 384 416 448
Memory addresses
16
GMEM OPTIMIZATION GUIDELINES
Strive for perfect coalescing

(Align starting address - may require padding)

A warp should access within a contiguous region

Have enough concurrent accesses to saturate the bus

Process several elements per thread

Multiple loads get pipelined

Indexing calculations can often be reused

Launch enough threads to maximize throughput

Latency is hidden by switching threads (warps)

Use all the caches!

17
SHARED MEMORY
SHARED MEMORY

Uses:

Inter-thread communication within a block

Cache data to reduce redundant global memory accesses

Use it to improve global memory access patterns

Organization:
32 banks, 4-byte wide banks

Successive 4-byte words belong to different banks

19
SHARED MEMORY

Performance:

Typically: 4 bytes per bank per 1 or 2 clocks per multiprocessor

shared accesses are issued per 32 threads (warp)

serialization: if N threads of 32 access different 4-byte words in the same bank, N accesses are
executed serially

multicast: N threads access the same word in one fetch

Could be different bytes within the same word

20
BANK ADDRESSING EXAMPLES
No Bank Conflicts No Bank Conflicts

Thread 0 Bank 0 Thread 0 Bank 0

Thread 1 Bank 1 Thread 1 Bank 1
Thread 2 Bank 2 Thread 2 Bank 2
Thread 3 Bank 3 Thread 3 Bank 3
Thread 4 Bank 4 Thread 4 Bank 4
Thread 5 Bank 5 Thread 5 Bank 5
Thread 6 Bank 6 Thread 6 Bank 6
Thread 7 Bank 7 Thread 7 Bank 7

Thread 31 Bank 31 Thread 31 Bank 31

21
BANK ADDRESSING EXAMPLES
2-way Bank Conflicts 16-way Bank Conflicts

Thread 0 Bank 0 Thread 0 x16 Bank 0

Thread 1 Bank 1 Thread 1 Bank 1
Thread 2 Bank 2 Thread 2 Bank 2
Thread 3 Bank 3 Thread 3
Thread 4 Bank 4 Thread 4
Bank 5 Thread 5 Bank 7
Bank 6 Thread 6 Bank 8
Bank 7 Thread 7 Bank 9
Thread 28 x16
Thread 29
Thread 30
Thread 31 Bank 31 Thread 31 Bank 31

22
SHARED MEMORY: AVOIDING BANK CONFLICTS
32x32 SMEM array

Warp accesses a column:

32-way bank conflicts (threads in a warp access the same bank)

warps:
0 1 2 31

Bank 0 0 1 2 31

Bank 1 0 1 2 31
… 0 1 2 31
Bank 31
0 1 2 31
23
SHARED MEMORY: AVOIDING BANK CONFLICTS
Add a column for padding:

32x33 SMEM array

Warp accesses a column:

32 different banks, no bank conflicts

warps:
0 1 2 31 padding

Bank 0 0 1 2 31

Bank 1 0 1 2 31

… 0 1 2 31

Bank 31
0 1 2 31
24
SUMMARY

Kernel Launch Configuration:

Launch enough threads per SM to hide latency

Launch enough threadblocks to load the GPU

Global memory:

Maximize throughput (GPU has lots of bandwidth, use it effectively)

Use shared memory when applicable (over 1 TB/s bandwidth)

Use analysis/profiling when optimizing:

“Analysis-driven Optimization” (future session)

25
FUTURE SESSIONS

Atomics, Reductions, Warp Shuffle

Using Managed Memory

Concurrency (streams, copy/compute overlap, multi-GPU)

Analysis Driven Optimization

Cooperative Groups

26
FURTHER STUDY
Optimization in-depth:

http://on-demand.gputechconf.com/gtc/2013/presentations/S3466-Programming-Guidelines-
GPU-Architecture.pdf

Analysis-Driven Optimization:

http://on-demand.gputechconf.com/gtc/2012/presentations/S0514-GTC2012-GPU-Performance-
Analysis.pdf

CUDA Best Practices Guide:

https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html

CUDA Tuning Guides:

https://docs.nvidia.com/cuda/index.html#programming-guides

(Kepler/Maxwell/Pascal/Volta)
27
HOMEWORK

Log into Summit (ssh [email protected] -> ssh summit)

Clone GitHub repository:

Git clone [email protected]:olcf/cuda-training-series.git

Follow the instructions in the readme.md file:

https://github.com/olcf/cuda-training-series/blob/master/exercises/hw4/readme.md

Prerequisites: basic linux skills, e.g. ls, cd, etc., knowledge of a text editor like vi/emacs, and some
knowledge of C/C++ programming

28
QUESTIONS?

Lecture GPU 17
No ratings yet
Lecture GPU 17
51 pages
Performance Optimization With Modern CUDA Programming Techniques - 1635781161534001h3am
No ratings yet
Performance Optimization With Modern CUDA Programming Techniques - 1635781161534001h3am
77 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
System Control Switch Board, Replacement
100% (1)
System Control Switch Board, Replacement
23 pages
How CUDA Programming Works - 1647539841016001sz6e
No ratings yet
How CUDA Programming Works - 1647539841016001sz6e
101 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
CUDA Memory Architecture: GPGPU Class Week 4
No ratings yet
CUDA Memory Architecture: GPGPU Class Week 4
28 pages
Adventurous Car Game Final Document
100% (2)
Adventurous Car Game Final Document
47 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
Lecture-4 (8086 Memory Address Space Partition)
100% (1)
Lecture-4 (8086 Memory Address Space Partition)
23 pages
Aaa RPCPPE & RPCI For School Heads Sample Entry
No ratings yet
Aaa RPCPPE & RPCI For School Heads Sample Entry
23 pages
Memory Hardware in G80: © David Kirk/NVIDIA and Wen-Mei W Hwu 2007-2009 1
No ratings yet
Memory Hardware in G80: © David Kirk/NVIDIA and Wen-Mei W Hwu 2007-2009 1
21 pages
2nd Periodic Test Ict Grade 9
100% (7)
2nd Periodic Test Ict Grade 9
1 page
Comp Arch Project 2 Final
No ratings yet
Comp Arch Project 2 Final
29 pages
Gpu1 - GPU Introduction
No ratings yet
Gpu1 - GPU Introduction
20 pages
HPC Summit Digital 2020: Gpu Experts Panel: Ampere Explained
No ratings yet
HPC Summit Digital 2020: Gpu Experts Panel: Ampere Explained
29 pages
Main Parameters To Evaluate The GPU Performance
No ratings yet
Main Parameters To Evaluate The GPU Performance
40 pages
Sanet - ST Concepts and Techniques of Programming in C
No ratings yet
Sanet - ST Concepts and Techniques of Programming in C
419 pages
NCSA02 Fundamental CUDA Optimization
No ratings yet
NCSA02 Fundamental CUDA Optimization
50 pages
Module 4 Sheet 2.3 Software Packages and Use of Application Programs
No ratings yet
Module 4 Sheet 2.3 Software Packages and Use of Application Programs
22 pages
1083 Wang
No ratings yet
1083 Wang
56 pages
Multithreaded Architectures: Lecture 5: Performance Considerations
No ratings yet
Multithreaded Architectures: Lecture 5: Performance Considerations
49 pages
Part3 22
No ratings yet
Part3 22
85 pages
GPU Computing CIS-543: Lecture 08: CUDA Memory Model
No ratings yet
GPU Computing CIS-543: Lecture 08: CUDA Memory Model
50 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Parallel Programming Module 5
No ratings yet
Parallel Programming Module 5
24 pages
Lecture 1: Introduction: Graphics Processing Units (Gpus) : Architecture and Programming
No ratings yet
Lecture 1: Introduction: Graphics Processing Units (Gpus) : Architecture and Programming
33 pages
CUDA Optimization
No ratings yet
CUDA Optimization
54 pages
Cuda Webinars WarpsAndOccupancy
No ratings yet
Cuda Webinars WarpsAndOccupancy
14 pages
Summary Exam 2015
No ratings yet
Summary Exam 2015
30 pages
Ans Pca End Sem
No ratings yet
Ans Pca End Sem
68 pages
Lecture 05 ARM Processors
No ratings yet
Lecture 05 ARM Processors
65 pages
Parralel 01
No ratings yet
Parralel 01
38 pages
02 RTVis GPGPU CUDA
No ratings yet
02 RTVis GPGPU CUDA
34 pages
Topics: Cache Innovations (Sections 2.4, B.4, B.5), Virtual Memory Intro
No ratings yet
Topics: Cache Innovations (Sections 2.4, B.4, B.5), Virtual Memory Intro
20 pages
Computer Architecture
No ratings yet
Computer Architecture
24 pages
Arallel Rocessing NIT
No ratings yet
Arallel Rocessing NIT
44 pages
217 Lec1
No ratings yet
217 Lec1
35 pages
Gpu Cuda 2
No ratings yet
Gpu Cuda 2
72 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
RG1 Intro ParallelArch HPCAI Jan2020
No ratings yet
RG1 Intro ParallelArch HPCAI Jan2020
47 pages
Hardware
No ratings yet
Hardware
54 pages
GPU in Supercomputer
No ratings yet
GPU in Supercomputer
7 pages
LM32 Ait L22
No ratings yet
LM32 Ait L22
20 pages
Lec6 Cuda Memory
No ratings yet
Lec6 Cuda Memory
18 pages
4 1 MWagner GPU Volta
No ratings yet
4 1 MWagner GPU Volta
36 pages
CS 179: GPU Computing: Lecture 4: Gpu Memory Systems
No ratings yet
CS 179: GPU Computing: Lecture 4: Gpu Memory Systems
43 pages
cs179 2016 Lec13
No ratings yet
cs179 2016 Lec13
30 pages
Intro GPUs
No ratings yet
Intro GPUs
36 pages
Arallel Rocessing NIT
No ratings yet
Arallel Rocessing NIT
58 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
GPU Computing 3
No ratings yet
GPU Computing 3
32 pages
Dissecting The NVIDIA Volta GPU Architecture Via Microbenchmarking
No ratings yet
Dissecting The NVIDIA Volta GPU Architecture Via Microbenchmarking
66 pages
Class 13
No ratings yet
Class 13
19 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
GTC S62191
No ratings yet
GTC S62191
89 pages
Unit 4
No ratings yet
Unit 4
48 pages
GPU Fundamentals
No ratings yet
GPU Fundamentals
20 pages
Windows Support For HDD More Than 2TB
No ratings yet
Windows Support For HDD More Than 2TB
2 pages
Microfriend: Dyna - 85 User's Manual
No ratings yet
Microfriend: Dyna - 85 User's Manual
96 pages
CSED405 Lec3-Memory and Locality - 240912 - 113301
No ratings yet
CSED405 Lec3-Memory and Locality - 240912 - 113301
65 pages
Computer Organization and Architecture
No ratings yet
Computer Organization and Architecture
33 pages
Comp 1 Reviewer
No ratings yet
Comp 1 Reviewer
5 pages
CSED405 Lec2-CUDA Overview - 240916 - 131108
No ratings yet
CSED405 Lec2-CUDA Overview - 240916 - 131108
52 pages
17 01 2023
No ratings yet
17 01 2023
200 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
LS 6 Module 1 Pre-Test
No ratings yet
LS 6 Module 1 Pre-Test
18 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
CUDA, Supercomputing For The Masses: Part 4: Understanding and Using Shared Memory
No ratings yet
CUDA, Supercomputing For The Masses: Part 4: Understanding and Using Shared Memory
3 pages
T200 Ident Printer - TE
No ratings yet
T200 Ident Printer - TE
3 pages
Allama Iqbal Open University, Islamabad Warning: (Department of Computer Science)
No ratings yet
Allama Iqbal Open University, Islamabad Warning: (Department of Computer Science)
5 pages
MANUAL
No ratings yet
MANUAL
137 pages
Mt303060usen-15 MT303060USEN
No ratings yet
Mt303060usen-15 MT303060USEN
567 pages
Von Neumann Architecture
100% (1)
Von Neumann Architecture
13 pages
Apple Market Ranker: Experian Micromarketer Generation3
No ratings yet
Apple Market Ranker: Experian Micromarketer Generation3
10 pages
RAM and ROM
No ratings yet
RAM and ROM
27 pages
(Trace32) Debugger rh850
No ratings yet
(Trace32) Debugger rh850
101 pages
Ict POINTERS TO REVIEW1
No ratings yet
Ict POINTERS TO REVIEW1
5 pages
Microprocesspr Prog. Kit
No ratings yet
Microprocesspr Prog. Kit
4 pages
Setup
No ratings yet
Setup
1,261 pages
1computer Fundamentals
No ratings yet
1computer Fundamentals
5 pages
Cache Memory
No ratings yet
Cache Memory
24 pages
CTE 242 SYllabuss
No ratings yet
CTE 242 SYllabuss
6 pages
Bugreport 2027M1 QP1A.190711.020 2021 05 17 18 31 19 Dumpstate - Log 23096
No ratings yet
Bugreport 2027M1 QP1A.190711.020 2021 05 17 18 31 19 Dumpstate - Log 23096
30 pages
LDCA Unit5
No ratings yet
LDCA Unit5
124 pages
Logitech K330 Keyboard
No ratings yet
Logitech K330 Keyboard
148 pages
4CS015 Fundamentals of Computing - Workshop #6
No ratings yet
4CS015 Fundamentals of Computing - Workshop #6
5 pages
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
From Everand
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
Rodrigo Copetti
No ratings yet
PC Engine / TurboGrafx-16 Architecture: Architecture of Consoles: A Practical Analysis, #16
From Everand
PC Engine / TurboGrafx-16 Architecture: Architecture of Consoles: A Practical Analysis, #16
Rodrigo Copetti
No ratings yet