0% found this document useful (0 votes)

16 views50 pages

NCSA02 Fundamental CUDA Optimization

Uploaded by

Gunduz Vehbi Demirci

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views50 pages

NCSA02 Fundamental CUDA Optimization

Uploaded by

Gunduz Vehbi Demirci

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

Fundamental CUDA

Optimization
NVIDIA Corporation
Outline

Fermi/Kepler Architecture
Kernel optimizations Most concepts in this
Launch configuration
presentation apply to
Global memory throughput
any language or API
Shared memory access
on NVIDIA GPUs
Instruction throughput / control flow

Optimization of CPU-GPU interaction

Maximizing PCIe throughput
Overlapping kernel execution with memory copies
20-Series Architecture (Fermi)
512 Scalar Processor (SP) cores execute parallel
thread instructions

16 Streaming Multiprocessors (SMs)

each contains
32 scalar processors
32 fp32 / int32 ops / clock,
16 fp64 ops / clock
4 Special Function Units (SFUs)
Shared register file (128KB)
48 KB / 16 KB Shared memory
16KB / 48 KB L1 data cache
Kepler cc 3.5 SM (GK110)

“SMX” (enhanced SM)

192 SP units (“cores”)
64 DP units
LD/ST units
4 warp schedulers
Each warp scheduler is dual-
issue capable
K20: 13 SMX’s, 5GB
K20X: 14 SMX’s, 6GB
K40: 15 SMX’s, 12GB
Execution Model
Software Hardware
Threads are executed by scalar processors
Scalar
Thread Processor
Thread blocks are executed on multiprocessors

Thread blocks do not migrate

Several concurrent thread blocks can reside on one

Thread multiprocessor - limited by multiprocessor
Block Multiprocessor
resources (shared memory and register file)

... A kernel is launched as a grid of thread blocks

Grid Device
Warps

A thread block consists of

32 Threads 32-thread warps
... = 32 Threads
32 Threads A warp is executed
Thread
Block Warps Multiprocessor physically in parallel
(SIMD) on a multiprocessor
Memory Architecture

Device
GPU
Multiprocessor
Host Multiprocessor
Registers
DRAM Shared Memory
Multiprocessor
Registers

CPU Local Shared Memory

Registers
Shared Memory

Chipset Global L1 / L2 Cache

Constant
DRAM Constant and Texture
Caches
Texture
Launch Configuration

© NVIDIA Corporation 2011

Launch Configuration

Key to understanding:
Instructions are issued in order
A thread stalls when one of the operands isn’t ready:
Memory read by itself doesn’t stall execution
Latency is hidden by switching threads
GMEM latency: 400-800 cycles
Arithmetic latency: 18-22 cycles
How many threads/threadblocks to launch?
Conclusion:
Need enough threads to hide latency
Launch Configuration

Hiding arithmetic latency:

Need ~18 warps (576 threads) per SM
Or, latency can also be hidden with independent instructions from the
same warp
For example, if instruction never depends on the output of preceding
instruction, then only 9 warps are needed, etc.
Maximizing global memory throughput:
Depends on the access pattern, and word size
Need enough memory transactions in flight to saturate the bus
Independent loads and stores from the same thread
Loads and stores from different threads
Larger word sizes can also help (float2 is twice the transactions of float, for
example)
Maximizing Memory Throughput
Increment of an array of 64M elements
Two accesses per thread (load then store)
The two accesses are dependent, so really 1 access per thread at a time
Tesla C2050, ECC on, theoretical bandwidth: ~120 GB/s

Several independent smaller

accesses have the same effect
as one larger one.
For example:
Four 32-bit ~= one 128-bit
Launch Configuration: Summary

Need enough total threads to keep GPU busy

Typically, you’d like 512+ threads per SM
More if processing one fp32 element per thread
Of course, exceptions exist
Threadblock configuration
Threads per block should be a multiple of warp size (32)
SM can concurrently execute up to 8 thread blocks
Really small thread blocks prevent achieving good occupancy
Really large thread blocks are less flexible
I generally use 128-256 threads/block, but use whatever is best for the application
For more details:
Vasily Volkov’s GTC2010 talk “Better Performance at Lower Occupancy”
(http://www.gputechconf.com/page/gtc-on-demand.html#session2238)
Global Memory
Throughput
Memory Hierarchy Review
Local storage
Each thread has own local storage
Mostly registers (managed by the compiler)
Shared memory / L1
Program configurable: 16KB shared / 48 KB L1 OR 48KB shared / 16KB L1
Shared memory is accessible by the threads in the same threadblock
Very low latency
Very high throughput: 1+ TB/s aggregate
L2
All accesses to global memory go through L2, including copies to/from CPU host
Global memory
Accessible by all threads as well as host (CPU)
High latency (400-800 cycles)
Throughput: up to 177 GB/s
Memory Hierarchy Review

SM-0 SM-1 SM-N

Registers Registers Registers

L1 SMEM L1 SMEM L1 SMEM

Global Memory
GMEM Operations

Two types of loads:

Caching
Default mode
Attempts to hit in L1, then L2, then GMEM
Load granularity is 128-byte line
Non-caching
Compile with –Xptxas –dlcm=cg option to nvcc
Attempts to hit in L2, then GMEM
– Do not hit in L1, invalidate the line if it’s in L1 already
Load granularity is 32-bytes
Stores:
Invalidate L1, write-back for L2
Load Operation

Memory operations are issued per warp (32 threads)

Just like all other instructions
Operation:
Threads in a warp provide memory addresses
Determine which lines/segments are needed
Request the needed lines/segments
Caching Load

Warp requests 32 aligned, consecutive 4-byte words

Addresses fall within 1 cache-line
Warp needs 128 bytes
128 bytes move across the bus on a miss
Bus utilization: 100%

addresses from a warp

...

0 32 64 96 128 160 192 224 256 288 320 352 384 416 448
Memory addresses
Non-caching Load

Warp requests 32 aligned, consecutive 4-byte words

Addresses fall within 4 segments
Warp needs 128 bytes
128 bytes move across the bus on a miss
Bus utilization: 100%

addresses from a warp

...

0 32 64 96 128 160 192 224 256 288 320 352 384 416 448
Memory addresses
Caching Load

Warp requests 32 aligned, permuted 4-byte words

Addresses fall within 1 cache-line
Warp needs 128 bytes
128 bytes move across the bus on a miss
Bus utilization: 100%

addresses from a warp

...

0 32 64 96 128 160 192 224 256 288 320 352 384 416 448
Memory addresses
Non-caching Load

Warp requests 32 aligned, permuted 4-byte words

Addresses fall within 4 segments
Warp needs 128 bytes
128 bytes move across the bus on a miss
Bus utilization: 100%

addresses from a warp

...

0 32 64 96 128 160 192 224 256 288 320 352 384 416 448
Memory addresses
Caching Load

Warp requests 32 misaligned, consecutive 4-byte words

Addresses fall within 2 cache-lines
Warp needs 128 bytes
256 bytes move across the bus on misses
Bus utilization: 50%

addresses from a warp

...

0 32 64 96 128 160 192 224 256 288 320 352 384 416 448
Memory addresses
Non-caching Load

Warp requests 32 misaligned, consecutive 4-byte words

Addresses fall within at most 5 segments
Warp needs 128 bytes
160 bytes move across the bus on misses
Bus utilization: at least 80%
Some misaligned patterns will fall within 4 segments, so 100% utilization

addresses from a warp

...

0 32 64 96 128 160 192 224 256 288 320 352 384 416 448
Memory addresses
Caching Load

All threads in a warp request the same 4-byte word

Addresses fall within a single cache-line
Warp needs 4 bytes
128 bytes move across the bus on a miss
Bus utilization: 3.125%

addresses from a warp

...

0 32 64 96 128 160 192 224 256 288 320 352 384 416 448
Memory addresses
Non-caching Load

All threads in a warp request the same 4-byte word

Addresses fall within a single segment
Warp needs 4 bytes
32 bytes move across the bus on a miss
Bus utilization: 12.5%

addresses from a warp

...

0 32 64 96 128 160 192 224 256 288 320 352 384 416 448
Memory addresses
Caching Load

Warp requests 32 scattered 4-byte words

Addresses fall within N cache-lines
Warp needs 128 bytes
N*128 bytes move across the bus on a miss
Bus utilization: 128 / (N*128)

addresses from a warp

...

0 32 64 96 128 160 192 224 256 288 320 352 384 416 448
Memory addresses
Non-caching Load

Warp requests 32 scattered 4-byte words

Addresses fall within N segments
Warp needs 128 bytes
N*32 bytes move across the bus on a miss
Bus utilization: 128 / (N*32)

addresses from a warp

...

0 32 64 96 128 160 192 224 256 288 320 352 384 416 448
Memory addresses
Impact of Address Alignment
Warps should access aligned regions for maximum memory throughput
L1 can help for misaligned loads if several warps are accessing a contiguous
region
ECC further significantly reduces misaligned store throughput

Experiment:
– Copy 16MB of floats
– 256 threads/block

Greatest throughput
drop:
– CA loads: 15%
– CG loads: 32%
GMEM Optimization Guidelines

Strive for perfect coalescing

Align starting address (may require padding)
A warp should access within a contiguous region
Have enough concurrent accesses to saturate the bus
Process several elements per thread
Multiple loads get pipelined
Indexing calculations can often be reused
Launch enough threads to maximize throughput
Latency is hidden by switching threads (warps)
Try L1 and caching configurations to see which one works best
Caching vs non-caching loads (compiler option)
16KB vs 48KB L1 (CUDA call)
Shared Memory
Shared Memory
Uses:
Inter-thread communication within a block
Cache data to reduce redundant global memory accesses
Use it to improve global memory access patterns
Organization:
32 banks, 4-byte wide banks
Successive 4-byte words belong to different banks
Performance:
4 bytes per bank per 2 clocks per multiprocessor
smem accesses are issued per 32 threads (warp)
serialization: if N threads of 32 access different 4-byte words in the same
bank, N accesses are executed serially
multicast: N threads access the same word in one fetch
Could be different bytes within the same word
Bank Addressing Examples

No Bank Conflicts No Bank Conflicts

Thread 0 Bank 0 Thread 0 Bank 0

Thread 1 Bank 1 Thread 1 Bank 1
Thread 2 Bank 2 Thread 2 Bank 2
Thread 3 Bank 3 Thread 3 Bank 3
Thread 4 Bank 4 Thread 4 Bank 4
Thread 5 Bank 5 Thread 5 Bank 5
Thread 6 Bank 6 Thread 6 Bank 6
Thread 7 Bank 7 Thread 7 Bank 7

Thread 31 Bank 31 Thread 31 Bank 31

Bank Addressing Examples

2-way Bank Conflicts 8-way Bank Conflicts

Thread 0 Bank 0 Thread 0 x8 Bank 0

Thread 1 Bank 1 Thread 1 Bank 1
Thread 2 Bank 2 Thread 2 Bank 2
Thread 3 Bank 3 Thread 3
Thread 4 Bank 4 Thread 4
Bank 5 Thread 5 Bank 7
Bank 6 Thread 6 Bank 8
Bank 7 Thread 7 Bank 9
Thread 28 x8
Thread 29
Thread 30
Thread 31 Bank 31 Thread 31 Bank 31
Shared Memory: Avoiding Bank Conflicts

32x32 SMEM array

Warp accesses a column:
32-way bank conflicts (threads in a warp access the same bank)

warps:
0 1 2 31

Bank 0 0 1 2 31

Bank 1 0 1 2 31
… 0 1 2 31
Bank 31
0 1 2 31
Shared Memory: Avoiding Bank Conflicts

Add a column for padding:

32x33 SMEM array
Warp accesses a column:
32 different banks, no bank conflicts
warps:
0 1 2 31 padding

Bank 0 0 1 2 31

Bank 1 0 1 2 31

… 0 1 2 31

Bank 31
0 1 2 31
Instruction Throughput
& Control Flow
Runtime Math Library and Intrinsics

Two types of runtime math library functions

__func(): many map directly to hardware ISA
Fast but lower accuracy (see CUDA Programming Guide for full details)
Examples: __sinf(x), __expf(x), __powf(x, y)
func(): compile to multiple instructions
Slower but higher accuracy (5 ulp or less)
Examples: sin(x), exp(x), pow(x, y)

A number of additional intrinsics:

__sincosf(), __frcp_rz(), ...
Explicit IEEE rounding modes (rz,rn,ru,rd)
Control Flow
Instructions are issued per 32 threads (warp)
Divergent branches:
Threads within a single warp take different paths
if-else, ...
Different execution paths within a warp are serialized
Different warps can execute different code with no impact on
performance
Avoid diverging within a warp
Example with divergence:
if (threadIdx.x > 2) {...} else {...}
Branch granularity < warp size
Example without divergence:
if (threadIdx.x / WARP_SIZE > 2) {...} else {...}
Branch granularity is a whole multiple of warp size
CPU-GPU Interaction
Pinned (non-pageable) memory

Pinned memory enables:

faster PCIe copies
memcopies asynchronous with CPU
memcopies asynchronous with GPU
Usage
cudaHostAlloc / cudaFreeHost
instead of malloc / free
cudaHostRegister / cudaHostUnregister
pin regular memory after allocation
Implication:
pinned memory is essentially removed from host virtual memory
Streams and Async API

Default API:
Kernel launches are asynchronous with CPU
Memcopies (D2H, H2D) block CPU thread
CUDA calls are serialized by the driver
Streams and async functions provide:
Memcopies (D2H, H2D) asynchronous with CPU
Ability to concurrently execute a kernel and a memcopy
Stream = sequence of operations that execute in issue-order on
GPU
Operations from different streams may be interleaved
A kernel and memcopy from different streams can be overlapped
Overlap kernel and memory copy

Requirements:
D2H or H2D memcopy from pinned memory
Kernel and memcopy in different, non-0 streams
Code:
cudaStream_t stream1, stream2;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);

cudaMemcpyAsync( dst, src, size, dir, stream1 ); potentially

kernel<<<grid, block, 0, stream2>>>(…); overlapped
Call Sequencing for Optimal Overlap
CUDA calls are dispatched to the hw in the sequence they were issued
Fermi can concurrently execute:
Up to 16 kernels
Up to 2 memcopies, as long as they are in different directions (D2H and H2D)
A call is dispatched if both are true:
Resources are available
Preceding calls in the same stream have completed
Scheduling:
Kernels are executed in the order in which they were issued
Threadblocks for a given kernel are scheduled if all threadblocks for preceding
kernels have been scheduled and there still are SM resources available
Note that if a call blocks, it blocks all other calls of the same type behind it,
even in other streams
Type is one of { kernel, memcopy}
Stream Examples (current HW)
K1,M1,K2,M2: K1 K2
M1 M2

K1,K2,M1,M2: K1 K2
M1 M2 K: Kernel
M: Memcopy
K1,M1,M2: K1 Integer: Stream ID
M1 M2

K1,M2,M1: K1
M2 M1

K1,M2,M2: K1
M2 M2

Time
More on Dual Copy

Fermi is capable of duplex communication with the host

PCIe bus is duplex
The two memcopies must be in different streams, different directions
Not all current host systems can saturate duplex PCIe bandwidth:
Likely issues with IOH chips
If this is important to you, test your host system
Duplex Copy: Experimental Results

10.8 GB/s 7.5 GB/s

DRAM DRAM DRAM

PCIe, x16
16 GB/s
CPU-0 CPU-0 CPU-1
QPI, 6.4 GT/s
25.6 GB/s

3xDDR3, 1066 MHz

IOH IOH
25.8 GB/s
X58 D36

GPU-0 GPU-0
Duplex Copy: Experimental Results

10.8 GB/s 11 GB/s

DRAM DRAM DRAM

PCIe, x16
16 GB/s
CPU-0 CPU-0 CPU-1
QPI, 6.4 GT/s
25.6 GB/s

3xDDR3, 1066 MHz

25.8 GB/s IOH IOH
X58 D36

GPU-0 GPU-0
Unified Virtual Addressing
Easier to Program with Single Address Space

No UVA: Multiple Memory Spaces UVA : Single Address Space

System GPU0 GPU1 System GPU0 GPU1

Memory Memory Memory Memory Memory Memory
0x0000
0x0000 0x0000 0x0000

0xFFFF
0xFFFF 0xFFFF 0xFFFF

CPU GPU0 GPU1 CPU GPU0 GPU1

PCI-e
PCI-e
Summary

Kernel Launch Configuration:

Launch enough threads per SM to hide latency
Launch enough threadblocks to load the GPU
Global memory:
Maximize throughput (GPU has lots of bandwidth, use it effectively)
Use shared memory when applicable (over 1 TB/s bandwidth)
GPU-CPU interaction:
Minimize CPU/GPU idling, maximize PCIe throughput
Use analysis/profiling when optimizing:
“Analysis-driven Optimization” part of the tutorial following
Questions?

Lecture GPU 17
No ratings yet
Lecture GPU 17
51 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
CUDA Memory Architecture: GPGPU Class Week 4
No ratings yet
CUDA Memory Architecture: GPGPU Class Week 4
28 pages
How CUDA Programming Works - 1647539841016001sz6e
No ratings yet
How CUDA Programming Works - 1647539841016001sz6e
101 pages
ARM CPU Architecture
No ratings yet
ARM CPU Architecture
30 pages
Hospital Planning and Design PDF
100% (1)
Hospital Planning and Design PDF
47 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
Meyer Attatchment Parts Catalogue 6-5206n - 156153 - V1
No ratings yet
Meyer Attatchment Parts Catalogue 6-5206n - 156153 - V1
23 pages
Main Parameters To Evaluate The GPU Performance
No ratings yet
Main Parameters To Evaluate The GPU Performance
40 pages
1083 Wang
No ratings yet
1083 Wang
56 pages
English (302) : This Question Paper Consists of 26 Questions (Section-A (16) +Section-B (5+5) ) and 11 Printed Pages
100% (1)
English (302) : This Question Paper Consists of 26 Questions (Section-A (16) +Section-B (5+5) ) and 11 Printed Pages
36 pages
GPU Architecture & Implications: David Luebke NVIDIA Research
No ratings yet
GPU Architecture & Implications: David Luebke NVIDIA Research
94 pages
Automobile Engineering Lecture Notes PDF
No ratings yet
Automobile Engineering Lecture Notes PDF
16 pages
CUDA Optimization
No ratings yet
CUDA Optimization
54 pages
Graphics Processing Unit (Gpu) Memory Hierarchy: Presented by Vu Dinh and Donald Macintyre
No ratings yet
Graphics Processing Unit (Gpu) Memory Hierarchy: Presented by Vu Dinh and Donald Macintyre
24 pages
05 GPU Memory
No ratings yet
05 GPU Memory
80 pages
Fundamentals of Information Technology
No ratings yet
Fundamentals of Information Technology
2 pages
Lecture4 CUDA Threads Part2
No ratings yet
Lecture4 CUDA Threads Part2
15 pages
GPU Architectures
No ratings yet
GPU Architectures
29 pages
GPU Computing CIS-543: Lecture 08: CUDA Memory Model
No ratings yet
GPU Computing CIS-543: Lecture 08: CUDA Memory Model
50 pages
Parallel Programming Module 5
No ratings yet
Parallel Programming Module 5
24 pages
Slides - Chapter 6
No ratings yet
Slides - Chapter 6
59 pages
04 CUDA Fundamental Optimization
No ratings yet
04 CUDA Fundamental Optimization
30 pages
Lecture 1: Introduction: Graphics Processing Units (Gpus) : Architecture and Programming
No ratings yet
Lecture 1: Introduction: Graphics Processing Units (Gpus) : Architecture and Programming
33 pages
Memory Hardware in G80: © David Kirk/NVIDIA and Wen-Mei W Hwu 2007-2009 1
No ratings yet
Memory Hardware in G80: © David Kirk/NVIDIA and Wen-Mei W Hwu 2007-2009 1
21 pages
Cuda Webinars WarpsAndOccupancy
No ratings yet
Cuda Webinars WarpsAndOccupancy
14 pages
Ans Pca End Sem
No ratings yet
Ans Pca End Sem
68 pages
Shared Memory
No ratings yet
Shared Memory
10 pages
VSCSE Lecture3 Cuda Memory Model 2012
No ratings yet
VSCSE Lecture3 Cuda Memory Model 2012
31 pages
Gpu Architecture
No ratings yet
Gpu Architecture
43 pages
37dl Plus - en
No ratings yet
37dl Plus - en
4 pages
CSED405 Lec2-CUDA Overview - 240916 - 131108
No ratings yet
CSED405 Lec2-CUDA Overview - 240916 - 131108
52 pages
Summary Exam 2015
No ratings yet
Summary Exam 2015
30 pages
CS 179: GPU Computing: Lecture 4: Gpu Memory Systems
No ratings yet
CS 179: GPU Computing: Lecture 4: Gpu Memory Systems
43 pages
Gpu Cuda 2
No ratings yet
Gpu Cuda 2
72 pages
Arallel Rocessing NIT
No ratings yet
Arallel Rocessing NIT
44 pages
Gpu1 - GPU Introduction
No ratings yet
Gpu1 - GPU Introduction
20 pages
Comp Arch Project 2 Final
No ratings yet
Comp Arch Project 2 Final
29 pages
Lec6 Cuda Memory
No ratings yet
Lec6 Cuda Memory
18 pages
Arallel Rocessing NIT
No ratings yet
Arallel Rocessing NIT
58 pages
GPU in Supercomputer
No ratings yet
GPU in Supercomputer
7 pages
Hardware
No ratings yet
Hardware
54 pages
Vector Processors
No ratings yet
Vector Processors
20 pages
ASTM E290 - 1997a
No ratings yet
ASTM E290 - 1997a
7 pages
Cuda
No ratings yet
Cuda
25 pages
Multithreaded Architectures: Lecture 5: Performance Considerations
No ratings yet
Multithreaded Architectures: Lecture 5: Performance Considerations
49 pages
GTC S62191
No ratings yet
GTC S62191
89 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
GPU Computing 3
No ratings yet
GPU Computing 3
32 pages
ENGEN DYNAMIC DIESEL 500 PPM
No ratings yet
ENGEN DYNAMIC DIESEL 500 PPM
1 page
0 Gpu Computing I Give It
No ratings yet
0 Gpu Computing I Give It
57 pages
Unit 4
No ratings yet
Unit 4
48 pages
Computer Architecture
No ratings yet
Computer Architecture
24 pages
2015 HK
No ratings yet
2015 HK
20 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
CUDA Execution Model
No ratings yet
CUDA Execution Model
67 pages
CUDA Programming
No ratings yet
CUDA Programming
35 pages
Solution Focussed Approach
No ratings yet
Solution Focussed Approach
14 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
07 NuMicro FMC
No ratings yet
07 NuMicro FMC
21 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
A41101 - How CUDA Programming Works
No ratings yet
A41101 - How CUDA Programming Works
116 pages
Lecture 05 ARM Processors
No ratings yet
Lecture 05 ARM Processors
65 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
2020 GKS-U Application Guidelines (Regional University Track)
No ratings yet
2020 GKS-U Application Guidelines (Regional University Track)
28 pages
Access - Catalog.805b.Color - DP&Casing Tools-46
No ratings yet
Access - Catalog.805b.Color - DP&Casing Tools-46
1 page
GPU Fundamentals
No ratings yet
GPU Fundamentals
20 pages
Objective:: Lab#10: 7-Segment Display SSUET/QR/114
No ratings yet
Objective:: Lab#10: 7-Segment Display SSUET/QR/114
4 pages
637 Service Manual
No ratings yet
637 Service Manual
339 pages
Arakin 3 Key
No ratings yet
Arakin 3 Key
23 pages
Lab: ARMA (1, 1) Process: T T T T
No ratings yet
Lab: ARMA (1, 1) Process: T T T T
7 pages
1.b. GPP Monitoring Tool
No ratings yet
1.b. GPP Monitoring Tool
2 pages
PROB STAT.4photo
No ratings yet
PROB STAT.4photo
11 pages
GPU Architecture
No ratings yet
GPU Architecture
17 pages
Grade 11 (Kinematics)
No ratings yet
Grade 11 (Kinematics)
27 pages
Post, or Distribute: Hypothesis Testing With Chi-Square
No ratings yet
Post, or Distribute: Hypothesis Testing With Chi-Square
25 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
An Agenda For Gesture Studies
No ratings yet
An Agenda For Gesture Studies
19 pages
Welding Research: Development of A New Hot-Cracking Test-The Sigmajig
No ratings yet
Welding Research: Development of A New Hot-Cracking Test-The Sigmajig
6 pages
Handouts
No ratings yet
Handouts
4 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
I Am Curious (Yellow)
No ratings yet
I Am Curious (Yellow)
7 pages
1988 Darling Bath
No ratings yet
1988 Darling Bath
16 pages
Sea: S E A: Parse Linear Attention With Stimated Ttention Mask
No ratings yet
Sea: S E A: Parse Linear Attention With Stimated Ttention Mask
22 pages
Autograd Handouts
No ratings yet
Autograd Handouts
14 pages
NeurIPS 2021 Sparse Is Enough in Scaling Transformers Paper
No ratings yet
NeurIPS 2021 Sparse Is Enough in Scaling Transformers Paper
13 pages
Cross Coverage
No ratings yet
Cross Coverage
31 pages
LightSpMV Faster CSR-based Sparse Matrix-Vector Multiplication On CUDA-enabled GPUs
No ratings yet
LightSpMV Faster CSR-based Sparse Matrix-Vector Multiplication On CUDA-enabled GPUs
8 pages
UPhL Ep 01
No ratings yet
UPhL Ep 01
6 pages
Viva Voce Question
No ratings yet
Viva Voce Question
3 pages
Xitaruxukikupex
No ratings yet
Xitaruxukikupex
3 pages
PlayStation 2 Architecture: Architecture of Consoles: A Practical Analysis, #12
From Everand
PlayStation 2 Architecture: Architecture of Consoles: A Practical Analysis, #12
Rodrigo Copetti
No ratings yet
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
From Everand
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
Rodrigo Copetti
No ratings yet

NCSA02 Fundamental CUDA Optimization

Uploaded by

NCSA02 Fundamental CUDA Optimization

Uploaded by

Fundamental CUDA

Optimization of CPU-GPU interaction

16 Streaming Multiprocessors (SMs)

“SMX” (enhanced SM)

Thread blocks do not migrate

Several concurrent thread blocks can reside on one

... A kernel is launched as a grid of thread blocks

A thread block consists of

CPU Local Shared Memory

Chipset Global L1 / L2 Cache

© NVIDIA Corporation 2011

Hiding arithmetic latency:

Several independent smaller

Need enough total threads to keep GPU busy

SM-0 SM-1 SM-N

L1 SMEM L1 SMEM L1 SMEM

Two types of loads:

Memory operations are issued per warp (32 threads)

Warp requests 32 aligned, consecutive 4-byte words

addresses from a warp

Warp requests 32 aligned, consecutive 4-byte words

addresses from a warp

Warp requests 32 aligned, permuted 4-byte words

addresses from a warp

Warp requests 32 aligned, permuted 4-byte words

addresses from a warp

Warp requests 32 misaligned, consecutive 4-byte words

addresses from a warp

Warp requests 32 misaligned, consecutive 4-byte words

addresses from a warp

All threads in a warp request the same 4-byte word

addresses from a warp

All threads in a warp request the same 4-byte word

addresses from a warp

Warp requests 32 scattered 4-byte words

addresses from a warp

Warp requests 32 scattered 4-byte words

addresses from a warp

Strive for perfect coalescing

No Bank Conflicts No Bank Conflicts

Thread 0 Bank 0 Thread 0 Bank 0

Thread 31 Bank 31 Thread 31 Bank 31

2-way Bank Conflicts 8-way Bank Conflicts

Thread 0 Bank 0 Thread 0 x8 Bank 0

32x32 SMEM array

Add a column for padding:

Two types of runtime math library functions

A number of additional intrinsics:

Pinned memory enables:

cudaMemcpyAsync( dst, src, size, dir, stream1 ); potentially

Fermi is capable of duplex communication with the host

10.8 GB/s 7.5 GB/s

3xDDR3, 1066 MHz

10.8 GB/s 11 GB/s

3xDDR3, 1066 MHz

No UVA: Multiple Memory Spaces UVA : Single Address Space

System GPU0 GPU1 System GPU0 GPU1

CPU GPU0 GPU1 CPU GPU0 GPU1

Kernel Launch Configuration:

You might also like