CUDA_Memory
CUDA_Memory
In every iteration of the inner loop, one global memory access is performed for
one floating-point addition
2
Compute-to-Global Memory Access Ratio
3
GPU Performance
4
CUDA Memories
Grid
Constant Memory
5
Registers
Grid
• Lifetime of a thread
Host Global Memory
6
Local Variables
7
Image Blurring Kernel
8
Shared Memory
Grid
Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) • Restricted to a block
9
Shared Memory in CUDA
10
Shared Memory in Fermi
11
Shared Variables
12
Shared Variable Example
13
Global Memory
Grid
Constant Memory
• Potential of traffic congestion
14
Global Variables
15
Vector Addition Host Code
16
Hardware View of CUDA Memories
Processing Unit
Shared
Register
Memory ALU
File
Control Unit
PC IR
Processor (SM)
17
Tiling for Reduced Memory Traffic
18
Matrix Multiplication
19
A Basic Matrix Multiplication Kernel
20
4x4 P: Thread to Data Mapping
Block(1,0) Block(1,1)
21
Global memory accesses performed by
threads in block0,0
22
Global Memory Access Pattern
Global Memory
Thread 1 Thread 2 …
23
Tiling
Global Memory
On-chip Memory
Thread 1 Thread 2
…
24
Tiling
Global Memory
On-chip Memory
Thread 1 Thread 2
…
25
Basic Concept of Tiling
26
Carpools Need Synchronization
27
Tiling Requires Synchronization Among Threads
Thread 1
Time
Thread 2
…
Thread 1
Time
Thread 2
28
Tiling
29
Outline of Tiling
30
A Basic Matrix Multiplication Kernel
31
A Basic Matrix Multiplication Kernel
32
4x4 P: Thread to Data Mapping
Block(1,0) Block(1,1)
33
Calculation of P0,0 and P0,1
N0,0 N0,1
N1,0 N1,1
N2,0 N2,1
N3,0 N3,1
34
Tiled Matrix Multiplication
WIDTH
by the thread block in each phase
are focused on one tile of
M and one tile of N
M P
The tile is of BLOCK_SIZE
BLOCK_WIDTH
elements in each dimension
WIDTH
Row
BLOCK_WIDTH
WIDTH WIDTH
Col
35
2x2 Tiles
36
Phase 0 Load for Block (0,0)
37
Phase 0 Use for Block (0,0) (iteration 0)
38
Phase 0 Use for Block (0,0) (iteration 1)
39
Phase 1 Load for Block (0,0)
40
Phase 1 Use for Block (0,0) (iteration 0)
41
Phase 1 Use for Block (0,0) (iteration 1)
42
Execution Phases
43
Execution Phases
44
Data in Shared Memory
45
Barrier Synchronization
46
Use 1D Indexing
M[Row][p*TILE_WIDTH+tx]
M[Row*Width + p*TILE_WIDTH + tx]
N[p*TILE_WIDTH+ty][Col]
N[(p*TILE_WIDTH+ty)*Width + Col]
47
Tiled Matrix Multiplication Kernel
48
Tiled Matrix Multiplication Kernel
49
Tiled Matrix Multiplication Kernel
50
Before-After
51
Tile (Thread Block) Size Considerations
52
Shared Memory
53
Handling Matrix of Arbitrary Size
54
NVIDIA NSight Compute
55
References
56