Module 4.1 - Memory and Data Locality: GPU Teaching Kit
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
Accelerated Computing
2
Review: Image Blur Kernel.
3
How about performance on a GPU
– All threads access global memory for their input matrix elements
– One memory accesses (4 bytes) per floating-point addition
– 4B/s of memory bandwidth/FLOPS
– Assume a GPU with
– Peak floating-point rate 1,500 GFLOPS with 200 GB/s DRAM bandwidth
– 4*1,500 = 6,000 GB/s required to achieve peak FLOPS rating
– The 200 GB/s memory bandwidth limits the execution at 50 GFLOPS
4
Example – Matrix Multiplication
N
WIDTH
M P
BLOCK_WIDTHE
WIDTH
Row
BLOCK_WIDTH
WIDTH WIDTH
Col
5
A Basic Matrix Multiplication
__global__ void MatrixMulKernel(float* M, float* N, float* P, int Width) {
6
Example – Matrix Multiplication
__global__ void MatrixMulKernel(float* M, float* N, float* P, int Width) {
7
A Toy Example: Thread to P Data Mapping
Block(0,0) Block(0,1)
Thread(0,1)
Thread(0,0)
P0,0 P0,1 P0,2 P0,3 BLOCK_WIDTH = 2
Thread(1,0)
P1,0 P1,1 P1,2 P1,3
Thread(1,1)
P2,0 P2,1 P2,2 P2,3
Block(1,0) Block(1,1)
8
Calculation of P0,0 and P0,1
N0,0 N0,1
N1,0 N1,1
N2,0 N2,1
N3,0 N3,1
9
Memory and Registers in the Von-Neumann Model
Memory I/O
Processing Unit
Reg
ALU
File
Control Unit
PC IR
10
Programmer View of CUDA Memories
Grid
Constant Memory
11
Declaring CUDA Variables
Variable declaration Memory Scope Lifetime
int LocalVar; register thread thread
__device__ __shared__ int SharedVar; shared block block
__device__ int GlobalVar; global grid application
__device__ __constant__ int ConstantVar; constant grid application
12
Example:
Shared Memory Variable Declaration
13
Where to Declare Variables?
Can host
access it?
global register
constant shared
Outside of
In the kernel
any Function
14
Shared Memory in CUDA
– A special type of memory whose contents are explicitly defined and
used in the kernel source code
– One in each SM
– Accessed at much higher speed (in both latency and throughput) than global
memory
– Scope of access and sharing - thread blocks
– Lifetime – thread block, contents will disappear after the corresponding thread
finishes terminates execution
– Accessed by memory load/store instructions
– A form of scratchpad memory in computer architecture
15
Hardware View of CUDA Memories
Processing Unit
Shared
Register
Memory ALU
File
Control Unit
PC IR
Processor (SM)
16
GPU Teaching Kit
Accelerated Computing
2
Global Memory Access Pattern
of the Basic Matrix Multiplication Kernel
Global Memory
Thread 1 Thread 2
3
Tiling/Blocking - Basic Idea
Global Memory
On-chip Memory
Thread 1 Thread 2
4
Tiling/Blocking - Basic Idea
Global Memory
On-chip Memory
Thread 1 Thread 2
5
Basic Concept of Tiling
– In a congested traffic system, significant reduction of vehicles
can greatly improve the delay seen by all vehicles
– Carpooling for commuters
– Tiling for global memory accesses
– drivers = threads accessing their memory data operands
– cars = memory access requests
6
6
Some Computations are More Challenging to Tile
– Some carpools may be easier than others
– Car pool participants need to have similar work schedule
– Some vehicles may be more suitable for carpooling
– Similar challenges exist in tiling
7
Carpools need synchronization.
– Good: when people have similar schedule
8
8
Carpools need synchronization.
– Bad: when people have very different schedule
9
9
Same with Tiling
– Good: when threads have similar access timing
Thread 1
Time
Thread 2
Thread 1
Time
Thread 2
11
Outline of Tiling Technique
– Identify a tile of global memory contents that are accessed by
multiple threads
– Load the tile from global memory into on-chip memory
– Use barrier synchronization to make sure that all threads are ready
to start the phase
– Have the multiple threads to access their data from the on-chip
memory
– Use barrier synchronization to make sure that all threads have
completed the current phase
– Move on to the next tile
12
12
GPU Teaching Kit
Accelerated Computing
2
Matrix Multiplication
– Data access pattern
– Each thread - a row of M and a N
column of N
– Each thread block – a strip of M and a
WIDTH
strip of N
M P
BLOCK_WIDTHE
WIDTH
Row
BLOCK_WIDTH
WIDTH WIDTH
3 Col
Tiled Matrix Multiplication
– Break up the execution of each
thread into phases N
– so that the data accesses by the
thread block in each phase are
WIDTH
focused on one tile of M and one
tile of N
– The tile is of BLOCK_SIZE
elements in each dimension
M P
BLOCK_WIDTHE
WIDTH
Row
BLOCK_WIDTH
WIDTH WIDTH
Col
4
Loading a Tile
– All threads in a block participate
– Each thread loads one M element and one N element in tiled code
5
Phase 0 Load for Block (0,0)
6
Phase 0 Use for Block (0,0) (iteration 0)
7
Phase 0 Use for Block (0,0) (iteration 1)
8
Phase 1 Load for Block (0,0)
9
Phase 1 Use for Block (0,0) (iteration 0)
10
Phase 1 Use for Block (0,0) (iteration 1)
11
Execution Phases of Toy Example
12
Execution Phases of Toy Example (cont.)
– All threads in the same block must reach the __syncthreads() before
any of the them can move on
14
GPU Teaching Kit
Accelerated Computing
2
Loading Input Tile 0 of M (Phase 0)
– Have each thread load an M
element and an N element at the N
same relative position as its P
element.
WIDTH
int Row = by * blockDim.y + ty;
int Col = bx * blockDim.x + tx;
2D indexing for accessing Tile 0:
M[Row][tx]
N[ty][Col]
M P
TILE_WIDTHE
WIDTH
Row
TILE_WIDTH
WIDTH WIDTH
Col
3
Loading Input Tile 0 of N (Phase 0)
– Have each thread load an M
element and an N element at the N
same relative position as its P
element.
WIDTH
int Row = by * blockDim.y + ty;
int Col = bx * blockDim.x + tx;
2D indexing for accessing Tile 0:
M[Row][tx]
N[ty][Col]
M P
BLOCK_WIDTHE
WIDTH
Row
BLOCK_WIDTH
WIDTH WIDTH
Col
4
Loading Input Tile 1 of M (Phase 1)
N
WIDTH
M[Row][1*TILE_WIDTH + tx]
N[1*TILE*WIDTH + ty][Col]
M P
BLOCK_WIDTHE
WIDTH
Row
BLOCK_WIDTH
WIDTH WIDTH
Col
5
Loading Input Tile 1 of N (Phase 1)
N
WIDTH
M[Row][1*TILE_WIDTH + tx]
N[1*TILE*WIDTH + ty][Col]
M P
BLOCK_WIDTHE
WIDTH
Row
BLOCK_WIDTH
WIDTH WIDTH
Col
6
M and N are dynamically allocated - use 1D indexing
M[Row][p*TILE_WIDTH+tx]
M[Row*Width + p*TILE_WIDTH + tx]
N[p*TILE_WIDTH+ty][Col]
N[(p*TILE_WIDTH+ty)*Width + Col]
7
Tiled Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* M, float* N, float* P, Int Width)
{
__shared__ float ds_M[TILE_WIDTH][TILE_WIDTH];
__shared__ float ds_N[TILE_WIDTH][TILE_WIDTH];
8
Tiled Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* M, float* N, float* P, Int Width)
{
__shared__ float ds_M[TILE_WIDTH][TILE_WIDTH];
__shared__ float ds_N[TILE_WIDTH][TILE_WIDTH];
9
Tiled Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* M, float* N, float* P, Int Width)
{
__shared__ float ds_M[TILE_WIDTH][TILE_WIDTH];
__shared__ float ds_N[TILE_WIDTH][TILE_WIDTH];
10
Tile (Thread Block) Size Considerations
– Each thread block should have many threads
– TILE_WIDTH of 16 gives 16*16 = 256 threads
– TILE_WIDTH of 32 gives 32*32 = 1024 threads
– For 16, in each phase, each block performs 2*256 = 512 float
loads from global memory for 256 * (2*16) = 8,192 mul/add
operations. (16 floating-point operations for each memory load)
– For 32, in each phase, each block performs 2*1024 = 2048 float
loads from global memory for 1024 * (2*32) = 65,536 mul/add
operations. (32 floating-point operation for each memory load)
11
11
Shared Memory and Threading
– For an SM with 16KB shared memory
– Shared memory size is implementation dependent!
– For TILE_WIDTH = 16, each thread block uses 2*256*4B = 2KB of shared
memory.
– For 16KB shared memory, one can potentially have up to 8 thread blocks
executing
– This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256 threads per block)
– The next TILE_WIDTH 32 would lead to 2*32*32*4 Byte= 8K Byte shared
memory usage per thread block, allowing 2 thread blocks active at the same time
– However, in a GPU where the thread count is limited to 1536 threads per SM,
the number of blocks per SM is reduced to one!
– Each __syncthread() can reduce the number of active threads for a
block
– More thread blocks can be advantageous
12
12
GPU Teaching Kit
Accelerated Computing
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under
the Creative Commons Attribution-NonCommercial 4.0 International License.
GPU Teaching Kit
Accelerated Computing
2
Warps as Scheduling Units
Block 1 Warps Block 2 Warps Block 3 Warps
… t0 t1 t2 … t31
… t0 t1 t2 … t31
… t0 t1 t2 … t31
… … …
3
Warps in Multi-dimensional Thread Blocks
– The thread blocks are first linearized into 1D in row major order
– In x-dimension first, y-dimension next, and z-dimension last
4
4
Blocks are partitioned after linearization
– Linearized thread blocks are partitioned
– Thread indices within a warp are consecutive and increasing
– Warp 0 starts with Thread 0
5
SMs are SIMD Processors
– Control unit for instruction fetch, decode, and control is shared
among multiple processing units
– Control overhead is minimized (Module 1)
Memory I/O
Processing Unit
Shared
Register
Memory ALU File
Control Unit
PC IR
Processor (SM)
6
SIMD Execution Among Threads in a Warp
– All threads in a warp must execute the same instruction
at any point in time
7
Control Divergence
– Control divergence occurs when threads in a warp take
different control flow paths by making different control
decisions
– Some take the then-path and others take the else-path of an if-
statement
– Some threads take different number of loop iterations than others
8
Control Divergence Examples
– Divergence can arise when branch or loop
condition is a function of thread indices
– Example kernel statement with divergence:
– if (threadIdx.x > 2) { }
– This creates two different control paths for threads in a block
– Decision granularity < warp size; threads 0, 1 and 2 follow
different path than the rest of the threads in the first warp
– Example without divergence:
– If (blockIdx.x > 2) { }
– Decision granularity is a multiple of blocks size; all threads in
any given warp follow the same path
9
Example: Vector Addition Kernel
Device Code
// Compute vector sum C = A + B
// Each thread performs one pair-wise addition
__global__
void vecAddKernel(float* A, float* B, float* C,
int n)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
if(i<n) C[i] = A[i] + B[i];
}
10
10
Analysis for vector size of 1,000 elements
– Assume that block size is 256 threads
– 8 warps in each block
11
GPU Teaching Kit
Accelerated Computing
2
Performance Impact of Control Divergence
– Boundary condition checks are vital for complete functionality and
robustness of parallel code
– The tiled matrix multiplication kernel has many boundary condition checks
– The concern is that these checks may cause significant performance degradation
– For example, see the tile loading code below:
3
Two types of blocks in loading M Tiles
– 1. Blocks whose tiles are all within valid range until the last phase.
– 2. Blocks whose tiles are partially outside the valid range all the way
Type 1
TILE_WIDTH
Type 2
4
Analysis of Control Divergence Impact
– Assume 16x16 tiles and thread blocks
– Each thread block has 8 warps (256/32)
– Assume square matrices of 100x100
– Each thread will go through 7 phases (ceiling of 100/16)
5
Control Divergence in Loading M Tiles
– Assume 16x16 tiles and thread blocks
– Each thread block has 8 warps (256/32)
– Assume square matrices of 100x100
– Each warp will go through 7 phases (ceiling of 100/16)
– There are 42 (6*7) Type 1 blocks, with a total of 336 (8*42) warps
– They all have 7 phases, so there are 2,352 (336*7) warp-phases
– The warps have control divergence only in their last phase
– 336 warp-phases have control divergence
6
Control Divergence in Loading M Tiles (Type 2)
– Type 2: the 7 block assigned to load the bottom tiles, with a total of
56 (8*7) warps
– They all have 7 phases, so there are 392 (56*7) warp-phases
– The first 2 warps in each Type 2 block will stay within the valid range
until the last phase
– The 6 remaining warps stay outside the valid range
– So, only 14 (2*7) warp-phases have control divergence
7
Overall Impact of Control Divergence
– Type 1 Blocks: 336 out of 2,352 warp-phases have control
divergence
– Type 2 Blocks: 14 out of 392 warp-phases have control divergence
– The performance impact is expected to be less than 12% (350/2,944
or (336+14)/(2352+14))
M
Type 1
TILE_WIDTH
Type 2
8
Additional Comments
– The calculation of impact of control divergence in loading N tiles is
somewhat different and is left as an exercise
– The fact that a kernel is full of control flow constructs does not mean
that there will be heavy occurrence of control divergence
9
GPU Teaching Kit
Accelerated Computing
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under
the Creative Commons Attribution-NonCommercial 4.0 International License.
GPU Teaching Kit
Accelerated Computing
2
Global Memory (DRAM) Bandwidth
– Ideal
– Reality
3
DRAM Core Array Organization
– Each DRAM core array has about 16M bits
Sense Amps
Column Latches
Wide
Column Mux
Addr
Narrow Pin Interface
Off-chip Data
4
A very small (8x2-bit) DRAM Core Array
0 1 1
decode
Sense amps
Mux
5
DRAM Core Arrays are Slow
– Reading from a cell in the core array is a very slow process
– DDR: Core speed = ½ interface speed
– DDR2/GDDR3: Core speed = ¼ interface speed
– DDR3/GDDR4: Core speed = ⅛ interface speed
– … likely to be worse in the future
To sense amps
6
DRAM Bursting
– For DDR{2,3} SDRAM cores clocked at 1/N speed of the interface:
– Load (N × interface width) of DRAM bits from the same row at once to an internal
buffer, then transfer in N steps at interface speed
– DDR3/GDDR4: buffer width = 8X interface width
7
DRAM Bursting Timing Example
Address bits to
decoder
bits
Core Array access delay on interface
time
Non-burst timing
Burst timing
8
Multiple DRAM Banks
decode
decode
Mux Mux
Bank 0 Bank 1
9
DRAM Bursting with Banking
10
GPU off-chip memory subsystem
– NVIDIA GTX280 GPU:
– Peak global memory bandwidth = 141.7GB/s
11
GPU Teaching Kit
Accelerated Computing
2
3
DRAM Burst – A System View
3
4
Memory Coalescing
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
4
5
Un-coalesced Accesses
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
5
6
How to judge if an access is coalesced?
– Accesses in a warp are to consecutive locations if the index in an
array access is in the form of
– A[(expression with terms independent of threadIdx.x) + threadIdx.x];
6
7
A 2D C Array in Linear Memory Space
7
Two Access Patterns of Basic Matrix Multiplication
A B
HEIGHT
Thread 1
Thread 2
WIDTH
A[Row*n+i] B[i*k+Col]
i is the loop counter in the inner product loop of the kernel code
A is m × n, B is n × k
Col = blockIdx.x*blockDim.x + threadIdx.x
8
B accesses are coalesced
N
B0,0 B0,1 B0,2 B0,3 B1,0 B1,1 B1,2 B1,3 B2,0 B2,1 B2,2 B2,3 B3,0 B3,1 B3,2 B3,3
9
A Accesses are Not Coalesced
Load iteration 1
…
T0 T1 T2 T3
Load iteration 0
T0 T1 T2 T3
A0,0 A0,1 A0,2 A0,3 A1,0 A1,1 A1,2 A1,3 A2,0 A2,1 A2,2 A2,3 A3,0 A3,1 A3,2 A3,3
10
Loading an Input Tile
A C
Row
m
m
WIDTH
n k
11
Corner Turning
d_M d_N
Original
WIDTH
Access
Pattern
WIDTH
Copy into
shared
memory
d_M d_N
Tiled
Access
Pattern
Perform
multiplication
with shared memory
values
12
GPU Teaching Kit
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under
the Creative Commons Attribution-NonCommercial 4.0 International License.
GPU Teaching Kit
Accelerated Computing
2
CPU-GPU Data Transfer using DMA
– DMA (Direct Memory Access) hardware is used by cudaMemcpy() for
better efficiency
– Frees CPU for other tasks
– Hardware unit specialized to transfer a number of bytes requested by OS
– Between physical memory address space regions (some can be mapped I/O memory
locations)
– Uses system interconnect, typically PCIe in today’s systems
PCIe
Global DMA
Memory
GPU card
(or other I/O cards)
3
Virtual Memory Management
4
Data Transfer and Virtual Memory
– DMA uses physical addresses
– When cudaMemcpy() copies an array, it is implemented as one or more DMA
transfers
– Address is translated and page presence checked for the entire source and
destination regions at the beginning of each DMA transfer
– No address translation for the rest of the same DMA transfer so that high efficiency
can be achieved
– The OS could accidentally page-out the data that is being read or written
by a DMA and page-in another virtual page into the same physical location
5
Pinned Memory and DMA Data Transfer
– Pinned memory are virtual memory pages that are specially marked so that
they cannot be paged out
– Allocated with a special system API function call
– a.k.a. Page Locked Memory, Locked Pages, etc.
– CPU memory that serve as the source or destination of a DMA transfer must
be allocated as pinned memory
6
CUDA data transfer uses pinned memory.
– The DMA used by cudaMemcpy() requires that any source or destination in
the host memory is allocated as pinned memory
7
Allocate/Free Pinned Memory
– cudaHostAlloc(), three parameters
– Address of pointer to the allocated memory
– Size of the allocated memory in bytes
– Option – use cudaHostAllocDefault for now
8
Using Pinned Memory in CUDA
– Use the allocated pinned memory and its pointer the same way as those
returned by malloc();
– The only difference is that the allocated memory cannot be paged by the OS
9
Putting It Together - Vector Addition Host Code Example
int main()
{
float *h_A, *h_B, *h_C;
…
cudaHostAlloc((void **) &h_A, N* sizeof(float),
cudaHostAllocDefault);
cudaHostAlloc((void **) &h_B, N* sizeof(float),
cudaHostAllocDefault);
cudaHostAlloc((void **) &h_C, N* sizeof(float),
cudaHostAllocDefault);
…
// cudaMemcpy() runs 2X faster
}
10
GPU Teaching Kit
Accelerated Computing
2
Serialized Data Transfer and Computation
time
3
Device Overlap
int dev_count;
cudaDeviceProp prop;
cudaGetDeviceCount( &dev_count);
for (int i = 0; i < dev_count; i++) {
cudaGetDeviceProperties(&prop, i);
if (prop.deviceOverlap) …
4
Ideal, Pipelined Timing
Trans Trans
A.3 B.3
5
CUDA Streams
6
Streams
– Requests made from the host code are put into First-In-First-Out
queues
– Queues are read and processed asynchronously by the driver and device
– Driver ensures that commands in a queue are processed in sequence. E.g.,
Memory copies end before kernel launch, etc.
host thread
cudaMemcpy()
kernel launch
FIFO device sync
cudaMemcpy()
device driver
7
Streams cont.
host thread
Stream 0 Stream 1
Event
device driver
8
Conceptual View of Streams
PCIe PCIe
up down
Stream 0 Stream 1
9
GPU Teaching Kit
Accelerated Computing
2
Simple Multi-Stream Host Code
cudaStream_t stream0, stream1;
cudaStreamCreate(&stream0);
cudaStreamCreate(&stream1);
3
Simple Multi-Stream Host Code (Cont.)
for (int i=0; i<n; i+=SegSize*2) {
cudaMemcpyAsync(d_A0, h_A+i, SegSize*sizeof(float),…, stream0);
cudaMemcpyAsync(d_B0, h_B+i, SegSize*sizeof(float),…, stream0);
vecAdd<<<SegSize/256, 256, 0, stream0>>>(d_A0, d_B0,…);
cudaMemcpyAsync(h_C+i, d_C0, SegSize*sizeof(float),…, stream0);
cudaMemcpyAsync(d_A1, h_A+i+SegSize, SegSize*sizeof(float),…, stream1);
cudaMemcpyAsync(d_B1, h_B+i+SegSize, SegSize*sizeof(float),…, stream1);
vecAdd<<<SegSize/256, 256, 0, stream1>>>(d_A1, d_B1, …);
cudaMemcpyAsync(d_C1, h_C+i+SegSize, SegSize*sizeof(float),…, stream1);
}
4
A View Closer to Reality in Previous GPUs
PCIe PCIe
up down
Copy Kernel Engine
Engine
Direction of
arrows
changed
MemCpy A.0 Kernel 0 from original
MemCpy B.0 Kernel 1 slides [SPM]
MemCpy C.0
MemCpy A.1
MemCpy B.1
MemCpy C.1
Stream 0 Stream 1
5
Not quite the overlap we want in some GPUs
6
Better Multi-Stream Host Code
for (int i=0; i<n; i+=SegSize*2) {
cudaMemcpyAsync(d_A0, h_A+i, SegSize*sizeof(float),…, stream0);
cudaMemcpyAsync(d_B0, h_B+i, SegSize*sizeof(float),…, stream0);
cudaMemcpyAsync(d_A1, h_A+i+SegSize, SegSize*sizeof(float),…, stream1);
cudaMemcpyAsync(d_B1, h_B+i+SegSize, SegSize*sizeof(float),…, stream1);
7
C.0 no longer blocks A.1 and B.1
PCIe PCIe
up down
Copy Kernel Engine Direction of
Engine
arrows
changed
MemCpy A.0 Kernel 0 from original
MemCpy B.0 Kernel 1 slides [SPM]
MemCpy A.1
MemCpy B.1
MemCpy C.0
MemCpy C.1
Stream 0 Stream 1
8
Better, not quite the best overlap
– C.1 blocks next iteration A.2 and B.2 in the copy engine queue
Trans
A.2
9
Ideal, Pipelined Timing
Trans Trans
A.3 B.3
10
Hyper Queues
A -- B -- C
A--B--C
Stream 0
P--Q--R P -- Q -- R
Stream 1
X--Y--Z X -- Y -- Z
Multiple Hardware Work Queues
Stream 2
11
Wait until all tasks have completed
– cudaStreamSynchronize(stream_id)
– Used in host code
– Takes one parameter – stream identifier
– Wait until all tasks in a stream have completed
– E.g., cudaStreamSynchronize(stream0)in host code ensures that all tasks
in the queues of stream0 have completed
12
GPU Teaching Kit
Accelerated Computing
The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under
the Creative Commons Attribution-NonCommercial 4.0 International License.