0% found this document useful (0 votes)

127 views132 pages

Module 4.1 - Memory and Data Locality: GPU Teaching Kit

This document discusses CUDA memory types and how to effectively use them in parallel programs. It covers registers, shared memory, and global memory, explaining their scope, lifetime, and importance for memory access efficiency. The document then demonstrates how tiling can improve memory performance by reducing global memory accesses. Tiling divides data into blocks that fit into faster on-chip memory. Threads work cooperatively on a tile, synchronizing to load, compute, and store each tile. This allows achieving a higher percentage of the GPU's peak floating-point performance.

Uploaded by

Andy Ortiz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

127 views132 pages

Module 4.1 - Memory and Data Locality: GPU Teaching Kit

Uploaded by

Andy Ortiz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 132

GPU Teaching Kit

Accelerated Computing

Module 4.1 – Memory and Data Locality

CUDA Memories
Objective
– To learn to effectively use the CUDA memory types in a parallel
program
– Importance of memory access efficiency
– Registers, shared memory, global memory
– Scope and lifetime

2
Review: Image Blur Kernel.

// Get the average of the surrounding 2xBLUR_SIZE x 2xBLUR_SIZE box

for(int blurRow = -BLUR_SIZE; blurRow < BLUR_SIZE+1; ++blurRow) {
for(int blurCol = -BLUR_SIZE; blurCol < BLUR_SIZE+1; ++blurCol) {

int curRow = Row + blurRow;

int curCol = Col + blurCol;
// Verify we have a valid image pixel
if(curRow > -1 && curRow < h && curCol > -1 && curCol < w) {
pixVal += in[curRow * w + curCol];
pixels++; // Keep track of number of pixels in the accumulated total
}
}
}

// Write our new pixel value out

out[Row * w + Col] = (unsigned char)(pixVal / pixels);

3
How about performance on a GPU
– All threads access global memory for their input matrix elements
– One memory accesses (4 bytes) per floating-point addition
– 4B/s of memory bandwidth/FLOPS
– Assume a GPU with
– Peak floating-point rate 1,500 GFLOPS with 200 GB/s DRAM bandwidth
– 4*1,500 = 6,000 GB/s required to achieve peak FLOPS rating
– The 200 GB/s memory bandwidth limits the execution at 50 GFLOPS

– This limits the execution rate to 3.3% (50/1500) of the peak

floating-point execution rate of the device!

– Need to drastically cut down memory accesses to get close to

the1,500 GFLOPS

4
Example – Matrix Multiplication
N

WIDTH
M P

BLOCK_WIDTHE

WIDTH
Row
BLOCK_WIDTH

WIDTH WIDTH

Col
5
A Basic Matrix Multiplication
__global__ void MatrixMulKernel(float* M, float* N, float* P, int Width) {

// Calculate the row index of the P element and M

int Row = blockIdx.y*blockDim.y+threadIdx.y;

// Calculate the column index of P and N

int Col = blockIdx.x*blockDim.x+threadIdx.x;

if ((Row < Width) && (Col < Width)) {

float Pvalue = 0;
// each thread computes one element of the block sub-matrix
for (int k = 0; k < Width; ++k) {
Pvalue += M[Row*Width+k]*N[k*Width+Col];
}
P[Row*Width+Col] = Pvalue;
}

6
Example – Matrix Multiplication
__global__ void MatrixMulKernel(float* M, float* N, float* P, int Width) {

// Calculate the row index of the P element and M

int Row = blockIdx.y*blockDim.y+threadIdx.y;

// Calculate the column index of P and N

int Col = blockIdx.x*blockDim.x+threadIdx.x;

if ((Row < Width) && (Col < Width)) {

float Pvalue = 0;
// each thread computes one element of the block sub-matrix
for (int k = 0; k < Width; ++k) {
Pvalue += M[Row*Width+k]*N[k*Width+Col];
}
P[Row*Width+Col] = Pvalue;
}

7
A Toy Example: Thread to P Data Mapping

Block(0,0) Block(0,1)
Thread(0,1)
Thread(0,0)
P0,0 P0,1 P0,2 P0,3 BLOCK_WIDTH = 2
Thread(1,0)
P1,0 P1,1 P1,2 P1,3
Thread(1,1)
P2,0 P2,1 P2,2 P2,3

P3,0 P3,1 P3,2 P3,3

Block(1,0) Block(1,1)

8
Calculation of P0,0 and P0,1
N0,0 N0,1

N1,0 N1,1

N2,0 N2,1

N3,0 N3,1

M0,0 M0,1 M0,2 M0,3 P0,0 P0,1

0,1

M1,0 M1,1 M1,2 M1,3 P1,0 P1,1

9
Memory and Registers in the Von-Neumann Model

Memory I/O

Processing Unit
Reg
ALU
File

Control Unit
PC IR

10
Programmer View of CUDA Memories

Grid

Block (0, 0) Block (1, 0)

Shared Memory Shared Memory

Registers Registers Registers Registers

Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)

Host Global Memory

Constant Memory

11
Declaring CUDA Variables
Variable declaration Memory Scope Lifetime
int LocalVar; register thread thread
__device__ __shared__ int SharedVar; shared block block
__device__ int GlobalVar; global grid application
__device__ __constant__ int ConstantVar; constant grid application

– device is optional when used with shared, or constant

– Automatic variables reside in a register
– Except per-thread arrays that reside in global memory

12
Example:
Shared Memory Variable Declaration

void blurKernel(unsigned char * in, unsigned char * out, int w, int h)

{

shared float ds_in[TILE_WIDTH][TILE_WIDTH];

…
}

13
Where to Declare Variables?

Can host
access it?
global register
constant shared

Outside of
In the kernel
any Function

14
Shared Memory in CUDA
– A special type of memory whose contents are explicitly defined and
used in the kernel source code
– One in each SM
– Accessed at much higher speed (in both latency and throughput) than global
memory
– Scope of access and sharing - thread blocks
– Lifetime – thread block, contents will disappear after the corresponding thread
finishes terminates execution
– Accessed by memory load/store instructions
– A form of scratchpad memory in computer architecture

15
Hardware View of CUDA Memories

Global Memory I/O

Processing Unit
Shared
Register
Memory ALU
File

Control Unit
PC IR

Processor (SM)

16
GPU Teaching Kit
Accelerated Computing

Module 4.2 – Memory and Data Locality

Tiled Parallel Algorithms
Objective
– To understand the motivation and ideas for tiled parallel algorithms
– Reducing the limiting effect of memory bandwidth on parallel kernel performance
– Tiled algorithms and barrier synchronization

2
Global Memory Access Pattern
of the Basic Matrix Multiplication Kernel

Global Memory

Thread 1 Thread 2

3
Tiling/Blocking - Basic Idea
Global Memory

On-chip Memory

Thread 1 Thread 2

Divide the global memory content into tiles

Focus the computation of threads on one or a small number

of tiles at each point in time

4
Tiling/Blocking - Basic Idea
Global Memory

On-chip Memory

Thread 1 Thread 2

5
Basic Concept of Tiling
– In a congested traffic system, significant reduction of vehicles
can greatly improve the delay seen by all vehicles
– Carpooling for commuters
– Tiling for global memory accesses
– drivers = threads accessing their memory data operands
– cars = memory access requests

6
6
Some Computations are More Challenging to Tile
– Some carpools may be easier than others
– Car pool participants need to have similar work schedule
– Some vehicles may be more suitable for carpooling
– Similar challenges exist in tiling

7
Carpools need synchronization.
– Good: when people have similar schedule

Worker A sleep work dinner

Time
Worker B sleep work dinner

8
8
Carpools need synchronization.
– Bad: when people have very different schedule

Worker A party sleep work

time
Worker B sleep work dinner

9
9
Same with Tiling
– Good: when threads have similar access timing

Thread 1
Time
Thread 2

– Bad: when threads have very different timing

10
10
Barrier Synchronization for Tiling

11
Outline of Tiling Technique
– Identify a tile of global memory contents that are accessed by
multiple threads
– Load the tile from global memory into on-chip memory
– Use barrier synchronization to make sure that all threads are ready
to start the phase
– Have the multiple threads to access their data from the on-chip
memory
– Use barrier synchronization to make sure that all threads have
completed the current phase
– Move on to the next tile

12
12
GPU Teaching Kit
Accelerated Computing

Module 4.3 - Memory Model and Locality

Tiled Matrix Multiplication
Objective
– To understand the design of a tiled parallel algorithm for matrix
multiplication
– Loading a tile
– Phased execution
– Barrier Synchronization

2
Matrix Multiplication
– Data access pattern
– Each thread - a row of M and a N
column of N
– Each thread block – a strip of M and a

WIDTH
strip of N

M P

BLOCK_WIDTHE

WIDTH
Row
BLOCK_WIDTH

WIDTH WIDTH

3 Col
Tiled Matrix Multiplication
– Break up the execution of each
thread into phases N
– so that the data accesses by the
thread block in each phase are

WIDTH
focused on one tile of M and one
tile of N
– The tile is of BLOCK_SIZE
elements in each dimension
M P

BLOCK_WIDTHE

WIDTH
Row

BLOCK_WIDTH

WIDTH WIDTH

Col
4
Loading a Tile
– All threads in a block participate
– Each thread loads one M element and one N element in tiled code

5
Phase 0 Load for Block (0,0)

N0,0 N0,1 N0,2 N0,3 N0,0 N0,1

Shared Memory
N1,0 N1,1 N1,2 N1,3 N1,0 N1,1
N2,0 N2,1 N2,2 N2,3
N3,0 N3,1 N3,2 N3,3
Shared Memory
M0,0 M0,1 M0,2 M0,3 M0,0 M0,1 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,0 M1,1 P1,0 P1,1 P1,2 P1,3
M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3
M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3

6
Phase 0 Use for Block (0,0) (iteration 0)

N0,0 N0,1 N0,2 N0,3 N0,0 N0,1

7
Phase 0 Use for Block (0,0) (iteration 1)

N0,0 N0,1 N0,2 N0,3 N0,0 N0,1

8
Phase 1 Load for Block (0,0)

N0,0 N0,1 N0,2 N0,3

N1,0 N1,1 N1,2 N1,3
N2,0 N2,1 N2,2 N2,3 N2,0 N2,1
Shared Memory
N3,0 N3,1 N3,2 N3,3 N3,0 N3,1
Shared Memory
M0,0 M0,1 M0,2 M0,3 M0,2 M0,3 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,2 M1,3 P1,0 P1,1 P1,2 P1,3
M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3
M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3

9
Phase 1 Use for Block (0,0) (iteration 0)

N0,0 N0,1 N0,2 N0,3

10
Phase 1 Use for Block (0,0) (iteration 1)

N0,0 N0,1 N0,2 N0,3

11
Execution Phases of Toy Example

12
Execution Phases of Toy Example (cont.)

Shared memory allows each value to be accessed by multiple

threads
13
Barrier Synchronization
– Synchronize all threads in a block
– __syncthreads()

– All threads in the same block must reach the __syncthreads() before
any of the them can move on

– Best used to coordinate the phased execution tiled algorithms

– To ensure that all elements of a tile are loaded at the beginning of a phase
– To ensure that all elements of a tile are consumed at the end of a phase

14
GPU Teaching Kit
Accelerated Computing

Module 4.4 - Memory and Data Locality

Tiled Matrix Multiplication Kernel
Objective
– To learn to write a tiled matrix-multiplication kernel
– Loading and using tiles for matrix multiplication
– Barrier synchronization, shared memory
– Resource Considerations
– Assume that Width is a multiple of tile size for simplicity

2
Loading Input Tile 0 of M (Phase 0)
– Have each thread load an M
element and an N element at the N
same relative position as its P
element.

WIDTH
int Row = by * blockDim.y + ty;
int Col = bx * blockDim.x + tx;
2D indexing for accessing Tile 0:
M[Row][tx]
N[ty][Col]
M P

TILE_WIDTHE

WIDTH
Row

TILE_WIDTH

WIDTH WIDTH

Col
3
Loading Input Tile 0 of N (Phase 0)
– Have each thread load an M
element and an N element at the N
same relative position as its P
element.

WIDTH
int Row = by * blockDim.y + ty;
int Col = bx * blockDim.x + tx;
2D indexing for accessing Tile 0:
M[Row][tx]
N[ty][Col]
M P

BLOCK_WIDTHE

WIDTH
Row

BLOCK_WIDTH

WIDTH WIDTH

Col
4
Loading Input Tile 1 of M (Phase 1)
N

2D indexing for accessing Tile 1:

WIDTH
M[Row][1*TILE_WIDTH + tx]
N[1*TILE*WIDTH + ty][Col]

M P

BLOCK_WIDTHE

WIDTH
Row

BLOCK_WIDTH

WIDTH WIDTH

Col
5
Loading Input Tile 1 of N (Phase 1)
N

2D indexing for accessing Tile 1:

WIDTH
M[Row][1*TILE_WIDTH + tx]
N[1*TILE*WIDTH + ty][Col]

M P

BLOCK_WIDTHE

WIDTH
Row

BLOCK_WIDTH

WIDTH WIDTH

Col
6
M and N are dynamically allocated - use 1D indexing

M[Row][p*TILE_WIDTH+tx]
M[Row*Width + p*TILE_WIDTH + tx]

N[p*TILE_WIDTH+ty][Col]
N[(p*TILE_WIDTH+ty)*Width + Col]

where p is the sequence number of the current phase

7
Tiled Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* M, float* N, float* P, Int Width)
{
__shared__ float ds_M[TILE_WIDTH][TILE_WIDTH];
__shared__ float ds_N[TILE_WIDTH][TILE_WIDTH];

int bx = blockIdx.x; int by = blockIdx.y;

int tx = threadIdx.x; int ty = threadIdx.y;

int Row = by * blockDim.y + ty;

int Col = bx * blockDim.x + tx;
float Pvalue = 0;

// Loop over the M and N tiles required to compute the P element

for (int p = 0; p < n/TILE_WIDTH; ++p) {
// Collaborative loading of M and N tiles into shared memory
ds_M[ty][tx] = M[Row*Width + p*TILE_WIDTH+tx];
ds_N[ty][tx] = N[(t*TILE_WIDTH+ty)*Width + Col];
__syncthreads();

for (int i = 0; i < TILE_WIDTH; ++i)Pvalue += ds_M[ty][i] * ds_N[i][tx];

__synchthreads();
}
P[Row*Width+Col] = Pvalue;
}

8
Tiled Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* M, float* N, float* P, Int Width)
{
__shared__ float ds_M[TILE_WIDTH][TILE_WIDTH];
__shared__ float ds_N[TILE_WIDTH][TILE_WIDTH];

int bx = blockIdx.x; int by = blockIdx.y;

int tx = threadIdx.x; int ty = threadIdx.y;

int Row = by * blockDim.y + ty;

int Col = bx * blockDim.x + tx;
float Pvalue = 0;

// Loop over the M and N tiles required to compute the P element

for (int i = 0; i < TILE_WIDTH; ++i)Pvalue += ds_M[ty][i] * ds_N[i][tx];

__synchthreads();
}
P[Row*Width+Col] = Pvalue;
}

9
Tiled Matrix Multiplication Kernel
__global__ void MatrixMulKernel(float* M, float* N, float* P, Int Width)
{
__shared__ float ds_M[TILE_WIDTH][TILE_WIDTH];
__shared__ float ds_N[TILE_WIDTH][TILE_WIDTH];

int bx = blockIdx.x; int by = blockIdx.y;

int tx = threadIdx.x; int ty = threadIdx.y;

int Row = by * blockDim.y + ty;

int Col = bx * blockDim.x + tx;
float Pvalue = 0;

// Loop over the M and N tiles required to compute the P element

for (int i = 0; i < TILE_WIDTH; ++i)Pvalue += ds_M[ty][i] * ds_N[i][tx];

__synchthreads();
}
P[Row*Width+Col] = Pvalue;
}

10
Tile (Thread Block) Size Considerations
– Each thread block should have many threads
– TILE_WIDTH of 16 gives 16*16 = 256 threads
– TILE_WIDTH of 32 gives 32*32 = 1024 threads

– For 16, in each phase, each block performs 2*256 = 512 float
loads from global memory for 256 * (2*16) = 8,192 mul/add
operations. (16 floating-point operations for each memory load)

– For 32, in each phase, each block performs 2*1024 = 2048 float
loads from global memory for 1024 * (2*32) = 65,536 mul/add
operations. (32 floating-point operation for each memory load)

11
Shared Memory and Threading
– For an SM with 16KB shared memory
– Shared memory size is implementation dependent!
– For TILE_WIDTH = 16, each thread block uses 2*256*4B = 2KB of shared
memory.
– For 16KB shared memory, one can potentially have up to 8 thread blocks
executing
– This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256 threads per block)
– The next TILE_WIDTH 32 would lead to 2*32*32*4 Byte= 8K Byte shared
memory usage per thread block, allowing 2 thread blocks active at the same time
– However, in a GPU where the thread count is limited to 1536 threads per SM,
the number of blocks per SM is reduced to one!
– Each __syncthread() can reduce the number of active threads for a
block
– More thread blocks can be advantageous

12
GPU Teaching Kit
Accelerated Computing

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under
the Creative Commons Attribution-NonCommercial 4.0 International License.
GPU Teaching Kit
Accelerated Computing

Module 5.1 – Thread Execusion Efficiency

Warps and SIMD Hardware
Objective
– To understand how CUDA threads execute on SIMD Hardware
– Warp partitioning
– SIMD Hardware
– Control divergence

2
Warps as Scheduling Units
Block 1 Warps Block 2 Warps Block 3 Warps
… t0 t1 t2 … t31
… t0 t1 t2 … t31
… t0 t1 t2 … t31
… … …

– Each block is divided into 32-thread warps

– An implementation technique, not part of the CUDA programming
model
– Warps are scheduling units in SM
– Threads in a warp execute in Single Instruction Multiple Data
(SIMD) manner
– The number of threads in a warp may vary in future generations

3
Warps in Multi-dimensional Thread Blocks
– The thread blocks are first linearized into 1D in row major order
– In x-dimension first, y-dimension next, and z-dimension last

Figure 6.1: Placing 2D threads into linear order

4
4
Blocks are partitioned after linearization
– Linearized thread blocks are partitioned
– Thread indices within a warp are consecutive and increasing
– Warp 0 starts with Thread 0

– Partitioning scheme is consistent across devices

– Thus you can use this knowledge in control flow
– However, the exact size of warps may change from
generation to generation

– DO NOT rely on any ordering within or between

warps
– If there are any dependencies between threads, you must
__syncthreads() to get correct results (more later).

5
SMs are SIMD Processors
– Control unit for instruction fetch, decode, and control is shared
among multiple processing units
– Control overhead is minimized (Module 1)

Memory I/O

Processing Unit
Shared
Register
Memory ALU File

Control Unit
PC IR

Processor (SM)

6
SIMD Execution Among Threads in a Warp
– All threads in a warp must execute the same instruction
at any point in time

– This works efficiently if all threads follow the same

control flow path
– All if-then-else statements make the same decision
– All loops iterate the same number of times

7
Control Divergence
– Control divergence occurs when threads in a warp take
different control flow paths by making different control
decisions
– Some take the then-path and others take the else-path of an if-
statement
– Some threads take different number of loop iterations than others

– The execution of threads taking different paths are

serialized in current GPUs
– The control paths taken by the threads in a warp are traversed one
at a time until there is no more.
– During the execution of each path, all threads taking that path will
be executed in parallel
– The number of different paths can be large when considering
nested control flow statements

8
Control Divergence Examples
– Divergence can arise when branch or loop
condition is a function of thread indices
– Example kernel statement with divergence:
– if (threadIdx.x > 2) { }
– This creates two different control paths for threads in a block
– Decision granularity < warp size; threads 0, 1 and 2 follow
different path than the rest of the threads in the first warp
– Example without divergence:
– If (blockIdx.x > 2) { }
– Decision granularity is a multiple of blocks size; all threads in
any given warp follow the same path

9
Example: Vector Addition Kernel
Device Code
// Compute vector sum C = A + B
// Each thread performs one pair-wise addition

__global__
void vecAddKernel(float* A, float* B, float* C,
int n)
{
int i = threadIdx.x + blockDim.x * blockIdx.x;
if(i<n) C[i] = A[i] + B[i];
}

10
10
Analysis for vector size of 1,000 elements
– Assume that block size is 256 threads
– 8 warps in each block

– All threads in Blocks 0, 1, and 2 are within valid range

– i values from 0 to 767
– There are 24 warps in these three blocks, none will have control divergence

– Most warps in Block 3 will not control divergence

– Threads in the warps 0-6 are all within valid range, thus no control divergence

– One warp in Block 3 will have control divergence

– Threads with i values 992-999 will all be within valid range
– Threads with i values of 1000-1023 will be outside valid range

– Effect of serialization on control divergence will be small

– 1 out of 32 warps has control divergence
– The impact on performance will likely be less than 3%

11
GPU Teaching Kit
Accelerated Computing

Module 5.2 – Thread Execusion Efficiency

Performance Impact of Control Divergence
Objective
– To learn to analyze the performance impact of control divergence
– Boundary condition checking
– Control divergence is data-dependent

2
Performance Impact of Control Divergence
– Boundary condition checks are vital for complete functionality and
robustness of parallel code
– The tiled matrix multiplication kernel has many boundary condition checks
– The concern is that these checks may cause significant performance degradation
– For example, see the tile loading code below:

if(Row < Width && t * TILE_WIDTH+tx < Width) {

ds_M[ty][tx] = M[Row * Width + p * TILE_WIDTH + tx];
} else {
ds_M[ty][tx] = 0.0;
}

if (p*TILE_WIDTH+ty < Width && Col < Width) {

ds_N[ty][tx] = N[(p*TILE_WIDTH + ty) * Width + Col];
} else {
ds_N[ty][tx] = 0.0;
}

3
Two types of blocks in loading M Tiles
– 1. Blocks whose tiles are all within valid range until the last phase.
– 2. Blocks whose tiles are partially outside the valid range all the way

Type 1

TILE_WIDTH

Type 2

4
Analysis of Control Divergence Impact
– Assume 16x16 tiles and thread blocks
– Each thread block has 8 warps (256/32)
– Assume square matrices of 100x100
– Each thread will go through 7 phases (ceiling of 100/16)

– There are 49 thread blocks (7 in each dimension)

5
Control Divergence in Loading M Tiles
– Assume 16x16 tiles and thread blocks
– Each thread block has 8 warps (256/32)
– Assume square matrices of 100x100
– Each warp will go through 7 phases (ceiling of 100/16)

– There are 42 (6*7) Type 1 blocks, with a total of 336 (8*42) warps
– They all have 7 phases, so there are 2,352 (336*7) warp-phases
– The warps have control divergence only in their last phase
– 336 warp-phases have control divergence

6
Control Divergence in Loading M Tiles (Type 2)
– Type 2: the 7 block assigned to load the bottom tiles, with a total of
56 (8*7) warps
– They all have 7 phases, so there are 392 (56*7) warp-phases
– The first 2 warps in each Type 2 block will stay within the valid range
until the last phase
– The 6 remaining warps stay outside the valid range
– So, only 14 (2*7) warp-phases have control divergence

7
Overall Impact of Control Divergence
– Type 1 Blocks: 336 out of 2,352 warp-phases have control
divergence
– Type 2 Blocks: 14 out of 392 warp-phases have control divergence
– The performance impact is expected to be less than 12% (350/2,944
or (336+14)/(2352+14))
M

Type 1

TILE_WIDTH

Type 2

8
Additional Comments
– The calculation of impact of control divergence in loading N tiles is
somewhat different and is left as an exercise

– The estimated performance impact is data dependent.

– For larger matrices, the impact will be significantly smaller

– In general, the impact of control divergence for boundary condition

checking for large input data sets should be insignificant
– One should not hesitate to use boundary checks to ensure full functionality

– The fact that a kernel is full of control flow constructs does not mean
that there will be heavy occurrence of control divergence

– We will cover some algorithm patterns that naturally incur control

divergence (such as parallel reduction) in the Parallel Algorithm
Patterns modules

9
GPU Teaching Kit
Accelerated Computing

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under
the Creative Commons Attribution-NonCommercial 4.0 International License.
GPU Teaching Kit
Accelerated Computing

Module 6.1 – Memory Access Performance

DRAM Bandwidth
Objective
– To learn that memory bandwidth is a first-order performance factor
in a massively parallel processor
– DRAM bursts, banks, and channels
– All concepts are also applicable to modern multicore processors

2
Global Memory (DRAM) Bandwidth

– Ideal

– Reality

3
DRAM Core Array Organization
– Each DRAM core array has about 16M bits

– Each bit is stored in a tiny capacitor made of one transistor

Row Row Memory Cell

Addr Decoder Core Array

Sense Amps
Column Latches
Wide
Column Mux
Addr
Narrow Pin Interface
Off-chip Data

4
A very small (8x2-bit) DRAM Core Array

0 1 1

decode
Sense amps

Mux

5
DRAM Core Arrays are Slow
– Reading from a cell in the core array is a very slow process
– DDR: Core speed = ½ interface speed
– DDR2/GDDR3: Core speed = ¼ interface speed
– DDR3/GDDR4: Core speed = ⅛ interface speed
– … likely to be worse in the future

About 1000 cells connected to each vertical line

decode

A very small capacitance that

stores a data bit

To sense amps

6
DRAM Bursting
– For DDR{2,3} SDRAM cores clocked at 1/N speed of the interface:
– Load (N × interface width) of DRAM bits from the same row at once to an internal
buffer, then transfer in N steps at interface speed
– DDR3/GDDR4: buffer width = 8X interface width

7
DRAM Bursting Timing Example

Address bits to
decoder
bits
Core Array access delay on interface
time

Non-burst timing

Burst timing

Modern DRAM systems are designed to always be accessed

in burst mode. Burst bytes are transferred to the processor
but discarded when accesses are not to sequential locations.

8
Multiple DRAM Banks

decode
decode

Sense amps Sense amps

Mux Mux
Bank 0 Bank 1

9
DRAM Bursting with Banking

Single-Bank burst timing, dead time on interface

Multi-Bank burst timing, reduced dead time

10
GPU off-chip memory subsystem
– NVIDIA GTX280 GPU:
– Peak global memory bandwidth = 141.7GB/s

– Global memory (GDDR3) interface @ 1.1GHz

– (Core speed @ 276Mhz)
– For a typical 64-bit interface, we can sustain only about 17.6 GB/s (Recall DDR - 2 transfers
per clock)
– We need a lot more bandwidth (141.7 GB/s) – thus 8 memory channels

11
GPU Teaching Kit
Accelerated Computing

Lecture 6.2 – Performance Considerations

Memory Coalescing in CUDA
Objective
– To learn that memory coalescing is important for effectively utilizing
memory bandwidth in CUDA
– Its origin in DRAM burst
– Checking if a CUDA memory access is coalesced
– Techniques for improving memory coalescing in CUDA code

2
3
DRAM Burst – A System View

Burst section Burst section Burst section Burst section

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

– Each address space is partitioned into burst sections

– Whenever a location is accessed, all other locations in the same
section are also delivered to the processor
– Basic example: a 16-byte address space, 4-byte burst sections
– In practice, we have at least 4GB address space, burst section
sizes of 128-bytes or more

3
4
Memory Coalescing

Coalesced Loads Coalesced Loads

T0 T1 T2 T3 T0 T1 T2 T3

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Burst section Burst section Burst section Burst section

– When all threads of a warp execute a load instruction, if all accessed

locations fall into the same burst section, only one DRAM request
will be made and the access is fully coalesced.

4
5
Un-coalesced Accesses

Un-coalesced Loads Un-coalesced Loads

T0 T1 T2 T3 T0 T1 T2 T3

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Burst section Burst section Burst section Burst section

– When the accessed locations spread across burst section

boundaries:
– Coalescing fails
– Multiple DRAM requests are made
– The access is not fully coalesced.
– Some of the bytes accessed and transferred are not used by the
threads

5
6
How to judge if an access is coalesced?
– Accesses in a warp are to consecutive locations if the index in an
array access is in the form of
– A[(expression with terms independent of threadIdx.x) + threadIdx.x];

6
7
A 2D C Array in Linear Memory Space

M0,0 M0,1 M0,2 M0,3

M1,0 M1,1 M1,2 M1,3
M2,0 M2,1 M2,2 M2,3
M3,0 M3,1 M3,2 M3,3
M
M0,0 M0,1 M0,2 M0,3 M1,0 M1,1 M1,2 M1,3 M2,0 M2,1 M2,2 M2,3 M3,0 M3,1 M3,2 M3,3

linearized order in increasing address

7
Two Access Patterns of Basic Matrix Multiplication

A B

HEIGHT
Thread 1
Thread 2
WIDTH

A[Row*n+i] B[i*k+Col]
i is the loop counter in the inner product loop of the kernel code

A is m × n, B is n × k
Col = blockIdx.x*blockDim.x + threadIdx.x

8
B accesses are coalesced

Load iteration 0 Load iteration 1

T0 T1 T2 T3 T0 T1 T2 T3

N
B0,0 B0,1 B0,2 B0,3 B1,0 B1,1 B1,2 B1,3 B2,0 B2,1 B2,2 B2,3 B3,0 B3,1 B3,2 B3,3

Access B0,0 B0,1 B0,2 B0,3

direction in B1,0 B1,1 B1,2 B1,3
kernel code
B2,0 B2,1 B2,2 B2,3
B3,0 B3,1 B3,2 B3,3

9
A Accesses are Not Coalesced

Load iteration 1
…
T0 T1 T2 T3

Load iteration 0
T0 T1 T2 T3

A0,0 A0,1 A0,2 A0,3 A1,0 A1,1 A1,2 A1,3 A2,0 A2,1 A2,2 A2,3 A3,0 A3,1 A3,2 A3,3

A0,0 A0,1 A0,2 A0,3

Access
A1,0 A1,1 A1,2 A1,3
direction in
kernel code A2,0 A2,1 A2,2 A2,3
A3,0 A3,1 A3,2 A3,3

10
Loading an Input Tile

Have each thread load an A element

and a B element at the same relative
position as its C element.
B
Col
int tx = threadIdx.x
n
int ty = threadIdx.y
Accessing tile 0 2D indexing:
A[Row][tx] k
B[ty][Col]

A C
Row
m
m

WIDTH
n k

11
Corner Turning

d_M d_N

Original

WIDTH
Access
Pattern

WIDTH
Copy into
shared
memory
d_M d_N

Tiled
Access
Pattern
Perform
multiplication
with shared memory
values

12
GPU Teaching Kit

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under
the Creative Commons Attribution-NonCommercial 4.0 International License.
GPU Teaching Kit
Accelerated Computing

Module 14 – Efficient Host-Device Data Transfer

Lecture 14.1 - Pinned Host Memory
Objective
– To learn the important concepts involved in copying (transferring) data
between host and device
– Direct Memory Access
– Pinned memory

2
CPU-GPU Data Transfer using DMA
– DMA (Direct Memory Access) hardware is used by cudaMemcpy() for
better efficiency
– Frees CPU for other tasks
– Hardware unit specialized to transfer a number of bytes requested by OS
– Between physical memory address space regions (some can be mapped I/O memory
locations)
– Uses system interconnect, typically PCIe in today’s systems

CPU Main Memory (DRAM)

PCIe

Global DMA
Memory
GPU card
(or other I/O cards)

3
Virtual Memory Management

– Modern computers use virtual memory management

– Many virtual memory spaces mapped into a single physical memory
– Virtual addresses (pointer values) are translated into physical addresses
– Not all variables and data structures are always in the physical
memory
– Each virtual address space is divided into pages that are mapped into and out of
the physical memory
– Virtual memory pages can be mapped out of the physical memory (page-out) to
make room
– Whether or not a variable is in the physical memory is checked at address
translation time

4
Data Transfer and Virtual Memory
– DMA uses physical addresses
– When cudaMemcpy() copies an array, it is implemented as one or more DMA
transfers
– Address is translated and page presence checked for the entire source and
destination regions at the beginning of each DMA transfer
– No address translation for the rest of the same DMA transfer so that high efficiency
can be achieved

– The OS could accidentally page-out the data that is being read or written
by a DMA and page-in another virtual page into the same physical location

5
Pinned Memory and DMA Data Transfer
– Pinned memory are virtual memory pages that are specially marked so that
they cannot be paged out
– Allocated with a special system API function call
– a.k.a. Page Locked Memory, Locked Pages, etc.
– CPU memory that serve as the source or destination of a DMA transfer must
be allocated as pinned memory

6
CUDA data transfer uses pinned memory.
– The DMA used by cudaMemcpy() requires that any source or destination in
the host memory is allocated as pinned memory

– If a source or destination of a cudaMemcpy() in the host memory is not

allocated in pinned memory, it needs to be first copied to a pinned memory –
extra overhead

– cudaMemcpy() is faster if the host memory source or destination is

allocated in pinned memory since no extra copy is needed

7
Allocate/Free Pinned Memory
– cudaHostAlloc(), three parameters
– Address of pointer to the allocated memory
– Size of the allocated memory in bytes
– Option – use cudaHostAllocDefault for now

– cudaFreeHost(), one parameter

– Pointer to the memory to be freed

8
Using Pinned Memory in CUDA
– Use the allocated pinned memory and its pointer the same way as those
returned by malloc();

– The only difference is that the allocated memory cannot be paged by the OS

– The cudaMemcpy() function should be about 2X faster with pinned memory

– Pinned memory is a limited resource

– over-subscription can have serious consequences

9
Putting It Together - Vector Addition Host Code Example
int main()
{
float *h_A, *h_B, *h_C;
…
cudaHostAlloc((void **) &h_A, N* sizeof(float),
cudaHostAllocDefault);
cudaHostAlloc((void **) &h_B, N* sizeof(float),
cudaHostAllocDefault);
cudaHostAlloc((void **) &h_C, N* sizeof(float),
cudaHostAllocDefault);
…
// cudaMemcpy() runs 2X faster
}

10
GPU Teaching Kit
Accelerated Computing

Module 14 – Efficient Host-Device Data Transfer

Lecture 14.2 - Task Parallelism in CUDA
Objective

– To learn task parallelism in CUDA

– CUDA Streams

2
Serialized Data Transfer and Computation

– So far, the way we use cudaMemcpy serializes data transfer and

GPU computation for VecAddKernel()

Trans. A Trans. B Comp Trans. C

time

Only use one direction, Only use one

PCIe Idle direction, GPU
GPU idle
idle

3
Device Overlap

– Some CUDA devices support device overlap

– Simultaneously execute a kernel while copying data between device and host
memory

int dev_count;
cudaDeviceProp prop;

cudaGetDeviceCount( &dev_count);
for (int i = 0; i < dev_count; i++) {
cudaGetDeviceProperties(&prop, i);
if (prop.deviceOverlap) …

4
Ideal, Pipelined Timing

– Divide large vectors into segments

– Overlap transfer and compute of adjacent segments

Trans Trans Comp Trans

A.0 B.0 C.0 = A.0 + B.0 C.0

Trans Trans Comp Trans

A.1 B.1 C.1 = A.1 + B.1 C.1

Trans Trans Comp

A.2 B.2 C.2 = A.2 + B.2

Trans Trans
A.3 B.3

5
CUDA Streams

– CUDA supports parallel execution of kernels and

cudaMemcpy() with “Streams”
– Each stream is a queue of operations (kernel launches and
cudaMemcpy()calls)
– Operations (tasks) in different streams can go in parallel
– “Task parallelism”

6
Streams

– Requests made from the host code are put into First-In-First-Out
queues
– Queues are read and processed asynchronously by the driver and device
– Driver ensures that commands in a queue are processed in sequence. E.g.,
Memory copies end before kernel launch, etc.

host thread

cudaMemcpy()
kernel launch
FIFO device sync
cudaMemcpy()

device driver

7
Streams cont.

– To allow concurrent copying and kernel execution, use multiple

queues, called “streams”
– CUDA “events” allow the host thread to query and synchronize with individual
queues (i.e. streams).

host thread

Stream 0 Stream 1

Event

device driver

8
Conceptual View of Streams

PCIe PCIe
up down

Copy Engine Kernel

Engine

MemCpy A.0 MemCpy A.1

MemCpy B.0 MemCpy B.1
Kernel 0 Kernel 1
MemCpy C.0 MemCpy C.1

Stream 0 Stream 1

Operations (Kernel launches, cudaMemcpy() calls)

9
GPU Teaching Kit
Accelerated Computing

Module 14 – Efficient Host-Device Data Transfer

Lecture 14.3 - Overlapping Data Transfer with Computation
Objective

– To learn how to overlap data transfer with computation

– Asynchronous data transfer in CUDA
– Practical limitations of CUDA streams

2
Simple Multi-Stream Host Code
cudaStream_t stream0, stream1;
cudaStreamCreate(&stream0);
cudaStreamCreate(&stream1);

float d_A0, d_B0, *d_C0; // device memory for stream 0

float *d_A1, *d_B1, *d_C1; // device memory for stream 1

// cudaMalloc() calls for d_A0, d_B0, d_C0, d_A1, d_B1, d_C1 go

here

3
Simple Multi-Stream Host Code (Cont.)
for (int i=0; i<n; i+=SegSize*2) {
cudaMemcpyAsync(d_A0, h_A+i, SegSize*sizeof(float),…, stream0);
cudaMemcpyAsync(d_B0, h_B+i, SegSize*sizeof(float),…, stream0);
vecAdd<<<SegSize/256, 256, 0, stream0>>>(d_A0, d_B0,…);
cudaMemcpyAsync(h_C+i, d_C0, SegSize*sizeof(float),…, stream0);
cudaMemcpyAsync(d_A1, h_A+i+SegSize, SegSize*sizeof(float),…, stream1);
cudaMemcpyAsync(d_B1, h_B+i+SegSize, SegSize*sizeof(float),…, stream1);
vecAdd<<<SegSize/256, 256, 0, stream1>>>(d_A1, d_B1, …);
cudaMemcpyAsync(d_C1, h_C+i+SegSize, SegSize*sizeof(float),…, stream1);
}

4
A View Closer to Reality in Previous GPUs

PCIe PCIe
up down
Copy Kernel Engine
Engine
Direction of
arrows
changed
MemCpy A.0 Kernel 0 from original
MemCpy B.0 Kernel 1 slides [SPM]
MemCpy C.0
MemCpy A.1
MemCpy B.1
MemCpy C.1

Stream 0 Stream 1

Operations (Kernel launches, cudaMemcpy() calls)

5
Not quite the overlap we want in some GPUs

– C.0 blocks A.1 and B.1 in the copy engine queue

Trans Trans Comp Trans

A.0 B.0 C.0 = A.0 + B.0 C.0

Trans Trans Comp Trans

A.1 B.1 C.1= A.1 + B.1 C.1

6
Better Multi-Stream Host Code
for (int i=0; i<n; i+=SegSize*2) {
cudaMemcpyAsync(d_A0, h_A+i, SegSize*sizeof(float),…, stream0);
cudaMemcpyAsync(d_B0, h_B+i, SegSize*sizeof(float),…, stream0);
cudaMemcpyAsync(d_A1, h_A+i+SegSize, SegSize*sizeof(float),…, stream1);
cudaMemcpyAsync(d_B1, h_B+i+SegSize, SegSize*sizeof(float),…, stream1);

vecAdd<<<SegSize/256, 256, 0, stream0>>>(d_A0, d_B0, …);

vecAdd<<<SegSize/256, 256, 0, stream1>>>(d_A1, d_B1, …);

cudaMemcpyAsync(h_C+i, d_C0, SegSize*sizeof(float),…, stream0);

cudaMemcpyAsync(h_C+i+SegSize, d_C1, SegSize*sizeof(float),…, stream1);
}

7
C.0 no longer blocks A.1 and B.1

PCIe PCIe
up down
Copy Kernel Engine Direction of
Engine
arrows
changed
MemCpy A.0 Kernel 0 from original
MemCpy B.0 Kernel 1 slides [SPM]
MemCpy A.1
MemCpy B.1
MemCpy C.0
MemCpy C.1

Stream 0 Stream 1

Operations (Kernel launches, cudaMemcpy() calls)

8
Better, not quite the best overlap

– C.1 blocks next iteration A.2 and B.2 in the copy engine queue

Trans Trans Comp Trans

A.0 B.0 C.0 = A.0 + B.0 C.0 Iteration n

Trans Trans Comp Trans

A.1 B.1 C.1= A.1 + B.1 C.1

Trans Trans Comp

Iteration n+1 A.2 B.2 C.2 = A.2
+

Trans
A.2

9
Ideal, Pipelined Timing

– Will need at least three buffers for each original A, B, and C,

code is more complicated

Trans Trans Comp Trans

A.0 B.0 C.0 = A.0 + B.0 C.0

Trans Trans Comp Trans

A.1 B.1 C.1 = A.1 + B.1 C.1

Trans Trans Comp

A.2 B.2 C.2 = A.2 + B.2

Trans Trans
A.3 B.3

10
Hyper Queues

– Provide multiple queues for each engine

– Allow more concurrency by allowing some streams to make
progress for an engine while others are blocked

A -- B -- C
A--B--C
Stream 0

P--Q--R P -- Q -- R
Stream 1

X--Y--Z X -- Y -- Z
Multiple Hardware Work Queues
Stream 2

11
Wait until all tasks have completed

– cudaStreamSynchronize(stream_id)
– Used in host code
– Takes one parameter – stream identifier
– Wait until all tasks in a stream have completed
– E.g., cudaStreamSynchronize(stream0)in host code ensures that all tasks
in the queues of stream0 have completed

– This is different from cudaDeviceSynchronize()

– Also used in host code
– No parameter
– cudaDeviceSynchronize() waits until all tasks in all streams have completed
for the current device

12
GPU Teaching Kit
Accelerated Computing

The GPU Teaching Kit is licensed by NVIDIA and the University of Illinois under
the Creative Commons Attribution-NonCommercial 4.0 International License.

RAID Spindle Calculator
No ratings yet
RAID Spindle Calculator
2 pages
primerHW SWinterface
No ratings yet
primerHW SWinterface
107 pages
300+ Computer Hardware Interview Questions and Answers 2021
No ratings yet
300+ Computer Hardware Interview Questions and Answers 2021
1 page
Computer Motherboard and Its Constituent Components
No ratings yet
Computer Motherboard and Its Constituent Components
29 pages
Nvidia Ampere Architecture Whitepaper
No ratings yet
Nvidia Ampere Architecture Whitepaper
83 pages
Gpu1 - GPU Introduction
No ratings yet
Gpu1 - GPU Introduction
20 pages
ECMWF Advanced GPU Topics 1
100% (1)
ECMWF Advanced GPU Topics 1
59 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
Openshift - Container - Platform 4.10 Monitoring en Us
No ratings yet
Openshift - Container - Platform 4.10 Monitoring en Us
107 pages
4 1 MWagner GPU Volta
No ratings yet
4 1 MWagner GPU Volta
36 pages
GPU Introduction
No ratings yet
GPU Introduction
52 pages
gtc22 Whitepaper Hopper
No ratings yet
gtc22 Whitepaper Hopper
71 pages
Anti-Drone Systems An Attention Based Improved YOLOv7 Model For A Real-Time Detection and Identification of Multi-Airborne Target
No ratings yet
Anti-Drone Systems An Attention Based Improved YOLOv7 Model For A Real-Time Detection and Identification of Multi-Airborne Target
15 pages
Vast Data Comparing File Systems White Paper
No ratings yet
Vast Data Comparing File Systems White Paper
5 pages
Audio Sample Selection With Generative Adversarial Networks
No ratings yet
Audio Sample Selection With Generative Adversarial Networks
80 pages
March 2019 NVMe TCP What You Need To Know About The Specification
No ratings yet
March 2019 NVMe TCP What You Need To Know About The Specification
34 pages
DGX Superpod Deployment Guide DGX A100
No ratings yet
DGX Superpod Deployment Guide DGX A100
45 pages
SIMD Tutorial
No ratings yet
SIMD Tutorial
17 pages
Esp32-S2 Technical Reference Manual en PDF
No ratings yet
Esp32-S2 Technical Reference Manual en PDF
702 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
Jetson Nano Developer Kit User Guide
No ratings yet
Jetson Nano Developer Kit User Guide
26 pages
Routing Quagga
No ratings yet
Routing Quagga
22 pages
NVSwitch
100% (1)
NVSwitch
23 pages
Vmware NSX-T: Prof. S.Dust
No ratings yet
Vmware NSX-T: Prof. S.Dust
37 pages
NSO Day 2 Yang XML and Rest Api
No ratings yet
NSO Day 2 Yang XML and Rest Api
101 pages
Glade Tutorial
No ratings yet
Glade Tutorial
5 pages
Ethernet PDF Tutorial
No ratings yet
Ethernet PDF Tutorial
2 pages
GPGPU
No ratings yet
GPGPU
139 pages
Demystfying Container Networking2 190915040315
No ratings yet
Demystfying Container Networking2 190915040315
82 pages
RISC-VTF RISC-V Based Extended Instruction Set For Transformer
No ratings yet
RISC-VTF RISC-V Based Extended Instruction Set For Transformer
6 pages
SL10496 Basic Concepts For NetApp ONTAP 9.4-NAS Services-CLI Edition-V2.0.0 PDF
No ratings yet
SL10496 Basic Concepts For NetApp ONTAP 9.4-NAS Services-CLI Edition-V2.0.0 PDF
45 pages
Esp Idf
No ratings yet
Esp Idf
381 pages
Beaglebone Black
No ratings yet
Beaglebone Black
63 pages
Cloud Application Development in Python: Bahga & Madisetti, © 2014 Book Website: WWW - Cloudcomputingbook.info
No ratings yet
Cloud Application Development in Python: Bahga & Madisetti, © 2014 Book Website: WWW - Cloudcomputingbook.info
20 pages
How To Use An Existing DNN Recognizer For Decoding in Kaldi
No ratings yet
How To Use An Existing DNN Recognizer For Decoding in Kaldi
14 pages
Module1 Introduction
No ratings yet
Module1 Introduction
42 pages
Enhancing VNF Performance by Exploiting SR IOV and DPDK Packet Processing Acceleration
No ratings yet
Enhancing VNF Performance by Exploiting SR IOV and DPDK Packet Processing Acceleration
6 pages
Bhyve Bsdmag
No ratings yet
Bhyve Bsdmag
82 pages
Anatomy of A Program in Memory
No ratings yet
Anatomy of A Program in Memory
19 pages
ZFNET Architecture
No ratings yet
ZFNET Architecture
14 pages
Ps Cloud Block Store Ds
No ratings yet
Ps Cloud Block Store Ds
3 pages
Microsoft PowerPoint - SoC Design Flow Tools Codesign
No ratings yet
Microsoft PowerPoint - SoC Design Flow Tools Codesign
110 pages
Kubernetes Flannel Network
No ratings yet
Kubernetes Flannel Network
6 pages
Speaker - A02 - 5747 - Best Practices in Networking For AI
No ratings yet
Speaker - A02 - 5747 - Best Practices in Networking For AI
15 pages
Hybrid PtP/LTE Infrastructure Planning Focusing On CAPEX-based Migration
100% (1)
Hybrid PtP/LTE Infrastructure Planning Focusing On CAPEX-based Migration
134 pages
Archived: Deep Learning On AWS
No ratings yet
Archived: Deep Learning On AWS
51 pages
Hacking Mobile Platforms
No ratings yet
Hacking Mobile Platforms
110 pages
FreeBSD Jails and ZFS
No ratings yet
FreeBSD Jails and ZFS
3 pages
Timing Issues in FPGA
No ratings yet
Timing Issues in FPGA
33 pages
SpaceWire-to-GigabitEther UesrGuide 20100930
No ratings yet
SpaceWire-to-GigabitEther UesrGuide 20100930
19 pages
ECN TLP Prefix 2008-12-15
100% (1)
ECN TLP Prefix 2008-12-15
19 pages
Xilinx Edge Processors: Aie Engineering Team Hotchips-33 Conference, August 2021
No ratings yet
Xilinx Edge Processors: Aie Engineering Team Hotchips-33 Conference, August 2021
21 pages
VM's vs. Containers
No ratings yet
VM's vs. Containers
7 pages
Data Center Trends Network Security
No ratings yet
Data Center Trends Network Security
12 pages
Parallela Cluster by Michael Johan Kruger
No ratings yet
Parallela Cluster by Michael Johan Kruger
56 pages
AI Unit 4 - Artificial Neural Network by Kulbhushan (Krazy Kaksha & KK World)
No ratings yet
AI Unit 4 - Artificial Neural Network by Kulbhushan (Krazy Kaksha & KK World)
5 pages
Docker Cheat Sheet
No ratings yet
Docker Cheat Sheet
5 pages
How To Build A Network of Linux Routers Using Quagga
No ratings yet
How To Build A Network of Linux Routers Using Quagga
8 pages
Iot Merged
No ratings yet
Iot Merged
132 pages
12 Gpu Cuda 3
No ratings yet
12 Gpu Cuda 3
58 pages
CUDA Memory
No ratings yet
CUDA Memory
56 pages
Lesson 828 HB Have We Met
No ratings yet
Lesson 828 HB Have We Met
11 pages
Recoil 2
No ratings yet
Recoil 2
3 pages
Challenge EngineeringIntern-CodingChallenge UserSessions
No ratings yet
Challenge EngineeringIntern-CodingChallenge UserSessions
7 pages
Lesson 361 Level 2 My Family 2
No ratings yet
Lesson 361 Level 2 My Family 2
11 pages
Lesson - 359 - Level 1 What Color Is The School Bus
No ratings yet
Lesson - 359 - Level 1 What Color Is The School Bus
12 pages
Florida Department of Corrections: Criminal Investigation Investigative Assist
No ratings yet
Florida Department of Corrections: Criminal Investigation Investigative Assist
7 pages
3 Orthogonal Vectors and Matrices
No ratings yet
3 Orthogonal Vectors and Matrices
6 pages
Section 1.4: The Matrix Equation Ax B
No ratings yet
Section 1.4: The Matrix Equation Ax B
2 pages
Notes On The Gram-Schmidt Process: MENU, Winter 2013
No ratings yet
Notes On The Gram-Schmidt Process: MENU, Winter 2013
4 pages
Gaussian Elimination and Back Substitution
No ratings yet
Gaussian Elimination and Back Substitution
10 pages
Hungry As Fuck Recipes
No ratings yet
Hungry As Fuck Recipes
68 pages
Area Hex
No ratings yet
Area Hex
3 pages
Feedback Report: Performance Attendance Homework
No ratings yet
Feedback Report: Performance Attendance Homework
1 page
COP3530 Cheat Sheet Data Structures
No ratings yet
COP3530 Cheat Sheet Data Structures
2 pages
Unit 4 MO Part 1
No ratings yet
Unit 4 MO Part 1
70 pages
1817-SanDiegoCPMTDL Lau Advancedpackaging
No ratings yet
1817-SanDiegoCPMTDL Lau Advancedpackaging
111 pages
Bec601 Mod1-Notes
No ratings yet
Bec601 Mod1-Notes
86 pages
ESP Hardware Design Manual
No ratings yet
ESP Hardware Design Manual
44 pages
P7650A/B/U: Differential Pressure Sensors
No ratings yet
P7650A/B/U: Differential Pressure Sensors
4 pages
VLSI 7th Sem Lab Manual PDF
No ratings yet
VLSI 7th Sem Lab Manual PDF
66 pages
(Ebook) Ewing's Analytical Instrumentation Handbook by Grinberg, Nelu Rodriguez, Sonia ISBN 9781482218671, 1482218674 Instant Download
No ratings yet
(Ebook) Ewing's Analytical Instrumentation Handbook by Grinberg, Nelu Rodriguez, Sonia ISBN 9781482218671, 1482218674 Instant Download
59 pages
Q4 ICT CSS 7 - 8 Week4
No ratings yet
Q4 ICT CSS 7 - 8 Week4
4 pages
ICT Lab Manual
No ratings yet
ICT Lab Manual
60 pages
Notes - Understanding PC & Troubleshooting
No ratings yet
Notes - Understanding PC & Troubleshooting
23 pages
96mpaa 2.8g 2mam3t Datasheet
No ratings yet
96mpaa 2.8g 2mam3t Datasheet
21 pages
Unit 2 Basic Components of Computer
No ratings yet
Unit 2 Basic Components of Computer
26 pages
Comp MCQ
100% (1)
Comp MCQ
266 pages
System Board
No ratings yet
System Board
20 pages
Unit 3
No ratings yet
Unit 3
42 pages
PSC Unit 1
No ratings yet
PSC Unit 1
52 pages
3d Integration in Vlsi Circuits Implementation Technologies and Applications Devices Circuits and Systems 1st Edition Katsuyuki Sakuma Download
No ratings yet
3d Integration in Vlsi Circuits Implementation Technologies and Applications Devices Circuits and Systems 1st Edition Katsuyuki Sakuma Download
79 pages
Topics in Computer Hardware Servicing A. Hardware: 1. Installing Computer Systems and Networks
No ratings yet
Topics in Computer Hardware Servicing A. Hardware: 1. Installing Computer Systems and Networks
16 pages
Information Technology Support Service Level II: Ethiopian TVET-System
No ratings yet
Information Technology Support Service Level II: Ethiopian TVET-System
105 pages
MN
100% (1)
MN
192 pages
Types of DRAM
No ratings yet
Types of DRAM
22 pages
Panasonic 3DO FZ-10 Service Manual (E, For UK)
No ratings yet
Panasonic 3DO FZ-10 Service Manual (E, For UK)
40 pages
Computer Oliveboard
No ratings yet
Computer Oliveboard
54 pages
SPCA1528A Sunplus
No ratings yet
SPCA1528A Sunplus
32 pages
UNIT 3 - Computer Memory
100% (4)
UNIT 3 - Computer Memory
16 pages
PDF Datasheet T5833 8.5x11
No ratings yet
PDF Datasheet T5833 8.5x11
2 pages
CEA201 Group6
No ratings yet
CEA201 Group6
20 pages
Signal Integrity 1
No ratings yet
Signal Integrity 1
15 pages

Module 4.1 - Memory and Data Locality: GPU Teaching Kit

Uploaded by

Module 4.1 - Memory and Data Locality: GPU Teaching Kit

Uploaded by

GPU Teaching Kit

Module 4.1 – Memory and Data Locality

// Get the average of the surrounding 2xBLUR_SIZE x 2xBLUR_SIZE box

int curRow = Row + blurRow;

// Write our new pixel value out

– This limits the execution rate to 3.3% (50/1500) of the peak

– Need to drastically cut down memory accesses to get close to

// Calculate the row index of the P element and M

// Calculate the column index of P and N

if ((Row < Width) && (Col < Width)) {

// Calculate the row index of the P element and M

// Calculate the column index of P and N

if ((Row < Width) && (Col < Width)) {

P3,0 P3,1 P3,2 P3,3

M0,0 M0,1 M0,2 M0,3 P0,0 P0,1

M1,0 M1,1 M1,2 M1,3 P1,0 P1,1

Block (0, 0) Block (1, 0)

Shared Memory Shared Memory

Registers Registers Registers Registers

Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)

Host Global Memory

– __device__ is optional when used with __shared__, or __constant__

void blurKernel(unsigned char * in, unsigned char * out, int w, int h)

__shared__ float ds_in[TILE_WIDTH][TILE_WIDTH];

Global Memory I/O

Module 4.2 – Memory and Data Locality

Divide the global memory content into tiles

Focus the computation of threads on one or a small number

Worker A sleep work dinner

Worker A party sleep work

– Bad: when threads have very different timing

Module 4.3 - Memory Model and Locality

N0,0 N0,1 N0,2 N0,3 N0,0 N0,1

N0,0 N0,1 N0,2 N0,3 N0,0 N0,1

N0,0 N0,1 N0,2 N0,3 N0,0 N0,1

N0,0 N0,1 N0,2 N0,3

N0,0 N0,1 N0,2 N0,3

N0,0 N0,1 N0,2 N0,3

Shared memory allows each value to be accessed by multiple

– Best used to coordinate the phased execution tiled algorithms

Module 4.4 - Memory and Data Locality

2D indexing for accessing Tile 1:

2D indexing for accessing Tile 1:

where p is the sequence number of the current phase

int bx = blockIdx.x; int by = blockIdx.y;

int Row = by * blockDim.y + ty;

// Loop over the M and N tiles required to compute the P element

for (int i = 0; i < TILE_WIDTH; ++i)Pvalue += ds_M[ty][i] * ds_N[i][tx];

int bx = blockIdx.x; int by = blockIdx.y;

int Row = by * blockDim.y + ty;

// Loop over the M and N tiles required to compute the P element

for (int i = 0; i < TILE_WIDTH; ++i)Pvalue += ds_M[ty][i] * ds_N[i][tx];

int bx = blockIdx.x; int by = blockIdx.y;

int Row = by * blockDim.y + ty;

// Loop over the M and N tiles required to compute the P element

for (int i = 0; i < TILE_WIDTH; ++i)Pvalue += ds_M[ty][i] * ds_N[i][tx];

Module 5.1 – Thread Execusion Efficiency

– Each block is divided into 32-thread warps

Figure 6.1: Placing 2D threads into linear order

– Partitioning scheme is consistent across devices

– DO NOT rely on any ordering within or between

– This works efficiently if all threads follow the same

– The execution of threads taking different paths are

– All threads in Blocks 0, 1, and 2 are within valid range

– Most warps in Block 3 will not control divergence

– One warp in Block 3 will have control divergence

– Effect of serialization on control divergence will be small

Module 5.2 – Thread Execusion Efficiency

if(Row < Width && t * TILE_WIDTH+tx < Width) {

if (p*TILE_WIDTH+ty < Width && Col < Width) {

– There are 49 thread blocks (7 in each dimension)

– The estimated performance impact is data dependent.

– In general, the impact of control divergence for boundary condition

– We will cover some algorithm patterns that naturally incur control

Module 6.1 – Memory Access Performance

– Each bit is stored in a tiny capacitor made of one transistor

– device is optional when used with shared, or constant

shared float ds_in[TILE_WIDTH][TILE_WIDTH];

float d_A0, d_B0, *d_C0; // device memory for stream 0