0% found this document useful (0 votes)
3 views

CUDA_Memory

Uploaded by

flyphofly
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

CUDA_Memory

Uploaded by

flyphofly
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

CENG443

Heterogeneous Parallel Programming

CUDA Shared Memory

Işıl ÖZ, IZTECH, Fall 2024


31 December 2024
Image Blurring Kernel

// Get the average of the surrounding 2xBLUR_SIZE x 2xBLUR_SIZE box


for(int blurRow = -BLUR_SIZE; blurRow < BLUR_SIZE+1; ++blurRow) {
for(int blurCol = -BLUR_SIZE; blurCol < BLUR_SIZE+1; ++blurCol) {

int curRow = Row + blurRow;


int curCol = Col + blurCol;
// Verify we have a valid image pixel
if(curRow > -1 && curRow < h && curCol > -1 && curCol < w) {
pixVal += in[curRow * w + curCol];
pixels++; // Keep track of number of pixels in the accumulated total
}
}
}

In every iteration of the inner loop, one global memory access is performed for
one floating-point addition

2
Compute-to-Global Memory Access Ratio

The number of floating-point calculation


performed for each access to the global memory
within a region of a program
It is 1 for floating-point add operation in image
blur kernel
Memory-bound programs
execution speed is limited by memory access throughput

3
GPU Performance

GPU device with the global memory bandwidth


1,000 GB/s, or 1 TB/s
4 bytes in each single-precision floating-point value
1000/4=250 giga single-precision operands per
second to be loaded
no more than 250 giga floating-point operations per second
(GFLOPS)
Peak single-precision performance of 12 TFLOPS
Tiny fraction = 250/12000 = 2%
Need to find ways of reducing global memory
accesses!

4
CUDA Memories

Grid

Block (0, 0) Block (1, 0)

Shared Memory Shared Memory

Registers Registers Registers Registers

Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)

Host Global Memory

Constant Memory

5
Registers

Grid

Block (0, 0) Block (1, 0)

Shared Memory Shared Memory • Fastest


Registers Registers Registers Registers

• Only accessible by a thread


Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)

• Lifetime of a thread
Host Global Memory

• Limited capacity of registers


Constant Memory

6
Local Variables

All scalar variables declared in kernel and device


functions are placed into registers
Automatic array variables are not stored in
registers (The compiler may decide to store an
array into registers if all accesses are done with
constant index values)
Similar to scalar variables, the scope of these
arrays is limited to individual threads
Once a thread terminates its execution, the
contents of its local variables also cease to exist

7
Image Blurring Kernel

// Get the average of the surrounding 2xBLUR_SIZE x 2xBLUR_SIZE box


for(int blurRow = -BLUR_SIZE; blurRow < BLUR_SIZE+1; ++blurRow) {
for(int blurCol = -BLUR_SIZE; blurCol < BLUR_SIZE+1; ++blurCol) {

int curRow = Row + blurRow;


int curCol = Col + blurCol;
// Verify we have a valid image pixel
if(curRow > -1 && curRow < h && curCol > -1 && curCol < w) {
pixVal += in[curRow * w + curCol];
pixels++; // Keep track of number of pixels in the accumulated total
}
}
}

8
Shared Memory

Grid

Block (0, 0) Block (1, 0)


• Extremely fast (~4cycles)
Shared Memory Shared Memory

Registers Registers Registers Registers • Highly parallel

Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) • Restricted to a block

Host Global Memory • Small (typically 48 KB per SM)


Constant Memory

9
Shared Memory in CUDA

A special type of memory whose contents are


explicitly defined and used in the kernel source
code
One in each SM
Accessed at much higher speed (in both latency and
throughput) than global memory
Scope of access and sharing - thread blocks
Lifetime – thread block, contents will disappear after the
corresponding thread terminates execution
Accessed by memory load/store instructions
A form of scratchpad memory in computer architecture

10
Shared Memory in Fermi

64KB configurable shared memory and L1 cache


48 KB shared memory and 16 KB L1 cache OR
16 KB shared memory and 48 KB L1 cache
With shared memory, you get full control as to
what gets stored where, while with cache,
everything is done automatically
Even though the compiler and the GPU can still
be very clever in optimizing memory accesses,
you can sometimes still find a better way

11
Shared Variables

Shared variables reside in the shared memory


The scope of a shared variable is within a thread
block, all threads in a block see the same version of a
shared variable, subject to race conditions
A private version of the shared variable is created for
and used by each thread block during kernel execution
When a kernel terminates its execution, the contents
of its shared variables cease to exist
An efficient means for threads within a block to
collaborate with one another, to hold the portion of
global memory data that are heavily used in a kernel
execution phase

12
Shared Variable Example

Statically (size known at compile time)


__global__ void Kernel(int count)
{
__shared__ int a[1024];
...
}

Dynamically (size not known until runtime)


__global__ void Kernel(int count)
{
extern __shared__ int a[];
...
}
Kernel<<< gridDim, blockDim, numBytesSharedMem >>>(count)

13
Global Memory

Grid

Block (0, 0) Block (1, 0)


• Typically implemented in DRAM
Shared Memory Shared Memory

Registers Registers Registers Registers • High access latency:


400-800 cycles
Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)

• Finite access bandwidth


Host Global Memory

Constant Memory
• Potential of traffic congestion

14
Global Variables

Visible to all threads of all kernels, can be used


as a means for threads to collaborate across
blocks
The only easy way to synchronize between
threads from different thread blocks or to
ensure data consistency across threads when
accessing global memory is by terminating the
current kernel execution

15
Vector Addition Host Code

void vecAdd(float *h_A, float *h_B, float *h_C, int n)

int size = n * sizeof(float); float *d_A, *d_B, *d_C;


Global memory:

cudaMalloc((void **) &d_A, size);


Allocate with
cudaMalloc(void** devPtr, size_t size)
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);

cudaMalloc((void **) &d_B, size);


Free with
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice); cudaFree(void* devPtr)
cudaMalloc((void **) &d_C, size);

cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

cudaFree(d_A); cudaFree(d_B); cudaFree (d_C);

16
Hardware View of CUDA Memories

Global Memory I/O

Processing Unit
Shared
Register
Memory ALU
File

Control Unit
PC IR

Processor (SM)

17
Tiling for Reduced Memory Traffic

The global memory is large but slow, whereas


the shared memory is small but fast
Partition the data into subsets called tiles so
that each tile fits into the shared memory
A large wall (i.e., the global memory data) can
be covered by tiles (i.e., subsets that each can
fit into the shared memory)

18
Matrix Multiplication

19
A Basic Matrix Multiplication Kernel

__global__ void MatrixMulKernel(float* M, float* N, float* P, int Width) {

// Calculate the row index of the P element and M


int Row = blockIdx.y*blockDim.y+threadIdx.y;

// Calculate the column index of P and N


int Col = blockIdx.x*blockDim.x+threadIdx.x;

if ((Row < Width) && (Col < Width)) {


float Pvalue = 0;
// each thread computes one element
for (int k = 0; k < Width; ++k) {
Pvalue += M[Row*Width+k]*N[k*Width+Col];
}
P[Row*Width+Col] = Pvalue;
}

20
4x4 P: Thread to Data Mapping

P matrix divided into 4 parts


Each block (2x2 threads) calculates 1 part
Block(0,0) Block(0,1)
Thread(0,1)
Thread(0,0)
P0,0 P0,1 P0,2 P0,3
Thread(1,0)
P1,0 P1,1 P1,2 P1,3
Thread(1,1)
P2,0 P2,1 P2,2 P2,3

P3,0 P3,1 P3,2 P3,3

Block(1,0) Block(1,1)

21
Global memory accesses performed by
threads in block0,0

8 global memory access per thread (8*4)


If threads can collaborate, elements are only
loaded from the global memory once, the total
number of accesses to the global memory can be
reduced by half

22
Global Memory Access Pattern

Global Memory

Thread 1 Thread 2 …

23
Tiling

Divide the global memory content into tiles

Global Memory

On-chip Memory

Thread 1 Thread 2

24
Tiling

Global Memory

On-chip Memory

Thread 1 Thread 2

25
Basic Concept of Tiling

In a congested traffic system, significant


reduction of vehicles can greatly improve the
delay seen by all vehicles
Carpooling for commuters (only cars with more than two or
three people are allowed to use these lanes)
Tiling for global memory accesses
drivers = threads accessing their memory data operands
cars = memory access requests

26
Carpools Need Synchronization

Good when people have similar schedule


Worker A sleep work dinner
Time
Worker B sleep work dinner

Worker A party sleep work


time
sleep work dinner
Worker B

Bad when people have very different schedule

27
Tiling Requires Synchronization Among Threads

Good when threads have similar access timing

Thread 1
Time
Thread 2

Thread 1
Time
Thread 2

Bad when threads have very different timing

28
Tiling

Localizes the memory locations accessed among


threads and the timing of their accesses
Divides the long access sequences of each
thread into phases and uses barrier
synchronization to keep the timing of accesses
to each section at close intervals
Controls the amount of on-chip memory required
by localizing the accesses both in time and in
space

29
Outline of Tiling

Identify a tile of global memory contents that are


accessed by multiple threads
Load the tile from global memory into on-chip
memory
Use barrier synchronization to make sure that all
threads are ready to start the phase
Have the multiple threads to access their data from
the on-chip memory
Use barrier synchronization to make sure that all
threads have completed the current phase
Move on to the next tile

30
A Basic Matrix Multiplication Kernel

__global__ void MatrixMulKernel(float* M, float* N, float* P, int Width) {

// Calculate the row index of the P element and M


int Row = blockIdx.y*blockDim.y+threadIdx.y;

// Calculate the column index of P and N


int Col = blockIdx.x*blockDim.x+threadIdx.x;

if ((Row < Width) && (Col < Width)) {


float Pvalue = 0;
// each thread computes one element of the block sub-matrix
for (int k = 0; k < Width; ++k) {
Pvalue += M[Row*Width+k]*N[k*Width+Col];
}
P[Row*Width+Col] = Pvalue;
}

31
A Basic Matrix Multiplication Kernel

__global__ void MatrixMulKernel(float* M, float* N, float* P, int Width) {

// Calculate the row index of the P element and M


int Row = blockIdx.y*blockDim.y+threadIdx.y;

// Calculate the column index of P and N


int Col = blockIdx.x*blockDim.x+threadIdx.x;

if ((Row < Width) && (Col < Width)) {


float Pvalue = 0;
// each thread computes one element of the block sub-matrix
for (int k = 0; k < Width; ++k) {
Pvalue += M[Row*Width+k]*N[k*Width+Col];
}
P[Row*Width+Col] = Pvalue;
}

32
4x4 P: Thread to Data Mapping

P matrix divided into 4 parts


Each block (2x2 threads) calculates 1 part
Block(0,0) Block(0,1)
Thread(0,1)
Thread(0,0)
P0,0 P0,1 P0,2 P0,3 BLOCK_WIDTH = 2
Thread(1,0)
P1,0 P1,1 P1,2 P1,3
Thread(1,1)
P2,0 P2,1 P2,2 P2,3

P3,0 P3,1 P3,2 P3,3

Block(1,0) Block(1,1)

33
Calculation of P0,0 and P0,1

N0,0 N0,1

N1,0 N1,1

N2,0 N2,1

N3,0 N3,1

M0,0 M0,1 M0,2 M0,3 P0,0 P0,1

M1,0 M1,1 M1,2 M1,3 P1,0 P1,1

34
Tiled Matrix Multiplication

Break up the execution of each thread into


phases (to utilize shared memory) N

So that the data accesses

WIDTH
by the thread block in each phase
are focused on one tile of
M and one tile of N
M P
The tile is of BLOCK_SIZE

BLOCK_WIDTH
elements in each dimension

WIDTH
Row
BLOCK_WIDTH

WIDTH WIDTH

Col

35
2x2 Tiles

36
Phase 0 Load for Block (0,0)

Each thread loads one M element and one N


element

N0,0 N0,1 N0,2 N0,3 N0,0 N0,1


Shared Memory
N1,0 N1,1 N1,2 N1,3 N1,0 N1,1
N2,0 N2,1 N2,2 N2,3
N3,0 N3,1 N3,2 N3,3
Shared Memory
M0,0 M0,1 M0,2 M0,3 M0,0 M0,1 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,0 M1,1 P1,0 P1,1 P1,2 P1,3
M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3
M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3

37
Phase 0 Use for Block (0,0) (iteration 0)

N0,0 N0,1 N0,2 N0,3 N0,0 N0,1


Shared Memory
N1,0 N1,1 N1,2 N1,3 N1,0 N1,1
N2,0 N2,1 N2,2 N2,3
N3,0 N3,1 N3,2 N3,3
Shared Memory
M0,0 M0,1 M0,2 M0,3 M0,0 M0,1 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,0 M1,1 P1,0 P1,1 P1,2 P1,3
M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3
M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3

38
Phase 0 Use for Block (0,0) (iteration 1)

N0,0 N0,1 N0,2 N0,3 N0,0 N0,1


Shared Memory
N1,0 N1,1 N1,2 N1,3 N1,0 N1,1
N2,0 N2,1 N2,2 N2,3
N3,0 N3,1 N3,2 N3,3
Shared Memory
M0,0 M0,1 M0,2 M0,3 M0,0 M0,1 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,0 M1,1 P1,0 P1,1 P1,2 P1,3
M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3
M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3

39
Phase 1 Load for Block (0,0)

Each thread loads one M element and one N


element (remaining elements)

N0,0 N0,1 N0,2 N0,3


N1,0 N1,1 N1,2 N1,3
N2,0 N2,1 N2,2 N2,3 N2,0 N2,1
Shared Memory
N3,0 N3,1 N3,2 N3,3 N3,0 N3,1
Shared Memory
M0,0 M0,1 M0,2 M0,3 M0,2 M0,3 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,2 M1,3 P1,0 P1,1 P1,2 P1,3
M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3
M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3

40
Phase 1 Use for Block (0,0) (iteration 0)

N0,0 N0,1 N0,2 N0,3


N1,0 N1,1 N1,2 N1,3
N2,0 N2,1 N2,2 N2,3 N2,0 N2,1
Shared Memory
N3,0 N3,1 N3,2 N3,3 N3,0 N3,1
Shared Memory
M0,0 M0,1 M0,2 M0,3 M0,2 M0,3 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,2 M1,3 P1,0 P1,1 P1,2 P1,3
M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3
M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3

41
Phase 1 Use for Block (0,0) (iteration 1)

N0,0 N0,1 N0,2 N0,3


N1,0 N1,1 N1,2 N1,3
N2,0 N2,1 N2,2 N2,3 N2,0 N2,1
Shared Memory
N3,0 N3,1 N3,2 N3,3 N3,0 N3,1
Shared Memory
M0,0 M0,1 M0,2 M0,3 M0,2 M0,3 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,2 M1,3 P1,0 P1,1 P1,2 P1,3
M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3
M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3

42
Execution Phases

43
Execution Phases

Shared memory allows each value to be accessed by multiple


threads

44
Data in Shared Memory

Mds and Nds: shared memory arrays for M and N


elements
They are reused to hold input values, allowing a
much smaller shared memory to serve most of
the accesses to global memory
Each phase focuses on a small subset of the
input matrix elements: locality

45
Barrier Synchronization

Synchronize all threads in a block


__syncthreads()

All threads in the same block must reach the


__syncthreads() before any of them can move on

Best used to coordinate the phased execution tiled


algorithms
To ensure that all elements of a tile are loaded at the beginning of a
phase
To ensure that all elements of a tile are consumed at the end of a phase

46
Use 1D Indexing

M[Row][p*TILE_WIDTH+tx]
M[Row*Width + p*TILE_WIDTH + tx]

N[p*TILE_WIDTH+ty][Col]
N[(p*TILE_WIDTH+ty)*Width + Col]

where p is the sequence number of the current phase

47
Tiled Matrix Multiplication Kernel

__global__ void MatrixMulKernel(float* M, float* N, float* P, int Width)


{
__shared__ float ds_M[TILE_WIDTH][TILE_WIDTH];
__shared__ float ds_N[TILE_WIDTH][TILE_WIDTH];

int bx = blockIdx.x; int by = blockIdx.y;


int tx = threadIdx.x; int ty = threadIdx.y;

int Row = by * blockDim.y + ty;


int Col = bx * blockDim.x + tx;
float Pvalue = 0;

// Loop over the M and N tiles required to compute the P element


for (int p = 0; p < Width/TILE_WIDTH; ++p) {
// Collaborative loading of M and N tiles into shared memory
ds_M[ty][tx] = M[Row*Width + p*TILE_WIDTH+tx];
ds_N[ty][tx] = N[(p*TILE_WIDTH+ty)*Width + Col];
__syncthreads();

for (int i = 0; i < TILE_WIDTH; ++i)Pvalue += ds_M[ty][i] * ds_N[i][tx];


__synchthreads();
}
P[Row*Width+Col] = Pvalue;
}

48
Tiled Matrix Multiplication Kernel

__global__ void MatrixMulKernel(float* M, float* N, float* P, Int Width)


{
__shared__ float ds_M[TILE_WIDTH][TILE_WIDTH];
__shared__ float ds_N[TILE_WIDTH][TILE_WIDTH];

int bx = blockIdx.x; int by = blockIdx.y;


int tx = threadIdx.x; int ty = threadIdx.y;

int Row = by * blockDim.y + ty;


int Col = bx * blockDim.x + tx;
float Pvalue = 0;

// Loop over the M and N tiles required to compute the P element


for (int p = 0; p < Width/TILE_WIDTH; ++p) {
// Collaborative loading of M and N tiles into shared memory
ds_M[ty][tx] = M[Row*Width + p*TILE_WIDTH+tx];
ds_N[ty][tx] = N[(p*TILE_WIDTH+ty)*Width + Col];
__syncthreads();

for (int i = 0; i < TILE_WIDTH; ++i)Pvalue += ds_M[ty][i] * ds_N[i][tx];


__synchthreads();
}
P[Row*Width+Col] = Pvalue;
}

49
Tiled Matrix Multiplication Kernel

__global__ void MatrixMulKernel(float* M, float* N, float* P, Int Width)


{
__shared__ float ds_M[TILE_WIDTH][TILE_WIDTH];
__shared__ float ds_N[TILE_WIDTH][TILE_WIDTH];

int bx = blockIdx.x; int by = blockIdx.y;


int tx = threadIdx.x; int ty = threadIdx.y;

int Row = by * blockDim.y + ty;


int Col = bx * blockDim.x + tx;
float Pvalue = 0;

// Loop over the M and N tiles required to compute the P element


for (int p = 0; p < Width/TILE_WIDTH; ++p) {
// Collaborative loading of M and N tiles into shared memory
ds_M[ty][tx] = M[Row*Width + p*TILE_WIDTH+tx];
ds_N[ty][tx] = N[(p*TILE_WIDTH+ty)*Width + Col];
__syncthreads();

for (int i = 0; i < TILE_WIDTH; ++i)Pvalue += ds_M[ty][i] * ds_N[i][tx];


__synchthreads();
}
P[Row*Width+Col] = Pvalue;
}

50
Before-After

for (int k = 0; k < Width; ++k) {


Pvalue += M[Row*Width+k]*N[k*Width+Col];
}

for (int p = 0; p < Width/TILE_WIDTH; ++p) {


ds_M[ty][tx] = M[Row*Width + p*TILE_WIDTH+tx];
ds_N[ty][tx] = N[(p*TILE_WIDTH+ty)*Width + Col];
__syncthreads();
for (int i = 0; i < TILE_WIDTH; ++i)
Pvalue += ds_M[ty][i] * ds_N[i][tx];
__synchthreads();
}

51
Tile (Thread Block) Size Considerations

Each thread block should have many threads


TILE_WIDTH of 16 gives 16*16 = 256 threads
TILE_WIDTH of 32 gives 32*32 = 1024 threads
(reduce global memory access by a factor of TILE_WIDTH)
For 16, in each phase, each block performs 2*256 = 512
float loads from global memory for 256 * (2*16) = 8,192
mul/add operations. (16 floating-point operations for each
memory load)

For 32, in each phase, each block performs 2*1024 = 2048


float loads from global memory for 1024 * (2*32) = 65,536
mul/add operations. (32 floating-point operation for each
memory load)

52
Shared Memory

For an SM with 16KB shared memory


Shared memory size is implementation dependent!
For TILE_WIDTH = 16, 256 threads, each thread block uses 2*256*4B
= 2KB shared memory per block, up to 8 thread blocks
This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256 threads per
block)
For TILE_WIDTH = 32, 1024 threads, 2*1024*4B= 8KB shared memory
usage per block, allowing 2 thread blocks active at the same time
However, in a GPU where the thread count is limited to 1536 threads per SM, the
number of blocks per SM is reduced to one!

Each __syncthreads() can reduce the number of active


threads for a block
More thread blocks can be advantageous

53
Handling Matrix of Arbitrary Size

Only square matrices whose dimensions (Width)


are multiples of the tile width (TILE_WIDTH)
Real applications need to handle arbitrary sized
matrices
One could pad (add elements to) the rows and
columns into multiples of the tile size, but would
have significant space and data transfer time
overhead

54
NVIDIA NSight Compute

55
References

Chapter 4.2, 4.4, 4.5, 4.6


(Programming Massively Parallel Processors : A
Hands-on Approach, David B. Kirk, Wen-Mei W.
Hwu, Morgan Kaufmann Publishers, 3rd edition)
https://developer.nvidia.com/nsight-compute
https://events.prace-ri.eu/

56

You might also like