0% found this document useful (0 votes)

7 views

CUDA_Memory

Uploaded by

flyphofly

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views

CUDA_Memory

Uploaded by

flyphofly

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 56

CENG443

Heterogeneous Parallel Programming

CUDA Shared Memory

Işıl ÖZ, IZTECH, Fall 2024

31 December 2024
Image Blurring Kernel

// Get the average of the surrounding 2xBLUR_SIZE x 2xBLUR_SIZE box

for(int blurRow = -BLUR_SIZE; blurRow < BLUR_SIZE+1; ++blurRow) {
for(int blurCol = -BLUR_SIZE; blurCol < BLUR_SIZE+1; ++blurCol) {

int curRow = Row + blurRow;

int curCol = Col + blurCol;
// Verify we have a valid image pixel
if(curRow > -1 && curRow < h && curCol > -1 && curCol < w) {
pixVal += in[curRow * w + curCol];
pixels++; // Keep track of number of pixels in the accumulated total
}
}
}

In every iteration of the inner loop, one global memory access is performed for
one floating-point addition

2
Compute-to-Global Memory Access Ratio

The number of floating-point calculation

performed for each access to the global memory
within a region of a program
It is 1 for floating-point add operation in image
blur kernel
Memory-bound programs
execution speed is limited by memory access throughput

3
GPU Performance

GPU device with the global memory bandwidth

1,000 GB/s, or 1 TB/s
4 bytes in each single-precision floating-point value
1000/4=250 giga single-precision operands per
second to be loaded
no more than 250 giga floating-point operations per second
(GFLOPS)
Peak single-precision performance of 12 TFLOPS
Tiny fraction = 250/12000 = 2%
Need to find ways of reducing global memory
accesses!

4
CUDA Memories

Grid

Block (0, 0) Block (1, 0)

Shared Memory Shared Memory

Registers Registers Registers Registers

Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)

Host Global Memory

Constant Memory

5
Registers

Grid

Block (0, 0) Block (1, 0)

Shared Memory Shared Memory • Fastest

Registers Registers Registers Registers

• Only accessible by a thread

Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)

• Lifetime of a thread
Host Global Memory

• Limited capacity of registers

Constant Memory

6
Local Variables

All scalar variables declared in kernel and device

functions are placed into registers
Automatic array variables are not stored in
registers (The compiler may decide to store an
array into registers if all accesses are done with
constant index values)
Similar to scalar variables, the scope of these
arrays is limited to individual threads
Once a thread terminates its execution, the
contents of its local variables also cease to exist

7
Image Blurring Kernel

// Get the average of the surrounding 2xBLUR_SIZE x 2xBLUR_SIZE box

for(int blurRow = -BLUR_SIZE; blurRow < BLUR_SIZE+1; ++blurRow) {
for(int blurCol = -BLUR_SIZE; blurCol < BLUR_SIZE+1; ++blurCol) {

int curRow = Row + blurRow;

8
Shared Memory

Grid

Block (0, 0) Block (1, 0)

• Extremely fast (~4cycles)
Shared Memory Shared Memory

Registers Registers Registers Registers • Highly parallel

Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) • Restricted to a block

Host Global Memory • Small (typically 48 KB per SM)

Constant Memory

9
Shared Memory in CUDA

A special type of memory whose contents are

explicitly defined and used in the kernel source
code
One in each SM
Accessed at much higher speed (in both latency and
throughput) than global memory
Scope of access and sharing - thread blocks
Lifetime – thread block, contents will disappear after the
corresponding thread terminates execution
Accessed by memory load/store instructions
A form of scratchpad memory in computer architecture

10
Shared Memory in Fermi

64KB configurable shared memory and L1 cache

48 KB shared memory and 16 KB L1 cache OR
16 KB shared memory and 48 KB L1 cache
With shared memory, you get full control as to
what gets stored where, while with cache,
everything is done automatically
Even though the compiler and the GPU can still
be very clever in optimizing memory accesses,
you can sometimes still find a better way

11
Shared Variables

Shared variables reside in the shared memory

The scope of a shared variable is within a thread
block, all threads in a block see the same version of a
shared variable, subject to race conditions
A private version of the shared variable is created for
and used by each thread block during kernel execution
When a kernel terminates its execution, the contents
of its shared variables cease to exist
An efficient means for threads within a block to
collaborate with one another, to hold the portion of
global memory data that are heavily used in a kernel
execution phase

12
Shared Variable Example

Statically (size known at compile time)

__global__ void Kernel(int count)
{
__shared__ int a[1024];
...
}

Dynamically (size not known until runtime)

__global__ void Kernel(int count)
{
extern __shared__ int a[];
...
}
Kernel<<< gridDim, blockDim, numBytesSharedMem >>>(count)

13
Global Memory

Grid

Block (0, 0) Block (1, 0)

• Typically implemented in DRAM
Shared Memory Shared Memory

Registers Registers Registers Registers • High access latency:

400-800 cycles
Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)

• Finite access bandwidth

Host Global Memory

Constant Memory
• Potential of traffic congestion

14
Global Variables

Visible to all threads of all kernels, can be used

as a means for threads to collaborate across
blocks
The only easy way to synchronize between
threads from different thread blocks or to
ensure data consistency across threads when
accessing global memory is by terminating the
current kernel execution

15
Vector Addition Host Code

void vecAdd(float h_A, float h_B, float *h_C, int n)

int size = n * sizeof(float); float d_A, d_B, *d_C;

Global memory:

cudaMalloc((void **) &d_A, size);

Allocate with
cudaMalloc(void** devPtr, size_t size)
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);

cudaMalloc((void **) &d_B, size);

Free with
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice); cudaFree(void* devPtr)
cudaMalloc((void **) &d_C, size);

cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

cudaFree(d_A); cudaFree(d_B); cudaFree (d_C);

16
Hardware View of CUDA Memories

Global Memory I/O

Processing Unit
Shared
Register
Memory ALU
File

Control Unit
PC IR

Processor (SM)

17
Tiling for Reduced Memory Traffic

The global memory is large but slow, whereas

the shared memory is small but fast
Partition the data into subsets called tiles so
that each tile fits into the shared memory
A large wall (i.e., the global memory data) can
be covered by tiles (i.e., subsets that each can
fit into the shared memory)

18
Matrix Multiplication

19
A Basic Matrix Multiplication Kernel

global void MatrixMulKernel(float* M, float* N, float* P, int Width) {

// Calculate the row index of the P element and M

int Row = blockIdx.y*blockDim.y+threadIdx.y;

// Calculate the column index of P and N

int Col = blockIdx.x*blockDim.x+threadIdx.x;

if ((Row < Width) && (Col < Width)) {

float Pvalue = 0;
// each thread computes one element
for (int k = 0; k < Width; ++k) {
Pvalue += M[Row*Width+k]*N[k*Width+Col];
}
P[Row*Width+Col] = Pvalue;
}

20
4x4 P: Thread to Data Mapping

P matrix divided into 4 parts

Each block (2x2 threads) calculates 1 part
Block(0,0) Block(0,1)
Thread(0,1)
Thread(0,0)
P0,0 P0,1 P0,2 P0,3
Thread(1,0)
P1,0 P1,1 P1,2 P1,3
Thread(1,1)
P2,0 P2,1 P2,2 P2,3

P3,0 P3,1 P3,2 P3,3

Block(1,0) Block(1,1)

21
Global memory accesses performed by
threads in block0,0

8 global memory access per thread (8*4)

If threads can collaborate, elements are only
loaded from the global memory once, the total
number of accesses to the global memory can be
reduced by half

22
Global Memory Access Pattern

Global Memory

Thread 1 Thread 2 …

23
Tiling

Divide the global memory content into tiles

Global Memory

On-chip Memory

Thread 1 Thread 2
…

24
Tiling

Global Memory

On-chip Memory

Thread 1 Thread 2
…

25
Basic Concept of Tiling

In a congested traffic system, significant

reduction of vehicles can greatly improve the
delay seen by all vehicles
Carpooling for commuters (only cars with more than two or
three people are allowed to use these lanes)
Tiling for global memory accesses
drivers = threads accessing their memory data operands
cars = memory access requests

26
Carpools Need Synchronization

Good when people have similar schedule

Worker A sleep work dinner
Time
Worker B sleep work dinner

Worker A party sleep work

time
sleep work dinner
Worker B

Bad when people have very different schedule

27
Tiling Requires Synchronization Among Threads

Good when threads have similar access timing

Thread 1
Time
Thread 2
…
Thread 1
Time
Thread 2

Bad when threads have very different timing

28
Tiling

Localizes the memory locations accessed among

threads and the timing of their accesses
Divides the long access sequences of each
thread into phases and uses barrier
synchronization to keep the timing of accesses
to each section at close intervals
Controls the amount of on-chip memory required
by localizing the accesses both in time and in
space

29
Outline of Tiling

Identify a tile of global memory contents that are

accessed by multiple threads
Load the tile from global memory into on-chip
memory
Use barrier synchronization to make sure that all
threads are ready to start the phase
Have the multiple threads to access their data from
the on-chip memory
Use barrier synchronization to make sure that all
threads have completed the current phase
Move on to the next tile

30
A Basic Matrix Multiplication Kernel

global void MatrixMulKernel(float* M, float* N, float* P, int Width) {

// Calculate the row index of the P element and M

int Row = blockIdx.y*blockDim.y+threadIdx.y;

// Calculate the column index of P and N

int Col = blockIdx.x*blockDim.x+threadIdx.x;

if ((Row < Width) && (Col < Width)) {

float Pvalue = 0;
// each thread computes one element of the block sub-matrix
for (int k = 0; k < Width; ++k) {
Pvalue += M[Row*Width+k]*N[k*Width+Col];
}
P[Row*Width+Col] = Pvalue;
}

31
A Basic Matrix Multiplication Kernel

global void MatrixMulKernel(float* M, float* N, float* P, int Width) {

// Calculate the row index of the P element and M

int Row = blockIdx.y*blockDim.y+threadIdx.y;

// Calculate the column index of P and N

int Col = blockIdx.x*blockDim.x+threadIdx.x;

if ((Row < Width) && (Col < Width)) {

float Pvalue = 0;
// each thread computes one element of the block sub-matrix
for (int k = 0; k < Width; ++k) {
Pvalue += M[Row*Width+k]*N[k*Width+Col];
}
P[Row*Width+Col] = Pvalue;
}

32
4x4 P: Thread to Data Mapping

P matrix divided into 4 parts

Each block (2x2 threads) calculates 1 part
Block(0,0) Block(0,1)
Thread(0,1)
Thread(0,0)
P0,0 P0,1 P0,2 P0,3 BLOCK_WIDTH = 2
Thread(1,0)
P1,0 P1,1 P1,2 P1,3
Thread(1,1)
P2,0 P2,1 P2,2 P2,3

P3,0 P3,1 P3,2 P3,3

Block(1,0) Block(1,1)

33
Calculation of P0,0 and P0,1

N0,0 N0,1

N1,0 N1,1

N2,0 N2,1

N3,0 N3,1

M0,0 M0,1 M0,2 M0,3 P0,0 P0,1

M1,0 M1,1 M1,2 M1,3 P1,0 P1,1

34
Tiled Matrix Multiplication

Break up the execution of each thread into

phases (to utilize shared memory) N

So that the data accesses

WIDTH
by the thread block in each phase
are focused on one tile of
M and one tile of N
M P
The tile is of BLOCK_SIZE

BLOCK_WIDTH
elements in each dimension

WIDTH
Row
BLOCK_WIDTH

WIDTH WIDTH

Col

35
2x2 Tiles

36
Phase 0 Load for Block (0,0)

Each thread loads one M element and one N

element

N0,0 N0,1 N0,2 N0,3 N0,0 N0,1

Shared Memory
N1,0 N1,1 N1,2 N1,3 N1,0 N1,1
N2,0 N2,1 N2,2 N2,3
N3,0 N3,1 N3,2 N3,3
Shared Memory
M0,0 M0,1 M0,2 M0,3 M0,0 M0,1 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,0 M1,1 P1,0 P1,1 P1,2 P1,3
M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3
M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3

37
Phase 0 Use for Block (0,0) (iteration 0)

N0,0 N0,1 N0,2 N0,3 N0,0 N0,1

38
Phase 0 Use for Block (0,0) (iteration 1)

N0,0 N0,1 N0,2 N0,3 N0,0 N0,1

39
Phase 1 Load for Block (0,0)

Each thread loads one M element and one N

element (remaining elements)

N0,0 N0,1 N0,2 N0,3

N1,0 N1,1 N1,2 N1,3
N2,0 N2,1 N2,2 N2,3 N2,0 N2,1
Shared Memory
N3,0 N3,1 N3,2 N3,3 N3,0 N3,1
Shared Memory
M0,0 M0,1 M0,2 M0,3 M0,2 M0,3 P0,0 P0,1 P0,2 P0,3
M1,0 M1,1 M1,2 M1,3 M1,2 M1,3 P1,0 P1,1 P1,2 P1,3
M2,0 M2,1 M2,2 M2,3 P2,0 P2,1 P2,2 P2,3
M3,0 M3,1 M3,2 M3,3 P3,0 P3,1 P3,2 P3,3

40
Phase 1 Use for Block (0,0) (iteration 0)

N0,0 N0,1 N0,2 N0,3

41
Phase 1 Use for Block (0,0) (iteration 1)

N0,0 N0,1 N0,2 N0,3

42
Execution Phases

43
Execution Phases

Shared memory allows each value to be accessed by multiple

threads

44
Data in Shared Memory

Mds and Nds: shared memory arrays for M and N

elements
They are reused to hold input values, allowing a
much smaller shared memory to serve most of
the accesses to global memory
Each phase focuses on a small subset of the
input matrix elements: locality

45
Barrier Synchronization

Synchronize all threads in a block

__syncthreads()

All threads in the same block must reach the

__syncthreads() before any of them can move on

Best used to coordinate the phased execution tiled

algorithms
To ensure that all elements of a tile are loaded at the beginning of a
phase
To ensure that all elements of a tile are consumed at the end of a phase

46
Use 1D Indexing

M[Row][p*TILE_WIDTH+tx]
M[Row*Width + p*TILE_WIDTH + tx]

N[p*TILE_WIDTH+ty][Col]
N[(p*TILE_WIDTH+ty)*Width + Col]

where p is the sequence number of the current phase

47
Tiled Matrix Multiplication Kernel

global void MatrixMulKernel(float* M, float* N, float* P, int Width)

{
__shared__ float ds_M[TILE_WIDTH][TILE_WIDTH];
__shared__ float ds_N[TILE_WIDTH][TILE_WIDTH];

int bx = blockIdx.x; int by = blockIdx.y;

int tx = threadIdx.x; int ty = threadIdx.y;

int Row = by * blockDim.y + ty;

int Col = bx * blockDim.x + tx;
float Pvalue = 0;

// Loop over the M and N tiles required to compute the P element

for (int p = 0; p < Width/TILE_WIDTH; ++p) {
// Collaborative loading of M and N tiles into shared memory
ds_M[ty][tx] = M[Row*Width + p*TILE_WIDTH+tx];
ds_N[ty][tx] = N[(p*TILE_WIDTH+ty)*Width + Col];
__syncthreads();

for (int i = 0; i < TILE_WIDTH; ++i)Pvalue += ds_M[ty][i] * ds_N[i][tx];

__synchthreads();
}
P[Row*Width+Col] = Pvalue;
}

48
Tiled Matrix Multiplication Kernel

global void MatrixMulKernel(float* M, float* N, float* P, Int Width)

{
__shared__ float ds_M[TILE_WIDTH][TILE_WIDTH];
__shared__ float ds_N[TILE_WIDTH][TILE_WIDTH];

int bx = blockIdx.x; int by = blockIdx.y;

int tx = threadIdx.x; int ty = threadIdx.y;

int Row = by * blockDim.y + ty;

int Col = bx * blockDim.x + tx;
float Pvalue = 0;

// Loop over the M and N tiles required to compute the P element

for (int i = 0; i < TILE_WIDTH; ++i)Pvalue += ds_M[ty][i] * ds_N[i][tx];

__synchthreads();
}
P[Row*Width+Col] = Pvalue;
}

49
Tiled Matrix Multiplication Kernel

global void MatrixMulKernel(float* M, float* N, float* P, Int Width)

{
__shared__ float ds_M[TILE_WIDTH][TILE_WIDTH];
__shared__ float ds_N[TILE_WIDTH][TILE_WIDTH];

int bx = blockIdx.x; int by = blockIdx.y;

int tx = threadIdx.x; int ty = threadIdx.y;

int Row = by * blockDim.y + ty;

int Col = bx * blockDim.x + tx;
float Pvalue = 0;

// Loop over the M and N tiles required to compute the P element

for (int i = 0; i < TILE_WIDTH; ++i)Pvalue += ds_M[ty][i] * ds_N[i][tx];

__synchthreads();
}
P[Row*Width+Col] = Pvalue;
}

50
Before-After

for (int k = 0; k < Width; ++k) {

Pvalue += M[Row*Width+k]*N[k*Width+Col];
}

for (int p = 0; p < Width/TILE_WIDTH; ++p) {

ds_M[ty][tx] = M[Row*Width + p*TILE_WIDTH+tx];
ds_N[ty][tx] = N[(p*TILE_WIDTH+ty)*Width + Col];
__syncthreads();
for (int i = 0; i < TILE_WIDTH; ++i)
Pvalue += ds_M[ty][i] * ds_N[i][tx];
__synchthreads();
}

51
Tile (Thread Block) Size Considerations

Each thread block should have many threads

TILE_WIDTH of 16 gives 16*16 = 256 threads
TILE_WIDTH of 32 gives 32*32 = 1024 threads
(reduce global memory access by a factor of TILE_WIDTH)
For 16, in each phase, each block performs 2*256 = 512
float loads from global memory for 256 * (2*16) = 8,192
mul/add operations. (16 floating-point operations for each
memory load)

For 32, in each phase, each block performs 2*1024 = 2048

float loads from global memory for 1024 * (2*32) = 65,536
mul/add operations. (32 floating-point operation for each
memory load)

52
Shared Memory

For an SM with 16KB shared memory

Shared memory size is implementation dependent!
For TILE_WIDTH = 16, 256 threads, each thread block uses 2*256*4B
= 2KB shared memory per block, up to 8 thread blocks
This allows up to 8*512 = 4,096 pending loads. (2 per thread, 256 threads per
block)
For TILE_WIDTH = 32, 1024 threads, 2*1024*4B= 8KB shared memory
usage per block, allowing 2 thread blocks active at the same time
However, in a GPU where the thread count is limited to 1536 threads per SM, the
number of blocks per SM is reduced to one!

Each __syncthreads() can reduce the number of active

threads for a block
More thread blocks can be advantageous

53
Handling Matrix of Arbitrary Size

Only square matrices whose dimensions (Width)

are multiples of the tile width (TILE_WIDTH)
Real applications need to handle arbitrary sized
matrices
One could pad (add elements to) the rows and
columns into multiples of the tile size, but would
have significant space and data transfer time
overhead

54
NVIDIA NSight Compute

55
References

Chapter 4.2, 4.4, 4.5, 4.6

(Programming Massively Parallel Processors : A
Hands-on Approach, David B. Kirk, Wen-Mei W.
Hwu, Morgan Kaufmann Publishers, 3rd edition)
https://developer.nvidia.com/nsight-compute
https://events.prace-ri.eu/

Oracle 1Z0-521 Free Questions Dumps
No ratings yet
Oracle 1Z0-521 Free Questions Dumps
5 pages
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
No ratings yet
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
132 pages
Multithreaded Architectures: Memory and Data Locality
No ratings yet
Multithreaded Architectures: Memory and Data Locality
39 pages
Lecture5 2
No ratings yet
Lecture5 2
46 pages
CUDA Memory Architecture: GPGPU Class Week 4
No ratings yet
CUDA Memory Architecture: GPGPU Class Week 4
28 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
GPU Programming Basics - Slides
No ratings yet
GPU Programming Basics - Slides
68 pages
Class4 Advanced Cuda Opencl
No ratings yet
Class4 Advanced Cuda Opencl
64 pages
CS 179: GPU Computing: Lecture 4: Gpu Memory Systems
No ratings yet
CS 179: GPU Computing: Lecture 4: Gpu Memory Systems
43 pages
217 Lec3
No ratings yet
217 Lec3
46 pages
12 Gpu Cuda 3
No ratings yet
12 Gpu Cuda 3
58 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
Chapter 3
No ratings yet
Chapter 3
20 pages
217 Lec6
No ratings yet
217 Lec6
23 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
217 Lec7
No ratings yet
217 Lec7
30 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Multithreaded Architectures: Lecture 5: Performance Considerations
No ratings yet
Multithreaded Architectures: Lecture 5: Performance Considerations
49 pages
Lec6 Cuda Memory
No ratings yet
Lec6 Cuda Memory
18 pages
Matrix-Matrix Multiplication Using Shared Memory
No ratings yet
Matrix-Matrix Multiplication Using Shared Memory
27 pages
06-CUDA Thread Organization
No ratings yet
06-CUDA Thread Organization
27 pages
CUDA Lab Instruction
No ratings yet
CUDA Lab Instruction
40 pages
Summary Exam 2015
No ratings yet
Summary Exam 2015
30 pages
ECE408 S19 ZJUI Exam1 Study Guide
No ratings yet
ECE408 S19 ZJUI Exam1 Study Guide
25 pages
2
No ratings yet
2
58 pages
HPC Revision
No ratings yet
HPC Revision
16 pages
Coursera Quiz Week1 Spring 2014 Heterogeneous Programming
100% (5)
Coursera Quiz Week1 Spring 2014 Heterogeneous Programming
4 pages
GPU-ACCELERATED FACE DETECTION ALGORITHM
No ratings yet
GPU-ACCELERATED FACE DETECTION ALGORITHM
9 pages
UNIT-5 Tiling
No ratings yet
UNIT-5 Tiling
23 pages
tilining
No ratings yet
tilining
23 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
02 CUDA Shared Memory
No ratings yet
02 CUDA Shared Memory
21 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
Ece408 Lecture5 CUDA Tiled Matrix Multiplication
No ratings yet
Ece408 Lecture5 CUDA Tiled Matrix Multiplication
31 pages
Chap9_CUDA Optimization
No ratings yet
Chap9_CUDA Optimization
73 pages
Lecture 6
No ratings yet
Lecture 6
28 pages
3-CUDA
No ratings yet
3-CUDA
5 pages
Matrix Mult
100% (1)
Matrix Mult
55 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
Main Memory: Prof. Mike Giles
No ratings yet
Main Memory: Prof. Mike Giles
9 pages
cuda_mode_lecture2
No ratings yet
cuda_mode_lecture2
33 pages
02 RTVis GPGPU CUDA
No ratings yet
02 RTVis GPGPU CUDA
34 pages
6963 Midterm Review
No ratings yet
6963 Midterm Review
20 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
An INTRODUCTION TO CUDA Programming
No ratings yet
An INTRODUCTION TO CUDA Programming
9 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
Memory Hardware in G80: © David Kirk/NVIDIA and Wen-Mei W Hwu 2007-2009 1
No ratings yet
Memory Hardware in G80: © David Kirk/NVIDIA and Wen-Mei W Hwu 2007-2009 1
21 pages
CUDA_part-2
No ratings yet
CUDA_part-2
49 pages
Class13
No ratings yet
Class13
19 pages
Pgi Cuda Tutorial
No ratings yet
Pgi Cuda Tutorial
58 pages
Module 3.1 - CUDA Parallelism Model: GPU Teaching Kit
No ratings yet
Module 3.1 - CUDA Parallelism Model: GPU Teaching Kit
44 pages
Class 10
No ratings yet
Class 10
13 pages
GPUMod 2
No ratings yet
GPUMod 2
64 pages
1
No ratings yet
1
44 pages
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
No ratings yet
Developing Kernels: Part 2: Algorithm Considerations, Multi-Kernel Programs and Optimization
23 pages
GPU Computing CIS-543: Lecture 08: CUDA Memory Model
No ratings yet
GPU Computing CIS-543: Lecture 08: CUDA Memory Model
50 pages
Parallel Computing Platforms: Chieh-Sen (Jason) Huang
No ratings yet
Parallel Computing Platforms: Chieh-Sen (Jason) Huang
21 pages
Graphics Processing Unit (GPU) Architecture and Programming: TU/e 5kk73 Zhenyu Ye Henk Corporaal 2011-11-15
No ratings yet
Graphics Processing Unit (GPU) Architecture and Programming: TU/e 5kk73 Zhenyu Ye Henk Corporaal 2011-11-15
53 pages
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
From Everand
Nintendo 64 Architecture: Architecture of Consoles: A Practical Analysis, #8
Rodrigo Copetti
No ratings yet
How To Write A Big PLC Program Contact and Coil
No ratings yet
How To Write A Big PLC Program Contact and Coil
5 pages
Projecj Deep Learning
No ratings yet
Projecj Deep Learning
9 pages
Activity7: Working With Numbers 7.1 Program Outcomes (Pos) Addressed by The Activity
100% (1)
Activity7: Working With Numbers 7.1 Program Outcomes (Pos) Addressed by The Activity
8 pages
AVNR-1
No ratings yet
AVNR-1
147 pages
Hyperdeck Studio HD Plus: Product Technical Specifications
No ratings yet
Hyperdeck Studio HD Plus: Product Technical Specifications
6 pages
MM Roadmap
No ratings yet
MM Roadmap
2 pages
Cloud Computing Presentation
No ratings yet
Cloud Computing Presentation
34 pages
Department: Computer Science Academic Year: 2016/2017 Academic Semester: May-2017 Level: Diploma
No ratings yet
Department: Computer Science Academic Year: 2016/2017 Academic Semester: May-2017 Level: Diploma
4 pages
IEC_82304_Compliance_Checklist
No ratings yet
IEC_82304_Compliance_Checklist
4 pages
Module in Internet Research
No ratings yet
Module in Internet Research
66 pages
13.2.3.7 Lab - Bitlocker and Bitlocker To Go Ety
0% (1)
13.2.3.7 Lab - Bitlocker and Bitlocker To Go Ety
3 pages
2A5 Linear Programming - Simplex Method - Maximization Case 3RD FILE Try This SC
No ratings yet
2A5 Linear Programming - Simplex Method - Maximization Case 3RD FILE Try This SC
21 pages
kaizen-2020-computer-engineering-department-report-MKiBIp
No ratings yet
kaizen-2020-computer-engineering-department-report-MKiBIp
12 pages
DNP3 TRANSFIX-Family Installation Manual
No ratings yet
DNP3 TRANSFIX-Family Installation Manual
64 pages
S 1515
No ratings yet
S 1515
2 pages
2CD Digipak Design Template
No ratings yet
2CD Digipak Design Template
1 page
Thesis On Expert System
100% (3)
Thesis On Expert System
7 pages
Pyautogui: Keyboard and Mouse Control
No ratings yet
Pyautogui: Keyboard and Mouse Control
1 page
Os QB 2019
No ratings yet
Os QB 2019
12 pages
New Cost and Budget Analysis Template (3) .XLSX Batch 2
No ratings yet
New Cost and Budget Analysis Template (3) .XLSX Batch 2
3 pages
Labsheet C#
No ratings yet
Labsheet C#
7 pages
Project 2 - Compute Value of e To N Number PROJECT
No ratings yet
Project 2 - Compute Value of e To N Number PROJECT
23 pages
JAVA Complete
No ratings yet
JAVA Complete
175 pages
GIS at Darwin City Council - The State of Play 2011: Josh Forner
No ratings yet
GIS at Darwin City Council - The State of Play 2011: Josh Forner
28 pages
Introduction to Data Structures and Algorithms
No ratings yet
Introduction to Data Structures and Algorithms
11 pages
Uses of Hexadecimal
No ratings yet
Uses of Hexadecimal
10 pages
Project 12
No ratings yet
Project 12
7 pages
Mendix HowToDashboard
No ratings yet
Mendix HowToDashboard
11 pages
FortiAnalyzer Guide by Fortinet
No ratings yet
FortiAnalyzer Guide by Fortinet
162 pages

CUDA_Memory

Uploaded by

CUDA_Memory

Uploaded by

CENG443

Heterogeneous Parallel Programming

CUDA Shared Memory

Işıl ÖZ, IZTECH, Fall 2024

// Get the average of the surrounding 2xBLUR_SIZE x 2xBLUR_SIZE box

int curRow = Row + blurRow;

The number of floating-point calculation

GPU device with the global memory bandwidth

Block (0, 0) Block (1, 0)

Shared Memory Shared Memory

Registers Registers Registers Registers

Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0)

Host Global Memory

Block (0, 0) Block (1, 0)

Shared Memory Shared Memory • Fastest

• Only accessible by a thread

• Limited capacity of registers

All scalar variables declared in kernel and device

// Get the average of the surrounding 2xBLUR_SIZE x 2xBLUR_SIZE box

int curRow = Row + blurRow;

Block (0, 0) Block (1, 0)

Registers Registers Registers Registers • Highly parallel

Host Global Memory • Small (typically 48 KB per SM)

A special type of memory whose contents are

64KB configurable shared memory and L1 cache

Shared variables reside in the shared memory

Statically (size known at compile time)

Dynamically (size not known until runtime)

Block (0, 0) Block (1, 0)

Registers Registers Registers Registers • High access latency:

• Finite access bandwidth

Visible to all threads of all kernels, can be used

void vecAdd(float *h_A, float *h_B, float *h_C, int n)

int size = n * sizeof(float); float *d_A, *d_B, *d_C;

cudaMalloc((void **) &d_A, size);

cudaMalloc((void **) &d_B, size);

cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

cudaFree(d_A); cudaFree(d_B); cudaFree (d_C);

Global Memory I/O

The global memory is large but slow, whereas

__global__ void MatrixMulKernel(float* M, float* N, float* P, int Width) {

// Calculate the row index of the P element and M

// Calculate the column index of P and N

if ((Row < Width) && (Col < Width)) {

P matrix divided into 4 parts

P3,0 P3,1 P3,2 P3,3

8 global memory access per thread (8*4)

Divide the global memory content into tiles

In a congested traffic system, significant

Good when people have similar schedule

Worker A party sleep work

Bad when people have very different schedule

Good when threads have similar access timing

Bad when threads have very different timing

Localizes the memory locations accessed among

Identify a tile of global memory contents that are

__global__ void MatrixMulKernel(float* M, float* N, float* P, int Width) {

// Calculate the row index of the P element and M

// Calculate the column index of P and N

if ((Row < Width) && (Col < Width)) {

__global__ void MatrixMulKernel(float* M, float* N, float* P, int Width) {

// Calculate the row index of the P element and M

// Calculate the column index of P and N

if ((Row < Width) && (Col < Width)) {

P matrix divided into 4 parts

P3,0 P3,1 P3,2 P3,3

M0,0 M0,1 M0,2 M0,3 P0,0 P0,1

M1,0 M1,1 M1,2 M1,3 P1,0 P1,1

Break up the execution of each thread into

So that the data accesses

Each thread loads one M element and one N

N0,0 N0,1 N0,2 N0,3 N0,0 N0,1

N0,0 N0,1 N0,2 N0,3 N0,0 N0,1

N0,0 N0,1 N0,2 N0,3 N0,0 N0,1

Each thread loads one M element and one N

N0,0 N0,1 N0,2 N0,3

void vecAdd(float h_A, float h_B, float *h_C, int n)

int size = n * sizeof(float); float d_A, d_B, *d_C;

global void MatrixMulKernel(float* M, float* N, float* P, int Width) {

global void MatrixMulKernel(float* M, float* N, float* P, int Width) {

global void MatrixMulKernel(float* M, float* N, float* P, int Width) {

global void MatrixMulKernel(float* M, float* N, float* P, int Width)

global void MatrixMulKernel(float* M, float* N, float* P, Int Width)

global void MatrixMulKernel(float* M, float* N, float* P, Int Width)