2023 CSC14120 Lecture05 CUDAMemories
2023 CSC14120 Lecture05 CUDAMemories
Memory architecture in
CUDA
2
This lecture:
• So far we only use global memory
• Off-chip DRAM, have long access latency and finite access
bandwidth.
• GPUs provide a number of additional on-chip memory
resources for accessing data that can remove the
majority of traffic to and from the global memory
• We will study the use of different memory types to
boost the execution performance of CUDA kernels.
• How one can organize and position data for efficient access
by a massive number of threads.
3
(CPU) Memory hierarchy
5
(CPU) Memory hierarchy
6
CUDA Memory Model
• CUDA programming model exposes more of the memory
hierarchy and gives you more explicit control
read/write
• Global memory ⟷ grid
read
• Constant memory ⟶ grid
read/write
• Shared memory ⟷ block
read/write
• Register memory, local memory ⟷ thread
7
CUDA Memory Model
Each thread can:
• Read/write per-thread
registers (~1 cycle)
• Read/write per-block shared
memory (~5 cycles)
• Read/write per-grid global
memory (~500 cycles)
• Read only per-grid constant
memory (~5 cycles with
caching)
8
Hardware View of CUDA Memories
9
Global memory (GMEM)
• When host calls cudaMalloc to allocate a memory region in
device, this region will lie in global memory (GMEM) of
device
• GMEM is where host communicates (ship input data to and
get output data from) with device
• GMEM lies in DRAM and is the biggest memory of device
• Query: totalGlobalMem in struct cudaDeviceProp
• E.g., Colab GPU has ? GB GMEM
• But in device, GMEM is the slowest memory to access
→ we should reduce # GMEM accesses from threads
(this is the goal of using other types of memories)
10
Device
SM SM
SP SP ... SP SP SP ... SP
L1 Const L1 Const
SMEM SMEM
Cache Cache Cache Cache
L2 Cache
GMEM DRAM
11
Global memory (GMEM)
• When host calls cudaMalloc to allocate a memory region in
device, this region will lie in global memory (GMEM) of
device
• GMEM is where host communicates (ship input data to and
get output data from) with device
• GMEM lies in DRAM and is the biggest memory of device
• Query: totalGlobalMem in struct cudaDeviceProp
• E.g., Colab GPU has ? GB GMEM
• But in device, GMEM is the slowest memory to access
→ we should reduce # GMEM accesses from threads
(this is the goal of using other types of memories)
12
Global memory (GMEM)
• We can allocate a memory region in GMEM by cudaMalloc
• Host can read/write this region by cudaMemcpy
• The pointer pointing to this region is passed as an argument to a kernel by
host
• In the kernel, all threads can access this region through the passed pointer
• This region will be freed when host calls cudaFree
• Or: we can declare a static variable in GMEM with keyword __device__
• E.g., __device__ float a[10];
• This declaration must be put outside all functions
• Host can read/write this variable by cudaMemcpyFrom/ToSymbol
• In a kernel, all threads can access this variable (we don’t need to pass it as
an argument to the kernel)
• This variable will be freed automatically when the program finishes
13
Constant memory (CMEM)
• In addition to GMEM, host can also communicate with
device through constant memory (CMEM)
• When should we use CMEM?
• When host wants to send device data which is constant during
kernel execution
• This data also needs to be small because CMEM is only 64 KB
• Query: totalConstMem in struct cudaDeviceProp
• Threads in the same warp read the same data
• CMEM lies in DRAM (similar to GMEM), but has Const Caches in
SMs (8 KB / SM with most CCs1)
• Const Cache có độ trễ thấp, nhưng lại có băng thông thấp (4B /
clock cycle / SM)
• → nếu các thread trong
CC:warp không
Compute cùng đọc một địa chỉ thì sẽ tốn
Capability 14
Device
SM SM
SP SP ... SP SP SP ... SP
L1 Const L1 Const
SMEM SMEM
Cache Cache Cache Cache
L2 Cache
CMEM
GMEM DRAM
15
Constant memory (CMEM)
• In addition to GMEM, host can also communicate with device
through constant memory (CMEM)
• When should we use CMEM?
• When host wants to send device data which is constant during kernel
execution
• This data also needs to be small because CMEM is only 64 KB
• Query: totalConstMem in struct cudaDeviceProp
• Threads in the same warp read the same data
• CMEM lies in DRAM (similar to GMEM), but has Const Caches in SMs
(8 KB / SM with most CCs)
• Const Cache has low latency, but low bandwidth (4B / clock cycle / SM)
• → if threads in a warp don’t read the same memory address, the warp
will need to read many times, otherwise it will need to read one time and
the read data will be broadcasted to all threads in the warp 16
Constant memory (CMEM)
• In device, kernel arguments are stored in CMEM
• Declare a variable in CMEM: similar to declaring a static
variable in GMEM, but replace keyword __device__ by
__constant__
• E.g., __constant__ float a[10];
• This declaration must be put outside all functions
• Host can read/write this variable by cudaMemcpyFrom/ToSymbol
• In a kernel, all threads can read (not write) this variable (we don’t
need to pass it as an argument to the kernel)
• This variable will be freed automatically when the program finishes
17
Example: 1D convolution
flt[0]
flt[1]
flt[2]
18
#define NF 100
#define NI (1<<24)
#define NO (NI - NF + 1)
__constant__ float d_flt[NF];
…
int main(int argc, char *argv[]) {
// Allocate memories for input, filter, output; set up data for input, filter
float *in, *flt, *out;
…
// Allocate device memories
float *d_in, *d_out;
cudaMalloc(&d_in, NI * sizeof(float));
cudaMalloc(&d_out, NO * sizeof(float));
// Copy data from host memories to device memories
cudaMemcpy(d_in, in, NI * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_flt, flt, NF * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpyToSymbol(d_flt, flt, NF * sizeof(float));
// Launch the kernel
…
// Copy results from device memory to host memory
cudaMemcpy(out, d_out, NO * sizeof(float), cudaMemcpyDeviceToHost);
// Free device memories
cudaFree(d_in);
cudaFree(d_out);
…
} 19
__global__ void convOnDevice(float *d_in, float *d_out)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < NO)
#define NF 100 {
#define NI (1<<24) float s = 0;
#define NO (NI - NF + 1) for (int j = 0; j < NF; j++)
__constant__ float d_flt[NF]; {
… s += d_flt[j] * d_in[ ? ];
int main(int argc, char *argv[]) }
{ d_out[i] = s;
… }
// Launch the kernel }
dim3 blockSize(512);
dim3 gridSize((NO - 1) / blockSize.x + 1);
convOnDevice<<<gridSize, blockSize>>>(d_in, d_out);
…
}
20
__global__ void convOnDevice(float *d_in, float *d_out)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < NO)
#define NF 100 {
#define NI (1<<24) float s = 0;
#define NO (NI - NF + 1) for (int j = 0; j < NF; j++)
__constant__ float d_flt[NF]; {
… s += d_flt[j] * d_in[i + j];
int main(int argc, char *argv[]) }
{ d_out[i] = s;
… }
// Launch the kernel }
dim3 blockSize(512);
dim3 gridSize((NO - 1) / blockSize.x + 1);
convOnDevice<<<gridSize, blockSize>>>(d_in, d_out);
…
}
21
Register memory (RMEM)
In addition to CMEM with cache, we can also reduce DRAM accesses
by register memory (RMEM)
__global__ void convOnDevice(float *d_in, float *d_out)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < NO)
{
float s = 0;
for (int j = 0; j < NF; j++)
Running time: 17.513
{
s += d_flt[j] * d_in[i + j];
}
d_out[i] = s;
}
}
23
Register memory (RMEM)
In addition to CMEM with cache, we can also reduce DRAM accesses
by register memory (RMEM)
__global__ void convOnDevice(float *d_in, float *d_out)
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < NO)
{
float s = 0;
d_out[i] = 0;
Running time: 17.513 47.107
for (int j = 0; j < NF; j++)
{
s += d_flt[j] * d_in[i + j];
d_out[i] += d_flt[j] * d_in[i + j];
}
d_out[i] = s;
}
}
24
Register memory (RMEM)
In addition to CMEM with cache, we can also reduce DRAM accesses
by register memory (RMEM)
__global__ void convOnDevice(float *d_in, float *d_out)
{ • Each thread will have its own version
int i = blockIdx.x * blockDim.x + threadIdx.x;of variable s stored in its own RMEM
• RMEM is the fastest memory in device
if (i < NO) • RMEM của thread sẽ được giải phóng
{ khi thread thực thi xong hàm kernel
float s = 0; • Host không thể “thấy” và đọc/ghi
d_out[i] = 0; RMEM
Thời gian chạy: 17.513 47.107
for (int j = 0; j < NF; j++)
{
s += d_flt[j] * d_in[i + j];
d_out[i] += d_flt[j] * d_in[i + j];
}
d_out[i] = s;
}
}
25
Device
SM SM
SP SP ... SP SP SP ... SP
L1 Const L1 Const
SMEM SMEM
Cache Cache Cache Cache
L2 Cache
CMEM
GMEM DRAM
26
Register memory (RMEM)
In addition to CMEM with cache, we can also reduce DRAM accesses
by register memory (RMEM)
__global__ void convOnDevice(float *d_in, float *d_out)
{ • Each thread will have its own version of
int i = blockIdx.x * blockDim.x + threadIdx.x;
variable s stored in its own RMEM
• RMEM is the fastest memory in device
if (i < NO) • RMEM of a thread will be freed when it
{ finishes
float s = 0; • Host cannot “see” and read/write RMEM
d_out[i] = 0;
Thời gian chạy: 17.513 47.107
for (int j = 0; j < NF; j++)
{
s += d_flt[j] * d_in[i +Write
j]; results many times to RMEM
d_out[i] += d_flt[j] * d_in[i + j];
}
d_out[i] = s; Write the final result one time from
} RMEM to GMEM
}
27
Local memory (LMEM)
• Although fastest, RMEM size is limited
• In most CCs: 64K 32-bit registers / SM, at most 255 32-
bit registers / thread
• What if each thread has data size > RMEM size?
• Data “spills” out of RMEM onto local memory (LMEM)
• LMEM lies in DRAM, but has cache
• Giống RMEM, LMEM là dành riêng cho mỗi thread và sẽ
được giải phóng khi thread thực thi xong
28
Device
SM SM
SP SP ... SP SP SP ... SP
L1 Const L1 Const
SMEM SMEM
Cache Cache Cache Cache
LMEM
L2 Cache
CMEM
GMEM DRAM
29
Local memory (LMEM)
• Although fastest, RMEM size is limited
• In most CC: 64K registers / SM, at most 255 registers /
thread
• What if each thread has data size > RMEM size?
• Data “spills” out of RMEM onto local memory (LMEM)
• LMEM lies in DRAM, but has cache
• Similar to RMEM, LMEM is private for a thread and will
be freed when it finishes
30
Shared memory (SMEM)
• In addition to CMEM and RMEM, we can reduce DRAM
accesses by shared memory (SMEM)
• A block has its own SMEM and will be freed when the block
finishes
• SMEM resides in SM, as the same level with L1 Cache and
Const Cache → can be accessed much faster than DRAM
(although not as fast as RMEM)
• Ở hầu hết các CC, mỗi SM có 48-96 KB SMEM và 48 -96
KB này được phân ra cho các block chứa trong SM
• Là “bộ nhớ cache” mà người lập trình có thể kiểm soát được
• Host không thể đọc/ghi SMEM
31
Device
SM SM
SP SP ... SP SP SP ... SP
L1 Const L1 Const
SMEM SMEM
SMEM Cache Cache Cache Cache
LMEM
L2 Cache
CMEM
GMEM DRAM
32
Shared memory (SMEM)
• In addition to CMEM and RMEM, we can reduce DRAM
accesses by shared memory (SMEM)
• A block has their own SMEM and will be freed when the
block finishes
• SMEM resides in SM, as the same level with L1 Cache and
Const Cache → can be accessed much faster than DRAM
(although not as fast as RMEM)
• In most CCs, each SM has 48-96 KB physical SMEM and
this 48-96 KB is divided for blocks residing in SM
• SMEM is the “cache memory” programmers can control
• Host cannot read/write SMEM
33
Shared memory (SMEM)
• In addition to CMEM and RMEM, we can reduce DRAM
accesses by shared memory (SMEM)
• A block has their own SMEM and will be freed when the
block finishes
• SMEM resides in SM, as the same level with L1 Cache and
Const Cache → can be accessed much faster than DRAM
(although not as fast as RMEM)
• In most CCs, each SM has 48-96 KB physical SMEM and
this 48-96 KB is divided for blocks residing in SM
• SMEM is the “cache memory” programmers can control
• Host cannot read/write SMEM
34
Shared memory - Work flow
https://wiki.illinois.edu/wiki/display/ECE408 35
Shared memory - Work flow
• Global memory read/write is slow
• To avoid Global Memory bottleneck, tile the input
data to take advantage of Shared Memory:
• Partition data into subsets that fit into the (smaller but
faster) shared memory
• Handle each data subset with one block by:
• Loading the subset from global memory to shared memory, using
multiple threads to exploit memory-level parallelism
• Performing the computation on shared memory. Each thread can
efficiently access any data element
• Copying results from shared memory to global memory
36
Shared memory – Static Allocate
37
Shared memory – Static Allocate
Your homework
𝐴.𝑤𝑖𝑑𝑡ℎ
40
Tiled Matrix Multiplication
41
Tiled Matrix Multiplication
• Data of A and B are in GMEM, and every time the
data of A/B is read, the threads must read from
GMEM.
• An element in the A/B matrix is read by different
threads in the same block.
• We can reduce the number of GMEM accesses by:
• First, each block will read the A/B data that the block
needs from GMEM and store it in the block's SMEM.
Each element is read once
• Then, when A/B data is needed, threads in the block can
read the data in SMEM at high speed.
42
Tiled Matrix Multiplication
• Problem: A and B too large to store in SMEM.
• Solution:
• Partition the Data that block need into subset. E.g.:
A1,A2,… B1,B2,…
• Each phase:
• Block will load Ai, Bi from GMEM to SMEM.
• Calculate partial Matrix Multiplication, add to previous
paritial result.
• Repeat for all Subset of A and B.
43
Tiled Matrix Multiplication
• Your homework.
44
Summary
Utilize high speed memories to
reduce DRAM accesses
Image source:
http://www.realworldtech.com/includes/images/articles/ 45
Declaring CUDA Variables
Variable declaration Memory Scope Lifetime
int LocalVar; register thread thread
__device__ __shared__ int SharedVar; shared block block
__device__ int GlobalVar; global grid application
__device__ __constant__ int ConstantVar; constant grid application
• __device__
• Optional with __shared__, or __constant__
• Not allowed by itself within functions
• Automatic variables with no qualifiers
• In a register for primitive types and structures
• In global memory for arrays with no fix size.
Device
SM SM
SP SP ... SP SP SP ... SP
L1 Const L1 Const
SMEM SMEM
SMEM Cache Cache Cache Cache
LMEM
L2 Cache
CMEM
GMEM DRAM
47
Reference
• [1] Wen-Mei, W. Hwu, David B. Kirk, and Izzat El Hajj.
Programming Massively Parallel Processors: A Hands-on
Approach. Morgan Kaufmann, 2022
• [2] Cheng John, Max Grossman, and Ty
McKercher. Professional Cuda C Programming. John Wiley
& Sons, 2014
• [3] Illinois GPU course
https://wiki.illinois.edu/wiki/display/ECE408/ECE408+Home
48
THE END
49