0% found this document useful (0 votes)

8 views18 pages

Lec6 Cuda Memory

The document discusses CUDA memory management in GPU programming, covering topics such as register, shared memory, and global memory. It explains the GPU memory hierarchy, provides examples of CUDA code for memory allocation and reduction techniques, and highlights the importance of optimizing memory usage for performance. Additionally, it addresses synchronization, memory alignment, and coalescing to enhance program efficiency.

Uploaded by

Newbie Gaming

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views18 pages

Lec6 Cuda Memory

Uploaded by

Newbie Gaming

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Lecture 6 – CUDA Memory

THANH TUAN DAO

VNU-UET 12/03/2025

GPU Programming
Today Agenda
Register
Shared Memory
Global Memory

GPU Programming
GPU Memory System
GPU memory hierarchy
Register ~ few - 20 cycles
L1 caches, L2 caches ~ 20-40 cycles
Shared memory ~ 20-40 cycles
Global Memory ~ up to few hundreds cycles

A Fermi Architecture

GPU Programming
CUDA Example: Find the GPU registers
#include <iostream>
#include <cuda_runtime.h>
int main() {
#define BLOCK_SIZE 256 int N = 1 << 20; // 1 million elements
size_t size = N * sizeof(float);
// Naive Reduction (No Shared Memory)
__global__ void reductionNaive(float *input, float *output, int N) { // Allocate host memory
int tid = threadIdx.x + blockIdx.x * blockDim.x; float *h_input = (float *)malloc(size);
for (int stride = 1; stride < N; stride *= 2) { float h_output = 0.0f;
if (tid % (2 * stride) == 0 && tid + stride < N) {
input[tid] += input[tid + stride]; // Initialize input array
} for (int i = 0; i < N; i++) {
__syncthreads(); h_input[i] = 1.0f;
} }
if (tid == 0) {
*output = input[0]; // Allocate device memory
} float *d_input, *d_output;
} cudaMalloc((void **)&d_input, size);
cudaMalloc((void **)&d_output, sizeof(float));
// Optimized Reduction Using Shared Memory
__global__ void reductionShared(float *input, float *output, int N) { // Copy data to device
__shared__ float temp[BLOCK_SIZE]; cudaMemcpy(d_input, h_input, size, cudaMemcpyHostToDevice);
int tid = threadIdx.x + blockIdx.x * blockDim.x; cudaMemcpy(d_output, &h_output, sizeof(float), cudaMemcpyHostToDevice);
int local_tid = threadIdx.x;
// Launch naive kernel
temp[local_tid] = (tid < N) ? input[tid] : 0.0f; int blocksPerGrid = (N + BLOCK_SIZE - 1) / BLOCK_SIZE;
__syncthreads(); reductionNaive<<<blocksPerGrid, BLOCK_SIZE>>>(d_input, d_output, N);
cudaMemcpy(&h_output, d_output, sizeof(float), cudaMemcpyDeviceToHost);
// Reduction within block std::cout << "Naive Reduction result: " << h_output << std::endl;
for (int stride = blockDim.x / 2; stride > 0; stride /= 2) {
if (local_tid < stride) { // Reset output and launch shared memory optimized kernel
temp[local_tid] += temp[local_tid + stride]; h_output = 0.0f;
} cudaMemcpy(d_output, &h_output, sizeof(float), cudaMemcpyHostToDevice);
__syncthreads(); reductionShared<<<blocksPerGrid, BLOCK_SIZE>>>(d_input, d_output, N);
} cudaMemcpy(&h_output, d_output, sizeof(float), cudaMemcpyDeviceToHost);
std::cout << "Shared Memory Reduction result: " << h_output << std::endl;
if (local_tid == 0) {
atomicAdd(output, temp[0]); // Free memory
} cudaFree(d_input);
} cudaFree(d_output);
free(h_input);

return 0;
}

GPU Programming
GPU Registers
Each SM has a fixed amount of registers
Allocated (partitioned) to active threads
Will be optimized by the compiler
If the threads use more registers than available, the compiler will spill to local memory
Local memory can be cached by L1, L2 and global memory
The register usage may be limited by nvcc –maxrregcount amount

Using register smartly can greatly improve your programs!

GPU Programming
GPU Shared Memory
Each SM has a fixed amount of shared memory
Allocated (partitioned) to active blocks
Threads in a block can access (read/write) to the allocated shared memory
Threads in a block use block-level synchronization to guarantee memory consistency
Shared memory can be declared using either static or dynamic method
Static: __shared__ float temp[BLOCK_SIZE];
Dynamic: extern __shared__ float temp[]; ... Kernel <<<grid, block, shared_mem_size>>>(...);

Using shared memory smartly can greatly improve your programs!

GPU Programming
GPU Shared Memory: Bank Conflict (1)
Shared memory is divided into 32 equally-sized banks (according to 32 threads in warp)
Accesses to different banks can be serviced simultaneously
A memory transaction can access one location per bank
Three types of situations may happen when a request to shared memory issued by a warp
Parallel access: Multiple addresses access across multiple banks
Serial access: Multiple addresses access within the same bank
Broadcast access: A single address read in a single bank
Bank conflicts arise when threads access different addresses in a bank

GPU Programming
GPU Shared Memory: Bank Conflict (2)

Parallel accesses

Serial accesses

Serial accesses (not

necessarily
broadcasting)

GPU Programming
Mapping Address to Bank Index
Bank width is the number of bytes within the same bank
bank index = (byte address ÷ 4 bytes/bank) % 32 banks

bank index = (byte address ÷ 8 bytes/bank) % 32 banks

A bank conflict does not occur when two threads from the same warp access the same address.
Read access will be broadcast
Write access will be written by one of the threads
Two modes: 32-bit mode and 64-bit mode (Kepler and later arch support both)
32-bit mode 64-bit mode

GPU Programming
Bank Conflicts Examples
64-bit mode

Conflict-free: Each thread access to a different bank

Conflict-free: Two threads accessing the same 8-byte word

2-way conflict

3-way conflict

GPU Programming
Synchronization
Block-level synchronization
__synchthreads(): Wait for all threads in a block to arrive at this point

Affect all threads in the same block

Atomic operations
E.g., float atomicAdd(float* addr, float amount)
Perform read-modify-write atomic operations on global or shared memory
Program-level synchronization
E.g., cudaDeviceSynchronize()
Synchronizes operations (kernel launches, memory copies) in a program
Memory Fences
Ensure any memory write before the fence is visible to other threads after the fence
void __threadfence(); void __threadfence_block(); void __threadfence_system();

GPU Programming
GPU Global Memory
Allocated by cudaMalloc in the host
Resides in device memory
Accessible via 32-byte, 64-byte or 128-byte memory transactions
The number of transactions required by a request depends on
Distribution of memory addresses across threads in a warp
Alignment of requested memory addresses
Optimizing the number of memory transactions is crucial for performance
Global memory access can be cached by
L1, L2, read-only constant, read-only texture caches

GPU Programming
Static Global Memory
Use cudaMemcpyToSymbol to transfer the data from host to device #include <cuda_runtime.h>
#include <stdio.h>
__device__ float devData;

global void checkGlobalVariable() {

// display the original value
printf("Device: the value of the global variable is %f\n",devData);
// alter the value
devData +=2.0f;
}
int main(void) {
// initialize the global variable
float value = 3.14f;
cudaMemcpyToSymbol(devData, &value, sizeof(float));
printf("Host: copied %f to the global variable\n", value);
// invoke the kernel
checkGlobalVariable <<<1, 1>>>();
// copy the global variable back to the host
cudaMemcpyFromSymbol(&value, devData, sizeof(float));
printf("Host: the value changed by the kernel to %f\n", value);
cudaDeviceReset();
return EXIT_SUCCESS;
}

GPU Programming
Memory Allocation and Transfer
Use cudaMemcpy to transfer the data from host to device #include <cuda_runtime.h>
#include <stdio.h>
int main(int argc, char **argv) {
// set up device
int dev = 0;
cudaSetDevice(dev);
// memory size
unsigned int isize = 1<<22;
unsigned int nbytes = isize * sizeof(float);
// get device information
cudaDeviceProp deviceProp;
cudaGetDeviceProperties(&deviceProp, dev);
printf("%s starting at ", argv[0]);
printf("device %d: %s memory size %d nbyte %5.2fMB\n", dev,
deviceProp.name,isize,nbytes/(1024.0f*1024.0f));
// allocate the host memory
float *h_a = (float *)malloc(nbytes);
// allocate the device memory
float *d_a;
cudaMalloc((float **)&d_a, nbytes);
// initialize the host memory
for(unsigned int i=0;i<isize;i++) h_a[i] = 0.5f;
// transfer data from the host to the device
cudaMemcpy(d_a, h_a, nbytes, cudaMemcpyHostToDevice);
// transfer data from the device to the host
cudaMemcpy(h_a, d_a, nbytes, cudaMemcpyDeviceToHost);
// free memory
cudaFree(d_a);
free(h_a);
// reset device
cudaDeviceReset();
return EXIT_SUCCESS;
}

GPU Programming
Pinned memory
Default host memory is pageable
Can be paged out (due to fault operations) any time by the OS
GPU transfers data from pageable host memory to pinned (or page-locked) memory
before transferring to GPU memory
Use cudaMallocHost to allocate pinned host memory
Using pinned memory is more expensive, but can copy faster

GPU Programming
Zero-copy memory
Zero-copy memory is pinned memory that is mapped into the device address space
No explicit copy is required
The memory transfer is issued based on the requests in the kernel
Slow when frequently used in the kernel
HW: rewrite VecAdd example with pinned memory, zero-copy memory and
Compare the performance
Write a report to the TA

GPU Programming
Memory Alignment and Coalescing
Aligned memory accesses occur when the first address of a device memory transaction is an
even multiple of the cache granularity being used to service the transaction (32 bytes for L2 or
128 bytes for L1)
Coalesced memory access occur when all 32 threads in a warp access a contiguous chunk of
memory (a memory segment)

Aligned and coalesced case

Misaligned and uncoalesced case

GPU Programming
Thank you!

GPU Programming

PFISTER
No ratings yet
PFISTER
1,238 pages
CUDA Memory Architecture: GPGPU Class Week 4
No ratings yet
CUDA Memory Architecture: GPGPU Class Week 4
28 pages
01 Cuda C Basics
No ratings yet
01 Cuda C Basics
32 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
Lecture 12 GPU Programming
No ratings yet
Lecture 12 GPU Programming
65 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
Factory Reset For Samsung Printer SL-C1860FW
50% (2)
Factory Reset For Samsung Printer SL-C1860FW
3 pages
4th Grade Math Skill of The Day Week 1
100% (1)
4th Grade Math Skill of The Day Week 1
9 pages
Question Bank With Answers - OOPs Through JAVA - Total 5 Units
No ratings yet
Question Bank With Answers - OOPs Through JAVA - Total 5 Units
84 pages
Gpu History and Cuda Programming Basics
No ratings yet
Gpu History and Cuda Programming Basics
44 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
E.optemal Reconfiguration of Network
No ratings yet
E.optemal Reconfiguration of Network
30 pages
PPS - Unit 5 Objective Questions
No ratings yet
PPS - Unit 5 Objective Questions
12 pages
Case Study - Alembic Pharma
No ratings yet
Case Study - Alembic Pharma
3 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
Cuda C/C++ Basics: NVIDIA Corporation
No ratings yet
Cuda C/C++ Basics: NVIDIA Corporation
67 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
Heikin Ashi Rsi Oscilator
No ratings yet
Heikin Ashi Rsi Oscilator
9 pages
21.L18 Intro To GPU and CUDA C
No ratings yet
21.L18 Intro To GPU and CUDA C
89 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
Introduction To CUDA C 3
No ratings yet
Introduction To CUDA C 3
67 pages
Chapter 2 - Network Basics
No ratings yet
Chapter 2 - Network Basics
64 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Lecture3 Fundamentals of CUDA (Part1) - 2025
No ratings yet
Lecture3 Fundamentals of CUDA (Part1) - 2025
52 pages
Multithreaded Architectures: Memory and Data Locality
No ratings yet
Multithreaded Architectures: Memory and Data Locality
39 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
Lecture12 GPUArchCUDA02-CUDAMem
No ratings yet
Lecture12 GPUArchCUDA02-CUDAMem
67 pages
Samsung Ah68 02293b Users Manual 280484
No ratings yet
Samsung Ah68 02293b Users Manual 280484
39 pages
GPU Computing CIS-543: Lecture 08: CUDA Memory Model
No ratings yet
GPU Computing CIS-543: Lecture 08: CUDA Memory Model
50 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Wavelet Book Completegoggle
No ratings yet
Wavelet Book Completegoggle
45 pages
Class4 Advanced Cuda Opencl
No ratings yet
Class4 Advanced Cuda Opencl
64 pages
CUDA PPT Anurita Unit3
No ratings yet
CUDA PPT Anurita Unit3
42 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
Web Design With HTML, CSS, Javascript and Jquery Set: Description
No ratings yet
Web Design With HTML, CSS, Javascript and Jquery Set: Description
2 pages
Memory Hardware in G80: © David Kirk/NVIDIA and Wen-Mei W Hwu 2007-2009 1
No ratings yet
Memory Hardware in G80: © David Kirk/NVIDIA and Wen-Mei W Hwu 2007-2009 1
21 pages
Cuda 1
No ratings yet
Cuda 1
45 pages
GPUMod 2
No ratings yet
GPUMod 2
64 pages
Threads
No ratings yet
Threads
54 pages
Lecture5 2
No ratings yet
Lecture5 2
46 pages
CS 179: GPU Computing: Lecture 4: Gpu Memory Systems
No ratings yet
CS 179: GPU Computing: Lecture 4: Gpu Memory Systems
43 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
No ratings yet
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
29 pages
Data Struture and Alghorithem
No ratings yet
Data Struture and Alghorithem
46 pages
All-In-One-Matrix Ao 11072024 (External)
No ratings yet
All-In-One-Matrix Ao 11072024 (External)
26 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
CUDA
No ratings yet
CUDA
33 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
2023 CSC14120 Lecture05 CUDAMemories
No ratings yet
2023 CSC14120 Lecture05 CUDAMemories
48 pages
CUDA Programming Invert
No ratings yet
CUDA Programming Invert
36 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
cs179 2016 Lec13
No ratings yet
cs179 2016 Lec13
30 pages
How2electronics Com BLDC Brushless DC Motor Driver Circuit 555
No ratings yet
How2electronics Com BLDC Brushless DC Motor Driver Circuit 555
10 pages
ControlAcceso Resumen
No ratings yet
ControlAcceso Resumen
27 pages
Hetero Lecture Slides 002 Lecture 1 Lecture-1-5-Cuda-API
No ratings yet
Hetero Lecture Slides 002 Lecture 1 Lecture-1-5-Cuda-API
11 pages
GPU Programming Slides 2
No ratings yet
GPU Programming Slides 2
37 pages
CUDA - Part 1 LMS
No ratings yet
CUDA - Part 1 LMS
51 pages
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
SPM Presentation
No ratings yet
SPM Presentation
12 pages
Lec 1
No ratings yet
Lec 1
27 pages
CUDA Part-1
No ratings yet
CUDA Part-1
52 pages
CUDA Programming
No ratings yet
CUDA Programming
35 pages
2023 CSC14120 Lecture01 CUDAIntroduction
No ratings yet
2023 CSC14120 Lecture01 CUDAIntroduction
32 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
Instant Download Antennas From Theory To Practice 1st Edition Huang Y. PDF All Chapters
No ratings yet
Instant Download Antennas From Theory To Practice 1st Edition Huang Y. PDF All Chapters
51 pages
.An Approach To Physical Design of 28nm Technology Based Processor Chip Using IC Compiler
No ratings yet
.An Approach To Physical Design of 28nm Technology Based Processor Chip Using IC Compiler
4 pages
Class 13
No ratings yet
Class 13
19 pages
Coding Blog
No ratings yet
Coding Blog
8 pages
Web Development Task PDF
No ratings yet
Web Development Task PDF
6 pages
3 Cpu Scheduling SJF RSJF
No ratings yet
3 Cpu Scheduling SJF RSJF
10 pages
Mandatory Course 2020-21
No ratings yet
Mandatory Course 2020-21
3 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
Shared Memory
No ratings yet
Shared Memory
10 pages
CSED405 Lec2-CUDA Overview - 240916 - 131108
No ratings yet
CSED405 Lec2-CUDA Overview - 240916 - 131108
52 pages
Simple Presentation On Artificial Intelligence
No ratings yet
Simple Presentation On Artificial Intelligence
7 pages
Freehand or Graphic Method
No ratings yet
Freehand or Graphic Method
11 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Current Openings
No ratings yet
Current Openings
3 pages
7368 ISAM ONT G-440G-A Datasheet
No ratings yet
7368 ISAM ONT G-440G-A Datasheet
2 pages
Rich A Lie Lyrics by Marksman - Google Search
No ratings yet
Rich A Lie Lyrics by Marksman - Google Search
1 page
Datasheet Modem 6 Transceiver (Surface)
No ratings yet
Datasheet Modem 6 Transceiver (Surface)
2 pages
Karamchandani Simran Resume
No ratings yet
Karamchandani Simran Resume
1 page
CUDA, Supercomputing For The Masses: Part 4: Understanding and Using Shared Memory
No ratings yet
CUDA, Supercomputing For The Masses: Part 4: Understanding and Using Shared Memory
3 pages
Build your own Blockchain: Make your own blockchain and trading bot on your pc
From Everand
Build your own Blockchain: Make your own blockchain and trading bot on your pc
Magelan Cybersecurity
No ratings yet
LPIC-1 Primer
From Everand
LPIC-1 Primer
John Greene
4.5/5 (3)
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
From Everand
Dreamcast Architecture: Architecture of Consoles: A Practical Analysis, #9
Rodrigo Copetti
No ratings yet
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet

Lec6 Cuda Memory

Uploaded by

Lec6 Cuda Memory

Uploaded by

Lecture 6 – CUDA Memory

THANH TUAN DAO

Using register smartly can greatly improve your programs!

Using shared memory smartly can greatly improve your programs!

Serial accesses (not

bank index = (byte address ÷ 8 bytes/bank) % 32 banks

Conflict-free: Each thread access to a different bank

Conflict-free: Two threads accessing the same 8-byte word

Affect all threads in the same block

__global__ void checkGlobalVariable() {

Aligned and coalesced case

Misaligned and uncoalesced case

You might also like

global void checkGlobalVariable() {