0% found this document useful (0 votes)

2 views

Class13

Uploaded by

Ali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views

Class13

Uploaded by

Ali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

GPU Memory

• Local registers per thread.

• A parallel data cache or shared
memory that is shared by all the
threads.
• A read-only constant cache that
is shared by all the threads.
• A read-only texture cache that
is shared by all the processors.
• A local cached memory like
registers

Memory access:100 times more

time to access local/global
memory. Maximize shared/
register memory.
CUDA Memory Types
Global memory
Slow and uncached, all threads

Texture memory (read only)

Cache optimized for 2D access, all threads

Constant memory (read only)

Slow, cached, all threads

Shared memory
Fast, bank conflicts; limited; threads in block

Registers
Fast, only for one thread

Local memory
For what doesn’t fit in registers, slow but cached, one thread
Variables
Access Penalty
Registers:
Memory
The fastest form of memory on the multi-processor. Is only accessible by the thread.
Has the lifetime of the thread.

Shared Memory:
Can be as fast as a register when there are no bank conflicts or when reading from
the same address. Accessible by any thread of the block from which it was created.
Has the lifetime of the block.

Constant Memory:
Accessible by all threads. Lifetime of application. Fully cached, but limited.

Global memory:
Potentially 150x slower than register or shared memory -- watch out for uncoalesced
reads and writes. Accessible from either the host or device. Has the lifetime of the
application—that is, it persistent between kernel launches.

Local memory:
A potential performance gotcha, it resides in global memory and can be 150x slower
than register or shared memory. Is only accessible by the thread. Has the lifetime of
the thread.
Where to declare variables?

Can Host access?

yes no

Outside of any function: In kernel:

global register
constant local
shared
Global Memory: DRAM on card

Found by deviceQueryDrv; I have 1024 Mb

Declared outside of any function

device int globalArray[256];

Assigned by cudaMemcpy

cudaMemcpy is blocking transfer; host thread waits until transfer complete

int *myDeviceMemory = 0;
cudaMalloc(&myDeviceMemory, 256 * sizeof(int));
Global Memory
CUDA streams are a sequence of operations that execute on the device in the order
they are issued by the host. Different streams can interleave operations.

When no stream specified, the default (null) stream is assumed.

cudaMemcpy(d_a, a, numBytes, cudaMemcpyHostToDevice);

increment<<<1,N>>>(d_a)
cudaMemcpy(a, d_a, numBytes, cudaMemcpyDeviceToHost);

All on the default stream. Kernel call waits until cudaMemcpy is complete.
increment kernel has to complete before the cudaMemcpy to the host is started.

Alternatively:

cudaMemcpy(d_a, a, numBytes, cudaMemcpyHostToDevice);

increment<<<1,N>>>(d_a)
myCpuFunction(b) //this function will run while the DeviceToHost memcpy occurs
cudaMemcpy(a, d_a, numBytes, cudaMemcpyDeviceToHost);
Global Memory
Asynchronous transfer

Using pinned memory with cudaHostAlloc, then

cudaMemcpyAsync is a non-blocking version of cudaMemcpy.
Requires using a stream ID.

cudaMemcpyAsync(a_d, a_h, size, cudaMemcpyHostToDevice, 0)

kernel_function <<<grid, block >>>(a_d);
cpu_function();

kernel function uses default stream 0, so no synchronization required as kernel uses

default as well. Same as before.

cpu_function does not wait for memory transfer nor kernel_function to finish.
Concurrent copy and kernel execution
Availability: DeviceQueryDrv:
“Concurrent copy and kernel execution: Yes with 1 copy engine(s)”

Compute_capability > 1.1

cudaStream_t stream1, stream2;

cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
cudaMemcpyAsync(a_d, a_h, size, cudaMemcpyHostToDevice, stream1);
kernel<<<grid, block, 0, stream2>>>(otherData_d);

See http://devblogs.nvidia.com/parallelforall/how-overlap-data-transfers-cuda-cc/
Coalesced Global Memory Access
Threads in a block are computed a warp at a time (32 threads).

Global data is read or written in as few transactions as possible by combining

memory access requests into a single transaction . This is referred to the device
coalescing memory stores and reads.

Every successive 128 bytes can be accessed by a warp (or 32 single precision
words).

All 32 reads done in one step. So blockDim.x multiples of 32 better.

Non-aligned Memory
Not in successive 128 blocks of memory; twice as long to read
Constant Memory
For data that does not change over the course of the computation. It is read only.

64k memory available on my machine (see deviceQueryDrv), but it is cached, so

there is only one clock cycle to read as opposed to 100+ for global memory

Using constant memory: Declare it outside of main()

constant float cdata; //available in all scopes

To get numbers into constant memory variable cdata, use

cudaMemcpytoSymbol( (const char * symbol, const void * src, size_t count ,

size_t offset=0, enum cudaMemcpyKind )

Best if all threads in warp read the same constant data, otherwise slower.
Constant Memory
deviceQueryDrv; I have 65536 bytes (64 kB)

Declare outside of any function

__constant__ float var; //available in all functions

Note: constant memory is available to all threads, like global memory; however, the
data is cached.

As fast as a register if all threads read same address.

//copy data from host to constant memory in main():

cudaMemcpyToSymbol (var, &host_var, data_size );

//var is available to kernels

__global__ void kernel (float *a, int N)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx<N)
{
a[idx] = a[idx]* var;
}
}
Class Problem

Write a GPU program to calculate the potential energy of N particles. The

particles have random vertical positions between 0 and 100 m and the mass
of each particle is the same (m = 0.2 kg).

N=2048
g= 9.81m/s2 , put into constant memory

Note: #include <stdlib.h> provides rand() on host

(float) rand() %100 should give you a good random elevation

NOTE: rand() by itself is repeatable;

adding srand(time(NULL)) first, will randomize rand(); also #include <time.h>
PotentialEnergy.cu
Unified Memory: CUDA 6.0
Existing model:

Allocate memory on host

Allocate memory on device
Copy data from host to device
Operate on the GPU data
Copy data back to host

Unified memory model: CPU GPU CPU GPU

Allocate memory (same to CPU as malloc; same to GPU as cudaMalloc)

Operate on data on GPU

NOTE: Linux or Windows only, right now

New memory model : CUDA 6+
int N = 2048;
float * data;

cudaMallocManaged(&data, N);

…..generate data

kernel<<<….>>> (data, N)

cudaDeviceSynchronize();

// Synchronize is to get GPU to finish before doing anything with the data on CPU

cudaFree(data);

———————————————

See PotentialEnergyUnifiedMem.cu as compared to PotentialEnergy.cu

NOTE: no d_data, same data on both devices

PotentialEnergyUnifiedMem.cu

GV451 GV551 NM-D151 Rev 1.0 PDF
No ratings yet
GV451 GV551 NM-D151 Rev 1.0 PDF
65 pages
Node.js 63 Interview Questions and Answers
From Everand
Node.js 63 Interview Questions and Answers
John Edward Cooper Berg
No ratings yet
Apple Inc Attractiveness of PC Industry - Group 3 Context:: 2 Categories
No ratings yet
Apple Inc Attractiveness of PC Industry - Group 3 Context:: 2 Categories
1 page
Lec6 Cuda Memory
No ratings yet
Lec6 Cuda Memory
18 pages
CUDA Memory Architecture: GPGPU Class Week 4
No ratings yet
CUDA Memory Architecture: GPGPU Class Week 4
28 pages
2023 CSC14120 Lecture05 CUDAMemories
No ratings yet
2023 CSC14120 Lecture05 CUDAMemories
48 pages
Gpu History and Cuda Programming Basics
No ratings yet
Gpu History and Cuda Programming Basics
44 pages
GPU Computing CIS-543: Lecture 08: CUDA Memory Model
No ratings yet
GPU Computing CIS-543: Lecture 08: CUDA Memory Model
50 pages
CUDA
No ratings yet
CUDA
33 pages
cs179 2016 Lec13
No ratings yet
cs179 2016 Lec13
30 pages
Constant Memory: Learning CUDA To Solve Scientific Problems
No ratings yet
Constant Memory: Learning CUDA To Solve Scientific Problems
15 pages
002 - Introduction To CUDA Programming - 1
No ratings yet
002 - Introduction To CUDA Programming - 1
54 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
4. CUDA Programming
No ratings yet
4. CUDA Programming
35 pages
CS 179: GPU Computing: Lecture 4: Gpu Memory Systems
No ratings yet
CS 179: GPU Computing: Lecture 4: Gpu Memory Systems
43 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
GPU Computing 2
No ratings yet
GPU Computing 2
28 pages
CUDA, Supercomputing For The Masses: Part 4: Understanding and Using Shared Memory
No ratings yet
CUDA, Supercomputing For The Masses: Part 4: Understanding and Using Shared Memory
3 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
21.L18 Intro To GPU and CUDA C
No ratings yet
21.L18 Intro To GPU and CUDA C
89 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
Lecture3 Fundamentals of CUDA(Part1)_2025
No ratings yet
Lecture3 Fundamentals of CUDA(Part1)_2025
52 pages
Lec 1
No ratings yet
Lec 1
27 pages
Overview of GPGPU's
No ratings yet
Overview of GPGPU's
81 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
No ratings yet
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
29 pages
Cuda C/C++ Basics: NVIDIA Corporation
No ratings yet
Cuda C/C++ Basics: NVIDIA Corporation
67 pages
Threads
No ratings yet
Threads
54 pages
2023-CSC14120-Lecture01-CUDAIntroduction
No ratings yet
2023-CSC14120-Lecture01-CUDAIntroduction
32 pages
CUDA_part-1
No ratings yet
CUDA_part-1
52 pages
Introduction To CUDA C 3
No ratings yet
Introduction To CUDA C 3
67 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
01 Cuda c Basics
No ratings yet
01 Cuda c Basics
32 pages
LM32_AIT_L21
No ratings yet
LM32_AIT_L21
19 pages
Lecture12 GPUArchCUDA02-CUDAMem
No ratings yet
Lecture12 GPUArchCUDA02-CUDAMem
67 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
A Beginner'S Guide To Programming Gpus With Cuda: Mike Peardon
No ratings yet
A Beginner'S Guide To Programming Gpus With Cuda: Mike Peardon
21 pages
CUDA_1
No ratings yet
CUDA_1
45 pages
Chap9_CUDA Optimization
No ratings yet
Chap9_CUDA Optimization
73 pages
GPUMod 2
No ratings yet
GPUMod 2
64 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
LM32_AIT_L23
No ratings yet
LM32_AIT_L23
22 pages
CUDA_part-1-LMS
No ratings yet
CUDA_part-1-LMS
51 pages
02 CUDA Shared Memory
No ratings yet
02 CUDA Shared Memory
21 pages
Hetero Lecture Slides 002 Lecture 1 Lecture-1-5-Cuda-API
No ratings yet
Hetero Lecture Slides 002 Lecture 1 Lecture-1-5-Cuda-API
11 pages
CUDA Programming Invert
No ratings yet
CUDA Programming Invert
36 pages
LPIC-1 Primer
From Everand
LPIC-1 Primer
John Greene
4.5/5 (3)
Node.js, JavaScript, API: Interview Questions and Answers
From Everand
Node.js, JavaScript, API: Interview Questions and Answers
John Edward Cooper Berg
5/5 (1)
50 Recipes for Programming Node.js
From Everand
50 Recipes for Programming Node.js
Jamie Munro
3/5 (4)
1.10. Decision Trees — scikit-learn 0.24.1 documentation
No ratings yet
1.10. Decision Trees — scikit-learn 0.24.1 documentation
10 pages
clustering
No ratings yet
clustering
1 page
2112.10318
No ratings yet
2112.10318
34 pages
Loading Pandas
No ratings yet
Loading Pandas
23 pages
10ClusBasic Editted v1
No ratings yet
10ClusBasic Editted v1
41 pages
KNN in python
No ratings yet
KNN in python
11 pages
بارگذاری فایل
No ratings yet
بارگذاری فایل
2 pages
02Data Edited v2
No ratings yet
02Data Edited v2
43 pages
_08ClassBasic_v1
No ratings yet
_08ClassBasic_v1
46 pages
_03Preprocessing
No ratings yet
_03Preprocessing
60 pages
subdivision
No ratings yet
subdivision
5 pages
01 Laurie Stephey
No ratings yet
01 Laurie Stephey
14 pages
SS Chapter 01 Introduction To A Machine Architecture
No ratings yet
SS Chapter 01 Introduction To A Machine Architecture
47 pages
Sop - Man - Gru - 06 Fine House
No ratings yet
Sop - Man - Gru - 06 Fine House
56 pages
Gaurav Osy MIcroproject Sem 5
No ratings yet
Gaurav Osy MIcroproject Sem 5
23 pages
Wk2 CHP 17 Automated Assembly Systems
No ratings yet
Wk2 CHP 17 Automated Assembly Systems
27 pages
CHS 8 - Module 5
No ratings yet
CHS 8 - Module 5
13 pages
RTOS Module5 FreeRTOS
No ratings yet
RTOS Module5 FreeRTOS
8 pages
MICROCONTROLLER MODULE-2 NOTES
No ratings yet
MICROCONTROLLER MODULE-2 NOTES
28 pages
Ficha Tecnica Carcasa de Cámara Fija
No ratings yet
Ficha Tecnica Carcasa de Cámara Fija
3 pages
Upgradation Process
No ratings yet
Upgradation Process
6 pages
Chapter 1-SPCC
No ratings yet
Chapter 1-SPCC
13 pages
Mengenal Persamaan Chipset
No ratings yet
Mengenal Persamaan Chipset
6 pages
Case Study On: Nitte Meenakshi Institute of Technology
No ratings yet
Case Study On: Nitte Meenakshi Institute of Technology
8 pages
DM9102H DS F01 100710
No ratings yet
DM9102H DS F01 100710
76 pages
ProArt Display PA279CV Data Sheet
No ratings yet
ProArt Display PA279CV Data Sheet
2 pages
Release Notes
No ratings yet
Release Notes
116 pages
DRV8312-C2-KIT HowToRunGuide v101
No ratings yet
DRV8312-C2-KIT HowToRunGuide v101
15 pages
Assignment 1 Microprocessors and Embedded Systems Nov 23 (AutoRecovered)
No ratings yet
Assignment 1 Microprocessors and Embedded Systems Nov 23 (AutoRecovered)
8 pages
Picocart64 v1 Lite
No ratings yet
Picocart64 v1 Lite
1 page
A11 - A12 - A13 - A14 - A15 - ECE3004 - Fall 2022-23 (Online) - Midterm
No ratings yet
A11 - A12 - A13 - A14 - A15 - ECE3004 - Fall 2022-23 (Online) - Midterm
1 page
DTR600 - 700 - 720 Series Software Update Steps For R01.02.02 LACR
No ratings yet
DTR600 - 700 - 720 Series Software Update Steps For R01.02.02 LACR
13 pages
ThinkPad-T14 Gen1 I - Datasheet PDF
No ratings yet
ThinkPad-T14 Gen1 I - Datasheet PDF
3 pages
Preventive Maintenance MUX-100
No ratings yet
Preventive Maintenance MUX-100
3 pages
PM8236
No ratings yet
PM8236
3 pages
User Guide Motherboard ECS G31T-M7
No ratings yet
User Guide Motherboard ECS G31T-M7
54 pages
Et Hardware Reference
No ratings yet
Et Hardware Reference
224 pages
Edexcel GCSE paper 1 - principles of computer science
No ratings yet
Edexcel GCSE paper 1 - principles of computer science
15 pages
CPR MCQs - Assignment - 9.1.24
No ratings yet
CPR MCQs - Assignment - 9.1.24
10 pages
314098H01 - Rev C - Thermo Scientific Revco - Plus Series ULT User Manual
No ratings yet
314098H01 - Rev C - Thermo Scientific Revco - Plus Series ULT User Manual
18 pages

Class13

Uploaded by

Class13

Uploaded by

GPU Memory

• Local registers per thread.

Memory access:100 times more

Texture memory (read only)

Constant memory (read only)

Can Host access?

Outside of any function: In kernel:

Found by deviceQueryDrv; I have 1024 Mb

Declared outside of any function

__device__ int globalArray[256];

cudaMemcpy is blocking transfer; host thread waits until transfer complete

When no stream specified, the default (null) stream is assumed.

cudaMemcpy(d_a, a, numBytes, cudaMemcpyHostToDevice);

cudaMemcpy(d_a, a, numBytes, cudaMemcpyHostToDevice);

Using pinned memory with cudaHostAlloc, then

cudaMemcpyAsync(a_d, a_h, size, cudaMemcpyHostToDevice, 0)

kernel function uses default stream 0, so no synchronization required as kernel uses

Compute_capability > 1.1

cudaStream_t stream1, stream2;

Global data is read or written in as few transactions as possible by combining

All 32 reads done in one step. So blockDim.x multiples of 32 better.

64k memory available on my machine (see deviceQueryDrv), but it is cached, so

Using constant memory: Declare it outside of main()

__constant__ float cdata; //available in all scopes

To get numbers into constant memory variable cdata, use

cudaMemcpytoSymbol( (const char * symbol, const void * src, size_t count ,

Declare outside of any function

As fast as a register if all threads read same address.

//copy data from host to constant memory in main():

//var is available to kernels

Write a GPU program to calculate the potential energy of N particles. The

Note: #include <stdlib.h> provides rand() on host

(float) rand() %100 should give you a good random elevation

NOTE: rand() by itself is repeatable;

Allocate memory on host

Unified memory model: CPU GPU CPU GPU

Allocate memory (same to CPU as malloc; same to GPU as cudaMalloc)

NOTE: Linux or Windows only, right now

See PotentialEnergyUnifiedMem.cu as compared to PotentialEnergy.cu

NOTE: no d_data, same data on both devices

You might also like

device int globalArray[256];

constant float cdata; //available in all scopes