0% found this document useful (0 votes)
28 views91 pages

CSE Lec4 Cuda

The document provides an introduction to CUDA and GPU architecture, detailing the structure and functioning of GPUs, including their cores, memory hierarchy, and the CUDA programming model. It explains the concept of CUDA kernels, thread organization, and memory management, emphasizing the importance of parallelism for efficient GPU computing. Additionally, it outlines the execution flow of CUDA programs and provides examples of vector addition on the device.

Uploaded by

lamya.gandhi4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views91 pages

CSE Lec4 Cuda

The document provides an introduction to CUDA and GPU architecture, detailing the structure and functioning of GPUs, including their cores, memory hierarchy, and the CUDA programming model. It explains the concept of CUDA kernels, thread organization, and memory management, emphasizing the importance of parallelism for efficient GPU computing. Additionally, it outlines the execution flow of CUDA programs and provides examples of vector addition on the device.

Uploaded by

lamya.gandhi4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 91

Introduction to CUDA

CSE 461
Spring ‘24

Dr. Arpan Jain


SVNIT Surat
E-mail: [email protected]
http://www.cse.ohio-state.edu/~jain.575
Introduction to the GPU
• Comprises many cores (that almost double each passing year)
– Tesla V100 have 5,120 CUDA Cores and 640 Tensor Cores
• Each core
– Runs at a clock speed significantly slower than a CPU’s clock
– Heavily multithreaded, in-order, single-instruction issue processor
• SIMD − single instruction, multiple-data
– Shares its control and instruction cache with multiple other cores
• Focus on execution throughput of massively-parallel programs.
• Much better Floating-Point Operations per Second (FLOPS) than CPUs
• Do not have virtual memory, interrupts, or means of addressing devices such as
the keyboard and the mouse

SVNIT, Surat CSE 461 2


Introduction to the GPU (Cont.)
• Terribly inefficient when we do not have SPMD
– Such programs are best handled by CPUs
– e.g., Serial operations
• Designed for data intensive applications
– Emphasized-upon by significantly higher memory bandwidths for GPUs
• Tesla V100 have 900 GB/s HBM2

• Originally designed for 3D rendering


– Requires holding large amount of texture and polygon data
– Caches cannot hold such large amount of data
– The only design that would have increased rendering performance was to increase the bus width and the memory
clock.
• Intel i7 has a memory bus of width 192b and a memory clock up to 800MHz.
• GTX 285 had a bus width of 512b, and a memory clock of 1242 MHz.

• Recently, CPUs are also adopting the concept of high bandwidth memory and 3D stacked memory to
improve memory access performance.

SVNIT, Surat CSE 461 3


GPU Processor SM SP

• Basic unit is called a Streaming


Multiprocessor (SM)
• Each SM has some set of
Streaming Processors (SP)
– 16 SM with 8 SP = 128 SPs
• Each SP has
– MAD unit (Multiply and Addition Unit)
– An additional MU (Multiply Unit)
• Each SP is massively threaded and can run thousands of threads
– The G80 card supports 96 threads per SP
– Each SM has 8 SPs, so each SM supports a maximum of 768 threads.
– Total threads that can run: 16 * 8 * 96 = 12,228 -> ‘massively parallel’
SVNIT, Surat CSE 461 4
The CUDA Kernel
• CUDA kernel is a function that gets executed on GPU
• The parallel portion of your applications is executed K times in
parallel by K different CUDA threads, as opposed to only one
time like regular C/C++ functions
• A group of threads is called a CUDA block. CUDA blocks are
grouped into a grid. The kernel is a function
executed on the GPU
• A kernel is executed as a grid of blocks of threads
• Each CUDA block is executed by one SM
– Cannot be migrated to other SMs in GPU
• Except during preemption, debugging, or CUDA dynamic parallelism

• One SM can run several concurrent CUDA blocks depending


on the resources needed by CUDA blocks.
CUDA kernels are subdivided into blocks and
• Each kernel is executed on one device and CUDA supports executed on the SMs on the GPU
running multiple kernels on a device at one time.
SVNIT, Surat CSE 461 5
GPU Thread Hierarchy
• Half-Warp/Warp
– Group of consecutive threads and generally executed together in parallel
– A warp is what executes on each SM at any given timestep
– Are aligned
• E.g., Threads 0->15 will be in the same half-warp/warp, 16->31, etc.
• Block
– A block is made up of warps
– Shared memory is shared among all threads in a block
• Threads within the same block can synchronize with each other, and
quickly communicate with each other
– Synchronization occurs at the block level
– So, the block is the ‘scope’ within which sets of threads can communicate
• Grid
– A collection of blocks
– Blocks can not synchronize with each other
• Threads within one block can not synchronize with threads in another

Kernel execution on GPU


SVNIT, Surat CSE 461 6
GPU Memory
• GPUs have less memory, but substantially higher
memory bandwidths
– Tesla V100 have 16 GB HBM2 @900 GB/s
• Fully coherent L2 Cache
– Smaller than CPUs L2 cache, but much higher bandwidth
– Tesla V100 have 6 MB L2 and 10 MB L1 caches
• Memory Coalescing
– Due to high degree of parallelism, many threads may want to
write to memory at the same time
– GPU coalesces memory operations to minimize number of
transactions Memory structure in NVIDIA Fermi GPU

SVNIT, Surat CSE 461 7


GPU Memory (Cont.)
• Registers
– These are private to each thread => registers assigned to a thread are not visible to other threads
– The compiler makes decisions about register utilization.
• L1 Cache/Shared Memory
– Each SM has a small amount of Shared memory
– Threads within the same block can quickly and easily communicate with each other by writing and reading to the
shared memory
– Generally used as a very quick working space for threads within a block
• At least 100 times faster than global memory, very advantageous if used correctly

• L2 cache
– Shared across all SMs => every thread in every CUDA block can access it
• Global Memory
– Global memory can be thought of as the physical memory on your graphics card.
– All threads can read and write to Global memory.
– You can even read and write to Global memory from a thread on the CPU.

SVNIT, Surat CSE 461 8


Processor and Memory in NVIDIA A100 GPU

SVNIT, Surat CSE 461 9


What is CUDA? An Introduction
• CUDA stands for Compute Unified Device Architecture
• An extension of the C programming language and was created by NVIDIA.
• CUDA lets the programmer take advantage of the hundreds of ALUs inside a
graphics processor
– Much more powerful than the handful of ALUs available in any CPU.
– This does put a limit on the types of applications that are well suited to CUDA.
• CUDA is only well suited for highly parallel algorithms
– In order to run efficiently on a GPU, you need to have many hundreds of threads.
– Generally, the more threads you have, the better.
– If you have an algorithm that is mostly serial, then CUDA does not make sense

SVNIT, Surat CSE 461 10


What is CUDA? (Cont.)

• CUDA program has code intended both for


the GPU and the CPU
• CPU is referred to as the host, and GPU is
referred to as the device
• Device code needs a special compiler to
understand the CUDA specific APIs
• NVCC (Nvidia C Compiler) separates the host
code from the device code (kernels)
• Kernels are executed on the GPU

SVNIT, Surat CSE 461 11


The Execution Flow

CPU Serial Code

GPU Parallel Kernel

CPU Serial Code

GPU Parallel Kernel

SVNIT, Surat CSE 461 12


CUDA Thread Organization
• The programmer has explicit control on the number of threads to launch
– This is a carefully decided-upon number
• These threads collectively form a three-dimensional grid
– Remember: threads are packed into blocks, and blocks are packed into grids
• Each thread is given a unique identifier
• Used to identify what data it is to be acted upon
• For all threads in a block, the block index is the same
– Can be accessed using the blockIdx variable inside a kernel.
– Each thread also has an associated index, and it can be accessed by using threadIdx variable
inside the kernel.
– Note that blockIdx and threadIdx are built-in CUDA variables that are only accessible from
inside the kernel.

SVNIT, Surat CSE 461 13


CUDA Thread Organization (Cont.)
• In a similar fashion, CUDA also has gridDim and blockDim variables that are also
built-in.
• They return the dimensions of the grid and block along a particular axis
respectively.
• As an example, blockDim.x can be used to find how many threads a particular
block has along the x axis.

SVNIT, Surat CSE 461 14


CUDA Thread Organization (Cont.)
• Consider an image that needs to be processed which is 76 pixels along the x
axis, and 62 pixels along the y axis – total of 4712 pixels
• Assuming at least one thread per pixel, we need a minimum of 4712 threads
• For performance reasons (we will come to this later), let us take number of
threads in each direction to be a multiple of 4.
• One possible layout is 80 threads on x-axis and 64 threads on the y-axis for a
total of 5120 threads. We’ll ensure extra threads do not do any work.
• Our grid has 5120 threads. How can we split it into blocks?
• One possible layout is to have 16x16x1 blocks (X*Y*Z).

SVNIT, Surat CSE 461 15


CUDA Thread Organization (Cont.) 16

• Each block is
16X16X1 16 BLOCK

• 5 Blocks on x-axis
• 4 blocks on y-axis
• We use these
when launching GRID

the kernel

SVNIT, Surat CSE 461 16


CUDA Thread Organization (Cont.)
• A kernel is launched as a grid of blocks of threads
– blockIdx and threadIdx are 3D
– Only one dimension (x)
• Built-in variables:
– threadIdx.x, threadIdx.y, threadIdx.z
• Return the thread ID in the x-axis, y-axis, and z-axis of the thread that is
being executed by this stream processor in this block.
– blockIdx.x, blockIdx.y, blockIdx.z
• Returns the block ID in the x-axis, y-axis, and z-axis of the block that is
executing the given block of code.
– blockDim.x, blockDim.y, blockDim.z
• Return the “block dimension” (i.e., the number of threads in a block in
the x-axis, y-axis, and z-axis).
– gridDim

SVNIT, Surat CSE 461 17


Simple Processing Flow

PCI Bus

• Copy input data from CPU memory to GPU


memory

SVNIT, Surat CSE 461 18


Simple Processing Flow (Cont.)

PCI Bus

• Copy input data from CPU memory to GPU


memory
• Load GPU code and execute it, caching data
on chip for performance

SVNIT, Surat CSE 461 19


Simple Processing Flow (Cont.)

PCI Bus

• Copy input data from CPU memory to GPU


memory
• Load GPU code and execute it, caching data
on chip for performance
• Copy results from GPU memory to CPU
memory
SVNIT, Surat CSE 461 20
Hello World on Host with NVCC

• Standard C that runs on the host


• NVIDIA compiler (nvcc) can be used to compile programs with no device code

SVNIT, Surat CSE 461 21


Hello World! with Device Code
• Two new syntactic elements…
• __global__ void mykernel(void)
• CUDA C/C++ __global__ keyword indicates a
function that:
– Runs on the device
– Is called from host code
• nvcc separates source code into host and
device components • mykernel<<<1,1>>>();

– Device functions (e.g. mykernel()) processed by • Triple angle brackets mark a call from host
NVIDIA compiler code to device code
– Host functions (e.g. main()) processed by – Also called a “kernel launch”
standard host compiler – We’ll return to the parameters (1,1) in a
– gcc, cl.exe moment

SVNIT, Surat CSE 461 22


Vector Addition on Device
• Let us write a simple kernel to add two integers
__global__ void add(int *a, int *b, int *c)
{
*c = *a + *b;
}
• As before __global__ is a CUDA C/C++ keyword
meaning
– add() will execute on the device
– add() will be called from the host
• add() runs on the device, so a, b and c must point to
device memory a b c
– We need to allocate memory on the GPU

SVNIT, Surat CSE 461 23


Memory Management
• Host and device memory are separate entities
• Device pointers point to GPU memory
– May be passed to/from host code
– May NOT be dereferenced in host code
• Host pointers point to CPU memory
– May be passed to/from device code
– May NOT be dereferenced in device code
• Simple CUDA API for handling device memory
• cudaMalloc(), cudaFree(), cudaMemcpy()
• Similar to the C equivalents malloc(), free(), memcpy()

SVNIT, Surat CSE 461 24


Code
Vector Addition on Device (Cont.)

• Use cudaMalloc and cudaFree to allocate and


deallocate memory on device
• Use cudaMemcpy to copy the data to and from
device
– An argument specifies direction of copy
• cudaMemcpyHostToDevice
• cudaMemcpyDeviceToHost

Compilation and Output

SVNIT, Surat CSE 461 25


Block 0 Block 1
Parallel Vector Addition c[0] = a[0] + b[0]; c[1] = a[1] + b[1];

• So how do we run code in parallel on the device? Block 2 Block 3


add<<< 1, 1 >>>(); c[2] = a[2] + b[2]; c[3] = a[3] + b[3];
….
How the GPU threads look at the kernel code
add<<< N, 1 >>>();
• Instead of executing add() once, execute N times in parallel
• With add() running in parallel we can do vector addition
• Terminology: each parallel invocation of add() is referred to as a block
– The set of blocks is referred to as a grid
– Each invocation can refer to its block index using blockIdx.x
__global__ void add(int *a, int *b, int *c) {
c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}
– By using blockIdx.x to index into the array, each block handles a different element of the array

SVNIT, Surat CSE 461 26


Parallel Vector Addition (Cont.)
• How about the main function that runs on the host? /* Dummy input values */
/* Host copies of a, b, and c */ for (i = 0; i < N; ++i) {
int *a = NULL, *b = NULL, *c = NULL; a[i] = b[i] = i;
… c[i] = 0;
/* Read number of elements from command line */ }
N = atoi(argv[1]); ...
/* Compute the size */ /* Copy input to device */
size = N*sizeof(int); cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);
/* Allocate space for host copies of a, b, and c */
a = (int*) malloc(size); /* Launch kernel for addition */
b = (int*) malloc(size); add<<<N,1>>>(d_a, d_b, d_c);
c = (int*) malloc(size);
… /* Copy result back to the host */
cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

SVNIT, Surat CSE 461 27


Parallel Vector Addition (Cont.)
Compilation and Output Scaling

SVNIT, Surat CSE 461 28


CUDA Threads
• Terminology: a block can be split into parallel threads
• Let’s change add() to use parallel threads instead of parallel blocks
__global__ void add(int *a, int *b, int *c)
{
c[threadIdx.x] = a[threadIdx.x] + b[threadIdx.x];
}
• We use threadIdx.x instead of blockIdx.x
• Need to make one change in main()…
// Launch add() kernel on GPU with N threads
add<<<1,N>>>(d_a, d_b, d_c);

SVNIT, Surat CSE 461 29


CUDA Threads (Cont.)
Compilation and Output Scaling

SVNIT, Surat CSE 461 30


Combining Threads and Blocks
• We’ve seen parallel vector addition using:
– Several blocks with one thread each
– One block with several threads
• Let’s adapt vector addition to use both blocks and threads
• Why? We’ll come to that
• First let’s discuss data indexing

SVNIT, Surat CSE 461 31


Indexing Arrays with Blocks and Threads

• No longer as simple as using blockIdx.x and threadIdx.x


• Consider indexing an array with one element per thread (8 threads/block)
• With M threads per block, a unique index for each thread is given by:
• int index = threadIdx.x + blockIdx.x * M;

threadIdx.x threadIdx.x threadIdx.x threadIdx.x

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

blockIdx.x = 0 blockIdx.x = 1 blockIdx.x = 2 blockIdx.x = 3

SVNIT, Surat CSE 461 32


Indexing Arrays with Blocks and Threads: Example

• Which thread will operate on the red element?

0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
threadIdx.x threadIdx.x threadIdx.x threadIdx.x

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

blockIdx.x = 0 blockIdx.x = 1 blockIdx.x = 2 blockIdx.x = 3

int index = threadIdx.x + blockIdx.x * M;


= 0 + 3 * 8;
= 24;
SVNIT, Surat CSE 461 33
Vector Addition with Blocks and Threads
• Use the built-in variable blockDim.x for threads per block
– int index = threadIdx.x + blockIdx.x * blockDim.x;
• Combined version of add() to use parallel threads and parallel blocks
__global__ void add(int *a, int *b, int *c)
{
int index = threadIdx.x + blockIdx.x * blockDim.x;
c[index] = a[index] + b[index];
}
• What changes need to be made in main()?
– add<<<N/THREADS_PER_BLOCK,THREADS_PER_BLOCK>>>(d_a, d_b, d_c);

SVNIT, Surat CSE 461 34


Vector Addition with Blocks and Threads (Cont.)
• Typical problems are not friendly multiples of blockDim.x
• Avoid accessing beyond the end of the arrays:
__global__ void add(int *a, int *b, int *c, int n)
{
int index = threadIdx.x + blockIdx.x * blockDim.x;
if (index < n)
c[index] = a[index] + b[index];
}
• Update the kernel launch:
add<<<(N + M-1) / M,M>>>(d_a, d_b, d_c, N);

SVNIT, Surat CSE 461 35


Why Blocks and Threads?
• Threads seem unnecessary
– They add a level of complexity
– What do we gain?

SVNIT, Surat CSE 461 36


Why Do You Need Threads?
• Key to understanding:
– Instructions are issued in order
– A thread stalls when one of the operands isn’t ready:
• Memory read by itself doesn’t stall execution
– Latency is hidden by switching threads
• GMEM latency: >100 cycles (varies by architecture/design)
• Arithmetic latency: <100 cycles (varies by architecture/design)

SVNIT, Surat CSE 461 37


Understanding GPU Latency Hiding
• In CUDA C source code:
– int idx = threadIdx.x+blockDim.x*blockIdx.x; c[idx] = a[idx] * b[idx];

• In machine code:
– I0: LD R0, a[idx];
– I1: LD R1, b[idx];
– I2: MPY R2,R0,R1

SVNIT, Surat CSE 461 38


GPU Latency Hiding in the SM
I0: LD R0, a[idx];

I1: LD R1, b[idx];

I2: MPY R2,R0,R1


clock cycles:
C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16 C17 C18 …
warps
W0:
W1:
W2:
W3:
W4:
W5:
W6:
W7:
W8:
W9:

SVNIT, Surat CSE 461 39
GPU Latency Hiding in the SM
I0: LD R0, a[idx];

I1: LD R1, b[idx];

I2: MPY R2,R0,R1


clock cycles:
C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16 C17 C18 …
warps
W0: I0 I1
W1:
W2:
W3:
W4:
W5:
W6:
W7:
W8:
W9:

SVNIT, Surat CSE 461 40
GPU Latency Hiding in the SM
I0: LD R0, a[idx];

I1: LD R1, b[idx];

I2: MPY R2,R0,R1


clock cycles:
C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16 C17 C18 …
warps
W0: I0 I1
W1:
W2:
W3:
W4:
W5:
W6:
W7:
W8:
W9:

SVNIT, Surat CSE 461 41
GPU Latency Hiding in the SM
I0: LD R0, a[idx];

I1: LD R1, b[idx];

I2: MPY R2,R0,R1


clock cycles:
C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16 C17 C18 …
warps
W0: I0 I1
W1: I0
W2:
W3:
W4:
W5:
W6:
W7:
W8:
W9:

SVNIT, Surat CSE 461 42
GPU Latency Hiding in the SM
I0: LD R0, a[idx];

I1: LD R1, b[idx];

I2: MPY R2,R0,R1


clock cycles:
C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16 C17 C18 …
warps
W0: I0 I1
W1: I0 I1
W2:
W3:
W4:
W5:
W6:
W7:
W8:
W9:

SVNIT, Surat CSE 461 43
GPU Latency Hiding in the SM
I0: LD R0, a[idx];

I1: LD R1, b[idx];

I2: MPY R2,R0,R1


clock cycles:
C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16 C17 C18 …
warps
W0: I0 I1
W1: I0 I1
W2: I0 I1
W3:
W4:
W5:
W6:
W7:
W8:
W9:

SVNIT, Surat CSE 461 44
GPU Latency Hiding in the SM
I0: LD R0, a[idx];

I1: LD R1, b[idx];

I2: MPY R2,R0,R1


clock cycles:
C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16 C17 C18 …
warps
W0: I0 I1
W1: I0 I1
W2: I0 I1
W3: I0 I1
W4: I0 I1
W5: I0 I1
W6: I0 I1
W7: I0 I1
W8:
W9

SVNIT, Surat CSE 461 45
GPU Latency Hiding in the SM
I0: LD R0, a[idx];

I1: LD R1, b[idx];

I2: MPY R2,R0,R1


clock cycles:
C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16 C17 C18 …
warps
W0: I0 I1
W1: I0 I1
W2: I0 I1
W3: I0 I1
W4: I0 I1
W5: I0 I1
W6: I0 I1
W7: I0 I1
W8 I0 I1


SVNIT, Surat CSE 461 46
GPU Latency Hiding in the SM
I0: LD R0, a[idx];

I1: LD R1, b[idx];

I2: MPY R2,R0,R1


clock cycles:
C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16 C17 C18 …
warps
W0: I0 I1 I2
W1: I0 I1
W2: I0 I1
W3: I0 I1
W4: I0 I1
W5: I0 I1
W6: I0 I1
W7: I0 I1
I0 I1


SVNIT, Surat CSE 461 47
GPU Latency Hiding in the SM
I0: LD R0, a[idx];

I1: LD R1, b[idx];

I2: MPY R2,R0,R1


clock cycles:
C0 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C12 C13 C14 C15 C16 C17 C18 …
warps
W0: I0 I1 I2
W1: I0 I1 I2
W2: I0 I1
W3: I0 I1
W4: I0 I1
W5: I0 I1
W6: I0 I1
W7: I0 I1
I0 I1


SVNIT, Surat CSE 461 48
Why Do You Need Threads?
• Key to understanding:
– Instructions are issued in order
– A thread stalls when one of the operands isn’t ready:
• Memory read by itself doesn’t stall execution
– Latency is hidden by switching threads
• GMEM latency: >100 cycles (varies by architecture/design)
• Arithmetic latency: <100 cycles (varies by architecture/design)

• What can you do to make this even better?


– Have more WARPS.
– Most SMs limit the number of WARPS to 64
• How many threads/threadblocks to launch?
• Conclusion:
– Need enough threads to hide latency

SVNIT, Surat CSE 461 49


Why Blocks and Threads?
• Threads seem unnecessary
– They add a level of complexity
– What do we gain?

• Unlike parallel blocks, threads have mechanisms to efficiently:


– Communicate
– Synchronize

SVNIT, Surat CSE 461 50


How Threads Communicate
• Generally, threads may only safely communicate with
each other if and only if they exist within the same
thread block.
– There are technically ways where two threads from different
blocks can communicate with each other, but this is much
more difficult, and much more prone to bugs within your
program.
– This is out of scope of this class.
• Threads within the same block have two main ways to
communicate data with each other.
– Shared memory
– Global Memory
Memory Hierarchy

SVNIT, Surat CSE 461 51


Shared Memory
• When a block of threads starts executing, it runs on an SM, a multiprocessor unit inside
the GPU.
• Each SM has a small amount of shared memory associated with it, usually 16KB of
memory.
• To make matters more difficult, often, multiple thread blocks can run simultaneously on
the same SM.
• For example, if each SM has 16KB of shared memory and there are 4 thread blocks
running simultaneously on an SM, then the maximum amount of shared memory
available to each thread block would be 16KB/4, or 4KB.
• So, as you can see, if you only need the threads to share a small amount of data at any
given time, using shared memory is by far the fastest and most convenient way to do it.

SVNIT, Surat CSE 461 52


Shared Memory (Example)
• There are multiple ways to declare shared memory inside a kernel
– Depends on whether the amount of memory needed is known at compile/run time
• Static Shared Memory
– If the shared memory array size is known at compile time
– __shared__ int s[64];
• Dynamic Shared Memory
– Used when the amount of shared memory needed is not known at compile time
– The shared memory allocation size per thread block must be specified (in bytes) using an optional third
execution configuration parameter
– dynamicReverse<<<1, n, n*sizeof(int)>>>(d_d, n);
– In the kernel, we use an “unsized” “extern” variable to access the shared memory segment
• extern shared int s[] (note the empty brackets and use of the extern specifier).
• Size is implicitly determined from the third execution configuration parameter when the kernel is launched

SVNIT, Surat CSE 461 53


Global Memory
• However, if your program is using too much shared memory to store data, or
your threads simply need to share too much data at once, then it is possible that
the shared memory is not big enough to accommodate all the data that needs
to be shared among the threads.
• In such a situation, threads always have the option of writing to and reading
from global memory.
• Global memory is much slower than accessing shared memory, however, global
memory is much larger.

SVNIT, Surat CSE 461 54


Example: 1D Stencil

• Consider applying a 1D stencil to a 1D array of elements


– Each output element is the sum of input elements within a radius
• If radius is 3, then each output element is the sum of 7 input elements

in

out

SVNIT, Surat CSE 461 55


Implementing Within a Block

• Each thread processes one output element


– blockDim.x elements per block
• Input elements are read several times radius radius

– With radius 3, each input element is read seven times

in

out

ThreaTdhreaTdhreaTdhreaTdhreaTdhreaTdhreaTdhreaTdhread
0 1 2 3 4 5 6 7 8

SVNIT, Surat CSE 461 56


Sharing Data Between Threads
• Terminology: within a block, threads share data via shared memory

• Extremely fast on-chip memory


– By opposition to device memory, referred to as global memory
– Like a user-managed cache

• Declare using __shared__, allocated per block


– Places a variable into shared memory for each respective thread block
• Data is not visible to threads in other blocks

SVNIT, Surat CSE 461 57


Implementing With Shared Memory

• Cache data in shared memory


– Read (blockDim.x + 2 * radius) input elements from global memory to shared memory
– Compute blockDim.x output elements
– Write blockDim.x output elements to global memory
• Each block needs a halo of radius elements at each boundary

in

out

SVNIT, Surat CSE 461 58


Stencil Kernel
• Kernel Code
__global__ void stencil_1d(int *in, int *out)
{
__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];
int gindex = threadIdx.x + blockIdx.x * blockDim.x;
int lindex = threadIdx.x + RADIUS;

// Read input elements into shared memory


temp[lindex] = in[gindex];
if (threadIdx.x < RADIUS) {
temp[lindex - RADIUS] = in[gindex - RADIUS];
temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
}
// Apply the stencil
int result = 0;
for (int offset = -RADIUS ; offset <= RADIUS ; offset++)
result += temp[lindex + offset];
// Store the result
out[gindex] = result;
}
SVNIT, Surat CSE 461 59
Data Race
▪ The stencil example will not work…

▪ Suppose thread 15 reads the halo before thread 0 has fetched it…

...
temp[lindex] = in[gindex]; Store at temp[18]
if (threadIdx.x < RADIUS) {

temp[lindex – RADIUS] = in[gindex – RADIUS];


Skipped since threadId.x > RADIUS
temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];

}
int result = 0;
for (int offset = -RADIUS ; offset <= RADIUS ; offset++)

result += temp[lindex + offset]; Load from temp[19]


...

SVNIT, Surat CSE 461 60


Synchronize Threads
• void __syncthreads();
• Synchronizes all threads within a block
– A thread’s execution can only proceed past a __syncthreads() after all threads in its block
have executed the __syncthreads().
– All threads within a thread block must call __syncthreads() at the same point
• Otherwise, it can lead to deadlock
– Used to prevent RAW / WAR / WAW hazards
• All threads must reach the barrier
– In conditional code, the condition must be uniform across the block

SVNIT, Surat CSE 461 61


Thread Divergence
• Recall that threads from a block are bundled into fixed-size warps for execution
on a CUDA core, and threads within a warp must follow the same execution
trajectory.
• All threads must execute the same instruction at the same time. In other words,
threads cannot diverge.
• The most common code construct that can cause thread divergence is
branching for conditionals in an if-then-else statement.
• If some threads in a single warp evaluate to 'true' and others to 'false', then the
'true' and 'false' threads will branch to different instructions.
• Some threads will want proceed to the 'then' instruction, while others the 'else'.

SVNIT, Surat CSE 461 62


Thread Divergence (Cont.)
• Intuitively, we would think statements in then and else should be executed in parallel.
• However, because of the requirement that threads in a warp cannot diverge, this cannot
happen.
• The CUDA platform has a workaround that fixes the problem, but has negative
performance consequences.
• When executing the if-then-else statement, the CUDA platform will instruct the warp to
execute the then part first, and then proceed to the else part.
• While executing the then part, all threads that evaluated to false (e.g. the else threads)
are effectively deactivated.
• When execution proceeds to the else condition, the situation is reversed. As you can see,
the then and else parts are not executed in parallel, but in serial.
• This serialization can result in a significant performance loss.

SVNIT, Surat CSE 461 63


Deadlock with Thread Divergence
• Thread divergence can also cause a program to deadlock. Consider the following example:
//my_Func_then and my_Func_else are some device functions
if (threadidx.x <16) {
myFunc_then();
__syncthreads();
} else if (threadidx >=16) {
myFunc_else();
__syncthreads();
}
• The first half of the warp will execute the then part, then wait for the second half of the
warp to reach __syncthreads().
• However, the second half of the warp did not enter the then part; therefore, the first half
of the warp will be waiting for them forever.

SVNIT, Surat CSE 461 64


Fixed Stencil Kernel
▪ Synchronize threads
▪ Basically – put a barrier in between to make them wait

...
temp[lindex] = in[gindex]; Store at temp[18]
if (threadIdx.x < RADIUS) {

temp[lindex – RADIUS] = in[gindex – RADIUS];


Skipped since threadId.x > RADIUS
temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
}
// Synchronize (ensure all the data is available)
__syncthreads();
int result = 0;
for (int offset = -RADIUS ; offset <= RADIUS ; offset++)
result += temp[lindex + offset]; Load from temp[19]

SVNIT, Surat CSE 461 65


Dynamic Shared Memory

• Dynamic shared memory can only be declared as a 1D array


• What if you need multiple dynamically sized arrays in a single kernel?
• You must declare a single extern unsized array as before and use pointers into it to
divide it into multiple arrays, as in the following excerpt.
extern __shared__ int s[];
int *integerData = s; // nI ints
float *floatData = (float*)&integerData[nI]; // nF floats
char *charData = (char*)&floatData[nF]; // nC chars
• In the kernel launch, specify the total shared memory needed, as in the following.
myKernel<<<gridSize, blockSize, nI*sizeof(int)+nF*sizeof(float)+nC*sizeof(char)>>>(...);

SVNIT, Surat CSE 461 66


Defining Grid/Block Structure
• Need to provide each kernel call with values for two key structures:
– Number of blocks in each dimension
– Threads per block in each dimension
• myKernel<<< B, T >>>(arg1, … );
• B – a structure that defines the number of blocks in grid in each dimension (1D
or 2D).
• T – a structure that defines the number of threads in a block in each dimension
(1D, 2D, or 3D).

SVNIT, Surat CSE 461 67


1D Grids and/or 1D Blocks
• If want a 1-D structure, can use a integer for B and T in:
• myKernel<<< B, T >>>(arg1, … );
• B – An integer would define a 1D grid of that size
• T – An integer would define a 1D block of that size
• Example: myKernel<<< 1, 100 >>>(arg1, ... );

SVNIT, Surat CSE 461 68


Higher Dimensional Grids/Blocks
• 1D grids/blocks are suitable for 1D data, but higher dimensional grids/blocks are
necessary for:
• Higher dimensional data.
• Data set larger than the hardware dimensional limitations of blocks.
• CUDA has built-in variables and structures to define the number of blocks in a
grid in each dimension and the number of threads in a block in each dimension.

SVNIT, Surat CSE 461 69


CUDA Built-In Vector Types and Structures
• uint3 and dim3 are CUDA-defined structures of unsigned integers: x, y, and z.
• struct uint3 {x; y; z;};
• struct dim3 {x; y; z;};
• The unsigned structure components are automatically initialized to 1.
• These vector types are mostly used to define grid of blocks and threads

SVNIT, Surat CSE 461 70


CUDA Built-In Variables for Grid/Block Sizes

• dim3 gridDim -- Grid dimensions, x, y, and z.


• Number of blocks in grid = gridDim.x * gridDim.y * gridDim.z
• dim3 blockDim -- Size of block dimensions x, y, and z.
• Number of threads in a block = blockDim.x * blockDim.y * blockDim.z
• Example
– dim3 grid(16,16); // grid = 16 x 16 blocks
– dim3 block(32,32); // block = 32 x 32 threads
– myKernel<<<grid, block>>>(...);
– which sets:
– grid.x = 16; grid.y = 16; grid.z = 1
– block.x = 32; block.y = 32; block.z = 1;

SVNIT, Surat CSE 461 71


2D Grids and 2D Blocks

• Full global thread ID in x and y dimensions


can be computed by:
– x = blockIdx.x * blockDim.x + threadIdx.x;
– y = blockIdx.y * blockDim.y + threadIdx.y;

SVNIT, Surat CSE 461 72


Flatten Matrices into Linear Memory

• Generally, memory allocated dynamically on device (GPU) and we cannot not


use two-dimensional indices (e.g., A[row][column]) to access matrices.
• We will need to know how the matrix is laid out in memory and then compute
the distance from the beginning of the matrix.
• C uses row-major order --- rows are stored one after the other in memory, i.e.,
row 0 then row 1 etc.

M0,0 M1,0 M2,0 M3,0

M0,1 M1,1 M2,1 M3,1


M0,0 M1,0 M2,0 M3,0 M0,1 M1,1 M2,1 M3,1 M0,2 M1,2 M2,2 M3,2 M0,3 M1,3 M2,3 M3,3
M0,2 M1,2 M2,2 M3,2

M0,3 M1,3 M2,3 M3,3


SVNIT, Surat CSE 461 73
Accessing Matrices in Linear Memory

• Logically
– a[row][column] == a[offset]
– offset = column + row * N

• In CUDA:
– int col = blockIdx.x*blockDim.x+threadIdx.x;
– int row = blockIdx.y*blockDim.y+threadIdx.y;
– int index = col + row * N;
– A[index] = …

SVNIT, Surat CSE 461 74


Matrix Addition: Add two 2D matrices
• Corresponding elements of each array (a,b) added together to form element of
third array (c):

• 2D matrices are added to form a sum 2D matrix.


• We use dim3 variables to set the Grid and Block dimensions.
• We calculate a global thread ID to index the column and row of the matrix.
• We calculate the linear index of the matrix.

SVNIT, Surat CSE 461 75


General Rules and Best Practices
• Choosing the number of threads per block is very complicated
• To simplify things, use a constant number of threads per block
• There is a limit to the number of threads per block
– Since all threads of a block are expected to reside on the same processor core and must
share the limited memory resources of that core.
• On current GPUs, a thread block may contain up to 1024 threads.
• Your thread block size should always be a multiple of 32
– Kernels issue instructions in warps (32 threads).
– For example, if you have a block size of 50 threads, the GPU will still issue commands to 64
threads, and you'd just be wasting them.

SVNIT, Surat CSE 461 76


General Rules and Best Practices (Cont.)
• Try to size your blocks based on the maximum numbers of threads and blocks
that correspond to the compute capability of your card.
– For example, a CC 3.0 card each SM can have 16 active blocks and 2048 active threads.
– If you have 128 threads per block, you could fit 16 blocks in your SM before hitting the 2048
thread limit.
– If you use 256 threads, you can only fit 8, but you're still using all the available threads and
will still have full occupancy.
– However, using 64 threads per block will only use 1024 threads when the 16-block limit is
hit, so only 50% occupancy.
– If shared memory and register usage is not a bottleneck, this should be your main concern
(other than your data dimensions).

SVNIT, Surat CSE 461 77


General Rules and Best Practices (Cont.)
• The blocks in your grid are spread out over the SMs to start
• Then the remaining blocks are placed into a pipeline.
• Blocks are moved into the SMs for processing as soon as there are enough
resources in that SM to take the block.
• In other words, as blocks complete in an SM, new ones are moved in.
• You could make the argument that having smaller blocks (128 instead of 256 in
the previous example) may complete faster since a particularly slow block will
hog fewer resources, but this is very much dependent on the code.

SVNIT, Surat CSE 461 78


CUDA Occupancy Calculator
• The CUDA Occupancy Calculator allows you to compute the multiprocessor occupancy of
a GPU by a given CUDA kernel.
• The multiprocessor occupancy is the ratio of active warps to the maximum number of
warps supported on a multiprocessor of the GPU.
• Each multiprocessor on the device has a set of N registers available for use by CUDA
program threads.
• These registers are a shared resource that are allocated among the thread blocks
executing on a multiprocessor.
• The CUDA compiler attempts to minimize register usage to maximize the number of
thread blocks that can be active in the machine simultaneously.
• If a program tries to launch a kernel for which the registers used per thread times the
thread block size is greater than N, the launch will fail
• https://docs.nvidia.com/cuda/cuda-occupancy-calculator/index.html
SVNIT, Surat CSE 461 79
Determining Values at Runtime
• Query GPU properties using cudaGetDeviceProperties
– __host__​cudaError_t cudaGetDeviceProperties ( cudaDeviceProp* prop, int device)
– Returns information about the compute-device.
• Some useful properties
– maxThreadsDim[3] - Maximum size of each dimension of a block
– maxGridSize[3] - Maximum size of each dimension of a grid
– maxThreadsPerBlock - Maximum number of threads per block
– maxThreadsPerMultiProcessor - Maximum resident threads per multiprocessor
– maxBlocksPerMultiProcessor - Maximum number of resident blocks per multiprocessor
– reservedSharedMemPerBlock - Shared memory reserved by CUDA driver per block in bytes
– sharedMemPerBlock - Shared memory available per block in bytes

SVNIT, Surat CSE 461 80


Determining Values at Runtime (Cont.)
• // CUDA device properties variable
cudaDeviceProp prop;
// Query GPU properties
cudaGetDeviceProperties(&prop, device_id);
printf("maxThreadsDim x,y,z = %d,%d,%d\n", prop.maxThreadsDim[0],
prop.maxThreadsDim[1], prop.maxThreadsDim[2]);
printf("maxGridSize x,y,z = %d,%d,%d\n", prop.maxGridSize[0],
prop.maxGridSize[1], prop.maxGridSize[2]);
printf("maxThreadsPerBlock = %d, maxThreadsPerMultiProcessor = %d, maxBlocksPerMultiProcessor = %d\n",
prop.maxThreadsPerBlock, prop.maxThreadsPerMultiProcessor, prop.maxBlocksPerMultiProcessor);
printf("reservedSharedMemPerBlock = %d, sharedMemPerBlock = %d\n",
prop.reservedSharedMemPerBlock, prop.sharedMemPerBlock);
• https://docs.nvidia.com/cuda/cuda-runtime-api/structcudaDeviceProp.html

SVNIT, Surat CSE 461 81


Matrix Multiplication Review

• To calculate the product of two matrices A


and B, we multiply the rows of A by the
columns of B and add them up.
• Then place the sum in the appropriate
position in the matrix C.

SVNIT, Surat CSE 461 82


Matrix Multiplication Review (Cont.)

SVNIT, Surat CSE 461 83


Matrix Multiplication Review (Cont.)

SVNIT, Surat CSE 461 84


Matrix Multiplication Review (Cont.)

SVNIT, Surat CSE 461 85


Matrix Multiplication Review (Cont.)

SVNIT, Surat CSE 461 86


j
Parallelizing Matrix Multiplication
B

• To compute a single value of C(i,j), only a single


i
thread be necessary to traverse the ith row of A
and the jth column of B. k
• Therefore, the number of threads needed to C
compute a square matrix multiply is O(N2).
SVNIT, Surat CSE 461 87
Quick Reference Guide
• Function attributes
– __global__ function called by the host, executes on the device
– __device__ function called by the device, executes on the device
– __host__ function called by the host, executes on the host
– __host__ __device__ generates both host and device code for the function
• Variables attributes
– __device__ variable on device (Global Memory)
– __shared__ variable in Shared Memory
– __restrict__ restricted pointers, assert to the compiler that pointers are not aliased
– No qualifier automatic variable, resides in Register or in Local Memory

SVNIT, Surat CSE 461 88


Quick Reference Guide (Cont.)
• Built-in Variables
– dim3 gridDim size of the grid in number of blocks along the x, y, z axes
– dim3 blockDim size of the block in number of threads along the x, y, z axes
– dim3 blockIdx position (x,y,z) of the block in the grid
– dim3 threadIdx position (x,y,z) of the thread in the block
• Shared memory
– __shared__ int x[10]; statically allocated array in shared memory
– extern __shared__ int x[]; dynamically allocated array in shared memory
• kernel<<<blocks, threadsperblock, dyn shared mem in bytes>>>

SVNIT, Surat CSE 461 89


Quick Reference Guide (Cont.)
• Memory Management
– cudaMalloc(&dptr, size) allocates size memory on the device
– cudaFree(dptr) frees size memory from the device
– cudaMallocHost(&hptr, size) allocates size pinned memory on the host
– cudaFreeHost(hptr) frees size memory from the host
– cudaMemcpy(trgptr, srcptr, size, direction) copies size memory from the
source pointer to the target pointer using the direction specified
• e.g. cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost,
cudaMemcpyDeviceToDevice

SVNIT, Surat CSE 461 90


Thank You!

SVNIT, Surat CSE 461 91

You might also like