0% found this document useful (0 votes)
60 views42 pages

CUDA PPT Anurita Unit3

Uploaded by

ankitupatil1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views42 pages

CUDA PPT Anurita Unit3

Uploaded by

ankitupatil1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 42

•UNIT 3

Introduction to CUDA: Data Parallelism


•CUDA Program Structure,
PARALLEL
PROGRAMMING •A Vector Addition Kernel

z
WITH CUDA •Device Global Memory And Data Transfer

•Kernel Functions

• Threading.
What is CUDA?
z

 CUDA stands for Compute Unified


Device Architecture
 A General-Purpose Parallel
Computing Platform and
Programming Model
 CUDA comes with a software
environment that allows
developers to use C++ as a high-
level programming language
z
 The CUDA parallel programming model is designed
to overcome this challenge while maintaining a low
learning curve for programmers familiar with
standard programming languages such as C.
 At its core are three key abstractions — a hierarchy
of thread groups, shared memories, and barrier
synchronization — that are simply exposed to the
programmer as a minimal set of language
extensions.
 These abstractions guide the programmer to
partition the problem into coarse sub-problems that
can be solved independently in parallel by blocks of
threads, and each sub-problem into finer pieces that
can be solved cooperatively in parallel by all
threads within the block.
z
CUDA program structure
 NVIDIA GPU’s are very popularly used as accelerating a device

 The CPU acts as the host.

 The CPU runs a main program that dispatches parallel tasks to GPU devices.

 Multiple GPU devices may be attached to the CPU.

 The CPU can send parallel jobs to these GPUs simultaneously.

 The GPUs execute these tasks and return results to the CPU (host).

 Any C program is a valid host code

 In general CUDA programs (host+device) code cannot be compiled by standard C


compilers
 NVIDIA C compiler (NVCC)
The Compilation Flow
z

 Create a C program with CUDA


extensions.
 Use NVIDIA's NVCC compiler.

 NVCC splits the code into host


(CPU) and device (GPU) segments.
 Host code is compiled by the
system's C compiler and linker.
 Device code is JIT-compiled for
execution on the GPU.
 6. \Host code runs on the CPU;
device code runs on the GPU.
The Execution Flow
1. Heterogeneous Programming: CUDA enables a heterogeneous
z
programming model where code executes across CPU (host) and GPU
(device) components.

2. Execution Flow:

i. The host code (CPU) runs serially, launching GPU parallel kernels as
needed.

ii. GPU kernels (device code) execute in parallel, then return results to the
host.

iii. This back-and-forth execution enables both CPU and GPU to contribute
to complex computations.

3. Separate Execution and Memory: CUDA assumes separate execution


environments
and memory spaces:

iv. Host (CPU) runs the main program.

v. Device (GPU) executes CUDA kernels as a coprocessor.

vi. Each has its own memory: host memory and device memory in DRAM.

4. Memory Management: Programs must manage device memory


allocation, deallocation, and data transfers between host and device using
CUDA runtime calls.
z
Vector Addition Program

void vectorAdd(float* h_a, float* h_b, float* h_c, int n)


{
for(int i = 0; i < n; i++)
h_c[i] = h_a[i] + h_b[i];
}
int main()
{ z

int n = 1000; // Size of the vectors


// Allocate memory for host vectors
float *h_a = (float*)malloc(n * sizeof(float));
float *h_b = (float*)malloc(n * sizeof(float));
float *h_c = (float*)malloc(n * sizeof(float));
// Initialize vectors h_a and h_b with some values
for(int i = 0; i < n; i++)
{ h_a[i] = i * 1.0f; h_b[i] = i * 2.0f; }
// Perform vector addition
vectorAdd(h_a, h_b, h_c, n);
// Free allocated memory
free(h_a); free(h_b); free(h_c); return 0;
}
Vector Addition CPU-GPU
z

#include <cuda_runtime.h>
// Kernel function to perform vector addition
__global__ void vectorAdd(float *a, float *b, float *c, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n)
c[i] = a[i] + b[i];
}
// Declare the kernel function
__global__ void vectorAdd(float *a, float *b, float *c, int n);
void vecAdd(float* h_a, float* h_b, float* h_c, int n){
int size = n * sizeof(float);
float *d_a = NULL, *d_b = NULL, *d_c = NULL;
Device Memory Allocation
z
err = cudaMalloc((void**)&d_a, size);
if (err != cudaSuccess) {
fprintf(stderr, "Failed to allocate device vector A (error code %s) !\n” , cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
err = cudaMalloc((void**)&d_b, size);
if (err != cudaSuccess) {
fprintf(stderr, "Failed to allocate device vector B (error code %s)!\n” , cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
err = cudaMalloc((void**)&d_c, size);
if (err != cudaSuccess) {
fprintf(stderr, "Failed to allocate device vector C (error code %s)!\n” , cudaGetErrorString(err));
exit(EXIT_FAILURE);
Device Memory Allocation
z
err = cudaMalloc((void**)&d_a, size);
if (err != cudaSuccess) {
fprintf(stderr, "Failed to allocate device vector A (error code %s) !\n” , cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
err = cudaMalloc((void**)&d_b, size);
if (err != cudaSuccess) {
fprintf(stderr, "Failed to allocate device vector B (error code %s)!\n” , cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
err = cudaMalloc((void**)&d_c, size);
if (err != cudaSuccess) {
fprintf(stderr, "Failed to allocate device vector C (error code %s)!\n” , cudaGetErrorString(err));
exit(EXIT_FAILURE);
Host to Device Transfer
// Copy inputz vectors from host memory to device memory
cout << "Copy input data from the host memory to the CUDA device\n";
err = cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice);
if (err != cudaSuccess) {
fprintf(stderr, "Failed to copy vector A from host to device (error code %s)!\n”
, cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
err = cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);
if (err != cudaSuccess) {
fprintf(stderr, "Failed to copy vector B from host to device (error code %s)!\n”
, cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
Kernel Launch
z
Launch kernel with N/256 blocks and 256 threads per block
int threadsPerBlock = 256;
int blocksPerGrid = (n + threadsPerBlock - 1) / threadsPerBlock;
vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_a, d_b, d_c, n);
// Check for any errors during kernel launch
err = cudaGetLastError();
if (err != cudaSuccess) {
fprintf(stderr, "Failed to launch vectorAdd kernel (error code %s)!\n”
, cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
Device to Host Memory Transfer
z
// Copy result from device to host
err = cudaMemcpy(h_c, d_c, size, cudaMemcpyDeviceToHost);
if (err != cudaSuccess) {
fprintf(stderr, "Failed to copy vector C from device to host (error code %s)!\
n”, cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
// Free device memory
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
}
int main(){
// Host vectors
z
float h_a[N], h_b[N], h_c[N];
// Initialize host vectors
for (int i = 0; i < N; i++) {
h_a[i] = i * 1.0f; h_b[i] = i * 2.0f;
}
// Perform vector addition on the device
vecAdd(h_a, h_b, h_c, N);
// Display the result
cout << "Result of vector addition:\n";
for (int i = 0; i < 10; i++) {
// Display first 10 elements for brevity
cout << h_a[i] << " + " << h_b[i] << " = " << h_c[i] << endl;
}
return 0;
z
Compile and Run

nvcc kernel.cu host.cu -o output


Observations
z

 cuda.h -> includes during compilation CUDA API functions and CUDA system variables
 h_a , h_b , h_c -> arrays mapped to main memory locations
Observations
z
 cudaMalloc((void**)&d_b, size);
1. This function allocates memory on the GPU (device) for the variable d_b.
2. Here, size specifies the amount of memory to allocate, which is usually calculated
based on the number of elements and their data type (e.g., n *
sizeof(float) for n floats).
3. This memory on the device is used to store data that the GPU will work with, separate
from the host memory.
 cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice);
1. This function copies data from the host (CPU) memory to the device (GPU) memory.
2. h_a is the pointer to the data on the host, and d_a is the pointer to the allocated
memory on the device.
3. cudaMemcpyHostToDevice specifies the direction of transfer, moving data from host
to device.
4. This allows the GPU to access the input data (h_a) during kernel execution.
Observations
z

 cudaMemcpy(h_c, d_c, size, cudaMemcpyDeviceToHost);


1. This function copies data from the device (GPU) memory back to the host (CPU)
memory.

2. d_c is the pointer to the data on the device, and h_c is the pointer to the allocated
memory on the host.

3. cudaMemcpyDeviceToHost specifies the direction of transfer, moving data from


device to host.

4. This is used to retrieve the results of GPU computations (e.g., the output of a
kernel) so that they can be accessed by the CPU for further processing or display.
CUDA
z Kernel

 A CUDA kernel, when invoked, launches multiple threads arranged in a 2-


level hierarchy.
 Check the device function call:

vectorAdd<<<ceil(n/256), 256>>>(d_A, d_B, d_C, n);


 The call specifies a grid of threads to be launched.

 The grid is arranged in a hierarchical manner:

 (Number of blocks, Number of threads per block)

 All blocks contain the same number of threads (with a maximum of 1024).

 Blocks can be numbered as ( _, _, _ ) triplets for indexing, with further details


on this structure to be covered later.
z
CUDA Kernel Invocation:
 When a CUDA kernel is called, it launches multiple threads organized in a two-level
z
hierarchy.
 Each thread is a basic computing unit, executed by a scalar processor on the GPU.

 Scalar processors execute threads in parallel.

Kernel Parameters:
 The kernel launch includes two parameters: blocksPerGrid and threadsPerBlock.

 These parameters define the grid (collection of threads) to be launched.

Significance of Parameters:
 To perform vector addition, many compute threads are launched to add vector
components in parallel.
 Threads are arranged in a two-level hierarchy:

 Blocks (high-level) – n/256 blocks.

 Threads per Block (low-level) – 256 threads per block.


z
Grid and Hierarchy:
• The grid of threads is arranged hierarchically.
• Each block contains 256 threads.
• The grid is defined by the number of blocks (n/256 blocks) and
threads per block (256 threads).
z
Vector Addition CPU-GPU
z

#include <cuda_runtime.h>
// Kernel function to perform vector addition
__global__ void vectorAdd(float *a, float *b, float *c, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n)
c[i] = a[i] + b[i];
}
// Declare the kernel function
__global__ void vectorAdd(float *a, float *b, float *c, int n);
void vecAdd(float* h_a, float* h_b, float* h_c, int n){
int size = n * sizeof(float);
float *d_a = NULL, *d_b = NULL, *d_c = NULL;
Thread Indexing:
z
 Threads are indexed hierarchically: each block has an ID, and each thread within a
block has an ID.
 In the code, the global thread ID is computed

i = threadIdx.x + blockDim.x * blockIdx.x;


 This uses CUDA-defined variables:

 threadIdx – ID of the thread within its block.

 blockDim – Number of threads per block.

 blockIdx – ID of the block within the grid.

Example of Thread Indexing:


 For block 0, threads are numbered 0 to 255.

 In block 0, blockIdx is 0, blockDim is 256, and threadIdx varies from 0 to 255.

 The global thread ID (i) is calculated for each thread based on its block and thread
indices.
z
Accessing Global Memory:
 The global thread ID (i) determines which elements of the data arrays (e.g., A, B, and
z
C) each thread will operate on.
 For example:

 Thread 1 in Block 0 calculates i = 1, thus computing C[1] = A[1] + B[1].

 Thread 1 in Block 1 calculates i = 1 + (1 * 256) = 257, computing C[257] = A[257]


+ B[257].
Parallel Processing Across Data Elements:
 Using this global thread ID, each thread can access different elements of the data
arrays independently.
 This approach allows for efficient parallel addition of vectors by having each thread
handle a specific element in the arrays A and B, storing results in C.
Global Memory Usage:
 The data arrays (A, B, and C) reside in global memory, which is accessible by all
threads in all blocks.
 Threads use the global thread ID to determine which data elements to access, allowing
for coordinated parallel processing across the entire dataset.
z

Kernel specific system vars

 gridDim - no. of blocks in the grid

 gridDim.x - no. of blocks in dimension x of multi-dim grid !!

 blockDim - no. of threads/block

 blockDim.x - no. of threads/block in dimension x of multi-dim block !!

 For single dimension defn of block composition in grid, blockDim =


blockDim.x
 blockIdx.x = block number for a thread

 threadIdx.x = thread no. inside a block


z

__global__ void vectorAdd(float *a, float *b, float *c, int n) {


int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n)
c[i] = a[i] + b[i];
}
 The code is executed by all threads in the grid

 Each thread has a unique combination of (blockIdx.x , threadIdx.x)which maps to a


unique value of i
 i is private value to each thread
z

Function declaration keywords

__global__ void vectorAdd(float *A, float *B, float *C, int n)


Function
z declaration keywords

 __device__: Functions qualified with __device__ run on the GPU and can only be
called from other device or global functions. They allow custom logic for GPU-side
code.
 __global__: Functions qualified with __global__ are called "kernels" and are the entry
points for GPU execution. These functions are launched from the host using special
CUDA kernel launch syntax (<<<...>>>). They must have a void return type and cannot
return values directly.
 __host__: Functions qualified with __host__ run on the CPU and can only be called
from other host functions.
CUDA Function Types and Their Qualifiers
z
In CUDA, functions can be qualified with specific keywords
(__host__, __device__, __global__) that define where the function executes and from
where it can be called.
1. Default Host Functions
 Without any CUDA keywords, all functions are considered host functions by
default.
 These functions:
 Execute on the CPU (host).
 Can only be called from other host functions.
 Do not interact directly with the GPU.
2. Declaring Host and Device Functions Together
 Functions can be declared as both __host__ and __device__ functions.
 __host__ __device__ float HostDeviceFunc();
 By qualifying a function as both __host__ and __device__, it can be called
from both host code (CPU) and device code (GPU).
 When a function is declared this way, the CUDA runtime system generates
two versions (object files) of the function:
 One version is compiled for the CPU, allowing calls from other host
functions.
 Another version is compiled for the GPU, allowing calls from device
functions.
CUDA Function Types and Their Qualifiers
z
3. Global Functions and Dynamic Parallelism
 __global__ functions, also known as kernel functions:
 Run on the GPU (device).
 Can be launched from the CPU (host) using CUDA’s kernel launch syntax:cpp
 KernelFunc<<<gridDim, blockDim>>>(...);
 Global functions can also be called from within other kernels using the
same kernel launch syntax if CUDA Dynamic Parallelism is enabled:
 This model requires:
 CUDA 5.0 or higher.
 GPU compute capability of 3.5 or higher.
 With Dynamic Parallelism, a kernel running on the GPU can launch
additional kernels, enabling complex, nested parallel computations directly
on the GPU without requiring intervention from the CPU.
CUDA zFunctions: Additional Observations
 Return Types:
 __device__ functions can have a non-void return type. This allows them to
return data directly to the calling kernel or function.
 __global__ functions (kernels) must always return void because they launch
multiple threads across the GPU in parallel and do not return data directly.
 Execution Contexts:
 Global functions (__global__) can be called within other kernels on the
GPU, creating new threads as part of the CUDA Dynamic Parallelism model.
 Device functions (__device__) run on the same thread as the calling
kernel. They do not create new threads but rather allow the calling thread to
execute custom logic.
Threadz Synchronization
Threads execute in parallel
 Threads can reads a result before another thread writes to that address .
 Race condition
Thread
Synchronization
z

via explicit barrier

 Threads need to synchronize


with each one another to
avoid race condition .
 A barrier is a point in kernel
where all threads stop and
wait the others.
 When all threads have
reached the barrier they can
proceed.
 Barriers can be implemented
with : __syncthreads( );
__syncthreads(
z
) Example

 Shift the contents of an array to the left by one element

__global__ void kernel(int* a)

int i = threadIdx.x + blockIdx.x * blockDim.x;

if (i < 3)

a[ i ] = a[ i+1 ] ;

}
__syncthreads( ) Example
z

 Shift the contents of an array to the left by one element

__global__ void kernel(int* a)

int i = threadIdx.x + blockIdx.x * blockDim.x;

if (i < 3)

int temp = a[ i+1 ] ;

__syncthreads();

a[ i ] = temp;

__syncthreads();

}
Implicit barriers between kernels
z

 Besides explicit barriers within a kernel, CUDA has implicit barriers between kernel launches.
 When launching multiple kernels sequentially, the GPU will automatically ensure that each
kernel completes before the next one begins. This synchronization guarantees that the second
kernel’s grid does not start until the first kernel has finished execution.
Host and Device
z
Synchronization
 Asynchronous Host-Device Interaction
 By default, the host (CPU) does not wait for
the GPU (device) to finish a kernel before
moving to the next command.
 The asynchronous nature allows the CPU to
perform other tasks while the GPU
processes a kernel.
 Explicit Host Synchronization
 To make the host wait for the GPU to finish
executing a kernel, we
use cudaDeviceSynchronize().
 When the host code
reaches cudaDeviceSynchronize(), it will
pause and wait until the GPU has completed
all previous kernel executions and memory
operations.
z Implicit Synchronization:Between consecutive kernel
launches, there is an implicit barrier that ensures each kernel
finishes before the next starts.

You might also like