CUDA PPT Anurita Unit3
CUDA PPT Anurita Unit3
z
WITH CUDA •Device Global Memory And Data Transfer
•Kernel Functions
• Threading.
What is CUDA?
z
The CPU runs a main program that dispatches parallel tasks to GPU devices.
The GPUs execute these tasks and return results to the CPU (host).
2. Execution Flow:
i. The host code (CPU) runs serially, launching GPU parallel kernels as
needed.
ii. GPU kernels (device code) execute in parallel, then return results to the
host.
iii. This back-and-forth execution enables both CPU and GPU to contribute
to complex computations.
vi. Each has its own memory: host memory and device memory in DRAM.
#include <cuda_runtime.h>
// Kernel function to perform vector addition
__global__ void vectorAdd(float *a, float *b, float *c, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n)
c[i] = a[i] + b[i];
}
// Declare the kernel function
__global__ void vectorAdd(float *a, float *b, float *c, int n);
void vecAdd(float* h_a, float* h_b, float* h_c, int n){
int size = n * sizeof(float);
float *d_a = NULL, *d_b = NULL, *d_c = NULL;
Device Memory Allocation
z
err = cudaMalloc((void**)&d_a, size);
if (err != cudaSuccess) {
fprintf(stderr, "Failed to allocate device vector A (error code %s) !\n” , cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
err = cudaMalloc((void**)&d_b, size);
if (err != cudaSuccess) {
fprintf(stderr, "Failed to allocate device vector B (error code %s)!\n” , cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
err = cudaMalloc((void**)&d_c, size);
if (err != cudaSuccess) {
fprintf(stderr, "Failed to allocate device vector C (error code %s)!\n” , cudaGetErrorString(err));
exit(EXIT_FAILURE);
Device Memory Allocation
z
err = cudaMalloc((void**)&d_a, size);
if (err != cudaSuccess) {
fprintf(stderr, "Failed to allocate device vector A (error code %s) !\n” , cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
err = cudaMalloc((void**)&d_b, size);
if (err != cudaSuccess) {
fprintf(stderr, "Failed to allocate device vector B (error code %s)!\n” , cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
err = cudaMalloc((void**)&d_c, size);
if (err != cudaSuccess) {
fprintf(stderr, "Failed to allocate device vector C (error code %s)!\n” , cudaGetErrorString(err));
exit(EXIT_FAILURE);
Host to Device Transfer
// Copy inputz vectors from host memory to device memory
cout << "Copy input data from the host memory to the CUDA device\n";
err = cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice);
if (err != cudaSuccess) {
fprintf(stderr, "Failed to copy vector A from host to device (error code %s)!\n”
, cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
err = cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);
if (err != cudaSuccess) {
fprintf(stderr, "Failed to copy vector B from host to device (error code %s)!\n”
, cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
Kernel Launch
z
Launch kernel with N/256 blocks and 256 threads per block
int threadsPerBlock = 256;
int blocksPerGrid = (n + threadsPerBlock - 1) / threadsPerBlock;
vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_a, d_b, d_c, n);
// Check for any errors during kernel launch
err = cudaGetLastError();
if (err != cudaSuccess) {
fprintf(stderr, "Failed to launch vectorAdd kernel (error code %s)!\n”
, cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
Device to Host Memory Transfer
z
// Copy result from device to host
err = cudaMemcpy(h_c, d_c, size, cudaMemcpyDeviceToHost);
if (err != cudaSuccess) {
fprintf(stderr, "Failed to copy vector C from device to host (error code %s)!\
n”, cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
// Free device memory
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
}
int main(){
// Host vectors
z
float h_a[N], h_b[N], h_c[N];
// Initialize host vectors
for (int i = 0; i < N; i++) {
h_a[i] = i * 1.0f; h_b[i] = i * 2.0f;
}
// Perform vector addition on the device
vecAdd(h_a, h_b, h_c, N);
// Display the result
cout << "Result of vector addition:\n";
for (int i = 0; i < 10; i++) {
// Display first 10 elements for brevity
cout << h_a[i] << " + " << h_b[i] << " = " << h_c[i] << endl;
}
return 0;
z
Compile and Run
cuda.h -> includes during compilation CUDA API functions and CUDA system variables
h_a , h_b , h_c -> arrays mapped to main memory locations
Observations
z
cudaMalloc((void**)&d_b, size);
1. This function allocates memory on the GPU (device) for the variable d_b.
2. Here, size specifies the amount of memory to allocate, which is usually calculated
based on the number of elements and their data type (e.g., n *
sizeof(float) for n floats).
3. This memory on the device is used to store data that the GPU will work with, separate
from the host memory.
cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice);
1. This function copies data from the host (CPU) memory to the device (GPU) memory.
2. h_a is the pointer to the data on the host, and d_a is the pointer to the allocated
memory on the device.
3. cudaMemcpyHostToDevice specifies the direction of transfer, moving data from host
to device.
4. This allows the GPU to access the input data (h_a) during kernel execution.
Observations
z
2. d_c is the pointer to the data on the device, and h_c is the pointer to the allocated
memory on the host.
4. This is used to retrieve the results of GPU computations (e.g., the output of a
kernel) so that they can be accessed by the CPU for further processing or display.
CUDA
z Kernel
All blocks contain the same number of threads (with a maximum of 1024).
Kernel Parameters:
The kernel launch includes two parameters: blocksPerGrid and threadsPerBlock.
Significance of Parameters:
To perform vector addition, many compute threads are launched to add vector
components in parallel.
Threads are arranged in a two-level hierarchy:
#include <cuda_runtime.h>
// Kernel function to perform vector addition
__global__ void vectorAdd(float *a, float *b, float *c, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n)
c[i] = a[i] + b[i];
}
// Declare the kernel function
__global__ void vectorAdd(float *a, float *b, float *c, int n);
void vecAdd(float* h_a, float* h_b, float* h_c, int n){
int size = n * sizeof(float);
float *d_a = NULL, *d_b = NULL, *d_c = NULL;
Thread Indexing:
z
Threads are indexed hierarchically: each block has an ID, and each thread within a
block has an ID.
In the code, the global thread ID is computed
The global thread ID (i) is calculated for each thread based on its block and thread
indices.
z
Accessing Global Memory:
The global thread ID (i) determines which elements of the data arrays (e.g., A, B, and
z
C) each thread will operate on.
For example:
__device__: Functions qualified with __device__ run on the GPU and can only be
called from other device or global functions. They allow custom logic for GPU-side
code.
__global__: Functions qualified with __global__ are called "kernels" and are the entry
points for GPU execution. These functions are launched from the host using special
CUDA kernel launch syntax (<<<...>>>). They must have a void return type and cannot
return values directly.
__host__: Functions qualified with __host__ run on the CPU and can only be called
from other host functions.
CUDA Function Types and Their Qualifiers
z
In CUDA, functions can be qualified with specific keywords
(__host__, __device__, __global__) that define where the function executes and from
where it can be called.
1. Default Host Functions
Without any CUDA keywords, all functions are considered host functions by
default.
These functions:
Execute on the CPU (host).
Can only be called from other host functions.
Do not interact directly with the GPU.
2. Declaring Host and Device Functions Together
Functions can be declared as both __host__ and __device__ functions.
__host__ __device__ float HostDeviceFunc();
By qualifying a function as both __host__ and __device__, it can be called
from both host code (CPU) and device code (GPU).
When a function is declared this way, the CUDA runtime system generates
two versions (object files) of the function:
One version is compiled for the CPU, allowing calls from other host
functions.
Another version is compiled for the GPU, allowing calls from device
functions.
CUDA Function Types and Their Qualifiers
z
3. Global Functions and Dynamic Parallelism
__global__ functions, also known as kernel functions:
Run on the GPU (device).
Can be launched from the CPU (host) using CUDA’s kernel launch syntax:cpp
KernelFunc<<<gridDim, blockDim>>>(...);
Global functions can also be called from within other kernels using the
same kernel launch syntax if CUDA Dynamic Parallelism is enabled:
This model requires:
CUDA 5.0 or higher.
GPU compute capability of 3.5 or higher.
With Dynamic Parallelism, a kernel running on the GPU can launch
additional kernels, enabling complex, nested parallel computations directly
on the GPU without requiring intervention from the CPU.
CUDA zFunctions: Additional Observations
Return Types:
__device__ functions can have a non-void return type. This allows them to
return data directly to the calling kernel or function.
__global__ functions (kernels) must always return void because they launch
multiple threads across the GPU in parallel and do not return data directly.
Execution Contexts:
Global functions (__global__) can be called within other kernels on the
GPU, creating new threads as part of the CUDA Dynamic Parallelism model.
Device functions (__device__) run on the same thread as the calling
kernel. They do not create new threads but rather allow the calling thread to
execute custom logic.
Threadz Synchronization
Threads execute in parallel
Threads can reads a result before another thread writes to that address .
Race condition
Thread
Synchronization
z
if (i < 3)
a[ i ] = a[ i+1 ] ;
}
__syncthreads( ) Example
z
if (i < 3)
__syncthreads();
a[ i ] = temp;
__syncthreads();
}
Implicit barriers between kernels
z
Besides explicit barriers within a kernel, CUDA has implicit barriers between kernel launches.
When launching multiple kernels sequentially, the GPU will automatically ensure that each
kernel completes before the next one begins. This synchronization guarantees that the second
kernel’s grid does not start until the first kernel has finished execution.
Host and Device
z
Synchronization
Asynchronous Host-Device Interaction
By default, the host (CPU) does not wait for
the GPU (device) to finish a kernel before
moving to the next command.
The asynchronous nature allows the CPU to
perform other tasks while the GPU
processes a kernel.
Explicit Host Synchronization
To make the host wait for the GPU to finish
executing a kernel, we
use cudaDeviceSynchronize().
When the host code
reaches cudaDeviceSynchronize(), it will
pause and wait until the GPU has completed
all previous kernel executions and memory
operations.
z Implicit Synchronization:Between consecutive kernel
launches, there is an implicit barrier that ensures each kernel
finishes before the next starts.