0% found this document useful (0 votes)

60 views42 pages

CUDA PPT Anurita Unit3

Uploaded by

ankitupatil1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

60 views42 pages

CUDA PPT Anurita Unit3

Uploaded by

ankitupatil1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 42

•UNIT 3

Introduction to CUDA: Data Parallelism

•CUDA Program Structure,
PARALLEL
PROGRAMMING •A Vector Addition Kernel

z
WITH CUDA •Device Global Memory And Data Transfer

•Kernel Functions

• Threading.
What is CUDA?
z

 CUDA stands for Compute Unified

Device Architecture
 A General-Purpose Parallel
Computing Platform and
Programming Model
 CUDA comes with a software
environment that allows
developers to use C++ as a high-
level programming language
z
 The CUDA parallel programming model is designed
to overcome this challenge while maintaining a low
learning curve for programmers familiar with
standard programming languages such as C.
 At its core are three key abstractions — a hierarchy
of thread groups, shared memories, and barrier
synchronization — that are simply exposed to the
programmer as a minimal set of language
extensions.
 These abstractions guide the programmer to
partition the problem into coarse sub-problems that
can be solved independently in parallel by blocks of
threads, and each sub-problem into finer pieces that
can be solved cooperatively in parallel by all
threads within the block.
z
CUDA program structure
 NVIDIA GPU’s are very popularly used as accelerating a device

 The CPU acts as the host.

 The CPU runs a main program that dispatches parallel tasks to GPU devices.

 Multiple GPU devices may be attached to the CPU.

 The CPU can send parallel jobs to these GPUs simultaneously.

 The GPUs execute these tasks and return results to the CPU (host).

 Any C program is a valid host code

 In general CUDA programs (host+device) code cannot be compiled by standard C

compilers
 NVIDIA C compiler (NVCC)
The Compilation Flow
z

 Create a C program with CUDA

extensions.
 Use NVIDIA's NVCC compiler.

 NVCC splits the code into host

(CPU) and device (GPU) segments.
 Host code is compiled by the
system's C compiler and linker.
 Device code is JIT-compiled for
execution on the GPU.
 6. \Host code runs on the CPU;
device code runs on the GPU.
The Execution Flow
1. Heterogeneous Programming: CUDA enables a heterogeneous
z
programming model where code executes across CPU (host) and GPU
(device) components.

2. Execution Flow:

i. The host code (CPU) runs serially, launching GPU parallel kernels as
needed.

ii. GPU kernels (device code) execute in parallel, then return results to the
host.

iii. This back-and-forth execution enables both CPU and GPU to contribute
to complex computations.

3. Separate Execution and Memory: CUDA assumes separate execution

environments
and memory spaces:

iv. Host (CPU) runs the main program.

v. Device (GPU) executes CUDA kernels as a coprocessor.

vi. Each has its own memory: host memory and device memory in DRAM.

4. Memory Management: Programs must manage device memory

allocation, deallocation, and data transfers between host and device using
CUDA runtime calls.
z
Vector Addition Program

void vectorAdd(float* h_a, float* h_b, float* h_c, int n)

{
for(int i = 0; i < n; i++)
h_c[i] = h_a[i] + h_b[i];
}
int main()
{ z

int n = 1000; // Size of the vectors

// Allocate memory for host vectors
float *h_a = (float*)malloc(n * sizeof(float));
float *h_b = (float*)malloc(n * sizeof(float));
float *h_c = (float*)malloc(n * sizeof(float));
// Initialize vectors h_a and h_b with some values
for(int i = 0; i < n; i++)
{ h_a[i] = i * 1.0f; h_b[i] = i * 2.0f; }
// Perform vector addition
vectorAdd(h_a, h_b, h_c, n);
// Free allocated memory
free(h_a); free(h_b); free(h_c); return 0;
}
Vector Addition CPU-GPU
z

#include <cuda_runtime.h>
// Kernel function to perform vector addition
__global__ void vectorAdd(float *a, float *b, float *c, int n) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n)
c[i] = a[i] + b[i];
}
// Declare the kernel function
__global__ void vectorAdd(float *a, float *b, float *c, int n);
void vecAdd(float* h_a, float* h_b, float* h_c, int n){
int size = n * sizeof(float);
float *d_a = NULL, *d_b = NULL, *d_c = NULL;
Device Memory Allocation
z
err = cudaMalloc((void**)&d_a, size);
if (err != cudaSuccess) {
fprintf(stderr, "Failed to allocate device vector A (error code %s) !\n” , cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
err = cudaMalloc((void**)&d_b, size);
if (err != cudaSuccess) {
fprintf(stderr, "Failed to allocate device vector B (error code %s)!\n” , cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
err = cudaMalloc((void**)&d_c, size);
if (err != cudaSuccess) {
fprintf(stderr, "Failed to allocate device vector C (error code %s)!\n” , cudaGetErrorString(err));
exit(EXIT_FAILURE);
Device Memory Allocation
z
err = cudaMalloc((void**)&d_a, size);
if (err != cudaSuccess) {
fprintf(stderr, "Failed to allocate device vector A (error code %s) !\n” , cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
err = cudaMalloc((void**)&d_b, size);
if (err != cudaSuccess) {
fprintf(stderr, "Failed to allocate device vector B (error code %s)!\n” , cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
err = cudaMalloc((void**)&d_c, size);
if (err != cudaSuccess) {
fprintf(stderr, "Failed to allocate device vector C (error code %s)!\n” , cudaGetErrorString(err));
exit(EXIT_FAILURE);
Host to Device Transfer
// Copy inputz vectors from host memory to device memory
cout << "Copy input data from the host memory to the CUDA device\n";
err = cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice);
if (err != cudaSuccess) {
fprintf(stderr, "Failed to copy vector A from host to device (error code %s)!\n”
, cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
err = cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice);
if (err != cudaSuccess) {
fprintf(stderr, "Failed to copy vector B from host to device (error code %s)!\n”
, cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
Kernel Launch
z
Launch kernel with N/256 blocks and 256 threads per block
int threadsPerBlock = 256;
int blocksPerGrid = (n + threadsPerBlock - 1) / threadsPerBlock;
vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_a, d_b, d_c, n);
// Check for any errors during kernel launch
err = cudaGetLastError();
if (err != cudaSuccess) {
fprintf(stderr, "Failed to launch vectorAdd kernel (error code %s)!\n”
, cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
Device to Host Memory Transfer
z
// Copy result from device to host
err = cudaMemcpy(h_c, d_c, size, cudaMemcpyDeviceToHost);
if (err != cudaSuccess) {
fprintf(stderr, "Failed to copy vector C from device to host (error code %s)!\
n”, cudaGetErrorString(err));
exit(EXIT_FAILURE);
}
// Free device memory
cudaFree(d_a);
cudaFree(d_b);
cudaFree(d_c);
}
int main(){
// Host vectors
z
float h_a[N], h_b[N], h_c[N];
// Initialize host vectors
for (int i = 0; i < N; i++) {
h_a[i] = i * 1.0f; h_b[i] = i * 2.0f;
}
// Perform vector addition on the device
vecAdd(h_a, h_b, h_c, N);
// Display the result
cout << "Result of vector addition:\n";
for (int i = 0; i < 10; i++) {
// Display first 10 elements for brevity
cout << h_a[i] << " + " << h_b[i] << " = " << h_c[i] << endl;
}
return 0;
z
Compile and Run

nvcc kernel.cu host.cu -o output

Observations
z

 cuda.h -> includes during compilation CUDA API functions and CUDA system variables
 h_a , h_b , h_c -> arrays mapped to main memory locations
Observations
z
 cudaMalloc((void**)&d_b, size);
1. This function allocates memory on the GPU (device) for the variable d_b.
2. Here, size specifies the amount of memory to allocate, which is usually calculated
based on the number of elements and their data type (e.g., n *
sizeof(float) for n floats).
3. This memory on the device is used to store data that the GPU will work with, separate
from the host memory.
 cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice);
1. This function copies data from the host (CPU) memory to the device (GPU) memory.
2. h_a is the pointer to the data on the host, and d_a is the pointer to the allocated
memory on the device.
3. cudaMemcpyHostToDevice specifies the direction of transfer, moving data from host
to device.
4. This allows the GPU to access the input data (h_a) during kernel execution.
Observations
z

 cudaMemcpy(h_c, d_c, size, cudaMemcpyDeviceToHost);

1. This function copies data from the device (GPU) memory back to the host (CPU)
memory.

2. d_c is the pointer to the data on the device, and h_c is the pointer to the allocated
memory on the host.

3. cudaMemcpyDeviceToHost specifies the direction of transfer, moving data from

device to host.

4. This is used to retrieve the results of GPU computations (e.g., the output of a
kernel) so that they can be accessed by the CPU for further processing or display.
CUDA
z Kernel

 A CUDA kernel, when invoked, launches multiple threads arranged in a 2-

level hierarchy.
 Check the device function call:

vectorAdd<<<ceil(n/256), 256>>>(d_A, d_B, d_C, n);

 The call specifies a grid of threads to be launched.

 The grid is arranged in a hierarchical manner:

 (Number of blocks, Number of threads per block)

 All blocks contain the same number of threads (with a maximum of 1024).

 Blocks can be numbered as ( _, _, _ ) triplets for indexing, with further details

on this structure to be covered later.
z
CUDA Kernel Invocation:
 When a CUDA kernel is called, it launches multiple threads organized in a two-level
z
hierarchy.
 Each thread is a basic computing unit, executed by a scalar processor on the GPU.

 Scalar processors execute threads in parallel.

Kernel Parameters:
 The kernel launch includes two parameters: blocksPerGrid and threadsPerBlock.

 These parameters define the grid (collection of threads) to be launched.

Significance of Parameters:
 To perform vector addition, many compute threads are launched to add vector
components in parallel.
 Threads are arranged in a two-level hierarchy:

 Blocks (high-level) – n/256 blocks.

 Threads per Block (low-level) – 256 threads per block.

z
Grid and Hierarchy:
• The grid of threads is arranged hierarchically.
• Each block contains 256 threads.
• The grid is defined by the number of blocks (n/256 blocks) and
threads per block (256 threads).
z
Vector Addition CPU-GPU
z

i = threadIdx.x + blockDim.x * blockIdx.x;

 This uses CUDA-defined variables:

 threadIdx – ID of the thread within its block.

 blockDim – Number of threads per block.

 blockIdx – ID of the block within the grid.

Example of Thread Indexing:

 For block 0, threads are numbered 0 to 255.

 In block 0, blockIdx is 0, blockDim is 256, and threadIdx varies from 0 to 255.

 The global thread ID (i) is calculated for each thread based on its block and thread
indices.
z
Accessing Global Memory:
 The global thread ID (i) determines which elements of the data arrays (e.g., A, B, and
z
C) each thread will operate on.
 For example:

 Thread 1 in Block 0 calculates i = 1, thus computing C[1] = A[1] + B[1].

 Thread 1 in Block 1 calculates i = 1 + (1 * 256) = 257, computing C[257] = A[257]

+ B[257].
Parallel Processing Across Data Elements:
 Using this global thread ID, each thread can access different elements of the data
arrays independently.
 This approach allows for efficient parallel addition of vectors by having each thread
handle a specific element in the arrays A and B, storing results in C.
Global Memory Usage:
 The data arrays (A, B, and C) reside in global memory, which is accessible by all
threads in all blocks.
 Threads use the global thread ID to determine which data elements to access, allowing
for coordinated parallel processing across the entire dataset.
z

Kernel specific system vars

 gridDim - no. of blocks in the grid

 gridDim.x - no. of blocks in dimension x of multi-dim grid !!

 blockDim - no. of threads/block

 blockDim.x - no. of threads/block in dimension x of multi-dim block !!

 For single dimension defn of block composition in grid, blockDim =

blockDim.x
 blockIdx.x = block number for a thread

 threadIdx.x = thread no. inside a block

global void vectorAdd(float a, float b, float *c, int n) {

int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < n)
c[i] = a[i] + b[i];
}
 The code is executed by all threads in the grid

 Each thread has a unique combination of (blockIdx.x , threadIdx.x)which maps to a

unique value of i
 i is private value to each thread
z

Function declaration keywords

global void vectorAdd(float A, float B, float *C, int n)

Function
z declaration keywords

 __device__: Functions qualified with __device__ run on the GPU and can only be
called from other device or global functions. They allow custom logic for GPU-side
code.
 __global__: Functions qualified with __global__ are called "kernels" and are the entry
points for GPU execution. These functions are launched from the host using special
CUDA kernel launch syntax (<<<...>>>). They must have a void return type and cannot
return values directly.
 __host__: Functions qualified with __host__ run on the CPU and can only be called
from other host functions.
CUDA Function Types and Their Qualifiers
z
In CUDA, functions can be qualified with specific keywords
(__host__, __device__, __global__) that define where the function executes and from
where it can be called.
1. Default Host Functions
 Without any CUDA keywords, all functions are considered host functions by
default.
 These functions:
 Execute on the CPU (host).
 Can only be called from other host functions.
 Do not interact directly with the GPU.
2. Declaring Host and Device Functions Together
 Functions can be declared as both __host__ and __device__ functions.
 __host__ __device__ float HostDeviceFunc();
 By qualifying a function as both __host__ and __device__, it can be called
from both host code (CPU) and device code (GPU).
 When a function is declared this way, the CUDA runtime system generates
two versions (object files) of the function:
 One version is compiled for the CPU, allowing calls from other host
functions.
 Another version is compiled for the GPU, allowing calls from device
functions.
CUDA Function Types and Their Qualifiers
z
3. Global Functions and Dynamic Parallelism
 __global__ functions, also known as kernel functions:
 Run on the GPU (device).
 Can be launched from the CPU (host) using CUDA’s kernel launch syntax:cpp
 KernelFunc<<<gridDim, blockDim>>>(...);
 Global functions can also be called from within other kernels using the
same kernel launch syntax if CUDA Dynamic Parallelism is enabled:
 This model requires:
 CUDA 5.0 or higher.
 GPU compute capability of 3.5 or higher.
 With Dynamic Parallelism, a kernel running on the GPU can launch
additional kernels, enabling complex, nested parallel computations directly
on the GPU without requiring intervention from the CPU.
CUDA zFunctions: Additional Observations
 Return Types:
 __device__ functions can have a non-void return type. This allows them to
return data directly to the calling kernel or function.
 __global__ functions (kernels) must always return void because they launch
multiple threads across the GPU in parallel and do not return data directly.
 Execution Contexts:
 Global functions (__global__) can be called within other kernels on the
GPU, creating new threads as part of the CUDA Dynamic Parallelism model.
 Device functions (__device__) run on the same thread as the calling
kernel. They do not create new threads but rather allow the calling thread to
execute custom logic.
Threadz Synchronization
Threads execute in parallel
 Threads can reads a result before another thread writes to that address .
 Race condition
Thread
Synchronization
z

via explicit barrier

 Threads need to synchronize

with each one another to
avoid race condition .
 A barrier is a point in kernel
where all threads stop and
wait the others.
 When all threads have
reached the barrier they can
proceed.
 Barriers can be implemented
with : __syncthreads( );
__syncthreads(
z
) Example

 Shift the contents of an array to the left by one element

global void kernel(int* a)

int i = threadIdx.x + blockIdx.x * blockDim.x;

if (i < 3)

a[ i ] = a[ i+1 ] ;

}
__syncthreads( ) Example
z

 Shift the contents of an array to the left by one element

global void kernel(int* a)

int i = threadIdx.x + blockIdx.x * blockDim.x;

if (i < 3)

int temp = a[ i+1 ] ;

__syncthreads();

a[ i ] = temp;

__syncthreads();

}
Implicit barriers between kernels
z

 Besides explicit barriers within a kernel, CUDA has implicit barriers between kernel launches.
 When launching multiple kernels sequentially, the GPU will automatically ensure that each
kernel completes before the next one begins. This synchronization guarantees that the second
kernel’s grid does not start until the first kernel has finished execution.
Host and Device
z
Synchronization
 Asynchronous Host-Device Interaction
 By default, the host (CPU) does not wait for
the GPU (device) to finish a kernel before
moving to the next command.
 The asynchronous nature allows the CPU to
perform other tasks while the GPU
processes a kernel.
 Explicit Host Synchronization
 To make the host wait for the GPU to finish
executing a kernel, we
use cudaDeviceSynchronize().
 When the host code
reaches cudaDeviceSynchronize(), it will
pause and wait until the GPU has completed
all previous kernel executions and memory
operations.
z Implicit Synchronization:Between consecutive kernel
launches, there is an implicit barrier that ensures each kernel
finishes before the next starts.

Java Notes For Selenium
100% (1)
Java Notes For Selenium
134 pages
Gpu History and Cuda Programming Basics
No ratings yet
Gpu History and Cuda Programming Basics
44 pages
CUDA Programming Basic: High Performance Computing Center Hanoi University of Science & Technology
No ratings yet
CUDA Programming Basic: High Performance Computing Center Hanoi University of Science & Technology
38 pages
Introduction To CUDA C 3
No ratings yet
Introduction To CUDA C 3
67 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
Laboratory Practice I (410246)
No ratings yet
Laboratory Practice I (410246)
28 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
Introduction To CUDA: CAP 4730 Spring 2012
No ratings yet
Introduction To CUDA: CAP 4730 Spring 2012
35 pages
Intro To CUDA
No ratings yet
Intro To CUDA
76 pages
Cuda C
No ratings yet
Cuda C
70 pages
Csnb594csnb4423 Lab 5 01a Harveen Velan Sw0104101
No ratings yet
Csnb594csnb4423 Lab 5 01a Harveen Velan Sw0104101
19 pages
Hetero Lecture Slides 002 Lecture 1 Lecture-1-5-Cuda-API
No ratings yet
Hetero Lecture Slides 002 Lecture 1 Lecture-1-5-Cuda-API
11 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
Programming Concepts and Embedded Programming in C and C++
No ratings yet
Programming Concepts and Embedded Programming in C and C++
55 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
Introduccion CUDA C
No ratings yet
Introduccion CUDA C
51 pages
RHEL 9.0 - Managing & Monitoring Security Updates
No ratings yet
RHEL 9.0 - Managing & Monitoring Security Updates
13 pages
Opening Standby For Read - Write
No ratings yet
Opening Standby For Read - Write
7 pages
CUDA
No ratings yet
CUDA
33 pages
Software Requirements Specifications (Template)
No ratings yet
Software Requirements Specifications (Template)
6 pages
Cuda C/C++ Basics: NVIDIA Corporation
No ratings yet
Cuda C/C++ Basics: NVIDIA Corporation
67 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
Basic-Cuda
No ratings yet
Basic-Cuda
49 pages
Aca Lab Manual Final
No ratings yet
Aca Lab Manual Final
28 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
SRM Enhanced Summary
No ratings yet
SRM Enhanced Summary
63 pages
Threads
No ratings yet
Threads
54 pages
Bigdata 15cs82 Vtu Module 1 2 Notes PDF
No ratings yet
Bigdata 15cs82 Vtu Module 1 2 Notes PDF
49 pages
Unit 1. Introduction To Big Data: False
No ratings yet
Unit 1. Introduction To Big Data: False
7 pages
System Calls
No ratings yet
System Calls
27 pages
q1 Week 1 Etech Powerpoint A
No ratings yet
q1 Week 1 Etech Powerpoint A
28 pages
GPUMod 2
No ratings yet
GPUMod 2
64 pages
Process
No ratings yet
Process
33 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Group A Assignment 4 (A) : Two Large Vectors
No ratings yet
Group A Assignment 4 (A) : Two Large Vectors
5 pages
LP 1,,1
No ratings yet
LP 1,,1
5 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
Random Portfolio
No ratings yet
Random Portfolio
10 pages
TW Presentation
No ratings yet
TW Presentation
38 pages
Clojure Vs Groovy Vs Scala
No ratings yet
Clojure Vs Groovy Vs Scala
2 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Brochure - CMU - Programming With Python - 07-June-2023 - V26
No ratings yet
Brochure - CMU - Programming With Python - 07-June-2023 - V26
12 pages
Pengujian Sistem Blackbox3
No ratings yet
Pengujian Sistem Blackbox3
7 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
Vector Addition
No ratings yet
Vector Addition
3 pages
Introduction To CUDA C
No ratings yet
Introduction To CUDA C
67 pages
Object Oriented Programming in C++
No ratings yet
Object Oriented Programming in C++
148 pages
GPU Series III CUDA Compilation Host Side 1721302802
No ratings yet
GPU Series III CUDA Compilation Host Side 1721302802
8 pages
Sail
No ratings yet
Sail
2 pages
DBMS
No ratings yet
DBMS
12 pages
Woodelivery & Wordpress/Woocommerce Integration: Generate Your Woodelivery Api Key
No ratings yet
Woodelivery & Wordpress/Woocommerce Integration: Generate Your Woodelivery Api Key
4 pages
Installation Guide: Dotnetrunner
No ratings yet
Installation Guide: Dotnetrunner
1 page
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
21.L18 Intro To GPU and CUDA C
No ratings yet
21.L18 Intro To GPU and CUDA C
89 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
CUDA Exercises
No ratings yet
CUDA Exercises
185 pages
2023 CSC14120 Lecture01 CUDAIntroduction
No ratings yet
2023 CSC14120 Lecture01 CUDAIntroduction
32 pages
Moving To Parallel With CUDA - Hello Program
No ratings yet
Moving To Parallel With CUDA - Hello Program
14 pages
Cuda 1
No ratings yet
Cuda 1
45 pages
3 Computation
No ratings yet
3 Computation
28 pages
CUDA - Part 1 LMS
No ratings yet
CUDA - Part 1 LMS
51 pages
ShopZ E-Commerce Application
No ratings yet
ShopZ E-Commerce Application
12 pages
PDC Assignment
No ratings yet
PDC Assignment
9 pages
Java Program Key Mouse Adapter
No ratings yet
Java Program Key Mouse Adapter
3 pages
Experiment 13
No ratings yet
Experiment 13
14 pages
CUDAProg Model
No ratings yet
CUDAProg Model
24 pages
Lab 9
No ratings yet
Lab 9
3 pages
09 Microsoft Access VBA Programming Ebook SAMPLE
No ratings yet
09 Microsoft Access VBA Programming Ebook SAMPLE
26 pages
CUDA Part-1
No ratings yet
CUDA Part-1
52 pages
Cuda
No ratings yet
Cuda
4 pages
CUDA Programming Invert
No ratings yet
CUDA Programming Invert
36 pages
01 Cuda C Basics
No ratings yet
01 Cuda C Basics
32 pages
Explain Delegation Pattern
No ratings yet
Explain Delegation Pattern
4 pages
CUDA Introduction Mod
No ratings yet
CUDA Introduction Mod
50 pages
Lecture3 Fundamentals of CUDA (Part1) - 2025
No ratings yet
Lecture3 Fundamentals of CUDA (Part1) - 2025
52 pages
CUDA Programming Model
No ratings yet
CUDA Programming Model
14 pages
3 Cuda
No ratings yet
3 Cuda
5 pages
01 Programming Fundamentals
No ratings yet
01 Programming Fundamentals
33 pages
Free Questions For: Associate-Reactive-Developer
No ratings yet
Free Questions For: Associate-Reactive-Developer
5 pages
Instant Access To JIRA Agile Essentials 1st Edition Li Ebook Full Chapters
100% (13)
Instant Access To JIRA Agile Essentials 1st Edition Li Ebook Full Chapters
82 pages
CUDA Additionof2Vector
No ratings yet
CUDA Additionof2Vector
2 pages
GPU Programming Slides 2
No ratings yet
GPU Programming Slides 2
37 pages
Cuda Add Mult
No ratings yet
Cuda Add Mult
3 pages
Endsem Imp HPC Unit 5
No ratings yet
Endsem Imp HPC Unit 5
24 pages
Addition Cuda
No ratings yet
Addition Cuda
2 pages
CUDA
No ratings yet
CUDA
18 pages
LPIC-1 Primer
From Everand
LPIC-1 Primer
John Greene
4.5/5 (3)
Foundation Course for Advanced Computer Studies
From Everand
Foundation Course for Advanced Computer Studies
Franck Ismael Djédjé
No ratings yet
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet

CUDA PPT Anurita Unit3

Uploaded by

CUDA PPT Anurita Unit3

Uploaded by

•UNIT 3

Introduction to CUDA: Data Parallelism

 CUDA stands for Compute Unified

 The CPU acts as the host.

 Multiple GPU devices may be attached to the CPU.

 The CPU can send parallel jobs to these GPUs simultaneously.

 Any C program is a valid host code

 In general CUDA programs (host+device) code cannot be compiled by standard C

 Create a C program with CUDA

 NVCC splits the code into host

3. Separate Execution and Memory: CUDA assumes separate execution

iv. Host (CPU) runs the main program.

v. Device (GPU) executes CUDA kernels as a coprocessor.

4. Memory Management: Programs must manage device memory

void vectorAdd(float* h_a, float* h_b, float* h_c, int n)

int n = 1000; // Size of the vectors

nvcc kernel.cu host.cu -o output

 cudaMemcpy(h_c, d_c, size, cudaMemcpyDeviceToHost);

3. cudaMemcpyDeviceToHost specifies the direction of transfer, moving data from

 A CUDA kernel, when invoked, launches multiple threads arranged in a 2-

vectorAdd<<<ceil(n/256), 256>>>(d_A, d_B, d_C, n);

 The grid is arranged in a hierarchical manner:

 (Number of blocks, Number of threads per block)

 Blocks can be numbered as ( _, _, _ ) triplets for indexing, with further details

 Scalar processors execute threads in parallel.

 These parameters define the grid (collection of threads) to be launched.

 Blocks (high-level) – n/256 blocks.

 Threads per Block (low-level) – 256 threads per block.

i = threadIdx.x + blockDim.x * blockIdx.x;

 threadIdx – ID of the thread within its block.

 blockDim – Number of threads per block.

 blockIdx – ID of the block within the grid.

Example of Thread Indexing:

 In block 0, blockIdx is 0, blockDim is 256, and threadIdx varies from 0 to 255.

 Thread 1 in Block 0 calculates i = 1, thus computing C[1] = A[1] + B[1].

 Thread 1 in Block 1 calculates i = 1 + (1 * 256) = 257, computing C[257] = A[257]

Kernel specific system vars

 gridDim - no. of blocks in the grid

 gridDim.x - no. of blocks in dimension x of multi-dim grid !!

 blockDim - no. of threads/block

 blockDim.x - no. of threads/block in dimension x of multi-dim block !!

 For single dimension defn of block composition in grid, blockDim =

 threadIdx.x = thread no. inside a block

__global__ void vectorAdd(float *a, float *b, float *c, int n) {

 Each thread has a unique combination of (blockIdx.x , threadIdx.x)which maps to a

Function declaration keywords

__global__ void vectorAdd(float *A, float *B, float *C, int n)

via explicit barrier

 Threads need to synchronize

 Shift the contents of an array to the left by one element

__global__ void kernel(int* a)

int i = threadIdx.x + blockIdx.x * blockDim.x;

 Shift the contents of an array to the left by one element

__global__ void kernel(int* a)

int i = threadIdx.x + blockIdx.x * blockDim.x;

int temp = a[ i+1 ] ;

You might also like

global void vectorAdd(float a, float b, float *c, int n) {

global void vectorAdd(float A, float B, float *C, int n)

global void kernel(int* a)

global void kernel(int* a)