0% found this document useful (0 votes)
31 views

GPU (Graphics Processing Unit)

The document discusses the history and development of GPUs. It describes how GPUs originally focused on graphics but their highly parallel architecture made them useful for general computing. The rise of CUDA in 2007 allowed GPUs to be used for non-graphics tasks like scientific computing and AI. Key developments included early 3D graphics cards, Nvidia's GeForce 256 introducing programmable shaders, and ATI's Radeon 9700 supporting DirectX 9.

Uploaded by

sonicboom00009
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

GPU (Graphics Processing Unit)

The document discusses the history and development of GPUs. It describes how GPUs originally focused on graphics but their highly parallel architecture made them useful for general computing. The rise of CUDA in 2007 allowed GPUs to be used for non-graphics tasks like scientific computing and AI. Key developments included early 3D graphics cards, Nvidia's GeForce 256 introducing programmable shaders, and ATI's Radeon 9700 supporting DirectX 9.

Uploaded by

sonicboom00009
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

GPU (Graphics Processing Unit)

 A Graphics Processing Unit (GPU) is a specialized electronic circuit designed to accelerate


graphics rendering and processing tasks.

 Graphic rendering refers to the process of generating or producing images from a set of
data or instructions, often with the goal of displaying those images on a screen providing
users with a visual representation of a virtual environment, object, or scene. It is used in a
wide range of applications, including video games, 3D modeling, computer-aided design,
virtual reality, simulations, and more.

 Originally, GPUs were developed specifically for rendering images and videos in computer
games and other visual applications. However, their highly parallel architecture and ability
to handle massive amounts of data quickly made them useful for a wide range of general-
purpose computing tasks beyond graphics.

GPU vs. CPU


Sl. Criteria CPU GPU
No.
1 Purpose CPUs are general-purpose GPUs are specialized processors
processors designed for sequential designed for parallel processing
and single-threaded tasks. and rendering graphics.
2 Architecture CPUs have a few powerful cores GPUs have thousands of smaller,
optimized for single-threaded simpler cores designed for
performance. They have deep parallel processing. They have
pipelines and extensive instruction shallow pipelines and execute
sets. multiple threads simultaneously
3 Parallelism CPUs are inherently sequential, GPUs are highly parallel
with limited parallelism. They are processors, capable of executing
suitable for tasks that require step- thousands of threads
by-step processing. simultaneously. They are ideal for
data-parallel tasks.
4 Memory CPUs have a complex memory GPUs have a simpler memory
Hierarchy hierarchy, including caches, main hierarchy, with global memory,
memory (RAM), and storage. shared memory, and local
memory.
5 Flexibility CPUs are flexible and can handle a GPUs are less flexible and are
wide range of tasks. They can run optimized for specific tasks. They
complex software and adapt to excel in repetitive, data-parallel
changing workloads. operations.
6 Clock Speed CPUs typically have higher clock GPUs have lower clock speeds
speeds, which contribute to their per core but compensate with a
single-threaded performance. large number of cores, allowing
them to handle many threads
concurrently.
7 Power Efficiency CPUs are designed for power GPUs consume more power,
efficiency and are commonly found making them suitable for
in laptops, servers, and mobile workstations, gaming PCs, and
devices. specialized high-performance
computing clusters.
8 Use Cases CPUs are used for general GPUs are used for graphics
computing tasks, including running rendering, scientific simulations,
operating systems, web browsing, deep learning, video encoding,
office applications, and running and any task that benefits from
single-threaded or multi-threaded parallel processing.
software.
9 Hardware Cost CPUs are more expensive per core, GPUs are cost-effective for
as they are designed for general- parallel workloads because they
purpose computing. offer many cores at a lower price
per core.
10 Programming CPUs use a sequential GPUs use a data-parallel
Model programming model. Code is programming model. Code is
executed one instruction at a time, executed in parallel by many
and the emphasis is on control flow threads, and the emphasis is on
and branching. SIMD (Single Instruction,
Multiple Data) operations.

History of GPU:

1. Early Graphics Cards (1970s-1980s): The concept of a graphics card began with the advent
of personal computers in the 1970s and 1980s. Early GPUs were simple and primarily
focused on rendering basic graphics and text. They facilitated the transition from simple
text-based interfaces to more visually-oriented interfaces, enabling users to interact with
computers through graphical elements and basic shapes.

[Note - A graphics card is a hardware component in a computer typically consists of several


key components, including the GPU (Graphics Processing Unit), video memory (VRAM), and
various connectors for connecting to monitors.]

2. VGA Era (1987): The Video Graphics Array (VGA) standard was introduced in 1987,
providing better color and resolution support. This marked a significant improvement in PC
graphics.

3. 3D Acceleration (1990s): In the 1990s, the need for better 3D graphics in video games led to
the development of 3D graphics accelerators, which were essentially early forms of GPUs.
Companies like 3dfx and ATI played a key role in this era.
[Note - Both 3dfx and ATI (now part of AMD) were major players in the graphics card
industry in the 1990s, and while they were not game developers themselves, they played a
significant role in the development of 3D graphics technology and supported numerous
games through their graphics cards.]

4. NVIDIA GeForce 256 (1999): The NVIDIA GeForce 256, released in 1999, is often considered
the first modern GPU. It was the first graphics card with hardware transformation and
lighting, introducing programmable shaders, which allowed developers to customize
graphics rendering.

[Note - "Hardware T&L" means that the GeForce 256 had dedicated hardware components
for handling these tasks. Transformation involves converting 3D coordinates of objects into
2D coordinates for rendering on a 2D screen, while lighting involves calculating how light
interacts with the objects in the 3D scene. The introduction of programmable shaders
allowed developers to write their own shader code, enabling more realistic and visually
appealing graphics.]

5. ATI Radeon 9700 (2002): ATI's Radeon 9700, released in 2002, was a significant milestone,
offering high-performance graphics processing and advanced features like DirectX 9
support.

[Note - DirectX is commonly used in the Windows operating system for game development
and multimedia applications. When a piece of software or hardware is said to have "DirectX
9 support," it means that it is capable of working with DirectX 9 and taking advantage of its
features and capabilities.]

6. The Rise of CUDA (2007): NVIDIA's introduction of CUDA (Compute Unified Device
Architecture) in 2007 revolutionized GPU usage. CUDA allowed GPUs to be used for general-
purpose computing, not just graphics. This marked the beginning of GPUs as powerful
parallel processors for scientific, engineering, and artificial intelligence applications.

7. AMD and the GPU Industry (2000s-Present): AMD (formerly ATI) has been a key player in
the GPU industry alongside NVIDIA, offering a range of GPUs for gaming and professional
applications. The competition between these two companies has driven innovation.
CUDA (Compute Unified Device Architecture)

 CUDA (Compute Unified Device Architecture) is a parallel computing platform and API
(Application Programming Interface) created by NVIDIA.

 It enables developers to harness the computational power of NVIDIA GPUs (Graphics


Processing Units) for a wide range of applications, including scientific simulations, deep
learning, image processing, and more.

 CUDA has gained popularity due to its ability to accelerate computationally intensive tasks
by offloading them to the GPU, which consists of thousands of cores that can perform many
operations simultaneously.

CUDA supported GPU Architecture

CUDA is built on NVIDIA's GPU architecture, which includes several key components:

a) Streaming Multiprocessors (SMs): These are the fundamental processing units on a GPU.
Each SM contains a number of CUDA cores (ALUs), specialized function units, and local
memory.
Different GPUs have different numbers of SMs, and this number can change from one GPU
generation to the next. High-end GPUs designed for scientific and compute workloads tend
to have more SMs than consumer-oriented GPUs built for gaming and graphics.

b) CUDA Cores: CUDA cores are small processing units within an SM that can execute
instructions in parallel. Modern GPUs have hundreds or even thousands of CUDA cores.

c) Global Memory: This is the GPU's main memory, which can be accessed by all SMs and
CUDA cores. It is used to store data and instructions.

d) Shared Memory: Each SM has a small but high-speed shared memory, which can be
accessed by all threads within an SM.
https://www.researchgate.net/figure/CPU-GPU-Heterogeneous-Computing-Architecture_fig1_276750219

e) Registers: Each SM has a set of registers for storing data that is currently being processed.

f) Texture and Constant Memory: Special memory spaces optimized for certain data access
patterns.

g) Thread Execution Units (TEUs): These units manage the execution of threads and
instruction scheduling within an SM.

CUDA Programming Model

The CUDA programming model allows developers to write code that runs on both the CPU and
GPU. Key components include:

a) Host and Device: In CUDA, there are two main components: the host (CPU) and the device
(GPU). The host manages the overall execution, handles input/output operations, and
orchestrates data transfers between the CPU and GPU. The device performs the actual
parallel processing tasks. It executes code written specifically for the GPU.

b) Host Code: The CPU code that manages data and controls the execution of GPU kernels.
c) Device Code (Kernels): In CUDA, parallel code is referred to as a "kernel." Kernels are
functions that execute on the GPU. Kernels are invoked from the host and run in parallel on
the GPU. Each kernel thread can perform a specific task on a piece of data, making it
suitable for data-parallel applications.

d) Grids and Blocks: Threads in CUDA are organized into blocks, and blocks are organized into
grids. Each thread executes the same kernel code but can have its own unique thread ID to
identify its task. The hierarchy allows developers to organize and control parallelism
effectively.

e) Threads and Warps: Individual threads execute kernels in parallel, and they are grouped
into warps, which are units of thread execution that share some resources.

f) Data Transfer: Efficient data transfer between the host and device is crucial. CUDA provides
functions to manage data movement between CPU and GPU memory.

CUDA Process Flow

a) Copy data from main memory to GPU memory


b) CPU initiates the GPU compute kernel
c) GPU's CUDA cores execute the kernel in parallel
d) Copy the resulting data from GPU memory to main memory

Benefits of CUDA

There are several advantages that give CUDA an edge over traditional general-purpose
graphics processor (GPU) computers with graphics APIs:

a) Massive Parallelism: CUDA allows leveraging the parallel processing power of


modern GPUs, which can have thousands of CUDA cores. This enables to perform
massively parallel computations that are impractical or extremely time-consuming
on a CPU.

b) High Performance: Due to the large number of CUDA cores and optimized memory
hierarchies, CUDA can significantly accelerate many computationally intensive tasks.
This makes it suitable for scientific simulations, machine learning, image processing,
and more.

c) General-Purpose Computing: CUDA isn't limited to graphics-related tasks. It enables


general-purpose computing on GPUs, allowing you to accelerate a wide range of
applications, from scientific simulations to deep learning and data analysis.

d) CUDA Libraries: NVIDIA provides a rich ecosystem of CUDA libraries, including


cuBLAS (for linear algebra), cuFFT (for Fast Fourier Transforms), cuDNN (for deep
learning), and more. These libraries can simplify the development of GPU-
accelerated applications.

e) Heterogeneous Computing: CUDA supports hybrid CPU-GPU computing. You can


offload specific parts of your code to the GPU, allowing you to take advantage of
both CPU and GPU resources in a single application.

f) Developer Tools: NVIDIA provides a suite of developer tools, including the CUDA
Toolkit, which includes a compiler, debugger, profiler, and performance optimization
tools. These tools make it easier to develop, debug, and optimize CUDA code.
Limitations of CUDA

a) Vendor Lock-In: CUDA is proprietary and developed by NVIDIA. This means that it is
primarily supported on NVIDIA GPUs. If you develop your application using CUDA, it
may not be easily portable to GPUs from other vendors, such as AMD.

b) Hardware Dependence: CUDA code is often optimized for specific NVIDIA GPU
architectures. As GPU architectures evolve, older CUDA code may not fully utilize the
capabilities of newer GPUs, requiring code updates for optimal performance.

c) Complexity: Developing CUDA applications can be more complex than writing CPU-
based code. It requires understanding GPU architecture, managing data transfer
between the CPU and GPU, and optimizing for memory access patterns.

d) Memory Management: CUDA programmers must explicitly manage memory, which


can be error-prone. Failing to correctly allocate and deallocate memory can lead to
resource leaks or crashes.

e) Overhead: There is overhead associated with transferring data between the CPU
and GPU. If data transfers are frequent, this overhead can limit the performance
benefits of GPU acceleration.

Applications of CUDA:

a) Scientific Simulations: CUDA is widely used in scientific simulations, such as fluid dynamics,
molecular dynamics, and weather modeling. These simulations involve complex
mathematical calculations that can be parallelized to significantly reduce the time required
for computation.
b) Deep Learning and AI: CUDA is a fundamental technology in the field of artificial
intelligence and deep learning. Frameworks like TensorFlow and PyTorch leverage CUDA to
accelerate the training of deep neural networks. The parallel processing capabilities of GPUs
allow for faster model training, enabling breakthroughs in AI applications like image
recognition and natural language processing.
c) Medical Imaging: Medical image processing tasks, including MRI and CT image
reconstruction, benefit from CUDA's parallelism. CUDA accelerates the reconstruction of 3D
images and the processing of medical data, enabling real-time visualization and diagnosis.
d) Financial Modeling: CUDA is used in financial institutions for pricing complex derivatives,
risk assessment, and portfolio optimization. These calculations involve extensive numerical
computations and Monte Carlo simulations.
e) Video and Image Processing: Video and image processing applications like video
transcoding, image filtering, and computer vision tasks benefit from CUDA. CUDA
accelerates these operations, reducing processing time and improving the quality of image
and video output.
f) Genomic Analysis: Genomic research, including DNA sequencing and bioinformatics,
involves extensive data processing. CUDA can significantly speed up tasks like sequence
alignment, genome assembly, and variant calling, enabling faster breakthroughs in
genomics.
g) Oil and Gas Exploration: CUDA plays a critical role in seismic data processing and reservoir
simulations for oil and gas exploration. CUDA enables the rapid processing of seismic data,
helping identify potential drilling sites and optimize oil reservoir extraction.
h) Simulated Reality and Gaming: The gaming industry benefits from CUDA for rendering
realistic 3D graphics and simulating complex game physics. It's not just limited to
entertainment; CUDA is also used in virtual reality and augmented reality applications.
i) Drug Discovery and Molecular Modeling: Pharmaceutical companies use CUDA for drug
discovery and molecular modeling. CUDA accelerates the computation of drug-protein
interactions, docking studies, and drug candidate screening.

CUDA Enabled Graphics Processors

NVIDIA offers a range of GPU product lines, each designed for specific applications and user
requirements. We briefly discuss about NVIDIA's GeForce, Quadro, and Tesla GPUs:

1. GeForce GPUs:

Audience: Gamers and Consumer Market

Purpose: GeForce GPUs are designed for gaming and consumer applications. They excel in
rendering high-quality graphics and providing a great gaming experience. They are also capable
of handling general computing tasks, making them suitable for CUDA programming.

Features: GeForce GPUs are a popular choice for gaming enthusiasts and can be used for CUDA
development, though they may not be optimized for professional or scientific computing.

2. Quadro GPUs:

Audience: Professional Workstations

Purpose: Quadro GPUs are designed for professional applications such as computer-aided
design (CAD), 3D modeling, scientific simulations, and content creation. They provide reliability,
precision, and stability for demanding professional workloads.
Features: Quadro GPUs offer certified drivers for professional software applications and come
with a focus on precision and accuracy. They are optimized for error-free calculations and are
well-suited for CUDA development in professional and scientific domains.

3. Tesla GPUs:

Audience: Data Centers and High-Performance Computing (HPC)

Purpose: Tesla GPUs are designed for data centers, supercomputing, and high-performance
computing clusters. They are built to deliver massive parallel processing power, making them
ideal for intensive scientific simulations and artificial intelligence workloads.

Features: Tesla GPUs are equipped with a large number of CUDA cores and optimized for
double-precision floating-point operations, making them suitable for scientific computing, deep
learning, and other HPC tasks. They often lack video output and focus purely on computation.

NVIDIA Device Drivers

NVIDIA device drivers play a crucial role in facilitating the communication between the
operating system, software applications, and NVIDIA GPUs (Graphics Processing Units). These
drivers serve as a bridge that allows the hardware (the GPU) to work seamlessly with the
software (the operating system and applications). The detailed roles of NVIDIA device drivers
are:

a) Hardware Interface: NVIDIA device drivers act as an interface between the GPU and the
operating system. They enable the operating system to recognize the GPU as a hardware
component and communicate with it effectively.

b) GPU Initialization: When the computer is powered on or the GPU is installed, the device
driver initializes the GPU, ensuring it is ready to execute commands from the operating
system.

c) Instruction Translation: Software applications, including games, multimedia software, and


CUDA-based applications, communicate with the GPU using a high-level programming
language. The device driver is responsible for translating these high-level instructions into
low-level instructions that the GPU can understand and execute.
d) Resource Management: Device drivers manage GPU resources, such as memory allocation
and deallocation, to ensure efficient use of the GPU's capabilities. They help coordinate
access to GPU resources for different applications.

e) Error Handling: Device drivers are responsible for detecting and handling errors that may
occur during GPU operations. They provide error reporting and recovery mechanisms to
prevent system crashes due to GPU-related issues.

f) Compatibility and Optimization: NVIDIA frequently updates its device drivers to ensure
compatibility with the latest operating systems and software updates. These updates also
often include optimizations that improve the performance of the GPU in various
applications.

How NVIDIA Device Drivers Work:

a) Installation: When you install an NVIDIA GPU in your computer, you typically install the
corresponding device driver. The driver package includes both the driver software and
associated control panels for configuring GPU settings.

b) Operating System Interaction: The operating system interacts with the device driver
through a software layer called the Display Driver Model (DDM) in Windows or the Direct
Rendering Manager (DRM) in Linux.

c) API Translation: When an application wants to use the GPU, it sends commands through
CUDA. The device driver translates these high-level commands into low-level GPU-specific
commands.

d) Execution on the GPU: The translated commands are sent to the GPU for execution. The
GPU performs the specified calculations or rendering tasks as directed by the device driver.

e) Data Transfer: The device driver manages data transfer between the CPU and GPU memory,
ensuring that the right data is available for GPU processing and that results are returned to
the CPU as needed.

f) Error Handling and Reporting: If errors occur during GPU operations the device driver
detects these errors, handles them, and, if possible, recovers to prevent system crashes.
g) Resource Management: The device driver manages memory allocation and deallocation on
the GPU. It ensures that different applications and processes do not interfere with each
other's use of the GPU's resources.

CUDA Development Toolkit

The CUDA Development Toolkit, or CUDA Toolkit, is a comprehensive software package


provided by NVIDIA to support GPU development using the CUDA programming model. The
CUDA Toolkit includes a set of tools, libraries, compilers, and resources to help developers
create GPU-accelerated software.

a) CUDA Compiler (nvcc): The CUDA Toolkit includes the CUDA C/C++ compiler (nvcc – NVIDIA
CUDA compiler). It allows developers to write and compile code that can be executed on
NVIDIA GPUs. The compiler extends the C/C++ language to include GPU-specific features
and allows for the definition of CUDA kernels, which are functions that run on the GPU.

b) CUDA Runtime Libraries: The toolkit provides a set of runtime libraries that offer
functionality for GPU programming, including memory management, thread
synchronization, and mathematical operations. These libraries simplify GPU application
development and are accessible from both CPU and GPU code.

c) CUDA Profiler and Debugger: CUDA developers can use NVIDIA's profiler and debugger
tools to identify and optimize performance bottlenecks in their GPU-accelerated
applications. These tools help developers understand how their code performs on the GPU,
identify issues, and make necessary optimizations. Profiler helps in finding the slowest parts
of the code so that it can be made faster.

d) CUDA Samples and SDK: The CUDA Toolkit includes a set of code samples and a software
development kit (SDK) with example programs to help developers learn and practice GPU
programming. These samples cover a wide range of GPU-accelerated applications, from
simple vector addition to more complex scientific simulations.

e) Documentation and Tutorials: The CUDA Toolkit provides extensive documentation,


including programming guides, reference manuals, and online resources. It also offers
tutorials and educational materials to help developers get started with GPU programming
and understand CUDA concepts.
f) Integration with Popular IDEs: CUDA development is well-supported in popular integrated
development environments (IDEs) like Visual Studio. NVIDIA offers plugins and tools that
make it easy to develop, debug, and profile CUDA code in your preferred IDE.

g) CUDA Math Library (cuMath): This library provides mathematical functions optimized for
GPU execution. It includes various functions for single- and double-precision math
operations, making it useful for scientific and engineering applications. cuBLAS provides
basic linear algebra subprograms, cuFFT is used for Fast Fourier Transforms, and cuSPARSE
offers functionality for sparse matrix operations.

h) Multi-GPU Support: The CUDA Toolkit includes features and libraries that facilitate multi-
GPU programming. Developers can scale their applications across multiple GPUs for even
greater parallel processing power.

i) Platform Compatibility: The CUDA Toolkit is designed to work with a variety of operating
systems, including Windows, Linux, and macOS, allowing developers to choose the
environment that best suits their needs.

Writing CUDA program

 Writing a program in CUDA involves creating both host code (to manage the program
and interface with the GPU) and device code (the actual CUDA kernels that run on the
GPU). The syntax for writing a complete CUDA program is as follow:
1. Include CUDA Headers:

#include <iostream> // standard C++ I/O


#include <cuda_runtime.h> // CUDA runtime library header file

2. Define a CUDA Kernel:

__global__ void MyKernel(parameters) {


// CUDA kernel code
}

3. Host Code (in the main function):

int main() {
// Allocate device memory
// Copy data from host to device
// Launch the CUDA kernel
// Copy results from device to host
// Free device memory
return 0;
}

4. Allocate Device Memory:


cudaMalloc((void**)&dev_ptr, size);

/* cudaMalloc is used for dynamically allocating device memory.


dev_ptr is a pointer that will point to the allocated device memory.
cudaMalloc expects a pointer to a pointer that will store the address of the allocated
memory
size specifies size of memory needed for the particular data structure or array */

5. Copy Data from Host to Device:


cudaMemcpy(dev_ptr, host_ptr, size, cudaMemcpyHostToDevice);

/* cudaMemcpy is a function for copying data between different memory


spaces, such as from the host (CPU) to the device (GPU) or vice versa.

dev_ptr and host_ptr: Pointers to the destination memory on the GPU


(device memory) and the source memory on the CPU (host memory).

Size indicates how many bytes of data should be transferred between host
and device memory.

cudaMemcpyHostToDevice is an enumeration that specifies the direction of


the memory transfer. In this case, it indicates that data should be copied
from host memory to device memory.

*/

6. Launch the CUDA Kernel:


MyKernel<<<blocksPerGrid, threadsPerBlock>>>(kernel_parameters);

/* MyKernel is the name of the CUDA kernel that you want to launch.

<<<blocksPerGrid, threadsPerBlock>>> is called the execution configuration and


specifies how many threads and blocks are used to execute the kernel.

blocksPerGrid is the number of thread blocks in the grid.

threadsPerBlock is the number of threads in each thread block.


kernel_parameters are the arguments you pass to the kernel function. */

7. Copy Results from Device to Host:


cudaMemcpy(host_ptr, dev_ptr, size, cudaMemcpyDeviceToHost);

8. Free Device Memory:


cudaFree(dev_ptr);

9. Synchronize with cudaDeviceSynchronize(): (to ensure all GPU tasks are completed)

Setting number of threads

 A thread is the smallest unit of execution in CUDA. Threads execute the same
instructions independently and in parallel.

 The blocksPerGrid and threadsPerBlock parameters are used to specify the


configuration for launching a CUDA kernel.

 They determine how many parallel threads will execute the kernel and how those
threads are organized into blocks.

 In CUDA, a "grid" is a collection of "blocks," and each "block" is a collection of "threads."

 A grid is the highest-level grouping of threads in a CUDA program. It represents the


entire set of threads that are launched to perform a specific task.

 A block is a middle-level grouping of threads within a grid. Threads within the same
block can cooperate and communicate through shared memory.

 The mapping of blocks to the GPU's Streaming Multiprocessors (SMs) and threads to
CUDA cores is handled by the CUDA runtime and the GPU's hardware. It's important to
note that the exact details of this mapping can vary based on the GPU architecture and
the specific configuration used when launching the kernel.

 blocksPerGrid: It is the number of thread blocks you want to create in the grid for the
kernel execution. It defines the level of parallelism in terms of thread blocks. More
thread blocks mean more parallelism.
 Having multiple blocks in a grid is particularly useful for problems where the data or
tasks can be divided into independent blocks that do not need to communicate with
each other during computation.

 threadsPerBlock: It is the number of threads you want to include in each thread block. It
defines the level of parallelism within each block. More threads in a block mean more
fine-grained parallelism. The threads within a block can cooperate and share data using
shared memory, which can be more efficient for certain types of computations. It's
often a multiple of 32, as GPUs are designed to work well with thread counts that are
multiples of 32.

 Having multiple threads in a block is useful for problems that require data sharing and
coordination between threads.

EXAMPLE

Suppose you have an array of 1000 elements that you want to process using a CUDA kernel.
You want to launch the kernel in a way that leverages the GPU's parallelism effectively. You can
decide to use the following configuration:

const int N = 1000; // Size of the array


const int threadsPerBlock = 256; // Number of threads per block

// Calculate the number of blocks needed to cover all elements


const int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;

// Launch the CUDA kernel


MyKernel<<<blocksPerGrid, threadsPerBlock>>>(kernel_parameters);

Hello World Program:

#include <iostream>
#include <cuda_runtime.h>

// CUDA kernel to print "Hello, CUDA!" from each thread


__global__ void helloCUDA() {
int threadId = threadIdx.x + blockIdx.x * blockDim.x;
printf("Hello, CUDA! Thread %d\n", threadId);
}
int main() {
const int N = 16; // Number of threads

// Define the grid and block dimensions for kernel execution


int threadsPerBlock = 8;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;

// Launch the CUDA kernel


helloCUDA<<<blocksPerGrid, threadsPerBlock>>>();

// Synchronize to ensure all threads finish


cudaDeviceSynchronize();

return 0;
}

Calculation of Thread Id in CUDA

 The line "int threadId = threadIdx.x + blockIdx.x * blockDim.x;" is often used in CUDA
kernel code to calculate a unique identifier (thread ID) for each thread within the grid.
threadIdx.x identifies a thread's position within its block.
blockIdx.x identifies a block's position within the grid.
blockDim.x tells you how many threads are in a block.

Example

 Define the grid and block configuration:


o Number of blocks (blocksPerGrid) = 2
o Number of threads per block (threadsPerBlock) = 3

 Calculate the total number of threads in the grid:


o Total threads in the grid = blocksPerGrid * threadsPerBlock = 2 blocks * 3 threads
per block = 6 threads in total.
 Now, let's calculate the thread IDs for each of the 6 threads:
o Thread 0 in Block 0: Thread ID = threadIdx.x (0) + blockIdx.x (0) * blockDim.x (3) =
0
o Thread 1 in Block 0: Thread ID = threadIdx.x (1) + blockIdx.x (0) * blockDim.x (3) =
1
o Thread 2 in Block 0: Thread ID = threadIdx.x (2) + blockIdx.x (0) * blockDim.x (3) =
2
o Thread 0 in Block 1: Thread ID = threadIdx.x (0) + blockIdx.x (1) * blockDim.x (3) =
3
o Thread 1 in Block 1: Thread ID = threadIdx.x (1) + blockIdx.x (1) * blockDim.x (3) =
4
o Thread 2 in Block 1: Thread ID = threadIdx.x (2) + blockIdx.x (1) * blockDim.x (3) =
5

Summing Vectors

#include <iostream>
#include <cuda_runtime.h>

// CUDA kernel to add two arrays on the GPU


__global__ void addArrays(int* a, int* b, int* result, int size) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < size) {
result[i] = a[i] + b[i];
}
}

int main() {
const int N = 256; // Size of the arrays
int a[N], b[N], result[N]; // Host (CPU) arrays
int* dev_a, * dev_b, * dev_result; // Device (GPU) arrays

// Initialize input arrays a and b


for (int i = 0; i < N; i++) {
a[i] = i;
b[i] = i * 2;
}

// Allocate memory on the GPU


cudaMalloc((void**)&dev_a, N * sizeof(int));
cudaMalloc((void**)&dev_b, N * sizeof(int));
cudaMalloc((void**)&dev_result, N * sizeof(int));

// Copy input arrays from CPU to GPU


cudaMemcpy(dev_a, a, N * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(dev_b, b, N * sizeof(int), cudaMemcpyHostToDevice);
// Define the grid and block dimensions for kernel execution
int threadsPerBlock = 64;
int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;

// Launch the CUDA kernel to add arrays


addArrays <<<blocksPerGrid, threadsPerBlock>>>(dev_a, dev_b, dev_result, N);

// Copy the result from GPU to CPU


cudaMemcpy(result, dev_result, N * sizeof(int), cudaMemcpyDeviceToHost);

// Free GPU memory


cudaFree(dev_a);
cudaFree(dev_b);
cudaFree(dev_result);

// Print the first and last elements of the result


std::cout << "Result: " << result[0] << " ... " << result[N - 1] << std::endl;

return 0;
}
NOTE- std::cout << "Result: " << result[0] << " ... " << result[N - 1] << std::endl; is the line of
code is printing the message "Result: " followed by the first value in the result array, an ellipsis
(" ... ") to indicate there are more values, and then the last value in the result array. The
std::endl at the end inserts a newline character to start a new line in the console output.

Passing Parameters

In the above program, we have following lines of code:

A. __global__ void addArrays(int* a, int* b, int* result, int size) { }


B. cudaMalloc((void**)&dev_a, N * sizeof(int));
C. cudaMalloc((void**)&dev_b, N * sizeof(int));
D. cudaMalloc((void**)&dev_result, N * sizeof(int));
E. cudaMemcpy(dev_a, a, N * sizeof(int), cudaMemcpyHostToDevice);
F. cudaMemcpy(dev_b, b, N * sizeof(int), cudaMemcpyHostToDevice);
G. addArrays <<<blocksPerGrid, threadsPerBlock>>>(dev_a, dev_b, dev_result, N);
H. cudaMemcpy(result, dev_result, N * sizeof(int), cudaMemcpyDeviceToHost);

 Lines B, C and D create memory spaces in the GPU of size N * sizeof(int) bytes and
associate them with the dev_a, dev_b and dev_result pointers. These are device
pointers that are created on the host (CPU) and will hold the addresses of memory on
the GPU where data is stored.
 In Lines E and F, the data stored in the arrays a and b (located in the CPU's host
memory) are copied to the GPU's device memory pointed by dev_a and dev_b.

 The line G is explicitly passing the device pointers dev_a, dev_b, and dev_result as
arguments to the kernel. Inside the kernel, the data pointed to by the formal
parameters a, b and result are expected to be in the GPU's memory space, and these
pointers enable the kernel to access and operate on that data on the GPU.

Querying Devices

To query information about available GPUs using CUDA, you can use CUDA Runtime API
functions provided by the CUDA Toolkit. You can query information such as the number of
available GPUs, device properties (e.g., number of cores, memory size), and more. Here are the
general steps to query devices in CUDA:

 Initialize CUDA:

Before querying devices, you should initialize the CUDA runtime.

cudaSetDevice(0); // Select the first GPU (index 0) or a specific GPU.

The above code sets the current CUDA device to be the first available GPU. You can change the
device index to select a different GPU if you have multiple GPUs.

 Query Device Count:

To get the number of available CUDA-capable devices, use cudaGetDeviceCount():

int deviceCount;

cudaGetDeviceCount(&deviceCount); //deviceCount will now contain the number of available


GPUs.

 Query Device Properties:

To get detailed information about a specific device, you can use cudaGetDeviceProperties():

cudaDeviceProp deviceProp;

cudaGetDeviceProperties(&deviceProp, deviceIndex);
deviceIndex should be set to the index of the GPU you want to query (e.g., 0 for the first GPU).
The deviceProp structure will hold information about the selected GPU, including its
capabilities, memory size, number of cores, and more.

#include <iostream>
#include <cuda_runtime.h>

int main() {
int deviceCount;
cudaGetDeviceCount(&deviceCount);

for (int i = 0; i < deviceCount; ++i) {


cudaDeviceProp deviceProp;
cudaGetDeviceProperties(&deviceProp, i);

std::cout << "Device " << i << ": " << deviceProp.name << std::endl;
std::cout << " Number of Cores: " << deviceProp.multiProcessorCount << std::endl;
std::cout << " Total Global Memory: " << deviceProp.totalGlobalMem << " bytes" <<
std::endl;
// Add more properties as needed.

std::cout << std::endl;


}
return 0;
}

Splitting Parallel Blocks

If you have two different tasks, Task 1 and Task 2, and you want to execute them in parallel,
you can still use blocks to manage this, but the structure and organization may vary based on
the nature of the tasks. Here is a general approach for executing Task 1 and Task 2 in parallel
using CUDA:

a) Create Two Different Kernels: Create two separate kernel functions, one for Task 1 and
one for Task 2. Each kernel should be responsible for one specific task.

b) Determine the Number of Blocks and Threads for Each Task: Decide how many blocks
and threads per block are needed for each task. The choice of these values should be
based on the nature of the tasks and the resources available on your GPU.
c) Launch the Kernels: Use the kernel launch configuration to execute both tasks in
parallel, specifying the number of blocks and threads for each task:

// Launch Task 1
task1Kernel <<<blocksPerGrid1, threadsPerBlock1>>>(...);

// Launch Task 2
task2Kernel <<<blocksPerGrid2, threadsPerBlock2>>>(...);

d) Handle Task-Specific Computation: Inside each kernel, you should perform the
computations specific to each task. The block and thread organization should be
managed according to the requirements of the individual tasks.

e) Synchronize if Needed: If the tasks require synchronization or coordination, you can use
__syncthreads() for synchronization within a block. However, note that CUDA kernels
typically don't synchronize across different blocks, so inter-block synchronization might
require more advanced techniques, like atomic operations or global memory.

f) Handle Data Transfer: If data needs to be transferred between the host and the device
for each task, use the appropriate CUDA memory management functions (e.g.,
cudaMalloc, cudaMemcpy, cudaFree) for each task.
15. Reductions:

- Reduction is a common CUDA operation that combines multiple values into a single result, like
summing an array.

16. Thread Cooperation:

- Threads within a block can cooperate through shared memory and synchronization to perform
complex tasks.

17. Shared Memory And Synchronization:

- Shared memory is a type of memory accessible by threads within a block, enabling efficient
data sharing.

- Synchronization is vital to ensure threads work together effectively.

18. Dot Product:

- The dot product is a mathematical operation used in linear algebra, and it can be efficiently
computed on GPUs.

You might also like