GPU (Graphics Processing Unit)
GPU (Graphics Processing Unit)
Graphic rendering refers to the process of generating or producing images from a set of
data or instructions, often with the goal of displaying those images on a screen providing
users with a visual representation of a virtual environment, object, or scene. It is used in a
wide range of applications, including video games, 3D modeling, computer-aided design,
virtual reality, simulations, and more.
Originally, GPUs were developed specifically for rendering images and videos in computer
games and other visual applications. However, their highly parallel architecture and ability
to handle massive amounts of data quickly made them useful for a wide range of general-
purpose computing tasks beyond graphics.
History of GPU:
1. Early Graphics Cards (1970s-1980s): The concept of a graphics card began with the advent
of personal computers in the 1970s and 1980s. Early GPUs were simple and primarily
focused on rendering basic graphics and text. They facilitated the transition from simple
text-based interfaces to more visually-oriented interfaces, enabling users to interact with
computers through graphical elements and basic shapes.
2. VGA Era (1987): The Video Graphics Array (VGA) standard was introduced in 1987,
providing better color and resolution support. This marked a significant improvement in PC
graphics.
3. 3D Acceleration (1990s): In the 1990s, the need for better 3D graphics in video games led to
the development of 3D graphics accelerators, which were essentially early forms of GPUs.
Companies like 3dfx and ATI played a key role in this era.
[Note - Both 3dfx and ATI (now part of AMD) were major players in the graphics card
industry in the 1990s, and while they were not game developers themselves, they played a
significant role in the development of 3D graphics technology and supported numerous
games through their graphics cards.]
4. NVIDIA GeForce 256 (1999): The NVIDIA GeForce 256, released in 1999, is often considered
the first modern GPU. It was the first graphics card with hardware transformation and
lighting, introducing programmable shaders, which allowed developers to customize
graphics rendering.
[Note - "Hardware T&L" means that the GeForce 256 had dedicated hardware components
for handling these tasks. Transformation involves converting 3D coordinates of objects into
2D coordinates for rendering on a 2D screen, while lighting involves calculating how light
interacts with the objects in the 3D scene. The introduction of programmable shaders
allowed developers to write their own shader code, enabling more realistic and visually
appealing graphics.]
5. ATI Radeon 9700 (2002): ATI's Radeon 9700, released in 2002, was a significant milestone,
offering high-performance graphics processing and advanced features like DirectX 9
support.
[Note - DirectX is commonly used in the Windows operating system for game development
and multimedia applications. When a piece of software or hardware is said to have "DirectX
9 support," it means that it is capable of working with DirectX 9 and taking advantage of its
features and capabilities.]
6. The Rise of CUDA (2007): NVIDIA's introduction of CUDA (Compute Unified Device
Architecture) in 2007 revolutionized GPU usage. CUDA allowed GPUs to be used for general-
purpose computing, not just graphics. This marked the beginning of GPUs as powerful
parallel processors for scientific, engineering, and artificial intelligence applications.
7. AMD and the GPU Industry (2000s-Present): AMD (formerly ATI) has been a key player in
the GPU industry alongside NVIDIA, offering a range of GPUs for gaming and professional
applications. The competition between these two companies has driven innovation.
CUDA (Compute Unified Device Architecture)
CUDA (Compute Unified Device Architecture) is a parallel computing platform and API
(Application Programming Interface) created by NVIDIA.
CUDA has gained popularity due to its ability to accelerate computationally intensive tasks
by offloading them to the GPU, which consists of thousands of cores that can perform many
operations simultaneously.
CUDA is built on NVIDIA's GPU architecture, which includes several key components:
a) Streaming Multiprocessors (SMs): These are the fundamental processing units on a GPU.
Each SM contains a number of CUDA cores (ALUs), specialized function units, and local
memory.
Different GPUs have different numbers of SMs, and this number can change from one GPU
generation to the next. High-end GPUs designed for scientific and compute workloads tend
to have more SMs than consumer-oriented GPUs built for gaming and graphics.
b) CUDA Cores: CUDA cores are small processing units within an SM that can execute
instructions in parallel. Modern GPUs have hundreds or even thousands of CUDA cores.
c) Global Memory: This is the GPU's main memory, which can be accessed by all SMs and
CUDA cores. It is used to store data and instructions.
d) Shared Memory: Each SM has a small but high-speed shared memory, which can be
accessed by all threads within an SM.
https://www.researchgate.net/figure/CPU-GPU-Heterogeneous-Computing-Architecture_fig1_276750219
e) Registers: Each SM has a set of registers for storing data that is currently being processed.
f) Texture and Constant Memory: Special memory spaces optimized for certain data access
patterns.
g) Thread Execution Units (TEUs): These units manage the execution of threads and
instruction scheduling within an SM.
The CUDA programming model allows developers to write code that runs on both the CPU and
GPU. Key components include:
a) Host and Device: In CUDA, there are two main components: the host (CPU) and the device
(GPU). The host manages the overall execution, handles input/output operations, and
orchestrates data transfers between the CPU and GPU. The device performs the actual
parallel processing tasks. It executes code written specifically for the GPU.
b) Host Code: The CPU code that manages data and controls the execution of GPU kernels.
c) Device Code (Kernels): In CUDA, parallel code is referred to as a "kernel." Kernels are
functions that execute on the GPU. Kernels are invoked from the host and run in parallel on
the GPU. Each kernel thread can perform a specific task on a piece of data, making it
suitable for data-parallel applications.
d) Grids and Blocks: Threads in CUDA are organized into blocks, and blocks are organized into
grids. Each thread executes the same kernel code but can have its own unique thread ID to
identify its task. The hierarchy allows developers to organize and control parallelism
effectively.
e) Threads and Warps: Individual threads execute kernels in parallel, and they are grouped
into warps, which are units of thread execution that share some resources.
f) Data Transfer: Efficient data transfer between the host and device is crucial. CUDA provides
functions to manage data movement between CPU and GPU memory.
Benefits of CUDA
There are several advantages that give CUDA an edge over traditional general-purpose
graphics processor (GPU) computers with graphics APIs:
b) High Performance: Due to the large number of CUDA cores and optimized memory
hierarchies, CUDA can significantly accelerate many computationally intensive tasks.
This makes it suitable for scientific simulations, machine learning, image processing,
and more.
f) Developer Tools: NVIDIA provides a suite of developer tools, including the CUDA
Toolkit, which includes a compiler, debugger, profiler, and performance optimization
tools. These tools make it easier to develop, debug, and optimize CUDA code.
Limitations of CUDA
a) Vendor Lock-In: CUDA is proprietary and developed by NVIDIA. This means that it is
primarily supported on NVIDIA GPUs. If you develop your application using CUDA, it
may not be easily portable to GPUs from other vendors, such as AMD.
b) Hardware Dependence: CUDA code is often optimized for specific NVIDIA GPU
architectures. As GPU architectures evolve, older CUDA code may not fully utilize the
capabilities of newer GPUs, requiring code updates for optimal performance.
c) Complexity: Developing CUDA applications can be more complex than writing CPU-
based code. It requires understanding GPU architecture, managing data transfer
between the CPU and GPU, and optimizing for memory access patterns.
e) Overhead: There is overhead associated with transferring data between the CPU
and GPU. If data transfers are frequent, this overhead can limit the performance
benefits of GPU acceleration.
Applications of CUDA:
a) Scientific Simulations: CUDA is widely used in scientific simulations, such as fluid dynamics,
molecular dynamics, and weather modeling. These simulations involve complex
mathematical calculations that can be parallelized to significantly reduce the time required
for computation.
b) Deep Learning and AI: CUDA is a fundamental technology in the field of artificial
intelligence and deep learning. Frameworks like TensorFlow and PyTorch leverage CUDA to
accelerate the training of deep neural networks. The parallel processing capabilities of GPUs
allow for faster model training, enabling breakthroughs in AI applications like image
recognition and natural language processing.
c) Medical Imaging: Medical image processing tasks, including MRI and CT image
reconstruction, benefit from CUDA's parallelism. CUDA accelerates the reconstruction of 3D
images and the processing of medical data, enabling real-time visualization and diagnosis.
d) Financial Modeling: CUDA is used in financial institutions for pricing complex derivatives,
risk assessment, and portfolio optimization. These calculations involve extensive numerical
computations and Monte Carlo simulations.
e) Video and Image Processing: Video and image processing applications like video
transcoding, image filtering, and computer vision tasks benefit from CUDA. CUDA
accelerates these operations, reducing processing time and improving the quality of image
and video output.
f) Genomic Analysis: Genomic research, including DNA sequencing and bioinformatics,
involves extensive data processing. CUDA can significantly speed up tasks like sequence
alignment, genome assembly, and variant calling, enabling faster breakthroughs in
genomics.
g) Oil and Gas Exploration: CUDA plays a critical role in seismic data processing and reservoir
simulations for oil and gas exploration. CUDA enables the rapid processing of seismic data,
helping identify potential drilling sites and optimize oil reservoir extraction.
h) Simulated Reality and Gaming: The gaming industry benefits from CUDA for rendering
realistic 3D graphics and simulating complex game physics. It's not just limited to
entertainment; CUDA is also used in virtual reality and augmented reality applications.
i) Drug Discovery and Molecular Modeling: Pharmaceutical companies use CUDA for drug
discovery and molecular modeling. CUDA accelerates the computation of drug-protein
interactions, docking studies, and drug candidate screening.
NVIDIA offers a range of GPU product lines, each designed for specific applications and user
requirements. We briefly discuss about NVIDIA's GeForce, Quadro, and Tesla GPUs:
1. GeForce GPUs:
Purpose: GeForce GPUs are designed for gaming and consumer applications. They excel in
rendering high-quality graphics and providing a great gaming experience. They are also capable
of handling general computing tasks, making them suitable for CUDA programming.
Features: GeForce GPUs are a popular choice for gaming enthusiasts and can be used for CUDA
development, though they may not be optimized for professional or scientific computing.
2. Quadro GPUs:
Purpose: Quadro GPUs are designed for professional applications such as computer-aided
design (CAD), 3D modeling, scientific simulations, and content creation. They provide reliability,
precision, and stability for demanding professional workloads.
Features: Quadro GPUs offer certified drivers for professional software applications and come
with a focus on precision and accuracy. They are optimized for error-free calculations and are
well-suited for CUDA development in professional and scientific domains.
3. Tesla GPUs:
Purpose: Tesla GPUs are designed for data centers, supercomputing, and high-performance
computing clusters. They are built to deliver massive parallel processing power, making them
ideal for intensive scientific simulations and artificial intelligence workloads.
Features: Tesla GPUs are equipped with a large number of CUDA cores and optimized for
double-precision floating-point operations, making them suitable for scientific computing, deep
learning, and other HPC tasks. They often lack video output and focus purely on computation.
NVIDIA device drivers play a crucial role in facilitating the communication between the
operating system, software applications, and NVIDIA GPUs (Graphics Processing Units). These
drivers serve as a bridge that allows the hardware (the GPU) to work seamlessly with the
software (the operating system and applications). The detailed roles of NVIDIA device drivers
are:
a) Hardware Interface: NVIDIA device drivers act as an interface between the GPU and the
operating system. They enable the operating system to recognize the GPU as a hardware
component and communicate with it effectively.
b) GPU Initialization: When the computer is powered on or the GPU is installed, the device
driver initializes the GPU, ensuring it is ready to execute commands from the operating
system.
e) Error Handling: Device drivers are responsible for detecting and handling errors that may
occur during GPU operations. They provide error reporting and recovery mechanisms to
prevent system crashes due to GPU-related issues.
f) Compatibility and Optimization: NVIDIA frequently updates its device drivers to ensure
compatibility with the latest operating systems and software updates. These updates also
often include optimizations that improve the performance of the GPU in various
applications.
a) Installation: When you install an NVIDIA GPU in your computer, you typically install the
corresponding device driver. The driver package includes both the driver software and
associated control panels for configuring GPU settings.
b) Operating System Interaction: The operating system interacts with the device driver
through a software layer called the Display Driver Model (DDM) in Windows or the Direct
Rendering Manager (DRM) in Linux.
c) API Translation: When an application wants to use the GPU, it sends commands through
CUDA. The device driver translates these high-level commands into low-level GPU-specific
commands.
d) Execution on the GPU: The translated commands are sent to the GPU for execution. The
GPU performs the specified calculations or rendering tasks as directed by the device driver.
e) Data Transfer: The device driver manages data transfer between the CPU and GPU memory,
ensuring that the right data is available for GPU processing and that results are returned to
the CPU as needed.
f) Error Handling and Reporting: If errors occur during GPU operations the device driver
detects these errors, handles them, and, if possible, recovers to prevent system crashes.
g) Resource Management: The device driver manages memory allocation and deallocation on
the GPU. It ensures that different applications and processes do not interfere with each
other's use of the GPU's resources.
a) CUDA Compiler (nvcc): The CUDA Toolkit includes the CUDA C/C++ compiler (nvcc – NVIDIA
CUDA compiler). It allows developers to write and compile code that can be executed on
NVIDIA GPUs. The compiler extends the C/C++ language to include GPU-specific features
and allows for the definition of CUDA kernels, which are functions that run on the GPU.
b) CUDA Runtime Libraries: The toolkit provides a set of runtime libraries that offer
functionality for GPU programming, including memory management, thread
synchronization, and mathematical operations. These libraries simplify GPU application
development and are accessible from both CPU and GPU code.
c) CUDA Profiler and Debugger: CUDA developers can use NVIDIA's profiler and debugger
tools to identify and optimize performance bottlenecks in their GPU-accelerated
applications. These tools help developers understand how their code performs on the GPU,
identify issues, and make necessary optimizations. Profiler helps in finding the slowest parts
of the code so that it can be made faster.
d) CUDA Samples and SDK: The CUDA Toolkit includes a set of code samples and a software
development kit (SDK) with example programs to help developers learn and practice GPU
programming. These samples cover a wide range of GPU-accelerated applications, from
simple vector addition to more complex scientific simulations.
g) CUDA Math Library (cuMath): This library provides mathematical functions optimized for
GPU execution. It includes various functions for single- and double-precision math
operations, making it useful for scientific and engineering applications. cuBLAS provides
basic linear algebra subprograms, cuFFT is used for Fast Fourier Transforms, and cuSPARSE
offers functionality for sparse matrix operations.
h) Multi-GPU Support: The CUDA Toolkit includes features and libraries that facilitate multi-
GPU programming. Developers can scale their applications across multiple GPUs for even
greater parallel processing power.
i) Platform Compatibility: The CUDA Toolkit is designed to work with a variety of operating
systems, including Windows, Linux, and macOS, allowing developers to choose the
environment that best suits their needs.
Writing a program in CUDA involves creating both host code (to manage the program
and interface with the GPU) and device code (the actual CUDA kernels that run on the
GPU). The syntax for writing a complete CUDA program is as follow:
1. Include CUDA Headers:
int main() {
// Allocate device memory
// Copy data from host to device
// Launch the CUDA kernel
// Copy results from device to host
// Free device memory
return 0;
}
Size indicates how many bytes of data should be transferred between host
and device memory.
*/
/* MyKernel is the name of the CUDA kernel that you want to launch.
9. Synchronize with cudaDeviceSynchronize(): (to ensure all GPU tasks are completed)
A thread is the smallest unit of execution in CUDA. Threads execute the same
instructions independently and in parallel.
They determine how many parallel threads will execute the kernel and how those
threads are organized into blocks.
A block is a middle-level grouping of threads within a grid. Threads within the same
block can cooperate and communicate through shared memory.
The mapping of blocks to the GPU's Streaming Multiprocessors (SMs) and threads to
CUDA cores is handled by the CUDA runtime and the GPU's hardware. It's important to
note that the exact details of this mapping can vary based on the GPU architecture and
the specific configuration used when launching the kernel.
blocksPerGrid: It is the number of thread blocks you want to create in the grid for the
kernel execution. It defines the level of parallelism in terms of thread blocks. More
thread blocks mean more parallelism.
Having multiple blocks in a grid is particularly useful for problems where the data or
tasks can be divided into independent blocks that do not need to communicate with
each other during computation.
threadsPerBlock: It is the number of threads you want to include in each thread block. It
defines the level of parallelism within each block. More threads in a block mean more
fine-grained parallelism. The threads within a block can cooperate and share data using
shared memory, which can be more efficient for certain types of computations. It's
often a multiple of 32, as GPUs are designed to work well with thread counts that are
multiples of 32.
Having multiple threads in a block is useful for problems that require data sharing and
coordination between threads.
EXAMPLE
Suppose you have an array of 1000 elements that you want to process using a CUDA kernel.
You want to launch the kernel in a way that leverages the GPU's parallelism effectively. You can
decide to use the following configuration:
#include <iostream>
#include <cuda_runtime.h>
return 0;
}
The line "int threadId = threadIdx.x + blockIdx.x * blockDim.x;" is often used in CUDA
kernel code to calculate a unique identifier (thread ID) for each thread within the grid.
threadIdx.x identifies a thread's position within its block.
blockIdx.x identifies a block's position within the grid.
blockDim.x tells you how many threads are in a block.
Example
Summing Vectors
#include <iostream>
#include <cuda_runtime.h>
int main() {
const int N = 256; // Size of the arrays
int a[N], b[N], result[N]; // Host (CPU) arrays
int* dev_a, * dev_b, * dev_result; // Device (GPU) arrays
return 0;
}
NOTE- std::cout << "Result: " << result[0] << " ... " << result[N - 1] << std::endl; is the line of
code is printing the message "Result: " followed by the first value in the result array, an ellipsis
(" ... ") to indicate there are more values, and then the last value in the result array. The
std::endl at the end inserts a newline character to start a new line in the console output.
Passing Parameters
Lines B, C and D create memory spaces in the GPU of size N * sizeof(int) bytes and
associate them with the dev_a, dev_b and dev_result pointers. These are device
pointers that are created on the host (CPU) and will hold the addresses of memory on
the GPU where data is stored.
In Lines E and F, the data stored in the arrays a and b (located in the CPU's host
memory) are copied to the GPU's device memory pointed by dev_a and dev_b.
The line G is explicitly passing the device pointers dev_a, dev_b, and dev_result as
arguments to the kernel. Inside the kernel, the data pointed to by the formal
parameters a, b and result are expected to be in the GPU's memory space, and these
pointers enable the kernel to access and operate on that data on the GPU.
Querying Devices
To query information about available GPUs using CUDA, you can use CUDA Runtime API
functions provided by the CUDA Toolkit. You can query information such as the number of
available GPUs, device properties (e.g., number of cores, memory size), and more. Here are the
general steps to query devices in CUDA:
Initialize CUDA:
The above code sets the current CUDA device to be the first available GPU. You can change the
device index to select a different GPU if you have multiple GPUs.
int deviceCount;
To get detailed information about a specific device, you can use cudaGetDeviceProperties():
cudaDeviceProp deviceProp;
cudaGetDeviceProperties(&deviceProp, deviceIndex);
deviceIndex should be set to the index of the GPU you want to query (e.g., 0 for the first GPU).
The deviceProp structure will hold information about the selected GPU, including its
capabilities, memory size, number of cores, and more.
#include <iostream>
#include <cuda_runtime.h>
int main() {
int deviceCount;
cudaGetDeviceCount(&deviceCount);
std::cout << "Device " << i << ": " << deviceProp.name << std::endl;
std::cout << " Number of Cores: " << deviceProp.multiProcessorCount << std::endl;
std::cout << " Total Global Memory: " << deviceProp.totalGlobalMem << " bytes" <<
std::endl;
// Add more properties as needed.
If you have two different tasks, Task 1 and Task 2, and you want to execute them in parallel,
you can still use blocks to manage this, but the structure and organization may vary based on
the nature of the tasks. Here is a general approach for executing Task 1 and Task 2 in parallel
using CUDA:
a) Create Two Different Kernels: Create two separate kernel functions, one for Task 1 and
one for Task 2. Each kernel should be responsible for one specific task.
b) Determine the Number of Blocks and Threads for Each Task: Decide how many blocks
and threads per block are needed for each task. The choice of these values should be
based on the nature of the tasks and the resources available on your GPU.
c) Launch the Kernels: Use the kernel launch configuration to execute both tasks in
parallel, specifying the number of blocks and threads for each task:
// Launch Task 1
task1Kernel <<<blocksPerGrid1, threadsPerBlock1>>>(...);
// Launch Task 2
task2Kernel <<<blocksPerGrid2, threadsPerBlock2>>>(...);
d) Handle Task-Specific Computation: Inside each kernel, you should perform the
computations specific to each task. The block and thread organization should be
managed according to the requirements of the individual tasks.
e) Synchronize if Needed: If the tasks require synchronization or coordination, you can use
__syncthreads() for synchronization within a block. However, note that CUDA kernels
typically don't synchronize across different blocks, so inter-block synchronization might
require more advanced techniques, like atomic operations or global memory.
f) Handle Data Transfer: If data needs to be transferred between the host and the device
for each task, use the appropriate CUDA memory management functions (e.g.,
cudaMalloc, cudaMemcpy, cudaFree) for each task.
15. Reductions:
- Reduction is a common CUDA operation that combines multiple values into a single result, like
summing an array.
- Threads within a block can cooperate through shared memory and synchronization to perform
complex tasks.
- Shared memory is a type of memory accessible by threads within a block, enabling efficient
data sharing.
- The dot product is a mathematical operation used in linear algebra, and it can be efficiently
computed on GPUs.