Parallel Programming Module 4
Parallel Programming Module 4
PARALLEL PROGRAMMING
MODULE – 4
Introduction to Data Parallelism with CUDA
6th SEM
B.Tech
DSE
1
Many Core Systems: Heterogeneous
Parallel Computing
Heterogeneous Multicore Processors
• Heterogeneous cores are not identical.
• They can differ in capabilities, speed, may lack certain features or
otherwise perform a task differently.
• Heterogeneous computing typically refers to a system that uses
multiple types of computing cores, like CPUs, GPUs, ASICs, FPGAs, and
NPUs.
• By assigning different workloads to processors that are designed for
specific purposes or specialized processing, performance and energy
efficiency is improved.
2
• The term “heterogenous compute” may also refer to the use of
processors based on different computer architectures, a common
approach when a particular architecture is better suited for a specific
task due to power efficiency, compatibility, or the number of cores
available.
• An early and still relatively common form of heterogenous computing
is the combination of CPU cores and a GPU (Graphics Processing
Unit), used for gaming and other graphics-rich applications.
3
• Heterogeneous computing enables a single system to have multiple
computing sub-systems.
Advantages:
• These processors, which may execute core instructions differently,
work in parallel to
• Accelerate compute speed and
• Minimize the time required to complete a task.
Applications:
• This is particularly useful in the development of Artificial Intelligence
(AI) and Machine Learning (ML) workloads, where vast amounts of
data must be processed and converted for a seamless user
experience.
4
Heterogenous Parallel Computing –
Example
Uses of Heterogenous Parallel Computing
Introduction to Data Parallelism
• Data parallelism is a parallel computing paradigm in which a large task is
divided into smaller, independent, simultaneously processed subtasks.
• Via this approach, different processors or computing units perform the
same operation on multiple pieces of data at the same time.
• Data Parallelism means concurrent execution of the same task on
each multiple computing core.
• Same task are performed on different subsets of same data.
• The primary goal of data parallelism is to:
• Improve computational efficiency
• Speed
7
Example 1:
8
Example 2: Data parallelism in matrix
multiplication
9
How Does Data Parallelism Work?
Data parallelism works by:
1. Dividing data into chunks
• The first step in data parallelism is breaking down a large data set into
smaller, manageable chunks.
• This division can be based on various criteria, such as dividing rows of a
matrix or segments of an array.
2. Distributed processing
• Once the data is divided into chunks, each chunk is assigned to a
separate processor or thread.
• This distribution allows for parallel processing, with each processor
independently working on its allocated portion of the data.
10
3. Simultaneous processing
• Multiple processors or threads work on their respective chunks
simultaneously.
• This simultaneous processing enables a significant reduction in the overall
computation time, as different portions of the data are processed
concurrently.
4. Operation replication
• The same operation or set of operations is applied to each chunk
independently.
• This ensures that the results are consistent across all processed chunks.
• Common operations include mathematical computations, transformations, or
other tasks that can be parallelized.
11
5. Aggregation
• After processing their chunks, the results are aggregated or
combined to obtain the final output.
• The aggregation step might involve summing, averaging, or
otherwise combining the individual results from each processed
chunk.
12
13
Benefits of Data Parallelism
• Improved Performance
• Data parallelism leads to a significant performance improvement by allowing
multiple processors or threads to work on different chunks of data
simultaneously. This parallel processing approach results in faster execution
of computations compared to sequential processing.
• Scalability
• One of the major advantages of data parallelism is its scalability. As the size of
the data set or the complexity of computations increases, data parallelism can
scale easily by adding more processors or threads. This makes it well-suited
for handling growing workloads without a proportional decrease in
performance.
14
• Efficient Resource Usage
• By distributing the workload across multiple processors or threads, data
parallelism enables efficient use of available resources. This ensures that
computing resources, such as CPU cores or GPUs, are fully engaged, leading
to better overall system efficiency.
• Handling Large Data Sets
• Data parallelism is particularly effective in addressing the challenges posed by
large data sets. By dividing the data set into smaller chunks, each processor
can independently process its portion, enabling the system to handle massive
amounts of data in a more manageable and efficient manner.
• Improved Throughput
• Data parallelism enhances system throughput by parallelizing the execution of
identical operations on different data chunks. This results in a higher
throughput as multiple tasks are processed simultaneously, reducing the
overall time required to complete the computations.
15
• Fault Tolerance
• In distributed computing environments, data parallelism can contribute to
fault tolerance. If one processor or thread encounters an error or failure, the
impact is limited to the specific chunk of data it was processing, and other
processors can continue their work independently.
• Versatility across Domains
• Data parallelism is versatile and applicable across various domains, including
scientific research, data analysis, artificial intelligence, and simulation. Its
adaptability makes it a valuable approach for a wide range of applications.
16
Data Parallelism Real-world Applications
• Machine Learning
• Image and Video Processing
• Genomic Data Analysis
• Financial Analytics
• Climate Modeling
• Computer Graphics
17
CPU and GPU Design Philosophy
Latency vs Throughput oriented architecture
Latency Oriented Architecture Throughput Oriented Architecture
Focuses on minimizing the time it takes to complete Priority is maximizing the total number of operations
individual tasks or instructions. completed within a given timeframe.
Reduced latency of cache accesses, minimized branch While individual operations may have higher latency
mispredictions, and prioritizing the execution of compared to CPUs, GPUs excel at processing large
critical path instruction. volumes of data in parallel, resulting in high overall
throughput.
Well-suited for tasks that require low response times Well-suited for highly parallelizable tasks such as
and are not easily parallelizable. graphics rendering, scientific simulations, and deep
learning training.
E.g- general-purpose computing tasks, gaming, and E.g- GPU's that achieve high throughput by exploiting
real-time processing applications. thread-level and data-level parallelism across
thousands of processing cores.
CPU(Central Processing Unit)
• A CPU, or central processing unit, is a hardware component that is the
core computational unit in a server.
• It handles all types of computing tasks required for the operating
system and applications to run correctly.
• Constructed from billions of transistors, can have multiple processing
cores and is commonly referred to as the “brain” of the computer.
• It is essential to all modern computing systems, as it executes the
commands and processes needed for your computer and operating
system.
• Latency oriented processor architecture is the microarchitecture of a
microprocessor designed to serve a serial computing thread with a low
latency.
20
21
22
• CPU comprises the arithmetic logic unit (ALU) accustomed quickly to
store the information and perform calculations and
• Control Unit (CU) for performing instruction sequencing as well as
branching.
• CPU interacts with more computer components such as memory,
input and output for performing instruction.
23
GPU (Graphics Processing Unit)
• A graphics processing unit (GPU) is a similar hardware component like CPU,
but more specialized.
• A GPU supports the CPU to perform concurrent calculations.
• The main difference between CPU and GPU architecture is that a CPU is
designed to handle a wide-range of tasks quickly (as measured by CPU
clock speed), but are limited in the concurrency of tasks that can be
running.
• A GPU can complete simple and repetitive tasks much faster because it can
break the task down into smaller components and finish them in parallel.
• A GPU is designed to quickly render high-resolution images and video
concurrently.
24
• It can more efficiently handle complex mathematical operations that run in
parallel than a general CPU.
• While GPUs were initially created to handle graphics and hyper-realistic gaming
visuals , they have evolved to become more general-purpose parallel processors
as well, handling a growing range of applications, including AI.
• The GPU is a processor that is made up of many smaller and more specialized
cores.
• By working together, the cores deliver massive performance when a processing
task can be divided up across many cores at the same time.
• It contains more ALU units than CPU.
• GPU is faster than CPU’s speed and it emphasis on high throughput.
• Throughput oriented architectures, usually have a multitude of processors with
much smaller caches and simpler control logic. This helps to efficiently utilize
the memory bandwidth and increase the total number of execution units on the
same chip area.
25
26
27
Similarity between CPU and GPU
28
When to use GPUs over CPUs
32
Applications of GPU
• Deep learning
• High-performance computing
• Autonomous vehicles
33
34
• CPUs prove inefficient to operate on large chunks of data
• GPUs focus on execution throughput of massively-parallel programs.
35
CUDA – Data Parallelism in GPU
(Introduction)
• CUDA (Compute Unified Device Architecture) is a parallel computing platform
and application programming interface (API), or programming model developed
by NVIDIA for general computing on graphical processing units (GPUs).
• With CUDA, developers are able to dramatically speed up computing applications
by utilizing the power of GPUs.
• CUDA is a software layer that gives direct access to the GPU's virtual instruction
set and parallel computational elements for the execution of compute kernels.
• This accessibility makes it easier for specialists in parallel programming to use
GPU resources, in contrast to prior APIs like Direct3D and OpenGL, which
required advanced skills in graphics programming.
• CUDA allows us to use parallel computing for so-called general-purpose
computing on graphics processing units (GPGPU)
36
• CUDA technology enables parallel processing by breaking down a task
into thousands of smaller "threads" executed independently.
• CUDA is an extension of C programming.
• CUDA is designed to work with programming languages such as C, C+
+, and Fortran.
• CUDA-powered GPUs also support programming frameworks such as
OpenMP, OpenACC and OpenCL and HIP by compiling such code to
CUDA.
37
CUDA Program Structure
38
• A typical CUDA program has code intended both for the GPU and the
CPU.
• Typically, we run serial workload on CPU and offload parallel
computation to GPUs.
• By default, a traditional C program is a CUDA program with only the host
code.
• The CPU is referred to as the host, and the GPU is referred to as the
device.
• The host code can be compiled by a traditional C compiler as the GCC,
the device code needs a special compiler to understand the API
functions that are used.
• For Nvidia GPUs, the compiler is called the NVCC (Nvidia C Compiler).
39
• The device code runs on the GPU, and the host code runs on the CPU.
• The NVCC processes a CUDA program, and separates the host code
from the device code.
• To accomplish this, special CUDA keywords are looked for.
• The code intended to run on the GPU (device code) is marked with
special CUDA keywords for labelling data-parallel functions, called
‘Kernels’.
• The device code is further compiled by the NVCC and executed on the
GPU.
40
• CPU and GPUs are separate entities.
• Both have their own memory space.
• CPU cannot directly access GPU memory, and vice versa.
• In CUDA terminology, CPU memory is called host memory and GPU
memory is called device memory.
• Pointers to CPU and GPU memory are called host pointer and device
pointer, respectively.
• For data to be accessible by GPU, it must be presented in the device
memory.
• CUDA provides APIs for allocating device memory and data transfer
between host and device memory.
41
A quick comparison between CUDA and C
42
Explanation
• The major difference between C and CUDA implementation is
__global__ specifier (Kernel) and <<<...>>> syntax (kernel launch)
• The __global__ specifier indicates a function that runs on device
(GPU).
• Such function can be called through host code, e.g. the main()
function in the example, and is also known as "kernels".
• Kernel: a C function which is flagged to be run on a GPU (or a device).
• When a kernel is called, its execution configuration is provided
through <<<...>>> syntax, e.g. cuda_hello<<<1,1>>>().
• Triple angle brackets mark a call from host code to device code
• Also called a “kernel launch” in CUDA terminology.
• We will discuss about the parameter (1,1)
43
Parameter (1,1)
• CUDA use a kernel execution configuration <<<...>>> to tell CUDA
runtime how many threads to launch on GPU.
• CUDA organizes threads into a group called "thread block".
• Kernel can launch multiple thread blocks, organized into a "grid"
structure.
• The syntax of kernel execution configuration is as follows
• <<< M , T >>>
• which indicate that a kernel launches with a grid of M thread
blocks.
• Each thread block has T parallel threads.
44
Example: <<< 2 , 5 >>>
Int main()
{
hello<<<2,2>>>();
printf(“Hello World from CPU\n”);
return 0;
}
47
Basic Steps of Cuda Programming
Following is the common workflow of CUDA programs.
50
• The name global here refers to scope, as it can be accessed and
modified from both the host and the device.
• Global memory can be declared in global (variable) scope using
• The __device__ declaration specifier as in the first line of the
following code snippet
or
• Dynamically allocated using cudaMalloc() and assigned to a
regular C pointer variable as in line 7
• Global memory allocations can persist for the lifetime of the
application.
• Depending on the compute capability of the device, global memory
may or may not be cached on the chip.
51
52
Device Functions
53
54
• The cudaMalloc function can be called from the host code to allocate
a piece of device global memory for an object.
Syntax of cudaMalloc
• cudaMalloc((void**)&Md, size);
• Here Md is the allocated storage of device
Syntax of cudaFree
• cudaFree(Md);
55
56
Syntax
57
CUDA GLOBAL MEMORY
PROCESSING FLOW
• First load data into CPU memory
• Perform the following steps:
1. Allocate GPU memory – e.g, cudaMalloc()
2. Populate GPU memory with inputs from the host – e.g.,cudaMemcpy(…,
cudaMemcpyHostToDevice)
3. Execute a GPU kernel on those inputs - e.g., kernel<<<>>> (gpuVar)
4. Transfer outputs from the GPU back to the host - e.g., cudaMemcpy(..,
cudaMemcpyDeviceToHost)
5. Free GPU memory
• Use the results on CPU
• The host can access the device memory and transfer data to and from it, but not the
other way round.
• CUDA provides API functions to accomplish all these steps. 58
59
60
61
62
Explanation
• The add() runs on the device, so device_c must point to the device
memory
• This is why we call cudaMalloc() to allocate memory on the device
• We can access memory on a device through calls to cudaMemcpy()
from host code.
• After copying result to host from device, we can free the device
memory using cudaFree()
63
Kernel Functions and Threading
• How does a CUDA program work?
• In CUDA, the kernel is executed with the aid of threads.
• A kernel is a function that compiles to run on a device (GPU).
• The thread is an abstract entity that represents the execution of the
kernel.
• Multi-threaded applications use many such threads that are running
at the same time, to organize parallel computation.
64
Kernel Functions and Threading
• While writing a CUDA program, the programmer has explicit control
on the number of threads that he wants to launch.
• These threads collectively form a three-dimensional grid (threads are
packed into blocks, and blocks are packed into grids).
• A group of threads is called a CUDA block.
• CUDA blocks are grouped into a grid.
• A kernel is executed as a grid of blocks of threads
• Each thread is given a unique identifier, which can be used to identify
what data it is to be acted upon.
65
Three-Dimensional Grid - View
66
67
CUDA kernel and thread hierarchy
• CUDA kernel is a function that gets executed on GPU as shown in figure
1 (next slide)
• Every CUDA kernel starts with a __global__ declaration specifier.
• Programmers provide a unique global ID to each thread by using built-in
variables.
• The parallel portion of your applications is executed K times in parallel
by K different CUDA threads.
• Every CUDA thread executes the same kernel logic (SIMT)
• NVIDIA calls it Single-Instruction, Multiple-Thread (SIMT)
• A kernel is executed once for every thread in every thread block
configured when the kernel is launched.
68
69
NVIDIA GPU Architecture
NVIDIA GPU Architecture
• NVIDIA GPU consists of multiple Streaming Multiprocessors (SM).
• Each SM contains multiple CUDA cores or Streaming Processors (SP)
responsible for executing instructions.
• Additionally, an SM includes Special Function Units (SFU), shared memory,
registers, and other components necessary for executing and managing
threads.
• The number of SMs in a GPU determines its computational power and
parallel processing capabilities.
NVIDIA GPU Architecture
• Each CUDA block is executed by one Streaming Multiprocessor (SM)
and cannot be migrated to other SMs in GPU (except during
preemption, debugging, or CUDA dynamic parallelism).
• One SM can run several concurrent CUDA blocks depending on the
resources needed by CUDA blocks.
• Figure in the next slide shows the kernel execution and mapping on
hardware resources available in GPU.
72
Kernel Execution on GPU
73
Data Parallel Execution Model
74
CUDA Thread Organization
• Refers to how threads are organized and managed on a CUDA-
enabled GPU (Graphics Processing Unit) during parallel computation.
• All CUDA Threads are organized in Blocks.
• We can organize our threads in 1,2 or 3-dimensional blocks.
• Blocks are organized in Grids.
• We can organize our blocks in 1,2 or 3-dimensional grids.
75
• Main components of CUDA thread organization are:
• Thread: The smallest unit of execution in CUDA. Each thread is responsible for
performing a specific task or computation.
• Block : A group of threads that execute the same kernel (function) and share data
through shared memory. Threads within the same block can synchronize and
communicate with each other. The number of threads per block is limited by the
hardware specifications of the GPU.
• Grid: A collection of blocks that execute the same kernel. Each block within a grid can
execute independently and concurrently with other blocks. The number of blocks in a
grid is determined by the application's requirements and the hardware limitations of
the GPU.
• Threads can be configured in one-, two-, or three dimensional
layouts.
77
78
79
• The number of blocks and threads per block is exposed through
intrinsic thread coordinate variables:
80
• Dimension and Indexing:
• Grid Dimension: The number of blocks in each dimension of the grid. For
example, a grid with dimensions (X, Y, Z) consists of X * Y * Z blocks.
• Block Dimension: The number of threads in each dimension of a block.
For example, a block with dimensions (X, Y, Z) consists of X * Y * Z threads.
• Thread Indexing: CUDA provides built-in variables such as threadIdx,
blockIdx, and blockDim to help threads identify their unique index within
a block and grid. These variables are often used to compute memory
access patterns and perform thread synchronization.
• CUDA architecture limits the numbers of threads per block (1024 threads
per block limit).
How to get Current thread number and
the Total number of threads?
• There are 4 variables each thread can access which contains
information about the organization of threads and current thread.
They are:
• threadIdx : Id of the current thread.
• blockIdx : Id of the current block.
• blockDim : Number of threads in each dimension of the current block.
• gridDim : Number of blocks in each dimension of the current grid.
• All of these are dim3 structure. We can use dot notation to access
variable x,y,z which contains the information of the corresponding
dimension. Example: threadIdx.x
82
• To calculate a globally unique ID for a thread inside a one-
dimensional grid and one-dimensional block:
83
Thread Block Indexing
CUDA thread indexing for 1D grid of 1D
blocks
• In a 1D grid of 1D blocks in CUDA, the thread indexing follows a
simple linear structure. Each thread is identified by a single
index.
• Let's say we have:
• blockDim.x as the number of threads per block.
• gridDim.x as the number of blocks in the grid.
• Thread indexing formula will be :
• threadIdx.x + blockDim.x * blockIdx.x
• Here:
• threadIdx.x represents the index of the thread within its
block.
• blockIdx.x represents the index of the block within the
grid.
• blockDim.x represents the number of threads per block.
• For example, if we have a grid of 3 blocks (gridDim.x = 3) and
each block contains 128 threads (blockDim.x = 128), then the
thread indexing will range from 0 to 383 (3 blocks * 128
threads/block - 1).
Multi-dimensional CUDA Thread
blockIdx and threadIdx: Detailed Explanation
• Let's say the block dimensions are block(4, 3), for
instance. This means there are 4 threads in the x
dimension and 3 threads in the y dimension, resulting in
a total of 4 * 3 = 12 threads in the block.
• thread(0,0), thread(0,1), thread(0,2), etc., typically refers
to the indices of threads within a block.
• These indices represent the position of each thread
within a two-dimensional block in CUDA.
• For example:
• thread(0,0) refers to the thread at index (0, 0) within
the block.
• thread(0,1) refers to the thread at index (0, 1) within
the block.
• thread(0,2) refers to the thread at index (0, 2) within
the block.
• and so on...
Example Code
• This program calculates the sum of two arrays on the GPU and then transfers the result back
to the host for printing.
#include <stdio.h>
// Kernel function to add two arrays element-wise
__global__ void addArrays(int *a, int *b, int *c, int size) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < size) {
c[idx] = a[idx] + b[idx];
}
}
int main() {
const int size = 5;
int a[size] = {1, 2, 3, 4, 5};
int b[size] = {10, 20, 30, 40, 50};
int c[size]; // Result array
93