0% found this document useful (0 votes)
11 views40 pages

HPC - 1

Module 1 covers the basics of parallelization, focusing on the CPU's functions, performance enhancements for single-core CPUs, and the principles of parallel computing. It discusses various types of parallelism, including data and task parallelism, and introduces key concepts such as Amdahl's Law and Flynn's Taxonomy. Additionally, it addresses memory architectures, scalability types, and factors limiting parallel execution.

Uploaded by

varundolat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views40 pages

HPC - 1

Module 1 covers the basics of parallelization, focusing on the CPU's functions, performance enhancements for single-core CPUs, and the principles of parallel computing. It discusses various types of parallelism, including data and task parallelism, and introduces key concepts such as Amdahl's Law and Flynn's Taxonomy. Additionally, it addresses memory architectures, scalability types, and factors limiting parallel execution.

Uploaded by

varundolat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

Module 1 - Basics of Parallelization

CPU and its Functions:


●​ The Central Processing Unit (CPU) is the brain of a computer, responsible for executing
instructions and performing calculations.
●​ It processes data using a cycle of operations known as Fetch-Decode-Execute-Store cycle.
●​ The CPU Cycle:
○​ Fetch: The CPU retrieves an instruction from RAM or cache memory.
○​ Decode: The instruction is interpreted and prepared for execution.
○​ Execute: The CPU processes the instruction using the Arithmetic Logic Unit (ALU) or
other components.
○​ Store: The result is saved in a register, cache, or main memory if needed.
●​ Key Terminologies:
○​ Clock Speed: Measured in GHz (gigahertz), it determines how many instructions the
CPU can execute per second. A higher clock speed generally means faster
performance.
○​ Cores: Modern CPUs have multiple cores, which are independent processing units
that can handle separate tasks simultaneously. More cores improve multitasking and
performance in parallel processing.
○​ Threads: A thread is the smallest sequence of programmed instructions that a
processor can manage and execute. Each core can support multiple threads, which
are virtual divisions that enable better multitasking by allowing multiple processes to
run concurrently.
○​ Cache: The CPU has small, high-speed memory known as cache (L1, L2, and L3) to
store frequently accessed data, reducing the need to fetch information from slower
RAM.

Ways to increase performance of single core CPU:


●​ Clock Frequency
○​ Clock frequency (or clock speed) is the rate at which a CPU executes instructions,
measured in Hertz (Hz) — typically in gigahertz (GHz) for modern processors.
○​ A CPU runs like a ticking clock, where each "tick" allows it to process instructions.
○​ A higher clock speed (measured in GHz) means the CPU ticks faster and completes
tasks quicker.
○​ But increasing speed too much creates heat and power issues.
●​ Instructions per Cycle (IPC)
○​ The number of instructions a ​CPU can execute per clock cycle.
○​ Optimize CPU architecture to process multiple instructions per cycle.
○​ Instead of just increasing speed, CPUs can be designed to do more work per cycle by
optimizing how they handle instructions.
○​ A CPU with high IPC can perform better even at a lower clock speed.
●​ Pipeline Efficiency
○​ Use deeper and optimized pipelines for simultaneous instruction execution.
○​ Pipelining is a technique used in modern CPUs to improve performance by executing
multiple instructions simultaneously at different stages.
○​ Instead of executing one instruction at a time (sequential execution), pipelining breaks
down instruction execution into smaller steps, allowing a new instruction to enter the
pipeline before the previous one finishes.
○​ Stages of pipelining:
■​ Fetch: The CPU retrieves an instruction from RAM or cache memory.
■​ Decode: The instruction is interpreted and prepared for execution.
■​ Execute: The CPU processes the instruction using the Arithmetic Logic Unit
(ALU) or other components.
■​ Store: The result is saved in a register, cache, or main memory if needed.
●​ Number of Transistors
○​ Transistors are tiny switches inside a CPU that process information.
○​ More transistors mean more power and smarter features like better caching and faster
calculations.
○​ Dennard Scaling
■​ Says as transistors get smaller, the power needed to run them stays the same
per unit volume.
■​ This means we can keep increasing the number of transistors without
increasing power consumption too much.
■​ This was true for many years and helped make CPUs faster and more
powerful.
○​ Why Dennard Scaling fails today:
■​ Earlier, power consumption per transistor reduced as they got smaller.
■​ But at very small sizes, transistors don’t turn off completely, allowing small
unwanted currents to leak, leading to more power usage and heating.
■​ This is why CPU clock speeds have been stuck around 4 GHz since 2006
●​ Word Length (Bit Width)
○​ Word length refers to the number of bits a CPU can process, transfer, or store as a
single unit in one clock cycle.
○​ A larger bit width (e.g., 64-bit vs. 32-bit) enables faster data processing. More like
reading more bits per clock cycle.
●​ Data Bus & Address Bus Width
○​ These are like highways that carry data between the CPU and memory.
○​ Data Bus Width: Determines how much data can be transferred at once (e.g., 32-bit
vs. 64-bit).
○​ Address Bus Width: Determines the amount of memory the CPU can access (e.g.,
32-bit can access 4GB, 64-bit can access much more).
○​ Wider buses improve data transfer speed and memory accessibility.
●​ RAM Performance
○​ RAM is like a short-term memory for the CPU.
○​ Faster RAM helps the CPU get the data it needs without waiting too long.
○​ RAM reduces wait times for fetching data.
●​ Memory Latency
○​ Latency is the time it takes for data to travel from RAM to the CPU.
○​ Using faster memory, better caches, and optimized data access methods reduces this
delay.
○​ Lower latency ensures quicker data access from RAM to CPU.
●​ Memory Bandwidth
○​ Just like a wider road allows more cars to move at once, higher bandwidth means
more data can travel between RAM and the CPU.
○​ This helps in high-performance tasks like gaming or video editing.
○​ Higher bandwidth speeds up communication between RAM and CPU.
●​ Efficient Cooling & Power Management
○​ Cooling prevents thermal throttling, where high temperatures slow down the CPU.
○​ Power management techniques (like dynamic voltage scaling) balance performance
and efficiency.

Parallel Computing:
●​ Parallel computing is a method of simultaneously executing multiple computations to solve a
problem faster by dividing tasks across multiple processors or cores.
●​ Instead of processing instructions one after another (sequential computing), parallel
computing splits the workload and processes multiple parts at the same time.
●​ Core Components:
○​ Multiple Processors or Cores – Tasks are divided among several processing units.
○​ Concurrency – Different tasks execute at the same time.
○​ Synchronization – Processors coordinate to combine results correctly.
○​ Communication – Data is exchanged between processors when necessary.
●​ Types:
○​ Bit-Level Parallelism – Processes multiple bits in a single operation (e.g., 64-bit
processors vs. 32-bit).
○​ Instruction-Level Parallelism (ILP) – Executes multiple instructions at the same time
using pipelining and superscalar architecture.
○​ Data Parallelism – Distributes data across multiple processors performing the same
operation (e.g., GPUs processing images).
○​ Task Parallelism – Different tasks run on separate processors (e.g., web server
handling multiple requests).

Speedup:
●​ Speedup measures how much faster a parallel algorithm runs compared to a sequential one.
●​ S = T1/TP
●​ Where, T1 is the time taken by a single processor and TP is the time taken by P processors.
Efficiency:
●​ Efficiency measures how effectively multiple processors are utilized.
●​ E = S/P
●​ Where S is the speedup and P is the total number of processors.
●​ E=1 → Perfect efficiency
●​ E<1 → Some inefficiency (due to communication, idle time, etc.)
●​ E>1 (rare) → Superlinear speedup (usually due to caching benefits)

Amdahl’s Law:
●​ Amdahl’s Law is a principle in parallel computing that predicts the maximum possible speedup
of a program when a portion of it is parallelized.
●​ It shows how much a program's performance can improve by adding more processors, but
also highlights the limits of parallelization due to the sequential portion of the program.

●​ As P increases, the improvement in speedup diminishes because the sequential part


becomes the limiting factor.
●​ Key Observations:
○​ If a program is completely sequential (f=0), then adding processors will not increase
speedup.
○​ If a program is fully parallel (f=1), then theoretically, speedup is proportional to P (ideal
case).
○​ If only a fraction is parallelizable, adding more processors helps but only up to a
certain point.
○​ Beyond a limit, adding processors has negligible effect because the sequential portion
dominates execution time.

●​ Amdahl’s Law helps determine the optimal number of processors needed before adding more
becomes wasteful.
●​ The speedup can never exceed the total number of processors available. If it does (very rare)
then it is called superlinear speedup.
Data Parallelism:
●​ Example: Show how we can add a million numbers across GPUs. So “data” i.e. the numbers
change but the “task” adding remains same. Do rough calculations and show.
●​ Advantages:
○​ Faster Processing – Multiple processors handle chunks of data simultaneously.
○​ Scales Well – Works effectively on large datasets, especially in deep learning &
simulations.
○​ Efficient Use of GPUs – Utilizes multiple GPUs efficiently for parallel computation.
●​ Disadvantages:
○​ Not Suitable for All Problems – Only works when the same operation is applied to
different data.
○​ High Communication Overhead – Data transfer between processors can slow things
down.
○​ Synchronization Issues – Combining results from multiple processors can introduce
complexity.

Task Parallelism:
●​ Example: AI app deployed over a multi core GPU - Core 1 handles User A’s login request.
Core 2 serves an image to User B. Core 3 fetches data from the database for User C. All
tasks run simultaneously, improving efficiency.
●​ Advantages:
○​ Handles Diverse Tasks – Different tasks can run independently (e.g., web servers,
OS processes).
○​ Better Resource Utilization – Can use different hardware for different tasks (CPU for
logic, GPU for rendering).
○​ Efficient for Task-Based Workloads – Useful for distributed systems & cloud
applications.
●​ Disadvantages:
○​ Limited Scalability – Some tasks cannot be divided further, limiting performance
gains.
○​ Load Balancing Issues – Some tasks may take longer than others, causing
inefficiencies.
○​ More Complex to Implement – Requires proper scheduling and task management.
Flynn's Taxonomy classifies computer architectures based on how instructions and data are handled
during processing. It defines four categories based on the number of instruction streams and data
streams a system can process simultaneously.​

●​ SISD (Single Instruction Single Data) System: Sequential Processing


a.​ An SISD computing system is a uniprocessor machine which is capable of executing a
single instruction, operating on a single data stream. ​
b.​ In SISD, machine instructions are processed in a sequential manner and computers
adopting this model are popularly called sequential computers. ​
c.​ All the instructions and data to be processed have to be stored in primary memory.​
d.​ The speed of the processing element in the SISD model is limited(dependent) by the
rate at which the computer can transfer information internally. ​
e.​ Dominant representative SISD systems are IBM PC workstations.​

●​ SIMD (Single Instruction, Multiple Data) System: Data Parallelism


a.​ An SIMD system is a multiprocessor machine capable of executing the same
instruction on all the CPUs but operating on different data streams.
b.​ Machines based on an SIMD model are well suited to scientific computing since they
involve lots of vector and matrix operations. ​
c.​ So that the information can be passed to all the processing elements (PEs) organized
data elements of vectors can be divided into multiple sets(N-sets for N PE systems)
and each PE can process one data set.​​

●​ MISD (Multiple Instruction, Single Data) System: Rarely Uses


a.​ An MISD computing system is a multiprocessor machine capable of executing different
instructions on different PEs but all of them operating on the same dataset.​
b.​ Example Z = sin(x)+cos(x)+tan(x)​The system performs different operations on the
same data set. ​
c.​ Machines built using the MISD model are not useful in most of the application, a few
machines are built, but none of them are available commercially.​​

●​ MIMD (Multiple Instruction, Multiple Data) System : Full Parallelism


a.​ An MIMD system is a multiprocessor machine which is capable of executing multiple
instructions on multiple data sets. ​
b.​ Each PE in the MIMD model has separate instruction and data streams; therefore
machines built using this model are capable of any kind of application. ​
c.​ Unlike SIMD and MISD machines, PEs in MIMD machines work asynchronously.​
●​ SPMD (Single Program Multiple Data) System:
a.​ A parallel computing model where multiple processors execute the same program but
operate on different data sets.
b.​ Each processor runs the same code but may take different execution paths based on
data conditions.
c.​ Common in distributed computing and GPU computing, such as OpenMP and
MPI-based applications.
d.​ Matrix multiplication across multiple GPUs: Each GPU runs the same matrix
multiplication algorithm but on different parts of the matrix.
●​ MPMD (Multiple Program Multiple Data) System:
a.​ A parallel computing model where multiple processors execute different programs on
different data sets.
b.​ Different processors can run different tasks simultaneously, improving flexibility.
c.​ Common in heterogeneous computing, such as CPU-GPU hybrid systems.
d.​ Example: Web Server Architecture

Parallel Processing Memory Architectures


●​ UMA (Uniform Memory Access)
○​ All processors share a single, unified memory with equal access time for all
processors.
○​ Used in SMP (Symmetric Multiprocessing) systems like multi-core CPUs.
○​ Easier to program but can suffer from memory bottlenecks as all processors compete
for the same memory.
○​ Example: Traditional multi-core processors in desktops and servers.
○​
●​ NUMA (Non-Uniform Memory Access)
○​ Memory is distributed, and each processor has its own local memory while still being
able to access other processors' memory.
○​ Faster access to local memory but slower access to non-local memory.
○​ Reduces memory bottlenecks, making it ideal for large-scale multiprocessor systems.
○​ Example: High-performance supercomputers and data centers.

○​

Scalability Types in HPC:


1.​ Strong Scaling (Amdahl’s Law)
●​ Performance improvement when solving a fixed-size problem by increasing the
number of processors.
●​ Adding more processors only helps if the parallelizable portion of the workload is large.
●​ If a significant portion of the program is sequential, speedup is limited, no matter how
many processors are added.
●​ Running a fixed dataset (e.g., weather simulation for a single city).

2.​ Weak Scaling (Gustafson’s Law)


●​ Performance improvement when the problem size increases proportionally to the
number of processors.
●​ Unlike Amdahl’s Law, this approach assumes that as we increase processors, we
increase workload proportionally, allowing almost linear speedup.
●​ Avoids the bottleneck of a fixed sequential portion, leading to better performance
scaling.
●​ Increasing both dataset size and processors (e.g., weather simulation for multiple
cities as processors increase).

Factors that limit parallel execution:


●​ Amdahl’s Law (Limited Speedup due to Sequential Code) - Explain this again
●​ Communication Overhead
○​ Processors need to exchange data during execution, which introduces delays.
○​ Types of overhead:
■​ Inter-Processor Communication: Transmitting data between CPUs
consumes time.
■​ Memory Access Overhead: Shared-memory systems may have delays due to
cache coherence protocols.
■​ Synchronization Overhead: Waiting for data updates (e.g., locks, barriers)
can reduce efficiency.
○​ If each processor spends 10% of its time communicating, the effective parallel
execution time increases.
●​ Load Imbalance
○​ If some processors finish their tasks early while others are still working, the idle
processors reduce efficiency.
○​ Causes:
■​ Unequal distribution of work
■​ Dynamic workload changes
■​ Some computations inherently take longer than others
○​ In a matrix multiplication problem, if one processor gets a larger chunk of data to
process, others will have to wait for it to finish.
●​ Memory Bottlenecks (Bandwidth & Latency)
○​ Types of memory architectures:
■​ UMA (Uniform Memory Access): Equal memory access time but may have
bus contention (multiple processors fighting for access).
■​ NUMA (Non-Uniform Memory Access): Each processor has local memory,
but accessing remote memory is slower.
●​ If multiple processors access the same memory, contention occurs, slowing down
performance.
●​ Synchronization Overhead
○​ Parallel threads/processes must synchronize to ensure correctness (e.g., locks,
barriers, critical sections).
○​ Costly operations:
■​ Locks: When multiple threads try to update a shared resource, contention
occurs.
■​ Barriers: All threads must wait for the slowest one before moving to the next
step.

Refined Performance Model (Skip):


●​ Amdahl’s Law and Gustafson’s Law fail because they oversimplify parallel execution and
ignore real-world constraints like communication overhead, memory bottlenecks, and load
balancing issues.
●​ To address these limitations, a refined performance model considers additional factors
affecting scalability.
Load Balancing:
●​ Load balancing is a crucial concept in parallel computing, ensuring that workloads are
distributed evenly across available computing resources (processors, cores, or nodes).
●​ The goal is to minimize execution time, avoid bottlenecks, and maximize resource utilization.
●​ Without proper load balancing, some processors may become overloaded while others remain
idle, leading to poor performance and inefficient parallel execution.

●​ Right one is better as only 1 core is idle whereas in left multiple are idle i.e. severe
underutilization.
●​ Types of Load Imbalance:
○​ Iterative Algorithms with Variable Convergence Rates
■​ Problem: Some computations converge faster than others, leading to idle
processors.
■​ Example: Solving a system of equations, training machine learning models.
■​ Solution: Use dynamic load balancing to reassign work from idle processors to
busy ones.
○​ Limited Parallelism Due to Coarse Granularity Problem:
■​ Tasks are too large and cannot be evenly distributed across processors.
■​ Example: 5 tasks assigned to 8 processors may cause underutilization.
■​ Solution: Break tasks into smaller sub-tasks (fine-grained parallelism) to
balance workload.
○​ Load Imbalance Due to Waiting for Resources Problem:
■​ Workers spend time waiting for disk I/O or network communication instead of
computing.
■​ Example: A simulation where a worker waits for data transfer.
■​ Solution: Overlap computation and communication using asynchronous
execution and prefetching.
●​ Types of Load Balancing
○​ (A) Static Load Balancing
■​ The workload is assigned before execution begins.
■​ The system does not adjust the load dynamically during runtime.
■​ Suitable for problems where the workload is predictable.
■​ Round-Robin Scheduling: Tasks are assigned cyclically to processors in a
predetermined order.
○​ (B) Dynamic Load Balancing
■​ The workload is assigned and adjusted at runtime based on the current load of
processors.
■​ Ensures that no processor remains idle while others are overloaded.
■​ Master-Slave Model: A master processor assigns tasks dynamically to worker
processors.
●​ Strategies:
○​ (A) Task Scheduling-Based Load Balancing
■​ Tasks are divided and scheduled optimally among processors.
■​ Used in multiprocessor scheduling and GPU task assignment.
■​ Example: Heterogeneous computing (CPU + GPU), where heavier tasks go to
the GPU while lightweight tasks run on the CPU.
○​ (B) Data Partitioning-Based Load Balancing
■​ The dataset is divided into smaller chunks, ensuring even workload distribution.
■​ Used in Big Data processing (e.g., MapReduce, Spark) where data partitions
are dynamically assigned.
●​ (C) Communication-Based Load Balancing
○​ Balances computational workload while minimizing communication costs
between processors.
○​ Used in distributed memory systems like MPI-based applications.
Module 2 - Introduction to High-Performance Computing

High Performance Computing:


●​ High-Performance Computing (HPC) refers to the use of supercomputers and parallel
processing techniques to solve complex computational problems that require immense
processing power and are not feasible to do using the naive classical computing techniques.
●​ It enables scientists, engineers, and businesses to perform large-scale simulations, data
analysis, and modeling tasks much faster than traditional computing.
●​ HPC systems are designed with high-speed processing power, advanced networking, and
large-memory capacity to enable massive parallel processing.​
●​ Key Aspects:
○​ Parallel Processing HPC systems use multiple processors working together to solve
a problem faster than a single processor. These processors can be distributed across
multiple nodes (computers) in a cluster.
○​ Supercomputers The most advanced form of HPC, capable of performing petaflops
(quadrillions of calculations per second). Example: Summit (developed by IBM for Oak
Ridge National Laboratory).
○​ Clusters & Grid Computing Clusters: Multiple interconnected computers work as a
single system. Grid Computing: Distributed computing where different machines in
various locations work together.
○​ Scalability HPC systems are scalable, meaning they can handle increasing workloads
by adding more processors or nodes.

Working of HPC:
●​ Cluster Configuration:
○​ HPC system consists of multiple interconnected computers, or nodes, forming clusters.
○​ Each node has processors, memory, and storage, connected via a high-speed network
for fast communication.
●​ Task Parallelization:
○​ Large computational problems are divided into smaller tasks, which run simultaneously
across multiple nodes.
○​ This process, called parallel computing, allows tasks to be executed faster.
●​ Data Distribution:
○​ The input data is split among the nodes, ensuring each node processes a specific
portion of the data.
○​ This prevents bottlenecks and enables efficient workload distribution.
●​ Computation:
○​ Each node executes its assigned computation in parallel with others.
○​ Intermediate results are shared, combined, and refined until the computation is
complete.
●​ Monitoring and Control:
○​ HPC systems use software tools to monitor node performance and dynamically
allocate resources to maximize efficiency.
○​ Load balancing ensures that all nodes are optimally utilized.
●​ Output Generation:
○​ The final result is an integration of all processed data.
○​ Outputs are stored in large parallel file systems and can be visualized for analysis and
interpretation.
Advantages:
●​ Faster Computation – Executes complex tasks in hours or minutes instead of weeks.
●​ Efficient Parallel Processing – Distributes workloads across multiple processors for better
performance.
●​ Scalability – Easily expands by adding more computing nodes to handle larger workloads.
●​ High Data Processing Capability – Handles vast amounts of data, ideal for big data
analytics and AI training.
●​ Improved Accuracy – Enables precise simulations and models in fields like weather
forecasting and medical research.
●​ Cost-Effective for Large Tasks – Reduces overall computation time, lowering operational
costs for research and businesses.
●​ Enhanced Resource Utilization – Uses CPU, GPU, and memory efficiently to maximize
throughput.
●​ Real-Time Processing – Supports time-sensitive applications like financial modeling and
scientific experiments.
●​ Optimized for Complex Problems – Solves high-end engineering, physics, and genomics
problems that traditional computers can't handle.
●​ Supports Multi-User Collaboration – Multiple teams can work on different computations
simultaneously.

HPC Use Cases:


●​ Scientific Research – Simulating climate change, astrophysics, and molecular modeling.
●​ Healthcare & Bioinformatics – DNA sequencing, drug discovery, and medical imaging
analysis.
●​ Financial Services – Risk modeling, fraud detection, and high-frequency trading.
●​ Artificial Intelligence & Machine Learning – Training deep learning models and AI-powered
analytics.
●​ Engineering & Manufacturing – Computer-aided design (CAD), fluid dynamics, and
structural simulations.
●​ Weather Forecasting & Climate Modeling – Predicting hurricanes, tsunamis, and long-term
climate changes.
●​ Energy Sector – Oil & gas exploration, seismic data analysis, and renewable energy
optimization.
●​ Aerospace & Automotive – Crash simulations, aerodynamics testing, and material analysis.
●​ Entertainment & Media – 3D rendering, special effects, and game physics simulation.
●​ Cybersecurity – Threat detection, cryptography, and real-time network monitoring.
●​ Government & Defense – Intelligence analysis, cryptographic computing, and battlefield
simulations.

CPU:
●​ The Central Processing Unit (CPU) is the primary component of a computer that executes
instructions from programs.
●​ It processes data, performs arithmetic and logical operations, and manages control signals for
other hardware components.
●​ Characteristics:
○​ Clock Speed (GHz): Determines how many instructions the CPU can execute per
clock cycle. Measured in gigahertz (GHz) (e.g., 3.5 GHz = 3.5 billion cycles per
second). Higher clock speed generally means faster processing but also higher power
consumption.
○​ Core Count: Modern CPUs have multiple cores (dual-core, quad-core, octa-core,
etc.). More cores allow for better multitasking and parallel processing. Used in
multi-threaded applications for improved performance.
○​ Cache Memory: A small but fast memory unit inside the CPU that stores frequently
accessed data. L1 Cache (fastest, smallest), L2 Cache (larger, slightly slower), L3
Cache (largest, slowest). Reduces the need to access slower RAM.
○​ Single Instruction Execution at a Time: Traditional CPUs follow the Von Neumann
architecture, processing one instruction after another in a sequential manner.
○​ Bus Speed and Memory Bandwidth Defines how quickly data moves between CPU,
RAM, and other components. Measured in GT/s (Giga Transfers per second) or GB/s.

GPU:
●​ A Graphics Processing Unit (GPU) is a specialized processor designed to accelerate parallel
computations.
●​ Originally built for rendering graphics, GPUs are now widely used in high-performance
computing (HPC), artificial intelligence (AI), deep learning, and scientific simulations due to
their ability to process massive amounts of data simultaneously.
●​ Massively Parallel Architecture
○​ Unlike a CPU, which has a few powerful cores optimized for sequential tasks, a GPU
contains thousands of smaller cores designed to handle multiple tasks in parallel.
○​ This makes GPUs highly efficient for vectorized and matrix-based computations,
essential for AI, scientific modeling, and gaming.
●​ High Throughput, Lower Latency
○​ Throughput refers to the amount of data processed at once, and GPUs are designed
for high throughput rather than low-latency execution.
○​ This means GPUs excel in workloads where many operations can be performed
simultaneously, rather than tasks requiring fast decision-making.
●​ High Memory Bandwidth
○​ GPUs feature fast memory interfaces (e.g., GDDR6, HBM) to handle large volumes of
data efficiently.
○​ Deep learning models, simulations, and gaming engines benefit significantly from this
increased memory bandwidth.
●​ Optimized for Parallel Workloads
○​ GPUs execute SIMD (Single Instruction, Multiple Data) operations, processing many
data points in parallel.
○​ This is particularly useful in neural networks, computer vision, financial modeling, and
video rendering.

FPGA:
●​ FPGA (Field-Programmable Gate Array) is a reconfigurable hardware device that allows users
to implement custom digital circuits.
●​ The exact architecture varies by manufacturer, but a typical FPGA consists of the following
key components:
●​ Programmable Logic Blocks (CLBs):
○​ Contain logic elements such as Look-Up Tables (LUTs), Flip-Flops, and multiplexers.
○​ Perform logic and arithmetic operations.
●​ Programmable I/O Blocks:
○​ Connect FPGA logic to external devices via interfacing pins.
○​ Support various communication standards.
●​ Programmable Interconnect Resources:
○​ Pre-laid vertical and horizontal wiring.
○​ Used to route signals between logic blocks.
○​ Density depends on the number of routing paths and wire segments.

Parallel Computing Techniques:


1.​ Bit Level:
●​ Involves processing multiple bits of data simultaneously within a single operation.
●​ Using a 64-bit processor to perform operations on 64 bits of data at once instead of 32
bits.
●​ Increases data throughput and efficiency.Reduces the number of instructions needed
for operations.
2.​ Instruction Level
●​ Allows multiple instructions to be executed simultaneously by overlapping their
execution phases.
●​ Multiple execution units to execute more than one instruction per clock cycle.
●​ Instructions are executed as resources become available rather than strictly in the
order they appear.
●​ Improves CPU utilization and throughput.
3.​ Task Level
●​ Involves dividing a program into independent tasks that can be executed
simultaneously on different processors or cores.
●​ Running multiple threads in parallel to perform different tasks.
●​ Utilizing multiple machines to execute tasks concurrently.
●​ Scales well with the number of processors available.
●​ Widely used in web servers, data processing frameworks (like MapReduce), and
parallel algorithms.
4.​ Superword Level
●​ SLP is a type of parallel computing that focuses on vectorizing operations on data
stored in short vector registers, enabling efficient execution of multiple data operations
simultaneously.
●​ It is a form of data parallelism that specifically operates on arrays or vectors of data,
allowing for the simultaneous processing of multiple data elements.
●​ Utilizes Single Instruction, Multiple Data (SIMD) operations, where a single instruction
is executed on multiple pieces of data concurrently.
●​ Highly effective in fields such as image processing, signal processing, and machine
learning, where operations on large arrays or matrices are common.

Challenges in Parallel Computing:


●​ Amdahl's Law
●​ Communication Overhead
●​ Load Imabaling
●​ Memory Architecture (UMA vs NUMA)
●​ Synchronisation Overhead
●​ Heating of hardware
●​ Scalability

Memory Architectures: UMA vs NUMA (mod1)

MPI (Message Passing Interface) - Distributed


●​ MPI is a standardized and portable message-passing system designed to facilitate
communication between processes in a parallel computing environment.
●​ It is widely used in high-performance computing (HPC) for distributed-memory systems.
●​ MPI allows processes to communicate by sending and receiving messages. It supports both
point-to-point communication (between two processes) and collective communication
(involving multiple processes).
●​ MPI is designed for scalability, making it suitable for large-scale applications that run on
clusters, supercomputers, and grid computing environments.
●​ MPI_Send and MPI_Recv functions are used for point-to-point communication.
●​ Collective communication functions (like MPI_Bcast and MPI_Reduce) allow data to be
shared among multiple processes.
Open MP (Multi Processing) - Multithreaded
●​ OpenMP is an API that provides a set of compiler directives, library routines, and environment
variables for parallel programming in shared-memory systems.
●​ It allows developers to create multi-threaded applications easily.
●​ OpenMP is designed for shared-memory architectures, where multiple threads can access the
same memory space, making it efficient for multi-core processors.
●​ Supports dynamic creation and termination of threads, enabling efficient resource
management based on workload.
●​ Each thread can access shared variables, but developers must manage synchronization to
avoid conflicts.
Module 3 - Introduction to GPUs and CUDA Programming

A GPU (Graphics Processing Unit) is a specialized processor designed to accelerate rendering of


images and video and perform massively parallel computations. Originally developed for graphics
rendering, GPUs are now widely used for scientific computing, AI/ML model training, simulations, and
large-scale data processing due to their ability to perform thousands of operations simultaneously.
Characteristics:
●​ Accelerator/Co-Processor
○​ GPUs work alongside CPUs to accelerate computations, particularly in tasks involving
massive parallelism.
○​ The CPU handles complex sequential tasks, while the GPU processes large-scale
parallel computations.
●​ Heterogeneous Computing Architecture
○​ Combines CPU + GPU to leverage the strengths of both
○​ CPU (General-Purpose Processor): Good at sequential tasks, decision-making, and
logic-heavy operations.
○​ GPU (Accelerator): Specialized in parallel data processing, making it ideal for AI, deep
learning, and graphics rendering.
●​ Not an Intelligent Device
○​ A GPU cannot function independently; it relies on the CPU for task delegation.
○​ It follows instructions provided by the host CPU but executes them with high
parallelism.
●​ Massive Parallelism
○​ Thousands of CUDA cores enable handling millions of threads simultaneously.
○​ Suitable for applications like scientific computing, simulations, cryptography, deep
learning, and real-time rendering.
●​ Throughput Oriented
○​ Designed for maximum task completion over time, not for quickly finishing a single
task.Great for bulk processing (e.g., images, video, neural networks).
●​ Higher Memory Bandwidth:
○​ Significantly higher than CPUs, enabling fast transfer of large datasets for operations
like deep learning. More data can be transferred in very less time efficiently.
Architecture:
●​ CUDA Cores (Compute Units)
○​ These are the fundamental processing units responsible for executing instructions.
○​ Organised for SIMD architecture i.e. processing same instructions on multiple data.
○​ Each core performs basic arithmetic and logical operations in parallel.
○​ Thousands of CUDA cores come together to enable massive parallelism.
●​ Streaming Multiprocessors (SMs)
○​ GPUs are divided into multiple Streaming Multiprocessors (SMs), each containing
multiple CUDA cores and wrap schedulers that manage threads in groups of 32.
○​ Each SM handles multiple threads in parallel using Single Instruction, Multiple Threads
(SIMT) execution.
●​ Special Function Units (SFUs) These units handle complex mathematical operations such
as trigonometric functions, logarithms, and exponentiation. These offload the scientific tasks
and complex tasks from standard ALUs for better accuracy and efficiency.
●​ Memory Hierarchy
○​ Registers: Fastest memory, private to each thread (most expensive)
○​ Shared Memory / L1 Cache: Fast memory shared among CUDA cores within an SM,
improving efficiency.
○​ L2 Cache: A larger cache shared across multiple SMs, reducing global memory
access latency.
○​ Global Memory: Large but slower memory accessible by all cores; used for storing
large datasets.
●​ Load/Store Units (LD/ST) Responsible for transferring and managing data movements
between the GPU's shared memory and registers.
●​ Interconnect Network: Provide high speed communication between SMs, cache, memory
controllers and other units.

GPU Program Execution Model i.e. Working : (Host - CPU and Device - GPU)
●​ Declare CPU Variables
●​ Allocate memory to CPU Variables
●​ Declare GPU Variables
●​ Allocate memory to GPU Variables (use functions like cudaAlloc)
●​ Initialize data in CPU
●​ Copy data from CPU memory to GPU memory (use functions like cudaMemCpy)
●​ CPU instruct to GPU for parallel execution (Launch the kernel from CPU i.e. the CPU instructs
that we need to use GPU so the GPU kernel is launched)
●​ Synchronise the host and the device (use cudaDeviceSynchronize)
●​ Copy results back from GPU memory to CPU memory (use cudaMemCpy)
●​ Free the device memory (using cudaFree)

Kernel Execution Hierarchy :


●​ Thread:
○​ A thread is the smallest unit of execution in a GPU.
○​ Each thread executes the same kernel (function) but on different data.
○​ A GPU can launch thousands of threads at once.
●​ Block:
○​ A block is a group of threads that work together.
○​ Threads inside a block can communicate and share data using shared memory.
○​ Each block has a limit on the number of threads (e.g., 1024 threads per block on
modern NVIDIA GPUs).
○​ Threads synchronize with each other using functions like cudsSyncThread()
○​ Each thread in a block has unique identifiers: threadIdx.x, threadIdx.y, threadIdx.z
●​ Grid:
○​ A grid is a collection of blocks.
○​ A GPU launches a grid of blocks, and each block contains multiple threads.
○​ Blocks cannot communicate with each other directly.
○​ Identifiers for blocks: blockIdx.x, blockIdx.y, blockIdx.z.
●​ Identifiers:
○​ Thread ID within a block: threadIdx.x, threadIdx.y, threadIdx.z
○​ Block ID within the grid: blockIdx.x, blockIdx.y, blockIdx.z
○​ Grid and Block Dimensions: gridDim.x, gridDim.y, blockDim.x, blockDim.y
●​ Wraps:
○​ A warp is a group of 32 threads that execute SIMD-style (Single Instruction Multiple
Data). All threads in a warp execute the same instruction simultaneously.
Memory Hierarchy / Management:
●​ Registers (Fastest, per Streaming Multiprocessor - SM)
○​ Each Streaming Multiprocessor (SM) has registers that store temporary data for
threads.
○​ Extremely fast, but limited in size (256 KB per SM in NVIDIA A100). Each thread gets
a private set of registers.
●​ L1/Shared Memory (Fast, per SM, 192 KB in A100)
○​ Each SM has L1 cache and Shared Memory (SMEM). Threads within a block can
share data via shared memory.
○​ Very fast (low latency), but small size compared to global memory.
●​ Read-Only Cache (Per SM, Optimized for Constants)
○​ Stores constant data that doesn’t change.
○​ Helps reduce redundant memory accesses from global memory.
●​ L2 Cache (Shared Across SMs, 40 MB in A100)
○​ Stores frequently accessed data shared across SMs.
○​ Reduces global memory accesses, improving performance.
●​ Global Memory (DRAM, Largest but Slowest, 40 GB in A100)
○​ Accessible by all threads but has high latency.
○​ Used for large datasets.
○​ Requires efficient access patterns (coalesced memory access) for good performance.
CUDA
●​ CUDA (Compute Unified Device Architecture) is a parallel computing platform and API
developed by NVIDIA that allows developers to use GPUs for general-purpose computing
(GPGPU).
●​ It enables software developers to write highly parallel programs using C, C++, and Python
while taking full advantage of the massively parallel architecture of GPUs.
●​ Thread Hierarchy (Grid ->Block -> Thread)
●​ Memory Hierarchy (Global, L2, Read Only, L1, Register)
●​ CUDA Kernels:
○​ Kernels are functions executed on GPU cores.
○​ Every thread runs the same kernel but processes different data elements.
○​ CUDA allows launching millions of threads, leading to massive parallelism.
●​ GPU-CPU Interaction:
○​ CPU (Host) launches a kernel on the GPU (Device).
○​ Data is copied from CPU memory → GPU memory.
○​ GPU performs computations in parallel.
○​ Results are copied back to CPU memory.

Advantages of CUDA Programming:


●​ Massive Parallelism: Utilizes thousands of GPU threads for parallel execution.
●​ High Performance: Significant speedup for data-parallel tasks like matrix operations, deep
learning, etc.
●​ Mature Ecosystem: Offers rich libraries (cuBLAS, cuDNN, cuRAND, etc.) and robust tooling
(NSight, Visual Profiler).
●​ Fine-Grained Control: Developers can optimize memory access and execution configuration
for performance.
●​ Seamless CPU-GPU Interop: Allows coordination between CPU and GPU for hybrid
computation.

Disadvantages of CUDA Programming:


●​ NVIDIA Dependency: Only works with NVIDIA GPUs; not portable to AMD or Intel GPUs.
●​ Learning Curve: Requires knowledge of parallel programming concepts and memory
hierarchy.
●​ Debugging Difficulty: Debugging GPU code is more complex than CPU code.
●​ Limited Use Cases: Not suitable for tasks with low parallelism or high control flow
divergence.
●​ Memory Transfer Overhead: Data transfer between CPU and GPU can be a bottleneck if not
optimized.

CUDA Directives: CUDA Directives refer to the special syntax, keywords, or pragmas used in
CUDA (Compute Unified Device Architecture) to guide the GPU on how to execute code in parallel.

Directive / Keyword Purpose

__global__ Declares a function (kernel) that runs on the GPU and is called from the CPU.

__device__ Declares a function or variable that runs and is accessible only on the GPU.
__host__ Optional; declares a function to run on the CPU (default behavior).

__shared__ Declares memory that is shared among threads in the same thread block.

__constant__ Declares memory that stays constant during kernel execution, optimized for broadcast.

__managed__ Declares unified memory accessible by both CPU and GPU.

<<<gridDim, blockDim>>> Launch configuration syntax for kernels. Example: kernel<<<4, 256>>>();
This means having 4 blocks and each block has 256 threadsvect

threadIdx.x / .y / .z Gives the index of the current thread within a block.

blockIdx.x / .y / .z Gives the index of the current block within the grid.

__syncthreads() Barrier for synchronizing threads within a block.

Just Skim (Helpful keywords to yap any CUDA answer):

Function One-liner Description

cudaMalloc() Allocates memory on the GPU device.

cudaFree() Frees memory previously allocated on the device.

cudaMemcpy() Copies data between host and device (both directions).

cudaMemset() Initializes device memory to a constant byte value.

cudaDeviceSynchronize() Waits for the GPU to finish all preceding tasks.

cudaGetLastError() Checks and returns the most recent CUDA error.

Tiling:
●​ Tiling is an optimization technique used in parallel computing (especially in GPUs) to
efficiently manage memory access and improve data locality.
●​ It helps reduce memory latency by reusing data stored in fast shared memory (SMEM) instead
of repeatedly fetching it from slower global memory (DRAM).
●​ Tiling is an optimization technique used in computing to enhance memory locality and
performance by dividing large problems into smaller, manageable subproblems.
●​ Instead of processing the entire dataset at once, tiling splits the data into smaller blocks
("tiles") that fit efficiently into cache or shared memory, reducing memory access latency.
●​ Why use tiling:
○​ Reduce Global Memory Access – Global memory is slow compared to registers and
shared memory. Tiling ensures data is loaded into fast shared memory, reducing
access time
○​ Improve Cache Efficiency – Data is reused multiple times, improving performance.
○​ Enable Parallel Computation – Each thread block processes a tile independently,
allowing efficient parallelization.
○​ Minimize Bank Conflicts – Optimized memory access patterns reduce conflicts in
shared memory.
●​ Working:
○​ Identify a tile of global memory contents that are accessed by multiple threads​
○​ Load the tile from global memory into on-chip memory​
○​ Use barrier synchronization to make sure that all threads are ready to start the phase​
○​ Have the multiple threads to access their data from the on-chip memory​
○​ Use barrier synchronization to make sure that all threads have completed the current
phase​
○​ Move on to the next tile

Tiled Matrix Multiplication:


●​ Matrix multiplication is memory-bound in GPUs because accessing global memory is slow.
●​ Since GPUs have fast shared memory but limited registers and cache, tiling is used to load
matrix sub-blocks into shared memory for reuse, minimizing slow global memory accesses.
●​ Let us say we have two matrices A and B of 4x4 dimension, then using normal matrix chain
multiplication we will have to store all the variables from A and B into the global memory and
then compute results which is very costly.
●​ Instead of computing all at once, we divide the matrices into tiles of size 2×2 (sub-matrices).

●​

●​
●​
●​ Benefits of Tiled Matrix Multiplication
○​ Minimizes global memory accesses → Stores data in fast shared memory
○​ Efficient parallel execution → Threads compute independent tiles
○​ Memory reuse → Avoids redundant loads

#define TILE_WIDTH 16

__global__ void MatMulTiled(float* A, float* B, float* C, int N) {


__shared__ float tileA[TILE_WIDTH][TILE_WIDTH];
__shared__ float tileB[TILE_WIDTH][TILE_WIDTH];

int row = blockIdx.y * TILE_WIDTH + threadIdx.y;


int col = blockIdx.x * TILE_WIDTH + threadIdx.x;

float temp = 0.0f;

for (int t = 0; t < N / TILE_WIDTH; ++t) {


// Load tiles into shared memory
tileA[threadIdx.y][threadIdx.x] = A[row * N + (t * TILE_WIDTH + threadIdx.x)];
tileB[threadIdx.y][threadIdx.x] = B[(t * TILE_WIDTH + threadIdx.y) * N + col];
__syncthreads();

// Multiply the tiles


for (int k = 0; k < TILE_WIDTH; ++k)
temp += tileA[threadIdx.y][k] * tileB[k][threadIdx.x];

__syncthreads();
}

// Write result
C[row * N + col] = temp;
}
Mod 4
On Chip Memory: Memory located inside the GPU chip. Extremely fast due to physical proximity to
the computer cores. These have very low latency, small size, used for common data.
●​ Registers.
●​ L1 Cache / Shared Memory
●​ Read Only Memory
Off Chip Memory: Memory located outside the GPU chip, typically on the GPU card or system
board. These are much larger, high latency, and need coalesced access.
●​ Global Memory
●​ Local Memory / L2 Cache
●​ Texture Memory - Special memory for graphic operations.

Memory Hierarchy:
Register (256KB) -> L1 Cache (192KB) -> Read Only (64KB) -> L2 Cache (40MB) -> Global (40GB)

Shared Memory:
●​ Shared memory is a small, fast on-chip memory that is shared among all threads within the
same thread block.
●​ It is significantly faster than global memory because it resides closer to the cores.
●​ Accessible only by threads within the same block. Not visible to other blocks.
●​ Exists for the duration of the block execution.
●​ Uses:
○​ Data reuse across threads.
○​ Reducing global memory access.
○​ Improving memory coalescing and performance.
●​ Declaration: __shared__ float tile[32][32];
●​ Capacity is limited (typically 48KB per Streaming Multiprocessor, varies by architecture).
●​ Access time is similar to register access speed, but with slightly higher latency.
●​ Threads must use __syncthreads() to avoid race conditions when reading/writing shared
memory.

Thread Synchronisation:
●​ Thread synchronization ensures that multiple threads coordinate their execution properly,
especially when sharing data or accessing shared memory to avoid race conditions and
ensure correct results.
●​ Why is it needed:
○​ Threads within a block often share data via __shared__ memory.
○​ Without synchronization, one thread might read/write data before others finish
updating it.
○​ This can lead to data inconsistency or incorrect output.
●​ Mechanism (Using CUDA Barriers):
○​ In CUDA, a barrier forces all threads in a thread block to wait until every other thread
in that block reaches the same point.
○​ All threads in a block must reach this point before any can proceed.
○​ Ensures memory consistency across threads.
○​ Only works within a single block, not across multiple blocks.
○​ Uses __syncThreads();
○​ Use it after writing to __shared__ memory and before reading from it.
●​ Limitations:
○​ Cannot synchronize across thread blocks.
○​ If not used correctly (e.g., inside conditional branches), it can lead to deadlocks.
●​ Example:
__global__ void withSync(int *arr) {
__shared__ int temp[256];
int tid = threadIdx.x;
temp[tid] = tid;
__syncthreads(); // Ensure all threads finish writing before any read
arr[tid] = temp[tid + 1]; // Safe read (all values written)
}

Types of Memory Accessing:


●​ Coalesced Access:
○​ Threads access memory in order and adjacent locations.
○​ These transactions are typically fast and efficient.
○​ Ex: thread 0 -> mem[0] , thread 1 -> mem[1]
●​ Non-Coalesced Access:
○​ Threads access memory in a random or scattered pattern.
○​ These transactions are slow and costly and cannot happen in one go.
○​ Ex: thread 0 -> mem[0] , thread 1 -> mem[12] thread[2] -> mem[33]

Memory coalescing
●​ When multiple threads in a warp access global memory in a continuous and aligned way, the
GPU combines those accesses into one memory transaction.
●​ In CUDA, global memory is slow, so efficient access is critical. When a warp (32 threads)
accesses memory, coalescing occurs if all threads access consecutive memory addresses.
●​ Coalesced access = faster memory performance
●​ Allows the GPU to use DRAM burst: fetches multiple adjacent memory locations in one go
●​ Threads in a warp must follow aligned, sequential access for coalescing to happen

●​ Techniques to Ensure Coalesced Access:


○​ Align Threads with Memory Layout (thread i should access memory i)
○​ Use shared memory (Load global data into shared memory first)
○​ Optimize Grid & Block Dimensions (for 2D/3D align threadID, blockID for better output)

Data Transfer using DMA:


●​ DMA (Direct Memory Access) is a hardware feature that allows peripherals (like GPUs) to
directly access system memory (RAM) without continuous involvement of the CPU.
●​ In the context of CUDA and GPUs, it enables data transfer between host (CPU) and device
(GPU) memory efficiently without the constant involvement of CPU.
●​ Importance:
○​ CPU-GPU communication is often a performance bottleneck in heterogeneous
computing. Most of the time is spent in synchronising the CPU-GPU.
○​ DMA allows asynchronous data transfer, meaning the CPU can perform other
operations while data is being copied — increasing parallelism.
●​ Working:
○​ The CPU simply initiates the transfer request (e.g., via cudaMemcpyAsync() in
CUDA), after which the DMA controller takes full control of the data movement.
○​ The DMA then reads the memory directly (bypassing CPU involvement),Uses bus
arbitration to gain access to the system memory bus (PCIs in todays time).
○​ Then performs the memory copy operation between the specified locations without
CPU intervention.
○​ While the DMA controller handles the transfer:
■​ The CPU is free to execute other instructions, improving overall efficiency.
■​ Similarly, if async copies are used, GPU can execute kernels while data
transfer is in progress.
●​ Benefits:
○​ Frees up CPU resources for other tasks
○​ Speeds up data transfer between host and device
○​ Supports concurrent execution and communication
○​ Reduces latency and improves throughput
●​ Disadvantages:
○​ Requires complex setup and careful memory management.
○​ Needs pinned (page-locked) memory, reducing available system RAM.
○​ Limited by the bus arbitration capacity.

Pageable Memory vs Pinned (Page Lock) Memory:

Feature Pageable Memory Pinned (Page-Locked) Memory

Definition Default memory that can be Memory that is locked in RAM and
paged in and out between the cannot be paged into the secondary
RAM and secondary storage by memory.
the OS.
Memory Allocation Allocated using standard Allocated using cudaHostAlloc() or
malloc() or new. cudaMallocHost().

Pageability Can be paged out to disk by the Cannot be paged out; stays resident in
OS. physical RAM.

Transfer Speed Slower due to extra copy via Faster as data can be directly accessed
(CPU↔GPU) staging buffer. by DMA.

DMA Support Not supported directly. Requires Supported directly. DMA controller
staging memory into pinned. accesses memory directly.

Use Case Good for general host-side Ideal for high-performance data
computation. transfers.

Impact on System Minimal — OS can manage Reduces available RAM for the OS and
Memory memory flexibly. other apps.

Async Transfer Support Limited — typically blocks the Fully supports cudaMemcpyAsync()
host thread. with streams.

Performance on Small Often better for very small data Slightly higher overhead; better for large
Transfers sizes. data.

Synchronization Needs Simpler, but less overlap Enables better overlap via streams and
between compute and transfer. events.
Data Transfer

Drawbacks of Over Allocating Pinned memory:


●​ Pinned memory reduces available system RAM since it cannot be paged out.
●​ Over-allocation can cause system slowdowns due to memory pressure.
●​ Excessive pinned memory may lead to allocation failures or program crashes.
●​ Too much pinned memory can degrade GPU data transfer performance.
●​ Reduces system’s ability to multitask due to locked memory.

Ensure for optimal performance:


●​ Allocate only what is needed and only the critical data to be kept in pinned memory
●​ Constantly monitor the memory usage and adjust pinned allocations based on workload.
●​ Release pinned memory promptly after use to free resources.
●​ Monitor system RAM usage to prevent memory pressure.

CUDA Streams:
●​ Normally, when we transfer data between the CPU and GPU using cudaMemcpy and then
launch a kernel, these operations happen sequentially — first the data is copied to the GPU,
then the kernel runs, and finally, the results are copied back.
●​ During each step, the GPU may sit idle, especially while waiting for data transfers, leading to
wasted time.
●​ CUDA Streams solve this by enabling overlap of operations.
●​ A stream is like a separate command queue where we can schedule tasks — such as
memory copies and kernel launches — independently of other streams.
●​ By using multiple streams, we can run memory transfers and kernel executions at the same
time. For example: While data is being copied from the CPU to the GPU in one stream,
another stream can be running a kernel on previously transferred data.
●​ This overlapping of tasks leads to better GPU utilization, allowing it to do useful work instead
of waiting — which in turn means faster execution and higher performance.
●​ Types of Streams:
○​ Default Stream:Operations are serialized. Synced with all other streams.
○​ Non Default Stream: Operations are only ordered within the stream.Can run
concurrently with operations in other streams.
●​ Behavior:
○​ Intra-stream ordering: Operations in the same stream execute in order.
○​ Inter-stream independence: Operations in different streams may execute
concurrently, depending on the GPU's ability.
●​ Synchronization APIs:
○​ cudaStreamSynchronize(stream): Waits until all tasks in the stream are finished.
○​ cudaDeviceSynchronize(): Waits for all GPU work to complete.
○​ cudaStreamWaitEvent(): Makes a stream wait for an event.
○​ cudaStreamQuery(): Checks if the stream is done (non-blocking).
●​ Best Practices:
○​ Use asynchronous APIs (cudaMemcpyAsync, etc.) for overlapping.
○​ Limit the number of concurrent streams based on your GPU's capabilities.
○​ Group independent work into separate streams.
○​ Avoid frequent cudaDeviceSynchronize(); use cudaStreamSynchronize() for
more fine-grained control.

Example:
for (int i = 0; i < N; ++i) {
cudaMemcpyAsync(d_input[i], h_input[i], size, cudaMemcpyHostToDevice, streams[i]);
kernel<<<grid, block, 0, streams[i]>>>(d_input[i], d_output[i]);
cudaMemcpyAsync(h_output[i], d_output[i], size, cudaMemcpyDeviceToHost, streams[i]);
}

Explain how multiple streams are running here and overlap of operations increases performance.

Data Prefetching:
●​ Data prefetching in CUDA refers to the technique of moving data to the GPU memory in
advance of when it is needed by a kernel.
●​ This helps hide memory transfer latency and ensures that the GPU is not idle while waiting for
data.
●​ In a typical CUDA program:
○​ Data is transferred from the CPU (host) to the GPU (device) using cudaMemcpy.
○​ Then, a kernel is launched to process the data.
○​ Finally, results are copied back to the host.
●​ If these steps are performed sequentially, the GPU often sits idle during data transfer, which
wastes time and reduces performance.
●​ Why is it needed:
○​ Hide memory latency by overlapping computation and data transfer.
○​ Improve memory throughput and reduce stalls.
○​ Take advantage of shared memory or L1 cache for faster access.
●​ Techniques:
○​ Manual Prefetching into Shared Memory
■​ Threads copy data from global memory to shared memory before using it.
■​ Done explicitly inside kernels using __shared__ memory.
○​ Using Asynchronous Memory Copy with Streams
■​ With cudaMemcpyAsync(), you can overlap data transfer and kernel execution.
■​ Ideal for large datasets and multi-buffered processing.
○​ Compute-Transfer Overlap
■​ While one chunk of data is being processed, the next chunk is prefetched in
parallel using streams.
○​ Prefetch APIs (For Unified Memory)
■​ cudaMemPrefetchAsync(ptr, size, device, stream) helps the runtime move data
closer to the target device before kernel execution.
■​ Useful for systems with unified memory (UM) support.

Summary (Techniques to Optimise Data Transfer between CPU - GPU / Improve memory
performance of a GPU):
●​ Pinned Memory
●​ Thread Synchronisations
●​ Shared Memory
●​ Memory Coalescing
●​ CUDA Streams
●​ Asynchronous Data Transfers
●​ Data Prefetching
●​ Limit the transfer rate (only do when really needed)
●​ Unified Memory
●​ DMA usage instead of making the CPU work for data transfer.
Mod 5
Mod6.

cuBLAS:
●​ cuBLAS is NVIDIA's GPU-accelerated implementation of the Basic Linear Algebra
Subprograms (BLAS) library.
●​ It provides optimized routines for performing fundamental linear algebra operations, such as
matrix and vector multiplications, on NVIDIA GPUS.
●​ Key Features:
○​ Optimized Routines: cuBLAS offers highly tuned implementations of BLAS routines,
enabling efficient execution of linear algebra operations on GPUs.
○​ Multiple API Versions: The library exposes several APls, including CUBLAS API,
CUBLASAt API, CuBLASLt API, and cuBLASDX API, each catering to different use
cases and providing varying levels of flexibility and performance.
○​ Asynchronous Execution: cuBLAS supports asynchronous execution of operations,
allowing for overlapping computation and communication, thereby improving overall
performance.
○​ Multi-GPU Support: The cuBLASt API facilitates distributed computations across
multiple GPUs, enabling scalability for large-scale linear algebra tasks.
○​ Data Layout Flexibility: The cuBLASLt API introduces a lightweight library dedicated
to General Matrix-to-Matrix Multiply (GEMM) operations, offering flexibility in matrix
data layouts and algorithmic implementations.
●​ Applications:
○​ Deep learning backends (e.g., TensorFlow, PyTorch)
○​ Scientific computing
○​ High-performance computing (HPC)
○​ Financial modeling, simulations

cuDNN:
●​ CuDNN is a GPU-accelerated library developed by NVIDIA, designed to optimize deep
learning operations on NVIDIA GPUs. It provides highly tuned implementations of standard
deep learning primitives such as convolutions, pooling, normalization, and activation functions.
●​ Key Features:
○​ Optimized Deep Learning Primitives: CuDN offers efficient implementations of
operations like convolution, pooling, softmax, and activation functions, accelerating the
training and inference of deep neural networks.
○​ Multi-Operation Fusion: The library supports multi-operation fusion patterns, allowing
for further optimization by combining multiple operations into a single kernel, reducing
memory overhead and improving performance.
○​ Flexible Data Layout: Custom data layouts, flexible dimensions for 4D tensor inputs
and outputs to its routines.
○​ Multithreading: With the help of Context API it allows for multithreading and
interoperability with CUDA streams, enabling efficient execution.
○​ Support for DL Algos: Support for LSTM, RNN, CNN etc.
●​ Applications:
○​ Deep learning training and inference
○​ CNNs, RNNs, Transformer models
○​ Real-time image/audio processing
cuGraph:
●​ cuGraph is an accelerated graph analytics library that is part of the RAPIDS suite.
●​ It provides a collection of graph algorithms and services, enabling efficient analysis of graph
data on NVIDIA GPUs.
●​ Key Features:
○​ GPU-Accelerated Graph Algorithms: cuGraph offers implementations of common
graph algorithms, such as PageRank, breadth-first search (BFS), and degree
centrality, leveraging the parallel processing power of GPUs.
○​ Integration with RAPIDS Ecosystem: The library seamlessly integrates with other
RAPIDS libraries, such as cuDF (GPU DataFrame) and cuML (GPU-accelerated
machine learning), facilitating end-to-end data science workflows.
○​ Support for Multiple Data Formats: cuGraph supports various graph data formats,
including NetworkX Graphs, cuDF DataFrames, and sparse matrices from CuPy and
SciPy, ensuring compatibility with existing tools and libraries.
○​ Multi-GPU Scalability: With integration into the RAPIDS ecosystem, cuGraph can
scale across multiple GPUs, enabling the analysis of large-scale graphs with billions of
edges.
●​ Applications:
○​ Social network analysis
○​ Fraud detection
○​ Recommendation systems
○​ Bioinformatics (e.g., protein interaction graphs)

GPU Clusters:
●​ GPU clusters are a collection of interconnected GPUs that work together to execute
computationally intensive tasks in parallel.
●​ Each GPU consists of hundreds or thousands of smaller cores, allowing them to handle a
multitude of tasks simultaneously.
●​ By leveraging the parallel processing capabilities of GPUs, clusters can process vast amounts
of data at lightning-fast speeds, making them ideal for HPC applications.
●​ Architecture:
○​ Head / Primary Node (Master Node) - Acts as the coordinator and main component.
○​ Compute / Secondary Nodes:
■​ CPU
■​ Heterogeneous vs Homogeneous GPU nodes.
■​ RAM
■​ Local Storage / Shared Storage
○​ High-speed interconnects (e.g.Ethernet to connect all the nodes in the network)
○​ Cluster manager system (e.g., Kubernetes, Slurm - to manage how nodes interact, like
if one goes down who will handle its task and stuff )
○​ Job schedulers to manage workloads
●​ Working:
○​ A single GPU might take hours or even days to complete the task. However, by
utilizing a GPU cluster, we can distribute the workload across multiple GPUs,
significantly reducing the processing time.
○​ This is achieved by dividing the data into smaller chunks and assigning each chunk to
a different GPU within the cluster.
○​ Each GPU then processes its assigned data independently, and the results are
combined to obtain the final output.
○​ Demonstrate with an example of adding millions of numbers across GPU nodes,
where each node produces an intermediate sum of the numbers it was assigned and
the master node takes these results and produces a final result.
●​ Features:
○​ Parallelism: Enables training of deep learning models across multiple GPUs/nodes.
○​ Scalability: Easily expand by adding more nodes/GPUs.
○​ Shared resources: Data and compute can be shared across the cluster.
○​ Handling Large Amounts of data
○​ Distributed Workload
●​ Disadvantages:
○​ Costly: Expensive to set up and maintain.
○​ Complex setup: Requires deep knowledge of networking, drivers, schedulers, etc.
○​ Job failures: Debugging across nodes can be difficult.
○​ Networking bottlenecks: Data transfer speed can become a limiting factor.
○​ Data and Model Synchronisation: Ensuring that all the nodes are synchronised can
be a difficult task.
●​ Methods:
○​ Data Parallelism:Split data across multiple GPUs; each GPU runs the same model on
a different subset of data. Ex: Training a DNN, where we accumulate the gradients
from all the GPUs and give a final result.
○​ Model Parallelism: Split the model across multiple GPUs. Each GPU handles a
different part of the model. Ex: Large models like transformer, where we can load
different layers on different GPUs.
○​ Hybrid Parallelism: Combines data + model + pipeline parallelism.Used in very
large-scale training setups like GPT, PaLM.
●​ Use Cases:
○​ Financial Modelling:
■​ Helps in implementing complex financial algos like Monte Carlo stimulations,
risk analysis, portfolio predictions very easily with the compute power.
■​ With faster calculations, firms can make faster decisions based on data
responding to the market movements.
○​ Training Larger Models like GPT:
■​ Larger transformer based models having a complex architecture with billions of
data sources need a distributed environment to achieve faster and efficient
results.
■​ Techniques like a hybrid approach can be used to achieve these.
○​ Big Data Analytics:
■​ Real time data pipelines where data is aggregated from 100s of sources and
we need to make quicker decisions and update systems in real time.
■​ Used for running complex queries on massive datasets, real time analytics with
the business.

You might also like