HPC - 1
HPC - 1
Parallel Computing:
● Parallel computing is a method of simultaneously executing multiple computations to solve a
problem faster by dividing tasks across multiple processors or cores.
● Instead of processing instructions one after another (sequential computing), parallel
computing splits the workload and processes multiple parts at the same time.
● Core Components:
○ Multiple Processors or Cores – Tasks are divided among several processing units.
○ Concurrency – Different tasks execute at the same time.
○ Synchronization – Processors coordinate to combine results correctly.
○ Communication – Data is exchanged between processors when necessary.
● Types:
○ Bit-Level Parallelism – Processes multiple bits in a single operation (e.g., 64-bit
processors vs. 32-bit).
○ Instruction-Level Parallelism (ILP) – Executes multiple instructions at the same time
using pipelining and superscalar architecture.
○ Data Parallelism – Distributes data across multiple processors performing the same
operation (e.g., GPUs processing images).
○ Task Parallelism – Different tasks run on separate processors (e.g., web server
handling multiple requests).
Speedup:
● Speedup measures how much faster a parallel algorithm runs compared to a sequential one.
● S = T1/TP
● Where, T1 is the time taken by a single processor and TP is the time taken by P processors.
Efficiency:
● Efficiency measures how effectively multiple processors are utilized.
● E = S/P
● Where S is the speedup and P is the total number of processors.
● E=1 → Perfect efficiency
● E<1 → Some inefficiency (due to communication, idle time, etc.)
● E>1 (rare) → Superlinear speedup (usually due to caching benefits)
Amdahl’s Law:
● Amdahl’s Law is a principle in parallel computing that predicts the maximum possible speedup
of a program when a portion of it is parallelized.
● It shows how much a program's performance can improve by adding more processors, but
also highlights the limits of parallelization due to the sequential portion of the program.
● Amdahl’s Law helps determine the optimal number of processors needed before adding more
becomes wasteful.
● The speedup can never exceed the total number of processors available. If it does (very rare)
then it is called superlinear speedup.
Data Parallelism:
● Example: Show how we can add a million numbers across GPUs. So “data” i.e. the numbers
change but the “task” adding remains same. Do rough calculations and show.
● Advantages:
○ Faster Processing – Multiple processors handle chunks of data simultaneously.
○ Scales Well – Works effectively on large datasets, especially in deep learning &
simulations.
○ Efficient Use of GPUs – Utilizes multiple GPUs efficiently for parallel computation.
● Disadvantages:
○ Not Suitable for All Problems – Only works when the same operation is applied to
different data.
○ High Communication Overhead – Data transfer between processors can slow things
down.
○ Synchronization Issues – Combining results from multiple processors can introduce
complexity.
Task Parallelism:
● Example: AI app deployed over a multi core GPU - Core 1 handles User A’s login request.
Core 2 serves an image to User B. Core 3 fetches data from the database for User C. All
tasks run simultaneously, improving efficiency.
● Advantages:
○ Handles Diverse Tasks – Different tasks can run independently (e.g., web servers,
OS processes).
○ Better Resource Utilization – Can use different hardware for different tasks (CPU for
logic, GPU for rendering).
○ Efficient for Task-Based Workloads – Useful for distributed systems & cloud
applications.
● Disadvantages:
○ Limited Scalability – Some tasks cannot be divided further, limiting performance
gains.
○ Load Balancing Issues – Some tasks may take longer than others, causing
inefficiencies.
○ More Complex to Implement – Requires proper scheduling and task management.
Flynn's Taxonomy classifies computer architectures based on how instructions and data are handled
during processing. It defines four categories based on the number of instruction streams and data
streams a system can process simultaneously.
○
● Right one is better as only 1 core is idle whereas in left multiple are idle i.e. severe
underutilization.
● Types of Load Imbalance:
○ Iterative Algorithms with Variable Convergence Rates
■ Problem: Some computations converge faster than others, leading to idle
processors.
■ Example: Solving a system of equations, training machine learning models.
■ Solution: Use dynamic load balancing to reassign work from idle processors to
busy ones.
○ Limited Parallelism Due to Coarse Granularity Problem:
■ Tasks are too large and cannot be evenly distributed across processors.
■ Example: 5 tasks assigned to 8 processors may cause underutilization.
■ Solution: Break tasks into smaller sub-tasks (fine-grained parallelism) to
balance workload.
○ Load Imbalance Due to Waiting for Resources Problem:
■ Workers spend time waiting for disk I/O or network communication instead of
computing.
■ Example: A simulation where a worker waits for data transfer.
■ Solution: Overlap computation and communication using asynchronous
execution and prefetching.
● Types of Load Balancing
○ (A) Static Load Balancing
■ The workload is assigned before execution begins.
■ The system does not adjust the load dynamically during runtime.
■ Suitable for problems where the workload is predictable.
■ Round-Robin Scheduling: Tasks are assigned cyclically to processors in a
predetermined order.
○ (B) Dynamic Load Balancing
■ The workload is assigned and adjusted at runtime based on the current load of
processors.
■ Ensures that no processor remains idle while others are overloaded.
■ Master-Slave Model: A master processor assigns tasks dynamically to worker
processors.
● Strategies:
○ (A) Task Scheduling-Based Load Balancing
■ Tasks are divided and scheduled optimally among processors.
■ Used in multiprocessor scheduling and GPU task assignment.
■ Example: Heterogeneous computing (CPU + GPU), where heavier tasks go to
the GPU while lightweight tasks run on the CPU.
○ (B) Data Partitioning-Based Load Balancing
■ The dataset is divided into smaller chunks, ensuring even workload distribution.
■ Used in Big Data processing (e.g., MapReduce, Spark) where data partitions
are dynamically assigned.
● (C) Communication-Based Load Balancing
○ Balances computational workload while minimizing communication costs
between processors.
○ Used in distributed memory systems like MPI-based applications.
Module 2 - Introduction to High-Performance Computing
Working of HPC:
● Cluster Configuration:
○ HPC system consists of multiple interconnected computers, or nodes, forming clusters.
○ Each node has processors, memory, and storage, connected via a high-speed network
for fast communication.
● Task Parallelization:
○ Large computational problems are divided into smaller tasks, which run simultaneously
across multiple nodes.
○ This process, called parallel computing, allows tasks to be executed faster.
● Data Distribution:
○ The input data is split among the nodes, ensuring each node processes a specific
portion of the data.
○ This prevents bottlenecks and enables efficient workload distribution.
● Computation:
○ Each node executes its assigned computation in parallel with others.
○ Intermediate results are shared, combined, and refined until the computation is
complete.
● Monitoring and Control:
○ HPC systems use software tools to monitor node performance and dynamically
allocate resources to maximize efficiency.
○ Load balancing ensures that all nodes are optimally utilized.
● Output Generation:
○ The final result is an integration of all processed data.
○ Outputs are stored in large parallel file systems and can be visualized for analysis and
interpretation.
Advantages:
● Faster Computation – Executes complex tasks in hours or minutes instead of weeks.
● Efficient Parallel Processing – Distributes workloads across multiple processors for better
performance.
● Scalability – Easily expands by adding more computing nodes to handle larger workloads.
● High Data Processing Capability – Handles vast amounts of data, ideal for big data
analytics and AI training.
● Improved Accuracy – Enables precise simulations and models in fields like weather
forecasting and medical research.
● Cost-Effective for Large Tasks – Reduces overall computation time, lowering operational
costs for research and businesses.
● Enhanced Resource Utilization – Uses CPU, GPU, and memory efficiently to maximize
throughput.
● Real-Time Processing – Supports time-sensitive applications like financial modeling and
scientific experiments.
● Optimized for Complex Problems – Solves high-end engineering, physics, and genomics
problems that traditional computers can't handle.
● Supports Multi-User Collaboration – Multiple teams can work on different computations
simultaneously.
CPU:
● The Central Processing Unit (CPU) is the primary component of a computer that executes
instructions from programs.
● It processes data, performs arithmetic and logical operations, and manages control signals for
other hardware components.
● Characteristics:
○ Clock Speed (GHz): Determines how many instructions the CPU can execute per
clock cycle. Measured in gigahertz (GHz) (e.g., 3.5 GHz = 3.5 billion cycles per
second). Higher clock speed generally means faster processing but also higher power
consumption.
○ Core Count: Modern CPUs have multiple cores (dual-core, quad-core, octa-core,
etc.). More cores allow for better multitasking and parallel processing. Used in
multi-threaded applications for improved performance.
○ Cache Memory: A small but fast memory unit inside the CPU that stores frequently
accessed data. L1 Cache (fastest, smallest), L2 Cache (larger, slightly slower), L3
Cache (largest, slowest). Reduces the need to access slower RAM.
○ Single Instruction Execution at a Time: Traditional CPUs follow the Von Neumann
architecture, processing one instruction after another in a sequential manner.
○ Bus Speed and Memory Bandwidth Defines how quickly data moves between CPU,
RAM, and other components. Measured in GT/s (Giga Transfers per second) or GB/s.
GPU:
● A Graphics Processing Unit (GPU) is a specialized processor designed to accelerate parallel
computations.
● Originally built for rendering graphics, GPUs are now widely used in high-performance
computing (HPC), artificial intelligence (AI), deep learning, and scientific simulations due to
their ability to process massive amounts of data simultaneously.
● Massively Parallel Architecture
○ Unlike a CPU, which has a few powerful cores optimized for sequential tasks, a GPU
contains thousands of smaller cores designed to handle multiple tasks in parallel.
○ This makes GPUs highly efficient for vectorized and matrix-based computations,
essential for AI, scientific modeling, and gaming.
● High Throughput, Lower Latency
○ Throughput refers to the amount of data processed at once, and GPUs are designed
for high throughput rather than low-latency execution.
○ This means GPUs excel in workloads where many operations can be performed
simultaneously, rather than tasks requiring fast decision-making.
● High Memory Bandwidth
○ GPUs feature fast memory interfaces (e.g., GDDR6, HBM) to handle large volumes of
data efficiently.
○ Deep learning models, simulations, and gaming engines benefit significantly from this
increased memory bandwidth.
● Optimized for Parallel Workloads
○ GPUs execute SIMD (Single Instruction, Multiple Data) operations, processing many
data points in parallel.
○ This is particularly useful in neural networks, computer vision, financial modeling, and
video rendering.
FPGA:
● FPGA (Field-Programmable Gate Array) is a reconfigurable hardware device that allows users
to implement custom digital circuits.
● The exact architecture varies by manufacturer, but a typical FPGA consists of the following
key components:
● Programmable Logic Blocks (CLBs):
○ Contain logic elements such as Look-Up Tables (LUTs), Flip-Flops, and multiplexers.
○ Perform logic and arithmetic operations.
● Programmable I/O Blocks:
○ Connect FPGA logic to external devices via interfacing pins.
○ Support various communication standards.
● Programmable Interconnect Resources:
○ Pre-laid vertical and horizontal wiring.
○ Used to route signals between logic blocks.
○ Density depends on the number of routing paths and wire segments.
GPU Program Execution Model i.e. Working : (Host - CPU and Device - GPU)
● Declare CPU Variables
● Allocate memory to CPU Variables
● Declare GPU Variables
● Allocate memory to GPU Variables (use functions like cudaAlloc)
● Initialize data in CPU
● Copy data from CPU memory to GPU memory (use functions like cudaMemCpy)
● CPU instruct to GPU for parallel execution (Launch the kernel from CPU i.e. the CPU instructs
that we need to use GPU so the GPU kernel is launched)
● Synchronise the host and the device (use cudaDeviceSynchronize)
● Copy results back from GPU memory to CPU memory (use cudaMemCpy)
● Free the device memory (using cudaFree)
CUDA Directives: CUDA Directives refer to the special syntax, keywords, or pragmas used in
CUDA (Compute Unified Device Architecture) to guide the GPU on how to execute code in parallel.
__global__ Declares a function (kernel) that runs on the GPU and is called from the CPU.
__device__ Declares a function or variable that runs and is accessible only on the GPU.
__host__ Optional; declares a function to run on the CPU (default behavior).
__shared__ Declares memory that is shared among threads in the same thread block.
__constant__ Declares memory that stays constant during kernel execution, optimized for broadcast.
<<<gridDim, blockDim>>> Launch configuration syntax for kernels. Example: kernel<<<4, 256>>>();
This means having 4 blocks and each block has 256 threadsvect
blockIdx.x / .y / .z Gives the index of the current block within the grid.
Tiling:
● Tiling is an optimization technique used in parallel computing (especially in GPUs) to
efficiently manage memory access and improve data locality.
● It helps reduce memory latency by reusing data stored in fast shared memory (SMEM) instead
of repeatedly fetching it from slower global memory (DRAM).
● Tiling is an optimization technique used in computing to enhance memory locality and
performance by dividing large problems into smaller, manageable subproblems.
● Instead of processing the entire dataset at once, tiling splits the data into smaller blocks
("tiles") that fit efficiently into cache or shared memory, reducing memory access latency.
● Why use tiling:
○ Reduce Global Memory Access – Global memory is slow compared to registers and
shared memory. Tiling ensures data is loaded into fast shared memory, reducing
access time
○ Improve Cache Efficiency – Data is reused multiple times, improving performance.
○ Enable Parallel Computation – Each thread block processes a tile independently,
allowing efficient parallelization.
○ Minimize Bank Conflicts – Optimized memory access patterns reduce conflicts in
shared memory.
● Working:
○ Identify a tile of global memory contents that are accessed by multiple threads
○ Load the tile from global memory into on-chip memory
○ Use barrier synchronization to make sure that all threads are ready to start the phase
○ Have the multiple threads to access their data from the on-chip memory
○ Use barrier synchronization to make sure that all threads have completed the current
phase
○ Move on to the next tile
●
●
●
● Benefits of Tiled Matrix Multiplication
○ Minimizes global memory accesses → Stores data in fast shared memory
○ Efficient parallel execution → Threads compute independent tiles
○ Memory reuse → Avoids redundant loads
#define TILE_WIDTH 16
__syncthreads();
}
// Write result
C[row * N + col] = temp;
}
Mod 4
On Chip Memory: Memory located inside the GPU chip. Extremely fast due to physical proximity to
the computer cores. These have very low latency, small size, used for common data.
● Registers.
● L1 Cache / Shared Memory
● Read Only Memory
Off Chip Memory: Memory located outside the GPU chip, typically on the GPU card or system
board. These are much larger, high latency, and need coalesced access.
● Global Memory
● Local Memory / L2 Cache
● Texture Memory - Special memory for graphic operations.
Memory Hierarchy:
Register (256KB) -> L1 Cache (192KB) -> Read Only (64KB) -> L2 Cache (40MB) -> Global (40GB)
Shared Memory:
● Shared memory is a small, fast on-chip memory that is shared among all threads within the
same thread block.
● It is significantly faster than global memory because it resides closer to the cores.
● Accessible only by threads within the same block. Not visible to other blocks.
● Exists for the duration of the block execution.
● Uses:
○ Data reuse across threads.
○ Reducing global memory access.
○ Improving memory coalescing and performance.
● Declaration: __shared__ float tile[32][32];
● Capacity is limited (typically 48KB per Streaming Multiprocessor, varies by architecture).
● Access time is similar to register access speed, but with slightly higher latency.
● Threads must use __syncthreads() to avoid race conditions when reading/writing shared
memory.
Thread Synchronisation:
● Thread synchronization ensures that multiple threads coordinate their execution properly,
especially when sharing data or accessing shared memory to avoid race conditions and
ensure correct results.
● Why is it needed:
○ Threads within a block often share data via __shared__ memory.
○ Without synchronization, one thread might read/write data before others finish
updating it.
○ This can lead to data inconsistency or incorrect output.
● Mechanism (Using CUDA Barriers):
○ In CUDA, a barrier forces all threads in a thread block to wait until every other thread
in that block reaches the same point.
○ All threads in a block must reach this point before any can proceed.
○ Ensures memory consistency across threads.
○ Only works within a single block, not across multiple blocks.
○ Uses __syncThreads();
○ Use it after writing to __shared__ memory and before reading from it.
● Limitations:
○ Cannot synchronize across thread blocks.
○ If not used correctly (e.g., inside conditional branches), it can lead to deadlocks.
● Example:
__global__ void withSync(int *arr) {
__shared__ int temp[256];
int tid = threadIdx.x;
temp[tid] = tid;
__syncthreads(); // Ensure all threads finish writing before any read
arr[tid] = temp[tid + 1]; // Safe read (all values written)
}
Memory coalescing
● When multiple threads in a warp access global memory in a continuous and aligned way, the
GPU combines those accesses into one memory transaction.
● In CUDA, global memory is slow, so efficient access is critical. When a warp (32 threads)
accesses memory, coalescing occurs if all threads access consecutive memory addresses.
● Coalesced access = faster memory performance
● Allows the GPU to use DRAM burst: fetches multiple adjacent memory locations in one go
● Threads in a warp must follow aligned, sequential access for coalescing to happen
Definition Default memory that can be Memory that is locked in RAM and
paged in and out between the cannot be paged into the secondary
RAM and secondary storage by memory.
the OS.
Memory Allocation Allocated using standard Allocated using cudaHostAlloc() or
malloc() or new. cudaMallocHost().
Pageability Can be paged out to disk by the Cannot be paged out; stays resident in
OS. physical RAM.
Transfer Speed Slower due to extra copy via Faster as data can be directly accessed
(CPU↔GPU) staging buffer. by DMA.
DMA Support Not supported directly. Requires Supported directly. DMA controller
staging memory into pinned. accesses memory directly.
Use Case Good for general host-side Ideal for high-performance data
computation. transfers.
Impact on System Minimal — OS can manage Reduces available RAM for the OS and
Memory memory flexibly. other apps.
Async Transfer Support Limited — typically blocks the Fully supports cudaMemcpyAsync()
host thread. with streams.
Performance on Small Often better for very small data Slightly higher overhead; better for large
Transfers sizes. data.
Synchronization Needs Simpler, but less overlap Enables better overlap via streams and
between compute and transfer. events.
Data Transfer
CUDA Streams:
● Normally, when we transfer data between the CPU and GPU using cudaMemcpy and then
launch a kernel, these operations happen sequentially — first the data is copied to the GPU,
then the kernel runs, and finally, the results are copied back.
● During each step, the GPU may sit idle, especially while waiting for data transfers, leading to
wasted time.
● CUDA Streams solve this by enabling overlap of operations.
● A stream is like a separate command queue where we can schedule tasks — such as
memory copies and kernel launches — independently of other streams.
● By using multiple streams, we can run memory transfers and kernel executions at the same
time. For example: While data is being copied from the CPU to the GPU in one stream,
another stream can be running a kernel on previously transferred data.
● This overlapping of tasks leads to better GPU utilization, allowing it to do useful work instead
of waiting — which in turn means faster execution and higher performance.
● Types of Streams:
○ Default Stream:Operations are serialized. Synced with all other streams.
○ Non Default Stream: Operations are only ordered within the stream.Can run
concurrently with operations in other streams.
● Behavior:
○ Intra-stream ordering: Operations in the same stream execute in order.
○ Inter-stream independence: Operations in different streams may execute
concurrently, depending on the GPU's ability.
● Synchronization APIs:
○ cudaStreamSynchronize(stream): Waits until all tasks in the stream are finished.
○ cudaDeviceSynchronize(): Waits for all GPU work to complete.
○ cudaStreamWaitEvent(): Makes a stream wait for an event.
○ cudaStreamQuery(): Checks if the stream is done (non-blocking).
● Best Practices:
○ Use asynchronous APIs (cudaMemcpyAsync, etc.) for overlapping.
○ Limit the number of concurrent streams based on your GPU's capabilities.
○ Group independent work into separate streams.
○ Avoid frequent cudaDeviceSynchronize(); use cudaStreamSynchronize() for
more fine-grained control.
Example:
for (int i = 0; i < N; ++i) {
cudaMemcpyAsync(d_input[i], h_input[i], size, cudaMemcpyHostToDevice, streams[i]);
kernel<<<grid, block, 0, streams[i]>>>(d_input[i], d_output[i]);
cudaMemcpyAsync(h_output[i], d_output[i], size, cudaMemcpyDeviceToHost, streams[i]);
}
Explain how multiple streams are running here and overlap of operations increases performance.
Data Prefetching:
● Data prefetching in CUDA refers to the technique of moving data to the GPU memory in
advance of when it is needed by a kernel.
● This helps hide memory transfer latency and ensures that the GPU is not idle while waiting for
data.
● In a typical CUDA program:
○ Data is transferred from the CPU (host) to the GPU (device) using cudaMemcpy.
○ Then, a kernel is launched to process the data.
○ Finally, results are copied back to the host.
● If these steps are performed sequentially, the GPU often sits idle during data transfer, which
wastes time and reduces performance.
● Why is it needed:
○ Hide memory latency by overlapping computation and data transfer.
○ Improve memory throughput and reduce stalls.
○ Take advantage of shared memory or L1 cache for faster access.
● Techniques:
○ Manual Prefetching into Shared Memory
■ Threads copy data from global memory to shared memory before using it.
■ Done explicitly inside kernels using __shared__ memory.
○ Using Asynchronous Memory Copy with Streams
■ With cudaMemcpyAsync(), you can overlap data transfer and kernel execution.
■ Ideal for large datasets and multi-buffered processing.
○ Compute-Transfer Overlap
■ While one chunk of data is being processed, the next chunk is prefetched in
parallel using streams.
○ Prefetch APIs (For Unified Memory)
■ cudaMemPrefetchAsync(ptr, size, device, stream) helps the runtime move data
closer to the target device before kernel execution.
■ Useful for systems with unified memory (UM) support.
Summary (Techniques to Optimise Data Transfer between CPU - GPU / Improve memory
performance of a GPU):
● Pinned Memory
● Thread Synchronisations
● Shared Memory
● Memory Coalescing
● CUDA Streams
● Asynchronous Data Transfers
● Data Prefetching
● Limit the transfer rate (only do when really needed)
● Unified Memory
● DMA usage instead of making the CPU work for data transfer.
Mod 5
Mod6.
cuBLAS:
● cuBLAS is NVIDIA's GPU-accelerated implementation of the Basic Linear Algebra
Subprograms (BLAS) library.
● It provides optimized routines for performing fundamental linear algebra operations, such as
matrix and vector multiplications, on NVIDIA GPUS.
● Key Features:
○ Optimized Routines: cuBLAS offers highly tuned implementations of BLAS routines,
enabling efficient execution of linear algebra operations on GPUs.
○ Multiple API Versions: The library exposes several APls, including CUBLAS API,
CUBLASAt API, CuBLASLt API, and cuBLASDX API, each catering to different use
cases and providing varying levels of flexibility and performance.
○ Asynchronous Execution: cuBLAS supports asynchronous execution of operations,
allowing for overlapping computation and communication, thereby improving overall
performance.
○ Multi-GPU Support: The cuBLASt API facilitates distributed computations across
multiple GPUs, enabling scalability for large-scale linear algebra tasks.
○ Data Layout Flexibility: The cuBLASLt API introduces a lightweight library dedicated
to General Matrix-to-Matrix Multiply (GEMM) operations, offering flexibility in matrix
data layouts and algorithmic implementations.
● Applications:
○ Deep learning backends (e.g., TensorFlow, PyTorch)
○ Scientific computing
○ High-performance computing (HPC)
○ Financial modeling, simulations
cuDNN:
● CuDNN is a GPU-accelerated library developed by NVIDIA, designed to optimize deep
learning operations on NVIDIA GPUs. It provides highly tuned implementations of standard
deep learning primitives such as convolutions, pooling, normalization, and activation functions.
● Key Features:
○ Optimized Deep Learning Primitives: CuDN offers efficient implementations of
operations like convolution, pooling, softmax, and activation functions, accelerating the
training and inference of deep neural networks.
○ Multi-Operation Fusion: The library supports multi-operation fusion patterns, allowing
for further optimization by combining multiple operations into a single kernel, reducing
memory overhead and improving performance.
○ Flexible Data Layout: Custom data layouts, flexible dimensions for 4D tensor inputs
and outputs to its routines.
○ Multithreading: With the help of Context API it allows for multithreading and
interoperability with CUDA streams, enabling efficient execution.
○ Support for DL Algos: Support for LSTM, RNN, CNN etc.
● Applications:
○ Deep learning training and inference
○ CNNs, RNNs, Transformer models
○ Real-time image/audio processing
cuGraph:
● cuGraph is an accelerated graph analytics library that is part of the RAPIDS suite.
● It provides a collection of graph algorithms and services, enabling efficient analysis of graph
data on NVIDIA GPUs.
● Key Features:
○ GPU-Accelerated Graph Algorithms: cuGraph offers implementations of common
graph algorithms, such as PageRank, breadth-first search (BFS), and degree
centrality, leveraging the parallel processing power of GPUs.
○ Integration with RAPIDS Ecosystem: The library seamlessly integrates with other
RAPIDS libraries, such as cuDF (GPU DataFrame) and cuML (GPU-accelerated
machine learning), facilitating end-to-end data science workflows.
○ Support for Multiple Data Formats: cuGraph supports various graph data formats,
including NetworkX Graphs, cuDF DataFrames, and sparse matrices from CuPy and
SciPy, ensuring compatibility with existing tools and libraries.
○ Multi-GPU Scalability: With integration into the RAPIDS ecosystem, cuGraph can
scale across multiple GPUs, enabling the analysis of large-scale graphs with billions of
edges.
● Applications:
○ Social network analysis
○ Fraud detection
○ Recommendation systems
○ Bioinformatics (e.g., protein interaction graphs)
GPU Clusters:
● GPU clusters are a collection of interconnected GPUs that work together to execute
computationally intensive tasks in parallel.
● Each GPU consists of hundreds or thousands of smaller cores, allowing them to handle a
multitude of tasks simultaneously.
● By leveraging the parallel processing capabilities of GPUs, clusters can process vast amounts
of data at lightning-fast speeds, making them ideal for HPC applications.
● Architecture:
○ Head / Primary Node (Master Node) - Acts as the coordinator and main component.
○ Compute / Secondary Nodes:
■ CPU
■ Heterogeneous vs Homogeneous GPU nodes.
■ RAM
■ Local Storage / Shared Storage
○ High-speed interconnects (e.g.Ethernet to connect all the nodes in the network)
○ Cluster manager system (e.g., Kubernetes, Slurm - to manage how nodes interact, like
if one goes down who will handle its task and stuff )
○ Job schedulers to manage workloads
● Working:
○ A single GPU might take hours or even days to complete the task. However, by
utilizing a GPU cluster, we can distribute the workload across multiple GPUs,
significantly reducing the processing time.
○ This is achieved by dividing the data into smaller chunks and assigning each chunk to
a different GPU within the cluster.
○ Each GPU then processes its assigned data independently, and the results are
combined to obtain the final output.
○ Demonstrate with an example of adding millions of numbers across GPU nodes,
where each node produces an intermediate sum of the numbers it was assigned and
the master node takes these results and produces a final result.
● Features:
○ Parallelism: Enables training of deep learning models across multiple GPUs/nodes.
○ Scalability: Easily expand by adding more nodes/GPUs.
○ Shared resources: Data and compute can be shared across the cluster.
○ Handling Large Amounts of data
○ Distributed Workload
● Disadvantages:
○ Costly: Expensive to set up and maintain.
○ Complex setup: Requires deep knowledge of networking, drivers, schedulers, etc.
○ Job failures: Debugging across nodes can be difficult.
○ Networking bottlenecks: Data transfer speed can become a limiting factor.
○ Data and Model Synchronisation: Ensuring that all the nodes are synchronised can
be a difficult task.
● Methods:
○ Data Parallelism:Split data across multiple GPUs; each GPU runs the same model on
a different subset of data. Ex: Training a DNN, where we accumulate the gradients
from all the GPUs and give a final result.
○ Model Parallelism: Split the model across multiple GPUs. Each GPU handles a
different part of the model. Ex: Large models like transformer, where we can load
different layers on different GPUs.
○ Hybrid Parallelism: Combines data + model + pipeline parallelism.Used in very
large-scale training setups like GPT, PaLM.
● Use Cases:
○ Financial Modelling:
■ Helps in implementing complex financial algos like Monte Carlo stimulations,
risk analysis, portfolio predictions very easily with the compute power.
■ With faster calculations, firms can make faster decisions based on data
responding to the market movements.
○ Training Larger Models like GPT:
■ Larger transformer based models having a complex architecture with billions of
data sources need a distributed environment to achieve faster and efficient
results.
■ Techniques like a hybrid approach can be used to achieve these.
○ Big Data Analytics:
■ Real time data pipelines where data is aggregated from 100s of sources and
we need to make quicker decisions and update systems in real time.
■ Used for running complex queries on massive datasets, real time analytics with
the business.