HPC Bankai
HPC Bankai
In non-blocking communication, the sender initiates the communication operation and continues
One-to-All Broadcast: with its computation without waiting for the operation to complete. Similarly, the receiver starts
the receive operation and proceeds with its computation. Non-blocking communication allows
In the one-to-all broadcast communication pattern, a single node (the source) sends the same for overlapping communication and computation, which can lead to improved performance in
message to all other nodes in the system. One common technique for implementing one-to-all certain scenarios.
broadcast on a ring network is the recursive doubling algorithm.
c) Prefix-Sum Operation:
Recursive Doubling Algorithm:
Prefix-sum, also known as scan, is a parallel computation technique used to compute the
The recursive doubling algorithm divides the nodes into pairs and performs a series of cumulative sum of a sequence of values. In a prefix-sum operation, each element of the input
communication steps, doubling the distance between nodes at each step until all nodes have sequence is combined with the preceding elements to produce a sequence of partial sums.
received the broadcasted message. Here's how it works on an eight-node ring:
The prefix-sum operation can be implemented efficiently in parallel using techniques such as
1. Initially, node 0 (the source) has the message to broadcast. parallel prefix algorithms, which divide the input sequence into smaller segments and compute
partial sums for each segment in parallel.
2. At each step, each node sends its message to the node that is offset by a power of 2. For
example: d) All-to-All Broadcast Communication Operation:
- Node 0 sends to node 1.
- Node 1 sends to node 3. In the all-to-all broadcast communication operation, each node broadcasts its local data to all
- Node 2 sends to node 0. other nodes in the system. This communication pattern ensures that every node receives data
- Node 3 sends to node 2. from every other node.
- And so on.
Step-by-Step Diagrams for All-to-All Broadcast on an Eight-Node Ring:
3. This process continues, doubling the distance between nodes at each step, until all nodes
have received the broadcasted message. 1. Initial State:
- Each node has its local data.
All-to-One Reduction: - Node 0 initiates the broadcast operation.
In the all-to-one reduction communication pattern, all nodes send their local data to a single 2. First Communication Step:
node (the destination), which aggregates the data to produce a final result. For example, in a - Node 0 sends its data to node 1.
summation reduction, each node sends its local value, and the destination node computes the - Node 1 sends its data to node 2.
sum of all values. - Node 2 sends its data to node 3.
- Node 3 sends its data to node 4.
b) Blocking and Non-Blocking Communication Using MPI: - Node 4 sends its data to node 5.
- Node 5 sends its data to node 6.
Blocking Communication: - Node 6 sends its data to node 7.
- Node 7 sends its data to node 0.
In blocking communication, the sender blocks until the message has been successfully sent,
and the receiver blocks until the message has been successfully received. This means that the 3. Last Communication Step:
sender and receiver synchronize their execution, and the communication operation completes - All nodes have received data from every other node.
before the sender or receiver can proceed with further computation. - Broadcast operation is complete.
Gather Operation: 1. **Data Distribution**: Divide each input matrix into smaller submatrices and distribute them
- In gather communication, each process sends its local data to a designated root process, across the processors in a 2D grid.
which gathers all the data into a single array.
- The root process receives data from all other processes and combines them into a single 2. **Initial Alignment**: Align the submatrices such that each processor has a local copy of the
array. submatrices it needs to compute.
f) Circular Shift Operation: 3. **Computation Phase**: Perform local matrix multiplication on each processor using its local
submatrices.
A circular shift operation involves shifting the elements of an array cyclically by a certain number
of positions. In a circular shift, elements that are shifted out at one end reappear at the other 4. **Communication Phase**: Shift the submatrices cyclically along rows and columns to
end of the array. perform data exchanges between processors. This ensures that each processor has the
required data to compute the next partial result.
For example, consider an array [1, 2, 3, 4, 5] and perform a circular shift to the right by 2
positions: 5. **Accumulation**: Accumulate the partial results obtained from local matrix multiplication to
- After the shift, the array becomes [4, 5, 1, 2, 3]. compute the final result.
- The elements 1 and 2 are shifted out from the right end and reappear at the left end of the
array. Example:
Circular shift operations are often used in parallel and distributed computing algorithms to Let's consider two input matrices A and B, each of size 4x4, and we want to compute the
rearrange data or perform data redistribution efficiently. product C = A * B using Cannon's algorithm with a 2x2 processor grid.
Matrix A:
```
|1234|
|5678|
| 9 10 11 12 |
| 13 14 15 16 |
```
Matrix B:
```
|1000|
|0100|
|0010|
|0001|
```
1. Distribute A and B across the processors in a 2x2 grid. - **Granularity**: Granularity refers to the size of the tasks or units of work in a parallel
2. Perform initial alignment. application. It can range from fine-grained (small tasks) to coarse-grained (large tasks).
3. Each processor computes its local submatrix multiplication.
4. Perform data exchanges between processors. Effects of Granularity on Performance:
5. Accumulate the partial results to obtain the final result.
1. **Communication Overhead**: Fine-grained tasks may lead to increased communication
b) Different Performance Metrics for Parallel Systems: overhead due to frequent data exchanges between processors. Coarse-grained tasks may
reduce communication overhead by reducing the frequency of data exchanges.
1. **Throughput**: Throughput measures the rate at which a parallel system can process a
certain amount of work, typically measured in operations per unit of time. 2. **Load Imbalance**: Fine-grained tasks may result in load imbalance if some processors
finish their tasks much faster than others, leading to idle time. Coarse-grained tasks may
2. **Scalability**: Scalability assesses how well a parallel system can handle an increasing mitigate load imbalance by distributing more substantial workloads to each processor.
workload by adding more resources (e.g., processors, nodes). It measures the ability of the
system to maintain or improve performance as the workload increases. 3. **Parallelization Overhead**: Fine-grained tasks may incur higher parallelization overhead,
such as thread creation and synchronization, which can reduce overall performance.
3. **Speedup**: Speedup compares the performance of a parallel system with that of a Coarse-grained tasks may reduce parallelization overhead by minimizing the number of
sequential system executing the same task. It quantifies how much faster the parallel system is parallelization-related tasks.
compared to the sequential system.
4. **Scalability**: The choice of granularity can affect the scalability of a parallel application.
4. **Efficiency**: Efficiency measures the ratio of the speedup achieved by a parallel system to Fine-grained tasks may scale well on a large number of processors but may suffer from
the number of processors used. It indicates how effectively the resources are utilized in increased overhead. Coarse-grained tasks may be less affected by overhead but may have
achieving parallelism. limited scalability due to load imbalance.
5. **Latency**: Latency measures the time delay between initiating a request and receiving a e) Various Sources of Overhead in Parallel Systems:
response. In parallel systems, latency can affect overall performance, especially in
communication-intensive applications. 1. **Communication Overhead**: Overhead associated with transferring data between
processors, including serialization/deserialization, network latency, and synchronization.
6. **Overhead**: Overhead refers to the additional time or resources consumed by
parallelization-related tasks, such as synchronization, communication, and load balancing. High 2. **Synchronization Overhead**: Overhead incurred by synchronization mechanisms used to
overhead can degrade the performance of parallel systems. coordinate the execution of parallel tasks, such as locks, barriers, and message passing.
c) Minimum Execution Time and Minimum Cost Optimal Execution Time: 3. **Parallelization Overhead**: Overhead related to parallelization, including thread/process
creation, context switching, and memory allocation.
- **Minimum Execution Time (MET)**: MET refers to the shortest possible time required to
execute a task on a parallel system using a given number of processors. It represents the 4. **Load Balancing Overhead**: Overhead caused by efforts to distribute workload evenly
theoretical lower bound on execution time for a given task and parallel system configuration. among processors to avoid load imbalance, including dynamic load balancing algorithms and
monitoring overhead.
- **Minimum Cost Optimal Execution Time (MCOT)**: MCOT refers to the execution time
achieved by balancing the trade-off between performance (speedup) and cost (number of 5. **Resource Management Overhead**: Overhead associated with managing system
processors or resources used). It represents the optimal execution time considering both resources, such as memory allocation, task scheduling, and resource contention.
performance and cost constraints.
f) Scaling Down (Downsizing) a Parallel System:
d) Granularity and Its Effects on Performance of Parallel Systems:
Scaling down, also known as downsizing, involves reducing the size or capacity of a parallel a) CUDA (Compute Unified Device Architecture) is a parallel computing platform and application
system, typically by decreasing the number of processors or resources used. Downsizing may programming interface (API) model created by NVIDIA. It allows developers to utilize the
be necessary to optimize resource utilization, reduce costs, or adapt to changes in workload or computational power of NVIDIA GPUs for general-purpose processing tasks, rather than just
system requirements. graphics rendering. CUDA supports several programming languages, including:
Example: 1. **CUDA C/C++**: This is the primary language for CUDA programming, extending the C/C++
language with special syntax and constructs for GPU programming.
Suppose an organization initially deploys a parallel computing cluster with 100 nodes to handle
a specific workload efficiently. However, over time, the workload decreases, and it becomes 2. **CUDA Fortran**: NVIDIA provides extensions to the Fortran language for GPU
more cost-effective to operate a smaller cluster. In this case, the organization may decide to programming, allowing Fortran developers to leverage CUDA for parallel computing.
scale down the parallel system by reducing the number of nodes from 100 to 50. This
downsizing strategy allows the organization to save on operational costs while still meeting the 3. **CUDA Python (PyCUDA)**: PyCUDA is a Python library that provides bindings for CUDA,
reduced workload demands. allowing developers to write CUDA code directly within Python scripts.
1. **Deep Learning**: CUDA is widely used in deep learning frameworks such as TensorFlow
and PyTorch to accelerate training and inference tasks on GPUs, significantly speeding up
computation compared to traditional CPU-based approaches.
3. **Computer Vision**: CUDA is utilized in computer vision applications for tasks such as image
processing, object detection, and recognition, enabling real-time performance and enhanced
accuracy by leveraging GPU parallelism.
1. **Host Code Execution**: This part of the program runs on the CPU and is responsible for
managing data transfers between the CPU and GPU, as well as launching kernels on the GPU.
2. **Kernel Execution**: Kernels are small, parallel functions that execute on the GPU. Each
kernel launch spawns multiple threads, which execute the kernel code in parallel.
3. **Device Memory Access**: Kernels operate on data stored in the GPU's memory (device
memory). During kernel execution, threads access and manipulate data stored in device
memory.
[CPU] --> [Transfer Data to GPU] --> [Launch Kernel] --> [Execute Kernel on GPU] --> [Transfer
Results to CPU]
- **Grid Dimension**: Refers to the number of blocks in the grid. It is specified in the form of (x,
c) In CUDA, the following terms are commonly used: y, z) dimensions.
- **Device**: Refers to the GPU device (e.g., NVIDIA GPU) used for parallel computation. Example CUDA kernel for addition of two vectors:
- **Host**: Refers to the CPU and its associated memory, where the main program runs. ```cuda
__global__ void vectorAdd(float *a, float *b, float *c, int n) {
- **Device Code**: Code written to be executed on the GPU (device) is referred to as device int i = blockIdx.x * blockDim.x + threadIdx.x;
code. It includes CUDA kernels and other GPU-specific functions. if (i < n)
c[i] = a[i] + b[i];
- **Kernel**: A kernel is a function that runs in parallel on the GPU. It is invoked by the host and }
executed by multiple threads on the device. ```
d) CUDA Memory Model: In this kernel, each thread calculates the sum of corresponding elements from vectors `a` and
`b` and stores the result in vector `c`. The thread index `i` is calculated based on the block and
CUDA has several types of memory, including: thread dimensions.
- **Global Memory**: Shared by all threads in all blocks and persists for the duration of the f) In CUDA, a kernel is a function that runs in parallel on the GPU. Kernel launch refers to the
application. process of invoking a kernel function from the host CPU to execute on the GPU.
- **Shared Memory**: Shared by threads within the same block and is much faster than global Arguments that can be specified in a kernel launch include:
memory.
- **Grid Dimension**: Specifies the number of blocks in the grid.
- **Local Memory**: Local to each thread and is stored in global memory.
- **Block Dimension**: Specifies the number of threads per block.
Thread Hierarchy:
- **Kernel Arguments**: Arguments passed to the kernel function, such as input data arrays and
CUDA organizes threads into a hierarchy of grids, blocks, and threads: output data arrays.
- **Grid**: A grid is a collection of blocks that execute the same kernel function. It forms the - **Stream**: Specifies the CUDA stream in which the kernel should execute, allowing for
highest level of the hierarchy. asynchronous execution and overlapping of kernel execution with data transfers and other
computations.
- **Block**: A block is a collection of threads that can communicate and synchronize with each
other via shared memory. Blocks are organized into a grid. - **Shared Memory Size**: Optionally specifies the amount of shared memory to allocate per
block for the kernel.
- **Thread**: The smallest unit of execution in CUDA. Threads within the same block can
cooperate and synchronize using shared memory. - **Dynamic Shared Memory Size**: Specifies the amount of dynamically allocated shared
memory per block for the kernel.
e) Block Dimension and Grid Dimension:
- **Kernel Execution Configuration**: Additional configuration options for kernel execution, such
- **Block Dimension**: Refers to the number of threads per block. It is specified in the form of (x, as shared memory usage and block size constraints.
y, z) dimensions.
a) Odd-even transportation in bubble sort is a parallel formulation that involves sorting elements 3. **Task Decomposition**: The graph is decomposed into smaller subgraphs, and each
in parallel using a technique based on the odd-even sorting network. In this approach, elements thread/process is assigned a subgraph to explore. Communication between threads/processes
are compared and swapped in pairs, where the positions of the elements are determined by may be necessary to synchronize exploration and update the status of nodes.
their indices and the iteration number. Here's a stepwise example solution using odd-even
transportation: 4. **Traversal**: Each thread/process performs a depth-first traversal of its assigned subgraph,
marking nodes as visited and processing them as necessary. Backtracking may occur when all
Consider an array of numbers: [5, 2, 8, 1, 3, 7, 4, 6] neighbors of a node have been explored.
1. **Odd Phase (Iteration 1)**: 5. **Termination**: The traversal terminates when all nodes have been visited or processed by
- Each element compares and swaps with its neighbor (element at an odd index) if it's greater. the parallel threads/processes.
After iteration 1: [2, 5, 1, 8, 3, 7, 4, 6] c) Kubernetes (K8s) is an open-source container orchestration platform for automating
deployment, scaling, and management of containerized applications. It provides a framework for
2. **Even Phase (Iteration 2)**: automating the deployment, scaling, and management of containerized applications across
- Each element compares and swaps with its neighbor (element at an even index) if it's clusters of hosts.
greater.
Features of Kubernetes include:
After iteration 2: [2, 1, 5, 3, 7, 4, 6, 8]
- **Container Orchestration**: Kubernetes automates the deployment, scaling, and management
3. **Odd Phase (Iteration 3)**: of containerized applications, ensuring that they run reliably and efficiently across clusters of
- Each element compares and swaps with its neighbor (element at an odd index) if it's greater. hosts.
After iteration 3: [1, 2, 3, 5, 4, 6, 7, 8] - **Service Discovery and Load Balancing**: Kubernetes provides built-in mechanisms for
service discovery and load balancing, enabling applications to communicate with each other
4. **Even Phase (Iteration 4)**: seamlessly and distribute incoming traffic across multiple instances.
- Each element compares and swaps with its neighbor (element at an even index) if it's
greater. - **Automatic Scaling**: Kubernetes can automatically scale applications based on resource
usage metrics or custom policies, ensuring that applications have the necessary resources to
After iteration 4: [1, 2, 3, 4, 5, 6, 7, 8] handle varying workloads.
The array is now sorted. - **Self-healing**: Kubernetes monitors the health of applications and automatically restarts or
reschedules containers that fail or become unresponsive, ensuring high availability and
b) Parallel Depth First Search (DFS) algorithm is used to traverse or search a graph or tree data reliability.
structure. It operates in a depthward motion, starting at an initial node and exploring as far as
possible along each branch before backtracking. Parallel DFS can be implemented using Applications of Kubernetes include:
techniques such as parallel processing and task decomposition. Here's an overview:
- **Microservices Architecture**: Kubernetes is commonly used to deploy and manage
1. **Initialization**: Each node in the graph is assigned a status (e.g., unvisited, visited, microservices-based applications, allowing developers to break down monolithic applications
processed). The initial node is marked as unvisited. into smaller, independently deployable services.
2. **Parallel Processing**: Multiple threads or processes are used to explore different branches - **Continuous Integration/Continuous Deployment (CI/CD)**: Kubernetes integrates with CI/CD
of the graph simultaneously. Each thread/process starts from a different node and explores its pipelines to automate the deployment of containerized applications, enabling rapid and reliable
neighbors in parallel. delivery of new features and updates.
- **Hybrid and Multi-cloud Deployments**: Kubernetes provides a consistent platform for - **Data Dependency**: Sorting algorithms often require data dependencies, where the result of
deploying and managing applications across on-premises data centers, public cloud one operation depends on the result of another. Managing data dependencies in parallel sorting
environments, and hybrid cloud configurations. algorithms can be complex and may require synchronization mechanisms, which can introduce
overhead and impact performance.
- **Big Data and Machine Learning**: Kubernetes can be used to deploy and manage big data
and machine learning workloads, providing scalability, resource isolation, and efficient resource - **Communication Overhead**: Parallel sorting algorithms may require communication between
utilization. processing units to exchange data or synchronize computation. High communication overhead
can degrade performance, especially in distributed memory architectures or when the data set
d) Short notes: is large.
(i) **Parallel Merge Sort**: Parallel merge sort is a parallelized version of the traditional merge - **Scalability**: The scalability of parallel sorting algorithms refers to their ability to efficiently
sort algorithm, which divides the input array into smaller subarrays and recursively sorts them in utilize increasing numbers of processing units as the problem size grows. Some parallel sorting
parallel. The sorted subarrays are then merged in parallel to produce the final sorted array. algorithms may not scale well with large problem sizes or may exhibit diminishing returns
Parallel merge sort can offer significant performance improvements on multi-core processors beyond a certain number of processing units.
and parallel computing architectures.
An example of an issue in sorting on parallel computers is the load balancing problem, where
(ii) **GPU Applications**: GPUs (Graphics Processing Units) are increasingly being used for unevenly distributed data leads to some processing units finishing their tasks much earlier than
general-purpose parallel computing tasks beyond graphics rendering. GPU applications span others, resulting in idle resources and overall slowdown in performance.
various domains, including:
f) Parallel Breadth-First Search (BFS) algorithm is used to traverse or search a graph or tree
- **Deep Learning and Artificial Intelligence**: GPUs are widely used to accelerate training and data structure in a breadthward motion, exploring all neighboring nodes at the present depth
inference tasks in deep learning frameworks such as TensorFlow and PyTorch, due to their before moving on to nodes at the next depth level. Parallel BFS can be implemented using
highly parallel architecture and computational power. techniques such as parallel processing and task decomposition.
- **Scientific Computing**: GPUs are utilized for accelerating simulations and computations in In parallel BFS:
scientific fields such as physics, chemistry, biology, and climate modeling, enabling researchers
to tackle complex problems with greater speed and efficiency. - **Initialization**: Each node in the graph is assigned a status (e.g., unvisited, visited,
processed). The initial node is marked as visited.
- **Computer Vision and Image Processing**: GPUs are employed in computer vision
applications for tasks such as object detection, image segmentation, and image processing, - **Parallel Processing**: Multiple threads or processes are used to explore nodes at the current
leveraging their parallel processing capabilities to achieve real-time performance and enhanced depth level simultaneously. Each thread/process explores all neighboring nodes of its assigned
accuracy. node.
- **Data Analytics and Visualization**: GPUs are used for accelerating data analytics and - **Task Decomposition**: The graph is decomposed into smaller subgraphs, and each
visualization tasks, enabling faster processing of large datasets and interactive visualization of thread/process is assigned a subgraph or a subset of nodes to explore. Communication
complex data. between threads/processes may be necessary to synchronize exploration and update the status
of nodes.
e) Issues in sorting on parallel computers include:
- **Traversal**: Each thread/process performs a breadth-first traversal of its assigned subset of
- **Load Balancing**: Distributing the workload evenly among processing units can be nodes, marking neighboring nodes as visited and adding them to a queue for further
challenging, especially for irregular or unbalanced data distributions. Uneven workload exploration. The traversal continues until all nodes have been visited.
distribution can lead to inefficient resource utilization and longer execution times.
- **Termination**: The traversal terminates when all nodes have been visited, and the entire
graph has been explored by the parallel threads/processes.