HPC Day 11 PPT
HPC Day 11 PPT
HPC Day 11 PPT
Computing(HPC)
DAY 11 - Topics
• MPI Basics (Continued)
• Blocking vs. Non-blocking
• Setting up an MPI Environment
• Basic Routines
• Send and Receive
• Writing and Running a Simple MPI Program
• Introduction to GPU and GPGPU Programming
• Why GPU?
• GPU vs. CPU
• GPGPU
• Applications of GPGPU Computing
MPI (Continued) and GPGPU Programming
Blocking vs. Non-blocking in MPI
Blocking Communication
Definition:
Blocking communication in MPI refers to the situation where a process (or MPI
rank) waits until a communication operation completes before proceeding to the next
instruction.
Types of Blocking Operations:
• When a process calls MPI_Send, it places data into a send buffer and then waits until
the data is safely received by the destination process's receive buffer (MPI_Recv at the
destination).
• Execution of the sending process halts until the message has been successfully
received by the receiver.
• Example:
When a process calls MPI_Recv, it waits until a matching message has been
sent by another process and is received into its receive buffer.
Execution of the receiving process halts until the message is available and
successfully copied into its receive buffer.
Example:
்
MPI_Recv(recv_buffer, count, MPI_DATATYPE, source_rank, tag,
MPI_COMM_WORLD, &status);
Characteristics:
Characteristics:
Advantages:
Disadvantages:
Potential Deadlock: If not carefully managed, blocking operations can lead to deadlock
situations where processes are waiting indefinitely for each other.
Non-blocking Communication
Non-blocking Communication
Definition:
Non-blocking communication in MPI allows a process to initiate a
communication operation and then continue execution without waiting for the
operation to complete.
Example:
்MPI_Irecv(recv_buffer, count, MPI_DATATYPE, source_rank, tag,
MPI_COMM_WORLD, &request);
Characteristics:
Asynchronous:
The sender and receiver do not wait for each other; they continue executing other
instructions.
Non-blocking:
Processes can overlap communication with computation, potentially improving
performance by reducing idle time.
Complexity:
Requires careful management of buffers and synchronization to ensure data integrity.
OpenMP Tasks
Advantages:
Flexibility:
Can be used to avoid deadlock situations in complex communication patterns.
Disadvantages:
Increased Complexity:
Requires careful handling of communication buffers and completion status (via
MPI_Test or MPI_Wait) to ensure correct synchronization.Potential for Resource
Overlap:
Overlapping too many operations can lead to resource contention and decreased
performance if not managed properly.
Choosing Between Blocking and Non-blocking
Communication Pattern:
Performance Requirements:
Programmer Comfort:
Best Practices:
There are several MPI implementations available, each with its own features and
compatibility:
Open MPI: Widely used, open-source MPI implementation that supports many
platforms.
Intel MPI: Optimized for Intel architectures, offering enhanced performance on Intel
processors.
Setting up an MPI Environment MPI
For example, on Ubuntu or Debian-based systems, you can install Open MPI with:்
Adjust the package name based on the MPI implementation you choose (mpich, openmpi,
etc.).
Setting up an MPI Environment MPI
Manual Installation:
Download the MPI source tarball from the official website of the MPI implementation
(e.g., Open MPI).Extract the tarball and follow the installation instructions provided in the
README or INSTALL file.
On Windows:
Install a MPI distribution that supports Windows, such as MS-MPI (Microsoft MPI) or
MPICH.Follow the installation instructions provided by the MPI distribution for Windows.
Step 3: Set Environment Variables (Linux)
PATH Variable:
Add MPI binaries to your PATH so that you can execute MPI commands from
any directory.
LD_LIBRARY_PATH (if necessary):If MPI libraries are not found during execution,
add their path to LD_LIBRARY_PATH.
export LD_LIBRARY_PATH=/usr/lib/openmpi/lib:$LD_LIBRARY_PATH
Step 4: Configure SSH (for Cluster Setup)
SSH Setup:Ensure passwordless SSH access between nodes for MPI processes.
Generate SSH keys (ssh-keygen) and copy them to each node (ssh-copy-id
user@hostname).
Use mpicc (for C programs) or mpic++ (for C++ programs) to compile MPI programs.
Example:
Example:
mpiexec -n 4 ./my_mpi_program
MPI can be configured with additional options depending on your specific requirements:
Hostfile: Specify hosts and their number of slots for MPI processes.
MPI Environment Variables: Adjust parameters such as process binding, error handling, and
debugging options.
Debugging and Troubleshooting
MPI (Message Passing Interface) provides a set of basic routines that enable processes (or MPI ranks)
to communicate and synchronize with each other in parallel computing applications. These routines are
fundamental for developing distributed memory parallel programs. Here’s a detailed study of some of
the basic MPI routines:
1. MPI_Init and MPI_Finalize
MPI_Init:
Purpose: Initializes the MPI execution environment.
Syntax: int MPI_Init(int *argc, char ***argv)
Usage: This routine must be called once at the beginning of every MPI program to initialize MPI.
Arguments: argc is a pointer to the number of command line arguments, and argv is a pointer to the
array of command line arguments (char **argv).
MPI_Finalize:
Purpose: Terminates the MPI execution environment.
Syntax: int MPI_Finalize()
Usage: This routine must be called once at the end of every MPI program to cleanly exit MPI and
release resources.
Open MP tasks
Usage: Receives count elements of type datatype into buf from process source in
communicator comm.
Arguments: buf is the receive buffer, count is the number of elements, datatype is
the type of elements, source is the rank of the source process, tag is the message tag
(to match with MPI_Send), comm is the communicator, and status is a pointer to an
MPI_Status structure providing status information.
4. MPI_Barrier
MPI_Barrier:
Purpose: Synchronizes all processes in a communicator.
Syntax: int MPI_Barrier(MPI_Comm comm)
Usage: Blocks each process until all processes in the communicator comm have
called MPI_Barrier.
Arguments: comm is the communicator.
MPI Task:
Key Considerations:
Communicator (MPI_Comm):
A group of processes that can communicate with each other.
Data Type (MPI_Datatype):
Defines the type of data being sent or received.
Tag:
An integer used to distinguish different types or classes of messages.
MPI_Status:
Provides information about the status of a communication operation.
Example Usage:
Example Usage:
Here's a simple example of using MPI to send a message from one process to another:்
#include <stdio.h>
#include <mpi.h>
int main(int argc, char *argv[]) {
int rank, size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
if (rank == 0) {
int message = 42;
MPI_Send(&message, 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
} else if (rank == 1) {
Example Usage:
int message;
MPI_Recv(&message, 1, MPI_INT, 0, 0, MPI_COMM_WORLD,
MPI_STATUS_IGNORE);
printf("Rank 1 received message: %d\n", message);
}
MPI_Finalize();
return 0;
}
Process 0 sends an integer message (42) to process 1 using MPI_Send.
Process 1 receives the message using MPI_Recv and prints it.
Conclusion
Conclusion
Understanding and effectively utilizing these basic MPI routines is essential for
developing parallel programs that can efficiently communicate and synchronize across
distributed memory systems.
Proper usage ensures correct and efficient parallel execution, while also leveraging
the full potential of MPI's capabilities for high-performance computing applications.
Send and Receive in MPI
In MPI (Message Passing Interface), sending and receiving messages between processes
(or MPI ranks) is fundamental for communication in parallel computing applications.
MPI provides several functions for sending and receiving data, each with specific
characteristics and usage patterns. Here’s a detailed study of the MPI_Send and
MPI_Recv functions, which are the basic mechanisms for point-to-point communication
in MPI:
1. MPI_Send
Purpose: Sends a message from one process to another.
Syntax:
int MPI_Send(const void *buf, int count, MPI_Datatype datatype, int dest, int tag,
MPI_Com
buf: Pointer to the send buffer containing the data to be sent.
count: Number of data elements to send.
datatype: MPI datatype of the elements in buf.
Send and Receive in MPI
MPI_Recv
Purpose: Receives a message sent by another process.
Syntax:
int MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag,
MPI_Comm comm, MPI_Status *status)
buf: Pointer to the receive buffer where the received data will be stored.
count: Maximum number of data elements to receive.
datatype: MPI datatype of the elements to receive.
source: Rank of the source process sending the message.
tag: Message tag to match with the tag used in MPI_Send.
comm: Communicator specifying the group of processes involved in the
communication.
status: Pointer to an MPI_Status structure providing information about the received
message.
Writing and Running a Simple MPI Program
Behavior:
The calling process (MPI_Recv caller) blocks until a matching message from the
specified source (source) with the specified tag (tag) is received.
Copies the received message from the MPI system buffer into the receive buffer
(buf).
Notes:
MPI_Recv may block indefinitely until a matching message arrives, depending on
the communication parameters.
Upon completion, status provides information about the received message, such as
source rank, tag, and number of elements received.
Writing and Running a Simple MPI Program
return 0;
}
Example Usage:
Explanation:
Include MPI Header File (mpi.h):
Provides MPI function prototypes and constants.
Initialize MPI:
MPI_Init(&argc, &argv): Initializes the MPI environment. argc and argv are
command-line arguments passed to the program.
Get Process Rank and Size:
MPI_Comm_rank(MPI_COMM_WORLD, &rank): Retrieves the rank of the
current process (rank) within the communicator MPI_COMM_WORLD.
MPI_Comm_size(MPI_COMM_WORLD, &size): Retrieves the total number of
processes (size) in the communicator MPI_COMM_WORLD.
Print Rank and Size:
Each process prints its rank and the total number of processes (size). This shows
how MPI manages multiple processes concurrently.
Example Usage:
Finalize MPI:
MPI_Finalize(): Terminates the MPI environment cleanly. Should be called once
at the end of every MPI program.
Compiling and Running the MPI Program
To compile and run the MPI program (simple_mpi.c in this case):
Compilation:
Assuming you have MPI installed and configured correctly on your system:்
mpicc -o simple_mpi simple_mpi.c
Compiling and Running the MPI Program
Compilation:
Assuming you have MPI installed and configured correctly on your system:
Certainly! Let's dive into an introduction to GPU (Graphics Processing Unit) and GPGPU
(General-Purpose computing on Graphics Processing Units) programming.
What is a GPU?
A GPU is a specialized processor originally designed for rendering graphics in
computer games and multimedia applications.
It excels in parallel processing tasks due to its architecture, which includes
thousands of smaller cores optimized for performing calculations simultaneously.
Modern GPUs are highly parallel and capable of handling many computations
concurrently, making them suitable for more than just graphics rendering.
Evolution into GPGPU
GPGPU, or General-Purpose computing on Graphics Processing Units, refers to
using GPUs for non-graphics tasks such as scientific simulations, data processing,
machine learning, and more. This shift became possible with the introduction of
programmable shaders and APIs (such as CUDA and OpenCL) that allow developers
to write general-purpose programs (kernels) executed on GPUs.
Key Concepts in GPU and GPGPU Programming
2. Memory Hierarchy
Global Memory: Large but slower memory accessible to all threads.
Shared Memory: Fast memory shared among threads within a block (CUDA) or work-
group (OpenCL).
Registers: Fastest memory, private to each thread.
Key Concepts in GPU and GPGPU Programming
Why GPU?
Using GPUs (Graphics Processing Units) for computing tasks has become
increasingly popular across various domains due to several key advantages that GPUs offer
over traditional CPUs (Central Processing Units). Here’s a detailed study on why GPUs are
advantageous and when they are beneficial:
1. Parallel Processing Power
Massively Parallel Architecture:
GPUs are designed with hundreds to thousands of smaller processing cores compared
to a CPU's fewer, more powerful cores. This architecture allows GPUs to perform many
computations simultaneously, making them highly efficient for tasks that can be
parallelized.
SIMD (Single Instruction, Multiple Data):
GPUs excel at SIMD operations where the same instruction is applied to multiple
data elements simultaneously. This capability is essential for tasks such as matrix
operations, image processing, and simulations.
Buffering in open MP
2. Performance
High Throughput:
GPUs can process large amounts of data quickly due to their parallel architecture and
high memory bandwidth. This makes them suitable for applications requiring intensive
numerical computations, data processing, and complex algorithms.
Acceleration of Specific Workloads:
Certain workloads, such as scientific simulations, deep learning training, and video
processing, can see significant speedups when executed on GPUs compared to CPUs. GPUs
are particularly effective for tasks involving matrix multiplications, convolutions, and other
linear algebra operations.
3. Energy Efficiency
Performance per Watt:
GPUs typically offer higher performance per watt compared to CPUs for parallelizable
tasks. This efficiency is crucial for applications that require large-scale computing
capabilities while minimizing power consumption and operational costs.
General-Purpose Computing (GPGPU):
• Conclusion
• The decision to use GPUs depends on the specific requirements of the application,
particularly its ability to parallelize tasks effectively.
For tasks that can benefit from parallelism and require high computational throughput,
GPUs offer significant advantages over CPUs in terms of performance, energy efficiency,
scalability, and versatility. As GPU technology continues to advance, its role in
accelerating diverse computational tasks across various industries will continue to
expand.
GPU vs. CPU
• Strengths:
• Parallel Processing:
• GPUs excel in tasks that can be divided into many smaller parallel tasks. This includes
graphics rendering, scientific simulations, deep learning training, and other
computations requiring matrix operations and data parallelism.
• Graphics Rendering:
• Originally designed for rendering images and animations in real-time applications,
GPUs are optimized for tasks like rasterization, shading, and texture mapping.
• Energy Efficiency:
• GPUs can achieve higher performance per watt compared to CPUs for parallelizable
tasks, making them efficient for large-scale computations while conserving energy.
GPU vs. CPU
• Programming Model:
• CUDA (Compute Unified Device Architecture):
• NVIDIA's proprietary programming model for GPUs, providing a C-like
environment for developing parallel applications.
• OpenCL (Open Computing Language):
• A cross-platform framework supported by various vendors (NVIDIA, AMD, Intel),
enabling developers to write code that runs on different GPU architectures and
other processors.
CPU (Central Processing Unit)
• Strengths:
• General-Purpose Computing:
• CPUs are versatile and excel at handling tasks that require complex logic,
sequential execution, and task switching.
• System Control:
• CPUs manage system operations, including running operating systems, handling
I/O operations, and executing single-threaded applications efficiently.
Low Latency Tasks:
Applications that require low latency and responsiveness, such as real-time
processing, database transactions, and gaming physics calculations, benefit from CPU
processing power.
CPU (Central Processing Unit)
Programming Model:
• Multi-threading:
• CPUs support multi-threading through technologies like Intel Hyper-Threading and AMD SMT
(Simultaneous Multi-Threading), enabling multiple threads to run concurrently on each core.
• APIs and Libraries:
• CPUs are supported by a wide range of programming languages, libraries (e.g., Intel Math Kernel
Library, OpenMP), and APIs for developing efficient serial and multi-threaded applications.
• Comparison and Use Cases:
• Data Parallelism:
• GPUs are ideal for tasks with data parallelism, such as large-scale numerical simulations, image
processing, and machine learning training (e.g., deep neural networks).
• Serial Processing:
• CPUs are better suited for tasks that require single-threaded performance, complex algorithmic logic,
and handling of system-level operations.
• Combined Use:
• Many applications benefit from a hybrid approach, where CPUs manage overall system operations and
delegate compute-intensive tasks to GPUs via APIs like CUDA or OpenCL
Example Scenario:
• Example Scenario:
• Video Rendering:
• GPU accelerates rendering of complex graphics and effects in real-time
video games and simulations, while CPU manages game logic, physics
calculations, and AI routines.
• Conclusion:
• Understanding the differences between GPUs and CPUs helps in choosing
the right hardware for specific computing tasks.
• GPUs excel in parallel processing tasks requiring high computational
throughput, while CPUs are versatile and efficient for handling diverse
workloads, managing system operations, and executing single-threaded
applications.
• The choice between GPU and CPU depends on the nature of the
application, its computational requirements, and the level of parallelism it
can exploit. As technology advances, both GPUs and CPUs continue to
evolve, offering enhanced performance and efficiency for a wide range of
computing applications.
GPGPU
GPGPU
GPGPU (General-Purpose computing on Graphics Processing Units) refers to the use of
GPUs (Graphics Processing Units) for performing computations traditionally handled by
CPUs (Central Processing Units). This approach leverages the highly parallel
architecture of GPUs to accelerate a wide range of general-purpose applications beyond
graphics rendering. Here's a detailed study of GPGPU, covering its architecture,
programming models, advantages, and applications:
Architecture of GPUs for GPGPU:
Parallel Processing Units:
CUDA Cores (NVIDIA) / Stream Processors (AMD): GPUs are designed with
hundreds to thousands of smaller processing units called CUDA cores (NVIDIA) or
stream processors (AMD). These cores work in parallel to execute computations
simultaneously, making GPUs highly efficient for tasks with data parallelism.
GPGPU
• Memory Hierarchy:
• Global Memory:
• Large and relatively slow memory accessible to all threads (processors) on the
GPU.
• Shared Memory:
• Fast and low-latency memory shared among threads within a thread block
(CUDA) or work-group (OpenCL). Used for data sharing and synchronization.
• Registers:
• Fastest and smallest memory, private to each thread, used for storing local
variables and intermediate results.
• SIMD (Single Instruction, Multiple Data):
GPUs excel at SIMD operations where a single instruction is applied to multiple data
elements simultaneously. This capability is crucial for tasks such as matrix operations,
image processing, and simulations.
Programming Models for GPGPU:
• Advantages of GPGPU:
• Parallel Computing Power:
• GPUs are designed for parallelism, allowing them to execute thousands of threads concurrently. This
capability significantly accelerates tasks that can be parallelized, such as scientific simulations, data analytics,
and deep learning.
• Performance Enhancement:
• GPGPU can provide substantial performance improvements over CPU-only computations for tasks involving
large datasets and intensive numerical calculations.
• GPUs offer higher throughput and computational efficiency due to their architecture optimized for parallel
processing.
• Energy Efficiency:
• GPUs often deliver higher performance per watt compared to CPUs for parallelizable tasks. This efficiency is
beneficial for applications requiring large-scale computational power while minimizing energy consumption.
• Versatility:
• GPGPU enables GPUs to be used beyond traditional graphics applications, expanding their role in scientific
research, machine learning, computer vision, financial modeling, and more.
• The flexibility of GPGPU programming models (CUDA, OpenCL) allows developers to harness GPU
capabilities for diverse applications and domains.
Applications of GPGPU:
• Applications of GPGPU:
• Scientific Computing:
• Simulation of physical phenomena (e.g., fluid dynamics, molecular dynamics), computational
chemistry, climate modeling, and numerical simulations benefit from GPU acceleration.
• Conclusion:
• GPGPU technology leverages the parallel processing capabilities of GPUs to accelerate
a wide range of computational tasks beyond traditional graphics rendering.
Understanding the architecture, programming models (CUDA, OpenCL), and
advantages of GPGPU enables developers and researchers to harness GPU power for
high-performance computing applications in scientific research, machine learning, data
analytics, and more. As GPU technology continues to advance, its role in accelerating
complex computations and handling massive datasets across various industries will
continue to expand.
Applications of GPGPU Computing
• Weather Forecasting:
• GPGPU is used in weather prediction models to simulate and predict weather
patterns more accurately and efficiently. This includes simulations of
atmospheric dynamics, ocean currents, and climate change scenarios.