Intro MPI
Intro MPI
Shaohao Chen
Research Computing Services
Information Services and Technology
Boston University
Outline
p: number of processors/cores,
α: fraction of the program that is serial.
• Figure from: https://en.wikipedia.org/wiki/Parallel_computing
Distributed and shared memory systems
Figures from the book Using OpenMP: Portable Shared Memory Parallel Programming
MPI Overview
Message Passing Interface (MPI) is a standard for parallel computing on a computer cluster.
MPI is a Library. Provides library routines in C, C++, and Fortran.
MPI implementations:
• OpenMPI
• MPICH, MVAPICH, Intel MPI
The first MPI program in C: Hello world!
• Hello world in C
#include <mpi.h>
main(int argc, char** argv){
int my_rank, my_size;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
MPI_Comm_size(MPI_COMM_WORLD, &my_size);
printf("Hello from %d of %d.\n", my_rank, my_size);
MPI_Finalize();
}
The first MPI program in Fortran: Hello world!
• Hello world in Fortran
program hello
include 'mpif.h'
integer my_rank, my_size, errcode
call MPI_INIT(errcode)
call MPI_COMM_RANK(MPI_COMM_WORLD, my_rank, errcode)
call MPI_COMM_SIZE(MPI_COMM_WORLD, my_size, errcode)
print *, 'Hello from ', my_rank, 'of', my_size, '.'
call MPI_FINALIZE(errcode)
end program hello
Basic Syntax
Include the header file: mpi.h for C or mpif.h for Fortran
MPI_INIT: This routine must be the first MPI routine you call (it does not have to be the first statement).
MPI_FINALIZE: This is the companion to MPI_Init. It must be the last MPI_Call.
MPI_INIT and MPI_FINALIZE appear in any MPI code.
MPI_COMM_RANK: Returns the rank of the process. This is the only thing that sets each process apart
from its companions.
MPI_COMM_SIZE: Returns the total number of processes.
MPI_COMM_WORLD: This is a communicator. Use MPI_COMM_WORLD unless you want to enable
communication in complicated patterns.
The error code is returned to the last argument in Fortran, while it is returned to the function value in C.
Compile MPI codes on BU SCC
Note: No need to provide hostfile explicitly. The job scheduler automatically distributes
MPI processes to the requested resources.
Exercise 1: hello world
1) Write an MPI hello-world code in either C or Fortran. Print the MPI ranks
and size on all processes.
2) Compile the hello-world code.
3) Run the MPI hello-world program either in an interactive session or by
submitting a batch job.
Analysis of the output
$ mpirun -np 4 hello
Hello from 1 of 4.
Hello from 2 of 4.
Hello from 0 of 4.
Hello from 3 of 4.
Syntax:
int MPI_Send(void* data, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm)
data: Initial address of send data.
count: Number of elements send (nonnegative integer).
datatype: Datatype of the send data.
dest: Rank of destination(integer).
tag: Message tag (integer).
comm: Communicator.
Point-to-point communication (2): Receive
One process receives a matching massage from another process.
Syntax:
int MPI_Recv (void* data, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm,
MPI_Status* status)
data: Initial address of receive data.
count: Maximum number of elements to receive (integer).
datatype: Datatype of receive data.
source: Rank of source (integer).
tag: Message tag (integer).
comm: Communicator (handle).
status: Status object (status).
A C example: send and receive a number between two processes
int my_rank, numbertoreceive, numbertosend;
MPI_Init(&argc, &argv);
MPI_Status status;
MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
if (my_rank==0){
numbertosend=36;
MPI_Send( &numbertosend, 1, MPI_INT, 1, 10, MPI_COMM_WORLD);
}
else if (my_rank==1){
MPI_Recv( &numbertoreceive, 1, MPI_INT, 0, 10, MPI_COMM_WORLD, &status);
printf("Number received is: %d\n", numbertoreceive);
}
MPI_Finalize();
A Fortran example: send and receive a number between two processes
Process 0’s memory Process 1’s memory Red regions: save data A.
Blue regions: temporarily
save data A.
Data A Data A
Operation 1
Operation 3
Operation 2
Send Queue Send Queue
Receive Queue Receive Queue
Blocking Receives and Sends
Syntax:
int MPI_Sendrecv (const void* senddata, int sendcount, MPI_Datatype sendtype, int dest, int sendtag, void*
recvdata, int recvcount, MPI_Datatype recvtype, int source, int recvtag, MPI_Comm comm, MPI_Status* status)
Write two MPI codes (in C or Fortran) to do the following tasks respectively:
1) Circular shift program: Every process sends its rank to its right neighbor. The
process with the largest rank send its rank to process 0.
2) Ring program: Pass a value around all processes in a ring-like fashion. The
passing sequence is 0 1 … N 0, where N is the maximum number of
processes.
Hints: Use MPI_Send and MPI_Recv (or MPI_Sendrecv). Make sure every
MPI_Send corresponds to a matching MPI_Recv.Be careful to avoid deadlocks.
Synchronization: Barrier
Blocks until all processes in the communicator have reached this routine.
Syntax:
int MPI_Barrier (MPI_Comm comm)
comm: Communicator.
Print in order
Syntax:
int MPI_Bcast (void * data, int count, MPI_Datatype datatype, int root, MPI_Comm comm)
data: Initial address of the broadcast data.
count: Number of elements the data (nonnegative integer).
datatype: Datatype of the data.
roort: Rank of the root process (integer).
comm: Communicator (handle).
Broadcast can be also enabled by using MPI_Send and MPI_Recv. But MPI_Bcast is more efficient,
because advanced algorithms (such as a binary-tree algorithm) are implemented in it.
Collective communication: Reduce
Reduce values of a variable on all processes to a single value and stores the value on the “root” process.
Syntax:
int MPI_Reduce (const void* send_data, void* recv_data, int count, MPI_Datatype datatype, MPI_Op op,
int root, MPI_Comm comm)
send_data: Initial address of the send data.
recv_data: Initial address of the receive data.
count: Number of elements the data (nonnegative integer).
datatype: Datatype of the data.
op: Reduction operation
root: Rank of the root process (integer).
comm: Communicator.
Reduction Operations
Syntax:
int MPI_Isend(void* data, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm,
MPI_Request *request)
int MPI_Irecv (void* data, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm
comm, MPI_Request *request)
int MPI_Wait (MPI_Request* request, MPI_Status* status)
Figure from:
Practical MPI Programming,
IBM Redbook
Collective communication: MPI_Scatter
The root process sends chunks of an array to all processes. Each non-root process receives a chunk of
the array and stores it in receive buffer. The root process also copies a chunk of the array to its own
receive buffer.
Syntax:
int MPI_Scatter (const void* send_data, int send_count, MPI_Datatype send_datatype, void*
recv_data, int recv_count, MPI_Datatype recv_datatype, int root, MPI_Comm comm)
send_data: The send array that originally resides on the root process.
send_count: Number of elements to be sent to each process (i.e. the chunk size). It is often
(approximately) equal to the size of the array divided by the number of processes.
send_datatype: Datatype of the send data.
recv_data: The receive buffer on all processes.
recv_count: Number of elements that the receive buffer can hold (i.e. the chunk size). It should be
equal to send_count if send_datatype and recv_datatype are the same.
recv_datatype: Datatype of the receive data.
root: The rank of the root process.
Collective communication: MPI_Gather
MPI_Gather is the inverse of MPI_Scatter.
Each non-root process sends a chunk of data to the root process. The root process receives chunks of
data and stores them (including its own chunk) in the receive buffer in the order of MPI ranks.
Syntax:
int MPI_Gather (const void* send_data, int send_count, MPI_Datatype send_datatype, void* recv_data,
int recv_count, MPI_Datatype recv_datatype, int root, MPI_Comm comm)
send_data: The send data on each process.
send_count: Number of elements of the send data (i.e. the chunk size).
send_datatype: Datatype of the send data.
recv_data: The receive buffer on the root process.
recv_count: Number of elements of the receive data (i.e. the chunk size, not the size of the receive
buffer) . It should be equal to send_count if send_datatype and recv_datatype are the same.
recv_datatype: Datatype of the receive data.
root: The rank of the root process.
An example for MPI_Scatter and MPI_Gather
int rank, nproc, i, m, n=100;
double sub_avg=0., global_avg=0.;
double * array = NULL, * sub_avgs = NULL;
Compute the MPI_Init(&argc, &argv);
average of all MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &nproc);
elements in an
m = (int) n/nproc; // chunk size
array. if(rank==0){ array = (double *) malloc(n*sizeof(double));
for(i=0; i<n; i++) array[i]=(double) i; }
MPI_Allreduce is the equivalent of doing MPI_Reduce followed by an MPI_Bcast. The root process
obtains the reduced value and broadcasts it to all other processes.
MPI_Allgather is the equivalent of doing MPI_Gather followed by an MPI_Bcast. The root process
gathers the values and broadcasts them to all other processes.
Syntax:
int MPI_Allreduce (const void* send_data, void* recv_data, int count, MPI_Datatype datatype,
MPI_Op op, MPI_Comm comm)
int MPI_Allgather (const void* send_data, int send_count, MPI_Datatype send_datatype, void*
recv_data, int recv_count, MPI_Datatype recv_datatype, MPI_Comm comm)
Quiz
What is the result of the following code on 4 processes?
Hints: Break down the code using MPI_Send and MPI_Recv, then analyze how the program
steps forward.
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank % 2 == 0) { // Even
MPI_Allreduce(&rank, &evensum, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
if (rank == 0) printf("evensum = %d\n", evensum);
} else { // Odd
MPI_Allreduce(&rank, &oddsum, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD);
if (rank == 1) printf("oddsum = %d\n", oddsum);
}
Contiguous:
Syntax:
int MPI_Type_contiguous (int count, MPI_Datatype oldtype, MPI_Datatype * newtype)
count: replication count (nonnegative integer)
oldtype: old data type
newtype: new data type
Vector datatype
Allows replication of an old datatype into locations that consist of equally spaced blocks.
Each block is a concatenation of the old datatype.
The block length and the stride are fixed.
Syntax:
int MPI_Type_vector ( int count, int blocklength, int stride, MPI_Datatype oldtype,
MPI_Datatype * newtype)
count: number of blocks (nonnegative integer)
blocklength: number of elements in each block (nonnegative integer)
stride: number of elements between start of each block (integer)
oldtype: old data type
newtype: new data type
Indexed datatype
Allows replication of an old datatype into a sequence of blocks.
The block lengths and the strides may be different.
Syntax:
int MPI_Type_indexed ( int count, const int * blocklength, const int * displacements,
MPI_Datatype oldtype, MPI_Datatype * newtype)
count: number of blocks (nonnegative integer)
blocklength: number of elements in each block (array of nonnegative integer)
displacements: displacement of each block in multiples of oldtype (array of integer)
oldtype: old data type
newtype: new data type
A C example for contiguous, vector and indexed datatypes
int n=18;
int blocklen[3] = {2, 5, 3 }, disp[3] = { 0, 5, 15 };
MPI_Datatype type1, type2, type3;
MPI_Type_contiguous(n, MPI_INT, &type1); MPI_Type_commit(&type1);
MPI_Type_vector(3, 4, 7, MPI_INT, &type2); MPI_Type_commit(&type2);
MPI_Type_indexed(3, blocklen, disp, MPI_INT, &type3); MPI_Type_commit(&type3);
if (rank == 0){
for (i=0; i<n; i++) buffer[i] = i+1;
MPI_Send(buffer, 1, type1, 1, 101, MPI_COMM_WORLD);
MPI_Send(buffer, 1, type2, 1, 102, MPI_COMM_WORLD);
MPI_Send(buffer, 1, type3, 1, 103, MPI_COMM_WORLD);
} else if (rank == 1) {
MPI_Recv(buffer1, 1, type1, 0, 101, MPI_COMM_WORLD, &status);
MPI_Recv(buffer2, 1, type2, 0, 102, MPI_COMM_WORLD, &status);
MPI_Recv(buffer3, 1, type3, 0, 103, MPI_COMM_WORLD, &status);
}
Struct datatype
Allows each block to consist of replications of different datatypes.
The block lengths, the strides and the old datatypes may be different.
Give users full control to pack data.
Syntax:
int MPI_Type_struct ( int count, const int * blocklength, const MPI_Aint * displacements,
MPI_Datatype oldtype, MPI_Datatype * newtype)
count: number of blocks (nonnegative integer)
blocklength: number of elements in each block (array of nonnegative integer)
displacements: displacement of each block in multiples of bytes (array of integer)
oldtype: old data type
newtype: new data type
Pack size
Returns the upper bound on the amount of space needed to pack a message.
Syntax:
int MPI_Pack_size ( int incount, MPI_Datatype datatype, MPI_Comm comm, int *size)
count: Count argument to packing call (integer)
datatype: Datatype argument to packing call
comm: Communicator
size: Upper bound on size of packed message, in unit of bytes (integer)
A C example for struct datatype
int psize;
int blocklens[3] = { 2, 5, 3 };
MPI_Aint disp[3] = { 0, 5*sizeof(int), 5*sizeof(int)+10*sizeof(double) };
MPI_Datatype oldtypes[3], newtype;
oldtypes[0] = MPI_INT;
oldtypes[1] = MPI_DOUBLE;
oldtypes[2] = MPI_CHAR;
Analysis:
1. Decompose the grids into sub-grids. Divide both rows and columns. Each process owns
one sub-grid.
2. Define necessary derived datatypes (e.g. MPI_contiguous and MPI_vector).
3. Pass necessary data between processes. (e.g. use MPI_Send and MPI_Recv). Be careful to
avoid dead locks.
4. Pass “shared” data between the root process and all other processes (e.g. use MPI_Bcast
and MPI_Reduce).
What is not covered ……