Message Passing Interface (MPI)
Message Passing Interface (MPI)
An Interface Specification:
M P I = Message Passing Interface
MPI is a specification for the developers and users of message passing libraries. By itself, it is NOT a library - but rather the
specification of what such a library should be.
MPI primarily addresses the message-passing parallel programming model: data is moved from the address space of one process to
that of another process through cooperative operations on each process.
Simply stated, the goal of the Message Passing Interface is to provide a widely used standard for writing message passing programs.
The interface attempts to be:
practical
portable
efficient
flexible
The MPI standard has gone through a number of revisions, with the most recent version being MPI-3.
Interface specifications have been defined for C and Fortran90 language bindings:
C++ bindings from MPI-1 are removed in MPI-3
MPI-3 also provides support for Fortran 2003 and 2008 features
Actual MPI library implementations differ in which version and features of the MPI standard they support. Developers/users will
need to be aware of this.
Programming Model:
Originally, MPI was designed for distributed memory architectures, which were becoming increasingly popular at that time (1980s early 1990s).
As architecture trends changed, shared memory SMPs were combined over networks creating hybrid distributed memory / shared
memory systems.
MPI implementors adapted their libraries to handle both types of underlying memory architectures seamlessly. They also
adapted/developed ways of handling different interconnects and protocols.
2/40
10/2/2014
A summary of LC's MPI environment is provided here, along with links to additional detailed information.
MVAPICH
General Info:
MVAPICH MPI from Ohio State University is the default MPI library on all of LC's Linux clusters.
As of June 2014, LC's default version is MVAPICH 1.2
MPI-1 implementation that includes support for MPI-I/O, but not for MPI one-sided communication.
Based on MPICH-1.2.7 MPI library from Argonne National Laboratory
Not thread-safe. All MPI calls should be made by the master thread in a multi-threaded MPI program.
See /usr/local/docs/mpi.mvapich.basics for LC usage details.
MVAPICH2 is also available on LC Linux clusters
MPI-2 implementation based on MPICH2 MPI library from Argonne National Laboratory
Not currently the default - requires the "use" command to load the selected dotkit - see https://computing.llnl.gov/?
set=jobs&page=dotkit for details.
Thread-safe
See /usr/local/docs/mpi.mvapich2.basics for LC usage details.
MVAPICH2 versions 1.9 and later implement MPI-3 according to the developer's documentation.
A code compiled with MVAPICH on one LC Linux cluster should run on any LC Linux cluster.
Clusters with an interconnect - message passing is done in shared memory on-node and over the switch inter-node
Clusters without an interconnect - message passing is done in shared memory
More information:
/usr/local/docs on LC's clusters:
mpi.basics
mpi.mvapich.basics
mpi.mvapich2.basics
C++
Fortran
mpicc
gcc
mpigcc
gcc
mpiicc
icc
mpipgcc
pgcc
mpiCC
g++
mpig++
g++
mpiicpc
icpc
mpipgCC
pgCC
mpif77
g77
mpigfortran
gfortran
mpiifort
ifort
mpipgf77
pgf77
mpipgf90
pgf90
4/40
10/2/2014
Header File:
Required for all programs that make MPI library calls.
C include file
#include "mpi.h"
With MPI-3 Fortran, the USE mpi_f08 module is preferred over using the include file shown above.
Format of MPI Calls:
C names are case sensitive; Fortran names are not.
Programs must not declare variables or functions with names beginning with the prefix MPI_ or PMPI_ (profiling interface).
C Binding
Format:
rc = MPI_Xxxxx(parameter, ... )
Example:
rc = MPI_Bsend(&buf,count,type,dest,tag,comm)
Example:
CALL MPI_BSEND(buf,count,type,dest,tag,comm,ierr)
Rank:
Within a communicator, every process has its own unique, integer identifier assigned by the system when the process initializes. A
rank is sometimes also called a "task ID". Ranks are contiguous and begin at zero.
Used by the programmer to specify the source and destination of messages. Often used conditionally by the application to control
program execution (if rank=0 do this / if rank=1 do that).
Error Handling:
https://computing.llnl.gov/tutorials/mpi/
7/40
10/2/2014
Most MPI routines include a return/error code parameter, as described in the "Format of MPI Calls" section above.
However, according to the MPI standard, the default behavior of an MPI call is to abort if there is an error. This means you will
probably not be able to capture a return/error code other than MPI_SUCCESS (zero).
The standard does provide a means to override this default error handler. A discussion on how to do this is available HERE. You can
also consult the error handling section of the relevant MPI Standard documentation located at http://www.mpi-forum.org/docs/.
The types of errors displayed to the user are implementation dependent.
MPI_Comm_size
Returns the total number of MPI processes in the specified communicator, such as MPI_COMM_WORLD. If the communicator is
MPI_COMM_WORLD, then it represents the number of MPI tasks available to your application.
MPI_Comm_size (comm,&size)
MPI_COMM_SIZE (comm,size,ierr)
MPI_Comm_rank
Returns the rank of the calling MPI process within the specified communicator. Initially, each process will be assigned a unique
integer rank between 0 and number of tasks - 1 within the communicator MPI_COMM_WORLD. This rank is often referred to as a
task ID. If a process becomes associated with other communicators, it will have a unique rank within each of these as well.
MPI_Comm_rank (comm,&rank)
MPI_COMM_RANK (comm,rank,ierr)
MPI_Abort
Terminates all MPI processes associated with the communicator. In most MPI implementations it terminates ALL processes
regardless of the communicator specified.
MPI_Abort (comm,errorcode)
MPI_ABORT (comm,errorcode,ierr)
MPI_Get_processor_name
Returns the processor name. Also returns the length of the name. The buffer for "name" must be at least
MPI_MAX_PROCESSOR_NAME characters in size. What is returned into "name" is implementation dependent - may not be the
same as the output of the "hostname" or "host" shell commands.
MPI_Get_processor_name (&name,&resultlength)
MPI_GET_PROCESSOR_NAME (name,resultlength,ierr)
MPI_Get_version
Returns the version and subversion of the MPI standard that's implemented by the library.
MPI_Get_version (&version,&subversion)
MPI_GET_VERSION (version,subversion,ierr)
MPI_Initialized
https://computing.llnl.gov/tutorials/mpi/
8/40
10/2/2014
Indicates whether MPI_Init has been called - returns flag as either logical true (1) or false(0). MPI requires that MPI_Init be called
once and only once by each process. This may pose a problem for modules that want to use MPI and are prepared to call MPI_Init if
necessary. MPI_Initialized solves this problem.
MPI_Initialized (&flag)
MPI_INITIALIZED (flag,ierr)
MPI_Wtime
Returns an elapsed wall clock time in seconds (double precision) on the calling processor.
MPI_Wtime ()
MPI_WTIME ()
MPI_Wtick
Returns the resolution in seconds (double precision) of MPI_Wtime.
MPI_Wtick ()
MPI_WTICK ()
MPI_Finalize
Terminates the MPI execution environment. This function should be the last MPI routine called in every MPI program - no other
MPI routines may be called after it.
MPI_Finalize ()
MPI_FINALIZE (ierr)
MPI_Finalize();
}
https://computing.llnl.gov/tutorials/mpi/
9/40
10/2/2014
Combined send/receive
"Ready" send
Any type of send routine can be paired with any type of receive routine.
MPI also provides several routines associated with send - receive operations, such as those used to wait for a message's arrival or
probe to find out if a message has arrived.
Buffering:
In a perfect world, every send operation would be perfectly synchronized with its matching receive. This is rarely the case.
Somehow or other, the MPI implementation must be able to deal with storing data when the two tasks are out of sync.
Consider the following two cases:
A send operation occurs 5 seconds before the receive is ready - where is the message while the receive is pending?
Multiple sends arrive at the same receiving task which can only accept one send at a time - what happens to the messages that
are "backing up"?
The MPI implementation (not the MPI standard) decides what happens to data in these types of cases. Typically, a system buffer
area is reserved to hold data in transit. For example:
12/40
10/2/2014
communication events to complete, such as message copying from user memory to system buffer space or the actual arrival of
message.
Non-blocking operations simply "request" the MPI library to perform the operation when it is able. The user can not predict
when that will happen.
It is unsafe to modify the application buffer (your variable space) until you know for a fact the requested non-blocking
operation was actually performed by the library. There are "wait" routines used to do this.
Non-blocking communications are primarily used to overlap computation with communication and exploit possible
performance gains.
Order and Fairness:
Order:
MPI guarantees that messages will not overtake each other.
If a sender sends two messages (Message 1 and Message 2) in succession to the same destination, and both match the same
receive, the receive operation will receive Message 1 before Message 2.
If a receiver posts two receives (Receive 1 and Receive 2), in succession, and both are looking for the same message, Receive
1 will receive the message before Receive 2.
Order rules do not apply if there are multiple threads participating in the communication operations.
Fairness:
MPI does not guarantee fairness - it's up to the programmer to prevent "operation starvation".
Example: task 0 sends a message to task 2. However, task 1 sends a competing message that matches task 2's receive. Only
one of the sends will complete.
MPI_Send(buffer,count,type,dest,tag,comm)
Non-blocking sends
MPI_Isend(buffer,count,type,dest,tag,comm,request)
Blocking receive
MPI_Recv(buffer,count,type,source,tag,comm,status)
Non-blocking receive
MPI_Irecv(buffer,count,type,source,tag,comm,request)
Buffer
Program (application) address space that references the data that is to be sent or received. In most cases, this is simply the variable
name that is be sent/received. For C programs, this argument is passed by reference and usually must be prepended with an
ampersand: &var1
Data Count
Indicates the number of data elements of a particular type to be sent.
https://computing.llnl.gov/tutorials/mpi/
13/40
10/2/2014
Data Type
For reasons of portability, MPI predefines its elementary data types. The table below lists those required by the standard.
C Data Types
MPI_CHAR
signed char
MPI_WCHAR
MPI_SHORT
MPI_INT
signed int
MPI_LONG
MPI_LONG_LONG_INT
MPI_LONG_LONG
MPI_SIGNED_CHAR
signed char
MPI_UNSIGNED_CHAR
unsigned char
MPI_UNSIGNED_SHORT
MPI_UNSIGNED
unsigned int
MPI_UNSIGNED_LONG
MPI_UNSIGNED_LONG_LONG
character(1)
MPI_INTEGER
MPI_INTEGER1
MPI_INTEGER2
MPI_INTEGER4
integer
integer*1
integer*2
integer*4
real
real*2
real*4
real*8
MPI_FLOAT
float
MPI_REAL
MPI_REAL2
MPI_REAL4
MPI_REAL8
MPI_DOUBLE
double
MPI_DOUBLE_PRECISION
double precision
MPI_LONG_DOUBLE
long double
MPI_C_COMPLEX
MPI_C_FLOAT_COMPLEX
float _Complex
MPI_COMPLEX
complex
MPI_C_DOUBLE_COMPLEX
double _Complex
MPI_DOUBLE_COMPLEX
double complex
MPI_C_LONG_DOUBLE_COMPLEX
MPI_C_BOOL
_Bool
MPI_LOGICAL
logical
MPI_C_LONG_DOUBLE_COMPLEX
MPI_INT8_T
MPI_INT16_T
MPI_INT32_T
MPI_INT64_T
int8_t
int16_t
int32_t
int64_t
MPI_UINT8_T
MPI_UINT16_T
MPI_UINT32_T
MPI_UINT64_T
uint8_t
uint16_t
uint32_t
uint64_t
MPI_BYTE
8 binary digits
MPI_BYTE
8 binary digits
MPI_PACKED
MPI_PACKED
data packed or
unpacked with
MPI_Pack()/
MPI_Unpack
Notes:
Programmers may also create their own data types (see Derived Data Types).
https://computing.llnl.gov/tutorials/mpi/
14/40
10/2/2014
MPI_Recv
Receive a message and block until the requested data is available in the application buffer in the receiving task.
MPI_Recv (&buf,count,datatype,source,tag,comm,&status)
MPI_RECV (buf,count,datatype,source,tag,comm,status,ierr)
MPI_Ssend
Synchronous blocking send: Send a message and block until the application buffer in the sending task is free for reuse and the
https://computing.llnl.gov/tutorials/mpi/
15/40
10/2/2014
MPI_Bsend
Buffered blocking send: permits the programmer to allocate the required amount of buffer space into which data can be copied until
it is delivered. Insulates against the problems associated with insufficient system buffer space. Routine returns after the data has
been copied from application buffer space to the allocated send buffer. Must be used with the MPI_Buffer_attach routine.
MPI_Bsend (&buf,count,datatype,dest,tag,comm)
MPI_BSEND (buf,count,datatype,dest,tag,comm,ierr)
MPI_Buffer_attach
MPI_Buffer_detach
Used by programmer to allocate/deallocate message buffer space to be used by the MPI_Bsend routine. The size argument is
specified in actual data bytes - not a count of data elements. Only one buffer can be attached to a process at a time.
MPI_Buffer_attach
MPI_Buffer_detach
MPI_BUFFER_ATTACH
MPI_BUFFER_DETACH
(&buffer,size)
(&buffer,size)
(buffer,size,ierr)
(buffer,size,ierr)
MPI_Rsend
Blocking ready send. Should only be used if the programmer is certain that the matching receive has already been posted.
MPI_Rsend (&buf,count,datatype,dest,tag,comm)
MPI_RSEND (buf,count,datatype,dest,tag,comm,ierr)
MPI_Sendrecv
Send a message and post a receive before blocking. Will block until the sending application buffer is free for reuse and until the
receiving application buffer contains the received message.
MPI_Sendrecv (&sendbuf,sendcount,sendtype,dest,sendtag,
...... &recvbuf,recvcount,recvtype,source,recvtag,
...... comm,&status)
MPI_SENDRECV (sendbuf,sendcount,sendtype,dest,sendtag,
...... recvbuf,recvcount,recvtype,source,recvtag,
...... comm,status,ierr)
MPI_Wait
MPI_Waitany
MPI_Waitall
MPI_Waitsome
MPI_Wait blocks until a specified non-blocking send or receive operation has completed. For multiple non-blocking operations, the
programmer can specify any, all or some completions.
MPI_Wait (&request,&status)
MPI_Waitany (count,&array_of_requests,&index,&status)
MPI_Waitall (count,&array_of_requests,&array_of_statuses)
MPI_Waitsome (incount,&array_of_requests,&outcount,
...... &array_of_offsets, &array_of_statuses)
MPI_WAIT (request,status,ierr)
MPI_WAITANY (count,array_of_requests,index,status,ierr)
MPI_WAITALL (count,array_of_requests,array_of_statuses,
...... ierr)
MPI_WAITSOME (incount,array_of_requests,outcount,
...... array_of_offsets, array_of_statuses,ierr)
MPI_Probe
Performs a blocking test for a message. The "wildcards" MPI_ANY_SOURCE and MPI_ANY_TAG may be used to test for a
message from any source or with any tag. For the C routine, the actual source and tag will be returned in the status structure as
status.MPI_SOURCE and status.MPI_TAG. For the Fortran routine, they will be returned in the integer array status(MPI_SOURCE)
and status(MPI_TAG).
MPI_Probe (source,tag,comm,&status)
MPI_PROBE (source,tag,comm,status,ierr)
https://computing.llnl.gov/tutorials/mpi/
16/40
10/2/2014
MPI_Irecv
Identifies an area in memory to serve as a receive buffer. Processing continues immediately without actually waiting for the message
to be received and copied into the the application buffer. A communication request handle is returned for handling the pending
message status. The program must use calls to MPI_Wait or MPI_Test to determine when the non-blocking receive operation
completes and the requested message is available in the application buffer.
MPI_Irecv (&buf,count,datatype,source,tag,comm,&request)
MPI_IRECV (buf,count,datatype,source,tag,comm,request,ierr)
MPI_Issend
Non-blocking synchronous send. Similar to MPI_Isend(), except MPI_Wait() or MPI_Test() indicates when the destination process
has received the message.
MPI_Issend (&buf,count,datatype,dest,tag,comm,&request)
MPI_ISSEND (buf,count,datatype,dest,tag,comm,request,ierr)
MPI_Ibsend
Non-blocking buffered send. Similar to MPI_Bsend() except MPI_Wait() or MPI_Test() indicates when the destination process has
received the message. Must be used with the MPI_Buffer_attach routine.
MPI_Ibsend (&buf,count,datatype,dest,tag,comm,&request)
MPI_IBSEND (buf,count,datatype,dest,tag,comm,request,ierr)
MPI_Irsend
Non-blocking ready send. Similar to MPI_Rsend() except MPI_Wait() or MPI_Test() indicates when the destination process has
received the message. Should only be used if the programmer is certain that the matching receive has already been posted.
MPI_Irsend (&buf,count,datatype,dest,tag,comm,&request)
MPI_IRSEND (buf,count,datatype,dest,tag,comm,request,ierr)
MPI_Test
MPI_Testany
MPI_Testall
MPI_Testsome
MPI_Test checks the status of a specified non-blocking send or receive operation. The "flag" parameter is returned logical true (1) if
the operation has completed, and logical false (0) if not. For multiple non-blocking operations, the programmer can specify any, all
or some completions.
MPI_Test (&request,&flag,&status)
MPI_Testany (count,&array_of_requests,&index,&flag,&status)
MPI_Testall (count,&array_of_requests,&flag,&array_of_statuses)
https://computing.llnl.gov/tutorials/mpi/
18/40
10/2/2014
Unexpected behavior, including program failure, can occur if even one task in the communicator doesn't participate.
It is the programmer's responsibility to ensure that all processes within a communicator participate in any collective operations.
Types of Collective Operations:
Synchronization - processes wait until all members of the
group have reached the synchronization point.
Data Movement - broadcast, scatter/gather, all to all.
Collective Computation (reductions) - one member of the
group collects data from the other members and performs an
operation (min, max, add, multiply, etc.) on that data.
Programming Considerations and Restrictions:
With MPI-3, collective operations can be blocking or nonblocking. Only blocking operations are covered in this tutorial.
Collective communication routines do not take message tag
arguments.
Collective operations within subsets of processes are accomplished by first partitioning the subsets into new groups and then
attaching the new groups to new communicators (discussed in the Group and Communicator Management Routines section).
Can only be used with MPI predefined datatypes - not with MPI Derived Data Types.
MPI-2 extended most collective operations to allow data movement between intercommunicators (not covered here).
MPI_Bcast
Data movement operation. Broadcasts (sends) a message from the process with rank "root" to all other processes in the group.
Diagram Here
MPI_Bcast (&buffer,count,datatype,root,comm)
MPI_BCAST (buffer,count,datatype,root,comm,ierr)
MPI_Scatter
Data movement operation. Distributes distinct messages from a single source task to each task in the group.
Diagram Here
MPI_Scatter (&sendbuf,sendcnt,sendtype,&recvbuf,
...... recvcnt,recvtype,root,comm)
MPI_SCATTER (sendbuf,sendcnt,sendtype,recvbuf,
...... recvcnt,recvtype,root,comm,ierr)
MPI_Gather
Data movement operation. Gathers distinct messages from each task in the group to a single destination task. This routine is the
reverse operation of MPI_Scatter.
Diagram Here
MPI_Gather (&sendbuf,sendcnt,sendtype,&recvbuf,
...... recvcount,recvtype,root,comm)
MPI_GATHER (sendbuf,sendcnt,sendtype,recvbuf,
...... recvcount,recvtype,root,comm,ierr)
https://computing.llnl.gov/tutorials/mpi/
21/40
10/2/2014
MPI_Allgather
Data movement operation. Concatenation of data to all tasks in a group. Each task in the group, in effect, performs a one-to-all
broadcasting operation within the group.
Diagram Here
MPI_Allgather (&sendbuf,sendcount,sendtype,&recvbuf,
...... recvcount,recvtype,comm)
MPI_ALLGATHER (sendbuf,sendcount,sendtype,recvbuf,
...... recvcount,recvtype,comm,info)
MPI_Reduce
Collective computation operation. Applies a reduction operation on all tasks in the group and places the result in one task.
Diagram Here
MPI_Reduce (&sendbuf,&recvbuf,count,datatype,op,root,comm)
MPI_REDUCE (sendbuf,recvbuf,count,datatype,op,root,comm,ierr)
The predefined MPI reduction operations appear below. Users can also define their own reduction functions by using the
MPI_Op_create routine.
MPI Reduction Operation
C Data Types
MPI_MAX
maximum
integer, float
MPI_MIN
minimum
integer, float
MPI_SUM
sum
integer, float
MPI_PROD
product
integer, float
MPI_LAND
logical AND
integer
logical
MPI_BAND
bit-wise AND
integer, MPI_BYTE
integer, MPI_BYTE
MPI_LOR
logical OR
integer
logical
MPI_BOR
bit-wise OR
integer, MPI_BYTE
integer, MPI_BYTE
MPI_LXOR
logical XOR
integer
logical
MPI_BXOR
bit-wise XOR
integer, MPI_BYTE
integer, MPI_BYTE
MPI_MAXLOC
MPI_MINLOC
MPI_Allreduce
Collective computation operation + data movement. Applies a reduction operation and places the result in all tasks in the group. This
is equivalent to an MPI_Reduce followed by an MPI_Bcast.
Diagram Here
MPI_Allreduce (&sendbuf,&recvbuf,count,datatype,op,comm)
MPI_ALLREDUCE (sendbuf,recvbuf,count,datatype,op,comm,ierr)
MPI_Reduce_scatter
Collective computation operation + data movement. First does an element-wise reduction on a vector across all tasks in the group.
Next, the result vector is split into disjoint segments and distributed across the tasks. This is equivalent to an MPI_Reduce followed
by an MPI_Scatter operation.
Diagram Here
MPI_Reduce_scatter (&sendbuf,&recvbuf,recvcount,datatype,
...... op,comm)
MPI_REDUCE_SCATTER (sendbuf,recvbuf,recvcount,datatype,
...... op,comm,ierr)
MPI_Alltoall
Data movement operation. Each task in a group performs a scatter operation, sending a distinct message to all the tasks in the group
https://computing.llnl.gov/tutorials/mpi/
22/40
10/2/2014
in order by index.
Diagram Here
MPI_Alltoall (&sendbuf,sendcount,sendtype,&recvbuf,
...... recvcnt,recvtype,comm)
MPI_ALLTOALL (sendbuf,sendcount,sendtype,recvbuf,
...... recvcnt,recvtype,comm,ierr)
MPI_Scan
Performs a scan operation with respect to a reduction operation across a task group.
Diagram Here
MPI_Scan (&sendbuf,&recvbuf,count,datatype,op,comm)
MPI_SCAN (sendbuf,recvbuf,count,datatype,op,comm,ierr)
https://computing.llnl.gov/tutorials/mpi/
23/40