0% found this document useful (0 votes)

47 views53 pages

Message Passing Interface (MPI) : Steve Lantz Center For Advanced Computing Cornell University

Message Passing Interface (MPI) is a standard for message passing parallel programming. It allows processes to communicate by sending and receiving messages. The key points are: 1) MPI is a standard API that allows processes to communicate through message passing. It handles low-level details like networking while providing high-level communication functions. 2) An MPI program typically initializes the MPI environment, queries the number of processes, identifies the current process, allows processes to communicate through sending and receiving messages, and finalizes the MPI environment. 3) On Stampede, MPI programs are compiled with mpicc/mpicxx/mpif90 wrappers and run with mpirun or ibrun. Batch scripts allow running MPI

Uploaded by

Elly Decanes

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views53 pages

Message Passing Interface (MPI) : Steve Lantz Center For Advanced Computing Cornell University

Uploaded by

Elly Decanes

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 53

Message Passing Interface (MPI)

Steve Lantz
Center for Advanced Computing
Cornell University

Workshop: Parallel Computing on Stampede, June 11, 2013

Based on materials developed by CAC and TACC
Overview Outline
• Overview
• Basics
– Hello World in MPI
– Compiling and running MPI programs (LAB)
• MPI messages
• Point-to-point communication
– Deadlock and how to avoid it (LAB)
• Collective communication
– Reduction operations (LAB)
• Releases
• MPI references and documentation

6/11/2013 www.cac.cornell.edu 2
Overview Introduction
• What is message passing?
– Sending and receiving messages between tasks or processes
– Includes performing operations on data in transit and synchronizing
tasks
• Why send messages?
– Clusters have distributed memory, i.e. each process has its own
address space and no way to get at another’s
• How do you send messages?
– Programmer makes use of an Application Programming Interface (API)
that specifies the functionality of high-level communication routines
– Functions give access to a low-level implementation that takes care of
sockets, buffering, data copying, message routing, etc.

6/11/2013 www.cac.cornell.edu 3
Overview API for Distributed Memory Parallelism
• Assumption: processes do not see each other’s memory
• Communication speed is determined by some kind of network
– Typical network = switch + cables + adapters + software stack…
• Key: the implementation of MPI (or any message passing API) can
be optimized for any given network
– Program gets the benefit
– No code changes required
– Works in shared memory, too network

Image of Dell PowerEdge C8220X: http://www.theregister.co.uk/2012/09/19/dell_zeus_c8000_hyperscale_server/

6/11/2013 www.cac.cornell.edu 4
Overview Why Use MPI?
• MPI is a de facto standard
– Public domain versions are easy to install
– Vendor-optimized version are available on most hardware
• MPI is “tried and true”
– MPI-1 was released in 1994, MPI-2 in 1996
• MPI applications can be fairly portable
• MPI is a good way to learn parallel programming
• MPI is expressive: it can be used for many different models of
computation, therefore can be used with many different applications
• MPI code is efficient (though some think of it as the “assembly
language of parallel processing”)
• MPI has freely available implementations (e.g., MPICH)

6/11/2013 www.cac.cornell.edu 5
Basics Simple MPI
Here is the basic outline of a simple MPI program :

• Include the implementation-specific header file --

#include <mpi.h> inserts basic definitions and types
• Initialize communications –
MPI_Init initializes the MPI environment
MPI_Comm_size returns the number of processes
MPI_Comm_rank returns this process’s number (rank)
• Communicate to share data between processes –
MPI_Send sends a message
MPI_Recv receives a message
• Exit from the message-passing system --
MPI_Finalize

6/11/2013 www.cac.cornell.edu 6
Basics Minimal Code Example: hello_mpi.c
• #include <stdio.h>
• #include "mpi.h"
• main(int argc, char **argv)
• {
• char message[20];
• int i, rank, size, tag = 99;
• MPI_Status status;
• MPI_Init(&argc, &argv);
• MPI_Comm_size(MPI_COMM_WORLD, &size);
• MPI_Comm_rank(MPI_COMM_WORLD, &rank);
• if (rank == 0) {
• strcpy(message, "Hello, world!");
• for (i = 1; i < size; i++)
• MPI_Send(message, 13, MPI_CHAR, i, tag, MPI_COMM_WORLD);
• } else
• MPI_Recv(message, 20, MPI_CHAR, 0, tag, MPI_COMM_WORLD, &status);
• printf("Message from process %d : %.13s\n", rank, message);
• MPI_Finalize();
• }

6/11/2013 www.cac.cornell.edu 7
Basics Initialize and Close Environment
• #include <stdio.h>
• #include "mpi.h" Initialize MPI environment
• main(int argc, char **argv) An implementation may also
• { use this call as a mechanism
• char message[20]; for making the usual argc and
• int i, rank, size, tag = 99; argv command-line arguments
• MPI_Status status;
from “main” available to all
• MPI_Init(&argc, &argv);
• MPI_Comm_size(MPI_COMM_WORLD, &size); tasks (C language only).
• MPI_Comm_rank(MPI_COMM_WORLD, &rank);
• if (rank == 0) {
• strcpy(message, "Hello, world!"); Close MPI environment
• for (i = 1; i < size; i++)
• MPI_Send(message, 13, MPI_CHAR, i, tag, MPI_COMM_WORLD);
• } else
• MPI_Recv(message, 20, MPI_CHAR, 0, tag, MPI_COMM_WORLD, &status);
• printf("Message from process %d : %.13s\n", rank, message);
• MPI_Finalize();
• }

6/11/2013 www.cac.cornell.edu 8
Basics Query Environment
• #include <stdio.h> Returns number of processes
• #include "mpi.h"
This, like nearly all other MPI
• main(int argc, char **argv)
• {
functions, must be called after
• char message[20]; MPI_Init and before MPI_Finalize.
• int i, rank, size, tag = 99; Input is the name of a communicator
• MPI_Status status; (MPI_COMM_WORLD is the global
• MPI_Init(&argc, &argv); communicator) and output is the size
• MPI_Comm_size(MPI_COMM_WORLD, &size); of that communicator.
• MPI_Comm_rank(MPI_COMM_WORLD, &rank);
• if (rank == 0) {
• strcpy(message, "Hello, world!");
• for (i = 1; i < size; i++)
• MPI_Send(message, 13, MPI_CHAR, i, tag, MPI_COMM_WORLD);
• } else
Returns this process’ number, or rank
• MPI_Recv(message, 20, MPI_CHAR, 0, tag, MPI_COMM_WORLD, &status);
• Inputrank,
printf("Message from process %d : %.13s\n", is again the name of a
message);
• MPI_Finalize(); communicator and the output is the rank
• } of this process in that communicator.

6/11/2013 www.cac.cornell.edu 9
Basics Pass Messages
• #include <stdio.h>
• #include "mpi.h"
• main(int argc, char **argv)
Send a message
• { Blocking send of data in the buffer.
• char message[20];
• int i, rank, size, tag = 99; Receive a message
• MPI_Status status;
Blocking receive of data into the buffer.
• MPI_Init(&argc, &argv);
• MPI_Comm_size(MPI_COMM_WORLD, &size);
• MPI_Comm_rank(MPI_COMM_WORLD, &rank);
• if (rank == 0) {
• strcpy(message, "Hello, world!");
• for (i = 1; i < size; i++)
• MPI_Send(message, 13, MPI_CHAR, i, tag, MPI_COMM_WORLD);
• } else
• MPI_Recv(message, 20, MPI_CHAR, 0, tag, MPI_COMM_WORLD, &status);
• printf("Message from process %d : %.13s\n", rank, message);
• MPI_Finalize();
• }

6/11/2013 www.cac.cornell.edu 10
Basics Compiling MPI Programs
• Generally, one uses a special compiler or wrapper script
– Not defined by the standard
– Consult your implementation
– Correctly handles include path, library path, and libraries
• On Stampede, use MPICH-style wrappers (the most common)
mpicc -o foo foo.c
mpicxx -o foo foo.cc
mpif90 -o foo foo.f (also mpif77)
– Choose compiler+MPI with “module load” (default, Intel13+MVAPICH2)
• Some MPI-specific compiler options
-mpilog -- Generate log files of MPI calls
-mpitrace -- Trace execution of MPI calls
-mpianim -- Real-time animation of MPI (not available on all systems)

6/11/2013 www.cac.cornell.edu 11
Basics Running MPI Programs
• To run a simple MPI program, use MPICH-style commands
mpirun -n 4 ./foo (usually mpirun is just a soft link to…)
mpiexec -n 4 ./foo
• Some options for running
-n -- states the number of MPI processes to launch
-wdir <dirname> -- starts in the given working directory
--help -- shows all options for mpirun
• To run over Stampede’s InfiniBand (as part of a batch script)
ibrun ./foo
– The scheduler handles the rest
• Note: mpirun, mpiexec, and compiler wrappers are not part of MPI,
but they can be found in nearly all implementations
– There are exceptions: e.g., on older IBM systems, one uses poe to run,
mpcc_r and mpxlf_r to compile

6/11/2013 www.cac.cornell.edu 12
Basics Creating an MPI Batch Script
• To submit a job to the compute nodes on Stampede, you must first
create a SLURM batch script with the commands you want to run.
#!/bin/bash
#SBATCH -J myMPI # job name
#SBATCH -o myMPI.o%j # output/error file (%j = jobID)
#SBATCH -N 1 # number of nodes requested
#SBATCH -n 16 # number of MPI tasks requested
#SBATCH -p development # queue (partition)
#SBATCH -t 00:01:00 # run time (hh:mm:ss)
#SBATCH -A TG-TRA120006 # account number

echo 2000 > input

ibrun ./myprog < input # run MPI executable "myprog"

6/11/2013 www.cac.cornell.edu 13
Basics LAB: Submitting MPI Programs
• Obtain the hello_mpi.c source code via copy-and-paste, or by
tar xvf ~tg459572/LABS/IntroMPI_lab.tar
cd IntroMPI_lab/hello

• Compile the code using mpicc to output the executable hello_mpi

• Modify the myMPI.sh batch script to run hello_mpi
• Submit the batch script to SLURM, the batch scheduler
– Check on progress until the job completes
– Examine the output file
sbatch myMPI.sh
squeue -u <my_username>
less myMPI.o*

6/11/2013 www.cac.cornell.edu 14
Messages Three Parameters Describe the Data

MPI_Send( message, 13, MPI_CHAR, i, type, MPI_COMM_WORLD );

MPI_Recv( message, 20, MPI_CHAR, 0, type, MPI_COMM_WORLD, &status);

Type of data, should be same

for send and receive
MPI_Datatype type

Number of elements (items, not bytes)

Recv number should be greater than or
equal to amount sent
int count
Address where the data start
void* data

6/11/2013 www.cac.cornell.edu 15
Messages Three Parameters Specify Routing
MPI_Send( message, 13, MPI_CHAR, i, type, MPI_COMM_WORLD );

MPI_Recv( message, 20, MPI_CHAR, 0, type, MPI_COMM_WORLD, &status);

Identify process you’re

communicating with by rank number
int dest/src

Arbitrary tag number, must match up

(receiver can specify MPI_ANY_TAG to
indicate that any tag is acceptable)
int tag

Communicator specified for send and Returns information

receive must match, no wildcards on received message
MPI_Comm comm MPI_Status* status

6/11/2013 www.cac.cornell.edu 16
Messages Fortran Notes

mpi_send (data, count, type, dest, tag, comm, ierr)

mpi_recv (data, count, type, src, tag, comm, status, ierr)

• A few Fortran particulars

– All Fortran arguments are passed by reference
– INTEGER ierr: variable to store the error code (in C/C++ this is the
return value of the function call)
• Wildcards are allowed
– src can be the wildcard MPI_ANY_SOURCE
– tag can be the wildcard MPI_ANY_TAG
– status returns information on the source and tag, useful in conjunction
with the above wildcards (receiving only)

6/11/2013 www.cac.cornell.edu 17
Point to Point Topics
• MPI_Send and MPI_Recv
• Synchronous vs. buffered (asynchronous) communication
• Blocking send and receive
• Non-blocking send and receive
• Combined send/receive
• Deadlock, and how to avoid it

6/11/2013 www.cac.cornell.edu 18
Point to Point Send and Recv: Simple

CPU 1 CPU 2
Process 0 Process 1
send receive
data data

• Sending data from one point (process/task)

to another point (process/task)
• One task sends while another receives
• But process 0 may need to wait until Process 1 is ready?…
• MPI provides different communication modes in order to help
6/11/2013 www.cac.cornell.edu 19
Point to Point Buffered Send, MPI_Bsend

CPU 1 CPU 2
Process 0 Process 1
send receive
data data

system buffer
data

• Message contents are sent to a system-controlled block of memory

• Process 0 continues executing other tasks; when process 1 is ready
to receive, the system simply copies the message from the system
buffer into the appropriate memory location controlled by process
• Must be preceded with a call to MPI_Buffer_attach

6/11/2013 www.cac.cornell.edu 20
Point to Point Send and Recv: So Many Choices
The communication mode indicates how the message should be sent.

Communication Blocking Routines Non-Blocking Routines

Mode
Synchronous MPI_Ssend MPI_Issend
Ready MPI_Rsend MPI_Irsend
Buffered MPI_Bsend MPI_Ibsend
Standard MPI_Send MPI_Isend
MPI_Recv MPI_Irecv
MPI_Sendrecv
MPI_Sendrecv_replace

Note: the receive routine does not specify the communication mode -- it
is simply blocking or non-blocking.
6/11/2013 www.cac.cornell.edu 21
Point to Point Overhead
• System overhead
Cost of transferring data from the sender’s message buffer onto the
network, then from the network into the receiver’s message buffer.
– Buffered send has more system overhead due to the extra buffer copy.
• Synchronization overhead
Time spent waiting for an event to occur on another task.
– Synchronous send has no extra copying but requires more waiting; a
receive must be executed and a handshake must arrive before sending.
• MPI_Send
Standard mode tries to trade off between the types of overhead.
– Large messages use the “rendezvous protocol” to avoid extra copying:
a handshake procedure establishes direct communication.
– Small messages use the “eager protocol” to avoid synchronization cost:
the message is quickly copied to a small system buffer on the receiver.

6/11/2013 www.cac.cornell.edu 22
Point to Point Standard Send, Eager Protocol

CPU 1 CPU 2
Process 0 Process 1
send eager receive
protocol
data data

system area
data

• Message goes a system-controlled area of memory on the receiver

6/11/2013 www.cac.cornell.edu 23
Point to Point Blocking vs. Non-Blocking

MPI_Send, MPI_Recv

A blocking send or receive call suspends execution of the process

until the message buffer being sent/received is safe to use.

MPI_Isend, MPI_Irecv

A non-blocking call initiates the communication process; the status of

data transfer and the success of the communication must be verified
independently by the programmer (MPI_Wait or MPI_Test).

6/11/2013 www.cac.cornell.edu 24
Point to Point One-Way Blocking/Non-Blocking
• Blocking send, non-blocking recv
IF (rank==0) THEN
! Do my work, then send to rank 1
CALL MPI_SEND (sendbuf,count,MPI_REAL,1,tag,MPI_COMM_WORLD,ie)
ELSEIF (rank==1) THEN
CALL MPI_IRECV (recvbuf,count,MPI_REAL,0,tag,MPI_COMM_WORLD,req,ie)
! Do stuff that doesn't yet need recvbuf from rank 0
CALL MPI_WAIT (req,status,ie)
! Do stuff with recvbuf
ENDIF

• Non-blocking send, non-blocking recv

IF (rank==0) THEN
! Get sendbuf ready as soon as possible
CALL MPI_ISEND (sendbuf,count,MPI_REAL,1,tag,MPI_COMM_WORLD,req,ie)
! Do other stuff that doesn’t involve sendbuf
ELSEIF (rank==1) THEN
CALL MPI_IRECV (recvbuf,count,MPI_REAL,0,tag,MPI_COMM_WORLD,req,ie)
ENDIF
CALL MPI_WAIT (req,status,ie)

6/11/2013 www.cac.cornell.edu 25
Point to Point MPI_Sendrecv

MPI_Sendrecv(sendbuf,sendcount,sendtype,dest,sendtag,
recvbuf,recvcount,recvtype,source,recvtag,
comm,status)

• Useful for communication patterns where each of a pair of nodes

both sends and receives a message (two-way communication)
• Destination and source need not be the same (ring, e.g.)
• Executes a blocking send and a blocking receive operation
• Both operations use the same communicator, but have distinct tag
arguments

6/11/2013 www.cac.cornell.edu 26
Point to Point Two-Way Communication: Deadlock!
• Deadlock 1
IF (rank==0) THEN
CALL MPI_RECV (recvbuf,count,MPI_REAL,1,tag,MPI_COMM_WORLD,status,ie)
CALL MPI_SEND (sendbuf,count,MPI_REAL,1,tag,MPI_COMM_WORLD,ie)
ELSEIF (rank==1) THEN
CALL MPI_RECV (recvbuf,count,MPI_REAL,0,tag,MPI_COMM_WORLD,status,ie)
CALL MPI_SEND (sendbuf,count,MPI_REAL,0,tag,MPI_COMM_WORLD,ie)
ENDIF

• Deadlock 2
IF (rank==0) THEN
CALL MPI_SSEND (sendbuf,count,MPI_REAL,1,tag,MPI_COMM_WORLD,ie)
CALL MPI_RECV (recvbuf,count,MPI_REAL,1,tag,MPI_COMM_WORLD,status,ie)
ELSEIF (rank==1) THEN
CALL MPI_SSEND (sendbuf,count,MPI_REAL,0,tag,MPI_COMM_WORLD,ie)
CALL MPI_RECV (recvbuf,count,MPI_REAL,0,tag,MPI_COMM_WORLD,status,ie)
ENDIF

– MPI_Send has same problem for count*MPI_REAL > 12K

(the MVAPICH2 “eager threshold”; it’s 256K for Intel MPI)

6/11/2013 www.cac.cornell.edu 27
Point to Point Deadlock Solutions
• Solution 1
IF (rank==0) THEN
CALL MPI_SEND (sendbuf,count,MPI_REAL,1,tag,MPI_COMM_WORLD,ie)
CALL MPI_RECV (recvbuf,count,MPI_REAL,1,tag,MPI_COMM_WORLD,status,ie)
ELSEIF (rank==1) THEN
CALL MPI_RECV (recvbuf,count,MPI_REAL,0,tag,MPI_COMM_WORLD,status,ie)
CALL MPI_SEND (sendbuf,count,MPI_REAL,0,tag,MPI_COMM_WORLD,ie)
ENDIF

• Solution 2
IF (rank==0) THEN
CALL MPI_SENDRECV (sendbuf,count,MPI_REAL,1,tag, &
recvbuf,count,MPI_REAL,1,tag,MPI_COMM_WORLD,status,ie)
ELSEIF (rank==1) THEN
CALL MPI_SENDRECV (sendbuf,count,MPI_REAL,0,tag, &
recvbuf,count,MPI_REAL,0,tag,MPI_COMM_WORLD,status,ie)
ENDIF

6/11/2013 www.cac.cornell.edu 28
Point to Point More Deadlock Solutions
• Solution 3
IF (rank==0) THEN
CALL MPI_IRECV (recvbuf,count,MPI_REAL,1,tag,MPI_COMM_WORLD,req,ie)
CALL MPI_SEND (sendbuf,count,MPI_REAL,1,tag,MPI_COMM_WORLD,ie)
ELSEIF (rank==1) THEN
CALL MPI_IRECV (recvbuf,count,MPI_REAL,0,tag,MPI_COMM_WORLD,req,ie)
CALL MPI_SEND (sendbuf,count,MPI_REAL,0,tag,MPI_COMM_WORLD,ie)
ENDIF
CALL MPI_WAIT (req,status)

• Solution 4
IF (rank==0) THEN
CALL MPI_BSEND (sendbuf,count,MPI_REAL,1,tag,MPI_COMM_WORLD,ie)
CALL MPI_RECV (recvbuf,count,MPI_REAL,1,tag,MPI_COMM_WORLD,status,ie)
ELSEIF (rank==1) THEN
CALL MPI_BSEND (sendbuf,count,MPI_REAL,0,tag,MPI_COMM_WORLD,ie)
CALL MPI_RECV(recvbuf,count,MPI_REAL,0,tag,MPI_COMM_WORLD,status,ie)
ENDIF

6/11/2013 www.cac.cornell.edu 29
Point to Point Two-way Communications: Summary

CPU 0 CPU 1

Deadlock 1 Recv/Send Recv/Send

Deadlock 2 Send/Recv Send/Recv

Solution 1 Send/Recv Recv/Send

Solution 2 Sendrecv Sendrecv

Solution 3 Irecv/Send, Wait Irecv/Send, Wait

Solution 4 Bsend/Recv Bsend/Recv

6/11/2013 www.cac.cornell.edu 30
Basics LAB: Deadlock
• Compile the C or Fortran code to output the executable deadlock
• Create a batch script including no #SBATCH parameters
cat > sr.sh
#!/bin/sh
ibrun ./deadlock [ctrl-D to exit cat]

• Submit the job, specifying parameters on the command line

sbatch -N 1 -n 8 -p development -t 00:01:00 –A TG-TRA120006 sr.sh

– Why use less than 16 tasks on a 16 core node? Memory or threads.

– How would serial differ? -N 1 -n 1 -p serial (& no ibrun)

• Check job progress with squeue; check output with less.

• The program will not end normally. Edit the source code to eliminate
deadlock (e.g., use sendrecv) and resubmit until the output is good.
6/11/2013 www.cac.cornell.edu 31
Collective Motivation
• What if one task wants to send to everyone?
if (mytid == 0) {
for (tid=1; tid<ntids; tid++) {
MPI_Send( (void*)a, /* target= */ tid, … );
}
} else {
MPI_Recv( (void*)a, 0, … );
}

• Implements a very naive, serial broadcast

• Too primitive
– Leaves no room for the OS / switch to optimize
– Leaves no room for more efficient algorithms
• Too slow

6/11/2013 www.cac.cornell.edu 32
Collective Topics
• Overview
• Barrier and Broadcast
• Data Movement Operations
• Reduction Operations

6/11/2013 www.cac.cornell.edu 33
Collective Overview
• Collective calls involve ALL processes within a communicator
• There are 3 basic types of collective communications:
– Synchronization (MPI_Barrier)
– Data movement (MPI_Bcast/Scatter/Gather/Allgather/Alltoall)
– Collective computation (MPI_Reduce/Allreduce/Scan)
• Programming considerations & restrictions
– Blocking operation
– No use of message tag argument
– Collective operation within subsets of processes require separate
grouping and new communicator
– Can only be used with MPI predefined datatypes

6/11/2013 www.cac.cornell.edu 34
Collective Barrier Synchronization, Broadcast

• Barrier blocks until all processes in comm have called it

– Useful when measuring communication/computation time
– mpi_barrier(comm, ierr)
– MPI_Barrier(comm)

• Broadcast sends data from root to all processes in comm

– Again, blocks until all tasks have called it
– mpi_bcast(data, count, type, root, comm, ierr)
– MPI_Bcast(data, count, type, root, comm)

6/11/2013 www.cac.cornell.edu 35
Collective Data Movement

• Broadcast

• Scatter/Gather

• Allgather

• Alltoall

6/11/2013 www.cac.cornell.edu 36
Collective Reduction Operations

• Reduce

• Scan

6/11/2013 www.cac.cornell.edu 37
Collective Reduction Operations

Name Meaning

MPI_MAX Maximum
MPI_MIN Minimum
MPI_SUM Sum
MPI_PROD Product
MPI_LAND Logical and
MPI_BAND Bit-wise and
MPI_LOR Logical or
MPI_BOR Bit-wise or
MPI_LXOR Logical xor
MPI_BXOR Logical xor
MPI_MAXLOC Max value and location
MPI_MINLOC Min value and location

6/11/2013 www.cac.cornell.edu 38
Basics LAB: Allreduce
• In the call to MPI_Allreduce, the reduction operation is wrong!
– Modify the C or Fortran source to use the correct operation
• Compile the C or Fortran code to output the executable allreduce
• Submit the myall.sh batch script to SLURM, the batch scheduler
– Check on progress until the job completes
– Examine the output file
sbatch myall.sh
squeue -u <my_username>
less myall.o*

• Verify that you got the expected answer

6/11/2013 www.cac.cornell.edu 39
MPI-1
• MPI-1 - Message Passing Interface (v. 1.2)
– Library standard defined by committee of vendors, implementers, and
parallel programmers
– Used to create parallel SPMD codes based on explicit message passing
• Available on almost all parallel machines with C/C++ and Fortran
bindings (and occasionally with other bindings)
• About 125 routines, total
– 6 basic routines
– The rest include routines of increasing generality and specificity
• This presentation has covered just MPI-1 routines

6/11/2013 www.cac.cornell.edu 40
MPI-2
• Includes features left out of MPI-1
– One-sided communications
– Dynamic process control
– More complicated collectives
– Parallel I/O (MPI-IO)
• Implementations came along only gradually
– Not quickly undertaken after the reference document was released (in
1997)
– Now OpenMPI, MPICH2 (and its descendants), and the vendor
implementations are nearly complete or fully complete
• Most applications still rely on MPI-1, plus maybe MPI-IO

6/11/2013 www.cac.cornell.edu 41
References
• MPI-1 and MPI-2 standards
– http://www.mpi-forum.org/docs/mpi-11-html/mpi-report.html
– http://www.mpi-forum.org/docs/mpi-20-html/mpi2-report.htm
– http://www.mcs.anl.gov/mpi/ (other mirror sites)
• Freely available implementations
– MPICH, http://www.mcs.anl.gov/mpi/mpich
– LAM-MPI, http://www.lam-mpi.org/
• Books
– Using MPI, by Gropp, Lusk, and Skjellum
– MPI Annotated Reference Manual, by Marc Snir, et al
– Parallel Programming with MPI, by Peter Pacheco
– Using MPI-2, by Gropp, Lusk and Thakur
• Newsgroup: comp.parallel.mpi

6/11/2013 www.cac.cornell.edu 42
Extra Slides

6/11/2013 www.cac.cornell.edu 43
MPI_COMM MPI Communicators
• Communicators
– Collections of processes that can communicate with each other
– Most MPI routines require a communicator as an argument
– Predefined communicator MPI_COMM_WORLD encompasses all tasks
– New communicators can be defined; any number can co-exist
• Each communicator must be able to answer two questions
– How many processes exist in this communicator?
– MPI_Comm_size returns the answer, say, Np
– Of these processes, which process (numerical rank) am I?
– MPI_Comm_rank returns the rank of the current process within the
communicator, an integer between 0 and Np-1 inclusive
– Typically these functions are called just after MPI_Init

6/11/2013 www.cac.cornell.edu 44
MPI_COMM C Example: param.c

#include <mpi.h>
main(int argc, char **argv){
int np, mype, ierr;

ierr = MPI_Init(&argc, &argv);

ierr = MPI_Comm_size(MPI_COMM_WORLD, &np);
ierr = MPI_Comm_rank(MPI_COMM_WORLD, &mype);
:
MPI_Finalize();
}

6/11/2013 www.cac.cornell.edu 45
MPI_COMM C++ Example: param.cc

#include "mpif.h"
[other includes]
int main(int argc, char *argv[]){
int np, mype, ierr;
[other declarations]
:
MPI::Init(argc, argv);
np = MPI::COMM_WORLD.Get_size();
mype = MPI::COMM_WORLD.Get_rank();
:
[actual work goes here]
:
MPI::Finalize();
}

6/11/2013 www.cac.cornell.edu 46
MPI_COMM Fortran Example: param.f90

program param
include 'mpif.h'
integer ierr, np, mype

call mpi_init(ierr)
call mpi_comm_size(MPI_COMM_WORLD, np , ierr)
call mpi_comm_rank(MPI_COMM_WORLD, mype, ierr)
:
call mpi_finalize(ierr)
end program

6/11/2013 www.cac.cornell.edu 47
Point to Point Communication Modes
Mode Pros Cons
Synchronous – sending - Safest, therefore most portable Synchronization
and receiving tasks must - No need for extra buffer space overhead
‘handshake’. - SEND/RECV order not critical
Ready- assumes that a - Lowest total overhead RECV must prec
‘ready to receive’ - No need for extra buffer space ede SEND
message has already - Handshake not required
been received.
Buffered – move data to - Decouples SEND from RECV Buffer copy
a buffer so process does - No sync overhead on SEND overhead
not wait. - Programmer controls buffer size
Standard – defined by - Good for many cases Your program
the implementer; meant - Small messages go right away may not be
to take advantage of the - Large messages must sync suitable
local system. - Compromise position

6/11/2013 www.cac.cornell.edu 48
Point to Point C Example: oneway.c

#include "mpi.h"
main(int argc, char **argv){
int ierr, mype, myworld; double a[2];
MPI_Status status;
MPI_Comm icomm = MPI_COMM_WORLD;
ierr = MPI_Init(&argc, &argv);
ierr = MPI_Comm_rank(icomm, &mype);
ierr = MPI_Comm_size(icomm, &myworld);
if(mype == 0){
a[0] = mype; a[1] = mype+1;
ierr = MPI_Ssend(a,2,MPI_DOUBLE,1,9,icomm);
}
else if (mype == 1){
ierr = MPI_Recv(a,2,MPI_DOUBLE,0,9,icomm,&status);
printf("PE %d, A array= %f %f\n",mype,a[0],a[1]);
}
MPI_Finalize();
}

6/11/2013 www.cac.cornell.edu 49
Point to Point Fortran Example: oneway.f90

program oneway
include "mpif.h"
real*8, dimension(2) :: A
integer, dimension(MPI_STATUS_SIZE) :: istat
icomm = MPI_COMM_WORLD
call mpi_init(ierr)
call mpi_comm_rank(icomm,mype,ierr)
call mpi_comm_size(icomm,np ,ierr);

if (mype.eq.0) then
a(1) = dble(mype); a(2) = dble(mype+1)
call mpi_send(A,2,MPI_REAL8,1,9,icomm,ierr)
else if (mype.eq.1) then
call mpi_recv(A,2,MPI_REAL8,0,9,icomm,istat,ierr)
print '("PE",i2," received A array =",2f8.4)',mype,A
endif
call mpi_finalize(ierr)
end program

6/11/2013 www.cac.cornell.edu 50
Collective C Example: allreduce.c

#include <mpi.h>
#define WCOMM MPI_COMM_WORLD
main(int argc, char **argv){
int npes, mype, ierr;
double sum, val; int calc, knt=1;
ierr = MPI_Init(&argc, &argv);
ierr = MPI_Comm_size(WCOMM, &npes);
ierr = MPI_Comm_rank(WCOMM, &mype);

val = (double)mype;
ierr = MPI_Allreduce(
&val, &sum, knt, MPI_DOUBLE, MPI_SUM, WCOMM);

calc = (npes-1 +npes%2)*(npes/2);

printf(" PE: %d sum=%5.0f calc=%d\n",mype,sum,calc);
ierr = MPI_Finalize();
}

6/11/2013 www.cac.cornell.edu 51
Collective Fortran Example: allreduce.f90

program allreduce
include 'mpif.h'
double precision :: val, sum
icomm = MPI_COMM_WORLD
knt = 1
call mpi_init(ierr)
call mpi_comm_rank(icomm,mype,ierr)
call mpi_comm_size(icomm,npes,ierr)

val = dble(mype)
call mpi_allreduce(val,sum,knt,MPI_REAL8,MPI_SUM,icomm,ierr)

ncalc = (npes-1 + mod(npes,2))*(npes/2)

print '(" pe#",i5," sum =",f5.0, " calc. sum =",i5)', &
mype, sum, ncalc
call mpi_finalize(ierr)
end program

6/11/2013 www.cac.cornell.edu 52
Collective The Collective Collection!

6/11/2013 www.cac.cornell.edu 53

NIELSEN, Jakob; Mack., Robert L. Usability Inspection Methods
No ratings yet
NIELSEN, Jakob; Mack., Robert L. Usability Inspection Methods
439 pages
Synchronous Soap To JDBC - End To End Walkthrough - Sap Pi Tutorials
No ratings yet
Synchronous Soap To JDBC - End To End Walkthrough - Sap Pi Tutorials
10 pages
Qlik Connector For SAP - Installation Guide v7.0.0
No ratings yet
Qlik Connector For SAP - Installation Guide v7.0.0
44 pages
Message Passing Interface (MPI)
No ratings yet
Message Passing Interface (MPI)
14 pages
03-MPIProgramStructure[1]
No ratings yet
03-MPIProgramStructure[1]
42 pages
SERC IntroMPI 2019-09-14 v0
No ratings yet
SERC IntroMPI 2019-09-14 v0
43 pages
Week 10
No ratings yet
Week 10
52 pages
The Message Passing Interface (MPI)
No ratings yet
The Message Passing Interface (MPI)
18 pages
Message Passing Interface (MPI)
No ratings yet
Message Passing Interface (MPI)
22 pages
‎⁨تقرير⁩
No ratings yet
‎⁨تقرير⁩
16 pages
PA
No ratings yet
PA
87 pages
Clase 4 - Tutorial de MPI
No ratings yet
Clase 4 - Tutorial de MPI
35 pages
Cs-3006 6 Mpi Basics 2
No ratings yet
Cs-3006 6 Mpi Basics 2
52 pages
Intro_MPI
No ratings yet
Intro_MPI
60 pages
CS-3006_5_MPI Basics
No ratings yet
CS-3006_5_MPI Basics
53 pages
in3200-chap09
No ratings yet
in3200-chap09
56 pages
Introduction to C MPI PM
No ratings yet
Introduction to C MPI PM
50 pages
NGK Mpi
No ratings yet
NGK Mpi
74 pages
Lab Mpi
No ratings yet
Lab Mpi
32 pages
An Introduction To MPI: Parallel Programming With The Message Passing Interface
No ratings yet
An Introduction To MPI: Parallel Programming With The Message Passing Interface
48 pages
Lab Mpi
No ratings yet
Lab Mpi
29 pages
Introduction to the Message Passing Interface (MPI
No ratings yet
Introduction to the Message Passing Interface (MPI
16 pages
Introduction MPI - Chap2 - Slide 3
No ratings yet
Introduction MPI - Chap2 - Slide 3
16 pages
Intro To MPI: Hpc-Support@duke - Edu
No ratings yet
Intro To MPI: Hpc-Support@duke - Edu
56 pages
Mpi Unit 5 Part 2 1
No ratings yet
Mpi Unit 5 Part 2 1
65 pages
Introduction to MPI Basics
No ratings yet
Introduction to MPI Basics
8 pages
mpi1
No ratings yet
mpi1
20 pages
Parallel & Distributed Computing: MPI - Message Passing Interface
No ratings yet
Parallel & Distributed Computing: MPI - Message Passing Interface
49 pages
Mpi Basic Operations
No ratings yet
Mpi Basic Operations
6 pages
Mpi
No ratings yet
Mpi
30 pages
Introduction To MPI Ranger Lonestar
No ratings yet
Introduction To MPI Ranger Lonestar
67 pages
Basic MPI: Tom Murphy, Dave Joiner, Paul Gray, Henry Neeman, Charlie Peck, Alex Lemann, Kristina Wanous, Kevin Hunter
No ratings yet
Basic MPI: Tom Murphy, Dave Joiner, Paul Gray, Henry Neeman, Charlie Peck, Alex Lemann, Kristina Wanous, Kevin Hunter
22 pages
Chapter 4 - Message-Passing Programming, MPI
No ratings yet
Chapter 4 - Message-Passing Programming, MPI
79 pages
العملي
No ratings yet
العملي
55 pages
MPI_tutorial_Fall_Break_2022
No ratings yet
MPI_tutorial_Fall_Break_2022
60 pages
Lecture 11 Distributed Memory Programming
No ratings yet
Lecture 11 Distributed Memory Programming
28 pages
Message Passing Interface: Parallel Processing Course University of Tehran
No ratings yet
Message Passing Interface: Parallel Processing Course University of Tehran
49 pages
02 Mpi 0
No ratings yet
02 Mpi 0
19 pages
Mpi Lecture
No ratings yet
Mpi Lecture
129 pages
Nscet E-Learning Presentation: Listen Learn Lead
No ratings yet
Nscet E-Learning Presentation: Listen Learn Lead
54 pages
Message Passing Interface (MPI) : EC3500: Introduction To Parallel Computing
100% (1)
Message Passing Interface (MPI) : EC3500: Introduction To Parallel Computing
40 pages
5 MPIprogramming
No ratings yet
5 MPIprogramming
43 pages
Message Passing Interface (MPI) Programming
No ratings yet
Message Passing Interface (MPI) Programming
11 pages
CP4253 Map Unit Iv
No ratings yet
CP4253 Map Unit Iv
22 pages
MiniTool Partition Wizard Crack 12 Key Download Free 2025
No ratings yet
MiniTool Partition Wizard Crack 12 Key Download Free 2025
29 pages
Mpi
No ratings yet
Mpi
67 pages
HPC Lecture40
No ratings yet
HPC Lecture40
25 pages
MPI (2)
No ratings yet
MPI (2)
25 pages
Distributed Memory Programming With MPI: Peter Pacheco
No ratings yet
Distributed Memory Programming With MPI: Peter Pacheco
121 pages
Parallel and Distributed Computing Lab Digital Assignment - 5
No ratings yet
Parallel and Distributed Computing Lab Digital Assignment - 5
7 pages
Ms. V. Uma Maheswari, Assistant Lecturer, Department of Information Technology, National Institute of Technology, Surathkal
No ratings yet
Ms. V. Uma Maheswari, Assistant Lecturer, Department of Information Technology, National Institute of Technology, Surathkal
91 pages
Message Passing and MPI: John Mellor-Crummey
No ratings yet
Message Passing and MPI: John Mellor-Crummey
78 pages
Introduction To The Message Passing Interface (MPI) : Parallel and High Performance Computing
No ratings yet
Introduction To The Message Passing Interface (MPI) : Parallel and High Performance Computing
41 pages
Parallel Programming Using Basic MPI Presented by Timothy H. Kaiser, Ph.D. San Diego Supercomputer Center
No ratings yet
Parallel Programming Using Basic MPI Presented by Timothy H. Kaiser, Ph.D. San Diego Supercomputer Center
19 pages
02 Message Passing Interface Tutorial
No ratings yet
02 Message Passing Interface Tutorial
34 pages
Lecture05 MPI
No ratings yet
Lecture05 MPI
26 pages
07_2_Introduction_MPI
No ratings yet
07_2_Introduction_MPI
27 pages
Lecture 15 MPI Summarization
No ratings yet
Lecture 15 MPI Summarization
26 pages
govind_4
No ratings yet
govind_4
3 pages
Mastering Go A Practical Guide to Developers: A Practical Guide to Developers
From Everand
Mastering Go A Practical Guide to Developers: A Practical Guide to Developers
Miguel Miranda de Mattos
No ratings yet
Dive Into Sea of C
From Everand
Dive Into Sea of C
M Ashok
No ratings yet
PHP Package Mastery: 100 Essential Tools in One Hour - 2024 Edition
From Everand
PHP Package Mastery: 100 Essential Tools in One Hour - 2024 Edition
Kanto
No ratings yet
Computer Practices Using C++
From Everand
Computer Practices Using C++
Ramkrishna Ghosh
No ratings yet
Petrophysics: Tigress
No ratings yet
Petrophysics: Tigress
6 pages
Combined Calendar Application and Scientific Calculator
No ratings yet
Combined Calendar Application and Scientific Calculator
36 pages
Artificial Lift: Solving Challenges
No ratings yet
Artificial Lift: Solving Challenges
4 pages
E-Comm Assignments - AKW
No ratings yet
E-Comm Assignments - AKW
57 pages
PPT MINI PROJECT
No ratings yet
PPT MINI PROJECT
10 pages
Wonderware Application Server User's Guide: Invensys Systems, Inc
No ratings yet
Wonderware Application Server User's Guide: Invensys Systems, Inc
262 pages
Information Security
No ratings yet
Information Security
4 pages
GSM and LTE Spectrum Concurrency (SRAN15.1)
No ratings yet
GSM and LTE Spectrum Concurrency (SRAN15.1)
88 pages
Cute Point & Figure Indicator
No ratings yet
Cute Point & Figure Indicator
5 pages
Brief Data Sheet: Hi3520D V400 H.265 Codec Processor
No ratings yet
Brief Data Sheet: Hi3520D V400 H.265 Codec Processor
8 pages
Tutorial - Motor Control Centers (MCCS)
No ratings yet
Tutorial - Motor Control Centers (MCCS)
5 pages
Win-Cnc: Windows - CNC Communication / Editing Software For Cincom Machines
No ratings yet
Win-Cnc: Windows - CNC Communication / Editing Software For Cincom Machines
37 pages
New Perspectives on the Internet Comprehensive 9th Edition Schneider Test Bank download
100% (5)
New Perspectives on the Internet Comprehensive 9th Edition Schneider Test Bank download
43 pages
13 - m1 - Linux Basic Commands - Edureka VM PDF
No ratings yet
13 - m1 - Linux Basic Commands - Edureka VM PDF
3 pages
Arya Uts
No ratings yet
Arya Uts
17 pages
Topic 2 - Basic Switch and End Device Configuration
No ratings yet
Topic 2 - Basic Switch and End Device Configuration
60 pages
SSL Hijacking
No ratings yet
SSL Hijacking
7 pages
Dokumentasi - Laporan Pembelian & Laporan Penjualan
No ratings yet
Dokumentasi - Laporan Pembelian & Laporan Penjualan
13 pages
8.1 Cloud Infrastructure Requirements and Questions
No ratings yet
8.1 Cloud Infrastructure Requirements and Questions
45 pages
6th Class Vacation Work - August 2020
No ratings yet
6th Class Vacation Work - August 2020
18 pages
What's New For Smart Start 8.70
No ratings yet
What's New For Smart Start 8.70
24 pages
Sit771-7 1P
No ratings yet
Sit771-7 1P
4 pages
Ailunce HD1 Software Manual
No ratings yet
Ailunce HD1 Software Manual
33 pages
3RD YEAR FINAL PROJECT
No ratings yet
3RD YEAR FINAL PROJECT
5 pages
Securing Secrets at Scale Ebook FINAL
No ratings yet
Securing Secrets at Scale Ebook FINAL
17 pages
Lab 04
No ratings yet
Lab 04
5 pages
UNIT 1 cyber security
No ratings yet
UNIT 1 cyber security
25 pages

Message Passing Interface (MPI) : Steve Lantz Center For Advanced Computing Cornell University

Uploaded by

Message Passing Interface (MPI) : Steve Lantz Center For Advanced Computing Cornell University

Uploaded by

Message Passing Interface (MPI)

Workshop: Parallel Computing on Stampede, June 11, 2013

Image of Dell PowerEdge C8220X: http://www.theregister.co.uk/2012/09/19/dell_zeus_c8000_hyperscale_server/

• Include the implementation-specific header file --

echo 2000 > input

• Compile the code using mpicc to output the executable hello_mpi

MPI_Send( message, 13, MPI_CHAR, i, type, MPI_COMM_WORLD );

MPI_Recv( message, 20, MPI_CHAR, 0, type, MPI_COMM_WORLD, &status);

Type of data, should be same

Number of elements (items, not bytes)

MPI_Recv( message, 20, MPI_CHAR, 0, type, MPI_COMM_WORLD, &status);

Identify process you’re

Arbitrary tag number, must match up

Communicator specified for send and Returns information

mpi_send (data, count, type, dest, tag, comm, ierr)

• A few Fortran particulars

• Sending data from one point (process/task)

• Message contents are sent to a system-controlled block of memory

Communication Blocking Routines Non-Blocking Routines

• Message goes a system-controlled area of memory on the receiver

A blocking send or receive call suspends execution of the process

A non-blocking call initiates the communication process; the status of

• Non-blocking send, non-blocking recv

• Useful for communication patterns where each of a pair of nodes

– MPI_Send has same problem for count*MPI_REAL > 12K

Deadlock 1 Recv/Send Recv/Send

Deadlock 2 Send/Recv Send/Recv

Solution 1 Send/Recv Recv/Send

Solution 2 Sendrecv Sendrecv

Solution 3 Irecv/Send, Wait Irecv/Send, Wait

Solution 4 Bsend/Recv Bsend/Recv

• Submit the job, specifying parameters on the command line

– Why use less than 16 tasks on a 16 core node? Memory or threads.

• Check job progress with squeue; check output with less.

• Implements a very naive, serial broadcast

• Barrier blocks until all processes in comm have called it

• Broadcast sends data from root to all processes in comm

• Verify that you got the expected answer

ierr = MPI_Init(&argc, &argv);

calc = (npes-1 +npes%2)*(npes/2);

ncalc = (npes-1 + mod(npes,2))*(npes/2)

You might also like