Gropp Mpi Tutorial
Gropp Mpi Tutorial
Message-Passing Interface
William Gropp
TIONAL L
NA AB
NE
OR
ARGON
ATO
RY
•U
O•
NI
AG
E
V
RS I
C
IT Y CH
OF
2
Background
Parallel Computing
Communicating with other processes
Cooperative operations
One-sided operations
The MPI process
3
Parallel Computing
4
Types of parallel computing
5
Communicating with other processes
6
Cooperative operations
SEND( data )
RECV( data )
7
One-sided operations
One-sided operations between parallel
processes include remote memory reads and
writes.
An advantage is that data can be accessed
without waiting for another process
Process 0 Process 1
PUT( data )
(Memory)
Process 0 Process 1
(Memory)
GET( data )
8
Class Example
9
Hardware models
10
What is MPI?
11
Motivation for a New Design
12
Motivation (cont.)
13
The MPI Process
Began at Williamsburg Workshop in April, 1992
Organized at Supercomputing '92 (November)
Followed HPF format and process
Met every six weeks for two days
Extensive, open email discussions
Drafts, readings, votes
Pre-nal draft distributed at Supercomputing '93
Two-month public comment period
Final version of draft in May, 1994
Widely available now on the Web, ftp sites, netlib
(http://www.mcs.anl.gov/mpi/index.html)
Public implementations available
Vendor implementations coming soon
14
Who Designed MPI?
Broad participation
Vendors
{ IBM, Intel, TMC, Meiko, Cray, Convex, Ncube
Library writers
{ PVM, p4, Zipcode, TCGMSG, Chameleon,
Express, Linda
Application specialists and consultants
Companies Laboratories Universities
ARCO ANL UC Santa Barbara
Convex GMD Syracuse U
Cray Res LANL Michigan State U
IBM LLNL Oregon Grad Inst
Intel NOAA U of New Mexico
KAI NSF Miss. State U.
Meiko ORNL U of Southampton
NAG PNL U of Colorado
nCUBE Sandia Yale U
ParaSoft SDSC U of Tennessee
Shell SRC U of Maryland
TMC Western Mich U
U of Edinburgh
Cornell U.
Rice U.
U of San Francisco
15
Features of MPI
General
{ Communicators combine context and group for
message security
{ Thread safety
Point-to-point communication
{ Structured buers and derived datatypes,
heterogeneity
{ Modes: normal (blocking and non-blocking),
synchronous, ready (to allow access to fast
protocols), buered
Collective
{ Both built-in and user-dened collective
operations
{ Large number of data movement routines
{ Subgroups dened directly or by topology
16
Features of MPI (cont.)
17
Features not in MPI
18
Is MPI Large or Small?
19
Where to use MPI?
20
Why learn MPI?
Portable
Expressive
Good way to learn about subtle issues in
parallel computing
21
Getting started
Writing MPI programs
Compiling and linking
Running MPI programs
More information
{ Using MPI by William Gropp, Ewing Lusk,
and Anthony Skjellum,
{ The LAM companion to \Using MPI..." by
Zdzislaw Meglicki
{ Designing and Building Parallel Programs by
Ian Foster.
{ A Tutorial/User's Guide for MPI by Peter
Pacheco
(ftp://math.usfca.edu/pub/MPI/mpi.guide.ps)
{ The MPI standard and other information is
available at http://www.mcs.anl.gov/mpi. Also
the source for several implementations.
22
Writing MPI programs
#include "mpi.h"
#include <stdio.h>
23
Commentary
24
Compiling and linking
25
Special compilation commands
The commands
mpicc -o first first.c
mpif77 -o firstf firstf.f
26
Using Makeles
27
Sample Makele.in
ARCH = @ARCH@
COMM = @COMM@
INSTALL_DIR = @INSTALL_DIR@
CC = @CC@
F77 = @F77@
CLINKER = @CLINKER@
FLINKER = @FLINKER@
OPTFLAGS = @OPTFLAGS@
#
LIB_PATH = -L$(INSTALL_DIR)/lib/$(ARCH)/$(COMM)
FLIB_PATH =
@FLIB_PATH_LEADER@$(INSTALL_DIR)/lib/$(ARCH)/$(COMM)
LIB_LIST = @LIB_LIST@
#
INCLUDE_DIR = @INCLUDE_PATH@ -I$(INSTALL_DIR)/include
28
Sample Makele.in (con't)
default: hello
all: $(EXECS)
clean:
/bin/rm -f *.o *~ PI* $(EXECS)
.c.o:
$(CC) $(CFLAGS) -c $*.c
.f.o:
$(F77) $(FFLAGS) -c $*.f
29
Running MPI programs
30
Finding out about the environment
31
A simple program
#include "mpi.h"
#include <stdio.h>
32
Caveats
33
Exercise - Getting Started
34
Sending and Receiving messages
Process 0 Process 1
A:
Send Recv
B:
Questions:
To whom is data sent?
What is sent?
How does the receiver identify it?
35
Current Message-Passing
where
{ dest is an integer identier representing the
process to receive the message.
{ type is a nonnegative integer that the
destination can use to selectively screen
messages.
{ (address, length) describes a contiguous area in
memory containing the message to be sent.
and
A typical global operation looks like:
broadcast( type, address, length )
36
The Buer
37
Generalizing the Buer Description
38
Generalizing the Type
39
Sample Program using Library Calls
40
Correct Execution of Library Calls
recv(any) send(1)
Sub1
recv(any) send(0)
recv(1) send(0)
Sub2
recv(2) send(1)
send(2) recv(0)
41
Incorrect Execution of Library Calls
recv(any) send(1)
Sub1
recv(any) send(0)
recv(1) send(0)
send(2) recv(0)
42
Correct Execution of Library Calls with Pending
Communcication
recv(any) send(1)
Sub1a
send(0)
recv(2) send(0)
send(1) recv(0)
recv(any)
Sub1b
43
Incorrect Execution of Library Calls with Pending
Communication
recv(any) send(1)
Sub1a
send(0)
recv(2) send(0)
send(1) recv(0)
recv(any)
Sub1b
44
Solution to the type problem
45
Delimiting Scope of Communication
46
Generalizing the Process Identier
47
MPI Basic Send/Receive
48
Getting information about a message
MPI_Status status;
MPI_Recv( ..., &status );
... status.MPI_TAG;
... status.MPI_SOURCE;
MPI_Get_count( &status, datatype, &count );
49
Simple Fortran example
program main
include 'mpif.h'
50
Simple Fortran example (cont.)
51
Six Function MPI
52
A taste of things to come
The following examples show a C and
Fortran version of the same program.
This program computes PI (with a very
simple method) but does not use MPI_Send
and MPI_Recv. Instead, it uses collective
operations to send data to and from all of
the running processes. This gives a dierent
six-function MPI set:
MPI Init
MPI Finalize
MPI Comm size
MPI Comm rank
MPI Bcast
MPI Reduce
53
Broadcast and Reduction
54
Fortran example: PI
program main
include "mpif.h"
call MPI_BCAST(n,1,MPI_INTEGER,0,MPI_COMM_WORLD,ierr)
55
Fortran example (cont.)
c check for quit signal
if ( n .le. 0 ) goto 30
sum = 0.0d0
do 20 i = myid+1, n, numprocs
x = h * (dble(i) - 0.5d0)
sum = sum + f(x)
20 continue
mypi = h * sum
goto 10
30 call MPI_FINALIZE(rc)
stop
end
56
C example: PI
#include "mpi.h"
#include <math.h>
int main(argc,argv)
int argc;
char *argv[];
{
int done = 0, n, myid, numprocs, i, rc;
double PI25DT = 3.141592653589793238462643;
double mypi, pi, h, sum, x, a;
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
57
C example (cont.)
while (!done)
{
if (myid == 0) {
printf("Enter the number of intervals: (0 quits) ");
scanf("%d",&n);
}
MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);
if (n == 0) break;
h = 1.0 / (double) n;
sum = 0.0;
for (i = myid + 1; i <= n; i += numprocs) {
x = h * ((double)i - 0.5);
sum += 4.0 / (1.0 + x*x);
}
mypi = h * sum;
if (myid == 0)
printf("pi is approximately %.16f, Error is %.16f\n",
pi, fabs(pi - PI25DT));
}
MPI_Finalize();
}
58
Exercise - PI
59
Exercise - Ring
60
Topologies
61
Cartesian Topologies
62
Dening a Cartesian Topology
63
Finding neighbors
64
Who am I?
65
Partitioning
66
Other Topology Routines
67
Why are these routines in MPI?
In many parallel computer interconnects,
some processors are closer to than
others. These routines allow the MPI
implementation to provide an ordering of
processes in a topology that makes logical
neighbors close in the physical interconnect.
Some parallel programmers may remember
hypercubes and the eort that went into
assigning nodes in a mesh to processors
in a hypercube through the use of Grey
codes. Many new systems have dierent
interconnects; ones with multiple paths
may have notions of near neighbors that
changes with time. These routines free
the programmer from many of these
considerations. The reorder argument is
used to request the best ordering.
68
The periods argument
69
Periodic Grids
70
Nonperiodic Grids
71
Collective Communications in MPI
72
Synchronization
MPI_Barrier(comm)
Function blocks untill all processes in
comm call it.
73
Available Collective Patterns
P0 A P0 A
P1 Broadcast P1 A
P2 P2 A
P3 P3 A
P0 A B C D P0 A
Scatter
P1 P1 B
P2 P2 C
Gather
P3 P3 D
P0 A P0 A B C D
P1 B All gather P1 A B C D
P2 C P2 A B C D
P3 D P3 A B C D
P0 A0 A1 A2 A3 P0 A0 B0 C0 D0
P1 B0 B1 B2 B3 All to All P1 A1 B1 C1 D1
P2 C0 C1 C2 C3 P2 A2 B2 C2 D2
P3 D0 D1 D2 D3 P3 A3 B3 C3 D3
74
Available Collective Computation Patterns
P0 A P0 ABCD
P1 B Reduce P1
P2 C P2
P3 D P3
P0 A P0 A
P1 B P1 AB
Scan
P2 C P2 ABC
P3 D P3 ABCD
75
MPI Collective Routines
Many routines:
Allgather Allgatherv Allreduce
Alltoall Alltoallv Bcast
Gather Gatherv Reduce
ReduceScatter Scan Scatter
Scatterv
All versions deliver results to all participating
processes.
V versions allow the chunks to have dierent sizes.
Allreduce, Reduce, ReduceScatter, and Scan take
both built-in and user-dened combination
functions.
76
Built-in Collective Computation Operations
77
Dening Your Own Collective Operations
78
Sample user function
To use, just
integer myop
call MPI_Op_create( myfunc, .true., myop, ierr )
call MPI_Reduce( a, b, 1, MPI_DOUBLE_PRECISON, myop, ... )
79
Dening groups
80
Subdividing a communicator
Row 1
use
MPI_Comm_split( oldcomm, row, 0, &newcomm );
81
Subdividing (con't)
Row 1
use
MPI_Comm_split( oldcomm, column, 0, &newcomm2 );
82
Manipulating Groups
83
Creating Groups
84
Buering issues
Local Buffer
Local Buffer
The Network B:
85
Better buering
B:
86
Blocking and Non-Blocking communication
87
Some Solutions to the \Unsafe" Problem
Process 0 Process 1
Sendrecv(1) Sendrecv(0)
Use non-blocking operations:
Process 0 Process 1
Isend(1) Isend(0)
Irecv(1) Irecv(0)
Waitall Waitall
Use MPI_Bsend
88
MPI's Non-Blocking Operations
flag, status)
89
Multiple completions
90
Fairness
What happens with this program:
#include "mpi.h"
#include <stdio.h>
int main(argc, argv)
int argc;
char **argv;
{
int rank, size, i, buf[1];
MPI_Status status;
91
Fairness in message-passing
92
Providing Fairness
One alternative is
#define large 128
MPI_Request requests[large];
MPI_Status statuses[large];
int indices[large];
int buf[large];
for (i=1; i<size; i++)
MPI_Irecv( buf+i, 1, MPI_INT, i,
MPI_ANY_TAG, MPI_COMM_WORLD, &requests[i-1] );
while(not done) {
MPI_Waitsome( size-1, requests, &ndone, indices, statuses );
for (i=0; i<ndone; i++) {
j = indices[i];
printf( "Msg from %d with tag %d\n",
statuses[i].MPI_SOURCE,
statuses[i].MPI_TAG );
MPI_Irecv( buf+j, 1, MPI_INT, j,
MPI_ANY_TAG, MPI_COMM_WORLD, &requests[j] );
}
}
93
Providing Fairness (Fortran)
One alternative is
parameter( large = 128 )
integer requests(large);
integer statuses(MPI_STATUS_SIZE,large);
integer indices(large);
integer buf(large);
logical done
do 10 i = 1,size-1
10 call MPI_Irecv( buf(i), 1, MPI_INTEGER, i,
* MPI_ANY_TAG, MPI_COMM_WORLD, requests(i), ierr )
20 if (.not. done) then
call MPI_Waitsome( size-1, requests, ndone,
indices, statuses, ierr )
do 30 i=1, ndone
j = indices(i)
print *, 'Msg from ', statuses(MPI_SOURCE,i), ' with tag',
* statuses(MPI_TAG,i)
call MPI_Irecv( buf(j), 1, MPI_INTEGER, j,
MPI_ANY_TAG, MPI_COMM_WORLD, requests(j), ierr )
done = ...
30 continue
goto 20
endif
94
Exercise - Fairness
95
More on nonblocking communication
96
Communication Modes
97
Buered Send
MPI provides a send routine that may be used when
MPI_Isend is awkward to use (e.g., lots of small
messages).
MPI_Bsend makes use of a user-provided buer to save
any messages that can not be immediately sent.
int bufsize;
char *buf = malloc(bufsize);
MPI_Buffer_attach( buf, bufsize );
...
MPI_Bsend( ... same as MPI_Send ... );
...
MPI_Buffer_detach( &buf, &bufsize );
98
Reusing the same buer
Consider a loop
MPI_Buffer_attach( buf, bufsize );
while (!done) {
...
MPI_Bsend( ... );
}
99
Other Point-to-Point Features
MPI_SENDRECV, MPI_SENDRECV_REPLACE
MPI_CANCEL
Persistent communication requests
100
Datatypes and Heterogenity
101
Datatypes in MPI
102
Basic Datatypes (Fortran)
103
Basic Datatypes (C)
104
Vectors
29 30 31 32 33 34 35
22 23 24 25 26 27 28
15 16 17 18 19 20 21
8 9 10 11 12 13 14
1 2 3 4 5 6 7
105
Structures
106
Example: Structures
struct {
char display[50]; /* Name of display */
int maxiter; /* max # of iterations */
double xmin, ymin; /* lower left corner of rectangle */
double xmax, ymax; /* upper right corner */
int width; /* of display in pixels */
int height; /* of display in pixels */
} cmdline;
/* set up 4 blocks */
int blockcounts[4] = {50,1,4,2};
MPI_Datatype types[4];
MPI_Aint displs[4];
MPI_Datatype cmdtype;
107
Strides
EXTENT
LB UB
108
Vectors revisited
109
Structures revisited
110
Interleaving data
111
An interleaved datatype
112
Scattering a Matrix
113
Exercises - datatypes
Objective: Learn about datatypes
1. Write a program to send rows of a matrix (stored
in column-major form) to the other processors.
Let processor 0 have the entire matrix, which has
as many rows as processors.
Processor 0 sends row i to processor i.
Processor i reads that row into a local array that
holds only that row. That is, processor 0 has a
matrix A(N; M ) while the other processors have a
row B(M ).
(a) Write the program to handle the case where
the matrix is square.
(b) Write the program to handle a number of
columns read from the terminal.
C programmers may send columns of a matrix
stored in row-major form if they prefer.
If you have time, try one of the following. If you
don't have time, think about how you would
program these.
2. Write a program to transpose a matrix, where
each processor has a part of the matrix. Use
topologies to dene a 2-Dimensional partitioning
114
of the matrix across the processors, and assume
that all processors have the same size submatrix.
(a) Use MPI_Send and MPI_Recv to send the block,
the transpose the block.
(b) Use MPI_Sendrecv instead.
(c) Create a datatype that allows you to receive
the block already transposed.
3. Write a program to send the "ghostpoints" of a
2-Dimensional mesh to the neighboring
processors. Assume that each processor has the
same size subblock.
(a) Use topologies to nd the neighbors
(b) Dene a datatype for the \rows"
(c) Use MPI_Sendrecv or MPI_IRecv and MPI_Send
with MPI_Waitall.
(d) Use MPI_Isend and MPI_Irecv to start the
communication, do some computation on the
interior, and then use MPI_Waitany to process
the boundaries as they arrive
The same approach works for general
datastructures, such as unstructured meshes.
4. Do 3, but for 3-Dimensional meshes. You will
need MPI_Type_Hvector.
Tools for writing libraries
115
Private communicators
116
Attributes
117
What is an attribute?
118
Examples of using attributes
119
Sequential Sections
#include "mpi.h"
#include <stdlib.h>
/*@
MPE_Seq_begin - Begins a sequential section of code.
Input Parameters:
. comm - Communicator to sequentialize.
. ng - Number in group. This many processes are allowed
to execute
at the same time. Usually one.
@*/
void MPE_Seq_begin( comm, ng )
MPI_Comm comm;
int ng;
{
int lidx, np;
int flag;
MPI_Comm local_comm;
MPI_Status status;
120
Sequential Sections II
121
Sequential Sections III
/*@
MPE_Seq_end - Ends a sequential section of code.
Input Parameters:
. comm - Communicator to sequentialize.
. ng - Number in group.
@*/
void MPE_Seq_end( comm, ng )
MPI_Comm comm;
int ng;
{
int lidx, np, flag;
MPI_Status status;
MPI_Comm local_comm;
122
Comments on sequential sections
123
Example: Managing tags
124
Caching tags on communicator
#include "mpi.h"
/*
Private routine to delete internal storage when a
communicator is freed.
*/
int MPE_DelTag( comm, keyval, attr_val, extra_state )
MPI_Comm *comm;
int *keyval;
void *attr_val, *extra_state;
{
free( attr_val );
return MPI_SUCCESS;
}
125
Caching tags on communicator II
/*@
MPE_GetTags - Returns tags that can be used in
communication with a
communicator
Input Parameters:
. comm_in - Input communicator
. ntags - Number of tags
Output Parameters:
. comm_out - Output communicator. May be 'comm_in'.
. first_tag - First tag available
@*/
int MPE_GetTags( comm_in, ntags, comm_out, first_tag )
MPI_Comm comm_in, *comm_out;
int ntags, *first_tag;
{
int mpe_errno = MPI_SUCCESS;
int tagval, *tagvalp, *maxval, flag;
if (MPE_Tag_keyval == MPI_KEYVAL_INVALID) {
MPI_Keyval_create( MPI_NULL_COPY_FN, MPE_DelTag,
&MPE_Tag_keyval, (void *)0 );
}
126
Caching tags on communicator III
if (!flag) {
/* This communicator is not yet known to this system,
so we
dup it and setup the first value */
MPI_Comm_dup( comm_in, comm_out );
comm_in = *comm_out;
MPI_Attr_get( MPI_COMM_WORLD, MPI_TAG_UB, &maxval,
&flag );
tagvalp = (int *)malloc( 2 * sizeof(int) );
printf( "Mallocing address %x\n", tagvalp );
if (!tagvalp) return MPI_ERR_EXHAUSTED;
tagvalp = *maxval;
MPI_Attr_put( comm_in, MPE_Tag_keyval, tagvalp );
return MPI_SUCCESS;
}
127
Caching tags on communicator IV
*comm_out = comm_in;
if (*tagvalp < ntags) {
/* Error, out of tags. Another solution would be to do
an MPI_Comm_dup. */
return MPI_ERR_INTERN;
}
*first_tag = *tagvalp - ntags;
*tagvalp = *first_tag;
return MPI_SUCCESS;
}
128
Caching tags on communicator V
/*@
MPE_ReturnTags - Returns tags allocated with MPE_GetTags.
Input Parameters:
. comm - Communicator to return tags to
. first_tag - First of the tags to return
. ntags - Number of tags to return.
@*/
int MPE_ReturnTags( comm, first_tag, ntags )
MPI_Comm comm;
int first_tag, ntags;
{
int *tagvalp, flag, mpe_errno;
if (!flag) {
/* Error, attribute does not exist in this communicator
*/
return MPI_ERR_OTHER;
}
if (*tagvalp == first_tag)
*tagvalp = first_tag + ntags;
return MPI_SUCCESS;
}
129
Caching tags on communicator VI
/*@
MPE_TagsEnd - Returns the private keyval.
@*/
int MPE_TagsEnd()
{
MPI_Keyval_free( &MPE_Tag_keyval );
MPE_Tag_keyval = MPI_KEYVAL_INVALID;
}
130
Commentary
131
Exercise - Writing libraries
Objective: Use private communicators and attributes
Write a routine to circulate data to the next process,
using a nonblocking send and receive operation.
void Init_pipe( comm )
void ISend_pipe( comm, bufin, len, datatype, bufout )
void Wait_pipe( comm )
A typical use is
Init_pipe( MPI_COMM_WORLD )
for (i=0; i<n; i++) {
ISend_pipe( comm, bufin, len, datatype, bufout );
Do_Work( bufin, len );
Wait_pipe( comm );
t = bufin; bufin = bufout; bufout = t;
}
132
MPI Objects
133
The MPI Objects
134
When should objects be freed?
135
Reference counting
136
Why reference counts
137
Tools for evaluating programs
138
The MPI Timer
139
Proling
MPI_Bcast MPI_Bcast
140
Writing proling routines
The MPICH implementation contains a program for
writing wrappers.
This description will write out each MPI routine that
is called.:
#ifdef MPI_BUILD_PROFILING
#undef MPI_BUILD_PROFILING
#endif
#include <stdio.h>
#include "mpi.h"
{{fnall fn_name}}
{{vardecl int llrank}}
PMPI_Comm_rank( MPI_COMM_WORLD, &llrank );
printf( "[%d] Starting {{fn_name}}...\n",
llrank ); fflush( stdout );
{{callfn}}
printf( "[%d] Ending {{fn_name}}\n", llrank );
fflush( stdout );
{{endfnall}}
The command
wrappergen -w trace.w -o trace.c
converts this to a C program. The complie the le
`trace.c' and insert the resulting object le into your
link line:
cc -o a.out a.o ... trace.o -lpmpi -lmpi
141
Another proling example
This version counts all calls and the number of bytes sent with
MPI_Send, MPI_Bsend, or MPI_Isend.
#include "mpi.h"
{{callfn}}
{{this_fn_name}}_ncalls_{{fileno}}++;
{{endfnall}}
{{callfn}}
{{endfn}}
142
Another proling example (con't)
143
Generating and viewing log les
144
Generating a log le
145
Connecting several programs together
146
Sending messages between dierent programs
Comm1 Comm2
Comm_intercomm
MPI_COMM_WORLD
147
Exchanging data between programs
Form intercommunicator
(MPI_INTERCOMM_CREATE)
Send data
MPI_Send( ..., 0, intercomm )
148
Collective operations
149
Final Comments
150
Sharable MPI Resources
The Standard itself:
{ As a Technical report: U. of Tennessee.
report
{ As postscript for ftp: at info.mcs.anl.gov in
pub/mpi/mpi-report.ps.
151
Sharable MPI Resources, continued
Newsgroup:
{ comp.parallel.mpi
Mailing lists:
{ [email protected] : the MPI Forum
discussion list.
{ [email protected] : the implementors'
discussion list.
Implementations available by ftp:
{ MPICH is available by anonymous ftp from
info.mcs.anl.gov in the directory
pub/mpi/mpich, le mpich.tar.Z.
152
MPI-2
The MPI Forum (with old and new participants)
has begun a follow-on series of meetings.
Goals
{ clarify existing draft
{ provide features users have requested
{ make extensions, not changes
Major Topics being considered
{ dynamic process management
{ client/server
{ real-time extensions
{ \one-sided" communication (put/get, active
messages)
{ portable access to MPI system state (for
debuggers)
{ language bindings for C++ and Fortran-90
Schedule
{ Dynamic processes, client/server by SC '95
{ MPI-2 complete by SC '96
153
Summary
154