0% found this document useful (0 votes)
1 views105 pages

Module5 MPI

Uploaded by

saif.nalband
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views105 pages

Module5 MPI

Uploaded by

saif.nalband
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 105

MPI

Saif Nalband

1
Sample Parallel Programming Models
▪ Shared Memory Programming
– Processes share memory address space (threads model)
– Application ensures no data corruption (Lock/Unlock)
▪ Transparent Parallelization
– Compiler works magic on sequential programs
▪ Directive-based Parallelization
– Compiler needs help (e.g., OpenMP)
▪ Message Passing
– Explicit communication between processes (like sending and receiving emails)

2
The Message-Passing Model
▪ A process is (traditionally) a program counter and address space.
▪ Processes may have multiple threads (program counters and associated stacks)
sharing a single address space. MPI is for communication among processes, which
have separate address spaces.
▪ Inter-process communication consists of
– synchronization
– movement of data from one process’s address space to another’s.

MPI

Process Process
MPI

3
spcl.inf.ethz.ch
@spcl_eth

The Message-Passing Model (an example)


▪ Each process has to send/receive data to/from other processes
▪ Example: Sorting Integers

Process1
O(N log N) 8 23 19 67 45 35 1 24 13 30 3 5

Process1 Process2

8 19 23 35 45 67 1 3 5 13 24 30

O(N/2 log N/2) O(N/2 log N/2)

1 3 5 8 13 19 23 24 30 35 45 67 O(N)

Process1

4
spcl.inf.ethz.ch
@spcl_eth

What is MPI?
▪ MPI: Message Passing Interface
– The MPI Forum organized in 1992 with broad participation by:
• Vendors: IBM, Intel, TMC, SGI, Convex, Meiko
• Portability library writers: PVM, p4
• Users: application scientists and library writers
• MPI-1 finished in 18 months
– Incorporates the best ideas in a “standard” way
• Each function takes fixed arguments
• Each function has fixed semantics
– Standardizes what the MPI implementation provides and what the application can and cannot expect
– Each system can implement it differently as long as the semantics match
▪ MPI is not…
– a language or compiler specification
– a specific implementation or product
5
MPI
• MPI is actually just an Application Programming Interface (API). As
such, MPI
• specifes what a call to each routine should look like, and how
each routine should behave, but
• does not specify how each routine should be implemented, and
sometimes is intentionally vague about certain aspects of a
routines behavior;
• implementations are often platform vendor specific, and
• has multiple open-source and proprietary implementations.

6
spcl.inf.ethz.ch
@spcl_eth

Reasons for Using MPI


▪ Standardization - MPI is the only message passing library which can be considered a standard. It is
supported on virtually all HPC platforms. Practically, it has replaced all previous message passing
libraries
▪Important
Portability -Note:
There is no need to modify your source code when you port your application to a different
platform that supports (and is compliant with) the MPI standard
All parallelism is explicit: the programmer is responsible for correctly identifying parallelism
▪ Performance Opportunities - Vendor implementations should be able to exploit native hardware features
and implementing parallel algorithms using MPI constructs
to optimize performance
▪ Functionality – Rich set of features
▪ Availability - A variety of implementations are available, both vendor and public domain
– MPICH/Open MPI are popular open-source and free implementations of MPI
– Vendors and other collaborators take MPICH and add support for their systems
• Intel MPI, IBM Blue Gene MPI, Cray MPI, Microsoft MPI, MVAPICH, MPICH-MX

7
spcl.inf.ethz.ch
@spcl_eth

MPI Basic Send/Receive


▪ Simple communication model
Process 0 Process 1
Send(data)
Receive(data)

▪ Application needs to specify to the MPI implementation:


1. How do you compile and run an MPI application?
2. How will processes be identified?
3. How will “data” be described?

8
spcl.inf.ethz.ch
@spcl_eth

Process Identification
▪ MPI processes can be collected into groups
▪ Each group can have multiple colors (some times called context)
▪ Group + color == communicator (it is like a name for the group)
▪ When an MPI application starts, the group of all processes is initially given a predefined name called
MPI_COMM_WORLD
▪ The same group can have many names, but simple programs do not have to worry about multiple
names
▪ A process is identified by a unique number within each communicator, called rank
▪ For two different communicators, the same process can have two different ranks: so the
meaning of a “rank” is only defined when you specify the communicator

9
spcl.inf.ethz.ch
@spcl_eth

Communicators
mpiexec -np 16 ./test

Communicators do not
need to contain all 0 01 12 3
processes in the system When you start an MPI
program, there is one
04 215 326 37 predefined communicator
MPI_COMM_WORLD

Every process in a 48 459 60


15 171
communicator has an ID Can make copies of this
called as “rank” communicator (same group of
12 163 174 15 processes, but different
“aliases”)

The same process might have different


ranks in different communicators

Communicators can be created “by hand” or using tools provided by MPI (not discussed in this tutorial)
Simple programs typically only use the predefined communicator MPI_COMM_WORLD
10
spcl.inf.ethz.ch
@spcl_eth

Simple MPI Program Identifying Processes


#include <mpi.h>
#include <stdio.h>

int main(int argc, char ** argv)


Basic
{ requirements
int rank, size; for an MPI
program
MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
printf("I am %d of %d\n", rank, size);

MPI_Finalize();
return 0;
}
11
spcl.inf.ethz.ch
@spcl_eth

Data Communication
▪ Data communication in MPI is like email exchange
– One process sends a copy of the data to another process (or a group of processes), and the other process
receives it
▪ Communication requires the following information:
– Sender has to know:
• Whom to send the data to (receiver’s process rank)
• What kind of data to send (100 integers or 200 characters, etc)
• A user-defined “tag” for the message (think of it as an email subject; allows the receiver to understand what
type of data is being received)
– Receiver “might” have to know:
• Who is sending the data (OK if the receiver does not know; in this case sender rank will be MPI_ANY_SOURCE,
meaning anyone can send)
• What kind of data is being received (partial information is OK: I might receive up to 1000 integers)
• What the user-defined “tag” of the message is (OK if the receiver does not know; in this case tag will be
MPI_ANY_TAG)
12
spcl.inf.ethz.ch
@spcl_eth

More Details on Using Ranks for Communication


▪ When sending data, the sender has to specify the destination
process’ rank
– Tells where the message should go
▪ The receiver has to specify the source process’ rank
– Tells where the message will come from
▪ MPI_ANY_SOURCE is a special “wild-card” source that can be
used by the receiver to match any source

13
spcl.inf.ethz.ch
@spcl_eth

More Details on Describing Data for Communication


▪ MPI Datatype is very similar to a C or Fortran datatype
– int → MPI_INT
– double → MPI_DOUBLE
– char → MPI_CHAR
▪ More complex datatypes are also possible:
– E.g., you can create a structure datatype that comprises of other datatypes → a char, an int and a
double.
– Or, a vector datatype for the columns of a matrix
▪ The “count” in MPI_SEND and MPI_RECV refers to how many datatype elements should
be communicated

14
spcl.inf.ethz.ch
@spcl_eth

More Details on User “Tags” for Communication


▪ Messages are sent with an accompanying user-defined integer tag, to assist the
receiving process in identifying the message
▪ For example, if an application is expecting two types of messages from a peer, tags
can help distinguish these two types
▪ Messages can be screened at the receiving end by specifying a specific tag
▪ MPI_ANY_TAG is a special “wild-card” tag that can be used by the receiver to match any
tag

15
spcl.inf.ethz.ch
@spcl_eth

MPI Basic (Blocking) Send


MPI_SEND(buf, count, datatype, dest, tag, comm)

▪ The message buffer is described by (buf, count, datatype).


▪ The target process is specified by dest and comm.
– dest is the rank of the target process in the communicator specified by comm.

▪ tag is a user-defined “type” for the message

▪ When this function returns, the data has been delivered to the system and the buffer can
be reused.
– The message may not have been received by the target process.

16
spcl.inf.ethz.ch
@spcl_eth

MPI Send Modes

17
spcl.inf.ethz.ch
@spcl_eth

MPI Basic (Blocking) Receive


MPI_RECV(buf, count, datatype, source, tag, comm, status)

▪ Waits until a matching (on source, tag, comm) message is received from the system, and
the buffer can be used.
▪ source is rank in communicator comm, or MPI_ANY_SOURCE.
▪ Receiving fewer than count occurrences of datatype is OK, but receiving more is an
error.
▪ status contains further information:
– Who sent the message (can be used if you used MPI_ANY_SOURCE)
– How much data was actually received
– What tag was used with the message (can be used if you used MPI_ANY_TAG)
– MPI_STATUS_IGNORE can be used if we don’t need any additional information
18
spcl.inf.ethz.ch
@spcl_eth

Simple Communication in MPI


#include <mpi.h>
#include <stdio.h>

int main(int argc, char ** argv)


{
int rank, data[100];

MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

if (rank == 0)
MPI_Send(data, 100, MPI_INT, 1, 0, MPI_COMM_WORLD);
else if (rank == 1)
MPI_Recv(data, 100, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);

MPI_Finalize();
return 0;
}
19
spcl.inf.ethz.ch
@spcl_eth

Parallel Sort using MPI Send/Recv


Rank 0
8 23 19 67 45 35 1 24 13 30 3 5

Rank 0 O(N log N) Rank 1

8 19 23 35 45 67 1 3 5 13 24 30

Rank 0

8 19 23 35 45 67 1 3 5 13 24 30

Rank 0

1 3 5 8 13 19 23 24 30 35 45 67

20
spcl.inf.ethz.ch

Parallel Sort using MPI Send/Recv (contd.)


@spcl_eth

#include <mpi.h>
#include <stdio.h>
int main(int argc, char ** argv)
{
int rank;
int a[1000], b[500];

MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0) {
MPI_Send(&a[500], 500, MPI_INT, 1, 0, MPI_COMM_WORLD);
sort(a, 500);
MPI_Recv(b, 500, MPI_INT, 1, 0, MPI_COMM_WORLD, &status);

/* Serial: Merge array b and sorted part of array a */


}
else if (rank == 1) {
MPI_Recv(b, 500, MPI_INT, 0, 0, MPI_COMM_WORLD, &status);
sort(b, 500);
MPI_Send(b, 500, MPI_INT, 0, 0, MPI_COMM_WORLD);
}
MPI_Finalize(); return 0; 21
}
spcl.inf.ethz.ch
@spcl_eth

Status Object
▪ The status object is used after completion of a receive to findthe actual length, source, and tag of
a message
▪ Status object is MPI-defined type and provides information about:
– The source process for the message (status.MPI_SOURCE)
– The message tag (status.MPI_TAG)
– Error status (status.MPI_ERROR)

▪ The number of elements received is given by:


MPI_Get_count(MPI_Status *status, MPI_Datatype datatype, int *count)

status return status of receive operation (status)


datatype datatype of each receive buffer element (handle)
count number of received elements (integer)(OUT)

22
spcl.inf.ethz.ch
@spcl_eth

Using the “status” field

Task1 Task2

▪ Each “worker process” computes some task (maximum 100 elements) and sends it to the
“master” process together with its group number: the “tag” field can be used to represent
the task
– Data count is not fixed (maximum 100 elements)
– Order in which workers send output to master is not fixed (different workers = different src ranks, and
different tasks = different tags)

23
spcl.inf.ethz.ch
@spcl_eth

Using the “status” field (contd.)


#include <mpi.h>
#include <stdio.h>

int main(int argc, char ** argv)


{
[...snip...]

if (rank != 0)
MPI_Send(data, rand() % 100, MPI_INT, 0, group_id, MPI_COMM_WORLD);
else {
for (i = 0; i < size – 1 ; i++) {
MPI_Recv(data, 100, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
MPI_Get_count(&status, MPI_INT, &count);
printf(“worker ID: %d; task ID: %d; count: %d\n”, status.source, status.tag, count);
}
}

[...snip...]
}

24
spcl.inf.ethz.ch
@spcl_eth

MPI is Simple
▪ Many parallel programs can be written using just these six functions, only two of which are non-trivial:
– MPI_INIT – initialize the MPI library (must be the
first routine called)
– MPI_COMM_SIZE - get the size of a communicator
– MPI_COMM_RANK – get the rank of the calling process
in the communicator
– MPI_SEND – send a message to another process
– MPI_RECV – send a message to another process
– MPI_FINALIZE – clean up all MPI state (must be last MPI function the called
by a process)

▪ For performance, however, you need to use other MPI features

25
Some of the simplest and most common communication routines are:
• MPI_Send() sends a message from the current process to another
process (the destination).
• MPI_Recv() receives a message on the current process from another
process (the source).
• MPI_Bcast() broadcasts a message from one process to all of the
others.
• MPI_Reduce() performs a reduction (e.g. a global sum, maximum,
etc.) of a variable in all processes, with the result ending up in a
single process.
• MPI_Allreduce() performs a reduction of a variable in all processes,
with the result ending up in all processes

26
spcl.inf.ethz.ch
@spcl_eth

Compiling MPI programs with MPICH


▪ Compilation Wrappers
– For C programs: mpicc test.c –o test
– For C++ programs: mpicxx test.cpp –o test
– For Fortran 77 programs: mpif77 test.f –o test
– For Fortran 90 programs: mpif90 test.f90 –o test

▪ You can link other libraries are required too


• – To link to a math library: mpicc
test.c –o test -lm
▪ You can just assume that “mpicc” and friends have replaced
• your regular compilers (gcc, gfortran, etc.)

27
spcl.inf.ethz.ch
@spcl_eth

Running MPI programs with MPICH


▪ Launch 16 processes on the local node:
– mpiexec –np 16 ./test
▪ Launch 16 processes on 4 nodes (each has 4 cores)
– mpiexec –hosts h1:4,h2:4,h3:4,h4:4 –np 16 ./test
• Runs the first four processes on h1, the next four on h2, etc.
– mpiexec –hosts h1,h2,h3,h4 –np 16 ./test
• Runs the first process on h1, the second on h2, etc., and wraps around
• So, h1 will have the 1st, 5th, 9th and 13th processes
▪ If there are many nodes, it might be easier to create a host file
– cat hf
h1:4 h2:2
– mpiexec –hostfile hf –np 16 ./test

28
Hello World!!!
1. #include <stdio.h>
2. #include <mpi.h>
3. int main ( int argc, char *argv[] )
4. {
5. int rank;
6. int number_of_processes;
7. MPI_Init( &argc, &argv );
8. MPI_Comm_size( MPI_COMM_WORLD,
&number_of_processes );
9. MPI_Comm_rank( MPI_COMM_WORLD, &rank );
10. printf( "hello from process %d of %d\n",
11. rank, number_of_processes );
12. MPI_Finalize();
13. return 0;
14. }
29
• All MPI processes (normally) run the same executable
• Each MPI process knows which rank it is
• Each MPI process knows how many processes are part of the
same job
• The processes run in a non-deterministic order
30
Recall the MPI initialization sequence:
• MPI_Init( &argc, &argv );
• MPI_Comm_size( MPI_COMM_WORLD, &number_of_processes );
MPI_Comm_rank( MPI_COMM_WORLD, &rank );
• MPI uses communicators to organize how processes communicate
with each other.
• A single communicator, MPI_COMM_WORLD, is created by MPI_Init()
and all the processes running the program have access to it.
• Note that process ranks are relative to a communicator. A program may
have multiple communicators; if so, a process may have multiple
ranks, one for each communicator it is associated with.

31
MPI is (usually) SPMD
• Usually MPI is run in SPMD (Single Program, Multiple Data) mode.
(It is possible to run multiple programs, i.e. MPMD).
• The program can use its rank to determine its role:

• as shown here, often the rank 0 process plays the role of server or
process coordinator.
32
A second MPI program: greeting.c
• The next several slides show the source code for an MPI program
that works on a client-server model.
• When the program starts, it initializes the MPI system then
determines if it is the server process (rank 0) or a client
process.
• Each client process will construct a string message and send it to
the server.
• The server will receive and display messages from the clients one-
by-one

33
1. #include <stdio.h>
2. #include <mpi.h>
3. const int SERVER_RANK = 0;
4. const int MESSAGE_TAG = 0;
5. int main ( int argc, char *argv[] )
6. {
7. int rank, number_of_processes;
8. MPI_Init( &argc, &argv );
9. MPI_Comm_size( MPI_COMM_WORLD, &number_of_processes );
10. MPI_Comm_rank( MPI_COMM_WORLD, &rank );
11. if ( rank == SERVER_RANK )
12. do_server_work( number_of_processes );
13. else
14. do_client_work( rank );
15. MPI_Finalize();
16. return 0;
17. }

34
18.void do_server_work( int number_of_processes )
19.{
20. const int max_message_length = 256;
21. char message[max_message_length];
22. int src;
23. MPI_Status status;
24. for ( src = 0; src < number_of_processes; src++ )
25. {
26. if ( src != SERVER_RANK )
27. {
28. MPI_Recv( message, max_message_length,
MPI_CHAR,
29. src, MESSAGE_TAG, MPI_COMM_WORLD,&status
);
30. printf( "Received: %s\n", message );
31. }
32. }
33.}
35
34.void do_client_work( int rank )
35.{
36. const int max_message_length = 256;
37. char message[max_message_length];
38. int message_length;
39. message_length = sprintf( message, "Greetings from
process %d", rank );
40. message_length++;
41. /* add one for null char */
42. MPI_Send( message, message_length, MPI_CHAR,
43. SERVER_RANK, MESSAGE_TAG, MPI_COMM_WORLD );
44.}

36
• Note: the server process (rank 0) does not send a message, but does
display the contents of messages received from the other processes.
• mpirun can be used rather than mpiexec.
• the arguments to mpiexec vary between MPI implementations.
• mpiexec (or mpirun) may not be available.
37
Deterministic operation?
• You may have noticed that in the four-process case the greeting messages
were printed out in-order. Does this mean that the order the messages were
sent is deterministic? Look again at the loop that carries out the servers work:

• The server loops over values of src from 0 to number_of_processes, skipping


the servers own rank. The fourth argument to MPI_Recv() is the rank of the
process the message is received from. Here the messages are received in
increasing rank order, regardless of the sending order. 38
Non-deterministic receive order
• By making one small change, we can allow the messages to be received in any
order. The constant MPI_ANY_SOURCE can be used in the MPI_Recv() function
to indicate that the next available message with the correct tag should be read,
regardless of its source.

• Note: it is possible to use the data returned in status to determine the messages
source.
39
ERROR HANDLING IN MPI
• Nearly all MPI functions return an integer status code:
• MPI_SUCCESS if function completed without error,
• otherwise an error code is returned,
• Most examples you nd on the web and in textbooks do not check
the MPI function return status value.
• ….but this should be done in production code.
• It can certainly help to avoid errors during development.

40
Sample MPI error handler
• Here is a sample MPI error handler. If called with a non-successful
status value, it displays a string corresponding to the error and
then exits all errors are treated as fatal.

41
MPI point-to-point communication routines
• MPI provides two main routines for point-to-point communication
• MPI_Send(): Send to a message to another process
• MPI_Recv(): Receive a message from another process
• Both of these have several variants that well mention here and see
some of later.

42
The calling sequence for MPI_Send() is

• The MPI_Send() function initiates a blocking send. Here blocking does not indicate
that the sender waits for the message to be received, but rather that the sender waits
for the message to be accepted by the MPI system.
• It does mean that once this function returns the send buffer may be changed with out
impacting the send operation. 43
The calling sequence for MPI_Recv() is

• The MPI_Recv() function initiates a blocking receive. It will not return to its
caller until a message with the speci ed tag is received from the speci ed
source.
• MPI_ANY_SOURCE may be used to indicate the message should be
accepted from any source.
• MPI_ANY_TAG may be used to indicate the message should be accepted
regardless of its tag. 44
Point-to-point communication modes
• Standard : Locally blocking, meaning that the routine does not
return until the memory holding the message is available to reuse
(in the case of MPI_Send()) or use (in the case of MPI_Recv()).
• Buffered: In this mode the user supplies buffer space sufficient to
hold an outgoing or incoming message. The routine MPI_Bsend()
returns as soon as the message is copied into the buffer.
• Synchronous: Similar to the standard mode, except MPI_Ssend() will
not return until the matching receive has been pointed. Essentially
this is explicit blocking.
• Ready: Similar to the standard mode, except that it is an error to call
MPI_Rsend() before the matching receive has been posted.
45
spcl.inf.ethz.ch
@spcl_eth

Blocking vs. Non-blocking Communication


▪ MPI_SEND/MPI_RECV are blocking communication calls
– Return of the routine implies completion
– When these calls return the memory locations used in the message transfer can be
safely accessed for reuse
– For “send” completion implies variable sent can be reused/modified
– Modifications will not affect data intended for the receiver
– For "receive" variable received can be read
▪ MPI_ISEND/MPI_IRECV are non-blocking variants
– Routine returns immediately – completion has to be separately tested for
– These are primarily used to overlap computation and communication to improve
performance
46
spcl.inf.ethz.ch
@spcl_eth

Blocking Communication
▪ In blocking communication.
– MPI_SEND does not return until buffer is empty (available for reuse)
– MPI_RECV does not return until buffer is full (available for use)
▪ A process sending data will be blocked until data in the send buffer is emptied
▪ A process receiving data will be blocked until the receive buffer is filled
▪ Exact completion semantics of communication generally depends on the message size
and the system buffer size
▪ Blocking communication is simple to use but can be prone to deadlocks
If (rank == 0) Then
Call mpi_send(..)
Call mpi_recv(..)
Usually deadlocks → Else
Call mpi_send(..)  UNLESS you reverse send/recv
Call mpi_recv(..)
Endif
47
Blocking Send-Receive Diagram

time

48
Non-Blocking Communication
▪ Non-blocking (asynchronous) operations return (immediately) ‘‘request handles” that can be waited on and
queried
– MPI_ISEND(start, count, datatype, dest, tag, comm, request)
– MPI_IRECV(start, count, datatype, src, tag, comm, request)
– MPI_WAIT(request, status)

▪ Non-blocking operations allow overlapping computation and communication


▪ One can also test without waiting using MPI_TEST
– MPI_TEST(request, flag, status)
▪ Anywhere you use MPI_SEND or MPI_RECV, you can use the pair of
MPI_ISEND/MPI_WAIT or MPI_IRECV/MPI_WAIT
▪ Combinations of blocking and non-blocking sends/receives can be used to synchronize execution instead of
barriers

49
spcl.inf.ethz.ch
@spcl_eth

Multiple Completions
▪ It is sometimes desirable to wait on multiple requests:
• MPI_Waitall(count, array_of_requests, array_of_statuses)
• MPI_Waitany(count, array_of_requests, &index, &status)
array_of_statuses)
• MPI_Waitsome(count, array_of_requests, array_of_indices,
▪ There are corresponding versions of testfor each of these

50
spcl.inf.ethz.ch
@spcl_eth

Non-Blocking Send-Receive Diagram

time

51
spcl.inf.ethz.ch
@spcl_eth

Message Completion and Buffering


▪ For a communication to succeed:
– Sender must specify a valid destination rank
– Receiver must specify a valid source rank (including MPI_ANY_SOURCE)
– The communicator must be the same
– Tags must match
– Receiver’s buffer must be large enough
▪ A send has completed when the user supplied buffer can be reused
*buf =3; *buf =3;
MPI_Send(buf, 1, MPI_INT …) MPI_Isend(buf, 1, MPI_INT …)
*buf = 4; /* OK, receiver will always *buf = 4; /*Not certain if receiver
receive 3 */ gets 3 or 4 or anything else */
MPI_Wait(…);

▪ Just because the send completes does not mean that the receive has completed
– Message may be buffered by the system
– Message may still be in transit

52
spcl.inf.ethz.ch
@spcl_eth

A Non-Blocking communication example


int main(int argc, char ** argv)
{
[...snip...]
if (rank == 0) {
for (i=0; i< 100; i++) {
/* Compute each data element and send it out */ data[i] = compute(i);
MPI_ISend(&data[i], 1, MPI_INT, 1, 0, MPI_COMM_WORLD, &request[i]);
}
MPI_Waitall(100, request, MPI_STATUSES_IGNORE)
}
else {
for (i = 0; i < 100; i++)
MPI_Recv(&data[i], 1, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}
[...snip...]
}

53
spcl.inf.ethz.ch
@spcl_eth

Introduction to Collective Operations in MPI


▪ Collective operations are called by all processes in a communicator.
▪ MPI_BCAST distributes data from one process (the root) to all others in a communicator.

▪ MPI_REDUCE combines data from all processes in the communicator and returns it to one
process.
▪ In many numerical algorithms, SEND/RECV can be replaced by BCAST/REDUCE, improving
both simplicity and efficiency.

54
spcl.inf.ethz.ch
@spcl_eth

MPI Collective Communication


▪ Communication and computation is coordinated amonga group of processes in a
communicator
▪ Tags are not used; different communicators deliver similar functionality
▪ Non-blocking collective operations in MPI-3
– Covered in the advanced tutorial (but conceptually simple)
▪ Three classes of operations: synchronization, data movement, collective computation

55
List of MPI collective communication routines

56
spcl.inf.ethz.ch
@spcl_eth

Synchronization
▪ MPI_BARRIER(comm)
– Blocks until all processes in the group of the communicator comm call it
– A process cannot get out of the barrier until all other processes have reached barrier

57
Scattering data

58
MPI Scatter() example

59
MPI Scatter()

60
MPI Scatter()

61
Another data distribution function

62
MPI Alltoall() example

63
MPI Alltoall()

64
MPI Gather()

65
MPI Gather() notes

66
MPI Gather() example

67
MPI Allgather()

68
MPI Allgather() example

69
Four more collective communication routines

70
MPI Scatterv()

71
Other collective routines

72
spcl.inf.ethz.ch
@spcl_eth

Collective Data Movement


int MPI_Bcast( void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm )

P0 A A
P1 Broadcast A
P2 A
P3 A

int MPI_Scatter(const void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, int recvcount,
MPI_Datatype recvtype, int root, MPI_Comm comm)

P0 A B C D Scatter A
P1
B
P2
C
Gather
P3
D
int MPI_Gather(const void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, int recvcount,
MPI_Datatype recvtype, int root, MPI_Comm comm)
73
spcl.inf.ethz.ch
@spcl_eth

More Collective Data Movement


int MPI_Allgather(const void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, int recvcount,
MPI_Datatype recvtype, MPI_Comm comm)

P0 A A B C D
P1 B Allgather A B C D
P2 C A B C D
P3 D A B C D

int MPI_Alltoall(const void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, int recvcount,
MPI_Datatype recvtype, MPI_Comm comm)

P0 A0 A1 A2 A3 A0 B0 C0 D0
P1
B0 B1 B2 B3 Alltoall A1 B1 C1 D1
P2
C0 C1 C2 C3 A2 B2 C2 D2
P3
D0 D1 D2 D3 A3 B3 C3 D3

74
spcl.inf.ethz.ch
@spcl_eth

Collective Computation
int MPI_Reduce(const void *sendbuf, void *recvbuf, int count, MPI_Datatype
datatype, MPI_Op op, int root, MPI_Comm comm)

P0 A A+B+C+D
P1 B Reduce
P2 C
P3 D

int MPI_Allreduce(const void *sendbuf, void *recvbuf, int count,


MPI_Datatype datatype, MPI_Op op, MPI_Comm comm)

P0 A A+B+C+D
P1 B AllReduce A+B+C+D
P2 C A+B+C+D
P3 D A+B+C+D

75
spcl.inf.ethz.ch
@spcl_eth

int MPI_Scan(const void *sendbuf, void *recvbuf, int count, MPI_Datatype


datatype, MPI_Op op, MPI_Comm comm)

P0 A A
P1 B AB
Scan
P2 C ABC
P3 D ABCD

▪ MPI_ALLREDUCE, MPI_REDUCE, MPI_REDUCESCATTER, and


MPI_SCANtake both built-in and user- defined combiner functions

76
spcl.inf.ethz.ch
@spcl_eth

MPI Built-in Collective Computation Operations


▪ MPI_MAX • Maximum
▪ MPI_MIN Minimum
▪ MPI_PROD Product Sum
▪ MPI_SUM Logical and
▪ MPI_LAND Logical or
▪ MPI_LOR • Logical exclusive or
▪ MPI_LXOR Bitwise and Bitwise or
▪ MPI_BAND
▪ MPI_BOR • Bitwise exclusive or
▪ MPI_BXOR Maximum and location
▪ MPI_MAXLOC Minimum and location
▪ MPI_MINLOC

77
Examples

78
Scatter/Gather example

79
80
81
82
83
84
Alltoall example prolog

85
86
87
88
89
90
Vector Scatter example prolog

91
92
93
94
95
Output:

96
ODD-EVEN Sort

97
Sorting n = 8 elements, using the odd-even transposition
sort algorithm. During each phase, n = 8 elements are
compared.

98
• After n phases of odd-even exchanges, the sequence is sorted.
• Each phase of the algorithm (either odd or even) requires Θ(n)
comparisons. •
• Serial complexity is Θ(n^2)

99
Parallel Odd-Even Transposition
• Consider the one item per processor case.
• There are n iterations, in each iteration, each processor does one
compare-exchange.
• The parallel run time of this formulation is Θ(n).
• This is cost optimal with respect to the base serial algorithm but
not the optimal one.

100
Parallel Odd-Even Transposition

101
Parallel Odd-Even Transposition
• Consider a block of n/p elements per processor.
• The first step is a local sort.
• In each subsequent step, the compare exchange operation is
replaced by the compare split operation.
• The parallel run time of the formulation is

102
Parallel Odd-Even Transposition
• The parallel formulation is cost-optimal for p = O(log n)
• The isoefficiency function of this parallel formulation is Θ(p2^p).

103
spcl.inf.ethz.ch
@spcl_eth

Defining your own Collective Operations


▪ Create your own collective computations with:
MPI_OP_CREATE(user_fn, commutes, &op);
MPI_OP_FREE(&op);

user_fn(invec, inoutvec, len, datatype);

▪ The user function should perform:


inoutvec[i] = invec[i] op inoutvec[i];
for i from 0 to len-1

▪ The user function can be non-commutative, but must be associative

10
References
• Introduction to Parallel Computing, Chapter 6: 6.1, 6.2, 6.3,
6.4,6.5, 6.6

105

You might also like