0% found this document useful (0 votes)

1 views105 pages

Module5 MPI

Uploaded by

saif.nalband

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views105 pages

Module5 MPI

Uploaded by

saif.nalband

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 105

MPI

Saif Nalband

1
Sample Parallel Programming Models
▪ Shared Memory Programming
– Processes share memory address space (threads model)
– Application ensures no data corruption (Lock/Unlock)
▪ Transparent Parallelization
– Compiler works magic on sequential programs
▪ Directive-based Parallelization
– Compiler needs help (e.g., OpenMP)
▪ Message Passing
– Explicit communication between processes (like sending and receiving emails)

2
The Message-Passing Model
▪ A process is (traditionally) a program counter and address space.
▪ Processes may have multiple threads (program counters and associated stacks)
sharing a single address space. MPI is for communication among processes, which
have separate address spaces.
▪ Inter-process communication consists of
– synchronization
– movement of data from one process’s address space to another’s.

MPI

Process Process
MPI

3
spcl.inf.ethz.ch
@spcl_eth

The Message-Passing Model (an example)

▪ Each process has to send/receive data to/from other processes
▪ Example: Sorting Integers

Process1
O(N log N) 8 23 19 67 45 35 1 24 13 30 3 5

Process1 Process2

8 19 23 35 45 67 1 3 5 13 24 30

O(N/2 log N/2) O(N/2 log N/2)

1 3 5 8 13 19 23 24 30 35 45 67 O(N)

Process1

4
spcl.inf.ethz.ch
@spcl_eth

What is MPI?
▪ MPI: Message Passing Interface
– The MPI Forum organized in 1992 with broad participation by:
• Vendors: IBM, Intel, TMC, SGI, Convex, Meiko
• Portability library writers: PVM, p4
• Users: application scientists and library writers
• MPI-1 finished in 18 months
– Incorporates the best ideas in a “standard” way
• Each function takes fixed arguments
• Each function has fixed semantics
– Standardizes what the MPI implementation provides and what the application can and cannot expect
– Each system can implement it differently as long as the semantics match
▪ MPI is not…
– a language or compiler specification
– a specific implementation or product
5
MPI
• MPI is actually just an Application Programming Interface (API). As
such, MPI
• specifes what a call to each routine should look like, and how
each routine should behave, but
• does not specify how each routine should be implemented, and
sometimes is intentionally vague about certain aspects of a
routines behavior;
• implementations are often platform vendor specific, and
• has multiple open-source and proprietary implementations.

6
spcl.inf.ethz.ch
@spcl_eth

Reasons for Using MPI

▪ Standardization - MPI is the only message passing library which can be considered a standard. It is
supported on virtually all HPC platforms. Practically, it has replaced all previous message passing
libraries
▪Important
Portability -Note:
There is no need to modify your source code when you port your application to a different
platform that supports (and is compliant with) the MPI standard
All parallelism is explicit: the programmer is responsible for correctly identifying parallelism
▪ Performance Opportunities - Vendor implementations should be able to exploit native hardware features
and implementing parallel algorithms using MPI constructs
to optimize performance
▪ Functionality – Rich set of features
▪ Availability - A variety of implementations are available, both vendor and public domain
– MPICH/Open MPI are popular open-source and free implementations of MPI
– Vendors and other collaborators take MPICH and add support for their systems
• Intel MPI, IBM Blue Gene MPI, Cray MPI, Microsoft MPI, MVAPICH, MPICH-MX

7
spcl.inf.ethz.ch
@spcl_eth

MPI Basic Send/Receive

▪ Simple communication model
Process 0 Process 1
Send(data)
Receive(data)

▪ Application needs to specify to the MPI implementation:

1. How do you compile and run an MPI application?
2. How will processes be identified?
3. How will “data” be described?

8
spcl.inf.ethz.ch
@spcl_eth

Process Identification
▪ MPI processes can be collected into groups
▪ Each group can have multiple colors (some times called context)
▪ Group + color == communicator (it is like a name for the group)
▪ When an MPI application starts, the group of all processes is initially given a predefined name called
MPI_COMM_WORLD
▪ The same group can have many names, but simple programs do not have to worry about multiple
names
▪ A process is identified by a unique number within each communicator, called rank
▪ For two different communicators, the same process can have two different ranks: so the
meaning of a “rank” is only defined when you specify the communicator

9
spcl.inf.ethz.ch
@spcl_eth

Communicators
mpiexec -np 16 ./test

Communicators do not
need to contain all 0 01 12 3
processes in the system When you start an MPI
program, there is one
04 215 326 37 predefined communicator
MPI_COMM_WORLD

Every process in a 48 459 60

15 171
communicator has an ID Can make copies of this
called as “rank” communicator (same group of
12 163 174 15 processes, but different
“aliases”)

The same process might have different

ranks in different communicators

Communicators can be created “by hand” or using tools provided by MPI (not discussed in this tutorial)
Simple programs typically only use the predefined communicator MPI_COMM_WORLD
10
spcl.inf.ethz.ch
@spcl_eth

Simple MPI Program Identifying Processes

#include <mpi.h>
#include <stdio.h>

int main(int argc, char ** argv)

Basic
{ requirements
int rank, size; for an MPI
program
MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm_size(MPI_COMM_WORLD, &size);
printf("I am %d of %d\n", rank, size);

MPI_Finalize();
return 0;
}
11
spcl.inf.ethz.ch
@spcl_eth

Data Communication
▪ Data communication in MPI is like email exchange
– One process sends a copy of the data to another process (or a group of processes), and the other process
receives it
▪ Communication requires the following information:
– Sender has to know:
• Whom to send the data to (receiver’s process rank)
• What kind of data to send (100 integers or 200 characters, etc)
• A user-defined “tag” for the message (think of it as an email subject; allows the receiver to understand what
type of data is being received)
– Receiver “might” have to know:
• Who is sending the data (OK if the receiver does not know; in this case sender rank will be MPI_ANY_SOURCE,
meaning anyone can send)
• What kind of data is being received (partial information is OK: I might receive up to 1000 integers)
• What the user-defined “tag” of the message is (OK if the receiver does not know; in this case tag will be
MPI_ANY_TAG)
12
spcl.inf.ethz.ch
@spcl_eth

More Details on Using Ranks for Communication

▪ When sending data, the sender has to specify the destination
process’ rank
– Tells where the message should go
▪ The receiver has to specify the source process’ rank
– Tells where the message will come from
▪ MPI_ANY_SOURCE is a special “wild-card” source that can be
used by the receiver to match any source

13
spcl.inf.ethz.ch
@spcl_eth

More Details on Describing Data for Communication

▪ MPI Datatype is very similar to a C or Fortran datatype
– int → MPI_INT
– double → MPI_DOUBLE
– char → MPI_CHAR
▪ More complex datatypes are also possible:
– E.g., you can create a structure datatype that comprises of other datatypes → a char, an int and a
double.
– Or, a vector datatype for the columns of a matrix
▪ The “count” in MPI_SEND and MPI_RECV refers to how many datatype elements should
be communicated

14
spcl.inf.ethz.ch
@spcl_eth

More Details on User “Tags” for Communication

▪ Messages are sent with an accompanying user-defined integer tag, to assist the
receiving process in identifying the message
▪ For example, if an application is expecting two types of messages from a peer, tags
can help distinguish these two types
▪ Messages can be screened at the receiving end by specifying a specific tag
▪ MPI_ANY_TAG is a special “wild-card” tag that can be used by the receiver to match any
tag

15
spcl.inf.ethz.ch
@spcl_eth

MPI Basic (Blocking) Send

MPI_SEND(buf, count, datatype, dest, tag, comm)

▪ The message buffer is described by (buf, count, datatype).

▪ The target process is specified by dest and comm.
– dest is the rank of the target process in the communicator specified by comm.

▪ tag is a user-defined “type” for the message

▪ When this function returns, the data has been delivered to the system and the buffer can
be reused.
– The message may not have been received by the target process.

16
spcl.inf.ethz.ch
@spcl_eth

MPI Send Modes

17
spcl.inf.ethz.ch
@spcl_eth

MPI Basic (Blocking) Receive

MPI_RECV(buf, count, datatype, source, tag, comm, status)

▪ Waits until a matching (on source, tag, comm) message is received from the system, and
the buffer can be used.
▪ source is rank in communicator comm, or MPI_ANY_SOURCE.
▪ Receiving fewer than count occurrences of datatype is OK, but receiving more is an
error.
▪ status contains further information:
– Who sent the message (can be used if you used MPI_ANY_SOURCE)
– How much data was actually received
– What tag was used with the message (can be used if you used MPI_ANY_TAG)
– MPI_STATUS_IGNORE can be used if we don’t need any additional information
18
spcl.inf.ethz.ch
@spcl_eth

Simple Communication in MPI

#include <mpi.h>
#include <stdio.h>

int main(int argc, char ** argv)

{
int rank, data[100];

MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

if (rank == 0)
MPI_Send(data, 100, MPI_INT, 1, 0, MPI_COMM_WORLD);
else if (rank == 1)
MPI_Recv(data, 100, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);

MPI_Finalize();
return 0;
}
19
spcl.inf.ethz.ch
@spcl_eth

Parallel Sort using MPI Send/Recv

Rank 0
8 23 19 67 45 35 1 24 13 30 3 5

Rank 0 O(N log N) Rank 1

8 19 23 35 45 67 1 3 5 13 24 30

Rank 0

8 19 23 35 45 67 1 3 5 13 24 30

Rank 0

1 3 5 8 13 19 23 24 30 35 45 67

20
spcl.inf.ethz.ch

Parallel Sort using MPI Send/Recv (contd.)

@spcl_eth

#include <mpi.h>
#include <stdio.h>
int main(int argc, char ** argv)
{
int rank;
int a[1000], b[500];

MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
if (rank == 0) {
MPI_Send(&a[500], 500, MPI_INT, 1, 0, MPI_COMM_WORLD);
sort(a, 500);
MPI_Recv(b, 500, MPI_INT, 1, 0, MPI_COMM_WORLD, &status);

/* Serial: Merge array b and sorted part of array a */

}
else if (rank == 1) {
MPI_Recv(b, 500, MPI_INT, 0, 0, MPI_COMM_WORLD, &status);
sort(b, 500);
MPI_Send(b, 500, MPI_INT, 0, 0, MPI_COMM_WORLD);
}
MPI_Finalize(); return 0; 21
}
spcl.inf.ethz.ch
@spcl_eth

Status Object
▪ The status object is used after completion of a receive to findthe actual length, source, and tag of
a message
▪ Status object is MPI-defined type and provides information about:
– The source process for the message (status.MPI_SOURCE)
– The message tag (status.MPI_TAG)
– Error status (status.MPI_ERROR)

▪ The number of elements received is given by:

MPI_Get_count(MPI_Status *status, MPI_Datatype datatype, int *count)

status return status of receive operation (status)

datatype datatype of each receive buffer element (handle)
count number of received elements (integer)(OUT)

22
spcl.inf.ethz.ch
@spcl_eth

Using the “status” field

Task1 Task2

▪ Each “worker process” computes some task (maximum 100 elements) and sends it to the
“master” process together with its group number: the “tag” field can be used to represent
the task
– Data count is not fixed (maximum 100 elements)
– Order in which workers send output to master is not fixed (different workers = different src ranks, and
different tasks = different tags)

23
spcl.inf.ethz.ch
@spcl_eth

Using the “status” field (contd.)

#include <mpi.h>
#include <stdio.h>

int main(int argc, char ** argv)

{
[...snip...]

if (rank != 0)
MPI_Send(data, rand() % 100, MPI_INT, 0, group_id, MPI_COMM_WORLD);
else {
for (i = 0; i < size – 1 ; i++) {
MPI_Recv(data, 100, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG, MPI_COMM_WORLD, &status);
MPI_Get_count(&status, MPI_INT, &count);
printf(“worker ID: %d; task ID: %d; count: %d\n”, status.source, status.tag, count);
}
}

[...snip...]
}

24
spcl.inf.ethz.ch
@spcl_eth

MPI is Simple
▪ Many parallel programs can be written using just these six functions, only two of which are non-trivial:
– MPI_INIT – initialize the MPI library (must be the
first routine called)
– MPI_COMM_SIZE - get the size of a communicator
– MPI_COMM_RANK – get the rank of the calling process
in the communicator
– MPI_SEND – send a message to another process
– MPI_RECV – send a message to another process
– MPI_FINALIZE – clean up all MPI state (must be last MPI function the called
by a process)

▪ For performance, however, you need to use other MPI features

25
Some of the simplest and most common communication routines are:
• MPI_Send() sends a message from the current process to another
process (the destination).
• MPI_Recv() receives a message on the current process from another
process (the source).
• MPI_Bcast() broadcasts a message from one process to all of the
others.
• MPI_Reduce() performs a reduction (e.g. a global sum, maximum,
etc.) of a variable in all processes, with the result ending up in a
single process.
• MPI_Allreduce() performs a reduction of a variable in all processes,
with the result ending up in all processes

26
spcl.inf.ethz.ch
@spcl_eth

Compiling MPI programs with MPICH

▪ Compilation Wrappers
– For C programs: mpicc test.c –o test
– For C++ programs: mpicxx test.cpp –o test
– For Fortran 77 programs: mpif77 test.f –o test
– For Fortran 90 programs: mpif90 test.f90 –o test

▪ You can link other libraries are required too

• – To link to a math library: mpicc
test.c –o test -lm
▪ You can just assume that “mpicc” and friends have replaced
• your regular compilers (gcc, gfortran, etc.)

27
spcl.inf.ethz.ch
@spcl_eth

Running MPI programs with MPICH

▪ Launch 16 processes on the local node:
– mpiexec –np 16 ./test
▪ Launch 16 processes on 4 nodes (each has 4 cores)
– mpiexec –hosts h1:4,h2:4,h3:4,h4:4 –np 16 ./test
• Runs the first four processes on h1, the next four on h2, etc.
– mpiexec –hosts h1,h2,h3,h4 –np 16 ./test
• Runs the first process on h1, the second on h2, etc., and wraps around
• So, h1 will have the 1st, 5th, 9th and 13th processes
▪ If there are many nodes, it might be easier to create a host file
– cat hf
h1:4 h2:2
– mpiexec –hostfile hf –np 16 ./test

28
Hello World!!!
1. #include <stdio.h>
2. #include <mpi.h>
3. int main ( int argc, char *argv[] )
4. {
5. int rank;
6. int number_of_processes;
7. MPI_Init( &argc, &argv );
8. MPI_Comm_size( MPI_COMM_WORLD,
&number_of_processes );
9. MPI_Comm_rank( MPI_COMM_WORLD, &rank );
10. printf( "hello from process %d of %d\n",
11. rank, number_of_processes );
12. MPI_Finalize();
13. return 0;
14. }
29
• All MPI processes (normally) run the same executable
• Each MPI process knows which rank it is
• Each MPI process knows how many processes are part of the
same job
• The processes run in a non-deterministic order
30
Recall the MPI initialization sequence:
• MPI_Init( &argc, &argv );
• MPI_Comm_size( MPI_COMM_WORLD, &number_of_processes );
MPI_Comm_rank( MPI_COMM_WORLD, &rank );
• MPI uses communicators to organize how processes communicate
with each other.
• A single communicator, MPI_COMM_WORLD, is created by MPI_Init()
and all the processes running the program have access to it.
• Note that process ranks are relative to a communicator. A program may
have multiple communicators; if so, a process may have multiple
ranks, one for each communicator it is associated with.

31
MPI is (usually) SPMD
• Usually MPI is run in SPMD (Single Program, Multiple Data) mode.
(It is possible to run multiple programs, i.e. MPMD).
• The program can use its rank to determine its role:

• as shown here, often the rank 0 process plays the role of server or
process coordinator.
32
A second MPI program: greeting.c
• The next several slides show the source code for an MPI program
that works on a client-server model.
• When the program starts, it initializes the MPI system then
determines if it is the server process (rank 0) or a client
process.
• Each client process will construct a string message and send it to
the server.
• The server will receive and display messages from the clients one-
by-one

33
1. #include <stdio.h>
2. #include <mpi.h>
3. const int SERVER_RANK = 0;
4. const int MESSAGE_TAG = 0;
5. int main ( int argc, char *argv[] )
6. {
7. int rank, number_of_processes;
8. MPI_Init( &argc, &argv );
9. MPI_Comm_size( MPI_COMM_WORLD, &number_of_processes );
10. MPI_Comm_rank( MPI_COMM_WORLD, &rank );
11. if ( rank == SERVER_RANK )
12. do_server_work( number_of_processes );
13. else
14. do_client_work( rank );
15. MPI_Finalize();
16. return 0;
17. }

34
18.void do_server_work( int number_of_processes )
19.{
20. const int max_message_length = 256;
21. char message[max_message_length];
22. int src;
23. MPI_Status status;
24. for ( src = 0; src < number_of_processes; src++ )
25. {
26. if ( src != SERVER_RANK )
27. {
28. MPI_Recv( message, max_message_length,
MPI_CHAR,
29. src, MESSAGE_TAG, MPI_COMM_WORLD,&status
);
30. printf( "Received: %s\n", message );
31. }
32. }
33.}
35
34.void do_client_work( int rank )
35.{
36. const int max_message_length = 256;
37. char message[max_message_length];
38. int message_length;
39. message_length = sprintf( message, "Greetings from
process %d", rank );
40. message_length++;
41. /* add one for null char */
42. MPI_Send( message, message_length, MPI_CHAR,
43. SERVER_RANK, MESSAGE_TAG, MPI_COMM_WORLD );
44.}

36
• Note: the server process (rank 0) does not send a message, but does
display the contents of messages received from the other processes.
• mpirun can be used rather than mpiexec.
• the arguments to mpiexec vary between MPI implementations.
• mpiexec (or mpirun) may not be available.
37
Deterministic operation?
• You may have noticed that in the four-process case the greeting messages
were printed out in-order. Does this mean that the order the messages were
sent is deterministic? Look again at the loop that carries out the servers work:

• The server loops over values of src from 0 to number_of_processes, skipping

the servers own rank. The fourth argument to MPI_Recv() is the rank of the
process the message is received from. Here the messages are received in
increasing rank order, regardless of the sending order. 38
Non-deterministic receive order
• By making one small change, we can allow the messages to be received in any
order. The constant MPI_ANY_SOURCE can be used in the MPI_Recv() function
to indicate that the next available message with the correct tag should be read,
regardless of its source.

• Note: it is possible to use the data returned in status to determine the messages
source.
39
ERROR HANDLING IN MPI
• Nearly all MPI functions return an integer status code:
• MPI_SUCCESS if function completed without error,
• otherwise an error code is returned,
• Most examples you nd on the web and in textbooks do not check
the MPI function return status value.
• ….but this should be done in production code.
• It can certainly help to avoid errors during development.

40
Sample MPI error handler
• Here is a sample MPI error handler. If called with a non-successful
status value, it displays a string corresponding to the error and
then exits all errors are treated as fatal.

41
MPI point-to-point communication routines
• MPI provides two main routines for point-to-point communication
• MPI_Send(): Send to a message to another process
• MPI_Recv(): Receive a message from another process
• Both of these have several variants that well mention here and see
some of later.

42
The calling sequence for MPI_Send() is

• The MPI_Send() function initiates a blocking send. Here blocking does not indicate
that the sender waits for the message to be received, but rather that the sender waits
for the message to be accepted by the MPI system.
• It does mean that once this function returns the send buffer may be changed with out
impacting the send operation. 43
The calling sequence for MPI_Recv() is

• The MPI_Recv() function initiates a blocking receive. It will not return to its
caller until a message with the speci ed tag is received from the speci ed
source.
• MPI_ANY_SOURCE may be used to indicate the message should be
accepted from any source.
• MPI_ANY_TAG may be used to indicate the message should be accepted
regardless of its tag. 44
Point-to-point communication modes
• Standard : Locally blocking, meaning that the routine does not
return until the memory holding the message is available to reuse
(in the case of MPI_Send()) or use (in the case of MPI_Recv()).
• Buffered: In this mode the user supplies buffer space sufficient to
hold an outgoing or incoming message. The routine MPI_Bsend()
returns as soon as the message is copied into the buffer.
• Synchronous: Similar to the standard mode, except MPI_Ssend() will
not return until the matching receive has been pointed. Essentially
this is explicit blocking.
• Ready: Similar to the standard mode, except that it is an error to call
MPI_Rsend() before the matching receive has been posted.
45
spcl.inf.ethz.ch
@spcl_eth

Blocking vs. Non-blocking Communication

▪ MPI_SEND/MPI_RECV are blocking communication calls
– Return of the routine implies completion
– When these calls return the memory locations used in the message transfer can be
safely accessed for reuse
– For “send” completion implies variable sent can be reused/modified
– Modifications will not affect data intended for the receiver
– For "receive" variable received can be read
▪ MPI_ISEND/MPI_IRECV are non-blocking variants
– Routine returns immediately – completion has to be separately tested for
– These are primarily used to overlap computation and communication to improve
performance
46
spcl.inf.ethz.ch
@spcl_eth

Blocking Communication
▪ In blocking communication.
– MPI_SEND does not return until buffer is empty (available for reuse)
– MPI_RECV does not return until buffer is full (available for use)
▪ A process sending data will be blocked until data in the send buffer is emptied
▪ A process receiving data will be blocked until the receive buffer is filled
▪ Exact completion semantics of communication generally depends on the message size
and the system buffer size
▪ Blocking communication is simple to use but can be prone to deadlocks
If (rank == 0) Then
Call mpi_send(..)
Call mpi_recv(..)
Usually deadlocks → Else
Call mpi_send(..)  UNLESS you reverse send/recv
Call mpi_recv(..)
Endif
47
Blocking Send-Receive Diagram

time

48
Non-Blocking Communication
▪ Non-blocking (asynchronous) operations return (immediately) ‘‘request handles” that can be waited on and
queried
– MPI_ISEND(start, count, datatype, dest, tag, comm, request)
– MPI_IRECV(start, count, datatype, src, tag, comm, request)
– MPI_WAIT(request, status)

▪ Non-blocking operations allow overlapping computation and communication

▪ One can also test without waiting using MPI_TEST
– MPI_TEST(request, flag, status)
▪ Anywhere you use MPI_SEND or MPI_RECV, you can use the pair of
MPI_ISEND/MPI_WAIT or MPI_IRECV/MPI_WAIT
▪ Combinations of blocking and non-blocking sends/receives can be used to synchronize execution instead of
barriers

49
spcl.inf.ethz.ch
@spcl_eth

Multiple Completions
▪ It is sometimes desirable to wait on multiple requests:
• MPI_Waitall(count, array_of_requests, array_of_statuses)
• MPI_Waitany(count, array_of_requests, &index, &status)
array_of_statuses)
• MPI_Waitsome(count, array_of_requests, array_of_indices,
▪ There are corresponding versions of testfor each of these

50
spcl.inf.ethz.ch
@spcl_eth

Non-Blocking Send-Receive Diagram

time

51
spcl.inf.ethz.ch
@spcl_eth

Message Completion and Buffering

▪ For a communication to succeed:
– Sender must specify a valid destination rank
– Receiver must specify a valid source rank (including MPI_ANY_SOURCE)
– The communicator must be the same
– Tags must match
– Receiver’s buffer must be large enough
▪ A send has completed when the user supplied buffer can be reused
*buf =3; *buf =3;
MPI_Send(buf, 1, MPI_INT …) MPI_Isend(buf, 1, MPI_INT …)
*buf = 4; /* OK, receiver will always *buf = 4; /*Not certain if receiver
receive 3 */ gets 3 or 4 or anything else */
MPI_Wait(…);

▪ Just because the send completes does not mean that the receive has completed
– Message may be buffered by the system
– Message may still be in transit

52
spcl.inf.ethz.ch
@spcl_eth

A Non-Blocking communication example

int main(int argc, char ** argv)
{
[...snip...]
if (rank == 0) {
for (i=0; i< 100; i++) {
/* Compute each data element and send it out */ data[i] = compute(i);
MPI_ISend(&data[i], 1, MPI_INT, 1, 0, MPI_COMM_WORLD, &request[i]);
}
MPI_Waitall(100, request, MPI_STATUSES_IGNORE)
}
else {
for (i = 0; i < 100; i++)
MPI_Recv(&data[i], 1, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
}
[...snip...]
}

53
spcl.inf.ethz.ch
@spcl_eth

Introduction to Collective Operations in MPI

▪ Collective operations are called by all processes in a communicator.
▪ MPI_BCAST distributes data from one process (the root) to all others in a communicator.

▪ MPI_REDUCE combines data from all processes in the communicator and returns it to one
process.
▪ In many numerical algorithms, SEND/RECV can be replaced by BCAST/REDUCE, improving
both simplicity and efficiency.

54
spcl.inf.ethz.ch
@spcl_eth

MPI Collective Communication

▪ Communication and computation is coordinated amonga group of processes in a
communicator
▪ Tags are not used; different communicators deliver similar functionality
▪ Non-blocking collective operations in MPI-3
– Covered in the advanced tutorial (but conceptually simple)
▪ Three classes of operations: synchronization, data movement, collective computation

55
List of MPI collective communication routines

56
spcl.inf.ethz.ch
@spcl_eth

Synchronization
▪ MPI_BARRIER(comm)
– Blocks until all processes in the group of the communicator comm call it
– A process cannot get out of the barrier until all other processes have reached barrier

57
Scattering data

58
MPI Scatter() example

59
MPI Scatter()

60
MPI Scatter()

61
Another data distribution function

62
MPI Alltoall() example

63
MPI Alltoall()

64
MPI Gather()

65
MPI Gather() notes

66
MPI Gather() example

67
MPI Allgather()

68
MPI Allgather() example

69
Four more collective communication routines

70
MPI Scatterv()

71
Other collective routines

72
spcl.inf.ethz.ch
@spcl_eth

Collective Data Movement

int MPI_Bcast( void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm )

P0 A A
P1 Broadcast A
P2 A
P3 A

int MPI_Scatter(const void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, int recvcount,
MPI_Datatype recvtype, int root, MPI_Comm comm)

P0 A B C D Scatter A
P1
B
P2
C
Gather
P3
D
int MPI_Gather(const void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, int recvcount,
MPI_Datatype recvtype, int root, MPI_Comm comm)
73
spcl.inf.ethz.ch
@spcl_eth

More Collective Data Movement

int MPI_Allgather(const void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, int recvcount,
MPI_Datatype recvtype, MPI_Comm comm)

P0 A A B C D
P1 B Allgather A B C D
P2 C A B C D
P3 D A B C D

int MPI_Alltoall(const void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, int recvcount,
MPI_Datatype recvtype, MPI_Comm comm)

P0 A0 A1 A2 A3 A0 B0 C0 D0
P1
B0 B1 B2 B3 Alltoall A1 B1 C1 D1
P2
C0 C1 C2 C3 A2 B2 C2 D2
P3
D0 D1 D2 D3 A3 B3 C3 D3

74
spcl.inf.ethz.ch
@spcl_eth

Collective Computation
int MPI_Reduce(const void *sendbuf, void *recvbuf, int count, MPI_Datatype
datatype, MPI_Op op, int root, MPI_Comm comm)

P0 A A+B+C+D
P1 B Reduce
P2 C
P3 D

int MPI_Allreduce(const void sendbuf, void recvbuf, int count,

MPI_Datatype datatype, MPI_Op op, MPI_Comm comm)

P0 A A+B+C+D
P1 B AllReduce A+B+C+D
P2 C A+B+C+D
P3 D A+B+C+D

75
spcl.inf.ethz.ch
@spcl_eth

int MPI_Scan(const void sendbuf, void recvbuf, int count, MPI_Datatype

datatype, MPI_Op op, MPI_Comm comm)

P0 A A
P1 B AB
Scan
P2 C ABC
P3 D ABCD

▪ MPI_ALLREDUCE, MPI_REDUCE, MPI_REDUCESCATTER, and

MPI_SCANtake both built-in and user- defined combiner functions

76
spcl.inf.ethz.ch
@spcl_eth

MPI Built-in Collective Computation Operations

▪ MPI_MAX • Maximum
▪ MPI_MIN Minimum
▪ MPI_PROD Product Sum
▪ MPI_SUM Logical and
▪ MPI_LAND Logical or
▪ MPI_LOR • Logical exclusive or
▪ MPI_LXOR Bitwise and Bitwise or
▪ MPI_BAND
▪ MPI_BOR • Bitwise exclusive or
▪ MPI_BXOR Maximum and location
▪ MPI_MAXLOC Minimum and location
▪ MPI_MINLOC

77
Examples

78
Scatter/Gather example

79
80
81
82
83
84
Alltoall example prolog

85
86
87
88
89
90
Vector Scatter example prolog

91
92
93
94
95
Output:

96
ODD-EVEN Sort

97
Sorting n = 8 elements, using the odd-even transposition
sort algorithm. During each phase, n = 8 elements are
compared.

98
• After n phases of odd-even exchanges, the sequence is sorted.
• Each phase of the algorithm (either odd or even) requires Θ(n)
comparisons. •
• Serial complexity is Θ(n^2)

99
Parallel Odd-Even Transposition
• Consider the one item per processor case.
• There are n iterations, in each iteration, each processor does one
compare-exchange.
• The parallel run time of this formulation is Θ(n).
• This is cost optimal with respect to the base serial algorithm but
not the optimal one.

100
Parallel Odd-Even Transposition

101
Parallel Odd-Even Transposition
• Consider a block of n/p elements per processor.
• The first step is a local sort.
• In each subsequent step, the compare exchange operation is
replaced by the compare split operation.
• The parallel run time of the formulation is

102
Parallel Odd-Even Transposition
• The parallel formulation is cost-optimal for p = O(log n)
• The isoefficiency function of this parallel formulation is Θ(p2^p).

103
spcl.inf.ethz.ch
@spcl_eth

Defining your own Collective Operations

▪ Create your own collective computations with:
MPI_OP_CREATE(user_fn, commutes, &op);
MPI_OP_FREE(&op);

user_fn(invec, inoutvec, len, datatype);

▪ The user function should perform:

inoutvec[i] = invec[i] op inoutvec[i];
for i from 0 to len-1

▪ The user function can be non-commutative, but must be associative

10
References
• Introduction to Parallel Computing, Chapter 6: 6.1, 6.2, 6.3,
6.4,6.5, 6.6

105

Business Analysis Report: SQL Lite and Mysql Project
76% (21)
Business Analysis Report: SQL Lite and Mysql Project
11 pages
Unit 12 - Software Development LA B
No ratings yet
Unit 12 - Software Development LA B
3 pages
Message Passing Interface (MPI)
No ratings yet
Message Passing Interface (MPI)
14 pages
MiniTool Partition Wizard Crack 12 Key Download Free 2025
No ratings yet
MiniTool Partition Wizard Crack 12 Key Download Free 2025
29 pages
An Introduction To MPI: Parallel Programming With The Message Passing Interface
No ratings yet
An Introduction To MPI: Parallel Programming With The Message Passing Interface
48 pages
Chapter 4 - Message-Passing Programming, MPI
No ratings yet
Chapter 4 - Message-Passing Programming, MPI
79 pages
Mpi Unit 5 Part 2 1
No ratings yet
Mpi Unit 5 Part 2 1
65 pages
Cs-3006 6 Mpi Basics 2
No ratings yet
Cs-3006 6 Mpi Basics 2
52 pages
BIg Data Anslysi
No ratings yet
BIg Data Anslysi
57 pages
Week 10
No ratings yet
Week 10
52 pages
HPC Lecture40
No ratings yet
HPC Lecture40
25 pages
High Performance Computing: Matthew Jacob Indian Institute of Science
No ratings yet
High Performance Computing: Matthew Jacob Indian Institute of Science
25 pages
CS-3006 - 5 - MPI Basics
No ratings yet
CS-3006 - 5 - MPI Basics
53 pages
Lec5 MPI
No ratings yet
Lec5 MPI
28 pages
Message Passing and MPI: John Mellor-Crummey
No ratings yet
Message Passing and MPI: John Mellor-Crummey
78 pages
Key Concepts in MPI Programming: Processes
No ratings yet
Key Concepts in MPI Programming: Processes
6 pages
02 Mpi 0
No ratings yet
02 Mpi 0
19 pages
Introduction To C MPI PM
No ratings yet
Introduction To C MPI PM
50 pages
23th_24th_Lecture
No ratings yet
23th_24th_Lecture
16 pages
Message Passing-1
No ratings yet
Message Passing-1
76 pages
07 2 Introduction MPI
No ratings yet
07 2 Introduction MPI
27 pages
Group6 P&DC Presentation
No ratings yet
Group6 P&DC Presentation
18 pages
Message Passing Interface (MPI) : EC3500: Introduction To Parallel Computing
100% (1)
Message Passing Interface (MPI) : EC3500: Introduction To Parallel Computing
40 pages
25th_26th_Lecture
No ratings yet
25th_26th_Lecture
24 pages
Ms. V. Uma Maheswari, Assistant Lecturer, Department of Information Technology, National Institute of Technology, Surathkal
No ratings yet
Ms. V. Uma Maheswari, Assistant Lecturer, Department of Information Technology, National Institute of Technology, Surathkal
91 pages
Message Passing Interface (MPI) Programming
No ratings yet
Message Passing Interface (MPI) Programming
11 pages
In3200 Chap09
No ratings yet
In3200 Chap09
56 pages
Lecture 10 - MPI.pptLecture 12 - Fault Tolerance.ppt
No ratings yet
Lecture 10 - MPI.pptLecture 12 - Fault Tolerance.ppt
26 pages
Clase 4 - Tutorial de MPI
No ratings yet
Clase 4 - Tutorial de MPI
35 pages
Lec 9 DR Marwa Abbas
No ratings yet
Lec 9 DR Marwa Abbas
64 pages
Mpi Basic Operations
No ratings yet
Mpi Basic Operations
6 pages
Parallel Programming Using Basic MPI Presented by Timothy H. Kaiser, Ph.D. San Diego Supercomputer Center
No ratings yet
Parallel Programming Using Basic MPI Presented by Timothy H. Kaiser, Ph.D. San Diego Supercomputer Center
19 pages
Mpi Lecture
No ratings yet
Mpi Lecture
129 pages
Message Passing Interface: Parallel Processing Course University of Tehran
No ratings yet
Message Passing Interface: Parallel Processing Course University of Tehran
49 pages
Message Passing Interface (MPI) Programming
No ratings yet
Message Passing Interface (MPI) Programming
11 pages
ECE 1747H: Parallel Programming: Message Passing (MPI)
No ratings yet
ECE 1747H: Parallel Programming: Message Passing (MPI)
67 pages
The Message Passing Interface (MPI)
No ratings yet
The Message Passing Interface (MPI)
18 pages
2013 02 24 Ppopp Mpi Basic
No ratings yet
2013 02 24 Ppopp Mpi Basic
102 pages
MPI Part2 Updated
No ratings yet
MPI Part2 Updated
20 pages
Unit4 RMD PDF
No ratings yet
Unit4 RMD PDF
18 pages
Message Passing Interface (MPI) : Steve Lantz Center For Advanced Computing Cornell University
No ratings yet
Message Passing Interface (MPI) : Steve Lantz Center For Advanced Computing Cornell University
53 pages
03 MPIProgramStructure
No ratings yet
03 MPIProgramStructure
42 pages
Unit Iv Distributed Memory Programming With Mpi
No ratings yet
Unit Iv Distributed Memory Programming With Mpi
19 pages
ATPESC 2019 Track-2 1-7-30 830am Guo-Raffenetti-Thakur-MPI For Scalable Computing
No ratings yet
ATPESC 2019 Track-2 1-7-30 830am Guo-Raffenetti-Thakur-MPI For Scalable Computing
199 pages
Intro To MPI: Hpc-Support@duke - Edu
No ratings yet
Intro To MPI: Hpc-Support@duke - Edu
56 pages
Parallel & Distributed Computing: MPI - Message Passing Interface
No ratings yet
Parallel & Distributed Computing: MPI - Message Passing Interface
49 pages
Distributed Memory Programming Using
No ratings yet
Distributed Memory Programming Using
113 pages
Message-Passing Multicomputer
No ratings yet
Message-Passing Multicomputer
57 pages
Distributed-Memory Parallel Programming With MPI: Supervised By: Dr. Shaima Hagras
No ratings yet
Distributed-Memory Parallel Programming With MPI: Supervised By: Dr. Shaima Hagras
20 pages
02 Message Passing Interface Tutorial
No ratings yet
02 Message Passing Interface Tutorial
34 pages
PDC Lecture 17 & 18
No ratings yet
PDC Lecture 17 & 18
16 pages
Intro MPI
No ratings yet
Intro MPI
60 pages
Message Passing Interface (MPI)
No ratings yet
Message Passing Interface (MPI)
4 pages
Lecture05 MPI
No ratings yet
Lecture05 MPI
26 pages
Module 3 Solutions PCS Ia2 Q.banks
No ratings yet
Module 3 Solutions PCS Ia2 Q.banks
13 pages
5 MPIprogramming
No ratings yet
5 MPIprogramming
43 pages
Mpi 1
No ratings yet
Mpi 1
20 pages
Lecture 11 Distributed Memory Programming
No ratings yet
Lecture 11 Distributed Memory Programming
28 pages
Week09 L2
No ratings yet
Week09 L2
13 pages
The 1 Page Python Book
From Everand
The 1 Page Python Book
Barani Kumar
2/5 (1)
PYTHON FOR BEGINNERS: A Comprehensive Guide to Learning Python Programming from Scratch (2023)
From Everand
PYTHON FOR BEGINNERS: A Comprehensive Guide to Learning Python Programming from Scratch (2023)
Denton Freeman
No ratings yet
Machine Learning with Python: A Comprehensive Guide with a Practical Example
From Everand
Machine Learning with Python: A Comprehensive Guide with a Practical Example
MARTIN NEEL
No ratings yet
Module1
No ratings yet
Module1
68 pages
Module2
No ratings yet
Module2
124 pages
Module4
No ratings yet
Module4
44 pages
Module3
No ratings yet
Module3
104 pages
ContoD4Pd Protokoll
No ratings yet
ContoD4Pd Protokoll
11 pages
Data Communication and Networking Practicals (2016-2017)
No ratings yet
Data Communication and Networking Practicals (2016-2017)
20 pages
CS1101 Computational Engineering: Introduction To C Programming Language
No ratings yet
CS1101 Computational Engineering: Introduction To C Programming Language
34 pages
GUI Design and Coding: (Cont.)
No ratings yet
GUI Design and Coding: (Cont.)
84 pages
Use Generators For Fetching Large DB Record Sets
No ratings yet
Use Generators For Fetching Large DB Record Sets
2 pages
Cs508 Midterm Solved Mcqs by Junaid Malik
No ratings yet
Cs508 Midterm Solved Mcqs by Junaid Malik
56 pages
Red Hat Developer Toolset-4-4.1 Release Notes-En-US
No ratings yet
Red Hat Developer Toolset-4-4.1 Release Notes-En-US
19 pages
Seq Cheat Sheet
No ratings yet
Seq Cheat Sheet
5 pages
ITC Lesson 2 - Computing Profession
No ratings yet
ITC Lesson 2 - Computing Profession
25 pages
Lab 03
No ratings yet
Lab 03
6 pages
Algorithm Units Notes
No ratings yet
Algorithm Units Notes
3 pages
Web Scripting With PHP, Mysql & Apache: Afrosp (African Open Source Project)
No ratings yet
Web Scripting With PHP, Mysql & Apache: Afrosp (African Open Source Project)
18 pages
Pipelining and Vector Processing
No ratings yet
Pipelining and Vector Processing
28 pages
ESSC/EESC 2030: Introduction To Computational Earth System Science
No ratings yet
ESSC/EESC 2030: Introduction To Computational Earth System Science
48 pages
Week-04Assignment MCQ
No ratings yet
Week-04Assignment MCQ
5 pages
Testng Tutorial
100% (2)
Testng Tutorial
68 pages
Primefaces Exercises: Ajax
No ratings yet
Primefaces Exercises: Ajax
1 page
Advanced Database
No ratings yet
Advanced Database
125 pages
Sol CH 1
No ratings yet
Sol CH 1
6 pages
Kernel-720 Installation Cookbook PDF
No ratings yet
Kernel-720 Installation Cookbook PDF
9 pages
HANA Workload AdmissionControlEvents 2.00.010+
No ratings yet
HANA Workload AdmissionControlEvents 2.00.010+
7 pages
CH 1.8.3 SQL Notes
No ratings yet
CH 1.8.3 SQL Notes
11 pages
DSP Practical No 1 To 24 Nirali Ali Karim
No ratings yet
DSP Practical No 1 To 24 Nirali Ali Karim
44 pages
DIPAM_PATEL_CV_2025
No ratings yet
DIPAM_PATEL_CV_2025
3 pages
Definitions of Database Terms
No ratings yet
Definitions of Database Terms
7 pages
12A-Creating Computer Programs
No ratings yet
12A-Creating Computer Programs
52 pages
Clean Code
No ratings yet
Clean Code
21 pages
Error Detection Codes PDF
No ratings yet
Error Detection Codes PDF
2 pages

Module5 MPI

Uploaded by

Module5 MPI

Uploaded by

MPI

The Message-Passing Model (an example)

O(N/2 log N/2) O(N/2 log N/2)

Reasons for Using MPI

MPI Basic Send/Receive

▪ Application needs to specify to the MPI implementation:

Every process in a 48 459 60

The same process might have different

Simple MPI Program Identifying Processes

int main(int argc, char ** argv)

More Details on Using Ranks for Communication

More Details on Describing Data for Communication

More Details on User “Tags” for Communication

MPI Basic (Blocking) Send

▪ The message buffer is described by (buf, count, datatype).

▪ tag is a user-defined “type” for the message

MPI Send Modes

MPI Basic (Blocking) Receive

Simple Communication in MPI

int main(int argc, char ** argv)

Parallel Sort using MPI Send/Recv

Rank 0 O(N log N) Rank 1

Parallel Sort using MPI Send/Recv (contd.)

/* Serial: Merge array b and sorted part of array a */

▪ The number of elements received is given by:

status return status of receive operation (status)

Using the “status” field

Using the “status” field (contd.)

int main(int argc, char ** argv)

▪ For performance, however, you need to use other MPI features

Compiling MPI programs with MPICH

▪ You can link other libraries are required too

Running MPI programs with MPICH

• The server loops over values of src from 0 to number_of_processes, skipping

Blocking vs. Non-blocking Communication

▪ Non-blocking operations allow overlapping computation and communication

Non-Blocking Send-Receive Diagram

Message Completion and Buffering

A Non-Blocking communication example

Introduction to Collective Operations in MPI

MPI Collective Communication

Collective Data Movement

More Collective Data Movement

int MPI_Allreduce(const void *sendbuf, void *recvbuf, int count,

int MPI_Scan(const void *sendbuf, void *recvbuf, int count, MPI_Datatype

▪ MPI_ALLREDUCE, MPI_REDUCE, MPI_REDUCESCATTER, and

MPI Built-in Collective Computation Operations

Defining your own Collective Operations

user_fn(invec, inoutvec, len, datatype);

▪ The user function should perform:

▪ The user function can be non-commutative, but must be associative

You might also like

int MPI_Allreduce(const void sendbuf, void recvbuf, int count,

int MPI_Scan(const void sendbuf, void recvbuf, int count, MPI_Datatype