Unit3 All
Unit3 All
• UNIT-III 8 Hours
Programming using the Message-Passing Paradigm: Principles of Message-Passing Programming, The Building
Blocks: Send and Receive Operations, MPI: The Message Passing Interface, Topologies and Embedding,
Overlapping Communication with Computation, Collective Communication and Computation Operations. Self-
Study: Groups and Communicators.
Reference Book
Ananth Grama,Anshul Gupta,Vipin kumar, George Karypis, Introduction to parallel computing, second edition,
2003, Pearson education publishers.
Chapter 6
2
MPI
Message Passing Interface
3
Outline
• Background
• Message Passing
• Principles of Message-Passing Programming
• MPI
• Group and Context
• Communication Modes
• Blocking/Non-blocking
• Features
• Programming / issues
• The Building Blocks: Send and Receive Operations
• MPI: the Message Passing Interface
• Topologies and Embedding
• Overlapping Communication with Computation
• Collective Communication and Computation Operations
• Groups and Communicators
4
Distributed Computing Paradigms
Communication Computation
Models: Models:
5
• Each processing elements cannot access all
data natively
Distributed • The scale can go up considerably
Memory • Penalty for coordinating with other
Parallelism processing elements is now significantly
higher
• Approaches change accordingly
6
7
Performance bandwidth
Metrics: Latency
Latency and • Performance affected since processor may have to wait
• Harder to overlap communication and computation
Bandwidth • Overhead to communicate is a problem in many
machines
Latency hiding
• Increases programming system burden
• Examples: communication/computation overlaps,
prefetch 9
Advantages of Distributed Memory
Architectures
10
Types of • Data parallel
Parallel • Simultaneous execution on multiple data items
• Example: Single Instruction, Multiple Data (SIMD)
Computing • Task parallel
• Different instructions on different data (MIMD)
Models • SPMD (Single Program, Multiple Data)
• Combination of data parallel and task parallel
• Not synchronized at individual operation level
• Message passing is for MIMD/SPMD parallelism
• Can be used for data parallel programming
11
Message Passing
A process is a program counter and Message passing is used for Inter-process communication:
address space. communication among processes.
Type:
• Synchronous / Asynchronous
Movement of data from one process’s address
space to another’s
12
The Message-Passing Model
• A process is a program counter and address space
• Processes can have multiple threads (program counters and associated stacks) sharing a
single address space process
P1 P2 P3 P4
thread
address space
(memory)
14
Synchronous Vs. Asynchronous
( cont. )
15
What is message passing?
16
SPMD ~~~~~
~~~ Multiple “Owner compute” rule:
~~~~ data Process that “owns”
~~ the data (local data)
performs computations
on that data
~~~~~
Shared ~~~
program ~~~~
~~
~~~~~ ~~~~~
~~~ ~~~
~~~~ ~~~~
~~ ~~
Message passing infrastructure attempts to support the forms of communication most often used or desired
18
Principles of
Message-Passing Programming
19
• Message-passing programs are often
written using the asynchronous or loosely
synchronous paradigms.
Principles of • In the asynchronous paradigm, all
20
Communication • Two ideas for communication
• Cooperative operations
Types • One-sided operations
21
Cooperative Operations for Communication
• Data is cooperatively exchanged in message-passing
• Explicitly sent by one process and received by another
• Advantage of local control of memory
• Any change in the receiving process’s memory is made with the
receiver’s explicit participation
• Communication and synchronization are combined
Process Process 1
0
Send(data)
Receive(data)
time
22
One-Sided Operations for Communication
One-sided operations between Include remote memory reads and writes
processes
Process Process 1
0Put(data)
(memory)
(memory)
Get(data)
time
23
Pairwise vs. Collective
Communication
• Communication between process pairs
• Send/Receive or Put/Get
• Synchronous or asynchronous (we’ll talk about this later)
• Collective communication between multiple processes
• Process group (collective)
• Several processes logically grouped together
• Communication within group
• Collective operations
• Communication patterns
• broadcast, multicast, subset, scatter/gather, …
• Reduction operations
24
The Building Blocks:
Send and Receive Operations
The prototypes of these operations are as follows:
• P0 P1
• a = 100; receive(&a, 1, 0)
• send(&a, 1, 1); printf("%d\n", a);
• a = 0;
The semantics of the send operation require that the value received by process P1
must be 100 as opposed to 0.
25
Blocking vs. Non-Blocking
26
Non-Buffered Blocking
Message Passing Operations Send/Receive
27
Dead Lock in Blocking Non-buffered Operations
28
Buffered Blocking
Message Passing Operations :Send/Receive
A simple solution to the idling and The sender simply copies the data
deadlocking problem outlined into the designated buffer and
above is to rely on buffers at the returns after the copy operation
sending and receiving ends. has been completed.
The data must be buffered at the Buffering trades off idling overhead
receiving end as well. for buffer copying overhead.
29
30
Buffered Blocking
Message Passing Operations
• Blocking buffered transfer
protocols:
• (a) in the presence of
communication hardware
with buffers at send and
receive ends;
• (b) in the absence of
communication hardware,
sender interrupts receiver
and deposits data in buffer at
receiver end.
(a) (b) 31
Buffered Blocking
Message Passing Operations
P0 P1
for (i = 0; i < 1000; i++) { for (i = 0; i < 1000; i++){
produce_data(&a); receive(&a, 1, 0);
send(&a, 1, 1); consume_data(&a);
} }
P0 P1
receive(&a, 1, 1); receive(&a, 1, 0);
send(&b, 1, 1); send(&b, 1, 0);
33
Non-Blocking
Message Passing Operations
34
Non Blocking send and receive
Non-Blocking
Message Passing Operations
• Non-blocking non-buffered
send and receive operations
• (a) in absence of
communication hardware;
• (b) in presence of
communication hardware.
36
Send and Receive Protocols
37
MPI
38
• A message-passing library specifications:
• Extended message-passing model
• Not a language or compiler specification
• Not a specific implementation or product
• For parallel computers, clusters, and heterogeneous networks.
• Communication modes: standard, synchronous, buffered, and ready.
• Designed to permit the development of parallel software libraries.
• Designed to provide access to advanced parallel hardware for
• End users
• Library writers
• Tool developers
39
MPI
• MPI defines a standard library for message-passing that can be used
to develop portable message-passing programs using either C or
Fortran.
• The MPI standard defines both the syntax as well as the semantics of
a core set of library routines.
• Vendor implementations of MPI are available on almost all
commercial parallel computers.
• It is possible to write fully-functional message-passing programs by
using only the six routines.
41
Why Use MPI?
• Message passing is a mature parallel programming model
• Well understood
• Efficient match to hardware (interconnection networks)
• Many applications
• MPI provides a powerful, efficient, and portable way to express parallel programs
• MPI was explicitly designed to enable libraries …
• … which may eliminate the need for many users to learn (much of) MPI
• Need standard, rich, and robust implementation
• Three versions: MPI-1, MPI-2, MPI-3 , MPI-4
• Robust implementations including free MPICH (ANL)
42
General
Point-to-point communication
Features • Structured buffers and derived datatypes, heterogeneity
Collective
• Both built-in and user-defined collective operations
• Large number of data movement routines
• Subgroups defined directly or by topology
43
Features that are NOT part of MPI
45
Is MPI Large or Small?
• MPI-1 is 128 functions, MPI-2 is 152 functions
MPI is large • Extensive functionality requires many functions
• Not necessarily a measure of complexity
MPI is small
• Many parallel programs use just 6 basic functions
(6 functions)
46
To use or not use MPI? That is the question?
47
The minimal set of MPI routines.
MPI_Init Initializes MPI.
MPI_Finalize Terminates MPI.
MPI: the Message
Passing Interface MPI_Comm_size Determines the number of processes.
MPI_Comm_rank Determines the label of calling process.
MPI_Send Sends a message.
MPI_Recv Receives a message.
48
Starting and Terminating the MPI Library
MPI_Init is called prior to any calls to other MPI routines. Its purpose is to initialize the MPI
environment.
MPI_Finalize is called at the end of the computation, and it performs various clean-up tasks to
terminate the MPI environment.
All MPI routines, data-types, and constants are prefixed by “MPI_”. The return code for successful
completion is MPI_SUCCESS.
49
Initialization and Finalization
MPI_Init
•gather information about the parallel job
•set up internal library state
•prepare for communication
MPI_Finalize
•cleanup
50
Communicators
51
Group and Context
52
Communicators
• A communicator defines a communication domain - a set of processes
that are allowed to communicate with each other.
• Information about communication domains is stored in variables of
type MPI_Comm.
• Communicators are used as arguments to all message transfer MPI
routines.
• A process can belong to many different (possibly overlapping)
communication domains.
• MPI defines a default communicator called MPI_COMM_WORLD
which includes all the processes.
53
MPI_COMM_WORLD
54
MPI_COMM_WORLD
55
Communication Scope
Communicator(communication handle)
•Defines the scope
•Specifies communication context
Process
•Belongs to a group
•Identified by a rank within a group
Identification
•MPI_Comm_size–total number of processes in communicator
•MPI_Comm_rank–rank in the communicator
56
Querying Information
1 2 3
The MPI_Comm_size and The calling sequences of The rank of a process is an
MPI_Comm_rank functions these routines are as follows: integer that ranges from zero
are used to determine the • int MPI_Comm_size(MPI_Comm up to the size of the
number of processes and the comm, int *size) communicator minus one.
label of the calling process, • int MPI_Comm_rank(MPI_Comm
comm, int *rank)
respectively.
57
#include <mpi.h>
58
Sending and Receiving Messages
• The basic functions for sending and receiving messages in MPI are the MPI_Send
and MPI_Recv, respectively.
• The calling sequences of these routines are as follows:
int MPI_Send(void *buf, int count, MPI_Datatype
datatype, int dest, int tag, MPI_Comm comm)
int MPI_Recv(void *buf, int count, MPI_Datatype
datatype, int source, int tag, MPI_Comm comm,
MPI_Status *status)
• MPI provides equivalent datatypes for all C datatypes. This is done for portability
reasons.
• The datatype MPI_BYTE corresponds to a byte (8 bits) and MPI_PACKED
corresponds to a collection of data items that has been created by packing non-
contiguous data.
• The message-tag can take values ranging from zero up to the MPI defined constant
MPI_TAG_UB. 59
MPI Datatype C Datatype
MPI_CHAR signed char
MPI_SHORT signed short int
MPI_INT signed int
MPI_LONG signed long int
MPI_UNSIGNED_CHAR unsigned char
MPI MPI_UNSIGNED_SHORT unsigned short int
Datatypes MPI_UNSIGNED unsigned int
MPI_UNSIGNED_LONG unsigned long int
MPI_FLOAT float
MPI_DOUBLE double
MPI_LONG_DOUBLE long double
MPI_BYTE
MPI_PACKED
60
MPI allows specification of wildcard arguments
for both source and tag.
61
• On the receiving end, the status variable can be used to get information about the
MPI_Recv operation.
• The corresponding data structure contains:
Sending typedef struct MPI_Status {
int MPI_SOURCE;
and
int MPI_TAG;
Receiving int MPI_ERROR; };
Messages • The MPI_Get_count function returns the precise count of data items received.
int MPI_Get_count(MPI_Status *status, MPI_Datatype
datatype, int *count)
62
Consider:
Avoiding if (myrank == 0) {
MPI_Send(a, 10, MPI_INT, 1, 1, MPI_COMM_WORLD);
Deadlocks
- Each process is waiting for matching
}
MPI_Send(b, 10, MPI_INT, 1, 2, MPI_COMM_WORLD);
63
• Dead lock may also occur when process
sends message to itself
Avoiding • It is legal
Deadlocks
-
• Behavior is implementation dependent
and must be avoided
64
• Improper use of MPI_Send and MPI_Recv can also lead to
deadlocks in situations when each processor needs to send
and receive a message in a circular fashion.
Deadlocks
-
3 ...
4 MPI_Comm_size(MPI_COMM_WORLD, &npes);
5 MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
6 MPI_Send(a, 10, MPI_INT, (myrank+1)%npes, 1,
MPI_COMM_WORLD);
7 MPI_Recv(b, 10, MPI_INT, (myrank-1+npes)%npes, 1,
MPI_COMM_WORLD);
8 ...
65
Consider the following piece of code, in which process i sends a
message to process i + 1 (modulo the number of processes) and receives
a message from process i - 1 (module the number of processes).
66
We can break the circular wait to avoid deadlocks as follows:
67
Sending and Receiving
Messages Simultaneously
MPI_Sendrecv_replace uses
• single buffer for sending & receiving
• Performs blocking send , receive
• Send & receive must transfer data of the same datatype
69
Topologies and Embedding
70
MPI allows a programmer to organize processors
into logical k-d meshes.
71
• Different ways to map a set of processes to a two-dimensional
Topologies and grid.
(a) and (b) show a row- and column-wise mapping of these
Embeddings processes,
• (c) shows a mapping that follows a space-flling curve (dotted
line), and
• (d) shows a mapping in which neighboring processes are
directly connected in a hypercube.
72
Creating and Using Cartesian Topologies
This function takes the processes in the old communicator and creates a new communicator with dims dimensions.
• Each processor can now be identified in this new cartesian topology by a vector of dimension dims.
73
Since sending and receiving messages still require (one-dimensional)
ranks, MPI provides routines to convert ranks to cartesian coordinates
and vice-versa.
• int MPI_Cart_coord(MPI_Comm comm_cart, int rank, int maxdims, int
*coords)
• int MPI_Cart_rank(MPI_Comm comm_cart, int *coords, int *rank)
The most common operation on cartesian topologies is a shift. To determine the rank
of source and destination of such shifts, MPI provides the following function:
74
Overlapping Communication with
Computation
75
76
Overlapping Communication
with Computation
• int MPI_Isend(void *buf, int count, MPI_Datatype datatype, int dest,
int tag, MPI_Comm comm, MPI_Request *request)
• int MPI_Irecv(void *buf, int count, MPI_Datatype datatype, int
source, int tag, MPI_Comm comm, MPI_Request *request)
77
Avoiding Deadlocks Non Blocking counter Part
int a[10], b[10], myrank;
MPI_Status status;
Using non-blocking operations remove most deadlocks. Consider:
MPI_Request requests[2];
int a[10], b[10], myrank; ...
MPI_Status status;
... MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
if (myrank == 0) {
if (myrank == 0) {
MPI_Send(a, 10, MPI_INT, 1, 1, MPI_COMM_WORLD); MPI_Send(a, 10, MPI_INT, 1, 1,
MPI_Send(b, 10, MPI_INT, 1, 2, MPI_COMM_WORLD); MPI_COMM_WORLD);
}
else if (myrank == 1) { MPI_Send(b, 10, MPI_INT, 1, 2,
MPI_Recv(b, 10, MPI_INT, 0, 2, &status, MPI_COMM_WORLD);
MPI_COMM_WORLD);
MPI_Recv(a, 10, MPI_INT, 0, 1, &status,
}
MPI_COMM_WORLD); else if (myrank == 1) {
}
... MPI_Irecv(b, 10, MPI_INT, 0, 2,
&requests[0], MPI_COMM_WORLD);
Replacing either the send or the receive operations with
non-blocking counterparts fixes this deadlock. MPI_Irecv(a, 10, MPI_INT, 0, 1,
&requests[1], MPI_COMM_WORLD);
} 78
MPI provides an extensive set of
functions for performing common
collective communication
Collective operations.
Communication
and Each of these operations is defined
over a group corresponding to the
Computation communicator.
Operations
79
80
81
int MPI_Bcast(void *buf, int count, MPI_Datatype datatype,
int source, MPI_Comm comm)
82
83
Reduction
• int MPI_Reduce(void *sendbuf, void *recvbuf,
int count, MPI_Datatype datatype, MPI_Op op,
int target, MPI_Comm comm)
84
• Predefined Operations
Predefined Reduction Operations
Operation Meaning Datatypes
MPI_MAX Maximum C integers and floating point
MPI_MIN Minimum C integers and floating point
MPI_SUM Sum C integers and floating point
MPI_PROD Product C integers and floating point
MPI_LAND Logical AND C integers
MPI_BAND Bit-wise AND C integers and byte
MPI_LOR Logical OR C integers
MPI_BOR Bit-wise OR C integers and byte
MPI_LXOR Logical XOR C integers
MPI_BXOR Bit-wise XOR C integers and byte
MPI_MAXLOC max-min value-location Data-pairs
MPI_MINLOC min-min value-location Data-pairs
85
An example use of the MPI_MINLOC and MPI_MAXLOC operators.
• The operation MPI_MAXLOC combines pairs of values (vi, li) and returns
the pair (v, l) such that v is the maximum among all vi 's and l is the
corresponding li (if there are more than one, it is the smallest among all
these li 's).
• MPI_MINLOC does the same, except for minimum value of vi.
86
MPI datatypes for data-pairs used with the MPI_MAXLOC and
MPI_MINLOC reduction operations.
87
• If the result of the reduction operation is needed by all
processes, MPI provides:
int MPI_Allreduce(void *sendbuf, void
*recvbuf, int count, MPI_Datatype
Collective datatype, MPI_Op op,MPI_Comm comm)
Communication
• To compute prefix-sums, MPI provides:
Operations int MPI_Scan(void *sendbuf, void*
recvbuf, int count,MPI_Datatype datatype,
MPI_Op op, MPI_Comm comm)
88
Collective Communication Operations
The barrier
synchronization operation • int MPI_Barrier(MPI_Comm comm)
is performed in MPI using:
89
• Int MPI_Allreduce(void *sendbuf, void *recvbuf, int
The all-to-all reduction count, MPI_Datatype datatype, MPI_Op op,
operation is: MPI_Comm comm)
90
Scatter and Gather
91
Vector variants of gather& allgather
92
Vector variants of scater
93
94
95
96
97
Groups and Communicators
Communicators
*comm_subcart)
• If keep_dims[i] is true (non-zero value in
C) then the ith dimension is retained in the
new sub-topology.
• The coordinate of a process in a sub-topology
created by MPI_Cart_sub can be obtained
from its coordinate in the original topology by
disregarding the coordinates that correspond
to the dimensions that were not retained.
Groups and Communicators
#include <mpi.h>
/*
Use MPI function call depend on your data partitioning and the
parallelization architecture
*/
MPI_Finalize();
The initialization routine
MPI_INIT is the first MPI
routine called.
Initializing
MPI
MPI_INIT is called once
int mpi_Init( int *argc, char
**argv );
A minimal MPI program(c)
#include “mpi.h”
#include <stdio.h>
int main(int argc, char *argv[])
{
MPI_Init(&argc, &argv);
printf(“Hello, world!\n”);
MPI_Finalize();
Return 0;
}
• #include “mpi.h” provides basic MPI definitions and
types.
program(c) • Note that all non-MPI routines are local; thus “printf”
(cont.) run on each process
Data Types
• MPI datatype C
datatype
types
• MPI_UNSIGNED_SHORT unsigned short
• MPI_INT signed int
• MPI_UNSIGNED unsigned int
• MPI_LONG signed long
• MPI_UNSIGNED_LONG unsigned long
• MPI_FLOAT float
• MPI_DOUBLE double
• MPI_LONG_DOUBLE long double
Why defining
the data types
Because communications take during the
place between heterogeneous
machines. Which may have send of a
different data representation and
length in the memory. message?
MPI_SEND(void *start, int
count,MPI_DATATYPE datatype,
int dest, int tag, MPI_COMM
MPI blocking • The receiver can specify a wildcard value for source
(MPI_ANY_SOURCE) and/or a wildcard value for tag
(MPI_ANY_TAG), indicating that any source and/or tag are
receive acceptable
• Status is used for extra information about the received
message if a wildcard receive mode is used.
• If the count of the message received is less than or equal to
that described by the MPI receive command, then the
message is successfully received. Else it is considered as a
buffer overflow error.
Status is a data structure
In C: