0% found this document useful (0 votes)
28 views115 pages

Unit3 All

Uploaded by

swatiulligeri52
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views115 pages

Unit3 All

Uploaded by

swatiulligeri52
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 115

UNIT-3

• UNIT-III 8 Hours
Programming using the Message-Passing Paradigm: Principles of Message-Passing Programming, The Building
Blocks: Send and Receive Operations, MPI: The Message Passing Interface, Topologies and Embedding,
Overlapping Communication with Computation, Collective Communication and Computation Operations. Self-
Study: Groups and Communicators.

Reference Book
Ananth Grama,Anshul Gupta,Vipin kumar, George Karypis, Introduction to parallel computing, second edition,
2003, Pearson education publishers.

Chapter 6

2
MPI
Message Passing Interface
3
Outline
• Background
• Message Passing
• Principles of Message-Passing Programming
• MPI
• Group and Context
• Communication Modes
• Blocking/Non-blocking
• Features
• Programming / issues
• The Building Blocks: Send and Receive Operations
• MPI: the Message Passing Interface
• Topologies and Embedding
• Overlapping Communication with Computation
• Collective Communication and Computation Operations
• Groups and Communicators
4
Distributed Computing Paradigms

Communication Computation
Models: Models:

Message Shared Functional


Data Parallel
Passing Memory Parallel

5
• Each processing elements cannot access all
data natively
Distributed • The scale can go up considerably
Memory • Penalty for coordinating with other
Parallelism processing elements is now significantly
higher
• Approaches change accordingly

6
7

Distributed Memory Multiprocessors

• Each processor has a local memory


• Physically separated memory address space
• Processors must communicate to access non-local data
• Message communication (message passing)
• Message passing architecture
• Processor interconnection network
• Parallel applications must be partitioned across
• Processors: execution units
• Memory: data partitioning
• Scalable architecture
• Small incremental cost to add hardware (cost of
node)
Distributed Memory (MP) Architecture
• Nodes are complete Network
computer systems
• Including I/O
• Nodes communicate
via interconnection M $ M $ M $
network
• Standard networks P P P
• Specialized networks
• Network interfaces Network interface
• Communication integration
• Easier to build
8
Bandwidth
• Need high bandwidth in communication
• Match limits in network, memory, and processor
• Network interface speed vs. network bisection

Performance bandwidth

Metrics: Latency
Latency and • Performance affected since processor may have to wait
• Harder to overlap communication and computation
Bandwidth • Overhead to communicate is a problem in many
machines

Latency hiding
• Increases programming system burden
• Examples: communication/computation overlaps,
prefetch 9
Advantages of Distributed Memory
Architectures

• The hardware can be simpler (especially versus NUMA) and is more


scalable
• Communication is explicit and simpler to understand
• Explicit communication focuses attention on costly aspect of parallel
computation
• Synchronization is naturally associated with sending messages, reducing
the possibility for errors introduced by incorrect synchronization
• Easier to use sender-initiated communication, which may have some
advantages in performance

10
Types of • Data parallel
Parallel • Simultaneous execution on multiple data items
• Example: Single Instruction, Multiple Data (SIMD)
Computing • Task parallel
• Different instructions on different data (MIMD)
Models • SPMD (Single Program, Multiple Data)
• Combination of data parallel and task parallel
• Not synchronized at individual operation level
• Message passing is for MIMD/SPMD parallelism
• Can be used for data parallel programming

11
Message Passing

A process is a program counter and Message passing is used for Inter-process communication:
address space. communication among processes.
Type:
• Synchronous / Asynchronous
Movement of data from one process’s address
space to another’s

12
The Message-Passing Model
• A process is a program counter and address space
• Processes can have multiple threads (program counters and associated stacks) sharing a
single address space process
P1 P2 P3 P4
thread
address space
(memory)

• MPI is for communication among processes


• Not threads
• Interprocess communication consists of
• Synchronization
• Data movement
13
Synchronous Vs. Asynchronous

A SYNCHRONOUS COMMUNICATION IS AN ASYNCHRONOUS COMMUNICATION


NOT COMPLETE UNTIL THE MESSAGE COMPLETES AS SOON AS THE MESSAGE
HAS BEEN RECEIVED. IS ON THE WAY.

14
Synchronous Vs. Asynchronous
( cont. )

15
What is message passing?

Requires Cooperation not


Data transfer. cooperation of always apparent in
sender and receiver code

16
SPMD ~~~~~
~~~ Multiple “Owner compute” rule:
~~~~ data Process that “owns”
~~ the data (local data)
performs computations
on that data
~~~~~
Shared ~~~
program ~~~~
~~

~~~~~ ~~~~~
~~~ ~~~
~~~~ ~~~~
~~ ~~

• Data distributed across processes


• Not shared
17
Message Passing Programming
Defined by communication requirements

• Data communication (necessary for algorithm)


• Control communication (necessary for dependencies)

Program behavior determined by communication patterns

Message passing infrastructure attempts to support the forms of communication most often used or desired

• Basic forms provide functional access


• Can be used most often
• Complex forms provide higher-level abstractions
• Serve as basis for extension
• Example: graph libraries, meshing libraries, …
• Extensions for greater programming power

18
Principles of
Message-Passing Programming

• The logical view of a machine supporting the message-passing paradigm


consists of p processes, each with its own exclusive address space.
• Each data element must belong to one of the partitions of the space; hence,
data must be explicitly partitioned and placed.
• All interactions (read-only or read/write) require cooperation of two processes
- the process that has the data and the process that wants to access the data.
• These two constraints, while onerous, make underlying costs very explicit to
the programmer.

19
• Message-passing programs are often
written using the asynchronous or loosely
synchronous paradigms.
Principles of • In the asynchronous paradigm, all

Message- concurrent tasks execute asynchronously.


• In the loosely synchronous model, tasks or
Passing subsets of tasks synchronize to perform
interactions. Between these interactions,

Programming tasks execute completely asynchronously.


• Most message-passing programs are
written using the single program multiple
data (SPMD) model.

20
Communication • Two ideas for communication
• Cooperative operations
Types • One-sided operations

21
Cooperative Operations for Communication
• Data is cooperatively exchanged in message-passing
• Explicitly sent by one process and received by another
• Advantage of local control of memory
• Any change in the receiving process’s memory is made with the
receiver’s explicit participation
• Communication and synchronization are combined

Process Process 1
0
Send(data)
Receive(data)

time
22
One-Sided Operations for Communication
One-sided operations between Include remote memory reads and writes
processes

Only one process needs to There is still agreement implicit in the


explicitly participate SPMD program

Communication and synchronization are


Advantages? decoupled

Process Process 1
0Put(data)
(memory)
(memory)
Get(data)
time
23
Pairwise vs. Collective
Communication
• Communication between process pairs
• Send/Receive or Put/Get
• Synchronous or asynchronous (we’ll talk about this later)
• Collective communication between multiple processes
• Process group (collective)
• Several processes logically grouped together
• Communication within group
• Collective operations
• Communication patterns
• broadcast, multicast, subset, scatter/gather, …
• Reduction operations

24
The Building Blocks:
Send and Receive Operations
The prototypes of these operations are as follows:

• send(void *sendbuf, int nelems, int dest)


• receive(void *recvbuf, int nelems, int source)

Consider the following code segments:

• P0 P1
• a = 100; receive(&a, 1, 0)
• send(&a, 1, 1); printf("%d\n", a);
• a = 0;

The semantics of the send operation require that the value received by process P1
must be 100 as opposed to 0.

This motivates the design of the send and receive protocols.

25
Blocking vs. Non-Blocking

BLOCKING, MEANS THE PROGRAM WILL NON-BLOCKING, MEANS THE PROGRAM


NOT CONTINUE UNTIL THE WILL CONTINUE, WITHOUT WAITING FOR
COMMUNICATION IS COMPLETED. THE COMMUNICATION TO BE COMPLETED.

26
Non-Buffered Blocking
Message Passing Operations Send/Receive

• Handshake for a blocking


non-buffered send/receive
operation.
• It is easy to see that in cases
where sender and receiver do
not reach communication
point at similar times, there
can be considerable idling
overheads.

27
Dead Lock in Blocking Non-buffered Operations

28
Buffered Blocking
Message Passing Operations :Send/Receive

A simple solution to the idling and The sender simply copies the data
deadlocking problem outlined into the designated buffer and
above is to rely on buffers at the returns after the copy operation
sending and receiving ends. has been completed.

The data must be buffered at the Buffering trades off idling overhead
receiving end as well. for buffer copying overhead.

29
30
Buffered Blocking
Message Passing Operations
• Blocking buffered transfer
protocols:
• (a) in the presence of
communication hardware
with buffers at send and
receive ends;
• (b) in the absence of
communication hardware,
sender interrupts receiver
and deposits data in buffer at
receiver end.

(a) (b) 31
Buffered Blocking
Message Passing Operations

Bounded buffer sizes can have signicant impact on performance.

P0 P1
for (i = 0; i < 1000; i++) { for (i = 0; i < 1000; i++){
produce_data(&a); receive(&a, 1, 0);
send(&a, 1, 1); consume_data(&a);
} }

What if consumer was much slower than producer?


-Bounded buffer requirements
32
Buffered Blocking
Message Passing Operations
Deadlocks are still possible with buffering since receive operations block.

P0 P1
receive(&a, 1, 1); receive(&a, 1, 0);
send(&b, 1, 1); send(&b, 1, 0);

33
Non-Blocking
Message Passing Operations

This class of non-blocking


The programmer must ensure Non-blocking operations are
protocols returns from the send
semantics of the send and generally accompanied by a
or receive operation before it is
receive. check-status operation.
semantically safe to do so.

When used correctly, these


primitives are capable of Message passing libraries
overlapping communication typically provide both blocking
overheads with useful and non-blocking primitives.
computations.

34
Non Blocking send and receive
Non-Blocking
Message Passing Operations

• Non-blocking non-buffered
send and receive operations
• (a) in absence of
communication hardware;
• (b) in presence of
communication hardware.

36
Send and Receive Protocols

• Space of possible protocols


for send and receive
operations.

37
MPI

38
• A message-passing library specifications:
• Extended message-passing model
• Not a language or compiler specification
• Not a specific implementation or product
• For parallel computers, clusters, and heterogeneous networks.
• Communication modes: standard, synchronous, buffered, and ready.
• Designed to permit the development of parallel software libraries.
• Designed to provide access to advanced parallel hardware for
• End users
• Library writers
• Tool developers
39
MPI
• MPI defines a standard library for message-passing that can be used
to develop portable message-passing programs using either C or
Fortran.
• The MPI standard defines both the syntax as well as the semantics of
a core set of library routines.
• Vendor implementations of MPI are available on almost all
commercial parallel computers.
• It is possible to write fully-functional message-passing programs by
using only the six routines.

41
Why Use MPI?
• Message passing is a mature parallel programming model
• Well understood
• Efficient match to hardware (interconnection networks)
• Many applications
• MPI provides a powerful, efficient, and portable way to express parallel programs
• MPI was explicitly designed to enable libraries …
• … which may eliminate the need for many users to learn (much of) MPI
• Need standard, rich, and robust implementation
• Three versions: MPI-1, MPI-2, MPI-3 , MPI-4
• Robust implementations including free MPICH (ANL)

42
General

• Communicators combine context and group for security


• Thread safety (implementation dependent)

Point-to-point communication
Features • Structured buffers and derived datatypes, heterogeneity

of MPI • Modes: normal, synchronous, ready, buffered

Collective
• Both built-in and user-defined collective operations
• Large number of data movement routines
• Subgroups defined directly or by topology

43
Features that are NOT part of MPI

PROCESS REMOTE MEMORY THREADS VIRTUAL SHARED


MANAGEMENT TRANSFER MEMORY

45
Is MPI Large or Small?
• MPI-1 is 128 functions, MPI-2 is 152 functions
MPI is large • Extensive functionality requires many functions
• Not necessarily a measure of complexity

MPI is small
• Many parallel programs use just 6 basic functions
(6 functions)

“MPI is just right,” • One can access flexibility when it is required


said Baby Bear • One need not master all parts of MPI to use it

46
To use or not use MPI? That is the question?

USE NOT USE


You need a portable parallel program You don’t need parallelism at all
You are writing a parallel library You can use libraries (which may be written in
You have irregular or dynamic data relationships MPI)
that do not fit a data parallel model You can use multi-threading in a concurrent
You care about performance and have to do environment

47
The minimal set of MPI routines.
MPI_Init Initializes MPI.
MPI_Finalize Terminates MPI.
MPI: the Message
Passing Interface MPI_Comm_size Determines the number of processes.
MPI_Comm_rank Determines the label of calling process.
MPI_Send Sends a message.
MPI_Recv Receives a message.

48
Starting and Terminating the MPI Library
MPI_Init is called prior to any calls to other MPI routines. Its purpose is to initialize the MPI
environment.

MPI_Finalize is called at the end of the computation, and it performs various clean-up tasks to
terminate the MPI environment.

int MPI_Init(int *argc, char ***argv)


The prototypes of these two functions are:
int MPI_Finalize()

MPI_Init also strips off any MPI related command-line arguments.

All MPI routines, data-types, and constants are prefixed by “MPI_”. The return code for successful
completion is MPI_SUCCESS.

49
Initialization and Finalization
MPI_Init
•gather information about the parallel job
•set up internal library state
•prepare for communication

MPI_Finalize
•cleanup

50
Communicators

51
Group and Context

Are two important and indivisible concepts of MPI.

Group: is the set of processes that communicate with one


another.

Context: it is somehow similar to the frequency in radio


communications.

Communicator: is the central object for communication in MPI.


Each communicator is associated with a group and a context.

52
Communicators
• A communicator defines a communication domain - a set of processes
that are allowed to communicate with each other.
• Information about communication domains is stored in variables of
type MPI_Comm.
• Communicators are used as arguments to all message transfer MPI
routines.
• A process can belong to many different (possibly overlapping)
communication domains.
• MPI defines a default communicator called MPI_COMM_WORLD
which includes all the processes.
53
MPI_COMM_WORLD

54
MPI_COMM_WORLD

55
Communication Scope
Communicator(communication handle)
•Defines the scope
•Specifies communication context

Process
•Belongs to a group
•Identified by a rank within a group

Identification
•MPI_Comm_size–total number of processes in communicator
•MPI_Comm_rank–rank in the communicator

56
Querying Information

1 2 3
The MPI_Comm_size and The calling sequences of The rank of a process is an
MPI_Comm_rank functions these routines are as follows: integer that ranges from zero
are used to determine the • int MPI_Comm_size(MPI_Comm up to the size of the
number of processes and the comm, int *size) communicator minus one.
label of the calling process, • int MPI_Comm_rank(MPI_Comm
comm, int *rank)
respectively.

57
#include <mpi.h>

main(int argc, char *argv[])


{
int npes, myrank;
MPI_Init(&argc, &argv); Our First MPI
MPI_Comm_size(MPI_COMM_WORLD, &npes);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank); Program
printf("From process %d out of %d, Hello World!\n",
myrank, npes);
MPI_Finalize();
}

58
Sending and Receiving Messages
• The basic functions for sending and receiving messages in MPI are the MPI_Send
and MPI_Recv, respectively.
• The calling sequences of these routines are as follows:
int MPI_Send(void *buf, int count, MPI_Datatype
datatype, int dest, int tag, MPI_Comm comm)
int MPI_Recv(void *buf, int count, MPI_Datatype
datatype, int source, int tag, MPI_Comm comm,
MPI_Status *status)
• MPI provides equivalent datatypes for all C datatypes. This is done for portability
reasons.
• The datatype MPI_BYTE corresponds to a byte (8 bits) and MPI_PACKED
corresponds to a collection of data items that has been created by packing non-
contiguous data.
• The message-tag can take values ranging from zero up to the MPI defined constant
MPI_TAG_UB. 59
MPI Datatype C Datatype
MPI_CHAR signed char
MPI_SHORT signed short int
MPI_INT signed int
MPI_LONG signed long int
MPI_UNSIGNED_CHAR unsigned char
MPI MPI_UNSIGNED_SHORT unsigned short int
Datatypes MPI_UNSIGNED unsigned int
MPI_UNSIGNED_LONG unsigned long int
MPI_FLOAT float
MPI_DOUBLE double
MPI_LONG_DOUBLE long double
MPI_BYTE
MPI_PACKED

60
MPI allows specification of wildcard arguments
for both source and tag.

If source is set to MPI_ANY_SOURCE, then any


Sending and process of the communication domain can be
the source of the message.
Receiving If tag is set to MPI_ANY_TAG, then messages
Messages with any tag are accepted.

On the receive side, the message must be of


length equal to or less than the length field
specified.

61
• On the receiving end, the status variable can be used to get information about the
MPI_Recv operation.
• The corresponding data structure contains:
Sending typedef struct MPI_Status {
int MPI_SOURCE;
and
int MPI_TAG;
Receiving int MPI_ERROR; };
Messages • The MPI_Get_count function returns the precise count of data items received.
int MPI_Get_count(MPI_Status *status, MPI_Datatype
datatype, int *count)

62
Consider:

int a[10], b[10], myrank;


MPI_Status status;
...
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);

Avoiding if (myrank == 0) {
MPI_Send(a, 10, MPI_INT, 1, 1, MPI_COMM_WORLD);

Deadlocks
- Each process is waiting for matching
}
MPI_Send(b, 10, MPI_INT, 1, 2, MPI_COMM_WORLD);

MPI_Send , wrt tag else if (myrank == 1) {


MPI_Recv(b, 10, MPI_INT, 0, 2, MPI_COMM_WORLD);
MPI_Recv(a, 10, MPI_INT, 0, 1, MPI_COMM_WORLD);
}
...

If MPI_Send is blocking, there is a deadlock.

63
• Dead lock may also occur when process
sends message to itself
Avoiding • It is legal

Deadlocks
-
• Behavior is implementation dependent
and must be avoided

64
• Improper use of MPI_Send and MPI_Recv can also lead to
deadlocks in situations when each processor needs to send
and receive a message in a circular fashion.

Avoiding 1 int a[10], b[10], npes, myrank;


2 MPI_Status status;

Deadlocks
-
3 ...
4 MPI_Comm_size(MPI_COMM_WORLD, &npes);
5 MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
6 MPI_Send(a, 10, MPI_INT, (myrank+1)%npes, 1,
MPI_COMM_WORLD);
7 MPI_Recv(b, 10, MPI_INT, (myrank-1+npes)%npes, 1,
MPI_COMM_WORLD);
8 ...

65
Consider the following piece of code, in which process i sends a
message to process i + 1 (modulo the number of processes) and receives
a message from process i - 1 (module the number of processes).

int a[10], b[10], npes, myrank;


MPI_Status status;
...
Avoiding MPI_Comm_size(MPI_COMM_WORLD, &npes);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
Deadlocks MPI_Send(a, 10, MPI_INT, (myrank+1)%npes, 1,
MPI_COMM_WORLD);
MPI_Recv(b, 10, MPI_INT, (myrank-1+npes)%npes, 1,
MPI_COMM_WORLD);
...

Once again, we have a deadlock if MPI_Send is blocking.

66
We can break the circular wait to avoid deadlocks as follows:

int a[10], b[10], npes, myrank;


MPI_Status status;
...
MPI_Comm_size(MPI_COMM_WORLD, &npes);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
if (myrank%2 == 1) {
Avoiding MPI_Send(a, 10, MPI_INT, (myrank+1)%npes, 1,
MPI_COMM_WORLD);

Deadlocks MPI_Recv(b, 10, MPI_INT, (myrank-1+npes)%npes, 1,


MPI_COMM_WORLD);
}
else {
MPI_Recv(a, 10, MPI_INT, (myrank-1+npes)%npes, 1,
MPI_COMM_WORLD);
MPI_Send(b, 10, MPI_INT, (myrank+1)%npes, 1,
MPI_COMM_WORLD);
}
...

67
Sending and Receiving
Messages Simultaneously

To exchange messages, MPI provides the following function:

int MPI_Sendrecv(void *sendbuf, int sendcount, MPI_Datatype senddatatype, int dest,


int sendtag, void *recvbuf, int recvcount, MPI_Datatype recvdatatype, int source,
int recvtag, MPI_Comm comm, MPI_Status *status)

The safe version of earlier example using MPI_Sendrecv is


int a[10], b[10], npes, myrank;
MPI_Status status;
...
MPI_Comm_size(MPI_COMM_WORLD, &npes);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
MPI_SendRecv(a, 10, MPI_INT, (myrank+1)%npes, 1,
b, 10, MPI_INT, (myrank-1+npes)%npes, 1,
MPI_COMM_WORLD, &status);
... 68
int MPI_Sendrecv_replace(void *buf, int count, MPI_Datatype datatype, int dest, int sendtag,
int source, int recvtag, MPI_Comm comm, MPI_Status *status)

MPI_Sendrecv_replace uses
• single buffer for sending & receiving
• Performs blocking send , receive
• Send & receive must transfer data of the same datatype

69
Topologies and Embedding

70
MPI allows a programmer to organize processors
into logical k-d meshes.

The processor ids in MPI_COMM_WORLD can be


mapped to other communicators (corresponding
Topologies and

to higher-dimensional meshes) in many ways.


Embeddings

The goodness of any such mapping is determined


by the interaction pattern of the underlying
program and the topology of the machine.

MPI does not provide the programmer any control


over these mappings.

71
• Different ways to map a set of processes to a two-dimensional
Topologies and grid.
(a) and (b) show a row- and column-wise mapping of these
Embeddings processes,
• (c) shows a mapping that follows a space-flling curve (dotted
line), and
• (d) shows a mapping in which neighboring processes are
directly connected in a hypercube.

72
Creating and Using Cartesian Topologies

• Virtual process topology – of arbitrary connection – in terms of graph


• Each node corresponds to the process
• Link indicate they are communicating with each other
• Graphs can be used specify any desired topology
• Commonly used topologies in message-passing programs are one-, two-, or higher-dimensional
grids, that are also referred to as Cartesian topologies
• MPI's function for describing Cartesian topologies is called MPI_Cart_create

int MPI_Cart_create(MPI_Comm comm_old, int ndims, int *dims, int *periods,


int reorder, MPI_Comm *comm_cart)

This function takes the processes in the old communicator and creates a new communicator with dims dimensions.
• Each processor can now be identified in this new cartesian topology by a vector of dimension dims.

73
Since sending and receiving messages still require (one-dimensional)
ranks, MPI provides routines to convert ranks to cartesian coordinates
and vice-versa.
• int MPI_Cart_coord(MPI_Comm comm_cart, int rank, int maxdims, int
*coords)
• int MPI_Cart_rank(MPI_Comm comm_cart, int *coords, int *rank)

The most common operation on cartesian topologies is a shift. To determine the rank
of source and destination of such shifts, MPI provides the following function:

• int MPI_Cart_shift(MPI_Comm comm_cart, int dir, int s_step, int


*rank_source, int *rank_dest)

74
Overlapping Communication with
Computation

75
76
Overlapping Communication
with Computation
• int MPI_Isend(void *buf, int count, MPI_Datatype datatype, int dest,
int tag, MPI_Comm comm, MPI_Request *request)
• int MPI_Irecv(void *buf, int count, MPI_Datatype datatype, int
source, int tag, MPI_Comm comm, MPI_Request *request)

• These operations return before the operations have been completed.


• Function MPI_Test tests whether or not the non-blocking send or receive operation identified by its request has finished.

• int MPI_Test(MPI_Request *request, int *flag,MPI_Status *status)


• MPI_Wait waits for the operation to complete.
int MPI_Wait(MPI_Request *request, MPI_Status *status)

77
Avoiding Deadlocks Non Blocking counter Part
int a[10], b[10], myrank;
MPI_Status status;
Using non-blocking operations remove most deadlocks. Consider:
MPI_Request requests[2];
int a[10], b[10], myrank; ...
MPI_Status status;
... MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
MPI_Comm_rank(MPI_COMM_WORLD, &myrank);
if (myrank == 0) {
if (myrank == 0) {
MPI_Send(a, 10, MPI_INT, 1, 1, MPI_COMM_WORLD); MPI_Send(a, 10, MPI_INT, 1, 1,
MPI_Send(b, 10, MPI_INT, 1, 2, MPI_COMM_WORLD); MPI_COMM_WORLD);
}
else if (myrank == 1) { MPI_Send(b, 10, MPI_INT, 1, 2,
MPI_Recv(b, 10, MPI_INT, 0, 2, &status, MPI_COMM_WORLD);
MPI_COMM_WORLD);
MPI_Recv(a, 10, MPI_INT, 0, 1, &status,
}
MPI_COMM_WORLD); else if (myrank == 1) {
}
... MPI_Irecv(b, 10, MPI_INT, 0, 2,
&requests[0], MPI_COMM_WORLD);
Replacing either the send or the receive operations with
non-blocking counterparts fixes this deadlock. MPI_Irecv(a, 10, MPI_INT, 0, 1,
&requests[1], MPI_COMM_WORLD);
} 78
MPI provides an extensive set of
functions for performing common
collective communication
Collective operations.

Communication
and Each of these operations is defined
over a group corresponding to the
Computation communicator.
Operations

All processors in a communicator


must call these operations.

79
80
81
int MPI_Bcast(void *buf, int count, MPI_Datatype datatype,
int source, MPI_Comm comm)
82
83
Reduction
• int MPI_Reduce(void *sendbuf, void *recvbuf,
int count, MPI_Datatype datatype, MPI_Op op,
int target, MPI_Comm comm)

84
• Predefined Operations
Predefined Reduction Operations
Operation Meaning Datatypes
MPI_MAX Maximum C integers and floating point
MPI_MIN Minimum C integers and floating point
MPI_SUM Sum C integers and floating point
MPI_PROD Product C integers and floating point
MPI_LAND Logical AND C integers
MPI_BAND Bit-wise AND C integers and byte
MPI_LOR Logical OR C integers
MPI_BOR Bit-wise OR C integers and byte
MPI_LXOR Logical XOR C integers
MPI_BXOR Bit-wise XOR C integers and byte
MPI_MAXLOC max-min value-location Data-pairs
MPI_MINLOC min-min value-location Data-pairs

85
An example use of the MPI_MINLOC and MPI_MAXLOC operators.

• The operation MPI_MAXLOC combines pairs of values (vi, li) and returns
the pair (v, l) such that v is the maximum among all vi 's and l is the
corresponding li (if there are more than one, it is the smallest among all
these li 's).
• MPI_MINLOC does the same, except for minimum value of vi.

86
MPI datatypes for data-pairs used with the MPI_MAXLOC and
MPI_MINLOC reduction operations.

MPI Datatype C Datatype


Collective MPI_2INT pair of ints
Communication MPI_SHORT_INT short and int
Operations
MPI_LONG_INT long and int
MPI_LONG_DOUBLE_INT long double and int
MPI_FLOAT_INT float and int
MPI_DOUBLE_INT double and int

87
• If the result of the reduction operation is needed by all
processes, MPI provides:
int MPI_Allreduce(void *sendbuf, void
*recvbuf, int count, MPI_Datatype
Collective datatype, MPI_Op op,MPI_Comm comm)
Communication
• To compute prefix-sums, MPI provides:
Operations int MPI_Scan(void *sendbuf, void*
recvbuf, int count,MPI_Datatype datatype,
MPI_Op op, MPI_Comm comm)

88
Collective Communication Operations
The barrier
synchronization operation • int MPI_Barrier(MPI_Comm comm)
is performed in MPI using:

The one-to-all broadcast • int MPI_Bcast(void *buf, int count, MPI_Datatype


operation is: datatype, int source, MPI_Comm comm)

• int MPI_Reduce(void *sendbuf, void *recvbuf, int


The all-to-one reduction
count, MPI_Datatype datatype, MPI_Op op, int target,
operation is: MPI_Comm comm)

89
• Int MPI_Allreduce(void *sendbuf, void *recvbuf, int
The all-to-all reduction count, MPI_Datatype datatype, MPI_Op op,
operation is: MPI_Comm comm)

• int MPI_Scan(void *sendbuf, void *recvbuf, int count,


The prefix reduction MPI_Datatype datatype, MPI_Op op, MPI_Comm
operation is: comm)

90
Scatter and Gather

• int MPI_Scatter(void *sendbuf, int sendcount, MPI_Datatype senddatatype,


void *recvbuf, int recvcount, MPI_Datatype recvdatatype, int source,
MPI_Comm comm)

• int MPI_Gather(void *sendbuf, int sendcount, MPI_Datatype senddatatype,


void *recvbuf, int recvcount, MPI_Datatype recvdatatype, int target,
MPI_Comm comm)

91
Vector variants of gather& allgather

• int MPI_Gatherv(void *sendbuf, int sendcount, MPI_Datatype


Gatherv: senddatatype, void *recvbuf, int *recvcounts, int *displs,
MPI_Datatype recvdatatype, int target, MPI_Comm comm)

• int MPI_Allgatherv(void *sendbuf, int sendcount,


Allgatherv: MPI_Datatype senddatatype, void *recvbuf, int *recvcounts,
int *displs, MPI_Datatype recvdatatype, MPI_Comm comm)

92
Vector variants of scater

• int MPI_Scatterv(void *sendbuf, int *sendcounts, int

Scatterv: *displs, MPI_Datatype senddatatype, void *recvbuf,


int recvcount, MPI_Datatype recvdatatype, int source,
MPI_Comm comm)

93
94
95
96
97
Groups and Communicators

• In many parallel algorithms, communication operations need to be restricted to


certain subsets of processes.
• MPI provides mechanisms for partitioning the group of processes that belong to a
communicator into subgroups each corresponding to a different communicator.
• The simplest such mechanism is:
int MPI_Comm_split(MPI_Comm comm, int color, int
key, MPI_Comm *newcomm)
• This operation groups processors by color and sorts resulting groups on the key.
• Using MPI_Comm_split to split a group of processes in a communicator
into subgroups.
Communicators
Groups and
• In many parallel algorithms, processes are
arranged in a virtual grid, and in different steps
of the algorithm, communication needs to be
restricted to a different subset of the grid.
• MPI provides a convenient way to partition a
Cartesian topology to form lower-dimensional
grids:
int MPI_Cart_sub(MPI_Comm

Groups and comm_cart, int *keep_dims,


MPI_Comm

Communicators
*comm_subcart)
• If keep_dims[i] is true (non-zero value in
C) then the ith dimension is retained in the
new sub-topology.
• The coordinate of a process in a sub-topology
created by MPI_Cart_sub can be obtained
from its coordinate in the original topology by
disregarding the coordinates that correspond
to the dimensions that were not retained.
Groups and Communicators

• Splitting a Cartesian topology of size 2 x 4 x 7


into (a) four
• subgroups of size 2 x 1 x 7, and (b) eight
subgroups of size 1 x 1 x 7.
Basic Commands
Standard with blocking
Skeleton MPI Program

#include <mpi.h>

main( int argc, char** argv )


{
MPI_Init( &argc, &argv );

/* main part of the program */

/*
Use MPI function call depend on your data partitioning and the
parallelization architecture
*/

MPI_Finalize();
The initialization routine
MPI_INIT is the first MPI
routine called.
Initializing
MPI
MPI_INIT is called once
int mpi_Init( int *argc, char
**argv );
A minimal MPI program(c)
#include “mpi.h”
#include <stdio.h>
int main(int argc, char *argv[])
{
MPI_Init(&argc, &argv);
printf(“Hello, world!\n”);
MPI_Finalize();
Return 0;
}
• #include “mpi.h” provides basic MPI definitions and
types.

A minimal • MPI_Init starts MPI

MPI • MPI_Finalize exits MPI

program(c) • Note that all non-MPI routines are local; thus “printf”
(cont.) run on each process

• Note: MPI functions return error codes or


MPI_SUCCESS
By default, an error causes all
processes to abort.

Error handling The user can have his/her own


error handling routines.

Some custom error handlers


are available for downloading
from the net.
>include <mpi.h#
>include <stdio.h#
int main(int argc, char *argv[])
{
Improved ;int rank, size
;MPI_Init(&argc, &argv)
Hello (c) ;MPI_Comm_rank(MPI_COMM_WORLD, &rank)
;MPI_Comm_size(MPI_COMM_WORLD, &size)
printf("I am %d of %d\n", rank,
;size)
;)(MPI_Finalize
;return 0
}
The default communicator is
the MPI_COMM_WORLD
Some
concepts
A process is identified by its
rank in the group associated
with a communicator.
The data message which is sent or received is The following data types are supported by MPI:
described by a triple (address, count, datatype).

Predefined data types that are corresponding to data types from


the programming language.
Arrays.
Sub blocks of a matrix
User defined data structure.
A set of predefined data types

Data Types
• MPI datatype C
datatype

• MPI_CHAR signed char


• MPI_SIGNED_CHAR signed char
• MPI_UNSIGNED_CHAR unsigned char

Basic MPI • MPI_SHORT


short
signed

types
• MPI_UNSIGNED_SHORT unsigned short
• MPI_INT signed int
• MPI_UNSIGNED unsigned int
• MPI_LONG signed long
• MPI_UNSIGNED_LONG unsigned long
• MPI_FLOAT float
• MPI_DOUBLE double
• MPI_LONG_DOUBLE long double
Why defining
the data types
Because communications take during the
place between heterogeneous
machines. Which may have send of a
different data representation and
length in the memory. message?
MPI_SEND(void *start, int
count,MPI_DATATYPE datatype,
int dest, int tag, MPI_COMM

MPI blocking comm)


• The message buffer is described by (start,
send count, datatype).
• dest is the rank of the target process in the
defined communicator.
• tag is the message identification number.
MPI_RECV(void *start, int count,
MPI_DATATYPE datatype, int source, int
tag, MPI_COMM comm, MPI_STATUS *status)
• Source is the rank of the sender in the communicator.

MPI blocking • The receiver can specify a wildcard value for source
(MPI_ANY_SOURCE) and/or a wildcard value for tag
(MPI_ANY_TAG), indicating that any source and/or tag are
receive acceptable
• Status is used for extra information about the received
message if a wildcard receive mode is used.
• If the count of the message received is less than or equal to
that described by the MPI receive command, then the
message is successfully received. Else it is considered as a
buffer overflow error.
Status is a data structure

In C:

• int recvd_tag, recvd_from,


recvd_count;
MPI_STATUS • MPI_Status status;
• MPI_Recv(…, MPI_ANY_SOURCE,
MPI_ANY_TAG, …, &status)
• recvd_tag = status.MPI_TAG;
• recvd_from = status.MPI_SOURCE;
• MPI_Get_count(&status, datatype,
&recvd_count);
• A receive operation may accept messages
from an arbitrary sender, but a send
operation must specify a unique receiver.
More info
• Source equals destination is allowed, that
is, a process can send a message to itself.

You might also like