2 Mpi
2 Mpi
Contents
MPI Comm rank(MPI Comm comm, int rank); Determines the rank of the calling process in the communicator.
The first argument to the call is a communicator and the rank of the process is returned in the second argument.
Essentially a communicator is a collection of processes that can send messages to each other. The only communicator
needed for basic programs is MPI COMM WORLD and is predefined in MPI and consists of the processees running when
program execution begins.
#include <stdio.h>
#include <mpi.h>
//Defines
//Global variables
//Declare int variables for Process number, number of processes and length of processor name.
int rank, size, namelen;
//Intialize MPI
MPI_Init(&argc, &argv);
printf ("Hello World. Rank %d out of %d running on %s!\n", rank, size, name);
• MPI Send(Message, BUFFER SIZE, MPI CHAR, Destination, Destination tag, MPI COMM WORLD);: Send the
string of MPI CHAR of size BUFFER SIZE to Message to Process with Rank Destination in MPI COMM WORLD;
• Example: MPI Recv(Message, BUFFER SIZE, MPI CHAR, Source, Source tag, MPI COMM WORLD, &status);: Re-
ceive the string Message of MPI CHAR of size BUFFER SIZE from Source belonging to MPI COMM WORLD. Execution
status of the function is stored in status.
Your task is to create a parallel MPI version of the π program. For this version, use the MPI Bcast and MPI Reduce
functions.
• Broadcast the total steps value (num steps) to all the processs (process 0 could broadcast). The syntax and usage
of MPI Bcast follow:
– MPI Bcast(void *message, int count, MPI Datatype datatype, int root, MPI Comm comm);: Broad-
cast a message from the process with rank “root” to all other processes of the group.
– Example: MPI Bcast(&n, 1, MPI INT, 0, MPI COMM WORLD);. process 0 broadcasts n, of type MPI INT, to
all the processs MPI COMM WORLD.
– MPI Reduce(void *operand, void *result, int count, MPI Datatype datatype, MPI Operator op, int
root, MPI Comm comm);: Reduce values on all processes to a single value. Count refers to the number of
operand and result values.
– Example: MPI Reduce(&mypi, &pi, 1, MPI DOUBLE, MPI SUM, 0, MPI COMM WORLD);: Perform MPI SUM
on mypi from each process into a single pi value and store it in process 0.
half = numprocs;
do
synch(); /*wait for partial sum completion*/
...
MPI_Status *status;
MPI_Request request[200];
...
• Each process (including the root) finds the square root of each element it receives.
• After the computation, all the processes return the modified array back to the root process. This step is called
Gather.
The sequence of operations are illustrated in Figure 3.
MPI provides MPI Scatter, MPI Scatterv, MPI Gather, and the MPI Gatherv primitives for use in situations where
scatter-gather sequences appear.
Note: The sendbuf, sendcounts, displs parameters are meaningful only in the root process.
• MPI Gatherv:Gathers varying amounts of data from all processes to the root process.
int MPI Gatherv(void *sendbuf, int sendcount, MPI Datatype sendtype, void *recvbuf, int *recvcounts,
int *displs, MPI Datatype recvtype, int root, MPI Comm comm)
Each of the parameters in the MPI Gatherv function have similar meaning as in the MPI Scatterv function.
– Each sending process populates its sendbuf with sendcount elements of type MPI INT.
– The root process receives recvbuf array which is the union of all the sendbuf arrays. The elements are filled
in recvbuf in the order of ranks of the subprocesses.
Note: The recvbuf, recvcounts, displs parameters are meaningful only in the root process.
Example usage:
Each subprocess has its copy of My gatherv which contains Send Count elements. The Root process receives
Total gatherv. Each subprocess’s segments start from the indices recorded int he Displacement array.
• MPI Scatter, MPI Gather: These primitives are used to scatter (gather) arrays uniformly (all segments of the array
have equal number of elements) between subprocesses.
struct dd{
char c;
int i[2];
float f[4];
};
(a) Root process creates a new
(b) After the MPI Commit, all processes (c) Root sends data in the structure to
MPI datatype reflecting the mirror of
register the derived datatype. all the subprocesses. The subprocesses
the struct.
now identify the data as an item of the
derived datatype.
Figure 4: The sequence of steps in the MPI program for the derived datatype question (Q7.).
– count indicates the number of blocks – 3 for the structure above (char, int, float).
– array of blocklengths: Number of elements in each block. [1, 2, 4] for the structure above. Indicates that
first block contains one element, the second block has 2 elements and the third block has 4 elements. The
code snippet:
array_of_block_lengths[0] = 1;
array_of_block_lengths[1] = 2;
array_of_block_lengths[2] = 4;
– array of displacements: Byte displacement of each block (array of integers). The first value is always
0. The second displacement value is the difference between the addresses of the second block and the first.
The third displacement value is the difference between the addresses of the third block and the first. The
array of displacements is of type MPI Aint. Use the MPI Get address function to retrieve the addresses of
type MPI Aint. The following snippet shows the second displacement calculation.
MPI_Get_address(&s.c, &block1_address);
MPI_Get_address(s.i, &block2_address);
array_of_displacements[1] = block2_address-block1_address;
block1 address and block2 address are of type MPI Aint. array of displacements is an array of size total
blocks and is of type MPI Aint. s is of type struct dd(shown above).
Prototype of MPI Get address: int MPI Get address(void *location, MPI Aint *address);
– array of types: Type of elements in each block. For the structure above, the first block contains characters,
the second is integer and the third block contains floating point numbers. The array of types is of type
MPI Datatype. The values will be MPI CHAR, MPI INT, MPI FLOAT.
– MPI Type create struct returns handle to the derived type in newtype.
• Commit the newtype using the MPI Type commit call. Prototype:
int MPI Type commit(MPI Datatype *datatype)
• Collective communication: The root broadcasts the filled structure to all the processes in its communicator. Print
out the structure in all the processes.
• Point-to-point communication: The root sends the filled structure to each process in its communicator. The
receiving process prints out the structure.
Q8. Pack and Unpack
Pack and Unpack
The previous question illustrated the creation of a derived datatype to help communicate compound datastructures. The
same objective can be achieved by packing elements of different types at the sender process and unpacking the elements at
the receiver process. For this question, achieve the same effect as the previous question using MPI Pack and MPI Unpack
routines. Prototypes follow:
int MPI_Pack(void *inbuf, int incount, MPI_Datatype datatype,
void *outbuf, int outsize, int *position, MPI_Comm comm)
incount number of items from inbuf, each of type datatype are stored contiguously in outbuf. The input/output
parameter position mentions the offset to start writing into (0 for the first pack). On return, the next write can begin
from position. outsize is the size of outbuf. The following code shows packing a char, an int array and a float array
in the sender process.
MPI_Pack(&c, 1, MPI_CHAR, buffer,100,&position,MPI_COMM_WORLD);
MPI_Pack(iA, 2, MPI_INT, buffer,100,&position,MPI_COMM_WORLD);
MPI_Pack(fA, 4, MPI_FLOAT,buffer,100,&position,MPI_COMM_WORLD);
To send (receive) the packed buffer use the datatype MPI PACKED in MPI Send(MPI Recv). The prototype for
MPI Unpack, to be used by the receiver process, follows.
int MPI_Unpack(void *inbuf, int insize, int *position,
void *outbuf, int outcount, MPI_Datatype datatype,
MPI_Comm comm)
inbuf is the packed buffer of size insize. Unpacking starts from position (semantics of position are the same as
MPI Pack. outcount number of elements of type datatype are read from inbuf and placed in outbuf.
Q10. Matrix Multiplication on a Cartesian Grid (2D Mesh) using Cannon’s Algorithm
Matrix Multiplication on a Cartesian Grid (2D Mesh) using Cannon’s Algorithm
For this question, we will assume that the multiprocessor is interconnected in the form of a grid (a 2-dimensional Mesh;
Dimensions are X-axis and Y-axis). An example 4×4 Mesh is shown in Figure 5. For this questin, the mesh contains
equal processors in both dimensions as shown. The 16 processors shown in the figure can be thought of being arranged
in a Cartesian grid - the coordinates of such a grid are also shown in the figure.
Assume that one process is executing on one processor. For the purpose clarity the ranks of the processes equals the
identies of the processor it is executing on. The objectives of this question are:
Read the details of Cannon’s multiplication algorithm here: Parallel Matrix Multiplication page. An example is
provided in the Cannon’s Matrix Multiplication section (below).
Details of the program implementation follow. The program can be divided into two stages.
Stage 1 Creation of a Grid of processes.
• Create a new communicator to which a Cartesian 4× 4 grid topology is attached. The API for this in MPI is
MPI Cart create. Prototype:
– comm old is the communicator whose processes are to be arranged in a grid. If all the processes in the MPI
program are to be put in the grid, comm old is equal to MPI COMM WORLD (the global communicator). The grid
is now attached to the new communicator pointed to by comm cart.
– ndims: The grid contains ndims dimensions. For this program ndims=2. The X and Y dimensions.
– dims is an array. Each element records the number of processes per dimension. dims[0] is the number of
processes in the X dimension. dims[1] is the number of processes in the Y dimension. Both values are 4 for
this question.
– periods is the logical array of size ndims specifying whether the grid is periodic (true) or not (false) in each
dimension. Both values are True for this question.
– reorder is a flag that indicates if the process ranks can be reordered in the new communicator. The newly
created grid should have the same ordering. This value is false for this question.
• Example:
typedef struct{
int N; /* The number of processors in a row (column). */
int size; /* Number of processors. (Size = N*N */
int row; /* This processor’s row number. */
int col; /* This processor’s column number. */
int MyRank; /* This processor’s unique identifier. */
MPI_Comm comm; /* Communicator for all processors in the grid.*/
MPI_Comm row_comm; /* All processors in this processor’s row . */
MPI_Comm col_comm; /* All processors in this processor’s column. */
}grid_info;
• The coordinates of the current process can be obtained using the following API call. These are needed during
multiplication.
Coordinates is the array that records the position of the process in the grid. Coordinates[0] stores the X position
and Coordinates[1] stores the position in the Y axis.
Note: You can verify that the grid has been created by printing out the coordinates of each process.
• Create communicators for individual rows and columns. This will ease the implementation of the algorithm (in the
skewing, and rotating steps).
• In A, left rotate (left shift with wrap around) elements in row i by i positions. In B, north rotate (up shift with
wrap around) elements in column i by i positions. Eg. A[2][1] will be shifted to A[2][2]. B[2][1] will be shifted to
B[1][1]. This step is called skewing. The rowwise and columnwise skewing steps are demonstrated in Figure 6. The
skewed matrices are:
1 2 3 1 5 9 0 0 0
A= 5 6 4
B= 4 8 3
C= 0 0 0
9 7 8 7 2 6 0 0 0
(a) Row-wise Skewing. Each rectangle represents a block of the
input matrix. (b) Column-wise Skewing. Each rectangle represents a block
of the input matrix.
• Multiply elements of A and B in position (elementwise) and add elementwise to C. This operation: C[i][j] +=
A[i][j]*B[i][j]. The new C matrix is shown.
1 2 3 1 5 9 1 10 27
A = 5 6 4 B = 4 8 3 C = 20 48 12
9 7 8 7 2 6 63 14 48
• Left rotate every element in A. North rotate every element in B. Repeat the multiplication step. The status of A
and B after the rotate steps are shown in the first line. The status of C after the multiplication is shown next.
2 3 1 4 8 3 1 10 27
A = 6 4 5 B = 7 2 6 C = 20 48 12
7 8 9 1 5 9 63 14 48
2 3 1 4 8 3 9 34 30
A = 6 4 5 B = 7 2 6 C = 62 56 42
7 8 9 1 5 9 70 54 129
• Repeat the previous step till all the multiplications are done. In this example there are 3 multiplications to complete.
The next is the last step. C is the product matrix.
3 1 2 7 2 6 30 36 42
A = 4 5 6 B = 1 5 9 C = 64 81 96
8 9 7 4 8 3 102 126 150
The core of the algorithm is the element wise multiplication and addition step made possible by skewing and rotating
steps. Consider that each element from A and B were stored in a single processor. The skewing and rotating steps
correspond to communication of elements between the processors. The multiplication and accumulation step is completed
inside the processor. In other words, Process (i,j) in the grid, calculates the element C[i][j]. This idea can be scaled
for larger arrays - divide the large arrays into submatrices (blocks) such that each processor now contains a submatrix.
Every processor now computes the submatrix of the product instead of a single element.
The details of the implementation of this stage are presented below. The matrices are populated and multiplication
algorithm is completed in this stage.
• In the root process, populate the multiplicand and multiplier arrays (A and B) with random numbers. Assume
4×4 arrays for now.
• Divide the array into equal sized blocks and scatter to each process in the grid. For this example we have a
convenient 4×4 array and a 4×4 grid. Each process will get one element each of arrays A and B corresponding to
its position in the grid. (Process (2,1) in the grid will get A[2][1] and B[2][1]), etc.
• Create communicators containing processes from a each row and each column. Use the MPI Cart sub call for this
purpose. The call partitions a communicator into subgroups. A subgroup in a grid is a lower-dimensional cartesian
subgrid. In the case of a 2-dimensional grid the subgroups are groups of process from the same row or the same
column. Prototype and example:
The ith entry of remain dims specifies whether the ith dimension is kept in the subgrid (true) or is dropped (false).
comm new is the handle of the new subgrid that contains the calling process. Example below shows the call that
returns the communicator handle (row comm) to which all the processes in the same row as the calling process are
attached. This is stored in the struct (grid) defined earlier.
remain_dims[0] = 1;
remain_dims[1] = 0;
MPI_Cart_sub(grid->Comm, remain_dims, &(grid->row_comm));
• Skew the input arrays A and B. Skewing involves left rotating each element in the ith row by i positions in Matrix
A. Skewing north rotates each element in the ith column by i positions in Matrix B. One way to implement would
be to use MPI Send and MPI Recv calls. If you choose to do this, extra care should be taken to avoid deadlocks.
The easier and better way of doing this would be to use the MPI Sendrecv replace call. The call is used to send
data in a buffer to a sender process and receive data into the same buffer from another process. Prototype and
example:
The array buf containing count items of type datatype. The contents of this buffer are sent to the process dest
and is tagged with sendtag. The same buffer, buf, is filled with a max of count elements from the process source
tagged with recvtag. An example:
The processes in a single row of the grid are attached to the row comm communicator. The current process sends 1
item of type MPI FLOAT to destination and receives 1 MPI FLOAT item from destination.
• Perform the Cannon’s Multiplication algorithm. The rotation steps can be implemented in the same manner as
the skewing step.
• At the end of the multiplication, gather the product matrix in the root process. Root prints out the result matrix.
Epilogue
This document borrows heavily from the excellent hyPACK 2013 workshop by CDAC, Pune. The MPI Quick Reference
Guide[3] will be useful in implementing the questions in this assignment. The manual pages for MPI are an excellent
source of syntax and semantics of the API calls. The recommended MPI Tutorial website[5]. An excellent collention of
online books and tutorials[6]. MPI Wikibooks page[7]. Beginning MPI page is a good source[8]. The list of recommended
books on MPI is maintained by the MPITutorial website[9]. Other references[10],.
References
[1] A. Dashin, “MPI Installation Guide,” https://jetcracker.wordpress.com/2012/03/01/
how-to-install-mpi-in-ubuntu/, Jetcracker.
[2] D. G. Martnez and S. R. Lumley, “Installation of MPI - Parallel and Distributed Programming,” http://lsi.ugr.es/
∼jmantas/pdp/ayuda/datos/instalaciones/Install OpenMPI en.pdf.
[3] M. Snir, S. Otto, S. Huss-Lederman, D. Walker, and J. Dongarra, “MPI Quick Reference Guide,” http://www.
netlib.org/utk/people/JackDongarra/WEB-PAGES/SPRING-2006/mpi-quick-ref.pdf, netlib.org.