0% found this document useful (0 votes)

34 views13 pages

2 Mpi

The document outlines a course on Parallel Programming with a focus on the Message Passing Interface (MPI), detailing various programming tasks and examples. It includes instructions for setting up MPI, writing Hello World programs, performing DAXPY operations, calculating π using MPI Bcast and MPI Reduce, and implementing collective communication methods like Scatter and Gather. Each section provides code snippets, explanations of MPI functions, and tasks for students to complete, emphasizing the use of MPI for parallel computing.

Uploaded by

thangdoi333

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views13 pages

2 Mpi

Uploaded by

thangdoi333

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

CO471 - Parallel Programming

A2 - Message Passing Interface (MPI)

Contents

1 Hello World Example 1

Q1. Hello World Program - Version 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Q2. DAXPY Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Q3. Hello World Program - Version 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Q4. Calculation of π - MPI Bcast and MPI Reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Q5. Reduction operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Q6. Collective Communication - Scatter - Gather . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Q7. MPI Derived Datatypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
Q8. Pack and Unpack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Q9. Derived Datatype - Indexed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Q10.Matrix Multiplication on a Cartesian Grid (2D Mesh) using Cannon’s Algorithm . . . . . . . . . . . . . . . . 9

1 Hello World Example

MPI Init(int *argc, char **argv); Initializes the MPI execution environment. Should be the first MPI call. After
this call the system is setup to use the MPI library.

MPI Comm rank(MPI Comm comm, int rank); Determines the rank of the calling process in the communicator.
The first argument to the call is a communicator and the rank of the process is returned in the second argument.
Essentially a communicator is a collection of processes that can send messages to each other. The only communicator
needed for basic programs is MPI COMM WORLD and is predefined in MPI and consists of the processees running when
program execution begins.

MPI Finalize(); Terminates MPI execution environment

#include <stdio.h>
#include <mpi.h>

//Include user libraries

//Defines
//Global variables

int main (int argc, char *argv[]) {

//Declare int variables for Process number, number of processes and length of processor name.
int rank, size, namelen;

//Declare char variable for name of processor

char name[100];

//Intialize MPI
MPI_Init(&argc, &argv);

//Get number of processes

MPI_Comm_size(MPI_COMM_WORLD, &size);

//Get processes number

MPI_Comm_rank(MPI_COMM_WORLD, &rank);

//Get processesor name

MPI_Get_processor_name(name, &namelen);
Figure 1: Illustration of the Hello World - Version 2 processes communicating with the root process.

printf ("Hello World. Rank %d out of %d running on %s!\n", rank, size, name);

//Terminate MPI environment

MPI_Finalize();
return 0;
}

Installation and Compilation

You will need the MPI compiler, (mpicc), and the MPI runtime, (mpirun) to run MPI programs. On most Ubuntu
systems running the following commands will install mpicc and mpirun. MPI Manual pages are installed by the openmpi
package. A small installation guide is here[1]. A more detailed installation is here[2]. The MPI complete reference is
here[3].

$ sudo apt-get update

$ sudo apt-get install openmpi-bin openmpi-common openmpi-doc

Compile and run the Hello World progam.

$ mpicc -o helloworld ./helloworld.c

$ mpirun -n 4 ./helloworld

Q1. Hello World Program - Version 1

Hello World Program - Version 1
Initialize the MPI parallel environment. Each process should identify itself and print out a Hello world message.

Q2. DAXPY Loop

DAXPY Loop
D stands for Double precision, A is a scalar value, X and Y are one-dimensional vectors of size 216 each, P stands for
Plus. The operation to be completed in one iteration is X[i] = a*X[i] + Y[i]. Implement an MPI program to complete
the DAXPY operation. Measure the speedup of the MPI implementation compared to a uniprocessor implementation.
The double MPI Wtime() function is the equivalent of the double omp get wtime().

Q3. Hello World Program - Version 2

Hello World Program - Version 2
Initialize the MPI parallel environment. Process with rank k(k = 1, 2, . . . , p − 1) will send a “Hello World” message to
the master process (has Rank 0). The master process receives the message and prints it. An illustration is provided in
Figure 1. Use the MPI point-to-point blocking communication library calls (MPI Send and MPI Recv). Example usages
of the calls follow:

• MPI Send(Message, BUFFER SIZE, MPI CHAR, Destination, Destination tag, MPI COMM WORLD);: Send the
string of MPI CHAR of size BUFFER SIZE to Message to Process with Rank Destination in MPI COMM WORLD;
• Example: MPI Recv(Message, BUFFER SIZE, MPI CHAR, Source, Source tag, MPI COMM WORLD, &status);: Re-
ceive the string Message of MPI CHAR of size BUFFER SIZE from Source belonging to MPI COMM WORLD. Execution
status of the function is stored in status.

Q4. Calculation of π - MPI Bcast and MPI Reduce

Calculation of π - MPI Bcast and MPI Reduce
This is the first example where many processs will cooperate to calculate a computational value. The task of this program
will be arrive at an approximate value of π. The serial version of the code follows.

static long num_steps = 100000;

double step;
void main ()
{
int i;
double x, pi, sum = 0.0;
step = 1.0/(double)num_steps;
for (i=0; i<num_steps; i++) {
x = (i+0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}
pi = step * sum;
}

Your task is to create a parallel MPI version of the π program. For this version, use the MPI Bcast and MPI Reduce
functions.

• Broadcast the total steps value (num steps) to all the processs (process 0 could broadcast). The syntax and usage
of MPI Bcast follow:
– MPI Bcast(void *message, int count, MPI Datatype datatype, int root, MPI Comm comm);: Broad-
cast a message from the process with rank “root” to all other processes of the group.
– Example: MPI Bcast(&n, 1, MPI INT, 0, MPI COMM WORLD);. process 0 broadcasts n, of type MPI INT, to
all the processs MPI COMM WORLD.

• Each process calculates its partial π value. Partial π = 1.0+x2 ×

4.0
step).
• Use MPI Reduce to reduce the partial π values generated by each process. MPI Reduce is executed by every process.
Syntax and an example follow:

– MPI Reduce(void *operand, void *result, int count, MPI Datatype datatype, MPI Operator op, int
root, MPI Comm comm);: Reduce values on all processes to a single value. Count refers to the number of
operand and result values.
– Example: MPI Reduce(&mypi, &pi, 1, MPI DOUBLE, MPI SUM, 0, MPI COMM WORLD);: Perform MPI SUM
on mypi from each process into a single pi value and store it in process 0.

Q5. Reduction operation

Reduction operation
The task for this question is to write two versions of an MPI program to find sum of n integers on n processors without
using the MPI Reduce API. The illustration of the last 4 iterations of the reduction operation are shown in Figure 2. The
pseudocode for the entire operation is also presented. For more explanation you may visit Page 521 of the PH-CoD-2014
book.
The basic idea is to bisect the available processors into two sets, say (P0 , . . . , P n2 −1 ) and (P n2 , . . ., Pn−1 ). One half
sends its number to its peer in the other half - Process Pi+ n2 sends its integer to Pi . The receiving process computes the
partial sum – process Pi calculates (Pi + Pi+ n2 ). This process is repeated till the last process (P0 in the pseudocode)
remains with the accumulated sum. Point to note - what if there are an odd number of processors at the beginning of
this step? (the pseudocode handles this situation too).
Figure 2: The last four levels of a reduction that sums results from each processor, from bottom to top. For all processors
whose number i is less than half, add the sum produced by processor number (i + half) to its sum (Figure 6.8 from
Patterson and Hennessy, Computer Organization and Design. MK. 2014.).

half = numprocs;
do
synch(); /*wait for partial sum completion*/

/* Conditional sum needed when half is odd;

Processor0 gets missing element. */
if (half%2 != 0 && Pn == 0)
sum[0] += sum[half1];

half = half/2; /dividing line on who sums /

if (Pn < half) sum[Pn] += sum[Pn+half];

while (half > 1); /*exit with final sum in Sum[0] */

Implement two versions of this program.

Ver. 1. Use blocking point-to-point communication API - MPI Send, MPI Recv.
Ver. 2. Use non-blocking point-to-point communication API - MPI ISend, MPI IRecv. MPI Wait or MPI Waitall is useful
in this scenario.
• MPI ISend(void* buf, int count, MPI Datatype datatype, int dest, int tag, MPI Comm comm, MPI Request
*request);: Begins a non-blocking send.
• MPI IRecv(void* buf, int count, MPI Datatype datatype, int source, int tag, MPI Comm comm, MPI Reques
*request);. Both ISend and Irecv return a request parameter that can be used in the Wait and Waitall
functions.
• MPI Wait(MPI Request *request, MPI Status *status);: Waits for a non-blocking MPI send or receive to
complete. The request parameter corresponds to the request parameter returned by MPI Isend or MPI Irecv.
• MPI Waitall(int count, MPI Request *array of requests, MPI Status *array of statuses);: Waits
for all given communications to complete and to test all or any of the collection of nonblocking operations.
Example usage:

...
MPI_Status *status;
MPI_Request request[200];
...

/* Receiver - Non Blocking calls */

MPI_Irecv(&value_to_recv, 1, MPI_INT, Source, Source_tag, MPI_COMM_WORLD, &request[i]);
{\tt MPI\_Waitall(1,&request[i],status);}
...

/* Sender - Non Blocking calls */

MPI_Isend(&value_to_send, 1, MPI_INT, Destination, Destination_tag, MPI_COMM_WORLD,&request[i]);
(a) Root process decides non-uniform segments of Array to(b) All processes execute MPI Scatterv(), receive segments
be sent to processes P0, P1, P2, and P3. of Array.

(c) Each process the computation step. Each process exe-

cutes MPI Gatherv() to send its segment to the Root pro- (d) Root process receives segments of the processed array
cess. (including its own).

Figure 3: The Scatter-Gather sequence for Q6..

/* wait for the Isend corresponding to request[i] to complete */

/* alternatively MPI_Wait could have been used here */
MPI_Waitall(1,&request[i],status);
...

Q6. Collective Communication - Scatter - Gather

Collective Communication - Scatter - Gather
The task for this question follows:
• The root process (Rank 0) distributes an array in a non-uniform manner to all the processes in the communicator
(including itself). This step is called Scatter.

• Each process (including the root) finds the square root of each element it receives.
• After the computation, all the processes return the modified array back to the root process. This step is called
Gather.
The sequence of operations are illustrated in Figure 3.
MPI provides MPI Scatter, MPI Scatterv, MPI Gather, and the MPI Gatherv primitives for use in situations where
scatter-gather sequences appear.

• MPI Scatterv: Scatters a buﬀer in parts to all tasks in a group.

int MPI Scatterv(void *sendbuf, int *sendcounts, int *displs, MPI Datatype sendtype, void *recvbuf,
int recvcount, MPI Datatype recvtype, int root, MPI Comm comm)

– The root process scatters the array sendbuf.

– Size of each segment is recorded in the array sendcounts. sendcounts[0] is the number of elements to be
sent to P0, sendcounts[1] is the number of elements to be sent to P1, and so on.
– displs: Integer array (of length group size). Entry i specifies the displacement (relative to sendbuf) from which
to take the outgoing data to process i. sendbuf[ displs[0] ], sendbuf[ displs[0]+1 ], . . ., sendbuf[
displs[0]+ sendcounts[0]-1 ] are sent to P0, sendbuf[ displs[1] ], sendbuf[ displs[1]+1 ], . . .,
sendbuf[ displs[1]+ sendcounts[1]-1 ] are sent to P1, and so on.
– Each receiving process is returned its segment in the recvcount array.
– root: The sending processes in the communicator (comm).

Note: The sendbuf, sendcounts, displs parameters are meaningful only in the root process.
• MPI Gatherv:Gathers varying amounts of data from all processes to the root process.
int MPI Gatherv(void *sendbuf, int sendcount, MPI Datatype sendtype, void *recvbuf, int *recvcounts,
int *displs, MPI Datatype recvtype, int root, MPI Comm comm)
Each of the parameters in the MPI Gatherv function have similar meaning as in the MPI Scatterv function.
– Each sending process populates its sendbuf with sendcount elements of type MPI INT.
– The root process receives recvbuf array which is the union of all the sendbuf arrays. The elements are filled
in recvbuf in the order of ranks of the subprocesses.
Note: The recvbuf, recvcounts, displs parameters are meaningful only in the root process.
Example usage:

MPI_Gatherv(My_gatherv, Send_Count, MPI_INT,

Total_gatherv, recv_count, Displacement, MPI_INT,
Root, MPI_COMM_WORLD);

Each subprocess has its copy of My gatherv which contains Send Count elements. The Root process receives
Total gatherv. Each subprocess’s segments start from the indices recorded int he Displacement array.
• MPI Scatter, MPI Gather: These primitives are used to scatter (gather) arrays uniformly (all segments of the array
have equal number of elements) between subprocesses.

Q7. MPI Derived Datatypes

MPI Derived Datatypes
The task to complete:
• Build a custom datatype (it is called a derived datatype) in the MPI program.
• The example derived datatype for this question is a struct. The struct contains a char, an int array with 2
elements, and a float array with 4 elements.
• The root process broadcasts one data item of the derived datatype (the struct in this case) to all other subprocesses
in the communicator.
• The root process sends the data item of the derived datatype to each subprocesses through a point-to-point
communication call.
The sequence of steps to be followed to implement this MPI program are shown in Figure Q7.. The steps are described
below.
• To create a derived datatype that mirrors a struct, use the MPI Type create struct function.
Prototype: int MPI Type create struct(int count, int array of blocklengths[], MPI Aint array of displacement
MPI Datatype array of types[], MPI Datatype *newtype)
The following example creates a derived datatype of the structure shown below.

struct dd{
char c;
int i[2];
float f[4];
};
(a) Root process creates a new
(b) After the MPI Commit, all processes (c) Root sends data in the structure to
MPI datatype reflecting the mirror of
register the derived datatype. all the subprocesses. The subprocesses
the struct.
now identify the data as an item of the
derived datatype.

Figure 4: The sequence of steps in the MPI program for the derived datatype question (Q7.).

– count indicates the number of blocks – 3 for the structure above (char, int, float).
– array of blocklengths: Number of elements in each block. [1, 2, 4] for the structure above. Indicates that
first block contains one element, the second block has 2 elements and the third block has 4 elements. The
code snippet:
array_of_block_lengths[0] = 1;
array_of_block_lengths[1] = 2;
array_of_block_lengths[2] = 4;

– array of displacements: Byte displacement of each block (array of integers). The first value is always
0. The second displacement value is the diﬀerence between the addresses of the second block and the first.
The third displacement value is the diﬀerence between the addresses of the third block and the first. The
array of displacements is of type MPI Aint. Use the MPI Get address function to retrieve the addresses of
type MPI Aint. The following snippet shows the second displacement calculation.
MPI_Get_address(&s.c, &block1_address);
MPI_Get_address(s.i, &block2_address);
array_of_displacements[1] = block2_address-block1_address;

block1 address and block2 address are of type MPI Aint. array of displacements is an array of size total
blocks and is of type MPI Aint. s is of type struct dd(shown above).
Prototype of MPI Get address: int MPI Get address(void *location, MPI Aint *address);
– array of types: Type of elements in each block. For the structure above, the first block contains characters,
the second is integer and the third block contains floating point numbers. The array of types is of type
MPI Datatype. The values will be MPI CHAR, MPI INT, MPI FLOAT.
– MPI Type create struct returns handle to the derived type in newtype.

• Commit the newtype using the MPI Type commit call. Prototype:
int MPI Type commit(MPI Datatype *datatype)

• Collective communication: The root broadcasts the filled structure to all the processes in its communicator. Print
out the structure in all the processes.

• Point-to-point communication: The root sends the filled structure to each process in its communicator. The
receiving process prints out the structure.
Q8. Pack and Unpack
Pack and Unpack
The previous question illustrated the creation of a derived datatype to help communicate compound datastructures. The
same objective can be achieved by packing elements of different types at the sender process and unpacking the elements at
the receiver process. For this question, achieve the same effect as the previous question using MPI Pack and MPI Unpack
routines. Prototypes follow:
int MPI_Pack(void *inbuf, int incount, MPI_Datatype datatype,
void *outbuf, int outsize, int *position, MPI_Comm comm)
incount number of items from inbuf, each of type datatype are stored contiguously in outbuf. The input/output
parameter position mentions the offset to start writing into (0 for the first pack). On return, the next write can begin
from position. outsize is the size of outbuf. The following code shows packing a char, an int array and a float array
in the sender process.
MPI_Pack(&c, 1, MPI_CHAR, buffer,100,&position,MPI_COMM_WORLD);
MPI_Pack(iA, 2, MPI_INT, buffer,100,&position,MPI_COMM_WORLD);
MPI_Pack(fA, 4, MPI_FLOAT,buffer,100,&position,MPI_COMM_WORLD);
To send (receive) the packed buffer use the datatype MPI PACKED in MPI Send(MPI Recv). The prototype for
MPI Unpack, to be used by the receiver process, follows.
int MPI_Unpack(void *inbuf, int insize, int *position,
void *outbuf, int outcount, MPI_Datatype datatype,
MPI_Comm comm)
inbuf is the packed buffer of size insize. Unpacking starts from position (semantics of position are the same as
MPI Pack. outcount number of elements of type datatype are read from inbuf and placed in outbuf.

Q9. Derived Datatype - Indexed

Derived Datatype - Indexed
The objective of this question is to create an indexed derived datatype. The task of this question is to create an indexed
derived datatype that stores the upper triangle of a 2 dimensional matrix and sends it to another process. Use the
MPI Type indexed call to create the “upper triangle” derived type. Prototype:
int MPI_Type_indexed(int count, int *array_of_blocklengths,
int *array_of_displacements, MPI_Datatype oldtype,
MPI_Datatype *newtype)
The MPI Type indexed creates an indexed derived datatype with the handle newtype. newtype contains count num-
ber of blocks. Number of items of type oldtype in each block is stored in array of blocklengths. array of displacements
stores the displacement
 for each block,
 in multiples of oldtype extent. An example is shown below.
0 1 2 3
4 5 6 7
Consider A =  
 8 9 10 11, populated by the Root process. An indexed datatype has to be created that stores
12 13 14 15
 
0 1 2 3
0 5 6 7
the elements in the upper triangular matrix. The upper triangular matrix A upper =  
0 0 10 11 has to be indexed
0 0 0 15
and sent to process with rank 1. In each row, the upper triangular matrix elements start from the (N × row) + row
position in a matrix of size N×N. For this example, array of displacements = [0,5,10,15]. There are as many blocks
as there are a rows in the matrix (count = N. A block from a row contains (N - row) elements (where row is the row
index of the matrix) array of blocklengths=[4,3,2,1]. oldtype is MPI INT. The call could look like this:
MPI_Type_indexed(N, blocklen, displacement, MPI_INT, &upper_triangle);
The task of this program is to create this MPI Type indexed derived datatype, commit it, and send an upper triangular
matrix to the Rank 1 process. Note that blocks in MPI Type indexed could be from any arbitrary row in any sequence.
The MPI Type indexed facilitates sending an irregular arrangement of elements from multi-dimensional matrices. To
communicate a regular section of the matrix MPI Type vector is used. An example wouldbe to distribute columns of an
N×N array to N processes.
Figure 5: For this question, assume that the processors are organized in the form of a 2D Mesh. The addressess of the
processors and the coordinates are also shown.

Q10. Matrix Multiplication on a Cartesian Grid (2D Mesh) using Cannon’s Algorithm
Matrix Multiplication on a Cartesian Grid (2D Mesh) using Cannon’s Algorithm
For this question, we will assume that the multiprocessor is interconnected in the form of a grid (a 2-dimensional Mesh;
Dimensions are X-axis and Y-axis). An example 4×4 Mesh is shown in Figure 5. For this questin, the mesh contains
equal processors in both dimensions as shown. The 16 processors shown in the figure can be thought of being arranged
in a Cartesian grid - the coordinates of such a grid are also shown in the figure.
Assume that one process is executing on one processor. For the purpose clarity the ranks of the processes equals the
identies of the processor it is executing on. The objectives of this question are:

• Create N× N processes. Arrange the processors in the grid fashion.

• The input arrays are distributed equally amongst all processes (scatter operation). The product array calculated
using Cannon’s multiplication algorithm[4]. The root process gathers the product array and displays it on the
output.

Read the details of Cannon’s multiplication algorithm here: Parallel Matrix Multiplication page. An example is
provided in the Cannon’s Matrix Multiplication section (below).
Details of the program implementation follow. The program can be divided into two stages.
Stage 1 Creation of a Grid of processes.

Stage 2 Cannon’s Matrix Multiplication.

Creation of a Grid of processes

• Create 16 processes (-np 16 in mpirun).

• Create a new communicator to which a Cartesian 4× 4 grid topology is attached. The API for this in MPI is
MPI Cart create. Prototype:

int MPI_Cart_create(MPI_Comm comm_old, int ndims, int *dims,

int *periods, int reorder, MPI_Comm *comm_cart)

– comm old is the communicator whose processes are to be arranged in a grid. If all the processes in the MPI
program are to be put in the grid, comm old is equal to MPI COMM WORLD (the global communicator). The grid
is now attached to the new communicator pointed to by comm cart.
– ndims: The grid contains ndims dimensions. For this program ndims=2. The X and Y dimensions.
– dims is an array. Each element records the number of processes per dimension. dims[0] is the number of
processes in the X dimension. dims[1] is the number of processes in the Y dimension. Both values are 4 for
this question.
– periods is the logical array of size ndims specifying whether the grid is periodic (true) or not (false) in each
dimension. Both values are True for this question.
– reorder is a flag that indicates if the process ranks can be reordered in the new communicator. The newly
created grid should have the same ordering. This value is false for this question.

• Example:

MPI_Cart_create(MPI_COMM_WORLD, 2, dims, periods, 0, &grid_comm));

grid comm is the handle to the new communicator.

• For each process in the grid, the following data will be useful to record.
– Grid info: Size of the grid, No. of processors per dimension.
– Rank in the grid.
– Position in the grid - row and column (X and Y coordinates).
– All communicator handles to which the process belongs.
Storing the data in a struct similar to the one below is suggested.

typedef struct{
int N; /* The number of processors in a row (column). */
int size; /* Number of processors. (Size = N*N */
int row; /* This processor’s row number. */
int col; /* This processor’s column number. */
int MyRank; /* This processor’s unique identifier. */
MPI_Comm comm; /* Communicator for all processors in the grid.*/
MPI_Comm row_comm; /* All processors in this processor’s row . */
MPI_Comm col_comm; /* All processors in this processor’s column. */
}grid_info;

• The coordinates of the current process can be obtained using the following API call. These are needed during
multiplication.

MPI_Cart_coords(grid->comm, grid->MyRank, dims, Coordinates);

Coordinates is the array that records the position of the process in the grid. Coordinates[0] stores the X position
and Coordinates[1] stores the position in the Y axis.
Note: You can verify that the grid has been created by printing out the coordinates of each process.
• Create communicators for individual rows and columns. This will ease the implementation of the algorithm (in the
skewing, and rotating steps).

Cannon’s Matrix Multiplication

Before explaining the implementation details for the MPI program, an example Multiplication of two 3 × 3 matrices using
Cannon’s algorithm is shown.
• Input Matrices are A and B. C is the product matrix.
     
1 2 3 1 2 3 0 0 0
A = 4 5 6 B = 4 5 6 C = 0 0 0
7 8 9 7 8 9 0 0 0

• In A, left rotate (left shift with wrap around) elements in row i by i positions. In B, north rotate (up shift with
wrap around) elements in column i by i positions. Eg. A[2][1] will be shifted to A[2][2]. B[2][1] will be shifted to
B[1][1]. This step is called skewing. The rowwise and columnwise skewing steps are demonstrated in Figure 6. The
skewed matrices are:
     
1 2 3 1 5 9 0 0 0

A= 5 6 4 
B= 4 8 3 
C= 0 0 0
9 7 8 7 2 6 0 0 0
(a) Row-wise Skewing. Each rectangle represents a block of the
input matrix. (b) Column-wise Skewing. Each rectangle represents a block
of the input matrix.

Figure 6: Skewing illustrations for Q10. and ??.

• Multiply elements of A and B in position (elementwise) and add elementwise to C. This operation: C[i][j] +=
A[i][j]*B[i][j]. The new C matrix is shown.
     
1 2 3 1 5 9 1 10 27
A = 5 6 4 B = 4 8 3 C = 20 48 12
9 7 8 7 2 6 63 14 48

• Left rotate every element in A. North rotate every element in B. Repeat the multiplication step. The status of A
and B after the rotate steps are shown in the first line. The status of C after the multiplication is shown next.
     
2 3 1 4 8 3 1 10 27
A = 6 4 5 B = 7 2 6 C = 20 48 12
7 8 9 1 5 9 63 14 48
     
2 3 1 4 8 3 9 34 30
A = 6 4 5 B = 7 2 6 C = 62 56 42 
7 8 9 1 5 9 70 54 129

• Repeat the previous step till all the multiplications are done. In this example there are 3 multiplications to complete.
The next is the last step. C is the product matrix.
     
3 1 2 7 2 6 30 36 42
A = 4 5 6 B = 1 5 9 C =  64 81 96 
8 9 7 4 8 3 102 126 150

The core of the algorithm is the element wise multiplication and addition step made possible by skewing and rotating
steps. Consider that each element from A and B were stored in a single processor. The skewing and rotating steps
correspond to communication of elements between the processors. The multiplication and accumulation step is completed
inside the processor. In other words, Process (i,j) in the grid, calculates the element C[i][j]. This idea can be scaled
for larger arrays - divide the large arrays into submatrices (blocks) such that each processor now contains a submatrix.
Every processor now computes the submatrix of the product instead of a single element.
The details of the implementation of this stage are presented below. The matrices are populated and multiplication
algorithm is completed in this stage.
• In the root process, populate the multiplicand and multiplier arrays (A and B) with random numbers. Assume
4×4 arrays for now.
• Divide the array into equal sized blocks and scatter to each process in the grid. For this example we have a
convenient 4×4 array and a 4×4 grid. Each process will get one element each of arrays A and B corresponding to
its position in the grid. (Process (2,1) in the grid will get A[2][1] and B[2][1]), etc.
• Create communicators containing processes from a each row and each column. Use the MPI Cart sub call for this
purpose. The call partitions a communicator into subgroups. A subgroup in a grid is a lower-dimensional cartesian
subgrid. In the case of a 2-dimensional grid the subgroups are groups of process from the same row or the same
column. Prototype and example:

int MPI_Cart_sub(MPI_Comm comm, int remain_dims, MPI_Comm comm_new)

The ith entry of remain dims specifies whether the ith dimension is kept in the subgrid (true) or is dropped (false).
comm new is the handle of the new subgrid that contains the calling process. Example below shows the call that
returns the communicator handle (row comm) to which all the processes in the same row as the calling process are
attached. This is stored in the struct (grid) defined earlier.

remain_dims[0] = 1;
remain_dims[1] = 0;
MPI_Cart_sub(grid->Comm, remain_dims, &(grid->row_comm));

• Skew the input arrays A and B. Skewing involves left rotating each element in the ith row by i positions in Matrix
A. Skewing north rotates each element in the ith column by i positions in Matrix B. One way to implement would
be to use MPI Send and MPI Recv calls. If you choose to do this, extra care should be taken to avoid deadlocks.
The easier and better way of doing this would be to use the MPI Sendrecv replace call. The call is used to send
data in a buﬀer to a sender process and receive data into the same buﬀer from another process. Prototype and
example:

int MPI_Sendrecv_replace(void *buf, int count, MPI_Datatype datatype,

int dest, int sendtag, int source, int recvtag, MPI_Comm comm,
MPI_Status *status)

The array buf containing count items of type datatype. The contents of this buﬀer are sent to the process dest
and is tagged with sendtag. The same buﬀer, buf, is filled with a max of count elements from the process source
tagged with recvtag. An example:

MPI_Sendrecv_replace(item, 1, MPI_FLOAT, destination, send_tag,

source, recv_tag, grid.row_comm, &status);

The processes in a single row of the grid are attached to the row comm communicator. The current process sends 1
item of type MPI FLOAT to destination and receives 1 MPI FLOAT item from destination.
• Perform the Cannon’s Multiplication algorithm. The rotation steps can be implemented in the same manner as
the skewing step.
• At the end of the multiplication, gather the product matrix in the root process. Root prints out the result matrix.

Epilogue
This document borrows heavily from the excellent hyPACK 2013 workshop by CDAC, Pune. The MPI Quick Reference
Guide[3] will be useful in implementing the questions in this assignment. The manual pages for MPI are an excellent
source of syntax and semantics of the API calls. The recommended MPI Tutorial website[5]. An excellent collention of
online books and tutorials[6]. MPI Wikibooks page[7]. Beginning MPI page is a good source[8]. The list of recommended
books on MPI is maintained by the MPITutorial website[9]. Other references[10],.

References
[1] A. Dashin, “MPI Installation Guide,” https://jetcracker.wordpress.com/2012/03/01/
how-to-install-mpi-in-ubuntu/, Jetcracker.
[2] D. G. Martnez and S. R. Lumley, “Installation of MPI - Parallel and Distributed Programming,” http://lsi.ugr.es/
∼jmantas/pdp/ayuda/datos/instalaciones/Install OpenMPI en.pdf.
[3] M. Snir, S. Otto, S. Huss-Lederman, D. Walker, and J. Dongarra, “MPI Quick Reference Guide,” http://www.
netlib.org/utk/people/JackDongarra/WEB-PAGES/SPRING-2006/mpi-quick-ref.pdf, netlib.org.

[4] J. Demmel, “Parallel Matrix Multiplication,” http://www.cs.berkeley.edu/∼demmel/cs267/lecture11/lecture11.

html, CS267.

[5] “MPI Tutorial Webpage,” http://mpitutorial.com.

[6] S. Baden, “MPI Online books and Tutorials Page,” https://cseweb.ucsd.edu/∼baden/Doc/mpi.html.

[7] “Message Passing Interface,” https://en.wikibooks.org/wiki/Message-Passing Interface.

[8] “Beginning MPI,” http://chryswoods.com/beginning mpi/.

[9] “Recommended MPI Books,” http://mpitutorial.com/recommended-books/.

[10] B. Barney, “MPI,” https://computing.llnl.gov/tutorials/mpi, Lawrence Livermore National Laboratory.

The Hydrogen Cipher (Judy Beebe)
100% (1)
The Hydrogen Cipher (Judy Beebe)
14 pages
Message Passing Interface (MPI)
No ratings yet
Message Passing Interface (MPI)
14 pages
Parallel & Distributed Computing: MPI - Message Passing Interface
No ratings yet
Parallel & Distributed Computing: MPI - Message Passing Interface
49 pages
Lecture 11 Distributed Memory Programming
No ratings yet
Lecture 11 Distributed Memory Programming
28 pages
Mpi 1
No ratings yet
Mpi 1
20 pages
Introduction MPI - Chap2 - Slide 3
No ratings yet
Introduction MPI - Chap2 - Slide 3
16 pages
Intro MPI
No ratings yet
Intro MPI
60 pages
Distributed Memory Programming With MPI: Peter Pacheco
No ratings yet
Distributed Memory Programming With MPI: Peter Pacheco
121 pages
Key Concepts in MPI Programming: Processes
No ratings yet
Key Concepts in MPI Programming: Processes
6 pages
Mpi Unit 5 Part 2 1
No ratings yet
Mpi Unit 5 Part 2 1
65 pages
Clase 4 - Tutorial de MPI
No ratings yet
Clase 4 - Tutorial de MPI
35 pages
Lecture 15 MPI Summarization
No ratings yet
Lecture 15 MPI Summarization
26 pages
MPI Pacheco Ch3
No ratings yet
MPI Pacheco Ch3
124 pages
1.hello World Programme in Mpi
No ratings yet
1.hello World Programme in Mpi
11 pages
MPI Tutorial Fall Break 2022
No ratings yet
MPI Tutorial Fall Break 2022
60 pages
An Introduction To MPI: Parallel Programming With The Message Passing Interface
No ratings yet
An Introduction To MPI: Parallel Programming With The Message Passing Interface
48 pages
Distributed Memory Programming With: Peter Pacheco
No ratings yet
Distributed Memory Programming With: Peter Pacheco
125 pages
The Message Passing Interface (MPI)
No ratings yet
The Message Passing Interface (MPI)
18 pages
5 MPIprogramming
No ratings yet
5 MPIprogramming
43 pages
ECE 1747H: Parallel Programming: Message Passing (MPI)
No ratings yet
ECE 1747H: Parallel Programming: Message Passing (MPI)
67 pages
Mpi Basic Operations
No ratings yet
Mpi Basic Operations
6 pages
Introduction To C MPI PM
No ratings yet
Introduction To C MPI PM
50 pages
03 MPIProgramStructure
No ratings yet
03 MPIProgramStructure
42 pages
ECE 1747H: Parallel Programming: Message Passing (MPI)
No ratings yet
ECE 1747H: Parallel Programming: Message Passing (MPI)
67 pages
In3200 Chap09
No ratings yet
In3200 Chap09
56 pages
Class03 - MPI, Part 1, Intermediate PDF
No ratings yet
Class03 - MPI, Part 1, Intermediate PDF
83 pages
Introduction To MPI Basics
No ratings yet
Introduction To MPI Basics
8 pages
Intro To MPI: Hpc-Support@duke - Edu
No ratings yet
Intro To MPI: Hpc-Support@duke - Edu
56 pages
Parallel Programming Using Basic MPI Presented by Timothy H. Kaiser, Ph.D. San Diego Supercomputer Center
No ratings yet
Parallel Programming Using Basic MPI Presented by Timothy H. Kaiser, Ph.D. San Diego Supercomputer Center
19 pages
02 Mpi 0
No ratings yet
02 Mpi 0
19 pages
Lab Mpi
No ratings yet
Lab Mpi
29 pages
SERC IntroMPI 2019-09-14 v0
No ratings yet
SERC IntroMPI 2019-09-14 v0
43 pages
Lab Mpi
No ratings yet
Lab Mpi
32 pages
NGK Mpi
No ratings yet
NGK Mpi
74 pages
PDCLabMan Updated
No ratings yet
PDCLabMan Updated
46 pages
Message Passing Interface (MPI) : Steve Lantz Center For Advanced Computing Cornell University
No ratings yet
Message Passing Interface (MPI) : Steve Lantz Center For Advanced Computing Cornell University
53 pages
MPI Tutorial: MPI (Message Passing Interface)
No ratings yet
MPI Tutorial: MPI (Message Passing Interface)
29 pages
Fortran Mpi Tutorial
No ratings yet
Fortran Mpi Tutorial
29 pages
MPI Tutorial: MPI (Message Passing Interface)
No ratings yet
MPI Tutorial: MPI (Message Passing Interface)
29 pages
Mpi
No ratings yet
Mpi
30 pages
02 Message Passing Interface Tutorial
No ratings yet
02 Message Passing Interface Tutorial
34 pages
CSC4005 Tutorial3
No ratings yet
CSC4005 Tutorial3
40 pages
1 MPI Communications: CS424. Parallel Computing Lab#4
No ratings yet
1 MPI Communications: CS424. Parallel Computing Lab#4
30 pages
MPI Guide C++
No ratings yet
MPI Guide C++
9 pages
MPI Part2 Updated
No ratings yet
MPI Part2 Updated
20 pages
Message Passing Interface (MPI) Programming
No ratings yet
Message Passing Interface (MPI) Programming
11 pages
‎⁨تقرير⁩
No ratings yet
‎⁨تقرير⁩
16 pages
Chapter 4 - Message-Passing Programming, MPI
No ratings yet
Chapter 4 - Message-Passing Programming, MPI
79 pages
Lecture07 MPI by Example
No ratings yet
Lecture07 MPI by Example
27 pages
Message Passing Interface: Parallel Processing Course University of Tehran
No ratings yet
Message Passing Interface: Parallel Processing Course University of Tehran
49 pages
MPI Exercises PDF
No ratings yet
MPI Exercises PDF
7 pages
Cluster Lab Session 03
No ratings yet
Cluster Lab Session 03
9 pages
PA
No ratings yet
PA
87 pages
Message Passing Interface (MPI)
No ratings yet
Message Passing Interface (MPI)
22 pages
Nscet E-Learning Presentation: Listen Learn Lead
No ratings yet
Nscet E-Learning Presentation: Listen Learn Lead
54 pages
Unit Iv Distributed Memory Programming With Mpi
No ratings yet
Unit Iv Distributed Memory Programming With Mpi
19 pages
Mpi 1
No ratings yet
Mpi 1
38 pages
MPI Plamen Krastev
No ratings yet
MPI Plamen Krastev
49 pages
10.collectives I
No ratings yet
10.collectives I
31 pages
8 - Requirement Analysis
No ratings yet
8 - Requirement Analysis
18 pages
5 - Software Development Processes
No ratings yet
5 - Software Development Processes
30 pages
3 - Software Development in Practice
No ratings yet
3 - Software Development in Practice
15 pages
Getting Started
No ratings yet
Getting Started
7 pages
Ausubels Meaningful Verbal Theory
No ratings yet
Ausubels Meaningful Verbal Theory
4 pages
2023 Key Stage 2 English Grammar Punctuation and Spelling Paper 1 Questions
No ratings yet
2023 Key Stage 2 English Grammar Punctuation and Spelling Paper 1 Questions
28 pages
4 6
No ratings yet
4 6
19 pages
Periodic Solutions of Non-Autonomous Ordinary Differential Equations
No ratings yet
Periodic Solutions of Non-Autonomous Ordinary Differential Equations
15 pages
PCB-Gcode-Wizard Version 100 Help File Manual en Ingles
No ratings yet
PCB-Gcode-Wizard Version 100 Help File Manual en Ingles
9 pages
Unit 2 - Lecture 10 - RDBMS
No ratings yet
Unit 2 - Lecture 10 - RDBMS
15 pages
SQL Injection Cheat Sheet
No ratings yet
SQL Injection Cheat Sheet
18 pages
The Value of Narrativity in The Represe
No ratings yet
The Value of Narrativity in The Represe
54 pages
Alastair Fowler - Genre and The Literary Canon - 1979
100% (1)
Alastair Fowler - Genre and The Literary Canon - 1979
24 pages
Answer Assignment Questions
No ratings yet
Answer Assignment Questions
3 pages
Definite Ness
No ratings yet
Definite Ness
398 pages
Useful Information For JUPAS Applicants 2025
No ratings yet
Useful Information For JUPAS Applicants 2025
14 pages
Grade 9 English FAL March 2025
No ratings yet
Grade 9 English FAL March 2025
11 pages
Encyclopaedia Judaica, v. 11 (Ja-Kas) PDF
100% (4)
Encyclopaedia Judaica, v. 11 (Ja-Kas) PDF
842 pages
ELTE 224. Project
No ratings yet
ELTE 224. Project
7 pages
Java I - O
No ratings yet
Java I - O
30 pages
Chapter 1-3
No ratings yet
Chapter 1-3
38 pages
Phonetic Foolishness
No ratings yet
Phonetic Foolishness
1 page
TNPSC Test 10 Question Paper
No ratings yet
TNPSC Test 10 Question Paper
20 pages
Themes & Symbols - A Doll's House
No ratings yet
Themes & Symbols - A Doll's House
5 pages
Prefatory Note
0% (1)
Prefatory Note
2 pages
Schopenhauer, Arthur Singh, R. Raj Death, Contemplation and Schopenhauer PDF
100% (1)
Schopenhauer, Arthur Singh, R. Raj Death, Contemplation and Schopenhauer PDF
141 pages
Revalidated - MATH - GR8 - QTR1-MODULE-2 - (28 Pages)
No ratings yet
Revalidated - MATH - GR8 - QTR1-MODULE-2 - (28 Pages)
28 pages
2 Extracts To Aid Characterisation WS
No ratings yet
2 Extracts To Aid Characterisation WS
2 pages
Cross Cultural Communication of Turkey: Done by Mohit Agrawal 11BSUHH010030
No ratings yet
Cross Cultural Communication of Turkey: Done by Mohit Agrawal 11BSUHH010030
27 pages
The Sea by James Reeves
No ratings yet
The Sea by James Reeves
2 pages
Lecture-12 Modulus Equations: Chapter 1 - Fundamentals of Mathematics
No ratings yet
Lecture-12 Modulus Equations: Chapter 1 - Fundamentals of Mathematics
15 pages
Invitation: Two Types of Invitation
No ratings yet
Invitation: Two Types of Invitation
6 pages
CI-CD Pipeline With Project Deployment
No ratings yet
CI-CD Pipeline With Project Deployment
61 pages

2 Mpi

Uploaded by

2 Mpi

Uploaded by

CO471 - Parallel Programming

A2 - Message Passing Interface (MPI)

1 Hello World Example 1

1 Hello World Example

MPI Finalize(); Terminates MPI execution environment

//Include user libraries

int main (int argc, char *argv[]) {

//Declare char variable for name of processor

//Get number of processes

//Get processes number

//Get processesor name

//Terminate MPI environment

Installation and Compilation

$ sudo apt-get update

Compile and run the Hello World progam.

$ mpicc -o helloworld ./helloworld.c

Q1. Hello World Program - Version 1

Q2. DAXPY Loop

Q3. Hello World Program - Version 2

Q4. Calculation of π - MPI Bcast and MPI Reduce

static long num_steps = 100000;

• Each process calculates its partial π value. Partial π = 1.0+x2 ×

Q5. Reduction operation

/* Conditional sum needed when half is odd;

half = half/2; /*dividing line on who sums */

if (Pn < half) sum[Pn] += sum[Pn+half];

Implement two versions of this program.

/* Receiver - Non Blocking calls */

/* Sender - Non Blocking calls */

(c) Each process the computation step. Each process exe-

Figure 3: The Scatter-Gather sequence for Q6..

/* wait for the Isend corresponding to request[i] to complete */

Q6. Collective Communication - Scatter - Gather

• MPI Scatterv: Scatters a buﬀer in parts to all tasks in a group.

– The root process scatters the array sendbuf.

MPI_Gatherv(My_gatherv, Send_Count, MPI_INT,

Q7. MPI Derived Datatypes

Q9. Derived Datatype - Indexed

• Create N× N processes. Arrange the processors in the grid fashion.

Stage 2 Cannon’s Matrix Multiplication.

Creation of a Grid of processes

int MPI_Cart_create(MPI_Comm comm_old, int ndims, int *dims,

MPI_Cart_create(MPI_COMM_WORLD, 2, dims, periods, 0, &grid_comm));

grid comm is the handle to the new communicator.

MPI_Cart_coords(grid->comm, grid->MyRank, dims, Coordinates);

Cannon’s Matrix Multiplication

Figure 6: Skewing illustrations for Q10. and ??.

int MPI_Cart_sub(MPI_Comm comm, int *remain_dims, MPI_Comm *comm_new)

int MPI_Sendrecv_replace(void *buf, int count, MPI_Datatype datatype,

MPI_Sendrecv_replace(item, 1, MPI_FLOAT, destination, send_tag,

[4] J. Demmel, “Parallel Matrix Multiplication,” http://www.cs.berkeley.edu/∼demmel/cs267/lecture11/lecture11.

[5] “MPI Tutorial Webpage,” http://mpitutorial.com.

[7] “Message Passing Interface,” https://en.wikibooks.org/wiki/Message-Passing Interface.

[9] “Recommended MPI Books,” http://mpitutorial.com/recommended-books/.

You might also like

half = half/2; /dividing line on who sums /

int MPI_Cart_sub(MPI_Comm comm, int remain_dims, MPI_Comm comm_new)