Parallel Algorithms Underlying MPI Implementations
Parallel Algorithms Underlying MPI Implementations
Underlying MPI
Implementations
Parallel Algorithms Underlying MPI
Implementations
• Assume that each processor has formed the partial sum of the
components of the vector that it has.
• Step 1: Processor 2 sends its partial sum to processor 1 and
processor 1 adds this partial sum to its own. Meanwhile, processor 4
sends its partial sum to processor 3 and processor 3 performs a
similar summation.
• Step 2: Processor 3 sends its partial sum, which is now the sum of
the components on processors 3 and 4, to processor 1 and
processor 1 adds it to its partial sum to get the final sum across all
the components of the vector.
• At each stage of the process, the number of processes doing work is
cut in half. The algorithm is depicted in the Figure 13.1 below, where
the solid arrow denotes a send operation and the dotted line arrow
denotes a receive operation followed by a summation.
Recursive Halving and Doubling
• Step 3: Processor 1 then must broadcast this sum to all other proce
ssors. This broadcast operation can be done using the same commu
nication structure as the summation, but in reverse. You will see pse
udocode for this at the end of this section. Note that if the total numb
er of processors is N, then only 2 log(N) (log base 2) steps are need
ed to complete the operation.
• There is an even more efficient way to finish the job in only log(N) st
eps. By way of example, look at the next figure containing 8 process
ors. At each step, processor i and processor i+k send and receive da
ta in a pairwise fashion and then perform the summation. k is iterate
d from 1 through N/2 in powers of 2. If the total number of processor
s is N, then log(N) steps are needed. As an exercise, you should writ
e out the necessary pseudocode for this example.
Recursive Halving and Doubling
• Example 1:
– Matrix-vector multiplication using collective communication.
• Example 2:
– Matrix-matrix multiplication using collective communication.
• Example 3:
– Solving Poisson's equation through the use of ghost cells.
• Example 4:
– Matrix-vector multiplication using a client-server approach.
Example 1: Matrix-vector
Multiplication
P0 P1 P2 P3
Reduction (SUM)
Example 1: Matrix-vector
Multiplication
• The columns of matrix B and elements of column vector
C must be distributed to the various processors using MP
I commands called scatter operations.
• Note that MPI provides two types of scatter operations de
pending on whether the problem can be divided evenly a
mong the number of processors or not.
• Each processor now has a column of B, called Bpart, an
d an element of C, called Cpart. Each processor can now
perform an independent vector-scalar multiplication.
• Once this has been accomplished, every processor will h
ave a part of the final column vector A, called Apart.
• The column vectors on each processor can be added tog
ether with an MPI reduction command that computes the
final sum on the root processor.
Example 1: Matrix-vector
Multiplication
#include <stdio.h>
#include <mpi.h>
#define NCOLS 4
int main(int argc, char **argv) {
int i,j,k,l;
int ierr, rank, size, root;
float A[NCOLS];
float Apart[NCOLS];
float Bpart[NCOLS];
float C[NCOLS];
float A_exact[NCOLS];
float B[NCOLS][NCOLS];
float Cpart[1];
root = 0;
/* Initiate MPI. */
ierr=MPI_Init(&argc, &argv);
ierr=MPI_Comm_rank(MPI_COMM_WORLD, &rank);
ierr=MPI_Comm_size(MPI_COMM_WORLD, &size);
Example 1: Matrix-vector
Multiplication
/* Initialize B and C. */
if (rank == root) {
B[0][0] = 1;
B[0][1] = 2;
B[0][2] = 3;
B[0][3] = 4;
B[1][0] = 4;
B[1][1] = -5;
B[1][2] = 6;
B[1][3] = 4;
B[2][0] = 7;
B[2][1] = 8;
B[2][2] = 9;
B[2][3] = 2;
B[3][0] = 3;
B[3][1] = -1;
B[3][2] = 5;
B[3][3] = 0;
C[0] = 1;
C[1] = -4;
C[2] = 7;
C[3] = 3;
}
Example 1: Matrix-vector
Multiplication
/* Put up a barrier until I/O is complete */
ierr=MPI_Barrier(MPI_COMM_WORLD);
/* Scatter matrix B by rows. */
ierr=MPI_Scatter(B,NCOLS,MPI_FLOAT,Bpart,NCOLS,MPI_FLOAT,
root,MPI_COMM_WORLD);
/* Scatter matrix C by columns */
ierr=MPI_Scatter(C,1,MPI_FLOAT,Cpart,1,MPI_FLOAT, root,MPI_C
OMM_WORLD);
/* Do the vector-scalar multiplication. */
for(j=0;j<NCOLS;j++)
Apart[j] = Cpart[0]*Bpart[j];
/* Reduce to matrix A. */
ierr=MPI_Reduce(Apart,A,NCOLS,MPI_FLOAT,MPI_SUM, root,MPI
_COMM_WORLD);
Example 1: Matrix-vector
Multiplication
if (rank == 0) {
printf("\nThis is the result of the parallel computation:\n\n");
printf("A[0]=%g\n",A[0]);
printf("A[1]=%g\n",A[1]);
printf("A[2]=%g\n",A[2]);
printf("A[3]=%g\n",A[3]);
for(k=0;k<NCOLS;k++) {
A_exact[k] = 0.0;
for(l=0;l<NCOLS;l++) {
A_exact[k] += C[l]*B[l][k];
}
}
MPI_Finalize();
}
Example 1: Matrix-vector
Multiplication
• It is important to realize that this algorithm would change i
f the program were written in Fortran. This is because C d
ecomposes arrays in memory by rows while Fortran deco
mposes arrays into columns.
• If you translated the above program directly into a Fortran
program, the collective MPI calls would fail because the d
ata going to each of the different processors is not contig
uous.
• This problem can be solved with derived datatypes, which
are discussed in Chapter 6 - Derived Datatypes.
• A simpler approach would be to decompose the vector-m
atrix multiplication into independent scalar-row computati
ons and then proceed as above. This approach is shown
schematically in Figure 13.6.
Example 1: Matrix-vector
Multiplication
B*C=A
(4mxM)
(nx4m) (nxM)
Example 2: Matrix-matrix
Multiplication
x, y e
a a x L / 4 2 y 2
e a x 3 L / 4 2 y 2
Figure 13.10. Poisson Equation on a 2D grid with periodic boundary conditions.
where phi(x,y) is our unknown potential function and rho
(x,y) is the known source charge density. The domain of t
he problem is the box defined by the x-axis, y-axis, and th
e lines x=L and y=L.
Example 3: The Use of Ghost Cells to
solve a Poisson Equation
• Serial Code:
• To solve this equation, an iterative scheme is employed using finite
differences. The update equation for the field phi at the (n+1)th
iteration is written in terms of the values at nth iteration via
1
i , j x i , j i 1, j i 1, j i , j 1 i , j 1
2
4
iterating until the condition
i, j i, j
new
i, j
old
i, j
i, j
• Parallel Code:
• In this example, the domain is chopped into rectangles, in
what is often called block-block decomposition. In Figure
13.11 below,
Figure 13.12. Array indexing in a parallel Poisson solver on a 3x5 processor grid.
Example 3: The Use of Ghost Cells to
solve a Poisson Equation
• Note that P(1,2) (i.e., P(7)) is responsible for indices i=23-43 and j=2
7-39 in the serial code double do-loop.
• A parallel speedup is obtained because each processor is working o
n essentially 1/15 of the total data.
• However, there is a problem. What does P(1,2) do when its 5-point s
tencil hits the boundaries of its domain (i.e., when i=23 or i=43, or j=
27 or j=39)? The 5-point stencil now reaches into another processo
r's domain, which means that boundary data exists in memory on an
other separate processor.
• Because the update formula for phi at grid point (i,j) involves neighbo
ring grid indices {i-1,i,i+1;j-1,j,j+1}, P(1,2) must communicate with its
North, South, East, and West (N, S, E, W) neighbors to get one colu
mn of boundary data from its E, W neighbors and one row of bounda
ry data from its N,S neighbors.
• This is illustrated in Figure 13.13 below.
Example 3: The Use of Ghost Cells to
solve a Poisson Equation
Figure 13.13. Boundary data movement in the parallel Poisson solver followi
ng each iteration of the stencil.
Example 3: The Use of Ghost Cells to
solve a Poisson Equation
• In order to accommodate this
transference of boundary data
between processors, each
processor must dimension its
local array phi to have two
extra rows and 2 extra
columns.
• This is illustrated in Figure
13.14 where the shaded areas
indicate the extra rows and
columns needed for the
boundary data from other Figure 13.14. Ghost cells: Local indices.
processors.
Example 3: The Use of Ghost Cells to
solve a Poisson Equation
• Note that even though this example speaks of global
indices, the whole point about parallelism is that no one
processor ever has the global phi matrix on processor.
• Each processor has only its local version of phi with its
own sub-collection of i and j indices.
• Locally these indices are labeled beginning at either 0 or
1, as in Figure 13.14, rather than beginning at their
corresponding global values, as in Figure 13.12.
• Keeping track of the on-processor local indices and the
global (in-your-head) indices is the bookkeeping that you
have to manage when using message passing
parallelism.
Example 3: The Use of Ghost Cells to
solve a Poisson Equation
• Other parallel paradigms, such as High Performance Fortran (HPF)
or OpenMP, are directive-based, i.e., compiler directives are inserte
d into the code to tell the supercomputer to distribute data across pr
ocessors or to perform other operations. The difference between the
two paradigms is akin to the difference between an automatic and sti
ck-shift transmission car.
• In the directive based paradigm (automatic), the compiler (car) does
the data layout and parallel communications (gear shifting) implicitly.
• In the message passing paradigm (stick-shift), the user (driver) perfo
rms the data layout and parallel communications explicitly. In this ex
ample, this communication can be performed in a regular prescribed
pattern for all processors.
• For example, all processors could first communicate with their N-mo
st partners, then S, then E, then W. What is happening when all proc
essors communicate with their E neighbors is illustrated in Figure 13.
15.
Example 3: The Use of Ghost Cells to
solve a Poisson Equation
(6) for{i=1;i<=N_local;i++){
for(j=1;j<=M_local;j++){
update phi[i][j]
}
}
End Loop over stencil iterations
(7) Output data
Example 3: The Use of Ghost Cells to
solve a Poisson Equation
• Note that initializing the data should be performed in parallel. That is,
each processor P(i,j) should only initialize the portion of phi for whic
h it is responsible. (Recall NO processor contains the full global phi).
• In relation to this point, step (7), Output data, is not such a simple-mi
nded task when performing parallel calculations. Should you reduce
all the data from phi_local on each processor to one giant phi_glob
al on P(0,0) and then print out the data? This is certainly one way to
do it, but it seems to defeat the purpose of not having all the data res
ide on one processor.
• For example, what if phi_global is too large to fit in memory on a sin
gle processor? A second alternative is for each processor to write ou
t its own phi_local to a file "phi.ij", where ij indicates the processor's
2-digit designation (e.g. P(1,2) writes out to file "phi.12").
• The data then has to be manipulated off processor by another code t
o put it into a form that may be rendered by a visualization package.
This code itself may have to be a parallel code.
Example 3: The Use of Ghost Cells to
solve a Poisson Equation