0% found this document useful (0 votes)

41 views55 pages

Parallel Algorithms Underlying MPI Implementations

The document discusses parallel algorithms underlying MPI implementations. It begins by describing recursive halving and doubling, the algorithm used for broadcasts and reductions. It then provides examples of parallel algorithms for MPI, including matrix-vector multiplication using collective communication with columns of the matrix and elements of the vector distributed among processors. Processors perform independent multiplications and then reduce results. The document also discusses matrix-matrix multiplication and solving Poisson's equation with ghost cells in a similar parallel manner.

Uploaded by

Aliaa Karam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views55 pages

Parallel Algorithms Underlying MPI Implementations

Uploaded by

Aliaa Karam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 55

Parallel Algorithms

Underlying MPI
Implementations
Parallel Algorithms Underlying MPI
Implementations

• This chapter looks at a few of the parallel algorithms

underlying the implementations of some simple MPI calls.
• The purpose of this is not to teach you how to "roll your
own" versions of these routines, but rather to help you
start thinking about algorithms in a parallel fashion.
• First, the method of recursive halving and doubling,
which is the algorithm underlying operations such as
broadcasts and reduction operations, is discussed.
• Then, specific examples of parallel algorithms that
implement message passing are given.
Recursive Halving and
Doubling
Recursive Halving and Doubling

• To illustrate recursive halving and doubling,

suppose you have a vector distributed among p
processors, and you need the sum of all
components of the vector in each processor, i.e.,
a sum reduction.
• One method is to use a tree-based algorithm to
compute the sum to a single processor and then
broadcast the sum to every processor.
Recursive Halving and Doubling

• Assume that each processor has formed the partial sum of the
components of the vector that it has.
• Step 1: Processor 2 sends its partial sum to processor 1 and
processor 1 adds this partial sum to its own. Meanwhile, processor 4
sends its partial sum to processor 3 and processor 3 performs a
similar summation.
• Step 2: Processor 3 sends its partial sum, which is now the sum of
the components on processors 3 and 4, to processor 1 and
processor 1 adds it to its partial sum to get the final sum across all
the components of the vector.
• At each stage of the process, the number of processes doing work is
cut in half. The algorithm is depicted in the Figure 13.1 below, where
the solid arrow denotes a send operation and the dotted line arrow
denotes a receive operation followed by a summation.
Recursive Halving and Doubling

Figure 13.1. Summation in log(N) steps.

Recursive Halving and Doubling

• Step 3: Processor 1 then must broadcast this sum to all other proce
ssors. This broadcast operation can be done using the same commu
nication structure as the summation, but in reverse. You will see pse
udocode for this at the end of this section. Note that if the total numb
er of processors is N, then only 2 log(N) (log base 2) steps are need
ed to complete the operation.
• There is an even more efficient way to finish the job in only log(N) st
eps. By way of example, look at the next figure containing 8 process
ors. At each step, processor i and processor i+k send and receive da
ta in a pairwise fashion and then perform the summation. k is iterate
d from 1 through N/2 in powers of 2. If the total number of processor
s is N, then log(N) steps are needed. As an exercise, you should writ
e out the necessary pseudocode for this example.
Recursive Halving and Doubling

Figure 13.2. Summation to all processors in log(N) steps.

Recursive Halving and Doubling

• What about adding vectors? That is, how do you

add several vectors component-wise to get a new
vector? The answer is, you employ the method
discussed earlier in a component-wise fashion.
This fascinating way to reduce the
communications and to avoid abundant
summations is described next. This method
utilizes the recursive halving and doubling
technique and is illustrated in Figure 13.3.
Recursive Halving and Doubling

• Suppose there are 4 processors and the length of each vector is

also 4.
• Step 1: Processor p0 sends the first two components of the vector
to processor p1, and p1 sends the last two components of the vector
to p0. Then p0 gets the partial sums for the last two components,
and p1 gets the partial sums for the first two components. So do p2
and p3.
• Step 2: Processor p0 sends the partial sum of the third component
to processor p3. Processor p3 then adds to get the total sum of the
third component. Similarly, processor 0,1 and 2 find the total sums of
the 4th, 2nd, and 1st components, respectively. Now the sum of the
vectors are found and the components are stored in different
processors.
• Step 3: Broadcast the result using the reverse of the above
communication process.
Recursive Halving and Doubling

Figure 13.3. Adding vectors

Recursive Halving and Doubling

• Pseudocode for Broadc

ast Operation:
• The following algorithm co
mpletes a broadcast oper
ation in logarithmic time. F
igure 13.4 illustrates the id
ea.

Figure 13.4. Broadcast via recursive doubling.

Recursive Halving and Doubling

• The first processor first sends If (myRank==0) {

the data to only two other send to processors 1 and 2;
processors. Then each of }
these processors send the data
to two other processors, and so else
on. At each stage, the number {
of processors sending and receive from processors int((m
receiving data doubles. The yRank-1)/2);
code is simple and looks
similar to torank1=2*myRank+1;
torank2=2*myRank+2;
if (torank1 exists)
send to torank1;
if (torank2 exists)
send to torank2;
}
Parallel Algorithm
Examples
Specific Examples

• In this section, specific examples of parallel algorithms

that implement message passing are given.
• The first two examples consider simple collective
communication calls to parallelize matrix-vector and
matrix-matrix multiplication.
• These calls are meant to be illustrative, because parallel
numerical libraries usually provide the most efficient
algorithms for these operations. (See Chapter 10 -
Parallel Mathematical Libraries.)
• The third example shows how you can use ghost cells to
construct a parallel data approach to solve Poisson's
equation.
• The fourth example revisits matrix-vector multiplication,
but from a client server approach.
Specific Examples

• Example 1:
– Matrix-vector multiplication using collective communication.
• Example 2:
– Matrix-matrix multiplication using collective communication.
• Example 3:
– Solving Poisson's equation through the use of ghost cells.
• Example 4:
– Matrix-vector multiplication using a client-server approach.
Example 1: Matrix-vector
Multiplication

• The figure below demonstrates schematically how a

matrix-vector multiplication, A=B*C, can be
decomposed into four independent computations
involving a scalar multiplying a column vector.
• This approach is different from that which is usually
taught in a linear algebra course because this
decomposition lends itself better to parallelization.
• These computations are independent and do not
require communication, something that usually
reduces performance of parallel code.
Example 1: Matrix-vector Multiplicatio
n (Columnwise)

Figure 13.5. Schematic of parallel decomposition for vector-matrix

multiplication, A=B*C. The vector A is depicted in yellow. The matrix B
and vector C are depicted in multiple colors representing the portions,
columns, and elements assigned to each processor, respectively.
Example 1: Matrix-vector Multiplicatio
n (Columnwise)
a0  b0, 0 c0 + b0,1c1 + b0, 2 c2 + b0,3c3 
 a  b c b1,1c1 + b1, 2 c2 + b1,3c3 
 1    1, 0 0 +
a2  b2, 0 c0 + b2,1c1 + b2, 2 c2 + b2,3c3 
   
A=B*C  a3  b3, 0 c0 + b3,1c1 + b3, 2 c2 + b3,3c3 
P0 P1 P2 P3

P0 P1 P2 P3
Reduction (SUM)
Example 1: Matrix-vector
Multiplication
• The columns of matrix B and elements of column vector
C must be distributed to the various processors using MP
I commands called scatter operations.
• Note that MPI provides two types of scatter operations de
pending on whether the problem can be divided evenly a
mong the number of processors or not.
• Each processor now has a column of B, called Bpart, an
d an element of C, called Cpart. Each processor can now
perform an independent vector-scalar multiplication.
• Once this has been accomplished, every processor will h
ave a part of the final column vector A, called Apart.
• The column vectors on each processor can be added tog
ether with an MPI reduction command that computes the
final sum on the root processor.
Example 1: Matrix-vector
Multiplication
#include <stdio.h>
#include <mpi.h>
#define NCOLS 4
int main(int argc, char **argv) {
int i,j,k,l;
int ierr, rank, size, root;
float A[NCOLS];
float Apart[NCOLS];
float Bpart[NCOLS];
float C[NCOLS];
float A_exact[NCOLS];
float B[NCOLS][NCOLS];
float Cpart[1];
root = 0;
/* Initiate MPI. */
ierr=MPI_Init(&argc, &argv);
ierr=MPI_Comm_rank(MPI_COMM_WORLD, &rank);
ierr=MPI_Comm_size(MPI_COMM_WORLD, &size);
Example 1: Matrix-vector
Multiplication
/* Initialize B and C. */
if (rank == root) {

B[0][0] = 1;
B[0][1] = 2;
B[0][2] = 3;
B[0][3] = 4;
B[1][0] = 4;
B[1][1] = -5;
B[1][2] = 6;
B[1][3] = 4;
B[2][0] = 7;
B[2][1] = 8;
B[2][2] = 9;
B[2][3] = 2;
B[3][0] = 3;
B[3][1] = -1;
B[3][2] = 5;
B[3][3] = 0;

C[0] = 1;
C[1] = -4;
C[2] = 7;
C[3] = 3;
}
Example 1: Matrix-vector
Multiplication
/* Put up a barrier until I/O is complete */
ierr=MPI_Barrier(MPI_COMM_WORLD);
/* Scatter matrix B by rows. */
ierr=MPI_Scatter(B,NCOLS,MPI_FLOAT,Bpart,NCOLS,MPI_FLOAT,
root,MPI_COMM_WORLD);
/* Scatter matrix C by columns */
ierr=MPI_Scatter(C,1,MPI_FLOAT,Cpart,1,MPI_FLOAT, root,MPI_C
OMM_WORLD);
/* Do the vector-scalar multiplication. */
for(j=0;j<NCOLS;j++)
Apart[j] = Cpart[0]*Bpart[j];
/* Reduce to matrix A. */
ierr=MPI_Reduce(Apart,A,NCOLS,MPI_FLOAT,MPI_SUM, root,MPI
_COMM_WORLD);
Example 1: Matrix-vector
Multiplication
if (rank == 0) {
printf("\nThis is the result of the parallel computation:\n\n");
printf("A[0]=%g\n",A[0]);
printf("A[1]=%g\n",A[1]);
printf("A[2]=%g\n",A[2]);
printf("A[3]=%g\n",A[3]);

for(k=0;k<NCOLS;k++) {
A_exact[k] = 0.0;
for(l=0;l<NCOLS;l++) {
A_exact[k] += C[l]*B[l][k];
}
}

printf("\nThis is the result of the serial computation:\n\n");

printf("A_exact[0]=%g\n",A_exact[0]);
printf("A_exact[1]=%g\n",A_exact[1]);
printf("A_exact[2]=%g\n",A_exact[2]);
printf("A_exact[3]=%g\n",A_exact[3]);

MPI_Finalize();
}
Example 1: Matrix-vector
Multiplication
• It is important to realize that this algorithm would change i
f the program were written in Fortran. This is because C d
ecomposes arrays in memory by rows while Fortran deco
mposes arrays into columns.
• If you translated the above program directly into a Fortran
program, the collective MPI calls would fail because the d
ata going to each of the different processors is not contig
uous.
• This problem can be solved with derived datatypes, which
are discussed in Chapter 6 - Derived Datatypes.
• A simpler approach would be to decompose the vector-m
atrix multiplication into independent scalar-row computati
ons and then proceed as above. This approach is shown
schematically in Figure 13.6.
Example 1: Matrix-vector
Multiplication

Figure 13.6. Schematic of parallel decomposition for vector-

matrix multiplication, A=B*C, in the C programming language.
Example 1: Matrix-vector Multiplicatio
n (Rowwise)

• Based on different data decomposition, we can d

esign another parallel program for MV-multiplicati
on. That is the Rowwise block striped MV-multipli
cation.

B*C=A

• Please think about the algorithm for parallel roww

ise MV-multiplication.
Example 1: Matrix-vector
Multiplication

• Another way this problem can be decomposed is

to broadcast the column vector C to all the
processors using the MPI broadcast command.
(See Section 6.2 - Broadcast.)
• Then, scatter the rows of B to every processor so
that they can form the elements of the result
matrix A by the usual vector-vector "dot product".
This will produce a scalar on each processor,
Apart, which can then be gathered with an MPI
gather command (see Section 6.4 - Gather) back
onto the root processor in the column vector A.
Example 1: Matrix-vector
Multiplication

Figure 13.7. Schematic of a different parallel decom

position for vector-matrix multiplication
Example 1: Matrix-vector
Multiplication

• Something for you to think about as you read

the next section on matrix-matrix
multiplication: How would you generalize this
algorithm to the multiplication of a n X 4m matrix
by a 4m by M matrix on 4 processors?

(4mxM)
(nx4m) (nxM)
Example 2: Matrix-matrix
Multiplication

• A similar, albeit naive, type of decomposition can

be achieved for matrix-matrix multiplication,
A=B*C.
• The figure below shows schematically how
matrix-matrix multiplication of two 4x4 matrices
can be decomposed into four independent
vector-matrix multiplications, which can be
performed on four different processors.
Example 2: Matrix-matrix
Multiplication

Figure 13.8. Schematic of a decomposition for matrix-matrix multiplication, A=

B*C, in Fortran 90. The matrices A and C are depicted as multicolored columns
with each color denoting a different processor. The matrix B, in yellow, is broad
cast to all processors.
Example 2: Matrix-matrix
Multiplication

• The basic steps are

1. Distribute the columns of C among the processors
using a scatter operation.
2. Broadcast the matrix B to every processor.
3. Form the product of B with the columns of C on each
processor. These are the corresponding columns of
A.
4. Bring the columns of A back to one processor using
a gather operation.
Example 2: Matrix-matrix
Multiplication

• Again, in C, the problem could be decomposed in

rows. This is shown schematically below.
• The code is left as your homework!!!
Example 2: Matrix-matrix
Multiplication

Figure 13.9. Schematic of a decomposition for matrix-matrix multiplication,

A=B*C, in the C programming language. The matrices A and B are depict
ed as multicolored rows with each color denoting a different processor. Th
e matrix C, in yellow, is broadcast to all processors.
Example 3: The Use of Ghost Cells to
solve a Poisson Equation

• The objective in data parallelism is for all

processors to work on a single task
simultaneously. The computational domain (e.g.,
a 2D or 3D grid) is divided among the processors
such that the computational work load is
balanced. Before each processor can compute
on its local data, it must perform communications
with other processors so that all of the necessary
information is brought on each processor in order
for it to accomplish its local task.
Example 3: The Use of Ghost Cells to
solve a Poisson Equation
• As an instructive example of data parallelism, an arbitrary
number of processors is used to solve the 2D Poisson Eq
uation in electrostatics (i.e., Laplace Equation with a sour
ce). The equation to solve is
 2 x, y   4 x, y 

 x, y   e


a  a x  L / 4 2  y 2
e  a  x 3 L / 4 2  y 2 

Figure 13.10. Poisson Equation on a 2D grid with periodic boundary conditions.
where phi(x,y) is our unknown potential function and rho
(x,y) is the known source charge density. The domain of t
he problem is the box defined by the x-axis, y-axis, and th
e lines x=L and y=L.
Example 3: The Use of Ghost Cells to
solve a Poisson Equation
• Serial Code:
• To solve this equation, an iterative scheme is employed using finite
differences. The update equation for the field phi at the (n+1)th
iteration is written in terms of the values at nth iteration via
1
i , j  x  i , j  i 1, j  i 1, j  i , j 1  i , j 1 
2

4
iterating until the condition

 i, j i, j
 new

i, j
  old



i, j
i, j

has been satisfied.

Example 3: The Use of Ghost Cells to
solve a Poisson Equation

• Parallel Code:
• In this example, the domain is chopped into rectangles, in
what is often called block-block decomposition. In Figure
13.11 below,

Figure 13.11. Parallel Poisson solver via domain decomposition o

n a 3x5 processor grid.
Example 3: The Use of Ghost Cells to
solve a Poisson Equation

• An example N=64 x M=64 computational grid is shown

that will be divided amongst NP=15 processors.
• The number of processors, NP, is purposely chosen such
that it does not divide evenly into either N or M.
• Because the computational domain has been divided into
rectangles, the 15 processors {P(0),P(1),...,P(14)} (which
are laid out in row-major order on the processor grid) can
be given a 2-digit designation that represents their
processor grid row number and processor grid column
number. MPI has commands that allow you to do this.
Example 3: The Use of Ghost Cells to
solve a Poisson Equation

Figure 13.12. Array indexing in a parallel Poisson solver on a 3x5 processor grid.
Example 3: The Use of Ghost Cells to
solve a Poisson Equation
• Note that P(1,2) (i.e., P(7)) is responsible for indices i=23-43 and j=2
7-39 in the serial code double do-loop.
• A parallel speedup is obtained because each processor is working o
n essentially 1/15 of the total data.
• However, there is a problem. What does P(1,2) do when its 5-point s
tencil hits the boundaries of its domain (i.e., when i=23 or i=43, or j=
27 or j=39)? The 5-point stencil now reaches into another processo
r's domain, which means that boundary data exists in memory on an
other separate processor.
• Because the update formula for phi at grid point (i,j) involves neighbo
ring grid indices {i-1,i,i+1;j-1,j,j+1}, P(1,2) must communicate with its
North, South, East, and West (N, S, E, W) neighbors to get one colu
mn of boundary data from its E, W neighbors and one row of bounda
ry data from its N,S neighbors.
• This is illustrated in Figure 13.13 below.
Example 3: The Use of Ghost Cells to
solve a Poisson Equation

Figure 13.13. Boundary data movement in the parallel Poisson solver followi
ng each iteration of the stencil.
Example 3: The Use of Ghost Cells to
solve a Poisson Equation
• In order to accommodate this
transference of boundary data
between processors, each
processor must dimension its
local array phi to have two
extra rows and 2 extra
columns.
• This is illustrated in Figure
13.14 where the shaded areas
indicate the extra rows and
columns needed for the
boundary data from other Figure 13.14. Ghost cells: Local indices.
processors.
Example 3: The Use of Ghost Cells to
solve a Poisson Equation
• Note that even though this example speaks of global
indices, the whole point about parallelism is that no one
processor ever has the global phi matrix on processor.
• Each processor has only its local version of phi with its
own sub-collection of i and j indices.
• Locally these indices are labeled beginning at either 0 or
1, as in Figure 13.14, rather than beginning at their
corresponding global values, as in Figure 13.12.
• Keeping track of the on-processor local indices and the
global (in-your-head) indices is the bookkeeping that you
have to manage when using message passing
parallelism.
Example 3: The Use of Ghost Cells to
solve a Poisson Equation
• Other parallel paradigms, such as High Performance Fortran (HPF)
or OpenMP, are directive-based, i.e., compiler directives are inserte
d into the code to tell the supercomputer to distribute data across pr
ocessors or to perform other operations. The difference between the
two paradigms is akin to the difference between an automatic and sti
ck-shift transmission car.
• In the directive based paradigm (automatic), the compiler (car) does
the data layout and parallel communications (gear shifting) implicitly.
• In the message passing paradigm (stick-shift), the user (driver) perfo
rms the data layout and parallel communications explicitly. In this ex
ample, this communication can be performed in a regular prescribed
pattern for all processors.
• For example, all processors could first communicate with their N-mo
st partners, then S, then E, then W. What is happening when all proc
essors communicate with their E neighbors is illustrated in Figure 13.
15.
Example 3: The Use of Ghost Cells to
solve a Poisson Equation

Figure 13.15. Data movement, shift right (East).

Example 3: The Use of Ghost Cells to
solve a Poisson Equation
• Note that in this shift right communication, P(i,j) places its right-most column of boundary data
into the left-most ghost column of P(i,j+1). In addition, P(i,j) receives the right-most column of b
oundary data from P(i,j-1) into its own left-most ghost column.
• For each iteration, the psuedo-code for the parallel algorithm is thus
t=0
(0) Initialize psi
(1) Loop over stencil iterations

(2) Perform parallel N shift communications of boundary data

(3) Perform parallel S shift communications of boundary data
(4) Perform parallel E shift communications of boundary data
(5) Perform parallel W shift communications of boundary data

(6) for{i=1;i<=N_local;i++){
for(j=1;j<=M_local;j++){
update phi[i][j]
}
}
End Loop over stencil iterations
(7) Output data
Example 3: The Use of Ghost Cells to
solve a Poisson Equation
• Note that initializing the data should be performed in parallel. That is,
each processor P(i,j) should only initialize the portion of phi for whic
h it is responsible. (Recall NO processor contains the full global phi).
• In relation to this point, step (7), Output data, is not such a simple-mi
nded task when performing parallel calculations. Should you reduce
all the data from phi_local on each processor to one giant phi_glob
al on P(0,0) and then print out the data? This is certainly one way to
do it, but it seems to defeat the purpose of not having all the data res
ide on one processor.
• For example, what if phi_global is too large to fit in memory on a sin
gle processor? A second alternative is for each processor to write ou
t its own phi_local to a file "phi.ij", where ij indicates the processor's
2-digit designation (e.g. P(1,2) writes out to file "phi.12").
• The data then has to be manipulated off processor by another code t
o put it into a form that may be rendered by a visualization package.
This code itself may have to be a parallel code.
Example 3: The Use of Ghost Cells to
solve a Poisson Equation

• As you can see, the issue of parallel I/O is not a

trivial one (see Section 9 - Parallel I/O) and is in
fact a topic of current research among parallel
language developers and researchers.
Matrix-vector Multiplication using a
Client-Server Approach
• In Section 13.2.1, a simple data decomposition for multiplying a matrix and
a vector was described. This decomposition is also used here to demonstrat
e a "client-server" approach. The code for this example is in the C program,
server_client_c.c.
• In server_client_c.c, all input/output is handled by the "server" (preset to be
processor 0). This includes parsing the command-line arguments, reading t
he file containing the matrix A and vector x, and writing the result to standar
d output. The file containing the matrix A and the vector x has the form
m n
x1 x2 ...
a11 a12 ...
a21 a22 ...
.
.
.
where A is m (rows) by n (columns), and x is a column vector with n eleme
nts.
Matrix-vector Multiplication using a
Client-Server Approach
• After the server reads in the size of A, it broadcasts this information
to all of the clients.
• It then checks to make sure that there are fewer processors than
columns. (If there are more processors than columns, then using a
parallel program is not efficient and the program exits.)
• The server and all of the clients then allocate memory locations for A
and x. The server also allocates memory for the result.
• Because there are more columns than client processors, the first
"round" consists of the server sending one column to each of the
client processors.
• All of the clients receive a column to process. Upon finishing, the
clients send results back to the server. As the server receives a
"result" buffer from a client, it sends the next unprocessed column to
that client.
Matrix-vector Multiplication using a
Client-Server Approach
• The source code is divided into two sections: the "server" code and
the "client" code. The pseudo-code for each of these sections is
• Server:
– Broadcast (vector) x to all client processors.
– Send a column of A to each processor.
– While there are more columns to process OR there are expected
results, receive results and send next unprocessed column.
– Print result.
• Client:
– Receive (vector) x.
– Receive a column of A with tag = column number.
– Multiply respective element of (vector) x (which is the same as tag) to
produce the (vector) result.
– Send result back to server.
• Note that the numbers used in the pseudo-code (for both the server
and client) have been added to the source code.
Matrix-vector Multiplication using a
Client-Server Approach

• Source code similar to server_client_c.c., server_client_r.

c is also provided as an example.
• The main difference between theses codes is the way the
data is stored.
• Because only contiguous memory locations can be sent u
sing MPI_SEND, server_client_c.c stores the matrix A "c
olumn-wise" in memory, while server_client_r.c stores the
matrix A "row-wise" in memory.
• The pseudo-code for server_client_c.c and server_client_
r.c is stated in the "block" documentation at the beginning
of the source code.
END

Project Report On Human Resource Management
81% (101)
Project Report On Human Resource Management
86 pages
Parallel and Distributed Computing
33% (3)
Parallel and Distributed Computing
10 pages
Brief History of The Kindergarten
No ratings yet
Brief History of The Kindergarten
74 pages
MECN430 Homework 4
No ratings yet
MECN430 Homework 4
6 pages
Thirumana Porutham Book in Tamil PDF 22
50% (2)
Thirumana Porutham Book in Tamil PDF 22
2 pages
Parallel Algorithms Underlying MPI Implementations
No ratings yet
Parallel Algorithms Underlying MPI Implementations
55 pages
MPI Matrix Multiplication 1 PDF
No ratings yet
MPI Matrix Multiplication 1 PDF
23 pages
Report - Viber String
No ratings yet
Report - Viber String
26 pages
As 3
No ratings yet
As 3
2 pages
Exercise 9
No ratings yet
Exercise 9
5 pages
New Microsoft Word Document
No ratings yet
New Microsoft Word Document
29 pages
Parallel Prefix Sum
No ratings yet
Parallel Prefix Sum
32 pages
Content PDF
No ratings yet
Content PDF
14 pages
Floyd's Algorithm: Input N: Number of Vertices A (0..n-1) (0..n-1) - Adjacency Matrix
No ratings yet
Floyd's Algorithm: Input N: Number of Vertices A (0..n-1) (0..n-1) - Adjacency Matrix
7 pages
PDC Experiments
No ratings yet
PDC Experiments
11 pages
A Practical Performance Comparison of Parallel Matrix Multiplication Algorithms On Networks of Workstations
No ratings yet
A Practical Performance Comparison of Parallel Matrix Multiplication Algorithms On Networks of Workstations
2 pages
matrix_mul
No ratings yet
matrix_mul
33 pages
Unit II Matrix Multiplication
No ratings yet
Unit II Matrix Multiplication
23 pages
Lecture 03-Parallel Prefix
No ratings yet
Lecture 03-Parallel Prefix
6 pages
Matrix Multi
No ratings yet
Matrix Multi
38 pages
Sliding Window Sum Algorithms For Deep Neural Networks
No ratings yet
Sliding Window Sum Algorithms For Deep Neural Networks
8 pages
CO-2 (2)
No ratings yet
CO-2 (2)
22 pages
Class18 - Linalg II Handout PDF
No ratings yet
Class18 - Linalg II Handout PDF
48 pages
High Performance Computing Matrix Mul.
No ratings yet
High Performance Computing Matrix Mul.
15 pages
EXERCISE- 4[1] (1)
No ratings yet
EXERCISE- 4[1] (1)
8 pages
60004210188_RajSingh_HPC_Exp1-7
No ratings yet
60004210188_RajSingh_HPC_Exp1-7
23 pages
Mpi
No ratings yet
Mpi
67 pages
MPI Plamen Krastev
No ratings yet
MPI Plamen Krastev
49 pages
PDCLabMan Updated
No ratings yet
PDCLabMan Updated
46 pages
Parallel Processing
No ratings yet
Parallel Processing
35 pages
CP4292-MCAP(1)
No ratings yet
CP4292-MCAP(1)
15 pages
Problem Statement
No ratings yet
Problem Statement
5 pages
Chained Matrix Multiplication
No ratings yet
Chained Matrix Multiplication
32 pages
Lecture 4: Principles of Parallel Algorithm Design (Part 4)
No ratings yet
Lecture 4: Principles of Parallel Algorithm Design (Part 4)
27 pages
ECE 1747H: Parallel Programming: Message Passing (MPI)
No ratings yet
ECE 1747H: Parallel Programming: Message Passing (MPI)
67 pages
Lab6 - Linear Algebra in C On A Microcontroller
No ratings yet
Lab6 - Linear Algebra in C On A Microcontroller
8 pages
Unit - 2 HPC
No ratings yet
Unit - 2 HPC
96 pages
Signals and Systems Lab 1
No ratings yet
Signals and Systems Lab 1
14 pages
Parallel Computing: MPI - Collective Communication
No ratings yet
Parallel Computing: MPI - Collective Communication
52 pages
Pseudo Code of Mpi Programs
No ratings yet
Pseudo Code of Mpi Programs
22 pages
parallel and distributed algorithms
No ratings yet
parallel and distributed algorithms
21 pages
MAP laB mannual
No ratings yet
MAP laB mannual
24 pages
Multicore Architecture and Programming Lab Manual
No ratings yet
Multicore Architecture and Programming Lab Manual
29 pages
08_1_MPI_Comm_Data_Distributions
No ratings yet
08_1_MPI_Comm_Data_Distributions
60 pages
Parallel Processing Previous Year Question
No ratings yet
Parallel Processing Previous Year Question
11 pages
Task 1 Types of Parallel Processing
No ratings yet
Task 1 Types of Parallel Processing
3 pages
DAA-1
No ratings yet
DAA-1
40 pages
MPI Lab 3
No ratings yet
MPI Lab 3
18 pages
CP4252 Multicore Architecture and Programming Lab Manual
No ratings yet
CP4252 Multicore Architecture and Programming Lab Manual
26 pages
EXERCISE- 4
No ratings yet
EXERCISE- 4
8 pages
Introduction
No ratings yet
Introduction
46 pages
The Sieve of Eratosthenes
No ratings yet
The Sieve of Eratosthenes
68 pages
Ee8218 Lab2
No ratings yet
Ee8218 Lab2
7 pages
Assignment 3 - COMP2129
No ratings yet
Assignment 3 - COMP2129
4 pages
hpc_scaling
No ratings yet
hpc_scaling
56 pages
Solutions To C Moler MATLAB Experiments Chapter 2
0% (1)
Solutions To C Moler MATLAB Experiments Chapter 2
13 pages
Lect11 12 Parallel
No ratings yet
Lect11 12 Parallel
57 pages
Chapter 14: Parallel Algorithms
No ratings yet
Chapter 14: Parallel Algorithms
23 pages
12.revision Parallelization
No ratings yet
12.revision Parallelization
30 pages
Parallel Random Access Machine (PRAM) : Control
No ratings yet
Parallel Random Access Machine (PRAM) : Control
9 pages
Lec 38
No ratings yet
Lec 38
19 pages
CP4292-MCAP
No ratings yet
CP4292-MCAP
24 pages
Week12 - L01 and L02
No ratings yet
Week12 - L01 and L02
22 pages
A Friendly Introduction to MATLAB Programming
From Everand
A Friendly Introduction to MATLAB Programming
Orhan Gazi
No ratings yet
Performance Tasks
No ratings yet
Performance Tasks
29 pages
1 s2.0 S1877705813016007 Main
No ratings yet
1 s2.0 S1877705813016007 Main
10 pages
COGNEX DM360 Quick Reference
No ratings yet
COGNEX DM360 Quick Reference
22 pages
The AHAM Seven Basic Truths
100% (1)
The AHAM Seven Basic Truths
6 pages
CHS-1 Finals-Reviewer Eapp
No ratings yet
CHS-1 Finals-Reviewer Eapp
3 pages
Blachford-Zinc-Stearate-201-201V-325-201ND
No ratings yet
Blachford-Zinc-Stearate-201-201V-325-201ND
2 pages
DWDM_Complete_Record
No ratings yet
DWDM_Complete_Record
81 pages
Rpncalcseries For Palm Os Section Properties - User Guide
No ratings yet
Rpncalcseries For Palm Os Section Properties - User Guide
10 pages
BJS 2106 Restorative Justice NOTES
No ratings yet
BJS 2106 Restorative Justice NOTES
51 pages
Unit-4 (Service Oriented Architecture)
No ratings yet
Unit-4 (Service Oriented Architecture)
9 pages
Programa Analítico de Inglés Secundaria
No ratings yet
Programa Analítico de Inglés Secundaria
13 pages
Charging Station Installation Guide
No ratings yet
Charging Station Installation Guide
71 pages
Clean Architecture
No ratings yet
Clean Architecture
24 pages
CLJ 1 Syllabus
100% (1)
CLJ 1 Syllabus
12 pages
Taklimat Hari Ko-Q (Pindaan 6.0) - Fakultiptj
No ratings yet
Taklimat Hari Ko-Q (Pindaan 6.0) - Fakultiptj
44 pages
Sharae Kalian Transformational Change Specialist
No ratings yet
Sharae Kalian Transformational Change Specialist
2 pages
Eduflow Homework Harmful or Helpful
100% (1)
Eduflow Homework Harmful or Helpful
7 pages
Question Bank DFTS: BE 4/4 ECE II Semester Elective II Unit I
No ratings yet
Question Bank DFTS: BE 4/4 ECE II Semester Elective II Unit I
4 pages
Resume Cezar Emanuel C.
No ratings yet
Resume Cezar Emanuel C.
4 pages
Organic Chemistry
No ratings yet
Organic Chemistry
11 pages
Keur 00821-029 201906
No ratings yet
Keur 00821-029 201906
5 pages
Provincial Newsletter Ed 036 - 15 08 13
No ratings yet
Provincial Newsletter Ed 036 - 15 08 13
5 pages
Unit 2 Studylib
No ratings yet
Unit 2 Studylib
23 pages
Phases of The Moon
No ratings yet
Phases of The Moon
4 pages
Sample Structural Drawing
100% (1)
Sample Structural Drawing
1 page
Lisondra Es 2022 Activity Design
No ratings yet
Lisondra Es 2022 Activity Design
3 pages

Parallel Algorithms Underlying MPI Implementations

Uploaded by

Parallel Algorithms Underlying MPI Implementations

Uploaded by

Parallel Algorithms

• This chapter looks at a few of the parallel algorithms

• To illustrate recursive halving and doubling,

Figure 13.1. Summation in log(N) steps.

Figure 13.2. Summation to all processors in log(N) steps.

• What about adding vectors? That is, how do you

• Suppose there are 4 processors and the length of each vector is

Figure 13.3. Adding vectors

• Pseudocode for Broadc

Figure 13.4. Broadcast via recursive doubling.

• The first processor first sends If (myRank==0) {

• In this section, specific examples of parallel algorithms

• The figure below demonstrates schematically how a

Figure 13.5. Schematic of parallel decomposition for vector-matrix

printf("\nThis is the result of the serial computation:\n\n");

Figure 13.6. Schematic of parallel decomposition for vector-

• Based on different data decomposition, we can d

• Please think about the algorithm for parallel roww

• Another way this problem can be decomposed is

Figure 13.7. Schematic of a different parallel decom

• Something for you to think about as you read

• A similar, albeit naive, type of decomposition can

Figure 13.8. Schematic of a decomposition for matrix-matrix multiplication, A=

• The basic steps are

• Again, in C, the problem could be decomposed in

Figure 13.9. Schematic of a decomposition for matrix-matrix multiplication,

• The objective in data parallelism is for all

has been satisfied.

Figure 13.11. Parallel Poisson solver via domain decomposition o

• An example N=64 x M=64 computational grid is shown

Figure 13.15. Data movement, shift right (East).

(2) Perform parallel N shift communications of boundary data

• As you can see, the issue of parallel I/O is not a

• Source code similar to server_client_c.c., server_client_r.

You might also like