0% found this document useful (0 votes)

122 views12 pages

Lec8 MPIalgorithmDesign

The document summarizes different algorithms for broadcasting messages across parallel processors. It discusses traditional PRAM algorithms that ignore communication costs versus architecture-independent algorithms that account for latency and bandwidth. A three-phase broadcasting algorithm is presented that allows each processor to send messages to k-1 other processors per round, reducing the number of rounds from log p to logk p and the total time to logk p * (ts + (k-1)tw).

Uploaded by

Anirudh Seth

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

122 views12 pages

Lec8 MPIalgorithmDesign

Uploaded by

Anirudh Seth

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 12

Lecture 8 Architecture Independent

(MPI) Algorithm Design

Parallel Computing
Fall 2007

1
Traditional PRAM Algorithm vs. Architecture
Independent Parallel Algorithm Design
 Under the PRAM model, synchronization is ignored and thus is
seen as for free, as PRAM processors work synchronously. It also
ignores communication, as in the PRAM the cost of accessing the
shared memory is as small as the cost of accessing local registers
of the PRAM.
 But actually, the exchange of data can significantly impact the
efficiency of parallel programs by introducing interaction delays
during their execution.
 It takes roughly ts+mtw time for a simple exchange of an m-word
message between two processes running on different nodes of an
interconnection network with cut-through routing.
 ts: latency or the startup time for the data transfer
 tw: per-word transfer time, which is inversely proportional to the
available bandwidth between the nodes.

2
Basic Communication Operations – One-
to-all broadcast and all-to-one reduction

 Assume that p processes participate in the operation

and the data to be broadcast or reduced contains m
words.
 Since one-to-all broadcast or all-to-one reduction
procedure involves log p point-to-point simple
message transfers, each at a time cost of ts+mtw.
Therefore, the total time taken by the procedure is
T=(ts+mtw) log p
 This is true for all interconnection network.

3
All-to-all Broadcast and Reduction
 Linear Array and Ring:
 P different messages circulate in the p-node ensemble.
 If communication is performed circularly in a single direction, then each node received all (p-
1) pieces of information from all other nodes in (p-1) steps.
 So the total time is: T=(ts+mtw)(p-1)
 2-D Mesh:
 Based on linear array algorithm, treating each rows and columns of the mesh as linear arrays.
 Two phases:
 Phase one: each row of the mesh performs an all-to-all broadcast using the procedure for the linear
array. In this phase, all nodes collect p corresponding to the p nodes of their respective rows. Each
node consolidates this information into a single message of size mp. The time for this phase is:
T1= =(ts+mtw)(p-1)
 Phase two: columnwise all-to-all broadcase of the consolidated messages. By the end of this phase,
each node obtains all p pieces of m-word data originally resided on different nodes. The time for this
phase is
T2= =(ts+mptw)(p-1)
 The time for entire all-to-all broadcast on a p-node two-dimensional square mesh is the sum of the
times spent in the individual phases:
T=2ts(p-1)+mtw(p-1)
 Hypercube:
log p
T   (t s  2i 1 t w m) t s log p  t w m( p  1)
i 1

4
Traditional PRAM Algorithm vs. Architecture
Independent Parallel Algorithm Design
 As an example of how traditional PRAM algorithm design differs from
architecture independent parallel algorithm design, example
algorithm for broadcasting in a parallel machine is introduced.
 Problem: In a parallel machine with p processors numbered 0, . . . ,
p − 1, one of them, say processor 0, holds a one-word message The
problem of broadcasting involves the dissemination of this message
to the local memory of the remaining p − 1 processors.
 The performance of a well-known exclusive PRAM algorithm for
broadcasting is analyzed below in two ways under the assumption
that no concurrent operations are allowed. One follows the
traditional (PRAM) analysis that minimizes parallel running time. The
other takes into consideration the issues of communication and
synchronization. This leads to a modification of the PRAM-based
algorithm to derive an architecture independent algorithm for
broadcasting whose performance is consistent with observations of
broadcasting operations on real parallel machines.

5
Broadcasting: PRAM Algorithm 1
 Algorithm. Without loss of generality let us assume that p is a
power of two. The message is broadcast in lg p rounds of
communication by binary replication. In round i = 1, . . . , lg p,
each processor j with index j < 2i−1 sends the message it
currently holds to processor j + 2i−1 (on a shared memory
system, this may mean copying information into a cell read by
this processor). The number of processors with the message at
the end of round i is thus 2i.
 Analysis of Algorithm. Under the PRAM model the algorithm
requires lg p communication rounds and so many parallel steps
to complete. This cost, however, ignores synchronization which is
for free, as PRAM processors work synchronously. It also ignores
communication, as in the PRAM the cost of accessing the shared
memory is as small as the cost of accessing local registers of the
PRAM.

6
Broadcasting: PRAM Algorithm 1
 Under the MPI cost model each communication round is assigned a
cost of max {ts, tw · 1} as each processor in each round sends or
receives at most one message containing the one-word message. The
BSP cost of the algorithm is lg p · max {tw, tw · 1}, as there are lgp
rounds of communication.
 As the communicated information by any processors is small in size, it
is likely that latency issues prevail in the transmission time (ie
bandwidth based cost tw · 1 is insignificant compared to the
latency/synchronization reflecting term ts).
 In high latency machines the dominant term would be ts lg p rather
than tw lg p. Even though each communication round would last for at
least ts time units, only a small fraction tw of it is used for actual
communication. The remainder is wasted.
 It makes then sense to increase communication round utilization so
that each processor sends the one-word message to as many
processors as it can accommodate within a round.
 The total time is: lg p *(ts+tw)

7
Broadcasting: PRAM Algorithm 2
 Input: p processors numbered 0 . . .p − 1. Processor 0
holds a message of length equal to one word.
 Output: The problem of broadcasting involves the
dissemination of this message to the remaining p − 1
processors.
 Algorithm 2. In one superstep, processor 0 sends the
message to be broadcast to processors 1, . . . , p − 1 in
turn (a “sequential”-looking algorithm).
 Analysis of Algorithm 2.
 The communication time of Algorithm 2 is 1 · max{ts, (p
− 1) · tw} (in a single superstep, the message is
replicated p − 1 times by processor 0).
 The total time is ts+(p-1)tw

8
Broadcasting: PRAM Algorithm 3
 Algorithm 3
 Both Algorithm 1 and Algorithm 2 can be viewed as extreme cases of an Algorithm 3.
 The main observation is that up to L/g words can be sent in a superstep at a cost of ts. Then,
It makes sense for each processor to send L/g messages to other processors. Let k − 1 be
the number of messages a processor sends to other processors in a broadcasting step. The
number of processors with the message at the end of a broadcasting superstep would be k
times larger than that in the start. We call k the degree of replication of the broadcast
operation.
 Architecture independent Algorithm 3
 In each round, every processor sends the message to k−1 other processors. In round i = 0,
1, . . ., each processor j with index j < ki sends the message to k − 1 distinct processors
numbered j + kiּl, where l = 1, . . . , k−1. At the end of round i (the (i+1)-st overall round),
the message is broadcast to ki ·(k−1)+ki = ki+1 processors. The number of rounds required is
the minimum integer r such that kr ≥ p, The number of rounds necessary for full
dissemination is thus decreased to lgkp, and the total cost becomes lgkp max {ts, (k − 1)tw}.
 At the end of each superstep the number of processors possessing the message is k
times more than that of the previous superstep. During each superstep each
processor sends the message to exactly k−1 other processors.
 Algorithm 3 consists of a number of rounds between 1 (and it becomes Algorithm 2)
and lg p (and it becomes Algorithm 1).
 The total time is: lgkp (ts+(k-1)tw)

9
Broadcasting: PRAM Algorithm 3
Broadcast (0, p, k)
1. my_pid = pid(); mask_pid = 1;
2. while (mask_pid < p) {
1. if (my_pid < mask_pid)
for (i = 1, j = mask_pid;i < k; i++, j+ = mask_pid) {
target_pid = my_pid + j;
if (target_pid < p)
mpi_put(target_pid,&M,&M, 0, sizeof(M));
(or mpi_send…)
1. }
2. else if ((my_pid >= mask_pid) and (my_pid < 2* mask_pid))
1. mpi_get() or mpi_Recv…
mask_pid = mask_pid ∗ k;
}

10
Broadcasting n > p words: Algorithm 4
 Now suppose that the message to be broadcast consists of not
a single word but is of size n > p. Algorithm 4 may be a better
choice than the previous algorithms as one of the processors
sends or receives substantially more than n words of
information. (ntw>>ts)
 There is a broadcasting algorithm, call it Algorithm 4, that
requires only two communication rounds and is optimal (for the
communication model abstracted by ts and tw) in terms of the
amount of information (up to a constant) each processor sends
or receives.
 Algorithm 4. Two-phase broadcasting
 The idea is to split the message into p pieces, have processor 0
send piece i to processor i in the first round and in the second
round processor i replicates the i-th piece p − 1 times by sending
each copy to each of the remaining p − 1 processors (see attached
figure).
 The total time is: p times one-to-one + one all-to-all broadcast
(ts+n/p*tw)(p-1)+(ts+n/p*tw)(p-1)=2(ts+n/p*tw)(p-1)
11
12

Assignment 1: Name Class Date Period Sbuid Netid Email
No ratings yet
Assignment 1: Name Class Date Period Sbuid Netid Email
4 pages
Programming with MATLAB: Taken From the Book "MATLAB for Beginners: A Gentle Approach"
From Everand
Programming with MATLAB: Taken From the Book "MATLAB for Beginners: A Gentle Approach"
Peter Kattan
4.5/5 (3)
LEC6 parallelAlg-Broadcasting
No ratings yet
LEC6 parallelAlg-Broadcasting
15 pages
Ram, Pram, and Logp Models
No ratings yet
Ram, Pram, and Logp Models
72 pages
Design of Parallel Algorithms: Bulk Synchronous Parallel A Bridging Model of Parallel Computation
No ratings yet
Design of Parallel Algorithms: Bulk Synchronous Parallel A Bridging Model of Parallel Computation
22 pages
Intro To Communication: - Advantages
No ratings yet
Intro To Communication: - Advantages
13 pages
Par Seq Algorithms
No ratings yet
Par Seq Algorithms
44 pages
Parallel Computation Models: Slide 1
No ratings yet
Parallel Computation Models: Slide 1
28 pages
Pdc - Co1-Basic Op & Cost Analysis
No ratings yet
Pdc - Co1-Basic Op & Cost Analysis
22 pages
Fundamentals of Parallel Computers
No ratings yet
Fundamentals of Parallel Computers
6 pages
HPC_Bankai
No ratings yet
HPC_Bankai
7 pages
L2 Parallel Computing Models
No ratings yet
L2 Parallel Computing Models
31 pages
PDA_3
No ratings yet
PDA_3
90 pages
Parallel Processing
No ratings yet
Parallel Processing
35 pages
PDC - Lecture - No. 3
No ratings yet
PDC - Lecture - No. 3
34 pages
Chapter 14: Parallel Algorithms
No ratings yet
Chapter 14: Parallel Algorithms
23 pages
Lecture 4 Flynn's Classical Taxonomy
No ratings yet
Lecture 4 Flynn's Classical Taxonomy
43 pages
Lecture 9 - Parallel Algorithms
No ratings yet
Lecture 9 - Parallel Algorithms
28 pages
F2 PDF
No ratings yet
F2 PDF
51 pages
Parallel Computing - Unit II - NLAL
No ratings yet
Parallel Computing - Unit II - NLAL
84 pages
OS-III UNIT (Inter Processes Communication)
No ratings yet
OS-III UNIT (Inter Processes Communication)
126 pages
Chapter 02
No ratings yet
Chapter 02
47 pages
Lecture 14 Basic Communication Operations.pptx
No ratings yet
Lecture 14 Basic Communication Operations.pptx
40 pages
Floyd's Algorithm: Input N: Number of Vertices A (0..n-1) (0..n-1) - Adjacency Matrix
No ratings yet
Floyd's Algorithm: Input N: Number of Vertices A (0..n-1) (0..n-1) - Adjacency Matrix
7 pages
Unit 3 (3.3) Inter Process Communication (IPC)
No ratings yet
Unit 3 (3.3) Inter Process Communication (IPC)
18 pages
Assignment of Algorithm
No ratings yet
Assignment of Algorithm
9 pages
12.revision Parallelization
No ratings yet
12.revision Parallelization
30 pages
Slides Chapter 2 - Parallel Programming Platforms
No ratings yet
Slides Chapter 2 - Parallel Programming Platforms
33 pages
chap4_selected_slides
No ratings yet
chap4_selected_slides
54 pages
Chapter 3
No ratings yet
Chapter 3
21 pages
Unit-3.3 PRAM Model.pptx
No ratings yet
Unit-3.3 PRAM Model.pptx
29 pages
Lecture 03 InterprocessCommunication
No ratings yet
Lecture 03 InterprocessCommunication
45 pages
The PRAM Model and Algorithms: Advanced Topics Spring 2008
No ratings yet
The PRAM Model and Algorithms: Advanced Topics Spring 2008
24 pages
Introduction
No ratings yet
Introduction
46 pages
Pram Algorithms: Parallel and Distributed Algorithms BY Debdeep Mukhopadhyay AND Abhishek Somani
No ratings yet
Pram Algorithms: Parallel and Distributed Algorithms BY Debdeep Mukhopadhyay AND Abhishek Somani
17 pages
Exam 1
No ratings yet
Exam 1
8 pages
Unit3-all
No ratings yet
Unit3-all
115 pages
Lecture 4: Principles of Parallel Algorithm Design (Part 4)
No ratings yet
Lecture 4: Principles of Parallel Algorithm Design (Part 4)
27 pages
Message Passing Fundamentals: Reference: Http://foxtrot - Ncsa.uiuc - edu:8900/public/MPI
No ratings yet
Message Passing Fundamentals: Reference: Http://foxtrot - Ncsa.uiuc - edu:8900/public/MPI
22 pages
Parallel Random Access Machine (PRAM) : Control
No ratings yet
Parallel Random Access Machine (PRAM) : Control
9 pages
Lecture-15-PDC-BCS-6EF-SMI-Spring-2025
No ratings yet
Lecture-15-PDC-BCS-6EF-SMI-Spring-2025
27 pages
hpc_scaling
No ratings yet
hpc_scaling
56 pages
Inter Process Communication
No ratings yet
Inter Process Communication
25 pages
daa_unit-vi
No ratings yet
daa_unit-vi
50 pages
Methodology For Simulation-Based Comparison of Algorithms For Distributed Mutual Exclusion
No ratings yet
Methodology For Simulation-Based Comparison of Algorithms For Distributed Mutual Exclusion
5 pages
PA midsem
No ratings yet
PA midsem
20 pages
Oslecture6-7 (Copy)
No ratings yet
Oslecture6-7 (Copy)
114 pages
Distributed System Message Passing
No ratings yet
Distributed System Message Passing
30 pages
Principles of Operating Systems: Lecture 5 - Interprocess Communication Ardalan Amiri Sani
No ratings yet
Principles of Operating Systems: Lecture 5 - Interprocess Communication Ardalan Amiri Sani
38 pages
HPC Endsem 2024 FlyHigh Services
No ratings yet
HPC Endsem 2024 FlyHigh Services
16 pages
Lecture 3 - 3 Evaluating Static Interconnection Networks
No ratings yet
Lecture 3 - 3 Evaluating Static Interconnection Networks
41 pages
Fundamental Algorithms: Chapter 3: Parallel Algorithms - The PRAM Model
No ratings yet
Fundamental Algorithms: Chapter 3: Parallel Algorithms - The PRAM Model
26 pages
Inter Process Communication (IPC) : Open in App
No ratings yet
Inter Process Communication (IPC) : Open in App
12 pages
04 Process Con
No ratings yet
04 Process Con
26 pages
Process Synchronization
No ratings yet
Process Synchronization
50 pages
Resource and Process Management
No ratings yet
Resource and Process Management
98 pages
06 Synchronization
No ratings yet
06 Synchronization
52 pages
Lecture 11
No ratings yet
Lecture 11
52 pages
MPI Maelstrom: Input
No ratings yet
MPI Maelstrom: Input
2 pages
MULTICAST IP ROUTING Part-2: IP routing & forwarding
From Everand
MULTICAST IP ROUTING Part-2: IP routing & forwarding
Ummed Singh
No ratings yet

Lec8 MPIalgorithmDesign

Uploaded by

Lec8 MPIalgorithmDesign

Uploaded by

Lecture 8 Architecture Independent

(MPI) Algorithm Design

 Assume that p processes participate in the operation

You might also like