0% found this document useful (0 votes)
30 views15 pages

LEC6 parallelAlg-Broadcasting

This document summarizes broadcasting algorithms for parallel computing. It discusses: 1) A broadcasting algorithm for PRAM that takes O(lg n) time by having each processor copy the message to the next processor. 2) Broadcasting algorithms for different architectures that take communication costs like latency into account. An optimized algorithm has each processor send the message to multiple other processors in each round. 3) The broadcasting time is analyzed as O(lg p * (ts + tw)) to account for both latency ts and transfer time tw per word across p processors. This provides a more realistic analysis than just parallel time in PRAM models.

Uploaded by

SAI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views15 pages

LEC6 parallelAlg-Broadcasting

This document summarizes broadcasting algorithms for parallel computing. It discusses: 1) A broadcasting algorithm for PRAM that takes O(lg n) time by having each processor copy the message to the next processor. 2) Broadcasting algorithms for different architectures that take communication costs like latency into account. An optimized algorithm has each processor send the message to multiple other processors in each round. 3) The broadcasting time is analyzed as O(lg p * (ts + tw)) to account for both latency ts and transfer time tw per word across p processors. This provides a more realistic analysis than just parallel time in PRAM models.

Uploaded by

SAI
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 15

Lecture 6 Broadcasting

Parallel Computing
Fall 2022

1
PRAM Algorithm: Broadcasting
 A message (say, a word) is stored in cell 0 of the shared memory. We
would like this message to be read by all n processors of a PRAM.
 On a CREW PRAM this requires one parallel step (processor i concurrently
reads cell 0).
 On an EREW PRAM broadcasting can be performed in O(lg n) steps. The
structure of the algorithm is the reverse of parallel sum. In lg n steps the
message is broadcast as follows. In step i each processor with index j less
than 2i reads the contents of cell j and copies it into cell j + 2i. After lg n
steps each processor i reads the message by reading the contents of cell i.
 A CR?W PRAM algorithm that solves the broadcasting problem has
performance P = O(n), T = O(1), and W = O(n).
 The EREW PRAM algorithm that solves the broadcasting problem has
performance P = O(n), T = O(lg n), and W = O(n lg n), W2 = O(n).

2
Broadcasting
begin Broadcast (M)
1. i = 0 ; j = pid(); C[0]=M;
2. while (2i < P)
3. if (j < 2i)
5. C[j + 2i] = C[j];
6. i = i + 1;
6. end
7. Processor j reads M from C[j].
end Broadcast

3
Traditional PRAM Algorithm vs. Architecture
Independent Parallel Algorithm Design
 Under the PRAM model, synchronization is ignored and thus is
seen as for free, as PRAM processors work synchronously. It also
ignores communication, as in the PRAM the cost of accessing the
shared memory is as small as the cost of accessing local registers
of the PRAM.
 But actually, the exchange of data can significantly impact the
efficiency of parallel programs by introducing interaction delays
during their execution.
 It takes roughly ts+mtw time for a simple exchange of an m-word
message between two processes running on different nodes of an
interconnection network with cut-through routing.
 ts: latency or the startup time for the data transfer
 tw: per-word transfer time, which is inversely proportional to the
available bandwidth between the nodes.

4
Basic Communication Operations – One-
to-all broadcast and all-to-one reduction
 Assume that p processes participate in the operation
and the data to be broadcast or reduced contains m
words.
 Since one-to-all broadcast or all-to-one reduction
procedure involves log p point-to-point simple
message transfers, each at a time cost of ts+mtw.
Therefore, the total time taken by the procedure is
T=(ts+mtw) log p
 This is true for all interconnection network.

5
All-to-all Broadcast and Reduction
 Linear Array and Ring:
 P different messages circulate in the p-node ensemble.
 If communication is performed circularly in a single direction, then each node received all (p-
1) pieces of information from all other nodes in (p-1) steps.
 So the total time is: T=(ts+mtw)(p-1)
 2-D Mesh:
 Based on linear array algorithm, treating each rows and columns of the mesh as linear arrays.
 Two phases:
 Phase one: each row of the mesh performs an all-to-all broadcast using the procedure for the linear
array. In this phase, all nodes collect p corresponding to the p nodes of their respective rows. Each
node consolidates this information into a single message of size mp. The time for this phase is:
T1= =(ts+mtw)(p-1)
 Phase two: columnwise all-to-all broadcase of the consolidated messages. By the end of this phase,
each node obtains all p pieces of m-word data originally resided on different nodes. The time for this
phase is
T2= =(ts+mptw)(p-1)
 The time for entire all-to-all broadcast on a p-node two-dimensional square mesh is the sum of the
times spent in the individual phases:
T=2ts(p-1)+mtw(p-1)
 Hypercube: log p
T  (t
i 1
s  2i 1 t w m) t s log p  t w m( p  1)

6
Traditional PRAM Algorithm vs. Architecture
Independent Parallel Algorithm Design
 As an example of how traditional PRAM algorithm design differs
from architecture independent parallel algorithm design, example
algorithm for broadcasting in a parallel machine is introduced.
 Problem: In a parallel machine with p processors numbered
0, . . . , p − 1, one of them, say processor 0, holds a one-word
message The problem of broadcasting involves the dissemination of
this message to the local memory of the remaining p − 1
processors.
 The performance of a well-known exclusive PRAM algorithm for
broadcasting is analyzed below in two ways under the assumption
that no concurrent operations are allowed. One follows the
traditional (PRAM) analysis that minimizes parallel running time. The
other takes into consideration the issues of communication and
synchronization. This leads to a modification of the PRAM-based
algorithm to derive an architecture independent algorithm for
broadcasting whose performance is consistent with observations of
broadcasting operations on real parallel machines.

7
Broadcasting: MPI Algorithm 1
 Algorithm. Without loss of generality let us assume that p is a
power of two. The message is broadcast in lg p rounds of
communication by binary replication. In round i = 1, . . . , lg p,
each processor j with index j < 2i−1 sends the message it
currently holds to processor j + 2i−1 (on a shared memory
system, this may mean copying information into a cell read by
this processor). The number of processors with the message at
the end of round i is thus 2i.
 Analysis of Algorithm. Under the PRAM model the algorithm
requires lg p communication rounds and so many parallel steps
to complete. This cost, however, ignores synchronization which
is for free, as PRAM processors work synchronously. It also
ignores communication, as in the PRAM the cost of accessing the
shared memory is as small as the cost of accessing local
registers of the PRAM.

8
Broadcasting: MPI Algorithm 1
 Under the MPI cost model each communication round is assigned a cost of
max {ts, tw · 1} as each processor in each round sends or receives at most
one message containing the one-word message. The cost of the
algorithm is lg p · max {tw, tw · 1}, as there are lgp rounds of
communication.
 As the communicated information by any processors is small in size, it is
likely that latency issues prevail in the transmission time (ie bandwidth
based cost tw · 1 is insignificant compared to the
latency/synchronization reflecting term t s).
 In high latency machines the dominant term would be t s lg p rather than
tw lg p. Even though each communication round would last for at least t s
time units, only a small fraction tw of it is used for actual communication.
The remainder is wasted.
 It makes then sense to increase communication round utilization so that
each processor sends the one-word message to as many processors as it
can accommodate within a round.
 The total time is: lg p *(ts+tw)

9
Broadcasting: MPI Algorithm 2
 Input: p processors numbered 0 . . .p − 1. Processor 0
holds a message of length equal to one word.
 Output: The problem of broadcasting involves the
dissemination of this message to the remaining p − 1
processors.
 Algorithm 2. In one superstep, processor 0 sends the
message to be broadcast to processors 1, . . . , p − 1 in
turn (a “sequential”-looking algorithm).
 Analysis of Algorithm 2.
 The communication time of Algorithm 2 is 1 · max{ts, (p
− 1) · tw} (in a single superstep, the message is
replicated p − 1 times by processor 0).

The total time is ts+(p-1)tw

10
Broadcasting: MPI Algorithm 3
 Algorithm 3
 Both Algorithm 1 and Algorithm 2 can be viewed as extreme cases of an Algorithm 3.
 The main observation is that up to L/g words can be sent in a superstep at a cost of t s.
Then, It makes sense for each processor to send L/g messages to other processors. Let k −
1 be the number of messages a processor sends to other processors in a broadcasting step.
The number of processors with the message at the end of a broadcasting superstep would
be k times larger than that in the start. We call k the degree of replication of the broadcast
operation.
 Architecture independent Algorithm 3
 In each round, every processor sends the message to k−1 other processors. In round i = 0,
1, . . ., each processor j with index j < ki sends the message to k − 1 distinct processors
numbered j + kiּl, where l = 1, . . . , k−1. At the end of round i (the (i+1)-st overall round),
the message is broadcast to ki ·(k−1)+ki = ki+1 processors. The number of rounds required is
the minimum integer r such that kr ≥ p, The number of rounds necessary for full
dissemination is thus decreased to lgkp, and the total cost becomes lgkp max {ts, (k − 1)tw}.
 At the end of each superstep the number of processors possessing the message is k
times more than that of the previous superstep. During each superstep each
processor sends the message to exactly k−1 other processors.
 Algorithm 3 consists of a number of rounds between 1 (and it becomes Algorithm 2)
and lg p (and it becomes Algorithm 1).
 The total time is: lgkp (ts+(k-1)tw)

11
Broadcasting: MPI Algorithm 3
Broadcast (0, p, k)
1. my_pid = pid(); mask_pid = 1;
2. while (mask_pid < p) {
1. if (my_pid < mask_pid)
for (i = 1, j = mask_pid;i < k; i++, j+ = mask_pid) {
target_pid = my_pid + j;
if (target_pid < p)
mpi_put(target_pid,&M,&M, 0, sizeof(M));
(or mpi_send…)
1. }
2. else if ((my_pid >= mask_pid) and (my_pid < k* mask_pid))
1. mpi_get() or mpi_Recv…
mask_pid = mask_pid ∗ k;
}

12
Broadcasting n > p words: Algorithm 4
 Now suppose that the message to be broadcast consists of not a
single word but is of size n > p. Algorithm 4 may be a better
choice than the previous algorithms as one of the processors
sends or receives substantially more than n words of information.
(ntw>>ts)
 There is a broadcasting algorithm, call it Algorithm 4, that
requires only two communication rounds and is optimal (for the
communication model abstracted by t s and tw) in terms of the
amount of information (up to a constant) each processor sends or
receives.
 Algorithm 4. Two-phase broadcasting
 The idea is to split the message into p pieces, have processor 0 send
piece i to processor i in the first round and in the second round
processor i replicates the i-th piece p − 1 times by sending each copy
to each of the remaining p − 1 processors (see attached figure).
 The total time is: p times one-to-one + one all-to-all broadcast
(ts+n/p*tw)(p-1)+(ts+n/p*tw)(p-1)=2(ts+n/p*tw)(p-1)

13
14
End

Thank you!

15

You might also like