Parallel Algorithm Main Single
Parallel Algorithm Main Single
Parallel Algorithms
Harald Räcke
http://www14.in.tum.de/lehre/2014WS/pa/
PA
© Harald Räcke 1
Part I
Organizational Matters
PA
© Harald Räcke 2
Part I
Organizational Matters
ñ Modul: IN2011
ñ Name: “Parallel Algorithms”
“Parallele Algorithmen”
ñ ECTS: 8 Credit points
ñ Lectures:
ñ 4 SWS
Mon 14:00–16:00 (Room 00.08.038)
Fri 8:30–10:00 (Room 00.08.038)
ñ Webpage: http://www14.in.tum.de/lehre/2014WS/pa/
ñ Required knowledge:
ñ IN0001, IN0003
“Introduction to Informatics 1/2”
“Einführung in die Informatik 1/2”
ñ IN0007
“Fundamentals of Algorithms and Data Structures”
“Grundlagen: Algorithmen und Datenstrukturen” (GAD)
ñ IN0011
“Basic Theoretic Informatics”
“Einführung in die Theoretische Informatik” (THEO)
ñ IN0015
“Discrete Structures”
“Diskrete Strukturen” (DS)
ñ IN0018
“Discrete Probability Theory”
“Diskrete Wahrscheinlichkeitstheorie” (DWT)
ñ IN2003
“Efficient Algorithms and Data Structures”
“Effiziente Algorithmen und Datenstrukturen”
The Lecturer
ñ Harald Räcke
ñ Email: [email protected]
ñ Room: 03.09.044
ñ Office hours: (per appointment)
PA
© Harald Räcke 5
Tutorials
ñ Tutors:
ñ Chris Pinkau
ñ [email protected]
ñ Room: 03.09.037
ñ Office hours: Tue 13:00–14:00
ñ Room: 03.11.018
ñ Time: Tue 14:00–16:00
PA
© Harald Räcke 6
Assignment sheets
PA
© Harald Räcke 7
Assessment
ñ Assignment Sheets:
ñ An assignment sheet is usually made available on Monday
on the module webpage.
ñ Solutions have to be handed in in the following week before
the lecture on Monday.
ñ You can hand in your solutions by putting them in the right
folder in front of room 03.09.019A.
ñ Solutions will be discussed in the subsequent tutorial on
Tuesday.
PA
© Harald Räcke 8
1 Contents
ñ PRAM algorithms
ñ Parallel Models
ñ PRAM Model
ñ Basic PRAM Algorithms
ñ Sorting
ñ Lower Bounds
ñ Networks of Workstations
ñ Offline Permutation Routing on the Mesh
ñ Oblivious Routing in the Butterfly
ñ Greedy Routing
ñ Sorting on the Mesh
ñ ASCEND/DESCEND Programs
ñ Embeddings between Networks
PA 1 Contents
© Harald Räcke 9
2 Literatur
Tom Leighton:
Introduction to Parallel Algorithms and Architecture:
Arrays, Trees, Hypercubes,
Morgan Kaufmann: San Mateo, CA, 1992
Joseph JaJa:
An Introduction to Parallel Algorithms,
Addison-Wesley: Reading, MA, 1997
Jeffrey D. Ullman:
Computational Aspects of VLSI,
Computer Science Press: Rockville, USA, 1984
Selim G. Akl.:
The Design and Analysis of Parallel Algorithms,
Prentice Hall: Englewood Cliffs, NJ, 1989
PA 2 Literatur
© Harald Räcke 10
Part II
Foundations
PA
© Harald Räcke 11
3 Introduction
Parallel Computing
A parallel computer is a collection of processors usually of the
same type, interconnected to allow coordination and exchange
of data.
Distributed Systems
A set of possibly many different types of processors are
distributed over a larger geographic area.
PA 3 Introduction
© Harald Räcke 12
Cost measures
ñ time efficiency
ñ space utilization
ñ energy consumption
ñ programmability
ñ ...
PA 3 Introduction
© Harald Räcke 13
Cost measures
How do we evaluate parallel algorithms?
ñ time efficiency
ñ space utilization
ñ energy consumption
ñ programmability
ñ communication requirement
ñ ...
Problems
ñ performance (e.g. runtime) depends on problem size n and
on number of processors p
ñ statements usually only hold for restricted types of parallel
machine as parallel computers may have vastly different
characteristics (in particular w.r.t. communication)
Speedup
Suppose a problem P has sequential complexity T ∗ (n), i.e.,
there is no algorithm that solves P in time o(T ∗ (n)).
Definition 1
The speedup Sp (n) of a parallel algorithm A that requires time
Tp (n) for solving P with p processors is defined as
T ∗ (n)
Sp (n) = .
Tp (n)
PA 3 Introduction
© Harald Räcke 15
Efficiency
Definition 2
The efficiency of a parallel algorithm A that requires time Tp (n)
when using p processors on a problem of size n is
T1 (n)
Ep (n) = .
pTp (n)
T1 (n)
Note that Ep (n) ≤ pT ∞ (n)
. Hence, the efficiency goes down
rapidly if p ≥ T1 (n)/T∞ (n).
PA 3 Introduction
© Harald Räcke 16
Parallel Models — Requirements
Simplicity
A model should allow to easily analyze various performance
measures (speed, communication, memory utilization etc.).
Implementability
Parallel algorithms developed in a model should be easily
implementable on a parallel machine.
PA 3 Introduction
© Harald Räcke 17
DAG model — computation graph
PA 3 Introduction
© Harald Räcke 18
Example: Addition
+ +
+ + + +
A1 A2 A3 A4 A5 A6 A7 A8
A1 + + + + + + +
A2 A3 A4 A5 A6 A7 A8
PA 3 Introduction
© Harald Räcke 19
DAG model — computation graph
Definition 3
A scheduling of a DAG G = (V , E) on p processors is an
assignment of pairs (tv , pv ) to every internal node v ∈ V , s.t.,
ñ pv ∈ {1, . . . , p}; tv ∈ {1, . . . , T }
ñ tu = tv ⇒ pu ≠ pv
ñ (u, v) ∈ E ⇒ tv ≥ tu + 1
where a non-internal node x (an input node) has tx = 0.
T is the length of the schedule.
PA 3 Introduction
© Harald Räcke 20
DAG model — computation graph
The parallel complexity of a DAG is defined as
Clearly,
Tp (n) ≥ T∞ (n)
Tp (n) ≥ T1 (n)/p
Lemma 4
A schedule with length O(T1 (n)/p + T∞ (n)) can be found easily.
Lemma 5
Finding an optimal schedule is in general NP-complete.
Note that the DAG model as defined is a non-uniform model of
computation.
An algorithm (e.g. for a RAM) must work for every input size and
must be of finite description length.
PA 3 Introduction
© Harald Räcke 22
PRAM Model
P1 P2 P3 P4 P5 P6 P7 P8
PA 3 Introduction
© Harald Räcke 23
PRAM Model
Algorithm 1 copy
1: if id = 1 then round ← 1
2: while round ≤ p and id = round do
3: x[id + 1] ← x[id]
4: round ← round + 1
PA 3 Introduction
© Harald Räcke 24
PRAM Model
ñ processors can effectively execute different code because of
branching according to id
ñ however, not arbitrarily; still uniform model of computation
Algorithm 2 sum
1: // computes sum of x[1] . . . x[p]
2: // red part is executed only by processor 1
3: r ← 1
4: while 2r ≤ p do
5: for id mod 2r = 1 pardo
6: // only executed by processors whose id matches
7: x[id] = x[id] + x[id + 2r −1 ]
8: r ←r +1
9: return x[1]
Different Types of PRAMs
Simultaneous Access to Shared Memory:
ñ EREW PRAM:
simultaneous access is not allowed
ñ CREW PRAM:
concurrent read accesses to the same location are allowed;
write accesses have to be exclusive
ñ CRCW PRAM:
concurrent read and write accesses allowed
ñ commom CRCW PRAM
all processors writing to x[i] must write same value
ñ arbitrary CRCW PRAM
values may be different; an arbitrary processor succeeds
ñ priority CRCW PRAM
values may be different; processor with smallest id succeeds
PA 3 Introduction
© Harald Räcke 26
Algorithm 3 sum
1: // computes sum of x[1] . . . x[p]
2: r ← 1
3: while 2r ≤ p do
4: for id mod 2r = 1 pardo
5: x[id] = x[id] + x[id + 2r −1 ]
6: r ←r +1
7: return x[1]
PA 3 Introduction
© Harald Räcke 27
PRAM Model — remarks
ñ similar to a RAM we either need to restrict the size of values
that can be stored in registers, or we need to have a
non-uniform cost model for doing a register manipulation
(cost for manipulating x[i] is proportional to the bit-length
of the largest number that is ever being stored in x[i])
ñ in this lecture: uniform cost model but we are not
exploiting the model
ñ global shared memory is very unrealistic in practise as
uniform access to all memory locations does not exist
ñ global synchronziation is very unrealistic; in real parallel
machines a global synchronization is very costly
ñ model is good for understanding basic parallel
mechanisms/techniques but not for algorithm development
ñ model is good for lower bounds
PA 3 Introduction
© Harald Räcke 28
Network of Workstations — NOWs
PA 3 Introduction
© Harald Räcke 29
Typical Topologies
111 3
110
101 4 2
100
011
010 6 8
001
000 7
1, 4 2, 4 3, 4 4, 4
1, 3 2, 3 3, 3 4, 3
1 2 3 4 5 6 7 8
linear array 1, 2 2, 2 3, 2 4, 2
1, 1 2, 1 3, 1 4, 1
mesh/grid
PA 3 Introduction
© Harald Räcke 30
Network of Workstations — NOWs
Computing the sum on a d-dimensional hypercube. Note that
x[0] . . . x[2d − 1] are stored at the individual nodes.
Algorithm 4 sum
1: // computes sum of x[0] . . . x[2d − 1]
2: r ← 1
3: while 2r ≤ 2d do // p = 2d
4: if id mod 2r = 0 then
5: temp ← receive(id + 2r −1 )
6: x[id] = x[id] + temp
7: if id mod 2r = 2r −1 then
8: send(x[id], id − 2r −1 )
9: r ←r +1
10: if id = 0 then return x[id]
Network of Workstations — NOWs
Remarks
ñ One has to ensure that at any point in time there is at most
one active communication along a link
ñ There also exist synchronized versions of the model, where
in every round each link can be used once for
communication
ñ In particular the asynchronous model is quite realistic
ñ Difficult to develop and analyze algorithms as a lot of low
level communication has to be dealt with
ñ Results only hold for one specific topology and cannot be
generalized easily
PA 3 Introduction
© Harald Räcke 32
Performance of PRAM algorithms
PA 3 Introduction
© Harald Räcke 33
Performance of PRAM algorithms
Suppose we have a PRAM algorithm that takes time T (n) and
work W (n), where work is the total number of operations.
Idea:
ñ Wi (n) denotes operations in parallel step i, 1 ≤ i ≤ T (n)
ñ simulate each step in dWi (n)/pe parallel steps
ñ then we have
X X
dWi (n)/pe ≤ bWi (n)/pc + 1 ≤ bW (n)/pc + T (n)
i i
Performance of PRAM algorithms
PA 3 Introduction
© Harald Räcke 35
Optimal PRAM algorithms
T ∗ (n) pT ∗ (n)
! !
Sp (n) = Ω =Ω = Ω(p)
T ∗ (n)/p + T (n) T ∗ (n) + pT (n)
PA 3 Introduction
© Harald Räcke 36
This means by improving the time T (n), (while using same
work) we improve the range of p, for which we obtain optimal
speedup.
PA 3 Introduction
© Harald Räcke 37
Example
One can show that any CREW PRAM requires Ω(log n) time to
compute the sum.
PA 3 Introduction
© Harald Räcke 38
Communication Cost
PA 3 Introduction
© Harald Räcke 39
Communication Cost
Algorithm 5 MatrixMult(A, B, n)
1: Input: n × n matrix A and B; n = 2k
2: Output: C = AB
3: for 1 ≤ i, j, ` ≤ n pardo
4: X[i, j, `] ← A[i, `] · B[`, j]
5: for r ← 1 to log n
6: for 1 ≤ i, j ≤ n; ` mod 2r = 1 pardo
7: X[i, j, `] ← X[i, j, `] + X[i, j, ` + 2r −1 ]
8: for 1 ≤ i, j ≤ n pardo
9: C[i, j] ← X[i, j, 1]
PA 3 Introduction
© Harald Räcke 40
What happens if we have n processors?
Phase 1
pi computes X[i, j, `] = A[i, `] · B[`, j] for all 1 ≤ j, ` ≤ n
n2 time; n2 communication for every processor
Phase 2 (round r)
pi updates X[i, j, `] for all 1 ≤ j ≤ n; 1 ≤ ` mod 2r = 1
O(n · n/2r ) time; no communication
Phase 3
pi writes i-th row into C[i, j]’s.
n time; n communication
PA 3 Introduction
© Harald Räcke 41
Alternative Algorithm
Split matrix into blocks of size n2/3 × n2/3 .
A1,1 A1,2 A1,3 A1,4 B1,1 B1,2 B1,3 B1,4 C1,1 C1,2 C1,3 C1,4
A2,1 A2,2 A2,3 A2,4 B2,1 B2,2 B2,3 B2,4 C2,1 C2,2 C2,3 C2,4
A · B = C
A3,1 A3,2 A3,3 A3,4 B3,1 B3,2 B3,3 B3,4 C3,1 C3,2 C3,3 C3,4
A4,1 A4,2 A4,3 A4,4 B4,1 B4,2 B4,3 B4,4 C4,1 C4,2 C4,3 C4,4
PA 3 Introduction
© Harald Räcke 43
Part III
PRAM Algorithms
PA
© Harald Räcke 44
Prefix Sum
input: x[1] . . . x[n]
Pi
output: s[1] . . . s[n] with s[i] = j=1 x[i] (w.r.t. operator ∗)
z1 z2 z3 z4 z5 z6 z7 z8
ẑ1 ẑ2
time steps
z10
a10
â1 â2
a1 a2 a3 a4 a5 a6 a7 a8
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
x-values
Prefix Sum
The algorithm uses work O(n) and time O(log n) for solving
Prefix Sum on an EREW-PRAM with n processors.
It is clearly work-optimal.
Theorem 6
On a CREW PRAM a Prefix Sum requires running time Ω(log n)
regardless of the number of processors.
4 3 7 8 2 1 6 5
Algorithm 7 ParallelPrefix
1: for 1 ≤ i ≤ n pardo
2: P [i] ← S[i]
3: while S[i] ≠ S[S[i]] do
4: x[i] ← x[i] ∗ x[S[i]]
5: S[i] ← S[S[i]]
6: if P [i] ≠ i then x[i] ← x[i] ∗ x[S(i)]
Definition 7
Let X = (x1 , . . . , xt ) be a sequence. The rank rank(y : X) of y in
X is
rank(y : X) = |{x ∈ X | x ≤ y}|
Observation:
We can assume wlog. that elements in A and B are different.
Lemma 8
On a CREW PRAM, Merging can be done in O(log n) time and
O(n log n) work.
not optimal
Algorithm 8 GenerateSubproblems
1: j0 ← 0
2: jk ← n
3: for 1 ≤ i ≤ k − 1 pardo
4: ji ← rank(bi log n : A)
5: for 0 ≤ i ≤ k − 1 pardo
6: Bi ← (bi log n+1 , . . . , b(i+1) log n )
7: Ai ← (aji +1 , . . . , aji+1 )
Parallelizing the last step gives total work O(n) and time
O(log n).
Lemma 9
On a CRCW PRAM the maximum of n numbers can be computed
in time O(1) with n2 processors.
proof on board...
Lemma 10
On a CRCW PRAM the maximum of n numbers can be computed
in time O(log log n) with n processors and work O(n log log n).
proof on board...
Lemma 11
On a CRCW PRAM the maximum of n numbers can be computed
in time O(log log n) with n processors and work O(n).
proof on board...
a1 a4
a0 a2 a3 a5 a6
a0 a1 a2 a3 a4 a5 a6 ∞
a1 a4
x5
a0 x
a23 a23 x5 aa335 xx
a969 aa55 aa66
a0 a1 x
a2
3 a2
3 a4
x5 aa5
33 xx
a
969 aa
∞4
4 aa
55 aa
66 ∞∞
x3 x5 x9
ñ Step 3, works in phases; one phase for every level of the tree
ñ Step 4, works in rounds; in each round a different set of
elements is inserted
Observation
We can start with phase i of round r as long as phase i of round
r − 1 and (of course), phase i − 1 of round r has finished.
Algorithm 9 BasicColoring
1: for 1 ≤ i ≤ n pardo
2: col(i) ← i
3: ki ← smallest bitpos where col(i) and col(S(i)) differ
4: col0 (i) ← 2ki + col(i)ki
5 4 v col k col0
6 1 0001 1 2
15
3 0011 2 4
7 0111 0 1
8
14 1110 2 5
2
2 0010 0 0
10
15 1111 0 1
4 0100 0 0
14
5 0101 0 1
11
6 0110 1 3
8 1000 1 2
7
10 1010 0 0
12
11 1011 0 1
3
12 1100 0 0
9 1001 2 4
9
1
13
13 1101 2 5
4.6 Symmetry Breaking
2(t − 1) + 1
Algorithm 10 ReColor
1: for ` ← 5 to 3
2: for 1 ≤ i ≤ n pardo
3: if col(i) = ` then
4: col(i) ← min{{0, 1, 2} \ {col(P [i]), col(S[i])}}
Lemma 12
We can color vertices in a ring with three colors in O(log∗ n)
time and with O(n log∗ n) work.
Lemma 13
Given n integers in the range 0, . . . , O(log n), there is an
algorithm that sorts these numbers in O(log n) time using a
linear number of operations.
Proof: Exercise!
Algorithm 11 OptColor
1: for 1 ≤ i ≤ n pardo
2: col(i) ← i
3: apply BasicColoring once
4: sort vertices by colors
5: for ` = 2dlog ne to 3 do
6: for all vertices i of color ` pardo
7: col(i) ← min{{0, 1, 2} \ {col(P [i]), col(S[i])}}
We can perform Lines 6 and 7 in time O(n` ) only because we sorted before. In general a state-
ment like “for constraint pardo” should only contain a contraint on the id’s of the processors
but not something complicated (like the color) which has to be checked and, hence, induces
work. Because of the sorting we can transform this complicated constraint into a constraint on
just the processor id’s.
Input:
A list given by successor pointers;
4 5 7 3 1 2 6 8 9
Output:
For every node number of hops to end of the list;
4 5 7 3 1 2 6 8 9
8 7 6 5 4 3 2 1 0
Observation:
Special case of parallel prefix
PA 5 List Ranking
© Harald Räcke 70
List Ranking
4 5 7 3 1 2 6 8 9
1 3 1 1 1 2 2 1 0
PA 5 List Ranking
© Harald Räcke 71
List Ranking
4 5 7 3 1 2 6 8 9
1 3 1 1 1 2 2 1 0
PA 5 List Ranking
© Harald Räcke 71
List Ranking
4 5 7 3 1 2 6 8 9
4 3 1 2 1 4 2 1 0
PA 5 List Ranking
© Harald Räcke 71
List Ranking
4 5 7 3 1 2 6 8 9
4 3 1 2 1 4 2 1 0
3 4 2 1 5 6
4 1 2 4 1 0
PA 5 List Ranking
© Harald Räcke 71
List Ranking
4 5 7 3 1 2 6 8 9
4 3 1 2 1 4 2 1 0
3 4 2 1 5 6
12 8 7 5 1 0
PA 5 List Ranking
© Harald Räcke 71
List Ranking
4 5 7 3 1 2 6 8 9
12 3 8 7 1 5 2 1 0
3 4 2 1 5 6
12 8 7 5 1 0
6. Map the values back into the larger list. Time: O(1);
Work: O(n)
PA 5 List Ranking
© Harald Räcke 71
List Ranking
4 5 7 3 1 2 6 8 9
12 11 8 7 6 5 3 1 0
3 4 2 1 5 6
12 8 7 5 1 0
PA 5 List Ranking
© Harald Räcke 71
We need O(log log n) shrinking iterations until the size of the
remaining list reaches O(n/ log n).
The work for all shrinking operations is just O(n), as the size of
the list goes down by a constant factor in each round.
List Ranking can be solved in time O(log n log log n) and work
O(n) on an EREW-PRAM.
PA 5 List Ranking
© Harald Räcke 72
Optimal List Ranking
PA 5 List Ranking
© Harald Räcke 73
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
p1 p1 p2 p2 p3 p4 p4 p5 p5 p3
B1 B2 B3 B4 B5 B6
Each iteration requires constant time and work O(n/ log n),
because we just work on one node in every block.
PA 5 List Ranking
© Harald Räcke 75
Observations/Remarks:
ñ If the p-pointer of a block cannot be advanced without
leaving the block, the processor responsible for this block
simply stops working; all other blocks continue.
ñ The p-node of a block (the node pi is pointing to) at the
beginning of a round is either a ruler with a living subject or
the node will become active during the round.
ñ The subject nodes always lie to the left of the p-node of the
respective block (if it exists).
Measure of Progress:
ñ a ruler will delete a subject
ñ an active node either
ñ becomes a ruler (with a subject)
ñ becomes a subject
ñ is isolated and therefore gets deleted
PA 5 List Ranking
© Harald Räcke 76
Analysis
Definition 15
The weight of the i-th node in a block is
(1 − q)i
1
with q = log log n , where the node-numbering starts from 0.
Hence, a block has nodes {0, . . . , log n − 1}.
PA 5 List Ranking
© Harald Räcke 77
Definition of Rulers
Properties:
ñ A ruler should have at most log log n subjects.
ñ The weight of a ruler should be at most the weight of any of
its subjects.
ñ Each ruler must have at least one subject.
ñ We must be able to remove the next subject in constant
time.
ñ We need to make the ruler/subject decision in constant
time.
PA 5 List Ranking
© Harald Räcke 78
Given a sublist of active nodes.
Color the sublist with O(log log n) colors. Take the local minima
w.r.t. this coloring.
PA 5 List Ranking
© Harald Räcke 79
Now we change the ruler definition.
We make all local minima w.r.t. the weight function into a ruler;
ties are broken according to block-id (so that comparing weights
gives a strict inequality).
A ruler gets as subjects the nodes left of it until the next local
maximum (or the start of the chain) (including the local
maximum) and the nodes right of it until the next local
maximum (or the end of the chain) (excluding the local
maximum).
In case the first node is a ruler the above definition could leave it
without a subject. We use constant time to fix this in some
arbitrary manner
PA 5 List Ranking
© Harald Räcke 80
1
Set q = log log n .
The total weight of a block is at most 1/q and the total weight of
n
all items is at most q log n .
to show:
After O(log n) iterations the weight is at most
(n/ log n)(1 − q)log n
PA 5 List Ranking
© Harald Räcke 81
In every iteration the weight drops by a factor of
(1 − q/4) .
PA 5 List Ranking
© Harald Räcke 82
We consider subject nodes to just have half their weight.
PA 5 List Ranking
© Harald Räcke 83
The weight is reduced because
ñ An isolated node is removed.
ñ A node is labelled as ruler, and the corresponding subjects
reduce their weight by a factor of 1/2.
ñ A node is a ruler and deletes one of its subjects.
PA 5 List Ranking
© Harald Räcke 84
Each p-node is responsible for some other nodes; it has to
generate a weight reduction large enough so that the weight of
all nodes it is responsible for decreases by the desired factor.
A ruler is responsible for all nodes that come after it in its block
and for all its subjects.
PA 5 List Ranking
© Harald Räcke 85
Case 1: Isolated Node
Suppose we delete an isolated node v that is the i-th node in its
block.
(1 − q)j
X
i≤j<log n
(1 − q)j ≤ (1 − q) (1 − q)j
X X
i<j<log n i≤j<log n
PA 5 List Ranking
© Harald Räcke 86
Case 2: Creating Subjects
Suppose we generate a ruler with at least one subject.
Initial weight:
k k k
1 X 2 X
(1 − q)` ≤ (1 − q)ij ≤ (1 − q)ij
X X
Q=
j=1 ij ≤`<log n
q j=1 q j=2
New weight:
k
1 X q
0
Q =Q− (1 − q)ij ≤ (1 − )Q
2 j=2 4
Case 3: Removing Subjects
weight of ruler: (1 − q)i1 ; weight of subjects: (1 − q)ij , 2 ≤ j ≤ k
Initial weight:
k
1 X
(1 − q)` + (1 − q)ij
X
Q=
2 j=2
i1 ≤`<log n
1 k
≤ (1 − q)i1 + (1 − q)i2
q 2
1 1
≤ (1 − q)i2 + (1 − q)i2
q 2q
New weight:
1 q
Q0 = Q − (1 − q)i2 ≤ (1 − )Q
2 3
After s iterations the weight is at most
n q s ! n
1− ≤ (1 − q)log n
q log n 4 log n
PA 5 List Ranking
© Harald Räcke 89
Tree Algorithms
1 2 3 4
2 1 5 6 7
3 1
5 3
4 1
6 2 1 5 2
6 2
8 7 4
7 2 8 9
9
8 7
9 7
Euler Circuits
PA 6 Tree Algorithms
© Harald Räcke 91
Euler Circuits
Lemma 16
An Euler circuit can be computed in constant time O(1) with
O(n) operations.
PA 6 Tree Algorithms
© Harald Räcke 92
Euler Circuits — Applications
Rooting a tree
ñ split the Euler tour at node r
ñ this gives a list on the set of directed edges (Euler path)
ñ assign x[e] = 1 for every edge;
ñ perform parallel prefix; let s[·] be the result array
ñ if s[(u, v)] < s[(v, u)] then u is parent of v;
PA 6 Tree Algorithms
© Harald Räcke 93
Euler Circuits — Applications
Postorder Numbering
ñ split the Euler tour at node r
ñ this gives a list on the set of directed edges (Euler path)
ñ assign x[e] = 1 for every edge (v, parent(v))
ñ assign x[e] = 0 for every edge (parent(v), v)
ñ perform parallel prefix
ñ post(v) = s[(v, parent(v))]; post(r ) = n
PA 6 Tree Algorithms
© Harald Räcke 94
Euler Circuits — Applications
Level of nodes
ñ split the Euler tour at node r
ñ this gives a list on the set of directed edges (Euler path)
ñ assign x[e] = −1 for every edge (v, parent(v))
ñ assign x[e] = 1 for every edge (parent(v), v)
ñ perform parallel prefix
ñ level(v) = s[(parent(v), v)]; level(r ) = 0
PA 6 Tree Algorithms
© Harald Räcke 95
Euler Circuits — Applications
Number of descendants
ñ split the Euler tour at node r
ñ this gives a list on the set of directed edges (Euler path)
ñ assign x[e] = 0 for every edge (parent(v), v)
ñ assign x[e] = 1 for every edge (v, parent(v)), v ≠ r
ñ perform parallel prefix
ñ size(v) = s[(v, parent(v))] − s[(parent(v), v)]
PA 6 Tree Algorithms
© Harald Räcke 96
Rake Operation
3
2 5
1 8 9 6 8
PA 6 Tree Algorithms
© Harald Räcke 97
We want to apply rake operations to a binary tree T until T just
consists of the root with two children.
Possible Problems:
1. we could concurrently apply the rake-operation to two
siblings
2. we could concurrently apply the rake-operation to two
leaves u and v such that p(u) and p(v) are connected
By choosing leaves carefully we ensure that none of the above
cases occurs
PA 6 Tree Algorithms
© Harald Räcke 98
Algorithm:
ñ label leaves consecutively from left to right (excluding
left-most and right-most leaf), and store them in an array A
ñ for dlog(n + 1)e iterations
ñ apply rake to all odd leaves that are left children
ñ apply rake operation to remaining odd leaves (odd at start
of round!!!)
ñ A=even leaves
PA 6 Tree Algorithms
© Harald Räcke 99
Observations
ñ the rake operation does not change the order of leaves
ñ two leaves that are siblings do not perform a rake operation
in the same round because one is even and one odd at the
start of the round
ñ two leaves that have adjacent parents either have different
parity (even/odd) or they differ in the type of child
(left/right)
PA 6 Tree Algorithms
© Harald Räcke 100
Cases, when the left edge btw. p(u) and p(v) is a left-child
edge.
v v
u u
1 2 2
3 4
PA 6 Tree Algorithms
© Harald Räcke 101
Example
17
12 14
9 2 3 15
8 1 13 16
10 11
4 5 6 7
PA 6 Tree Algorithms
© Harald Räcke 102
ñ one iteration can be performed in constant time with O(|A|)
processors, where A is the array of leaves;
ñ hence, all iterations can be performed in O(log n) time and
O(n) work;
ñ the intial parallel prefix also requires time O(log n) and
work O(n)
PA 6 Tree Algorithms
© Harald Räcke 103
Evaluating Expressions
Suppose that we want to evaluate an expression tree, containing
additions and multiplications.
+
∗ +
∗ + + ∗
A1 A2 A3 A4 A5 A6 A7 A8
A1 + + +
∗ + + +
∗ +
A2 A3 A4 A5 A6 A7 A8
PA 6 Tree Algorithms
© Harald Räcke 104
We can use the rake-operation to do this quickly.
Invariant:
Let u be internal node with children v and w. Then
PA 6 Tree Algorithms
© Harald Räcke 105
Rake Operation
r
∗
w
u
+ ∗
v
x1 x x3 x4 x5
2
Lemma 17
We can evaluate an arithmetic expression tree in time O(log n)
and work O(n) regardless of the height or depth of the tree.
PA 6 Tree Algorithms
© Harald Räcke 107
Lemma 18
We compute tree functions for arbitrary trees in time O(log n)
and a linear number of operations.
proof on board...
PA 6 Tree Algorithms
© Harald Räcke 108
In the LCA (least common ancestor) problem we are given a tree
and the goal is to design a data-structure that answers
LCA-queries in constant time.
PA 6 Tree Algorithms
© Harald Räcke 109
Least Common Ancestor
LCAs on complete binary trees (inorder numbering):
1000
8
0100 1100
4 12
1 3 5 7 9 11 13 15
0001 0011 0101 0111 1001 1011 1101 1111
z1 z2 . . . zi 1 0 . . . 0
PA 6 Tree Algorithms
© Harald Räcke 110
Least Common Ancestor
2 8 9
3 4
5 6 7
nodes 1 2 3 2 4 5 4 6 4 7 4 2 1 8 1 9 1
levels 0 1 2 1 2 3 2 3 2 3 2 1 0 1 0 1 0
PA 6 Tree Algorithms
© Harald Räcke 111
`(v) is index of first appearance of v in node-sequence.
PA 6 Tree Algorithms
© Harald Räcke 112
Least Common Ancestor
Lemma 19
PA 6 Tree Algorithms
© Harald Räcke 113
Range Minima Problem
PA 6 Tree Algorithms
© Harald Räcke 114
Observation
Given an algorithm for solving the range minima problem in time
T (n) and work W (n) we can obtain an algorithm that solves the
LCA-problem in time O(T (n) + log n) and work O(n + W (n)).
Remark
In the sequential setting the LCA-problem and the range minima
problem are equivalent. This is not necessarily true in the
parallel setting.
PA 6 Tree Algorithms
© Harald Räcke 115
Prefix and Suffix Minima
Tree with prefix-minima and suffix-minima:
0 0 0 0 0 0 0 0 0 1 1 3 3 3 3 3
6 4 2 2 2 2 1 1 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 6 0 1 1 3 3 3 3 3
6 4 2 2 2 2 1 1 0 0 0 0 0 0 0 0
2 2 2 3 1 1 1 6 0 1 1 6 3 3 3 3
6 4 2 2 4 4 1 1 0 0 0 0 3 3 3 3
4 4 2 3 4 5 1 6 0 5 1 6 3 4 3 3
6 4 2 2 4 4 1 1 0 0 1 1 3 3 5 3
6 4 2 3 4 5 1 6 0 5 1 6 3 4 5 3
PA 6 Tree Algorithms
© Harald Räcke 116
ñ Suppose we have an array A of length n = 2k
ñ We compute a complete binary tree T with n leaves.
ñ Each internal node corresponds to a subsequence of A. It
contains an array with the prefix and suffix minima of this
subsequence.
PA 6 Tree Algorithms
© Harald Räcke 117
Lemma 20
We can solve the range minima problem in time O(log n) and
work O(n log n).
PA 6 Tree Algorithms
© Harald Räcke 118
Reducing the Work
For each block Bi compute the minimum xi and its prefix and
suffix minima.
PA 6 Tree Algorithms
© Harald Räcke 119
Answering a query (`, r):
ñ if ` and r are from the same block the data-structure for
this block gives us the result in constant time
ñ if ` and r are from different blocks the result is a minimum
of three elements:
PA 6 Tree Algorithms
© Harald Räcke 120
Searching
log n
logp+1 (n) =
log(p + 1)
Definition 21
Let X = (x1 , . . . , xt ) be a sequence. The rank rank(y : X) of y in
X is
rank(y : X) = |{x ∈ X | x ≤ y}|
j(i) := rank(bi√m : A)
3. Let Bi = (bi√m+1 , . . . , b(i+1)√m−1 ); and
Ai = (aj(i)+1 , . . . , aj(i+1) ).
proof on board...
Lemma 22
A straightforward parallelization of Mergesort can be
implemented in time O(log n log log n) and with work O(n log n).
Algorithm 11 ColeSort()
1: initialize L0 [v] = Av for leaf nodes; L0 [v] = otw.
2: for s ← 1 to 3 · height(T ) do
3: for all active nodes v do
4: // u and w children of v
5: L0s [u] ← sample(Ls−1 [u])
6: L0s [w] ← sample(Ls−1 [w])
7: Ls [v] ← merge(L0s [u], L0s [w])
sample4 (Ls [v]) s ≤ 3 height(v)
sample(Ls [v]) = sample2 (Ls [v]) s = 3 height(v) + 1
sample (L [v]) s = 3 height(v) + 2
1 s
0 0 0 1 1 2 2 2 3
0 1
3 3
2 2 3
4 4 2 5
1 4 2 595
4 9 6
6
7 9 6
9 6
9 6
8 9
6 6
9 9
6 7 8 8 9 9 9 9 9
0 0 0 1 1 2 2 2 3
0 3
1 3
2 4 3
2 4 2 5
1 4 4 5
9 5
9 6
6 9 6
8 9 6
9 6
7 8 6
9 6
9 6
9 7 8 8 9 9 9 9 9
0 2 2 2 3 4 5
2 2 2 596
4 9 8
8 6 6
9 9 9
8 9 9 9 9 0 0 1 1 3
0 1
3 4
3 595
1 4 9 6
6 9 6
7 9
6 6 7 8 9
0 2 2 2 3 2 5
2 4 4 5
8 6
9 6
8 9 6
8 9 8
9 9 9 9 9 0 0 1 1 3
1 3 3 4
1 4 6 5
5 9 6
6 9 6
7 6
9 6 7 8 9
ss == 17
75
10
13
15
16
18
0
3
6
8
9
11
12
14
1
2
4
0 2 3
2 499 9 9 9 2 2 5
2 586
8 8
6 6 8 0 1 4
1 696
9 9
7 7 8 9 0 1 3
1 364
6 6
5 5 6 6
0 2 3 9 9 9 9 9
2 4 2 2 5 5
6 6
8 6
8 6 8 0 1 4 6
8 6
9 7
8 9 8 9 0 1 3
1 3
5 4
6 5
5 6 6 6
499 399 299 099 688 266 566 255 477 099 688 166 155 044 333 666
4 9 3 9 2 9 0 9 6 8 2 6 5 6 2 5 4 7 0 9 6 8 1 6 1 5 0 4 3 3 6 6
4 9 9 3 9 2 0 9 6 8 6 2 6 5 5 2 7 4 0 9 6 8 1 6 5 1 4 0 3 3 6 6
4 9 9 3 9 2 0 9 6 8 6 2 6 5 5 2 7 4 0 9 6 8 1 6 5 1 4 0 3 3 6 6
Lemma 23
After round s = 3 height(v), the list Ls [v] is complete.
Proof:
ñ clearly true for leaf nodes
ñ suppose it is true for all nodes up to height h;
ñ fix a node v on level h + 1 with children u and w
ñ L3h [u] and L3h [w] are complete by induction hypothesis
ñ further sample(L3h+2 [u]) = L[u] and
sample(L3h+2 [v]) = L[v]
ñ hence in round 3h + 3 node v will merge the complete list
of its children; after the round L[v] will be complete
Lemma 24
The number of elements in lists Ls [v] for active nodes v is at
most O(n).
proof on board...
Lemma 26
L0s [v] is a 4-cover of L0s+1 [v].
Lemma 27
If [a, b] with a, b ∈ L0s [v] ∪ {−∞, ∞} intersects (−∞, L0s [v], ∞) in
k ≥ 2 items, then [a, b] intersects (−∞, L0s+1 , ∞) in at most 2k
items.
=k
Ls−1 [v] ≤ 4k − 3
p + q ≤ 4k − 1
Ls [v] ≤ 2p + 2q ≤ 8k − 2
Note that the last step holds as long L0s+1 [v] = sample4 (Ls [v]). Otw. Ls−1 [v] has already been
full, and hence, L0s [v], L0s+1 [v], L0s+2 [v] are 4-covers of the complete list L[v], and also 4-covers
of each other.
Merging with a Cover
Lemma 28
Given two sorted sequences A and B. Let X be a c-cover of A and
B for constant c, and let rank(X : A) and rank(X : B) be known.
Lemma 29
Given two sorted sequences A and B. Let X be a c-cover of B for
constant c, and let rank(A : X) and rank(X : B) be known.
Lemma 30
Given two sorted sequences A and B. Let X be a c-cover of B for
constant c, and let rank(A : X) and rank(X : B) be known.
We can compute
ñ rank(L0s+1 [v] : L0s+2 [v])
ñ rank(L0s+1 [u] : L0s+1 [w])
ñ rank(L0s+1 [w] : L0s+1 [u])
in constant time and O(|Ls+1 [v]|) operations, where v is the
parent of u and w.
PA 8 Sorting Networks
© Harald Räcke 145
Bitonic Merger
If we feed a bitonic 0-1 sequence S into the
network on the right we obtain two bitonic
sequences ST and SB s.t.
1. SB ≤ ST (element-wise) ST
2. SB and ST are bitonic
Proof:
ñ assume wlog. S more 1’s than 0’s. S
Bitonic Merger Bd
The bitonic merger Bd
of dimension d is con-
Bd−1
structed by combining
two bitonic mergers of
dimension d − 1.
0
Sd−1
Sd−1
Bitonic Merger: (n = 2d )
ñ comparators: C(n) = 2C(n/2) + n/2 ⇒ C(n) = O(n log n).
ñ depth: D(n) = D(n/2) + 1 ⇒ D(d) = O(log n).
Bitonic Sorter: (n = 2d )
ñ comparators: C(n) = 2C(n/2) + O(n log n) ⇒
C(n) = O(n log2 n).
ñ depth: D(n) = D(n/2) + log n ⇒ D(n) = Θ(log2 n).
PA 8 Sorting Networks
© Harald Räcke 149
Odd-Even Merge
How to merge two sorted sequences?
A = (a1 , a2 , . . . , an ), B = (b1 , b2 , . . . , bn ), n even.
Let
Then
PA 8 Sorting Networks
© Harald Räcke 150
Odd-Even Merge
Md−1
Md−1
Theorem 34
There exists a sorting network with depth O(log n) and
O(n log n) comparators.
PA 8 Sorting Networks
© Harald Räcke 152
Parallel Comparison Tree Model
PA 9 Lower Bounds
© Harald Räcke 153
Comparison PRAM
PA 9 Lower Bounds
© Harald Räcke 154
A Lower Bound for Searching
Theorem 35
Given a sorted table X of n elements and an element y.
log n
Searching for y in X requires Ω( log(p+1) ) steps in the parallel
comparsion tree with parallelism p < n.
PA 9 Lower Bounds
© Harald Räcke 155
A Lower Bound for Maximum
Theorem 36
A graph G with m edges and n vertices has an independent set
n2
on at least 2m+n vertices.
base case (n = 1)
ñ The only graph with one vertex has m = 0, and an
independent set of size 1.
PA 9 Lower Bounds
© Harald Räcke 156
induction step (1, . . . , n → n + 1)
ñ Let G be a graph with n + 1 vertices, and v a node with
minimum degree (d).
ñ Let G0 be the graph after deleting v and its adjacent
vertices in G.
ñ n0 = n − (d + 1)
d
ñ m0 ≤ m − 2 (d + 1) as we remove d + 1 vertices, each with
degree at least d
ñ In G0 there is an independent set of size ((n0 )2 /(2m0 + n0 )).
ñ By adding v we obtain an indepent set of size
(n0 )2 n2
1+ ≥
2m + n
0 0 2m + n
A Lower Bound for Maximum
Theorem 37
Computing the maximum of n elements in the comparison tree
requires Ω(log log n) steps whenever the degree of parallelism is
p ≤ n.
Theorem 38
Computing the maximum of n elements requires Ω(log log n)
steps on the comparison PRAM with n processors.
PA 9 Lower Bounds
© Harald Räcke 158
An adversary can specify the input such that at the end of the
(i + 1)-st step the maximum lies in a set Ci+1 of size si+1 such
that
ñ no two elements of Ci+1 have been compared
si2
ñ si+1 ≥ 2p+ci
PA 9 Lower Bounds
© Harald Räcke 159
Theorem 39
The selection problem requires Ω(log n/ log log n) steps on a
comparison PRAM.
PA 9 Lower Bounds
© Harald Räcke 160
A Lower Bound for Merging
PA 9 Lower Bounds
© Harald Räcke 161
A Lower Bound for Merging
Lemma 40
Suppose we are given a parallel comparison tree with
parallelism p to solve the (k, s) merging problem. After the first
step an adversary can specify the input such that an arbitrary
(k0 , s 0 ) merging problem has to be solved, where
3q
k0 = pk
4s
s k
s0 =
4 p
PA 9 Lower Bounds
© Harald Räcke 162
A Lower Bound for Merging
PA 9 Lower Bounds
© Harald Räcke 163
Choose for every i the diagonal of M i that has most zeros.
We can choose value s.t. elements for the j-th pair along the
diagonal are all smaller than for the (j + 1)-th pair.
PA 9 Lower Bounds
© Harald Räcke 164
How many pairs do we have?
ñ there are k` blocks in total
ñ there are k · `2 matrix entries in total
ñ there are at least k · `2 − p zeros.
ñ choosing a random diagonal (same for every matrix M i ) hits
at least
k`2 − p k` p
≥ −
2` − 1 2 2`
zeroes.
ñ Choosing ` = d2 p/ke gives
p
s
0 3q s s s k
k ≥ pk and s 0 = b c ≥ p =
4 ` 4 p/k 4 p
PA 9 Lower Bounds
© Harald Räcke 165
Lemma 41
Let T (k, s, p) be the number of parallel steps required on a
comparison tree to solve the (k, s) merging problem. Then
p
1 log k
T (k, p, s) ≥ log p
4 log ks
PA 9 Lower Bounds
© Harald Räcke 166
Induction Step:
Assume that
p
1 log k0
T (k0 , s 0 , p) ≥log p
4 log k0 s 0
4 p
q
1 log 3 k
≥ log 16 p
4 log 3 ks
1 p
1 log k
≥ log 2 p
4 7 log ks
p
1 log k
≥ log p −1
4 log ks
PA 9 Lower Bounds
© Harald Räcke 167
Theorem 42
Merging requires at least Ω(log log n) time on a CRCW PRAM
with n processors.
PA 9 Lower Bounds
© Harald Räcke 168
Simulations between PRAMs
Theorem 43
We can simulate a p-processor priority CRCW PRAM on a
p-processor EREW PRAM with slowdown O(log p).
Theorem 44
We can simulate a p-processor priority CRCW PRAM on a
p log p-processor common CRCW PRAM with slowdown O(1).
Theorem 45
We can simulate a p-processor priority CRCW PRAM on a
log p
p-processor common CRCW PRAM with slowdown O( log log p ).
Theorem 46
We can simulate a p-processor priority CRCW PRAM on a
p-processor arbitrary CRCW PRAM with slowdown O(log log p).
Ideal PRAM:
ñ every processor has unbounded local memory
ñ in each step a processor reads a global variable
ñ then it does some (unbounded) computation on its local
memory
ñ then it writes a global variable
Definition 47
An input index i affects a memory location M at time t on some
input I if the content of M at time t differs between inputs I and
I(i) (i-th bit flipped).
Definition 48
An input index i affects a processor P at time t on some input I
if the state of P at time t differs between inputs I and I(i) (i-th
bit flipped).
Lemma 49
If i ∈ K(P , t, I) with t > 1 then either
ñ i ∈ K(P , t − 1, I), or
ñ P reads a global memory location M on input I at time t,
and i ∈ L(M, t − 1, I).
Lemma 50
If i ∈ L(M, t, I) with t > 1 then either
ñ A processor writes into M at time t on input I and
i ∈ K(P , t, I), or
ñ No processor writes into M at time t on input I and
ñ either i ∈ L(M, t − 1, I)
ñ or a processor P writes into M at time t on input I(i).
Lemma 51
|K(P , t, I)| ≤ kt and |L(M, t, I)| ≤ `t for any t ≥ 0
Hence,
Case 1:
A processor P writes into location M at time t + 1 on input I.
Then,
Fact:
For all pairs us , ut with Pws ≠ Pwt either
us ∈ K(Pwt , t + 1, I(ut )) or ut ∈ K(Pws , t + 1, I(us )).
Otherwise, Pwt and Pws would both write into M at the same
time on input I(us )(ut ).
1
Hence, there must be at least 2 r (r − kt+1 ) pairs ui , uj with
Pwi ≠ Pwj .
Hence,
1
|E| ≥ r (r − kt+1 )
2
Eigenvalues:
1 √ 1 √
λ1 = (5 + 21) and λ2 = (5 − 21)
2 2
Eigenvectors:
! !
1 1
v1 = and v2 =
−(1 − λ1 ) −(1 − λ2 )
1 1
! !
v1 = 3 1√ and v2 = 3 1√
2 + 2 21 2 − 2 21
1 1
! !
v1 = 3 1√ and v2 = 3 √
2 + 2 21 2 − 12 21
k0
! !
0 1
= = √ (v1 − v2 )
`0 1 21
kt
!
1 t
λ1 v1 − λt2 v2
= √
`t 21
Solving the recurrence gives
λt λt
kt = √ 1 − √ 2
21 21
√ √
3 + 21 t −3 + 21 t
`t = √ λ1 + √ λ2
2 21 2 21
√ √
with λ1 = 21 (5 + 21) and λ2 = 12 (5 − 21).
C(T ) ≥ n, C(0) = 0
Claim:
C(t) ≤ 6C(t − 1) + 3|P |
6T −1
This gives C(T ) ≤ 5 3|P | and hence T = Ω(log n − log |P |).
This means that the number of new indices in any set L(M, t)
(over all M) is at most X
2 |K(P , t)|
P
X X X
max{0, |L(M, t)| − 1} ≤ max{0, |L(M, t − 1)| − 1} + 2 |K(P , t)|
M M P
X X X
|K(P , t)| ≤ |K(P , t − 1)| + |L(M, t − 1)|
P P M∈Jt
Recall
X X X
max{0, |L(M, t)| − 1} ≤ max{0, |L(M, t − 1)| − 1} + 2 |K(P , t)|
M M P
Hence,
C(t) ≤ 6C(t − 1) + 3|P |
Theorem 56
Let f : {0, 1}n → {0, 1} be an arbitrary Boolean function. f can
be computed in O(1) time on a common CRCW PRAM with ≤ n2n
processors.
Definition 58
A family {Cn } of circuits is logspace uniform if there exists a
deterministic Turing machine M s.t
ñ M runs in logarithmic space.
ñ For all n, M outputs Cn on input 1n .
0
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111
111
110
101
100
011
010
001
000
-3 -2 -1 0 1 2 3
0
000 001 002 010 011 012 020 021 022 100 101 102 110 111 112 120 121 122 200 201 202 210 211 212 220 221 222
111
110
101
100
011
010
001
000
-3 -2 -1 0 1 2 3
PA 11 Some Networks
© Harald Räcke 210
Permutation Routing
Lemma 59
On the linear array M(n, 1) any permutation can be routed
online in 2n steps with buffersize 3.
PA 11 Some Networks
© Harald Räcke 211
Permutation Routing
Lemma 60
On the Beneš network any permutation can be routed offline in
2d steps between the sources level (+d) and target level (−d).
PA 11 Some Networks
© Harald Räcke 212
Recursive Beneš Network
B(d − 1)
B(d − 1)
Permutation Routing
base case d = 0
trivial
induction step d → d + 1
ñ The packets that start at (ā, d) and (ā(d), d) have to be
sent into different sub-networks.
ñ The packets that end at (ā, −d) and (ā(d), −d) have to
come out of different sub-networks.
We can color such a graph with n colors such that no two nodes
in a hyperedge share a color.
PA 11 Some Networks
© Harald Räcke 216
We can simulate the algorithm for the n-ary Beneš Network.
PA 11 Some Networks
© Harald Räcke 217
We simulate the behaviour of the Beneš network on the
n-dimensional mesh.
PA 11 Some Networks
© Harald Räcke 219
The nodes are of the form (`, x̄), x̄ ∈ [n]d , ` ∈ −d, . . . , d.
We route in 3 phases:
1. Permute packets along the rows such that afterwards no
column contains packets that have the same target row.
O(d) steps.
2. We can use pipeling to permute every column, so that
afterwards every packet is in its target row. O(2d + 2d)
steps.
3. Every packet is in its target row. Permute packets to their
right destinations. O(d) steps.
PA 11 Some Networks
© Harald Räcke 220
Lemma 63
We can do offline permutation routing of (partial) permutations
in 2d steps on the hypercube.
Lemma 64
We can sort on the hypercube M(2, d) in O(d2 ) steps.
Lemma 65
We can do online permutation routing of permutations in O(d2 )
steps on the hypercube.
PA 11 Some Networks
© Harald Räcke 221
Bitonic Sorter Sd
0
Sd−1
Sd−1
ASCEND/DESCEND Programs
PA 11 Some Networks
© Harald Räcke 223
Algorithm 11 oper(a, a0 , dim, Ta , Ta0 )
1: if adim , . . . , a0 = 0dim+1 then
2: Ta = min{Ta , Ta0 }
PA 11 Some Networks
© Harald Räcke 224
We can perform an ASCEND/DESCEND run on a linear array
M(2d , 1) in O(2d ) steps.
PA 11 Some Networks
© Harald Räcke 225
The CCC network is obtained from a hypercube by replacing
every node by a cycle of degree d.
constand degree
PA 11 Some Networks
© Harald Räcke 226
Lemma 66
Let d = 2k . An ASCEND run of a hypercube M(2, d + k) can be
simulated on CCC(d) in O(d) steps.
PA 11 Some Networks
© Harald Räcke 227
The shuffle exchange network SE(d) is defined as follows
ñ nodes: V = [2]d
ñ edges:
n o n o
E = {x ᾱ, ᾱx} | x ∈ [2], ᾱ ∈ [2]d−1 ∪ {ᾱ0, ᾱ1} | ᾱ ∈ [2]d−1
constand degree
Edges of the first type are called shuffle edges. Edges of the
second type are called exchange edges
PA 11 Some Networks
© Harald Räcke 228
Shuffle Exchange Networks
100 101
010 011
PA 11 Some Networks
© Harald Räcke 229
Lemma 67
We can perform an ASCEND run of M(2, d) on SE(d) in O(d)
steps.
PA 11 Some Networks
© Harald Räcke 230
Simulations between Networks
PA 11 Some Networks
© Harald Räcke 231
Simulations between Networks
Definition 68
A configuration Ci of processor Pi is the complete description of
the state of Pi including local memory, program counter,
read-register, write-register, etc.
PA 11 Some Networks
© Harald Räcke 232
Simulations between Networks
Definition 69
Let C = (C0 , . . . , Cp−1 ) a configuration of M. A machine M 0 with
q ≥ p processors weakly simulates t steps of M with slowdown k
if
ñ in the beginning there are p non-empty processors sets
A0 , . . . , Ap−1 ⊆ M 0 so that all processors in Ai know Ci ;
ñ after at most k · t steps of M 0 there is a processor Q(i) that
knows the t-th successors configuration of C for processor
Pi .
PA 11 Some Networks
© Harald Räcke 233
Simulations between Networks
Definition 70
M 0 simulates M with slowdown k if
ñ M 0 weakly simulates machine M with slowdown k
ñ and every processor in Ai knows the t-th successor
configuration of C for processor Pi .
PA 11 Some Networks
© Harald Räcke 234
We have seen how to simulate an ASCEND/DESCEND run of the
hypercube M(2, d + k) on CCC(d) with d = 2k in O(d) steps.
PA 11 Some Networks
© Harald Räcke 235
Lemma 71
Suppose a network S with n processors can route any
permutation in time O(t(n)). Then S can simulate any constant
degree network M with at most n vertices with slowdown
O(t(n)).
PA 11 Some Networks
© Harald Räcke 236
Map the vertices of M to vertices of S in an arbitrary way.
PA 11 Some Networks
© Harald Räcke 237
Lemma 72
Suppose a network S with n processors can sort n numbers in
time O(t(n)). Then S can simulate any network M with at most
n vertices with slowdown O(t(n)).
PA 11 Some Networks
© Harald Räcke 238
Lemma 73
There is a constant degree network on O(n1+ ) nodes that can
simulate any constant degree network with slowdown O(1).
PA 11 Some Networks
© Harald Räcke 239
Suppose we allow concurrent reads, this means in every step all
neighbours of a processor Pi can read Pi ’s read register.
Lemma 74
A constant degree network M that can simulate any n-node
network has slowdown Ω(log n) (independent of the size of M).
PA 11 Some Networks
© Harald Räcke 240
We show the lemma for the following type of simulation.
ñ There are representative sets Ati for every step t that specify
which processors of M simulate processor Pi in step t
(know the configuration of Pi after the t-th step).
ñ The representative sets for different processors are disjoint.
ñ for all i ∈ {1, . . . , n} and steps t, Ati ≠ .
PA 11 Some Networks
© Harald Räcke 241
Suppose processor Pi reads from processor Pji in step t.
`
1 X
k= kt
` t=1
PA 11 Some Networks
© Harald Räcke 242
We show
ñ The simulation of a step takes at least time γ log n, or
ñ the size of the representative sets shrinks by a lot
1 X t
|At+1
X
i |≤ |Ai |
i
n i
PA 11 Some Networks
© Harald Räcke 243
Suppose there is no pair (i, j) such that i reading from j
requires time γ log n.
ñ For every i the set Γ2k (Ai ) contains a node from Aj .
ñ Hence, there must exist a ji such that Γ2k (Ai ) contains at
most
|Ai | · c 2k |Ai | · c 3k
|Cji | := ≤ .
n−1 n
processors from |Aji |
PA 11 Some Networks
© Harald Räcke 244
If we choose that i reads from ji we get
|A0i | ≤ |Cji | · c k
|Ai | · c 3k
≤ ck ·
n
1
= |Ai | · c 4k
n
PA 11 Some Networks
© Harald Räcke 245
Let ` be the total number of steps and s be the number of short
steps when kt < γ log n.
PA 11 Some Networks
© Harald Räcke 246
1 s Y
kt +1 n `+ t kt
P
n ≤ h` ≤ h0 c ≤ · c
n t∈long ns
n ≤ n1−s+` 2
PA 11 Some Networks
© Harald Räcke 247
Deterministic Online Routing
Lemma 75
A permutation on an n × n-mesh can be routed online in O(n)
steps.
PA 11 Some Networks
© Harald Räcke 248
Deterministic Online Routing
PA 11 Some Networks
© Harald Räcke 249
Deterministic Online Routing
PA 11 Some Networks
© Harald Räcke 250
Deterministic Online Routing
Definition 80 (dilation)
For a given path system the dilation is the maximum length of a
path.
PA 11 Some Networks
© Harald Räcke 251
Lemma 81
Any oblivious routing protocol requires at least max{Cf , Df }
steps, where Cf and Df , are the congestion and dilation,
respectively, of the path-system used. (node congestion or edge
congestion depending on the communication model)
Lemma 82
Any reasonable oblivious routing protocol requires at most
O(Df · Cf ) steps (unbounded buffers).
PA 11 Some Networks
© Harald Räcke 252
Theorem 83 (Borodin, Hopcroft)
For any path system W there exists a permutation π : V → V
√
and an edge e ∈ E such that at least Ω( n/∆) of the paths go
through e.
PA 11 Some Networks
© Harald Räcke 253
Let Wv = {Pv,u | u ∈ V }.
PA 11 Some Networks
© Harald Räcke 254
For any node v there are many edges that are are quite popular
for v.
|V | × |E|-matrix A(z):
e is z-popular for v
(
1
Av,e (z) =
0 otherwise
Define
ñ
X
Av (z) = Av,e (z)
e
ñ
X
Ae (z) = Av,e (z)
v
PA 11 Some Networks
© Harald Räcke 255
Lemma 84
n−1
Let z ≤ ∆ .
n
For every node v ∈ V there exist at least 2∆z edges that are z
popular for v. This means
n
Av (z) ≥
2∆z
PA 11 Some Networks
© Harald Räcke 256
Lemma 85
There exists an edge e0 that is z-popular for at least z nodes
√
with z = Ω( n∆).
X X n2
Ae (z) = Av (z) ≥
e v 2∆z
n2 n
& '
Ae0 (z) ≥ ≥
|E| · 2∆z 2∆2 z
PA 11 Some Networks
© Harald Räcke 257
n √ √
We choose z such that z = 2∆2 z
(i.e., z = n/( 2∆)).
PA 11 Some Networks
© Harald Räcke 258
Deterministic oblivious routing may perform very poorly.
PA 11 Some Networks
© Harald Räcke 259
Suppose every source on level 0 has p packets, that are routed
to random destinations.
Hence,
2d−i 1
Pr[packet goes over v] ≤ = i
2d 2
PA 11 Some Networks
© Harald Räcke 260
Expected number of packets:
1
E[packets over v] = p · 2i · =p
2i
PA 11 Some Networks
© Harald Räcke 261
What is the probability that at least r packets go through v.
p · 2i 1 r
!
Pr[at least r path through v] ≤ ·
r 2i
!r
p2i · e 1
≤ ·
r 2i
pe r
=
r
PA 11 Some Networks
© Harald Räcke 262
Pr[there exists a node v sucht that at least r path through v]
pe r
d
≤ d2 ·
r
1
Pr[exists node v with more than r paths over v] ≤
N`
PA 11 Some Networks
© Harald Räcke 263
Scheduling Packets
PA 11 Some Networks
© Harald Räcke 264
Definition 86 (Delay Sequence of length s)
ñ delay path W
ñ lengths `0 , `1 , . . . , `s , with `0 ≥ 1, `1 , . . . , `s ≥ 0 lengths of
delay-free sub-paths
ñ collision nodes v0 , v1 , . . . , vs , vs+1
ñ collision packets P0 , . . . , Ps
PA 11 Some Networks
© Harald Räcke 265
Properties
ñ rank(P0 ) ≥ rank(P1 ) ≥ · · · ≥ rank(Ps )
Ps
i=0 `i = d
ñ
PA 11 Some Networks
© Harald Räcke 266
Definition 87 (Formal Delay Sequence)
PA 11 Some Networks
© Harald Räcke 267
We say a formal delay sequence is active if rank(Pi ) = ki holds
for all i.
PA 11 Some Networks
© Harald Räcke 268
Lemma 88
s+1
2eC(s + k)
Ns ≤
s+1
PA 11 Some Networks
© Harald Räcke 269
Hence the probability that the routing takes more than d + s
steps is at most
s+1
2e · C · (s + k)
N3 ·
(s + 1)k
PA 11 Some Networks
© Harald Räcke 270
ñ With probability 1 − 1`1 the random routing problem has
N
congestion at most O(p + `1 d).
ñ With probability 1 − 1`2 the packet scheduling finishes in at
N
most O(C + `2 d) steps.
PA 11 Some Networks
© Harald Räcke 271
Valiants Trick
We only used
ñ all routing paths are of the same length d
ñ there are a polynomial number of delay paths
PA 11 Some Networks
© Harald Räcke 272
Valiants Trick
PA 11 Some Networks
© Harald Räcke 273
Valiants Trick
PA 11 Some Networks
© Harald Räcke 274
Valiants Trick
PA 11 Some Networks
© Harald Räcke 275
For a network G = (V , E, c) we define the characteristic flow
problem via
c(u)c(v)
ñ demands du,v = c(V )
PA 11 Some Networks
© Harald Räcke 276
Definition 89
A (randomized) oblivious routing scheme is given by a path
system P and a weight function w such that
X
w(p) = 1
p∈Ps,t
PA 11 Some Networks
© Harald Räcke 277
Construct an oblivious routing scheme from S as follows:
ñ let fx,y be the flow between x and y in S
ñ
1 c(x)c(y)
fx,y ≥ dx,y /C(S) ≥ dx,y /F =
F c(V )
ñ for p ∈ Px,y set w(p) = fp /fx,y
PA 11 Some Networks
© Harald Räcke 278
Valiants Trick
PA 11 Some Networks
© Harald Räcke 279
Example: hypercube.
PA 11 Some Networks
© Harald Räcke 280
Oblivious Routing for the Mesh
PA 11 Some Networks
© Harald Räcke 281
Let for a multicommodity flow problem P Copt (P ) be the
optimum congestion, and Dopt (P ) be the optimum dilation (by
perhaps different flow solutions).
Lemma 90
There is an oblivious routing scheme for the mesh that obtains a
flow solution S with C(S) = O(Copt (P ) log n) and
D(S) = O(Dopt (P )).
PA 11 Some Networks
© Harald Räcke 282
Lemma 91
For any oblivious routing scheme on the mesh there is a demand
P such that routing P will give congestion Ω(log n · Copt ).
PA 11 Some Networks
© Harald Räcke 283