0% found this document useful (0 votes)
2 views

Parallel Algorithm Main Single

The document outlines the course 'Parallel Algorithms' taught by Harald Räcke at TU Munich during the Winter Term 2014/15, covering organizational details, required knowledge, and assessment methods. It introduces key concepts in parallel computing, including various models like PRAM and DAG, and discusses performance measures such as speedup and efficiency. Additionally, it provides a list of recommended literature for further reading on the subject.

Uploaded by

Anupam Mondal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Parallel Algorithm Main Single

The document outlines the course 'Parallel Algorithms' taught by Harald Räcke at TU Munich during the Winter Term 2014/15, covering organizational details, required knowledge, and assessment methods. It introduces key concepts in parallel computing, including various models like PRAM and DAG, and discusses performance measures such as speedup and efficiency. Additionally, it provides a list of recommended literature for further reading on the subject.

Uploaded by

Anupam Mondal
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 289

WS 2014/15

Parallel Algorithms

Harald Räcke

Fakultät für Informatik


TU München

http://www14.in.tum.de/lehre/2014WS/pa/

Winter Term 2014/15

PA
© Harald Räcke 1
Part I

Organizational Matters

PA
© Harald Räcke 2
Part I

Organizational Matters

ñ Modul: IN2011
ñ Name: “Parallel Algorithms”
“Parallele Algorithmen”
ñ ECTS: 8 Credit points
ñ Lectures:
ñ 4 SWS
Mon 14:00–16:00 (Room 00.08.038)
Fri 8:30–10:00 (Room 00.08.038)
ñ Webpage: http://www14.in.tum.de/lehre/2014WS/pa/
ñ Required knowledge:
ñ IN0001, IN0003
“Introduction to Informatics 1/2”
“Einführung in die Informatik 1/2”
ñ IN0007
“Fundamentals of Algorithms and Data Structures”
“Grundlagen: Algorithmen und Datenstrukturen” (GAD)
ñ IN0011
“Basic Theoretic Informatics”
“Einführung in die Theoretische Informatik” (THEO)
ñ IN0015
“Discrete Structures”
“Diskrete Strukturen” (DS)
ñ IN0018
“Discrete Probability Theory”
“Diskrete Wahrscheinlichkeitstheorie” (DWT)
ñ IN2003
“Efficient Algorithms and Data Structures”
“Effiziente Algorithmen und Datenstrukturen”
The Lecturer

ñ Harald Räcke
ñ Email: [email protected]
ñ Room: 03.09.044
ñ Office hours: (per appointment)

PA
© Harald Räcke 5
Tutorials

ñ Tutors:
ñ Chris Pinkau
ñ [email protected]
ñ Room: 03.09.037
ñ Office hours: Tue 13:00–14:00
ñ Room: 03.11.018
ñ Time: Tue 14:00–16:00

PA
© Harald Räcke 6
Assignment sheets

ñ In order to pass the module you need to pass


a 3 hour exam

PA
© Harald Räcke 7
Assessment

ñ Assignment Sheets:
ñ An assignment sheet is usually made available on Monday
on the module webpage.
ñ Solutions have to be handed in in the following week before
the lecture on Monday.
ñ You can hand in your solutions by putting them in the right
folder in front of room 03.09.019A.
ñ Solutions will be discussed in the subsequent tutorial on
Tuesday.

PA
© Harald Räcke 8
1 Contents

ñ PRAM algorithms
ñ Parallel Models
ñ PRAM Model
ñ Basic PRAM Algorithms
ñ Sorting
ñ Lower Bounds
ñ Networks of Workstations
ñ Offline Permutation Routing on the Mesh
ñ Oblivious Routing in the Butterfly
ñ Greedy Routing
ñ Sorting on the Mesh
ñ ASCEND/DESCEND Programs
ñ Embeddings between Networks

PA 1 Contents
© Harald Räcke 9
2 Literatur

Tom Leighton:
Introduction to Parallel Algorithms and Architecture:
Arrays, Trees, Hypercubes,
Morgan Kaufmann: San Mateo, CA, 1992
Joseph JaJa:
An Introduction to Parallel Algorithms,
Addison-Wesley: Reading, MA, 1997
Jeffrey D. Ullman:
Computational Aspects of VLSI,
Computer Science Press: Rockville, USA, 1984
Selim G. Akl.:
The Design and Analysis of Parallel Algorithms,
Prentice Hall: Englewood Cliffs, NJ, 1989

PA 2 Literatur
© Harald Räcke 10
Part II

Foundations

PA
© Harald Räcke 11
3 Introduction
Parallel Computing
A parallel computer is a collection of processors usually of the
same type, interconnected to allow coordination and exchange
of data.

The processors are primarily used to jointly solve a given


problem.

Distributed Systems
A set of possibly many different types of processors are
distributed over a larger geographic area.

Processors do not work on a single problem.

Some processors may act in a malicous way.

PA 3 Introduction
© Harald Räcke 12
Cost measures

How do we evaluate sequential algorithms?

ñ time efficiency
ñ space utilization
ñ energy consumption
ñ programmability
ñ ...

Asymptotic bounds (e.g., for running time) often give a good


indication on the algorithms performance on a wide variety of
machines.

PA 3 Introduction
© Harald Räcke 13
Cost measures
How do we evaluate parallel algorithms?

ñ time efficiency
ñ space utilization
ñ energy consumption
ñ programmability
ñ communication requirement
ñ ...

Problems
ñ performance (e.g. runtime) depends on problem size n and
on number of processors p
ñ statements usually only hold for restricted types of parallel
machine as parallel computers may have vastly different
characteristics (in particular w.r.t. communication)
Speedup
Suppose a problem P has sequential complexity T ∗ (n), i.e.,
there is no algorithm that solves P in time o(T ∗ (n)).

Definition 1
The speedup Sp (n) of a parallel algorithm A that requires time
Tp (n) for solving P with p processors is defined as

T ∗ (n)
Sp (n) = .
Tp (n)

Clearly, Sp (n) ≤ p. Goal: obtain Sp (n) ≈ p.

It is common to replace T ∗ (n) by the time bound of the best


known sequential algorithm for P !

PA 3 Introduction
© Harald Räcke 15
Efficiency
Definition 2
The efficiency of a parallel algorithm A that requires time Tp (n)
when using p processors on a problem of size n is

T1 (n)
Ep (n) = .
pTp (n)

Ep (n) ≈ 1 indicates that the algorithm is running roughly p


times faster with p processors than with one processor.

T1 (n)
Note that Ep (n) ≤ pT ∞ (n)
. Hence, the efficiency goes down
rapidly if p ≥ T1 (n)/T∞ (n).

Disadvantage: cost-measure does not relate to the optimum


sequential algorithm.

PA 3 Introduction
© Harald Räcke 16
Parallel Models — Requirements

Simplicity
A model should allow to easily analyze various performance
measures (speed, communication, memory utilization etc.).

Results should be as hardware-independent as possible.

Implementability
Parallel algorithms developed in a model should be easily
implementable on a parallel machine.

Theoretical analysis should carry over and give meaningful


performance estimates.

A real satisfactory model does not exist!

PA 3 Introduction
© Harald Räcke 17
DAG model — computation graph

ñ nodes represent operations (single instructions or larger


blocks)
ñ edges represent dependencies (precedence constraints)
ñ closely related to circuits; however there exist many
different variants
ñ branching instructions cannot be modelled
ñ completely hardware independent
ñ scheduling is not defined

Often used for automatically parallelizing numerical


computations.

PA 3 Introduction
© Harald Räcke 18
Example: Addition

+ +

+ + + +

A1 A2 A3 A4 A5 A6 A7 A8

A1 + + + + + + +

A2 A3 A4 A5 A6 A7 A8

Here, vertices without incoming edges correspond to input data.


The graph can be viewed as a data flow graph.

PA 3 Introduction
© Harald Räcke 19
DAG model — computation graph

The DAG itself is not a complete algorithm. A scheduling


implements the algorithm on a parallel machine, by assigning a
time-step tv and a processor pv to every node.

Definition 3
A scheduling of a DAG G = (V , E) on p processors is an
assignment of pairs (tv , pv ) to every internal node v ∈ V , s.t.,
ñ pv ∈ {1, . . . , p}; tv ∈ {1, . . . , T }
ñ tu = tv ⇒ pu ≠ pv
ñ (u, v) ∈ E ⇒ tv ≥ tu + 1
where a non-internal node x (an input node) has tx = 0.
T is the length of the schedule.

PA 3 Introduction
© Harald Räcke 20
DAG model — computation graph
The parallel complexity of a DAG is defined as

Tp (n) = min {T (S)} .


schedule S

T1 (n): #internal nodes in DAG


T∞ (n): diameter of DAG

Clearly,
Tp (n) ≥ T∞ (n)
Tp (n) ≥ T1 (n)/p

Lemma 4
A schedule with length O(T1 (n)/p + T∞ (n)) can be found easily.

Lemma 5
Finding an optimal schedule is in general NP-complete.
Note that the DAG model as defined is a non-uniform model of
computation.

In principle, there could be a different DAG for every input size


n.

An algorithm (e.g. for a RAM) must work for every input size and
must be of finite description length.

Hence, specifying a different DAG for every n has more


expressive power.

Also, this is not really a complete model, as the operations


allowed in a DAG node are not clearly defined.

PA 3 Introduction
© Harald Räcke 22
PRAM Model

P1 P2 P3 P4 P5 P6 P7 P8

Global Shared Memory

All processors are synchronized.

In every round a processor can:


ñ read a register from global memory into local memory
ñ do a local computation à la RAM
ñ write a local register into global memory

PA 3 Introduction
© Harald Räcke 23
PRAM Model

Every processor executes the same program.

However, the program has access to two special variables:


ñ p: total number of processors
ñ id ∈ {1, . . . , p}: the id of the current processor

The following (stupid) program copies the content of the global


register x[1] to registers x[2] . . . x[p].

Algorithm 1 copy
1: if id = 1 then round ← 1
2: while round ≤ p and id = round do
3: x[id + 1] ← x[id]
4: round ← round + 1

PA 3 Introduction
© Harald Räcke 24
PRAM Model
ñ processors can effectively execute different code because of
branching according to id
ñ however, not arbitrarily; still uniform model of computation

Often it is easier to explicitly define which parts of a program


are executed in parallel:

Algorithm 2 sum
1: // computes sum of x[1] . . . x[p]
2: // red part is executed only by processor 1
3: r ← 1
4: while 2r ≤ p do
5: for id mod 2r = 1 pardo
6: // only executed by processors whose id matches
7: x[id] = x[id] + x[id + 2r −1 ]
8: r ←r +1
9: return x[1]
Different Types of PRAMs
Simultaneous Access to Shared Memory:
ñ EREW PRAM:
simultaneous access is not allowed
ñ CREW PRAM:
concurrent read accesses to the same location are allowed;
write accesses have to be exclusive
ñ CRCW PRAM:
concurrent read and write accesses allowed
ñ commom CRCW PRAM
all processors writing to x[i] must write same value
ñ arbitrary CRCW PRAM
values may be different; an arbitrary processor succeeds
ñ priority CRCW PRAM
values may be different; processor with smallest id succeeds

PA 3 Introduction
© Harald Räcke 26
Algorithm 3 sum
1: // computes sum of x[1] . . . x[p]
2: r ← 1
3: while 2r ≤ p do
4: for id mod 2r = 1 pardo
5: x[id] = x[id] + x[id + 2r −1 ]
6: r ←r +1
7: return x[1]

The above is an EREW PRAM algorithm.

On a CREW PRAM we could replace Line 4 by


for 1 ≤ id ≤ p pardo

PA 3 Introduction
© Harald Räcke 27
PRAM Model — remarks
ñ similar to a RAM we either need to restrict the size of values
that can be stored in registers, or we need to have a
non-uniform cost model for doing a register manipulation
(cost for manipulating x[i] is proportional to the bit-length
of the largest number that is ever being stored in x[i])
ñ in this lecture: uniform cost model but we are not
exploiting the model
ñ global shared memory is very unrealistic in practise as
uniform access to all memory locations does not exist
ñ global synchronziation is very unrealistic; in real parallel
machines a global synchronization is very costly
ñ model is good for understanding basic parallel
mechanisms/techniques but not for algorithm development
ñ model is good for lower bounds

PA 3 Introduction
© Harald Räcke 28
Network of Workstations — NOWs

ñ interconnection network represented by a graph G = (V , E)


ñ each v ∈ V represents a processor
ñ an edge {u, v} ∈ E represents a two-way communication
link between processors u and v
ñ network is asynchronous
ñ all coordination/communiation has to be done by explicit
message passing

PA 3 Introduction
© Harald Räcke 29
Typical Topologies

111 3
110
101 4 2
100

hypercube 5 cycle, ring 1

011
010 6 8
001
000 7

1, 4 2, 4 3, 4 4, 4

1, 3 2, 3 3, 3 4, 3

1 2 3 4 5 6 7 8
linear array 1, 2 2, 2 3, 2 4, 2

1, 1 2, 1 3, 1 4, 1

mesh/grid

PA 3 Introduction
© Harald Räcke 30
Network of Workstations — NOWs
Computing the sum on a d-dimensional hypercube. Note that
x[0] . . . x[2d − 1] are stored at the individual nodes.

Processors are numbered consecutively starting from 0

Algorithm 4 sum
1: // computes sum of x[0] . . . x[2d − 1]
2: r ← 1
3: while 2r ≤ 2d do // p = 2d
4: if id mod 2r = 0 then
5: temp ← receive(id + 2r −1 )
6: x[id] = x[id] + temp
7: if id mod 2r = 2r −1 then
8: send(x[id], id − 2r −1 )
9: r ←r +1
10: if id = 0 then return x[id]
Network of Workstations — NOWs

Remarks
ñ One has to ensure that at any point in time there is at most
one active communication along a link
ñ There also exist synchronized versions of the model, where
in every round each link can be used once for
communication
ñ In particular the asynchronous model is quite realistic
ñ Difficult to develop and analyze algorithms as a lot of low
level communication has to be dealt with
ñ Results only hold for one specific topology and cannot be
generalized easily

PA 3 Introduction
© Harald Räcke 32
Performance of PRAM algorithms

Suppose that we can solve an instance of a problem with size n


with P (n) processors and time T (n).

We call C(n) = T (n) · P (n) the time-processor product or the


cost of the algorithm.

The following statements are equivalent


ñ P (n) processors and time O(T (n))
ñ O(C(n)) cost and time O(T (n))
ñ O(C(n)/p) time for any number p ≤ P (n) processors
ñ O(C(n)/p + T (n)) for any number p of processors

PA 3 Introduction
© Harald Räcke 33
Performance of PRAM algorithms
Suppose we have a PRAM algorithm that takes time T (n) and
work W (n), where work is the total number of operations.

We can nearly always obtain a PRAM algorithm that uses time at


most
bW (n)/pc + T (n)

parallel steps on p processors.

Idea:
ñ Wi (n) denotes operations in parallel step i, 1 ≤ i ≤ T (n)
ñ simulate each step in dWi (n)/pe parallel steps
ñ then we have
X X
dWi (n)/pe ≤ bWi (n)/pc + 1 ≤ bW (n)/pc + T (n)


i i
Performance of PRAM algorithms

Why nearly always?

We need to assign processors to operations.


ñ every processor pi needs to know whether it should be
active
ñ in case it is active it needs to know which operations to
perform

design algorithms for an arbitrary number of processors;


keep total time and work low

PA 3 Introduction
© Harald Räcke 35
Optimal PRAM algorithms

Suppose the optimal sequential running time for a problem is


T ∗ (n).

We call a PRAM algorithm for the same problem work optimal if


its work W (n) fulfills

W (n) = Θ(T ∗ (n))

If such an algorithm has running time T (n) it has speedup

T ∗ (n) pT ∗ (n)
! !
Sp (n) = Ω =Ω = Ω(p)
T ∗ (n)/p + T (n) T ∗ (n) + pT (n)

for p = O(T ∗ (n)/T (n)).

PA 3 Introduction
© Harald Räcke 36
This means by improving the time T (n), (while using same
work) we improve the range of p, for which we obtain optimal
speedup.

We call an algorithm worktime (WT) optimal if T (n) cannot be


asymptotically improved by any work optimal algorithm.

PA 3 Introduction
© Harald Räcke 37
Example

Algorithm for computing the sum has work W (n) = O(n).


optimal

T (n) = O(log n). Hence, we achieve an optimal speedup for


p = O(n/ log n).

One can show that any CREW PRAM requires Ω(log n) time to
compute the sum.

PA 3 Introduction
© Harald Räcke 38
Communication Cost

When we differentiate between local and global memory we can


analyze communication cost.

We define the communication cost of a PRAM algorithm as the


worst-case traffic between the local memory of a processor and
the global shared memory.

Important criterion as communication is usually a major


bottleneck.

PA 3 Introduction
© Harald Räcke 39
Communication Cost

Algorithm 5 MatrixMult(A, B, n)
1: Input: n × n matrix A and B; n = 2k
2: Output: C = AB
3: for 1 ≤ i, j, ` ≤ n pardo
4: X[i, j, `] ← A[i, `] · B[`, j]
5: for r ← 1 to log n
6: for 1 ≤ i, j ≤ n; ` mod 2r = 1 pardo
7: X[i, j, `] ← X[i, j, `] + X[i, j, ` + 2r −1 ]
8: for 1 ≤ i, j ≤ n pardo
9: C[i, j] ← X[i, j, 1]

On n3 processors this algorithm runs in time O(log n).


It uses n3 multiplications and O(n3 ) additions.

PA 3 Introduction
© Harald Räcke 40
What happens if we have n processors?

Phase 1
pi computes X[i, j, `] = A[i, `] · B[`, j] for all 1 ≤ j, ` ≤ n
n2 time; n2 communication for every processor

Phase 2 (round r)
pi updates X[i, j, `] for all 1 ≤ j ≤ n; 1 ≤ ` mod 2r = 1
O(n · n/2r ) time; no communication

Phase 3
pi writes i-th row into C[i, j]’s.
n time; n communication

PA 3 Introduction
© Harald Räcke 41
Alternative Algorithm
Split matrix into blocks of size n2/3 × n2/3 .

A1,1 A1,2 A1,3 A1,4 B1,1 B1,2 B1,3 B1,4 C1,1 C1,2 C1,3 C1,4

A2,1 A2,2 A2,3 A2,4 B2,1 B2,2 B2,3 B2,4 C2,1 C2,2 C2,3 C2,4
A · B = C
A3,1 A3,2 A3,3 A3,4 B3,1 B3,2 B3,3 B3,4 C3,1 C3,2 C3,3 C3,4

A4,1 A4,2 A4,3 A4,4 B4,1 B4,2 B4,3 B4,4 C4,1 C4,2 C4,3 C4,4

Note that Ci,j = Ai,` B`,j .


P
`

Now we have the same problem as before but n0 = n1/3 and a


single multiplication costs time O((n2/3 )3 ) = O(n2 ). An addition
costs n4/3 .

work for multiplications: O(n2 · (n0 )3 ) = O(n3 )


work for additions: O(n4/3 · (n0 )3 ) = O(n3 )
time: O(n2 ) + log n0 · O(n4/3 ) = O(n2 )
Alternative Algorithm

The communication cost is only O(n4/3 log n0 ) as a processor in


the original problem touches at most log n entries of the matrix.

Each entry has size O(n4/3 ).

The algorithm exhibits less parallelism but still has optimum


work/runtime for just n processors.

much, much better in practise

PA 3 Introduction
© Harald Räcke 43
Part III

PRAM Algorithms

PA
© Harald Räcke 44
Prefix Sum
input: x[1] . . . x[n]
Pi
output: s[1] . . . s[n] with s[i] = j=1 x[i] (w.r.t. operator ∗)

Algorithm 6 PrefixSum(n, x[1] . . . x[n])


1: // compute prefixsums; n = 2k
2: if n = 1 then s[1] ← x[1]; return
3: for 1 ≤ i ≤ n/2 pardo
4: a[i] ← x[2i − 1] ∗ x[2i]
5: z[1], . . . , z[n/2] ← PrefixSum(n/2, a[1] . . . a[n/2])
6: for 1 ≤ i ≤ n pardo
7: i even : s[i] ← z[i/2]
8: i=1 : s[1] = x[1]
9: i odd : s[i] ← z[(i − 1)/2] ∗ x[i]

PA 4.1 Prefix Sum


© Harald Räcke 45
Prefix Sum
s-values
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

z1 z2 z3 z4 z5 z6 z7 z8

z̄1 z̄2 z̄3 z̄4

ẑ1 ẑ2
time steps

z10

a10

â1 â2

ā1 ā2 ā3 ā4

a1 a2 a3 a4 a5 a6 a7 a8

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

x-values
Prefix Sum

The algorithm uses work O(n) and time O(log n) for solving
Prefix Sum on an EREW-PRAM with n processors.

It is clearly work-optimal.

Theorem 6
On a CREW PRAM a Prefix Sum requires running time Ω(log n)
regardless of the number of processors.

PA 4.1 Prefix Sum


© Harald Räcke 47
Parallel Prefix

Input: a linked list given by successor pointers; a value x[i] for


every list element; an operator ∗;

Output: for every list position ` the sum (w.r.t. ∗) of elements


after ` in the list (including `)

4 3 7 8 2 1 6 5

x[4] x[3] x[7] x[8] x[2] x[1] x[6] x[5]


S[4]=3 S[3]=7 S[7]=8 S[8]=2 S[2]=1 S[1]=6 S[6]=5 S[5]=5

PA 4.2 Parallel Prefix


© Harald Räcke 48
Parallel Prefix

Algorithm 7 ParallelPrefix
1: for 1 ≤ i ≤ n pardo
2: P [i] ← S[i]
3: while S[i] ≠ S[S[i]] do
4: x[i] ← x[i] ∗ x[S[i]]
5: S[i] ← S[S[i]]
6: if P [i] ≠ i then x[i] ← x[i] ∗ x[S(i)]

The algorithm runs in time O(log n).

It has work requirement O(n log n). non-optimal

This technique is also known as pointer jumping

PA 4.2 Parallel Prefix


© Harald Räcke 49
4.3 Divide & Conquer — Merging

Given two sorted sequences A = (a1 , . . . , an ) and


B = (b1 , . . . , bn ), compute the sorted squence C = (c1 , . . . , cn ).

Definition 7
Let X = (x1 , . . . , xt ) be a sequence. The rank rank(y : X) of y in
X is
rank(y : X) = |{x ∈ X | x ≤ y}|

For a sequence Y = (y1 , . . . , ys ) we define


rank(Y : X) := (r1 , . . . , rs ) with ri = rank(yi : X).

PA 4.3 Divide & Conquer — Merging


© Harald Räcke 50
4.3 Divide & Conquer — Merging

Given two sorted sequences A = (a1 . . . an ) and B = (b1 . . . bn ),


compute the sorted squence C = (c1 . . . cn ).

Observation:
We can assume wlog. that elements in A and B are different.

Then for ci ∈ C we have i = rank(ci : A ∪ B).

This means we just need to determine rank(x : A ∪ B) for all


elements!

Observe, that rank(x : A ∪ B) = rank(x : A) + rank(x : B).

Clearly, for x ∈ A we already know rank(x : A), and for x ∈ B we


know rank(x : B).

PA 4.3 Divide & Conquer — Merging


© Harald Räcke 51
4.3 Divide & Conquer — Merging

Compute rank(x : A) for all x ∈ B and rank(x : B) for all x ∈ A.


can be done in O(log n) time with 2n processors by binary
search

Lemma 8
On a CREW PRAM, Merging can be done in O(log n) time and
O(n log n) work.

not optimal

PA 4.3 Divide & Conquer — Merging


© Harald Räcke 52
4.3 Divide & Conquer — Merging
A = (a1 , . . . , an ); B = (b1 , . . . , bn );
log n integral; k := n/ log n integral;

Algorithm 8 GenerateSubproblems
1: j0 ← 0
2: jk ← n
3: for 1 ≤ i ≤ k − 1 pardo
4: ji ← rank(bi log n : A)
5: for 0 ≤ i ≤ k − 1 pardo
6: Bi ← (bi log n+1 , . . . , b(i+1) log n )
7: Ai ← (aji +1 , . . . , aji+1 )

If Ci is the merging of Ai and Bi then the sequence C0 . . . Ck−1 is


a sorted sequence.

PA 4.3 Divide & Conquer — Merging


© Harald Räcke 53
4.3 Divide & Conquer — Merging
We can generate the subproblems in time O(log n) and work
O(n).

Note that in a sub-problem Bi has length log n.

If we run the algorithm again for every subproblem, (where Ai


takes the role of B) we can in time O(log log n) and work O(n)
generate subproblems where Aj and Bj have both length at
most log n.

Such a subproblem can be solved by a single processor in time


O(log n) and work O(|Ai | + |Bi |).

Parallelizing the last step gives total work O(n) and time
O(log n).

the resulting algorithm is work optimal

PA 4.3 Divide & Conquer — Merging


© Harald Räcke 54
4.4 Maximum Computation

Lemma 9
On a CRCW PRAM the maximum of n numbers can be computed
in time O(1) with n2 processors.

proof on board...

PA 4.4 Maximum Computation


© Harald Räcke 55
4.4 Maximum Computation

Lemma 10
On a CRCW PRAM the maximum of n numbers can be computed
in time O(log log n) with n processors and work O(n log log n).

proof on board...

PA 4.4 Maximum Computation


© Harald Räcke 56
4.4 Maximum Computation

Lemma 11
On a CRCW PRAM the maximum of n numbers can be computed
in time O(log log n) with n processors and work O(n).

proof on board...

PA 4.4 Maximum Computation


© Harald Räcke 57
4.5 Inserting into a (2, 3)-tree

Given a (2, 3)-tree with n elements, and a sequence


x0 < x1 < x2 < · · · < xk of elements. We want to insert
elements x1 , . . . , xk into the tree (k  n).
time: O(log n); work: O(k log n)

a1 a4

a0 a2 a3 a5 a6

a0 a1 a2 a3 a4 a5 a6 ∞

PA 4.5 Inserting into a (2, 3)-tree


© Harald Räcke 58
4.5 Inserting into a (2, 3)-tree
1. determine for every xi the leaf element before which it has
to be inserted
time: O(log n); work: O(k log n); CREW PRAM
all xi ’s that have to be inserted before the same element
form a chain
2. determine the largest/smallest/middle element of every
chain
time: O(log k); work: O(k);
3. insert the middle element of every chain
compute new chains
time: O(log n); work: O(ki log n + k); ki = #inserted
elements
(computing new chains is constant time)
4. repeat Step 3 for logarithmically many rounds
time: O(log n log k); work: O(k log n);
PA 4.5 Inserting into a (2, 3)-tree
© Harald Räcke 59
Step 3

a1 a4

x5

a0 x
a23 a23 x5 aa335 xx
a969 aa55 aa66

a0 a1 x
a2
3 a2
3 a4
x5 aa5
33 xx
a
969 aa
∞4
4 aa
55 aa
66 ∞∞
x3 x5 x9

ñ each internal node is split into at most two parts


ñ each split operation promotes at most one element
ñ hence, on every level we want to insert at most one element
per successor pointer
ñ we can use the same routine for every level

PA 4.5 Inserting into a (2, 3)-tree


© Harald Räcke 60
4.5 Inserting into a (2, 3)-tree

ñ Step 3, works in phases; one phase for every level of the tree
ñ Step 4, works in rounds; in each round a different set of
elements is inserted

Observation
We can start with phase i of round r as long as phase i of round
r − 1 and (of course), phase i − 1 of round r has finished.

This is called Pipelining. Using this technique we can perform all


rounds in Step 4 in just O(log k + log n) many parallel steps.

PA 4.5 Inserting into a (2, 3)-tree


© Harald Räcke 61
4.6 Symmetry Breaking

The following algorithm colors an n-node cycle with dlog ne


colors.

Algorithm 9 BasicColoring
1: for 1 ≤ i ≤ n pardo
2: col(i) ← i
3: ki ← smallest bitpos where col(i) and col(S(i)) differ
4: col0 (i) ← 2ki + col(i)ki

(bit positions are numbered starting with 0)

PA 4.6 Symmetry Breaking


© Harald Räcke 62
4.6 Symmetry Breaking

5 4 v col k col0
6 1 0001 1 2
15
3 0011 2 4
7 0111 0 1
8

14 1110 2 5

2
2 0010 0 0
10

15 1111 0 1
4 0100 0 0

14
5 0101 0 1
11

6 0110 1 3
8 1000 1 2
7

10 1010 0 0
12

11 1011 0 1
3
12 1100 0 0
9 1001 2 4
9
1
13
13 1101 2 5
4.6 Symmetry Breaking

Applying the algorithm to a coloring with bit-length t generates


a coloring with largest color at most

2(t − 1) + 1

and bit-length at most

dlog2 (2(t − 1) + 1)e ≤ dlog2 (2t)e = dlog2 (t)e + 1

Applying the algorithm repeatedly generates a constant number


of colors after O(log∗ n) operations.

Note that the first inequality


holds because 2(t − 1) − 1 is
odd.

PA 4.6 Symmetry Breaking


© Harald Räcke 64
4.6 Symmetry Breaking
As long as the bit-length t ≥ 4 the bit-length decreases.

Applying the algorithm with bit-length 3 gives a coloring with


colors in the range 0, . . . , 5 = 2t − 1.

We can improve to a 3-coloring by successively re-coloring nodes


from a color-class:

Algorithm 10 ReColor
1: for ` ← 5 to 3
2: for 1 ≤ i ≤ n pardo
3: if col(i) = ` then
4: col(i) ← min{{0, 1, 2} \ {col(P [i]), col(S[i])}}

This requires time O(1) and work O(n).

PA 4.6 Symmetry Breaking


© Harald Räcke 65
4.6 Symmetry Breaking

Lemma 12
We can color vertices in a ring with three colors in O(log∗ n)
time and with O(n log∗ n) work.

not work optimal

PA 4.6 Symmetry Breaking


© Harald Räcke 66
4.6 Symmetry Breaking

Lemma 13
Given n integers in the range 0, . . . , O(log n), there is an
algorithm that sorts these numbers in O(log n) time using a
linear number of operations.

Proof: Exercise!

PA 4.6 Symmetry Breaking


© Harald Räcke 67
4.6 Symmetry Breaking

Algorithm 11 OptColor
1: for 1 ≤ i ≤ n pardo
2: col(i) ← i
3: apply BasicColoring once
4: sort vertices by colors
5: for ` = 2dlog ne to 3 do
6: for all vertices i of color ` pardo
7: col(i) ← min{{0, 1, 2} \ {col(P [i]), col(S[i])}}

We can perform Lines 6 and 7 in time O(n` ) only because we sorted before. In general a state-
ment like “for constraint pardo” should only contain a contraint on the id’s of the processors
but not something complicated (like the color) which has to be checked and, hence, induces
work. Because of the sorting we can transform this complicated constraint into a constraint on
just the processor id’s.

PA 4.6 Symmetry Breaking


© Harald Räcke 68
Lemma 14
A ring can be colored with 3 colors in time O(log n) and with
work O(n).

work optimal but not too fast

PA 4.6 Symmetry Breaking


© Harald Räcke 69
List Ranking

Input:
A list given by successor pointers;

4 5 7 3 1 2 6 8 9

Output:
For every node number of hops to end of the list;
4 5 7 3 1 2 6 8 9
8 7 6 5 4 3 2 1 0

Observation:
Special case of parallel prefix

PA 5 List Ranking
© Harald Räcke 70
List Ranking

4 5 7 3 1 2 6 8 9
1 3 1 1 1 2 2 1 0

1. Given a list with values; perhaps from previous


iterations.
The list is given via predecessor pointers P (i) and
successor pointers S(i).
S(4) = 5, S(2) = 6, P (3) = 7, etc.

PA 5 List Ranking
© Harald Räcke 71
List Ranking

4 5 7 3 1 2 6 8 9
1 3 1 1 1 2 2 1 0

2. Find an independent set; time: O(log n); work: O(n).

The independent set should contain a constant fraction


of the vertices.

Color vertices; take local minima

PA 5 List Ranking
© Harald Räcke 71
List Ranking

4 5 7 3 1 2 6 8 9
4 3 1 2 1 4 2 1 0

3. Splice the independent set out of the list;

At the independent set vertices the array still contains


old values for P (i) and S(i);

PA 5 List Ranking
© Harald Räcke 71
List Ranking

4 5 7 3 1 2 6 8 9
4 3 1 2 1 4 2 1 0

3 4 2 1 5 6
4 1 2 4 1 0

4. Compress remaining n0 nodes into a new array of n0


entries.
The index positions can be computed by a prefix sum
in time O(log n) and work O(n)
Pointers can then be adjusted in time O(1).

PA 5 List Ranking
© Harald Räcke 71
List Ranking

4 5 7 3 1 2 6 8 9
4 3 1 2 1 4 2 1 0

3 4 2 1 5 6
12 8 7 5 1 0

5. Solve the problem on the remaining list.


If current size is less than n/ log n do pointer jumping:
time O(log n); work O(n).
Otherwise continue shrinking the list by finding an
independent set

PA 5 List Ranking
© Harald Räcke 71
List Ranking

4 5 7 3 1 2 6 8 9
12 3 8 7 1 5 2 1 0

3 4 2 1 5 6
12 8 7 5 1 0

6. Map the values back into the larger list. Time: O(1);
Work: O(n)

PA 5 List Ranking
© Harald Räcke 71
List Ranking

4 5 7 3 1 2 6 8 9
12 11 8 7 6 5 3 1 0

3 4 2 1 5 6
12 8 7 5 1 0

7. Compute values for independent set nodes. Time:


O(1); Work: O(1).
8. Splice nodes back into list. Time: O(1); Work: O(1).

PA 5 List Ranking
© Harald Räcke 71
We need O(log log n) shrinking iterations until the size of the
remaining list reaches O(n/ log n).

Each shrinking iteration takes time O(log n).

The work for all shrinking operations is just O(n), as the size of
the list goes down by a constant factor in each round.

List Ranking can be solved in time O(log n log log n) and work
O(n) on an EREW-PRAM.

PA 5 List Ranking
© Harald Räcke 72
Optimal List Ranking

In order to reduce the work we have to improve the shrinking of


the list to O(n/ log n) nodes.

After this we apply pointer jumping

PA 5 List Ranking
© Harald Räcke 73
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24

p1 p1 p2 p2 p3 p4 p4 p5 p5 p3

B1 B2 B3 B4 B5 B6

ñ some nodes are active;


ñ active nodes without neighbouring active nodes are
isolated;
ñ the others form sublists;

1 delete isolated nodes from the list;


2 color each sublist with O(log log n) colors; time: O(1);
PA
work: O(n); 5 List Ranking
© Harald Räcke 74
Optimal List Ranking

Each iteration requires constant time and work O(n/ log n),
because we just work on one node in every block.

We need to prove that we just require O(log n) iterations to


reduce the size of the list to O(n/ log n).

PA 5 List Ranking
© Harald Räcke 75
Observations/Remarks:
ñ If the p-pointer of a block cannot be advanced without
leaving the block, the processor responsible for this block
simply stops working; all other blocks continue.
ñ The p-node of a block (the node pi is pointing to) at the
beginning of a round is either a ruler with a living subject or
the node will become active during the round.
ñ The subject nodes always lie to the left of the p-node of the
respective block (if it exists).
Measure of Progress:
ñ a ruler will delete a subject
ñ an active node either
ñ becomes a ruler (with a subject)
ñ becomes a subject
ñ is isolated and therefore gets deleted

PA 5 List Ranking
© Harald Räcke 76
Analysis

For the analysis we assign a weight to every node in every block


as follows.

Definition 15
The weight of the i-th node in a block is

(1 − q)i

1
with q = log log n , where the node-numbering starts from 0.
Hence, a block has nodes {0, . . . , log n − 1}.

PA 5 List Ranking
© Harald Räcke 77
Definition of Rulers

Properties:
ñ A ruler should have at most log log n subjects.
ñ The weight of a ruler should be at most the weight of any of
its subjects.
ñ Each ruler must have at least one subject.
ñ We must be able to remove the next subject in constant
time.
ñ We need to make the ruler/subject decision in constant
time.

PA 5 List Ranking
© Harald Räcke 78
Given a sublist of active nodes.

Color the sublist with O(log log n) colors. Take the local minima
w.r.t. this coloring.

If the first node is not a ruler


ñ if the second node is a ruler switch ruler status between
first and second
ñ otw. just make the first node into a ruler

This partitions the sub-list into chains of length at most


log log n each starting with a ruler

PA 5 List Ranking
© Harald Räcke 79
Now we change the ruler definition.

Consider some chain.

We make all local minima w.r.t. the weight function into a ruler;
ties are broken according to block-id (so that comparing weights
gives a strict inequality).

A ruler gets as subjects the nodes left of it until the next local
maximum (or the start of the chain) (including the local
maximum) and the nodes right of it until the next local
maximum (or the end of the chain) (excluding the local
maximum).

In case the first node is a ruler the above definition could leave it
without a subject. We use constant time to fix this in some
arbitrary manner

PA 5 List Ranking
© Harald Räcke 80
1
Set q = log log n .

The i-th node in a block is assigned a weight of (1 − q)i ,


0 ≤ i < log n

The total weight of a block is at most 1/q and the total weight of
n
all items is at most q log n .

to show:
After O(log n) iterations the weight is at most
(n/ log n)(1 − q)log n

This means at most n/ log n nodes remain because the smallest


weight a node can have is (1 − q)log n−1 .

PA 5 List Ranking
© Harald Räcke 81
In every iteration the weight drops by a factor of

(1 − q/4) .

PA 5 List Ranking
© Harald Räcke 82
We consider subject nodes to just have half their weight.

We can view the step of becoming a subject as a precursor to


deletion.

Hence, a node looses half its weight when becoming a subject


and the remaining half when deleted.

Note that subject nodes will be deleted after just an additional


O(log log n) iterations.

PA 5 List Ranking
© Harald Räcke 83
The weight is reduced because
ñ An isolated node is removed.
ñ A node is labelled as ruler, and the corresponding subjects
reduce their weight by a factor of 1/2.
ñ A node is a ruler and deletes one of its subjects.

Hence, the weight reduction comes from p-nodes (ruler/active).

PA 5 List Ranking
© Harald Räcke 84
Each p-node is responsible for some other nodes; it has to
generate a weight reduction large enough so that the weight of
all nodes it is responsible for decreases by the desired factor.

An active node is responsible for all nodes that come after it in


its block.

A ruler is responsible for all nodes that come after it in its block
and for all its subjects.

Note that by this definition every node remaining in the list is


covered.

PA 5 List Ranking
© Harald Räcke 85
Case 1: Isolated Node
Suppose we delete an isolated node v that is the i-th node in its
block.

The weight of all node that v is responsible for is

(1 − q)j
X

i≤j<log n

This weight reduces to

(1 − q)j ≤ (1 − q) (1 − q)j
X X

i<j<log n i≤j<log n

Hence, weight reduces by a factor (1 − q) ≤ (1 − q/4).

PA 5 List Ranking
© Harald Räcke 86
Case 2: Creating Subjects
Suppose we generate a ruler with at least one subject.

Weight of ruler: (1 − q)i1 .


Weight of subjects: (1 − q)ij , 2 ≤ j ≤ k.

Initial weight:

k k k
1 X 2 X
(1 − q)` ≤ (1 − q)ij ≤ (1 − q)ij
X X
Q=
j=1 ij ≤`<log n
q j=1 q j=2

New weight:

k
1 X q
0
Q =Q− (1 − q)ij ≤ (1 − )Q
2 j=2 4
Case 3: Removing Subjects
weight of ruler: (1 − q)i1 ; weight of subjects: (1 − q)ij , 2 ≤ j ≤ k

Assume ruler removes subject with largest weight say i2 (why?).

Initial weight:
k
1 X
(1 − q)` + (1 − q)ij
X
Q=
2 j=2
i1 ≤`<log n
1 k
≤ (1 − q)i1 + (1 − q)i2
q 2
1 1
≤ (1 − q)i2 + (1 − q)i2
q 2q

New weight:
1 q
Q0 = Q − (1 − q)i2 ≤ (1 − )Q
2 3
After s iterations the weight is at most

n q s ! n
 
1− ≤ (1 − q)log n
q log n 4 log n

Choosing i = 5 log n the inequality holds for sufficiently large n.

PA 5 List Ranking
© Harald Räcke 89
Tree Algorithms

1 2 3 4

2 1 5 6 7

3 1
5 3
4 1

6 2 1 5 2

6 2
8 7 4

7 2 8 9

9
8 7

9 7
Euler Circuits

Every node v fixes an arbitrary ordering among its adjacent


nodes:
u0 , u1 , . . . , ud−1

We obtain an Euler tour by setting

succ((ui , v)) = (v, u(i+1) mod d )

PA 6 Tree Algorithms
© Harald Räcke 91
Euler Circuits

Lemma 16
An Euler circuit can be computed in constant time O(1) with
O(n) operations.

PA 6 Tree Algorithms
© Harald Räcke 92
Euler Circuits — Applications

Rooting a tree
ñ split the Euler tour at node r
ñ this gives a list on the set of directed edges (Euler path)
ñ assign x[e] = 1 for every edge;
ñ perform parallel prefix; let s[·] be the result array
ñ if s[(u, v)] < s[(v, u)] then u is parent of v;

PA 6 Tree Algorithms
© Harald Räcke 93
Euler Circuits — Applications

Postorder Numbering
ñ split the Euler tour at node r
ñ this gives a list on the set of directed edges (Euler path)
ñ assign x[e] = 1 for every edge (v, parent(v))
ñ assign x[e] = 0 for every edge (parent(v), v)
ñ perform parallel prefix
ñ post(v) = s[(v, parent(v))]; post(r ) = n

PA 6 Tree Algorithms
© Harald Räcke 94
Euler Circuits — Applications

Level of nodes
ñ split the Euler tour at node r
ñ this gives a list on the set of directed edges (Euler path)
ñ assign x[e] = −1 for every edge (v, parent(v))
ñ assign x[e] = 1 for every edge (parent(v), v)
ñ perform parallel prefix
ñ level(v) = s[(parent(v), v)]; level(r ) = 0

PA 6 Tree Algorithms
© Harald Räcke 95
Euler Circuits — Applications

Number of descendants
ñ split the Euler tour at node r
ñ this gives a list on the set of directed edges (Euler path)
ñ assign x[e] = 0 for every edge (parent(v), v)
ñ assign x[e] = 1 for every edge (v, parent(v)), v ≠ r
ñ perform parallel prefix
ñ size(v) = s[(v, parent(v))] − s[(parent(v), v)]

PA 6 Tree Algorithms
© Harald Räcke 96
Rake Operation

Given a binary tree T .

Given a leaf u ∈ T with p(u) ≠ r the rake-operation does the


following
ñ remove u and p(u)
ñ attach sibling of u to p(p(u))

3
2 5

1 8 9 6 8

PA 6 Tree Algorithms
© Harald Räcke 97
We want to apply rake operations to a binary tree T until T just
consists of the root with two children.

Possible Problems:
1. we could concurrently apply the rake-operation to two
siblings
2. we could concurrently apply the rake-operation to two
leaves u and v such that p(u) and p(v) are connected
By choosing leaves carefully we ensure that none of the above
cases occurs

PA 6 Tree Algorithms
© Harald Räcke 98
Algorithm:
ñ label leaves consecutively from left to right (excluding
left-most and right-most leaf), and store them in an array A
ñ for dlog(n + 1)e iterations
ñ apply rake to all odd leaves that are left children
ñ apply rake operation to remaining odd leaves (odd at start
of round!!!)
ñ A=even leaves

PA 6 Tree Algorithms
© Harald Räcke 99
Observations
ñ the rake operation does not change the order of leaves
ñ two leaves that are siblings do not perform a rake operation
in the same round because one is even and one odd at the
start of the round
ñ two leaves that have adjacent parents either have different
parity (even/odd) or they differ in the type of child
(left/right)

PA 6 Tree Algorithms
© Harald Räcke 100
Cases, when the left edge btw. p(u) and p(v) is a left-child
edge.

v v

u u

1 2 2

3 4

PA 6 Tree Algorithms
© Harald Räcke 101
Example

17

12 14

9 2 3 15

8 1 13 16

10 11

4 5 6 7

PA 6 Tree Algorithms
© Harald Räcke 102
ñ one iteration can be performed in constant time with O(|A|)
processors, where A is the array of leaves;
ñ hence, all iterations can be performed in O(log n) time and
O(n) work;
ñ the intial parallel prefix also requires time O(log n) and
work O(n)

PA 6 Tree Algorithms
© Harald Räcke 103
Evaluating Expressions
Suppose that we want to evaluate an expression tree, containing
additions and multiplications.
+

∗ +

∗ + + ∗

A1 A2 A3 A4 A5 A6 A7 A8

A1 + + +
∗ + + +
∗ +

A2 A3 A4 A5 A6 A7 A8

If the tree is not balanced this may be time-consuming.

PA 6 Tree Algorithms
© Harald Räcke 104
We can use the rake-operation to do this quickly.

Applying the rake-operation changes the tree.

In order to maintain the value we introduce parameters av and


bv for every node that still allows to compute the value of a
node based on the value of its children.

Invariant:
Let u be internal node with children v and w. Then

val(u) = (av · val(v) + bv ) ⊗ (aw · val(w) + bw )

where ⊗ ∈ {∗, +} is the operation at node u.

Initially, we can choose av = 1 and bv = 0 for every node.

PA 6 Tree Algorithms
© Harald Räcke 105
Rake Operation
r

w
u
+ ∗
v
x1 x x3 x4 x5
2

Currently the value at u is

val(u) = (av · val(v) + bv ) + (aw · val(w) + bw )


= x1 + (aw · val(w) + bw )

In the expression for r this goes in as

au · [x1 + (aw · val(w) + bw )] + bu


= au aw · val(w) + au x1 + au bw + bu
| {z } | {z }
a0w 0
bw
PA 6 Tree Algorithms
© Harald Räcke 106
If we change the a and b-values during a rake-operation
according to the previous slide we can calculate the value of the
root in the end.

Lemma 17
We can evaluate an arithmetic expression tree in time O(log n)
and work O(n) regardless of the height or depth of the tree.

By performing the rake-operation in the reverse order we can


also compute the value at each node in the tree.

PA 6 Tree Algorithms
© Harald Räcke 107
Lemma 18
We compute tree functions for arbitrary trees in time O(log n)
and a linear number of operations.
proof on board...

PA 6 Tree Algorithms
© Harald Räcke 108
In the LCA (least common ancestor) problem we are given a tree
and the goal is to design a data-structure that answers
LCA-queries in constant time.

PA 6 Tree Algorithms
© Harald Räcke 109
Least Common Ancestor
LCAs on complete binary trees (inorder numbering):
1000
8

0100 1100
4 12

0010 0110 1010 1110


2 6 10 14

1 3 5 7 9 11 13 15
0001 0011 0101 0111 1001 1011 1101 1111

The least common ancestor of u and v is

z1 z2 . . . zi 1 0 . . . 0

where zi+1 is the first bit-position in which u and v differ.

PA 6 Tree Algorithms
© Harald Räcke 110
Least Common Ancestor

2 8 9

3 4

5 6 7

nodes 1 2 3 2 4 5 4 6 4 7 4 2 1 8 1 9 1

levels 0 1 2 1 2 3 2 3 2 3 2 1 0 1 0 1 0

PA 6 Tree Algorithms
© Harald Räcke 111
`(v) is index of first appearance of v in node-sequence.

r (v) is index of last appearance of v in node-squence.

`(v) and r (v) can be computed in constant time, given the


node- and level-sequence.

PA 6 Tree Algorithms
© Harald Räcke 112
Least Common Ancestor

Lemma 19

1. u is ancestor of v iff `(u) < `(v) < r (u)


2. u and v are not related iff either r (u) < `(v) or
r (v) < `(u)
3. suppose r (u) < `(v) then LCA(u, v) is vertex with
minimum level over interval [r (u), `(v)].

PA 6 Tree Algorithms
© Harald Räcke 113
Range Minima Problem

Given an array A[1 . . . n], a range minimum query (`, r ) consists


of a left index ` ∈ {1, . . . , n} and a right index r ∈ {1, . . . , n}.

The answer has to return the index of the minimum element in


the subsequence A[` . . . r ].

The goal in the range minima problem is to preprocess the array


such that range minima queries can be answered quickly
(constant time).

PA 6 Tree Algorithms
© Harald Räcke 114
Observation
Given an algorithm for solving the range minima problem in time
T (n) and work W (n) we can obtain an algorithm that solves the
LCA-problem in time O(T (n) + log n) and work O(n + W (n)).

Remark
In the sequential setting the LCA-problem and the range minima
problem are equivalent. This is not necessarily true in the
parallel setting.

For solving the LCA-problem it is sufficient to solve the restricted


range minima problem where two successive elements in the
array just differ by +1 or −1.

PA 6 Tree Algorithms
© Harald Räcke 115
Prefix and Suffix Minima
Tree with prefix-minima and suffix-minima:

0 0 0 0 0 0 0 0 0 1 1 3 3 3 3 3
6 4 2 2 2 2 1 1 0 0 0 0 0 0 0 0

1 1 1 1 1 1 1 6 0 1 1 3 3 3 3 3
6 4 2 2 2 2 1 1 0 0 0 0 0 0 0 0

2 2 2 3 1 1 1 6 0 1 1 6 3 3 3 3
6 4 2 2 4 4 1 1 0 0 0 0 3 3 3 3

4 4 2 3 4 5 1 6 0 5 1 6 3 4 3 3
6 4 2 2 4 4 1 1 0 0 1 1 3 3 5 3

6 4 2 3 4 5 1 6 0 5 1 6 3 4 5 3

PA 6 Tree Algorithms
© Harald Räcke 116
ñ Suppose we have an array A of length n = 2k
ñ We compute a complete binary tree T with n leaves.
ñ Each internal node corresponds to a subsequence of A. It
contains an array with the prefix and suffix minima of this
subsequence.

Given the tree T we can answer a range minimum query (`, r ) in


constant time.
ñ we can determine the LCA x of ` and r in constant time
since T is a complete binary tree
ñ Then we consider the suffix minimum of ` in the left child
of x and the prefix minimum of r in the right child of x.
ñ The minimum of these two values is the result.

PA 6 Tree Algorithms
© Harald Räcke 117
Lemma 20
We can solve the range minima problem in time O(log n) and
work O(n log n).

PA 6 Tree Algorithms
© Harald Räcke 118
Reducing the Work

Partition A into blocks Bi of length log n

Preprocess each Bi block separately by a sequential algorithm so


that range-minima queries within the block can be answered in
constant time. (how?)

For each block Bi compute the minimum xi and its prefix and
suffix minima.

Use the previous algorithm on the array (x1 , . . . , xn/ log n ).

PA 6 Tree Algorithms
© Harald Räcke 119
Answering a query (`, r):
ñ if ` and r are from the same block the data-structure for
this block gives us the result in constant time
ñ if ` and r are from different blocks the result is a minimum
of three elements:

• the suffix minmum of entry ` in `’s block

• the minimum among x`+1 , . . . , xr −1

• the prefix minimum of entry r in r ’s block

PA 6 Tree Algorithms
© Harald Räcke 120
Searching

An extension of binary search with p processors gives that one


can find the rank of an element in

log n
logp+1 (n) =
log(p + 1)

many parallel steps with p processors. (not work-optimal)

This requires a CREW PRAM model. For the EREW model


searching cannot be done faster than O(log n − log p) with p
processors even if there are p copies of the search key.

PA 7 Searching and Sorting


© Harald Räcke 121
Merging

Given two sorted sequences A = (a1 , . . . , an ) and


B = (b1 , . . . , bn ), compute the sorted squence C = (c1 , . . . , cn ).

Definition 21
Let X = (x1 , . . . , xt ) be a sequence. The rank rank(y : X) of y in
X is
rank(y : X) = |{x ∈ X | x ≤ y}|

For a sequence Y = (y1 , . . . , ys ) we define


rank(Y : X) := (r1 , . . . , rs ) with ri = rank(yi : X).

PA 7 Searching and Sorting


© Harald Räcke 122
Merging

We have already seen a merging-algorithm that runs in time


O(log n) and work O(n).

Using the fast search algorithm we can improve this to a running


time of O(log log n) and work O(n log log n).

PA 7 Searching and Sorting


© Harald Räcke 123
Merging
Input: A = a1 , . . . , an ; B = b1 , . . . , bm ; m ≤ n
1. if m < 4 then rank elements of B, using the parallel search
algorithm with p processors. Time: O(1). Work: O(n).
2. Concurrently rank elements b√m , b2√m , . . . , bm in A using

the parallel search algorithm with p = n. Time: O(1).
√ √
Work: O( m · n) = O(n)

j(i) := rank(bi√m : A)
3. Let Bi = (bi√m+1 , . . . , b(i+1)√m−1 ); and
Ai = (aj(i)+1 , . . . , aj(i+1) ).

Recursively compute rank(Bi : Ai ).


√ k
4. Let k be index not a multiple of m. i = d √m e. Then
rank(bk : A) = j(i) + rank(bk : Ai ).

PA 7 Searching and Sorting


© Harald Räcke 124
The algorithm can be made work-optimal by standard
techniques.

proof on board...

PA 7 Searching and Sorting


© Harald Räcke 125
Mergesort

Lemma 22
A straightforward parallelization of Mergesort can be
implemented in time O(log n log log n) and with work O(n log n).

This assumes the CREW-PRAM model.

PA 7 Searching and Sorting


© Harald Räcke 126
Mergesort

Let L[v] denote the (sorted) sublist of elements stored at the


leaf nodes rooted at v.

We can view Mergesort as computing L[v] for a complete binary


tree where the leaf nodes correspond to nodes in the given array.

Since the merge-operations on one level of the complete binary


tree can be performed in parallel we obtain time O(h log log n)
and work O(hn), where h = O(log n) is the height of the tree.

PA 7 Searching and Sorting


© Harald Räcke 127
Pipelined Mergesort

We again compute L[v] for every node in the complete binary


tree.

After round s, Ls [v] is an approximation of L[v] that will be


improved in future rounds.

For s ≥ 3 height(v), Ls [v] = L[v].

PA 7 Searching and Sorting


© Harald Räcke 128
Pipelined Mergesort

In every round, a node v sends sample(Ls [v]) (an


approximation of its current list) upwards, and receives
approximations of the lists of its children.

It then computes a new approximation of its list.

A node is called active in round s if s ≤ 3 height(v) (this means


its list is not yet complete at the start of the round, i.e.,
Ls−1 [v] ≠ L[v]).

PA 7 Searching and Sorting


© Harald Räcke 129
Pipelined Mergesort

Algorithm 11 ColeSort()
1: initialize L0 [v] = Av for leaf nodes; L0 [v] = œ otw.
2: for s ← 1 to 3 · height(T ) do
3: for all active nodes v do
4: // u and w children of v
5: L0s [u] ← sample(Ls−1 [u])
6: L0s [w] ← sample(Ls−1 [w])
7: Ls [v] ← merge(L0s [u], L0s [w])


 sample4 (Ls [v]) s ≤ 3 height(v)


sample(Ls [v]) = sample2 (Ls [v]) s = 3 height(v) + 1
 sample (L [v]) s = 3 height(v) + 2


1 s

PA 7 Searching and Sorting


© Harald Räcke 130
Colesort

0 0 0 1 1 2 2 2 3
0 1
3 3
2 2 3
4 4 2 5
1 4 2 595
4 9 6
6
7 9 6
9 6
9 6
8 9
6 6
9 9
6 7 8 8 9 9 9 9 9
0 0 0 1 1 2 2 2 3
0 3
1 3
2 4 3
2 4 2 5
1 4 4 5
9 5
9 6
6 9 6
8 9 6
9 6
7 8 6
9 6
9 6
9 7 8 8 9 9 9 9 9

0 2 2 2 3 4 5
2 2 2 596
4 9 8
8 6 6
9 9 9
8 9 9 9 9 0 0 1 1 3
0 1
3 4
3 595
1 4 9 6
6 9 6
7 9
6 6 7 8 9
0 2 2 2 3 2 5
2 4 4 5
8 6
9 6
8 9 6
8 9 8
9 9 9 9 9 0 0 1 1 3
1 3 3 4
1 4 6 5
5 9 6
6 9 6
7 6
9 6 7 8 9

ss == 17
75
10
13
15
16
18
0
3
6
8
9
11
12
14
1
2
4
0 2 3
2 499 9 9 9 2 2 5
2 586
8 8
6 6 8 0 1 4
1 696
9 9
7 7 8 9 0 1 3
1 364
6 6
5 5 6 6
0 2 3 9 9 9 9 9
2 4 2 2 5 5
6 6
8 6
8 6 8 0 1 4 6
8 6
9 7
8 9 8 9 0 1 3
1 3
5 4
6 5
5 6 6 6

3 499 9 0 299 9 2 686


8 8 2 565
6 6 0 497
9 9 1 686
8 8 0 154
5 5 3 366 6
3 9
4 9 9 0 9
2 9 9 2 6 6
8 8 2 5 5
6 6 0 4 9 9
7 7 1 6 6
8 8 0 1 5 5
4 4 3 3 6 6

499 399 299 099 688 266 566 255 477 099 688 166 155 044 333 666
4 9 3 9 2 9 0 9 6 8 2 6 5 6 2 5 4 7 0 9 6 8 1 6 1 5 0 4 3 3 6 6

4 9 9 3 9 2 0 9 6 8 6 2 6 5 5 2 7 4 0 9 6 8 1 6 5 1 4 0 3 3 6 6
4 9 9 3 9 2 0 9 6 8 6 2 6 5 5 2 7 4 0 9 6 8 1 6 5 1 4 0 3 3 6 6

PA 7 Searching and Sorting


© Harald Räcke 131
Pipelined Mergesort

Lemma 23
After round s = 3 height(v), the list Ls [v] is complete.

Proof:
ñ clearly true for leaf nodes
ñ suppose it is true for all nodes up to height h;
ñ fix a node v on level h + 1 with children u and w
ñ L3h [u] and L3h [w] are complete by induction hypothesis
ñ further sample(L3h+2 [u]) = L[u] and
sample(L3h+2 [v]) = L[v]
ñ hence in round 3h + 3 node v will merge the complete list
of its children; after the round L[v] will be complete

PA 7 Searching and Sorting


© Harald Räcke 132
Pipelined Mergesort

Lemma 24
The number of elements in lists Ls [v] for active nodes v is at
most O(n).

proof on board...

PA 7 Searching and Sorting


© Harald Räcke 133
Definition 25
A sequence X is a c-cover of a sequence Y if for any two
consecutive elements α, β from (−∞, X, ∞) the set
|{yi | α ≤ yi ≤ β}| ≤ c.

PA 7 Searching and Sorting


© Harald Räcke 134
Pipelined Mergesort

Lemma 26
L0s [v] is a 4-cover of L0s+1 [v].

If [a, b] fulfills |[a, b] ∩ (A ∪ {−∞, ∞})| = k we say [a, b]


intersects (−∞, A, +∞) in k items.

Lemma 27
If [a, b] with a, b ∈ L0s [v] ∪ {−∞, ∞} intersects (−∞, L0s [v], ∞) in
k ≥ 2 items, then [a, b] intersects (−∞, L0s+1 , ∞) in at most 2k
items.

PA 7 Searching and Sorting


© Harald Räcke 135
L0s [v]

=k

Ls−1 [v] ≤ 4k − 3

p L0s−1 [u] L0s−1 [w] q

p + q ≤ 4k − 1

2p L0s [u] L0s [w] 2q

Ls [v] ≤ 2p + 2q ≤ 8k − 2

L0s+1 [v] ≤ 2k + 1/4 ⇒≤ 2k

Note that the last step holds as long L0s+1 [v] = sample4 (Ls [v]). Otw. Ls−1 [v] has already been
full, and hence, L0s [v], L0s+1 [v], L0s+2 [v] are 4-covers of the complete list L[v], and also 4-covers
of each other.
Merging with a Cover

Lemma 28
Given two sorted sequences A and B. Let X be a c-cover of A and
B for constant c, and let rank(X : A) and rank(X : B) be known.

We can merge A and B in time O(1) using O(|X|) operations.

PA 7 Searching and Sorting


© Harald Räcke 137
Merging with a Cover

Lemma 29
Given two sorted sequences A and B. Let X be a c-cover of B for
constant c, and let rank(A : X) and rank(X : B) be known.

We can compute rank(A : B) using O(|X| + |A|) operations.

PA 7 Searching and Sorting


© Harald Räcke 138
Merging with a Cover

Lemma 30
Given two sorted sequences A and B. Let X be a c-cover of B for
constant c, and let rank(A : X) and rank(X : B) be known.

We can compute rank(B : A) using O(|X| + |A|) operations.

Easy to do with concurrent read. Can also be done with exclusive


read but non-trivial.

PA 7 Searching and Sorting


© Harald Räcke 139
In order to do the merge in iteration s + 1 in constant time we
need to know

rank(Ls [v] : L0s+1 [u]) and rank(Ls [v] : L0s+1 [w])

and we need to know that Ls [v] is a 4-cover of L0s+1 [u] and


L0s+1 [w].

PA 7 Searching and Sorting


© Harald Räcke 140
Lemma 31
Ls [v] is a 4-cover of L0s+1 [u] and L0s+1 [w].

ñ Ls [v] ⊇ L0s [u], L0s [w]


ñ L0s [u] is 4-cover of L0s+1 [u]
ñ Hence, Ls [v] is 4-cover of L0s+1 [u] as adding more elements
cannot destroy the cover-property.

PA 7 Searching and Sorting


© Harald Räcke 141
Analysis
Lemma 32
Suppose we know for every internal node v with children u and
w
ñ rank(L0s [v] : L0s+1 [v])
ñ rank(L0s [u] : L0s [w])
ñ rank(L0s [w] : L0s [u])

We can compute
ñ rank(L0s+1 [v] : L0s+2 [v])
ñ rank(L0s+1 [u] : L0s+1 [w])
ñ rank(L0s+1 [w] : L0s+1 [u])
in constant time and O(|Ls+1 [v]|) operations, where v is the
parent of u and w.

PA 7 Searching and Sorting


© Harald Räcke 142
Given
ñ rank(L0s [u] : L0s+1 [u]) (4-cover)
ñ rank(L0s [w] : L0s [u])
ñ rank(L0s [u] : L0s [w])
ñ rank(L0s [w] : L0s+1 [w]) (4-cover)
Compute
ñ rank(L0s+1 [w] : L0s [u])
ñ rank(L0s+1 [u] : L0s [w])
Compute
ñ rank(L0s+1 [w] : L0s+1 [u])
ñ rank(L0s+1 [u] : L0s+1 [w])

ranks between siblings can be computed easily

PA 7 Searching and Sorting


© Harald Räcke 143
Given
ñ rank(L0s [u] : L0s+1 [u]) (4-cover → rank(L0s+1 [u] : L0s [u]))
ñ rank(L0s [w] : L0s+1 [u])
ñ rank(L0s [u] : L0s+1 [w])
ñ rank(L0s [w] : L0s+1 [w]) (4-cover → rank(L0s+1 [w] : L0s [w]))
Compute (recall that Ls [v] = merge(L0s [u], L0s [w]))
ñ rank(Ls [v] : L0s+1 [u])
ñ rank(Ls [v] : L0s+1 [w])
Compute
ñ rank(Ls [v] : Ls+1 [v]) (by adding)
ñ rank(L0s+1 [v] : L0s+2 [v]) (by sampling)

PA 7 Searching and Sorting


© Harald Räcke 144
Definition 33
A 0-1 sequence S is bitonic if it can be written as the
concatenation of subsequences S1 and S2 such that either
ñ S1 is monotonically increasing and S2 monotonically
decreasing, or
ñ S1 is monotonically decreasing and S2 monotonically
increasing.

Note, that this just defines bitonic 0-1 sequences. Bitonic


sequences are defined differently.

PA 8 Sorting Networks
© Harald Räcke 145
Bitonic Merger
If we feed a bitonic 0-1 sequence S into the
network on the right we obtain two bitonic
sequences ST and SB s.t.
1. SB ≤ ST (element-wise) ST
2. SB and ST are bitonic

Proof:
ñ assume wlog. S more 1’s than 0’s. S

ñ assume for contradiction two 0s at


same comparator (i, j = i + 2d )
ñ everything 0 btw i and j means we
have more than 50% zeros (“). SB
ñ all 1s btw. i and j means we have
less than 50% ones (“).
ñ 1 btw. i and j and elsewhere
means S is not bitonic (“).
Bitonic Merger

Bitonic Merger Bd
The bitonic merger Bd
of dimension d is con-
Bd−1
structed by combining
two bitonic mergers of
dimension d − 1.

If we feed a bitonic 0-1


sequence into this, the
sequence will be sorted.

(actually, any bitonic se- Bd−1


quence will be sorted,
but we do not prove
this)
Bitonic Sorter Sd

0
Sd−1

Sd−1
Bitonic Merger: (n = 2d )
ñ comparators: C(n) = 2C(n/2) + n/2 ⇒ C(n) = O(n log n).
ñ depth: D(n) = D(n/2) + 1 ⇒ D(d) = O(log n).

Bitonic Sorter: (n = 2d )
ñ comparators: C(n) = 2C(n/2) + O(n log n) ⇒
C(n) = O(n log2 n).
ñ depth: D(n) = D(n/2) + log n ⇒ D(n) = Θ(log2 n).

PA 8 Sorting Networks
© Harald Räcke 149
Odd-Even Merge
How to merge two sorted sequences?
A = (a1 , a2 , . . . , an ), B = (b1 , b2 , . . . , bn ), n even.

Split into odd and even sequences:


Aodd = (a1 , a3 , a5 , . . . , an−1 ), Aeven = (a2 , a4 , a6 , . . . an )
Bodd = (b1 , b3 , b5 , . . . , bn−1 ), Beven = (b2 , b4 , b6 , . . . , bn )

Let

X = merge(Aodd , Bodd ) and Y = merge(Aeven , Beven )

Then

S = (x1 , min{x2 , y1 }, max{x2 , y1 }, min{x3 , y2 }, . . . , yn )

PA 8 Sorting Networks
© Harald Räcke 150
Odd-Even Merge

Md−1

Md−1
Theorem 34
There exists a sorting network with depth O(log n) and
O(n log n) comparators.

PA 8 Sorting Networks
© Harald Räcke 152
Parallel Comparison Tree Model

A parallel comparison tree (with parallelism p) is a 3p -ary tree.


ñ each internal node represents a set of p comparisons btw.
p pairs (not necessarily distinct)
ñ a leaf v corresponds to a unique permutation that is valid
for all the comparisons on the path from the root to v
ñ the number of parallel steps is the height of the tree

PA 9 Lower Bounds
© Harald Räcke 153
Comparison PRAM

A comparison PRAM is a PRAM where we can only compare the


input elements;
ñ we cannot view them as strings
ñ we cannot do calculations on them

A lower bound for the comparison tree with parallelism p


directly carries over to the comparison PRAM with p processors.

PA 9 Lower Bounds
© Harald Räcke 154
A Lower Bound for Searching

Theorem 35
Given a sorted table X of n elements and an element y.
log n
Searching for y in X requires Ω( log(p+1) ) steps in the parallel
comparsion tree with parallelism p < n.

PA 9 Lower Bounds
© Harald Räcke 155
A Lower Bound for Maximum

Theorem 36
A graph G with m edges and n vertices has an independent set
n2
on at least 2m+n vertices.

base case (n = 1)
ñ The only graph with one vertex has m = 0, and an
independent set of size 1.

PA 9 Lower Bounds
© Harald Räcke 156
induction step (1, . . . , n → n + 1)
ñ Let G be a graph with n + 1 vertices, and v a node with
minimum degree (d).
ñ Let G0 be the graph after deleting v and its adjacent
vertices in G.
ñ n0 = n − (d + 1)
d
ñ m0 ≤ m − 2 (d + 1) as we remove d + 1 vertices, each with
degree at least d
ñ In G0 there is an independent set of size ((n0 )2 /(2m0 + n0 )).
ñ By adding v we obtain an indepent set of size

(n0 )2 n2
1+ ≥
2m + n
0 0 2m + n
A Lower Bound for Maximum

Theorem 37
Computing the maximum of n elements in the comparison tree
requires Ω(log log n) steps whenever the degree of parallelism is
p ≤ n.

Theorem 38
Computing the maximum of n elements requires Ω(log log n)
steps on the comparison PRAM with n processors.

PA 9 Lower Bounds
© Harald Räcke 158
An adversary can specify the input such that at the end of the
(i + 1)-st step the maximum lies in a set Ci+1 of size si+1 such
that
ñ no two elements of Ci+1 have been compared
si2
ñ si+1 ≥ 2p+ci

PA 9 Lower Bounds
© Harald Räcke 159
Theorem 39
The selection problem requires Ω(log n/ log log n) steps on a
comparison PRAM.

not proven yet

PA 9 Lower Bounds
© Harald Räcke 160
A Lower Bound for Merging

The (k, s)-merging problem, asks to merge k pairs of


subsequences A1 , . . . , Ak and B 1 , . . . , B k where we know that all
elements in Ai ∪ B i are smaller than elements in Aj ∪ B j for
(i < j). Further |Ai |, |Bi | ≥ s.

PA 9 Lower Bounds
© Harald Räcke 161
A Lower Bound for Merging

Lemma 40
Suppose we are given a parallel comparison tree with
parallelism p to solve the (k, s) merging problem. After the first
step an adversary can specify the input such that an arbitrary
(k0 , s 0 ) merging problem has to be solved, where

3q
k0 = pk
4s
s k
s0 =
4 p

PA 9 Lower Bounds
© Harald Räcke 162
A Lower Bound for Merging

Partition Ai s and B i s into blocks of length roughly s/`; hence `


blocks.

Define an ` × ` binary matrix M i , where Mxy


i is 0 iff the parallel
i
step did not compare an element from Ax with an element from
i .
By

The matrix has 2` − 1 diagonals.

PA 9 Lower Bounds
© Harald Räcke 163
Choose for every i the diagonal of M i that has most zeros.

Pair all Aij+di , Bji , (where di ∈ {−(` − 1), . . . , ` − 1} specifies the


chosen diagonal) for which the entry in M i is zero.

We can choose value s.t. elements for the j-th pair along the
diagonal are all smaller than for the (j + 1)-th pair.

Hence, we get a (k0 , s 0 ) problem.

PA 9 Lower Bounds
© Harald Räcke 164
How many pairs do we have?
ñ there are k` blocks in total
ñ there are k · `2 matrix entries in total
ñ there are at least k · `2 − p zeros.
ñ choosing a random diagonal (same for every matrix M i ) hits
at least
k`2 − p k` p
≥ −
2` − 1 2 2`
zeroes.
ñ Choosing ` = d2 p/ke gives
p

s
0 3q s s s k
k ≥ pk and s 0 = b c ≥ p =
4 ` 4 p/k 4 p

where we assume s ≥ 6 p/k.


p

PA 9 Lower Bounds
© Harald Räcke 165
Lemma 41
Let T (k, s, p) be the number of parallel steps required on a
comparison tree to solve the (k, s) merging problem. Then
p
1 log k
T (k, p, s) ≥ log p
4 log ks

provided that p ≥ 2ks and p ≤ ks 2 /36

PA 9 Lower Bounds
© Harald Räcke 166
Induction Step:

Assume that
p
1 log k0
T (k0 , s 0 , p) ≥log p
4 log k0 s 0
4 p
q
1 log 3 k
≥ log 16 p
4 log 3 ks
1 p
1 log k
≥ log 2 p
4 7 log ks
p
1 log k
≥ log p −1
4 log ks

This gives the induction step.

PA 9 Lower Bounds
© Harald Räcke 167
Theorem 42
Merging requires at least Ω(log log n) time on a CRCW PRAM
with n processors.

PA 9 Lower Bounds
© Harald Räcke 168
Simulations between PRAMs

Theorem 43
We can simulate a p-processor priority CRCW PRAM on a
p-processor EREW PRAM with slowdown O(log p).

PA 10 Simulations between PRAMs


© Harald Räcke 169
Simulations between PRAMs

Theorem 44
We can simulate a p-processor priority CRCW PRAM on a
p log p-processor common CRCW PRAM with slowdown O(1).

PA 10 Simulations between PRAMs


© Harald Räcke 170
Simulations between PRAMs

Theorem 45
We can simulate a p-processor priority CRCW PRAM on a
log p
p-processor common CRCW PRAM with slowdown O( log log p ).

PA 10 Simulations between PRAMs


© Harald Räcke 171
Simulations between PRAMs

Theorem 46
We can simulate a p-processor priority CRCW PRAM on a
p-processor arbitrary CRCW PRAM with slowdown O(log log p).

PA 10 Simulations between PRAMs


© Harald Räcke 172
Lower Bounds for the CREW PRAM

Ideal PRAM:
ñ every processor has unbounded local memory
ñ in each step a processor reads a global variable
ñ then it does some (unbounded) computation on its local
memory
ñ then it writes a global variable

PA 10 Simulations between PRAMs


© Harald Räcke 173
Lower Bounds for the CREW PRAM

Definition 47
An input index i affects a memory location M at time t on some
input I if the content of M at time t differs between inputs I and
I(i) (i-th bit flipped).

L(M, t, I) = {i | i affects M at time t on input I}

PA 10 Simulations between PRAMs


© Harald Räcke 174
Lower Bounds for the CREW PRAM

Definition 48
An input index i affects a processor P at time t on some input I
if the state of P at time t differs between inputs I and I(i) (i-th
bit flipped).

K(P , t, I) = {i | i affects P at time t on input I}

PA 10 Simulations between PRAMs


© Harald Räcke 175
Lower Bounds for the CREW PRAM

Lemma 49
If i ∈ K(P , t, I) with t > 1 then either
ñ i ∈ K(P , t − 1, I), or
ñ P reads a global memory location M on input I at time t,
and i ∈ L(M, t − 1, I).

PA 10 Simulations between PRAMs


© Harald Räcke 176
Lower Bounds for the CREW PRAM

Lemma 50
If i ∈ L(M, t, I) with t > 1 then either
ñ A processor writes into M at time t on input I and
i ∈ K(P , t, I), or
ñ No processor writes into M at time t on input I and
ñ either i ∈ L(M, t − 1, I)
ñ or a processor P writes into M at time t on input I(i).

PA 10 Simulations between PRAMs


© Harald Räcke 177
Let k0 = 0, `0 = 1 and define

kt+1 = kt + `t and `t+1 = 3kt + 4`t

Lemma 51
|K(P , t, I)| ≤ kt and |L(M, t, I)| ≤ `t for any t ≥ 0

PA 10 Simulations between PRAMs


© Harald Räcke 178
base case (t = 0):
ñ No index can influence the local memory/state of a
processor before the first step (hence |K(P , 0, I)| = k0 = 0).
ñ Initially every index in the input affects exactly one memory
location. Hence |L(M, 0, I)| = 1 = `0 .

PA 10 Simulations between PRAMs


© Harald Räcke 179
induction step (t → t + 1):

K(P , t + 1, I) ⊆ K(P , t, I) ∪ L(M, t, I), where M is the location


read by P in step t + 1.

Hence,

|K(P , t + 1, I)| ≤ |K(P , t, I)| + |L(M, t, I)|


≤ kt + `t

PA 10 Simulations between PRAMs


© Harald Räcke 180
induction step (t → t + 1):

For the bound on |L(M, t + 1, I)| we have two cases.

Case 1:
A processor P writes into location M at time t + 1 on input I.

Then,

|L(M, t + 1, I)| ≤ |K(P , t + 1, I)|


≤ kt + `t
≤ 3kt + 4`t = `t+1

PA 10 Simulations between PRAMs


© Harald Räcke 181
Case 2:
No processor P writes into location M at time t + 1 on input I.

An index i affects M at time t + 1 iff i affects M at time t or


some processor P writes into M at t + 1 on I(i).

L(M, t + 1, I) ⊆ L(M, t, I) ∪ Y (M, t + 1, I)

Y (M, t + 1, I) is the set of indices uj that cause some processor


Pwj to write into M at time t + 1 on input I.

PA 10 Simulations between PRAMs


© Harald Räcke 182
Y (M, t + 1, I) is the set of indices uj that cause some processor
Pwj to write into M at time t + 1 on input I.

Fact:
For all pairs us , ut with Pws ≠ Pwt either
us ∈ K(Pwt , t + 1, I(ut )) or ut ∈ K(Pws , t + 1, I(us )).

Otherwise, Pwt and Pws would both write into M at the same
time on input I(us )(ut ).

PA 10 Simulations between PRAMs


© Harald Räcke 183
Let U = {u1 , . . . , ur } denote all indices that cause some
processor to write into M.

Let V = {(I(u1 ), Pw1 ), . . . }.

We set up a bipartite graph between U and V , such that


(ui , (I(uj ), Pwj )) ∈ E if ui affects Pwj at time t + 1 on input
I(uj ).

Each vertex (I(uj ), Pwj ) has degree at most kt+1 as this is an


upper bound on indices that can influence a processor Pwj .

Hence, |E| ≤ r · kt+1 .

PA 10 Simulations between PRAMs


© Harald Räcke 184
For an index uj there can be at most kt+1 indices ui with
Pwi = Pwj .

1
Hence, there must be at least 2 r (r − kt+1 ) pairs ui , uj with
Pwi ≠ Pwj .

Each pair introduces at least one edge.

Hence,
1
|E| ≥ r (r − kt+1 )
2

This gives r ≤ 3kt+1 ≤ 3kt + 3`t

PA 10 Simulations between PRAMs


© Harald Räcke 185
Recall that L(M, t + 1, i) ⊆ L(M, t, i) ∪ Y (M, t + 1, I)

|L(M, t + 1, i)| ≤ 3kt + 4`t

PA 10 Simulations between PRAMs


© Harald Räcke 186
kt+1 kt k0
! ! ! ! !
1 1 0
= =
`t+1 3 4 `t `0 1

Eigenvalues:

1 √ 1 √
λ1 = (5 + 21) and λ2 = (5 − 21)
2 2

Eigenvectors:
! !
1 1
v1 = and v2 =
−(1 − λ1 ) −(1 − λ2 )

1 1
! !
v1 = 3 1√ and v2 = 3 1√
2 + 2 21 2 − 2 21
1 1
! !
v1 = 3 1√ and v2 = 3 √
2 + 2 21 2 − 12 21

k0
! !
0 1
= = √ (v1 − v2 )
`0 1 21

kt
!
1  t
λ1 v1 − λt2 v2

= √
`t 21
Solving the recurrence gives

λt λt
kt = √ 1 − √ 2
21 21
√ √
3 + 21 t −3 + 21 t
`t = √ λ1 + √ λ2
2 21 2 21
√ √
with λ1 = 21 (5 + 21) and λ2 = 12 (5 − 21).

PA 10 Simulations between PRAMs


© Harald Räcke 189
Theorem 52
The following problems require logarithmic time on a CREW
PRAM.
ñ Sorting a sequence of x1 , . . . , xn with xi ∈ {0, 1}
ñ Computing the maximum of n inputs
ñ Computing the sum x1 + · · · + xn with xi ∈ {0, 1}

PA 10 Simulations between PRAMs


© Harald Räcke 190
A Lower Bound for the EREW PRAM

Definition 53 (Zero Counting Problem)


Given a monotone binary sequence x1 , x2 , . . . , xn determine the
index i such that xi = 0 and xi+1 = 1.

We show that this problem requires Ω(log n − log p) steps on a


p-processor EREW PRAM.

PA 10 Simulations between PRAMs


© Harald Räcke 191
Let Ii be the input with i zeros folled by n − i ones.

Index i affects processor P at time t if the state in step t is


differs between Ii−1 and Ii .

Index i affects location M at time t if the content of M after step


t differs between inputs Ii−1 and Ii .

PA 10 Simulations between PRAMs


© Harald Räcke 192
Lemma 54
If i ∈ K(P , t) then either
ñ i ∈ K(P , t − 1), or
ñ P reads some location M on input Ii (and, hence, also on
Ii−1 ) at step t and i ∈ L(M, t − 1)

PA 10 Simulations between PRAMs


© Harald Räcke 193
Lemma 55
If i ∈ L(M, t) then either
ñ i ∈ L(M, t − 1), or
ñ Some processor P writes M at step t on input Ii and
i ∈ K(P , t).
ñ Some processor P writes M at step t on input Ii−1 and
i ∈ K(P , t).

PA 10 Simulations between PRAMs


© Harald Räcke 194
Define
X X
C(t) = |K(P , t)| + max{0, |L(M, t)| − 1}
P M

C(T ) ≥ n, C(0) = 0

Claim:
C(t) ≤ 6C(t − 1) + 3|P |

6T −1
This gives C(T ) ≤ 5 3|P | and hence T = Ω(log n − log |P |).

PA 10 Simulations between PRAMs


© Harald Räcke 195
For an index i to newly appear in L(M, t) some processor must
write into M on either input Ii or Ii−1 .

Hence, any index in K(P , t) can at most generate two new


indices in L(M, t).

This means that the number of new indices in any set L(M, t)
(over all M) is at most X
2 |K(P , t)|
P

PA 10 Simulations between PRAMs


© Harald Räcke 196
Hence,
X X X
|L(M, t)| ≤ |L(M, t − 1)| + 2 |K(P , t)|
M M P

We can assume wlog. that L(M, t − 1) ⊆ L(M, t). Then

X X X
max{0, |L(M, t)| − 1} ≤ max{0, |L(M, t − 1)| − 1} + 2 |K(P , t)|
M M P

PA 10 Simulations between PRAMs


© Harald Räcke 197
For an index i to newly appear in K(P , t), P must read a memory
location M with i ∈ L(M, t) on input Ii (and also on input Ii−1 ).

Since we are in the EREW model at most one processor can do so


in every step.

Let J(i, t) be memory locations read in step t on input Ii , and let


Jt = i J(i, t).
S

X X X
|K(P , t)| ≤ |K(P , t − 1)| + |L(M, t − 1)|
P P M∈Jt

Over all inputs Ii a processor can read at most |K(P , t − 1)| + 1


different memory locations (why?).

PA 10 Simulations between PRAMs


© Harald Räcke 198
Hence,
X X X
|K(P , t)| ≤ |K(P , t − 1)| + |L(M, t − 1)|
P P M∈Jt
X X
≤ |K(P , t − 1)| + (|L(M, t − 1)| − 1) + Jt
P M∈Jt
X X
≤2 |K(P , t − 1)| + (|L(M, t − 1)| − 1) + |P |
P M∈Jt
X X
≤2 |K(P , t − 1)| + max{0, |L(M, t − 1)| − 1} + |P |
P M

Recall
X X X
max{0, |L(M, t)| − 1} ≤ max{0, |L(M, t − 1)| − 1} + 2 |K(P , t)|
M M P

PA 10 Simulations between PRAMs


© Harald Räcke 199
This gives
X X
K(P , t) + max{0, |L(M, t)| − 1}
P M
X X
≤4 max{0, |L(M, t − 1)| − 1} + 6 |K(P , t − 1)| + 3|P |
M P

Hence,
C(t) ≤ 6C(t − 1) + 3|P |

PA 10 Simulations between PRAMs


© Harald Räcke 200
Lower Bounds for CRCW PRAMS

Theorem 56
Let f : {0, 1}n → {0, 1} be an arbitrary Boolean function. f can
be computed in O(1) time on a common CRCW PRAM with ≤ n2n
processors.

Can we obtain non-constant lower bounds if we restrict the


number of processors to be polynomial?

PA 10 Simulations between PRAMs


© Harald Räcke 201
Boolean Circuits

ñ nodes are either AND, OR, or NOT gates or are special


INPUT/OUTPUT nodes
ñ AND and OR gates have unbounded fan-in (indegree) and
ounbounded fan-out (outdegree)
ñ NOT gates have unbounded fan-out
ñ INPUT nodes have indegree zero; OUTPUT nodes have
outdegree zero
ñ size is the number of edges
ñ depth is the longest path from an input to an output

PA 10 Simulations between PRAMs


© Harald Räcke 202
Theorem 57
Let f : {0, 1}n → {0, 1}m be a function with n inputs and m ≤ n
outputs, and circuit C computes f with depth D(n) and size
S(n). Then f can be computed by a common CRCW PRAM in
O(D(n)) time using S(n) processors.

PA 10 Simulations between PRAMs


© Harald Räcke 203
Given a family {Cn } of circuits we may not be able to compute
the corresponding family of functions on a CRCW PRAM.

Definition 58
A family {Cn } of circuits is logspace uniform if there exists a
deterministic Turing machine M s.t
ñ M runs in logarithmic space.
ñ For all n, M outputs Cn on input 1n .

PA 10 Simulations between PRAMs


© Harald Räcke 204
Bufferfly Network BF(d)

0
0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111

ñ node set V = {(`, x̄) | x̄ ∈ [2]d , ` ∈ [d + 1]}, where


x̄ = x0 x1 . . . xd−1 is a bit-string of length d
ñ edge set
E = {{(`, x̄), (` + 1, x̄ 0 )} | ` ∈ [d], x̄ ∈ [2]d , xi0 = xi for i ≠ `}

Sometimes the first and last level are identified.


Beneš Network

111

110

101

100

011

010

001

000
-3 -2 -1 0 1 2 3

ñ node set V = {(`, x̄) | x̄ ∈ [2]d , ` ∈ {−d, . . . , d}}


ñ edge set
E = {{(`, x̄), (` + 1, x̄ 0 )} | ` ∈ [d], x̄ ∈ [2]d , xi0 = xi for i ≠ `}
∪ {{(−`, x̄), (` − 1, x̄ 0 )} | ` ∈ [d], x̄ ∈ [2]d , xi0 = xi for i ≠ `}
n-ary Bufferfly Network BF(n, d)

0
000 001 002 010 011 012 020 021 022 100 101 102 110 111 112 120 121 122 200 201 202 210 211 212 220 221 222

ñ node set V = {(`, x̄) | x̄ ∈ [n]d , ` ∈ [d + 1]}, where


x̄ = x0 x1 . . . xd−1 is a bit-string of length d
ñ edge set
E = {{(`, x̄), (` + 1, x̄ 0 )} | ` ∈ [d], x̄ ∈ [n]d , xi0 = xi for i ≠ `}
Permutation Network PN(n, d)

111

110

101

100

011

010

001

000
-3 -2 -1 0 1 2 3

ñ There is an n-ary version of the Benes network (2 n-ary


butterflies glued at level 0).
ñ identifying levels 0 and 1 (or 0 and −1) gives PN(n, d).
The d-dimensional mesh M(n, d)

ñ node set V = [n]d


ñ edge set E = {{(x0 , . . . , xi , . . . , xd−1 ), (x0 , . . . , xi + 1, . . . , xd−1 )} |
xs ∈ [n] for s ∈ [d] \ {i}, xi ∈ [n − 1]}
Remarks

M(2, d) is also called d-dimensional hypercube.

M(n, 1) is also called linear array of length n.

PA 11 Some Networks
© Harald Räcke 210
Permutation Routing

Lemma 59
On the linear array M(n, 1) any permutation can be routed
online in 2n steps with buffersize 3.

PA 11 Some Networks
© Harald Räcke 211
Permutation Routing

Lemma 60
On the Beneš network any permutation can be routed offline in
2d steps between the sources level (+d) and target level (−d).

PA 11 Some Networks
© Harald Räcke 212
Recursive Beneš Network

B(d − 1)

B(d − 1)
Permutation Routing
base case d = 0
trivial

induction step d → d + 1
ñ The packets that start at (ā, d) and (ā(d), d) have to be
sent into different sub-networks.
ñ The packets that end at (ā, −d) and (ā(d), −d) have to
come out of different sub-networks.

We can generate a graph on the set of packets.


ñ Every packet has an incident source edge (connecting it to
the conflicting start packet)
ñ Every packet has an incident target edge (connecting it to
the conflicting packet at its target)
ñ This clearly gives a bipartite graph; Coloring this graph tells
us which packet to send into which sub-network.
Permutation Routing on the n-ary Beneš Network
Instead of two we have n sub-networks B(n, d − 1).

All packets starting at positions


{(x0 , . . . , xd−2 , xd−1 , d) | xd−1 ∈ [n]} have to be send to
different sub-networks.

All packets ending at positions


{(x0 , . . . , xd−2 , xd−1 , d) | xd−1 ∈ [n]} have to come from
different sub-networks.

The conflict graph is an n-uniform 2-regular hypergraph.

We can color such a graph with n colors such that no two nodes
in a hyperedge share a color.

This gives the routing.


Lemma 61
On a d-dimensional mesh with sidelength n we can route any
permutation (offline) in 4dn steps.

PA 11 Some Networks
© Harald Räcke 216
We can simulate the algorithm for the n-ary Beneš Network.

Each step can be simulated by routing on disjoint linear arrays.


This takes at most 2n steps.

PA 11 Some Networks
© Harald Räcke 217
We simulate the behaviour of the Beneš network on the
n-dimensional mesh.

In round r ∈ {−d, . . . , −1, 0, 1, . . . , d − 1} we simulate the step of


sending from level r of the Beneš network to level r + 1.

Each node x̄ ∈ [n]d of the mesh simulates the node (r , x̄).

Hence, if in the Beneš network we send from (r , x̄) to (r + 1, x̄ 0 )


we have to send from x̄ to x̄ 0 in the mesh.

All communication is performed along linear arrays. In round


r < 0 the linear arrays along dimension −r − 1 (recall that
dimensions are numbered from 0 to d − 1) are used

x̄d−1 . . . x̄−r αx̄−r −2 . . . x̄0

In rounds r ≥ 0 linear arrays along dimension r are used.

Hence, we can perform a round in O(n) steps.


Lemma 62
We can route any permutation on the Beneš network in O(d)
steps with constant buffer size.

The same is true for the butterfly network.

PA 11 Some Networks
© Harald Räcke 219
The nodes are of the form (`, x̄), x̄ ∈ [n]d , ` ∈ −d, . . . , d.

We can view nodes with same first coordinate forming columns


and nodes with the same second coordinate as forming rows.
This gives rows of length 2d − 1 and columns of length nd .

We route in 3 phases:
1. Permute packets along the rows such that afterwards no
column contains packets that have the same target row.
O(d) steps.
2. We can use pipeling to permute every column, so that
afterwards every packet is in its target row. O(2d + 2d)
steps.
3. Every packet is in its target row. Permute packets to their
right destinations. O(d) steps.

PA 11 Some Networks
© Harald Räcke 220
Lemma 63
We can do offline permutation routing of (partial) permutations
in 2d steps on the hypercube.

Lemma 64
We can sort on the hypercube M(2, d) in O(d2 ) steps.

Lemma 65
We can do online permutation routing of permutations in O(d2 )
steps on the hypercube.

PA 11 Some Networks
© Harald Räcke 221
Bitonic Sorter Sd

0
Sd−1

Sd−1
ASCEND/DESCEND Programs

Algorithm 11 ASCEND(procedure oper)


1: for dim = 0 to d − 1
2: for all ā ∈ [2]d pardo
3: oper(ā, ā(dim), dim)

Algorithm 11 DESCEND(procedure oper)


1: for dim = d − 1 to 0
2: for all ā ∈ [2]d pardo
3: oper(ā, ā(dim), dim)

oper should only depend on the dimension and on values stored


in the respective processor pair (ā, ā(dim), V [ā], V [ā(dim)]).

oper should take constant time.

PA 11 Some Networks
© Harald Räcke 223
Algorithm 11 oper(a, a0 , dim, Ta , Ta0 )
1: if adim , . . . , a0 = 0dim+1 then
2: Ta = min{Ta , Ta0 }

Performing an ASCEND run with this operation computes the


minimum in processor 0.

We can sort on M(2, d) by using d DESCEND runs.

We can do offline permutation routing by using a DESCEND run


followed by an ASCEND run.

PA 11 Some Networks
© Harald Räcke 224
We can perform an ASCEND/DESCEND run on a linear array
M(2d , 1) in O(2d ) steps.

PA 11 Some Networks
© Harald Räcke 225
The CCC network is obtained from a hypercube by replacing
every node by a cycle of degree d.

ñ nodes {(`, x̄) | x̄ ∈ [2]d , ` ∈ [d]}


ñ edges {{(`, x̄), (`, x̄(`)} | x ∈ [2]d , ` ∈ [d]}

constand degree

PA 11 Some Networks
© Harald Räcke 226
Lemma 66
Let d = 2k . An ASCEND run of a hypercube M(2, d + k) can be
simulated on CCC(d) in O(d) steps.

PA 11 Some Networks
© Harald Räcke 227
The shuffle exchange network SE(d) is defined as follows

ñ nodes: V = [2]d
ñ edges:
n o n o
E = {x ᾱ, ᾱx} | x ∈ [2], ᾱ ∈ [2]d−1 ∪ {ᾱ0, ᾱ1} | ᾱ ∈ [2]d−1

constand degree

Edges of the first type are called shuffle edges. Edges of the
second type are called exchange edges

PA 11 Some Networks
© Harald Räcke 228
Shuffle Exchange Networks

100 101

000 001 110 111

010 011

1000 1001 1100 1101

0000 0001 0100 1011 1110 1111


0101 1010

0010 0011 0110 0111

PA 11 Some Networks
© Harald Räcke 229
Lemma 67
We can perform an ASCEND run of M(2, d) on SE(d) in O(d)
steps.

PA 11 Some Networks
© Harald Räcke 230
Simulations between Networks

For the following observations we need to make the definition of


parallel computer networks more precise.

Each node of a given network corresponds to a processor/RAM.

In addition each processor has a read register and a write


register.

In one (synchronous) step each neighbour of a processor Pi can


write into Pi ’s write register or can read from Pi ’s read register.

Usually we assume that proper care has to be taken to avoid


concurrent reads and concurrent writes from/to the same
register.

PA 11 Some Networks
© Harald Räcke 231
Simulations between Networks

Definition 68
A configuration Ci of processor Pi is the complete description of
the state of Pi including local memory, program counter,
read-register, write-register, etc.

Suppose a machine M is in configuration (C0 , . . . , Cp−1 ),


performs t synchronous steps, and is then in configuration
C = (C00 , . . . , Cp−1
0
).

Ci0 is called the t-th successor configuration of C for processor i.

PA 11 Some Networks
© Harald Räcke 232
Simulations between Networks

Definition 69
Let C = (C0 , . . . , Cp−1 ) a configuration of M. A machine M 0 with
q ≥ p processors weakly simulates t steps of M with slowdown k
if
ñ in the beginning there are p non-empty processors sets
A0 , . . . , Ap−1 ⊆ M 0 so that all processors in Ai know Ci ;
ñ after at most k · t steps of M 0 there is a processor Q(i) that
knows the t-th successors configuration of C for processor
Pi .

PA 11 Some Networks
© Harald Räcke 233
Simulations between Networks

Definition 70
M 0 simulates M with slowdown k if
ñ M 0 weakly simulates machine M with slowdown k
ñ and every processor in Ai knows the t-th successor
configuration of C for processor Pi .

PA 11 Some Networks
© Harald Räcke 234
We have seen how to simulate an ASCEND/DESCEND run of the
hypercube M(2, d + k) on CCC(d) with d = 2k in O(d) steps.

Hence, we can simulate d + k steps (one ASCEND run) of the


hypercube in O(d) steps. This means slowdown O(1).

PA 11 Some Networks
© Harald Räcke 235
Lemma 71
Suppose a network S with n processors can route any
permutation in time O(t(n)). Then S can simulate any constant
degree network M with at most n vertices with slowdown
O(t(n)).

PA 11 Some Networks
© Harald Räcke 236
Map the vertices of M to vertices of S in an arbitrary way.

Color the edges of M with ∆ + 1 colors, where ∆ = O(1) denotes


the maximum degree.

Each color gives rise to a permutation.

We can route this permutation in S in t(n) steps.

Hence, we can perform the required communication for one step


of M by routing ∆ + 1 permutations in S. This takes time t(n).

A processor of M is simulated by the same processor of S


throughout the simulation.

PA 11 Some Networks
© Harald Räcke 237
Lemma 72
Suppose a network S with n processors can sort n numbers in
time O(t(n)). Then S can simulate any network M with at most
n vertices with slowdown O(t(n)).

PA 11 Some Networks
© Harald Räcke 238
Lemma 73
There is a constant degree network on O(n1+ ) nodes that can
simulate any constant degree network with slowdown O(1).

PA 11 Some Networks
© Harald Räcke 239
Suppose we allow concurrent reads, this means in every step all
neighbours of a processor Pi can read Pi ’s read register.

Lemma 74
A constant degree network M that can simulate any n-node
network has slowdown Ω(log n) (independent of the size of M).

PA 11 Some Networks
© Harald Räcke 240
We show the lemma for the following type of simulation.
ñ There are representative sets Ati for every step t that specify
which processors of M simulate processor Pi in step t
(know the configuration of Pi after the t-th step).
ñ The representative sets for different processors are disjoint.
ñ for all i ∈ {1, . . . , n} and steps t, Ati ≠ œ.

This is a step-by-step simulation.

PA 11 Some Networks
© Harald Räcke 241
Suppose processor Pi reads from processor Pji in step t.

Every processor Q ∈ M with Q ∈ At+1


i must have a path to a
0 t 00 t
processor Q ∈ Ai and to Q ∈ Aji .

Let kt be the largest distance (maximized over all i, ji ).

Then the simulation of step t takes time at least kt .

The slowdown is at least

`
1 X
k= kt
` t=1

PA 11 Some Networks
© Harald Räcke 242
We show
ñ The simulation of a step takes at least time γ log n, or
ñ the size of the representative sets shrinks by a lot

1 X t
|At+1
X
i |≤ |Ai |
i
n i

PA 11 Some Networks
© Harald Räcke 243
Suppose there is no pair (i, j) such that i reading from j
requires time γ log n.
ñ For every i the set Γ2k (Ai ) contains a node from Aj .
ñ Hence, there must exist a ji such that Γ2k (Ai ) contains at
most
|Ai | · c 2k |Ai | · c 3k
|Cji | := ≤ .
n−1 n
processors from |Aji |

PA 11 Some Networks
© Harald Räcke 244
If we choose that i reads from ji we get

|A0i | ≤ |Cji | · c k
|Ai | · c 3k
≤ ck ·
n
1
= |Ai | · c 4k
n

Choosing k = Θ(log n) gives that this is at most |Ai |/n .

PA 11 Some Networks
© Harald Räcke 245
Let ` be the total number of steps and s be the number of short
steps when kt < γ log n.

In a step of time kt a representative set can at most increase by


c kt +1 .

Let h` denote the number of representatives after step `.

PA 11 Some Networks
© Harald Räcke 246
 1 s Y
kt +1 n `+ t kt
P
n ≤ h` ≤ h0 c ≤ · c
n t∈long ns

If kt ≥ `( 2 logc n − 1), we are done. Otw.


P
t


n ≤ n1−s+` 2

This gives s ≤ `/2 .

Hence, at most 50% of the steps are short.

PA 11 Some Networks
© Harald Räcke 247
Deterministic Online Routing

Lemma 75
A permutation on an n × n-mesh can be routed online in O(n)
steps.

PA 11 Some Networks
© Harald Räcke 248
Deterministic Online Routing

Definition 76 (Oblivious Routing)


Specify a path-system W with a path Pu,v between u and v for
every pair {u, v} ∈ V × V .

A packet with source u and destination v moves along path Pu,v .

PA 11 Some Networks
© Harald Räcke 249
Deterministic Online Routing

Definition 77 (Oblivious Routing)


Specify a path-system W with a path Pu,v between u and v for
every pair {u, v} ∈ V × V .

Definition 78 (node congestion)


For a given path-system the node congestion is the maximum
number of path that go through any node v ∈ V .

Definition 79 (edge congestion)


For a given path-system the edge congestion is the maximum
number of path that go through any edge e ∈ E.

PA 11 Some Networks
© Harald Räcke 250
Deterministic Online Routing

Definition 80 (dilation)
For a given path system the dilation is the maximum length of a
path.

PA 11 Some Networks
© Harald Räcke 251
Lemma 81
Any oblivious routing protocol requires at least max{Cf , Df }
steps, where Cf and Df , are the congestion and dilation,
respectively, of the path-system used. (node congestion or edge
congestion depending on the communication model)

Lemma 82
Any reasonable oblivious routing protocol requires at most
O(Df · Cf ) steps (unbounded buffers).

PA 11 Some Networks
© Harald Räcke 252
Theorem 83 (Borodin, Hopcroft)
For any path system W there exists a permutation π : V → V

and an edge e ∈ E such that at least Ω( n/∆) of the paths go
through e.

PA 11 Some Networks
© Harald Räcke 253
Let Wv = {Pv,u | u ∈ V }.

We say that an edge e is z-popular for v if at least z paths from


Wv contain e.

PA 11 Some Networks
© Harald Räcke 254
For any node v there are many edges that are are quite popular
for v.

|V | × |E|-matrix A(z):

e is z-popular for v
(
1
Av,e (z) =
0 otherwise

Define
ñ
X
Av (z) = Av,e (z)
e
ñ
X
Ae (z) = Av,e (z)
v

PA 11 Some Networks
© Harald Räcke 255
Lemma 84
n−1
Let z ≤ ∆ .
n
For every node v ∈ V there exist at least 2∆z edges that are z
popular for v. This means
n
Av (z) ≥
2∆z

PA 11 Some Networks
© Harald Räcke 256
Lemma 85
There exists an edge e0 that is z-popular for at least z nodes

with z = Ω( n∆).

X X n2
Ae (z) = Av (z) ≥
e v 2∆z

There must exist an edge e0

n2 n
& '  
Ae0 (z) ≥ ≥
|E| · 2∆z 2∆2 z

where the last step follows from |E| ≤ ∆n.

PA 11 Some Networks
© Harald Räcke 257
n √ √
We choose z such that z = 2∆2 z
(i.e., z = n/( 2∆)).

This means e0 is dze-popular for dze nodes.

We can construct a permutation such that z paths go through e0 .

PA 11 Some Networks
© Harald Räcke 258
Deterministic oblivious routing may perform very poorly.

What happens if we have a random routing problem in a


butterfly?

PA 11 Some Networks
© Harald Räcke 259
Suppose every source on level 0 has p packets, that are routed
to random destinations.

How many packets go over node v on level i?

From v we can reach 2d /2i different targets.

Hence,
2d−i 1
Pr[packet goes over v] ≤ = i
2d 2

PA 11 Some Networks
© Harald Räcke 260
Expected number of packets:

1
E[packets over v] = p · 2i · =p
2i

since only p2i packets can reach v.

But this is trivial.

PA 11 Some Networks
© Harald Räcke 261
What is the probability that at least r packets go through v.

p · 2i 1 r
!  
Pr[at least r path through v] ≤ ·
r 2i
!r  
p2i · e 1
≤ ·
r 2i
pe r
 
=
r

Pr[there exists a node v sucht that at least r path through v]


pe r
 
≤ d2d ·
r

PA 11 Some Networks
© Harald Räcke 262
Pr[there exists a node v sucht that at least r path through v]
pe r
 
d
≤ d2 ·
r

Choose r as 2ep + (` + 1)d + log d = O(p + log N), where N is


number of sources in BF(d).

1
Pr[exists node v with more than r paths over v] ≤
N`

PA 11 Some Networks
© Harald Räcke 263
Scheduling Packets

Assume that in every round a node may forward at most one


packet but may receive up to two.

We select a random rank Rp ∈ [k]. Whenever, we forward a


packet we choose the packet with smaller rank. Ties are broken
according to packet id.

Random Rank Protocol

PA 11 Some Networks
© Harald Räcke 264
Definition 86 (Delay Sequence of length s)

ñ delay path W
ñ lengths `0 , `1 , . . . , `s , with `0 ≥ 1, `1 , . . . , `s ≥ 0 lengths of
delay-free sub-paths
ñ collision nodes v0 , v1 , . . . , vs , vs+1
ñ collision packets P0 , . . . , Ps

PA 11 Some Networks
© Harald Räcke 265
Properties
ñ rank(P0 ) ≥ rank(P1 ) ≥ · · · ≥ rank(Ps )
Ps
i=0 `i = d
ñ

ñ if the routing takes d + s steps than the delay sequence has


length s

PA 11 Some Networks
© Harald Räcke 266
Definition 87 (Formal Delay Sequence)

ñ a path W of length d from a source to a target


Ps
ñ s integers `0 ≥ 1, `1 , . . . , `s ≥ 0 and i=0 `i = d
ñ nodes v0 , . . . vs , vs+1 on W with vi being on level
d − `0 − · · · − `i−1
ñ s + 1 packets P0 , . . . , Ps , where Pi is a packet with path
through vi and vi−1
ñ numbers Rs ≤ Rs−1 ≤ · · · ≤ R0

PA 11 Some Networks
© Harald Räcke 267
We say a formal delay sequence is active if rank(Pi ) = ki holds
for all i.

Let Ns be the number of formal delay sequences of length at


most s. Then
Ns
Pr[routing needs at least d + s steps] ≤
ks+1

PA 11 Some Networks
© Harald Räcke 268
Lemma 88
s+1
2eC(s + k)

Ns ≤
s+1

ñ there are N 2 ways to choose W


Ps
there are s+d−1
 
ñ
s ways to choose `i ’s with i=0 `i = d
ñ the collision nodes are fixed
ñ there are at most C s+1 ways to choose the collision packets
where C is the node congestion
s+k
 
ñ there are at most s+1 ways to choose
0 ≤ ks ≤ · · · ≤ k0 < k

PA 11 Some Networks
© Harald Räcke 269
Hence the probability that the routing takes more than d + s
steps is at most
s+1
2e · C · (s + k)

N3 ·
(s + 1)k

We choose s = 8eC − 1 + (` + 3)d and k = s + 1. This gives that


1
the probability is at most N ` .

PA 11 Some Networks
© Harald Räcke 270
ñ With probability 1 − 1`1 the random routing problem has
N
congestion at most O(p + `1 d).
ñ With probability 1 − 1`2 the packet scheduling finishes in at
N
most O(C + `2 d) steps.

Hence, with high probability routing random problems with p


packets per source in a butterfly requires only O(p + d) steps.

What do we do for arbitrary routing problems?

PA 11 Some Networks
© Harald Räcke 271
Valiants Trick

Where did the scheduling analysis use the butterfly?

We only used
ñ all routing paths are of the same length d
ñ there are a polynomial number of delay paths

Choose paths as follows:


ñ route from source to random destination on target level
ñ route to real target column (albeit on source level)
ñ route to target

All phases run in time O(p + d) with high probability.

PA 11 Some Networks
© Harald Räcke 272
Valiants Trick

Multicommodity Flow Problem


ñ undirected (weighted) graph G = (V , E, c)
ñ commodities (si , ti ), i ∈ {1, . . . , k}
ñ a multicommodity flow is a flow f : E × {1, . . . , k} → R+
for all edges e ∈ E: i fi (e) ≤ c(e)
P
ñ

ñ for all nodes v ∈ V \ {si , ti }:


u:(u,v)∈E fi ((u, v)) = w:(v,w)∈E fi ((v, w))
P P

Goal A (Maximum Multicommodity Flow)


maximize i e=(si ,x)∈E fi (e)
P P

Goal B (Maximum Concurrent Multicommodity Flow)


maximize mini e=(si ,x)∈E fi (e)/di (throughput fraction), where
P

di is demand for commodity i

PA 11 Some Networks
© Harald Räcke 273
Valiants Trick

A Balanced Multicommodity Flow Problem is a concurrent


multicommodity flow problem in which incoming and outgoing
flow is equal to X
c(v) = c(e)
e=(v,x)∈E

PA 11 Some Networks
© Harald Räcke 274
Valiants Trick

For a multicommodity flow S we assume that we have a


decomposition of the flow(s) into flow-paths.

We use C(S) to denote the congestion of the flow problem


(inverse of througput fraction), and D(S) the length of the
longest routing path.

PA 11 Some Networks
© Harald Räcke 275
For a network G = (V , E, c) we define the characteristic flow
problem via
c(u)c(v)
ñ demands du,v = c(V )

Suppose the characteristic flow problem has a solution S with


C(S) ≤ F and D(S) ≤ F .

PA 11 Some Networks
© Harald Räcke 276
Definition 89
A (randomized) oblivious routing scheme is given by a path
system P and a weight function w such that
X
w(p) = 1
p∈Ps,t

PA 11 Some Networks
© Harald Räcke 277
Construct an oblivious routing scheme from S as follows:
ñ let fx,y be the flow between x and y in S
ñ
1 c(x)c(y)
fx,y ≥ dx,y /C(S) ≥ dx,y /F =
F c(V )
ñ for p ∈ Px,y set w(p) = fp /fx,y

gives an oblivious routing scheme.

PA 11 Some Networks
© Harald Räcke 278
Valiants Trick

We apply this routing scheme twice:


ñ first choose a path from Ps,v , where v is chosen uniformly
according to c(v)/c(V )
ñ then choose path according to Pv,t
If the input flow problem/packet routing problem is balanced
doing this randomization results in flow solution S (twice).

Hence, we have an oblivious scheme with congestion and


dilation at most 2F for (balanced inputs).

PA 11 Some Networks
© Harald Räcke 279
Example: hypercube.

PA 11 Some Networks
© Harald Räcke 280
Oblivious Routing for the Mesh

We can route any permutation on an n × n mesh in O(n) steps,


by x-y routing. Actually O(d) steps where d is the largest
distance between a source-target pair.

What happens if we do not have a permutation?

x − y routing may generate large congestion if some pairs have


a lot of packets.

Valiants trick may create a large dilation.

PA 11 Some Networks
© Harald Räcke 281
Let for a multicommodity flow problem P Copt (P ) be the
optimum congestion, and Dopt (P ) be the optimum dilation (by
perhaps different flow solutions).

Lemma 90
There is an oblivious routing scheme for the mesh that obtains a
flow solution S with C(S) = O(Copt (P ) log n) and
D(S) = O(Dopt (P )).

PA 11 Some Networks
© Harald Räcke 282
Lemma 91
For any oblivious routing scheme on the mesh there is a demand
P such that routing P will give congestion Ω(log n · Copt ).

PA 11 Some Networks
© Harald Räcke 283

You might also like