0% found this document useful (0 votes)

25 views289 pages

Parallel Algorithm Main Single

The document outlines the course 'Parallel Algorithms' taught by Harald Räcke at TU Munich during the Winter Term 2014/15, covering organizational details, required knowledge, and assessment methods. It introduces key concepts in parallel computing, including various models like PRAM and DAG, and discusses performance measures such as speedup and efficiency. Additionally, it provides a list of recommended literature for further reading on the subject.

Uploaded by

Anupam Mondal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views289 pages

Parallel Algorithm Main Single

Uploaded by

Anupam Mondal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 289

WS 2014/15

Parallel Algorithms

Harald Räcke

Fakultät für Informatik

TU München

http://www14.in.tum.de/lehre/2014WS/pa/

Winter Term 2014/15

PA
© Harald Räcke 1
Part I

Organizational Matters

PA
© Harald Räcke 2
Part I

Organizational Matters

ñ Modul: IN2011
ñ Name: “Parallel Algorithms”
“Parallele Algorithmen”
ñ ECTS: 8 Credit points
ñ Lectures:
ñ 4 SWS
Mon 14:00–16:00 (Room 00.08.038)
Fri 8:30–10:00 (Room 00.08.038)
ñ Webpage: http://www14.in.tum.de/lehre/2014WS/pa/
ñ Required knowledge:
ñ IN0001, IN0003
“Introduction to Informatics 1/2”
“Einführung in die Informatik 1/2”
ñ IN0007
“Fundamentals of Algorithms and Data Structures”
“Grundlagen: Algorithmen und Datenstrukturen” (GAD)
ñ IN0011
“Basic Theoretic Informatics”
“Einführung in die Theoretische Informatik” (THEO)
ñ IN0015
“Discrete Structures”
“Diskrete Strukturen” (DS)
ñ IN0018
“Discrete Probability Theory”
“Diskrete Wahrscheinlichkeitstheorie” (DWT)
ñ IN2003
“Efficient Algorithms and Data Structures”
“Effiziente Algorithmen und Datenstrukturen”
The Lecturer

ñ Harald Räcke
ñ Email: [email protected]
ñ Room: 03.09.044
ñ Office hours: (per appointment)

PA
© Harald Räcke 5
Tutorials

ñ Tutors:
ñ Chris Pinkau
ñ [email protected]
ñ Room: 03.09.037
ñ Office hours: Tue 13:00–14:00
ñ Room: 03.11.018
ñ Time: Tue 14:00–16:00

PA
© Harald Räcke 6
Assignment sheets

ñ In order to pass the module you need to pass

a 3 hour exam

PA
© Harald Räcke 7
Assessment

ñ Assignment Sheets:
ñ An assignment sheet is usually made available on Monday
on the module webpage.
ñ Solutions have to be handed in in the following week before
the lecture on Monday.
ñ You can hand in your solutions by putting them in the right
folder in front of room 03.09.019A.
ñ Solutions will be discussed in the subsequent tutorial on
Tuesday.

PA
© Harald Räcke 8
1 Contents

ñ PRAM algorithms
ñ Parallel Models
ñ PRAM Model
ñ Basic PRAM Algorithms
ñ Sorting
ñ Lower Bounds
ñ Networks of Workstations
ñ Offline Permutation Routing on the Mesh
ñ Oblivious Routing in the Butterfly
ñ Greedy Routing
ñ Sorting on the Mesh
ñ ASCEND/DESCEND Programs
ñ Embeddings between Networks

PA 1 Contents
© Harald Räcke 9
2 Literatur

Tom Leighton:
Introduction to Parallel Algorithms and Architecture:
Arrays, Trees, Hypercubes,
Morgan Kaufmann: San Mateo, CA, 1992
Joseph JaJa:
An Introduction to Parallel Algorithms,
Addison-Wesley: Reading, MA, 1997
Jeffrey D. Ullman:
Computational Aspects of VLSI,
Computer Science Press: Rockville, USA, 1984
Selim G. Akl.:
The Design and Analysis of Parallel Algorithms,
Prentice Hall: Englewood Cliffs, NJ, 1989

PA 2 Literatur
© Harald Räcke 10
Part II

Foundations

PA
© Harald Räcke 11
3 Introduction
Parallel Computing
A parallel computer is a collection of processors usually of the
same type, interconnected to allow coordination and exchange
of data.

The processors are primarily used to jointly solve a given

problem.

Distributed Systems
A set of possibly many different types of processors are
distributed over a larger geographic area.

Processors do not work on a single problem.

Some processors may act in a malicous way.

PA 3 Introduction
© Harald Räcke 12
Cost measures

How do we evaluate sequential algorithms?

ñ time efficiency
ñ space utilization
ñ energy consumption
ñ programmability
ñ ...

Asymptotic bounds (e.g., for running time) often give a good

indication on the algorithms performance on a wide variety of
machines.

PA 3 Introduction
© Harald Räcke 13
Cost measures
How do we evaluate parallel algorithms?

ñ time efficiency
ñ space utilization
ñ energy consumption
ñ programmability
ñ communication requirement
ñ ...

Problems
ñ performance (e.g. runtime) depends on problem size n and
on number of processors p
ñ statements usually only hold for restricted types of parallel
machine as parallel computers may have vastly different
characteristics (in particular w.r.t. communication)
Speedup
Suppose a problem P has sequential complexity T ∗ (n), i.e.,
there is no algorithm that solves P in time o(T ∗ (n)).

Definition 1
The speedup Sp (n) of a parallel algorithm A that requires time
Tp (n) for solving P with p processors is defined as

T ∗ (n)
Sp (n) = .
Tp (n)

Clearly, Sp (n) ≤ p. Goal: obtain Sp (n) ≈ p.

It is common to replace T ∗ (n) by the time bound of the best

known sequential algorithm for P !

PA 3 Introduction
© Harald Räcke 15
Efficiency
Definition 2
The efficiency of a parallel algorithm A that requires time Tp (n)
when using p processors on a problem of size n is

T1 (n)
Ep (n) = .
pTp (n)

Ep (n) ≈ 1 indicates that the algorithm is running roughly p

times faster with p processors than with one processor.

T1 (n)
Note that Ep (n) ≤ pT ∞ (n)
. Hence, the efficiency goes down
rapidly if p ≥ T1 (n)/T∞ (n).

Disadvantage: cost-measure does not relate to the optimum

sequential algorithm.

PA 3 Introduction
© Harald Räcke 16
Parallel Models — Requirements

Simplicity
A model should allow to easily analyze various performance
measures (speed, communication, memory utilization etc.).

Results should be as hardware-independent as possible.

Implementability
Parallel algorithms developed in a model should be easily
implementable on a parallel machine.

Theoretical analysis should carry over and give meaningful

performance estimates.

A real satisfactory model does not exist!

PA 3 Introduction
© Harald Räcke 17
DAG model — computation graph

ñ nodes represent operations (single instructions or larger

blocks)
ñ edges represent dependencies (precedence constraints)
ñ closely related to circuits; however there exist many
different variants
ñ branching instructions cannot be modelled
ñ completely hardware independent
ñ scheduling is not defined

Often used for automatically parallelizing numerical

computations.

PA 3 Introduction
© Harald Räcke 18
Example: Addition

+ +

+ + + +

A1 A2 A3 A4 A5 A6 A7 A8

A1 + + + + + + +

A2 A3 A4 A5 A6 A7 A8

Here, vertices without incoming edges correspond to input data.

The graph can be viewed as a data flow graph.

PA 3 Introduction
© Harald Räcke 19
DAG model — computation graph

The DAG itself is not a complete algorithm. A scheduling

implements the algorithm on a parallel machine, by assigning a
time-step tv and a processor pv to every node.

Definition 3
A scheduling of a DAG G = (V , E) on p processors is an
assignment of pairs (tv , pv ) to every internal node v ∈ V , s.t.,
ñ pv ∈ {1, . . . , p}; tv ∈ {1, . . . , T }
ñ tu = tv ⇒ pu ≠ pv
ñ (u, v) ∈ E ⇒ tv ≥ tu + 1
where a non-internal node x (an input node) has tx = 0.
T is the length of the schedule.

PA 3 Introduction
© Harald Räcke 20
DAG model — computation graph
The parallel complexity of a DAG is defined as

Tp (n) = min {T (S)} .

schedule S

T1 (n): #internal nodes in DAG

T∞ (n): diameter of DAG

Clearly,
Tp (n) ≥ T∞ (n)
Tp (n) ≥ T1 (n)/p

Lemma 4
A schedule with length O(T1 (n)/p + T∞ (n)) can be found easily.

Lemma 5
Finding an optimal schedule is in general NP-complete.
Note that the DAG model as defined is a non-uniform model of
computation.

In principle, there could be a different DAG for every input size

An algorithm (e.g. for a RAM) must work for every input size and
must be of finite description length.

Hence, specifying a different DAG for every n has more

expressive power.

Also, this is not really a complete model, as the operations

allowed in a DAG node are not clearly defined.

PA 3 Introduction
© Harald Räcke 22
PRAM Model

P1 P2 P3 P4 P5 P6 P7 P8

Global Shared Memory

All processors are synchronized.

In every round a processor can:

ñ read a register from global memory into local memory
ñ do a local computation à la RAM
ñ write a local register into global memory

PA 3 Introduction
© Harald Räcke 23
PRAM Model

Every processor executes the same program.

However, the program has access to two special variables:

ñ p: total number of processors
ñ id ∈ {1, . . . , p}: the id of the current processor

The following (stupid) program copies the content of the global

Algorithm 1 copy
1: if id = 1 then round ← 1
2: while round ≤ p and id = round do
3: x[id + 1] ← x[id]
4: round ← round + 1

PA 3 Introduction
© Harald Räcke 24
PRAM Model
ñ processors can effectively execute different code because of
branching according to id
ñ however, not arbitrarily; still uniform model of computation

Often it is easier to explicitly define which parts of a program

are executed in parallel:

Algorithm 2 sum
1: // computes sum of x[1] . . . x[p]
2: // red part is executed only by processor 1
3: r ← 1
4: while 2r ≤ p do
5: for id mod 2r = 1 pardo
6: // only executed by processors whose id matches
7: x[id] = x[id] + x[id + 2r −1 ]
8: r ←r +1
9: return x[1]
Different Types of PRAMs
Simultaneous Access to Shared Memory:
ñ EREW PRAM:
simultaneous access is not allowed
ñ CREW PRAM:
concurrent read accesses to the same location are allowed;
write accesses have to be exclusive
ñ CRCW PRAM:
concurrent read and write accesses allowed
ñ commom CRCW PRAM
all processors writing to x[i] must write same value
ñ arbitrary CRCW PRAM
values may be different; an arbitrary processor succeeds
ñ priority CRCW PRAM
values may be different; processor with smallest id succeeds

PA 3 Introduction
© Harald Räcke 26
Algorithm 3 sum
1: // computes sum of x[1] . . . x[p]
2: r ← 1
3: while 2r ≤ p do
4: for id mod 2r = 1 pardo
5: x[id] = x[id] + x[id + 2r −1 ]
6: r ←r +1
7: return x[1]

The above is an EREW PRAM algorithm.

On a CREW PRAM we could replace Line 4 by

for 1 ≤ id ≤ p pardo

PA 3 Introduction
© Harald Räcke 27
PRAM Model — remarks
ñ similar to a RAM we either need to restrict the size of values
that can be stored in registers, or we need to have a
non-uniform cost model for doing a register manipulation
(cost for manipulating x[i] is proportional to the bit-length
of the largest number that is ever being stored in x[i])
ñ in this lecture: uniform cost model but we are not
exploiting the model
ñ global shared memory is very unrealistic in practise as
uniform access to all memory locations does not exist
ñ global synchronziation is very unrealistic; in real parallel
machines a global synchronization is very costly
ñ model is good for understanding basic parallel
mechanisms/techniques but not for algorithm development
ñ model is good for lower bounds

PA 3 Introduction
© Harald Räcke 28
Network of Workstations — NOWs

ñ interconnection network represented by a graph G = (V , E)

ñ each v ∈ V represents a processor
ñ an edge {u, v} ∈ E represents a two-way communication
link between processors u and v
ñ network is asynchronous
ñ all coordination/communiation has to be done by explicit
message passing

PA 3 Introduction
© Harald Räcke 29
Typical Topologies

111 3
110
101 4 2
100

hypercube 5 cycle, ring 1

011
010 6 8
001
000 7

1, 4 2, 4 3, 4 4, 4

1, 3 2, 3 3, 3 4, 3

1 2 3 4 5 6 7 8
linear array 1, 2 2, 2 3, 2 4, 2

1, 1 2, 1 3, 1 4, 1

mesh/grid

PA 3 Introduction
© Harald Räcke 30
Network of Workstations — NOWs
Computing the sum on a d-dimensional hypercube. Note that
x[0] . . . x[2d − 1] are stored at the individual nodes.

Processors are numbered consecutively starting from 0

Algorithm 4 sum
1: // computes sum of x[0] . . . x[2d − 1]
2: r ← 1
3: while 2r ≤ 2d do // p = 2d
4: if id mod 2r = 0 then
5: temp ← receive(id + 2r −1 )
6: x[id] = x[id] + temp
7: if id mod 2r = 2r −1 then
8: send(x[id], id − 2r −1 )
9: r ←r +1
10: if id = 0 then return x[id]
Network of Workstations — NOWs

Remarks
ñ One has to ensure that at any point in time there is at most
one active communication along a link
ñ There also exist synchronized versions of the model, where
in every round each link can be used once for
communication
ñ In particular the asynchronous model is quite realistic
ñ Difficult to develop and analyze algorithms as a lot of low
level communication has to be dealt with
ñ Results only hold for one specific topology and cannot be
generalized easily

PA 3 Introduction
© Harald Räcke 32
Performance of PRAM algorithms

Suppose that we can solve an instance of a problem with size n

with P (n) processors and time T (n).

We call C(n) = T (n) · P (n) the time-processor product or the

cost of the algorithm.

The following statements are equivalent

ñ P (n) processors and time O(T (n))
ñ O(C(n)) cost and time O(T (n))
ñ O(C(n)/p) time for any number p ≤ P (n) processors
ñ O(C(n)/p + T (n)) for any number p of processors

PA 3 Introduction
© Harald Räcke 33
Performance of PRAM algorithms
Suppose we have a PRAM algorithm that takes time T (n) and
work W (n), where work is the total number of operations.

We can nearly always obtain a PRAM algorithm that uses time at

most
bW (n)/pc + T (n)

parallel steps on p processors.

Idea:
ñ Wi (n) denotes operations in parallel step i, 1 ≤ i ≤ T (n)
ñ simulate each step in dWi (n)/pe parallel steps
ñ then we have
X X
dWi (n)/pe ≤ bWi (n)/pc + 1 ≤ bW (n)/pc + T (n)

i i
Performance of PRAM algorithms

Why nearly always?

We need to assign processors to operations.

ñ every processor pi needs to know whether it should be
active
ñ in case it is active it needs to know which operations to
perform

design algorithms for an arbitrary number of processors;

keep total time and work low

PA 3 Introduction
© Harald Räcke 35
Optimal PRAM algorithms

Suppose the optimal sequential running time for a problem is

T ∗ (n).

We call a PRAM algorithm for the same problem work optimal if

its work W (n) fulfills

W (n) = Θ(T ∗ (n))

If such an algorithm has running time T (n) it has speedup

T ∗ (n) pT ∗ (n)
! !
Sp (n) = Ω =Ω = Ω(p)
T ∗ (n)/p + T (n) T ∗ (n) + pT (n)

for p = O(T ∗ (n)/T (n)).

PA 3 Introduction
© Harald Räcke 36
This means by improving the time T (n), (while using same
work) we improve the range of p, for which we obtain optimal
speedup.

We call an algorithm worktime (WT) optimal if T (n) cannot be

asymptotically improved by any work optimal algorithm.

PA 3 Introduction
© Harald Räcke 37
Example

Algorithm for computing the sum has work W (n) = O(n).

optimal

T (n) = O(log n). Hence, we achieve an optimal speedup for

p = O(n/ log n).

One can show that any CREW PRAM requires Ω(log n) time to
compute the sum.

PA 3 Introduction
© Harald Räcke 38
Communication Cost

When we differentiate between local and global memory we can

analyze communication cost.

We define the communication cost of a PRAM algorithm as the

worst-case traffic between the local memory of a processor and
the global shared memory.

Important criterion as communication is usually a major

bottleneck.

PA 3 Introduction
© Harald Räcke 39
Communication Cost

Algorithm 5 MatrixMult(A, B, n)
1: Input: n × n matrix A and B; n = 2k
2: Output: C = AB
3: for 1 ≤ i, j, ` ≤ n pardo
4: X[i, j, `] ← A[i, `] · B[`, j]
5: for r ← 1 to log n
6: for 1 ≤ i, j ≤ n; ` mod 2r = 1 pardo
7: X[i, j, `] ← X[i, j, `] + X[i, j, ` + 2r −1 ]
8: for 1 ≤ i, j ≤ n pardo
9: C[i, j] ← X[i, j, 1]

On n3 processors this algorithm runs in time O(log n).

It uses n3 multiplications and O(n3 ) additions.

PA 3 Introduction
© Harald Räcke 40
What happens if we have n processors?

Phase 1
pi computes X[i, j, `] = A[i, `] · B[`, j] for all 1 ≤ j, ` ≤ n
n2 time; n2 communication for every processor

Phase 2 (round r)
pi updates X[i, j, `] for all 1 ≤ j ≤ n; 1 ≤ ` mod 2r = 1
O(n · n/2r ) time; no communication

Phase 3
pi writes i-th row into C[i, j]’s.
n time; n communication

PA 3 Introduction
© Harald Räcke 41
Alternative Algorithm
Split matrix into blocks of size n2/3 × n2/3 .

A1,1 A1,2 A1,3 A1,4 B1,1 B1,2 B1,3 B1,4 C1,1 C1,2 C1,3 C1,4

A2,1 A2,2 A2,3 A2,4 B2,1 B2,2 B2,3 B2,4 C2,1 C2,2 C2,3 C2,4
A · B = C
A3,1 A3,2 A3,3 A3,4 B3,1 B3,2 B3,3 B3,4 C3,1 C3,2 C3,3 C3,4

A4,1 A4,2 A4,3 A4,4 B4,1 B4,2 B4,3 B4,4 C4,1 C4,2 C4,3 C4,4

Note that Ci,j = Ai,` B`,j .

P
`

Now we have the same problem as before but n0 = n1/3 and a

single multiplication costs time O((n2/3 )3 ) = O(n2 ). An addition
costs n4/3 .

work for multiplications: O(n2 · (n0 )3 ) = O(n3 )

work for additions: O(n4/3 · (n0 )3 ) = O(n3 )
time: O(n2 ) + log n0 · O(n4/3 ) = O(n2 )
Alternative Algorithm

The communication cost is only O(n4/3 log n0 ) as a processor in

the original problem touches at most log n entries of the matrix.

Each entry has size O(n4/3 ).

The algorithm exhibits less parallelism but still has optimum

work/runtime for just n processors.

much, much better in practise

PA 3 Introduction
© Harald Räcke 43
Part III

PRAM Algorithms

PA
© Harald Räcke 44
Prefix Sum
input: x[1] . . . x[n]
Pi
output: s[1] . . . s[n] with s[i] = j=1 x[i] (w.r.t. operator ∗)

Algorithm 6 PrefixSum(n, x[1] . . . x[n])

1: // compute prefixsums; n = 2k
2: if n = 1 then s[1] ← x[1]; return
3: for 1 ≤ i ≤ n/2 pardo
4: a[i] ← x[2i − 1] ∗ x[2i]
5: z[1], . . . , z[n/2] ← PrefixSum(n/2, a[1] . . . a[n/2])
6: for 1 ≤ i ≤ n pardo
7: i even : s[i] ← z[i/2]
8: i=1 : s[1] = x[1]
9: i odd : s[i] ← z[(i − 1)/2] ∗ x[i]

PA 4.1 Prefix Sum

z1 z2 z3 z4 z5 z6 z7 z8

z̄1 z̄2 z̄3 z̄4

ẑ1 ẑ2
time steps

z10

a10

â1 â2

ā1 ā2 ā3 ā4

a1 a2 a3 a4 a5 a6 a7 a8

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

x-values
Prefix Sum

The algorithm uses work O(n) and time O(log n) for solving
Prefix Sum on an EREW-PRAM with n processors.

It is clearly work-optimal.

Theorem 6
On a CREW PRAM a Prefix Sum requires running time Ω(log n)
regardless of the number of processors.

PA 4.1 Prefix Sum

Input: a linked list given by successor pointers; a value x[i] for

every list element; an operator ∗;

Output: for every list position ` the sum (w.r.t. ∗) of elements

after ` in the list (including `)

4 3 7 8 2 1 6 5

x[4] x[3] x[7] x[8] x[2] x[1] x[6] x[5]

S[4]=3 S[3]=7 S[7]=8 S[8]=2 S[2]=1 S[1]=6 S[6]=5 S[5]=5

PA 4.2 Parallel Prefix

Algorithm 7 ParallelPrefix
1: for 1 ≤ i ≤ n pardo
2: P [i] ← S[i]
3: while S[i] ≠ S[S[i]] do
4: x[i] ← x[i] ∗ x[S[i]]
5: S[i] ← S[S[i]]
6: if P [i] ≠ i then x[i] ← x[i] ∗ x[S(i)]

The algorithm runs in time O(log n).

It has work requirement O(n log n). non-optimal

This technique is also known as pointer jumping

PA 4.2 Parallel Prefix

Given two sorted sequences A = (a1 , . . . , an ) and

B = (b1 , . . . , bn ), compute the sorted squence C = (c1 , . . . , cn ).

Definition 7
Let X = (x1 , . . . , xt ) be a sequence. The rank rank(y : X) of y in
X is
rank(y : X) = |{x ∈ X | x ≤ y}|

For a sequence Y = (y1 , . . . , ys ) we define

rank(Y : X) := (r1 , . . . , rs ) with ri = rank(yi : X).

PA 4.3 Divide & Conquer — Merging

Given two sorted sequences A = (a1 . . . an ) and B = (b1 . . . bn ),

compute the sorted squence C = (c1 . . . cn ).

Observation:
We can assume wlog. that elements in A and B are different.

Then for ci ∈ C we have i = rank(ci : A ∪ B).

This means we just need to determine rank(x : A ∪ B) for all

elements!

Observe, that rank(x : A ∪ B) = rank(x : A) + rank(x : B).

Clearly, for x ∈ A we already know rank(x : A), and for x ∈ B we

know rank(x : B).

PA 4.3 Divide & Conquer — Merging

Compute rank(x : A) for all x ∈ B and rank(x : B) for all x ∈ A.

can be done in O(log n) time with 2n processors by binary
search

Lemma 8
On a CREW PRAM, Merging can be done in O(log n) time and
O(n log n) work.

not optimal

PA 4.3 Divide & Conquer — Merging

© Harald Räcke 52
4.3 Divide & Conquer — Merging
A = (a1 , . . . , an ); B = (b1 , . . . , bn );
log n integral; k := n/ log n integral;

Algorithm 8 GenerateSubproblems
1: j0 ← 0
2: jk ← n
3: for 1 ≤ i ≤ k − 1 pardo
4: ji ← rank(bi log n : A)
5: for 0 ≤ i ≤ k − 1 pardo
6: Bi ← (bi log n+1 , . . . , b(i+1) log n )
7: Ai ← (aji +1 , . . . , aji+1 )

If Ci is the merging of Ai and Bi then the sequence C0 . . . Ck−1 is

a sorted sequence.

PA 4.3 Divide & Conquer — Merging

Note that in a sub-problem Bi has length log n.

If we run the algorithm again for every subproblem, (where Ai

takes the role of B) we can in time O(log log n) and work O(n)
generate subproblems where Aj and Bj have both length at
most log n.

Such a subproblem can be solved by a single processor in time

O(log n) and work O(|Ai | + |Bi |).

Parallelizing the last step gives total work O(n) and time
O(log n).

the resulting algorithm is work optimal

PA 4.3 Divide & Conquer — Merging

Lemma 9
On a CRCW PRAM the maximum of n numbers can be computed
in time O(1) with n2 processors.

proof on board...

PA 4.4 Maximum Computation

Lemma 10
On a CRCW PRAM the maximum of n numbers can be computed
in time O(log log n) with n processors and work O(n log log n).

proof on board...

PA 4.4 Maximum Computation

Lemma 11
On a CRCW PRAM the maximum of n numbers can be computed
in time O(log log n) with n processors and work O(n).

proof on board...

PA 4.4 Maximum Computation

Given a (2, 3)-tree with n elements, and a sequence

x0 < x1 < x2 < · · · < xk of elements. We want to insert
elements x1 , . . . , xk into the tree (k n).
time: O(log n); work: O(k log n)

a1 a4

a0 a2 a3 a5 a6

a0 a1 a2 a3 a4 a5 a6 ∞

PA 4.5 Inserting into a (2, 3)-tree

© Harald Räcke 58
4.5 Inserting into a (2, 3)-tree
1. determine for every xi the leaf element before which it has
to be inserted
time: O(log n); work: O(k log n); CREW PRAM
all xi ’s that have to be inserted before the same element
form a chain
2. determine the largest/smallest/middle element of every
chain
time: O(log k); work: O(k);
3. insert the middle element of every chain
compute new chains
time: O(log n); work: O(ki log n + k); ki = #inserted
elements
(computing new chains is constant time)
4. repeat Step 3 for logarithmically many rounds
time: O(log n log k); work: O(k log n);
PA 4.5 Inserting into a (2, 3)-tree
© Harald Räcke 59
Step 3

a1 a4

a0 x
a23 a23 x5 aa335 xx
a969 aa55 aa66

a0 a1 x
a2
3 a2
3 a4
x5 aa5
33 xx
a
969 aa
∞4
4 aa
55 aa
66 ∞∞
x3 x5 x9

ñ each internal node is split into at most two parts

ñ each split operation promotes at most one element
ñ hence, on every level we want to insert at most one element
per successor pointer
ñ we can use the same routine for every level

PA 4.5 Inserting into a (2, 3)-tree

ñ Step 3, works in phases; one phase for every level of the tree
ñ Step 4, works in rounds; in each round a different set of
elements is inserted

Observation
We can start with phase i of round r as long as phase i of round
r − 1 and (of course), phase i − 1 of round r has finished.

This is called Pipelining. Using this technique we can perform all

rounds in Step 4 in just O(log k + log n) many parallel steps.

PA 4.5 Inserting into a (2, 3)-tree

The following algorithm colors an n-node cycle with dlog ne

colors.

Algorithm 9 BasicColoring
1: for 1 ≤ i ≤ n pardo
2: col(i) ← i
3: ki ← smallest bitpos where col(i) and col(S(i)) differ
4: col0 (i) ← 2ki + col(i)ki

(bit positions are numbered starting with 0)

PA 4.6 Symmetry Breaking

5 4 v col k col0
6 1 0001 1 2
15
3 0011 2 4
7 0111 0 1
8

14 1110 2 5

2
2 0010 0 0
10

15 1111 0 1
4 0100 0 0

14
5 0101 0 1
11

6 0110 1 3
8 1000 1 2
7

10 1010 0 0
12

11 1011 0 1
3
12 1100 0 0
9 1001 2 4
9
1
13
13 1101 2 5
4.6 Symmetry Breaking

Applying the algorithm to a coloring with bit-length t generates

a coloring with largest color at most

2(t − 1) + 1

and bit-length at most

dlog2 (2(t − 1) + 1)e ≤ dlog2 (2t)e = dlog2 (t)e + 1

Applying the algorithm repeatedly generates a constant number

of colors after O(log∗ n) operations.

Note that the first inequality

holds because 2(t − 1) − 1 is
odd.

PA 4.6 Symmetry Breaking

Applying the algorithm with bit-length 3 gives a coloring with

colors in the range 0, . . . , 5 = 2t − 1.

We can improve to a 3-coloring by successively re-coloring nodes

from a color-class:

Algorithm 10 ReColor
1: for ` ← 5 to 3
2: for 1 ≤ i ≤ n pardo
3: if col(i) = ` then
4: col(i) ← min{{0, 1, 2} \ {col(P [i]), col(S[i])}}

This requires time O(1) and work O(n).

PA 4.6 Symmetry Breaking

Lemma 12
We can color vertices in a ring with three colors in O(log∗ n)
time and with O(n log∗ n) work.

not work optimal

PA 4.6 Symmetry Breaking

Lemma 13
Given n integers in the range 0, . . . , O(log n), there is an
algorithm that sorts these numbers in O(log n) time using a
linear number of operations.

Proof: Exercise!

PA 4.6 Symmetry Breaking

Algorithm 11 OptColor
1: for 1 ≤ i ≤ n pardo
2: col(i) ← i
3: apply BasicColoring once
4: sort vertices by colors
5: for ` = 2dlog ne to 3 do
6: for all vertices i of color ` pardo
7: col(i) ← min{{0, 1, 2} \ {col(P [i]), col(S[i])}}

We can perform Lines 6 and 7 in time O(n` ) only because we sorted before. In general a state-
ment like “for constraint pardo” should only contain a contraint on the id’s of the processors
but not something complicated (like the color) which has to be checked and, hence, induces
work. Because of the sorting we can transform this complicated constraint into a constraint on
just the processor id’s.

PA 4.6 Symmetry Breaking

work optimal but not too fast

PA 4.6 Symmetry Breaking

Input:
A list given by successor pointers;

4 5 7 3 1 2 6 8 9

Output:
For every node number of hops to end of the list;
4 5 7 3 1 2 6 8 9
8 7 6 5 4 3 2 1 0

Observation:
Special case of parallel prefix

PA 5 List Ranking
© Harald Räcke 70
List Ranking

4 5 7 3 1 2 6 8 9
1 3 1 1 1 2 2 1 0

1. Given a list with values; perhaps from previous

iterations.
The list is given via predecessor pointers P (i) and
successor pointers S(i).
S(4) = 5, S(2) = 6, P (3) = 7, etc.