0% found this document useful (0 votes)
153 views

Distributed Algorithm

This document provides an overview of the structure and content of a course on distributed algorithms. The course will consist of 13 lectures and 11 exercise classes over the term. Key topics that will be covered include distributed systems concepts, communication paradigms, formal models of distributed computations, correctness assertions, and concurrency issues.

Uploaded by

ashi_vid
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
153 views

Distributed Algorithm

This document provides an overview of the structure and content of a course on distributed algorithms. The course will consist of 13 lectures and 11 exercise classes over the term. Key topics that will be covered include distributed systems concepts, communication paradigms, formal models of distributed computations, correctness assertions, and concurrency issues.

Uploaded by

ashi_vid
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 466

Distributed Algorithms

December 17, 2008


Gerard Tel
Introduction to Distributed Algorithms (2
nd
edition)
Cambridge University Press, 2000
Set-Up of the Course
13 lectures: Wan Fokkink
room U342
email: [email protected]
11 exercise classes: David van Moolenbroek
one exercise on the board
bonus exercise sheet: 0.5 bonus for exam
exam: written
you may use (clean) copies of slides
homepage: http://www.cs.vu.nl/tcs/da/
copies of the slides
exercise sheets for the labs
structure of the course
schedule of lectures and exercise classes
old exams
Distributed Systems
A distributed system is an interconnected collection of autonomous
processes.
Motivation:

information exchange (WAN)

resource sharing (LAN)

replication to increase reliability

parallelization to increase performance

modularity to improve design


Distributed Versus Uniprocessor
Distributed systems dier from uniprocessor systems in three aspects.

Lack of knowledge on the global state: A process usually has


no up-to-date knowledge on the local states of other
processes. For example, deadlock detection becomes an issue.

Lack of a global time frame: No total order on events by their


temporal occurrence. For example, mutual exclusion becomes
an issue.

Nondeterminism: The execution of processes is usually


nondeterministic, for example due to dierences in execution
speed.
Communication Paradigms
The two main paradigms to capture communication in a
distributed system are message passing and variable sharing.
We will mainly consider message passing.
Asynchronous communication means that sending and receiving of
a message are independent events.
In case of synchronous communication, sending and receiving of a
message are coordinated to form a single event; a message is only
allowed to be sent if its destination is ready to receive it.
Communication Protocols
In a computer network, messages are transported through a medium,
which may lose, duplicate, reorder or garble these messages.
A communication protocol detects and corrects such aws during
message passing.
Examples:

Alternating bit protocol

Bounded retransmission protocol

Sliding window protocols


5
6
4
2
7
1
3
0
K
R S
A
B
L
D
E
C
5
6
F
4
2
7
1
3
0
Formal Framework
We rst present a formal framework for distributed algorithms.
In this course, correctness proofs and complexity estimations of
distributed algorithms will be presented in an informal fashion.
Transition Systems
The (global) state of a distributed algorithm is called its conguration.
The conguration evolves in discrete steps, called transitions.
A transition system consists of:

a set ( of congurations;

a binary transition relation on (; and

a set 1 ( of initial congurations.


( is terminal if for no (.
Executions
An execution is a sequence
0

1

2
of congurations that is
either innite or ends in a terminal conguration, such that:


0
1; and


i

i +1
for i 0.
is reachable from if there is a sequence =
0

1

2

k
=
with
i

i +1
for 0 i < k.
is reachable if it is reachable from a 1.
States and Events
The conguration of a distributed algorithm is composed from the
states at its processes.
A transition is associated to an event (or, in case of synchronous
communication, two events) at one (or two) of its processes.
Local Algorithm at a Process
For simplicity we assume dierent channels carry dierent messages.
The local algorithm at a process consists of:

a set Z of states;

a set I of initial states;

a relation
i
of internal events (c, d);

a relation
s
of send events (c, m, d); and

a relation
r
of receive events (c, m, d).
A process is an initiator if its rst event is an internal or send event.
Asynchronous Communication
Let p = (Z
p
, I
p
,
i
p
,
s
p
,
r
p
) for processes p.
Consider an asynchronous distributed algorithm (p
1
, . . . , p
N
).
( = Z
p
1
Z
p
N
M(/) (/ is the set of messages)
1 = I
p
1
I
p
N

(c
1
, . . . , c
j
, . . . , c
N
, M)
(c
1
, . . . , d
j
, . . . , c
N
, M) if (c
j
, d
j
)
i
p
j
(c
1
, . . . , c
j
, . . . , c
N
, M)
(c
1
, . . . , d
j
, . . . , c
N
, M m) if (c
j
, m, d
j
)
s
p
j
(c
1
, . . . , c
j
, . . . , c
N
, M)
(c
1
, . . . , d
j
, . . . , c
N
, Mm) if (c
j
, m, d
j
)
r
p
j
and m M
Synchronous Communication
Consider a synchronous distributed algorithm (p
1
, . . . , p
N
).
( = Z
p
1
Z
p
N
1 = I
p
1
I
p
N
(c
1
, . . . , c
j
, . . . , c
N
)
(c
1
, . . . , d
j
, . . . , c
N
) if (c
j
, d
j
)
i
p
j
(c
1
, . . . , c
j
, . . . , c
k
, . . . , c
N
)
(c
1
, . . . , d
j
, . . . , d
k
, . . . , c
N
) if (c
j
, m, d
j
)
s
p
j
and
(c
k
, m, d
k
)
r
p
k
for some m /
(c
1
, . . . , c
j
, . . . , c
k
, . . . , c
N
)
(c
1
, . . . , d
j
, . . . , d
k
, . . . , c
N
) if (c
j
, m, d
j
)
r
p
j
and
(c
k
, m, d
k
)
s
p
k
for some m /
Assertions
An assertion is a predicate on the set of congurations of an algorithm.
An assertion is a safety property if it is true in each conguration
of each execution of the algorithm.
An assertion is a liveness property if it is true in some conguration
of each execution of the algorithm.
Invariants
Assertion P is an invariant if:

P() for all 1; and

if and P(), then P().


Each invariant is a safety property.
Let P be an assertion such that:

there is a well-founded partial order > on ( such that for each


either > or P(); and

P is true in all terminal congurations.


Then P is a liveness property.
Invariants
Assertion P is an invariant if:

P() for all 1; and

if and P(), then P().


Each invariant is a safety property.
Let P be an assertion such that:

there is a well-founded partial order > on ( such that for each


either > or P(); and

P is true in all terminal congurations.


Then P is a liveness property.
Question
Give a transition system S and an assertion P such that P is a
safety property but not an invariant of S.
Fairness
Intuitively, fairness means that if a state is visited innitely often,
then each event at this state is taken innitely often.
An execution is fair if each event that is applicable in innitely
many congurations occurs innitely often in the execution.
Some assertions of the distributed algorithms that we will study are
only liveness properties if we restrict to the fair executions.
Causal Order
In each conguration of an asynchronous system, applicable events
at dierent processes are independent.
The causal order on occurrences of events in an execution is the
smallest transitive relation such that:

if a and b are events at the same process and a occurs before


b, then a b; and

if a is a send event and b the corresponding receive event,


then a b.
If neither a _ b nor b _ a, then a and b are called concurrent.
Concurrency
An important challenge in the design of distributed algorithms is to
cope with concurrent events (i.e., avoid race conditions).
Typical examples are:
Snapshots: Compute a conguration of the system during an
execution.
Termination detection: Find out whether all processes have
terminated.
Mutual exclusion: Guarantee that at any time, no more than one
process is accessing the critical section.
Computations
A permutation of the events in an execution that respects the
causal order, does not aect the result of the execution.
These permutations together form a computation.
All executions of a computation start in the same conguration,
and if they are nite they all end in the same conguration.
Clocks
A clock maps occurrences of events in a computation to a partially
ordered set such that a b (a) < (b).

Order in sequence: Order events based on a particular


execution of the computation.

Lamports logical clock


L
: Assign to each event a the length
k of the longest causality chain b
1
b
k
= a.
A distributed algorithm to compute
L
is:
if a is an internal or send event, and k the clock value of the
previous event at the same process (k = 0 if there is no such
previous event), then
L
(a) = k+1;
if a is a receive event, k the clock value of the previous event
at the same process (k = 0 if there is no such previous event),
and b the send event corresponding to a, then

L
(a) = maxk,
L
(b)+1.
Clocks
A clock maps occurrences of events in a computation to a partially
ordered set such that a b (a) < (b).

Order in sequence: Order events based on a particular


execution of the computation.

Lamports logical clock


L
: Assign to each event a the length
k of the longest causality chain b
1
b
k
= a.
A distributed algorithm to compute
L
is:
if a is an internal or send event, and k the clock value of the
previous event at the same process (k = 0 if there is no such
previous event), then
L
(a) = k+1;
if a is a receive event, k the clock value of the previous event
at the same process (k = 0 if there is no such previous event),
and b the send event corresponding to a, then

L
(a) = maxk,
L
(b)+1.
Complexity Measures
Resource consumption of distributed algorithms can be computed
in several ways.
Message complexity: Total number of messages exchanged by the
algorithm.
Bit complexity: Total number of bits exchanged by the algorithm.
(Only interesting when messages are very long.)
Time complexity: Amount of time consumed by the algorithm.
(We assume: (1) event processing takes no time, and (2) a
message is received at most one time unit after it is sent.)
Space complexity: Amount of space needed for the processes in
the algorithm.
Dierent computations may give rise to dierent consumption of
resources. We consider worst- and average-case complexity
(the latter with a probability distribution over all computations).
Assumptions
Unless stated otherwise, we assume:

a strongly connected network;

each node knows only its neighbors;

processes have unique identities;

message passing communication;

asynchronous communication;

channels are non-FIFO;

channels do not lose, duplicate or garbel messages;

messages are received in nite time;

execution times of events are abstracted away; and

in-order execution of underlying processors.


Channels can be directed or undirected.
Question
What is more general, an algorithm for a directed or for an
undirected network?
Snapshots
We distinguish basic messages of the underlying distributed
algorithm and control messages of the snapshot algorithm.
A snapshot of a basic computation consists of local snapshots of
the state of each process and the messages in transit in each
channel.
A snapshot is meaningful if it is a conguration of an execution of
the basic computation.
A snapshot may not be meaningful, if some process p takes a local
snapshot, and sends a message m to a process q, where

either q takes a local snapshot after the receipt of m;

or m is included in the local snapshot of the channel pq.


Snapshots
We distinguish basic messages of the underlying distributed
algorithm and control messages of the snapshot algorithm.
A snapshot of a basic computation consists of local snapshots of
the state of each process and the messages in transit in each
channel.
A snapshot is meaningful if it is a conguration of an execution of
the basic computation.
A snapshot may not be meaningful, if some process p takes a local
snapshot, and sends a message m to a process q, where

either q takes a local snapshot after the receipt of m;

or m is included in the local snapshot of the channel pq.


Snapshots - Applications
Challenge: To take a snapshot without freezing the basic
computation.
Snapshots can be used to determine stable properties, which
remain true as soon as they have become true.
Examples: deadlock, garbage.
Snapshots can also be used for restarting after a failure, or for
debugging.
Chandy-Lamport Algorithm
Consider a directed network with FIFO channels.
Initially, initiators take a local snapshot (of their state), and send a
control message mkr) to their neighbors.
When a non-initiator receives mkr) for the rst time, it takes a
local snapshot (of its state), and sends mkr) to its neighbors.
The channel state of pq consists of the messages via pq received by
q after taking its local snapshot and before receiving mkr) from p.
If channels are FIFO, then the Chandy-Lamport algorithm
computes a meaningful snapshot.
Message complexity: ([E[)
Time complexity: O(D)
Chandy-Lamport Algorithm
Consider a directed network with FIFO channels.
Initially, initiators take a local snapshot (of their state), and send a
control message mkr) to their neighbors.
When a non-initiator receives mkr) for the rst time, it takes a
local snapshot (of its state), and sends mkr) to its neighbors.
The channel state of pq consists of the messages via pq received by
q after taking its local snapshot and before receiving mkr) from p.
If channels are FIFO, then the Chandy-Lamport algorithm
computes a meaningful snapshot.
Message complexity: ([E[)
Time complexity: O(D)
Chandy-Lamport Algorithm
Consider a directed network with FIFO channels.
Initially, initiators take a local snapshot (of their state), and send a
control message mkr) to their neighbors.
When a non-initiator receives mkr) for the rst time, it takes a
local snapshot (of its state), and sends mkr) to its neighbors.
The channel state of pq consists of the messages via pq received by
q after taking its local snapshot and before receiving mkr) from p.
If channels are FIFO, then the Chandy-Lamport algorithm
computes a meaningful snapshot.
Message complexity: ([E[)
Time complexity: O(D)
Question
Any ideas for a snapshot algorithm that works in case of non-FIFO
channels?
Lai-Yang Algorithm
The Lai-Yang algorithm uses piggybacking.
Initially, each initiator takes a local snapshot.
When a process has taken its local snapshot, it appends true to
each outgoing basic message.
When a non-initiator receives a message with true or a control
message (see below) for the rst time, it takes a local snapshot of
its state before reception of this message.
All processes eventually take a snapshot.
The channel state of pq consists of the basic messages via pq
without the tag true that are received by q after its local snapshot.
p sends a control message to q, informing q how many basic
messages without the tag true p sent into pq.
Lai-Yang Algorithm
The Lai-Yang algorithm uses piggybacking.
Initially, each initiator takes a local snapshot.
When a process has taken its local snapshot, it appends true to
each outgoing basic message.
When a non-initiator receives a message with true or a control
message (see below) for the rst time, it takes a local snapshot of
its state before reception of this message.
All processes eventually take a snapshot.
The channel state of pq consists of the basic messages via pq
without the tag true that are received by q after its local snapshot.
p sends a control message to q, informing q how many basic
messages without the tag true p sent into pq.
Lai-Yang Algorithm
The Lai-Yang algorithm uses piggybacking.
Initially, each initiator takes a local snapshot.
When a process has taken its local snapshot, it appends true to
each outgoing basic message.
When a non-initiator receives a message with true or a control
message (see below) for the rst time, it takes a local snapshot of
its state before reception of this message.
All processes eventually take a snapshot.
The channel state of pq consists of the basic messages via pq
without the tag true that are received by q after its local snapshot.
p sends a control message to q, informing q how many basic
messages without the tag true p sent into pq.
Question
Due to the control message from p, q knows when to take a local
snapshot of channel pq.
Which information does q have to store exactly on incoming basic
messages from p without the tag true?
Wave Algorithms
Decide events are special internal events.
A distributed algorithm is a wave algorithm if for each computation
(also called wave) C:

Termination: C is nite;

Decision: C contains a decide event; and

Dependence: for each decide event e in C and process p,


f _ e for an event f at p.
Examples: Ring algorithm, tree algorithm, echo algorithm.
Traversal Algorithms
A traversal algorithm is a centralized wave algorithm; i.e., there is
one initiator, which sends around a token (representing a combined
send and receive event).
In each computation, the token rst visits all processes. Finally, a
decide event happens at the initiator, who at that time holds the token.
In traversal algorithms, the father of a non-initiator is the neighbor
from which it received the token rst.
Tarrys Algorithm
G = (V, E) is an undirected graph.
Tarrys traversal algorithm (from 1895):
R1 A node never forwards the token through the same channel
twice.
R2 A node only forwards the token to its father when there is no
other option.
Tarrys Algorithm - Example
The graph below is undirected and unweighted; u is the initiator.
v
1
4 2
3
y x
5
w
6
u
Arrows mark the path of the token (so the father-child relation is
reversed).
Edges of the spanning tree are solid.
Frond edges, which are not part of the spanning tree, are dashed.
Depth-First Search
A spanning tree is a depth-rst search tree if each frond edge
connects an ancestor and a descendant of the spanning tree.
Depth-rst search is obtained by adding to Tarrys algorithm:
R3 When a node receives the token, it immediately sends it back
through the same channel, if this is allowed by R1,2.
Message complexity: 2[E[ messages
Time complexity: 2[E[ time units
The spanning tree of a depth-rst search is a depth-rst search tree.
Example:
u v
1
6 2
3
y x
5
4
w
Depth-First Search
A spanning tree is a depth-rst search tree if each frond edge
connects an ancestor and a descendant of the spanning tree.
Depth-rst search is obtained by adding to Tarrys algorithm:
R3 When a node receives the token, it immediately sends it back
through the same channel, if this is allowed by R1,2.
Message complexity: 2[E[ messages
Time complexity: 2[E[ time units
The spanning tree of a depth-rst search is a depth-rst search tree.
Example:
u v
1
6 2
3
y x
5
4
w
Question
How can (the delay of) messages through frond edges be avoided?
Neighbor Knowledge
Prevents transmission of the token through a frond edge.
The visited nodes are included in the token.
The token is not forwarded to nodes in this list (except when a
node sends the token to its father).
Message complexity: 2[V[2 messages
Tree edges carry 2 forwarded tokens.
Bit complexity: Up to k[V[ bits per message, where k bits are
needed to represent one node.
Time complexity: 2[V[2 time units
Neighbor Knowledge
Prevents transmission of the token through a frond edge.
The visited nodes are included in the token.
The token is not forwarded to nodes in this list (except when a
node sends the token to its father).
Message complexity: 2[V[2 messages
Tree edges carry 2 forwarded tokens.
Bit complexity: Up to k[V[ bits per message, where k bits are
needed to represent one node.
Time complexity: 2[V[2 time units
Awerbuchs Algorithm

A node holding the token for the rst time informs all
neighbors except its father (and the node to which it will
forward the token).

The token is only forwarded when these neighbors all


acknowledged reception.

The token is only forwarded to nodes that were not yet visited
by the token (except when a node sends the token to its father).
Awerbuchs Algorithm - Complexity
Message complexity: < 4[E[ messages
Frond edges carry 2 information and 2 acknowledgement messages.
Tree edges carry 2 forwarded tokens, and
possibly 1 information/acknowledgement pair.
Time complexity: 4[V[2 time units
Tree edges carry 2 forwarded tokens.
Nodes wait at most 2 time units before forwarding the token.
Awerbuchs Algorithm - Complexity
Message complexity: < 4[E[ messages
Frond edges carry 2 information and 2 acknowledgement messages.
Tree edges carry 2 forwarded tokens, and
possibly 1 information/acknowledgement pair.
Time complexity: 4[V[2 time units
Tree edges carry 2 forwarded tokens.
Nodes wait at most 2 time units before forwarding the token.
Question
Are the acknowledgements in Awerbuchs algorithm really needed?
Cidons Algorithm
Abolishes acknowledgements from Awerbuchs algorithm.

The token is forwarded without delay. Each node u records to


which node mrs
u
it forwarded the token last.

Suppose node u receives the token from a node v ,= mrs


u
.
Then u marks the edge uv as used and purges the token.

Suppose node v receives an information message from mrs


v
.
Then it continues forwarding the token.
Cidons Algorithm - Complexity
Message complexity: < 4[E[ messages
Each edge carries at most 2 information messages and 2 forwarded
tokens.
Time complexity: 2[V[2 time units
At least once per time unit, a token is forwarded through a tree
edge. Each tree edge carries 2 forwarded tokens.
Cidons Algorithm - Example
u
w
v
x
1
4
2
3
y
5
Tree Algorithm
The tree algorithm is a decentralized wave algorithm for
undirected, acyclic graphs.
The local algorithm at a node u:

u waits until it received messages from all neighbors except


one, denoted Nb
u
;

then u sends a message to Nb


u
;

if u receives a message from Nb


u
, it decides;
in that case u sends the decision to all neighbors except Nb
u
;

if u receives a decision from Nb


u
, it passes it on to all other
neighbors.
Remark: Always two nodes decide.
Message complexity: 2[V[2 messages
Tree Algorithm
The tree algorithm is a decentralized wave algorithm for
undirected, acyclic graphs.
The local algorithm at a node u:

u waits until it received messages from all neighbors except


one, denoted Nb
u
;

then u sends a message to Nb


u
;

if u receives a message from Nb


u
, it decides;
in that case u sends the decision to all neighbors except Nb
u
;

if u receives a decision from Nb


u
, it passes it on to all other
neighbors.
Remark: Always two nodes decide.
Message complexity: 2[V[2 messages
Tree Algorithm - Example
decide decide
Question
What happens if the tree algorithm is applied to a graph
containing a cycle?
Echo Algorithm
The echo algorithm is a centralized wave algorithm for undirected graphs.

Initiator sends a message to all neighbors.

The father of a non-initiator is the neighbor from which it


receives the rst message.

A non-initiator sends a message to all neighbors except its


father.

When a non-initiator has received a message from all


neighbors, it sends a message to its father.

When the initiator has received a message from all neighbors,


it decides.
Message complexity: 2[E[ messages
Echo Algorithm
The echo algorithm is a centralized wave algorithm for undirected graphs.

Initiator sends a message to all neighbors.

The father of a non-initiator is the neighbor from which it


receives the rst message.

A non-initiator sends a message to all neighbors except its


father.

When a non-initiator has received a message from all


neighbors, it sends a message to its father.

When the initiator has received a message from all neighbors,


it decides.
Message complexity: 2[E[ messages
Echo Algorithm - Example
decide
Question
Let each process initially carry a random integer value.
Adapt the echo algorithm to compute the sum of these integer
values.
Termination Detection
In a distributed setting, detection of termination can be non-trivial.
The basic algorithm is terminated if each process is passive and no
messages are in transit.
The control algorithm consists of termination detection and
announcement. Announcement is simple; we focus on detection.
active
read
send
internal
read
passive
Termination detection should not (1) use further information on
local states or internal events, or (2) inuence basic computations.
Dijkstra-Scholten Algorithm
Requires (1) a centralized basic algorithm, and (2) an undirected
graph.
A tree T is maintained, in which the initiator is the root, and all
active processes are nodes of T. Initially, T consists of the initiator.
sc
p
estimates (from above) the number of children of process p in T.

When p sends a basic message, sc


p
:= sc
p
+1.

Suppose this message is received by q.


- If q was not yet in T, q becomes a node in T with father p
and sc
q
:= 0.
- If q is already in T, it sends a message to p that it is not a
new son of p. Upon receipt of this message, sc
p
:= sc
p
1.

When a non-initiator p is passive and sc


p
= 0, it informs its
father it is no longer a son.

When initiator p
0
is passive and sc
p
0
= 0, it calls Announce.
Dijkstra-Scholten Algorithm
Requires (1) a centralized basic algorithm, and (2) an undirected
graph.
A tree T is maintained, in which the initiator is the root, and all
active processes are nodes of T. Initially, T consists of the initiator.
sc
p
estimates (from above) the number of children of process p in T.

When p sends a basic message, sc


p
:= sc
p
+1.

Suppose this message is received by q.


- If q was not yet in T, q becomes a node in T with father p
and sc
q
:= 0.
- If q is already in T, it sends a message to p that it is not a
new son of p. Upon receipt of this message, sc
p
:= sc
p
1.

When a non-initiator p is passive and sc


p
= 0, it informs its
father it is no longer a son.

When initiator p
0
is passive and sc
p
0
= 0, it calls Announce.
Question
Any suggestions to make the Dijkstra-Scholten algorithm
decentralized?
Shavit-Francez Algorithm
Allows a decentralized basic algorithm; requires an undirected graph.
A forest F of trees is maintained, rooted in the initiators. Each
process is in at most one tree of F.
Initially, each initiator constitutes a tree in F.

When a process p sends a basic message, sc


p
:= sc
p
+1.

Suppose this message is received by q.


- If q was not yet in some tree in F, q becomes a node in F
with father p and sc
q
:= 0.
- If q is already in some tree in F, it sends a message to p that
it is not a new son of p. Upon receipt, sc
p
:= sc
p
1.

When a non-initiator p is passive and sc


p
= 0, it informs its
father it is no longer a son.
A wave algorithm is used on the side, in which only nodes that are
not in a tree participate, and decide calls Announce.
Shavit-Francez Algorithm
Allows a decentralized basic algorithm; requires an undirected graph.
A forest F of trees is maintained, rooted in the initiators. Each
process is in at most one tree of F.
Initially, each initiator constitutes a tree in F.

When a process p sends a basic message, sc


p
:= sc
p
+1.

Suppose this message is received by q.


- If q was not yet in some tree in F, q becomes a node in F
with father p and sc
q
:= 0.
- If q is already in some tree in F, it sends a message to p that
it is not a new son of p. Upon receipt, sc
p
:= sc
p
1.

When a non-initiator p is passive and sc


p
= 0, it informs its
father it is no longer a son.
A wave algorithm is used on the side, in which only nodes that are
not in a tree participate, and decide calls Announce.
Shavit-Francez Algorithm
Allows a decentralized basic algorithm; requires an undirected graph.
A forest F of trees is maintained, rooted in the initiators. Each
process is in at most one tree of F.
Initially, each initiator constitutes a tree in F.

When a process p sends a basic message, sc


p
:= sc
p
+1.

Suppose this message is received by q.


- If q was not yet in some tree in F, q becomes a node in F
with father p and sc
q
:= 0.
- If q is already in some tree in F, it sends a message to p that
it is not a new son of p. Upon receipt, sc
p
:= sc
p
1.

When a non-initiator p is passive and sc


p
= 0, it informs its
father it is no longer a son.
A wave algorithm is used on the side, in which only nodes that are
not in a tree participate, and decide calls Announce.
Ranas Algorithm
A decentralized algorithm, for undirected graphs.
Let a logical clock provide (basic and control) events with a time
stamp.
The time stamp of a process is the time stamp of its last event
(initially it is 0).
Each basic message is acknowledged, and each process counts how
many of the basic messages it sent have not yet been acknowledged.
If at some time t a process becomes quiet, meaning that (1) it is
passive and (2) all basic messages it sent have been acknowledged,
it starts a wave (of control messages) tagged with t.
Only quiet processes, that have been quiet from a time t
onward, take part in the wave.
If a wave completes, the initiator of the wave calls Announce.
Ranas Algorithm
A decentralized algorithm, for undirected graphs.
Let a logical clock provide (basic and control) events with a time
stamp.
The time stamp of a process is the time stamp of its last event
(initially it is 0).
Each basic message is acknowledged, and each process counts how
many of the basic messages it sent have not yet been acknowledged.
If at some time t a process becomes quiet, meaning that (1) it is
passive and (2) all basic messages it sent have been acknowledged,
it starts a wave (of control messages) tagged with t.
Only quiet processes, that have been quiet from a time t
onward, take part in the wave.
If a wave completes, the initiator of the wave calls Announce.
Ranas Algorithm - Correctness
Suppose a wave, tagged with some t, does not complete.
Then some process did not take part in the wave, meaning that it
was not quiet at a time t.
So this process will start a new wave at a later time.
Ranas Algorithm - Correctness
Suppose a quiet process p takes part in a wave, and is later on
made active by a basic message from a process q that was not yet
visited by this wave.
Then this wave will not complete.
Namely, let the wave be tagged with t. When p takes part in the
wave, its logical clock becomes > t.
By the resulting acknowledgement from p to q, the logical clock of
q becomes > t.
So q will not take part in the wave (because it is tagged with t).
Matterns Weight-Throwing Termination Detection
Requires a centralized basic algorithm; allows a directed graph.
The initiator has weight 1, all non-initiators have weight 0.
When a process sends a basic message, it attaches part of its
weight to this message (and subtracts this from its own weight).
When a process receives a basic message, it adds the weight to this
message to its own weight.
When a non-initiator becomes passive, it returns its weight to the
initiator, by means of a control message.
When the initiator becomes passive, and has regained weight 1, it
calls Announce.
Matterns Weight-Throwing Termination Detection
Requires a centralized basic algorithm; allows a directed graph.
The initiator has weight 1, all non-initiators have weight 0.
When a process sends a basic message, it attaches part of its
weight to this message (and subtracts this from its own weight).
When a process receives a basic message, it adds the weight to this
message to its own weight.
When a non-initiator becomes passive, it returns its weight to the
initiator, by means of a control message.
When the initiator becomes passive, and has regained weight 1, it
calls Announce.
Weight-Throwing Termination Detection - Drawback
Underow: The weight of a process can become too small to be
divided further.
Solution 1: This process can ask the initiator for extra weight.
Solution 2: This process can itself initiate a weight-throwing
termination detection sub-call, and only return its weight to the
initiator when it has become passive and this sub-call has
terminated.
Weight-Throwing Termination Detection - Drawback
Underow: The weight of a process can become too small to be
divided further.
Solution 1: This process can ask the initiator for extra weight.
Solution 2: This process can itself initiate a weight-throwing
termination detection sub-call, and only return its weight to the
initiator when it has become passive and this sub-call has
terminated.
Weight-Throwing Termination Detection - Drawback
Underow: The weight of a process can become too small to be
divided further.
Solution 1: This process can ask the initiator for extra weight.
Solution 2: This process can itself initiate a weight-throwing
termination detection sub-call, and only return its weight to the
initiator when it has become passive and this sub-call has
terminated.
Token-Based Termination Detection
A centralized algorithm for directed graphs.
A process p
0
is initiator of a traversal algorithm to check whether
all processes are passive.
Complication 1: Reception of basic messages cannot be
acknowledged.
Solution: Synchronous communication.
Complication 2: A traversal of only passive processes still does not
guarantee termination.
Complication 2 - Example
0
p
s
q
r
The token is at p
0
; only s is active.
The token travels to r .
s sends a basic message to q, making q active.
s becomes passive.
The token travels on to p
0
, which falsely calls Announce.
Dijkstra-Feijen-van Gasteren Algorithm
Solution: Processes are colored white or black. Initially they are
white, and a process that sends a basic message becomes black.

When p
0
is passive, it sends a white token.

Only passive processes forward the token.

If a black process forwards the token, the token becomes


black and the process white.

Eventually, the token returns to p


0
, and p
0
waits until it is
passive.
- If both the token and p
0
are white, then p
0
calls Announce.
- Otherwise, p
0
sends a white token again.
Dijkstra-Feijen-van Gasteren Algorithm - Example
0
p
s
q
r
The token is at p
0
; only s is active.
The token travels to r .
s sends a basic message to q, making s black and q active.
s becomes passive.
The token travels on to p
0
, and is made black at s.
q becomes passive.
The token travels around the network, and p
0
calls Announce.
Question
Give an example to show that the Dijkstra-Feijen-van Gasteren
algorithm does not work in case of asynchronous communication.
Safras Algorithm
Allows a directed graph and asynchronous communication.
Each process maintains a counter of type Z; initially it is 0.
At each outgoing/incoming basic message, the counter is
increased/decreased.
At each round trip, the token carries the sum of the counters of
the processes it has traversed.
At any time, the sum of all counters in the network is 0, and it
is 0 if and only if no basic message is in transit.
Still the token may compute a negative sum for a round trip, when
a visited passive process receives a basic message, becomes active
and sends basic messages that are received by an unvisited process.
Safras Algorithm
Processes are colored white or black. Initially they are white, and a
process that receives a basic message becomes black.

When p
0
is passive, it sends a white token.

Only passive processes forward the token.

When a black process forwards the token, the token becomes


black and the process white.

Eventually the token returns to p


0
, and p
0
waits until it is
passive.
- If the token and p
0
are white and the sum of all counters is
zero, p
0
calls Announce.
- Otherwise, p
0
sends a white token again.
Safras Algorithm - Example
The token is at p
0
; only s is active; no messages are in transit; all
processes are white with counter 0.
0
p
s
q
r
s sends a basic message m to q, setting the counter of s to 1.
s becomes passive.
The token travels around the network, white with sum 1.
The token travels on to r , white with sum 0.
m travels to q and back to s, making them active and black with
counter 0.
s becomes passive.
The token travels from r to p
0
, black with sum 0.
q becomes passive.
After two more round trips of the token, p
0
calls Announce.
Election Algorithms

Each computation terminates in a conguration where


one process is the leader.

All processes have the same local algorithm.

Identities of processes are totally ordered.

The initiators are any non-empty subset.


Chang-Roberts Algorithm
Let G = (V, E) be a directed ring.

Each initiator u sends a token around the ring, containing its


id.

When u receives v with v < u, u becomes passive, and u


passes on id v.

When u receives v with v > u, it purges the message.

When u receives u, u becomes the leader.


Passive processes (including all non-initiators) pass on incoming
messages.
Worst-case message complexity: O([V[
2
)
Average-case message complexity: O([V[ log [V[)
Chang-Roberts Algorithm
Let G = (V, E) be a directed ring.

Each initiator u sends a token around the ring, containing its


id.

When u receives v with v < u, u becomes passive, and u


passes on id v.

When u receives v with v > u, it purges the message.

When u receives u, u becomes the leader.


Passive processes (including all non-initiators) pass on incoming
messages.
Worst-case message complexity: O([V[
2
)
Average-case message complexity: O([V[ log [V[)
Chang-Roberts Algorithm - Example
1 N-2
N-1 0
N-3 2 clockwise: N(N+1)/2 messages
anti-clockwise: 2N-1 messages
i
Franklins Algorithm
Let G be an undirected ring.
In Franklins algorithm, each active node compares its id with the
identities of its nearest active neighbors. If its id is not the
smallest, it becomes passive.

Initially, initiators are active, and non-initiators are passive.


Each active node sends its id to its neighbors on either side.

Let active node u receive v and w:


- if minv, w < u, then u becomes passive;
- if minv, w > u, then u sends its id again;
- if minv, w = u, then u becomes the leader.
Passive nodes pass on incoming messages.
Franklins Algorithm
Let G be an undirected ring.
In Franklins algorithm, each active node compares its id with the
identities of its nearest active neighbors. If its id is not the
smallest, it becomes passive.

Initially, initiators are active, and non-initiators are passive.


Each active node sends its id to its neighbors on either side.

Let active node u receive v and w:


- if minv, w < u, then u becomes passive;
- if minv, w > u, then u sends its id again;
- if minv, w = u, then u becomes the leader.
Passive nodes pass on incoming messages.
Franklins Algorithm - Complexity
Worst-case message complexity: O([V[ log [V[)
In each update round, at least half of the active nodes become passive.
Each update round takes 2[V[ messages.
Question
Any suggestions how to adapt Franklins algorithm to a directed ring?
Dolev-Klawe-Rodeh Algorithm
Let G be a (clockwise) directed ring.
The diculty is to obtain the clockwise next active id.
In the DKR algorithm, the comparison of identities of an active
node u and its nearest active neighbors v and w is performed at w.
u v w y x

If u has a smaller id than v and w, then w assumes the id of


u.

Otherwise, w becomes passive.


When u and w carry the same id, w concludes that the node with
this id must become the leader;
then w broadcasts this id to all nodes.
Worst-case message complexity: O([V[ log [V[)
Dolev-Klawe-Rodeh Algorithm
Let G be a (clockwise) directed ring.
The diculty is to obtain the clockwise next active id.
In the DKR algorithm, the comparison of identities of an active
node u and its nearest active neighbors v and w is performed at w.
u v w y x

If u has a smaller id than v and w, then w assumes the id of


u.

Otherwise, w becomes passive.


When u and w carry the same id, w concludes that the node with
this id must become the leader;
then w broadcasts this id to all nodes.
Worst-case message complexity: O([V[ log [V[)
Dolev-Klawe-Rodeh Algorithm - Example
Consider the following clockwise oriented ring.
0
4
2 0
5 3
leader
0
1 2
1
Question
How could the tree algorithm be used to get an election algorithm
for undirected, acyclic graphs?
Tree Election Algorithm
Let G be an undirected, acyclic graph.
The tree election algorithm starts with a wake-up phase, driven by
the initiators.
The local algorithm at an awake node u:

u waits until it received identities from all neighbors except


one, denoted Nb
u
;

u computes the smallest id min


u
among the received identities
and its own id;

u sends min
u
to Nb
u
;

when u receives id v from Nb


u
, it computes min

u
, being the
minimum of min
u
and v;

if min

u
= u, then u becomes the leader;

u sends min

u
to all neighbors except Nb
u
.
Message complexity: 2[V[2 messages
Tree Election Algorithm
Let G be an undirected, acyclic graph.
The tree election algorithm starts with a wake-up phase, driven by
the initiators.
The local algorithm at an awake node u:

u waits until it received identities from all neighbors except


one, denoted Nb
u
;

u computes the smallest id min


u
among the received identities
and its own id;

u sends min
u
to Nb
u
;

when u receives id v from Nb


u
, it computes min

u
, being the
minimum of min
u
and v;

if min

u
= u, then u becomes the leader;

u sends min

u
to all neighbors except Nb
u
.
Message complexity: 2[V[2 messages
Question
Why does u compute the minimum of min
u
and v?
Tree Election Algorithm - Example
4 2
3
6
1 5
2
5
2
1
4
1
4 2
3
6
1 5
2
5
2
1
4
1
4 2
3
6
1 5
2
5
2
1
4
1
4 2
3
6
1 5
2
1
2
1
4
1
4 2
3
6
1 5
2
1
1
1
4
1
4 2
3
6
1 5
2
1
1
1
1
1
4 2
3
6
1 5
1
1
1
1
1
1 leader
4 2
3
6
1 5
2
1
1
1
4
1
4 2
3
6
1 5
2
5
6
1
4
3
4 2
3
6
1 5
2
5
6
1
4
3
4 2
3
6
1 5
2
5
4
1
4
3
4 2
3
6
1 5
2
5
2
1
4
3
Question
How could the echo algorithm be used to get an election algorithm
for any undirected graph?
Echo Algorithm with Extinction
Election for undirected graphs, based on the echo algorithm:

Each initiator starts a wave, tagged with its id.

At any time, each node takes part in at most one wave.

Suppose a node u in wave v is hit by a wave w:


- if v > w, then u changes to wave w (it abandons all earlier
messages);
- if v < w, then u continues with wave v (it purges the
incoming message);
- if v = w, then the incoming message is treated according to
the echo algorithm of wave w.

If wave u executes a decide event (at u), u becomes leader.


Non-initiators join the rst wave that hits them.
Worst-case message complexity: O([V[[E[)
Echo Algorithm with Extinction
Election for undirected graphs, based on the echo algorithm:

Each initiator starts a wave, tagged with its id.

At any time, each node takes part in at most one wave.

Suppose a node u in wave v is hit by a wave w:


- if v > w, then u changes to wave w (it abandons all earlier
messages);
- if v < w, then u continues with wave v (it purges the
incoming message);
- if v = w, then the incoming message is treated according to
the echo algorithm of wave w.

If wave u executes a decide event (at u), u becomes leader.


Non-initiators join the rst wave that hits them.
Worst-case message complexity: O([V[[E[)
Minimal Spanning Trees
Let G be an undirected, weighted graph, in which dierent edges
have dierent weights.
(Weights can be totally ordered by taking into account the identities
of endpoints of an edge, and using a lexicographical order.)
In a minimal spanning tree, the sum of the weights of the edges in
the spanning tree are minimal.
Lemma: Let F be a fragment (i.e., a subtree of a minimal spanning
tree M in G), and e the lowest-weight outgoing edge of F (i.e., e
has exactly one endpoint in F). Then e is in M.
Proof: Suppose not. Then M e has a cycle, containing e and
another outgoing edge f of F. Replacing f by e in M gives a
spanning tree with a smaller sum of weights of edges.
Minimal Spanning Trees
Let G be an undirected, weighted graph, in which dierent edges
have dierent weights.
(Weights can be totally ordered by taking into account the identities
of endpoints of an edge, and using a lexicographical order.)
In a minimal spanning tree, the sum of the weights of the edges in
the spanning tree are minimal.
Lemma: Let F be a fragment (i.e., a subtree of a minimal spanning
tree M in G), and e the lowest-weight outgoing edge of F (i.e., e
has exactly one endpoint in F). Then e is in M.
Proof: Suppose not. Then M e has a cycle, containing e and
another outgoing edge f of F. Replacing f by e in M gives a
spanning tree with a smaller sum of weights of edges.
Prims Algorithm
Centralized. Initially, F is a single node.
As long as F is not a spanning tree, add the lowest-weight
outgoing edge of F to F.
Kruskals Algorithm
Decentralized. Initially, each node in G forms a separate fragment.
In each step, two distinct fragments F and F

are joined by adding


an outgoing edge of both F and F

which is lowest-weight for F.


Example:
8
3
9
1
2
10
Prims and Kruskals algorithm also work when edges have the same
weight. But then the minimal spanning tree may not be unique.
Complications in a distributed setting: Is an edge outgoing? Is it
lowest-weight?
Kruskals Algorithm
Decentralized. Initially, each node in G forms a separate fragment.
In each step, two distinct fragments F and F

are joined by adding


an outgoing edge of both F and F

which is lowest-weight for F.


Example:
8
3
9
1
2
10
Prims and Kruskals algorithm also work when edges have the same
weight. But then the minimal spanning tree may not be unique.
Complications in a distributed setting: Is an edge outgoing? Is it
lowest-weight?
Gallager-Humblet-Spira Algorithm
G is an undirected, weighted graph, in which dierent edges have
dierent weights.
Distributed computation of a minimal spanning tree in G:

initially, each node is a fragment;

the nodes in a fragment F together search for the


lowest-weight outgoing edge e
F
;

when e
F
is found, the fragment at the other end is asked to
collaborate in a merge.
Level, name and core edge
Fragments carry a level L : N and a name FN.
Fragments F = (L, FN) and F

= (L

, FN

) are joined in the


following cases:
L < L

F
e
F
F

: F F

= (L

, FN

)
L > L

e
F

F: F F

= (L, FN)
L = L

e
F
= e
F
: F F

= (L+1, weight e
F
)
The core edge of a fragment is the last edge that connected two
sub-fragments at the same level; its end points are the core nodes.
Parameters of a node
Each node keeps track of:

its state: sleep (for non-initiators), nd or found;

status of its edges: basic, branch or reject;

level and name (i.e., the weight of the core edge) of its
fragment;

its father, toward the core edge.


Initialization
Each initiator u sets level
u
to 0, its lowest-weight outgoing edge
uv to branch, and its other edges to basic.
u sends connect, 0) to v.
Each non-initiator is woken up when it receives a connect or test
message.
Joining two fragments
Let fragments F = (L, FN) and F

= (L

, FN

) be joined via edge uv


(after the exchange of 1 or 2 connect messages through this edge).

If L < L

, then initiate, L

, FN

,
nd
found
) is sent by v to u, and
forwarded through F;
F F

inherits the core edge of F

If L > L

, then vice versa.

If L = L

, then initiate, L+1, weight uv, nd) is sent both


ways;
F F

has core edge uv.


Computing the lowest-weight outgoing edge
At reception of initiate, L, FN,
nd
found
), a node u stores L and FN,
and adopts the sender as its father.
It passes on the message through the branch edges.
In case of nd, u checks in increasing order of weight if one of its
basic edges uv is outgoing, by sending test, L, FN) to v.
1. If L > level
v
, v postpones processing the incoming test
message;
2. else, if v is in fragment FN, v replies reject (except when v
was awaiting a reply of a test message to u), after which u
and v set uv to reject;
3. else, v replies accept.
When a basic edge accepts, or all basic edges failed, u stops the search.
Computing the lowest-weight outgoing edge
At reception of initiate, L, FN,
nd
found
), a node u stores L and FN,
and adopts the sender as its father.
It passes on the message through the branch edges.
In case of nd, u checks in increasing order of weight if one of its
basic edges uv is outgoing, by sending test, L, FN) to v.
1. If L > level
v
, v postpones processing the incoming test
message;
2. else, if v is in fragment FN, v replies reject (except when v
was awaiting a reply of a test message to u), after which u
and v set uv to reject;
3. else, v replies accept.
When a basic edge accepts, or all basic edges failed, u stops the search.
Computing the lowest-weight outgoing edge
At reception of initiate, L, FN,
nd
found
), a node u stores L and FN,
and adopts the sender as its father.
It passes on the message through the branch edges.
In case of nd, u checks in increasing order of weight if one of its
basic edges uv is outgoing, by sending test, L, FN) to v.
1. If L > level
v
, v postpones processing the incoming test
message;
2. else, if v is in fragment FN, v replies reject (except when v
was awaiting a reply of a test message to u), after which u
and v set uv to reject;
3. else, v replies accept.
When a basic edge accepts, or all basic edges failed, u stops the search.
Question
In case 1, if L > level
v
, why does v postpone processing the
incoming test message from u?
Question
In case 1, if L > level
v
, why does v postpone processing the
incoming test message from u?
Answer: u and v might be in the same fragment, in which case
initiate, L, FN,
nd
found
) is on its way to v.
Question
Why does this postponement not lead to a deadlock?
Question
Why does this postponement not lead to a deadlock?
Answer: Because there is always a fragment of minimal level.
Reporting to the core nodes

u waits for all branch edges, except its father, to report.

u sets its state to found.

u computes the minimum of (1) these reports, and


(2) the weight of its lowest-weight outgoing edge (or , if no
such edge was found).

if < , u stores the edge that sent , or its basic edge of


weight .

u sends report, ) to its father.


Termination or changeroot at the core nodes
A core node receives reports from all neighbors, including its father.

If the minimum reported value is , the core nodes


terminate.

If < , the core node that received rst sends


changeroot toward the lowest-weight outgoing edge.
When a node u that reported its lowest-weight outgoing edge
receives changeroot, it sets this edge to branch, and sends
connect, level
u
) into it.
Starting the join of two fragments
When a node v receives connect, level
u
) from u, then level
v
level
u
.
Namely, either level
u
= 0 or v earlier sent accept to u.
1. If level
v
> level
u
, then v sets vu to branch and sends
initiate, level
v
, name
v
,
nd
found
) to u.
(If v was awaiting a reply of a test message to u, it stops
doing so.)
2. As long as level
v
= level
u
and vu is not a branch of v,
v postpones processing the connect message.
3. If level
v
= level
u
and vu is a branch of v (i.e., v sent to u
connect, level
v
)), v sends to u
initiate, level
v
+1, weight vu, nd).
Now v (and u) become the core nodes.
Question
In case 2, if level
v
= level
u
, why does v postpone processing the
incoming connect message from u?
Question
In case 2, if level
v
= level
u
, why does v postpone processing the
incoming connect message from u?
Answer: The fragment of v might be in the process of joining a
fragment at level level
v
, in which case the fragment of u should
subsume the name and level of that joined fragment, instead of
joining the fragment of v at an equal level.
Question
Why does this postponement not give rise to a deadlock?
Question
Why does this postponement not give rise to a deadlock?
Answer: Since dierent edges have dierent weights, there cannot
be a cycle of fragments that are waiting for a reply to a postponed
connect message.
Question
In case 1, which problem could occur if at this time v was
reporting uv as lowest-weight outgoing edge to its core nodes?
Why can we be sure that this is never the case?
Question
In case 1, which problem could occur if at this time v was
reporting uv as lowest-weight outgoing edge to its core nodes?
Why can we be sure that this is never the case?
Answer: Then v could later receive a changeroot to set uv to
branch, while this has already been done.
Since in case 1, level
v
> level
u
, we can be sure that u did not send
an accept to v.
Gallager-Humblet-Spira Algorithm - Complexity
Worst-case message complexity: O([E[+[V[ log [V[)

At most one test-reject or test-test pair per edge.


Between two subsequent joins, each node in a fragment:

receives one initiate;

sends at most one test that triggers an accept;

sends one report; and

sends at most one changeroot or connect.


Each node experiences at most log [V[ joins (because a fragment
at level L contains 2
L
nodes).
Question
How can the Gallager-Humblet-Spira algorithm be turned into an
election algorithm for any undirected graph?
Back to Election
By two extra messages at the very end, the core node with the
smallest id becomes the leader.
So Gallager-Humblet-Spira is an election algorithm for general
graphs.
Lower bounds for the average-case message complexity of election
algorithms, based on comparison of identities:
Rings: ([V[ log [V[)
General graphs: ([E[+[V[ log [V[)
Question
Consider a ring of size 3, in which all edges have weight 1.
Show that in this case, the Gallager-Humblet-Spira algorithm could
get into a deadlock.
As said before, this deadlock can be avoided by imposing a total
order on edges; take into account the identities of endpoints of an
edge, and use a lexicographical order.
Question
Consider a ring of size 3, in which all edges have weight 1.
Show that in this case, the Gallager-Humblet-Spira algorithm could
get into a deadlock.
As said before, this deadlock can be avoided by imposing a total
order on edges; take into account the identities of endpoints of an
edge, and use a lexicographical order.
Example
y
x w
v u
15
7 9
11
3
5
uv vu connect, 0) vw connect, 1)
uv vu initiate, 1, 5, nd) xu wv test, 1, 3)
ux vw test, 1, 5) ux vw accept
yv connect, 0) wx report, 7)
vy initiate, 1, 5, nd) xw report, 9)
yv report, ) wv connect, 1)
wx xw connect, 0) wv vu vy vw wx initiate, 2, 7, nd)
wx xw initiate, 1, 3, nd) uw test, 2, 7)
xu wv accept wu reject
uv report, 9) yv uv vw xw wv report, )
vu report, 7)
Anonymous Networks
Processes may be anonymous (e.g., Lego MindStorm chips), or
transmitting identities may be too expensive (e.g., FireWire bus).
When a leader is known, all processes can be named (using for
instance a traversal algorithm).
Assumptions: Processes have no identities and carry the same local
algorithm.
Impossibility of Election in Anonymous Networks
Theorem: There is no terminating algorithm for electing a leader in
an asynchronous anonymous graph.
Proof: Take a (directed) ring of size N.
In a symmetric conguration, all nodes are in the same state and
all cannels carry the same messages.

The initial conguration is symmetric.

If
0
is symmetric and
0

1
, then
1

2

N
where
N
is symmetric.
So there is an innite fair execution.
Impossibility of Election in Anonymous Networks
Theorem: There is no terminating algorithm for electing a leader in
an asynchronous anonymous graph.
Proof: Take a (directed) ring of size N.
In a symmetric conguration, all nodes are in the same state and
all cannels carry the same messages.

The initial conguration is symmetric.

If
0
is symmetric and
0

1
, then
1

2

N
where
N
is symmetric.
So there is an innite fair execution.
Probabilistic Algorithms
In a probabilistic algorithm, each process p holds two local
algorithms,
0
and
1
.
Let
p
: N 0, 1 for each p. In a -computation, the k-th event
at p is performed according to

p
(k)
.
For a probabilistic algorithm where all computations terminate in a
correct conguration,
0
is a correct non-probabilistic algorithm.
Monte Carlo and Las Vegas Algorithms
A probabilistic algorithm is Monte Carlo if:

it always terminates; and

the probability that a terminal conguration is correct is


greater than zero.
It is Las Vegas if:

the probability that it terminates is greater than zero; and

all terminal congurations are correct.


Monte Carlo and Las Vegas Algorithms
A probabilistic algorithm is Monte Carlo if:

it always terminates; and

the probability that a terminal conguration is correct is


greater than zero.
It is Las Vegas if:

the probability that it terminates is greater than zero; and

all terminal congurations are correct.


Question
Even if the probability that a Las Vegas algorithm terminates is 1,
this does not imply termination. Why is that?
Question
Assume a Monte Carlo algorithm, and a (deterministic) algorithm
to check whether the Monte Carlo algorithm terminated correctly.
Give a Las Vegas algorithm that terminates with probability 1.
Itai-Rodeh Election Algorithm
Let G be an anonymous, directed ring, in which all nodes know the
ring size N.
The Itai-Rodeh election algorithm is based on the Chang-Roberts
algorithm, where each process sends out its id, and the smallest id
is the only one making a round trip.
Each initiator selects a random id from 1, . . . , N.
Complication: Dierent processes may select the same id.
Solution: Each message is supplied with a hop count. A message
arrives at its source if and only if its hop count is N.
If several processes selected the same smallest id, they start a fresh
election round, at a higher level.
The Itai-Rodeh election algorithm is a Las Vegas algorithm; it
terminates with probability 1.
Itai-Rodeh Election Algorithm
Let G be an anonymous, directed ring, in which all nodes know the
ring size N.
The Itai-Rodeh election algorithm is based on the Chang-Roberts
algorithm, where each process sends out its id, and the smallest id
is the only one making a round trip.
Each initiator selects a random id from 1, . . . , N.
Complication: Dierent processes may select the same id.
Solution: Each message is supplied with a hop count. A message
arrives at its source if and only if its hop count is N.
If several processes selected the same smallest id, they start a fresh
election round, at a higher level.
The Itai-Rodeh election algorithm is a Las Vegas algorithm; it
terminates with probability 1.
Itai-Rodeh Election Algorithm
Initially, initiators are active at level 0, and non-initiators are passive.
In each election round, if p is active at level :

At the start of the round, p selects a random id id


p
, and
sends (, id
p
, 1, true). The 3rd parameter is the hop count.
The 4th parameter signals whether another process with the
same id was encountered during the round trip.

p gets (

, u, h, b) with <

, or =

and id
p
> u:
it becomes passive and sends (

, u, h+1, b).

p gets (

, u, h, b) with >

, or =

and id
p
< u:
it purges the message.

p gets (, id
p
, h, b) with h < N: it sends (, id
p
, h+1, false).

p gets (, id
p
, N, false): it proceeds to an election at level +1.

p gets (, id
p
, N, true): it becomes the leader.
Passive processes pass on messages, increasing their hop count by one.
Itai-Rodeh Election Algorithm - Correctness + Complexity
Correctness: Eventually one leader is elected, with probability 1.
Average-case message complexity: In the order of N log N messages.
Without levels, the algorithm would break down (if channels are
non-FIFO).
Example:
u u w x
v v
u < v v < w,x
(v,1,true)
Question
Any suggestions how to adapt the echo algorithm with extinction,
to get an election algorithm for arbitrary anonymous undirected
graphs?
Election in Arbitrary Anonymous Networks
We use the echo algorithm with extinction, with random selection
of identities, for election in anonymous undirected graphs in which
all nodes know the network size.
Initially, initiators are active at level 0, and non-initiators are passive.
Each active process selects a random id, and starts a wave, tagged
with its id and level 0.
Suppose process p in wave v at level is hit by wave w at level

if <

, or =

and v > w, then p changes to wave w at


level

, and treats the message according to the echo algorithm;

if >

, or =

and v < w, then p purges the message;

if =

and v = w, then p treats the message according to


the echo algorithm.
Election in Arbitrary Anonymous Networks
We use the echo algorithm with extinction, with random selection
of identities, for election in anonymous undirected graphs in which
all nodes know the network size.
Initially, initiators are active at level 0, and non-initiators are passive.
Each active process selects a random id, and starts a wave, tagged
with its id and level 0.
Suppose process p in wave v at level is hit by wave w at level

if <

, or =

and v > w, then p changes to wave w at


level

, and treats the message according to the echo algorithm;

if >

, or =

and v < w, then p purges the message;

if =

and v = w, then p treats the message according to


the echo algorithm.
Election in Arbitrary Networks
Each message sent upwards in the constructed tree reports the size
of its subtree. All other messages report 0.
When a process p decides, it computes the size of the constructed
tree.
If the constructed tree covers the network, p becomes the leader.
Otherwise, it selects a new id, and initiates a new wave,
at a higher level.
Election in Arbitrary Networks - Example
u < v < w < x < y. Only waves that complete are shown.
1,x
(1,x,0) (1,x,1)
(1,x,0) (1,x,0)
(1,x,5) (1,x,2)
1,y
(1,x,4)
1,y
(1,x,1)
(1,x,0) (1,x,0)
0,u 0,u
0,u
0,u 0,v 0,u 0,u 0,v
0,w
(0,u,0) (0,u,0)
(0,u,1) (0,u,1)
(0,u,0) (0,u,1)
(0,u,0)
(0,u,0)
(0,u,0)
(0,u,0)
The process at the left computes size 6, and becomes the leader.
Computing the Size of a Network
Theorem: There is no Las Vegas algorithm to compute the size of
an anonymous ring.
When a leader is known, the network size can be computed by the
echo algorithm.
Namely, each message sent upwards in the spanning tree reports
the size of its subtree.
Hence there no Las Vegas algorithm for election in an anonymous
ring if the nodes do not know the ring size.
Impossibility of Computing an Anonymous Network Size
Theorem: There is no Las Vegas algorithm to compute the size of
an anonymous ring.
Proof: Consider a directed ring p
1
, . . . , p
N
.
Assume a probabilistic algorithm with a -computation C that
terminates with the correct outcome N.
Let each process execute at most L events in C.
Consider the ring p
1
, . . . , p
2N
.
For i = 1, . . . , N, let the rst L bits of

(p
i
) and

(p
i +N
) coincide
with (p
i
). (The probability of such an assignment is (
1
2
)
NL
.)
Let each event at a p
i
in C be executed concurrently at p
i
and
p
i +N
. This

-computation terminates with the incorrect outcome N.


Itai-Rodeh Ring Size Algorithm
Each process p maintains an estimate est
p
of the ring size. Always
est
p
N; initially, est
p
= 2.
p initiates an estimate round (1) at the start of the algorithm, and
(2) at each update of est
p
.
At each round, p selects a random id id
p
in 1, . . . , R, sends
(est
p
, id
p
, 1), and waits for a message (est, id, h). Always h est.

If est < est


p
, p purges the message.

Let est > est


p
.
- If h < est, then p sends (est, id, h+1), and est
p
:= est.
- If h = est, then est
p
:= est+1.

Let est = est


p
.
- If h < est, then p sends (est, id, h+1).
- If h = est and id ,= id
p
, then est
p
:= est+1.
- If h = est and id = id
p
, p purges the message (possibly its
own token returned).
Itai-Rodeh Ring Size Algorithm
Each process p maintains an estimate est
p
of the ring size. Always
est
p
N; initially, est
p
= 2.
p initiates an estimate round (1) at the start of the algorithm, and
(2) at each update of est
p
.
At each round, p selects a random id id
p
in 1, . . . , R, sends
(est
p
, id
p
, 1), and waits for a message (est, id, h). Always h est.

If est < est


p
, p purges the message.

Let est > est


p
.
- If h < est, then p sends (est, id, h+1), and est
p
:= est.
- If h = est, then est
p
:= est+1.

Let est = est


p
.
- If h < est, then p sends (est, id, h+1).
- If h = est and id ,= id
p
, then est
p
:= est+1.
- If h = est and id = id
p
, p purges the message (possibly its
own token returned).
Question
Upon message-termination, is est
p
always the same at all p?
Question
Why will est
p
never become greater than N?
Itai-Rodeh Ring Size Algorithm - Correctness
Message-termination can be detected using a decentralized
termination detection algorithm.
The Itai-Rodeh ring size algorithm is a Monte Carlo algorithm.
Possibly, est
p
is in the end smaller than the ring size.
Example:
(u,2)
(v,2)
(v,2)
(u,2)
The probability of computing an incorrect ring size tends to zero
when R tends to innity.
Itai-Rodeh Ring Size Algorithm - Correctness
Message-termination can be detected using a decentralized
termination detection algorithm.
The Itai-Rodeh ring size algorithm is a Monte Carlo algorithm.
Possibly, est
p
is in the end smaller than the ring size.
Example:
(u,2)
(v,2)
(v,2)
(u,2)
The probability of computing an incorrect ring size tends to zero
when R tends to innity.
Itai-Rodeh Ring Size Algorithm - Correctness
Message-termination can be detected using a decentralized
termination detection algorithm.
The Itai-Rodeh ring size algorithm is a Monte Carlo algorithm.
Possibly, est
p
is in the end smaller than the ring size.
Example:
(u,2)
(v,2)
(v,2)
(u,2)
The probability of computing an incorrect ring size tends to zero
when R tends to innity.
Itai-Rodeh Ring Size Algorithm - Complexity
Worst-case message complexity: O(N
3
)
Each process starts at most N1 estimate rounds; each round it
sends out one message, which takes at most N steps.
Itai-Rodeh Ring Size Algorithm - Example
(u,2)
(v,2)
(w,2)
(u,2) (x,3)
(x,3)
(x,3)
(y,3) (u,2)
(x,3)
(u,2)
(x,3)
(y,3) (x,3)
(x,3)
(z,4)
(u,4)
(z,4)
(z,4)
(v,4)
FireWire Election Algorithm
IEEE Standard 1394, called FireWire, is a serial multimedia bus. It
connects digital devices, which can be added and removed dynamically.
It uses the tree election algorithm for undirected, acyclic graphs,
adapted to anonymous networks. (Cyclic graphs give a time-out.)
The network size is unknown to the nodes!
When a node has one possible father, it sends a parent request to this
neighbor. If the request is accepted, an acknowledgement is sent back.
The last two fatherless nodes can send parent requests to each
other simultaneously. This is called root contention.
Each of the two nodes in root contention randomly decides to
either immediately send a parent request again, or to wait some
time for a parent request from the other node.
This election algorithm is a Las Vegas algorithm for acyclic graphs;
it terminates with probability 1.
FireWire Election Algorithm
IEEE Standard 1394, called FireWire, is a serial multimedia bus. It
connects digital devices, which can be added and removed dynamically.
It uses the tree election algorithm for undirected, acyclic graphs,
adapted to anonymous networks. (Cyclic graphs give a time-out.)
The network size is unknown to the nodes!
When a node has one possible father, it sends a parent request to this
neighbor. If the request is accepted, an acknowledgement is sent back.
The last two fatherless nodes can send parent requests to each
other simultaneously. This is called root contention.
Each of the two nodes in root contention randomly decides to
either immediately send a parent request again, or to wait some
time for a parent request from the other node.
This election algorithm is a Las Vegas algorithm for acyclic graphs;
it terminates with probability 1.
FireWire Election Algorithm
IEEE Standard 1394, called FireWire, is a serial multimedia bus. It
connects digital devices, which can be added and removed dynamically.
It uses the tree election algorithm for undirected, acyclic graphs,
adapted to anonymous networks. (Cyclic graphs give a time-out.)
The network size is unknown to the nodes!
When a node has one possible father, it sends a parent request to this
neighbor. If the request is accepted, an acknowledgement is sent back.
The last two fatherless nodes can send parent requests to each
other simultaneously. This is called root contention.
Each of the two nodes in root contention randomly decides to
either immediately send a parent request again, or to wait some
time for a parent request from the other node.
This election algorithm is a Las Vegas algorithm for acyclic graphs;
it terminates with probability 1.
FireWire Election Algorithm
IEEE Standard 1394, called FireWire, is a serial multimedia bus. It
connects digital devices, which can be added and removed dynamically.
It uses the tree election algorithm for undirected, acyclic graphs,
adapted to anonymous networks. (Cyclic graphs give a time-out.)
The network size is unknown to the nodes!
When a node has one possible father, it sends a parent request to this
neighbor. If the request is accepted, an acknowledgement is sent back.
The last two fatherless nodes can send parent requests to each
other simultaneously. This is called root contention.
Each of the two nodes in root contention randomly decides to
either immediately send a parent request again, or to wait some
time for a parent request from the other node.
This election algorithm is a Las Vegas algorithm for acyclic graphs;
it terminates with probability 1.
Question
In case of root contention, is it optimal to give an equal chance of
0.5 to both sending immediately and waiting for some time?
Question
Give a terminating algorithm for computing the network size of
anonymous, acyclic graphs.
Routing Algorithms
See also Computer Networks (Chapter 5.2).
Routing means guiding a packet in a network to its destination.
A routing table at node u stores for each node v ,= u a neighbor w
of u: each packet with destination v that arrives at u is then
passed on to w.
Criteria for good routing algorithms:

use of optimal paths;

robust with respect to topology changes in the network;

computing routing tables is cheap;

table adaptation to avoid busy channels;

all nodes are served in the same degree.


All-Pairs Shortest-Path Problem
Let G = (V, E) be a directed, weighted graph, with weights
uv
> 0.
We want to compute for each pair of nodes u, v a shortest path
from u to v in G.
For S V, d
S
(u, v) denotes the length of a shortest path in G
with all intermediate nodes in S.
d
S
(u, u) = 0
d

(u, v) =
uv
if uv E and u ,= v
d

(u, v) = if uv , E and u ,= v
d
S{w}
(u, v) = mind
S
(u, v), d
S
(u, w)+d
S
(w, v)
Note that d
V
is the standard distance function.
All-Pairs Shortest-Path Problem
Let G = (V, E) be a directed, weighted graph, with weights
uv
> 0.
We want to compute for each pair of nodes u, v a shortest path
from u to v in G.
For S V, d
S
(u, v) denotes the length of a shortest path in G
with all intermediate nodes in S.
d
S
(u, u) = 0
d

(u, v) =
uv
if uv E and u ,= v
d

(u, v) = if uv , E and u ,= v
d
S{w}
(u, v) = mind
S
(u, v), d
S
(u, w)+d
S
(w, v)
Note that d
V
is the standard distance function.
Floyd-Warshall Algorithm
Exploits the last equality to compute d
S
where S grows from to V.
S :=
forall u, v V do
if u = v then D
u
[v] := 0; Nb
u
[v] :=
else if uv E then D
u
[v] :=
uv
; Nb
u
[v] := v
else D
u
[v] := ; Nb
u
[v] :=
while S ,= V do
pick w from VS (w-pivot round)
forall u, v V do
if D
u
[w]+D
w
[v] < D
u
[v] then
D
u
[v] := D
u
[w]+D
w
[v];
Nb
u
[v] := Nb
u
[w]
S := S w
Time complexity: ([V[
3
)
Floyd-Warshall Algorithm
Exploits the last equality to compute d
S
where S grows from to V.
S :=
forall u, v V do
if u = v then D
u
[v] := 0; Nb
u
[v] :=
else if uv E then D
u
[v] :=
uv
; Nb
u
[v] := v
else D
u
[v] := ; Nb
u
[v] :=
while S ,= V do
pick w from VS (w-pivot round)
forall u, v V do
if D
u
[w]+D
w
[v] < D
u
[v] then
D
u
[v] := D
u
[w]+D
w
[v];
Nb
u
[v] := Nb
u
[w]
S := S w
Time complexity: ([V[
3
)
Floyd-Warshall Algorithm
Exploits the last equality to compute d
S
where S grows from to V.
S :=
forall u, v V do
if u = v then D
u
[v] := 0; Nb
u
[v] :=
else if uv E then D
u
[v] :=
uv
; Nb
u
[v] := v
else D
u
[v] := ; Nb
u
[v] :=
while S ,= V do
pick w from VS (w-pivot round)
forall u, v V do
if D
u
[w]+D
w
[v] < D
u
[v] then
D
u
[v] := D
u
[w]+D
w
[v];
Nb
u
[v] := Nb
u
[w]
S := S w
Time complexity: ([V[
3
)
Floyd-Warshall Algorithm - Example
1 1
1
4
w
v
x
u
pivot u D
x
[v] := 5 D
v
[x] := 5
N
x
[v] := u N
v
[x] := u
pivot v D
u
[w] := 5 D
w
[u] := 5
N
u
[w] := v N
w
[u] := v
pivot w D
x
[v] := 2 D
v
[x] := 2
N
x
[v] := w N
v
[x] := w
pivot x D
u
[w] := 2 D
w
[u] := 2
N
u
[w] := x N
w
[u] := x
D
u
[v] := 3 D
v
[u] := 3
N
u
[v] := x N
v
[u] := w
Question
How can the Floyd-Warshall algorithm be turned into a distributed
algorithm?
Touegs Algorithm
A distributed version of Floyd-Warshall computes the routing
tables at their nodes.
Given an undirected, weighted graph.
Assumption: Each node knows from the start the identities of all
nodes in V. (Because pivots must be picked uniformly at all nodes.)
At the w-pivot round, w broadcasts its values D
w
[v], for all v V.
If Nb
u
[w] = with u ,= w at the w-pivot round, then D
u
[w] = ,
so D
u
[w]+D
w
[v] D
u
[v] for all v V. Hence the sink tree of w
can be used to broadcast D
w
.
Nodes u with Nb
u
[w] ,= must tell Nb
u
[w] to pass on D
w
:

u sends ys, w) to Nb
u
[w] if it is not ;

u sends nys, w) to its other neighbors.


Touegs Algorithm
A distributed version of Floyd-Warshall computes the routing
tables at their nodes.
Given an undirected, weighted graph.
Assumption: Each node knows from the start the identities of all
nodes in V. (Because pivots must be picked uniformly at all nodes.)
At the w-pivot round, w broadcasts its values D
w
[v], for all v V.
If Nb
u
[w] = with u ,= w at the w-pivot round, then D
u
[w] = ,
so D
u
[w]+D
w
[v] D
u
[v] for all v V. Hence the sink tree of w
can be used to broadcast D
w
.
Nodes u with Nb
u
[w] ,= must tell Nb
u
[w] to pass on D
w
:

u sends ys, w) to Nb
u
[w] if it is not ;

u sends nys, w) to its other neighbors.


Touegs Algorithm - Local Algorithm at Node u
Initialization:
S :=
forall v V do
if u = v then D
u
[v] := 0; Nb
u
[v] :=
else if v Neigh
u
then D
u
[v] :=
uv
; Nb
u
[v] := v
else D
u
[v] := ; Nb
u
[v] :=
Touegs Algorithm - Local Algorithm at Node u
while S ,= V do
pick w from VS
forall x Neigh
u
do
if Nb
u
[w] = x then send ys, w) to x
else send nys, w) to x
num rec
u
:= 0
while num rec
u
< [Neigh
u
[ do
receive a ys, w) or nys, w) message;
num rec
u
:= num rec
u
+1
if D
u
[w] < then
if u ,= w then receive D
w
from Nb
u
[w]
forall x Neigh
u
that sent ys, w) do send D
w
to x
forall v V do
if D
u
[w]+D
w
[v] < D
u
[v] then
D
u
[v] := D
u
[w]+D
w
[v]; Nb
u
[v] := Nb
u
[w]
S := S w
Touegs Algorithm - Complexity and Drawbacks
Message complexity: ([V[[E[)
Drawbacks:

uniform selection of pivot nodes requires that all nodes know


V in advance;

global broadcast of D
w
at the w-pivot round;

not robust with respect to topology changes.


Touegs Algorithm - Optimization
Let Nb
x
[w] = u with u ,= w at the start of the w-pivot round. If
D
u
[v] is not changed in this round, then neither is D
x
[v].
Upon reception of D
w
, u can therefore rst update D
u
and N
u
,
and only forward values D
w
[v] for which D
u
[v] has changed.
Additional advantage: Cycle-free sink trees not only between but
also during pivot rounds.
Example:
4
1
1 1
w
v
x
u
Subsequent pivots: u x w v.
At the w-pivot round, the sink tree toward v may temporarily
contain a cycle: Nb
x
[v] = u and Nb
u
[v] = x.
Touegs Algorithm - Optimization
Let Nb
x
[w] = u with u ,= w at the start of the w-pivot round. If
D
u
[v] is not changed in this round, then neither is D
x
[v].
Upon reception of D
w
, u can therefore rst update D
u
and N
u
,
and only forward values D
w
[v] for which D
u
[v] has changed.
Additional advantage: Cycle-free sink trees not only between but
also during pivot rounds.
Example:
4
1
1 1
w
v
x
u
Subsequent pivots: u x w v.
At the w-pivot round, the sink tree toward v may temporarily
contain a cycle: Nb
x
[v] = u and Nb
u
[v] = x.
Chandy-Misra Algorithm
A centralized algorithm to compute all shortest paths to initiator v
0
.
Again, an undirected, weighted graph is assumed.
Each node uses only D
w
[v
0
] values from neighbors w.
Initially, D
v
0
[v
0
] = 0, D
u
[v
0
] = if u ,= v
0
, and Nb
u
[v
0
] =.
v
0
sends the message mydist, 0) to its neighbors.
When a node u receives mydist, d) from a neighbor w, and if
d+
uw
< D
u
[v
0
], then:

D
u
[v
0
] := d+
uw
and Nb
u
[v
0
] := w;

u sends mydist, D
u
[v
0
]) to its neighbors (except w).
Termination detection by the Dijkstra-Scholten algorithm.
Worst-case message complexity: Exponential
Worst-case message complexity for minimum-hop: O([V[
2
[E[)
Chandy-Misra Algorithm
A centralized algorithm to compute all shortest paths to initiator v
0
.
Again, an undirected, weighted graph is assumed.
Each node uses only D
w
[v
0
] values from neighbors w.
Initially, D
v
0
[v
0
] = 0, D
u
[v
0
] = if u ,= v
0
, and Nb
u
[v
0
] =.
v
0
sends the message mydist, 0) to its neighbors.
When a node u receives mydist, d) from a neighbor w, and if
d+
uw
< D
u
[v
0
], then:

D
u
[v
0
] := d+
uw
and Nb
u
[v
0
] := w;

u sends mydist, D
u
[v
0
]) to its neighbors (except w).
Termination detection by the Dijkstra-Scholten algorithm.
Worst-case message complexity: Exponential
Worst-case message complexity for minimum-hop: O([V[
2
[E[)
Chandy-Misra Algorithm - Example
u
v
w
1
1
1
1
4
6
x
0
D
v
0
:= 0 Nb
v
0
:=
D
w
:= 6 Nb
w
:= v
0
D
u
:= 7 Nb
u
:= w
D
x
:= 8 Nb
x
:= u
D
x
:= 7 Nb
x
:= w
D
u
:= 4 Nb
u
:= v
0
D
w
:= 5 Nb
w
:= u
D
x
:= 6 Nb
x
:= w
D
x
:= 5 Nb
x
:= u
D
x
:= 1 Nb
x
:= v
0
D
w
:= 2 Nb
w
:= x
D
u
:= 3 Nb
u
:= w
D
u
:= 2 Nb
u
:= x
Merlin-Segall Algorithm
A centralized algorithm to compute all shortest paths to initiator v
0
.
Again, an undirected, weighted graph is assumed.
Initially, D
v
0
[v
0
] = 0, D
u
[v
0
] = if u ,= v
0
, and the Nb
u
[v
0
] form
a sink tree with root v
0
.
At each update round, for u ,= v
0
:
1. v
0
sends mydist, 0) to its neighbors.
2. Let u get mydist, d) from neighbor w.
If d+
uw
< D
u
[v
0
], D
u
[v
0
] := d+
uw
(and u stores w as
future value for Nb
u
[v
0
]).
If w = Nb
u
[v
0
], u sends mydist, D
u
[v
0
]) to its neighbors
except Nb
u
[v
0
].
3. When u received a mydist message from all its neighbors, it
sends mydist, D
u
[v
0
]) to Nb
u
[v
0
], and next updates Nb
u
[v
0
].
v
0
starts a new update round after receiving mydist from all neighbors.
Merlin-Segall Algorithm - Termination and Complexity
After i update rounds, all shortest paths of i hops have been
computed. The algorithm terminates after [V[1 update rounds.
Message complexity: ([V[
2
[E[)
Example:
4
1
1
2
1
4
0
3
5 1
4
1
1
4
1
4
0
1
3
5 1
4
1
1
2
1
3
0
3
5 1
4
1
1
0
3
5
Merlin-Segall Algorithm - Termination and Complexity
After i update rounds, all shortest paths of i hops have been
computed. The algorithm terminates after [V[1 update rounds.
Message complexity: ([V[
2
[E[)
Example:
4
1
1
2
1
4
0
3
5 1
4
1
1
4
1
4
0
1
3
5 1
4
1
1
2
1
3
0
3
5 1
4
1
1
0
3
5
Merlin-Segall Algorithm - Topology Changes
A number is attached to mydist messages.
When a channel fails or becomes operational, adjacent nodes send
the number of the update round to v
0
via the sink tree.
(If the message meets a failed tree link, it is discarded.)
When v
0
receives such a message, it starts a new set of [V[1
update rounds, with a higher number.
If a failed channel is part of the sink tree, the remaining tree is
extended to a sink tree, similar to the initialization phase.
Example:
0
w
v
y x
u z
y signals to z that the failed channel was part of the sink tree.
Breadth-First Search
Consider an undirected graph.
A spanning tree is a breadth-rst search tree if each tree path to
the root is minimum-hop.
The Chandy-Misra algorithm for minimum-hop paths computed a
breadth-rst search tree using O([V[[E[) messages (for each root).
Breadth-First Search - A Simple Algorithm
Initially (after round 0), the initiator is at level 0, and all other
nodes are at level .
After round f 0, each node at f hops from the node (1) is at
level f , and (2) knows which neighbors are at level f 1.
We explain what happens in round f +1:
0
explore
explore
explore/reverse
forward/reverse
forward/reverse
f
f + 1
Breadth-First Search - A Simple Algorithm

Messages forward, f ) travel down the tree, from the initiator


to the nodes at level f .

A node at level f sends explore, f +1) to all neighbors that


are not at level f 1.
It stores incoming messages explore, f +1) and reverse, b)
as answers.

If a node at level gets explore, f +1), its level becomes


f +1, and the sender becomes its father. It sends back
reverse, true).

If a node at level f +1 gets explore, f +1), it stores that the


sender is at level f . It sends back reverse, false).
Breadth-First Search - A Simple Algorithm

A non-initiator at level f (or < f ) waits until all messages


explore, f +1) (resp. forward, f )) have been answered.
Then it sends reverse, b) to its father, where b = true if and
only if new nodes were added to its subtree.

The initiator waits until all forward, f ) (or, in round 1,


explore, 1)) messages are answered.
It continues with round f +2 if new nodes were added in
round f +1. Otherwise it terminates.
Breadth-First Search - Complexity
Worst-case message complexity: O([V[
2
)
There are at most [V[ rounds.
Each round, a tree edge carries at most one forward and one
replying reverse.
In total, an edge carries one explore and one replying reverse or
explore.
Worst-case time complexity: O([V[
2
)
Round f is completed in at most 2f time units.
Fredericksons Algorithm
Computes 1 levels per round.
Initially, the initiator is at level 0, and all other nodes are at level .
After round k, each node at f k hops from the node (1) is at
level f , and (2) knows which neighbors are at level f 1.
At round k+1:

forward, k) travels down the tree, from the initiator to the


nodes at level k.
forward messages from non-fathers are replied with no-child).
Fredericksons Algorithm

A node at level k sends explore, k+1, ) to neighbors that


are not at level k1.

If a node u receives explore, f , m) with f < level


u
, then
level
u
:= f , and the sender becomes us new father.
If m > 1, u sends explore, f +1, m1) to those neighbors
that are not already known to be at level f 1, f or f +1.
If m = 1 (so f = (k+1)), u sends back reverse, true).

If a node u receives explore, f , m) with f level


u
, and u did
not send explore, level
u
, ) into this channel, it sends back
reverse, false).
Fredericksons Algorithm

A non-initiator at level k f < (k+1) (or f < k) waits


until all explore, f +1) (resp. forward, k)) messages have
been answered.
Then it sends reverse, b) to its father, where b = true if and
only if new nodes were added.

The initiator waits until all forward, k) (or, in round 1,


explore, 1, )) messages are answered.
It continues with round k+2 if new nodes were added in
round k+1. Otherwise it terminates.
Fredericksons Algorithm - Complexity
Worst-case message complexity: O(
|V|
2

+[E[)
There are at most
|V|

| rounds.
Each round, a tree edge carries at most one forward and one
replying reverse.
In total, an edge carries at most 2 explores and replying reverses.
In total, a frond edge carries at most one spurious forward and
one replying no-child.
Worst-case time complexity: O(
|V|
2

)
Levels k+1 up to (k+1) are computed in 2(k+1) time units.
Let =
|V|

|E|
|. Then both message and time complexity are
O([V[

[E[).
Deadlock-Free Packet Switching
Let G = (V, E) be a directed graph, supplied with routing tables.
Each node has buers to store data packets on their way to their
destination.
For simplicity we assume synchronous communication.
Possible events:

Generation: A new packet is placed in an empty buer.

Forwarding: A packet is forwarded to an empty buer of the


next node on its route.

Consumption: A packet at its destination node is removed


from the buer.
At a node with all buers empty, generation of a packet must
always be allowed.
Store-and-Forward Deadlocks
A store-and-forward deadlock occurs when a group of packets are
all waiting for the use of a buer occupied by a packet in the group.
A controller avoids such deadlocks. It prescribes whether a packet
can be generated or forwarded, and in which buer it is placed next.
Destination Scheme
V = v
1
, . . . , v
N
, and T
i
denotes the sink tree (with respect to
the routing tables) with root v
i
for i = 1, . . . , N.
In the destination scheme, each node carries N buers.

When a packet with destination v


i
is generated at u, it is
placed in the i th buer of u.

If uv is an edge in T
i
, then the i th buer of u is linked to the
i th buer of v.
Hops-So-Far Scheme
k is the length of a longest path in any T
i
.
In the hops-so-far scheme, each node carries k+1 buers.

When a packet is generated at u, it is placed in the rst buer


of u.

If uv is an edge in some T
i
, then each j th buer of u is linked
to the (j +1)th buer of v.
Forward-Count Controller
Suppose that for a packet p at a node u, the number s
u
(p) of hops
that p still has to make to its destination is always known.
f
u
is the number of free buers at u.
k
u
is the length of a longest path, starting in u, in any sink tree in G.
In the forward-count controller, each node u contains k
u
+1
buers. A packet p is accepted at node u if and only if s
u
(p) < f
u
.
If the buers of a node u are all empty, u can accept any packet.
Unlike the previous controllers, an accepted packet can be placed
in any buer.
Forward-Count Controller
Suppose that for a packet p at a node u, the number s
u
(p) of hops
that p still has to make to its destination is always known.
f
u
is the number of free buers at u.
k
u
is the length of a longest path, starting in u, in any sink tree in G.
In the forward-count controller, each node u contains k
u
+1
buers. A packet p is accepted at node u if and only if s
u
(p) < f
u
.
If the buers of a node u are all empty, u can accept any packet.
Unlike the previous controllers, an accepted packet can be placed
in any buer.
Forward-Count Controller - Correctness
Theorem: Forward-count controllers are deadlock-free.
Proof: Consider a reachable conguration where no forwarding or
consumption is possible. Suppose, toward a contradiction, that in
some buer is occupied.
Select a packet p, at some node u, with s
u
(p) minimal. p must be
forwarded to a node w, but is blocked:
s
w
(p) f
w
Then some buer in w is occupied. Let q be the packet at w that
arrived last. Let f
old
w
be the number of free buers before qs arrival.
s
w
(q) < f
old
w
f
w
+1
Hence, we get a contradiction:
s
u
(p) = s
w
(p)+1 f
w
+1 > s
w
(q)
So in , all buers are empty.
Forward-Count Controller - Correctness
Theorem: Forward-count controllers are deadlock-free.
Proof: Consider a reachable conguration where no forwarding or
consumption is possible. Suppose, toward a contradiction, that in
some buer is occupied.
Select a packet p, at some node u, with s
u
(p) minimal. p must be
forwarded to a node w, but is blocked:
s
w
(p) f
w
Then some buer in w is occupied. Let q be the packet at w that
arrived last. Let f
old
w
be the number of free buers before qs arrival.
s
w
(q) < f
old
w
f
w
+1
Hence, we get a contradiction:
s
u
(p) = s
w
(p)+1 f
w
+1 > s
w
(q)
So in , all buers are empty.
Example
Acyclic Orientation Cover Controller
Let G be undirected. An acyclic orientation of G is a directed,
acyclic graph obtained by directing all edges of G.
Acyclic orientations G
1
, . . . , G
n
of G are an acyclic orientation cover
of a set T of paths in G if each path in T is the concatenation of
paths P
1
, . . . , P
n
in G
1
, . . . , G
n
.
Given an acyclic orientation cover G
1
, . . . , G
n
of the sink trees. In
the acyclic orientation cover controller, each node has n buers.

A packet generated at v, is placed in the rst buer of v.

If vw is an edge in G
i
, then the i th buer of v is linked to the
i th buer of w, and if i < n, the i th buer of w is linked to
the (i +1)th buer of v.
Acyclic Orientation Cover Controller
Let G be undirected. An acyclic orientation of G is a directed,
acyclic graph obtained by directing all edges of G.
Acyclic orientations G
1
, . . . , G
n
of G are an acyclic orientation cover
of a set T of paths in G if each path in T is the concatenation of
paths P
1
, . . . , P
n
in G
1
, . . . , G
n
.
Given an acyclic orientation cover G
1
, . . . , G
n
of the sink trees. In
the acyclic orientation cover controller, each node has n buers.

A packet generated at v, is placed in the rst buer of v.

If vw is an edge in G
i
, then the i th buer of v is linked to the
i th buer of w, and if i < n, the i th buer of w is linked to
the (i +1)th buer of v.
Example
For each undirected ring there exists a deadlock-free controller that
uses three buers per node and allows packets to travel via
minimum-hop paths.
For instance, in case of a ring of size six:
G G G
1 2 3
Acyclic Orientation Cover Controller - Correctness
Theorem: Acyclic orientation cover controllers are deadlock-free.
Proof: Consider a reachable conguration . Make forwarding and
consumption transitions to a conguration where no forwarding
or consumption is possible.
Since G
n
is acyclic, packets in nth buers can travel to their
destinations. So in , nth buers are empty.
Suppose all (i +1)th buers are empty in , for some i < n. Then
all i th buers must also be empty in . For else, since G
i
is acyclic,
some packet in an i th buer could be forwarded or consumed.
Concluding, in all buers are empty.
Acyclic Orientation Cover Controller - Correctness
Theorem: Acyclic orientation cover controllers are deadlock-free.
Proof: Consider a reachable conguration . Make forwarding and
consumption transitions to a conguration where no forwarding
or consumption is possible.
Since G
n
is acyclic, packets in nth buers can travel to their
destinations. So in , nth buers are empty.
Suppose all (i +1)th buers are empty in , for some i < n. Then
all i th buers must also be empty in . For else, since G
i
is acyclic,
some packet in an i th buer could be forwarded or consumed.
Concluding, in all buers are empty.
Fault Tolerance
A process may (1) crash, i.e., execute no further events, or even (2)
become Byzantine, meaning that it can perform arbitrary events.
Assumptions: The graph is complete, i.e., there is an undirected
channel between each pair of dierent processes. Thus, failing
processes never make the network disconnected.
Crashing of processes cannot be observed.
Consensus: Correct processes must uniformly decide 0 or 1.
Assumption: Initial congurations are bivalent, meaning that both
decisions occur in some terminal congurations (that are reachable
by correct transitions).
Given a conguration, a set S of processes is b-potent if by only
executing events at processes in S, some process in S can decide b.
Fault Tolerance
A process may (1) crash, i.e., execute no further events, or even (2)
become Byzantine, meaning that it can perform arbitrary events.
Assumptions: The graph is complete, i.e., there is an undirected
channel between each pair of dierent processes. Thus, failing
processes never make the network disconnected.
Crashing of processes cannot be observed.
Consensus: Correct processes must uniformly decide 0 or 1.
Assumption: Initial congurations are bivalent, meaning that both
decisions occur in some terminal congurations (that are reachable
by correct transitions).
Given a conguration, a set S of processes is b-potent if by only
executing events at processes in S, some process in S can decide b.
Impossibility of 1-Crash Consensus
Theorem: There is no terminating algorithm for 1-crash consensus
(i.e., only one process may crash).
Proof: Suppose, toward a contradiction, there is such an algorithm.
Let be a reachable bivalent conguration. Then
0
and

1
, where
0
can lead to decision 0 and
1
to decision 1.

Suppose these transitions correspond to events at dierent


processes. Then
0

1
for some . So
0
or
1
is bivalent.

Suppose these two transitions correspond to events at the


same process p. In , p can crash, so the other processes are
b-potent for some b. Then in
0
and
1
, the processes except
p are b-potent. So
0
or
1
is bivalent.
Concluding, each reachable bivalent conguration can make a
transition to a bivalent conguration. Since initial congurations
are bivalent, there is an execution visiting only bivalent congurations.
Note: There even exists a fair innite execution.
Impossibility of 1-Crash Consensus
Theorem: There is no terminating algorithm for 1-crash consensus
(i.e., only one process may crash).
Proof: Suppose, toward a contradiction, there is such an algorithm.
Let be a reachable bivalent conguration. Then
0
and

1
, where
0
can lead to decision 0 and
1
to decision 1.

Suppose these transitions correspond to events at dierent


processes. Then
0

1
for some . So
0
or
1
is bivalent.

Suppose these two transitions correspond to events at the


same process p. In , p can crash, so the other processes are
b-potent for some b. Then in
0
and
1
, the processes except
p are b-potent. So
0
or
1
is bivalent.
Concluding, each reachable bivalent conguration can make a
transition to a bivalent conguration. Since initial congurations
are bivalent, there is an execution visiting only bivalent congurations.
Note: There even exists a fair innite execution.
Impossibility of 1-Crash Consensus
Theorem: There is no terminating algorithm for 1-crash consensus
(i.e., only one process may crash).
Proof: Suppose, toward a contradiction, there is such an algorithm.
Let be a reachable bivalent conguration. Then
0
and

1
, where
0
can lead to decision 0 and
1
to decision 1.

Suppose these transitions correspond to events at dierent


processes. Then
0

1
for some . So
0
or
1
is bivalent.

Suppose these two transitions correspond to events at the


same process p. In , p can crash, so the other processes are
b-potent for some b. Then in
0
and
1
, the processes except
p are b-potent. So
0
or
1
is bivalent.
Concluding, each reachable bivalent conguration can make a
transition to a bivalent conguration. Since initial congurations
are bivalent, there is an execution visiting only bivalent congurations.
Note: There even exists a fair innite execution.
Impossibility of
N
2
|-Crash Consensus
Theorem: Let t
N
2
. There is no Las Vegas algorithm for t-crash
consensus.
Proof: Suppose, toward a contradiction, there is such an
algorithm. Partition the set of processes into S and T, with
[S[ =
N
2
| and [T[ =
N
2
|.
In reachable congurations, S and T are either both 0-potent or
both 1-potent. For else, since t
N
2
, S and T could independently
decide for dierent values.
Since the initial conguration is bivalent, there is a reachable
conguration and a transition with S and T both only
b-potent in and only (1b)-potent in . Such a transition
cannot exist.
Impossibility of
N
2
|-Crash Consensus
Theorem: Let t
N
2
. There is no Las Vegas algorithm for t-crash
consensus.
Proof: Suppose, toward a contradiction, there is such an
algorithm. Partition the set of processes into S and T, with
[S[ =
N
2
| and [T[ =
N
2
|.
In reachable congurations, S and T are either both 0-potent or
both 1-potent. For else, since t
N
2
, S and T could independently
decide for dierent values.
Since the initial conguration is bivalent, there is a reachable
conguration and a transition with S and T both only
b-potent in and only (1b)-potent in . Such a transition
cannot exist.
Question
Give a Monte Carlo algorithm for t-crash consensus for any t.
Bracha-Toueg Crash Consensus Algorithm
Let t <
N
2
. Initially, each correct process randomly chooses a value
0 or 1, with weight 1. In round k, at each correct, undecided p:

p sends k, value
p
, weight
p
) to all processes (including itself).

p waits till Nt messages k, b, w) arrived. (p purges/stores


messages from earlier/future rounds.)
If w >
N
2
for a k, b, w), then value
p
:= b. (This b is unique.)
Else, value
p
:= 0 if most messages voted 0, and value
p
:= 1
otherwise.
weight
p
is changed into the number of incoming votes for
value
p
in round k.

If w >
N
2
for more than t incoming messages k, b, w), then p
decides b. (Note that t < Nt.)
When p decides b, it broadcasts k+1, b, Nt) and k+2, b, Nt),
and terminates.
Bracha-Toueg Crash Consensus Algorithm - Example
N = 3 and t = 1. Each round a correct process requires two
incoming messages, and two b-votes with weight 2 to decide b.
(Messages of a process to itself are not depicted.)
weight 1
weight 1
weight 1
decide 0
weight 2
weight 2
weight 1
decide 0
weight 2
weight 2 crashed crashed
decide 0
weight 1
weight 1
0 0
<3,0,1>
0
<4,0,2>
<4,0,2> <3,0,1>
1
0 0 1
0
<1,1,1> <2,0,2> <2,0,2>
<1,0,1>
0
<1,0,1> <2,1,1>
0
0 0
<1,0,1>
Bracha-Toueg Crash Consensus Algorithm - Correctness
Theorem: Let t <
N
2
. The Bracha-Toueg t-crash consensus algorithm
is a Las Vegas algorithm that terminates with probability 1.
Proof (part I): Suppose a process decides b in round k. Then in this
round, value
q
= b and weight
q
>
N
2
for more than t processes q.
So in round k, each correct process receives a message q, b, w)
with w >
N
2
.
Hence, in round k+1, all correct processes vote b.
Then, after round k+2, all correct processes have decided b.
Concluding, all correct processes decide for the same value.
Bracha-Toueg Crash Consensus Algorithm - Correctness
Theorem: Let t <
N
2
. The Bracha-Toueg t-crash consensus algorithm
is a Las Vegas algorithm that terminates with probability 1.
Proof (part I): Suppose a process decides b in round k. Then in this
round, value
q
= b and weight
q
>
N
2
for more than t processes q.
So in round k, each correct process receives a message q, b, w)
with w >
N
2
.
Hence, in round k+1, all correct processes vote b.
Then, after round k+2, all correct processes have decided b.
Concluding, all correct processes decide for the same value.
Bracha-Toueg Crash Consensus Algorithm - Correctness
Proof (part II): Assumption: Scheduling of messages is fair.
Let S be a set of Nt processes that do not crash.
Due to fair scheduling, there is a chance > 0 that in a round
each process in S receives its rst Nt messages from processes in S.
So with chance
3
this happens in consecutive rounds k, k+1, k+2.
After round k, all processes in S have the same value b.
After round k+1, all processes in S have weight Nt >
N
2
.
Since Nt > t, after round k+2, all processes in S have decided b.
Concluding, the algorithm terminates with probability 1.
Impossibility of
N
3
|-Byzantine Consensus
Theorem: Let t
N
3
. There is no t-Byzantine consensus algorithm.
Proof: Suppose, toward a contradiction, that there is such an
algorithm. Since t
N
3
, we can choose sets S and T of processes
with [S[ = [T[ = Nt and [S T[ t.
Let conguration be reachable by a sequence of correct
transitions (so that still any process can become Byzantine). In ,
S and T are either both 0-potent or both 1-potent. For else, since
the processes in S T may become Byzantine, S and T could
independently decide for dierent values.
Since the initial conguration is bivalent, there is a conguration
, reachable by correct transitions, and a correct transition ,
with S and T both only b-potent in and only (1b)-potent in .
Such a transition cannot exist.
Impossibility of
N
3
|-Byzantine Consensus
Theorem: Let t
N
3
. There is no t-Byzantine consensus algorithm.
Proof: Suppose, toward a contradiction, that there is such an
algorithm. Since t
N
3
, we can choose sets S and T of processes
with [S[ = [T[ = Nt and [S T[ t.
Let conguration be reachable by a sequence of correct
transitions (so that still any process can become Byzantine). In ,
S and T are either both 0-potent or both 1-potent. For else, since
the processes in S T may become Byzantine, S and T could
independently decide for dierent values.
Since the initial conguration is bivalent, there is a conguration
, reachable by correct transitions, and a correct transition ,
with S and T both only b-potent in and only (1b)-potent in .
Such a transition cannot exist.
Bracha-Toueg Byzantine Consensus Algorithm
Let t <
N
3
. Bracha and Toueg gave a t-Byzantine consensus algorithm.
Again, in every round, correct processes broadcast their value, and
wait for Nt incoming messages. (No weights are needed.)
A correct process decides b upon receiving more than
Nt
2
+t =
N+t
2
b-votes in one round. (Note that
N+t
2
< Nt.)
Echo Mechanism
Complication: A Byzantine process may send dierent votes to
dierent processes.
Example: Let N = 4 and t = 1. Each round, a correct process
waits for 3 votes, and needs 3 b-votes to decide b.
decide 1 decide 0
decide 0
Byzantine
1
0
0
1
1
0
1
1 0
Byzantine
0
0
0 0
0
1 0
Solution: Each incoming vote is veried using an echo mechanism.
A vote is accepted after more than
N+t
2
conrming echos.
Bracha-Toueg Byzantine Consensus Algorithm
Initially, each correct process randomly chooses 0 or 1.
In round k, at each correct, undecided p:

p broadcasts in, k, value


p
).

If p receives in, , b) with k from q, it broadcasts


ec, q, , b).

p counts incoming ec, q, k, b) messages for each q, b. When


more than
N+t
2
such messages arrived, p accepts qs b-vote.

p purges ec, q, , b) with < k, and stores in, , b) and


ec, q, , b) with > k.

The round is completed when p has accepted Nt votes.


If most votes are for 0, then value
p
:= 0. Else, value
p
:= 1.
Bracha-Toueg Byzantine Consensus Algorithm
p keeps track whether it received multiple messages in, , ) or
ec, q, , ) via the same channel. (The sender must be Byzantine.)
p only takes into account the rst of these messages.
If more than
N+t
2
of the accepted votes were for b, then p decides b.
When p decides b, it broadcasts decide, b) and terminates.
The other processes interpret decide, b) as a b-vote by p, and a
b-echo by p for each q, for all rounds to come.
Question
If an undecided process receives decide, b), why can it in general
not immediately decide b?
Example
We study the previous example again, now with verication of votes.
Let N = 4 and t = 1. Each round, a correct process needs 3
conrmations to accept a vote, 3 accepted votes, and 3 b-votes to
decide b.
decide 0
decide 0
Byzantine
0
0
0 0
0
1 0
1
0
Byzantine
1
1
0
1
1
0
0 0
0
0
1
Only relevant in messages are depicted (without their round number).
Example
decide 0
decide 0
Byzantine
0
0
0 0
0
1 0
1
0
Byzantine
1
1
0
1
1
0
0 0
0
0
1
In the rst round, the left bottom node does not accept vote 1 by
the Byzantine process, since none of the other two correct
processes conrm this vote. So it waits for (and accepts) vote 0 by
the right bottom node, and thus does not decide 1 in the rst round.
decide 0
Byzantine 0
0 0
Byzantine
1
1
1
0
0 0
decide,0
decide,0
Bracha-Toueg Byzantine Consensus Alg. - Correctness
Theorem: Let t <
N
3
. The Bracha-Toueg t-Byzantine consensus
algorithm is a Las Vegas algorithm that terminates with probability 1.
Proof: Each round, the correct processes eventually accept Nt
votes, since there are Nt correct processes (and Nt >
N+t
2
).
Suppose in round k, correct processes p and q accept votes for b
and b

, respectively, from a process r . Then p and q received more


than
N+t
2
messages ec, r , k, b) and ec, r , k, b

), respectively.
More than t processes, so at least one correct process, sent such
messages to both p and q. Then b = b

.
Bracha-Toueg Byzantine Consensus Alg. - Correctness
Theorem: Let t <
N
3
. The Bracha-Toueg t-Byzantine consensus
algorithm is a Las Vegas algorithm that terminates with probability 1.
Proof: Each round, the correct processes eventually accept Nt
votes, since there are Nt correct processes (and Nt >
N+t
2
).
Suppose in round k, correct processes p and q accept votes for b
and b

, respectively, from a process r . Then p and q received more


than
N+t
2
messages ec, r , k, b) and ec, r , k, b

), respectively.
More than t processes, so at least one correct process, sent such
messages to both p and q. Then b = b

.
Bracha-Toueg Byzantine Consensus Alg. - Correctness
Theorem: Let t <
N
3
. The Bracha-Toueg t-Byzantine consensus
algorithm is a Las Vegas algorithm that terminates with probability 1.
Proof: Each round, the correct processes eventually accept Nt
votes, since there are Nt correct processes (and Nt >
N+t
2
).
Suppose in round k, correct processes p and q accept votes for b
and b

, respectively, from a process r . Then p and q received more


than
N+t
2
messages ec, r , k, b) and ec, r , k, b

), respectively.
More than t processes, so at least one correct process, sent such
messages to both p and q. Then b = b

.
Bracha-Toueg Byzantine Consensus Alg. - Correctness
Suppose a correct process decides b in round k. In this round it
accepts more than
N+t
2
b-votes. So in round k, each correct
process accepts more than
N+t
2
t =
Nt
2
b-votes. Hence, in
round k+1, value
q
= b for each correct q. This implies that the
correct processes vote b in all rounds > k. (Namely, in rounds
> k, each correct process accepts at least N2t >
Nt
2
b-votes.)
Let S be a set of Nt processes that do not become Byzantine.
Due to fair scheduling, there is a chance > 0 that in a round
each process in S accepts Nt votes from processes in S. So there
is a chance
2
that this happens in consecutive rounds k, k+1.
After round k, all processes in S have the same value b. After
round k+1, all processes in S have decided b.
Bracha-Toueg Byzantine Consensus Alg. - Correctness
Suppose a correct process decides b in round k. In this round it
accepts more than
N+t
2
b-votes. So in round k, each correct
process accepts more than
N+t
2
t =
Nt
2
b-votes. Hence, in
round k+1, value
q
= b for each correct q. This implies that the
correct processes vote b in all rounds > k. (Namely, in rounds
> k, each correct process accepts at least N2t >
Nt
2
b-votes.)
Let S be a set of Nt processes that do not become Byzantine.
Due to fair scheduling, there is a chance > 0 that in a round
each process in S accepts Nt votes from processes in S. So there
is a chance
2
that this happens in consecutive rounds k, k+1.
After round k, all processes in S have the same value b. After
round k+1, all processes in S have decided b.
Failure Detectors and Synchronous Systems
A failure detector at a process keeps track which processes have
(or may have) crashed.
Given a (known or unknown) upper bound on network delay, and
heartbeat messages by each process, one can implement a failure
detector.
In a synchronous system, processes execute in lock step.
Given local clocks that have a known bounded inaccuracy, and a
known upper bound on network delay, one can transform a system
based on asynchronous communication into a synchronous system.
With a failure detector, and for a synchronous system, the proof
for impossibility of 1-crash consensus no longer applies. Consensus
algorithms have been developed for these settings.
Failure Detectors and Synchronous Systems
A failure detector at a process keeps track which processes have
(or may have) crashed.
Given a (known or unknown) upper bound on network delay, and
heartbeat messages by each process, one can implement a failure
detector.
In a synchronous system, processes execute in lock step.
Given local clocks that have a known bounded inaccuracy, and a
known upper bound on network delay, one can transform a system
based on asynchronous communication into a synchronous system.
With a failure detector, and for a synchronous system, the proof
for impossibility of 1-crash consensus no longer applies. Consensus
algorithms have been developed for these settings.
Failure Detectors and Synchronous Systems
A failure detector at a process keeps track which processes have
(or may have) crashed.
Given a (known or unknown) upper bound on network delay, and
heartbeat messages by each process, one can implement a failure
detector.
In a synchronous system, processes execute in lock step.
Given local clocks that have a known bounded inaccuracy, and a
known upper bound on network delay, one can transform a system
based on asynchronous communication into a synchronous system.
With a failure detector, and for a synchronous system, the proof
for impossibility of 1-crash consensus no longer applies. Consensus
algorithms have been developed for these settings.
Failure Detection
Aim: To detect crashed processes.
T is the time domain. F() is the set of crashed processes at time .
A process cannot observe and F().

1

2
F(
1
) F(
2
) (i.e., no restart).
Crash(F) =
T
F(), and H(p, ) is the set of processes
suspected to be crashed by process p at time .
Each execution is decorated with a failure pattern F and a failure
detector history H.
We require that a failure detector is complete: From some time
onward, every crashed process is suspected by every correct process.
p Crash(F) q , Crash(F)

p H(q,

)
Failure Detection
Aim: To detect crashed processes.
T is the time domain. F() is the set of crashed processes at time .
A process cannot observe and F().

1

2
F(
1
) F(
2
) (i.e., no restart).
Crash(F) =
T
F(), and H(p, ) is the set of processes
suspected to be crashed by process p at time .
Each execution is decorated with a failure pattern F and a failure
detector history H.
We require that a failure detector is complete: From some time
onward, every crashed process is suspected by every correct process.
p Crash(F) q , Crash(F)

p H(q,

)
Strongly Accurate Failure Detection
A failure detector is strongly accurate if only crashed processes are
suspected:
p, q , F() p , H(q, )
Assumptions:

Each correct process broadcasts alive every time units.

is an upper bound on communication delay.


Each process from which no message is received for + time
units has crashed.
This failure detector is complete and strongly accurate.
Strongly Accurate Failure Detection
A failure detector is strongly accurate if only crashed processes are
suspected:
p, q , F() p , H(q, )
Assumptions:

Each correct process broadcasts alive every time units.

is an upper bound on communication delay.


Each process from which no message is received for + time
units has crashed.
This failure detector is complete and strongly accurate.
Weakly Accurate Failure Detection
With a failure detector, the proof for impossibility of 1-crash
consensus no longer applies, if for instance from some time on
some process is never suspected by any process.
A failure detector is weakly accurate if some process is never
suspected:
p q , F() p , H(q, )
Assume a complete and weakly accurate failure detector. We give
a rotating coordinator algorithm for (N1)-crash consensus.
Consensus with Weakly Accurate Failure Detection
Processes are numbered: p
1
, . . . , p
N
. Initially, each process has
value 0 or 1. In round k:

p
k
(if not crashed) broadcasts its value.

Each process waits:


- either for an incoming message from p
k
, in which case it
adopts the value of p
k
;
- or until it suspects that p
k
crashed.
After round N, each correct process decides for its value at that time.
Correctness: Let p
j
never be suspected. After round j , all correct
processes have the same value b. Hence, after round N, all correct
processes decide b.
Consensus with Weakly Accurate Failure Detection
Processes are numbered: p
1
, . . . , p
N
. Initially, each process has
value 0 or 1. In round k:

p
k
(if not crashed) broadcasts its value.

Each process waits:


- either for an incoming message from p
k
, in which case it
adopts the value of p
k
;
- or until it suspects that p
k
crashed.
After round N, each correct process decides for its value at that time.
Correctness: Let p
j
never be suspected. After round j , all correct
processes have the same value b. Hence, after round N, all correct
processes decide b.
Eventually Strongly Accurate Failure Detection
A failure detector is eventually strongly accurate if from some time
onward, only crashed processes are suspected:

p, q , Crash(F) p , H(q,

)
Assumptions:

Each correct process broadcasts alive every time units.

There is an (unknown) upper bound on communication delay.


Each process p initially takes
p
= 1.
If p receives no message from q for +
p
time units, then p
suspects that q has crashed.
When p receives a message from a suspected process q, then q is
no longer suspected and
p
:=
p
+1.
This failure detector is complete and eventually strongly accurate.
Eventually Strongly Accurate Failure Detection
A failure detector is eventually strongly accurate if from some time
onward, only crashed processes are suspected:

p, q , Crash(F) p , H(q,

)
Assumptions:

Each correct process broadcasts alive every time units.

There is an (unknown) upper bound on communication delay.


Each process p initially takes
p
= 1.
If p receives no message from q for +
p
time units, then p
suspects that q has crashed.
When p receives a message from a suspected process q, then q is
no longer suspected and
p
:=
p
+1.
This failure detector is complete and eventually strongly accurate.
Consensus with Failure Detection
Theorem: Let t
N
2
. There is no t-crash consensus algorithm
based on an eventually strongly accurate failure detector.
Proof: Suppose, toward a contradiction, there is such an
algorithm. Partition the set of processes into S and T, with
[S[ =
N
2
| and [T[ =
N
2
|.
In reachable congurations, S and T are either both 0-potent or
both 1-potent. For else, the processes in S could suspect for a
suciently long period that the processes in T have crashed, and
vice versa. Then, since t
N
2
, S and T could independently
decide for dierent values.
Since the initial conguration is bivalent, there is a reachable
conguration and a transition with S and T both only
b-potent in and only (1b)-potent in . Such a transition
cannot exist.
Consensus with Failure Detection
Theorem: Let t
N
2
. There is no t-crash consensus algorithm
based on an eventually strongly accurate failure detector.
Proof: Suppose, toward a contradiction, there is such an
algorithm. Partition the set of processes into S and T, with
[S[ =
N
2
| and [T[ =
N
2
|.
In reachable congurations, S and T are either both 0-potent or
both 1-potent. For else, the processes in S could suspect for a
suciently long period that the processes in T have crashed, and
vice versa. Then, since t
N
2
, S and T could independently
decide for dierent values.
Since the initial conguration is bivalent, there is a reachable
conguration and a transition with S and T both only
b-potent in and only (1b)-potent in . Such a transition
cannot exist.
Chandra-Toueg Algorithm
A failure detector is eventually weakly accurate if from some time
onward some process is never suspected:
p

q , Crash(F) p , H(q,

)
Let t <
N
2
. Assume a complete and eventually weakly accurate
failure detector. Chandra and Toueg gave a rotating coordinator
algorithm for t-crash consensus.
Each process q records the last round ts
q
in which it updated
value
q
. Initially, value
q
0, 1 and ts
q
= 0.
Processes are numbered: p
1
, . . . , p
N
. Round k is coordinated by p
c
with c = (k modN)+1.
Note: Tel presents a simplied version of the Chandra-Toueg
algorithm (without acknowledgements and time stamps), which
only works for t <
N
3
.
Chandra-Toueg Algorithm
A failure detector is eventually weakly accurate if from some time
onward some process is never suspected:
p

q , Crash(F) p , H(q,

)
Let t <
N
2
. Assume a complete and eventually weakly accurate
failure detector. Chandra and Toueg gave a rotating coordinator
algorithm for t-crash consensus.
Each process q records the last round ts
q
in which it updated
value
q
. Initially, value
q
0, 1 and ts
q
= 0.
Processes are numbered: p
1
, . . . , p
N
. Round k is coordinated by p
c
with c = (k modN)+1.
Note: Tel presents a simplied version of the Chandra-Toueg
algorithm (without acknowledgements and time stamps), which
only works for t <
N
3
.
Chandra-Toueg Algorithm
A failure detector is eventually weakly accurate if from some time
onward some process is never suspected:
p

q , Crash(F) p , H(q,

)
Let t <
N
2
. Assume a complete and eventually weakly accurate
failure detector. Chandra and Toueg gave a rotating coordinator
algorithm for t-crash consensus.
Each process q records the last round ts
q
in which it updated
value
q
. Initially, value
q
0, 1 and ts
q
= 0.
Processes are numbered: p
1
, . . . , p
N
. Round k is coordinated by p
c
with c = (k modN)+1.
Note: Tel presents a simplied version of the Chandra-Toueg
algorithm (without acknowledgements and time stamps), which
only works for t <
N
3
.
Chandra-Toueg Algorithm

Correct q send k, value


q
, ts
q
) to p
c
.

p
c
(if not crashed) waits until Nt such messages arrived,
and selects one, say k, b, ts), with ts as large as possible.
p
c
broadcasts k, b).

Each correct q waits:


- until k, b) arrives: then value
q
:= b, ts
q
:= k, and q sends
ack, k) to p
c
;
- or until it suspects p
c
crashed: then q sends nack, k) to p
c
.

p
c
(if not crashed) waits until Nt acknowledgements
arrived. If more than t of them are ack, then p
c
decides b,
and broadcasts decide, b). (Note that Nt > t.)

When a process did not yet decide and receives decide, b),
it decides b.
Chandra-Toueg Algorithm - Example
<decide,0>
<1,0,0> <ack,1>
1
<2,1,0> <ack,2>
2
<1,0>
3
<2,0>
p p
p
ts 0
value 1 value 0
ts 0
ts 0 ts 1
value 0 value 0
ts 1
value 1 value 1 value 1
ts 0
value 0
ts 1
ts 0
ts 1
decide 0
value 1
crashed crashed value 0
value 0
value 0
ts 2
crashed value 0
ts 2
decide 0
value 0
ts 2
ts 1
value 0
ts 0
crashed value 0
ts 2
value 0
ts 2
decide 0
ts 2
N = 3 and t = 1
Chandra-Toueg Algorithm - Correctness
Theorem: Let t <
N
2
. The Chandra-Toueg algorithm is a
terminating algorithm for t-crash consensus.
Proof: If the coordinator in some round k receives > t acks, then
(for some b 0, 1):
(1) there are > t processes q with ts
q
k; and
(2) ts
q
k implies value
q
= b.
These properties are preserved in rounds > k.
This follows by induction on k.
By (1), in round the coordinator receives at least one message
with time stamp k.
Hence, by (2), the coordinator of round broadcasts , b).
Chandra-Toueg Algorithm - Correctness
Theorem: Let t <
N
2
. The Chandra-Toueg algorithm is a
terminating algorithm for t-crash consensus.
Proof: If the coordinator in some round k receives > t acks, then
(for some b 0, 1):
(1) there are > t processes q with ts
q
k; and
(2) ts
q
k implies value
q
= b.
These properties are preserved in rounds > k.
This follows by induction on k.
By (1), in round the coordinator receives at least one message
with time stamp k.
Hence, by (2), the coordinator of round broadcasts , b).
Chandra-Toueg Algorithm - Correctness
Theorem: Let t <
N
2
. The Chandra-Toueg algorithm is a
terminating algorithm for t-crash consensus.
Proof: If the coordinator in some round k receives > t acks, then
(for some b 0, 1):
(1) there are > t processes q with ts
q
k; and
(2) ts
q
k implies value
q
= b.
These properties are preserved in rounds > k.
This follows by induction on k.
By (1), in round the coordinator receives at least one message
with time stamp k.
Hence, by (2), the coordinator of round broadcasts , b).
Chandra-Toueg Algorithm - Correctness
Theorem: Let t <
N
2
. The Chandra-Toueg algorithm is a
terminating algorithm for t-crash consensus.
Proof: If the coordinator in some round k receives > t acks, then
(for some b 0, 1):
(1) there are > t processes q with ts
q
k; and
(2) ts
q
k implies value
q
= b.
These properties are preserved in rounds > k.
So from round k onward, processes can only decide b.
Since the failure detector is eventually weakly accurate, from some
round onward some process p will never be suspected.
So when p becomes the coordinator, it receives Nt acks.
Since Nt > t, it decides.
All correct processes eventually receive the decide message of p,
and also decide.
Chandra-Toueg Algorithm - Correctness
Theorem: Let t <
N
2
. The Chandra-Toueg algorithm is a
terminating algorithm for t-crash consensus.
Proof: If the coordinator in some round k receives > t acks, then
(for some b 0, 1):
(1) there are > t processes q with ts
q
k; and
(2) ts
q
k implies value
q
= b.
These properties are preserved in rounds > k.
So from round k onward, processes can only decide b.
Since the failure detector is eventually weakly accurate, from some
round onward some process p will never be suspected.
So when p becomes the coordinator, it receives Nt acks.
Since Nt > t, it decides.
All correct processes eventually receive the decide message of p,
and also decide.
Question
Why is it dicult to implement a failure detector for Byzantine
processes?
Local Clocks with Bounded Drift
Suppose we have a dense time domain.
Let each process p have a local clock C
p
(), which returns a time
value at real time .
We assume that each local clock has bounded drift, compared to
real time:
1
1+
(
2

1
) C
p
(
2
) C
p
(
1
) (1+)(
2

1
)
for some > 0.
Clock Synchronization
At certain time intervals, the processes synchronize clocks:
they read each others clock values, and adjust their local clocks.
The aim is to achieve, for some > 0,
[C
p
() C
q
()[
for all .
Due to drift, this precision may degrade over time, necessitating
repeated synchronizations.
Clock Synchronization
Suppose that after each synchronization, at say real time , for all
processes p, q:
[C
p
() C
q
()[
0
for some
0
< .
Due to -bounded drift, at real time +R,
[C
p
(+R) C
q
(+R)[
0
+ (1+
1
1+
)R <
0
+ 2R
So there should be a synchronization every

0
2
time units.
We assume a bound
max
on network delay. For simplicity, let
max
be much smaller than (so that this delay can be ignored).
Clock Synchronization
Suppose that after each synchronization, at say real time , for all
processes p, q:
[C
p
() C
q
()[
0
for some
0
< .
Due to -bounded drift, at real time +R,
[C
p
(+R) C
q
(+R)[
0
+ (1+
1
1+
)R <
0
+ 2R
So there should be a synchronization every

0
2
time units.
We assume a bound
max
on network delay. For simplicity, let
max
be much smaller than (so that this delay can be ignored).
Mahaney-Schneider Synchronizer
Consider a complete network of N processes, where at most t
processes can become Byzantine.
In the Mahaney-Schneider synchronizer, each correct process at a
synchronization:
1. Collects the clock values of all processes (waiting for 2
max
).
2. Discards those reported values for which less than Nt
processes report a value in the interval [, +]
(they are from Byzantine processes).
3. Replaces all discarded and non-received values with some
value such that there are accepted reported values
1
and

2
with
1

2
.
4. Takes the average of these N values as its new clock value.
Mahaney-Schneider Synchronizer - Correctness
Lemma: Let t <
N
3
. If values a
p
and a
q
pass the lters of correct
processes p and q, respectively, in some synchronization round,
then
[a
p
a
q
[ 2
Proof: At least Nt processes reported a value in [a
p
, a
p
+] to p.
And at least Nt processes reported a value in [a
q
, a
q
+] to q.
Since N2t > t, at least one correct process r reported a value in
[a
p
, a
p
+] to p, and in [a
q
, a
q
+] to q.
Since r reports the same value to p and q, it follows that
[a
p
a
q
[ 2
Mahaney-Schneider Synchronizer - Correctness
Theorem: Let t <
N
3
. The Mahaney-Schneider synchronizer is
t-Byzantine robust.
Proof: Let a
pr
(resp. a
qr
) be the value that correct process p (resp. q)
accepted or computed for process r , in some synchronization round.
By the lemma, for all r , [a
pr
a
qr
[ 2.
Moreover, a
pr
= a
qr
for all correct r .
Hence, for all correct p and q,
[
1
N
(

processes r
a
pr
)
1
N
(

processes r
a
qr
)[
1
N
t2 <
2
3

So we can take
0
=
2
3
, and there should be a synchronization
every

6
time units.
Impossibility of
N
3
|-Byzantine Synchronizers
Theorem: Let t
N
3
. There is no t-Byzantine robust synchronizer.
Proof: Let N = 3, t = 1. Processes are p, q, r ; r is Byzantine.
(The construction below easily extends to general N and t
N
3
.)
Let the local clock of p run faster than the local clock of q.
Suppose a synchronization takes place at real time .
r sends C
p
()+ to p, and C
q
() to q.
p and q cannot recognize that r is Byzantine. So they have to stay
within range of the value reported by r . Hence p cannot
decrease its clock value, and q cannot increase its clock value.
By repeating this scenario at each synchronization round, the clock
values of p and q get further and further apart.
Impossibility of
N
3
|-Byzantine Synchronizers
Theorem: Let t
N
3
. There is no t-Byzantine robust synchronizer.
Proof: Let N = 3, t = 1. Processes are p, q, r ; r is Byzantine.
(The construction below easily extends to general N and t
N
3
.)
Let the local clock of p run faster than the local clock of q.
Suppose a synchronization takes place at real time .
r sends C
p
()+ to p, and C
q
() to q.
p and q cannot recognize that r is Byzantine. So they have to stay
within range of the value reported by r . Hence p cannot
decrease its clock value, and q cannot increase its clock value.
By repeating this scenario at each synchronization round, the clock
values of p and q get further and further apart.
Synchronous Networks
A synchronous network proceeds in pulses. In one pulse, each process:
1. sends messages;
2. receives messages; and
3. performs internal events.
A message is sent and received in the same pulse.
Such synchrony is called lockstep.
From Synchronizer to Synchronous Network
Assume -bounded local clocks, and a synchronizer with precision .
For simplicity, let the maximum network delay
max
, and the time
to perform internal events in a pulse, be much smaller than .
When a process reads clock value (i 1)(1+)
2
, it starts pulse i .
Key question: Does each process receive all messages for pulse i
before the start of pulse i +1?
From Synchronizer to Synchronous Network
Assume -bounded local clocks, and a synchronizer with precision .
For simplicity, let the maximum network delay
max
, and the time
to perform internal events in a pulse, be much smaller than .
When a process reads clock value (i 1)(1+)
2
, it starts pulse i .
Key question: Does each process receive all messages for pulse i
before the start of pulse i +1?
From Synchronizer to Synchronous Network
When a process reads clock value (i 1)(1+)
2
, it starts pulse i .
Since the synchronizer has precision , and the clock of q is
-bounded (from below), for all ,
C
1
q
() C
1
p
() + (1+)
And since the clock of p is -bounded (from above), for all , ,
C
1
p
() + C
1
p
( + (1+))
Hence C
1
q
((i 1)(1+)
2
) C
1
p
(i (1+)
2
), so p receives the
message from q for pulse i before the start of pulse i +1.
Byzantine Broadcast
Consider a synchronous network of N processes, where at most t
processes can become Byzantine.
One process g, called the general, is given an input x
g
V.
The other processes are called lieutenants.
Requirements for t-Byzantine broadcast:

Termination: Every correct process decides a value in V.

Dependence: If the general is correct, it decides x


g
.

Agreement: All correct processes decide the same value.


Impossibility of
N
3
|-Byzantine Broadcast
Theorem: Let t
N
3
. There is no t-Byzantine broadcast algorithm
for synchronous networks (unless authentication is used).
Proof: Divide the processes into three sets S, T and U with each
t elements. Let g S.
S
T U
0
0 1
1
0
1
Byzantine
The processes in T decide 0 and in U decide 1
Scenario 2
T U
0
0 0
0
Byzantine
0
1
S
S
T U
1
1 1
1
0
1
Byzantine
Scenario 1 Scenario 0
The processes in S and T decide 0 The processes in S and U decide 1
Lamport-Shostak-Pease Byzantine Broadcast
Let t <
N
3
. Broadcast
g
(N, t) is a t-Byzantine broadcast algorithm
for synchronous networks.
Pulse 1: General g: decide and broadcast x
g
Lieutenant p: if v is received from g then x
p
:= v else x
p
:=;
if t = 0: decide x
p
if t > 0: perform Broadcast
p
(N1, t1) in pulse 2 (g is excluded)
Pulse t+1: (t > 0)
Lieutenant p: for each lieutenant q, p has taken a decision in
Broadcast
q
(N1, t1); store this decision in M
p
[q];
x
p
:= major (M
p
); decide x
p
(major maps each multiset over V to a value in V, such that if
more than half of the elements in m are v, then major (m) = v.)
Lamport-Shostak-Pease Byzantine Broadcast
Let t <
N
3
. Broadcast
g
(N, t) is a t-Byzantine broadcast algorithm
for synchronous networks.
Pulse 1: General g: decide and broadcast x
g
Lieutenant p: if v is received from g then x
p
:= v else x
p
:=;
if t = 0: decide x
p
if t > 0: perform Broadcast
p
(N1, t1) in pulse 2 (g is excluded)
Pulse t+1: (t > 0)
Lieutenant p: for each lieutenant q, p has taken a decision in
Broadcast
q
(N1, t1); store this decision in M
p
[q];
x
p
:= major (M
p
); decide x
p
(major maps each multiset over V to a value in V, such that if
more than half of the elements in m are v, then major (m) = v.)
Lamport-Shostak-Pease Byzantine Broadcast
Let t <
N
3
. Broadcast
g
(N, t) is a t-Byzantine broadcast algorithm
for synchronous networks.
Pulse 1: General g: decide and broadcast x
g
Lieutenant p: if v is received from g then x
p
:= v else x
p
:=;
if t = 0: decide x
p
if t > 0: perform Broadcast
p
(N1, t1) in pulse 2 (g is excluded)
Pulse t+1: (t > 0)
Lieutenant p: for each lieutenant q, p has taken a decision in
Broadcast
q
(N1, t1); store this decision in M
p
[q];
x
p
:= major (M
p
); decide x
p
(major maps each multiset over V to a value in V, such that if
more than half of the elements in m are v, then major (m) = v.)
Example
N = 4 and t = 1; general correct.
Initially: After pulse 1:
1
g
1
Byzantine
g
Byzantine 1
1
decide 1
Consider the sub-network without g.
After pulse 1, all correct processes carry the value 1.
So, since N1 > 2f (i.e., 3 > 2), the correct lieutenants will all
decide 1 (even though one third of the lieutenants is Byzantine).
Example
N = 7 and t = 2; general Byzantine. (Channels are omitted.)
Byzantine
0 0 0
g Byzantine
1 1
After pulse 1:
All correct lieutenants p build, in the recursive call Broadcast
p
(6, 1),
the same multiset m = 0, 0, 0, 1, 1, v, for some v V.
So in Broadcast
g
(7, 2), they all decide major (m).
Lamport-Shostak-Pease Byzantine Broadcast - Correctness
Lemma: If general g is correct, and N > 2f +t (with f the number
of Byzantine processes; f > t is allowed here), then in
Broadcast
g
(N, t) all correct processes decide x
g
.
Proof: By induction on t. Case t = 0 is trivial. Let t > 0.
Since g is correct, in pulse 1, at all correct lieutenants q, x
q
:= x
g
.
Since N1 > 2f +(t1), by induction, for all correct lieutenants q,
in Broadcast
q
(N1, t1), the decision x
q
= x
g
is taken.
Since a majority of the lieutenants is correct (N1 > 2f ), in pulse
t+1, at each correct lieutenant p, x
p
:= major (M
p
) = x
g
.
Lamport-Shostak-Pease Byzantine Broadcast - Correctness
Lemma: If general g is correct, and N > 2f +t (with f the number
of Byzantine processes; f > t is allowed here), then in
Broadcast
g
(N, t) all correct processes decide x
g
.
Proof: By induction on t. Case t = 0 is trivial. Let t > 0.
Since g is correct, in pulse 1, at all correct lieutenants q, x
q
:= x
g
.
Since N1 > 2f +(t1), by induction, for all correct lieutenants q,
in Broadcast
q
(N1, t1), the decision x
q
= x
g
is taken.
Since a majority of the lieutenants is correct (N1 > 2f ), in pulse
t+1, at each correct lieutenant p, x
p
:= major (M
p
) = x
g
.
Lamport-Shostak-Pease Byzantine Broadcast - Correctness
Theorem: Let t <
N
3
. Broadcast
g
(N, t) is a t-Byzantine broadcast
algorithm for synchronous networks.
Proof: By induction on t.
If g is correct, then consensus follows from the lemma.
Let g be Byzantine (so t > 0). Then at most t1 lieutenants are
Byzantine.
Since t1 <
N1
3
, by induction, for every lieutenant q, all correct
lieutenants take in Byzantine
q
(N1, t1) the same decision v
q
.
Hence, all correct lieutenants p compute the same multiset M
p
.
So in pulse t+1, all correct lieutenants p decide the same value
major (M
p
).
Lamport-Shostak-Pease Byzantine Broadcast - Correctness
Theorem: Let t <
N
3
. Broadcast
g
(N, t) is a t-Byzantine broadcast
algorithm for synchronous networks.
Proof: By induction on t.
If g is correct, then consensus follows from the lemma.
Let g be Byzantine (so t > 0). Then at most t1 lieutenants are
Byzantine.
Since t1 <
N1
3
, by induction, for every lieutenant q, all correct
lieutenants take in Byzantine
q
(N1, t1) the same decision v
q
.
Hence, all correct lieutenants p compute the same multiset M
p
.
So in pulse t+1, all correct lieutenants p decide the same value
major (M
p
).
Incorrectness for t
N
3
- Example
N = 3 and t = 1.
1
g
Byzantine
p q
g decides 1.
On the other hand, calling Broadcast
q
(2, 0), q builds the multiset 0, 1
(assuming that p communicates 0 to q).
As a result, in Broadcast
g
(3, 1), q decides 0
(assuming that major (0, 1) = 0).
Partial Synchrony
A synchronous system can be obtained if local clocks have known
bounded drift, and there is a known upper bound on network delay.
Dwork, Lynch and Stockmeyer showed that a t-Byzantine
broadcast algorithm, for t <
N
3
, exists for partially synchronous
systems, in which either

the bounds on the inaccuracy of local clocks and network


delay are unknown; or

these bounds are known, but only valid from some unknown
point in time.
Public-Key Cryptosystems
A public-key cryptosystem consists of a nite message domain /
and, for each process q, functions S
q
, P
q
: / / with
S
q
(P
q
(m)) = P
q
(S
q
(m)) = m for m /.
S
q
is kept secret, P
q
is made public.
Underlying assumption: Computing S
q
from P
q
is expensive.
p sends secret message m to q: P
q
(m)
p sends signed message m to q: m, S
p
(m))
Example: RSA cryptosystem.
Lamport-Shostak-Pease Authenticating Algorithm
Pulse 1: The general broadcasts x
g
, (S
g
(x
g
), g)), and decides x
g
.
Pulse i : If a lieutenant q receives a message v, (
1
, p
1
) : : (
i
, p
i
))
that is valid, i.e.:

p
1
= g,

p
1
, . . . , p
i
, q are distinct, and

P
p
k
(
k
) = v for k = 1, . . . , i ,
then q includes v in the set W
q
.
If i t and [W
q
[ 2, then in pulse i +1, q broadcasts
v, (
1
, p
1
) : : (
i
, p
i
) : (S
q
(v), q))
After pulse t+1, each correct lieutenant p decides
v if W
p
is a singleton v, or
otherwise (the general is Byzantine)
Lamport-Shostak-Pease Authenticating Alg. - Correctness
Theorem: The Lamport-Shostak-Pease authenticating algorithm is
a t-Byzantine broadcast algorithm, for any t.
Proof: If the general is correct, then owing to authentication,
correct lieutenants only add x
g
to W
q
. So they all decide x
g
.
Suppose a lieutenant receives a valid message v,

) in pulse t+1.
Since

has length t+1, it contains a correct q. Then q received a


valid message v, ) in a pulse t.
When a correct lieutenant q receives a valid message v, ) in a
pulse t, then either it broadcasts v, or it already broadcast two
other values before, with valid messages.
Lamport-Shostak-Pease Authenticating Alg. - Correctness
Theorem: The Lamport-Shostak-Pease authenticating algorithm is
a t-Byzantine broadcast algorithm, for any t.
Proof: If the general is correct, then owing to authentication,
correct lieutenants only add x
g
to W
q
. So they all decide x
g
.
Suppose a lieutenant receives a valid message v,

) in pulse t+1.
Since

has length t+1, it contains a correct q. Then q received a


valid message v, ) in a pulse t.
When a correct lieutenant q receives a valid message v, ) in a
pulse t, then either it broadcasts v, or it already broadcast two
other values before, with valid messages.
Lamport-Shostak-Pease Authenticating Alg. - Correctness
We conclude that for all correct lieutenants p,

either W
p
= for all p,

or [W
p
[ 2 for all p,

or W
p
= v for all p, for some v V.
In the rst two cases, all correct processes decide
(the general is Byzantine).
In the third case, they all decide v.
Example
N = 4 and t = 2.
g
Byzantine
Byzantine p
q r
pulse 1: g sends 0, (S
g
(0), g)) to p and q
g sends 1, (S
g
(1), g)) to r
W
p
= W
q
= 0
pulse 2: p broadcasts 0, (S
g
(0), g) : (S
p
(0), p))
q broadcasts 0, (S
g
(0), g) : (S
q
(0), q))
r sends 1, (S
g
(1), g) : (S
r
(1), r )) to q
W
p
= 0 and W
q
= 0, 1
pulse 3: q broadcasts 1, (S
g
(1), g) : (S
r
(1), r ) : (S
q
(1), q))
W
p
= W
q
= 0, 1
p and q decide
Example
N = 4 and t = 2.
g
Byzantine
Byzantine p
q r
pulse 1: g sends 0, (S
g
(0), g)) to p and q
g sends 1, (S
g
(1), g)) to r
W
p
= W
q
= 0
pulse 2: p broadcasts 0, (S
g
(0), g) : (S
p
(0), p))
q broadcasts 0, (S
g
(0), g) : (S
q
(0), q))
r sends 1, (S
g
(1), g) : (S
r
(1), r )) to q
W
p
= 0 and W
q
= 0, 1
pulse 3: q broadcasts 1, (S
g
(1), g) : (S
r
(1), r ) : (S
q
(1), q))
W
p
= W
q
= 0, 1
p and q decide
Example
N = 4 and t = 2.
g
Byzantine
Byzantine p
q r
pulse 1: g sends 0, (S
g
(0), g)) to p and q
g sends 1, (S
g
(1), g)) to r
W
p
= W
q
= 0
pulse 2: p broadcasts 0, (S
g
(0), g) : (S
p
(0), p))
q broadcasts 0, (S
g
(0), g) : (S
q
(0), q))
r sends 1, (S
g
(1), g) : (S
r
(1), r )) to q
W
p
= 0 and W
q
= 0, 1
pulse 3: q broadcasts 1, (S
g
(1), g) : (S
r
(1), r ) : (S
q
(1), q))
W
p
= W
q
= 0, 1
p and q decide
Mutual Exclusion
Processes p
0
, . . . , p
N1
contend for the critical section.
A process that can enter the critical section is called privileged.
For each execution, we require mutual exclusion and no starvation:

in every conguration at most one process is privileged;

if a process p
i
tries to enter the critical section, and no
process remains privileged forever, then p
i
will eventually
become privileged.
Raymonds Algorithm
Requires an undirected graph, which must, also initially, form a
sink tree. At any time, the root, holding a token, is privileged.
Each process maintains a FIFO queue, which may contain identities
of its children, and its own id. Initially, this queue is empty.
Queue maintenance:

When a process wants to enter the critical section, it adds its


id to its own queue.

When a process that is not the root gets a new head at its
(non-empty) queue, it asks its father for the token.

When a process receives a request for the token from a child,


it adds this child to its queue.
Raymonds Algorithm
When the root exits the critical section (and its queue is non-empty),
it sends the token to the process q at the head of its queue, makes
q its father, and removes q from the head of its queue.
Let a process p get the token from its father, with process q at the
head of its queue.

if p ,= q, then p sends the token to q, and makes q its father;

if p = q, then p becomes the root (i.e., it has no father, and


becomes privileged).
In both cases, p removes q from the head of its queue.
Raymonds Algorithm - Example
1
2 3
4 5 6 7
Raymonds Algorithm - Example
1
2 3
4 5 6 7
7
3
7
Raymonds Algorithm - Example
1
2 3
4 5 6 7 7
3, 2
7 2
Raymonds Algorithm - Example
6
1
2 3
4 5 6 7 7
3, 2
7, 6 2
Raymonds Algorithm - Example
1
2 3
4 5 6 7 7
2
7, 6, 1 2
6
Raymonds Algorithm - Example
1
2 3
4 5 6 7
2
6, 1 2
6
3
Raymonds Algorithm - Example
1
2 3
4 5 6 7
2
6, 1 2
6
Raymonds Algorithm - Example
1
2 3
4 5 6 7
2
1 2
3
Raymonds Algorithm - Example
1
2 3
4 5 6 7
2
1 2
Raymonds Algorithm - Example
1
2 3
4 5 6 7
2
2
Raymonds Algorithm - Example
1
2 3
4 5 6 7
Raymonds Algorithm - Correctness
Raymonds algorithm provides mutual exclusion, because at all
times there is only one root.
Raymonds algorithm provides no starvation, because eventually
each request in a queue moves to the head of this queue.
However, note that in the example, process 2 requests the token
before process 6, but process 6 receives the token before process 2.
Drawback: Sensitive to failures.
Raymonds Algorithm - Correctness
Raymonds algorithm provides mutual exclusion, because at all
times there is only one root.
Raymonds algorithm provides no starvation, because eventually
each request in a queue moves to the head of this queue.
However, note that in the example, process 2 requests the token
before process 6, but process 6 receives the token before process 2.
Drawback: Sensitive to failures.
Ricart-Agrawala Algorithm
When a process p
i
wants to access the critical section, it sends
request(ts
i
, i ) to all other processes, with ts
i
its logical time stamp.
When p
j
receives this request, it sends a permission if:

p
j
is neither inside nor trying to enter the critical section; or

p
j
sent a request with time stamp ts
j
, and either ts
i
< ts
j
or
ts
i
= ts
j
i < j .
p
i
enters the critical section when it received permission from all
other processes.
Ricart-Agrawala Algorithm - Correctness
Mutual exclusion: Two processes cannot send permission to each
other concurrently.
Because when a process p sends permission to q, p has not issued
a request, and the logical time of p is greater than the time stamp
of qs request.
No starvation: Eventually a request will have the smallest time
stamp of all requests in the network.
Ricart-Agrawala Algorithm - Example 1
N = 2, and p
0
and p
1
both are at logical time 0.
p
1
sends request(1, 1) to p
0
. When p
0
receives this message, it
sets its logical time to 1.
p
0
sends permission to p
1
.
p
0
sends request(2, 0) to p
1
. When p
1
receives this message, it
does not send permission to p
0
, because (1, 1) < (2, 0).
p
1
receives permission from p
0
, and enters the critical section.
Ricart-Agrawala Algorithm - Example 2
N = 2, and p
0
and p
1
both are at logical time 0.
p
1
sends request(1, 1) to p
0
, and p
0
sends request(1, 0) to p
1
.
When p
0
receives the request from p
1
, it does not send permission
to p
1
, because (1, 0) < (1, 1).
When p
1
receives the request from p
0
, it sends permission to p
0
.
p
0
receives permission from p
1
, and enters the critical section.
Ricart-Agrawala Algorithm - Optimization
Drawback: High message complexity.
Carvalho-Roucairol optimization: After a rst entry of the critical
section, a process only needs to send a request to processes that it
sent permission to (since its last exit from the critical section).
Question
Suppose a leader has been elected in the network. Give a mutual
exclusion algorithm, with no starvation.
What is a drawback of such a mutual exclusion algorithm?
Mutual Exclusion with Shared Variables
Hagit Attiya and Jennifer Welch, Distributed Computing, McGraw
Hill, 1998 (Chapter 4)
See also Chapter 10 of: Nancy Lynch, Distributed Algorithms, 1996
Processes communicate via shared variables (called registers) in
shared memory.
read/write registers allow a process to perform an atomic read or
write.
A single-reader (or single-writer) register is readable (or writable)
by one process.
A multi-reader (or multi-writer) register is readable (or writable) by
all processes.
An Incorrect Solution for Mutual Exclusion
Let ag be a multi-reader/multi-writer register, with range 0, 1.
A process wanting to enter the critical section waits until ag = 0.
Then it writes ag := 1, and becomes privileged.
When it exits the critical section, it writes ag := 0.
The problem is that there is a time delay between reading ag = 0
and writing ag := 1, so that multiple processes can perform this
read and write in parallel.
An Incorrect Solution for Mutual Exclusion
Let ag be a multi-reader/multi-writer register, with range 0, 1.
A process wanting to enter the critical section waits until ag = 0.
Then it writes ag := 1, and becomes privileged.
When it exits the critical section, it writes ag := 0.
The problem is that there is a time delay between reading ag = 0
and writing ag := 1, so that multiple processes can perform this
read and write in parallel.
Dijkstras Mutual Exclusion Algorithm
turn is a multi-reader/multi-writer register with range 0, . . . , N1.
ag[i ] a multi-reader/ single-writer register, only writable by p
i
,
with range 0, 1, 2.
Initially they all have value 0.
We present the pseudocode for process p
i
.
Entry): L: ag[i ] := 1
while turn ,= i do
if ag[turn] = 0 then turn := i
ag[i ] := 2
for j ,= i do
if ag[j ] = 2 then goto L
Exit): ag[i ] := 0
Dijkstras Mutual Exclusion Algorithm - Correctness
This algorithm provides mutual exclusion.
And if a process p
i
tries to enter the critical section, and no
process remains in the critical section forever, then some process
will eventually become privileged (no deadlock).
However, there can be starvation.
Dijkstras Mutual Exclusion Algorithm - Example
Let N = 3.
ag[1] := 1 p
1
and p
2
read turn = 2
ag[2] := 1 p
1
reads ag[2] ,= 0
p
1
and p
2
read turn = 0 ag[2] := 2
p
1
and p
2
read ag[0] = 0 p
2
enters the critical section
turn := 1 p
2
exits the critical section
turn := 2 ag[2] := 0
ag[1] := 2 ag[2] := 1
ag[2] := 2 p
1
reads ag[2] ,= 0
ag[1] := 1 ag[2] := 2
ag[2] := 1 p
2
enters the critical section
Fischers Algorithm
Uses time delays, and the assumption that an operation can be
performed within one time unit.
turn is a multi-reader/multi-writer register with range 1, 0, . . . , N1.
Initially it has value -1.
We present the pseudocode for process p
i
.
Entry): L: wait until turn = 1
turn := i (takes less than one time unit)
delay of more than one time unit
if turn ,= i then goto L
Exit): turn := 1
Fischers algorithm guarantees mutual exclusion and no deadlock.
Lamports Bakery Algorithm
Multi-reader/single-writer registers number [i ] and choosing[i ]
range over N and 0, 1, respectively; they are only writeable by p
i
.
Initially they all have value 0.
When p
i
wants to enter the critical section, it writes a number to
number [i ] that is greater than number [j ] for all j ,= i .
Dierent processes can concurrently obtain the same number;
therefore the ticket of p
i
is the pair (number [i ], i ).
choosing[i ] = 1 while p
i
is obtaining a number.
When the critical section is empty, and no process is obtaining a
number, the process with the smallest ticket (n, i ) with n > 0 enters.
When p
i
exits the critical section, number [i ] is set to 0.
Lamports Bakery Algorithm
We present the pseudocode for process p
i
.
Entry) : choosing[i ] := 1
number [i ] := maxnumber [0], . . . , number [N1] + 1
choosing[i ] := 0
for j ,= i do
wait until choosing[j ] = 0
wait until number [j ] = 0 or (number [j ], j ) > (number [i ], i )
Exit) : number [i ] := 0
The bakery algorithm provides mutual exclusion and no starvation.
Drawback: Can have high synchronization delays.
Lamports Bakery Algorithm
We present the pseudocode for process p
i
.
Entry) : choosing[i ] := 1
number [i ] := maxnumber [0], . . . , number [N1] + 1
choosing[i ] := 0
for j ,= i do
wait until choosing[j ] = 0
wait until number [j ] = 0 or (number [j ], j ) > (number [i ], i )
Exit) : number [i ] := 0
The bakery algorithm provides mutual exclusion and no starvation.
Drawback: Can have high synchronization delays.
Lamports Bakery Algorithm - Example
Let N = 2.
choosing[1] := 1 p
0
exits the critical section
choosing[0] := 1 number [0] := 0
p
0
and p
1
read number [0] and
number [1] choosing[0] := 1
number [1] := 1 p
0
reads number [0] and number [1]
choosing[1] := 0 number [0] := 2
p
1
reads choosing[0] := 1 choosing[0] := 0
number [0] := 1 p
1
reads choosing[0] = 0 and
choosing[0] := 0 (number [1], 1) < (number [0], 0)
p
0
reads choosing[1] = 0 and p
1
enters the critical section
(number [0], 0) < (number [1], 1)
p
0
enters the critical section
Mutual Exclusion for Two Processes
Assume processes p
0
and p
1
.
ag[i ] is a multi-reader/single-writer register, only writeable by p
i
.
Its range is 0, 1; initially it has value 0.
code for p
0
code for p
1
Entry): L: ag[1] := 0
wait until ag[0] = 0
ag[0] := 1 ag[1] := 1
wait until ag[1] = 0 if ag[0] = 1 then goto L
Exit): ag[0] := 0 ag[1] := 0
This algorithm provides mutual exclusion and no deadlock.
However, p
1
may never progress from Entry to the critical section,
while p
0
enters the critical section innitely often (starvation of p
1
).
Mutual Exclusion for Two Processes
Assume processes p
0
and p
1
.
ag[i ] is a multi-reader/single-writer register, only writeable by p
i
.
Its range is 0, 1; initially it has value 0.
code for p
0
code for p
1
Entry): L: ag[1] := 0
wait until ag[0] = 0
ag[0] := 1 ag[1] := 1
wait until ag[1] = 0 if ag[0] = 1 then goto L
Exit): ag[0] := 0 ag[1] := 0
This algorithm provides mutual exclusion and no deadlock.
However, p
1
may never progress from Entry to the critical section,
while p
0
enters the critical section innitely often (starvation of p
1
).
Question
How can this mutual exlusion algorithm for two processes be
adapted so that it provides no starvation?
Peterson2P Algorithm
ag[i ] is a multi-reader/single-writer register, only writeable by p
i
.
priority is a multi-reader/ multi-writer register.
They all have range 0, 1; initially they have value 0.
code for p
0
code for p
1
Entry): L: ag[0] := 0 L: ag[1] := 0
wait until (ag[1] = 0 wait until (ag[0] = 0
or priority = 0) or priority = 1)
ag[0] := 1 ag[1] := 1
if priority = 1 then if priority = 0 then
if ag[1] = 1 then goto L if ag[0] = 1 then goto L
else wait until ag[1] = 0 else wait until ag[0] = 0
Exit): priority := 1 priority := 0
ag[0] := 0 ag[1] := 0
Peterson2P algorithm provides mutual exclusion and no starvation.
Peterson2P Algorithm
ag[i ] is a multi-reader/single-writer register, only writeable by p
i
.
priority is a multi-reader/ multi-writer register.
They all have range 0, 1; initially they have value 0.
code for p
0
code for p
1
Entry): L: ag[0] := 0 L: ag[1] := 0
wait until (ag[1] = 0 wait until (ag[0] = 0
or priority = 0) or priority = 1)
ag[0] := 1 ag[1] := 1
if priority = 1 then if priority = 0 then
if ag[1] = 1 then goto L if ag[0] = 1 then goto L
else wait until ag[1] = 0 else wait until ag[0] = 0
Exit): priority := 1 priority := 0
ag[0] := 0 ag[1] := 0
Peterson2P algorithm provides mutual exclusion and no starvation.
Peterson2P Algorithm - Example
ag[1] := 0 (L) ag[0] := 0 (L)
ag[1] := 1 ag[0] := 1
ag[0] := 0 (L) p
0
enters the critical section
ag[0] := 1 ag[1] := 1
ag[1] := 0 p
0
exits the critical section
p
0
enters the critical section priority := 1
p
0
exits the critical section ag[0] := 0
priority := 1 ag[0] := 0 (L)
ag[0] := 0 p
1
enters the critical section
Question
How can the Peterson2P algorithm be transformed into a mutual
exclusion algorithm for N 2 processes?
PetersonNP Algorithm
Assume processes p
0
, . . . , p
N1
.
Processes compete pairwise, using the Peterson2P algorithm, in a
tournament tree, which is a complete binary tree.

Each process begins in a leaf of the tree.

The winner proceeds to the next higher level, where it


competes with the winner of the competition on the other
side of the subtree.

The process that wins at the root becomes privileged.


PetersonNP Algorithm - Tournament Tree
Let k =
2
log N|1.
Consider the complete binary tree of depth k. The root is
numbered 1, and the left and right child of a node v are numbered
2v and 2v+1, respectively.
Each node has two sides, 0 and 1 (corresponding to priorities).
Initially, process p
i
is associated to node 2
k
+i /2| and side i mod2.
1
2 3
4 5 6 7
p p p p p p p p
2 1 3 4 5 6 7 0
PetersonNP Algorithm
Each node v has shared variables ag
v
[0], ag
v
[1] and priority
v
.
They all have range 0, 1; initially they have value 0.
A process p
i
repeatedly applies the procedure Node(2
k
+i /2|, i mod2).
procedure Node(v : N, side : 0..1)
L: ag
v
[side] := 0
wait until (ag
v
[1side] = 0 or priority
v
= side)
ag
v
[side] := 1
if priority
v
= 1side then if ag
v
[1side] = 1 then goto L
else wait until ag
v
[1side] = 0
if v = 1 then Enter Critical Section) Exit)
else Node(v/2|, v mod2)
priority
v
:= 1side
ag
v
[side] := 0
PetersonNP algorithm provides mutual exclusion and no starvation.
PetersonNP Algorithm
Each node v has shared variables ag
v
[0], ag
v
[1] and priority
v
.
They all have range 0, 1; initially they have value 0.
A process p
i
repeatedly applies the procedure Node(2
k
+i /2|, i mod2).
procedure Node(v : N, side : 0..1)
L: ag
v
[side] := 0
wait until (ag
v
[1side] = 0 or priority
v
= side)
ag
v
[side] := 1
if priority
v
= 1side then if ag
v
[1side] = 1 then goto L
else wait until ag
v
[1side] = 0
if v = 1 then Enter Critical Section) Exit)
else Node(v/2|, v mod2)
priority
v
:= 1side
ag
v
[side] := 0
PetersonNP algorithm provides mutual exclusion and no starvation.
PetersonNP Algorithm - Example
N = 8.
p
1
starts in Node(4, 1) and p
6
in Node(7, 0).
Redundant L events (ag
v
[side] := 0) are omitted.
1
2 3
4 5 6 7
p p p p p p p p
2 1 3 4 5 6 7 0
ag
7
[1] := 1 ag
1
[1] := 0
ag
7
[0] := 1 p
1
enters the critical section
ag
7
[1] := 0 p
1
exits the critical section
p
6
continues with Node(3, 1) priority
1
:= 1
ag
3
[1] := 1 ag
1
[0] := 0
p
6
continues with Node(1, 1) ag
1
[1] := 1
ag
1
[1] := 1 p
6
enters the critical section
ag
4
[1] := 1 priority
2
:= 1
p
1
continues with Node(2, 0) ag
2
[0] := 0
ag
2
[0] := 1 priority
4
:= 0
p
1
continues with Node(1, 0) ag
4
[1] := 0
ag
1
[0] := 1 p
1
continues with Node(4, 1)
Read-Modify-Write Registers
A read-modify-write register allows a process to (1) read its value,
(2) compute a new value, and (3) assign this new value to the
register, all in one instantaneous atomic operation.
In case of a read-modify-write register, mutual exclusion with no
starvation can be achieved with a single register.
The register maintains a FIFO queue of process identities. A
process that wants to enter the critical section, adds its id at the
end of the queue. A process at the head of the queue can enter
the critical section. When a process exits the critical section, it
deletes itself from the queue.
By contrast, mutual exclusion with no deadlock for N processes
can only be achieved with N read/write registers.
Read-Modify-Write Registers
A read-modify-write register allows a process to (1) read its value,
(2) compute a new value, and (3) assign this new value to the
register, all in one instantaneous atomic operation.
In case of a read-modify-write register, mutual exclusion with no
starvation can be achieved with a single register.
The register maintains a FIFO queue of process identities. A
process that wants to enter the critical section, adds its id at the
end of the queue. A process at the head of the queue can enter
the critical section. When a process exits the critical section, it
deletes itself from the queue.
By contrast, mutual exclusion with no deadlock for N processes
can only be achieved with N read/write registers.
Read-Modify-Write Registers
A read-modify-write register allows a process to (1) read its value,
(2) compute a new value, and (3) assign this new value to the
register, all in one instantaneous atomic operation.
In case of a read-modify-write register, mutual exclusion with no
starvation can be achieved with a single register.
The register maintains a FIFO queue of process identities. A
process that wants to enter the critical section, adds its id at the
end of the queue. A process at the head of the queue can enter
the critical section. When a process exits the critical section, it
deletes itself from the queue.
By contrast, mutual exclusion with no deadlock for N processes
can only be achieved with N read/write registers.
Wait-Free Consensus
Two read-modify-write registers:
Fetch-and-: Fetch the value of a register, and modify this value
using the function .
Compare-and-swap(v
1
, v
2
): Compare the value of the register with
v
1
, and if they are equal, change this value to v
2
Wait-free algorithm: Each process can complete any operation in a
nite number of steps, even if other processes do not respond.
Herlihy showed that wait-free (N1)-crash consensus can be
achieved with compare-and-swap, but not with fetch-and-.
Wait-Free Consensus
Two read-modify-write registers:
Fetch-and-: Fetch the value of a register, and modify this value
using the function .
Compare-and-swap(v
1
, v
2
): Compare the value of the register with
v
1
, and if they are equal, change this value to v
2
Wait-free algorithm: Each process can complete any operation in a
nite number of steps, even if other processes do not respond.
Herlihy showed that wait-free (N1)-crash consensus can be
achieved with compare-and-swap, but not with fetch-and-.
Question
How can wait-free consensus be achieved with compare-and-swap?
Ticket Lock
A ticket lock L maintains two multi-reader/multi-writer counters:

the number of requests to acquire L; and

the number of times L has been released.


A process p that want to enter the critical section, performs a
fetch-and-increment on Ls request counter; p stores the value of
this counter as its ticket, and increments this counter by 1.
Next p keeps polling Ls release counter; when this counter equals
ps ticket, p enters the critical section.
When p exits the critical section, it increments Ls release counter by 1.
The ticket counter is an optimization of Lamports bakery algorithm.
Drawback: In large-scale systems, excessive polling of Ls release
counter by remote processes becomes a bottleneck.
Ticket Lock
A ticket lock L maintains two multi-reader/multi-writer counters:

the number of requests to acquire L; and

the number of times L has been released.


A process p that want to enter the critical section, performs a
fetch-and-increment on Ls request counter; p stores the value of
this counter as its ticket, and increments this counter by 1.
Next p keeps polling Ls release counter; when this counter equals
ps ticket, p enters the critical section.
When p exits the critical section, it increments Ls release counter by 1.
The ticket counter is an optimization of Lamports bakery algorithm.
Drawback: In large-scale systems, excessive polling of Ls release
counter by remote processes becomes a bottleneck.
Mellor-Crummey-Scott Lock
Lock L maintains a multi-reader/multi-writer register last,
containing the last process that requested L (or if L is not held).
Each process p maintains multi-reader/multi-writer registers
locked
p
of type boolean (initially false), and next
p
containing a
process id (initially ).
A process p that wants to enter the critical section, performs
fetch-and-store(p) on last, to fetch the process that requested L
last, and write its own id into last.

If last =, then p enters the critical section.

If last = q, then p sets locked


p
:= true and next
q
:= p.
Now p must wait until locked
p
= false.
Mellor-Crummey-Scott Lock
Lock L maintains a multi-reader/multi-writer register last,
containing the last process that requested L (or if L is not held).
Each process p maintains multi-reader/multi-writer registers
locked
p
of type boolean (initially false), and next
p
containing a
process id (initially ).
A process p that wants to enter the critical section, performs
fetch-and-store(p) on last, to fetch the process that requested L
last, and write its own id into last.

If last =, then p enters the critical section.

If last = q, then p sets locked


p
:= true and next
q
:= p.
Now p must wait until locked
p
= false.
Mellor-Crummey-Scott Lock
Let process q exit the critical section.

If next
q
= p, then q sets locked
p
:= false, upon which p can
enter the critical section.

If next
q
=, then q performs compare(q, ) on last.
If q nds that last = p ,= q, it waits until next
q
= p, and then
sets locked
p
:= false.
Note that p only needs to repeatedly poll its local variables
locked
p
and (sometimes, for a short period) next
p
.
Mellor-Crummey-Scott Lock
Let process q exit the critical section.

If next
q
= p, then q sets locked
p
:= false, upon which p can
enter the critical section.

If next
q
=, then q performs compare(q, ) on last.
If q nds that last = p ,= q, it waits until next
q
= p, and then
sets locked
p
:= false.
Note that p only needs to repeatedly poll its local variables
locked
p
and (sometimes, for a short period) next
p
.
Mellor-Crummey-Scott Lock - Example
q performs fetch-and-store(q) on last: last := q
q enters the critical section
p performs fetch-and-store(p) on last: last := p
p performs locked
p
:= true
q exits the critical section, and reads next
q
=
q performs compare(q, ) on last
Since last = p ,= q, q must wait until next
q
= p
p performs next
q
:= p
q performs locked
p
:= false
p enters the critical section
Self-Stabilization
All congurations are initial congurations.
An algorithm is self-stabilizing if every execution reaches a
correct conguration.
Advantages:

fault tolerance

robustness for dynamic topologies

straightforward initialization
Processes communicate via registers in shared memory.
Self-Stabilization
All congurations are initial congurations.
An algorithm is self-stabilizing if every execution reaches a
correct conguration.
Advantages:

fault tolerance

robustness for dynamic topologies

straightforward initialization
Processes communicate via registers in shared memory.
Dijkstras Self-Stabilizing Token Ring
Let p
0
, . . . , p
N1
form a directed ring, where each p
i
holds a value

i
0, . . . , K1 with K N.

p
i
with 0 < i < N is privileged if
i
,=
i 1
.

p
0
is privileged if
0
=
N1
.
Each privileged process is allowed to change its value, causing the
loss of its privilege:


i
:=
i 1
when
i
,=
i 1
, for 0 < i < N;


0
:= (
N1
+1) modK when
0
=
N1
.
If K N, then Dijkstras token ring self-stabilizes. That is, each
execution will reach a conguration where mutual exclusion is satised.
Moreover, Dijkstras token ring guarantees no starvation.
Dijkstras Self-Stabilizing Token Ring
Let p
0
, . . . , p
N1
form a directed ring, where each p
i
holds a value

i
0, . . . , K1 with K N.

p
i
with 0 < i < N is privileged if
i
,=
i 1
.

p
0
is privileged if
0
=
N1
.
Each privileged process is allowed to change its value, causing the
loss of its privilege:


i
:=
i 1
when
i
,=
i 1
, for 0 < i < N;


0
:= (
N1
+1) modK when
0
=
N1
.
If K N, then Dijkstras token ring self-stabilizes. That is, each
execution will reach a conguration where mutual exclusion is satised.
Moreover, Dijkstras token ring guarantees no starvation.
Dijkstras Token Ring - Example
Let N = K = 4. Consider the initial conguration
0
1
2
3
p
p 3 p
2
0
1
p
It is not hard to see that it self-stabilizes. For instance,
0 0
1 1
2 2
3 3
p
p 0 p
p
p
p 0 p
p
2
0 0
0
3
3
Dijkstras Token Ring - Correctness
Theorem: If K N, then Dijkstras token ring self-stabilizes.
Proof: In each conguration at least one process is privileged.
A transition never increases the number of privileged processes.
Consider an execution. After at most
1
2
(N1)N events at
p
1
, . . . , p
N1
, an event must happen at p
0
. So during the
execution,
0
ranges over all values in 0, . . . , K1. Since
p
1
, . . . , p
N1
only copy values, and K N, in some conguration
of the execution,
0
,=
i
for all 0 < i < N.
The next time p
0
becomes privileged, clearly
i
=
0
for all
0 < i < N. So then mutual exclusion has been achieved.
Dijkstras Token Ring - Correctness
Theorem: If K N, then Dijkstras token ring self-stabilizes.
Proof: In each conguration at least one process is privileged.
A transition never increases the number of privileged processes.
Consider an execution. After at most
1
2
(N1)N events at
p
1
, . . . , p
N1
, an event must happen at p
0
. So during the
execution,
0
ranges over all values in 0, . . . , K1. Since
p
1
, . . . , p
N1
only copy values, and K N, in some conguration
of the execution,
0
,=
i
for all 0 < i < N.
The next time p
0
becomes privileged, clearly
i
=
0
for all
0 < i < N. So then mutual exclusion has been achieved.
Question
Can you argue why, if N 3, Dijkstras token ring also
self-stabilizes when K = N 1?
This lower bound for K is sharp!
Dijkstras Token Ring - Lower Bound for K
Example: Let N 4 and K = N2, and consider the following
initial conguration.
. . .
0
2
1
4
3
N-4
N-3
N-2
N-1
N-3
N-5
N-3
N-3
p
2
1
0
p
p p
p N-4
p
p p N-6
p
It does not always self-stabilize.
Dijkstras Token Ring - Message Complexity
Worst-case message complexity: Mutual exclusion is achieved after
at most O(N
2
) transitions.
p
i
for 0 < i < N can copy the initial values of p
0
, . . . , p
i 1
.
(Total:
1
2
(N1)N events.)
p
0
takes on at most N new values to attain a fresh value. These
values can be copied by p
1
, . . . , p
N1
. (Total: N
2
events.)
Arora-Gouda Self-Stabilizing Election Algorithm
Given an undirected network.
Let an upper bound K on the network size be known to all processes.
The process with the largest id becomes the leader.
Each process p
i
maintains the following variables:
Neigh
i
: the set of identities of its neighbors
father
i
: its father in the sink tree
leader
i
: the root of the sink tree
dist
i
: its distance from the root
Arora-Gouda Election Algorithm - Complications
Due to arbitrary initialization, there are three complications.
Complication 1: Multiple processes may consider themselves root
of the sink tree.
Complication 2: There may be cycles in the sink tree.
Complication 3: leader
i
may not be the id of any process in the
network.
Arora-Gouda Election Algorithm
A process p
i
declares itself leader, i.e.
leader
i
:= i father
i
:= dist
i
:= 0
if it detects an inconsistency in its local variables:

leader
i
< i ; or

father
i
=, and leader
i
,= i or dist
i
> 0; or

father
i
, Neigh
i
; or

dist
i
K.
Suppose father
i
= j with j Neigh
i
and dist
j
< K.
If leader
i
,= leader
j
, then leader
i
:= leader
j
.
If dist
i
,= dist
j
+ 1, then dist
i
:= dist
j
+ 1.
Arora-Gouda Election Algorithm
A process p
i
declares itself leader, i.e.
leader
i
:= i father
i
:= dist
i
:= 0
if it detects an inconsistency in its local variables:

leader
i
< i ; or

father
i
=, and leader
i
,= i or dist
i
> 0; or

father
i
, Neigh
i
; or

dist
i
K.
Suppose father
i
= j with j Neigh
i
and dist
j
< K.
If leader
i
,= leader
j
, then leader
i
:= leader
j
.
If dist
i
,= dist
j
+ 1, then dist
i
:= dist
j
+ 1.
Arora-Gouda Election Algorithm
If leader
i
< leader
j
where j Neigh
i
and dist
j
< K, then
leader
i
:= leader
j
father
i
:= j dist
i
:= dist
j
+ 1
To obtain a breadth-rst search tree, one can add:
If leader
i
= leader
j
where j Neigh
i
and dist
j
+ 1 < dist
i
, then
leader
i
:= leader
j
father
i
:= j dist
i
:= dist
j
+ 1
Arora-Gouda Election Algorithm
If leader
i
< leader
j
where j Neigh
i
and dist
j
< K, then
leader
i
:= leader
j
father
i
:= j dist
i
:= dist
j
+ 1
To obtain a breadth-rst search tree, one can add:
If leader
i
= leader
j
where j Neigh
i
and dist
j
+ 1 < dist
i
, then
leader
i
:= leader
j
father
i
:= j dist
i
:= dist
j
+ 1
Arora-Gouda Election Algorithm - Example
leader = 6
father = 5
dist = 4
p
2
p
5
p
p
3
p
1
4
leader = 6
father = 1
dist = 4
leader = 6
father = 3
dist = 3
leader = 6
father = 2
dist = 2
leader = 6
father = 3
dist = 3
Arora-Gouda Election Algorithm - Example
leader = 6
father = 5
dist = 4
p
2
p
5
p
p
3
p
1
4
leader = 6
father = 1
dist = 4
leader = 6
father = 3
dist = 3
leader = 6
father = 2
dist = 5
leader = 6
father = 3
dist = 3
Arora-Gouda Election Algorithm - Example
leader = 6
father = 5
dist = 4
p
2
p
5
p
p
3
p
1
4
leader = 6
father = 1
dist = 4
leader = 6
father = 3
dist = 3
leader = 3
father =
dist = 0
leader = 6
father = 3
dist = 3
Arora-Gouda Election Algorithm - Example
leader = 3
father = 5
dist = 2
p
2
p
5
p
p
3
p
1
4
leader = 3
father = 1
dist = 2
leader = 3
father = 3
dist = 1
leader = 3
father =
dist = 0
leader = 3
father = 3
dist = 1
Arora-Gouda Election Algorithm - Example
leader = 3
father = 5
dist = 2
p
2
p
5
p
p
3
p
1
4
leader = 4
father =
dist = 0
leader = 3
father = 3
dist = 1
leader = 3
father =
dist = 0
leader = 3
father = 3
dist = 1
Arora-Gouda Election Algorithm - Example
leader = 4
father = 5
dist = 4
p
2
p
5
p
p
3
p
1
4
leader = 4
father =
dist = 0
leader = 4
father = 3
dist = 3
leader = 4
father = 1
dist = 2
leader = 4
father = 4
dist = 1
Arora-Gouda Election Algorithm - Example
leader = 4
father = 5
dist = 4
p
2
p
5
p
p
3
p
1
4
leader = 4
father =
dist = 0
leader = 5
father =
dist = 0
leader = 4
father = 1
dist = 2
leader = 4
father = 4
dist = 1
Arora-Gouda Election Algorithm - Example
leader = 5
father = 5
dist = 1
p
2
p
5
p
p
3
p
1
4
leader = 5
father = 1
dist = 3
leader = 5
father =
dist = 0
leader = 5
father = 5
dist = 1
leader = 5
father = 3
dist = 2
Arora-Gouda Election Algorithm - Correctness
A subgraph in the network with a leader value j that is not an id of
any node in this subgraph, contains an inconsistency or a cycle.
Such an inconsistency or cycle will eventually cause a process in
this subgraph to declare itself leader.
Let i be the largest id of any process in the network.

Leader values greater than i will eventually disappear; and

p
i
will eventually declare itself leader.
After p
i
has declared itself leader, the algorithm will eventually
converge to a spanning tree with root p
i
.
Arora-Gouda Election Algorithm - Correctness
A subgraph in the network with a leader value j that is not an id of
any node in this subgraph, contains an inconsistency or a cycle.
Such an inconsistency or cycle will eventually cause a process in
this subgraph to declare itself leader.
Let i be the largest id of any process in the network.

Leader values greater than i will eventually disappear; and

p
i
will eventually declare itself leader.
After p
i
has declared itself leader, the algorithm will eventually
converge to a spanning tree with root p
i
.
Arora-Gouda Election Algorithm - Correctness
A subgraph in the network with a leader value j that is not an id of
any node in this subgraph, contains an inconsistency or a cycle.
Such an inconsistency or cycle will eventually cause a process in
this subgraph to declare itself leader.
Let i be the largest id of any process in the network.

Leader values greater than i will eventually disappear; and

p
i
will eventually declare itself leader.
After p
i
has declared itself leader, the algorithm will eventually
converge to a spanning tree with root p
i
.
Afek-Kutten-Yung Self-Stabilizing Election Algorithm
No upper bound on the network size needs to be known.
The process with the largest id becomes the leader.
A process p
i
declares itself leader, i.e.
leader
i
:= i father
i
:= dist
i
:= 0
if these three variables do not yet all have these values, and p
i
detects even the slightest inconsistency in its local variables:
leader
i
i or father
i
, Neigh
i
or
leader
i
,= leader
father
i
or dist
i
,= dist
father
i
+ 1
p
i
can make a neighbor p
j
its father if leader
i
< leader
j
:
leader
i
:= leader
j
father
i
:= j dist
i
:= dist
j
+ 1
Afek-Kutten-Yung Self-Stabilizing Election Algorithm
No upper bound on the network size needs to be known.
The process with the largest id becomes the leader.
A process p
i
declares itself leader, i.e.
leader
i
:= i father
i
:= dist
i
:= 0
if these three variables do not yet all have these values, and p
i
detects even the slightest inconsistency in its local variables:
leader
i
i or father
i
, Neigh
i
or
leader
i
,= leader
father
i
or dist
i
,= dist
father
i
+ 1
p
i
can make a neighbor p
j
its father if leader
i
< leader
j
:
leader
i
:= leader
j
father
i
:= j dist
i
:= dist
j
+ 1
Afek-Kutten-Yung Self-Stabilizing Election Algorithm
No upper bound on the network size needs to be known.
The process with the largest id becomes the leader.
A process p
i
declares itself leader, i.e.
leader
i
:= i father
i
:= dist
i
:= 0
if these three variables do not yet all have these values, and p
i
detects even the slightest inconsistency in its local variables:
leader
i
i or father
i
, Neigh
i
or
leader
i
,= leader
father
i
or dist
i
,= dist
father
i
+ 1
p
i
can make a neighbor p
j
its father if leader
i
< leader
j
:
leader
i
:= leader
j
father
i
:= j dist
i
:= dist
j
+ 1
Question
Suppose that during an application of the Afek-Kutten-Yung leader
election algorithm, the created subgraph contains a cycle.
Why will at least one of the processes on this cycle declare itself
leader?
Afek-Kutten-Yung Election Algorithm - Complication
Processes can innitely often join a component of the created
subgraph with a false leader.
Example: Given two adjacent processes p
0
and p
1
.
leader
0
= leader
1
= 2; father
0
= 1 and father
1
= 0; dist
0
= dist
1
= 0.
Since dist
0
,= dist
1
+ 1, p
0
declares itself leader:
leader
0
:= 0, father
0
:= and dist
0
:= 0.
Since leader
0
< leader
1
, p
0
makes p
1
its father:
leader
0
:= 2, father
0
:= 1 and dist
0
:= 2.
Since dist
1
,= dist
0
+ 1, p
1
declares itself leader:
leader
1
:= 1, father
1
:= and dist
1
:= 0.
Since leader
1
< leader
0
, p
1
makes p
0
its father:
leader
1
:= 2, father
1
:= 0 and dist
1
:= 3.
Et cetera
Afek-Kutten-Yung Election Algorithm - Complication
Processes can innitely often join a component of the created
subgraph with a false leader.
Example: Given two adjacent processes p
0
and p
1
.
leader
0
= leader
1
= 2; father
0
= 1 and father
1
= 0; dist
0
= dist
1
= 0.
Since dist
0
,= dist
1
+ 1, p
0
declares itself leader:
leader
0
:= 0, father
0
:= and dist
0
:= 0.
Since leader
0
< leader
1
, p
0
makes p
1
its father:
leader
0
:= 2, father
0
:= 1 and dist
0
:= 2.
Since dist
1
,= dist
0
+ 1, p
1
declares itself leader:
leader
1
:= 1, father
1
:= and dist
1
:= 0.
Since leader
1
< leader
0
, p
1
makes p
0
its father:
leader
1
:= 2, father
1
:= 0 and dist
1
:= 3.
Et cetera
Afek-Kutten-Yung Election Algorithm - Complication
Processes can innitely often join a component of the created
subgraph with a false leader.
Example: Given two adjacent processes p
0
and p
1
.
leader
0
= leader
1
= 2; father
0
= 1 and father
1
= 0; dist
0
= dist
1
= 0.
Since dist
0
,= dist
1
+ 1, p
0
declares itself leader:
leader
0
:= 0, father
0
:= and dist
0
:= 0.
Since leader
0
< leader
1
, p
0
makes p
1
its father:
leader
0
:= 2, father
0
:= 1 and dist
0
:= 2.
Since dist
1
,= dist
0
+ 1, p
1
declares itself leader:
leader
1
:= 1, father
1
:= and dist
1
:= 0.
Since leader
1
< leader
0
, p
1
makes p
0
its father:
leader
1
:= 2, father
1
:= 0 and dist
1
:= 3.
Et cetera
Afek-Kutten-Yung Election Algorithm - Complication
Processes can innitely often join a component of the created
subgraph with a false leader.
Example: Given two adjacent processes p
0
and p
1
.
leader
0
= leader
1
= 2; father
0
= 1 and father
1
= 0; dist
0
= dist
1
= 0.
Since dist
0
,= dist
1
+ 1, p
0
declares itself leader:
leader
0
:= 0, father
0
:= and dist
0
:= 0.
Since leader
0
< leader
1
, p
0
makes p
1
its father:
leader
0
:= 2, father
0
:= 1 and dist
0
:= 2.
Since dist
1
,= dist
0
+ 1, p
1
declares itself leader:
leader
1
:= 1, father
1
:= and dist
1
:= 0.
Since leader
1
< leader
0
, p
1
makes p
0
its father:
leader
1
:= 2, father
1
:= 0 and dist
1
:= 3.
Et cetera
Afek-Kutten-Yung Election Algorithm - Complication
Processes can innitely often join a component of the created
subgraph with a false leader.
Example: Given two adjacent processes p
0
and p
1
.
leader
0
= leader
1
= 2; father
0
= 1 and father
1
= 0; dist
0
= dist
1
= 0.
Since dist
0
,= dist
1
+ 1, p
0
declares itself leader:
leader
0
:= 0, father
0
:= and dist
0
:= 0.
Since leader
0
< leader
1
, p
0
makes p
1
its father:
leader
0
:= 2, father
0
:= 1 and dist
0
:= 2.
Since dist
1
,= dist
0
+ 1, p
1
declares itself leader:
leader
1
:= 1, father
1
:= and dist
1
:= 0.
Since leader
1
< leader
0
, p
1
makes p
0
its father:
leader
1
:= 2, father
1
:= 0 and dist
1
:= 3.
Et cetera
Afek-Kutten-Yung Election Algorithm - Join Requests
Let leader
i
< leader
j
for some j Neigh
i
.
Before p
i
makes p
j
its father, rst it sends a join request to p
j
.
This request is forwarded through p
j
s component, toward the root
(if any) of this component.
The root sends back a grant toward p
i
, which travels the reverse
path of the request.
When p
i
receives this grant, it makes p
j
its father:
leader
i
:= leader
j
, father
i
:= j and dist
i
:= dist
j
+ 1.
If p
j
s component has no root, p
j
will never join this component.
Communication is performed using shared variables, so
join requests and grants are encoded in shared variables.
Afek-Kutten-Yung Election Algorithm - Join Requests
Let leader
i
< leader
j
for some j Neigh
i
.
Before p
i
makes p
j
its father, rst it sends a join request to p
j
.
This request is forwarded through p
j
s component, toward the root
(if any) of this component.
The root sends back a grant toward p
i
, which travels the reverse
path of the request.
When p
i
receives this grant, it makes p
j
its father:
leader
i
:= leader
j
, father
i
:= j and dist
i
:= dist
j
+ 1.
If p
j
s component has no root, p
j
will never join this component.
Communication is performed using shared variables, so
join requests and grants are encoded in shared variables.
Afek-Kutten-Yung Election Algorithm - Join Requests
A process can only be forwarding (and awaiting a grant for) at
most one request message at a time.
Join requests and grants between inconsistent nodes are not
forwarded.
Example: Given a ring with nodes u, v, w, and let x > u, v, w.
Initially, u and v consider themselves leader, while w considers u
its father and x the leader.
Since leader
w
> leader
v
, v sends a join req to w.
Without the aformentioned consistency check, w would forward
this join req to u. Since u considers itself leader, it would send
back an ack to v (via w), and v would make w its father.
Since leader
w
,= leader
u
, w would make itself leader.
Now we would have a symmetrical conguration to the initial one.
Afek-Kutten-Yung Election Algorithm - Join Requests
A process can only be forwarding (and awaiting a grant for) at
most one request message at a time.
Join requests and grants between inconsistent nodes are not
forwarded.
Example: Given a ring with nodes u, v, w, and let x > u, v, w.
Initially, u and v consider themselves leader, while w considers u
its father and x the leader.
Since leader
w
> leader
v
, v sends a join req to w.
Without the aformentioned consistency check, w would forward
this join req to u. Since u considers itself leader, it would send
back an ack to v (via w), and v would make w its father.
Since leader
w
,= leader
u
, w would make itself leader.
Now we would have a symmetrical conguration to the initial one.
Afek-Kutten-Yung Election Algorithm - Join Requests
A process can only be forwarding (and awaiting a grant for) at
most one request message at a time.
Join requests and grants between inconsistent nodes are not
forwarded.
Example: Given a ring with nodes u, v, w, and let x > u, v, w.
Initially, u and v consider themselves leader, while w considers u
its father and x the leader.
Since leader
w
> leader
v
, v sends a join req to w.
Without the aformentioned consistency check, w would forward
this join req to u. Since u considers itself leader, it would send
back an ack to v (via w), and v would make w its father.
Since leader
w
,= leader
u
, w would make itself leader.
Now we would have a symmetrical conguration to the initial one.
Afek-Kutten-Yung Election Algorithm - Example
Given two adjacent processes p
0
and p
1
.
leader
0
= leader
1
= 2; father
0
= 1 and father
1
= 0; dist
0
= dist
1
= 0.
Since dist
0
,= dist
1
+ 1, p
0
declares itself leader:
leader
0
:= 0, father
0
:= and dist
0
:= 0.
Since leader
0
< leader
1
, p
0
sends a join request to p
1
.
This join request does not immediately trigger a grant.
Since dist
1
,= dist
0
+ 1, p
1
declares itself leader:
leader
1
:= 1, father
1
:= and dist
1
:= 0.
Since p
1
is now a proper root, it grants the join request of p
0
,
which makes p
1
its father:
leader
0
:= 1, father
0
:= 1 and dist
0
:= 1.
Afek-Kutten-Yung Election Algorithm - Example
Given two adjacent processes p
0
and p
1
.
leader
0
= leader
1
= 2; father
0
= 1 and father
1
= 0; dist
0
= dist
1
= 0.
Since dist
0
,= dist
1
+ 1, p
0
declares itself leader:
leader
0
:= 0, father
0
:= and dist
0
:= 0.
Since leader
0
< leader
1
, p
0
sends a join request to p
1
.
This join request does not immediately trigger a grant.
Since dist
1
,= dist
0
+ 1, p
1
declares itself leader:
leader
1
:= 1, father
1
:= and dist
1
:= 0.
Since p
1
is now a proper root, it grants the join request of p
0
,
which makes p
1
its father:
leader
0
:= 1, father
0
:= 1 and dist
0
:= 1.
Afek-Kutten-Yung Election Algorithm - Example
Given two adjacent processes p
0
and p
1
.
leader
0
= leader
1
= 2; father
0
= 1 and father
1
= 0; dist
0
= dist
1
= 0.
Since dist
0
,= dist
1
+ 1, p
0
declares itself leader:
leader
0
:= 0, father
0
:= and dist
0
:= 0.
Since leader
0
< leader
1
, p
0
sends a join request to p
1
.
This join request does not immediately trigger a grant.
Since dist
1
,= dist
0
+ 1, p
1
declares itself leader:
leader
1
:= 1, father
1
:= and dist
1
:= 0.
Since p
1
is now a proper root, it grants the join request of p
0
,
which makes p
1
its father:
leader
0
:= 1, father
0
:= 1 and dist
0
:= 1.
Afek-Kutten-Yung Election Algorithm - Example
Given two adjacent processes p
0
and p
1
.
leader
0
= leader
1
= 2; father
0
= 1 and father
1
= 0; dist
0
= dist
1
= 0.
Since dist
0
,= dist
1
+ 1, p
0
declares itself leader:
leader
0
:= 0, father
0
:= and dist
0
:= 0.
Since leader
0
< leader
1
, p
0
sends a join request to p
1
.
This join request does not immediately trigger a grant.
Since dist
1
,= dist
0
+ 1, p
1
declares itself leader:
leader
1
:= 1, father
1
:= and dist
1
:= 0.
Since p
1
is now a proper root, it grants the join request of p
0
,
which makes p
1
its father:
leader
0
:= 1, father
0
:= 1 and dist
0
:= 1.
Afek-Kutten-Yung Election Algorithm - Example
Given two adjacent processes p
0
and p
1
.
leader
0
= leader
1
= 2; father
0
= 1 and father
1
= 0; dist
0
= dist
1
= 0.
Since dist
0
,= dist
1
+ 1, p
0
declares itself leader:
leader
0
:= 0, father
0
:= and dist
0
:= 0.
Since leader
0
< leader
1
, p
0
sends a join request to p
1
.
This join request does not immediately trigger a grant.
Since dist
1
,= dist
0
+ 1, p
1
declares itself leader:
leader
1
:= 1, father
1
:= and dist
1
:= 0.
Since p
1
is now a proper root, it grants the join request of p
0
,
which makes p
1
its father:
leader
0
:= 1, father
0
:= 1 and dist
0
:= 1.
Afek-Kutten-Yung Election Algorithm - Correctness
A subgraph in the network with a leader value j that is not an id of
any node in this subgraph, contains an inconsistency, so a process
in this subgraph will declare itself leader.
Each process can only nitely often (each time due to incorrect
initial register values) join a subgraph with a false leader.
Let i be the largest id of any process in the network.

Leader values greater than i will eventually disappear; and

p
i
will eventually declare itself leader.
After p
i
has declared itself leader, the algorithm will eventually
converge to a spanning tree with as root p
i
.
Afek-Kutten-Yung Election Algorithm - Correctness
A subgraph in the network with a leader value j that is not an id of
any node in this subgraph, contains an inconsistency, so a process
in this subgraph will declare itself leader.
Each process can only nitely often (each time due to incorrect
initial register values) join a subgraph with a false leader.
Let i be the largest id of any process in the network.

Leader values greater than i will eventually disappear; and

p
i
will eventually declare itself leader.
After p
i
has declared itself leader, the algorithm will eventually
converge to a spanning tree with as root p
i
.
Afek-Kutten-Yung Election Algorithm - Correctness
A subgraph in the network with a leader value j that is not an id of
any node in this subgraph, contains an inconsistency, so a process
in this subgraph will declare itself leader.
Each process can only nitely often (each time due to incorrect
initial register values) join a subgraph with a false leader.
Let i be the largest id of any process in the network.

Leader values greater than i will eventually disappear; and

p
i
will eventually declare itself leader.
After p
i
has declared itself leader, the algorithm will eventually
converge to a spanning tree with as root p
i
.
Garbage Collection
Processes are provided with memory, and root objects carry
references to (local or remote) heap objects.
Also heap objects can carry references to each other.
Processes can perform three operations related to references:

reference creation by the object owner;

duplication of a remote reference to another processes;

reference deletion.
Aim of garbage collection is to reclaim inaccessible heap objects.
Garbage Collection - Reference Counting
Reference counting is based on keeping track of the number of
references to an object. If it drops to zero, the object is garbage.
Drawbacks:

Reference counting cannot reclaim cyclic garbage.

In a distributed setting, each operation on a remote reference


induces a message.
Garbage Collection - Race Conditions
In a distributed setting, garbage collection suers from race conditions.
Example: Process p holds a reference to object O on process r .
p duplicates this reference to process q.
p deletes the reference to O, and sends a dereference message to r .
r receives this message from p, and marks O as garbage.
O is reclaimed prematurely by the garbage collector.
q receives from p the reference to O.
Reference Counting - Acknowledgements
Consider reference counting. One way to avoid race conditions is
to use acknowledgements.
In the previous example, before p duplicates the O-reference to q,
it rst sends an increment message for O to r .
Upon reception of this increment message, r increments Os
counter, and sends an acknowledgement to p.
Upon reception of this acknowledgement, p duplicates the
O-reference to q.
Drawback: High synchronization delays.
Reference Counting - Acknowledgements
Consider reference counting. One way to avoid race conditions is
to use acknowledgements.
In the previous example, before p duplicates the O-reference to q,
it rst sends an increment message for O to r .
Upon reception of this increment message, r increments Os
counter, and sends an acknowledgement to p.
Upon reception of this acknowledgement, p duplicates the
O-reference to q.
Drawback: High synchronization delays.
Weighted Reference Counting
Each object carries a total weight (equal to the weights of all
references to the object), and a partial weight.
When a reference is created, the partial weight of the object is
divided over the object and the reference.
When a reference is duplicated, the weight of the reference is
divided over itself and the copy.
When a reference is deleted, the object owner is notied, and the
weight of the deleted reference is subtracted from the total weight
of the object.
When the total weight of the object becomes equal to its partial
weight, the object can be reclaimed.
Weighted Reference Counting - Drawback
When the weight of a reference (or object) has become 1, no more
duplication (or creation) is possible.
Solution 1: The reference with weight 1 increases its weight, and
tells the object owner to increase its total weight.
An acknowledgement from the object owner to the reference is
needed, to avoid race conditions.
Solution 2: The duplicated reference is to an articial object
with a new total weight, so that the reference to the original object
becomes indirect.
Weighted Reference Counting - Drawback
When the weight of a reference (or object) has become 1, no more
duplication (or creation) is possible.
Solution 1: The reference with weight 1 increases its weight, and
tells the object owner to increase its total weight.
An acknowledgement from the object owner to the reference is
needed, to avoid race conditions.
Solution 2: The duplicated reference is to an articial object
with a new total weight, so that the reference to the original object
becomes indirect.
Weighted Reference Counting - Drawback
When the weight of a reference (or object) has become 1, no more
duplication (or creation) is possible.
Solution 1: The reference with weight 1 increases its weight, and
tells the object owner to increase its total weight.
An acknowledgement from the object owner to the reference is
needed, to avoid race conditions.
Solution 2: The duplicated reference is to an articial object
with a new total weight, so that the reference to the original object
becomes indirect.
Question
Why is the possibility of an underow (weight 1) in weighted
reference counting a much more serious problem than the
possibility of an overow of a reference counter?
Piquers Indirect Reference Counting
The target object maintains a counter how many references have
been created.
Each reference is supplied with a counter how many times it has
been duplicated.
Process store where duplicated reference were duplicated from.
When a process receives a duplicated reference, but already holds a
reference to this object, it sends a decrement to the sender of the
duplicated reference, to decrease its counter for this reference.
When a duplicated (or created) reference has been deleted, and its
counter has become zero, a decrement is sent to the reference
where it was duplicated from (or to the object).
When the counter of the object becomes zero, it can be reclaimed.
Indirect Reference Listing
Instead of a counter, the object and references keep track at which
process a reference has been created or duplicated.
Advantage: Resilience against process failures (at the expense of
some memory overhead).
Garbage Collection Termination Detection
Tel and Mattern showed that garbage collection algorithms can be
tranformed into (existing and new) termination detection algorithms.
Given a (basic) algorithm. Let each process p host one (articial)
root object O
p
. There is also a special non-root object Z.
Initially, only for active processes p, there is a reference from O
p
to Z.
Each basic message carries a duplication of the Z-reference.
When a process becomes passive, it deletes its Z-references.
The basic algorithm is terminated if and only if Z is garbage.
Garbage Collection Termination Detection
Tel and Mattern showed that garbage collection algorithms can be
tranformed into (existing and new) termination detection algorithms.
Given a (basic) algorithm. Let each process p host one (articial)
root object O
p
. There is also a special non-root object Z.
Initially, only for active processes p, there is a reference from O
p
to Z.
Each basic message carries a duplication of the Z-reference.
When a process becomes passive, it deletes its Z-references.
The basic algorithm is terminated if and only if Z is garbage.
Garbage Collection Termination Detection - Examples
Weighted reference counting yields weight-throwing termination
detection.
Indirect reference counting yields Dijkstra-Scholten termination
detection.
Garbage Collection Termination Detection - Examples
Weighted reference counting yields weight-throwing termination
detection.
Indirect reference counting yields Dijkstra-Scholten termination
detection.
Garbage Collection - Mark-Scan
Mark-scan consists of two phases:

A traversal of all accessible objects, which are marked.

All unmarked objects are reclaimed.


Drawback: In a distributed setting, mark-scan usually requires
freezing the basic computation.
In mark-copy, the second phase consists of copying all marked
objects to contiguous empty memory space.
In mark-compact, the second phase compacts all marked objects
without requiring empty space.
Copying is signicantly faster than compaction.
Generational Garbage Collection in Java
In practice, most objects can be reclaimed shortly after their
creation.
Garbage collection in Java, which is based on mark-scan, therefore
divides objects into generations.

Garbage in the young generation is collected frequently using


mark-copy.

Garbage in the old generation is collected less frequently using


mark-compact.
On-Line Scheduling Algorithms
Jane Liu, Real-Time Systems, Prentice Hall, 2000
See also Chapter 2.4 of: Andrew Tanenbaum and Albert Woodhull,
Operating Systems: Design and Implementation, Prentice Hall, 1997
General Picture
system
embedded
application
processors resources
(memory)
Resources are allocated to processors.
Jobs
A job is a unit of work, scheduled and executed by the system.
Parameters of jobs are:

functional behavior

time constraints

resource requirements
Job are divided over processors, and are competing for resources.
A scheduler decides in which order jobs are performed on a
processor, and which resources they can claim.
Terminology
arrival time: when a job arrives at a processor
release time: when a job becomes available for execution
execution time: amount of processor time needed to perform the
job (assuming it executes alone and all resources are available)
absolute deadline: when a job is required to be completed
relative deadline: maximum allowed length of time from arrival
until completion of a job
hard deadline: late completion not allowed
soft deadline: late completion allowed
slack: available idle time of a job until the next deadline
A preemptive job can be suspended at any time of its execution
Out of scope

jitter: imprecise release and execution times

overrun management

penalty for missing a soft deadline

performance

communication between jobs

migration of jobs

use of distant resources

dierent processor and resource types


Types of Tasks
A task is a set of related jobs.
A processor distinguishes three types of tasks:

periodic: known input before the start of the system, with


hard deadlines

aperiodic: executed in response to some external event, with


soft deadlines

sporadic: executed in response to some external event, with


hard deadlines
Periodic Tasks
A periodic task is dened by:

release time r (of the rst periodic job)

period p (regular time interval, at the start of which a


periodic job is released)

execution time e
For simplicity we assume that the relative deadline of each periodic
job is equal to its period.
Periodic Tasks - Example
T
1
= (1, 2, 1) and T
2
= (0, 3, 1).
1 2 3 4 5 6 0
The hyperperiod is 6.
The conict at time 3 must be resolved by some scheduler.
Job Queues at a Processor
We focus on individual aperiodic and sporadic jobs.
processor
periodic tasks
aperiodic jobs
sporadic jobs
acceptance test
accept
reject
Sporadic jobs are only accepted when they can be completed in time.
Aperiodic jobs are always accepted, and performed such that
periodic and accepted sporadic jobs do not miss their deadlines.
The queueing discipline of aperiodic jobs tries to minimize e.g.
average tardiness (completion time monus deadline), or the
number of missed soft deadlines.
Utilization
Utilization of a periodic task (r , p, e) is
e
p
.
Utilization of a processor is the sum of utilizations of its periodic tasks.
Assumptions: Jobs preemptive, no resource competition.
Theorem: Utilization of a processor is 1 if and only if scheduling
its periodic tasks is feasible.
Scheduler
The scheduler of a processor schedules and allocates resources to
jobs (according to some scheduling algorithms and resource access
control algorithms).
A schedule is valid if:

jobs are not scheduled before their release times, and

the total amount of processor time assigned to a job equals its


(maximum) execution time.
A (valid) schedule is feasible if all hard deadlines are met.
A scheduler is optimal if it produces a feasible schedule whenever
possible.
Clock-Driven Scheduling
O-line scheduling: The schedule for periodic tasks is computed
beforehand (typically with an algorithm for an NP-complete graph
problem).
Time is divided into regular time intervals called frames.
In each frame, a predetermined set of periodic tasks is executed.
Jobs may be sliced into subjobs, to accommodate frame length.
Clock-driven scheduling is conceptually simple, but cannot cope
well with:

jitter

system modications

nondeterminism
Priority-Driven Scheduling
On-line scheduling: The schedule is computed at run-time.
Scheduling decisions are taken when:

periodic jobs are released or aperiodic/sporadic jobs arrive

jobs are completed

resources are required or released


Released jobs are placed in priority queues, e.g. ordered by:

release time (FIFO, LIFO)

execution time (SETF, LETF)

period of the task (RM)

deadline (EDF)

slack (LST)
We will focus on EDF scheduling.
Periodic tasks and jobs are assumed to be preemptive.
RM Scheduler
Rate Monotonic: Shorter period gives a higher priority.
Advantage: Priority on the level of tasks makes RM easier to
analyze than EDF/LST.
Example: Non-optimality of the RM scheduler
(one processor, preemptive jobs, no competition for resources).
Let T
1
= (0, 4, 2) and T
2
= (0, 6, 3).
2 4 6 0 8 10 12
RM
EDF / LST
Remark: If for periods p < p

, p is always a divisor of p

, then the
RM scheduler is optimal.
EDF Scheduler
Earliest Deadline First: The earlier the deadline, the higher the
priority.
Theorem: Given one processor, and preemptive jobs. When jobs do
not compete for resources, the EDF scheduler is optimal.
Example: Non-optimality in case of non-preemption.
0 1 2 3 4
J
J
J
J
r d d r
non-EDF
EDF
1 2 2 1
1
1 2
2
EDF Scheduler
Example: Non-optimality in case of resource competition.
Let J
1
and J
2
both require resource R.
0 1 2 3 4
J
J
J
J
r d d r
non-EDF
EDF
1 2 2 1
1
1 2
2
EDF Scheduler - Drawbacks
Dynamic priority of periodic tasks makes it dicult to analyze
which deadlines are met in case of overloads.
Late jobs can cause other jobs to miss their deadlines
(good overrun management is needed).
LST Scheduler
Least Slack Time rst: less slack gives a job higher priority.
With the LST scheduler, priorities of jobs change dynamically.
Remark: Continuous scheduling decisions would lead to context
switch overhead in case of two jobs with the same slack.
Theorem: Given one processor, and preemptive jobs. When jobs do
not compete for resources, the LST scheduler is optimal.
Drawback of LST: computationally expensive
Scheduling Anomaly
Let jobs be non-preemptive. Then shorter execution times can lead
to violation of deadlines.
Example: Consider the EDF (or LST) scheduler.
1 2 3 4 5
d
1
r
3
r
2
r
1
d
3
d
2
0
J J J
1 2 3
J
2 3
J
1
J
If jobs are preemptive, and there is no competition for resources,
then there is no scheduling anomaly.
Scheduling Aperiodic Jobs
(For the moment, we ignore sporadic jobs.)
Background: Aperiodic jobs are only scheduled in idle time.
Drawback: Needless delay of aperiodic jobs.
Slack stealing: Periodic tasks and accepted sporadic jobs may be
interrupted if there is sucient slack.
Example: T
1
= (0, 2,
1
2
) and T
2
= (0, 3,
1
2
). Aperiodic jobs
available in [0, 6].
0
0.5
1
1.5
2 3 4 5 6 1 0
Drawback: Dicult to compute in case of jitter.
Scheduling Aperiodic Jobs
(For the moment, we ignore sporadic jobs.)
Background: Aperiodic jobs are only scheduled in idle time.
Drawback: Needless delay of aperiodic jobs.
Slack stealing: Periodic tasks and accepted sporadic jobs may be
interrupted if there is sucient slack.
Example: T
1
= (0, 2,
1
2
) and T
2
= (0, 3,
1
2
). Aperiodic jobs
available in [0, 6].
0
0.5
1
1.5
2 3 4 5 6 1 0
Drawback: Dicult to compute in case of jitter.
Polling Server
Given a period p
s
, and an execution time e
s
for aperiodic jobs in
such a period.
At the start of a new period, the rst e
s
time units can be used to
execute aperiodic jobs.
Consider periodic tasks T
k
= (r
k
, p
k
, e
k
) for k = 1, . . . , n.
The polling server works if
n

k=1
e
k
p
k
+
e
s
p
s
1
Drawback: Aperiodic jobs released just after a polling may be
delayed needlessly.
Deferrable Server
Allows a polling server to save its execution time within a period
p
s
(but not after this period!) if the aperiodic queue is empty.
The EDF scheduler treats the deadline of a deferrable server at the
end of a period p
s
as a hard deadline.
Remark:

n
k=1
e
k
p
k
+
e
s
p
s
1 does not guarantee that periodic jobs
meet their deadlines.
Example: T = (2, 5, 3
1
3
) and p
s
= 3, e
s
= 1. An aperiodic job with
e = 2 arrives at 2.
T misses its deadline at 7
0 1 2 3 4 5 6 7
Drawbacks: Partial use of available bandwidth.
Dicult to determine good values for p
s
and e
s
.
Total Bandwidth Server
Fix an allowed utilization rate u
s
for the server, such that
n

k=1
e
k
p
k
+ u
s
1
When the aperiodic queue is non-empty, a deadline d is
determined for the head of the queue, according to the rules below.
Let the head of the aperiodic queue have execution time e.
When, at a time t, either a job arrives at the empty aperiodic
queue, or an aperiodic job completes and the tail of the aperiodic
queue is non-empty:
d := max(d, t) +
e
u
s
Initially, d = 0.
Total Bandwidth Server
Aperiodic jobs can now be treated in the same way as periodic
jobs, by the EDF scheduler.
In the absence of sporadic jobs, aperiodic jobs meet the deadlines
assigned to them (which may dier from their actual soft deadlines).
Example: T
1
= (0, 2, 1) and T
2
= (0, 3, 1). We x u
s
=
1
6
.
1 2
r r r
3 4 5 6 7 8 9 10 11 12 0
A, released at 1 with e =
1
2
, gets (at 1) deadline 1 + 3 = 4.
A

, released at 2 with e

=
2
3
, gets (at 2
1
2
) deadline 4 + 4 = 8.
A

, released at 3 with e

=
1
3
, gets (at 6
1
3
) deadline 8 + 2 = 10.
Acceptance Test for Sporadic Jobs
A sporadic job with deadline d and execution time e is accepted at
time t if utilization (of the periodic and accepted sporadic jobs) in
the time interval [t, d] is never more than 1
e
dt
.
If accepted, utilization in [t, d] is increased with
e
dt
.
Example: Periodic task T = (0, 2, 1).
Sporadic job with r = 1, e = 2 and d = 6 is accepted.
Utilization in [1, 6] is increased to
9
10
.
Sporadic job with r = 2, e = 2 and d = 20 is rejected (although it
could be scheduled).
Sporadic job with r = 3, e = 1 and d = 13 is accepted.
Utilization in [3, 6] is increased to 1, and utilization in [6, 13] to
3
5
.
Acceptance Test for Sporadic Jobs
A sporadic job with deadline d and execution time e is accepted at
time t if utilization (of the periodic and accepted sporadic jobs) in
the time interval [t, d] is never more than 1
e
dt
.
If accepted, utilization in [t, d] is increased with
e
dt
.
Example: Periodic task T = (0, 2, 1).
Sporadic job with r = 1, e = 2 and d = 6 is accepted.
Utilization in [1, 6] is increased to
9
10
.
Sporadic job with r = 2, e = 2 and d = 20 is rejected (although it
could be scheduled).
Sporadic job with r = 3, e = 1 and d = 13 is accepted.
Utilization in [3, 6] is increased to 1, and utilization in [6, 13] to
3
5
.
Acceptance Test for Sporadic Jobs
Periodic and accepted sporadic jobs are scheduled by the EDF
scheduler.
The acceptance test may reject schedulable sporadic jobs.
The total bandwidth server can be integrated with an acceptance
test for sporadic jobs (e.g. by making the allowed utilization rate
u
s
dynamic).
Acceptance Test for Sporadic Jobs
Periodic and accepted sporadic jobs are scheduled by the EDF
scheduler.
The acceptance test may reject schedulable sporadic jobs.
The total bandwidth server can be integrated with an acceptance
test for sporadic jobs (e.g. by making the allowed utilization rate
u
s
dynamic).
Remote Access Control Algorithms
Resource units can be requested by jobs during their execution,
and are allocated to jobs in a mutually exclusive fashion.
When a requested resource is refused, the job is preempted (blocked).
Resource sharing gives rise to scheduling anomaly.
Remote Access Control Algorithms
Dangers of resource sharing:
(1) Deadlock can occur.
Example: J
1
> J
2
. 1 2
2 1 2
J
1
requires green
requires red; deadlock
2
r r
J J J J
(2) A job J can be blocked by lower-priority jobs.
Example: J > J
1
> > J
k
, and J, J
k
require the red resource.
J
k
J
k
J J J
done done done done done
1 2 k1
J
2
J
k1
r
k1
r
k2
r r
1
r
2
r
k
J
Question
How can we avoid blocking by lower-priority jobs?
Priority Inheritance
When a job J requires a resource R and becomes blocked because
a lower-priority job J

holds R, then J

inherits the priority of J


until it releases R.
(1) Deadlock can still occur.
Example: J
1
> J
2
. 1 2
2 1 2
J
1
requires green
requires red; deadlock
2
r r
J J J J
(2) Blocking by lower-priority jobs becomes less likely.
Example: J > J
1
> > J
k
, and J, J
k
require red.
J J J J
done done
r r r
J J
done done
r r
1
r
2 k2 k1 k
J
k1 k 1 1 2 k k1
J
Priority Ceiling
The priority ceiling of a resource R at time t is the highest priority
of (known) jobs that will require R at some time t.
The priority ceiling of the system at time t is the highest priority
ceiling of resources that are in use at time t. (It has a special
bottom value when no resources are in use.)
In the priority ceiling algorithm, from the arrival of a job, this job is
not released until its priority is higher than the priority ceiling of
the system.
There is also priority inheritance: A job inherits the priority of a
higher-priority job that it is blocking.
Assumption: The resources required by a job are known beforehand.
Note: In the pictures to follow, a denotes the arrival of a job
(dierent from its release).
Priority Ceiling
The priority ceiling of a resource R at time t is the highest priority
of (known) jobs that will require R at some time t.
The priority ceiling of the system at time t is the highest priority
ceiling of resources that are in use at time t. (It has a special
bottom value when no resources are in use.)
In the priority ceiling algorithm, from the arrival of a job, this job is
not released until its priority is higher than the priority ceiling of
the system.
There is also priority inheritance: A job inherits the priority of a
higher-priority job that it is blocking.
Assumption: The resources required by a job are known beforehand.
Note: In the pictures to follow, a denotes the arrival of a job
(dierent from its release).
Priority Ceiling
(1) No deadlocks, if job priorities are xed (e.g. EDF, but not LST).
Because a job can only start executing when all resources it will
require are free.
Example: J
1
> J
2
.
1 1 2
1 2
r
done done
J J
a r
(2) Blocking by lower-priority jobs becomes less likely.
Example: J > J
1
> > J
k
, and J, J
k
require the red resource.
J J
r r r
J
r a r
J J
r
r
k 2 1 1 k-2
J
k-2
J J
k-1
1 2 k-3 k-2
k-1
k k-1
done done done done done
In this example, the future arrival of J is known when J
k1
arrives.
Questions
What would happen if J were only known at its arrival?
How can J get blocked by J
1
, . . . , J
k
?
Questions
What would happen if J were only known at its arrival?
How can J get blocked by J
1
, . . . , J
k
?
Priority Ceiling - Multiple Resource Units
Priority ceiling assumed only one unit per resource type.
In case of multiple units of the same resource type, the denition
of priority ceiling needs to be adapted:
The priority ceiling of a resource R with k free units at time t is
the highest priority level of known jobs that require > k units of R
at some time t.
PODC Inuential Paper Award
2000: Lamport, Time, Clocks, and the Ordering of Events in a
Distributed System, CACM 1978
2001: Fischer, Lynch, Paterson, Impossibility of Distributed Consensus
with One Faulty Process, JACM 1985
2002: Dijkstra, Self-Stabilizing Systems in Spite of Distributed Control,
CACM 1974
2003: Herlihy, Wait-Free Synchronization, ACM TOPLAS 1991
2004: Gallager, Humblet, Spira, A Distributed Algorithm for
Minimum-Weight Spanning Trees, ACM TOPLAS 1983
2005: Pease, Shostak, Lamport, Reaching Agreement in the Presence
of Faults, JACM 1980
2006: Mellor-Crummey, Scott, Algorithms for Scalable Synchronization
on Shared-Memory Multiprocessors, ACM TOCS 1991
2007: Dwork, Lynch, Stockmeyer, Consensus in the Presence of
Partial Synchrony, JACM 1988
2008: Awerbuch, Peleg, Sparse Partitions, FOCS 1990

You might also like