0% found this document useful (0 votes)
54 views94 pages

Slides 05

The document discusses coordination in distributed systems, including physical clocks, clock synchronization algorithms, and logical clocks. It describes how logical clocks can be used to maintain a consistent global view of event ordering in a distributed system without a shared physical clock.

Uploaded by

Susu Melody
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
54 views94 pages

Slides 05

The document discusses coordination in distributed systems, including physical clocks, clock synchronization algorithms, and logical clocks. It describes how logical clocks can be used to maintain a consistent global view of event ordering in a distributed system without a shared physical clock.

Uploaded by

Susu Melody
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 94

Distributed Systems

(4th edition, version 01)

Chapter 05: Coordination


Coordination Clock synchronization

Physical clocks
Problem
Sometimes we simply need the exact time, not just an ordering.

Solution: Universal Coordinated Time (UTC)


• Based on the number of transitions per second of the cesium 133
atom (pretty accurate).
• At present, the real time is taken as the average of some 50
cesium clocks around the world.
• Introduces a leap second from time to time to compensate that days
are getting longer.

Note
UTC is broadcast through short-wave radio and satellite. Satellites can give
an accuracy of about ±0.5 ms.

Physical clocks
Coordination Clock synchronization

Clock synchronization
Precision
The goal is to keep the deviation between two clocks on any two
machines within a specified bound, known as the precision π:

∀t, ∀p, q : |Cp (t ) −C q (t )| ≤ π

with Cp (t ) the computed clock time of machine p at UTC time t .

Accuracy
In the case of accuracy, we aim to keep the clock bound to a value α:

∀t, ∀p : |Cp (t ) −t| ≤ α

Synchronization
• Internal synchronization: keep clocks precise
• External synchronization: keep clocks accurate

Clock synchronization algorithms


Coordination Clock synchronization

Clock drift
Clock specifications
• A clock comes specified with its maximum clock drift rate ρ.
• F (t ) denotes oscillator frequency of the hardware clock at time t
• F is the clock’s ideal (constant) frequency ⇒ living up to
specifications:
F
∀t : (1 − ρ ) ≤ ≤ (1 +
(tF)
ρ)

Observation Fast, perfect, slow clocks


By using hardware interrupts we
couple a software clock to the
hardware clock, and thus also its clock
drift rate:

Clock synchronization algorithms


Coordination Clock synchronization

Detecting and adjusting incorrect


times
Getting the current time from a timeserver

Clock synchronization algorithms


Coordination Clock synchronization

Reference broadcast synchronization


Essence
• A node broadcasts a reference message m ⇒ each receiving node
p
records the time Tp,m that it received m.
• Note: Tp,m is read from p’s local clock.

Problem: averaging will not capture RBS minimizes critical


path drift ⇒ use linear regression
M
∑k =1 (T p,k −T q,k)
NO: Offset[p, q](t ) M
=
YES: Offset[p, q](t ) = αt +
β

Clock synchronization algorithms


Coordination Logical clocks

The Happened-before relationship


Issue
What usually matters is not that all processes agree on exactly what time it
is, but that they agree on the order in which events occur. Requires a notion
of ordering.

Lamport’s logical clocks


Coordination Logical clocks

The Happened-before relationship


Issue
What usually matters is not that all processes agree on exactly what time it
is, but that they agree on the order in which events occur. Requires a notion
of ordering.

The happened-before relation


• If a and b are two events in the same process, and a comes before
b, then a → b.
• If a is the sending of a message, and b is the receipt of that
message, then a → b
• If a → b and b → c, then a → c

Note
This introduces a partial ordering of events in a system with
concurrently operating processes.

Lamport’s logical clocks


Coordination Logical clocks

Logical clocks
Problem
How do we maintain a global view of the system’s behavior that is
consistent with the happened-before relation?

Lamport’s logical clocks


Coordination Logical clocks

Logical clocks
Problem
How do we maintain a global view of the system’s behavior that is
consistent with the happened-before relation?

Attach a timestamp C(e) to each event e, satisfying the following


properties:
P1 If a and b are two events in the same process, and a → b, then
we demand that C(a) < C(b).
P2 If a corresponds to sending a message m, and b to the receipt of
that message, then also C(a) < C(b).

Lamport’s logical clocks


Coordination Logical clocks

Logical clocks
Problem
How do we maintain a global view of the system’s behavior that is
consistent with the happened-before relation?

Attach a timestamp C(e) to each event e, satisfying the following


properties:
P1 If a and b are two events in the same process, and a → b, then
we demand that C(a) < C(b).
P2 If a corresponds to sending a message m, and b to the receipt of
that message, then also C(a) < C(b).

Problem
How to attach a timestamp to an event when there’s no global clock ⇒
maintain a consistent set of logical clocks, one per process.

Lamport’s logical clocks


Coordination Logical clocks

Logical clocks: solution


Each process Pi maintains a local counter Ci and adjusts this counter

1. For each new event that takes place within Pi , Ci is incremented by 1.


2. Each time a message m is sent by process Pi , the message receives
a timestamp ts(m) = Ci .
3. Whenever a message m is received by a process Pj , Pj adjusts its local
counter Cj to max{Cj , ts(m)}; then executes step 1 before passing m to
the application.

Notes
• Property P1 is satisfied by (1); Property P2 by (2) and (3).
• It can still occur that two events happen at the same time. Avoid this
by breaking ties through process IDs.

Lamport’s logical clocks


Coordination Logical clocks

Logical clocks: example


Consider three processes with event counters operating at different
rates

Lamport’s logical clocks


Coordination Logical clocks

Logical clocks: where implemented


Adjustments implemented in middleware

Lamport’s logical clocks


Coordination Logical clocks

Example: Totally ordered multicast


Concurrent updates on a replicated database are seen in the same
order everywhere
• P1 adds $100 to an account (initial value: $1000)
• P2 increments account by 1%
• There are two replicas

Result
In absence of proper synchronization:
replica #1 ← $1111, while replica #2 ←
$1110.
Lamport’s logical clocks
Coordination Logical clocks

Example: Totally ordered multicast


Solution
• Process Pi sends timestamped message mi to all others. The
message itself is put in a local queue queuei .
• Any incoming message at Pj is queued in queuej , according to
its timestamp, and acknowledged to every other process.

Lamport’s logical clocks


Coordination Logical clocks

Example: Totally ordered multicast


Solution
• Process Pi sends timestamped message mi to all others. The
message itself is put in a local queue queuei .
• Any incoming message at Pj is queued in queuej , according to
its timestamp, and acknowledged to every other process.

Pj passes a message mi to its application if:

(1) mi is at the head of queuej


(2) for each process Pk , there is a message mk in queuej with a
larger timestamp.

Lamport’s logical clocks


Coordination Logical clocks

Example: Totally ordered multicast


Solution
• Process Pi sends timestamped message mi to all others. The
message itself is put in a local queue queuei .
• Any incoming message at Pj is queued in queuej , according to
its timestamp, and acknowledged to every other process.

Pj passes a message mi to its application if:

(1) mi is at the head of queuej


(2) for each process Pk , there is a message mk in queuej with a
larger timestamp.

Note
We are assuming that communication is reliable and FIFO ordered.

Lamport’s logical clocks


Coordination Logical clocks

Lamport’s clocks for mutual


exclusion
2 def i n i t ( s e l f , chanID, procID, procIDSet):
31 c l a sself.chan.join(procID)
s Process:
4 self.procID = int(procID)
5 self.otherProcs.remove(self.procID)
6 self.queue = [] # The request queue
7 s e l f. c l o c k = 0 # The current l o g i c a l clock
8
9 def requestToEnter(self):
10 self.clock = self.clock + 1 # Increment c loc k value
11 self.queue.append((self.clock, s e lf. proc ID, ENTER)) # Append request t o q
12 self.cleanupQ() # S o r t t h e queue
13 self.chan.sendTo(self.otherProcs, ( s e l f . c l o c k , s e lf. proc ID, ENTER)) # Send request
14
15 def ackToEnter(self, re que s te r):
16 self.clock = self.clock + 1 # Increment c loc k value
17 self.chan.sendTo(requester, ( s e l f . c l o c k , s e lf. proc ID, ACK)) # Permit other
18
19 def r e l e a s e ( s e l f ) :
20 tmp = [ r f o r r i n s e l f. q u e ue [1 :] i f r [ 2 ] == ENTER] # Remove a l l ACKs
21 s e lf. que ue = tmp # and copy t o new queue
22 self.clock = self.clock + 1 # Increment c loc k value
23 self.chan.sendTo(self.otherProcs, ( s e l f . c l o c k , s e lf. proc ID, RELEASE)) # Release
24
25 def allowedToEnter(self):
26 commProcs = s e t ( [ r e q [ 1 ] f o r r e q i n s e l f . q u e u e [ 1 : ] ] ) # See who has s e n t a message
27 return (s e lf. que ue [0][1] == self.procID and le n(s e lf. othe rProc s ) == len(commProcs))

Lamport’s logical clocks


Coordination Logical clocks

Lamport’s clocks for mutual


exclusion
1 def re c e iv e (s e l f):
2 msg = self.chan.recvFrom(self.otherProcs)[1] # Pick up any message
3 s e l f . c l o c k = max(self.clock, msg[0]) # Adjust c loc k v a l u e . . .
4 self.clock = self.clock + 1 # . . . a n d increment
5 i f msg[2] == ENTER:
6 self.queue.append(msg) # Append an ENTER request
7 self.ackToEnter(msg[1]) # and unconditionally allow
8 e l i f msg[2] == ACK:
9 self.queue.append(msg) # Append a received ACK
10 e l i f msg[2] == RELEASE:
11 del(self.queue[0]) # J u s t remove f i r s t message
12 self.cleanupQ() # And s o r t and cleanup

Lamport’s logical clocks


Coordination Logical clocks

Lamport’s clocks for mutual exclusion


Analogy with totally ordered multicast
• With totally ordered multicast, all processes build identical
queues, delivering messages in the same order
• Mutual exclusion is about agreeing in which order processes are
allowed to enter a critical region

Lamport’s logical clocks


Coordination Logical clocks

Vector clocks
Observation
Lamport’s clocks do not guarantee that if C(a) < C(b) that a causally
preceded b.

Concurrent message Observation


transmission using logical Event a: m1 is received at T = 16;
clocks Event b: m2 is sent at T = 20.

Vector clocks
Coordination Logical clocks

Vector clocks
Observation
Lamport’s clocks do not guarantee that if C(a) < C(b) that a causally
preceded b.

Concurrent message Observation


transmission using logical Event a: m1 is received at T = 16;
clocks Event b: m2 is sent at T = 20.

Note
We cannot conclude that a
causally precedes b.

Vector clocks
Coordination Logical clocks

Causal dependency
Definition
We say that b may causally depend on a if ts(a) < ts(b), with:
• for all k , ts(a)[k ] ≤ ts(b)[k ] and
• there exists at least one index k′ for which ts(a)[k′] < ts(b)[k′]

Precedence vs. dependency


• We say that a causally precedes b.
• b may causally depend on a, as there may be information from a that
is propagated into b.

Vector clocks
Coordination Logical clocks

Capturing potential causality


Solution: each Pi maintains a vector VCi
• VCi [i ] is the local logical clock at process Pi .

• If VCi [j ] = k then Pi knows that k events have occurred at Pj .

Maintaining vector clocks


1. Before executing an event, Pi executes VCi [i ] ← VCi [i ] + 1.

2. When process Pi sends a message m to Pj , it sets m’s


(vector) timestamp ts(m) equal to VCi after having executed
step 1.
3. Upon the receipt of a message m, process Pj sets
VCj [k ] ← max{VCj [k ], ts(m)[k ]} for each k , after which it executes
step 1 and then delivers the message to the application.

Vector clocks
Coordination Logical clocks

Vector clocks: Example


Capturing potential causality when exchanging messages

(a) (b)

Analysis

Situation ts(m2 ) ts(m4 ) ts(m2 ) ts(m2 ) Conclusion


< >
ts(m4 ) ts(m4 )
(a) (2, 1, 0) (4, 3, 0) Yes No m2 may causally precede m4
(b) (4, 1, 0) (2, 3, 0) No No m2 and m4 may conflict

Vector clocks
Coordination Logical clocks

Causally ordered multicasting


Observation
We can now ensure that a message is delivered only if all causally
preceding messages have already been delivered.

Adjustment
Pi increments VCi [i ] only when sending a message, and Pj “adjusts” VCj
when receiving a message (i.e., effectively does not change VCj [j ]).

Vector clocks
Coordination Logical clocks

Causally ordered multicasting


Observation
We can now ensure that a message is delivered only if all causally
preceding messages have already been delivered.

Adjustment
Pi increments VCi [i ] only when sending a message, and Pj “adjusts” VCj
when receiving a message (i.e., effectively does not change VCj [j ]).

Pj postpones delivery of m until:

1. ts(m)[i ] = VCj [i ] + 1
2. ts(m)[k ] ≤ VCj [k ] for all k ̸= i

Vector clocks
Coordination Logical clocks

Causally ordered multicasting


Enforcing causal communication

Vector clocks
Coordination Logical clocks

Causally ordered multicasting


Enforcing causal communication

Example
Take VC3 = [0, 2, 2], ts(m) = [1, 3, 0] from P1. What information does P3
have, and what will it do when receiving m (from P1)?

Vector clocks
Coordination Mutual exclusion

Mutual exclusion
Problem
Several processes in a distributed system want exclusive access to
some resource.

Basic solutions
Permission-based: A process wanting to enter its critical region, or access
a resource, needs permission from other processes.
Token-based: A token is passed between processes. The one who has the
token may proceed in its critical region, or pass it on when
not interested.

Overview
Coordination Mutual exclusion

Permission-based, centralized
Simply use a coordinator

(a) (b)
(c)

(a) Process P1 asks the coordinator for permission to access a


shared resource. Permission is granted.
(b) Process P2 then asks permission to access the same resource.
The coordinator does not reply.
(c) When P1 releases the resource, it tells the coordinator, which then
replies to P2 .

A centralized algorithm
Coordination Mutual exclusion

Mutual exclusion: Ricart & Agrawala


The same as Lamport except that acknowledgments are not sent
Return a response to a request only when:
• The receiving process has no interest in the shared resource; or
• The receiving process is waiting for the resource, but has lower
priority (known through comparison of timestamps).
In all other cases, reply is deferred, implying some more local
administration.

A distributed algorithm
Coordination Mutual exclusion

Mutual exclusion: Ricart & Agrawala


Example with three processes

(a) (b) (c)

(a) Two processes want to access a shared resource at the same moment.
(b) P0 has the lowest timestamp, so it wins.
(c) When process P0 is done, it sends an OK also, so P2 can now go ahead.

A distributed algorithm
Coordination Mutual exclusion

Mutual exclusion: Token ring algorithm


Essence
Organize processes in a logical ring, and let a token be passed between
them. The one that holds the token is allowed to enter the critical region (if it
wants to).

An overlay network constructed as a logical ring with a circulating token

A token-ring algorithm
Coordination Mutual exclusion

Decentralized mutual exclusion


Principle
Assume every resource is replicated N times, with each replica having its
own coordinator ⇒ access requires a majority vote from m > N/2
coordinators. A coordinator always responds immediately to a request.

Assumption
When a coordinator crashes, it will recover quickly, but will have
forgotten about permissions it had granted.

A decentralized algorithm
Coordination Mutual exclusion

Decentralized mutual exclusion


How robust is this system?
• Let p = ∆ t / T be the probability that a coordinator resets during a
time interval ∆ t , while having a lifetime of T .

A decentralized algorithm
Coordination Mutual exclusion

Decentralized mutual exclusion


How robust is this system?
• Let p = ∆ t / T be the probability that a coordinator resets during a
time interval ∆ t , while having a lifetime of T .
• The probability P[k ] that k out of m coordinators reset during the
same interval is t )
m k m−k
P[k ] = p (1
k
−p)

A decentralized algorithm
Coordination Mutual exclusion

Decentralized mutual exclusion


How robust is this system?
• Let p = ∆ t / T be the probability that a coordinator resets during a
time interval ∆ t , while having a lifetime of T .
• The probability P[k ] that k out of m coordinators reset during the
same interval is t )
m k m−k
P[k ] = p (1
k
−p)
• f coordinators reset ⇒ correctness is violated when there is only a
minority of nonfaulty coordinators: when N − (m − f ) ≥ m, or, f ≥ 2m
−N.

A decentralized algorithm
Coordination Mutual exclusion

Decentralized mutual exclusion


How robust is this system?
• Let p = ∆ t / T be the probability that a coordinator resets during a
time interval ∆ t , while having a lifetime of T .
• The probability P[k ] that k out of m coordinators reset during the
same interval is

• f coordinators reset ⇒ correctness is violated when there is only a


minority of nonfaulty coordinators: when N − (m − f ) ≥ m, or, f ≥ 2m
−N.
• The probability of a violation is ∑m P[k ].
k =2m−N

A decentralized algorithm
Coordination Mutual exclusion

Decentralized mutual exclusion


Violation probabilities for various parameter values

N m p Violation N m p Violation
8 5 3 sec/hour < 10−5 8 5 30 sec/hour < 10−3
8 6 3 sec/hour < 10−11 8 6 30 sec/hour < 10−7
16 9 3 sec/hour < 10−4 16 9 30 sec/hour < 10−2
16 12 3 sec/hour < 10−21 16 12 30 sec/hour < 10−13

32 17 3 sec/hour < 10−4 32 17 30 sec/hour < 10−2


32 24 3 sec/hour < 10−43 32 24 30 sec/hour < 10−27

A decentralized algorithm
Coordination Mutual exclusion

Decentralized mutual exclusion


Violation probabilities for various parameter values

N m p Violation N m p Violation
8 5 3 sec/hour < 10−5 8 5 30 sec/hour < 10−3
8 6 3 sec/hour < 10−11 8 6 30 sec/hour < 10−7
16 9 3 sec/hour < 10−4 16 9 30 sec/hour < 10−2
16 12 3 sec/hour < 10−21 16 12 30 sec/hour < 10−13

32 17 3 sec/hour < 10−4 32 17 30 sec/hour < 10−2


32 24 3 sec/hour < 10−43 32 24 30 sec/hour < 10−27

So....
What can we conclude?

A decentralized algorithm
Coordination Mutual exclusion

Mutual exclusion: comparison

Messages per Delay before entry


Algorithm entry/exit (in message times)
Centralized 3 2
Distributed 2(N − 1) 2(N − 1)
Token ring 1,..., ∞ 0,..., N − 1
Decentralized 2kN + ( k − 1)N/2 + N, k = 1, 2,... 2kN + ( k − 1)N/2

A decentralized algorithm
Coordination Mutual exclusion

Example: ZooKeeper
Basics (and keeping it simple)
• Centralized server setup
• All client-server communication is nonblocking: a client immediately
gets a response
• ZooKeeper maintains a tree-based namespace, akin to that of
a filesystem
• Clients can create, delete, or update nodes, as well as check
existence.

Example: Simple locking with ZooKeeper


Coordination Mutual exclusion

ZooKeeper race condition


Note
ZooKeeper allows a client to be notified when a node, or a branch in the
tree, changes. This may easily lead to race conditions.

Consider a simple locking mechanism


1. A client C1 creates a node /lock .
2. A client C2 wants to acquire the lock but is notified that the
associated node already exists.
3. Before C2 subscribes to a notification, C1 releases the lock, i.e.,
deletes
/lock .
4. Client C2 subscribes to changes to /lock and blocks locally.

Solution
Use version numbers

Example: Simple locking with ZooKeeper


Coordination Mutual exclusion

ZooKeeper versioning

Notations
• W (n, k )a: request to write a to node n, assuming current version is k
.
• R(n, k ): current version of node n is k .
• R(n): client wants to know the current value of node n
• R(n, k )a: value a from node n is returned with its current version k .

Example: Simple locking with ZooKeeper


Coordination Mutual exclusion

ZooKeeper locking protocol


It is now very simple
1. lock: A client C1 creates a node /lock .
2. lock: A client C2 wants to acquire the lock but is notified that the
associated node already exists ⇒ C2 subscribes to notification
on changes of /lock .
3. unlock: Client C1 deletes node /lock ⇒ all subscribers to changes
are notified.

Example: Simple locking with ZooKeeper


Coordination Election algorithms

Election algorithms
Principle
An algorithm requires that some process acts as a coordinator. The question
is how to select this special process dynamically.

Note
In many systems, the coordinator is chosen manually (e.g., file servers).
This leads to centralized solutions ⇒ single point of failure.
Coordination Election algorithms

Election algorithms
Principle
An algorithm requires that some process acts as a coordinator. The question
is how to select this special process dynamically.

Note
In many systems, the coordinator is chosen manually (e.g., file servers).
This leads to centralized solutions ⇒ single point of failure.

Teasers
1. If a coordinator is chosen dynamically, to what extent can we speak
about a centralized or distributed solution?
2. Is a fully distributed solution, i.e. one without a coordinator, always
more robust than any centralized/coordinated solution?
Coordination Election algorithms

Basic assumptions

• All processes have unique id’s


• All processes know id’s of all processes in the system (but not if they
are up or down)
• Election means identifying the process with the highest id that is up
Coordination Election algorithms

Election by bullying
Principle
Consider N processes {P0 , . . . , PN−1} and let id (Pk ) = k . When a process
Pk notices that the coordinator is no longer responding to requests, it initiates
an election:
1. Pk sends an ELECTION message to all processes with higher
identifiers:
Pk + 1 , Pk + 2 , . . . , PN−1.

2. If no one responds, Pk wins the election and becomes coordinator.

3. If one of the higher-ups answers, it takes over and Pk ’s job is done.

The bully algorithm


Coordination Election algorithms

Election by bullying
The bully election algorithm

The bully algorithm


Coordination Election algorithms

Election in a ring
Principle
Process priority is obtained by organizing processes into a (logical) ring.
The process with the highest priority should be elected as coordinator.
• Any process can start an election by sending an election message to its
successor. If a successor is down, the message is passed on to the
next successor.
• If a message is passed on, the sender adds itself to the list. When it
gets back to the initiator, everyone had a chance to make its presence
known.
• The initiator sends a coordinator message around the ring containing a
list of all living processes. The one with the highest priority is elected
as coordinator.

A ring algorithm
Coordination Election algorithms

Election in a ring
Election algorithm using a ring

• The solid line shows the election messages initiated by P6

• The dashed one, the messages by P3

A ring algorithm
Coordination Election algorithms

Example: Leader election in ZooKeeper server group


Basics
• Each server s in the server group has an identifier id(s)
• Each server has a monotonically increasing counter tx(s) of the
latest transaction it handled (i.e., series of operations on the
namespace).
• When follower s suspects leader crashed, it broadcasts an
ELECTION
message, along with the pair (voteID,voteTX ). Initially,
• voteID ← id(s)
• voteTX ← tx(s)
• Each server s maintains two variables:
• leader(s): records the server that s believes may be final
leader. Initially, leader(s) ← id(s).
• lastTX(s): what s knows to be the most recent transaction.
Initially, lastTX(s) ← tx(s).

Example: Leader election in ZooKeeper


Coordination Election algorithms

Example: Leader election in ZooKeeper server group


When s∗ receives (voteID,voteTX )
• If lastTX(s∗) < voteTX , then s∗ just received more up-to-date
information on the most recent transaction, and sets
• leader(s∗) ← voteID
• lastTX(s∗) ← voteTX
• If lastTX(s∗) = voteTX and leader(s∗) < voteID, then s∗ knows as
much about the most recent transaction as what it was just sent, but its
perspective on which server will be the next leader needs to be
updated:
• leader(s∗) ← voteID

Note
When s∗ believes it should be the leader, it broadcasts ⟨id(s∗),
tx(s∗)⟩. Essentially, we’re bullying.

Example: Leader election in ZooKeeper


Coordination Election algorithms

Example: Leader election in Raft


Basics
• We have a (relatively small) group of servers
• A server is in one of three states: follower , candidate, or leader
• The protocol works in terms, starting with term 0
• Each server starts in the follower state.
• A leader is to regularly broadcast messages (perhaps just a
simple heartbeat)

Example: Leader election in Raft


Coordination Election algorithms

Example: Leader election in Raft


Selecting a new leader
When follower s∗ hasn’t received anything from the alleged leader s for some
time, s∗ broadcasts that it volunteers to be the next leader, increasing the
term by 1. s∗ enters the candidate state. Then:
• If leader s receives the message, it responds by acknowledging that it
is still the leader. s∗ returns to the follower state.
• If another follower s∗∗ gets the election message from s∗, and it is the
first election message during the current term, s∗∗ votes for s∗.
Otherwise, it simply ignores the election message from s∗. When s∗
has collected a majority of votes, a new term starts with a new leader.

Example: Leader election in Raft


Coordination Election algorithms

Example: Leader election in Raft


Selecting a new leader
When follower s∗ hasn’t received anything from the alleged leader s for some
time, s∗ broadcasts that it volunteers to be the next leader, increasing the
term by 1. s∗ enters the candidate state. Then:
• If leader s receives the message, it responds by acknowledging that it
is still the leader. s∗ returns to the follower state.
• If another follower s∗∗ gets the election message from s∗, and it is the
first election message during the current term, s∗∗ votes for s∗.
Otherwise, it simply ignores the election message from s∗. When s∗
has collected a majority of votes, a new term starts with a new leader.

Observation
By slightly differing the timeout values per follower for deciding when to
start an election, we can avoid concurrent elections, and the election will
rapidly converge.

Example: Leader election in Raft


Coordination Election algorithms

Elections by proof of work


Basics
• Consider a potentially large group of processes
• Each process is required to solve a computational puzzle
• When a process solves the puzzle, it broadcasts its victory to the group
• We assume there is a conflict resolution procedure when more than
one process claims victory

Solving a computational puzzle


• Make use of a secure hashing function H(m):
• m is some data; H(m) returns a fixed-length bit string
• computing h = H(m) is computationally efficient
• finding a function H − 1 such that m = H − 1 (H(m)) is
computationally extremely difficult
• Practice: finding H − 1 boils down to an extensive trial-and-error
procedure
Elections in large-scale systems
Coordination Election algorithms

Elections by proof of work


Controlled race
• Assume a globally known secure hash function H∗. Let Hi be the
hash function used by process Pi .

• Task: given a bit string h = Hi (m), find a bit string h˜ such that

h∗ = H∗(Hi (h˜ ⊙ h ) ) where:


• h∗ is a bit string with K leading zeroes
• h˜ ⊙ h denotes some predetermined bitwise operation on h˜ and
h

Elections in large-scale systems


Coordination Election algorithms

Elections by proof of work


Controlled race
• Assume a globally known secure hash function H∗. Let Hi be the
hash function used by process Pi .

• Task: given a bit string h = Hi (m), find a bit string h˜ such that

h∗ = H∗(Hi (h˜ ⊙ h ) ) where:


• h∗ is a bit string with K leading zeroes
• h˜ ⊙ h denotes some predetermined bitwise operation on h˜ and
h

Observation
By controlling K , we control the difficulty of finding h˜ . If p is the probability
that a random guess for h˜ will suffice: p = (1/2)K .

Elections in large-scale systems


Coordination Election algorithms

Elections by proof of work


Controlled race
• Assume a globally known secure hash function H∗. Let Hi be the
hash function used by process Pi .

• Task: given a bit string h = Hi (m), find a bit string h˜ such that

h∗ = H∗(Hi (h˜ ⊙ h ) ) where:


• h∗ is a bit string with K leading zeroes
• h˜ ⊙ h denotes some predetermined bitwise operation on h˜ and
h

Observation
By controlling K , we control the difficulty of finding h˜ . If p is the probability
that a random guess for h˜ will suffice: p = (1/2)K .

Current practice
In many PoW-based blockchain systems, K = 64


Elections in large-scale systems ˜
Coordination Election algorithms

Elections by proof of stake


Basics
We assume a blockchain system in which N secure tokens are used:
• Each token has a unique owner
• Each token has a uniquely associated index 1 ≤ k ≤ N
• A token cannot be modified or copied without this going unnoticed

Principle
• Draw a random number k ∈ {1,..., N}
• Look up the process P that owns the token with index k . P is the
next leader.

Observation
The more tokens a process owns, the higher the probability it will be
selected as leader.

Elections in large-scale systems


Coordination Election algorithms

A solution for wireless networks


A sample network

Essence
Find the node with the highest capacity to select as the next
leader.

Elections in wireless environments


Coordination Election algorithms

A solution for wireless networks


A sample network

Elections in wireless environments


Coordination Election algorithms

A solution for wireless networks


A sample network

Essence
A node reports back only the node that it found to have the highest
capacity.

Elections in wireless environments


Coordination Gossip-based coordination

Gossip-based coordination: aggregation


Typical apps
• Data dissemination: Perhaps the most important one. Note that there
are many variants of dissemination.
• Aggregation: Let every node Pi maintain a variable vi . When two
nodes gossip, they each reset their variable to

vi , vj ← (vi + vj )/2

Result: in the end each node will have computed the average v¯ = ∑i vi /N.

Aggregation
Coordination Gossip-based coordination

Gossip-based coordination: aggregation


Typical apps
• Data dissemination: Perhaps the most important one. Note that there
are many variants of dissemination.
• Aggregation: Let every node Pi maintain a variable vi . When two
nodes gossip, they each reset their variable to

vi , vj ← (vi + vj )/2

Result: in the end each node will have computed the average v¯ = ∑i vi /N.

• What happens in the case that initially vi = 1 and vj = 0, j ̸= i ?

Aggregation
Coordination Gossip-based coordination

Gossip-based coordination: peer sampling


Problem
For many gossip-based applications, you need to select a peer uniformly at
random from the entire network. In principle, this means you need to know
all other peers. Impossible?

Basics
• Each node maintains a list of c references to other nodes
• Regularly, pick another node at random (from the list), and
exchange roughly c/2 references
• When the application needs to select a node at random, it also picks
a random one from from its local list.

Observation
Statistically, it turns out that the selection of a peer from the local list is
indistinguishable from selecting uniformly at random peer from the
entire network

A peer-sampling service
Coordination Gossip-based coordination

Gossip-based overlay construction


Essence
Maintain two local lists of neighbors. The lowest is used for providing
a peer-sampling service; the highest list is used to carefully select
application-dependent neighbors.

Gossip-based overlay construction


Coordination Gossip-based coordination

Gossip-based overlay construction: a 2D torus


Consider a logical N × N grid, with a node on each point of the grid.
• Every node must maintain a list of c nearest neighbors
• Distance between node at (a1, a2) and (b1, b2) is d1 + d2, with
di = min(N −|ai −b i |, |ai −b i |)
• Every node picks a random other node from its lowest-level list,
and keeps only the closest one in its top-level list.
• Once every node has picked and selected a random node, we move
to the next round

start (N = 50) after 5 rounds after 20 rounds

Gossip-based overlay construction


Coordination Gossip-based coordination

A gossip-based 2D torus in Python (outline)


1 def maintainViews():
2 f o r viewType i n [viewOverlay, viewPSS]: # For each view, do t h e same
3 peer[viewType] = None
4 i f time t o maintain viewType: # This viewType needs t o be updated
5 peer[viewType] = selectPeer(viewType) # S e l e c t a peer
6 l i n k s = selectLinks(viewType, peer[viewType]) # Select links
7 sendTo(peer[viewType], Request[viewType], l i n k s ) # Send l i n k s asynchronously
8
9 while True:
10 block = (peer[viewOverlay] != None) or (peer[viewPSS] != None)
11 s e nde r, msgType, msgData = recvFromAny(block) # Block i f expecting something
12
13 i f msg == None: # A l l work has been done, simply re tu r n from t h e c a l l
14 return
15
16 f o r viewType i n [viewOverlay, viewPSS]: # For each view, do t h e same
17 i f msgType == Response[viewType]: # Response t o previously s e n t l i n k s
18 updateOwnView(viewType, msgData) # J u s t update t h e own view
19
20 e l i f msgType == Request[viewType]: # Request f o r exchanging l i n k s
21 i f peer[viewType] == None: # No outstanding exchange
22 request l i n k s = selectLinks(viewType, sender) #
23 S e l e c t l i n k s sendTo(sender, Response[viewType], l i n k s ) # Send them
24 updateOwnView(viewType,msgData)
asynchronously # Update own view
25 e l s e : # This node already has a pending exchange re q ue s t, ignore t h i s one
26 sendTo(sender, IgnoreRequest[viewType])
27
28 e l i f msgType == IgnoreRequest[viewType]: # Request has been denied, g i v e up
29 peer[viewType] = None

Gossip-based overlay construction


Coordination Gossip-based coordination

Secure gossiping
Dramatic attack
Consider when exchanging references, a set of colluding nodes
systematically returns links only to each other ⇒ we are dealing with hub
attack.

Situation
A network with 100,000 nodes, a local list size c = 30, and only 30
attackers. The y-axis shows the number of nodes with links only to the
attackers. After less than 300 rounds, the attackers have full control.

Secure gossiping
Coordination Gossip-based coordination

A solution: gathering statistics


This is what measuring indegree distributions tells us: which fraction of
nodes (y-axis) have how many other nodes pointing to them (x-axis)?

Basic approach
When a benign node initiates an exchange, it may either use the result for
gathering statistics, or for updating its local list. An attacker is in limbo: will
its response be used for statistical purposes or for functional purposes?

Secure gossiping
Coordination Gossip-based coordination

A solution: gathering statistics


This is what measuring indegree distributions tells us: which fraction of
nodes (y-axis) have how many other nodes pointing to them (x-axis)?

Basic approach
When a benign node initiates an exchange, it may either use the result for
gathering statistics, or for updating its local list. An attacker is in limbo: will
its response be used for statistical purposes or for functional purposes?

Observation
When gathering statistics may reveal colluders, a colluding node will be
forced to behave according to the protocol.
Secure gossiping
Coordination Distributed event matching

Distributed event matching

Principle
• A process specifies in which events it is interested (subscription S)
• When a process publishes a notification N we need to see whether S
matches N.
Coordination Distributed event matching

Distributed event matching

Principle
• A process specifies in which events it is interested (subscription S)
• When a process publishes a notification N we need to see whether S
matches N.

Hard part
Implementing the match function in a scalable manner.
Coordination Distributed event matching

General approach
What is needed
• sub2node(S): map a subscription S to a nonempty subset S of servers
•not2node(N): map a notification N to a nonempty subset N of
servers Make sure that S ∩ N ̸= emptyset

Observations
• Centralized solution is simple: S = N = {s}, i.e. a single server.
• Topic-based publish-subscribe is also simple: each S and N is tagged
with a single topic; each topic is handled by a single server (a
rendezevous node). Several topics may be handled by same server).
• Content-based publish-subscribe is tough: a subscription takes the form
(attribute, value) pair, with example values:
• range: “1 ≤ x < 10”
• containment: “x ∈ {red, blue}”
• prefix and suffix expressions: “ url.startswith ( "h ttp s") ”

Centralized implementations
Coordination Distributed event matching

Selective routing

(a) (b)
(a) first broadcast subscriptions
(b) forward notifications only to relevant rendezvous nodes

Centralized implementations
Coordination Distributed event matching

Selective routing

(a) (b)
(a) first broadcast subscriptions
(b) forward notifications only to relevant rendezvous nodes

Example of a (partially filled) routing table


Interface Filter
To node 3 a ∈ [0, 3]
To node 4 a ∈ [2, 5]
Toward router R1 (unspecified)

Centralized implementations
Coordination Distributed event matching

Gossiping: Sub-2-Sub
Basics
• Goal: To realize scalability, make sure that subscribers with the same
interests form just a single group
• Model: There are N attributes a 1 ,..., aN . An attribute value is
always (mappable to) a floating-point number.
• Subscription: Takes forms such as S = ⟨a1 → 3.0, a4 → [0.0, 0.5)⟩:
a1 should be 3.0; a4 should lie between 0.0 and 0.5; other attribute
values don’t matter.

Observations
• A subscription Si specifies a subset Si in a N-dimensional space.
• We are interested only in notifications that fall into S = ∪Si.

Centralized implementations
Coordination Distributed event matching

Gossiping: Sub-2-Sub
Basics
• Goal: To realize scalability, make sure that subscribers with the same
interests form just a single group
• Model: There are N attributes a 1 ,..., aN . An attribute value is
always (mappable to) a floating-point number.
• Subscription: Takes forms such as S = ⟨a1 → 3.0, a4 → [0.0, 0.5)⟩:
a1 should be 3.0; a4 should lie between 0.0 and 0.5; other attribute
values don’t matter.

Observations
• A subscription Si specifies a subset Si in a N-dimensional space.
• We are interested only in notifications that fall into S = ∪Si.

Centralized implementations
Coordination Distributed event matching

Gossiping: Sub-2-Sub

Centralized implementations
Coordination Distributed event matching

Secure publish-subscribe
We are facing nasty dilemma’s
• Referential decoupling: messages should be able to flow from a
publisher to subscribers while guaranteeing mutual anonymity ⇒ we
cannot set up a secure channel.
• Not knowing where messages come from imposes integrity problems.
• Assuming a trusted broker may easily be practically impossible,
certainly when dealing with sensitive information ⇒ we now have a
routing problem.

Secure publish-subscribe solutions


Coordination Distributed event matching

Secure publish-subscribe
We are facing nasty dilemma’s
• Referential decoupling: messages should be able to flow from a
publisher to subscribers while guaranteeing mutual anonymity ⇒ we
cannot set up a secure channel.
• Not knowing where messages come from imposes integrity problems.
• Assuming a trusted broker may easily be practically impossible,
certainly when dealing with sensitive information ⇒ we now have a
routing problem.

Solution
• Allow for searching (and matching) on encrypted data, without the
need for decryption.
• PEKS: accompany encryptyed messages with a collection of
(again encrypted) keywords and search for matches on keywords.

Secure publish-subscribe solutions


Coordination Distributed event matching

Public-Key Encryption with Keyword Search (PEKS)


Basics
• Use a public key PK , message m and its n keywords KW 1 ,..., KWn
are stored at a server as the message m∗:

m∗ = [PK (m)|PEKS(PK, KW1)|PEKS(PK, KW2 )|··|PEKS(PK, KWn )]

• A subscriber gets the accompanying secret key.


• For each keyword KWi , a trapdoor TKWi is generated: TW (m∗) will
return
true iff W ∈ {KW 1 ,..., KWn}.

KWi∗ = PEKS(PK, KWi )


Secure publish-subscribe solutions
Coordination Location systems

Positioning nodes
Issue
In large-scale distributed systems in which nodes are dispersed across a
wide-area network, we often need to take some notion of proximity or
distance into account ⇒ it starts with determining a (relative) location of a
node.
Coordination Location systems

Computing position
Observation
A node P needs d + 1 landmarks to compute its own position in
a
d -dimensional space. Consider two-dimensional case.
Computing a position in 2D Solution
P needs to solve three equations
in two unknowns (xP ,yP ):

GPS: Global Positioning System


Coordination Location systems

Global Positioning System


Assuming that the clocks of the satellites are accurate and
synchronized
• It takes a while before a signal reaches the receiver
• The receiver’s clock is definitely out of sync with the
satellite

Observation
4 satellites ⇒ 4 equations in 4 unknowns (with ∆ r as one of
them)
GPS: Global Positioning System
Coordination Location systems

WiFi-based location services


Basic idea
• Assume we have a database of known access points (APs)
with coordinates
• Assume we can estimate distance to an AP
• Then: with 3 detected access points, we can compute a
position.

War driving: locating access points


• Use a WiFi-enabled device along with a GPS receiver, and move
through an area while recording observed access points.
• Compute the centroid: assume an access point AP has been detected
at N
• Compute location of AP as ⃗xAP = ∑ N ⃗xi
i =1

N different
. locations {x⃗1, x⃗2, . . . , x⃗N }, with known GPS location.
Problems
• Limited accuracy of each GPS detection point ⃗xi
• An access point has a nonuniform transmission range
• Number of sampled detection points N may be too low.
When GPS is not an option
Coordination Location systems

Computing position

Problems Inconsistent distances in 1D space


• Measured latencies to
landmarks fluctuate
• Computed distances will not
even be consistent

Solution: minimize errors


• Use N special landmark nodes L 1 ,..., LN .
• Landmarks measure their pairwise latencies d˜ (L i , Lj )
• A central node computes the coordinates for each landmark, minimizing:

where dˆ (L i , Lj ) is distance after nodes Li and Lj have been


positioned.
Logical positioning of nodes
Coordination Location systems

Computing position
Choosing the dimension m
The hidden parameter is the dimension m with N > m. A node P measures
its distance to each of the N landmarks and computes its coordinates by
minimizing

Observation
Practice shows that m can be as small as 6 or 7 to achieve latency
estimations within a factor 2 of the actual value.

Logical positioning of nodes


Coordination Location systems

Vivaldi

Logical positioning of nodes

You might also like