Slides 05
Slides 05
Physical clocks
Problem
Sometimes we simply need the exact time, not just an ordering.
Note
UTC is broadcast through short-wave radio and satellite. Satellites can give
an accuracy of about ±0.5 ms.
Physical clocks
Coordination Clock synchronization
Clock synchronization
Precision
The goal is to keep the deviation between two clocks on any two
machines within a specified bound, known as the precision π:
Accuracy
In the case of accuracy, we aim to keep the clock bound to a value α:
Synchronization
• Internal synchronization: keep clocks precise
• External synchronization: keep clocks accurate
Clock drift
Clock specifications
• A clock comes specified with its maximum clock drift rate ρ.
• F (t ) denotes oscillator frequency of the hardware clock at time t
• F is the clock’s ideal (constant) frequency ⇒ living up to
specifications:
F
∀t : (1 − ρ ) ≤ ≤ (1 +
(tF)
ρ)
Note
This introduces a partial ordering of events in a system with
concurrently operating processes.
Logical clocks
Problem
How do we maintain a global view of the system’s behavior that is
consistent with the happened-before relation?
Logical clocks
Problem
How do we maintain a global view of the system’s behavior that is
consistent with the happened-before relation?
Logical clocks
Problem
How do we maintain a global view of the system’s behavior that is
consistent with the happened-before relation?
Problem
How to attach a timestamp to an event when there’s no global clock ⇒
maintain a consistent set of logical clocks, one per process.
Notes
• Property P1 is satisfied by (1); Property P2 by (2) and (3).
• It can still occur that two events happen at the same time. Avoid this
by breaking ties through process IDs.
Result
In absence of proper synchronization:
replica #1 ← $1111, while replica #2 ←
$1110.
Lamport’s logical clocks
Coordination Logical clocks
Note
We are assuming that communication is reliable and FIFO ordered.
Vector clocks
Observation
Lamport’s clocks do not guarantee that if C(a) < C(b) that a causally
preceded b.
Vector clocks
Coordination Logical clocks
Vector clocks
Observation
Lamport’s clocks do not guarantee that if C(a) < C(b) that a causally
preceded b.
Note
We cannot conclude that a
causally precedes b.
Vector clocks
Coordination Logical clocks
Causal dependency
Definition
We say that b may causally depend on a if ts(a) < ts(b), with:
• for all k , ts(a)[k ] ≤ ts(b)[k ] and
• there exists at least one index k′ for which ts(a)[k′] < ts(b)[k′]
Vector clocks
Coordination Logical clocks
Vector clocks
Coordination Logical clocks
(a) (b)
Analysis
Vector clocks
Coordination Logical clocks
Adjustment
Pi increments VCi [i ] only when sending a message, and Pj “adjusts” VCj
when receiving a message (i.e., effectively does not change VCj [j ]).
Vector clocks
Coordination Logical clocks
Adjustment
Pi increments VCi [i ] only when sending a message, and Pj “adjusts” VCj
when receiving a message (i.e., effectively does not change VCj [j ]).
1. ts(m)[i ] = VCj [i ] + 1
2. ts(m)[k ] ≤ VCj [k ] for all k ̸= i
Vector clocks
Coordination Logical clocks
Vector clocks
Coordination Logical clocks
Example
Take VC3 = [0, 2, 2], ts(m) = [1, 3, 0] from P1. What information does P3
have, and what will it do when receiving m (from P1)?
Vector clocks
Coordination Mutual exclusion
Mutual exclusion
Problem
Several processes in a distributed system want exclusive access to
some resource.
Basic solutions
Permission-based: A process wanting to enter its critical region, or access
a resource, needs permission from other processes.
Token-based: A token is passed between processes. The one who has the
token may proceed in its critical region, or pass it on when
not interested.
Overview
Coordination Mutual exclusion
Permission-based, centralized
Simply use a coordinator
(a) (b)
(c)
A centralized algorithm
Coordination Mutual exclusion
A distributed algorithm
Coordination Mutual exclusion
(a) Two processes want to access a shared resource at the same moment.
(b) P0 has the lowest timestamp, so it wins.
(c) When process P0 is done, it sends an OK also, so P2 can now go ahead.
A distributed algorithm
Coordination Mutual exclusion
A token-ring algorithm
Coordination Mutual exclusion
Assumption
When a coordinator crashes, it will recover quickly, but will have
forgotten about permissions it had granted.
A decentralized algorithm
Coordination Mutual exclusion
A decentralized algorithm
Coordination Mutual exclusion
A decentralized algorithm
Coordination Mutual exclusion
A decentralized algorithm
Coordination Mutual exclusion
A decentralized algorithm
Coordination Mutual exclusion
N m p Violation N m p Violation
8 5 3 sec/hour < 10−5 8 5 30 sec/hour < 10−3
8 6 3 sec/hour < 10−11 8 6 30 sec/hour < 10−7
16 9 3 sec/hour < 10−4 16 9 30 sec/hour < 10−2
16 12 3 sec/hour < 10−21 16 12 30 sec/hour < 10−13
A decentralized algorithm
Coordination Mutual exclusion
N m p Violation N m p Violation
8 5 3 sec/hour < 10−5 8 5 30 sec/hour < 10−3
8 6 3 sec/hour < 10−11 8 6 30 sec/hour < 10−7
16 9 3 sec/hour < 10−4 16 9 30 sec/hour < 10−2
16 12 3 sec/hour < 10−21 16 12 30 sec/hour < 10−13
So....
What can we conclude?
A decentralized algorithm
Coordination Mutual exclusion
A decentralized algorithm
Coordination Mutual exclusion
Example: ZooKeeper
Basics (and keeping it simple)
• Centralized server setup
• All client-server communication is nonblocking: a client immediately
gets a response
• ZooKeeper maintains a tree-based namespace, akin to that of
a filesystem
• Clients can create, delete, or update nodes, as well as check
existence.
Solution
Use version numbers
ZooKeeper versioning
Notations
• W (n, k )a: request to write a to node n, assuming current version is k
.
• R(n, k ): current version of node n is k .
• R(n): client wants to know the current value of node n
• R(n, k )a: value a from node n is returned with its current version k .
Election algorithms
Principle
An algorithm requires that some process acts as a coordinator. The question
is how to select this special process dynamically.
Note
In many systems, the coordinator is chosen manually (e.g., file servers).
This leads to centralized solutions ⇒ single point of failure.
Coordination Election algorithms
Election algorithms
Principle
An algorithm requires that some process acts as a coordinator. The question
is how to select this special process dynamically.
Note
In many systems, the coordinator is chosen manually (e.g., file servers).
This leads to centralized solutions ⇒ single point of failure.
Teasers
1. If a coordinator is chosen dynamically, to what extent can we speak
about a centralized or distributed solution?
2. Is a fully distributed solution, i.e. one without a coordinator, always
more robust than any centralized/coordinated solution?
Coordination Election algorithms
Basic assumptions
Election by bullying
Principle
Consider N processes {P0 , . . . , PN−1} and let id (Pk ) = k . When a process
Pk notices that the coordinator is no longer responding to requests, it initiates
an election:
1. Pk sends an ELECTION message to all processes with higher
identifiers:
Pk + 1 , Pk + 2 , . . . , PN−1.
Election by bullying
The bully election algorithm
Election in a ring
Principle
Process priority is obtained by organizing processes into a (logical) ring.
The process with the highest priority should be elected as coordinator.
• Any process can start an election by sending an election message to its
successor. If a successor is down, the message is passed on to the
next successor.
• If a message is passed on, the sender adds itself to the list. When it
gets back to the initiator, everyone had a chance to make its presence
known.
• The initiator sends a coordinator message around the ring containing a
list of all living processes. The one with the highest priority is elected
as coordinator.
A ring algorithm
Coordination Election algorithms
Election in a ring
Election algorithm using a ring
A ring algorithm
Coordination Election algorithms
Note
When s∗ believes it should be the leader, it broadcasts ⟨id(s∗),
tx(s∗)⟩. Essentially, we’re bullying.
Observation
By slightly differing the timeout values per follower for deciding when to
start an election, we can avoid concurrent elections, and the election will
rapidly converge.
• Task: given a bit string h = Hi (m), find a bit string h˜ such that
• Task: given a bit string h = Hi (m), find a bit string h˜ such that
Observation
By controlling K , we control the difficulty of finding h˜ . If p is the probability
that a random guess for h˜ will suffice: p = (1/2)K .
• Task: given a bit string h = Hi (m), find a bit string h˜ such that
Observation
By controlling K , we control the difficulty of finding h˜ . If p is the probability
that a random guess for h˜ will suffice: p = (1/2)K .
Current practice
In many PoW-based blockchain systems, K = 64
•
Elections in large-scale systems ˜
Coordination Election algorithms
Principle
• Draw a random number k ∈ {1,..., N}
• Look up the process P that owns the token with index k . P is the
next leader.
Observation
The more tokens a process owns, the higher the probability it will be
selected as leader.
Essence
Find the node with the highest capacity to select as the next
leader.
Essence
A node reports back only the node that it found to have the highest
capacity.
vi , vj ← (vi + vj )/2
Result: in the end each node will have computed the average v¯ = ∑i vi /N.
Aggregation
Coordination Gossip-based coordination
vi , vj ← (vi + vj )/2
Result: in the end each node will have computed the average v¯ = ∑i vi /N.
Aggregation
Coordination Gossip-based coordination
Basics
• Each node maintains a list of c references to other nodes
• Regularly, pick another node at random (from the list), and
exchange roughly c/2 references
• When the application needs to select a node at random, it also picks
a random one from from its local list.
Observation
Statistically, it turns out that the selection of a peer from the local list is
indistinguishable from selecting uniformly at random peer from the
entire network
A peer-sampling service
Coordination Gossip-based coordination
Secure gossiping
Dramatic attack
Consider when exchanging references, a set of colluding nodes
systematically returns links only to each other ⇒ we are dealing with hub
attack.
Situation
A network with 100,000 nodes, a local list size c = 30, and only 30
attackers. The y-axis shows the number of nodes with links only to the
attackers. After less than 300 rounds, the attackers have full control.
Secure gossiping
Coordination Gossip-based coordination
Basic approach
When a benign node initiates an exchange, it may either use the result for
gathering statistics, or for updating its local list. An attacker is in limbo: will
its response be used for statistical purposes or for functional purposes?
Secure gossiping
Coordination Gossip-based coordination
Basic approach
When a benign node initiates an exchange, it may either use the result for
gathering statistics, or for updating its local list. An attacker is in limbo: will
its response be used for statistical purposes or for functional purposes?
Observation
When gathering statistics may reveal colluders, a colluding node will be
forced to behave according to the protocol.
Secure gossiping
Coordination Distributed event matching
Principle
• A process specifies in which events it is interested (subscription S)
• When a process publishes a notification N we need to see whether S
matches N.
Coordination Distributed event matching
Principle
• A process specifies in which events it is interested (subscription S)
• When a process publishes a notification N we need to see whether S
matches N.
Hard part
Implementing the match function in a scalable manner.
Coordination Distributed event matching
General approach
What is needed
• sub2node(S): map a subscription S to a nonempty subset S of servers
•not2node(N): map a notification N to a nonempty subset N of
servers Make sure that S ∩ N ̸= emptyset
Observations
• Centralized solution is simple: S = N = {s}, i.e. a single server.
• Topic-based publish-subscribe is also simple: each S and N is tagged
with a single topic; each topic is handled by a single server (a
rendezevous node). Several topics may be handled by same server).
• Content-based publish-subscribe is tough: a subscription takes the form
(attribute, value) pair, with example values:
• range: “1 ≤ x < 10”
• containment: “x ∈ {red, blue}”
• prefix and suffix expressions: “ url.startswith ( "h ttp s") ”
Centralized implementations
Coordination Distributed event matching
Selective routing
(a) (b)
(a) first broadcast subscriptions
(b) forward notifications only to relevant rendezvous nodes
Centralized implementations
Coordination Distributed event matching
Selective routing
(a) (b)
(a) first broadcast subscriptions
(b) forward notifications only to relevant rendezvous nodes
Centralized implementations
Coordination Distributed event matching
Gossiping: Sub-2-Sub
Basics
• Goal: To realize scalability, make sure that subscribers with the same
interests form just a single group
• Model: There are N attributes a 1 ,..., aN . An attribute value is
always (mappable to) a floating-point number.
• Subscription: Takes forms such as S = ⟨a1 → 3.0, a4 → [0.0, 0.5)⟩:
a1 should be 3.0; a4 should lie between 0.0 and 0.5; other attribute
values don’t matter.
Observations
• A subscription Si specifies a subset Si in a N-dimensional space.
• We are interested only in notifications that fall into S = ∪Si.
Centralized implementations
Coordination Distributed event matching
Gossiping: Sub-2-Sub
Basics
• Goal: To realize scalability, make sure that subscribers with the same
interests form just a single group
• Model: There are N attributes a 1 ,..., aN . An attribute value is
always (mappable to) a floating-point number.
• Subscription: Takes forms such as S = ⟨a1 → 3.0, a4 → [0.0, 0.5)⟩:
a1 should be 3.0; a4 should lie between 0.0 and 0.5; other attribute
values don’t matter.
Observations
• A subscription Si specifies a subset Si in a N-dimensional space.
• We are interested only in notifications that fall into S = ∪Si.
Centralized implementations
Coordination Distributed event matching
Gossiping: Sub-2-Sub
Centralized implementations
Coordination Distributed event matching
Secure publish-subscribe
We are facing nasty dilemma’s
• Referential decoupling: messages should be able to flow from a
publisher to subscribers while guaranteeing mutual anonymity ⇒ we
cannot set up a secure channel.
• Not knowing where messages come from imposes integrity problems.
• Assuming a trusted broker may easily be practically impossible,
certainly when dealing with sensitive information ⇒ we now have a
routing problem.
Secure publish-subscribe
We are facing nasty dilemma’s
• Referential decoupling: messages should be able to flow from a
publisher to subscribers while guaranteeing mutual anonymity ⇒ we
cannot set up a secure channel.
• Not knowing where messages come from imposes integrity problems.
• Assuming a trusted broker may easily be practically impossible,
certainly when dealing with sensitive information ⇒ we now have a
routing problem.
Solution
• Allow for searching (and matching) on encrypted data, without the
need for decryption.
• PEKS: accompany encryptyed messages with a collection of
(again encrypted) keywords and search for matches on keywords.
Positioning nodes
Issue
In large-scale distributed systems in which nodes are dispersed across a
wide-area network, we often need to take some notion of proximity or
distance into account ⇒ it starts with determining a (relative) location of a
node.
Coordination Location systems
Computing position
Observation
A node P needs d + 1 landmarks to compute its own position in
a
d -dimensional space. Consider two-dimensional case.
Computing a position in 2D Solution
P needs to solve three equations
in two unknowns (xP ,yP ):
Observation
4 satellites ⇒ 4 equations in 4 unknowns (with ∆ r as one of
them)
GPS: Global Positioning System
Coordination Location systems
N different
. locations {x⃗1, x⃗2, . . . , x⃗N }, with known GPS location.
Problems
• Limited accuracy of each GPS detection point ⃗xi
• An access point has a nonuniform transmission range
• Number of sampled detection points N may be too low.
When GPS is not an option
Coordination Location systems
Computing position
Computing position
Choosing the dimension m
The hidden parameter is the dimension m with N > m. A node P measures
its distance to each of the N landmarks and computes its coordinates by
minimizing
Observation
Practice shows that m can be as small as 6 or 7 to achieve latency
estimations within a factor 2 of the actual value.
Vivaldi