Lecture 05
Lecture 05
Distributed Systems
Instructor: YU Haifeng
Today’s Roadmap
Chapter 6 of textbook
Mutual exclusion
Leader election
Haifeng Yu, CS5223, Some Contents Adapted (with permission) from © R.Ayani, G.Tan 2
Motivation for Physical Clocks
Example: Timestamp files
Creation time, modification time, etc.
Haifeng Yu, CS5223, Some Contents Adapted (with permission) from © R.Ayani, G.Tan 3
Motivation for Physical Clock Synchronization
Electronic devices (including computers) usually advances
clocks based on oscillations from some crystal
Haifeng Yu, CS5223, Some Contents Adapted (with permission) from © R.Ayani, G.Tan 4
Clock Synchronization Can be Confusing!
Human beings are usually used to talk about a single
clock
Confusion usually arises when there are multiple clocks each
may advance at its own pace
Such confusion already arises in some places in the textbook…
Haifeng Yu, CS5223, Some Contents Adapted (with permission) from © R.Ayani, G.Tan 5
Cristian’s Algorithm
Assumptions
Propagation delay for request and reply is the same
Clock rate drift is 0
Ta1 Ta2
A
Reply
B
Tb1 Tb2
time
Haifeng Yu, CS5223, Some Contents Adapted (with permission) from © R.Ayani, G.Tan 6
T1, T2, T3, T4 are accurate times
Ta2 – Ta1 = T4 – T1 (since clock rate drift is 0)
Tb2 – Tb1 = T3 – T2
Propagation delay d = ((T4 - T1) – (T3 - T2)) / 2
T1 Ta1 T4
Ta2
A
B
Tb1 Tb2
T2 T3 time
Haifeng Yu, CS5223, Some Contents Adapted (with permission) from © R.Ayani, G.Tan 7
Key idea: (T4 – T1) is accurate even if A has clock drift
What is A’s clock rate drift is not 0?
Consider a scenario where d = 0, Tb2 - Tb1 = 1ms, Ta2 – Ta1 = 2
minutes
Thus this implicit assumption (not mentioned in textbook) is
critical !
T1 Ta1 T4
Ta2
A
B
Tb1 Tb2
T2 T3 time
Haifeng Yu, CS5223, Some Contents Adapted (with permission) from © R.Ayani, G.Tan 8
Network Time Protocol (NTP)
Inaccuracy in Cristian’s protocol comes from
Non-identical propagation delay for request and reply (quite
common in wide-area network)
Clock rate drift not being 0 -- impact is usually small (why?)
Stratum-1 servers:
Servers with UTC receiver or atimic clocks
Haifeng Yu, CS5223, Some Contents Adapted (with permission) from © R.Ayani, G.Tan 10
The Berkeley Algorithm
Assumptions
Network delay = 0
Usually for cases where no machine has the “accurate time”
Haifeng Yu, CS5223, Some Contents Adapted (with permission) from © R.Ayani, G.Tan 11
The Berkeley Algorithm: Example
Haifeng Yu, CS5223, Some Contents Adapted (with permission) from © R.Ayani, G.Tan 12
The Berkeley Algorithm
Quite straight-forward
Haifeng Yu, CS5223, Some Contents Adapted (with permission) from © R.Ayani, G.Tan 13
Problems with Physical Clocks
Many protocols need to impose an ordering among events
If two players open the same treasure chest “almost” at the same
time, who should get the treasure?
Physical clocks:
Seems to completely solve the problem
But what about theory of relativity?
Even without theory of relativity – efficiency problems
Haifeng Yu, CS5223, Some Contents Adapted (with permission) from © R.Ayani, G.Tan 14
Software “Clocks”
Software “clocks” can incur much lower overhead
than maintaining (sufficiently accurate) physical
clocks
But does not allow comparison with external clocks
Haifeng Yu, CS5223, Some Contents Adapted (with permission) from © R.Ayani, G.Tan 15
Assumptions
Process can perform three kinds of atomic
actions/events
Local computation
Send a single message to a single process
Receive a single message from a single process
No atomic broadcast
Communication model
Point-to-point
Error-free, infinite buffer
Potentially out of order
Haifeng Yu, CS5223, Some Contents Adapted (with permission) from © R.Ayani, G.Tan 16
Visible Ordering to Users
A B (process order) A?D
B C (send-receive order) B?D
A C (transitivity)
A B
user1 (process1)
C
user2 (process2)
user3 (process3)
D
Haifeng Yu, CS5223, Some Contents Adapted (with permission) from © R.Ayani, G.Tan 17
“Happened-Before” Relation
“Happened-before” relation captures the ordering that
is visible to users when there is no physical clock
A partial order among events
Process-order, send-receive order, transitivity
Haifeng Yu, CS5223, Some Contents Adapted (with permission) from © R.Ayani, G.Tan 18
Software “Clock” 1: Logical Clocks
Each event has a single integer as its logical clock
value
Each process has a local counter C
Increment C at each “local computation” and “send” event
When sending a message, logical clock value V is attached
to the message. At each “receive” event, C = max(C, V) + 1
3 = max(1,2)+1 4 = max(3,3)+1
1 2 3 4
user1 (process1)
1 3 4 5 6
user2 (process2)
1 2 6
user3 (process3)
Haifeng Yu, CS5223, Some Contents Adapted (with permission) from © R.Ayani, G.Tan 19
Logical Clock Properties
Theorem:
Event s happens before t the logical clock value of s is
smaller than the logical clock value of t.
The reverse may not be true
1 2 3 4
user1 (process1)
1 3 4 5 6
user2 (process2)
1 2 6
user3 (process3)
Haifeng Yu, CS5223, Some Contents Adapted (with permission) from © R.Ayani, G.Tan 20
Example Application for Logical Clock
Total-ordered broadcast protocol is the “standard”
example application for logical clock (as in textbook)
But it is complex and will take half a lecture to go through
(The protocol is covered in CS4231…)
Haifeng Yu, CS5223, Some Contents Adapted (with permission) from © R.Ayani, G.Tan 21
Cary
Email 1:
Email 2:
I proved
P=NP and I proved P
here is my = NP here
proof is my proof
Malice
Haifeng Yu, CS5223, Some Contents Adapted (with permission) from © R.Ayani, G.Tan 22
Software “Clock” 2: Vector Clocks
Logical clock:
Event s happens before event t the logical clock value of s
is smaller than the logical clock value of t.
Vector clock:
Event s happens before event t the vector clock value of s
is “smaller” than the vector clock value of t.
Each event has a vector of n integers as its vector
clock value
v1 = v2 if all n fields same
v1 v2 if every field in v1 is less than or equal to the
corresponding field in v2
v1 < v2 if v1 v2 and v1 ≠ v2
Relation “<“ here is
not a total order
Haifeng Yu, CS5223, Some Contents Adapted (with permission) from © R.Ayani, G.Tan 23
Vector Clock Protocol
Each process i has a local vector C
Increment C[i] at each “local computation” and “send” event
When sending a message, vector clock value V is attached to
the message. At each “receive” event, C = pairwise-max(C, V);
C[i]++;
C = (0,1,0), V = (0,0,2)
pairwise-max(C, V) = (0,1,2)
Haifeng Yu, CS5223, Some Contents Adapted (with permission) from © R.Ayani, G.Tan 24
Vector Clock Properties
Event s happens before t vector clock value of s < vector
clock value of t: There must be a chain from s to t
Event s happens before t vector clock value of s < vector
clock value of t
If s and t on same process, done
If s is on p and t is on q, let VS be s’s vector clock and VT be t’s
VS < VT VS[p] ≤ VT[p] Must be a sequence of message from
p to q after s and before t
s
p
t
q
Haifeng Yu, CS5223, Some Contents Adapted (with permission) from © R.Ayani, G.Tan 25
Example Application of Vector Clock
Each Bitcoin node has some blocks (assume that
Bitcoin runs on a fixed set of nodes)
Want all nodes to know all blocks
Each block has a vector clock value
I have seen all blocks whose vector clock is smaller than
(2,3,2):
I can avoid linear search for existence testing
B1, (1,0,0) (2,0,0) (3,0,0)
user1 (process1)
Haifeng Yu, CS5223, Some Contents Adapted (with permission) from © R.Ayani, G.Tan 26
Mutual Exclusion: Token-based Approach
Motivation is the same as in non-distributed systems
Token-based approached
A single token system-wide
Whoever holds the token has the privilege
Haifeng Yu, CS5223, Some Contents Adapted (with permission) from © R.Ayani, G.Tan 27
0 1 2 Process 1
requests
request
ok
C queue
lock manager
Haifeng Yu, CS5223, Some Contents Adapted (with permission) from © R.Ayani, G.Tan 28
0 1 2 Process 2
requests
request lock while
No reply process 1
is holding
2
C
the lock
queue
lock manager
Haifeng Yu, CS5223, Some Contents Adapted (with permission) from © R.Ayani, G.Tan 29
Mutual Exclusion: Token-based Approach
Also possible to implement it as a token ring
Pros:
Simplicity – lock manager approach is widely used
Fairness easy to ensure
No starvation
Cons:
Cannot deal with failures nicely
Haifeng Yu, CS5223, Some Contents Adapted (with permission) from © R.Ayani, G.Tan 30
Mutual Exclusion: Token-based Approach
Dealing with client failures
Use leases – the lock is granted for only a finite amount of
time
Automatically revoked after time out
Problem: A client is writing to a large file but cannot renew
the lease due to lost network connection
Haifeng Yu, CS5223, Some Contents Adapted (with permission) from © R.Ayani, G.Tan 31
Mutual Exclusion: Voting
We want to deal with server crashes
Have n servers instead of 1
Each server has a vote
A server may crash and recover
Haifeng Yu, CS5223, Some Contents Adapted (with permission) from © R.Ayani, G.Tan 32
Mutual Exclusion: Voting
Theorem:
No two clients can hold the lock at the same time with
majority voting (even if some server crashes)
Disadvantage:
Need n servers instead of one
Need to contact n/2+1 servers to get lock
Haifeng Yu, CS5223, Some Contents Adapted (with permission) from © R.Ayani, G.Tan 33
Mutual Exclusion: Voting
Majority voting is extremely flexible
How about assigning different number of votes to different
servers?
A (5 votes), B(2 vote), C(2 vote), D(2 vote)
What is the impact on availability and performance?
Haifeng Yu, CS5223, Some Contents Adapted (with permission) from © R.Ayani, G.Tan 34
Mutual Exclusion: Quorum Systems
A (5 votes), B(2 vote), C(2 vote), D(2 vote) =
{ {A, B}, {A, C}, {A, D},
{A, B, C}, {A, B, D}, {A, C, D}, {B, C, D},
{A, B, C, D} }
Haifeng Yu, CS5223, Some Contents Adapted (with permission) from © R.Ayani, G.Tan 35
History Readings (Non-compulsory)
“The reliability of voting mechanisms”, IEEE
Transactions on Computers, pages 1197-1208, Oct
1987.
Haifeng Yu, CS5223, Some Contents Adapted (with permission) from © R.Ayani, G.Tan 36
Leader Election: Motivation
We often need a coordinator in distributed systems
Leader, distinguished node/process
Haifeng Yu, CS5223, Some Contents Adapted (with permission) from © R.Ayani, G.Tan 37
Leader Election on General Graph (n known)
Complete graph
Each node send its id to all other nodes
Wait until you receive n ids
Biggest id wins
Haifeng Yu, CS5223, Some Contents Adapted (with permission) from © R.Ayani, G.Tan 38
Leader Election on General Graph (n unknown)
Complete graph
n must be known
Haifeng Yu, CS5223, Some Contents Adapted (with permission) from © R.Ayani, G.Tan 39
Spanning Tree Construction
Remember: No centralized coordinator
Goal of the protocol: Each node knows its parent and children (i.e.,
a distributed tree)
Haifeng Yu, CS5223, Some Contents Adapted (with permission) from © R.Ayani, G.Tan 40
Counting Nodes Using a Spanning Tree
Disseminate “start count” msg down the tree
Count will be aggregated up the tree
Haifeng Yu, CS5223, Some Contents Adapted (with permission) from © R.Ayani, G.Tan 41
Summary
Physical clock synchronization
Mutual exclusion
Leader election
Haifeng Yu, CS5223, Some Contents Adapted (with permission) from © R.Ayani, G.Tan 42