2016 DistributedSystems 1B L4
2016 DistributedSystems 1B L4
Dr Robert N. M. Watson
1
Last time
• Started to look at time in distributed systems
– Coordinating actions between processes
• Physical clocks ‘tick’ based on physical processes (e.g.
oscillations in quartz crystals, atomic transitions)
– Imperfect, so gain/lose time over time
– (wrt nominal perfect ‘reference’ clock (such as UTC))
• The process of gaining/losing time is clock drift
• The difference between two clocks is called clock skew
• Clock synchronization aims to minimize clock skew
between two (or a set of) different clocks
2
From last lecture
The clock synchronization problem
• In distributed systems, we’d like all the different
nodes to have the same notion of time, but
– quartz oscillators oscillate at slightly different
frequencies (time, temperature, manufacture)
• Hence clocks tick at different rates:
– create ever-widening gap in perceived time
– this is called clock drift
• The difference between two clocks at a given
point in time is called clock skew
• Clock synchronization aims to minimize clock
skew between two (or a set of) different clocks
3
Dealing with drift
• A clock can have positive or negative drift with
respect to a reference clock (e.g. UTC)
– Need to [re]synchronize periodically
• Can’t just set clock to ‘correct’ time
– Jumps (particularly backward!) can confuse apps
• Instead aim for gradual compensation
– If clock fast, make it run slower until correct
– If clock slow, make it run faster until correct
4
Compensation
• Most systems relate real-time to cycle counters or periodic
interrupt sources
– E.g. calibrate CPU Time-Stamp Counter (TSC) against CMOS
Real-Time Clock (RTC) at boot, and compute scaling factor (e.g.
cycles per ms)
– Can now convert TSC differences to real-time
– Similarly can determine how much real-time passes between
periodic interrupts: call this delta
– On interrupt, add delta to software real-time clock
• Making small changes to delta gradually adjusts time
– Once synchronized, change delta back to original value
– (Or try to estimate drift & continually adjust delta)
– Minimise time discontinuities from stepping
5
Obtaining accurate time
• Of course, need some way to know correct time
(e.g. UTC) in order to adjust clock!
– could attach a GPS receiver (or GOES receiver) to
computer, and get ±1ms (or ±0.1ms) accuracy…
– …but too expensive/clunky for general use
– (RF in server rooms and data centres non-ideal)
• Instead can ask some machine with a more
accurate clock over the network: a time server
– e.g. send RPC getTime() to server
– What’s the problem here?
6
Cristian’s Algorithm (1989)
T0 T1
client
request reply
Ts
server
time
S 08:02:04.325 Ts
C
Time
T1 08:02:02.130
8
Berkeley Algorithm (1989)
• Don’t assume have an accurate time server
• Try to synchronize a set of clocks to the average
– One machine, M, is designated the master
– M periodically polls all other machines for their time
– (can use Cristian’s technique to account for delays)
– Master computes average (including itself, but ignoring
outliers), and sends an adjustment to each machine
Avg = (01:17+01:12+02:01)/3 +00:00:13
08:01:17 M = (04:30/3) = 01:30 M
-00:00:31
08:02:01
A B C A B C
9
Network Time Protocol (NTP)
• Previous schemes designed for LANs; in practice
today’s systems use NTP:
– Global service designed to enable clients to stay
within (hopefully) a few ms of UTC
• Hierarchy of clocks arranged into strata
– Stratum0 = atomic clocks (or maybe GPS, GEOS)
– Stratum1 = servers directly attached to stratum0 clock
– Stratum2 = servers that synchronize with stratum1
– … and so on
• Timestamps made up of seconds and ‘fraction’
– e.g. 32 bit seconds-since-epoch; 32 bit ‘picoseconds’
10
NTP algorithm
T0 T3
client
request reply
server
T1 T2 time
12
NTP: additional details (1)
• NTP uses multiple requests per server
– Remember <offset, delay> in each case
– Calculate the filter dispersion of the offsets & discard
outliers
– Chooses remaining candidate with the smallest delay
• NTP can also use multiple servers
– Servers report synchronization dispersion = estimate
of their quality relative to the root (stratum 0)
– Combined procedure to select best samples from best
servers (see RFC 5905 for the gory details)
13
NTP: additional details (2)
• Various operating modes:
– Broadcast (“multicast”): server advertises current
time
– Client-server (“procedure call”): as described on
previous
– Symmetric: between a set of NTP servers
• Security is supported
– Authenticate server, prevent replays
– Cryptographic cost compensated for
14
Physical clocks: summary
• Physical devices exhibit clock drift
– Even if initially correct, they tick too fast or too slow, and
hence time ends up being wrong
– Drift rates depend on the specific device, and can vary
with time, temperature, acceleration, …
• Instantaneous difference between clocks is clock skew
• Clock synchronization algorithms attempt to minimize
the skew between a set of clocks
– Decide upon a target correct time (atomic, or average)
– Communicate to agree, compensating for delays
– In reality, will still have 1-10ms skew after sync ;-(
15
Ordering
• One use of time is to provide ordering
– If I withdrew £100 cash at 23:59.44…
– And the bank computes interest at 00:00.00…
– Then interest calculation shouldn’t include the £100
• But in distributed systems we can’t perfectly
synchronize time => cannot use this for ordering
– Clock skew can be large, and may not be trusted
– And over large distances, relativistic events mean that
ordering depends on the observer
– (similar effect due to finite ‘speed of Internet’ ;-)
16
The “happens-before” relation
• Often don’t need to know when event a occurred
– Just need to know if a occurred before or after b
• Define the happens-before relation, a → b
– If events a and b are within the same process, then
a→ b if a occurs with an earlier local timestamp
– Messages between processes are ordered causally,
i.e. the event send(m) → the event receive(m)
– Transitivity: i.e. if a→ b and b→ c, then a→ c
• Note that this only provides a partial order:
– Possible for neither a→ b nor b→ a to hold
– We say that a and b are concurrent and write a ~ b
17
Example
P1
a b m1
? ?
P2 physical time
c d m2
? ?
P3
e f
18
Implementing Happens-Before
• One early scheme due to Lamport [1978]
– Each process Pi has a logical clock Li
• Li can simply be an integer, initialized to 0
– Li is incremented on every local event e
• We write Li(e) or L(e) as the timestamp of e
– When Pi sends a message, it increments Li and copies
the value into the packet
– When Pi receives a message from Pj, it extracts Lj and
sets Li := max(Li,Lj), and then increments Li
• Guarantees that if a → b, then L(a) < L(b)
– However if L(x) < L(y), this doesn’t imply x → y !
19
Lamport Clocks: Example
1 2
P1
a b m1
3 4
P2 physical time
c d m2
1 5
P3
e f
• When P2 receives m1, it extracts timestamp 2 and sets its
clock to max(0, 2) before increment
• Possible for events to have duplicate timestamps
– e.g. event e has the same timestamp as event a
• If desired can break ties by looking at pids, IP addresses, …
– this gives a total order, but doesn’t imply happens-before!
20
Vector clocks
• With Lamport clocks, given L(a) and L(b), we
can’t tell if a→ b or b→ a or a ~ b
• One solution is vector clocks:
– An ordered list of logical clocks, one per-process
– Each process Pi maintains Vi[], initially all zeroes
– On a local event e, Pi increments Vi[i]
• If the event is message send, new Vi[] copied into packet
– If Pi receives a message from Pj then, for all k = 0, 1, …,
it sets Vi[k] := max(Vj[k], Vi[k]), and increments Vi[i]
• Intuitively Vi[k] captures the number of events at
process Pk that have been observed by Pi
21
Vector clocks: example
(1,0,0) (2,0,0)
P1
a b m1
(2,1,0) (2,2,0)
P2 physical time
c d m2
(0,0,1) (2,2,2)
P3
e f
• When P2 receives m1, it merges the entries from P1’s clock
– choose the maximum value in each position
• Similarly when P3 receives m2, it merges in P2’s clock
– this incorporates the changes from P1 that P2 already saw
• Vector clocks explicitly track the transitive causal order: f’s
timestamp captures the history of a, b, c & d
22
Using vector clocks for ordering
• Can compare vector clocks piecewise:
– Vi = Vj iff Vi[k] = Vj[k] for k = 0, 1, 2, …
– Vi ≤ Vj iff Vi[k] ≤ Vj[k] for k = 0, 1, 2, …
– Vi < Vj iff Vi ≤ Vj and Vi ≠ Vj e.g. [2,0,0] versus [0,0,1]
– Vi ~ Vj otherwise
• For any two event timestamps T(a) and T(b)
– if a → b then T(a) < T(b) ; and
– if T(a) < T(b) then a → b
• Hence can use timestamps to determine if there
is a causal ordering between any two events
– i.e. determine whether a → b, b → a or a ~ b
Does this seem familiar? Recall Time-Stamp Ordering and Optimistic
23
Concurrency Control for transactions last term.
Summary + next time (ironically)
• The clock synchronisation problem
• Cristian’s Algorithm, Berkeley Algorithm, NTP
• Logical time via the happens-before relation
• Vector clocks
24