DS 3
DS 3
This chapter focuses on the importance of time in distributed systems, which are
networks of computers that work together. It covers how we can monitor these systems
and manage the timing of events that happen across different computers.
1. Accurate Measurement:
▪ Data Consistency: Ensuring that all copies of data are the same
across different computers.
• Time can be tricky to measure because different observers may perceive time
differently. This concept is rooted in Einstein’s Special Theory of Relativity,
which tells us that:
o The speed of light is constant for all observers, regardless of their motion.
• Causality:
o While the order of events can differ for observers, if one event causes
another, all observers will agree on that order. However, the time it takes
between cause and effect can differ.
Time in Distributed Systems
Chapter Structure
1. Clock Synchronization:
2. Logical Clocks:
o The chapter introduces logical clocks, such as vector clocks, which help
establish the order of events without needing to measure physical time.
This section builds on the basics of distributed systems introduced earlier. Here, we
focus on how to understand and track the evolution of these systems, especially how to
order and timestamp events that occur within them.
2. Process State:
Events in a Process
• The sequence of events within a single process can be ordered. We denote this
order with a relation <i<i:
• History of a Process:
o The history of a process pipi is the series of events that occur in that
process, ordered by the <i<i relation:
history(pi)=⟨e0,e1,e2,…⟩history(pi)=⟨e0,e1,e2,…⟩
• Physical Clocks:
o The operating system reads the hardware clock and adjusts it to create a
software clock Ci(t)Ci(t) that approximates real time. This clock might not
be perfectly accurate due to various factors.
• Timestamping Events:
o We can use the value of the software clock Ci(t)Ci(t) to timestamp events
in process pipi. However, for timestamps to differ between consecutive
events, the clock must update frequently enough (the clock resolution
must be smaller than the time between events).
• Clock Skew:
• Clock Drift:
o Over time, clocks can run at slightly different speeds due to physical
variations. This means that even if two clocks start at the same time, they
will diverge as time passes.
o For example, a typical quartz clock might drift about one second every
1,000,000 seconds (or roughly every 11.6 days).
• Defining UTC:
o UTC signals are broadcast from various sources, including radio stations
and satellites like GPS, allowing computers to synchronize their clocks
with high accuracy (GPS can be accurate to about 1 microsecond).
Synchronizing Physical Clocks in Distributed Systems
In distributed systems, it's crucial to know the exact time when events occur for various
reasons, such as record-keeping and coordination. To achieve this, we need to
synchronize the clocks of the processes involved. There are two main types of
synchronization: external synchronization and internal synchronization.
1. Types of Synchronization
External Synchronization:
• The goal is to ensure that the clocks CiCi of all processes are accurate within a
specified bound DD of the external time source SS for all times tt in a given
interval II.
• Mathematically, we say:
• This means the clocks are accurate to within the bound DD.
Internal Synchronization:
• This refers to the synchronization of clocks with each other, ensuring that they
agree within a specified bound DD.
• Even if the clocks are not synchronized with an external source, as long as they
are synchronized with each other, we can measure the time intervals between
events.
2. Clock Correctness
• Correctness: A hardware clock HH is considered correct if its drift rate (the rate
at which it diverges from the true time) is within a known limit, such
as 10−610−6 seconds per second. This means the error in measuring time
should be bounded:
∣t−H(t)∣<bounded error∣t−H(t)∣<bounded error
t1<t2 ⟹ C(t1)<C(t2)t1<t2⟹C(t1)<C(t2)
3. Faulty Clocks
A clock that does not adhere to the correctness conditions is considered faulty. There
are two types of clock failures:
• Arbitrary Failure: The clock behaves unpredictably, such as when the "Y2K bug"
caused clocks to incorrectly register dates after December 31, 1999, as January
1, 1900.
Importantly, a clock does not need to be accurate (i.e., showing the correct current
time) to be considered correct. The focus is on the clock's ability to function reliably,
maintaining proper event ordering and synchronization with other clocks.
The chapter goes on to discuss various algorithms designed for both external and
internal synchronization of clocks in distributed systems. These algorithms help ensure
that processes can coordinate their actions based on time, which is essential for
maintaining consistency and correctness across the system.
Synchronization in a Synchronous System
A synchronous distributed system is one where certain timing constraints are known:
• Drift Rate: The maximum rate at which the clocks can drift from the true time.
• Execution Time: The maximum time required for a process to execute a single
step.
These known bounds help ensure that processes can synchronize their clocks
effectively.
Let’s consider two processes, P1P1 and P2P2, that need to synchronize their clocks.
o Process P1P1 sends its local clock time tt to P2P2 in a message mm.
o Upon receiving the message, P2P2 can adjust its clock. In principle, P2P2
could set its clock to t+Ttranst+Ttrans, where TtransTtrans is the time
taken for the message to travel from P1P1 to P2P2.
However, the challenge is that TtransTtrans can vary and is unknown due to several
factors, including network congestion and the processing load on the nodes.
u=Tmax−Tminu=Tmax−Tmin
• This means that the actual transmission time TtransTtrans can vary anywhere
between TminTmin and TmaxTmax.
When P2P2 receives the message, it has to decide how to set its clock based on the
uncertainty:
o If P2P2 sets its clock to t+Tmaxt+Tmax, the clock skew (the difference
between the two clocks) can be as much as uu if the message actually
took the maximum time to arrive.
o If it sets its clock to t+Tmint+Tmin, the skew could also be as large as uu.
• Optimal Setting:
o To minimize the skew, P2P2 can set its clock to the halfway point:
t+Tmin+Tmax2t+2Tmin+Tmax
u22u
u⋅(1+1N)u⋅(1+N1)
This means that as the number of clocks increases, the skew can be minimized further,
but it will always be proportional to the uncertainty uu.
6. Asynchronous Systems
• The factors leading to message delays are unpredictable, and there is no upper
bound on transmission delays.
Ttrans=Tmin+xTtrans=Tmin+x
where xx can be any non-negative value that is not known in advance.
• Time Server: Cristian’s method uses a time server (let's call it SS) that is
connected to a device receiving UTC time signals. The time server is responsible
for providing the current time to requesting processes.
• Request-Response Mechanism:
o The server SS responds with a message mtmt that contains the current
time tt. This time tt is recorded just before the message is sent.
• Round-Trip Time: The time it takes for the request mrmr to reach the server and
for the response mtmt to come back to process pp is called the round-trip
time TroundTround.
• Clock Drift: The accuracy of the round-trip time measurement depends on the
clock drift of process pp. For example, if the drift rate is 10−610−6 seconds per
second, and the round-trip time is about 1-10 milliseconds (0.001 to 0.01
seconds), the clock drift during that period would be negligible
(around 0.010.01 milliseconds).
• Setting the Clock: To set its clock accurately, process pp can estimate the time
to which it should set its clock using:
t′=t+Tround2t′=t+2Tround
This assumes that the time taken for the request to reach the server and for the
response to return is roughly equal.
• The width of this range indicates the potential error in the time synchronization.
4. Improving Accuracy
6. Related Work
• Cristian and his colleagues later proposed probabilistic protocols for internal
clock synchronization that tolerate certain failures.
• Other researchers, like Srikanth and Toueg, developed algorithms that optimize
accuracy while tolerating some faulty clocks. Dolev et al. highlighted the need for
a minimum number of correct clocks to achieve agreement in the presence of
faulty ones.
The Berkeley algorithm, developed by Gusella and Zatti in 1989, is a method for
synchronizing the clocks of multiple computers in a distributed system, specifically
designed for Berkeley UNIX. Here’s a detailed yet straightforward explanation of how the
algorithm works, its components, and its advantages.
• Slaves: The other computers in the network are referred to as slaves. They send
their current clock readings to the master when requested.
• Polling Process: The master periodically sends requests to the slave computers
to ask for their current clock values.
• Averaging Clock Values: Once the master has collected the clock values from
all the slaves, it calculates an average clock time. This average includes the
master’s own clock reading. The idea is that averaging helps to cancel out any
individual clock errors (whether they run fast or slow).
• Handling Outliers: The master also sets a nominal maximum round-trip time. If
any clock readings are associated with round-trip times longer than this
maximum, those readings are ignored. This helps eliminate outliers caused by
faulty clocks or network delays.
• Sending Adjustments: Instead of sending the updated time back to the slaves
(which could introduce uncertainty due to transmission delays), the master
calculates how much each slave’s clock needs to be adjusted. This adjustment
can be either positive (to speed up the clock) or negative (to slow it down).
5. Fault Tolerance
• Subset Selection: The master selects a subset of clock readings that do not
differ from one another by more than a specified threshold. The average is then
calculated only from these selected clocks, which helps ensure that faulty
clocks do not skew the results.
• Master Election: If the master fails, another computer can be elected to take
over its role. This election process is crucial for maintaining synchronization in
the network.
• Election Algorithms: While there are algorithms for electing a new master, they
do not guarantee that a new master will be elected within a bounded time frame.
This means that if there is a delay in electing a new master, the clocks may drift
apart during that time.
1. Purpose of NTP
• Synchronize Clients to UTC: NTP helps clients across the Internet synchronize
their clocks accurately to Coordinated Universal Time (UTC), despite the large
delays that can occur in Internet communication.
• Provide Reliability: It offers a reliable service that can continue to function even
if some servers or connections fail.
• Ensure Security: NTP uses authentication techniques to verify that the time data
comes from trusted sources and to protect against malicious interference.
2. NTP Architecture
NTP operates through a hierarchical structure of time servers, organized into levels
called strata:
• Stratum 2: These servers synchronize their clocks with Stratum 1 servers. They
are one step removed from the primary time sources.
• Stratum 3 and Lower: These servers synchronize with Stratum 2 servers, and so
on. Each lower stratum level may introduce some error, making them slightly
less accurate.
3. Synchronization Process
• Multicast Mode: Suitable for high-speed LANs, where one or more servers
periodically broadcast their time to other computers. This mode is less accurate
but useful for many applications.
• Procedure-Call Mode: Similar to Cristian’s method, where a server responds to
requests from clients with its current timestamp. This mode is used when higher
accuracy is required.
• Symmetric Mode: Used between servers that need to share time information
accurately. Servers exchange timing data, and this mode is designed for high
accuracy, especially between servers at lower strata.
In NTP, messages are sent using the User Datagram Protocol (UDP), which is an
unreliable transport method. Each NTP message includes several timestamps:
• Ti-2: Time when the first message was received by the other server.
• Delay (d): The total time taken for the messages to be sent and received.
• Data Filtering: NTP servers filter the timing data they receive, keeping track of
the most recent pairs of offset and delay values. This helps to identify and
discard unreliable data.
• Peer Selection: NTP servers prioritize synchronization with peers that have
lower stratum numbers (closer to the primary time source) and lower dispersion
(less variability in timekeeping).
NTP uses a phase lock loop model to adjust the local clock's frequency based on its
drift rate. For example, if a clock consistently gains time, NTP can slightly reduce its
update frequency to compensate, thereby improving overall accuracy.
7. Performance
• Redundant Servers: If one server fails, others can take over without disrupting
the time service.
1. Local Events: If two events occur in the same process, they are ordered by the
time they are observed in that process. For example, if process p1p1 performs
action aa and then action bb, we say a→ba→b.
• Process p1p1:
• Process p2p2:
• Process p3p3:
Concurrent Events
Not all events are related by the happened-before relation. If there are two events,
say aa and ee, that occur in different processes and there is no message sent between
them, we say that aa and ee are concurrent. This is denoted as a∥ea∥e. This means that
we cannot determine which event happened first based on the happened-before
relation.
Potential Causality
It’s important to note that the happened-before relation captures potential causality,
not actual causality. Just because one event happened before another in terms of the
relation does not mean that the first event caused the second. For example, if a server
receives a request and then sends a reply, the reply is ordered after the request.
However, if the server sends replies every five minutes regardless of requests, there is
no causal link between the two events, even though they are ordered by the happened-
before relation.
1. Logical Clocks
Logical clocks were introduced by Leslie Lamport in 1978 to provide a way to order
events in a distributed system where processes cannot perfectly synchronize their
physical clocks.
1. Increment Before Event: Before any event occurs (like sending a message), the
process increments its logical clock:
Li:=Li+1Li:=Li+1
2. Sending Messages: When process pipi sends a message mm, it includes its
current logical clock value LiLi with the message:
send(m,Li)send(m,Li)
Lj:=max(Lj,t)+1Lj:=max(Lj,t)+1
Example:
• Suppose three processes p1,p2,p1,p2, and p3p3 start with L1=0L1=0, L2=0L2=0,
and L3=0L3=0.
This method allows us to maintain a logical order of events based on the happened-
before relation.
While Lamport's logical clocks help to establish a partial order of events, sometimes we
need a total order, meaning every pair of events can be compared.
o Global timestamps: (Ti,i)(Ti,i) for event ee and (Tj,j)(Tj,j) for event e′e′.
▪ Ti<TjTi<Tj, or
Example:
• If ee occurs at p1p1 with timestamp 2 and e′e′ occurs at p2p2 with timestamp 3,
then (2,1)<(3,2)(2,1)<(3,2).
• If both events have the same timestamp (e.g., both have timestamp 2), we can
order them based on their process identifiers.
This total ordering is useful in scenarios like managing access to shared resources
(critical sections).
3. Vector Clocks
Vector clocks, developed by Mattern and Fidge, improve upon Lamport's clocks by
allowing us to determine concurrency (whether two events are independent) and
potential causality more effectively.
• Array of Counters: Each process pipi maintains a vector clock ViVi, which is an
array of integers of size NN (the total number of processes). Each entry Vi[j]Vi
[j] counts the number of events that process pjpj has seen that could affect pipi.
Vi[i]:=Vi[i]+1Vi[i]:=Vi[i]+1
3. Sending Messages: When pipi sends a message, it includes its vector clock:
send(m,Vi)send(m,Vi)
4. Receiving Messages: When pjpj receives a message with vector clock VV, it
updates its own vector clock:
Vj[k]:=max(Vj[k],V[k])for all kVj[k]:=max(Vj[k],V[k])for all k
Vj[j]:=Vj[j]+1Vj[j]:=Vj[j]+1
Example:
To determine the relationship between two events based on their vector clocks VeVe
and VfVf:
• Concurrent Events: If neither Ve<VfVe<Vf nor Vf<VeVf<Ve, then the events are
concurrent.
Key Points:
• References: An object is only useful if at least one process has a reference to it.
If no process has a reference, the object can be safely deleted.
• In-Transit Messages: We must also consider messages that are currently being
sent between processes. If a message contains a reference to an object, that
object cannot be considered garbage until the message is delivered.
Example:
• Suppose Process p1p1 has two objects: one referenced by itself and another
referenced by Process p2p2. If Process p2p2 has an object that no one
references, it can be considered garbage. However, if there’s a message in transit
from p2p2 to p1p1 that references another object, that object cannot be garbage
until the message is delivered.
A deadlock occurs when a set of processes are blocked because each process is
waiting for a message from another process in the set, forming a cycle in the wait-for
graph.
Key Points:
• Wait-For Graph: This graph represents which process is waiting for which other
process. If there is a cycle in this graph, a deadlock exists.
Key Points:
• Active vs. Passive Processes: An active process is currently doing work, while
a passive process is not actively engaged but is ready to respond to requests.
• Example Scenario: Imagine two processes, p1p1 and p2p2. If both are found to
be passive, you might think the algorithm has terminated. However, if there’s a
message in transit from p2p2 to p1p1, p1p1 could become active again when it
receives the message, indicating that the algorithm hasn’t truly terminated.
4. Distributed Debugging
Key Points:
In distributed systems, multiple processes run concurrently, and each process has its
own local state. A global state represents the collective state of all processes and their
communication channels at a particular moment. Understanding global states is
crucial for analyzing the behavior of distributed systems, especially when detecting
issues like deadlocks or ensuring consistency.
2. Meaningful Global States: Even if we could collect states from each process,
not all combinations of these states would represent a valid global state
because of the asynchronous nature of communication.
Definitions
• History of a Process: The history of a process pipi is the sequence of events that
occur in that process. We denote this as:
hi=⟨ei0,ei1,ei2,…⟩hi=⟨ei0,ei1,ei2,…⟩
• State of a Process: The state siksik of process pipi immediately before the kk-th
event occurs.
• Global History: The global history HH of the system is the union of the histories
of all processes:
H=h1∪h2∪…∪hNH=h1∪h2∪…∪hN
• Cut: A cut is a subset of the global history that consists of prefixes from each
process's history. A cut can be represented as:
C=⟨h1c1,h2c2,…,hNcN⟩C=⟨h1c1,h2c2,…,hNcN⟩
where cici indicates the last event from process pipi included in the cut.
Consistent Cuts
A cut is consistent if it respects the causal relationships between events. This means
that if an event ee in the cut happens after an event ff (i.e., ff happened-before ee),
then ff must also be included in the cut.
Causal Relationship
The happened-before relation (denoted as →→) is a way to express the order of events:
• If a process sends a message, and another process receives it, the sending event
happened-before the receiving event.
• If two events occur in the same process, they are ordered by their occurrence.
Example of Cuts
Consider two processes p1p1 and p2p2 with the following events:
Inconsistent Cut: A cut that includes e20e20 (the receipt of m1m1) but does not
include e10e10 (the sending of m1m1) is inconsistent because it shows an effect
without its cause.
Consistent Cut: A cut that includes both e10e10 and e20e20 (sending and
receiving m1m1) is consistent. It reflects the actual execution where the message was
sent before it was received.
A global state corresponds to a consistent cut. The system transitions between global
states as events occur. Each transition involves a single event happening in one
process, which can be:
• Sending a message
• Receiving a message
• A run is a sequence of events that may not necessarily respect the order of
concurrent events. However, all linearizations must pass through consistent
global states.
Reachability of States
A state S′S′ is said to be reachable from a state SS if there is a linearization that passes
through SS and then S′S′. This means you can transition from one global state to
another through a valid sequence of events.
• Global State Predicate: This is a function that evaluates the possible global
states of a distributed system and returns either True or False. For instance,
predicates can check conditions like:
1. Stability: A predicate is stable if, once a global state satisfies the predicate
(returns True), all future reachable states from that state also satisfy it. This is an
important property for certain conditions we want to monitor.
These concepts help categorize the properties of global states into two high-level
concepts: safety and liveness.
Safety
Liveness
Overview
• Stability: Once a stable predicate is True, it remains True in all reachable future
states.
• Safety: Ensures that undesirable conditions (like deadlock) do not occur from a
given initial state.
• Liveness: Guarantees that from any initial state, a desirable condition (like
termination) will eventually be satisfied.
Key Assumptions
Before we dive into the algorithm, here are some key assumptions it makes:
3. Strong Connectivity: Every process can communicate with every other process,
meaning there’s a path from any process to any other process.
4. Independent Initiation: Any process can start the snapshot at any time.
The algorithm records the state of each process and the state of the communication
channels. Here’s how it operates:
Step-by-Step Process
1. Recording State:
o For each incoming channel, it records a set of messages that have arrived
after it recorded its state.
2. Marker Messages:
Example Scenario
Let’s illustrate the algorithm with a simple example involving two processes, p1p1
and p2p2, connected by two unidirectional channels, c1c1 and c2c2.
Initial States
• Process p1p1:
o Money: $1000
o Widgets: 0
• Process p2p2:
o Money: $50
o Widgets: 2000
Execution Steps
2. State Changes:
o After sending the marker, p1p1 sends the order message, and now the
system enters a new state S1S1.
o Since p2p2 has not yet recorded its state, it records its current state
as S1S1: (Money: $50, Widgets: 2000).
o Before p2p2 sends the marker, it also sends a message back to p1p1 (e.g.,
“5 widgets”).
o When p1p1 receives the message “5 widgets” and also receives p2p2’s
marker, it records the state of channel c1c1 as the set of messages it
received (which is the message “5 widgets”).
Final Notes
• Termination: The algorithm ensures that all processes will eventually record
their states due to the reliable communication and connectivity assumptions.
Termination refers to the point where all processes in a distributed system have
successfully recorded their states and the states of their channels.
o Each process that receives a marker message will record its state within a
finite amount of time.
o After recording its state, the process will also send marker messages over
each of its outgoing channels within a finite time.
2. Communication Path:
o If there is a communication path from process pipi to process pjpj
(meaning pipi can send messages to pjpj), then pjpj will record its state a
finite time after pipi records its state.
3. Strong Connectivity:
o This is guaranteed because once one process starts the snapshot, the
markers will propagate through the system, leading all processes to
record their states.
The snapshot algorithm effectively selects a cut from the history of the execution of the
distributed system. A cut is a way to divide the events in the execution into two parts:
those that happened before the snapshot and those that happened after.
o If an event eiei occurs at process pipi and another event ejej occurs at
process pjpj, and if eiei happens before ejej (denoted as ei→ejei→ej), then
if ejej is in the cut, eiei must also be in the cut.
2. Proof of Consistency:
o If we assume the opposite (that pipi recorded its state before eiei), we can
show that this leads to a contradiction:
▪ If ejej is in the cut and eiei occurred before it, then pjpj must have
recorded its state before ejej.
▪ This contradicts the assumption that ejej is in the cut, meaning our
initial assumption must be incorrect.
Reachability of States
Reachability refers to the relationship between the recorded global state (snapshot)
and the initial and final global states of the system.
1. States Defined:
o SinitSinit: The global state just before the first process records its state.
o SsnapSsnap: The global state recorded by the snapshot algorithm.
2. Execution Linearization:
3. Ordering Events:
▪ By using the rules of the marker messages, we can ensure that the
order of events respects the happened-before relationship.
4. Establishing Reachability:
1. Stable Predicates: