DC Unit II Notes
DC Unit II Notes
1. LOGICAL TIME
Logical clocks are based on capturing chronological and causal relationships of processes and
ordering events based on these relationships.
Precise physical clocking is not possible in distributed systems. The asynchronous
distributed systems span logical clock for coordinating the events. Three types of
logical clock are maintained in distributed systems:
• Scalar clock
• Vector clock
• Matrix clock
Data structures:
Each process pimaintains data structures with the given capabilities:
• A local logical clock (lci), that helps process pi measure itsown progress.
• A logical global clock (gci), that is a representation of process pi’s local view
of the logicalglobal time. It allows this process to assignconsistent timestamps to
its local events.
Protocol:
The protocol ensures that a process’s logical clock, and thus its view of theglobaltime,
is managed consistently with the following rules:
Rule 1: Decides the updates of the logical clock by a process. It controls send, receive and other
operations.
Rule 2: Decides how a process updates its global logical clock to update its view of the global
time and global progress. It dictates what information about the logical time is piggybacked in a
message and how this information is used by the receiving process to update its view of the
global time.
3. Scalar Time
Scalar time is designed by Lamport to synchronize all the events in distributed systems. A
Lamport logical clock is an incrementing counter maintained in each process. This logical clock has
meaning only in relation to messages moving between processes. When a process receives a message,
it resynchronizes its logical clock with that sender maintaining causal relationship.
The Lamport’s algorithm is governed using the following rules:
• The algorithm of Lamport Timestamps can be captured in a few rules:
• All the process counters start with value 0.
• A process increments its counter for each event (internal event, message sending,
messagereceiving) in that process.
• When a process sends a message, it includes its (incremented) counter value with the
message.
• On receiving a message, the counter of the recipient is updated to the greater of its
currentcounter and the timestamp in the received message, and then incremented by
one.
time:
ii. Total Reordering: Scalar clocks order the events in distributed systems. But all the events do
not follow a common identical timestamp. Hence a tie breaking mechanism is essential to
order the events. The tie breaking is done through:
i. Linearly order process identifiers.
ii. Process with low identifier value will be given higher priority.
The term (t, i) indicates timestamp of an event, where t is its time of occurrence and i
is theidentity of the process where it occurred.
The total order relation ( ) over two events x and y with timestamp (h, i) and (k, j) is given by:
The time domain is represented by a set of n-dimensional non-negative integer vectors in vector
time.
Rule 2: Each message m is piggybacked with the vector clock vt of the sender
process at sending time. On the receipt of such a message (m,vt), process
pi executes the following sequence of actions:
1. update its global logical time
2. execute R1
3. deliver the message m
2. Strong consistency
The system of vector clocks is strongly consistent; thus, by examining
the vectortimestamp of two events, we can determine if the events are causally
related.
3. Event counting
If an event e has timestamp vh, vh[j] denotes the number of events
executed byprocess pj that causally precede e.
− This cuts down the message size, communication bandwidth and buffer (to store messages)
requirements.
− The storage overhead is resolved by maintaining two vectors by process pi :
Fig 1.22: Vector clocks progress in Singhal–Kshemkalyani technique
− i is updated as follows:
a. Whenever an event occurs at pi such that,
b. When a process pi sends a message to process pj, it piggybacks the updatedvalue of Di[i] inthe
message.
c. When pi receives a message from pj with piggybacked value d, piupdates its dependencyvector as
follows: Di[j]:= max{Di[j], d}.
This technique results in considerable saving in the cost; only one scalar is piggybacked on
every message.
Due to different clocks rates, the clocks at various sites may diverge with time, and periodically
a clock synchronization must be performed to correct this clock skew in distributed systems.
Clocks are synchronized to an accurate real-time standard like UTC (Universal Coordinated
Time). Clocks that must not only be synchronized with each other but also have to adhere to
physical time are termed physical clocks. This degree of synchronization additionally enables
to coordinate and schedule actions between multiple computers connected to a common network.
Basic terminologies:
If Ca and Cb are two different clocks, then:
• Time: The time of a clock in a machine p is given by the function Cp(t),where Cp(t)= t for a
perfect clock.
• Frequency: Frequency is the rate at which a clock progresses. The frequency at time t of
clock CaisCa’(t).
• Offset:Clock offset is the difference between the time reported by a clockand the real time.
The offset of the clock Ca is given by Ca(t)− t. Theoffset of clock C a relative to Cb at time t
≥0 is given by Ca(t)- Cb(t)
• Skew: The skew of a clock is the difference in the frequencies of the clockand the perfect
clock. The skew of a clock Ca relative to clock Cb at timet is Ca’(t)- Cb’(t).
• Drift (rate): The drift of clock Ca the second derivative of the clockvalue with respect to
time. The drift is calculated as:
Clocking Inaccuracies
Physical clocks are synchronized to an accurate real-time standard like UTC (Universal
Coordinated Time). Due to the clock inaccuracy discussed above, a timer (clock)is said to be
working within its specification if:
- maximum skew rate.
Fig 1.30 a) Offset and delay estimation Fig 1.30 b) Offset and delay estimation
between processes from same server between processes from different servers
Let T1, T2, T3, T4 be the values of the four most recent timestamps. The clocks A andB
are stable and running at the same speed. Let a = T1 − T3 and b = T2 − T4. If the network delay
difference from A to B and from B to A, called differential delay, is small, the clock offset and
roundtrip delay of B relative to A at time T4are approximatelygiven by the following:
Each NTP message includes the latest three timestamps T1, T2, andT3, while T4 is determined
upon arrival.
(i) non-FIFO
(ii) FIFO
(iii) causal order
(iv) synchronous order
There is always a trade-off between concurrency and ease of use and implementation.
Asynchronous Executions
An asynchronous execution (or A-execution) is an execution (E, ≺) for which the causality relation
is a partial order.
• There cannot be any causal relationship between events in asynchronous execution.
• The messages can be delivered in any order even in non FIFO.
• Though there is a physical link that delivers the messages sent on it in FIFO order dueto
the physical properties of the medium, a logicallink may be formed as a composite of
physical links and multiple paths mayexist between the two end points of the logical link.
• Two send events s and s’ are related by causality ordering (not physical time ordering), then
a causally ordered execution requires that their corresponding receive events r and r’ occur in
the same order at all common destinations.
• If s and s’ are not related by causality, then CO is vacuously(blankly)satisfied.
• Causal order is used in applications that update shared data, distributed sharedmemory, or
fair resource allocation.
• The delayed message m is then given to the application for processing. The event ofan
application processing an arrived message is referred to as a delivery event.
• No message overtaken by a chain of messages between the same (sender, receiver)pair.
If send(m1) ≺ send(m2) then for each common destination d of messages m1 and m2,
deliverd(m1) ≺deliverd(m2) must be satisfied.
.
2. Empty Interval Execution: An execution (E ≺) is an empty-interval (EI)execution if for
each pair of events (s, r) ∈ T, the open interval set
in the partial order is empty.
3. An execution (E, ≺) is CO if and only if for each pair of events (s, r) ∈ T and eachevent e
∈ E,
• weak common past:
Synchronous Execution
• When all the communication between pairs of processes uses synchronous send and
receives primitives, the resulting order is the synchronous order.
• The synchronous communication always involves a handshake between the receiver
and the sender, the handshake events may appear to be occurring instantaneously and
atomically.
• The instantaneous communication property of synchronous executions requires a
modified definition of the causality relation because for each (s, r) ∈ T, the send
event is not causally ordered before the receive event.
• The two events are viewed as being atomic and simultaneous, and neither event
precedes the other.
S2: If (s, r ∈ T, then) for all x ∈ E, [(x<< s ⇐⇒ x<<r) and (s<< x ⇐⇒ r<< x)].
• An execution can be modeled to give a total order that extends the partial order
(E, ≺).
• In an A-execution, the messages can be made to appear instantaneous if there exist a
linear extension of the execution, such that each send event is immediately followed
by its corresponding receive event in this linear extension.
Non-separated linear extension is an extension of (E, ≺) is a linear extension of (E, ≺) such that
for each pair (s, r) ∈ T, the interval { x∈ E s ≺ x ≺ r } is empty.
A A-execution (E, ≺) is an RSC execution if and only if there exists a non-separated linear
extension of the partial order (E, ≺).
• In the non-separated linear extension, if the adjacent send event and its corresponding
receive event are viewed atomically, then that pair of events shares a common past
and a common future with each other.
Crown
Let E be an execution. A crown of size k in E is a sequence <(si, ri), i ∈{0,…, k-1}> of pairs of
corresponding send and receive events such that: s0 ≺ r1, s1 ≺ r2, sk−2 ≺ rk−1, sk−1 ≺ r0.
The crown is <(s1, r1) (s2, r2)> as we have s1 ≺ r2 and s2 ≺ r1. Cyclic dependencies
may exist in a crown. The crown criterion states that an A-computation is RSC, i.e., it can be
realized on a system with synchronous communication, if and only if it contains no crown.
− The above hierarchy implies that some executions belonging to a class X will not belong
to any of the classes included in X. The degree of concurrency is most in Aand least
in SYNC.
− A program using synchronous communication is easiest to develop and verify.
− A program using non-FIFO communication, resulting in an A execution, is hardest to
design and verify.
Simulations
− The events in the RSC execution are scheduled as per some non-separated linear
extension, and adjacent (s, r) events in this linear extension are executed sequentially
in the synchronous system.
− The partial order of the asynchronous execution remains unchanged.
− If an A-execution is not RSC, then there is no way to schedule the events to make
them RSC, without actually altering the partial order of the given A-execution.
− However, the following indirect strategy that does not alter the partial order can be
used.
− Each channel Ci,j is modeled by a control process Pi,j that simulates the channel buffer.
− An asynchronous communication from i to j becomes a synchronous communication
from i to Pi,j followed by a synchronous communication from Pi,j to j.
− This enables the decoupling of the sender from the receiver, a feature that is essential
in asynchronous systems.
Rendezvous
Rendezvous systems are a form of synchronous communication among an arbitrary
number of asynchronous processes. All the processes involved meet with each other, i.e.,
communicate synchronously with each other at one time. Two types of rendezvous systems
are possible:
• Binary rendezvous: When two processes agree to synchronize.
• Multi-way rendezvous: When more than two processes agree to synchronize.
Features of binary rendezvous:
• For the receive command, the sender must be specified. However, multiple receive
commands can exist. A type check on the data is implicitly performed.
• Send and received commands may be individually disabled or enabled. A command is
disabled if it is guarded and the guard evaluates to false. The guard would likely contain
an expression on some local variables.
• Synchronous communication is implemented by scheduling messages under the
covers using asynchronous communication.
• Scheduling involves pairing of matching send and receives commands that are both
enabled. The communication events for the control messages under the covers do not
alter the partial order of the execution.
The message (M) types used are: M, ack(M), request(M), and permission(M). Execution
events in the synchronous execution are only the send of the message M and receive of the
message M. The send and receive events for the other message types – ack(M), request(M),
and permission(M) which are control messages. The messages request(M), ack(M), and
permission(M) use M’s unique tag; the message M is not included in these messages.
(message types)
Pi executes send(M) and blocks until it receives ack(M) from Pj . The send event SEND(M) now
completes.
Any M’ message (from a higher priority processes) and request(M’) request for synchronization (from
a lower priority processes) received during the blocking period are queued.
(i) If a message M’ arrives from a higher priority process Pk, Pi accepts M’ by scheduling a
RECEIVE(M’) event and then executes send(ack(M’)) to Pk.
(ii) If a request(M’) arrives from a lower priority process Pk, Pi executes send(permission(M’)) to Pk
and blocks waiting for the messageM’. WhenM’ arrives, the RECEIVE(M’) event is executed.
(2c) When the permission(M) arrives, Pi knows partner Pj is synchronized and Pi executes send(M). The
SEND(M) now completes.
At the time a request(M) is processed by Pi, process Pi executes send(permission(M)) to Pj and blocks
waiting for the message M. When M arrives, the RECEIVE(M) event is executed and the process
unblocks.
When Pi is unblocked, it dequeues the next (if any) message from the queue and processes it as a
message arrival (as per rules 3 or 4).
GROUP COMMUNICATION
Group communication is done by broadcasting of messages. A message broadcast is
the sending of a message to all members in the distributed system. The communication may
be
• Multicast: A message is sent to a certain subset or a group.
• Unicasting: A point-to-point message communication.
Propagation Constraint II: it is not known that a message has been sent to d in the causal
future of Send(M), and hence it is not guaranteed using a reasoning based on transitivity that
the message M will be delivered to d in CO.
Fig 2.6: Conditions for causal ordering
The Propagation Constraints also imply that if either (I) or (II) is false, the information
“d ∈ M.Dests” must not be stored or propagated, even to remember that (I) or (II) has been
falsified:
▪ not in the causal future of Deliverd(M1, a)
▪ not in the causal future of e k, c where d ∈Mk,cDests and there is no other
message sent causally between Mi,a and Mk, c to the same destination d.
The data structures maintained are sorted row–major and then column–major:
1. Explicit tracking:
▪ Tracking of (source, timestamp, destination) information for messages (i) not known to be
delivered and (ii) not guaranteed tobe delivered in CO, is done explicitly using the I.Dests
field of entries inlocal logs at nodes and o.Dests field of entries in messages.
▪ Sets li,aDestsand oi,a. Dests contain explicit information of destinations to which Mi,ais not
guaranteed to be delivered in CO and is not known to be delivered.
▪ The information about d ∈Mi,a .Destsis propagated up to the earliestevents on all causal
paths from (i, a) at which it is known that Mi,a isdelivered to d or is guaranteed to be
delivered to d in CO.
2. Implicit tracking:
▪ Tracking of messages that are either (i) already delivered, or (ii) guaranteed to be
delivered in CO, is performed implicitly.
▪ The information about messages (i) already delivered or (ii) guaranteed to be delivered
in CO is deleted and not propagated because it is redundant as far as enforcing CO is
concerned.
▪ It is useful in determining what information that is being carried in other messages and
is being stored in logs at other nodes has become redundant and thus can be purged.
▪ The semantics are implicitly stored and propagated. This information about messages
that are (i) already delivered or (ii) guaranteed to be delivered in CO is tracked without
explicitly storing it.
▪ The algorithm derives it from the existing explicit information about messages (i) not
known to be delivered and (ii) not guaranteed to be delivered in CO, by examining only
oi,aDests or li,aDests, which is a part of the explicit information.
Multicast M4,3
At event (4, 3), the information P6 ∈M5,1.Dests in Log4 is propagated onmulticast M4,3only to
process P6 to ensure causal delivery using the DeliveryCondition. The piggybacked
information on message M4,3sent to process P3must not contain this information because of
constraint II. As long as any future message sent to P6 is delivered in causal order w.r.t. M4,3sent
to P6, it will also be delivered in causal order w.r.t. M5,1. And as M5,1 is already delivered to P4,
the information M5,1Dests = ∅ is piggybacked on M4,3 sent to P 3. Similarly, the information
P6 ∈ M5,1Dests must be deleted from Log4 as it will no longer be needed, because of constraint
II. M5,1Dests = ∅ is stored in Log4 to remember that M5,1 has been delivered or is guaranteed
to be delivered in causal order to all its destinations.
Learning implicit information at P2 and P3
When message M4,2is received by processes P2 and P3, they insert the (new) piggybacked
information in their local logs, as information M5,1.Dests = P6. They both continue to store
this in Log2 and Log3 and propagate this information on multicasts until they learn at events
(2, 4) and (3, 2) on receipt of messages M3,3and M4,3, respectively, that any future message is
expected to be delivered in causal order to process P6, w.r.t. M5,1sent toP6. Hence by
constraint II, this information must be deleted from Log2 andLog3. The flow of events is
given by;
• When M4,3 with piggybacked information M5,1Dests = ∅ is received byP3at (3, 2), this
is inferred to be valid current implicit information aboutmulticast M5,1because the log
Log3 already contains explicit informationP6 ∈M5,1.Dests about that multicast.
Therefore, the explicit informationin Log3 is inferred to be old and must be deleted to
achieve optimality. M5,1Dests is set to ∅ in Log3.
• The logic by which P2 learns this implicit knowledge on the arrival of M3,3is identical.
Processing at P6
When message M5,1 is delivered to P6, only M5,1.Dests = P4 is added to Log6. Further, P6
propagates only M5,1.Dests = P4 on message M6,2, and this conveys the current implicit
information M5,1 has been delivered to P6 by its very absence in the explicit information.
• When the information P6 ∈ M5,1Dests arrives on M4,3, piggybacked as M5,1 .Dests
= P6 it is used only to ensure causal delivery of M4,3 using the Delivery Condition,
and is not inserted in Log6 (constraint I) – further, the presence of M5,1 .Dests = P4
in Log6 implies the implicit information that M5,1 has already been delivered to
P6. Also, the absence of P4 in M5,1 .Dests in the explicit piggybacked information
implies the implicit information that M5,1 has been delivered or is guaranteed to be
delivered in causal order to P4, and, therefore, M5,1. Dests is set to ∅ in Log6.
• When the information P6 ∈ M5,1 .Dests arrives on M5,2 piggybacked as M5,1. Dests
= {P4, P6} it is used only to ensure causal delivery of M4,3 using the Delivery
Condition, and is not inserted in Log6 because Log6 contains M5,1 .Dests = ∅,
which gives the implicit information that M5,1 has been delivered or is guaranteed
to be delivered in causal order to both P4 and P6.
Processing at P1
• When M2,2arrives carrying piggybacked information M5,1.Dests = P6 this (new)
information is inserted in Log1.
• When M6,2arrives with piggybacked information M5,1.Dests ={P4}, P1learns implicit
information M5,1has been delivered to P6 by the very absence of explicit information
P6 ∈ M5,1.Dests in the piggybacked information, and hence marks information P6 ∈
M5,1Dests for deletion from Log1. Simultaneously, M5,1Dests = P6 in Log1 implies
the implicit information that M5,1has been delivered or is guaranteed to be delivered in
causal order to P4.Thus, P1 also learns that the explicit piggybacked information
M5,1.Dests = P4 is outdated. M5,1.Dests in Log1 is set to ∅.
• The information “P6 ∈M5,1.Dests piggybacked on M2,3,which arrives at P 1, is
inferred to be outdated usingthe implicit knowledge derived from M5,1.Dest= ∅” in
Log1.
10. TOTAL ORDER
For each pair of processes Pi and Pj and for each pair of messages Mx and My that are delivered to
both the processes, Pi is delivered Mx before My if and only if Pj is delivered Mxbefore My.
Each process sends the message it wants to broadcast to a centralized process, which
relays all the messages it receives to every other process over FIFO channels.
Complexity: Each message transmission takes two message hops and exactly n messages
in a system of n processes.
Drawbacks: A centralized algorithm has a single point of failure and congestion, and is
not an elegant solution.
Sender side
Phase 1
• In the first phase, a process multicasts the message M with a locally unique tag and
the local timestamp to the group members.
Phase 2
• The sender process awaits a reply from all the group members who respond with a
tentative proposal for a revised timestamp for that message M.
• The await call is non-blocking.
Phase 3
• The process multicasts the final timestamp to the group.
Fig 2.9: Sender side of three phase distributed algorithm
Receiver Side
Phase 1
• The receiver receives the message with a tentative timestamp. It updates the variable
priority that tracks the highest proposed timestamp, then revises the proposed
timestamp to the priority, and places the message with its tag and the revised
timestamp at the tail of the queue temp_Q. In the queue, the entry is marked as
undeliverable.
Phase 2
• The receiver sends the revised timestamp back to the sender. The receiver then waits
in a non-blocking manner for the final timestamp.
Phase 3
• The final timestamp is received from the multicaster. The corresponding message
entry in temp_Q is identified using the tag, and is marked as deliverable after the
revised timestamp is overwritten by the final timestamp.
• The queue is then resorted using the timestamp field of the entries as the key. As the
queue is already sorted except for the modified entry for the message under
consideration, that message entry has to be placed in its sorted position in the queue.
• If the message entry is at the head of the temp_Q, that entry, and all consecutive
subsequent entries that are also marked as deliverable, are dequeued from temp_Q,
and enqueued in deliver_Q.
Complexity
This algorithm uses three phases, and, to send a message to n − 1 processes, it uses 3(n – 1)
messages and incurs a delay of three message hops
11. GLOBAL STATE AND SNAPSHOT RECORDING
ALGORITHMS
• A distributed computing system consists of processes that do not share a common
memory and communicate asynchronously with each other by message passing.
• Each component of has a local state. The state of the process is the local memory and a
history of its activity.
• The state of a channel is characterized by the set of messages sent along the channel
less the messages received along the channel. The global state of a distributed system
isa collection of the local states of its components.
• If shared memory were available, an up-to-date state of the entire system would be
available to the processes sharing the memory.
• The absence of shared memory necessitates ways of getting a coherent and complete
view of the system based on the local states of individual processes.
• A meaningful global snapshot can be obtained if the components of the distributed
system record their local states at the same time.
• This would be possible if the local clocks at processes were perfectly synchronized or
if there were a global system clock that could be instantaneously read by the processes.
• If processes read time from a single common clock, various indeterminate transmission
delays during the read operation will cause the processes to identify various physical
instants as the same time.
System Model
• The system consists of a collection of n processes, p1, p2,…,pn that are connected
by channels.
• Let Cij denote the channel from process pi to process pj.
• Processes and channels have states associated with them.
• The state of a process at any time is defined by the contents of processor registers,
stacks, local memory, etc., and may be highly dependent on the local context of
the distributed application.
• The state of channel Cij, denoted by SCij, is given by the set of messages in transit
in the channel.
• The events that may happen are: internal event, send (send (mij)) and receive
(rec(mij)) events.
• The occurrences of events cause changes in the process state.
• A channel is a distributed entity and its state depends on the local states of the
processes on which it is incident.
Law of conservation of messages: Every message mij that is recorded as sent in the local state of a
process pi must be captured in the state of the channel Cij or in the collected local state of the receiver
process pj.
➢ In a consistent global state, every message that is recorded as received is also recorded
as sent. Such a global state captures the notion of causality that a message cannot be
received if it was not sent.
➢ Consistent global states are meaningful global states and inconsistent global states are
not meaningful in the sense that a distributed system can never be in an inconsistent
state.
Interpretation of cuts
• Cuts in a space–time diagram provide a powerful graphical aid in representing and
reasoning about the global states of a computation. A cut is a line joining an arbitrary
point on each process line that slices the space–time diagram into a PAST and a
FUTURE.
• A consistent global state corresponds to a cut in which every message received in the
PAST of the cut has been sent in the PAST of that cut. Sucha cut is known as a
consistent cut.
• In a consistent snapshot, all the recorded local states of processes are concurrent; that
is, the recorded local state of no process casually affects the recorded local state of any
other process.
Issue 2:
How to determine the instant when a process takes its snapshot?
The answer
Answer:
A process pj must record its snapshot before processing a message mij that was sent by
process pi after recording its snapshot.
12. SNAPSHOT ALGORITHMS FOR FIFO CHANNELS
Each distributed application has number of processes running on different physical
servers. These processes communicate with each other through messaging channels.
A snapshot captures the local states of each process along with the state of each communication channel.
Chandy–Lamport algorithm
• The algorithm will record a global snapshot for each process channel.
• The Chandy-Lamport algorithm uses a control message, called a marker.
• After a site has recorded its snapshot, it sends a marker along all of its outgoing channels
before sending out any more messages.
• Since channels are FIFO, a marker separates the messages in the channel into those to
be included in the snapshot from those not to be recorded in the snapshot.
• This addresses issue I1. The role of markers in a FIFO system is to act as delimiters
for the messages in the channels so that the channel state recorded by the process at
the receiving end of the channel satisfies the condition C2.
Initiating a snapshot
• Process Pi initiates the snapshot
• Pi records its own state and prepares a special marker message.
• Send the marker message to all other processes.
• Start recording all incoming messages from channels Cij for j not equal to i.
Propagating a snapshot
• For all processes Pjconsider a message on channel Ckj.
• If marker message is seen for the first time:
− Pjrecords own sate and marks Ckj as empty
− Send the marker message to all other processes.
− Record all incoming messages from channels Clj for 1 not equal to j or k.
− Else add all messages from inbound channels.
Terminating a snapshot
• All processes have received a marker.
• All process has received a marker on all the N-1 incoming channels.
• A central server can gather the partial state to build a global snapshot.
Complexity
The recording part of a single instance of the algorithm requires O(e) messages
and O(d) time, where e is the number of edges in the network and d is the diameter of the
network.