Distributed Systems
Distributed Systems
Communication
A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 1 / 52
Distributed Computing: Principles, Algorithms, and Systems
Outline
I Message orders: non-FIFO, FIFO, causal order, synchronous order
I Group communication with multicast: causal order, total order
I Expected behaviour semantics when failures occur
I Multicasts: application layer on overlays; also at network layer
Notations
I Network (N, L); event set (E , ≺)
I message mi : send and receive events s i and r i
I send and receive events: s and r .
I M, send(M), and receive(M)
I Corresponding events: a ∼ b denotes a and b occur at the same process
I send-receive pairs T = {(s, r ) ∈ Ei × Ej | s corresponds to r }
A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 2 / 52
Distributed Computing: Principles, Algorithms, and Systems
P1 r2 r1 r3 r1 r2
m2
m3 m1
m1 m2
P2
s1 s2 s3 s1 s2
(a) (b)
Figure 6.1: (a) A-execution that is FIFO (b) A-execution that is not FIFO
Asynchronous executions FIFO executions
A-execution: (E , ≺) for which the causality an A-execution in which:
relation is a partial order. for all (s, r ) and (s 0 , r 0 ) ∈ T ,
no causality cycles (s ∼ s 0 and r ∼ r 0 and s ≺ s 0 ) =⇒ r ≺ r 0
on any logical link, not necessarily FIFO Logical link inherently non-FIFO
delivery, e.g., network layer IPv4 Can assume connection-oriented service at
connectionless service transport layer, e.g., TCP
All physical links obey FIFO To implement FIFO over non-FIFO link:
use h seq num, conn id i per message.
Receiver uses buffer to order messages.
A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 3 / 52
Distributed Computing: Principles, Algorithms, and Systems
If send events s and s 0 are related by causality ordering (not physical time
ordering), their corresponding receive events r and r 0 occur in the same order
at all common dests.
If s and s 0 are not related by causality, then CO is vacuously satisfied.
r3 r1 s3 r1 s3 r1 r3 r1
P1
m3 m3 m3
m1 m3 m1
r2 r2 r3 r2 m1
P2
m2 s3 m2 r3 s2 m2 s
3
1 m2
P3 m 1
s s2 s1 s2 s1 r2 s2 s1
(a) (b) (c) (d)
Figure 6.2: (a) Violates CO as s 1 ≺ s 3 ; r 3 ≺ r 1 (b) Satisfies CO. (c) Satisfies CO. No send
events related by causality. (d) Satisfies CO.
A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 4 / 52
Distributed Computing: Principles, Algorithms, and Systems
CO alternate definition
If send(m1 ) ≺ send(m2 ) then for each common destination d of messages m1 and
m2 , deliverd (m1 ) ≺ deliverd (m2 ) must be satisfied.
A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 5 / 52
Distributed Computing: Principles, Algorithms, and Systems
A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 6 / 52
Distributed Computing: Principles, Algorithms, and Systems
Figure 6.2: (a) Violates CO as s 1 ≺ s 3 ; r 3 ≺ r 1 (b) Satisfies CO. (c) Satisfies CO. No send
events related by causality. (d) Satisfies CO.
Fig 6.2(b). Consider M 2 . No event x such that s 2 ≺ x ≺ r 2 . Holds for all messages ⇒ EI
For EI hs, r i, there exists some linear extension 1 < | such the corresp. interval
{x ∈ E | s < x < r } is also empty.
An empty hs, r i interval in a linear extension implies s, r may be arbitrarily close; shown by
vertical arrow in a timing diagram.
An execution E is CO iff for each M, there exists some space-time diagram in which that
message can be drawn as a vertical arrow.
1 A linear extension of a partial order (E , ≺) is any total order (E , <)| each ordering relation
CO 6=⇒ all messages can be drawn as vertical arrows in the same space-time
diagram (otherwise all hs, r i intervals empty in the same linear extension;
synchronous execution).
If the past of both s and r are identical (analogously for the future), viz.,
e ≺ r =⇒ e ≺ s and s ≺ e =⇒ r ≺ e, we get a subclass of CO executions,
called synchronous executions.
A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 8 / 52
Distributed Computing: Principles, Algorithms, and Systems
s2 s3 s 4 r5 s2 s3 s4 r5
P1
m3 m 3
m2 m5 4 m5
r1 s6 m r1 s6
P2
r3 s5 r3 s5
m6 m1 m2 m4 m6
m1
P
3 s1 r2 r4r6 s1 r2 r4 r6
(a) (b)
Figure 6.3: (a) Execution in an async system (b) Equivalent sync execution.
Handshake between sender and receiver
Instantaneous communication ⇒ modified definition of causality, where s, r
are atomic and simultaneous, neither preceding the other.
A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 9 / 52
Distributed Computing: Principles, Algorithms, and Systems
Process i Process j
... ...
Send(j) Send(i)
Receive(j) Receive(i)
... ...
RSC Executions
RSC execution
An A-execution (E , ≺) is an RSC execution iff there exists a non-separated linear
extension of the partial order (E , ≺).
A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 12 / 52
Distributed Computing: Principles, Algorithms, and Systems
Crown: Definition
Crown
Let E be an execution. A crown of size k in E is a sequence h (s i ,r i ), i ∈ { 0, . . ., k-1 }
i of pairs of corresponding send and receive events such that:
s 0 ≺ r 1 , s 1 ≺ r 2 , . . . . . . s k−2 ≺ r k−1 , s k−1 ≺ r 0 .
s1 r2 r3 r2 s3 r2
P1
m3
m1 m3
m2 s 1s 3 m2 r3 m 2
P2
s2 r1 m1 s1 m1
P3
s2 r1 s2 r1
(a) (b) (c)
Figure 6.5: Illustration of non-RSC A-executions and crowns.
Fig 6.5(a): crown is h(s 1 , r 1 ), (s 2 , r 2 )i as we have s 1 ≺ r 2 and s 2 ≺ r 1
Fig 6.5(b) (b) crown is h(s 1 , r 1 ), (s 2 , r 2 )i as we have s 1 ≺ r 2 and s 2 ≺ r 1
Fig 6.5(c): crown is h(s 1 , r 1 ), (s 3 , r 3 ), (s 2 , r 2 )i as we have s 1 ≺ r 3 and s 3 ≺ r 2 and s 2 ≺ r 1
Fig 6.2(a): crown is h(s 1 , r 1 ), (s 2 , r 2 ), (s 3 , r 3 )i as we have s 1 ≺ r 2 and s 2 ≺ r 3 and s 3 ≺ r 1 .
A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 13 / 52
Distributed Computing: Principles, Algorithms, and Systems
A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 14 / 52
Distributed Computing: Principles, Algorithms, and Systems
Crown criterion
An A-computation is RSC, i.e., it can be realized on a system with synchronous communication,
iff it contains no crown.
SYNC
CO
A
FIFO
SYNC
(a) (b)
Figure 6.7: Hierarchy of message ordering paradigms. (a) Venn diagram (b) Example
executions.
An A-execution is RSC iff A is an S-execution.
RSC ⊂ CO ⊂ FIFO ⊂ A.
More restrictions on the possible message orderings in the smaller classes.
The degree of concurrency is most in A, least in SYN C.
A program using synchronous communication easiest to develop and verify. A
program using non-FIFO communication, resulting in an A-execution, hardest
to design and verify.
A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 16 / 52
Distributed Computing: Principles, Algorithms, and Systems
Pi
m
RSC execution: schedule events as per a Pi,j
non-separated linear extension m’
m
adjacent (s, r ) events sequentially
Pj,i
partial order of original A-execution
unchanged
m’
Pj
If A-execution is not RSC:
partial order has to be changed; or Figure 6.8: Modeling channels as processes to
model each Ci,j by control process Pi,j simulate an execution using asynchronous
and use sync communication (see Fig primitives on an synchronous system.
6.8)
Enables decoupling of sender from
receiver.
This implementation is expensive.
A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 17 / 52
Distributed Computing: Principles, Algorithms, and Systems
A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 18 / 52
Distributed Computing: Principles, Algorithms, and Systems
Assumptions
Receives are always enabled
Send, once enabled, remains enabled
To break deadlock, PIDs used to introduce asymmetry
Each process schedules one send at a time
Message types: M, ack(M), request(M), permission(M)
Process blocks when it knows it can successfully synchronize the current message
higher P
priority i M
permission(M)
ack(M)
request(M) M
lower
priority Pj
(a) (b)
Fig 6.: Rules to prevent message cyles. (a) High priority process blocks. (b) Low
priority process does not block.
A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 20 / 52
Distributed Computing: Principles, Algorithms, and Systems
1 If a message M 0 arrives from a higher priority process Pk , Pi accepts M 0 by scheduling a RECEIVE(M’) event and then
executes send(ack(M’)) to Pk .
2 If a request(M’) arrives from a lower priority process Pk , Pi executes send(permission(M’)) to Pk and blocks waiting for
the message M 0 . When M 0 arrives, the RECEIVE(M’) event is executed.
3 When the permission(M) arrives, Pi knows partner Pj is synchronized and Pi executes send(M). The SEND(M) now completes.
A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 21 / 52
Distributed Computing: Principles, Algorithms, and Systems
Higher prio Pi blocks on lower prio Pj to avoid cyclic wait (whether or not it is the
intended sender or receiver of msg being scheduled)
Before sending M to Pi , Pj requests permission in a nonblocking manner.
While waiting:
I M 0 arrives from another higher prio process. ack(M 0 ) is returned
I request(M 0 ) arrives from lower prio process. Pj returns permission(M 0 ) and
blocks until M 0 arrives.
Note: receive(M 0 ) gets permuted with the send(M) event
(highest priority)
Pi M, sent to lower
priority process
request(M)
Pj ack(M)
permission(M)
M, sent to higher
Pk priority process
(lowest priority) blocking period
(a) (b)
A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 22 / 52
Distributed Computing: Principles, Algorithms, and Systems
Group Communication
Unicast vs. multicast vs. broadcast
Network layer or hardware-assist multicast cannot easily provide:
I Application-specific semantics on message delivery order
I Adapt groups to dynamic membership
I Multicast to arbitrary process set at each send
I Provide multiple fault-tolerance semantics
Closed group (source part of group) vs. open group
# groups can be O(2n )
P1
R1 R2 R3
m1 m1
R1
R2
m m
R3
P1 P2 P m2 m2
2
(a) (b) (c)
Figure 6.11: (a) Updates to 3 replicas. (b) Causal order (CO) and total order violated. (c)
Causal order violated.
If m did not exist, (b,c) would not violate CO.
A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 23 / 52
Distributed Computing: Principles, Algorithms, and Systems
Assumptions/Correctness Complexity
FIFO channels. n2 ints/ process
Safety: Step (2a,b). n2 ints/ msg
Liveness: assuming no failures, finite Time per send and rcv event: n2
propagation times
A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 24 / 52
Distributed Computing: Principles, Algorithms, and Systems
M
i
e e‘ e8
e5
e2 e‘‘
e4 In the causal future of Deliverd (Mi,a ),
e6
message sent to d and Send(Mk,c ), the information is
border of causal future of corresponding event redundant; elsewhere, it is necessary.
event at which message is sent to d, and there is no such
event on any causal path between event e and this event Information about what messages have
info "d is a dest. of M" must exist for correctness been delivered (or are guaranteed to be
info "d is a dest. of M" must not exist for optimality
delivered without violating CO) is
“d ∈ Mi,a .Dests” must be available in the necessary for the Delivery Condition.
causal future of event ei,a , but I For optimality, this cannot be
stored. Algorithm infers this using
not in the causal future of
set-operation logic.
Deliverd (Mi,a ), and
not in the causal future of ek,c , where
d ∈ Mk,c .Dests and there is no other
message sent causally between Mi,a and
Mk,c to the same destination d.
A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 26 / 52
Distributed Computing: Principles, Algorithms, and Systems
A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 27 / 52
Distributed Computing: Principles, Algorithms, and Systems
1 clockj ←− clockj + 1;
A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 28 / 52
Distributed Computing: Principles, Algorithms, and Systems
1 // Delivery Condition; ensure that messages sent causally before M are delivered.
for all om,tm ∈ OM do
if j ∈ om.tm .Dests wait until tm ≤ SRj [m];
PURGE NULL ENTRIES(Logj ): // Purge older entries l for which l.Dests = ∅ is implicitly inferred
A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 29 / 52
Distributed Computing: Principles, Algorithms, and Systems
A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 30 / 52
Distributed Computing: Principles, Algorithms, and Systems
A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 31 / 52
Distributed Computing: Principles, Algorithms, and Systems
A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 32 / 52
Distributed Computing: Principles, Algorithms, and Systems
max(7,10)=10 max(9,9)=9
A B A B
Figure 6.14: (a) A snapshot for PROPOSED TS and REVISE TS messages. The dashed lines show the
further execution after the snapshot. (b) The FINAL TS messages.
Complexity:
Three phases
3(n − 1) messages for n − 1 dests
Delay: 3 message hops
Also implements causal order
A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 34 / 52
Global State and Snapshot Recording
Algorithms
A. Kshemkalyani and M. Singhal (Distributed Computing)Global State and Snapshot Recording Algorithms CUP 2008 1 / 51
Distributed Computing: Principles, Algorithms, and Systems
Introduction
A. Kshemkalyani and M. Singhal (Distributed Computing)Global State and Snapshot Recording Algorithms CUP 2008 2 / 51
Distributed Computing: Principles, Algorithms, and Systems
System model
A. Kshemkalyani and M. Singhal (Distributed Computing)Global State and Snapshot Recording Algorithms CUP 2008 3 / 51
Distributed Computing: Principles, Algorithms, and Systems
System model
A. Kshemkalyani and M. Singhal (Distributed Computing)Global State and Snapshot Recording Algorithms CUP 2008 4 / 51
Distributed Computing: Principles, Algorithms, and Systems
Models of communication
Recall, there are three models of communication: FIFO, non-FIFO, and Co.
In FIFO model, each channel acts as a first-in first-out message queue and
thus, message ordering is preserved by a channel.
In non-FIFO model, a channel acts like a set in which the sender process adds
messages and the receiver process removes messages from it in a random
order.
A system that supports causal delivery of messages satisfies the following
property: “For any two messages mij and mkj , if send(mij ) −→ send(mkj ),
then rec(mij ) −→ rec(mkj )”.
A. Kshemkalyani and M. Singhal (Distributed Computing)Global State and Snapshot Recording Algorithms CUP 2008 5 / 51
Distributed Computing: Principles, Algorithms, and Systems
A. Kshemkalyani and M. Singhal (Distributed Computing)Global State and Snapshot Recording Algorithms CUP 2008 6 / 51
Distributed Computing: Principles, Algorithms, and Systems
A. Kshemkalyani and M. Singhal (Distributed Computing)Global State and Snapshot Recording Algorithms CUP 2008 7 / 51
Distributed Computing: Principles, Algorithms, and Systems
C C2
1
e11 e21 e3
1
e 14
p
1 m1
e21 e22 e23 e42 m m
5
p2 4
1 m2
e e 23 e 33 e34 e35
p3 3
m
e41 3 e42
p4
time
A. Kshemkalyani and M. Singhal (Distributed Computing)Global State and Snapshot Recording Algorithms CUP 2008 8 / 51
Distributed Computing: Principles, Algorithms, and Systems
A. Kshemkalyani and M. Singhal (Distributed Computing)Global State and Snapshot Recording Algorithms CUP 2008 9 / 51
Distributed Computing: Principles, Algorithms, and Systems
Chandy-Lamport algorithm
The Chandy-Lamport algorithm uses a control message, called a marker
whose role in a FIFO system is to separate messages in the channels.
After a site has recorded its snapshot, it sends a marker, along all of its
outgoing channels before sending out any more messages.
A marker separates the messages in the channel into those to be included in
the snapshot from those not to be recorded in the snapshot.
A process must record its snapshot no later than when it receives a marker on
any of its incoming channels.
A. Kshemkalyani and M. Singhal (Distributed Computing)Global State and Snapshot Recording Algorithms CUP 2008 10 / 51
Distributed Computing: Principles, Algorithms, and Systems
Chandy-Lamport algorithm
A. Kshemkalyani and M. Singhal (Distributed Computing)Global State and Snapshot Recording Algorithms CUP 2008 11 / 51
Distributed Computing: Principles, Algorithms, and Systems
Chandy-Lamport algorithm
A. Kshemkalyani and M. Singhal (Distributed Computing)Global State and Snapshot Recording Algorithms CUP 2008 12 / 51
Distributed Computing: Principles, Algorithms, and Systems
Correctness
Due to FIFO property of channels, it follows that no message sent after the
marker on that channel is recorded in the channel state. Thus, condition C2
is satisfied.
When a process pj receives message mij that precedes the marker on channel
Cij , it acts as follows: If process pj has not taken its snapshot yet, then it
includes mij in its recorded snapshot. Otherwise, it records mij in the state of
the channel Cij . Thus, condition C1 is satisfied.
Complexity
The recording part of a single instance of the algorithm requires O(e)
messages and O(d) time, where e is the number of edges in the network and
d is the diameter of the network.
A. Kshemkalyani and M. Singhal (Distributed Computing)Global State and Snapshot Recording Algorithms CUP 2008 13 / 51
Distributed Computing: Principles, Algorithms, and Systems
The recorded global state may not correspond to any of the global states that
occurred during the computation.
This happens because a process can change its state asynchronously before
the markers it sent are received by other sites and the other sites record their
states.
◮ But the system could have passed through the recorded global states in some
equivalent executions.
◮ The recorded global state is a valid state in an equivalent execution and if a
stable property (i.e., a property that persists) holds in the system before the
snapshot algorithm begins, it holds in the recorded global snapshot.
◮ Therefore, a recorded global state is useful in detecting stable properties.
A. Kshemkalyani and M. Singhal (Distributed Computing)Global State and Snapshot Recording Algorithms CUP 2008 14 / 51
Distributed Computing: Principles, Algorithms, and Systems
Spezialetti-Kearns algorithm
There are two phases in obtaining a global snapshot: locally recording the
snapshot at every process and distributing the resultant global snapshot to all the
initiators.
Efficient snapshot recording
In the Spezialetti-Kearns algorithm, a markers carries the identifier of the
initiator of the algorithm. Each process has a variable master to keep track of
the initiator of the algorithm.
A key notion used by the optimizations is that of a region in the system. A
region encompasses all the processes whose master field contains the
identifier of the same initiator.
When the initiator’s identifier in a marker received along a channel is different
from the value in the master variable, the sender of the marker lies in a
different region.
The identifier of the concurrent initiator is recorded in a local variable
id-border -set.
A. Kshemkalyani and M. Singhal (Distributed Computing)Global State and Snapshot Recording Algorithms CUP 2008 15 / 51
Distributed Computing: Principles, Algorithms, and Systems
A. Kshemkalyani and M. Singhal (Distributed Computing)Global State and Snapshot Recording Algorithms CUP 2008 16 / 51
Distributed Computing: Principles, Algorithms, and Systems
If pi receives its first marker from pj then process pj is the parent of process
pi in the spanning tree.
When an intermediate process in a spanning tree has received the recorded
states from all its child processes and has recorded the states of all incoming
channels, it forwards its locally recorded state and the locally recorded states
of all its descendent processes to its parent.
When the initiator receives the locally recorded states of all its descendents
from its children processes, it assembles the snapshot for all the processes in
its region and the channels incident on these processes.
The initiator exchanges the snapshot of its region with the initiators in
adjacent regions in rounds.
The message complexity of snapshot recording is O(e) irrespective of the
number of concurrent initiations of the algorithm. The message complexity of
assembling and disseminating the snapshot is O(rn2 ) where r is the number
of concurrent initiations.
A. Kshemkalyani and M. Singhal (Distributed Computing)Global State and Snapshot Recording Algorithms CUP 2008 17 / 51