100% found this document useful (1 vote)
992 views

Distributed Systems

The document discusses message ordering and group communication in distributed systems. It outlines different message orderings including non-FIFO, FIFO, causal order, and synchronous order. It also discusses group communication with multicast and achieving causal or total order. It defines causal order and discusses its characterization using message order, empty interval property, and common past/future. It notes that causal order is important for applications involving shared data updates, distributed shared memory, and collaborative systems.

Uploaded by

Vctw Cse
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
992 views

Distributed Systems

The document discusses message ordering and group communication in distributed systems. It outlines different message orderings including non-FIFO, FIFO, causal order, and synchronous order. It also discusses group communication with multicast and achieving causal or total order. It defines causal order and discusses its characterization using message order, empty interval property, and common past/future. It notes that causal order is important for applications involving shared data updates, distributed shared memory, and collaborative systems.

Uploaded by

Vctw Cse
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Unit:-2 Message Ordering and Group

Communication

A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 1 / 52
Distributed Computing: Principles, Algorithms, and Systems

Outline and Notations

Outline
I Message orders: non-FIFO, FIFO, causal order, synchronous order
I Group communication with multicast: causal order, total order
I Expected behaviour semantics when failures occur
I Multicasts: application layer on overlays; also at network layer
Notations
I Network (N, L); event set (E , ≺)
I message mi : send and receive events s i and r i
I send and receive events: s and r .
I M, send(M), and receive(M)
I Corresponding events: a ∼ b denotes a and b occur at the same process
I send-receive pairs T = {(s, r ) ∈ Ei × Ej | s corresponds to r }

A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 2 / 52
Distributed Computing: Principles, Algorithms, and Systems

Asynchronous and FIFO Executions

P1 r2 r1 r3 r1 r2
m2
m3 m1
m1 m2
P2
s1 s2 s3 s1 s2
(a) (b)
Figure 6.1: (a) A-execution that is FIFO (b) A-execution that is not FIFO
Asynchronous executions FIFO executions
A-execution: (E , ≺) for which the causality an A-execution in which:
relation is a partial order. for all (s, r ) and (s 0 , r 0 ) ∈ T ,
no causality cycles (s ∼ s 0 and r ∼ r 0 and s ≺ s 0 ) =⇒ r ≺ r 0

on any logical link, not necessarily FIFO Logical link inherently non-FIFO
delivery, e.g., network layer IPv4 Can assume connection-oriented service at
connectionless service transport layer, e.g., TCP
All physical links obey FIFO To implement FIFO over non-FIFO link:
use h seq num, conn id i per message.
Receiver uses buffer to order messages.

A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 3 / 52
Distributed Computing: Principles, Algorithms, and Systems

Causal Order: Definition


Causal order (CO)
A CO execution is an A-execution in which, for all (s, r ) and (s 0 , r 0 ) ∈ T ,
(r ∼ r 0 and s ≺ s 0 ) =⇒ r ≺ r 0

If send events s and s 0 are related by causality ordering (not physical time
ordering), their corresponding receive events r and r 0 occur in the same order
at all common dests.
If s and s 0 are not related by causality, then CO is vacuously satisfied.
r3 r1 s3 r1 s3 r1 r3 r1
P1
m3 m3 m3
m1 m3 m1
r2 r2 r3 r2 m1
P2
m2 s3 m2 r3 s2 m2 s
3
1 m2
P3 m 1
s s2 s1 s2 s1 r2 s2 s1
(a) (b) (c) (d)
Figure 6.2: (a) Violates CO as s 1 ≺ s 3 ; r 3 ≺ r 1 (b) Satisfies CO. (c) Satisfies CO. No send
events related by causality. (d) Satisfies CO.

A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 4 / 52
Distributed Computing: Principles, Algorithms, and Systems

Causal Order: Definition from Implementation Perspective

CO alternate definition
If send(m1 ) ≺ send(m2 ) then for each common destination d of messages m1 and
m2 , deliverd (m1 ) ≺ deliverd (m2 ) must be satisfied.

Message arrival vs. delivery:


I message m that arrives in OS buffer at Pi may have to be delayed until the
messages that were sent to Pi causally before m was sent (the “overtaken”
messages) have arrived!
I The event of an application processing an arrived message is referred to as a
delivery event (instead of as a receive event).
no message overtaken by a chain of messages between the same (sender,
receiver) pair. In Fig. 6.1(a), m1 overtaken by chain hm2 , m3 i.
CO degenerates to FIFO when m1, m2 sent by same process
Uses: updates to shared data, implementing distributed shared memory, fair
resource allocation; collaborative applications, event notification systems,
distributed virtual environments

A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 5 / 52
Distributed Computing: Principles, Algorithms, and Systems

Causal Order: Other Characterizations (1)

Message Order (MO)


A-execution in which, for all (s, r ) and (s 0 , r 0 ) ∈ T , s ≺ s 0 =⇒ ¬(r 0 ≺ r )

Fig 6.2(a): s 1 ≺ s 3 but ¬(r 3 ≺ r 1 ) is false ⇒ MO not satisfied


m cannot be overtaken by a chain
r3 r1 s3 r1 s3 r1 r3 r1
P1
m3 m3 m3
m1 m3 m1
r2 r2 r3 r2 m1
P2
m2 s3 m2 r3 s2 m2 s3
m1 m2
P3
s1 s2 s1 s2 s1 r2 s2 s1
(a) (b) (c) (d)
Figure 6.2: (a) Violates CO as s 1 ≺ s 3 ; r 3 ≺ r 1 (b) Satisfies CO. (c) Satisfies CO. No send
events related by causality. (d) Satisfies CO.

A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 6 / 52
Distributed Computing: Principles, Algorithms, and Systems

Causal Order: Other Characterizations (2)


r3 r1 s3 r1 s3 r1 r3 r1
P1
m3 m3 m3
m1 m3 m1
r2 r2 r3 r2 m1
P2
m2 s3 m2 r3 s2 m2 s3
m1 m2
P3
s1 s2 s1 s2 s1 r2 s2 s1
(a) (b) (c) (d)

Figure 6.2: (a) Violates CO as s 1 ≺ s 3 ; r 3 ≺ r 1 (b) Satisfies CO. (c) Satisfies CO. No send
events related by causality. (d) Satisfies CO.

Empty-Interval (EI) property


(E , ≺) is an EI execution if for each (s, r ) ∈ T , the open interval set
{x ∈ E | s ≺ x ≺ r } in the partial order is empty.

Fig 6.2(b). Consider M 2 . No event x such that s 2 ≺ x ≺ r 2 . Holds for all messages ⇒ EI
For EI hs, r i, there exists some linear extension 1 < | such the corresp. interval
{x ∈ E | s < x < r } is also empty.
An empty hs, r i interval in a linear extension implies s, r may be arbitrarily close; shown by
vertical arrow in a timing diagram.
An execution E is CO iff for each M, there exists some space-time diagram in which that
message can be drawn as a vertical arrow.
1 A linear extension of a partial order (E , ≺) is any total order (E , <)| each ordering relation

of the partial order is preserved.


A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 7 / 52
Distributed Computing: Principles, Algorithms, and Systems

Causal Order: Other Characterizations (3)

CO 6=⇒ all messages can be drawn as vertical arrows in the same space-time
diagram (otherwise all hs, r i intervals empty in the same linear extension;
synchronous execution).

Common Past and Future


An execution (E , ≺) is CO iff for each pair (s, r ) ∈ T and each event e ∈ E ,
Weak common past: e ≺ r =⇒ ¬(s ≺ e)
Weak common future: s ≺ e =⇒ ¬(e ≺ r )

If the past of both s and r are identical (analogously for the future), viz.,
e ≺ r =⇒ e ≺ s and s ≺ e =⇒ r ≺ e, we get a subclass of CO executions,
called synchronous executions.

A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 8 / 52
Distributed Computing: Principles, Algorithms, and Systems

Synchronous Executions (SYNC)

s2 s3 s 4 r5 s2 s3 s4 r5
P1
m3 m 3
m2 m5 4 m5
r1 s6 m r1 s6
P2
r3 s5 r3 s5
m6 m1 m2 m4 m6
m1
P
3 s1 r2 r4r6 s1 r2 r4 r6
(a) (b)
Figure 6.3: (a) Execution in an async system (b) Equivalent sync execution.
Handshake between sender and receiver
Instantaneous communication ⇒ modified definition of causality, where s, r
are atomic and simultaneous, neither preceding the other.

A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 9 / 52
Distributed Computing: Principles, Algorithms, and Systems

Synchronous Executions: Definition


Causality in a synchronous execution.
The synchronous causality relation  on E is the smallest transitive relation that
satisfies the following.
S1. If x occurs before y at the same process, then x  y
S2. If (s, r ) ∈ T , then for all x ∈ E , [(x  s ⇐⇒ x  r ) and
(s  x ⇐⇒ r  x)]
S3. If x  y and y  z, then x  z

Synchronous execution (or S-execution).


An execution (E , ) for which the causality relation  is a partial order.

Timestamping a synchronous execution.


An execution (E , ≺) is synchronous iff there exists a mapping from E to T (scalar
timestamps) |
for any message M, T (s(M)) = T (r (M))
for each process Pi , if ei ≺ ei0 then T (ei ) < T (ei0 )
A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 10 / 52
Distributed Computing: Principles, Algorithms, and Systems

Asynchronous Execution with Synchronous Communication


Will a program written for an asynchronous system (A-execution) run correctly if
run with synchronous primitives?

Process i Process j

... ...
Send(j) Send(i)
Receive(j) Receive(i)
... ...

Figure 6.4: A-execution deadlocks when using synchronous primitives.


An A-execution that is realizable under synchronous communication is a realizable
with synchronous communication (RSC) execution.
s1 r2 r3 r2 s3 r2
P
1 m3
m1 m3
m2 s 1s 3 m2 r3 m 2
P
2 s2 r1 m1 s1 m1
P3
s2 r1 s2 r1
(a) (b) (c)
Figure 6.5: Illustration of non-RSC A-executions.
A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 11 / 52
Distributed Computing: Principles, Algorithms, and Systems

RSC Executions

Non-separated linear extension of (E , ≺)


A linear extension of (E , ≺) such that for each pair (s, r ) ∈ T , the interval { x ∈
E | s ≺ x ≺ r } is empty.

Exercise: Identify a non-separated and a separated linear extension in Figs 6.2(d)


and 6.3(b)

RSC execution
An A-execution (E , ≺) is an RSC execution iff there exists a non-separated linear
extension of the partial order (E , ≺).

Checking for all linear extensions has exponential cost!


Practical test using the crown characterization

A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 12 / 52
Distributed Computing: Principles, Algorithms, and Systems

Crown: Definition

Crown
Let E be an execution. A crown of size k in E is a sequence h (s i ,r i ), i ∈ { 0, . . ., k-1 }
i of pairs of corresponding send and receive events such that:
s 0 ≺ r 1 , s 1 ≺ r 2 , . . . . . . s k−2 ≺ r k−1 , s k−1 ≺ r 0 .

s1 r2 r3 r2 s3 r2
P1
m3
m1 m3
m2 s 1s 3 m2 r3 m 2
P2
s2 r1 m1 s1 m1
P3
s2 r1 s2 r1
(a) (b) (c)
Figure 6.5: Illustration of non-RSC A-executions and crowns.
Fig 6.5(a): crown is h(s 1 , r 1 ), (s 2 , r 2 )i as we have s 1 ≺ r 2 and s 2 ≺ r 1
Fig 6.5(b) (b) crown is h(s 1 , r 1 ), (s 2 , r 2 )i as we have s 1 ≺ r 2 and s 2 ≺ r 1
Fig 6.5(c): crown is h(s 1 , r 1 ), (s 3 , r 3 ), (s 2 , r 2 )i as we have s 1 ≺ r 3 and s 3 ≺ r 2 and s 2 ≺ r 1
Fig 6.2(a): crown is h(s 1 , r 1 ), (s 2 , r 2 ), (s 3 , r 3 )i as we have s 1 ≺ r 2 and s 2 ≺ r 3 and s 3 ≺ r 1 .

A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 13 / 52
Distributed Computing: Principles, Algorithms, and Systems

Crown: Characterization of RSC Executions


s1 r2 r3 r2 s3 r2
P1
m3
m1 m3
m2 s 1s 3 m2 r3 m 2
P
2 s2 r1 m1 s1 m1
P
3 s2 r1 s2 r1
(a) (b) (c)
Figure 6.5: Illustration of non-RSC A-executions and crowns.
Fig 6.5(a): crown is h(s 1 , r 1 ), (s 2 , r 2 )i as we have s 1 ≺ r 2 and s 2 ≺ r 1
Fig 6.5(b) (b) crown is h(s 1 , r 1 ), (s 2 , r 2 )i as we have s 1 ≺ r 2 and s 2 ≺ r 1
Fig 6.5(c): crown is h(s 1 , r 1 ), (s 3 , r 3 ), (s 2 , r 2 )i as we have s 1 ≺ r 3 and s 3 ≺ r 2 and s 2 ≺ r 1
Fig 6.2(a): crown is h(s 1 , r 1 ), (s 2 , r 2 ), (s 3 , r 3 )i as we have s 1 ≺ r 2 and s 2 ≺ r 3 and s 3 ≺ r 1 .
Some observations
In a crown, s i and r i+1 may or may not be on same process
Non-CO execution must have a crown
CO executions (that are not synchronous) have a crown (see Fig 6.2(b))
Cyclic dependencies of crown ⇒ cannot schedule messages serially ⇒ not RSC

A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 14 / 52
Distributed Computing: Principles, Algorithms, and Systems

Crown Test for RSC executions


1 Define the ,→: T × T relation on messages in the execution (E , ≺) as follows. Let
,→ ([s, r ], [s 0 , r 0 ]) iff s ≺ r 0 . Observe that the condition s ≺ r 0 (which has the form used
in the definition of a crown) is implied by all the four conditions: (i) s ≺ s 0 , or (ii)
s ≺ r 0 , or (iii) r ≺ s 0 , and (iv) r ≺ r 0 .
2 Now define a directed graph G,→ = (T , ,→), where the vertex set is the set of messages
T and the edge set is defined by ,→.
Observe that ,→: T × T is a partial order iff G,→ has no cycle, i.e., there must not be a
cycle with respect to ,→ on the set of corresponding (s, r ) events.
3 Observe from the defn. of a crown that G,→ has a directed cycle iff (E , ≺) has a crown.

Crown criterion
An A-computation is RSC, i.e., it can be realized on a system with synchronous communication,
iff it contains no crown.

Crown test complexity: O(|E |) (actually, # communication events)

Timestamps for a RSC execution


Execution (E , ≺) is RSC iff there exists a mapping from E to T (scalar timestamps) such that
for any message M, T (s(M)) = T (r (M))
for each (a, b) in (E × E ) \ T , a ≺ b =⇒ T (a) < T (b)
A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 15 / 52
Distributed Computing: Principles, Algorithms, and Systems

Hierarchy of Message Ordering Paradigms


FIFO CO A

SYNC

CO

A
FIFO
SYNC

(a) (b)

Figure 6.7: Hierarchy of message ordering paradigms. (a) Venn diagram (b) Example
executions.
An A-execution is RSC iff A is an S-execution.
RSC ⊂ CO ⊂ FIFO ⊂ A.
More restrictions on the possible message orderings in the smaller classes.
The degree of concurrency is most in A, least in SYN C.
A program using synchronous communication easiest to develop and verify. A
program using non-FIFO communication, resulting in an A-execution, hardest
to design and verify.
A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 16 / 52
Distributed Computing: Principles, Algorithms, and Systems

Simulations: Async Programs on Sync Systems

Pi
m
RSC execution: schedule events as per a Pi,j
non-separated linear extension m’
m
adjacent (s, r ) events sequentially
Pj,i
partial order of original A-execution
unchanged
m’
Pj
If A-execution is not RSC:
partial order has to be changed; or Figure 6.8: Modeling channels as processes to
model each Ci,j by control process Pi,j simulate an execution using asynchronous
and use sync communication (see Fig primitives on an synchronous system.
6.8)
Enables decoupling of sender from
receiver.
This implementation is expensive.

A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 17 / 52
Distributed Computing: Principles, Algorithms, and Systems

Simulations: Synch Programs on Async Systems

Schedule msgs in the order in which they appear in S-program


partial order of S-execution unchanged
Communication on async system with async primitives
When sync send is scheduled:
I wait for ack before completion

A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 18 / 52
Distributed Computing: Principles, Algorithms, and Systems

Sync Program Order on Async Systems


Deterministic program: repeated runs produce same partial order
Deterministic receive ⇒ deterministic execution ⇒ (E , ≺) is fixed
Nondeterminism (besides due to unpredictable message delays):
Receive call does not specify sender
Multiple sends and receives enabled at a process; can be executed in
interchangeable order
∗[G1 −→ CL1 || G2 −→ CL2 || · · · || Gk −→ CLk ]
Deadlock example of Fig 6.4
If event order at a process is permuted, no deadlock!
How to schedule (nondeterministic) sync communication calls over async system?
Match send or receive with corresponding event
Binary rendezvous (implementation using tokens)
Token for each enabled interaction
Schedule online, atomically, in a distributed manner
Crown-free scheduling (safety); also progress to be guaranteed
Fairness and efficiency in scheduling
A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 19 / 52
Distributed Computing: Principles, Algorithms, and Systems

Bagrodia’s Algorithm for Binary Rendezvous (1)

Assumptions
Receives are always enabled
Send, once enabled, remains enabled
To break deadlock, PIDs used to introduce asymmetry
Each process schedules one send at a time
Message types: M, ack(M), request(M), permission(M)
Process blocks when it knows it can successfully synchronize the current message
higher P
priority i M
permission(M)
ack(M)
request(M) M
lower
priority Pj
(a) (b)
Fig 6.: Rules to prevent message cyles. (a) High priority process blocks. (b) Low
priority process does not block.

A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 20 / 52
Distributed Computing: Principles, Algorithms, and Systems

Bagrodia’s Algorithm for Binary Rendezvous: Code


(message types)
M, ack(M), request(M), permission(M)

1 Pi wants to execute SEND(M) to a lower priority process Pj :


Pi executes send(M) and blocks until it receives ack(M) from Pj . The send event SEND(M) now completes.
Any M’ message (from a higher priority processes) and request(M’) request for synchronization (from a lower priority processes) received
during the blocking period are queued.

2 Pi wants to execute SEND(M) to a higher priority process Pj :

1 Pi seeks permission from Pj by executing send(request(M)).


// to avoid deadlock in which cyclically blocked processes queue messages.

2 While Pi is waiting for permission, it remains unblocked.

1 If a message M 0 arrives from a higher priority process Pk , Pi accepts M 0 by scheduling a RECEIVE(M’) event and then
executes send(ack(M’)) to Pk .

2 If a request(M’) arrives from a lower priority process Pk , Pi executes send(permission(M’)) to Pk and blocks waiting for
the message M 0 . When M 0 arrives, the RECEIVE(M’) event is executed.

3 When the permission(M) arrives, Pi knows partner Pj is synchronized and Pi executes send(M). The SEND(M) now completes.

3 Request(M) arrival at Pi from a lower priority process Pj :


At the time a request(M) is processed by Pi , process Pi executes send(permission(M)) to Pj and blocks waiting for the message M. When
M arrives, the RECEIVE(M) event is executed and the process unblocks.
4 Message M arrival at Pi from a higher priority process Pj :
At the time a message M is processed by Pi , process Pi executes RECEIVE(M) (which is assumed to be always enabled) and then
send(ack(M)) to Pj .

5 Processing when Pi is unblocked:


When Pi is unblocked, it dequeues the next (if any) message from the queue and processes it as a message arrival (as per Rules 3 or 4).

A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 21 / 52
Distributed Computing: Principles, Algorithms, and Systems

Bagrodia’s Algorithm for Binary Rendezvous (2)

Higher prio Pi blocks on lower prio Pj to avoid cyclic wait (whether or not it is the
intended sender or receiver of msg being scheduled)
Before sending M to Pi , Pj requests permission in a nonblocking manner.
While waiting:
I M 0 arrives from another higher prio process. ack(M 0 ) is returned
I request(M 0 ) arrives from lower prio process. Pj returns permission(M 0 ) and
blocks until M 0 arrives.
Note: receive(M 0 ) gets permuted with the send(M) event
(highest priority)
Pi M, sent to lower
priority process
request(M)
Pj ack(M)
permission(M)
M, sent to higher
Pk priority process
(lowest priority) blocking period
(a) (b)

Figure 6.10: Scheduling messages with sync communication.

A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 22 / 52
Distributed Computing: Principles, Algorithms, and Systems

Group Communication
Unicast vs. multicast vs. broadcast
Network layer or hardware-assist multicast cannot easily provide:
I Application-specific semantics on message delivery order
I Adapt groups to dynamic membership
I Multicast to arbitrary process set at each send
I Provide multiple fault-tolerance semantics
Closed group (source part of group) vs. open group
# groups can be O(2n )
P1
R1 R2 R3
m1 m1
R1
R2
m m
R3
P1 P2 P m2 m2
2
(a) (b) (c)
Figure 6.11: (a) Updates to 3 replicas. (b) Causal order (CO) and total order violated. (c)
Causal order violated.
If m did not exist, (b,c) would not violate CO.

A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 23 / 52
Distributed Computing: Principles, Algorithms, and Systems

Raynal-Schiper-Toueg (RST) Algorithm


(local variables)
array of int SENT [1 . . . n, 1 . . . n]
array of int DELIV [1 . . . n] // DELIV [k] = # messages sent by k that are delivered locally

(1) send event, where Pi wants to send message M to Pj :


(1a) send (M, SENT ) to Pj ;
(1b) SENT [i, j] ←− SENT [i, j] + 1.

(2) message arrival, when (M, ST ) arrives at Pi from Pj :


(2a) deliver M to Pi when for each process x,
(2b) DELIV [x] ≥ ST [x, i];
(2c) ∀x, y , SENT [x, y ] ←− max(SENT [x, y ], ST [x, y ]);
(2d) DELIV [j] ←− DELIV [j] + 1.

How does algorithm simplify if all msgs are broadcast?

Assumptions/Correctness Complexity
FIFO channels. n2 ints/ process
Safety: Step (2a,b). n2 ints/ msg
Liveness: assuming no failures, finite Time per send and rcv event: n2
propagation times

A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 24 / 52
Distributed Computing: Principles, Algorithms, and Systems

Optimal KS Algorithm for CO: Principles


Mi,a : ath multicast message sent by Pi

Delivery Condition for correctness:


Msg M ∗ that carries information “d ∈ M.Dests”, where message M was sent to d in the
causal past of Send(M ∗ ), is not delivered to d if M has not yet been delivered to d.

Necessary and Sufficient Conditions for Optimality:


For how long should the information “d ∈ Mi,a .Dests” be stored in the log at a
process, and piggybacked on messages?
as long as and only as long as
I (Propagation Constraint I:) it is not known that the message Mi,a is delivered
to d, and
I (Propagation Constraint II:) it is not known that a message has been sent to d
in the causal future of Send(Mi,a ), and hence it is not guaranteed using a
reasoning based on transitivity that the message Mi,a will be delivered to d in
CO.

⇒ if either (I) or (II) is false, “d ∈ M.Dests” must not be stored or propagated,


even to remember that (I) or (II) has been falsified.
A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 25 / 52
Distributed Computing: Principles, Algorithms, and Systems

Optimal KS Algorithm for CO: Principles


e1
e3 e7
d
Deliver(M)

M
i
e e‘ e8
e5
e2 e‘‘
e4 In the causal future of Deliverd (Mi,a ),
e6
message sent to d and Send(Mk,c ), the information is
border of causal future of corresponding event redundant; elsewhere, it is necessary.
event at which message is sent to d, and there is no such
event on any causal path between event e and this event Information about what messages have
info "d is a dest. of M" must exist for correctness been delivered (or are guaranteed to be
info "d is a dest. of M" must not exist for optimality
delivered without violating CO) is
“d ∈ Mi,a .Dests” must be available in the necessary for the Delivery Condition.
causal future of event ei,a , but I For optimality, this cannot be
stored. Algorithm infers this using
not in the causal future of
set-operation logic.
Deliverd (Mi,a ), and
not in the causal future of ek,c , where
d ∈ Mk,c .Dests and there is no other
message sent causally between Mi,a and
Mk,c to the same destination d.
A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 26 / 52
Distributed Computing: Principles, Algorithms, and Systems

Optimal KS Algorithm for CO: Principles


e1
e3 e7
Deliver(M)
d Info about messages (i) not known to be
delivered and (ii) not guaranteed to be
M
i
e e8
delivered in CO, is explicitly tracked using
e‘
e5 (source, ts, dest).
e2 e‘‘ Must be deleted as soon as either (i) or (ii)
e4 e6 becomes false.
message sent to d
border of causal future of corresponding event Info about messages already delivered and
event at which message is sent to d, and there is no such messages guaranteed to be delivered in CO
event on any causal path between event e and this event
info "d is a dest. of M" must exist for correctness
is implicitly tracked without storing or
info "d is a dest. of M" must not exist for optimality propagating it:
“d ∈ M.Dests” I derived from the explicit
must exist at e1 and e2 because (I) and information.
(II) are true. I used for determining when (i) or
must not exist at e3 because (I) is false
must not exist at e4, e5, e6 because (II) is (ii) becomes false for the explicit
false information being
must not exist at e7, e8 because (I) and stored/piggybacked.
(II) are false

A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 27 / 52
Distributed Computing: Principles, Algorithms, and Systems

Optimal KS Algorithm for CO: Code (1)


(local variables)
clockj ←− 0; // local counter clock at node j
SRj [1...n] ←− 0; // SRj [i] is the timestamp of last msg. from i delivered to j
LOGj = {(i, clocki , Dests)} ←− {∀i, (i, 0, ∅)};
// Each entry denotes a message sent in the causal past, by i at clocki . Dests is the set of remaining destinations
// for which it is not known that Mi,clock (i) has been delivered, or (ii) is guaranteed to be delivered in CO.
i
SND: j sends a message M to Dests:

1 clockj ←− clockj + 1;

2 for all d ∈ M.Dests do:


OM ←− LOGj ; // OM denotes OM
j,clockj
for all o ∈ OM , modify o.Dests as follows:
if d 6∈ o.Dests then o.Dests ←− (o.Dests \ M.Dests);S
if d ∈ o.Dests then o.Dests ←− (o.Dests \ M.Dests) {d};
// Do not propagate information about indirect dependencies that are
// guaranteed to be transitively satisfied when dependencies of M are satisfied.
for all os,t ∈ OM do
(∃o 0 0 ∈ OM | t < t 0 ) then OM ←− OM \ {os,t };
V
if os,t .Dests = ∅
s,t
// do not propagate older entries for which Dests field is ∅
send (j, clockj , M, Dests, OM ) to d;

3 for all l ∈ LOGj do l.Dests ←− l.Dests \ Dests;


// Do not store information about indirect dependencies that are guaranteed
// to be transitively satisfied when dependencies of M are satisfied.
Execute PURGE NULL ENTRIES(LOGj ); // purge l ∈ LOGj if l.Dests = ∅
S
4 LOGj ←− LOGj {(j, clockj , Dests)}.

A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 28 / 52
Distributed Computing: Principles, Algorithms, and Systems

Optimal KS Algorithm for CO: Code (2)


RCV: j receives a message (k, tk , M, Dests, OM ) from k:

1 // Delivery Condition; ensure that messages sent causally before M are delivered.
for all om,tm ∈ OM do
if j ∈ om.tm .Dests wait until tm ≤ SRj [m];

2 Deliver M; SRj [k] ←− tk ;


S
3 OM ←− {(k, tk , Dests)} OM ;
for all om,tm ∈ OM do om,tm .Dests ←− om,tm .Dests \ {j};
// delete the now redundant dependency of message represented by om,tm sent to j

4 // Merge OM and LOGj by eliminating all redundant entries.


// Implicitly track “already delivered” & “guaranteed to be delivered in CO” messages.
for all om,t ∈ OM and l 0 ∈ LOGj such that s = m do
s,t
if t < t 0
V
ls,t 6∈ LOGj then mark om,t ;
// ls,t had been deleted or never inserted, as ls,t .Dests = ∅ in the causal past
if t 0 < t
V
o 6∈ OM then mark l 0 ;
m,t 0 s,t
// o 6∈ OM because l 0 had become ∅ at another process in the causal past
m,t 0 s,t
Delete all marked elements in OM and LOGj ; // delete entries about redundant information
V 0
for all l 0 ∈ LOGj and om,t ∈ OM , such that s = m t = t do
s,t T
l 0 .Dests ←− l 0 .Dests om,t .Dests; // delete destinations for which Delivery
s,t s,t
// Condition is satisfied or guaranteed to be satisfied as per om,t
Delete om,t from OM ; // information has been incorporated in l 0
S s,t
LOGj ←− LOGj OM ; // merge nonredundant information of OM into LOGj

5 PURGE NULL ENTRIES(LOGj ). // Purge older entries l for which l.Dests = ∅

PURGE NULL ENTRIES(Logj ): // Purge older entries l for which l.Dests = ∅ is implicitly inferred

for all ls,t ∈ Logj do

(∃l 0 0 ∈ Logj | t < t 0 ) then Logj ←− Logj \ {ls,t }.


V
if ls,t .Dests = ∅
s,t

A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 29 / 52
Distributed Computing: Principles, Algorithms, and Systems

Optimal KS Algorithm for CO: Information Pruning

Explicit tracking of (s, ts, dest) per multicast in Log and OM


Implicit tracking of msgs that are (i) delivered, or (ii) guaranteed to be
delivered in CO: W
I (Type 1:) ∃d ∈ Mi,a .Dests | d 6∈ li,a .Dests d 6∈ oi,a .Dests
F When li,a .Dests = ∅ or oi,a .Dests = ∅?
F Entries of the form li,ak for k = 1, 2, . . . will accumulate
F Implemented in Step (2d)
I (Type 2:) if a1 < a2 and li,a2 ∈ LOGj , then li,a1 ∈ LOGj . (Likewise for
messages)
F entries of the form li,a1 .Dests = ∅ can be inferred by their absence, and should
not be stored
F Implemented in Step (2d) and PURGE NULL ENTRIES

A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 30 / 52
Distributed Computing: Principles, Algorithms, and Systems

Optimal KS Algorithm for CO: Example


1 2 3
P1
M 2,2 M 2,3
1
P2
M 2 3 4
4,2 1 2 3 M 3,3 piggybacked
P3 Message to dest. M5,1.Dests
1 M 4,2 M 4,3 M to P ,P {P ,P }
5,1 4 6 4 6
P4 M to P ,P {P }
M 5,1 2 3 M 3,3 4,2 3 2 6
2 M to P {P }
2,2 1 6
P5 M to P {P }
1 M 5,1 M 6,2 M 4,3 M 5,2 6,2 1 4
M to P {P }
P6 4,3 6 6
1 2 3 4 5 M to P {}
4,3 3
causal past contains event (6,1) M to P {P ,P }
5,2 6 4 6
information about P6 as a destination M to P {P }
2,3 1 6
of multicast at event (5,1) propagates M to P ,P {}
as piggybacked information and in Logs 3,3 2 6

Figure 6.13: Tracking of information about M5,1 .Dests

A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 31 / 52
Distributed Computing: Principles, Algorithms, and Systems

Total Message Order


Centralized algorithm
Total order (1) When Pi wants to multicast M to group G :
For each pair of processes Pi and Pj and for (1a) send M(i, G ) to coordinator.
each pair of messages Mx and My that are (2) When M(i, G ) arrives from Pi at coordinator:
delivered to both the processes, Pi is (2a) send M(i, G ) to members of G .
delivered Mx before My if and only if Pj is
(3) When M(i, G ) arrives at Pj from coordinator:
delivered Mx before My . (3a) deliver M(i, G ) to application.

Same order seen by all


Solves coherence problem
P1
R1 R2 R3
m1 m1
Time Complexity: 2 hops/ transmission
R1
R2
Message complexity: n
m m
R3
P1 P2 P m2 m2
2
(a) (b) (c)

Fig 6.11: (a) Updates to 3 replicas. (b) Total


order violated. (c) Total order not violated.

A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 32 / 52
Distributed Computing: Principles, Algorithms, and Systems

Total Message Order: 3-phase Algorithm Code


record Q entry
M: int; // the application message
tag : int; // unique message identifier
sender id: int; // sender of the message
timestamp: int; // tentative timestamp assigned to message
deliverable: boolean; // whether message is ready for delivery
(local variables)
queue of Q entry: temp Q, delivery Q
int: clock // Used as a variant of Lamport’s scalar clock
int: priority // Used to track the highest proposed timestamp
(message types)
REVISE TS(M, i, tag , ts) // Phase 1 message sent by Pi , with initial timestamp ts
PROPOSED TS(j, i, tag , ts) // Phase 2 message sent by Pj , with revised timestamp, to Pi
FINAL TS(i, tag , ts) // Phase 3 message sent by Pi , with final timestamp

(1) When process Pi wants to multicast a message M with a tag tag :


(1a) clock = clock + 1;
(1b) send REVISE TS(M, i, tag , clock) to all processes;
(1c) temp ts = 0;
(1d) await PROPOSED TS(j, i, tag , tsj ) from each process Pj ;
(1e) ∀j ∈ N, do temp ts = max(temp ts, tsj );
(1f) send FINAL TS(i, tag , temp ts) to all processes;
(1g) clock = max(clock, temp ts).
(2) When REVISE TS(M, j, tag , clk) arrives from Pj :
(2a) priority = max(priority + 1, clk);
(2b) insert (M, tag , j, priority , undeliverable) in temp Q; // at end of queue
(2c) send PROPOSED TS(i, j, tag , priority ) to Pj .
(3) When FINAL TS(j, tag , clk) arrives from Pj :
(3a) Identify entry Q entry (tag ) in temp Q, corresponding to tag ;
(3b) mark qtag as deliverable;
(3c) Update Q entry .timestamp to clk and re-sort temp Q based on the timestamp field;
(3d) if head(temp Q) = Q entry (tag ) then
(3e) move Q entry (tag ) from temp Q to delivery Q;
(3f) while head(temp Q) is deliverable do
(3g) move head(temp Q) from temp Q to delivery Q.
(4) When Pi removes a message (M, tag , j, ts, deliverable) from head(delivery Qi ):
(4a) clock = max(clock, ts) + 1.
A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 33 / 52
Distributed Computing: Principles, Algorithms, and Systems

Total Order: Distributed Algorithm: Example and


Complexity
(10,d) (9,u) (10,u) (9,d)
(9,u) (7,u) (10,u) (9,u)
temp_Q delivery_Q temp_Q delivery_Q
temp_Q delivery_Q temp_Q delivery_Q
C D (a) C D
(b)
9
10 PROPOSED_TS
7 FINAL_TS
10
9 9 9
7
9 9
REVISE_TS 10
7

max(7,10)=10 max(9,9)=9
A B A B

Figure 6.14: (a) A snapshot for PROPOSED TS and REVISE TS messages. The dashed lines show the
further execution after the snapshot. (b) The FINAL TS messages.
Complexity:
Three phases
3(n − 1) messages for n − 1 dests
Delay: 3 message hops
Also implements causal order
A. Kshemkalyani and M. Singhal (Distributed Computing) Message Ordering and Group Commnication CUP 2008 34 / 52
Global State and Snapshot Recording
Algorithms

A. Kshemkalyani and M. Singhal (Distributed Computing)Global State and Snapshot Recording Algorithms CUP 2008 1 / 51
Distributed Computing: Principles, Algorithms, and Systems

Introduction

Recording the global state of a distributed system on-the-fly is an important


paradigm.
The lack of globally shared memory, global clock and unpredictable message
delays in a distributed system make this problem non-trivial.
This chapter first defines consistent global states and discusses issues to be
addressed to compute consistent distributed snapshots.
Then several algorithms to determine on-the-fly such snapshots are presented
for several types of networks.

A. Kshemkalyani and M. Singhal (Distributed Computing)Global State and Snapshot Recording Algorithms CUP 2008 2 / 51
Distributed Computing: Principles, Algorithms, and Systems

System model

The system consists of a collection of n processes p1 , p2 , ..., pn that are


connected by channels.
There are no globally shared memory and physical global clock and processes
communicate by passing messages through communication channels.
Cij denotes the channel from process pi to process pj and its state is denoted
by SCij .
The actions performed by a process are modeled as three types of events:
Internal events,the message send event and the message receive event.
For a message mij that is sent by process pi to process pj , let send(mij ) and
rec(mij ) denote its send and receive events.

A. Kshemkalyani and M. Singhal (Distributed Computing)Global State and Snapshot Recording Algorithms CUP 2008 3 / 51
Distributed Computing: Principles, Algorithms, and Systems

System model

At any instant, the state of process pi , denoted by LSi , is a result of the


sequence of all the events executed by pi till that instant.
For an event e and a process state LSi , e∈LSi iff e belongs to the sequence
of events that have taken process pi to state LSi .
For an event e and a process state LSi , e6∈LSi iff e does not belong to the
sequence of events that have taken process pi to state LSi .
For a channel Cij , the following set of messages can be defined based on the
local states of the processes pi and pj
V
Transit: transit(LSi , LSj ) = {mij |send(mij ) ∈ LSi rec(mij ) 6∈ LSj }

A. Kshemkalyani and M. Singhal (Distributed Computing)Global State and Snapshot Recording Algorithms CUP 2008 4 / 51
Distributed Computing: Principles, Algorithms, and Systems

Models of communication

Recall, there are three models of communication: FIFO, non-FIFO, and Co.
In FIFO model, each channel acts as a first-in first-out message queue and
thus, message ordering is preserved by a channel.
In non-FIFO model, a channel acts like a set in which the sender process adds
messages and the receiver process removes messages from it in a random
order.
A system that supports causal delivery of messages satisfies the following
property: “For any two messages mij and mkj , if send(mij ) −→ send(mkj ),
then rec(mij ) −→ rec(mkj )”.

A. Kshemkalyani and M. Singhal (Distributed Computing)Global State and Snapshot Recording Algorithms CUP 2008 5 / 51
Distributed Computing: Principles, Algorithms, and Systems

Consistent global state

The global state of a distributed system is a collection of the local states of


the processes and the channels.
Notationally, global state GS is defined as,
S S
GS = { i LSi , i ,j SCij }
A global state GS is a consistent global state iff it satisfies the following two
conditions :
C1: send(mij )∈LSi ⇒ mij ∈SCij ⊕ rec(mij )∈LSj . (⊕ is Ex-OR
operator.)
C2: send(mij )6∈LSi ⇒ mij 6∈SCij ∧ rec(mij )6∈LSj .

A. Kshemkalyani and M. Singhal (Distributed Computing)Global State and Snapshot Recording Algorithms CUP 2008 6 / 51
Distributed Computing: Principles, Algorithms, and Systems

Interpretation in terms of cuts

A cut in a space-time diagram is a line joining an arbitrary point on each


process line that slices the space-time diagram into a PAST and a FUTURE.
A consistent global state corresponds to a cut in which every message
received in the PAST of the cut was sent in the PAST of that cut.
Such a cut is known as a consistent cut.
For example, consider the space-time diagram for the computation illustrated
in Figure 4.1.
Cut C1 is inconsistent because message m1 is flowing from the FUTURE to
the PAST.
Cut C2 is consistent and message m4 must be captured in the state of
channel C21 .

A. Kshemkalyani and M. Singhal (Distributed Computing)Global State and Snapshot Recording Algorithms CUP 2008 7 / 51
Distributed Computing: Principles, Algorithms, and Systems

C C2
1
e11 e21 e3
1
e 14
p
1 m1
e21 e22 e23 e42 m m
5
p2 4
1 m2
e e 23 e 33 e34 e35
p3 3

m
e41 3 e42
p4

time

Figure 4.1: An Interpretation in Terms of a Cut.

A. Kshemkalyani and M. Singhal (Distributed Computing)Global State and Snapshot Recording Algorithms CUP 2008 8 / 51
Distributed Computing: Principles, Algorithms, and Systems

Issues in recording a global state

The following two issues need to be addressed:


I1: How to distinguish between the messages to be recorded in the
snapshot from those not to be recorded.

-Any message that is sent by a process before recording its


snapshot, must be recorded in the global snapshot (from C1).
-Any message that is sent by a process after recording its snapshot,
must not be recorded in the global snapshot (from C2).
I2: How to determine the instant when a process takes its snapshot.

-A process pj must record its snapshot before processing a message


mij that was sent by process pi after recording its snapshot.

A. Kshemkalyani and M. Singhal (Distributed Computing)Global State and Snapshot Recording Algorithms CUP 2008 9 / 51
Distributed Computing: Principles, Algorithms, and Systems

Snapshot algorithms for FIFO channels

Chandy-Lamport algorithm
The Chandy-Lamport algorithm uses a control message, called a marker
whose role in a FIFO system is to separate messages in the channels.
After a site has recorded its snapshot, it sends a marker, along all of its
outgoing channels before sending out any more messages.
A marker separates the messages in the channel into those to be included in
the snapshot from those not to be recorded in the snapshot.
A process must record its snapshot no later than when it receives a marker on
any of its incoming channels.

A. Kshemkalyani and M. Singhal (Distributed Computing)Global State and Snapshot Recording Algorithms CUP 2008 10 / 51
Distributed Computing: Principles, Algorithms, and Systems

Chandy-Lamport algorithm

The algorithm can be initiated by any process by executing the “Marker


Sending Rule” by which it records its local state and sends a marker on each
outgoing channel.
A process executes the “Marker Receiving Rule” on receiving a marker. If the
process has not yet recorded its local state, it records the state of the channel
on which the marker is received as empty and executes the “Marker Sending
Rule” to record its local state.
The algorithm terminates after each process has received a marker on all of
its incoming channels.
All the local snapshots get disseminated to all other processes and all the
processes can determine the global state.

A. Kshemkalyani and M. Singhal (Distributed Computing)Global State and Snapshot Recording Algorithms CUP 2008 11 / 51
Distributed Computing: Principles, Algorithms, and Systems

Chandy-Lamport algorithm

Marker Sending Rule for process i


1 Process i records its state.
2 For each outgoing channel C on which a marker
has not been sent, i sends a marker along C
before i sends further messages along C.
Marker Receiving Rule for process j
On receiving a marker along channel C:
if j has not recorded its state then
Record the state of C as the empty set
Follow the “Marker Sending Rule”
else
Record the state of C as the set of messages
received along C after j’s state was recorded
and before j received the marker along C

A. Kshemkalyani and M. Singhal (Distributed Computing)Global State and Snapshot Recording Algorithms CUP 2008 12 / 51
Distributed Computing: Principles, Algorithms, and Systems

Correctness and Complexity

Correctness
Due to FIFO property of channels, it follows that no message sent after the
marker on that channel is recorded in the channel state. Thus, condition C2
is satisfied.
When a process pj receives message mij that precedes the marker on channel
Cij , it acts as follows: If process pj has not taken its snapshot yet, then it
includes mij in its recorded snapshot. Otherwise, it records mij in the state of
the channel Cij . Thus, condition C1 is satisfied.
Complexity
The recording part of a single instance of the algorithm requires O(e)
messages and O(d) time, where e is the number of edges in the network and
d is the diameter of the network.

A. Kshemkalyani and M. Singhal (Distributed Computing)Global State and Snapshot Recording Algorithms CUP 2008 13 / 51
Distributed Computing: Principles, Algorithms, and Systems

Properties of the recorded global state

The recorded global state may not correspond to any of the global states that
occurred during the computation.
This happens because a process can change its state asynchronously before
the markers it sent are received by other sites and the other sites record their
states.
◮ But the system could have passed through the recorded global states in some
equivalent executions.
◮ The recorded global state is a valid state in an equivalent execution and if a
stable property (i.e., a property that persists) holds in the system before the
snapshot algorithm begins, it holds in the recorded global snapshot.
◮ Therefore, a recorded global state is useful in detecting stable properties.

A. Kshemkalyani and M. Singhal (Distributed Computing)Global State and Snapshot Recording Algorithms CUP 2008 14 / 51
Distributed Computing: Principles, Algorithms, and Systems

Spezialetti-Kearns algorithm
There are two phases in obtaining a global snapshot: locally recording the
snapshot at every process and distributing the resultant global snapshot to all the
initiators.
Efficient snapshot recording
In the Spezialetti-Kearns algorithm, a markers carries the identifier of the
initiator of the algorithm. Each process has a variable master to keep track of
the initiator of the algorithm.
A key notion used by the optimizations is that of a region in the system. A
region encompasses all the processes whose master field contains the
identifier of the same initiator.
When the initiator’s identifier in a marker received along a channel is different
from the value in the master variable, the sender of the marker lies in a
different region.
The identifier of the concurrent initiator is recorded in a local variable
id-border -set.

A. Kshemkalyani and M. Singhal (Distributed Computing)Global State and Snapshot Recording Algorithms CUP 2008 15 / 51
Distributed Computing: Principles, Algorithms, and Systems

The state of the channel is recorded just as in the Chandy-Lamport algorithm


(including those that cross a border between regions).
Snapshot recording at a process is complete after it has received a marker
along each of its channels.
After every process has recorded its snapshot, the system is partitioned into
as many regions as the number of concurrent initiations of the algorithm.
Variable id-border -set at a process contains the identifiers of the neighboring
regions.
Efficient dissemination of the recorded snapshot
In the snapshot recording phase, a forest of spanning trees is implicitly
created in the system. The initiator of the algorithm is the root of a spanning
tree and all processes in its region belong to its spanning tree.

A. Kshemkalyani and M. Singhal (Distributed Computing)Global State and Snapshot Recording Algorithms CUP 2008 16 / 51
Distributed Computing: Principles, Algorithms, and Systems

Efficient dissemination of the recorded snapshot

If pi receives its first marker from pj then process pj is the parent of process
pi in the spanning tree.
When an intermediate process in a spanning tree has received the recorded
states from all its child processes and has recorded the states of all incoming
channels, it forwards its locally recorded state and the locally recorded states
of all its descendent processes to its parent.
When the initiator receives the locally recorded states of all its descendents
from its children processes, it assembles the snapshot for all the processes in
its region and the channels incident on these processes.
The initiator exchanges the snapshot of its region with the initiators in
adjacent regions in rounds.
The message complexity of snapshot recording is O(e) irrespective of the
number of concurrent initiations of the algorithm. The message complexity of
assembling and disseminating the snapshot is O(rn2 ) where r is the number
of concurrent initiations.

A. Kshemkalyani and M. Singhal (Distributed Computing)Global State and Snapshot Recording Algorithms CUP 2008 17 / 51

You might also like