0% found this document useful (0 votes)
228 views

State Recording Algorithm

This document provides an introduction to snapshot algorithms in distributed computing. It discusses the challenges of capturing a consistent global state in a distributed system without shared memory or a global clock. It then presents several snapshot algorithms for different communication models, including FIFO, non-FIFO, and causal delivery. The algorithms allow processes to record global states in a consistent way by coordinating message passing between processes.

Uploaded by

alice
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
228 views

State Recording Algorithm

This document provides an introduction to snapshot algorithms in distributed computing. It discusses the challenges of capturing a consistent global state in a distributed system without shared memory or a global clock. It then presents several snapshot algorithms for different communication models, including FIFO, non-FIFO, and causal delivery. The algorithms allow processes to record global states in a consistent way by coordinating message passing between processes.

Uploaded by

alice
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Home Search Collections Journals About Contact us My IOPscience

An introduction to snapshot algorithms in distributed computing

This article has been downloaded from IOPscience. Please scroll down to see the full text article.

1995 Distrib. Syst. Engng. 2 224

(http://iopscience.iop.org/0967-1846/2/4/005)

View the table of contents for this issue, or go to the journal homepage for more

Download details:
IP Address: 128.195.52.171
The article was downloaded on 13/01/2011 at 21:13

Please note that terms and conditions apply.


Distrib. Syst. Engng 2 (1995) 224-233. Printed in the UK

I An intrqduction to snapshot
I algorithms in distributed computing
Ajay D Kshemkalyanit, Michel Raynalt and Mukesh Singhals
t IBM Corporation, PO Box 12195, Research Triangle Park, NC 27709, USA
$ IRISA, campus de Beaulieu, 35042 Rennes-cedex, France
5 Department of Computer and Information Science, The Ohio State University,
Columbus, OH 43210, USA
Received 14 November 1994, in final form 26 July 1995

Abstract. Recording on-the-fly global states of distributed executions is an


important paradigm when one is interested in analysing, testing, or verifying
properties associated with these executions. Since Chandy and Lamport's seminal
paper on this topic, this problem is called the snapshot problem. Unfortunately, the
lack of both a globally shared memory and a global clock in a distributed system,
added to the fact that transfer delays in these systems are finite but unpredictable,
makes this problem non-trivial.
This paper first discusses issues which have to be addressed to compute
distributed snapshots in a consistent way. Then several algorithms which
determine on-the-fly such snapshots are presented for several types of networks
(according to the properties of their communication channels, namely, FIFO,
non-FIFO, and causal delivery).

1. Introduction Therefore, it is important that there be efficient ways


of recording the global state of a distributed system 161.
A distributed computing system consists of spatially Unfortunately, there is no shared memory and no global
separated processes that do not share a common memory clock in a distributed system and the distributed nature
and communicate asynchronously with each other by of the local clocks and local memory makes it difficult to
passing messages over communication channels. Each record the global state of the system efficiently.
component of a distributed system has a local state. The If shared memory were available, an up-to-date state
state of a process is characterized by the state of its local of the entire system would be available to the processes
memory and a history of its activity. The state of a channel sharing the memory. The absence of shared memory
is characterized by the set of messages sent along the necessitates ways of getting a coherent and complete view
channel less the messages received along the channel. The of the system based on the local states of individual
global state of a distributed system is a collection of the processes. A meaningful global snapshot can be obtained if
local states of its components. the components of the distributed system record their local
Recording the global state of a distributed system is states at the same time. This would be possible if the local
an important paradigm and it finds applications in several clocks at processes were perfectly synchronized or if there
aspects of distributed system design. For examples, in were a global system clock that could be instantaneously
detection of stable properties such as deadlocks [15]and read by the processes. However, it is technologically
termination [18], the global state of the system is examined unfeasible to have perfectly synchronized clocks at various
for certain properties; for failure recovery, a global state of
sites - clocks are bound to drift. If processes read time from
the distributed system (called a checkpoint) is periodically
a single common clock (maintained at one process), various
saved and recovery from a processor failure is done by
c indeterminate transmission delays during the read operation
restoring the system to the last saved global state [14]; for
will cause the processes to identify various physical instants
debugging distributed software, the system i s restored to
as the same time. In both cases, the collection of local state
a consistent global state [7,8] and the execution resumes
observations will be made at different times and may not
from there in a controlled manner. A snapshot-recording
be meaningful, as illustrated by the following example.
method has been used in the distributed debugging facility
of Estelle [12, lo],a distributed programming environment. Example: Let S1 and S2 be two distinct sites of a
Other applications include monitoring distributed events distributed system which maintain bank accounts A and
[25]such as in industrial process control, setting distributed B, respectively. A site refers to a process in this example.
breakpoints [ZO], protocol specification and verification Let the communication channels from site S1 to site S2
[4,9,13], and discarding obsolete information 1211. and from site S2 to site S1 be denoted by Cl2 and Czl,

0967-1&16/95/040224d0519.50 0 1995 The British Computer Society, The Institution of Electrical Engineers and IOP Publishing Ltd
An introduction to snapshot algorithms in distributed computing

SI : Account A S2: Account B computing. The work presented in this paper will be
useful to designers of distributed systems and designers of
application support mechanisms.
The rest of the paper is organized as follows. Section 2

tt2 @-Po c1,:$100


c*l:$0
$200
presents the system model and a formal definition of
the notion of consistent global state. The subsequent
sections present algorithms to record such global states
under various communication models. These algorithms
are called snapshot algorithms. Section 3 presents snapshot
algorithms for FIFO communication channels. It presents
the Chandy-Lamport snapshot algorithm followed by a
short discussion on three variations of it. Section 4
presents snapshot algorithms for non-FIFO communication
.channels. Section 5 discusses algorithms for sytems that
support causal ordering of messages. Finally, Section 6
concludes the paper with summary remarks.

2. System model and definitions

2.1. System model


The system consists of a collection of n processes, indexed
from 1 to n. that are connected by channels. There is no
globally shared memory and processes communicate solely
Figure 1. A banking example, by passing messages. There is no physical global clock
in the system. Message send and receive is asynchronous.
Messages are delivered reliably with finite but arbitrary time
respectively. Consider the following sequence of actions, delay. The system can be described as a directed graph in
which are also illustrated in figure 1: which vertices represent the processes and edges represent
(i) Initially, Account A = $500, Account B = $200, Clz = unidirectional communication channels. Let Cjj denote the
$0, CZl = $0. channel from process i to process j .
(ii) Site SI initiates a transfer of $100 from Account A to Processes and channels have states associated with
Account B. Account A is decremented by $100 to $400 them. The state of a process at any time is defined by
and a request for $100 credit to Account B is sent on the contents of processor registers, stacks, local memory,
Channel C12to site S2. Account A = $400, Account B etc and may be highly dependent on the local context of the
= $200, c,2 = $100, CZI = $0. distributed application. The state of channel Cij, denoted
(iii) Site S2 initiates a transfer of $50 from Account B to by SC,, is given by the set of messages in transit in the
Account A. Account B is decremented by $50 to $150 channel.
and a request for $50 credit to Account A is sent on The actions performed by a process are modelled as
Channel Czl to site S1. Account A = $400, Account B three types of events, namely, internal events, message send
= $150, Clz = $100, CZI= $50. events, and message receive events. For a message mjj
(iv) Site S1 receives the message for a $50 credit to that is sent by process i to process j , let send(mij) and
Account A and updates Account A. Account A = $450, rec(mij) denote its send and receive events, respectively.
Account B = $150, CIZ= $100, C21 = $0. Occurrence of events changes the states of respective
(v) Site S2 receives the message for a $100 credit to processes and channels, thus causing transitions in global
Account B and updates Account B. Account A = $450, system state. For example, an internal event changes the
Account B = $250, CIZ= $0, CZI= $0. state of the process at which it occurs. A send event (or a
receive event) changes the state of the process that sends
Suppose the local state of Account A is recorded at (or receives) the message and the state of the channel on
the end of step 1 to show $500 and the local state of which the message is sent (or received). The events at a
Account B and channels Clz and C21 are recorded at the process are linearly ordered by their order of occurrence.
end of step 3 to show $150, $100, and $50, respectively. At any instant, the state of process i, denoted by LS;,
Then the recorded global state shows $800 in the system. results from the sequence of all the events executed by
An extra of $100 appears in the system. The reason for process i till that instant. For an event e and a process
the inconsistency is that Account A‘s state was recorded state LSj, e E LSj iff e belongs to the sequence of events
before the $100 transfer to Account B using channel Clz that have taken process i to state LSj.
was initiated, whereas channel Clz’s state was recorded A channel is a distributed entity and its state depends
after the $100 transfer was initiated. on the local states of the processes on which it is incident.
This simple example shows that recording a consistent For a channel Cij, the following set of messages can be
global state of a distributed system is not a trivial task. defined, based on the local states of the processes i and j
This paper addresses this fundamental issue of distributed D11.

225
A D Kshemkalyani et a/

Transit: transit(LSi. LSj) = (mu [send(mij) E


LSi A r e c ( m i j ) @ LSj }
Thus, if a snapshot recording algorithm records the state
of processes i and j as LSi and LSj, respectively, then it
must record the state of channel Cij as transit(LSi, LSj).
There are several models of communication among
processes and different snapshot algorithms have assumed
different models of communication. In a FIFO model, each
channel acts as a first-in first-out message queue and thus, Figure 2. Timing diagram for the banking example.
message ordering is preserved by a channel. In non-FIFO
model, a channel acts like a set in which the sender process
adds messages and the receiver process removes messages the time of the snapshot. (All messages are timestamped
from it in a random order. The ‘causal ordering’ model with the sender’s time.) Clearly, if channels are not FFO,
a termination detection scheme will be needed to determine
[5] is based on Lamport’s ‘happens before’ relation on
the system events. An event el happens before event e2, when to stop waiting for messages on channels.
denoted by el + e2, if (a) el occurs before e2 on the same However, a global physical clock is not available in
process, or @) el is the send event of a message and e2
. a distributed system and the following two issues need to
is the receive event of that message, or (c) 3e’leI happens be addressed in recording a consistent global snapshot of a
before e’ and e’ happens before e2. A system that supports distributed system:
a causal ordering model satisfies the following property: 11: How to distinguish between the messages to be
CO: For any two messages mi, and mkj, if send(m;j) -+ recorded in the snapshot (either in a channel state or
send(mkj), then rec(mij) -+ rec(mkj). a process state) from those not to be recorded. The
answer to this comes from conditions C1 and C2 as
Causally ordered delivery of messages implies FIFO follows:
message delivery. Causal ordering model is useful in Any message that is sent by a process before recording
developing distributed algorithms and may simplify the its snapshot must be recorded in the global snapshot
algorithms themselves. (from Cl).
Any message that is sent by a process after recording
2.2. Global state its snapshot must not be recorded in the global snapshot
(from C2).
The global state of a distributed system is a collection of the I2: How to determine the instant when a process takes its
local states of the processes and the channels. Notationally, snapshot. The answer to this comes from condition C2:
a global state GS is defined as A process j must record its snapshot before processing
GS = {U L S ~ ,Uscij] a message mij that was sent by process i after recording
its snapshot.
i i.j

A global state GS is a consistent global sfate iff it


satisfies the following two conditions: 2.4. Cuts of a distributed computation
C1: send(mij) E LSt =$ mtj E SCjfB rec(mij) E LSj. (fB A distributed computation can be conveniently represented
is Ex-OR operator.) using a timing diagram where horizontal lines represent the
C 2 send(m;j) LSi + mij @ SCijA rec(mij) @ LSj. processes’ time lines. Figure 2 shows a timing diagram
In a consistent global state, every message that is for the computation illustrated in figure 1. A line joining
recorded as received is also recorded as sent and such a one arbitrary point on each process line slices the timing
state captures the notion of causality that a message cannot diagram into a PAST and a FUTURE. Such a line is
be received if it was not sent. Consistent global states are termed a cur in the computation. Every cut corresponds
meaningful global states and inconsistent global states are to a global state and every global state can be graphically
not meaningful in the sense that a distributed system can represented as a cut in the computation’s timing diagram
never be in an inconsistent state. [3]. A consistent global state corresponds to a cut in which
every message received in the PAST of the cut has been sent
in the PAST of that cut. Such a cut is known as a consistent
2.3. Jssnes in recording a global state
cur. Cuts in a timing diagram provide a powerful graphical
If a global physical clock were available, the following aid in representing and reasoning about global states of a
simple procedure could be used to record a consistent computation.
_global snapshot of a distributed system: The initiator of We next discuss a set of representative snapshot
the snapshot collection decides a future time at which the algorithms for distributed systems. These algorithms
snapshot is to be taken and broadcasts this time to each assume different interprocess communication capabilities
process. All processes take their local snapshots at that about the underlying system and illustrate how interprocess
instant in the global time. The snapshot of channel Cij communication affects the design complexity of these
includes all the messages that process j receives after algorithms. There are two types of messages: computation
taking the snapshot and whose timestamp is smaller than messages and control messages. The former are exchanged

226
Marker Sending Rule for process i
(i) Process i records its state. x -
(ii) For each outgoing channel C on which a marker
Si: Accoun! A f
has not been sent, i sends a marker along C
before i sends further messages along C. S 2 Account B

Marker Receiving Rule for process j


On receiving a marker along channel C:
if j has not recorded its state then
begin Record the state of C as the empty set
Follow the ‘Marker Sending Rule’
end
else
Record the state of C as the set of messages recording the local state but before receiving the marker
received along C after j ’ s state was recorded on that channel. The algorithm can be initiated by any
and before j received the marker along C process by executing the ‘Marker Sending Rule’.
To prove the correctness of the algorithm, we now
Figure 3. The Chandy-Lampott algorithm. show that a recorded snapshot satisfies conditions C1 and
C2. Since a process records its snapshot when it receives
the first marker on any incoming channel, no messages
by the underlying application and the latter are exchanged
that follow markers on the channels incoming to it are
by the snapshot algorithm. Execution of a snapshot
recorded in the process’s snapshot. Moreover, a process
algorithm is transparent to the underlying application,
stops recording the state of an incoming channel when
except for occasional delaying of some actions of the
a marker is received on that channel. Due to the FIFO
application.
property of channels, it follows that no message sent after
the marker on that channel is recorded in the channel state.
3. Snapshot algorithms for FIFO channels Thus, condition C2 is satisfied. When a process j receives
message mi, that precedes the marker on channel Cij, it
This section presents Chandy and Lamport algorithm [6], acts as follows: if process j has not taken its snapshot yet,
which was the first algorithm to record the global snapshot, then it includes mij in its recorded snapshot. Otherwise, it
and three of its variations. records mij in the state of the channel Cij. Thus, condition
C1 is satisfied.
3.1. Chandy-Lamport algorithm The recorded local snapshots can be put together to
create the global snapshot in several ways. One policy is
3.1.1. Principle. After a site has recorded its snapshot,
to have each process send its local snapshot to the initiator
it sends a control message, called a marker, along all its
of the algorithm. Another policy is to have each process
outgoing channels before sending out any more messages.
send the information it records along all outgoing channels,
Since channels are FIFO,a marker separates the messages
and to have each process receiving such information for the
in the channel into those to be included in the snapshot
first time propagate it along its outgoing channels. All the
(is. channel state or process state) from those not to be local snapshots get disseminated to all other processes and
recorded in the snapshot. (This addresses issue 11.) The
all the processes can determine the global state.
role of markers in a FIFO system is to act as delimiters
The recording part of a single instance of the algorithm
for the messages in the channels so that the channel state
requires O(e) messages and O(d) time, where e is the
recorded by the process at the receiving end of the channel
number of edges in the graph and d is the diameter of the
satisfies the condition C2.
graph.
Since all messages that follow a marker on channel Cij
have been sent by process i after i has taken its snapshot,
process j must record its snapshot not later than when it 3.2. Property of the recorded global state
receives a marker on channel Cjj. mis addresses issue The recorded global state may not correspond to any of
Ja the global states that occurred during the computation.
Consider a possible execution of the snapshot algorithm
3.1.2. The algorithm. The algorithm is given in figure for the money transfer example of fiewe 2 using a timing
3. A process initiates snapshot collection by executing diagram in figure 4. Let site S1 initiate the algorithm at the
the ‘,Marker Sending Rule’ by which it records its local end of step 1. Site S1 records its local state (Account A =
state and sends a marker on each outgoing channel. A $500) and sends a marker to site 2. The marker is received
process executes the ‘Marker Receiving Rule’ on receiving by site S2 at the end of step 4. When site S 2 receives the
a marker. If the process has not yet recorded its local marker, it records its local state (Account B = $250), the
state, it executes the ‘Marker Sending Rule’ to record state of channel C1 as $0, and sends a marker along channel
its local state. The state of the incoming channel on C2. When site S1 receives,this marker, it records the state
which the marker is received is recorded as being the set of Channel C2 as $50. The $700 amount in the system is
of computation messages received on that channel after conserved in the recorded global state. However, this global

227
A D Kshemkalyani et a/

t’ I
recorded
global state
t2, t3, t4, , t5 algorithm to efficiently record repeated snapshots of a
distributed system that are required in recovery algorithms
with synchronous checkpointing.
Spezialetti-Kearns method There are two phases in
S2: b u n t B obtaining a global snapshot: locally recording the snapshot

- computation message
at every process and distributing the resultant global
snapshot to all the initiators. Spezialetti and Kearns [24]
optimized the Chandy-Lamport algorithm by exploiting
the work of combining concurrently initiated snapshots
(in the first phase) to efficiently distribute the resultant
Figure 5. Applying the rubber-band criterion.
global snapshot to only the concurrent initiators (in the
second phase). A process needs to take only one snapshot,
state never occurred in the execution. This happens because irrespective of the number of concurrent initiators and all
a process can change its state asynchronously before the processes are not sent the global snapshot.
markers it sent are received by other sites and the other This algorithm assumes bidirectional channels in the
sites record their states, system. The message complexity of snapshot recording is
Nevertheless, as we discuss next, the system could have O(e) irrespective of the number of concurrent initiations of
passed through the recorded global state in an equivalent the algorithm. The message complexity of assembling and
execution [6]. Suppose the algorithm is initiated in global disseminating thesnapshot is O ( m 2 )where r is the,number
state Si and it terminates in global state S,. Let seq of concurrent initiations.
be the sequence of events which takes the system from Venkatesan’s incremental snapshot method Many
Si to S,. Let Sn be the global state recorded by the applications require repeated collection of global snapshots
algorithm. Chandy and Lamport [6] showed that there of the system. For example, recovery algorithms
exists a sequence se4’ which is a permutation of se4 such with synchronous checkpointing need to advance their
that S* is reachable from Si by executing a prefix of seq‘ checkpoints periodically. This can be achieved by repeated
and S, is reachable from S* by executing the rest of the invocations of the Chandy-Lamport algorithm. However,
events of seq‘. Venkatesan [23]proposed the following efficient approach
Thus, the recorded global state is a valid state in an Execute an algorithm to record an incremental snapshot
equivalent execution and if a stable property (i.e. a property since the most recent snapshot was taken and combine it
that persists, such as termination or deadlock) holds in the with the most recent snapshot to obtain the latest snapshot
system before the snapshot algorithm begins, it holds in of the system. The incremental snapshot algorithm of
the recorded global snapshot. Therefore, a recorded global Venkatesan E231 modifies the global snapshot algorithm of
state is useful in detecting stable properties. Chandy-Lamport to save on messages when computation
A physical interpretation of the collected global state messages are sent only on a few of the network channels,
is as follows. Consider the two instants of recording of between the recording of two successive snapshots.
the local states in the banking example. These instants The incremental snapshot algorithm assumes bidirec-
are marked by crosses in figure 4. If the cut formed by tional FIFO channels, the presence of a single initiator, a
these instants is viewed as being an elastic band and if the fixed spanning tree in the network, and four types of con-
elastic band is stretched so that it is vertical, then all the trol messages: initsnap, snap-completed, regular, and ack.
recorded states of all processes occur simultaneously at one initsnap and snap-completed messages traverse spanning
physical instant and the recorded global state occurs in the edges. regular and ack messages which serve to record
execution that is depicted in this modified timing diagram states of non-spanning edges are not sent on those edges
(figure 5). Note that the system execution would have been on which no computation message has been sent since the
like this, had the processors’ speeds and message delays previous snapshot.
been different. Yet another physical interpretation of the Venkatesan [23]showed that the lower bound on the
collected global state is as follows: all the recorded process message complexity of an incremental snapshot algorithm
states are mutually concurrent-no process state causally is S2(u+ n) where U is the number of edges on which
depends upon another. Therefore, we can view logically a computation message has been sent since the previous
that all these process states occurred simultaneously even snapshot. Venkatesan’s algorithm achieves this lower
though they might have occurred at different instants in bound in message complexity.
physical time. Helary’s wave synchronization method Helary’s
snapshot algorithm [Ill incorporates the concept of
3.3. Variations of the Chandy-Lamport algorithm message waves in the Chandy-Lamport algorithm. A wave
is a flow of control messages such that every process
Several variants of the Chandy-Lamport snapshotalgorithm in the system is visited exactly once by a wave control
followed. These variants refined and optimized the message, and at least one process in the system can
basic algorithm. For example, Spezialetti and Kearns determine when this flow of control messages terminates.
algorithm [24]optimizes concurrent initiation of snapshot A wave is initiated after the previous wave terminates:
collection and efficiently dishibutes the recorded snapshot. Wave sequences may be implemented by various traversal
Venkatesan’s algorithm [23]optimizes the basic snapshot structures such as a ring. A process begins recording

228
An introduction to snapshot algorithms in distributed computing

the local snapshot when it is visited by the wave control (ii) Every message sent by a white (red) process is coloured
message. white (red). Thus, a white (red) message is a message
Note that in this algorithm, the primary function of that was sent before (after) the sender of that message
wave synchronization is to evaluate functions over the recorded its local snapshot.
recorded global snapshot.’ This algorithm has a message (iii)Every white process takes its snapshot at its
complexity of O ( e ) to record a snapshot (because all convenience, but no later than the instant it receives
channels can be traversed to implement the wave). a red message.
Thus, when a white process receives a red message, it
4. Snapshot algorithms for non-FIFO channels records its local snapshot before processing the message.
This ensures that no message sent by a process after
A FIFO system ensures that all messages sent after a recording its local snapshot is processed by the destination
marker on a channel will be delivered after the marker. process before the destination records its local snapshot.
This ensures that condition C2 is satisfied in the recorded Thus, an explicit marker message is not required in this
snapshot if LS;, LSj, and SC;j are recorded as described algorithm and the ‘marker’ is piggybacked on computation
in the Chandy-Lamport algorithm. In a non-FIFO system, messages using a colouring scheme.
the problem of global snapshot recording is complicated The second observation is that the marker informs
because a marker cannot be used to delineate messages into process j of the value of [send(mjj)[send(mij)E LS, ]
those to be recorded in the global state from those not to so that transit(LS;,LSj) can be computed. The Lai-Yang
be recorded in the global state. In such systems, different algorithm fulfils this role of the marker in the following
techniques have to be used to ensure that a recorded global way.
state satisfies condition C2. (iv) Every white process records a history of all white
In a non-FIFO system, either some degree of inhibition messages sent or received by it along each channel.
(i.e. temporarily delaying the execution of an application (v) When a process turns red, it sends these histories along
process or delaying the send of a computation message) with its snapshot to the initiator process that collects
or piggybacking of control information on computation the global snapshot.
messages to capture out-of-sequence messages, is necessary (vi) The initiator process evaluates transir(LSj, LSj) for
to record a consistent global snapshot [22]. The non-FIFO each channel Cjj as given below:
algorithm by Helary uses message inhibition.[ll]. The SCjj = (send(mjj)lsend(m;j) E LS, ] -
non-FIFO algorithms by Lai and Yang [16], Li er nl [17] {rec(m,j)[rec(mjj)E LSj 1.
and Mattern [19] use message piggybacking to distinguish Condition C2 holds because a red message is not
computation messages sent after the marker from those sent included in the snapshot of the recipient process and
before the marker. a channel state is the difference of two sets of white
The non-FIFO algorithm of Helary [ 111 uses message messages. Condition C1 holds because a white message
inhibition to avoid an inconsistency in a global snapshot in mij is included in the snapshot of process j if j receives
the following way: When a process receives a marker, it mjj before taking its snapshot. Otherwise, mij is included
immediately returns an acknowledgement After a process in the state of channel C j j .
i has sent a marker on the outgoing channel to process j , it Though marker messages are not required in the
does not send any messages on this channel until it is sure algorithm, each process has to record the entire message
that j has recorded its local state. Process i can conclude history on each channel as paa of the local snapshot.
this if it has received an acknowledgement for the marker Thus, the space requirements of the algorithm may he
sent to j , or has received a marker for this snapshot from j . large. Lai and Yang describe how the size of the local
We next discuss snapshot recording algorithms for storage and snapshot recording can be reduced by storing
systems with non-FIFO channels that use piggybacking of only the messages sent and received since the previous
computation messages. snapshot recording, assuming that the previous snapshot
is still available. This approach can be very useful to
4.1. Lai-Yang algorithm applications that require repeated snapshots of a distributed
system.
Lai and Yang’s global snapshot algorithm for non-FIFO
systems [16] is based on two observations on the role of 4.2. Li et al’s algorithm
a marker in a FIFO system. The first observation is that a
marker ensures that condition C2 is satisfied for LS, and Li etal’s algorithm [17] for recording a global snapshot in
LSj when the snapshots are recorded at processes i and j , a non-FIFO system is similar to the Lai-Yang algorithm.
respectively. The La-Yang algorithm fulfills this role of a Markers are tagged so as to generalize the rdwhite colours
marker in a non-FIFO system by using a colouring scheme of the Lai-Yang algorithm to accommodate repeated
on computation messages as follows. invocations of the algorithm and multiple initiators. In
addition, the algorithm is not concerned with the contents
(i) Every process is initially white and turns red while of computation messages and the state of a channel
taking a snapshot. The equivalent of the ‘Marker is computed as the number of messages in transit in
Sending Rule’ is executed when a process turns red. the channel. This simplification is combined with the

229
A D Kshemkalyani et a/

incremental technique to compute channel states, also s is included in the snapshot of process j if j receives mij
outlined by Lai and Yang, which reduces the size of before taking its snapshot. Otherwise, mij is included in
message histories to be stored and ’uansmitted. The initiator the state of channel C,.
computes the state of Cjj as: (the number of messages’in The following observations about the above algorithm
Cij in the previous snapshot) + (the number of messages lead to various optimizations. (i) The initiator can be
sent on Cjj since the last snapshot at i ) - (the number of made a ‘virtual‘ process: so, no process has to freeze.
messages received on Cij since the last snapshot at j ) . (ii) As long as a new higher value of s is selected, the
Though this algorithm does not require any additional phase of broadcasting s and returning the acks can be
message to record a global snapshot provided computation -eliminated. (iii) Only the initiator’s component of s is used
messages are eventually sent on each channel, the local to determine when to record a snapshot. Also, one needs
storage and size of tags on computation messages is of size to know only if the initiator’s component of the vector
O(n), where n is the number of initiators. timestamp in a message has increased beyond the value
of the corresponding component in s. Therefore, it suffices
4.3. Mattern’s algorithm to have just two values of s, say, white and red, which can
be represented using one bit
Mattern’s algorithm [I91 is based on vector clocks. In With these optimizations, the algorithm becomes
vector clocks, the clock at a process is an integer vector similar to the Lai-Yang algorithm except for the manner
of length n, with one component for each process. in which transit(LSi, LSj) is evaluated for channel Cjj.
The component of a process in the vector clock at In Mattern’s algorithm, a process is not required to store
a process advances independently whenever the process message histories to evaluate the channel states. The state
learns, through messages, that a component value has of any channel is the set of all the white messages that
advanced. are received by a red process on which that channel is
Mattern’s algorithm assumes a single initiator process incident. A termination detection scheme for non-FIFO
and works as follows. channels is required to detect that no white messages are in
transit to ensure that the recording of all the channel states
(i) The initiator ‘ticks’ its local clock and selects a future
is complete.
vector times at which it would like a global snapshot to
be recorded. It then broadcasts this time s and freezes The savings of not storing and transmitting entire
message histones, over the Lai-Yang algorithm, comes at
all activity until it receives acknowledgements of the
receipt of this broadcast. the expense of delay in the termination of the snapshot
recording algorithm and need for a termination detection
(ii) When a process receives the broadcast, it remembers
scheme (e.g. a message counter per channel).
the value s and returns an acknowledgement to the
initiator.
(iii) After having received an acknowledgement from every 5. S n a p s h o t s in a causal delivery system
process, the initiator increases its vector clock to s and
broadcasts a dummy message to all processes. (Observe Two global snapshot-recording algorithms, namely,
that before broadcasting this dummy message, the local Acharya-Badrinath [I] and Alagar-Venkatesan 121 assume
clocks of other processes have a value 2 s.) that the underlying system supports causal message deliv-
(iv) The receipt of this dummy message forces each ery. The causal message delivery property CO provides a
recipient to increase its clock to a value s if not built-in message synchronization to control and computa-
already 2 s. tion messages. Consequently, snapshot algorithms for such
(v) Each process takes a local snapshot and sends it to the systems are considerably simplified. For example, these
initiator when (just before) its clock increases from a algorithms do not send control messages (i.e. markers) on
value less than s to a value 2 s. Observe that this every channel and are simpler than the snapshot algorithms
may happen before the dummy message arrives at the for a FIFO system.
process. Both these algorithms use an identical principle to
(vi) The state of Cjj is all messages sent along Cij, whose record the state of processes. An initiator process
timestamp is smaller than s and which are received by broadcasts a token, denoted as token, to every process
pj after recording LSj. including itself. Let the copy of the token received by
process i be denoted tokeni. A process i records its
Processes record their local snapshot as per rule (5). local snapshot LS, when it receives tokeni and sends the
Any message mij sent by process i after it records its local recorded snapshot to the initiator.
snapshot LSj has a timestamp > s. Assume that this mij is These algorithms do not require each process to send
received by j before it records LSj. After receiving this mij markers on each channel, and the processes do not
and before j records LSj, j ’ s local clock reads a value > s, coordinate their local snapshot recordings with every other
as per rules for updating vector clocks. This implies j must process. Nonetheless, for any two processes i and j the
have already recorded LSj as per rule (5), which contradicts following property (called Property P1) is satisfied
the assumption. Therefore, mi, cannot be received by j
before it records LSj. By rule (6), mij is not recorded in send(mij) # LSi + rec(mij) # LSj.
SCi, and therefore, condition C2 is satisfied. Condition C1 This is due to the causal ordering property of the
holds because each message mij with a timestamp less than underlying system as explained next. Let a message

230
An introduction to snapshot algorithms in distributed computing

Table 1. Comparison of the snapshot algorithms.


Algorithms Features
Chandy-Lamport [6], Baseline algorithm. FIFO systems. O(e) messages to record snapshot,
1985
SDezialettiiKeams lmurovements to 161: S U D D O ~ ~concurrent
S initiators. efficientassemblv and distribution
[24], 1986 of 'snapshot. Assumes bidirectional channels. b(e) messages to record, O(m2)
messages to assemble and distribute snapshot,
Venkatesan [23], 1989 Based on [61. Selective sending of markers. Provides message-optimal incremental
snapshots. S2(n + U) messages to record snapshot.
Helary [ill, 1989 Based on [SI. Uses wave synchronization. Evaluates function over recorded global
state. Adaptable to non-FIFO systems but requires inhibaion.
Lai-Yang [16],1987 Non-FIFO system. Markers iggybacked on computation messages. Message history
required to compute channerstates.
Li et a/ [17].1987 Similar to 161 Small message history needed as channel states are computed
incrementa& .
Mattern 1191, 1989 Similar to 116 No message history required. Termination detection (e.g. a message
counter per ckannel) required to compute channel states.
Acharya-Badrinath Requires causal delivery suppolt, Centralized computation of channel states, Channel
[l],
1992 message contents not known. Requires 2n messages, 2 time units.
Alagar-Venkatesan Requires causal delivery support. DistribLted computation of channel slates. Requires
121, 1993 3n messages, 3 time units, small messages.

n =#processes, U = # edges on which messages were sent after previous snapshot,


e = # channels, r = # concurrent initiators.

mjj be such that rec(tokeni) --+ send(mjj). Then LSj + mij $ SC,. This in conjunction with property P1
send(tokenj) 3send(mjj) and the underlying causal implies that the algorithm satisfies condition C2.
ordering propem ensures that rec(tokenj), at which instant Consider a message mij which is the kth message from
j records LSj, happens before rec(mjj). Thus, mjj whose process i to process j before i takes its snapshot. The two
send is not recorded in LSj, is not recorded as received in possibilities below imply that condition C1 is satisfied.
LSj. Process j receives mij before taking its snapshot. In
Methods of channel staterecording are different in these this case, mij is recorded in j ' s snapshot.
two algorithms and are discussed next. Otherwise, R E C D j [ i ] 5 k 5 S E N Z [ j ] and the
message mij will be included in the state of channel
5.1. Channel Recording in the Afharya-Badrinath cjj.
algorithm This algorithm requires 2n messages and 2 time units
Each process i maintains arrays S E N Z [ l , ..., NI and for recording and assembling the snapshot, where one time
R E C D j [ l . ...,NI. S E N Z [ j ] is thenumberofmessages unit is required for the delivery of a message. If the contents
sent by process i to process j and R E C D j l j ] is thc number of messages in channel states are required, the algorithm
of messages received by process i from process j . The requires 2n messages and 2 time units additionally.
arrays may not contribute to the storage complexity of the
algorithm because the underlying causal ordering protocol 5.2. C h a M d mcording in the Alagar-Venkatesan
may require these arrays to enforce causal ordering. algorithm
Channel states are recorded as follows: when a process
A message is referred to as old if the send of the message
i records its local snapshot LSj on the receipt of token;, it
causally precedes the send of the token. Otherwise, the
includes arrays RECDi and S E N Z in its local state before
message is referred to as new. Whether a message is new or
sending the snapshot to the initiator. When the algorithm
old can be determined by examining the vector timestamp
terminates, the initiator determines the state of channels in
in the message, which is needed to enforce causal ordering
the global snapshot being assembled as follows:
among messages.
(i) The state of each channel from the initiator to each In the Alagar-Vmkatesan algoritbm [2], channel states
process is empty. are recorded as follows.
(ii) The state of channel from process i to process j is the
(i) When a process receives the token, it takes its snapshot,
set of messages whose sequence numbers are given by
initializes the state of all channels to empty, and returns
+
{RECDj[i] 1,. .., S E N Z [ j ] ) .
a Done message to the initiator. Now onwards, a
We now show that the algorithm satisfies conditions C1 process includes a message received on a channel in
and C2. the channel state only if it is an old message.
Let a message mjj be such that rec(tokeni) + (ii) After the initiator has received a Done message from
send(mjj). Clearly, send(tokenj) --f send(mjj) and all processes, it broadcasts a Terminate message.
the sequence number of mij is greater than S E N Z [ j l . (iii) A process stops the snapshot algorithm after receiving
Therefore, mjj is not recorded in SCjj. Thus, send(mjj) $ a Terminate message.

231
A D Kshemkalyani et a/

An interesting observation is that a process receives all References


the old messages in its incoming channels before it receives
the Terminate message. This is ensured by the underlying [I] Acharya A and Badrinath B R 1992 Recording distributed
snapshots based on causal order of message delivery
causal message delivery property. Infomation Processing k t t . 44 317-21
Causal ordering property ensures that no new message [2] Alagar S and Venkatesan S 1994 An optimal algorithm for
is delivered to a process prior to the token and only distributed snapshots with causal message ordering
old messages are recorded in the channel states. Thus, Infomation Processing Lert. 50 3116
send(mjj) @ LSj mjj @ SCjj. This together [31 Babaoglu 0 and Marzullo K 1993 Consistent global states
of distributed systems: fundamental concepts and
with Property P1 implies that condition C2 is satisfied. mechanisms Distributed Systems ed S J Mullender
Condition C1 is satisfied because each old message mij is (ACM Press) ch 4
delivered either before the token is delivered or before the [4] Babaoglu 0 and Raynal M 1995 Specification and
Terminate is delivered to a process and thus gets recorded verification of dynamic properties in distributed
in LSj or SCg, respectively. computations J. Parallel Distributed Systems 28
[SI Birman K and Joseph T 1987 Reliable communication in
presence of failures ACM Trans. Compur. Systems 3
47-76
6. Summary [61 Chandy K M and Lamport L 1985 Distributed snapshots:
determining global states of distributed systems ACM
Recording global state of a distributed system is an Trans. Comput. Systems 3 63-75
important paradigm in the design of the distributed systems [7l Cooper R and Marzullo K 1991 Consistent detection of
and the design of efficient methods of recording the global global predicates Proc. ACWONR Workshap on Parallel
and Distributed Debugging (May 1991) pp 163-73
state is an important issue. Recording of a global state of a 181 Fromentin E,Plouzeau N and Raynal M 1995 An
distributed system is complica& due to the lack of both a introduction to the analysis and debug of distributed
globally shared memory and a global clock in a distributed computations Proc. 1st IEEE Int. Con$ on Algorithms
system. This paper first presented a formal definition of and Architectures for Parallel Processina- .(Brisbane.
the global state of a distributed system and exposed issues April 1995) pp 545-54
191. Geihs K and Seifen M 1986 Automated validation of a
.
related to its capture; it then described several algorithms cooperation protocol for distributed systems Proc. 6rh
to record a snapshot of a distributed system under various Inr. ConJ on Dirrribured Computing Sysrems pp 436-43
communication models. [IO] Gerstel 0,Hurfin .M, Plouzeau N, Raynal hl and Zaks S
Table 1 gives a comparison of the salient features of 1995 On-the-fly replay: a practical paradigm and its
the various snapshot-recording algorithms. Clearly, the implementation for distributed debugging Proc. brh
IEEE Inr. Symp. on Purallel and Distribured Debugging
higher the level of abstraction provided by a communication (Dallas, 7X,Ocr. 1995) pp 266-72
model, the simpler the snapshot algorithm. However, there [I I] Helary I-M 1989 Observing global states of asynchronous
is no best-performing snapshot algorithm and an appropriate distributed applications Proc. 3rd Inr. Workhop on
algorithm can be chosen based on the application's Disrribured A/gorirhnu, LNCS 392 (Berlin: Springer)
pp 124-34
requirement. For examples, for termination detection, a [I21 Hurfin M, Plouzeau N and Raynal 11 1993 A debugging
snapshot algorithm that computes a channel state as the tool for distribted Estelle programs J. Compur. Commun.
number of messages is adequate; for checkpointing for 16 328-33
recovery from failures, an incremental snapshot algorithm [I31 Kamal J and Singhd M 1992 Specification and verification
is likely to be the most efficient; for global state monitoring, of distributed mutual exclusion algorithms Teclmical
Reporr (Columbus, OH: The Ohio State University,
rather than recording and evaluating complete snapshots at Department of Computer and Information Science)
regular intervals, it is more efficient to monitor changes [I41 Koo R and Toueg S 1987 Checkpointing and
to the variables that affect the predicate and evaluate the rollback-recovery in distributed systems IEEE Trans,
predicate only when some component variable changes. Sofnvare Engineering
As indicated in the introduction, the paradigm of global [IS] Kshemkalymi A and Singhal M 1994 Efficient detection
and resolution of genedized distributed deadlocks IEEE
snapshots finds a large number of applications (among oth- Trans. SoJtware Engineering 20 43-54
ers: detection of stable properties, checkpointing, monitor- [I61 Lai T H and Yang T H 1987 On distributed snapshots
ing, debugging, analyses of distributed computation, dis- Informarion Processing Leu. 25 153-8
carding of obsolete information). Moreover, in addition to [In Li H F,Radhakrishnan T and Vcnkatesh K 1987 Global
state detection in non-FIFO networks Proc. 7rh Inr.
the problems they solve, the algorithms presented in this Conf on Distribured Compuring Systems pp 364-70
paper are of great importance to people interested in dis- [IS] .Mattem F 1987 Algorithms for distributed termination
tributed computing, since these algorithms illustrate the in- detection Disrribured Compuring pp 161-75
cidence of properties of communication channels (FIFO, [I91 Mattem F 1993 Efficient algorithms for distributed
non-FIFO,causal ordering) on the design of a class of dis- snapshots and global virtual lime approximation J,
Parallel Disrribured Computing 18 423-34
tributed algorithms. [201 .Miller B and Choi J 1988 Breakpoints and haking in
distributed oroerams Proc. 8r/1 Inr. Conti on Distributed
Compuring 'sys;ems pp 3 16-23
Acknowledgments 1211 Sarin S and Lvnch N 1987 Discardine obsolete information
The authors are grateful to Professors F Mattem and in a replicated database system IEEE Trans. Sofnvnre
Engineering 13 39-47
S Venkatesan for providing useful feedback on an earlier
version of the paper.

232
An introduction to snapshot algorithms in distributed computing

[22] Taylor K 1989 The role of inhibition in consistent cut snapshots Proc. 6th Int. Conz on Distributed Computing
protocols Proc. 3rd Int. Workshop on Distributed Systems pp 382-8
Algorithms LNCS 392-(BerIin: Springer) pp 12/1-34 [25] Spezialetti M and Keams P 1989 Simultaneous regions: a
[23] Venkatesan S 1993 Message-optimal incremental snapshots framework for the consistent monitoring of distributed
J. Comput. Sofnvare Engineering 1 211-31 systems Proc. 9fh Int. Con? on Distributed Computing
[24] Spezialetti M and Keams P 1986 Efficient distributed Systems pp 61-8

233

You might also like