State Recording Algorithm
State Recording Algorithm
This article has been downloaded from IOPscience. Please scroll down to see the full text article.
(http://iopscience.iop.org/0967-1846/2/4/005)
View the table of contents for this issue, or go to the journal homepage for more
Download details:
IP Address: 128.195.52.171
The article was downloaded on 13/01/2011 at 21:13
I An intrqduction to snapshot
I algorithms in distributed computing
Ajay D Kshemkalyanit, Michel Raynalt and Mukesh Singhals
t IBM Corporation, PO Box 12195, Research Triangle Park, NC 27709, USA
$ IRISA, campus de Beaulieu, 35042 Rennes-cedex, France
5 Department of Computer and Information Science, The Ohio State University,
Columbus, OH 43210, USA
Received 14 November 1994, in final form 26 July 1995
0967-1&16/95/040224d0519.50 0 1995 The British Computer Society, The Institution of Electrical Engineers and IOP Publishing Ltd
An introduction to snapshot algorithms in distributed computing
SI : Account A S2: Account B computing. The work presented in this paper will be
useful to designers of distributed systems and designers of
application support mechanisms.
The rest of the paper is organized as follows. Section 2
225
A D Kshemkalyani et a/
226
Marker Sending Rule for process i
(i) Process i records its state. x -
(ii) For each outgoing channel C on which a marker
Si: Accoun! A f
has not been sent, i sends a marker along C
before i sends further messages along C. S 2 Account B
227
A D Kshemkalyani et a/
t’ I
recorded
global state
t2, t3, t4, , t5 algorithm to efficiently record repeated snapshots of a
distributed system that are required in recovery algorithms
with synchronous checkpointing.
Spezialetti-Kearns method There are two phases in
S2: b u n t B obtaining a global snapshot: locally recording the snapshot
- computation message
at every process and distributing the resultant global
snapshot to all the initiators. Spezialetti and Kearns [24]
optimized the Chandy-Lamport algorithm by exploiting
the work of combining concurrently initiated snapshots
(in the first phase) to efficiently distribute the resultant
Figure 5. Applying the rubber-band criterion.
global snapshot to only the concurrent initiators (in the
second phase). A process needs to take only one snapshot,
state never occurred in the execution. This happens because irrespective of the number of concurrent initiators and all
a process can change its state asynchronously before the processes are not sent the global snapshot.
markers it sent are received by other sites and the other This algorithm assumes bidirectional channels in the
sites record their states, system. The message complexity of snapshot recording is
Nevertheless, as we discuss next, the system could have O(e) irrespective of the number of concurrent initiations of
passed through the recorded global state in an equivalent the algorithm. The message complexity of assembling and
execution [6]. Suppose the algorithm is initiated in global disseminating thesnapshot is O ( m 2 )where r is the,number
state Si and it terminates in global state S,. Let seq of concurrent initiations.
be the sequence of events which takes the system from Venkatesan’s incremental snapshot method Many
Si to S,. Let Sn be the global state recorded by the applications require repeated collection of global snapshots
algorithm. Chandy and Lamport [6] showed that there of the system. For example, recovery algorithms
exists a sequence se4’ which is a permutation of se4 such with synchronous checkpointing need to advance their
that S* is reachable from Si by executing a prefix of seq‘ checkpoints periodically. This can be achieved by repeated
and S, is reachable from S* by executing the rest of the invocations of the Chandy-Lamport algorithm. However,
events of seq‘. Venkatesan [23]proposed the following efficient approach
Thus, the recorded global state is a valid state in an Execute an algorithm to record an incremental snapshot
equivalent execution and if a stable property (i.e. a property since the most recent snapshot was taken and combine it
that persists, such as termination or deadlock) holds in the with the most recent snapshot to obtain the latest snapshot
system before the snapshot algorithm begins, it holds in of the system. The incremental snapshot algorithm of
the recorded global snapshot. Therefore, a recorded global Venkatesan E231 modifies the global snapshot algorithm of
state is useful in detecting stable properties. Chandy-Lamport to save on messages when computation
A physical interpretation of the collected global state messages are sent only on a few of the network channels,
is as follows. Consider the two instants of recording of between the recording of two successive snapshots.
the local states in the banking example. These instants The incremental snapshot algorithm assumes bidirec-
are marked by crosses in figure 4. If the cut formed by tional FIFO channels, the presence of a single initiator, a
these instants is viewed as being an elastic band and if the fixed spanning tree in the network, and four types of con-
elastic band is stretched so that it is vertical, then all the trol messages: initsnap, snap-completed, regular, and ack.
recorded states of all processes occur simultaneously at one initsnap and snap-completed messages traverse spanning
physical instant and the recorded global state occurs in the edges. regular and ack messages which serve to record
execution that is depicted in this modified timing diagram states of non-spanning edges are not sent on those edges
(figure 5). Note that the system execution would have been on which no computation message has been sent since the
like this, had the processors’ speeds and message delays previous snapshot.
been different. Yet another physical interpretation of the Venkatesan [23]showed that the lower bound on the
collected global state is as follows: all the recorded process message complexity of an incremental snapshot algorithm
states are mutually concurrent-no process state causally is S2(u+ n) where U is the number of edges on which
depends upon another. Therefore, we can view logically a computation message has been sent since the previous
that all these process states occurred simultaneously even snapshot. Venkatesan’s algorithm achieves this lower
though they might have occurred at different instants in bound in message complexity.
physical time. Helary’s wave synchronization method Helary’s
snapshot algorithm [Ill incorporates the concept of
3.3. Variations of the Chandy-Lamport algorithm message waves in the Chandy-Lamport algorithm. A wave
is a flow of control messages such that every process
Several variants of the Chandy-Lamport snapshotalgorithm in the system is visited exactly once by a wave control
followed. These variants refined and optimized the message, and at least one process in the system can
basic algorithm. For example, Spezialetti and Kearns determine when this flow of control messages terminates.
algorithm [24]optimizes concurrent initiation of snapshot A wave is initiated after the previous wave terminates:
collection and efficiently dishibutes the recorded snapshot. Wave sequences may be implemented by various traversal
Venkatesan’s algorithm [23]optimizes the basic snapshot structures such as a ring. A process begins recording
228
An introduction to snapshot algorithms in distributed computing
the local snapshot when it is visited by the wave control (ii) Every message sent by a white (red) process is coloured
message. white (red). Thus, a white (red) message is a message
Note that in this algorithm, the primary function of that was sent before (after) the sender of that message
wave synchronization is to evaluate functions over the recorded its local snapshot.
recorded global snapshot.’ This algorithm has a message (iii)Every white process takes its snapshot at its
complexity of O ( e ) to record a snapshot (because all convenience, but no later than the instant it receives
channels can be traversed to implement the wave). a red message.
Thus, when a white process receives a red message, it
4. Snapshot algorithms for non-FIFO channels records its local snapshot before processing the message.
This ensures that no message sent by a process after
A FIFO system ensures that all messages sent after a recording its local snapshot is processed by the destination
marker on a channel will be delivered after the marker. process before the destination records its local snapshot.
This ensures that condition C2 is satisfied in the recorded Thus, an explicit marker message is not required in this
snapshot if LS;, LSj, and SC;j are recorded as described algorithm and the ‘marker’ is piggybacked on computation
in the Chandy-Lamport algorithm. In a non-FIFO system, messages using a colouring scheme.
the problem of global snapshot recording is complicated The second observation is that the marker informs
because a marker cannot be used to delineate messages into process j of the value of [send(mjj)[send(mij)E LS, ]
those to be recorded in the global state from those not to so that transit(LS;,LSj) can be computed. The Lai-Yang
be recorded in the global state. In such systems, different algorithm fulfils this role of the marker in the following
techniques have to be used to ensure that a recorded global way.
state satisfies condition C2. (iv) Every white process records a history of all white
In a non-FIFO system, either some degree of inhibition messages sent or received by it along each channel.
(i.e. temporarily delaying the execution of an application (v) When a process turns red, it sends these histories along
process or delaying the send of a computation message) with its snapshot to the initiator process that collects
or piggybacking of control information on computation the global snapshot.
messages to capture out-of-sequence messages, is necessary (vi) The initiator process evaluates transir(LSj, LSj) for
to record a consistent global snapshot [22]. The non-FIFO each channel Cjj as given below:
algorithm by Helary uses message inhibition.[ll]. The SCjj = (send(mjj)lsend(m;j) E LS, ] -
non-FIFO algorithms by Lai and Yang [16], Li er nl [17] {rec(m,j)[rec(mjj)E LSj 1.
and Mattern [19] use message piggybacking to distinguish Condition C2 holds because a red message is not
computation messages sent after the marker from those sent included in the snapshot of the recipient process and
before the marker. a channel state is the difference of two sets of white
The non-FIFO algorithm of Helary [ 111 uses message messages. Condition C1 holds because a white message
inhibition to avoid an inconsistency in a global snapshot in mij is included in the snapshot of process j if j receives
the following way: When a process receives a marker, it mjj before taking its snapshot. Otherwise, mij is included
immediately returns an acknowledgement After a process in the state of channel C j j .
i has sent a marker on the outgoing channel to process j , it Though marker messages are not required in the
does not send any messages on this channel until it is sure algorithm, each process has to record the entire message
that j has recorded its local state. Process i can conclude history on each channel as paa of the local snapshot.
this if it has received an acknowledgement for the marker Thus, the space requirements of the algorithm may he
sent to j , or has received a marker for this snapshot from j . large. Lai and Yang describe how the size of the local
We next discuss snapshot recording algorithms for storage and snapshot recording can be reduced by storing
systems with non-FIFO channels that use piggybacking of only the messages sent and received since the previous
computation messages. snapshot recording, assuming that the previous snapshot
is still available. This approach can be very useful to
4.1. Lai-Yang algorithm applications that require repeated snapshots of a distributed
system.
Lai and Yang’s global snapshot algorithm for non-FIFO
systems [16] is based on two observations on the role of 4.2. Li et al’s algorithm
a marker in a FIFO system. The first observation is that a
marker ensures that condition C2 is satisfied for LS, and Li etal’s algorithm [17] for recording a global snapshot in
LSj when the snapshots are recorded at processes i and j , a non-FIFO system is similar to the Lai-Yang algorithm.
respectively. The La-Yang algorithm fulfills this role of a Markers are tagged so as to generalize the rdwhite colours
marker in a non-FIFO system by using a colouring scheme of the Lai-Yang algorithm to accommodate repeated
on computation messages as follows. invocations of the algorithm and multiple initiators. In
addition, the algorithm is not concerned with the contents
(i) Every process is initially white and turns red while of computation messages and the state of a channel
taking a snapshot. The equivalent of the ‘Marker is computed as the number of messages in transit in
Sending Rule’ is executed when a process turns red. the channel. This simplification is combined with the
229
A D Kshemkalyani et a/
incremental technique to compute channel states, also s is included in the snapshot of process j if j receives mij
outlined by Lai and Yang, which reduces the size of before taking its snapshot. Otherwise, mij is included in
message histories to be stored and ’uansmitted. The initiator the state of channel C,.
computes the state of Cjj as: (the number of messages’in The following observations about the above algorithm
Cij in the previous snapshot) + (the number of messages lead to various optimizations. (i) The initiator can be
sent on Cjj since the last snapshot at i ) - (the number of made a ‘virtual‘ process: so, no process has to freeze.
messages received on Cij since the last snapshot at j ) . (ii) As long as a new higher value of s is selected, the
Though this algorithm does not require any additional phase of broadcasting s and returning the acks can be
message to record a global snapshot provided computation -eliminated. (iii) Only the initiator’s component of s is used
messages are eventually sent on each channel, the local to determine when to record a snapshot. Also, one needs
storage and size of tags on computation messages is of size to know only if the initiator’s component of the vector
O(n), where n is the number of initiators. timestamp in a message has increased beyond the value
of the corresponding component in s. Therefore, it suffices
4.3. Mattern’s algorithm to have just two values of s, say, white and red, which can
be represented using one bit
Mattern’s algorithm [I91 is based on vector clocks. In With these optimizations, the algorithm becomes
vector clocks, the clock at a process is an integer vector similar to the Lai-Yang algorithm except for the manner
of length n, with one component for each process. in which transit(LSi, LSj) is evaluated for channel Cjj.
The component of a process in the vector clock at In Mattern’s algorithm, a process is not required to store
a process advances independently whenever the process message histories to evaluate the channel states. The state
learns, through messages, that a component value has of any channel is the set of all the white messages that
advanced. are received by a red process on which that channel is
Mattern’s algorithm assumes a single initiator process incident. A termination detection scheme for non-FIFO
and works as follows. channels is required to detect that no white messages are in
transit to ensure that the recording of all the channel states
(i) The initiator ‘ticks’ its local clock and selects a future
is complete.
vector times at which it would like a global snapshot to
be recorded. It then broadcasts this time s and freezes The savings of not storing and transmitting entire
message histones, over the Lai-Yang algorithm, comes at
all activity until it receives acknowledgements of the
receipt of this broadcast. the expense of delay in the termination of the snapshot
recording algorithm and need for a termination detection
(ii) When a process receives the broadcast, it remembers
scheme (e.g. a message counter per channel).
the value s and returns an acknowledgement to the
initiator.
(iii) After having received an acknowledgement from every 5. S n a p s h o t s in a causal delivery system
process, the initiator increases its vector clock to s and
broadcasts a dummy message to all processes. (Observe Two global snapshot-recording algorithms, namely,
that before broadcasting this dummy message, the local Acharya-Badrinath [I] and Alagar-Venkatesan 121 assume
clocks of other processes have a value 2 s.) that the underlying system supports causal message deliv-
(iv) The receipt of this dummy message forces each ery. The causal message delivery property CO provides a
recipient to increase its clock to a value s if not built-in message synchronization to control and computa-
already 2 s. tion messages. Consequently, snapshot algorithms for such
(v) Each process takes a local snapshot and sends it to the systems are considerably simplified. For example, these
initiator when (just before) its clock increases from a algorithms do not send control messages (i.e. markers) on
value less than s to a value 2 s. Observe that this every channel and are simpler than the snapshot algorithms
may happen before the dummy message arrives at the for a FIFO system.
process. Both these algorithms use an identical principle to
(vi) The state of Cjj is all messages sent along Cij, whose record the state of processes. An initiator process
timestamp is smaller than s and which are received by broadcasts a token, denoted as token, to every process
pj after recording LSj. including itself. Let the copy of the token received by
process i be denoted tokeni. A process i records its
Processes record their local snapshot as per rule (5). local snapshot LS, when it receives tokeni and sends the
Any message mij sent by process i after it records its local recorded snapshot to the initiator.
snapshot LSj has a timestamp > s. Assume that this mij is These algorithms do not require each process to send
received by j before it records LSj. After receiving this mij markers on each channel, and the processes do not
and before j records LSj, j ’ s local clock reads a value > s, coordinate their local snapshot recordings with every other
as per rules for updating vector clocks. This implies j must process. Nonetheless, for any two processes i and j the
have already recorded LSj as per rule (5), which contradicts following property (called Property P1) is satisfied
the assumption. Therefore, mi, cannot be received by j
before it records LSj. By rule (6), mij is not recorded in send(mij) # LSi + rec(mij) # LSj.
SCi, and therefore, condition C2 is satisfied. Condition C1 This is due to the causal ordering property of the
holds because each message mij with a timestamp less than underlying system as explained next. Let a message
230
An introduction to snapshot algorithms in distributed computing
mjj be such that rec(tokeni) --+ send(mjj). Then LSj + mij $ SC,. This in conjunction with property P1
send(tokenj) 3send(mjj) and the underlying causal implies that the algorithm satisfies condition C2.
ordering propem ensures that rec(tokenj), at which instant Consider a message mij which is the kth message from
j records LSj, happens before rec(mjj). Thus, mjj whose process i to process j before i takes its snapshot. The two
send is not recorded in LSj, is not recorded as received in possibilities below imply that condition C1 is satisfied.
LSj. Process j receives mij before taking its snapshot. In
Methods of channel staterecording are different in these this case, mij is recorded in j ' s snapshot.
two algorithms and are discussed next. Otherwise, R E C D j [ i ] 5 k 5 S E N Z [ j ] and the
message mij will be included in the state of channel
5.1. Channel Recording in the Afharya-Badrinath cjj.
algorithm This algorithm requires 2n messages and 2 time units
Each process i maintains arrays S E N Z [ l , ..., NI and for recording and assembling the snapshot, where one time
R E C D j [ l . ...,NI. S E N Z [ j ] is thenumberofmessages unit is required for the delivery of a message. If the contents
sent by process i to process j and R E C D j l j ] is thc number of messages in channel states are required, the algorithm
of messages received by process i from process j . The requires 2n messages and 2 time units additionally.
arrays may not contribute to the storage complexity of the
algorithm because the underlying causal ordering protocol 5.2. C h a M d mcording in the Alagar-Venkatesan
may require these arrays to enforce causal ordering. algorithm
Channel states are recorded as follows: when a process
A message is referred to as old if the send of the message
i records its local snapshot LSj on the receipt of token;, it
causally precedes the send of the token. Otherwise, the
includes arrays RECDi and S E N Z in its local state before
message is referred to as new. Whether a message is new or
sending the snapshot to the initiator. When the algorithm
old can be determined by examining the vector timestamp
terminates, the initiator determines the state of channels in
in the message, which is needed to enforce causal ordering
the global snapshot being assembled as follows:
among messages.
(i) The state of each channel from the initiator to each In the Alagar-Vmkatesan algoritbm [2], channel states
process is empty. are recorded as follows.
(ii) The state of channel from process i to process j is the
(i) When a process receives the token, it takes its snapshot,
set of messages whose sequence numbers are given by
initializes the state of all channels to empty, and returns
+
{RECDj[i] 1,. .., S E N Z [ j ] ) .
a Done message to the initiator. Now onwards, a
We now show that the algorithm satisfies conditions C1 process includes a message received on a channel in
and C2. the channel state only if it is an old message.
Let a message mjj be such that rec(tokeni) + (ii) After the initiator has received a Done message from
send(mjj). Clearly, send(tokenj) --f send(mjj) and all processes, it broadcasts a Terminate message.
the sequence number of mij is greater than S E N Z [ j l . (iii) A process stops the snapshot algorithm after receiving
Therefore, mjj is not recorded in SCjj. Thus, send(mjj) $ a Terminate message.
231
A D Kshemkalyani et a/
232
An introduction to snapshot algorithms in distributed computing
[22] Taylor K 1989 The role of inhibition in consistent cut snapshots Proc. 6th Int. Conz on Distributed Computing
protocols Proc. 3rd Int. Workshop on Distributed Systems pp 382-8
Algorithms LNCS 392-(BerIin: Springer) pp 12/1-34 [25] Spezialetti M and Keams P 1989 Simultaneous regions: a
[23] Venkatesan S 1993 Message-optimal incremental snapshots framework for the consistent monitoring of distributed
J. Comput. Sofnvare Engineering 1 211-31 systems Proc. 9fh Int. Con? on Distributed Computing
[24] Spezialetti M and Keams P 1986 Efficient distributed Systems pp 61-8
233