Pfoceduf 8

This paper introduces a new replication algorithm for distributed systems that enhances availability through a primary copy method. The algorithm ensures that even in the event of node crashes or network partitions, services remain accessible by reorganizing backups and utilizing a unique timestamp called a viewstamp to track state changes. The method guarantees minimal delay in user computations and maintains data integrity during view changes, making it suitable for highly-available services.

Uploaded by

prathamshah354

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views10 pages

Pfoceduf 8

Uploaded by

prathamshah354

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Viewstamped Replication: A New Primary Copy Method to

Support Highly-Available Distributed Systems

Brian M. Oki
Barbara H. Liskov
Massachusetts institute of Technology
Laboratory for Computer Science
Cambridge, MA 02139

Abstract Our algorithm runs on a system consisting of nodes connected by

a communication network. Nodes are independent computers that
One of the potential benefits of distributed systems is their use in communicate with each other only by sending messages over the
providing highly-available services that are likely to be usable when network. Although both nodes and the netwolrk may fail, we assume
needed. Availabilay is achieved through replication. By having inore these failures are not byzantine [24]. Nodes can crash, but we
than one copy of information, a service continues to be usable even assume they are faiistop processors (341. The network may bse,
when some copies are inaccessible, for example, because of a crash delay, and duplicate messages, or delver messages out of order.
of the computer where a copy was stored. This paper presents a Link failures may cause the network to parlition into subnetworks that
new replication algorithm that has desirable performance properties. are unable to communicate with each other. We assume that nodes
Our approach is based on the primary copy technique. eventually recover from crashes and partitions are eventually
Computations run at a primary. which notifies its backups of what it repaired.
has done. If the primary crashes, the backups are reorganized, and
one of the backups becomes the new primary. Our method works in Our replication method assumes a model of computation in which
a general network with both node crashes and partitions. Replication a distributed program consists of modu/es, each of which resides at
a single node of the network. Each module contains within it both
causes little delay in user computations and little information is lost in
a reorganization; we use a special kind of timestamp called a data objects and code that manipulates the objects; modules can
viewstamp to detect lost information. recover from crashes with some of their state intact. No other
module can access the data objects of another module directly.
instead, each module provides procedures that can b8 used to
1 Introduction access its objects; modules communicate by means of remote
One of the potential benefits of distributed systems is their use in pfoceduf8 calls. Modules that make calls (are called clients; the
called module is a server.
providing highly-available services, that is, services that are likely to
be up and accessible when needed. Availability is essential to many Modules are the unit of replication in our method. Ideally,
computer-based services; for example, in airline reservation systems
programmers would write programs without concern for availability in
the failure of a single computer can prevent ticket sales for a
some (distributed) programming language that supports our model of
considerable time, causing a loss of revenue and passenger
computation. The language implementation thlen uses our technique
goodwill. to replicate individual modules automatically; ithe resulting programs
are highly available.
Availability is achieved through replication. By having more than
one copy of important information, the service continues to be usable We assume that computations run as atomic transactions [14].
even when some copies are inaccessible, for example, because of a Our method guarantees the one-copy serializability correctness
crash of the computer where a copy was stored. Various replication
criterion [3, 331: the concurrent execution of transactions on
algorithms have been proposed to achieve availability
replicated data is equivalent to a serial execution on non-replicated
[2, 4, 9, 11, 12, 16, 21,351. This paper presents a new replication
data.
algorithm that has desirable performance properties.
Our approach is based on the primary o~py technique [t, 361.
which works roughly as follows. One replica is designated the
primary; the others are backups. The primary is responsible for the
This research was supported in part by the Advanced Research processing of transactions that use its objects; it notifies the backups
Projects Agency of the Department of Defense, monitored by the of what it has done. When a replica crashes or is separated from the
Office of Naval Research under contract NO0014-83-K-01 25, and in others by a partition. or when a replica recovers from a crash or a
parl by the National Science Foundation under granl DCR-8503662. parfition is repaired, the replicas are reorganized and a new primary
is selected if necessary. We refer to this reorganization as a view
change (131. Once the view change is complete, the (new) primary
can continue with transaction processing.
Permission to copy without fee all or part of this material is granted pro-
vided that the copies are not made or distributed for direct commercial The primary copy technique as originally proposed worked only if
advantage, the ACM copyright notice and the title of the publication and node failures were distinguishable from network failures; in general
its date appear, and notice is given that copying is by permission of the such a distinction cannot be made and our method does not require
Association for Computing Machinery. To copy otherwise, or to republish,
requires a fee and/or specific permission. it. In addition, our method exhibits useful performance properties.
Transactions encounter little delay in interacting with the replicas, yet
0 1988 ACM O-89791-277-2/88/0007/0008 $1.50 little information is lost in a view change. Remote procedure calls to

8
access or modify objects are executed entirely at the primary, which information survives a view change. Because a timestamp is
notifies the backups in background mode. If a view change meaningful only within a view, we introduce viewstamps. A
happens, the effects of a call may or may not survive into the new viewstamp is simply a timestamp concatenated with the viewid of the
view. If they do survive, the transaction can commit; otherwise, it view In which the timestamp was generated; we refer to the parts of
must abort. We use a special kind of timestamp called a viewsramp viewstamp Y as v.id and v.ts. Each cohort maintains a history.
to distinguish the two situations. consisting of a sequence of viewstamps. each with a different viewfd.
We guarantee that for each viewstamp v in its history, the cohort’s
We begin in the next section with an overview of our method. state reflects event e from view v.id iff e’s timestamp is less than or
Sections 3 and 4 describe the two parts of the method, transaction equal to v.ts.
processing and view changes. Section 5 discusses how our method
compares with other replication techniques. We conclude in Section The correctness of our algorithm depends on the interaction of
6 with a summary of what we have accomplished. transaction processing and the view change algorithm. Transaction
processing guarantees that transactions are serialized properly. In
addition, it guarantees that a transaction can commit only if all its
2 Overview of our Method events are known to at least a majority of cohorts. The view change
The method replicates individual modules to obtain module algorithm guarantees that events known to a majority of cohorts
groups. A module group consists of several copies of the module, survive into subsequent views. Thus, events of committed
called cohods, which behave as a single, logical entity; the program transactions will survive view changes. Not all events survive view
can indicate the number of cohorts when the group is created. The changes, however; for example, the processing of a particular
set of cohorts is the group’s configuration. Each cohort has a unique remote calf may be lost. We use the hisfory plus some information
name called an mid; the group as a whole bears a unique groupid. that arrives in the prepare message to ensure that the transaction
We expect a small number of cohorts per group, on the order of will be forced to abort in such a case. On the other hand, if the
three or five. history and the information in the prepare message indicate that all
the events associated with the transaction survived the view change,
One cohort is designated as the primary; it executes procedure the transaction can commit.
calls, and participates in two-phase commit.
In the next two sections we describe our technique. First, we
The remaining cohorts are backups, which are essentially passive describe transaction processing and then the view change algorithm.
and merely receive state information from the primary.

Failures and recoveries are masked when they are noticed. Over 3 Running Transactions
time the communication capability within a group may change as Our system runs transactions in a manner similar to a system
cohorts or communication links fail and recover. To reflect this without replication. There are two main differences: we use
changing situation, each cohort runs in a view. A view is a set of viewstamps to determine whether a transaction can commit, and
cohorts that are (or were) capable of communicating with each other, instead of writing Information to stable storage [25] during two-phase
together with an indication of which cohort is the primary; it is a commit, the primary sends it to the backups using the
subset of the configur&ion and must contain a majority of group communication buffer discussed above.
members.
The part of a cohort’sstate that affects transaction processing is
A group switches to a new view by executing a view change summarized in Figure 1. Each cohort has a status; if it is “active,” it
protocol; our protocol is a simpfiffcation and modification of the virtual can participate in transaction processing, and otherwise it is involved
partitions protocol [12]. Each view is identified by a unique viewid; in a view change. We say that a cohort is active if its status is
we guarantee that viewids are totally ordered. The view change “active”; otherwise it is inactive. The g&ate consists of all objects
protocol generates a new view and viewid. If a majority of cohorts that constitute the group state. Each object has a unique name
accept the new view, cohorts switch to the new, active view; (relative to the group) and a current value, and also whatever
otherwise, they remain in their old views, but the views become information is needed to implement synchronization and recovery. In
inactive. Transactions are processed only in active views.

Views and viewids reflect the current communication patterns, but status: status % cohort is active or doing a view change
not the information about committed and active transactions that gstate: (object) % objects in the group state
have run at the group. This additional information is obtained by mygroupid: int % the name of the module group
using fimestamps. Timestamps are unique within a view and form a cur-viewid: viewid % the current viewid
total order: they are generated by the primary and are easy to cur-view: view % the current view
produce, for example, by incrementing a local counter. The primary history: [viewstamp] % indicates events known to cohort
generates a new timestamp each time it needs to communicate timestamp: int % the timestamp generator
information to its backups; we refer to each such occurrence as an buffer: [event-record] % the communication buffer
event. Examples of events are the completion of processing of a
where
remote calf or of a prepare or commit message. Each event is
assigned a unique timestamp, and later events receive later Status = oneof[active, view-manager, underling: null]
timestamps. Instead of checkpointing events directly to the backups, object - cuid: int, base: T, lockers: (lock-info)>
the primary maintains a communication buffer (similar to a fifo lock-info = clocker: aM, info: oneofjread: null, write: TJ>
queue) to which it writes event records. An event record identifies viewid = <cnt: int, mid: int>
the type of the event, and contains other relevant information about view = <primary: int, backups: (int},
the event. Information in the buffer is sent to the backups in viewstamp = <id: viewid. ts: int>
timestamp order. The buffer implementation provides reliable
delivery of event records to all backups in the primary’s view; if it fails
to defiver a message, then a crash or communication failure has Flgure 1: Partial State of a Cohort: () denotes a set, [ 1denotes
occurred that will cause a view change. a sequence, oneof means a tagged union wilh component
tags and types as indicated, and CX=denotes a record, with
We use timestamps as an inexpensive way of determining what component names and types as indicated.

9
the remainder of this paper, we assume that transactions 3.1 Actlve Prlmarles of Clients
synchronized by means of strict P-phase locking [18] with read and Recall that we intend to use viewstamps to determine whether a
write locks. Each object has a base version of some type T; different transaction can commit. Each time a server finishes processing a
objects can be of different types, but we ignore these differences in remote call on behalf of a transaction, sit assigns the call a
the paper. A transaction modifies a tentaUve version, which is viewstamp. Information about these viewstamps is collected as the
discarded if the transaction aborts and becomes the base version if it transaction runs in a data structure called the ,oset, which is a set of
commits. Thus, in addition to its name and base version, an object cgroupid: int, vs: viewstamp>
contains a set of lockers that identifies transactions holding locks on
the objects, the kinds of locks held, and any tentative versions pairs. The pser contains an entry for every call made by the
created for them. transaction; a pair cQ, v, indicates that group g ran a call for the
transaction and as ;igned it viewstamp v.
The primary uses the buffer to communicate information about
events to the backups; the implementation of the buffer guarantees The processing at the client’s primary is summarized in Fiiure 2.
reliable delivery of event records to all backups in timestamp order. When a transaction is created, it receives a unique transaction
We distinguish between writing and forcing information to the buffer; identifier aid and an empty pset. (We make the aid unique across
a similar distinction is made in transaction systems that use stable view changes by including mygrcupid and cur-viewid in it.) To make
storage. Writing means that the information will be delivered to the a remote call, the system looks up the primary and viewid for the
backups at a convenient time; this is accomplished by calling the add group in its cache, initializing the cache if necessary, and then sends
operation on the buffer. Add takes an event record as an argument. the call message to the primary. The message contains the viewid
It atomically assigns the event a timestamp (advancing the from the cache, a unique call id (to prevent duplicate processing of a
timestamp and updating the history in the process) and adds the single call), and information about the call itself (the procedure name
event record to the buffer; it returns the event’s viewstamp. There and the arguments).
may be concurrent execution within a module, so the implementation
of add must serialize the use of the buffer and ensure that event There are three possible results of such a message. The first, and
records are recorded in the buffer in timesramp order. most likely, is a reply message for the call. The reply message
contains a pset that records cgruupid, viewslamp pairs for this call
The farce-to operation is used to force the buffer. Since
sometimes it is not necessary to force the entire buffer, the operation
takes a viewstamp v as an argument. If the viewstamp is not for the Starting a transactlon:
current view it returns immediately; othenvise it waits until a
sub-major&y of backups know about all events in the current view Create the transaction aid and an empty pset.
with timestamps less than or equal to V.&Z.’ A sub-majority is one
less than a majority of the configuration; if a sub-majority of backups Maklng a remote call:
knows about an event, then a majority of the cohorts in the 1. Look up the server in th$ cache, updating the cache if
configuration knows about that event. As mentioned earlier, if a necessary. Send the call message to the primary; the
majority of cohorts knows some information, the view change message contains the unique call id and also the
algorithm guarantees that the information will be known in all viewid obtained from the cache.
subsequent views. 2. If a repfy message arrives, add the elements of the
pset in the reply message to the tran!;action’s pset.
Running transactions requires the collaboration of both clients and User code at the client can now continue running.
servers. Clients create transactions, make any remote calls they
contain, and act as coordinators of two-phase commit. Servers 3. If there is no reply, abort the transaction: send abort
process remote calls and participate in two-phase cammft; in messages to the participants (determiined from the
processing a call, a server may make further calls. pset), and add an <“aborted”, aid> record to the buffer.
4. If the reply indicates that the view has charged, update
We assume the system provides a highly-available location server the cache, if possible, and QOto sfep 1. If a more
that maps Qroupids to configurations; various implementations are recent view cannot be discovered, atlorl the
discussed in [15, 20, 22, 311.* To find a server it has not used transaction as described above.
before, a cohort fetches the configuration from the location server
and communicates with members of the configuration to determine Coordinator for two-phase Commit:
the current primary and viewid. It stores this information in a local 1. Send prepare messages containing tlhe aid and pset to
cache. the participants, which can be determined from the
pset.
Below we discuss the work done by active primaries of clients and 2. If all participants agree to commit, release any local
servers, other processing at cohorts, and processing of query locks held by the transaction and install its tentative
messages. We assume that both clients and servers are replicated; versions, add a <“committing”, plist, aid:. record to the
we discuss an alternative to replicating clients in Section 3.5. Our buffer, where the plist is a list of non-read-only
discussion assumes that transactions are one-level; we discuss participants, and then do a force-fo(new-vs), where
nested transactions in Section 3.6. new-vs is the viewstamp returned by the call on the
addoperation. Send commif messages to the
participants; when all of them acknowledge the
commit, add a <“done*, aid> record to the buffer.
‘Force-to delays its caller, but other work, including adding and 3. If there is no answer after repeated tries, update the
forcing the buffer, can still go on at the cohott in other processes. If cache, if possible, and retry the prepare. If a more
communication with some backups is impossible, the calf of force-to recent view cannot be discovered, or if any participant
will be abandoned, and the cohort will switch to running the view refuses to prepare, discard any local locks and
change algorithm. versions held by the transaction and send aborf
messages to the participants. Add an -?‘aborted”. aid>
*Note that the location server defines the limits of availability: no record to the buffer.
module group can be more available than it is. Flgure 2: Processing at the Active Primary of a Client.

10
and any further remote calls made in processing it. The pairs in the compatible(ps, g, vh) =
reply’s pset are added to the transaction’s psef. Q p E ps (p.groupid = g 3
V v E vh (p.vs.id = v.id ti p.vs.ts i v.ts))
The second possibility is no reply at all (after a sufficient number If the pser is not compatible with the hisfory, it refuses the prepare.
of probes). In this case, we abort the transaction; we also attempt to Otherwise, it computes the viewstamp of the most recent
update the cache, so that the next use of the server will not cause an “completed-call” event by calling vs-max(pset, mygmup~~:
abort. The transaction must abort because we cannot know whether
vs_max(ps, 9) = p.vs s.t.
the call message would be a duplicate if we sent it to a new primary. p E ps & p.groupid = g & V p’ E ps (p’.groupid = g
The message might be a new one, or it might be a duplicate for a Call 3 p’.vs.id < p.vs.id v (p’.vs.id = p.vs.id & p’.vs.ts 5 p.vs.ts))
that ran before the view change or was running when the view
change happened. In the first case, we need to do the call; in the
second case, we must not redo it. To resolve this uncertainty, we It uses this viewstamp to force the buffer to ensure that all
aborf the transaction. “completed-call” events are known to at least a sub-majority of
backups and then sends an acceptance to the coordinator.
The lhird possibility is a reply indicating that the view has
changed. In this case, we update the cache and retry the call. We When it receives a commit message, the primary forces a
assume the message delivery system maintains some connection “committed” record to the buffer and then sends an acknowledgment
information that enables it to not deliver duplicate messages even in to the coordinator. If it receives an abort message, it adds an
the case when the module crashes and recovers between deliveries. “aborted” record to the buffer.
If duplicate messages are possible, we must abort the transaction In
this case too.
3.3 Other Processing at Cohorts
When the transaction commits, the client’s primary acts as the Cohorts that are not active primaries reject messages sent to them
coordinator of the two-phase commit protocol [19]. It determines by other module groups, except for queries as discussed in the next
who the participants are from the psef. It sends the pset in the section. The response to the re jetted message COntainSinformation
prepare messages to allow each participant to determine whether it about the current viewid and primary if the cohort knows them (for
knows all events of the preparing transaction. example, if it is a backup in an active view).

If all participants agree to prepare, the coordinator adds a

“committing” record to its buffer and forces the entire buffer to the Processing a call:
backups. This ensures that the commit wilt be known across a view 1. If the viewid in the call message is not equal to the
change of the coordinator. The “committing” record lists only the primary’s cur-viewid, send back a rejection message
participants where the transaction holds write locks, since only these containing the new viewid and view.
must take part in phase two; the reply from a participant indicates 2. Create an empty pset. Then run the call. If it makes
whether or not it is read-only. Then the coordinator sends commit any nested calls, process them as described in Figure
messages, and, when all are acknowledged, adds a “done” record to 2.
the buffer. Note that user code can continue running as soon as the
“committing” record has been forced to the backups. 3. When the call finishes, add a &ompleted-call”, object-
list, aid> record to the buffer; the object-list lists all
If the transaction aborts, or if any participant refuses the prepare, objects used by the remote call, together with the type
the coordinator sends abort messages to the participants and adds of lock acquired and the tentative version if any. Add a
emygroupid, new-vs> pair to the pset, where new-vs
an “aborted” record to the buffer. This record is not really needed is the viewstamp returned by the call on the add
because a view change at the coordinator that leads to a new operation of the buffer, and send back a reply message
primary will cause any of the group’s transactions to aboti containing the pset.
automatically. (To avoid such aborts would require some kind of
checkpoint mechanism [f7j.) However, the record is useful for query Processing a Prepare Message:
processing as discussed in Section 3.4. 1. If wmpati&le(pset, history, mygroupid), perform a
forceJo( vs-max(psef, mygroupid)), release read locks
held by the transaction, and then reply prepared. In
3.2 Active Primaries of Servers the reply message, indicate whether the transaction
Servers process remote calls and act as participants in two-phase held only read locks at this participant. If the
commit. Each time a call completes, the primary assigns it a transaction is read-only, add a <“committed”, aid>
viewstamp, and returns this information in the reply message. The record to the buffer.
primary can agree to prepare only if it knows about all remote calls 2. Otherwise, send a message to the coordinator refusing
its group has done on behalf of the preparing transaction. It uses its the prepare and abort the transaction: discard its locks
history and the pset in the prepare message to determine this. and versions and add an c”abort”. aid> event record to
the buffer.
Processing at the primary of the server is summarized in Figure 3.
When the primary receives a call message, tt rejects the call if the PrOceSSlng a Commit Message:
call’s viewid is not equal to cur-viewid. Otherwise, it creates an 1. Release locks and install versions held by the
empty pset and runs the call, possibly making further nested calls as transaction. Add a c’%ommitted”, aid> record to the
described above. When the call completes, it adds a “completed- buffer, do a force-to(new-vs), where new-vs is the
call” record to the buffer; this record identifies each atomic object that VieWStampreturn by add, and send a done message to
was read or written in processing the call, together with the type of the coordinator.
lock obtained and the tentative version if any. Then it adds a pair for
this call to the call’s pset and returns the psetin the reply message. Processing an Abort Message:
1. Discard locks and versions held by the aborted
When the primary receives a prepare message, it checks whether tranSaCtiOnand add an <“aborted”, aid> record to the
it knows about all calls made by the transaction to its group by calling buffer.
wmpafi&fe@sef, mygroupid, history): Figure 3: Processing at the Active Primary of a Server,

11
Active backups receive messages containing information from the concurrency within a transaction in a way that allows the concurrent
communication buffer. They process event records in timestamp activities to be serialized. Second, they provide a checkpointing
order, updating the state accordingly. The backup can simply store mechanism: if some part of a transaction cannot complete, we can
the records, or it can perform them, for example, by setting locks and avoid aborting the entire transaction by running that part as a
Creating versions for a “completed-call” record. There is a tradeoff subactiin.
here between the amount of processing at the backups, and how
much work is needed during a view change before a backup can Checkpointing is what allows us to minimize the effects of view
become a primary. Perhaps a good compromise is to store changes. If the call is made as a subaction, we need not abort the
“completed-call” records (as part of the gs&fe) until the “committed” entire transaction if there is no reply. Instead. we can abort just the
or “aborted” record for the call’s transaction is received; at this point subaction, and then do the call again as a new subaction. An
records for a committed transaction would be processed, while those algorithm for our ,nethod in a system with nested transactions is
for an aborted transaction would be discarded. described in[32]: lt is based on the implementation of nested
transactions in Argus (26, 281.

3.4 Queries Subactions are an economical way to cope with view changes.
Our implementation does not guarantee that all messages about They are not expensive to implement [27j; they are much cheaper
transaction events arrive where they might be needed. For example, than either of the alternatives for avoiding aborts sketched above.
if the transaction aborts, we send abort messages to the participants, Furthermore, we need to abort and redo a call subaction Only when
but do not guarantee they will arrive. Instead, a cohort that needs to the view changes; thus we do extra work only when the problem
know whether an abort occurred sends a query to another cohofi arises.
that might know. For example, the primary of the participant can
send a query to the primary of the coordinator.
3.7 Dlscusslon
To speed up the processing of queries, we allow any cohort to There is a one-to-one correspondence between event records and
respond to a query whenever it knows the answer. For example, a information written to stable storage by a coniventional transaction
cohort that is not a primary may know about the abort of a system and therefore our system works because a conventional one
transaction because B received the “aborted” event record from the does. The “completed-call” records are equivalent to the data
primary. records that must be forced to stable storage before preparing, and
the “commit” and “abort” records are the same as their stable
storage counterparts. The only difference is our treatment of
3.5 Replicated Clients prepares, since we have no prepare record. In a conventional
The algorithms above assumed that both the client and the server system, the prepare record tells the participant #aftera crash whether
are replicated. It is good to replicate servers, since they do work on a transaction that ran there before a crash is ablle to commit. We do
behalf of many clients. Replicating a client that is not a server, not need the prepare record because we use the primary’s history
however, may not be worthwhile. and the psef in the prepare message to determine what to do.

If the client is not replicated, it is still desirable for the coordinator Even when a transaction only has read locks, we must force the
to be highly available, since this can reduce the “window of “completed-call” records to the backups when preparing to ensure
vulnerability” [30] in two-phase commit. This can be accomplished that read locks are held across a view change. A view change may
by providing a replicated “coordinator-server.” The client have happened without this primary being aware of it. and there may
communicates with such a server when it starts a transaction, and be a new primary already processing user requests in the other view.
when t commits or aborts the transaction. The coordinator-server Furthermore, the preparing transaction’s read-locks may not be
carries out two-phase commit as described above on the client’s known in the new view, so the new primary may allow other
behalf. It also responds to queries about the outcome of the transactions to obtain conflicting locks. Forcing the buffer
transaction; its groupid is part of the transaction’s aid, so that guarantees that the prepare can succeed only if the transaction’s
participants know who it is. In answering a query about a transaction locks survived the view change. Without the force, the prepare could
that appears to still be active, tt would check wtth the client, but if no succeed at the old primary even though the locks did not survive. fn
reply is forthcoming, it can abort the transaction unilaterally. essence, not doing the force is equivalent to not sending the prepare
message to a read-only participant; such prepare messages are
needed to prevent violations of two-phase locking.
3.6 Nested Transactlons
The protocol discussed above is quite permisstve about when a We believe that our method will perform better than a non-
transaction can prepare, but much less permissive when a client replicated system. Remote calls in our system run only at the
sends a message to a cohort that does not respond. A lack of primary and need not involve the backups and therefore their
response causes the entire transaction to abort. Such an abort can performance is the same as in a non-replicated system. We expect
cause lots of work to be lost. that pfepafe messages are usually processed entirely at the primary
because the needed “completed-call” event records for remote calls
Obviously, there are ways to reduce the number of situations in of the preparing transaction will already be stored at a sub-majority
which the abort happens. For example, we couM force a special of cohorts; otherwise, the primary must watt while the relevant part of
“start call” record to the backups before making a nested remote call. the buffer is forced to the backups. Careful engineering is needed
It would be safe to run the call at the new primary lf there were no here to provide both speedy delivery and small numbers of
such record, since even if the call ran before the view change, its messages. Committing a transaction requires forcing the
effects were bcal lo this group and therefore have been undone by “committed” record to the coordinators backups; the remainder of
the view change. Atternatively, the client could do a probe before the protocol can run in background. For both preparing and
making the call to determine the current primary. However, neither committing, our method will be faster than using non-replicated
of these techniques is satisfactory, since they delay normal clients and sewers if communication is faster than writing to stable
processing. storage, which is often the case provided that the number of backups
is small.
A better approach is to use nested transactions [lo, 28.301.
Nested transactions have two desirable properties. First, they allow

12
4 Changing Views cohort receives a “change” message, this means that the exchange
Transaction processing depends upon forcing information to of “I’m alive” messages indicates the need for a view change; it
backups so that a majority of cohorts know about particular events. becomes the view manager by changing its StatUS to
The job of the view change algorithm is to ensure that events known “view manager.” If it receives an invitation to join a view, and if the
to a majority of cohorts survive into subsequent views. It does this new view’s viewid is greater than any it has seen so far, it accepts
by ensuring that every view contains at least a majority of cohorts the view and becomes an underling by changing its status to
and by starting up the new view in the latest possible state. “underling.” The procedure do-accept records the new viewid in
max-viewid and sends an acceptance message. There are two
If every view has at least a majority of cohorts, then it contains at kinds of acceptance messages, “normal” ones, and *crashed” ones.
least one cohort that knows about any event that was forced to a If the cohort is up to date (i.e., up-to-date = true), it SendS an
majority of cohorts. Thus we need only make sure that the state of acceptance containing its current viewstamp and an indication of
the new view includes what that cohort knows. This is done using whether it is the primary in the current view. Otherwise, it sends a
viewstamps: the state of the cohort with the highest viewstamp for “crash-accept” response; this response contains only its viewid, and
the previous view is used to initialize the state in the new view. This means that it has forgotten its gstate.
scheme works because event records are sent to the backups in
timestamp order, and therefore a cohort with a later viewstamp for If it is a view manager, the cohorl sends invitations to join the new
some view knows everything known to a cohort with an earlier view to all other cohorts, and waits for responses. The procedure
viewstamp for that view. make_invifations creates a new viewid by pairing mymid with a
number greater than max-viewid.cnf and stores it in max-viewid.
The view change algorithm requires some information to be Notice that the new viewid will be different from any produced by
recorded in the cohort state. This information is summarized in another cohort. Then it sends invitations containing max-viewid to
Figure 4, which shows the complete cohort state. Most of this state the other cohorts, records its own response (“crashed” or “normal”),
is volatile and will be lost in a crash; the ramifications of such and collects the other responses. If an invitation with a higher viewid
crashes are discussed in Section 4.2. The exceptions are mymid, arrives, it signals invited, returning the new viewid and the mid of the
configuration, and mygroupid, which are stored on stable storage inviter. In this case, the view manager accepts the invitation and
when the cohort is first created, and cur-viewid, which is stored at
the end of a view change. When a cohort recovers from a crash, it
initializes up-to-dale to be false, indicating that its gsfate is not up to while true do
date, and initializes max-wiewid to cur viewid. Then it initializes tagcaee status
status to be “view-manager; this causesft to start a view change as tag active:
discussed below. receive % accept a message
when change: status := view=manager
Cohorts send periodic “I’m Alive” messages to other cohorts in the when invite (vid: viewid, m: mid):
configuration. If a cohort notices that it is not communicating with If vid < max viewid then conllnue end % ignore the msg
some other cohort in its view, or if ft notices that it is communicating do-accept(vid, m)
with a cohort that it could not communicate with previously, or if ft status := underling
has just recovered from a crash, ft initiates a view change. It is the others: % transaction messages handled here
manager Of this protocol; the other, cohorts are the underlings. end % receive

An overview of the algorithm run by a cohort is shown in Figure 5. tag view-manager:

The figure shows what the cohort does in each of its three states, responses := make-invitations( )
“active,” “view-manager,” and “underling.” In the “active” state, the except when invited (vid: viewid, m: mid):
cohotl waits for messages to arrive; the receive statement selects do-accept(vid, m)
an atbiirary waiting message for delivery to the program, and status := underling
continue % continue at next iteration
dispatches to the arm that matches the name of that message. If the end except
v: view := form-view(responses)
except when cannot: continue end % wait and then try again
If v.primafy = mymid
status: status % cohort is active or doing a view change then start-view(v)
gstate: (object) % objects in the group’s state status :- active
up-to-date: bool % true if gstate is meaningful else send init-view(max-viewid. v) to v.primary
configuration: [int) % modules in the configuration status := underling
mymid: int % name of this module end % if
mygroupid: int % name of the group
cur-viewid: viewid % current viewid tag underling:
cur-view: view % current view await-view( )
history: [viewstamp] % indicates events known to cohort except
max-viewid: viewid % highest viewid seen so far when timeout:
timestamp: int % the timestamp generator status := view-manager
buffer: [event-recordj % the communication buffer continue
when Invited (vid: viewid, m: mid):
where do-accept(vid, m)
continue
view = status = oneoflactive, view-manager, underling: null] when becomegrimary(v: view): start-view(v)
object = euid: int, base: T, lockers: (lock-info}> end % except
lock-info - clocker: aid, info: oneof[read: null, wrtte: Tj> status := active
viewid - <cnt: int. mid: inb
view = cprimaty: int, backups: (in+- end % tagcase
viewstamp = cid: viewid, ts: inb end % while

Flgure 4: State of a Cohort. Figure 5: The View Change Algorithm.

becomes an underling. Otherwise, when all cohorts accept the messages is all that is needed when the manager is also the primary
invitation or a timeout expires, make-invitations returns the in the last active view; otherwise, one round plus one message is
responses. In this case, the view manager attempts to form a new needed.
view (the details are discussed below). If the attempt fails,
(form_view signals cannor), the cohort attempts another view The system performs correctly even if there are several active
formation later. If the attempt succeeds, and if the view manager is primaries. This situation could arise when them is a panlion and the
not the new primary, it sends an “init-view” message to the new old primary is slow to notice the need for a view change and
primary, and becomes an underling. Otherwise it starts the new continues to respond to client requests even after the new view is
view: it updates cur-view and cur-viewid, stores zero in timestamp formed. The old primary will not be able to prepare and commit user
and appends <cur-viewid, (h to the history, and writes cur-viewidto transactions, however, since it Cannot force their effects to the
stable storage. Then it initializes the buffer to contain a single backups.
“newview” event record; this record contains cur-view, history, and
gstate. Finally, it becomes active. If the same cohort is the primary both befo,re and after the view
change, then no user work is lost in the change. Otherwise, we
View formation can succeed only if two conditions are satisfied: at guarantee the following: Transactions that prepared in the okl view
least a majority of cohorts must have accepted the invitation, and at will be able to commit, and those that committed will still be
least one of them must know all forced information from previous committed. Transactions that had not yet prepared before the
views. The latter condition may not be true if some acceptances are change may be able to prepare afterwards, depending on whether
of the “crashed” variety. For example, suppose there are three the completion events of the remote calls are known in the new view.
cohorts, A, 8 and C, and that view vl = <primary: A, backups: IS, Aborts of transactions may have been forgotten, but delivery of abort
C+. Suppose that A committed a transaction, forcing its event messages is not guaranteed in any case: recovery from lost
records to f? but not C, then A crashed and recovered, and then a messages is done by using queries (see Section 3.4). To minimize
partition occurred that separated 6 from A and C. In this case we disruption while a view change is happening, or when there is no
cannot form a new view until the partition is repaired because A hes active view, queries can be answered by any cohort that knows the
lost information and there are forced events that C does not know. answer.

The correct rule for view fofination is: a majority of cohorts have The algorithm is tolerant to several cohorts simultaneously acting
accepted and as managers; the one that chooses the higher viewid will ultimately
1. a majority of cohorts accepted normally, or succeed. Having several managers will slow things down, since
there will be more message traffic, but the slow down will be slight.
2. crash-viewid -Znormal-viewid, or Furthermore, we can avoid concurrent managers to some extent by
3. crash-viewid = normal-viewid and the primary of view various policies. For example, the cohorts could be ordered, and a
normal-viewid has done a normal acceptance of the cohort would become a manager only if all higher-priority cohorts
invitation. appear to be inaccessible.
Here crash-viewid is the largest viewid returned in a “crashed
acceptance, and normalviewid is the largest viewstamp returned in However, the algofhm is not tolerant of lost messages and slow
a “normal” acceptance. Condition (1) says we can ignore crashed responses. For example, suppose a manager waits only until it
acceptances if we have enough normal ones; condition (2) says we hears from a sub-majority even though there are other cohorts that
can ignore crashed acceptances if they are from old views; and could respond. This would result in those other cohorts being
condition (3) says we can ignore a crashed acceptance if we have excluded from the new view, which in turn will mean another round of
information from the primary of its view, because the primary always view changing will occur shortly. If that next view change also
knows at least as much as any backup. excludes some potential members, that will lead to another view
change, and so on.
If the view can be formed, the cohort returning the largest
viewstamp (in a “normal” acceptance) is selected as the new To avoid such a situation, a manager should use a fairly long
primary; the old primary of that view is selected if possible, since this timeout while it waits to hear from all cohorts8 that the “I’m alive”
causes minimal disruption in the system. messages indicate should reply. Similarly, an underling should use a
fairly long timeout before it becomes a manager. In addition, it is
A cohort in the underling state Calls await_view to wait to find out worthwhile to mask lost messages by sending duplicates, so that a
what happened to the new view. If no message arrives within some lost message won’t trigger another view change.
interval, await_view signals timeout and the cohort becomes the view
manager and attempts to form another view. If an invitation for a A final point is that not all view changes descrfbed above really
higher viewid arrives, await_view signals invited, and the cohort need to be done. One special case is wheln an active primary
accepts the invitation. If an “init-view” message containing a viewid notices that it cannot communicate wfth a backup, but it still has a
equal to maw_viewid arrives, await_view signals becomegrimary; sub-majority of other backups. In this case, the primary can
the cohort initializes itself to be a primary as discussed above, and unilaterally exclude the inaccessible backup from the view. Similarly,
becomes active. If a “newview” record for a view with viewid equal an acttte primary can unilaterally add a backup to its view. View
to max-viewid arrives from the buffer, await-view inftializes the changes are really needed only when the primary is lost, or when a
cohort state before returning: it initializes cur-view, wr-viewid, current active view loses enough members that it is no longer a
hisfory and gstate from the information in the message, writes majority. In the latter case, we need not do a view change either; we
cur-viewid to stable storage, sets up-lo-date to true (to indicate that make the primary inactive since this stops it from working on
it now has information in gsnte), and returns normally. Then the transactions when it wilt not be able to commit them.
cohoR becomes active.

4.2 Stable Storage

4.1 Discussion In our algorithm we assumed that most of a cohort’s state was
When failures or recoveries are detected by the system, the view volatile. Such an assumption means that if a majority of cohorts are
change protocol runs in each affected module group. The protocol crashed “simultaneously,” we may lose information about the module
requires relatively little message-passing in the simple case of no group’s state. Here we view a cohort as crashed if either it is really
additional failures and no concurrent view managers. One round of crashed, or if it has recovered from a crash, but its up-to-dare
variable is false. Note that a catastrophe does not cause a group to of later events implies knowledge of earlier ones. A total order on
enter a new view missing some needed information. Rather, it viewstamps would be costly to implement with voting, since there iS
causes the algorithm to never again form a new view. no single place (like our primary) to generate the viewstamp. it might
be possible to use multipart viewstamps (23, 291, however. This is a
Whether it is worthwhile to worry about catastrophes depends on matter for future research.
the likelihood of occurrence and the importance of the information in
the group state. The considerations here are similar to decisions A different approach to replication is taken in Isis [4, 51. Isis works
about when it is necessary to store information in stable storage in a only in a local area network because its view change protocol does
nonreplicated system, except that replication makes the probability of not tolerate partitions. In Isis, calls are sent to a single cohort. If the
. called procedure is a read, the cohort acquires a read lock IOCally
catastrophe smaller to begin with.
and performs the operation locally. If the procedure is a write, the
If protection against catastrophes is desired, there are various cohort acquires write locks at all cohorts before doing the Call. (Write
techniques that could be tried. For example, we might use stable locks are acquired using a two-phase algorithm that prevents
storage only at the primary or we might supply each cohort with a deadlocks in the case of concurrent writes.) Then the cohoR
universal power supply and have them write information to performs the call. In either case, the Cohort communicates the
nonvolatile storage in the background. effects of reads3 and writes to other cohorts in background ‘mode,
and piggybacks them on reply messages. This piggybacked
information accompanies all future client messages, including calls to
5 Related Work other servers as well as prepare and commit messages. This
In this section we discuss the relationship of our approach to other means, for example, that if the prepare message is sent to a different
work on replication and view changes. cohort from the one that performed the call, the information about the
effect of the call wilt be present at the cohort doing the prepare, SO
The best known replication technique is voting [16,21]. With there will be no need for that cohort to wait for the background
voting, write operations are usually performed at all Cohorts, and message to arrive, and no possibility that it would need to reject the
reads are performed at only one cohort, but in general writes can be prepare. Unlike our pset, however, piggybacked information in Isis
performed at a majority of cohorts and reads at enough cohorts that cannot be discarded when transactions commit. A disadvantage of
each read will intersect each write at at least one cohort. The write Isis is the large amount of extra information flowing on every
ail/read one choice is preferred when reads are much more common message, and the difficulty in garbage collecting that information.
than writes.
Our method avoids these problems at the Cost of a possible delay
Our method is faster than voting for write operations since we at prepare time (to force the buffer) and of an occasional abort when
require fewer messages. Also, we avoid the deadlocks that can there is a view change. The viewstamps in our method represent the
arise if messages for concurrent updates arrive at the cohorts in information flowing in Isis. However, since the viewstamps only
different orders. Our method will also be faster for read operations if indicate that certain events have occurred, but not what these events
these take place at several cohorts. If reads take place at just one are, we must sometimes wait for information about its events to
cohorl, voting may outperform our method because reading can arrive in buffer messages. Also, we must sometimes abort a
occur at any cohort, while reading in our scheme must happen at the transaction because information about its events is lost in a view
primary, which could become a performance bottleneck. On the change.
other hand, the real source of a bottlenedc is a node, not a cohort,
and we can organize our system so that primaries of different groups In Cooper’s replicated remote procedure calls [9], each procedure
usually run on different nodes. Furthermore, the system can be call is replicated and executed at every cohort of a server. This
Configured to place primaries at more powerful nodes most of the technique has high overhead during normal system operation: il
time. This organization Could lead to better performance than voting. requires lots of messages, is wasteful of computation, and requires
that programs be deterministic. The advantage of the method is that
Voting allows operations to continue running as long as the recovery is inexpensive.
needed number of cohorts are up and accessible. However, when
writes must happen at all Cohorts, the lost of a single cohort can Finally, Tandem’s Nonstop systemI2, 7,8] and Ihe Auragen
cause writes to become unavailable. The virtual partitions system [6] are primary copy methods but there is just one backup, so
pro~ocol[l2, 131 was invented to solve this problem. Our view they can survive onfy a single failure. Furlhermore, the
change protocol is a simplification and modification of this protocol primary/backup pair must reside at a single node (containing multiple
and has better performance. The virtual partitions protocol requires processors). If these constraints are acceptable, these methods are
three phases. The first round establishes the new view, the second efficient. Ours is more general.
informs the cohorts of the new view, and in the third, the Cohorts all
communicate with one another to find out the current state. We
avoid extra work by using viewstamps in phase 1 (the first round) to 6 Conclusions
determine what each cohort knows. Our technique can be used in This paper has described a new replication method for providing
conjunction with voting when writes are done at all members of a high availability. The method performs well in the normal case, does
view. Just as we use viewstamps, in such a system timestamps view changes efficiently, and loses little information in a view
assigned when transactions commit could be used to determine change. We expect the performance of our method to be
which replica has the most information about transaction Commits comparable to that of a system in which modules are not replicated
(the timestamps would not contain information about the state of and better than most other replication methods. At present we are
active transactions). Systems in which writes only go to a majority implementing our method; we will be able to run experiments about
are more difficult to optimize in this way since there is usually no system performance when our implementation is complete.
cohort whose state Contains at least as much information as the
state of any other cohort. Our view change algorithm is highly likely not to lose work in a
view change. If a transaction’s effects are known at the new primary,
Virtual partitions force transactions that were active across a view
change to abort. For example, a transaction that did a remote call in
the old view will not be able to prepare in the new view. We use
viewstamps to avoid the abort and we rely on the fact that knowledge 3The effect of a read is that a read lock has been acquired.

15
the transaction can commit. Our notion of viewstamps allows us to References
determine inexpensively how much each cohort knows and whether
a transaction can be committed. Our policy of choosing the primary 1. Alsberg, P. A., and Day, J. D. A Principle for Resilient Sharing of
of the last active view to be the new primary whenever possible Distributed Resources. Proc. of the 2nd International Conference on
avoids losing work altogether; even remote calls that were running Software Engineering, October, 1976. pp. 627-644. Also available in
before the view change can continue to run afterwards. Note that unpublished form as CAC Document number 202 Center for
the probability of aborts can be decreased further if desired. There is Advanced Computation University of Illinois, Urbana-Champaign,
a tradeoff here between loss of information in view changes and Illinois 61801 by Alsberg, Benfod, Day, and Grapa.
speed of processing calls. For example, if “completed call” records 2. Bartlett, J. F. A Nonstop Kernel. Proc. of the 8th ACM
were forced to the backups before the call returned, there would be Symposium on Operating System Principles, SIGOPS Operating
no aborts due to view changes, but calls would be processed more System Review, 15 5, December, 1981. pp. 22-29.
slowly.
3. Bernstein, P. A., and Goodman, N. The Failure and Recovery
Problem for Replicated Databases. Second ACM ljymposium on the
Choosing the primary of the old view to be the new primary
Principles of Distributed Computing, August, 1983, pp. 114-122.
minimizes information loss and makes the view change protocol run
quickly. On the other hand, we could modify the protocol to always 4. Birman, K. P., Joseph, T. A., Rauchle. T., and Eil Abbadi, A.
choose a particular cohort to be the primary if possible. Such a “Implementing Fault-tolerant Distributed Objects”. /E/Z Trans. on
policy matches the needs of some applications. The policy would Sclffwaft?Engineering 11.6 (June 1985). 502-508.
not cause loss of information: if the old primary is a member of the 5. Birman, K. P. and Joseph, T. A. “Reliable Communication in the
new view, all its events will survive into the new view. However, Presence of Failures”. ACM Trans. on Computer Systems 5, 1
work in progress at the old primary would be lost in the change (Februaty 1987), 47-76.
(unless some additional mechanism is included); this includes
aborting transactions for which the primary is the coordinator. In 6. Borg, A., Baumbach, J., and Glazer, S. A Message System
addition, a few extra messages will sometimes be needed in the view Supporling Fault Tolerance. Proc. of the 9th ACM Symposium on
Operating System Principles, SIGOPS Operating System Review,
change protocol. 17,5, October, 1983, pp. 90-99.
We presented our algorithm in a system with one-level 7. Barr, A. J. Transaction Monitoring in Encompass: Reliable
transactions. However, as noted earlier, such a system can lead to Distributed Transaction Processing. Proc. of the Seventh
aborts in which a substantial amount of work can be lost. The International Conference on Very Large Data Bases, September,
problem arises when a client gets no reply for a remote call; the 1981, pp. 155-165.
transaction must be aborted to avoid running a call more than once. 8. Barr, A. J. Robustness to Crash in a Distributed Database: A
Nested transactions prevent the abort of the top level transaction, Non Shared-Memory Multi-Processor Approach. F’roc. of the Tenth
and, furthermore, do so efficiently. International Conference on Very Large Data Bases, August, 1984,
pp. 445-453.
In defining our algorithm, we chose to avoid the use of stable
storage as much as possible because we were interested in 9. Cooper, E. C. Replicated Distributed Programs. UCBICSD
understanding the extent to which having several replicas eliminated 85/231, University of California, Berkeley, CA, May, 1985.
the need for stable storage. We found that catastrophes (loss of a 10. Davies, C. T. “Data Processing Spheres of Control”. IBM
group’s state) that would not happen if events were recorded on SystemsJournal f7,2 (February78), 179-198.
stable storage could sometimes occur in our system. The probability
of a catastrophe depends on the configuration, e.g., on whether the Il. Eager, D. L.. and Sevcik, K. C. “Achieving Rolsustness in
Distributed Database Systems”. ACM Trans. on Database Systems
cohort’s nodes are failure independent. The algorithm can be 8,3 (September 1983), 354-381.
modified in various ways to reduce the probability of catastrophe if it
is considered to be too high. 12. El Abbadi. A., Skeen, D., and Cristian, F. An Efficient, Fault-
Tolerant Protocol for Replicated Data Management. Proc. of the 4th
The use of viewstamps is an interesting compromise between loss ACM SIGACT/SIGMOD Conference on Principles of Data Base
of work in failures and extra information. Isis represents one Systems, 1985.
extreme here: no work is lost when there is a failure but large 13. El Abbadi, A., and Toueg, S. Maintaining Availability in
amounts of information must flow around the system. Other systems Partitioned Replicated Databases. Proc. of the 5th ACM
have no information like viewstamps and must abort all transactions SIGACT/SIGMOD Conference on Principles of Data Base Svstems.
affected by a failure. 1986.
Viewstamps may also be worthwhile in a nonreplicated System. In 14. Eswaran, K. P., Gray, J. N., Lorfe, Ft. A., and Traiger, 1. L. “The
such a system, records containing the effects of Calls could be Notions of Consistency and Predicate Locks in a Database System”,
Comm. of the ACM i9,ll (November 1976), 624-633.
written to stable storage in background mode; the records, like event
records, would contain viewstamps. When the prepare message 15. Fowler, R.J. Decentralized Object Finding Using Forwarding
arrives, it would only be necessaty to force the records; no delay Addresses. 85-12-1, University of Washington, Dept. of Computer
would be encountered if the records had already been Written. A Science, Seattle, WA, December, 1985.
crash would not cause active transactions to abort aUtOmatiCally; 16. Gifford, D. K. Weighted Voting for Replicated Data. Proc. of the
instead, queries would be sent to coordinators to determine the 7th ACM Symposium on Operating Systems Princilples, SIGOPS
outcomes. The result would be a system that is more tolerant Of Operating Systems Review, 13,5, December, 1979, pp. 150-162.
crashes (by avoiding aborts) and also faster at prepare time.
17. Gifford, D. K.;and Donahue, J. E. Coordinating Independent
Atomic Actions. Proc. of IEEE CompCon February, 1985, pp.
Acknowledgments 92-95.
18. Gray, J. N.. Lorie, R. A. Putzolu, G. F., and Traiger, I. L.
We are thankful for the helpful comments of readers of earlier Granularity of locks and degrees of consistency in a shared data
drafts of this paper, and especially to Sanjay Ghemawat, Dave base. In Modeling in Data Base Management Sysremss,
Gifford, Bob Gruber, Deborah Hwang, Elliot Kolodner, Gary Leavens, G. M. Nijssen, Eds., Elsevier North-Holland, New York, 1976, pp.
Sharon Perl, Liuba Shrira, and Bill Weihl. 365-394.

16
19. Gray, J. N. Notes on Database Operating Systems. In LeCtUfe
Notes in Computer Science 60, Goos and Hartmanis, Eds., Springer-
Verlag Berlin, 1978, pp. 393-481.
20. Henderson, C. Locating Migratory Objects in an Internet. M.I.T.
Laboratory tar Computer Science, Cambridge, MA, 1983.
21. Herfihy. M. P. “A Quorum-Consensus Replication Method for
Abstract Data Types”. ACM Trans, on Compuler Systems 4, 1
(February 1986), 32-53.
22. l-twang, 0. J. Constructing a Highly-Available Location Service
for a Distributed Environment. Technical ReDort MIT/LCS/TR-410.
M.I.T. Laboratory for Computer Science, Cahbridge, MA, January;
1988.
23. Ladin, Ft.. Liskov, B.. and Shrira, L. A Technique for
Constructing Highly-Available Services. M.I.T. Laboratory for
Computer Science, Cambridge, MA, January, 1988. To be published
in Algorifhmca.
24. Lamport. L., Shostak, Ft., and Pease, M. “The Byzantine
Generals Problem”. ACM Trans. on Programming Languages and
Sysrems 4,3 (July 1982), 382-401.
25. Lampson, B. W., and Sturgis, H. E. Crash Recovery in a
Distributed Data Storage System. Xerox Research Center, Palo
Alto, Ca., 1979.
26. Liskov, B., and Scheifler, R. ‘Guardians and Actions: Linguistic
Support for Robust Distributed Programs”. ACM Trans. on
Programming Languages and Systems 5,3 (July 1963), 381404.
27. Liskov, B., Curtis, D., Johnson, P., and Scheifler, R.
Implementation of Argus. Proc. of the Eleventh ACM Symposium on
Operating Systems Principles, SIGOPS Operating Systems Review,
21,5, November, 1987, pp. 111-122.
28. Liskov, B. “Distributed Programming in Argus”. &mm. of fhe
ACM 31,3 (March 1988), 300-312.
29. Liskov, B., and Ladin, R. Highly-Available Distributed Services
and Fault-Tolerant Distributed Garbage CoLtion. Proo. of the Fifth
ACM Symposium on the Principles of Distributed Computing,
August, 1986.
30. Moss, J. E. B. Nested Transactions: An Approach to Reliable
Distributed Computing. Technical Reporl MITILCSTTR-260, M.I.T.
Laboratory for Computer Science, June, 1981.
31. Mullender, S.. and Vitanyi, P. Distributed Match-Making for
Processes in Computer Networks---Preliminary Version. Proc. of the
Fourth Symposium on the Principles of Distributed Computing, ACM,
August, 1985.
32. Oki, B. M. Viewstamped ffeplicafion for Highly-Avai&b/e
Distributed Systems. Ph.D. Th., Massachusetts Institute of
Technology, Laboratory for Computer Science, Cambridge. MA, May
1988. Forthcoming.
33. Papadimitriou, C. H. “Serializability of Concurrent Database
Updates”. J. of the ACM 24,4 (October 1979), 631-653.
34. Schneider, F. 6. Fail-Stop Processors. Digest of Papers from
Spring CompCon ‘83 26th IEEE Computer Society International
Conference, March, 1983, pp. 66-70.
35. Skeen, D., and Wright, D. D. Increasing Availability in
Partitioned Database Systems. TR 83581, Dept. of Computer
Science, Cornell University, March, 1984.
36. Stonebraker, M. “Concurrency Control and Consistency of
Multiple Copies of Data in Distributed INGRES”. /EEE Trans. on
Software Engineering 53 (May 1979), 188-194.

2002 Replication Techniques in Distributed Systems Advances in Database Systems.9780792398004.29706
100% (1)
2002 Replication Techniques in Distributed Systems Advances in Database Systems.9780792398004.29706
166 pages
Distributed File System
No ratings yet
Distributed File System
68 pages
Dependable Systems
No ratings yet
Dependable Systems
22 pages
Leaner, More Ciemt, Available Copy Protocol
No ratings yet
Leaner, More Ciemt, Available Copy Protocol
8 pages
Chapter 1 Introduction
No ratings yet
Chapter 1 Introduction
48 pages
Craq
No ratings yet
Craq
16 pages
(Ci) 09 (Ci) 12
No ratings yet
(Ci) 09 (Ci) 12
31 pages
Fault Tolerance Unit 3-4
No ratings yet
Fault Tolerance Unit 3-4
32 pages
Weighted Voting For Replicated Data
No ratings yet
Weighted Voting For Replicated Data
13 pages
REPLICATION
No ratings yet
REPLICATION
20 pages
Illustrated Design of Reeinforced Concrete Buildings Karve Shah
No ratings yet
Illustrated Design of Reeinforced Concrete Buildings Karve Shah
396 pages
Coco
No ratings yet
Coco
14 pages
Replication Control in Distributed File Systems
No ratings yet
Replication Control in Distributed File Systems
15 pages
R: A Reconfigurable Atomic Memory Service For Dynamic Networks
No ratings yet
R: A Reconfigurable Atomic Memory Service For Dynamic Networks
18 pages
Publish IJACSA
No ratings yet
Publish IJACSA
8 pages
High Availability and Load Balancing For Postgresql Databases: Designing and Implementing.
100% (1)
High Availability and Load Balancing For Postgresql Databases: Designing and Implementing.
8 pages
A Review On Fault Tolerance in Distributed Database
No ratings yet
A Review On Fault Tolerance in Distributed Database
4 pages
Consistency and Rep Contd
No ratings yet
Consistency and Rep Contd
28 pages
SAP SuccessFactors Onboarding Role-Based Permission Guidance - v1.3
No ratings yet
SAP SuccessFactors Onboarding Role-Based Permission Guidance - v1.3
31 pages
Chapter 7 Replication
No ratings yet
Chapter 7 Replication
26 pages
Ijdps 040301
No ratings yet
Ijdps 040301
15 pages
Data Replication Techniques
No ratings yet
Data Replication Techniques
3 pages
Dynamo
No ratings yet
Dynamo
19 pages
Failover In-Depth
No ratings yet
Failover In-Depth
4 pages
Replication and Consistency in Distributed Systems (Cont'd)
No ratings yet
Replication and Consistency in Distributed Systems (Cont'd)
17 pages
LCD LED Display Device Training
No ratings yet
LCD LED Display Device Training
36 pages
Distributed Systems As DS DS
No ratings yet
Distributed Systems As DS DS
7 pages
Distributed File Systems
No ratings yet
Distributed File Systems
19 pages
NMAP Testing: Iptables Flushed in The Target (Default)
No ratings yet
NMAP Testing: Iptables Flushed in The Target (Default)
5 pages
LESSON 2 - Assembling A Computer - Performance Checklist
No ratings yet
LESSON 2 - Assembling A Computer - Performance Checklist
2 pages
Project Management Office Checklist
100% (1)
Project Management Office Checklist
10 pages
MTC 30521
No ratings yet
MTC 30521
28 pages
Azure ML
100% (1)
Azure ML
5 pages
08 Asra Kurzanleitung en
No ratings yet
08 Asra Kurzanleitung en
52 pages
Galaxy G1 Plus (IMU) Brochure
No ratings yet
Galaxy G1 Plus (IMU) Brochure
2 pages
HP - Probook.4510s.4520s.wistron.s Intel.H9265 4.48.4GK06.041.Rev .SD .Schematics 2
No ratings yet
HP - Probook.4510s.4520s.wistron.s Intel.H9265 4.48.4GK06.041.Rev .SD .Schematics 2
61 pages
An5543 Guidelines For Enhanced Spi Communication On stm32 Mcus and Mpus Stmicroelectronics en
No ratings yet
An5543 Guidelines For Enhanced Spi Communication On stm32 Mcus and Mpus Stmicroelectronics en
24 pages
Jaideep Resume Aiml
No ratings yet
Jaideep Resume Aiml
1 page
Tester Guide
No ratings yet
Tester Guide
79 pages
List Kebutuhan Rsu PSM Full
No ratings yet
List Kebutuhan Rsu PSM Full
15 pages
Project
No ratings yet
Project
47 pages
Chapter 2
No ratings yet
Chapter 2
95 pages
The Net Exam
No ratings yet
The Net Exam
73 pages
A Simulation of A Monitoring and Alarm System in An Energy
No ratings yet
A Simulation of A Monitoring and Alarm System in An Energy
58 pages
Dbms 2
No ratings yet
Dbms 2
28 pages
Azure
No ratings yet
Azure
3 pages
Multimedia: Marian C. de Luna CEIT Instructor
No ratings yet
Multimedia: Marian C. de Luna CEIT Instructor
11 pages
React Fundamentals and Environment Setup
No ratings yet
React Fundamentals and Environment Setup
8 pages
Startup Guide Pro
No ratings yet
Startup Guide Pro
38 pages
Session 2: - Manipulating Container With Docker Client
No ratings yet
Session 2: - Manipulating Container With Docker Client
20 pages
CDA C1 R 015 en File 31.en
No ratings yet
CDA C1 R 015 en File 31.en
2 pages
Unit 3
No ratings yet
Unit 3
12 pages
Set Vase TEFAL Ingenio L897SB74, 11 Piese, 16-28cm, Inox, Argintiu
No ratings yet
Set Vase TEFAL Ingenio L897SB74, 11 Piese, 16-28cm, Inox, Argintiu
6 pages
OPJEMS Process PDF
No ratings yet
OPJEMS Process PDF
9 pages
C# Classes Syntax Steps
No ratings yet
C# Classes Syntax Steps
1 page
(Mission and Vision) Google and Amazon and AAST
No ratings yet
(Mission and Vision) Google and Amazon and AAST
2 pages
Study Guide for the Cisco 300-440 ENCC Designing and Implementing Cloud Connectivity Exam.
From Everand
Study Guide for the Cisco 300-440 ENCC Designing and Implementing Cloud Connectivity Exam.
Anand Vemula
No ratings yet
Submariner Multi-Cluster Connectivity in Kubernetes: The Complete Guide for Developers and Engineers
From Everand
Submariner Multi-Cluster Connectivity in Kubernetes: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Advanced Penetration Testing for Highly-Secured Environments: The Ultimate Security Guide
From Everand
Advanced Penetration Testing for Highly-Secured Environments: The Ultimate Security Guide
Allen Lee
4.5/5 (6)
Comptia Cloud+ CV0 - 004: 715 Questions and Explanation
From Everand
Comptia Cloud+ CV0 - 004: 715 Questions and Explanation
Arabella Kushner
No ratings yet
DENT Network Operating System in Practice: The Complete Guide for Developers and Engineers
From Everand
DENT Network Operating System in Practice: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Software-Defined Networks: A Systems Approach
From Everand
Software-Defined Networks: A Systems Approach
Larry Peterson
5/5 (1)
Mastering System Programming with C: Files, Processes, and IPC
From Everand
Mastering System Programming with C: Files, Processes, and IPC
Larry Jones
No ratings yet
Mastering C: Advanced Techniques and Tricks
From Everand
Mastering C: Advanced Techniques and Tricks
Ted Norice
No ratings yet
Essays on Infrastructure-as-code
From Everand
Essays on Infrastructure-as-code
Ravi Rajamani
No ratings yet
Study Guide Designing Cisco Data Centre Infrastructure (300-610) Exam
From Everand
Study Guide Designing Cisco Data Centre Infrastructure (300-610) Exam
Anand Vemula
No ratings yet
Mastering Terraform A Comprehensive Guide to Infrastructure As Code
From Everand
Mastering Terraform A Comprehensive Guide to Infrastructure As Code
Mario Marinov
No ratings yet
Terraform for Developers, Second Edition
From Everand
Terraform for Developers, Second Edition
Kimiko Lee
No ratings yet
Terraform for Developers, Second Edition: Essentials of Infrastructure Automation and Provisioning
From Everand
Terraform for Developers, Second Edition: Essentials of Infrastructure Automation and Provisioning
Kimiko Lee
No ratings yet
Study Guide 300-435 ENAUTO: Automating and Programming Cisco Enterprise Solutions Certification Exam
From Everand
Study Guide 300-435 ENAUTO: Automating and Programming Cisco Enterprise Solutions Certification Exam
Anand Vemula
No ratings yet
Confluent Certified Developer for Apache Kafka® Exam kit
From Everand
Confluent Certified Developer for Apache Kafka® Exam kit
PRIYANKA
No ratings yet
Oracle 11g Streams Implementer's Guide
From Everand
Oracle 11g Streams Implementer's Guide
Ann L. R. McKinnell
No ratings yet
Kubernetes: Build and Deploy Modern Applications in a Scalable Infrastructure. The Complete Guide to the Most Modern Scalable Software Infrastructure.: Docker & Kubernetes, #2
From Everand
Kubernetes: Build and Deploy Modern Applications in a Scalable Infrastructure. The Complete Guide to the Most Modern Scalable Software Infrastructure.: Docker & Kubernetes, #2
Jordan Lioy
No ratings yet
Red Hat AMQ Streams for Cloud-Native Messaging: The Complete Guide for Developers and Engineers
From Everand
Red Hat AMQ Streams for Cloud-Native Messaging: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Learn Multithreading with Modern C++
From Everand
Learn Multithreading with Modern C++
James Raynard
No ratings yet
Chaos Mesh for Resilient Kubernetes Deployments: The Complete Guide for Developers and Engineers
From Everand
Chaos Mesh for Resilient Kubernetes Deployments: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Systemd-nspawn in Practice: Definitive Reference for Developers and Engineers
From Everand
Systemd-nspawn in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Efficient Deployment Automation with Fabric: Definitive Reference for Developers and Engineers
From Everand
Efficient Deployment Automation with Fabric: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Technical Foundations of Emulation: Definitive Reference for Developers and Engineers
From Everand
Technical Foundations of Emulation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Distributed Cluster Operations with DC/OS: Definitive Reference for Developers and Engineers
From Everand
Distributed Cluster Operations with DC/OS: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Tcpdump in Depth: Definitive Reference for Developers and Engineers
From Everand
Tcpdump in Depth: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Cloud Computing Essentials: A Practical Guide with Examples
From Everand
Cloud Computing Essentials: A Practical Guide with Examples
William E. Clark
No ratings yet
Daemon Architecture and Implementation: Definitive Reference for Developers and Engineers
From Everand
Daemon Architecture and Implementation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
About Kubernetes and Security Practices - Short Edition: First Edition, #1
From Everand
About Kubernetes and Security Practices - Short Edition: First Edition, #1
Ami Adi
No ratings yet
SystemTap Essentials: Definitive Reference for Developers and Engineers
From Everand
SystemTap Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Mastering Kubernetes
From Everand
Mastering Kubernetes
Manish Soni
No ratings yet
AZURE AZ 500 STUDY GUIDE-2: Microsoft Certified Associate Azure Security Engineer: Exam-AZ 500
From Everand
AZURE AZ 500 STUDY GUIDE-2: Microsoft Certified Associate Azure Security Engineer: Exam-AZ 500
Mamta Devi
No ratings yet
Cloud Computing Made Simple: Navigating the Cloud: A Practical Guide to Cloud Computing
From Everand
Cloud Computing Made Simple: Navigating the Cloud: A Practical Guide to Cloud Computing
Poonam Devi
No ratings yet
Mastering OpenTelemetry: Building Scalable Observability Systems for Cloud-Native Applications
From Everand
Mastering OpenTelemetry: Building Scalable Observability Systems for Cloud-Native Applications
Robert Johnson
No ratings yet
Cloud: Get All The Support And Guidance You Need To Be A Success At Using The CLOUD
From Everand
Cloud: Get All The Support And Guidance You Need To Be A Success At Using The CLOUD
John Hawkins
No ratings yet

Pfoceduf 8

Uploaded by

Pfoceduf 8

Uploaded by

Viewstamped Replication: A New Primary Copy Method to

Support Highly-Available Distributed Systems

Abstract Our algorithm runs on a system consisting of nodes connected by

If all participants agree to prepare, the coordinator adds a

An overview of the algorithm run by a cohort is shown in Figure 5. tag view-manager:

Flgure 4: State of a Cohort. Figure 5: The View Change Algorithm.

4.2 Stable Storage

You might also like