We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF or read online on Scribd
You are on page 1/ 14
‘A QUORUM-BASED COMMIT PROTOCOL
Dale Skeen
TR 82-483
February 1982
Department of Computer Science
Cornell University
Ithaca, New York 14853A QUORUH-BASED COMMIT PROTOCOL
Dale Skeen
Computer Science Department
Cornell University
Ithaca, New York
Abstract
Herein, we propose a commit protocol and an associated recovery protocol
that is resilient to site failures, lost messages» and network partitioning.
The protocols do not require that a failure be correctly identified or even
detected. The only potential effect of undetected failures is a degradation
in performance. The protocols use a weighted voting scheme that supports an
arbitrary degree of data replication (including none) and allows unila~
terally aborts by any site. This lact property facilitates the integration
of these protocols with concurrency control protocols. Both protocols are
centralized protocols with low message overhead.Introduction
A transaction is, by definition, an atomic operation on a distributed
database system. Either all changes by the transaction are permanently
installed in the database, in which case the transaction is said to be con-
mitted, or no changes persist, in which case the transaction is said to be
aborted. It is the task of a commit protocol to ensure that a transaction
is atomically executed.
In this paper we propose a commit protocol that is resilient to multi-
ple occurrences of the following classes of benevolent failures: arbitrary
site failures, lost messages, and network partitioning. It does not require
that the type of failure be correctly determined, in fact, resiliency is
guaranteed even if failures go undetected.
The protocol uses a weighted voting scheme to resolve conflicts during
failures. When failures occur, a transaction is committed only if a
minimum number of votes, called a commit quorum and denoted V,, are cast for
committing. Similarly, in the presence of failures, a transaction will be
aborted only if a minimum mumber of votes, called an abort quorum and
denoted V,, are cast for aborting. A commit quorum does not have to equal
an abort quorum, but their sun must exceed the total number of votes.
Voting schemes have been proposed previously for transaction manage~
ment. Thomas introduced a majority voting scheme to ensure consistency in a
fully replicated database ([THOM/9]). Gifford extended the scheme by
assigning weights to sites and using quorums rather than a simple majority
(LGIFF79]). The proposed protocol differs from the previous work in several
important ways: .
(1) It is a commit protocol, not a concurrency control schene. It provides
atomicity at a pex transaction basis. Nonetheless, it is straightfor-
ward to integrate any type of concurrency control protocol into this
protocol.
(2) It allows unilateral aborts during the first phase of the transaction.
A site may decide to abort because of several reasons, for example, a
deadlock is detected locally.
(3) It is primarily intended for partially replicated distributed databases
vhere a transaction can read fron any copy but must update all copies.
In addition, the protocol exhibits the following properties?
(1) It is a centralized protocol and, thus, benefits from the economy of
centralized protocols.
(2) In the absence of failures it is no more expensive than previously pro~
posed protocols that are resilient only to coordinator failures (and
not to a partitioning of the network).
(3) If all failures are eventually repaired, then the protocol will eventu-
ally terminate.
(4) It is a blocking protocol -- operational sites must occasionally wait
until a failure is repaired. This is an undesirable but necessary pro-
perty exhibited by any protocol that is resilient to network partition~
ing ((SKEE8la]). However, the protocol can be tuned so that thefrequency of blocking is low.
This paper is divided into six sections. The second section states our
assumptions and defines the terminology used in the remainder of the paper.
The third section develops a resilient quorum-based commit protocol, and the
fourth section develops @ resilient quorum-based recovery protocol. The
recovery protocol is invoked whenever a group of sites can no longer commun-
icate with the original coordinator (either it has failed or the network has
Partitioned). Like the conmit protocol, it is a centralized protocol. The
fifth section discusses performance, and the sixth section concludes the
Paper.
Although the protocols proposed are resilient to many classes of
failures, this paper will focus on the problem of network partitioning.
This class of failures is generally agreed to the most difficult class to
handle. The other two classes, site feilures and lost messages, can be cast
special cases of a partitioned network. In a site failure, a single site
is isolated (partitioned) from the remainder: of the network. A lost message
can be viewed as a very short lived partitioning. In all cases, the proto-
cols work without modifications.
2. Background
We assume that an underlying communications network provides point-to-
point communication between any pair of sites. We also assume that it gen-
erates no spontaneous messages, and that garbled messages are detected and
deleted. We do not assume that messages arrive in order nor that it detects
lost messages.
A partitioned network occurs when there are two or more disjoint groups
of sites such that no communication is possible between the groups. Each of
the disjoint groups is called a partition.
A distributed transaction T is decomposed into subtransactions 1). T,»
ssey Tye where a subtransaction is executed at one of the N participating
sites. Any subtransaction can be unilaterally aborted, which results in the
abortion of the entire transaction. Hence, for transaction T to be commit~
ted, all sites must agree to conmit their subtransaction. MWe assume that a
subtransaction can be atomically executed by a local transaction management
system ([GRAY79 ,LIND791).
It is the responsibility of a commit protocol to ensure that all sub-
transactions are consistently committed or aborted. One of the simplest
commit protocols is the two-phase protocol ([GRAY79, LAMP76]) depicted in
Figure 1. The protocol uses a central site, the coordinator, to direct the
execution of the transaction at the.other sites. Each slave has a chance to
abort the transaction by replying with a "no" in the first round.
A commit protocol can be conveniently described by a set of state
diagrams, one for each participating site ([SKEES1a]). The diagram for Site
i describes the processing of subtrensaction T;. A state in the diagran is
called a local transaction state.
In the two-phase conmit protocol, a single state diagran (illustrated
in Figure 2.) suffices to describe processing at all sites. For both the
coordinator and the slaves, there are four distinct and easily identifiedCCORDIRATOR SLAVE
(1) Transaction is received.
Subtransactions are
sent to each slave.
Subtransaction is received.
A reply is sent:
yea to commits
no to abort.
(2) I£ all sites respond yes
then commit is sent; 7
else, abort is sent.
Either commit or abort is
received and processed.
Figure 1. The two-phase commit protocol.
Figure 2. The state diagram for the two-phase commit protocol.
loca} transaction states: the imitial state (state q in the diagram), thewait state (w), the abort state (a), and the commit state (c). A site occu-
pies the initial state until it decides whether to unilateral abort the
transaction. If the site decides against an abort, then the wait state is
entered. This state represents a period of uncertainty for the sites where
it has agreed to proceed with the transaction but does not yet know its out
come (i.e. committed or aborted). The commit and abort states are self-
explanatory.
The local transaction states of any protocol form two disjoint subset:
the committable states and the noncommittable states. A site occupies a
conmittable state only if all sites have agreed to proceed with the transac~
tion. For example, the only committable state in the two-phase commit pro-
tocol is the commit state, A state that is not a committable state is a
noncomittable state.
3. A Resilient Commit Protocol
The two-phase commit protocol is not a-very robust protocol. Whenever
the coordinator fails or becomes partitioned from the slaves, the slaves
must block until the failure can be repaired.
In this section we develop a very resilient commit protocol that allows
recovery from both of these types of failures. The section develops the
commit protocol in detail; the next section discusses the associated
recovery protocols for handling coordinator failures and partitioning.
Each site is assigned an integral nonnegative number of votes. (The
number can be 0, in which case the site is a passive participant.) The basic
idea is that whenever a group of communicating sites establishes a quorum,
they are allowed to proceed. There are two distinct types of quoruns - a
commit quorum and an abort quorum,
Let Vs Vor
required for a conmit quorum, and the number required for an abort quorum.
A resilient quorum-based protocol must obey the following properties
(LSKEEB1¢]):
Q) Vet pY where 0V,. One argument
concerns protocols allowing unilateral aborts: if a significant number of
transactions are unilaterally aborted, then clearly V, should be smaller. A
stronger argument is that most site failures are expected to occur during
Phase 1 of the commit protocol since most of the transaction execution tine
is epent in Phase 1. This phase is time consuming because the majority of
the data processing takes place during it; whereas, Phase 2 and Phase 3 syn-
chronize state information among the sites and require very little local
Processing. If sites fail during Phase 1, then the transaction must be
aborted -- hence, it should be easy to abort.
Am interesting heuristic for choosing V, is based on a rough estimate
of the failure distribution of the sites. ‘This heuristic is useful in
environments where site failures, rather than network partitions, predom
inate. Let P(V,) be the probability that at least an abort quorum is opera~
tional. P(V,) is a decreasing function in V,. The point is to choose the
maximum V, such that V,<=Vg and P(V,) exceeds a minimum level of desired
availability.
As mentioned before, the weight of a site can be zero, in which case
the site contributes nothing toward forming a quorum. (However, such a site
can still unilaterally abort the transaction.) When designing a protocol, a
zero-weighted site can be eliminated from all phases requiring the formation
of a quorum. In the extrone case, where only a single site has a non-zero
weight, a quorum based commit protocol degenerates into the standard two-
10phase protocol with all of its disadvantages. Specifically, all sites must
block on the failure of the only nonzero weighted site (vhich is normally
the coordinator).
6. Conclusion
The use of quorums is a standard recovery technique for handling net~
work partitioning (even primary site schemes, e.g. [STON79], are a degen-
erate case of using quorums). We have presented a very general quorum-based
commit protocol that can be used with both replicated and nonreplicated
data. Unlike previous echenes it allows a single site to unilaterally abort
the transaction.
Quorum-based protocols are resilient because a site is allowed to par-
ticipate in only one type of quorum. Quorum sizes are carefully chosen such
that the formation of both a commit and an abort quorum requires the parti
cipation of a common site. In this way mutual exclusion is assured -~ only
one type of quorum can be formed during the execution of a transaction.
(owever, it is possible for multiple occurrences of a single type of quorum
to be formed. For example, since abort quorums are usually small, more than
fone can be formed concurrently.) In such a scheme the concurrent execution
of several coordinators, even if they are within the same partition, does
not destroy consistency.
When a new coordinator is elected in the proposed recovery protocol, it
polls all sites about their current local state. In making a coumit deci-
sion, only the replies from the latest poll is used -- information obtained
in earlier polls is ignored. Less conservative approaches which uses previ-
ous information can be found in [SKEES1c].
REFERENCES
CaLsB76] Alsberg, P. and Day, J.s "A Principle for Resilient Sharing of
Distributed Resources." Proc. 2nd International Conference on
Software Ingineering, San Francisco, Ca+, October 1976.
{cac79} Garcia-Molina, Hector, Ph.D. Thesis, Stanford University»
1979.
Uoarcs1] Garcia-Molina, Hector, "Elections in a Distributed Computing
System," TR No. 280, Princeton University, Decenber, 1980.
[GIFF791 Gifford, David, "Weighted Voting for Replicated Data" Qperat=
ing Systems Reviews 13, 5, Dec.» 1979, pp. 150-9.
Coray79] Gray, J. N., "Notes on Database Operating Systems," in Operat-
ing Systems: An Advanced Course, Springer-Verlag, 1979+
CHana179] Hammer, M. and Shipman, D., "Reliability Mechanisms for SDD-1:
A Systen for Distributed Databases," Computer Corporation of
America, Canbridge, Masse» July 1979.
a(LAMP761
(Linp79]
[skEE81a]
(SKEES81b]
[skEE81¢]
{sT0n79]
[TH0n79)
Lampson, B. and Sturgis, H., "Crash Recovery in a Distributed
Storage System," -Tech. Report, Computer Science Laboratory»
Xerox Parc, Palo Alto, California, 1976.
Lindsay, B.G. et ale, "Notes on Distributed Databases," IBM
Research Report, no. RJ2571 (July 1979).
Skeen, D. and M. Stonebraker, "A Formal Model of Crash
Recovery in a Distributed System,” IEEE JIransactions on
Software Engineering, (to appear).
Skeen, De, "Nonblocking Commit Protocols." SIGMOD Intexna-
ional Conf. on Management of Data, Ann Arbor, Michigan, 1981.
Skeen, D., "Crash Recovery in a Distributed Database System
Ph.D. Thesis, University of California, Berkeley (in prepa
tion).
Stonebraker, M., "Concurrency Control and Consistency of Mul-
tiple Copies in Distributed INGRES," IEEE Transactions on
Software Engineering, May 1979.
‘Thomas, Robert, "A Majority Consensus Approach to Concurrency
Control," Transactions on Database Systems, 4, 2, June 1979.
412