0% found this document useful (0 votes)
14 views16 pages

Timestamp-Basedalgorithmsfor Concurrenccyontrolin Distributed Databasesystems

The document presents a framework for designing and analyzing concurrency control algorithms in distributed database management systems (DDBMS), focusing on timestamp-based methods. It decomposes concurrency control into read-write and write-write synchronization problems, detailing 12 principal algorithms and their refinements, resulting in over 50 distinct algorithms. The paper also discusses transaction processing models and the importance of atomic commitment in ensuring database consistency during concurrent transactions.

Uploaded by

tepibi3515
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views16 pages

Timestamp-Basedalgorithmsfor Concurrenccyontrolin Distributed Databasesystems

The document presents a framework for designing and analyzing concurrency control algorithms in distributed database management systems (DDBMS), focusing on timestamp-based methods. It decomposes concurrency control into read-write and write-write synchronization problems, detailing 12 principal algorithms and their refinements, resulting in over 50 distinct algorithms. The paper also discusses transaction processing models and the importance of atomic commitment in ensuring database consistency during concurrent transactions.

Uploaded by

tepibi3515
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

TIMESTAMP-BASEDALGORITHMSFOR CONCURRENCY

CONTROLIN DISTRIBUTED DATABASE SYSTEMS*

Philip A. Bernstein**
Nathan Goodman**

Computer Corporation of America


and Harvard University

We begin by defining in (Section 2) a standard


terminology for describing DDBMS concurrency con-
trol algorithms and a standard model of the DDBMS
environment. Using this terminology and model as a
foundation, we decompose the problem of concurrency
control into the sub-problems of read-write and
write-write synchronization (in Section 3). In
Abstract Section 4 we present a series of timestamp-based
algorithms (called synchronization techniaues) that
We decompose the problem of concurrency control achieve read-write and/or write-write synchroniza-
into the sub-problems of read-write and write-write tion. Finally in Section 5 we show how each read-
synchronization. We present a series of timestamp- write technique can be integrated with each write-
based algorithms (called synchronization tech- write technique to form a complete and correct con-
niaues) that achieve read-write and/or write-write currency control algorithm.
synchronization. And we show how to combine any
read-write technique with any write-write technique This work is part of a larger study of concurrency
to yield a complete concurrency control algorithm control [BG2] that considers locking-based synchro-
(called a method). Using this framework we nization techniques in addition to timestamp-based
describe 12 “principal” concurrency control methods ones.
in detail. Each principal method can be modified
by refinements described in the paper, leading to
more than 50 distinct concurrency control algo-
rithms.
2. Transaction Processing Kodel

To understand how a concurrency control algorithm


1. Introduction operates, one must understand how the algorithm
fits into an overall DDBMS. In this section we
present a simple model of a DDBMS, emphasizing how
In this paper we present a framework for the design the DDBMS processes transactions.
and analysis of concurrency control algorithms for
distributed database management systems (DDBMS).
This framework permits us to describe a large 2.1 Preliminary Definitions
number of concurrency control algorithms in concise
terms and guides us in the discovery of new algo- A distributed database management system (DDBKS) is
rithms. Using this framework we describe 12 “prin- a collection of sites interconnected by a network.
cipal” concurrency control algorithms in detail and Each site is a computer running one or both of the
show how these principal algorithms can be refined following software modules : a transaction manager
to yield more than 50 distinct algorithms. These (TM) or a data manager (DII). Briefly, TMs super-
algorithms subsume about half of the literature on vise user interactions with the DDBHS while DMs
DDBKS concurrency control [BG1,21; in addition manage the actual database. A network is a
these algorithms extend the state-of-the-art in computer-to-computer communication system. The
DDBMS concurrency control, because most of them are network is assumed to be perfectly reliable -- if
new. site A sends a message to site B, site B is gueran-
teed to receive the message without error. In ad-
* This work was supported by Rome Air Development dition, we assume that between any pair of sites
Center under contract no. F30602-79-C-0191. The the network delivers messages in the order they
views expressed are those of the authors and do not were sent.
necessarily represent the opinion of Rome Air De-
velopment Center or the U.S. Government. From a user’s perspective, a database consists of a
collection of logical data items, denoted X,Y,Z,...
We leave the granularity of logical data items un-
** Author’s address: Aiken Computation Laboratory, specified. In practice, logical data items may be
Harvard University, Cambridge, HA 02138. files, records, etc. A lonical database staLe is

c”l534-7/80/0000-0285$00.75 @ 1980 IEEE 285


an assignment of values to the logical data items returns the value of X in the current logical data-
comprising a database. Each logical data item may base state. WRITECX, new-value) creates a new
be stored at any DM in the system or redundantly at logical database state in which X has the specified
several DMs. A stored copy of a logical data item new value. Since transactions are assumed to rep-
is called a stored data item; xl ,..., xm denote resent complete computations, we use BEGIN and END
operations to bracket transaction executions.
the stored copies of logical data item X. When no
confusion is possible we use the term data item for
stored data item. A stored database state is an DMs manage the stored database, functioning essen-
assignment of values to the stored data items of a tially as back-end database processors. In re-
database. sponse to commands from transactions, TMs issue
commands to DMs specifying stored data items to be
Users interact with the DDBMS by executing w- read or written. The details of the TM-DM inter-
actions. Transactions may be on-line queries ex- face constitute the core of our transaction pro-
pressed in a self-contained query language; appli- cessing model and are discussed in Sections 2.3 and
cation programs written in a general-purpose pro- 2.4. Section 2.3 describes the TM-DM interaction
gramming language; etc. The concurrency control in a centralized database environment. Section 2.4
algorithms we study pay no attention to the COILIT)U- extends the discussion to a distributed database
tations performed by transactions. Instead these setting.
algorithms make all of their decisions based on the
data items a transaction reads and writes, and so
the detailed form of transactions is unimportant in 2.3 Centralized Transaction Processing Model
our analysis. However we do assume that trans-
actions represent complete and correct computa- A centralized DBMS consists of one TM and one DM
tions; i.e. each transaction if executed alone on executing at the same site. A transaction, T, ac-
an initially consistent database would terminate, cesses the DBMS by issuing BEGIN, READ, WRITE, and
output correct results, and leave the database con- END operations, which are processed as follows.
sistent. The logical readset (resp. writeset) of a
transaction is the set of logical data items the BEGIN : The TM initializes a private workspace for
transaction reads (resp. writes). Stored readsets T. The private workspace functions as a temporary
and stored writesets are defined analogously. Two buffer for values that T writes into the database,
transactions are said to conflict if the stored and as a cache for values that T reads from the da-
readset or writeset of one intersects the stored tabase.
writeset of the other.

The correctness of a concurrency control algorithm


is defined relative to user expectations regarding Figure 2.1 DDBMSSystem Architecture
transaction execution. There are two correctness transaction
criteria. (1) Users expect that each transaction
submitted to the system will eventually be execut- .
ed. And (2) users expect the computation performed .
by each transaction to be the same whether it exe-
cutes alone in a dedicated system or in parallel transaction
with other transactions in a multiprogrammed
system; the attainment of this expectation is the
principal issue in concurrency control.

2.2 DDBMS Architecture

A DDBMS contains four components (see fig. 2.1):


transactions, TMs, DNs, and data. Transactions transaction
communicate with TKs, TEs communicate with DMs, and
DNs manage the data. (Tks do not communicate with .
other TMs, nor do DMs communicate with other DMs.)
The interface between transactions and TMs is the
external interface of the DDBMS; the interface transaction
between TMs and DMs is its internal interface.

TMs supervise transactions. Each transaction exe-


cuted in the DDBMS is supervised by a single TM,
meaning that the transaction issues all of its da-
tabase operations to that TM. Any distributed com-
putation that is needed to execute the transaction transaction
is managed by the TM. Therefore, each transaction
believes the system consists of a single TM and 0
multiple DIr!s. .

Four operations are defined at the external inter-


face. Let X be any logical data item. READ(X) transaction

286
READ(X): The T:i looks for a copy of X in T’s pri- Figure 2.2 The Need for Atomic Commitment
vate workspace. If the copy exists, its value is
returned to T. Otherwise the TM issues a command
to the DIGasking it to retrieve a stored copy of X
from the database. This operation is denoted dm- *Consider a database of banking information
read(x). The value retrieved by the D?: is given to
T and put into T’s private workspace. *Suppose Acme Corp. 's savings account has
$2,000,000 and its checking account has $500,000.
WKITE(X, new-value 1 : The Tl.1 again checks the pri- And suppose the DBMS fails while processing the
vate workspace. If the workspace has a copy of X, following transaction.
its value is updated to new-value; otherwise a
copy of X with that value is created in the work- T: Move $l,OOO,OOO from savings to checking
space. The new value of X is not stored in the da-
tabase at this time. *In the absence of atomic commitment, the follow-
ing incorrect execution could occur.
END: The TN issues an operation denoted dm-write(x)
for each logical data item X updated by T. Each
dm-write(x) requests that the DtI update the value
of X in the stored database to the value of X in
T’s local workspace. When all dm-writes are pro- Execution of T Database
cessed, T is finished executing, and its private
workspace is discarded. s $2,000,000
READ savings s $2,000,000 C 500,000
The DBMS may restart T any time before a dm-write READ checking C 500,000
has been processed. The effect of restarting T is
to obliterate its private workspace and to re-
execute T from the beginning. As we will see, many Subtract $l,OOO,OOO from savings
concurrency control algorithms use transaction re- Add $l,OOO,OOO to checking
starts as a tactic for attaining correct execu-
s $1,000,000
tions. However, once a single dm-write has been
This is because
c 1,500,000
processed, T cannot be restarted.
each dm-write permanently installs an update into
the database, and we cannot permit the database to
WRITE savings
reflect partial effects of transactions.
$1,000,000
500,000
A DBWS can fail in many ways and a detailed treat-
ment of reliability issues is beyond the scope of
------SYSTEM CRASHES------
this paper. However, a reliability problem called
atomic commitment has a major impact on concurrency
control. Consider a transaction T that updates
WRITE checking --- never executed
data items x,y,z,... and suppose the DBMS fails
while processing T’s END. If this occurs, some of
T’s updates may have been installed in the stored To model two-phase commit, it is convenient to add
database while others have not, and the database a third TM-DM operation, pre-commit, which in-
may contain incorrect information (see fig. 2.2). structs the DM to copy a data item from the private
To avoid this problem, the DBNS must ensure that workspace to secure storage.
& of a transaction’s dm-writes are processed or
none are.
2.4 Distributed Transaction Processing Model

The “standard” way to implement atomic commitment Our model of transaction processing in a distrib-
involves a procedure called two-phase commit [LS, uted environment differs from the centralized case
Gray].* Again suppose T is updating x,y,z,... in two areas: how private workspaces are handled,
When T issues its END, the first phase of two-phase and the implementation of two-phase commit.
commit begins. During this phase the DM copies the
values of x,y,z,... from T’s private workspace onto Private WorksDaces in a DDBMS
secure storage. If the DBMS fails during the first
phase, no harm is done, since none of T’s updates In a centralized DBMS we assumed that private work-
have yet been applied to the stored database. spaces were part of the TM. We also assumed that
During the second phase, the DBMS copies the values data could freely move between a transaction and
of x,y,z,... into the stored database. If the DBMS its workspace, and between a workspace and the DM.
fails during the second phase, the database may These assumptions are not appropriate in a DDBMS
contain incorrect information. However since the because TMs and DMs may run at different sites and
values of x,y,z,... are stored on secure storage, the movement of data between a TM and a DM can be
this inconsistency can be rectified when the system expensive. To reduce this cost, many DDBMSs employ
recovers : the recovery procedure reads the values query outimization procedures which regulate (and
of XYY,ZY... from secure storage and resumes the hopefully reduce) the flow of data between sites.
commitment activity.

287
For example, in SDD-1 the private workspace for
cessed from these TMS.
transaction T is distributed across all sites at
which T accesses data [GBWRRI. The details of how To avoid this problem, each D11 that receives a pre-
T reads and writes data in these workspaces is a commit must be able to determine which other DMs
query optimization problem, and has no direct ef- are involved in the commitment activity. (This in-
fect on concurrency control. Consequently, we fac- formation could be a parameter to the pre-commit
tor this issue out of our model for distributed operation, stored in a private workspace, etc.) If
transaction processing. T’s TM fails before issuing all dm-writes, the DMs
whose dm-writes were not issued can recognize the
In detail, our model of distributed transaction ex-
situation and consult the other DMs involved in the
ecution is as follows. commitment. If m DM received a dm-write, the re-
maining ones act as if they had also received the
1. When transaction T issues its BEGIN opera- command. Thus, if any DM applies an update to the
tion, T’s TM creates a private workspace for database, they all do (see also, (HS21).
T. The location and organization of this
workspace is left unspecified.
2. When T issues a READ(X) operation, the TM
checks T’s private workspace to see if a
copy of X is present. If so, the value of 3. Decomposition of Concurrency Control Problem
that copy is made available to T. Otherwise
the TM selects some stored copy of X, say
xi, and issues dm-read(xi) to the DM at In this section we review concurrency control
which xi is stored. The DM responds by re- theory with two objectives: to define “correct ex-
ecutions” in precise terms, and to decompose the
trieving the stored value of xi from the da- concurrency control problem into more tractable
sub-problems.
tabase, placing this value in the private
workspace. The TM then returns this value
to T. 3 .I Serializability
3. When T issues a WRITECX, new-value) opera-
tion, the value of X in T’s private work- Let E denote an execution of transactions Tl, ....
space is updated to new-value, assuming the Eis a serial execution if no transactions
workspace contains a copy of X. Otherwise, Tn’
a copy of X with the new value is created in ever execute concurrently in E; i.e., each trans-
the workspace. action is executed to completion before the next
4. When T issues its END operation, two-phase one begins. Every serial execution is defined to
commit begins. For each X updated by T, and be correct, because the properties of transactions
for each stored copy xi of X, the TM issues (see Section 2.1) imply that a serial execution
terminates properly and preserves database consis-
a pre-commit(xi) to the DK that stores xi. tency . An execution is serializable if it is com-
The DM responds by copying the value of X putationally equivalent to a serial execution, that
from T’s private workspace onto secure is, if it produces the same output and has the same
storage internal to the DM. After all pre- effect on the database as some serial execution.
commits are processed, the TM issues dm- Since serial executions are correct and every seri-
writes for all copies of all logical data alizable execution is equivalent to a serial one,
items updated by T. A DM responds to dm- every serializable execution is also correct. The
write(xi) by copying the value of xi from goal of database concurrency control is to ensure
that all executions are serializable.
secure storage into the stored database.
After all dm-writes are installed, T’s exe- The only operations that access the stored database
cution is finished. are dm-read and dm-write. Hence, insofar as seri-
alizability is concerned, it is sufficient to model
an execution of transactions b the execution of
Two-Phase Commit in a DDBMS dm-reads and dm-writes at t B e various DMs of the
DDBMS. In this spirit, we formally model an execu-
The problem of atomic commitment is aggravated in a tion of transactions by a set of logs, one log per
DDBKS by the possibility of one site falling while
DM. Each log indicates the order in which dm-reads
the remainder of the system continues to operate.
and dm-writes are processed at one DM (see fig.
Suppose T is updating X,Y,Z,*.. stored at DMx,
3.1).
DM , Dkls,... (resp.) and suppose T’s TM fails
An execution modelled by a set of logs is serial if
afzer issuing the dm-write(x), but before issuing (1) for each 1 og, and for each pair of traaons
the dm-writes for y,z,... At this point, the data- Ti and T. whose operations appear in the log,
base contains incorrect information as illustrated 3
in fig. 2.2. In a centralized DBMS, this phenomen- either all of Ti’s operations precede all of T.‘s
3
on is not harmful because no transaction can access operations, or vice versa; and (2) for each pair of
the database until the TM recovers from the transactions, Ti and T., if Ti’s operations precede
failure. However, in a DDBMS, other TMs remain op- 3
Tj’s operations in one log, then Ti’s operations
erational, and the incorrect database can be ac- precede T. ‘s operations in every log in which oper-
3

288
Figure 3.1 Modelling Executions as Logs Figure 3.2 Serial and Non-Serial Logs
Transactions Database The execution modelled in figure 3.1 is serial.
Condition (1) holds since each log is itself
T1: BEGIN; serial -- i.e., there is no interleaving of opera-
READ(X); WRITE(Y); END A x1 tions from different transactions. Condition (2)
y1 holds since at DM A, Tl precedes T2 precedes T ;
3
at DM B, T1 precedes T * and at DM C, T precedes
2' 2
T2: BEGIN;
READ(Y); WRITE(Z): END B y2 T3.
s2
The following execution is not serial; it satis-
T : BEGIN; fies (1) but not (2).
3 READ(Z); WRITE(X); END C
=3
DM A: rl[xll w,[Y,l r2[Y21 w3[x11
DM B: w2[z21 w1[y21
DM C: w,[z,l r3[z31
One possible execution of TI, T2, and T3 is rep-
resented by the following logs. (Note: ri[xl
denotes the operation dm-read(x) issued by Ti;
The following execution is also not serial; it
w&x] has the analogous meaning) doesn't satisfy (1) or (2);

DM A: rl[xll r2[y21 w,[x,l q[~,l


Log for DM A: r,[x,l wl[yll r2[Y11 w3[x11
DM B: w21s21 w1[y21
Log for DM B: WJY,l w2[s21
DM C: w,[z,l r3[z31
Log for DM C: w2[z31 r3[z31
appears in the same relative order in both logs.
Intuitively, computational equivalence must hold in
this case because (1) each dm-read operation reads
ations from both Ti and Tj appear (see fig. 3.2). data item values that were produced by the same dm-
Intuitively, (1) says that at each DM no two trans- writes in both executions; and (2) the final dm-
actions are interleaved, and (2) says that trans- write on each data item is the same in both execu-
actions execute in the same order at all Ms. tions. Condition(l) ensures that each transaction
reads the same input in both executions (and there-
Two operations conflict if they operate on the same fore performs the same computation). Combined with
data item and one of the operations is a dm-write. (21, it ensures that both executions leave the da-
The order in which operations execute is computa- tabase in the same final state.
tionally signif icant iff the operations conflict.
To illustrate the notion of conflict, consider a We can now characterize serializable executions
data item x and transactions Ti and Tj. If Ti precisely.
issues dm-read(x) and T. issues dm-write(x), the
J Theorem 1 [PBR, Papadimitriou, SLR] Let x={T1,...,
value read by Ti will (in general) differ depending Tn) be a set of transactions and let E be an execu-
on whether the dm-read precedes or follows the dm- tion of these transactions modelled by logs
write. Similarly, if both transactions issue dm-
{L1,. . . , Ln}. E is serializable if there exists a
write(x) operations, the final value of x depends
on which dm-write happens last. Those conflict total ordering of g such that for each pair of con-
situations are called read-write conflicts and flicting operations 0 and Oj from distinct trans-
i
write-write conflicts respectively.
actions Ti ad Tj (resp.), Oi precedes Oj in a log
The notion of conflict helps characterize the equi- iff Ti precedes T. in the total ordering.
valence of executions. Let El and E2 be two execu- J

tions, modelled by logs {Ll,l,..., Ll,n) and The total order hypothesized in Tlneorem 1 is called
a serialization order. A serialization order indi-
$J,..’ Lg,,L where L. models the execution at
l,j cates a serial execution of the transactions 1 that
El and E2 are
comnutationallv eqUiVal- is computationally equivalent to the original exe-
Dtlj for Ei’ cution E. Thus, if the transactions had executed
& if [PBR, Papadimitrioul: for each j, l<j<n,
and L2j contain the same set of dm-reads and
Ll,j
dm-writes and each pair of conflicting operations serially in the hypothesized order, the computation

289
performed by the transactions would have been iden- Theorem 2 Let ->rwr and ->ww be associated with
tical to the computation represented by E. execution E. Then E is serializable if (a) ->rwr
and ->ww are acyclic, and (b) there is a total or-
To attain serializability, the DDBKS must guarantee dering of the transactions consistent both with all
that all executions satisfy the condition of ->rwr and all ->ww relationships.
Theorem 1. Those conditions require that conflict-
ing dm-reads and dm-writes be processed in certain Theorem 2 emphasizes a point overlooked in Theorem
relative orders. Concurrency control is the activ- 1: read-write and write-write conflicts interact
ity of controlling the relative order of conflict- only insofar as there must be a total ordering of
ing operations; an algorithm to perform such con- the transactions consistent with both types of con-
trol is called a synchronization technique. so, to flicts. This suggests that read-write and write-
be correct, a DBNS must incorporate synchronization write conflicts can, to some extent, be synchron-
techniques that guarantee the conditions of Theorem ized independently. We can use one technique to
1. guarantee an acyclic ->rwr relation (which amounts
to read-write svnchronization) and a different
technique to guarantee an acyclic ->ww relation
3.2 A Paradigm for Concurrency Control (write-write svnchronization). However, Theorem 2
says that having both ->rwr and ->ww acyclic is not
In Theorem 1, read-write and write-write conflicts enough. There must also be one serial order con-
are treated together under the general notion of sistent with & -> relations. This serial order
conflict . However, we can decompose the concept of is the cement that binds together the read-write
serializability by distinguishing these two types and write-write synchronization techniques.
of conflict. Let E be an execution modelled by a
set of logs. I!e define three binary relations on Decomposing the serializability problem into the
transactions in E, denoted ->rw, --)wr, and ->ww. problems of read-write and write-write synchroniza-
For each pair of transactions, Ti and T. tion is the cornerstone of our paradigm for concur-
3 rency control. In Section 4 we describe algorithms
Ti reads that accomplish read-write (rw) and/or write-write
1. Ti ->rw Tj iff in some log of E,
(wwl synchronization, and in Section 5 we show how
some data item into which Tj subsequently to combine rw and ww synchronization algorithms
into correct concurrency control algorithms. It
writes;
will be important hereafter to distinguish algo-
2. Ti ->wr Tj iff in some log of E, Ti writes
rithms that attain rw and/or ww synchronization
into some data item that Tj subsequently from algorithms that solve the entire distributed
concurrency control problem. We shall use SJg-
reads ;
chronization technique for the former type of algo-
3. Ti ->ww Tj iff in some log of E, Ti writes
rithm, and concurrency control method for the
into some data item into which T. subse- latter.
J
quently writes.

Rotationally, we use ->rwr = (->rw U --)wr) and -> =


(->rwr U ->ww).
4. Timestamp Ordering (T/O) Techniques
Intuitively, -> (with any subscript) means “in any
serialization must precede”. For example, T. ->rw
1
Tj means ” Ti in any serialization must precede T.“.
4.1 Specification
This interpretation follows from Theorem 1: If ‘Ti
reads x before T. writes into x, then the hypothet- Timestamp ordering (T/O) is a technique whereby a
3
ical serialization in Theorem 1 must have Ti pre- serialization order is selected a priori and trans-
.action execution is forced to obey this order.
ceding T.. When a transaction begins, its TM creates a unique
J timestamp for it by reading the local clock time
and appending a unique TM identifier to the low
Every conflict between operations in E is repre- order bit. The TM also agrees not to assign
sented b an -> relationship. Therefore,+we can another timestamp until the next clock tick. Thus
restate T t eorem 1 in terms of ->. According to
timestamps assigned by different TMs differ in
Theorem 1, E is serializable if there is a total
their low order bits while timestamps assigned by
order of transactions that is consistent with the
the same TM differ in their high order bits, and SO
order of all conflicts. In terms of ->, this means
all timestamps are unique system-wide. (Kotice
that E is serializable if there is a total order of
that this algorithm does not require that clocks at
transactions that is consistent with ->. This
different sites be synchronized.)
latter condition holds iff -> is acyclic ( A rela-
tion, ->, is acyclic if there is no sequence il -,
The TM attaches the timestamp to all dm-read and
i2, i2 -> i3,..., in-l -> in such that il = in. 1 dm-write operations issued on behalf of the trans-
action. DMs are required to process conflicting op-
In addition, we can decompose -> into its compo-
theorem in erations in timestamp order. The definition of
nent s, ->rwr and ->ww, and restate the
conflicting ouerations depends on the type of syn-
terms of these components.

290
chronization being perforted. For rw synchroniza- restart policy can lead to a cyclic restart situa-
tion, two operations conflict iff both operate on tion, meaning that some transaction can be continu-
the same data item and one is a dm-read r;ni the ally restarted without ever f i.nishing. Cyclic re-
other is a dm-write. For ww sync’hronization, two start can ‘be avoided by assigning an especially
operations conflict iff both operate on the same large timestamp to the transaction, thereby reduc-
data item and both are dm-writes. ing the probability of a subsequent restart. Other
restart policies are discussed in later sections.
It is easy to prove that T/O attains an acyclic
->rm (resp. -->ww) relation when used for rw (resp. This implementation of T/O requires a substantial
~7) synchronization. Since ezch DI’ processes con- amount of storage for maintaining timestamps.
flicting operations in timestamp order, each edge Techniques for reducing this storage requirement
of the ->rwr (resp. ->ww) relation is in timestamp are discussed in Section 4.E.
order. Gence, all paths in the relation are in
timestamp order and, since all transactions have
unio ue timestamps, no cycles are possible. In ad- 4.3 The Thomas Krite Rule
dition, the timestamp order is a valid serializa-
tion order. For w synchronization the basic T/O scheduler can
be optimized using an observation of [Thomas 1,21.
suppose the timestamp of a dm-write(x) is smaller
4.2 basic Implementation than W-timestamp( Instead of rejecting the dn-
write (and restarting the issuing transaction) ”
An implementation of T/O amounts to building a m can simply ignore the dm-write. We call this the
scheduler, a software module that receives dn-read Thomas \lrite Rule (TTnlF.1.
and dn-write operations and outputs these opera-
tions according to the T/O specification. In prac- Intuitively, blR only applies to a dm-write that
tice, pre-commits must also be processed through tries to put obsolete information into the data-
the T/O scheduler for two-phase commit to operate base. The rule guarantees that the effect of ap-
properly. In Sections 4.1-4.S we describe T/O im- plying a set of dm-writes to x is identical to what
plementations without considering the impact of would have happened had the dm-writes been applied
two-phase commit. Section 4.9 considers two-phase in timestamp order.
commitnent issues.

The basic T/O implementation distributes the sche- 4.4 Multi-Version T/O
dulers along with the database. Consider the T/O
scheduler at some particular INi. For each data For rv synchronization the basic T/C scheduler can
item x stored at the Dll, the scheduler keeps track be improved by using the multi-version data item
Of the largest timestamp of any dm-read (resp. dm- concept of [Reed]. For each data item x we main-
write) that has operated on x. This timestamp is tain a set of R-timestamps, and a set of <w-
denoted R-timestamp(x)(resp. II-timestamp(x timestamp, value> pairs (called versios. The R-
timestamps of x record the timestamps of all dm-
For rw synchronization the basic T/O scheduler o reads that have ever read x; the versions record
crates as follows. To process a dm-read(x), tK, the timestamps of all dm-writes that have ever
scheduler compares the timestamp of the dm-read to written into x, along with the values written.
V-timestamp( If the former timestamp is larger,
the scheduler outputs the dm-read and updates R- Using multi-versions, one can achieve rw synchroni-
tinestamp to the maximum of (a) the old R- zation without ever rejecting dm-reads. Consider a
timestamp( or (b) the timestamp of the dm-read. dm-read(x) with timestamp TS. To process this op-
If the timestamp of the dm-read is smaller than W- eration, we simply read the version(x) with largest
timestamp( the dm-read is rejected and the issu- timestamp less than TS; see fig. 4.la. However,
ing transaction is aborted. Similarly, to process dm-writes can still be rejected. Consider a dm-
a dm-write(x), the scheduler compares the timestamp write(x) with timestamp TSl, and let TS2* be the
of the dm-write to R-timestamp( If the former
smallest W-timestamp greater than TSi see fig.
timestamp is lar*;er , the dm-write is output and P-
timestamp is updcted to the maximum of (a) the 4.lb. If any R-timestamp lies between TSl and
old C-timestamp( or (b) the timestamp of the dm-
TS2 then the dm-write is rejected. If no R-
write. Otherwise, the dm-write is rejected and the
transaction is aborted. timestanp.lies in that range, then the scheduler
outputs the dm-write; this causes a new version of
For ww synchronization, the T/O scheduler operates x to be created with timestamp TS
1’
as follows. To process a dm-write(x), scheduler
compares the tinestamp of the dm-write to the B-
timestamp( If the dm-write has a larger time-
To prove the correctness of this technique, con-
stamp, the dm-write is output and I+timestamp is
sider 2 dm-read(x) with timestamp TSl that is pro-
set equal to the timestamp of the dm-write. Other-
wise, the dm-write is rejected and the transaction cessed “out of order”; i.e., suppose the dm-
is aborted. read(x) has timestamp TS1 yet it is processed after

it is assigned a some dm-write(x) with a larger timestamp Tsp. The


When a transaction is aborted,
larger timestamp by its TM and is restarted. This dm-read ignores all versions(x) with timestamps

291
Figure 4.1 Multi-version Reading and Writing 4.5 Conservative T/O
a) Let us represent the versions of a data item x
Conservative timestamn ordering is a technique for
on a "time line" eliminating restarts during T/O scheduling [BP,
BSR, IIV, RNTR, 6X1 , SP;21. When a scheduler re-
values v
v1 V2 V3 n-l 'n ceives an operation 0 that might cause a future re-
II 1 I 1 I start, the scheduler delays 0 until it is certain
b
W-timestamps 5 10 20 92 100 that no future restarts are possible.

Imagine that each T/O scheduler has a set of input


queues, one R-aueue and one W-queue per TM. Each
To process a dm-read(x) with timestamp 95, find R-queue (resp. W-queu ) is a FIFO channel
the biggest W-time&amp less than 95; in this transmitting dm-reads Presp. dm-writes) from onef%
case 92. That is the version you read. So in to one scheduler. In addition, each TM is required
this case, the value read by the dm-read is V to place operations into any given queue in time-
n-l' stamp order.

This structure can be used for rw synchronization


as follows. Suppose scheduler S. receives a dm-
b) Let us represent the R-timestsmps of x
read(x) with timestamp TS. If Sj oitputs this dm-
similarly
read “too early“, subsequent dm-writes may have to
R-timestamps ' ' I I I b be rejected. S. can avoid this possibility by
57 15 92 95 J
scanning its W-queues and only outputting the dm-
values read if (a) every W-queue is non-empty, and (b) the
v1 v2 v3 first dm-write on each W-queue has timestamp great-
er than TS. This guarantees that S. will not out-
W-timestamps c5 10 20 100 J
put the dm-read until it has processed every dm-
write with timestamp less than TS that Sj will m
To process a dm-write(x) with timestamp 93, we receive. To avoid the rejection of dm-reads,
S.
create a new version of x with that timestamps. 3
can use multi-version T/O, or it can delay the pro-
R-timestamps -5 ; 1; I I * cessing of dm-writes until it is has processed all
92 95 dm-reads with smaller timestamps using an algorithm
similar to the above.
values ' -1 ' 'n
5 v2 p
t I l For ww synchronization, the scheduler need only
I I I I I
W-timestamps 5 10 20 92I 93 100 wait until every W-queue is nonempty and then out-
put the dm-write with smallest timestamp. If con-
servative T/O is used for both rw and ww synchroni-
However, this new version "invalidates" the zation, the scheduler waits until every queue is
dm-read Of part (a), because if the dm-read had nonempty and then outputs the operation with small-
arrived after the dm-write, it would have read est timestamp.
value V instead of Vn-l' Therefore, we must
reject the dm-write. The above implementation of conservative T/O suf-
fers three major problems. First, the implementa-
larger than TSl; thus, the value read by the dm- tion does not guarantee termination -- if some TM
never sends an operation to some scheduler, the
read equals the value it would have read had it scheduler will “get stuck” due to the empty queue
been processed “in order”. NOW consider a dm-
and will never output any operations. Second, the
write(x) that is processed “out of order”. I.e.,
implementation requires that all TMs communicate
suppose the dm-write is processed after some dm-
regularly with all schedulers -- this is infeasible
read with a larger timestamp TS2. Since the dm-
in large networks. Third, the implementation is
write was not rejected, there must exist a yer- overly conservative -- e.g., the combined rw and ww
sion(x) with timestamp TSl such that TSl < TSl < algorithm processes all operations in timestamn
order, not merely conflicting operation. The first
TS2. Again the effect is identical to that Of a two problems are addressed below. The third is
timestamp ordered execution. Q.E.D. considered in Section 4.6.

tiotice that the multi-version concept achieves ww Guaranteeinp. Termination -- Null operations
synchronization “automatically”; insofar as ww syn-
chronization is concerned, multi-versions are an To guarantee termination, we require that TMs per-
embellished implementation of TWR. iodically send timestamped null-operations to each
scheduler, in the absence of any “real” traffic. A
It is usually not possible to keep all versions null-operation is a dm-read or dm-write that does
forever, so a technique for forgetting (i.e., de- not reference a data item. When TMi sends a null-.
leting) versions is needed (see Section 4.8). dm-read (resp. null-dm-write) with timestamp TS to

292
scheduler S. this signifies that TMi will not send of class C iff T’s readset is a subset of C’s read-
J’
set, and T’s writeset is a subset of C’s writeset.
Sj any more dm-reads (resp. dm-writes) with time-
(Classes need not be disjoint.) Class definitions
stamps smaller than TS. Thus, any scheduling deci- are not expected to change frequently during normal
sion requiring that S ; receive all dm-reads (resp. operation of the system. Changing a class defini-
tion is akin to changing the database schema and
dm-writes) from TPli tinestamped less than TS can be requires mechanisms beyond the scope of this paper.
made after that null-dm-read (resp. null-dm-write) We assume that class definitions are stored in
is received. An impatient scheduler can prompt a static tables which are available at any site re-
TK for a null-operation by sending a reauest-null quiring them.
operation to it.
Classes are associated with Ttls. Every transaction
midinc: Unneccssarv Communication that executes at a TM must be a member of a class
associated with the TX. If a transaction is sub-
To avoid unnecessary communication between TMs and mitted to a TM at which this property does not
schedulers, null-operations with very large time- hold, the transaction is forwarded to another TM
stamps can be used. In extreme cases, TMi can send that has an appropriate class. Ic’e assume that
every class is associated with exactly one TM, and
Sj a null-operation with infinite timestamp, signi-
conversely, every TM is associated with exactly one
fying that TLli does not intend to communicate with class. We use Ci to denote the class associated
Sj until further notice. Of course, when T&Ii needs with TMi. This notation simplifies our discussion,
to send a “real” operation to S., some mechanism is but does not constrain system operation in any way.
J For example, to execute transactions that are
required to retract the infinite timestamp and re-
members of class C at two TMs, we define another
place it by a finite one. 1
class with the same readset and writeset as C
c2 1
4.6 Conservative T/O with Transaction Classes and associate C 1 with one TM and C2 with the other.
On the other hand, to execute transactions that are
Another technique for reducing communication is members of two classes at one site, we multi-
transaction classes [BKGIJI. Here, we assume that program two TMs at the same site.
the rcadsct and writeset of every transaction is
known in advance. This information is used to Transaction classes are exploited by conservative
group transactions into predefined classes. Class T/O schedulers as follows. Consider rw synchroni-
definitions help support a less conservative sche- zation and suppose scheduler S. wants to output a
duling policy. dm-read(x) with timestamp TS. &stead of waiting
for dm-writes with smaller timestamp from all TMs,
A transaction class is defined by a readset and Sj need only wait for dm-writes from those TMs
writeset (see fig. 4.2). Transaction T is a member
vhose class writeset contains x. Similarly, to
Figure 4.2 Transaction Classes process a &n-write(x) with timestamp TS, Sj need
aA class is defined by a readset and a writeset. only wait for dm-reads with smaller timestamp from
E.g., those TKs whose class readset contains x. Thus,
the level of concurrency in the system is in-
Cl: readset = {x,) , writeset = (y,,y,) creased. ww synchronization proceeds analogously.

C2: readset = (x1,y2) , writeset = {y y z This technique also reduces communication require-
1’ 2’ 2’23)
ments, since a TM need only communicate with a
C3: readset = {y,, z,} , writeset = (x z scheduler if its class readset or writeset contains
1’ 2+3)
data items protected by the scheduler.

*A transaction is a member of a class if its read- 4.7 Conservative T/O with Conflict Graph Analysis
set is a subset of the class readset and its
writeset is a subset of the class writeset. E.g., Conflict graph analvsis is a technique for further
improving the performance of conservative T/O with
Tl: readset = {x,} , writeset = Iyl, y2 1 classes. A conflict Eraoh is an undirected graph
that summarizes potential conflicts between trans-
T2: readset = {y2} , writeset = {z2, 23) actions in different classes. For eack class Ci

T3: readset = {z,} , Writes& = Ix,) the graph contains two nodes, denoted ri and wi,
which intuitively represent the readset and vrite-
set of C. The edges of the graph are defined as
*T is a member of Cl and C2 1’
1 follows (see fig. 4.3). (i) For
.T2 is a member of C2 and C there is a vertical edEe between
3
l T is a member of C For each pair of classes Ci and Cj
3 3

293
is a horizontal edge between w. and w. iff the Since classes are defined statically, conflict
1 J
writeset of C. intersects the writeset of C.. graph analysis is also performed statically. The
1
output of this analysis is a table indicating which
(iii) For each pair of classes Ci and Cj (with i+:)
horizontal and vertical edges require synchroniza-
there is a diagonal edge between ri and w. iff the tion and which do not. This information, like
3 class definitions, is distributed in advance to all
readset of Ci intersects the writeset of C..
J schedulers that require it.

Intuitively, a horizontal edge indicates that a Conservative T/O with conflict graph analysis has
scheduler Sk may be forced to delay dm-writes for been implemented in the SDD-1 DDBMS [BSRI. In
principle, conflict graph analysis can be applied
purposes of ww synchronization. Suppose classes Ci
to other synchronization techniques to improve
and C. are connected by a horizontal edge (i.e., their performance as well. Theoretical aspects of
J this integration are examined in [BSWI, but many
there is an edge between wi and wj). Then the
details remain to be worked out.
class writesets intersect and so, if Sk receives a
dm-write from Ci, Sk must delay the dm-write until
4.8 Timestamp Management
receives all dm-writes with smaller timestamps
sk
A common criticism of T/O schedulers is that too
from C.. Similarly, a diagonal edge indicates that
J much memory is needed to store timestamps. This
Sk may need to delay operations for rw synchroniza- problem can be overcome by “forgetting” old time-
stamps.
tion.
Timestamps are used in basic T/O to reject opera-
Conflict graph analysis improves the situation by tions that “arrive late”, e.g., to reject a dm-
identifying inter-class conflicts that never cause read(x) with timestamp TSl that arrives after a dm-
non-serializable behavior. This corresponds to
identifying horizontal and diagonal edges that do write(x) with timestamp TS2 > TS1. In principle,
not require synchronization. In particular, sche- and TS2 can differ by an arbitrary but amount,
dulers need only synchronize dm-writes from C. and TSI
1 in practice these timestamps are unlikely to differ
Cj if either (1) the edge (wi, w.) is embedded in a by more than a few minutes. Consequently we may
& of the conflict graph; or i 2) portions of the store timestamps in small tables which are periodi-
intersection of Ci’s writeset and C.‘s writeset are cally purged.
stored at two or more DMs[BSI. Thai is, if condi-
R-timestamps are stored in the R-table with entries
tions (1) and (2) do not hold, a scheduler S need of the form <x, R-timestamps; for any data item x,
k
not process dm-writes from Ci and C. in timestamp there is at most one entry. In addition, there is
J a variable, R-min which tells the maximum value of
order. Similarly, dm-reads from Ci and dm-writes any timestam-; has been purged from the table.
from Ci need only be processed in timestamp order To find R-timestamp( a scheduler searches the
J
if either R-table for an <x, TS> entry. If such an entry is
(‘) the edge (‘i, w : ) is embedded in a found, TS = R-timestamp( otherwise, R-
.I
cycle of the conflict graph; or (2) portions of the timestamp < R-min and to err on the side of
intersection of Ci’s readset and Cj’s writeset are safety, the scheduler assumes R-timestamp = R-

stored at two or more DKs[BSJ.

Figure 4.3 Conflict Graph

Define Cl, C2, c3 as in figure 4.2

l=~y2>z31
Cl readset = {xl>

Clwriteset = {y,, y2 1 C2 writeset = {y,, y2, z2, z,} C3 writeset = Ix,, z2, 233

294
min. To update R-timestamp( the scheduler modi- commits need not be processed by the ww scheduler.
fies the <x, TS> entry, if one exists; otherwise,
a new entry is created and added to the table. Integrating Two-Phase Commit Into Multi-Version T/O
When the R-table is full, the scheduler selects an
appropriate value for R-min and deletes all entries Like TWR, multi-versions eliminate the need for
from the table with smaller tinestamp. W- two-phase commit insofar as ww synchronization is
tinestamps are managed similarly; analogous tech- concerned. However, two-phase commit remains as
niques can be devised for multi-version databases. issue for rw synchronization.
Let P be a pre-commit(x) with timestamp TSl and let
Ilaintaining timestamps for conservative T/O is even
cheaper, since conservative T/O only requires time- W be the corresponding dm-write. When P arrives at
a scheduler, the scheduling rule of Section 4.4 is
;~;;;;~v~ner~tions, not timestamped data. If con- applied:
T 0 is used for rw svnchronization, the let TS2 be the smallest W-timestamp >
R-timestamps of data items are rendered useless and
may be discarded. If conservative T/O is used for TSl ; if any R-timestamp lies between TSl and
both rw and ww synchronization, I!-timestamps can be TS2, P is rejected, otherwise P .is accepted. If
eliminated too.
the scheduler accepts P, it agrees not to output
any dir-read(x) with timestamp between TSl and TS2
until W is received. As before, all such dm-reads
that arrive before W are placed on a waiting queue.
4.9 Integrating Two-Phase Commit into T/O
Intenratina Two-Phase Commit Into Conservative T/O
It is necessary to integrate two-phase commit into
Two-phase commit need not be tightly integrated
the T/O implementations described above to ensure into conservative T/O, because dm-writes are never
atomic commitment of updates (see Section 2). This
rejected. However, scheduling delay can be reduced
is done by timestamping pre-commits and modifying
the T/C implementations to accept or reject pre- by transmitting pre-commits via W-queues. For
example, suppose conservative T/O is used for rw
commits instead of dm-writes. If a scheduler re-
synchronization, and suppose scheduler S. wants to
jects a pre-commit, the issuing transaction is J
aborted. However, if a scheduler accepts a pre- output a dm-read(x) with timestamp TS.
commit, it must accept the corresponding dm-write ‘j need
only delay this dm-read until each W-queue contains
no matter when that operation arrives. To make a pre-commit with, @mestamp greater than TS; it
this guarantee, the scheduler may be forced to
need not’ wait for ‘the corresponding dm-writes.
u conflicting operations that arrive before the
(However I the dm-read may have to wait for some dm-
dm-write.
writes with smaller timestamp; i.e., if Sj has ac-
Integratinp Two-Phase Commit Into Basic T/O cepted a pre-commit(x) with timestamp TS’ < TS, the
dm-read cannot be output until the dm-write(x) with
Consider a pre-commit(x) with timestamp TS. Let P timestamp TS ’ is received.)
denote this operation and let W denote the corres-
ponding dm-write. Assume that basic T/O is used
for rw synchronization. P can be accepted by a 4.10 Heuristics for Reducing Restarts
scheduler iff TS > R-timestamp( i.e., P is ac-
cepted iff the scheduler can still output W. Once This section describes three heuristics for reduc-
the scheduler accepts P, it must guarantee that TS ing the cost or probability of restarts for non-
will remain greater than R-timestamp until W is conservative T/O implementations.
received. To make this guarantee, the scheduler
refuses to output any dm-read(x) with timestamp Predeclaration of Readsets and Writesets
greater than TS, until W is received. All such dm-
reads that arrive before W are placed on a waiting To reduce the cost of restarts, transactions should
queue. issue their dm-reads and pre-commits as early as
possible. The extreme version of this heuristic
For ww synchronization, P is accepted by the sche- calls for transactions to predeclare their readsets
duler iff TS > W-timestamp( Once the scheduler and writesets, so that dm-reads and pre-commits are
accepts P, it agrees not to output any dm-write(x) issued for the entire readset and writeset before a
with timestamp greater than TS until it receives W. transaction begins its main execution. If no oper-
All such dm-writes that arrive before H are placed ation is rejected, the transaction is guaranteed to
on a waiting queue as above. execute with no danger of restart.

Integrating Two-Phase Commit Into Thomas Write Rule Delavina of Ooerations

TWR applies only to ww synchronization and elimin- To reduce the probability of restart, a scheduler
ates the possibility of rejecting dm-writes for can a the processing of operations to wait for
purposes of ww synchronization. Hence there is no “earlier” operations (i.e., ones with smaller time-
need to incorporate two-phase commit into the ww stamps) to arrive. This heuristic is essentially a
synchronization algorithm. Pre-commits must still compromise between conservative and non-
be sent to all sites being updated, but the pre- conservative T/O, and trades response time for a

295
reduction in robability The amount of 5.1 Using Basic T/O for rw Synchronization
delay can be tuz%?ztop e imize thls’trade-off.

Reading Old Versions Wethods l-4 use basic TJO for rw synchronization.
Each stored data item e.g. xi, has an R-timestamp
The performance of multi-version TJO can be and a W-timestamp. Let T be a transaction with
improved by ,permitting aueries (i.e., read-only timestamp TS. To read xi, T issues a dm-read(xi)
transactions) to read old versions of data items. with timestamp TS; this dm-read is accented iff TS
Recall that in multi-version T/O, dm-read opera- > W-timestamp(
tions are never rejected but may cause subsequent To write xi, T issues a pre-
pre-commits to be rejected. (E.g., once dm-read(x) commit(xi) with timestamp TS; this pre-commit is
with timestamp TS is processed, a subsequent pre-
commit(x) with timestamp TS’, where TS’ < TS, accented iff (a) TS > R-timestamp (xi), and (b) a
may
be rejected.) To reduce the probability of reject- condition determined by the ww synchronization
ing a pre-commit, we may assign old (i.e. small) technique is also satisfied.
timestamps to queries. Of course, this also causes
the query to read older data. Thus, this technique Method 1 -- Basic T/O for ww synchronization. The
entails a compromise between system performance and pre-commit is accepted iff TS > R-timestamp (xi)
timeliness of data. Little is known about this and TS > W-timestamp (xi.)
tradeoff in general, but a good compromise should
of ten be achievable. For example, if queries are
assigned timestamps that are five minutes old, we Method 2 -- TWR for ww synchronization. The pre-
would expect few queries to interfere with updates. commit is accepted iff TS > the largest R-
And in many applications, five minute old data is timestamp( However, if the pre-commit is ac-
perfectly acceptable. cepted and TS < the W-timestamp( the correspon-
As a fringe benefit, this technique also improves ding dm-write has no effect on the database. This
the response time for queries by reducing the prob- method represents an optimization of Hethod 1 that
ability that a query’s dm-reads will be blocked by is apparently preferable in most situations.
pre-commits.
Method 3 -- Multi-version T/O for ww synchroniza-
tion. The pre-commit is accepted iff TS > R-
timestamp( the W-timestamp is irrelevant. If
the pre-commit is accepted, the corresponding dm-
5. Integrated TJO Concurrency Control Methods write creates a new version of ‘i’ While this
method appears to be a space-inefficient version of
The synchronization techniques of Section 4 can be Method 2, it can yield better performance by let-
integrated to form twelve principal T/O concurrencv ting queries read old versions of data items ; see
-methods: Section 4.10.

2 rw techniaue ww technique Method 4 -- Conservative T/O for ww synchroniza-


tion. Pre-commits are processed by each scheduler
basic T/O basic T/O in timestamp order. I.e., a scheduler S will not
basic T/O Thomas Write Rule (TWR) process a pre-commit with timestamp TS until it has
basic T/O multi-version T/O processed all pre-commits with smaller timestamp.
basic T/O conservative T/O When S processes a pre-commit(xi) with timestamp
multi-version TJO basic TJO TS, it accents the pre-commit iff TS > R-
multi-version T/O TWR timestamp( bt first glance this method appears
multi-version T/O multi-version T/O
8 multi-version T/O conservative T/O to be a time-inefficient version of Hethod 2. How-
conservative T/O basic T/O ever, unlike Method 2, this method applies updates
9
10 conservative T/O IWR to each DM in timestamp order. Consequently, the
conservative T/O multi-version T/O database at each DN is always consistent between
11
conservative T/O updates, a property which may be useful for relia-
12 conservative T/O
bility reasons.

Each TJO methoo that incorporates a non-


conservative comnonent can be further refined by 5.2 Using Multi-version T/O for rw Synchronization
including (1) techniques for forpetting timestamps
(see Section 4.8) and (2) heuristics for reducing Methods 5-8 use multi-version T/O for rw synchroni-
restarts (see Section 4.10). Each method that in- zation. Let T be a transaction with timestamp TS.
corporates a conservative component may also incor- To read xi, T issues a dm-readcx.) with timestamp
porate classes (see Section 4.6) and conflict graph TS; this dm-read is alwavs acceited. TO write Xi,

analysis (see Section 4.7). Thus, these 12 princi- T issues a pre-commit(xi) with timestamp TS; this
pal methods produce over 50 distinct methods. In
;$is.sectiop we describe the twelve principal meth- pre-commit is accented iff (a) there is no R-
s in detail.
timestamp that lies between TS and the smallest

296
W-timestamp larger than TS, and (b) a condition Figure 5.1 Inconsistent Retrievals in Method 6
determined by the ww synchronization technique is
also satisfied.
*Consider data items x and y with the following
versions
Method 5 -- Basic T/O for ww synchronization. For
basic T/O, condition (b) requires that TS be 100
values 0
greater-than the largest W-timestamp( So, for I I
X l
Method 5, conditions (a) and (b) may be simplified: W-timestsmps 0 100
The pre-commit is accepted iff TS > largest R-
timestamp and the largest W-timestamp( If
the pre-commit is accepted, the corresponding dm- values 0
write creates a new version of x.. l
Y
Method G -- TWR for ww synihroniaat ion. This W-timestsmps 0
method is incorrect. TWR requires that a dm-
write(xi) with timestamp TS be ignored if TS < the
maximum W-timestamp( This may cause subsequent *Now suppose T has timestamp 50 and writes x:=50,
dm-reads to read inconsistent data; see fig. 5.1. y:50. Under Method 6, the update to x is ignored,
(Kethod 6 is the only incorrect method we will en- and the result is
counter. )

Method 7 -- Multi-version T/O for ww synchroniza- values 0 100


L e
tion. This achieves the goals of TWR in conjunc- X I
tion with multi-version rw synchronization. The W-timestamps 0 100
pre-commit is accepted iff condition (a) holds. If
the pre-commit is accepted, the corresponding dm-
write creates a new version of x i. This method is
values 0 50
similar to the algorithms of [Reed, Montgomery]. ,
Y I I 6
W-timestamps 0 50
Method 8 -- Conservative T/O for ww synchroniza-
tion. A scheduler S will not process a pre-commit
with timestamp TS until it has processed all pre-
commits with smaller timestamps, and none with *Finally, suppose T' has timestamp 75 and reads x
larger timestamps. This permits us to simplify the
and y. The values it will read are x=0, ~'50
condition for acceptance of a pre-commit: A pre- T' should read x=50, y=50.
which is incorrect.
commit(xi) with timestamp TS is accepted iff TS is
greater than the largest R-timestamp( Notice also that Wethods 5 and 8 only require that
the largest R-timestamp of each data item be
Systematic ForPetting of Old Version stored. Smaller R-timestamps may be forgotten at
once.
In Methods 5 and 8, the versions of each data item
xi are created in timestamp order. That is, once a Systematic Reading of Old Versions

version of xi has been created with timestamp TS, Methods 5 and 8 also support a systematic technique
no subsequent transaction can create a version with for assigning old timestamps to queries (see
a smaller timestamp. When this property holds, it Section 4.10) so that (a) no dram-read issued by a
is possible to forvet (i.e., delete) old versions query will ever cause a pre-commit to be rejected;
such that we never delete a version needed by a and (b) the timestamp assigned to the query is the
later transaction. largest one satisfying (a). This technique is sim-
ilar to the technique for systematic forgetting of
Let W-QaX(Xi) be the maximum kJ-timestamp and old versions.

W-min be the minimum value of W-max(xi) over all Let Q be a query. The technique we describe re-
quires that Q’s readset be predeclared. Before Q
data items xi. Observe that no pre-commit with
begins its main execution Q’s readset is examined;
timestamp smaller than W-min can be accepted in the for each xi in the readset, W-max(xi) is ascer-
future: since W-min < W-ISaX for all Xi) all
tained. In addition, we calculate W-min = min{W-
future update transactions with timestamps less
max( xi is in Q’s readset). The timestamp as-
than W-min are guaranteed to be restarted. So, in-
sof at as update transactions are concerned, we can signed to Q is W-min - 1. The correctness of this
safely forget all versions of every data item time- technique is shown in [BGZI.
stamped less than W-min. Queries are handled in
this framework by interpreting all dm-reads with
timestamps less than kJ-min as if they had time- 5.3 Using Conservative T/O for rw Synchronization
stamps equal to W-min.
The remaining T/O methods use conservative T/O for

297
rw synchronization. In these methods, a scheduler framework has two main parts: (1) a model of dis-
S will not process a dm-read(xi) with timestamp TS tributed transaction execution, in which trans-
until it has processed all pre-commits with smaller actions execute by issuing dm-read, pre-commit, and
timestamps and none with larger timestamps. Sym- dm-write operations; and (2) a decomposition of the
metrically, S will not process a pre-commit(xi) concurrency control problem into the sub-problems
of rw and ww synchronization.
with timestamp TS until it has processed all dm-
reads with smaller timestamps and none with larger We presented several timestamp-based synch roniza-
timestamps. When S processes a pre-commit(xi) with tion techniques for solving each sub-problem. Four
timestamp TS, its action depends on the ww tech- of these techniques were deemed to be "principal":
basic T/O, the Thomas Write Rule, multi-version
nique.
T/O, and conservative T/O. These techniques vary
Method 9 -- Basic T/O for ww synchronization. The substantially in their behavior but are united by a
common underlying objective: each technique seeks
pre-commit is accepted iff TS > W-timestamp(
to execute conflicting operations in timestamp
order, or in some equivalent order. Basic T/O
Method 10 -- TWR for ww synchronization. The pre- achieves this objective by reiecting operations
commit is always accepted. However, if TS < W- that are received out of timestamp order. The
timestamp( the corresponding dm-write has no Thomas Write Rule ignores operations that are
effect on the database. received out of timestamp order. (This technique is
only suitable for ww synchronization.) Multi-
Method 10 is essentially the concurrency control of version T/O retains multiple "versions" of data
items to permit many operations that are received
SDD-1 [BSR]. In SDD-1, however, the method is re- out of order to be executed as if they had been
fined in several ways to reduce delay. First, SDD- received in order. And conservative T/O delays op-
1 uses classes and conflict nraoh analysis and re- that are received out of order to permit
uires redeclaration of readsets, In addition, erations
L!DD-1 on y entorces the conservative scheduling all operations with smaller timestamps to be pro-
rule on dm-reads, meaning that dm-reads wait for cessed first.
pre-commits, but pre-commits need not wait for all
dm-reads with smaller timestamps. Consequently, it Finally we showed how to integrate any principal rw
technique with any principal ww technique to yield
is possible for dm-reads to be rejected in SDD-1. a principal concurrency control method. Twelve
The SDD-1 designers accepted this possibility for principal methods can be constructed in this way.
two reasons: (1) since readsets are predeclared, Each principal method can be refined by several
all dm-reads are issued before the transaction be-
gins its main execution and the cost of rejecting a non-principal techniques so that more than 50 dis-
dm-read is modest. (2) The probability that a dm- tinct concurrency control algorithms can be built
read will be rejected can be reduced by assigning using the framework and material of this paper.
large timestamps to transactions. Other techniques
for reducing restarts are described by [Lin]. Most of the principal methods we describe are new
algorithms. These are Methods l-4 (which use basic
Method 11 -- Multi-version T/O for ww synchroniza- TIO for rw synchronization); Methods 5 and 8
tion. The pre-commit is always accepted and the (multi-version T/O for rw, with basic T/O or con-
corresponding dm-write always creates a new version servative T/O for ww); and Methods 9 and 11 (con-
of xi. When multi-versions are used, the conserva- servative T/O for rw, with basic T/O or multi-
version T/O for ww). Of the remaining methods,
tive rw technique can be optimized as follows: a Method 6 (multi-version T/O with TWR) is an incor-
dm-read can never be rejected, and so there is no rect method; Method 7 (multi-version T/O for rw and
reason to force pre-commits to wait for dm-reads. ww) is similar but not identical to the algorithms
(dm-reads must still wait for pre-commits to ensure of [Montgomery, Reed]; Method 10 (conservative T/O
that pre-commits are never rejected.) with TWR) is essentially the SDD-1 concurrency con-
trol algorithm [BSR]; and Wethod 12 (conservative
Nethod 12 -- Conservative T/O for ww synchroniza- T/O for rw and ww) is essentially the algorithm re-
tion. Scheduler S will not process a pre-commit commended by [BP, HV, KWTH, Sl11,21.
with timestamp TS until it has processed all pre-
commits with smaller timestamps and none with A major issue we have not addressed concerns the
larger timestamps. Combined with conservative rw performance of these algorithms. This issue is ad-
synchronization, the effect is to process & oper- dresssed aualitativelv in IBG21. However, little
ations in timestamp order. Method 12 has been re- quantitative performance analysis has been reported
commended by [BP, HV, KN'IR, SMl, S&21. in the literature and this remains a topic for
future research.

* The term "two-phase comnit" is commonly used to


denote the distributed version of this procedure.
6. Conclusion However, since the centralized and distributed
versions are identical in structure, we use "two-
phase commit" to describe both.
We have presented a framework for DDBMSconcurrency
control and have used that framework to describe a * TS equals infinity if TS is the largest
number of DDBMSconcurrency control methods. The W-tiZestamp(x). 1

298
References Copies", Proc. First International Conf. on
Distributed Comnutinc. Systems, IEEE, N.Y.,
pp. 625-631.
[BGlJ
Bernstein, P.A., and Goodman, M., 'lhp- I KRTHI
proaches to Concurrency Control in Distrib- Kaneko, A., Y.Nishihara, K. Tsuruoka, and
uted Databases", Proc. 1979 National Com- 1:. Hattori, "Logical Clock Synchronization
puter Conf., June 1979. Method for Duplicated Database Control",
Proc. First International Conf. on Distrib-
LnG2.1 uted Computinc. Svstems, IEEE, N.Y., Oct.
Bernstein, P.A., and Goodman, 1;. , "Funda- 1979, pp. 601-611.
mental Algorithms for Concurrency Control
in Distributed Database Systems", Tech. [LSI
Rep., Computer Corp. of Am., Feb. 19GO. Lampson, B. and Sturgis, II., "Crash Becov-
ery in a Distributed Data Storage System",
[BP] Tech. ReF., Computer Science Lab., Xerox
Badal, D.Z.; and Popek, G.J. "A Proposal Palo Alto Research Center, 1976.
for Distributed Concurrency Control for
Partially Redundant Distributed Data Ease [Linl
System," Proc. 3rd Berkeley Workshon on Lin, W. K., "Concurrency Control in a Mul-
Distributed Data lknagement and Commuter tiple Copy Distributed Data Base System",
Networks, 197&, pp. 273-2ES. Proc. 4th Berkeley Nork. on Distributed
Data Management & Computer Networks, August
[BSI 1979.
Bernstein P.A. and Shipman D.W., "The Cor-
rectness of Concurrency Control Mechanisms [Nontgomeryl
in a System for Distributed Databases (SDD- Montgomery, \!.A., "Robust Concurrency Con-
l)", ACM Trans. on Database Svs., Vol. 5, trol for a Distributed Information System",
110. 1, March 13EO. Ph.D. dissertation, Laboratory for Computer
Science, MIT, Dec. 1978.
[BSRI
Bernstein P., Shipman D., and Rothnie J., [PBRI
"Concurrency Control in a System for Cis- Papadimitriou, C. H., Bernstein, P. A. and
tributed Databases (SDD-l)", ACM Trans. on Rothnie, J. B., Jr., "Some Computational
Database Svst., Vol. 5, No. 1, March 19SO. Problems Related to Database Concurrency
Control," Proc. Conf. on Theoretical Con-
[SW1 puter Science, Waterloo, Ontario, August
Bernstein, P. A., Shipman D. W. and Won&, 1977.
w. s., "Formal Aspects of Serializability
in Database Concurrency Control", IEEE [Papadimitrioul
Trans. on Software Engineering, Vol. SE-5, Papadimitriou, C. H., "Serializability of
No. 3, May 1979. Concurrent Updates", J. of the ACM, Vol.
26, No. 4, Oct. 1979, pp. 631-653.
[ GBWCE
1
Goodman, N., P.A. Bernstein, E. Wong, C.L. [Reed]
Reeve, and J.B. Rothnie, "Query Processing Reed, D.P., Namine and Svnchronization in a
in SDD-l", Tech. Rep. 79-06, Computer Decentralized Computer Svstem, Ph.D. The-
Corp. of Am., Oct. 1979 sis, M.I.T. Department of Electrical Engi-
neering, Sept. 1978.
[Gray 1
Gray, J. N. Notes on Database Operating 1SMl 1
Systems, unpublished lecture notes. IBM Shapiro, R.M. and Millstein, R.E., "Relia-
San Jose Research Laboratory, San Jose, bility and Fault Recovery in Distributed
Calif., 1977. Processing", Oceans '77 Conference Record,
Vol. II, Los Angeles, 1977.
Hammer, M. M. and Shipman, D. W., "An Over- [SM21
view of Reliability Mechanisms for a Dis- Shapiro, R.M. and Millstein, R.E., NSW Re-
tributed Data Base System", Proc. 1977 liabilitv Plan, Mass. Computer Associates,
COMPCOlJ.IEEE, N.Y. Tech. Rep. 7701-1411, June 1977.

[IIS [SLRI
Hammer, M.M., and Shipman, D.V., "Reliabil- Stearns, R.E., Lewis, P.M. 11 and
ity Mechanisms for SDD-I", Tech. Rep. 79- Rosenkrantz, D.J., "Concurrency Controls
05, Computer Corp. of Am., July 1979. for Database Systems", Proc. 17th SWP. on
Found. of Computer Science, IEEE, 1976, pp.
[HVI 19-32.
Herman, D. and J.P. Verjus, "An Algorithm
for Maintaining the Consistencyof Multiple

299
[Thomas11
Thomas, R.H., "A Solution to the Concur-
rency Control Problem for Multiple Copy Da-
tabases", Proc. 1978 COMPCOMConference.,
IEEE, N.Y.

[Thomas21
Thomas, R.H., "A Kajority Consensous Ap-
proach to Concurrency Control for Multiple
Copy Databases", ACM Trans. on Database
m, Vol. 4, No. 2, June 1979, pp. 180-
203.

300

You might also like