Timestamp-Basedalgorithmsfor Concurrenccyontrolin Distributed Databasesystems
Timestamp-Basedalgorithmsfor Concurrenccyontrolin Distributed Databasesystems
Philip A. Bernstein**
Nathan Goodman**
286
READ(X): The T:i looks for a copy of X in T’s pri- Figure 2.2 The Need for Atomic Commitment
vate workspace. If the copy exists, its value is
returned to T. Otherwise the TM issues a command
to the DIGasking it to retrieve a stored copy of X
from the database. This operation is denoted dm- *Consider a database of banking information
read(x). The value retrieved by the D?: is given to
T and put into T’s private workspace. *Suppose Acme Corp. 's savings account has
$2,000,000 and its checking account has $500,000.
WKITE(X, new-value 1 : The Tl.1 again checks the pri- And suppose the DBMS fails while processing the
vate workspace. If the workspace has a copy of X, following transaction.
its value is updated to new-value; otherwise a
copy of X with that value is created in the work- T: Move $l,OOO,OOO from savings to checking
space. The new value of X is not stored in the da-
tabase at this time. *In the absence of atomic commitment, the follow-
ing incorrect execution could occur.
END: The TN issues an operation denoted dm-write(x)
for each logical data item X updated by T. Each
dm-write(x) requests that the DtI update the value
of X in the stored database to the value of X in
T’s local workspace. When all dm-writes are pro- Execution of T Database
cessed, T is finished executing, and its private
workspace is discarded. s $2,000,000
READ savings s $2,000,000 C 500,000
The DBMS may restart T any time before a dm-write READ checking C 500,000
has been processed. The effect of restarting T is
to obliterate its private workspace and to re-
execute T from the beginning. As we will see, many Subtract $l,OOO,OOO from savings
concurrency control algorithms use transaction re- Add $l,OOO,OOO to checking
starts as a tactic for attaining correct execu-
s $1,000,000
tions. However, once a single dm-write has been
This is because
c 1,500,000
processed, T cannot be restarted.
each dm-write permanently installs an update into
the database, and we cannot permit the database to
WRITE savings
reflect partial effects of transactions.
$1,000,000
500,000
A DBWS can fail in many ways and a detailed treat-
ment of reliability issues is beyond the scope of
------SYSTEM CRASHES------
this paper. However, a reliability problem called
atomic commitment has a major impact on concurrency
control. Consider a transaction T that updates
WRITE checking --- never executed
data items x,y,z,... and suppose the DBMS fails
while processing T’s END. If this occurs, some of
T’s updates may have been installed in the stored To model two-phase commit, it is convenient to add
database while others have not, and the database a third TM-DM operation, pre-commit, which in-
may contain incorrect information (see fig. 2.2). structs the DM to copy a data item from the private
To avoid this problem, the DBNS must ensure that workspace to secure storage.
& of a transaction’s dm-writes are processed or
none are.
2.4 Distributed Transaction Processing Model
The “standard” way to implement atomic commitment Our model of transaction processing in a distrib-
involves a procedure called two-phase commit [LS, uted environment differs from the centralized case
Gray].* Again suppose T is updating x,y,z,... in two areas: how private workspaces are handled,
When T issues its END, the first phase of two-phase and the implementation of two-phase commit.
commit begins. During this phase the DM copies the
values of x,y,z,... from T’s private workspace onto Private WorksDaces in a DDBMS
secure storage. If the DBMS fails during the first
phase, no harm is done, since none of T’s updates In a centralized DBMS we assumed that private work-
have yet been applied to the stored database. spaces were part of the TM. We also assumed that
During the second phase, the DBMS copies the values data could freely move between a transaction and
of x,y,z,... into the stored database. If the DBMS its workspace, and between a workspace and the DM.
fails during the second phase, the database may These assumptions are not appropriate in a DDBMS
contain incorrect information. However since the because TMs and DMs may run at different sites and
values of x,y,z,... are stored on secure storage, the movement of data between a TM and a DM can be
this inconsistency can be rectified when the system expensive. To reduce this cost, many DDBMSs employ
recovers : the recovery procedure reads the values query outimization procedures which regulate (and
of XYY,ZY... from secure storage and resumes the hopefully reduce) the flow of data between sites.
commitment activity.
287
For example, in SDD-1 the private workspace for
cessed from these TMS.
transaction T is distributed across all sites at
which T accesses data [GBWRRI. The details of how To avoid this problem, each D11 that receives a pre-
T reads and writes data in these workspaces is a commit must be able to determine which other DMs
query optimization problem, and has no direct ef- are involved in the commitment activity. (This in-
fect on concurrency control. Consequently, we fac- formation could be a parameter to the pre-commit
tor this issue out of our model for distributed operation, stored in a private workspace, etc.) If
transaction processing. T’s TM fails before issuing all dm-writes, the DMs
whose dm-writes were not issued can recognize the
In detail, our model of distributed transaction ex-
situation and consult the other DMs involved in the
ecution is as follows. commitment. If m DM received a dm-write, the re-
maining ones act as if they had also received the
1. When transaction T issues its BEGIN opera- command. Thus, if any DM applies an update to the
tion, T’s TM creates a private workspace for database, they all do (see also, (HS21).
T. The location and organization of this
workspace is left unspecified.
2. When T issues a READ(X) operation, the TM
checks T’s private workspace to see if a
copy of X is present. If so, the value of 3. Decomposition of Concurrency Control Problem
that copy is made available to T. Otherwise
the TM selects some stored copy of X, say
xi, and issues dm-read(xi) to the DM at In this section we review concurrency control
which xi is stored. The DM responds by re- theory with two objectives: to define “correct ex-
ecutions” in precise terms, and to decompose the
trieving the stored value of xi from the da- concurrency control problem into more tractable
sub-problems.
tabase, placing this value in the private
workspace. The TM then returns this value
to T. 3 .I Serializability
3. When T issues a WRITECX, new-value) opera-
tion, the value of X in T’s private work- Let E denote an execution of transactions Tl, ....
space is updated to new-value, assuming the Eis a serial execution if no transactions
workspace contains a copy of X. Otherwise, Tn’
a copy of X with the new value is created in ever execute concurrently in E; i.e., each trans-
the workspace. action is executed to completion before the next
4. When T issues its END operation, two-phase one begins. Every serial execution is defined to
commit begins. For each X updated by T, and be correct, because the properties of transactions
for each stored copy xi of X, the TM issues (see Section 2.1) imply that a serial execution
terminates properly and preserves database consis-
a pre-commit(xi) to the DK that stores xi. tency . An execution is serializable if it is com-
The DM responds by copying the value of X putationally equivalent to a serial execution, that
from T’s private workspace onto secure is, if it produces the same output and has the same
storage internal to the DM. After all pre- effect on the database as some serial execution.
commits are processed, the TM issues dm- Since serial executions are correct and every seri-
writes for all copies of all logical data alizable execution is equivalent to a serial one,
items updated by T. A DM responds to dm- every serializable execution is also correct. The
write(xi) by copying the value of xi from goal of database concurrency control is to ensure
that all executions are serializable.
secure storage into the stored database.
After all dm-writes are installed, T’s exe- The only operations that access the stored database
cution is finished. are dm-read and dm-write. Hence, insofar as seri-
alizability is concerned, it is sufficient to model
an execution of transactions b the execution of
Two-Phase Commit in a DDBMS dm-reads and dm-writes at t B e various DMs of the
DDBMS. In this spirit, we formally model an execu-
The problem of atomic commitment is aggravated in a tion of transactions by a set of logs, one log per
DDBKS by the possibility of one site falling while
DM. Each log indicates the order in which dm-reads
the remainder of the system continues to operate.
and dm-writes are processed at one DM (see fig.
Suppose T is updating X,Y,Z,*.. stored at DMx,
3.1).
DM , Dkls,... (resp.) and suppose T’s TM fails
An execution modelled by a set of logs is serial if
afzer issuing the dm-write(x), but before issuing (1) for each 1 og, and for each pair of traaons
the dm-writes for y,z,... At this point, the data- Ti and T. whose operations appear in the log,
base contains incorrect information as illustrated 3
in fig. 2.2. In a centralized DBMS, this phenomen- either all of Ti’s operations precede all of T.‘s
3
on is not harmful because no transaction can access operations, or vice versa; and (2) for each pair of
the database until the TM recovers from the transactions, Ti and T., if Ti’s operations precede
failure. However, in a DDBMS, other TMs remain op- 3
Tj’s operations in one log, then Ti’s operations
erational, and the incorrect database can be ac- precede T. ‘s operations in every log in which oper-
3
288
Figure 3.1 Modelling Executions as Logs Figure 3.2 Serial and Non-Serial Logs
Transactions Database The execution modelled in figure 3.1 is serial.
Condition (1) holds since each log is itself
T1: BEGIN; serial -- i.e., there is no interleaving of opera-
READ(X); WRITE(Y); END A x1 tions from different transactions. Condition (2)
y1 holds since at DM A, Tl precedes T2 precedes T ;
3
at DM B, T1 precedes T * and at DM C, T precedes
2' 2
T2: BEGIN;
READ(Y); WRITE(Z): END B y2 T3.
s2
The following execution is not serial; it satis-
T : BEGIN; fies (1) but not (2).
3 READ(Z); WRITE(X); END C
=3
DM A: rl[xll w,[Y,l r2[Y21 w3[x11
DM B: w2[z21 w1[y21
DM C: w,[z,l r3[z31
One possible execution of TI, T2, and T3 is rep-
resented by the following logs. (Note: ri[xl
denotes the operation dm-read(x) issued by Ti;
The following execution is also not serial; it
w&x] has the analogous meaning) doesn't satisfy (1) or (2);
tions, modelled by logs {Ll,l,..., Ll,n) and The total order hypothesized in Tlneorem 1 is called
a serialization order. A serialization order indi-
$J,..’ Lg,,L where L. models the execution at
l,j cates a serial execution of the transactions 1 that
El and E2 are
comnutationallv eqUiVal- is computationally equivalent to the original exe-
Dtlj for Ei’ cution E. Thus, if the transactions had executed
& if [PBR, Papadimitrioul: for each j, l<j<n,
and L2j contain the same set of dm-reads and
Ll,j
dm-writes and each pair of conflicting operations serially in the hypothesized order, the computation
289
performed by the transactions would have been iden- Theorem 2 Let ->rwr and ->ww be associated with
tical to the computation represented by E. execution E. Then E is serializable if (a) ->rwr
and ->ww are acyclic, and (b) there is a total or-
To attain serializability, the DDBKS must guarantee dering of the transactions consistent both with all
that all executions satisfy the condition of ->rwr and all ->ww relationships.
Theorem 1. Those conditions require that conflict-
ing dm-reads and dm-writes be processed in certain Theorem 2 emphasizes a point overlooked in Theorem
relative orders. Concurrency control is the activ- 1: read-write and write-write conflicts interact
ity of controlling the relative order of conflict- only insofar as there must be a total ordering of
ing operations; an algorithm to perform such con- the transactions consistent with both types of con-
trol is called a synchronization technique. so, to flicts. This suggests that read-write and write-
be correct, a DBNS must incorporate synchronization write conflicts can, to some extent, be synchron-
techniques that guarantee the conditions of Theorem ized independently. We can use one technique to
1. guarantee an acyclic ->rwr relation (which amounts
to read-write svnchronization) and a different
technique to guarantee an acyclic ->ww relation
3.2 A Paradigm for Concurrency Control (write-write svnchronization). However, Theorem 2
says that having both ->rwr and ->ww acyclic is not
In Theorem 1, read-write and write-write conflicts enough. There must also be one serial order con-
are treated together under the general notion of sistent with & -> relations. This serial order
conflict . However, we can decompose the concept of is the cement that binds together the read-write
serializability by distinguishing these two types and write-write synchronization techniques.
of conflict. Let E be an execution modelled by a
set of logs. I!e define three binary relations on Decomposing the serializability problem into the
transactions in E, denoted ->rw, --)wr, and ->ww. problems of read-write and write-write synchroniza-
For each pair of transactions, Ti and T. tion is the cornerstone of our paradigm for concur-
3 rency control. In Section 4 we describe algorithms
Ti reads that accomplish read-write (rw) and/or write-write
1. Ti ->rw Tj iff in some log of E,
(wwl synchronization, and in Section 5 we show how
some data item into which Tj subsequently to combine rw and ww synchronization algorithms
into correct concurrency control algorithms. It
writes;
will be important hereafter to distinguish algo-
2. Ti ->wr Tj iff in some log of E, Ti writes
rithms that attain rw and/or ww synchronization
into some data item that Tj subsequently from algorithms that solve the entire distributed
concurrency control problem. We shall use SJg-
reads ;
chronization technique for the former type of algo-
3. Ti ->ww Tj iff in some log of E, Ti writes
rithm, and concurrency control method for the
into some data item into which T. subse- latter.
J
quently writes.
290
chronization being perforted. For rw synchroniza- restart policy can lead to a cyclic restart situa-
tion, two operations conflict iff both operate on tion, meaning that some transaction can be continu-
the same data item and one is a dm-read r;ni the ally restarted without ever f i.nishing. Cyclic re-
other is a dm-write. For ww sync’hronization, two start can ‘be avoided by assigning an especially
operations conflict iff both operate on the same large timestamp to the transaction, thereby reduc-
data item and both are dm-writes. ing the probability of a subsequent restart. Other
restart policies are discussed in later sections.
It is easy to prove that T/O attains an acyclic
->rm (resp. -->ww) relation when used for rw (resp. This implementation of T/O requires a substantial
~7) synchronization. Since ezch DI’ processes con- amount of storage for maintaining timestamps.
flicting operations in timestamp order, each edge Techniques for reducing this storage requirement
of the ->rwr (resp. ->ww) relation is in timestamp are discussed in Section 4.E.
order. Gence, all paths in the relation are in
timestamp order and, since all transactions have
unio ue timestamps, no cycles are possible. In ad- 4.3 The Thomas Krite Rule
dition, the timestamp order is a valid serializa-
tion order. For w synchronization the basic T/O scheduler can
be optimized using an observation of [Thomas 1,21.
suppose the timestamp of a dm-write(x) is smaller
4.2 basic Implementation than W-timestamp( Instead of rejecting the dn-
write (and restarting the issuing transaction) ”
An implementation of T/O amounts to building a m can simply ignore the dm-write. We call this the
scheduler, a software module that receives dn-read Thomas \lrite Rule (TTnlF.1.
and dn-write operations and outputs these opera-
tions according to the T/O specification. In prac- Intuitively, blR only applies to a dm-write that
tice, pre-commits must also be processed through tries to put obsolete information into the data-
the T/O scheduler for two-phase commit to operate base. The rule guarantees that the effect of ap-
properly. In Sections 4.1-4.S we describe T/O im- plying a set of dm-writes to x is identical to what
plementations without considering the impact of would have happened had the dm-writes been applied
two-phase commit. Section 4.9 considers two-phase in timestamp order.
commitnent issues.
The basic T/O implementation distributes the sche- 4.4 Multi-Version T/O
dulers along with the database. Consider the T/O
scheduler at some particular INi. For each data For rv synchronization the basic T/C scheduler can
item x stored at the Dll, the scheduler keeps track be improved by using the multi-version data item
Of the largest timestamp of any dm-read (resp. dm- concept of [Reed]. For each data item x we main-
write) that has operated on x. This timestamp is tain a set of R-timestamps, and a set of <w-
denoted R-timestamp(x)(resp. II-timestamp(x timestamp, value> pairs (called versios. The R-
timestamps of x record the timestamps of all dm-
For rw synchronization the basic T/O scheduler o reads that have ever read x; the versions record
crates as follows. To process a dm-read(x), tK, the timestamps of all dm-writes that have ever
scheduler compares the timestamp of the dm-read to written into x, along with the values written.
V-timestamp( If the former timestamp is larger,
the scheduler outputs the dm-read and updates R- Using multi-versions, one can achieve rw synchroni-
tinestamp to the maximum of (a) the old R- zation without ever rejecting dm-reads. Consider a
timestamp( or (b) the timestamp of the dm-read. dm-read(x) with timestamp TS. To process this op-
If the timestamp of the dm-read is smaller than W- eration, we simply read the version(x) with largest
timestamp( the dm-read is rejected and the issu- timestamp less than TS; see fig. 4.la. However,
ing transaction is aborted. Similarly, to process dm-writes can still be rejected. Consider a dm-
a dm-write(x), the scheduler compares the timestamp write(x) with timestamp TSl, and let TS2* be the
of the dm-write to R-timestamp( If the former
smallest W-timestamp greater than TSi see fig.
timestamp is lar*;er , the dm-write is output and P-
timestamp is updcted to the maximum of (a) the 4.lb. If any R-timestamp lies between TSl and
old C-timestamp( or (b) the timestamp of the dm-
TS2 then the dm-write is rejected. If no R-
write. Otherwise, the dm-write is rejected and the
transaction is aborted. timestanp.lies in that range, then the scheduler
outputs the dm-write; this causes a new version of
For ww synchronization, the T/O scheduler operates x to be created with timestamp TS
1’
as follows. To process a dm-write(x), scheduler
compares the tinestamp of the dm-write to the B-
timestamp( If the dm-write has a larger time-
To prove the correctness of this technique, con-
stamp, the dm-write is output and I+timestamp is
sider 2 dm-read(x) with timestamp TSl that is pro-
set equal to the timestamp of the dm-write. Other-
wise, the dm-write is rejected and the transaction cessed “out of order”; i.e., suppose the dm-
is aborted. read(x) has timestamp TS1 yet it is processed after
291
Figure 4.1 Multi-version Reading and Writing 4.5 Conservative T/O
a) Let us represent the versions of a data item x
Conservative timestamn ordering is a technique for
on a "time line" eliminating restarts during T/O scheduling [BP,
BSR, IIV, RNTR, 6X1 , SP;21. When a scheduler re-
values v
v1 V2 V3 n-l 'n ceives an operation 0 that might cause a future re-
II 1 I 1 I start, the scheduler delays 0 until it is certain
b
W-timestamps 5 10 20 92 100 that no future restarts are possible.
tiotice that the multi-version concept achieves ww Guaranteeinp. Termination -- Null operations
synchronization “automatically”; insofar as ww syn-
chronization is concerned, multi-versions are an To guarantee termination, we require that TMs per-
embellished implementation of TWR. iodically send timestamped null-operations to each
scheduler, in the absence of any “real” traffic. A
It is usually not possible to keep all versions null-operation is a dm-read or dm-write that does
forever, so a technique for forgetting (i.e., de- not reference a data item. When TMi sends a null-.
leting) versions is needed (see Section 4.8). dm-read (resp. null-dm-write) with timestamp TS to
292
scheduler S. this signifies that TMi will not send of class C iff T’s readset is a subset of C’s read-
J’
set, and T’s writeset is a subset of C’s writeset.
Sj any more dm-reads (resp. dm-writes) with time-
(Classes need not be disjoint.) Class definitions
stamps smaller than TS. Thus, any scheduling deci- are not expected to change frequently during normal
sion requiring that S ; receive all dm-reads (resp. operation of the system. Changing a class defini-
tion is akin to changing the database schema and
dm-writes) from TPli tinestamped less than TS can be requires mechanisms beyond the scope of this paper.
made after that null-dm-read (resp. null-dm-write) We assume that class definitions are stored in
is received. An impatient scheduler can prompt a static tables which are available at any site re-
TK for a null-operation by sending a reauest-null quiring them.
operation to it.
Classes are associated with Ttls. Every transaction
midinc: Unneccssarv Communication that executes at a TM must be a member of a class
associated with the TX. If a transaction is sub-
To avoid unnecessary communication between TMs and mitted to a TM at which this property does not
schedulers, null-operations with very large time- hold, the transaction is forwarded to another TM
stamps can be used. In extreme cases, TMi can send that has an appropriate class. Ic’e assume that
every class is associated with exactly one TM, and
Sj a null-operation with infinite timestamp, signi-
conversely, every TM is associated with exactly one
fying that TLli does not intend to communicate with class. We use Ci to denote the class associated
Sj until further notice. Of course, when T&Ii needs with TMi. This notation simplifies our discussion,
to send a “real” operation to S., some mechanism is but does not constrain system operation in any way.
J For example, to execute transactions that are
required to retract the infinite timestamp and re-
members of class C at two TMs, we define another
place it by a finite one. 1
class with the same readset and writeset as C
c2 1
4.6 Conservative T/O with Transaction Classes and associate C 1 with one TM and C2 with the other.
On the other hand, to execute transactions that are
Another technique for reducing communication is members of two classes at one site, we multi-
transaction classes [BKGIJI. Here, we assume that program two TMs at the same site.
the rcadsct and writeset of every transaction is
known in advance. This information is used to Transaction classes are exploited by conservative
group transactions into predefined classes. Class T/O schedulers as follows. Consider rw synchroni-
definitions help support a less conservative sche- zation and suppose scheduler S. wants to output a
duling policy. dm-read(x) with timestamp TS. &stead of waiting
for dm-writes with smaller timestamp from all TMs,
A transaction class is defined by a readset and Sj need only wait for dm-writes from those TMs
writeset (see fig. 4.2). Transaction T is a member
vhose class writeset contains x. Similarly, to
Figure 4.2 Transaction Classes process a &n-write(x) with timestamp TS, Sj need
aA class is defined by a readset and a writeset. only wait for dm-reads with smaller timestamp from
E.g., those TKs whose class readset contains x. Thus,
the level of concurrency in the system is in-
Cl: readset = {x,) , writeset = (y,,y,) creased. ww synchronization proceeds analogously.
C2: readset = (x1,y2) , writeset = {y y z This technique also reduces communication require-
1’ 2’ 2’23)
ments, since a TM need only communicate with a
C3: readset = {y,, z,} , writeset = (x z scheduler if its class readset or writeset contains
1’ 2+3)
data items protected by the scheduler.
*A transaction is a member of a class if its read- 4.7 Conservative T/O with Conflict Graph Analysis
set is a subset of the class readset and its
writeset is a subset of the class writeset. E.g., Conflict graph analvsis is a technique for further
improving the performance of conservative T/O with
Tl: readset = {x,} , writeset = Iyl, y2 1 classes. A conflict Eraoh is an undirected graph
that summarizes potential conflicts between trans-
T2: readset = {y2} , writeset = {z2, 23) actions in different classes. For eack class Ci
T3: readset = {z,} , Writes& = Ix,) the graph contains two nodes, denoted ri and wi,
which intuitively represent the readset and vrite-
set of C. The edges of the graph are defined as
*T is a member of Cl and C2 1’
1 follows (see fig. 4.3). (i) For
.T2 is a member of C2 and C there is a vertical edEe between
3
l T is a member of C For each pair of classes Ci and Cj
3 3
293
is a horizontal edge between w. and w. iff the Since classes are defined statically, conflict
1 J
writeset of C. intersects the writeset of C.. graph analysis is also performed statically. The
1
output of this analysis is a table indicating which
(iii) For each pair of classes Ci and Cj (with i+:)
horizontal and vertical edges require synchroniza-
there is a diagonal edge between ri and w. iff the tion and which do not. This information, like
3 class definitions, is distributed in advance to all
readset of Ci intersects the writeset of C..
J schedulers that require it.
Intuitively, a horizontal edge indicates that a Conservative T/O with conflict graph analysis has
scheduler Sk may be forced to delay dm-writes for been implemented in the SDD-1 DDBMS [BSRI. In
principle, conflict graph analysis can be applied
purposes of ww synchronization. Suppose classes Ci
to other synchronization techniques to improve
and C. are connected by a horizontal edge (i.e., their performance as well. Theoretical aspects of
J this integration are examined in [BSWI, but many
there is an edge between wi and wj). Then the
details remain to be worked out.
class writesets intersect and so, if Sk receives a
dm-write from Ci, Sk must delay the dm-write until
4.8 Timestamp Management
receives all dm-writes with smaller timestamps
sk
A common criticism of T/O schedulers is that too
from C.. Similarly, a diagonal edge indicates that
J much memory is needed to store timestamps. This
Sk may need to delay operations for rw synchroniza- problem can be overcome by “forgetting” old time-
stamps.
tion.
Timestamps are used in basic T/O to reject opera-
Conflict graph analysis improves the situation by tions that “arrive late”, e.g., to reject a dm-
identifying inter-class conflicts that never cause read(x) with timestamp TSl that arrives after a dm-
non-serializable behavior. This corresponds to
identifying horizontal and diagonal edges that do write(x) with timestamp TS2 > TS1. In principle,
not require synchronization. In particular, sche- and TS2 can differ by an arbitrary but amount,
dulers need only synchronize dm-writes from C. and TSI
1 in practice these timestamps are unlikely to differ
Cj if either (1) the edge (wi, w.) is embedded in a by more than a few minutes. Consequently we may
& of the conflict graph; or i 2) portions of the store timestamps in small tables which are periodi-
intersection of Ci’s writeset and C.‘s writeset are cally purged.
stored at two or more DMs[BSI. Thai is, if condi-
R-timestamps are stored in the R-table with entries
tions (1) and (2) do not hold, a scheduler S need of the form <x, R-timestamps; for any data item x,
k
not process dm-writes from Ci and C. in timestamp there is at most one entry. In addition, there is
J a variable, R-min which tells the maximum value of
order. Similarly, dm-reads from Ci and dm-writes any timestam-; has been purged from the table.
from Ci need only be processed in timestamp order To find R-timestamp( a scheduler searches the
J
if either R-table for an <x, TS> entry. If such an entry is
(‘) the edge (‘i, w : ) is embedded in a found, TS = R-timestamp( otherwise, R-
.I
cycle of the conflict graph; or (2) portions of the timestamp < R-min and to err on the side of
intersection of Ci’s readset and Cj’s writeset are safety, the scheduler assumes R-timestamp = R-
l=~y2>z31
Cl readset = {xl>
Clwriteset = {y,, y2 1 C2 writeset = {y,, y2, z2, z,} C3 writeset = Ix,, z2, 233
294
min. To update R-timestamp( the scheduler modi- commits need not be processed by the ww scheduler.
fies the <x, TS> entry, if one exists; otherwise,
a new entry is created and added to the table. Integrating Two-Phase Commit Into Multi-Version T/O
When the R-table is full, the scheduler selects an
appropriate value for R-min and deletes all entries Like TWR, multi-versions eliminate the need for
from the table with smaller tinestamp. W- two-phase commit insofar as ww synchronization is
tinestamps are managed similarly; analogous tech- concerned. However, two-phase commit remains as
niques can be devised for multi-version databases. issue for rw synchronization.
Let P be a pre-commit(x) with timestamp TSl and let
Ilaintaining timestamps for conservative T/O is even
cheaper, since conservative T/O only requires time- W be the corresponding dm-write. When P arrives at
a scheduler, the scheduling rule of Section 4.4 is
;~;;;;~v~ner~tions, not timestamped data. If con- applied:
T 0 is used for rw svnchronization, the let TS2 be the smallest W-timestamp >
R-timestamps of data items are rendered useless and
may be discarded. If conservative T/O is used for TSl ; if any R-timestamp lies between TSl and
both rw and ww synchronization, I!-timestamps can be TS2, P is rejected, otherwise P .is accepted. If
eliminated too.
the scheduler accepts P, it agrees not to output
any dir-read(x) with timestamp between TSl and TS2
until W is received. As before, all such dm-reads
that arrive before W are placed on a waiting queue.
4.9 Integrating Two-Phase Commit into T/O
Intenratina Two-Phase Commit Into Conservative T/O
It is necessary to integrate two-phase commit into
Two-phase commit need not be tightly integrated
the T/O implementations described above to ensure into conservative T/O, because dm-writes are never
atomic commitment of updates (see Section 2). This
rejected. However, scheduling delay can be reduced
is done by timestamping pre-commits and modifying
the T/C implementations to accept or reject pre- by transmitting pre-commits via W-queues. For
example, suppose conservative T/O is used for rw
commits instead of dm-writes. If a scheduler re-
synchronization, and suppose scheduler S. wants to
jects a pre-commit, the issuing transaction is J
aborted. However, if a scheduler accepts a pre- output a dm-read(x) with timestamp TS.
commit, it must accept the corresponding dm-write ‘j need
only delay this dm-read until each W-queue contains
no matter when that operation arrives. To make a pre-commit with, @mestamp greater than TS; it
this guarantee, the scheduler may be forced to
need not’ wait for ‘the corresponding dm-writes.
u conflicting operations that arrive before the
(However I the dm-read may have to wait for some dm-
dm-write.
writes with smaller timestamp; i.e., if Sj has ac-
Integratinp Two-Phase Commit Into Basic T/O cepted a pre-commit(x) with timestamp TS’ < TS, the
dm-read cannot be output until the dm-write(x) with
Consider a pre-commit(x) with timestamp TS. Let P timestamp TS ’ is received.)
denote this operation and let W denote the corres-
ponding dm-write. Assume that basic T/O is used
for rw synchronization. P can be accepted by a 4.10 Heuristics for Reducing Restarts
scheduler iff TS > R-timestamp( i.e., P is ac-
cepted iff the scheduler can still output W. Once This section describes three heuristics for reduc-
the scheduler accepts P, it must guarantee that TS ing the cost or probability of restarts for non-
will remain greater than R-timestamp until W is conservative T/O implementations.
received. To make this guarantee, the scheduler
refuses to output any dm-read(x) with timestamp Predeclaration of Readsets and Writesets
greater than TS, until W is received. All such dm-
reads that arrive before W are placed on a waiting To reduce the cost of restarts, transactions should
queue. issue their dm-reads and pre-commits as early as
possible. The extreme version of this heuristic
For ww synchronization, P is accepted by the sche- calls for transactions to predeclare their readsets
duler iff TS > W-timestamp( Once the scheduler and writesets, so that dm-reads and pre-commits are
accepts P, it agrees not to output any dm-write(x) issued for the entire readset and writeset before a
with timestamp greater than TS until it receives W. transaction begins its main execution. If no oper-
All such dm-writes that arrive before H are placed ation is rejected, the transaction is guaranteed to
on a waiting queue as above. execute with no danger of restart.
TWR applies only to ww synchronization and elimin- To reduce the probability of restart, a scheduler
ates the possibility of rejecting dm-writes for can a the processing of operations to wait for
purposes of ww synchronization. Hence there is no “earlier” operations (i.e., ones with smaller time-
need to incorporate two-phase commit into the ww stamps) to arrive. This heuristic is essentially a
synchronization algorithm. Pre-commits must still compromise between conservative and non-
be sent to all sites being updated, but the pre- conservative T/O, and trades response time for a
295
reduction in robability The amount of 5.1 Using Basic T/O for rw Synchronization
delay can be tuz%?ztop e imize thls’trade-off.
Reading Old Versions Wethods l-4 use basic TJO for rw synchronization.
Each stored data item e.g. xi, has an R-timestamp
The performance of multi-version TJO can be and a W-timestamp. Let T be a transaction with
improved by ,permitting aueries (i.e., read-only timestamp TS. To read xi, T issues a dm-read(xi)
transactions) to read old versions of data items. with timestamp TS; this dm-read is accented iff TS
Recall that in multi-version T/O, dm-read opera- > W-timestamp(
tions are never rejected but may cause subsequent To write xi, T issues a pre-
pre-commits to be rejected. (E.g., once dm-read(x) commit(xi) with timestamp TS; this pre-commit is
with timestamp TS is processed, a subsequent pre-
commit(x) with timestamp TS’, where TS’ < TS, accented iff (a) TS > R-timestamp (xi), and (b) a
may
be rejected.) To reduce the probability of reject- condition determined by the ww synchronization
ing a pre-commit, we may assign old (i.e. small) technique is also satisfied.
timestamps to queries. Of course, this also causes
the query to read older data. Thus, this technique Method 1 -- Basic T/O for ww synchronization. The
entails a compromise between system performance and pre-commit is accepted iff TS > R-timestamp (xi)
timeliness of data. Little is known about this and TS > W-timestamp (xi.)
tradeoff in general, but a good compromise should
of ten be achievable. For example, if queries are
assigned timestamps that are five minutes old, we Method 2 -- TWR for ww synchronization. The pre-
would expect few queries to interfere with updates. commit is accepted iff TS > the largest R-
And in many applications, five minute old data is timestamp( However, if the pre-commit is ac-
perfectly acceptable. cepted and TS < the W-timestamp( the correspon-
As a fringe benefit, this technique also improves ding dm-write has no effect on the database. This
the response time for queries by reducing the prob- method represents an optimization of Hethod 1 that
ability that a query’s dm-reads will be blocked by is apparently preferable in most situations.
pre-commits.
Method 3 -- Multi-version T/O for ww synchroniza-
tion. The pre-commit is accepted iff TS > R-
timestamp( the W-timestamp is irrelevant. If
the pre-commit is accepted, the corresponding dm-
5. Integrated TJO Concurrency Control Methods write creates a new version of ‘i’ While this
method appears to be a space-inefficient version of
The synchronization techniques of Section 4 can be Method 2, it can yield better performance by let-
integrated to form twelve principal T/O concurrencv ting queries read old versions of data items ; see
-methods: Section 4.10.
analysis (see Section 4.7). Thus, these 12 princi- T issues a pre-commit(xi) with timestamp TS; this
pal methods produce over 50 distinct methods. In
;$is.sectiop we describe the twelve principal meth- pre-commit is accented iff (a) there is no R-
s in detail.
timestamp that lies between TS and the smallest
296
W-timestamp larger than TS, and (b) a condition Figure 5.1 Inconsistent Retrievals in Method 6
determined by the ww synchronization technique is
also satisfied.
*Consider data items x and y with the following
versions
Method 5 -- Basic T/O for ww synchronization. For
basic T/O, condition (b) requires that TS be 100
values 0
greater-than the largest W-timestamp( So, for I I
X l
Method 5, conditions (a) and (b) may be simplified: W-timestsmps 0 100
The pre-commit is accepted iff TS > largest R-
timestamp and the largest W-timestamp( If
the pre-commit is accepted, the corresponding dm- values 0
write creates a new version of x.. l
Y
Method G -- TWR for ww synihroniaat ion. This W-timestsmps 0
method is incorrect. TWR requires that a dm-
write(xi) with timestamp TS be ignored if TS < the
maximum W-timestamp( This may cause subsequent *Now suppose T has timestamp 50 and writes x:=50,
dm-reads to read inconsistent data; see fig. 5.1. y:50. Under Method 6, the update to x is ignored,
(Kethod 6 is the only incorrect method we will en- and the result is
counter. )
version of xi has been created with timestamp TS, Methods 5 and 8 also support a systematic technique
no subsequent transaction can create a version with for assigning old timestamps to queries (see
a smaller timestamp. When this property holds, it Section 4.10) so that (a) no dram-read issued by a
is possible to forvet (i.e., delete) old versions query will ever cause a pre-commit to be rejected;
such that we never delete a version needed by a and (b) the timestamp assigned to the query is the
later transaction. largest one satisfying (a). This technique is sim-
ilar to the technique for systematic forgetting of
Let W-QaX(Xi) be the maximum kJ-timestamp and old versions.
W-min be the minimum value of W-max(xi) over all Let Q be a query. The technique we describe re-
quires that Q’s readset be predeclared. Before Q
data items xi. Observe that no pre-commit with
begins its main execution Q’s readset is examined;
timestamp smaller than W-min can be accepted in the for each xi in the readset, W-max(xi) is ascer-
future: since W-min < W-ISaX for all Xi) all
tained. In addition, we calculate W-min = min{W-
future update transactions with timestamps less
max( xi is in Q’s readset). The timestamp as-
than W-min are guaranteed to be restarted. So, in-
sof at as update transactions are concerned, we can signed to Q is W-min - 1. The correctness of this
safely forget all versions of every data item time- technique is shown in [BGZI.
stamped less than W-min. Queries are handled in
this framework by interpreting all dm-reads with
timestamps less than kJ-min as if they had time- 5.3 Using Conservative T/O for rw Synchronization
stamps equal to W-min.
The remaining T/O methods use conservative T/O for
297
rw synchronization. In these methods, a scheduler framework has two main parts: (1) a model of dis-
S will not process a dm-read(xi) with timestamp TS tributed transaction execution, in which trans-
until it has processed all pre-commits with smaller actions execute by issuing dm-read, pre-commit, and
timestamps and none with larger timestamps. Sym- dm-write operations; and (2) a decomposition of the
metrically, S will not process a pre-commit(xi) concurrency control problem into the sub-problems
of rw and ww synchronization.
with timestamp TS until it has processed all dm-
reads with smaller timestamps and none with larger We presented several timestamp-based synch roniza-
timestamps. When S processes a pre-commit(xi) with tion techniques for solving each sub-problem. Four
timestamp TS, its action depends on the ww tech- of these techniques were deemed to be "principal":
basic T/O, the Thomas Write Rule, multi-version
nique.
T/O, and conservative T/O. These techniques vary
Method 9 -- Basic T/O for ww synchronization. The substantially in their behavior but are united by a
common underlying objective: each technique seeks
pre-commit is accepted iff TS > W-timestamp(
to execute conflicting operations in timestamp
order, or in some equivalent order. Basic T/O
Method 10 -- TWR for ww synchronization. The pre- achieves this objective by reiecting operations
commit is always accepted. However, if TS < W- that are received out of timestamp order. The
timestamp( the corresponding dm-write has no Thomas Write Rule ignores operations that are
effect on the database. received out of timestamp order. (This technique is
only suitable for ww synchronization.) Multi-
Method 10 is essentially the concurrency control of version T/O retains multiple "versions" of data
items to permit many operations that are received
SDD-1 [BSR]. In SDD-1, however, the method is re- out of order to be executed as if they had been
fined in several ways to reduce delay. First, SDD- received in order. And conservative T/O delays op-
1 uses classes and conflict nraoh analysis and re- that are received out of order to permit
uires redeclaration of readsets, In addition, erations
L!DD-1 on y entorces the conservative scheduling all operations with smaller timestamps to be pro-
rule on dm-reads, meaning that dm-reads wait for cessed first.
pre-commits, but pre-commits need not wait for all
dm-reads with smaller timestamps. Consequently, it Finally we showed how to integrate any principal rw
technique with any principal ww technique to yield
is possible for dm-reads to be rejected in SDD-1. a principal concurrency control method. Twelve
The SDD-1 designers accepted this possibility for principal methods can be constructed in this way.
two reasons: (1) since readsets are predeclared, Each principal method can be refined by several
all dm-reads are issued before the transaction be-
gins its main execution and the cost of rejecting a non-principal techniques so that more than 50 dis-
dm-read is modest. (2) The probability that a dm- tinct concurrency control algorithms can be built
read will be rejected can be reduced by assigning using the framework and material of this paper.
large timestamps to transactions. Other techniques
for reducing restarts are described by [Lin]. Most of the principal methods we describe are new
algorithms. These are Methods l-4 (which use basic
Method 11 -- Multi-version T/O for ww synchroniza- TIO for rw synchronization); Methods 5 and 8
tion. The pre-commit is always accepted and the (multi-version T/O for rw, with basic T/O or con-
corresponding dm-write always creates a new version servative T/O for ww); and Methods 9 and 11 (con-
of xi. When multi-versions are used, the conserva- servative T/O for rw, with basic T/O or multi-
version T/O for ww). Of the remaining methods,
tive rw technique can be optimized as follows: a Method 6 (multi-version T/O with TWR) is an incor-
dm-read can never be rejected, and so there is no rect method; Method 7 (multi-version T/O for rw and
reason to force pre-commits to wait for dm-reads. ww) is similar but not identical to the algorithms
(dm-reads must still wait for pre-commits to ensure of [Montgomery, Reed]; Method 10 (conservative T/O
that pre-commits are never rejected.) with TWR) is essentially the SDD-1 concurrency con-
trol algorithm [BSR]; and Wethod 12 (conservative
Nethod 12 -- Conservative T/O for ww synchroniza- T/O for rw and ww) is essentially the algorithm re-
tion. Scheduler S will not process a pre-commit commended by [BP, HV, KWTH, Sl11,21.
with timestamp TS until it has processed all pre-
commits with smaller timestamps and none with A major issue we have not addressed concerns the
larger timestamps. Combined with conservative rw performance of these algorithms. This issue is ad-
synchronization, the effect is to process & oper- dresssed aualitativelv in IBG21. However, little
ations in timestamp order. Method 12 has been re- quantitative performance analysis has been reported
commended by [BP, HV, KN'IR, SMl, S&21. in the literature and this remains a topic for
future research.
298
References Copies", Proc. First International Conf. on
Distributed Comnutinc. Systems, IEEE, N.Y.,
pp. 625-631.
[BGlJ
Bernstein, P.A., and Goodman, M., 'lhp- I KRTHI
proaches to Concurrency Control in Distrib- Kaneko, A., Y.Nishihara, K. Tsuruoka, and
uted Databases", Proc. 1979 National Com- 1:. Hattori, "Logical Clock Synchronization
puter Conf., June 1979. Method for Duplicated Database Control",
Proc. First International Conf. on Distrib-
LnG2.1 uted Computinc. Svstems, IEEE, N.Y., Oct.
Bernstein, P.A., and Goodman, 1;. , "Funda- 1979, pp. 601-611.
mental Algorithms for Concurrency Control
in Distributed Database Systems", Tech. [LSI
Rep., Computer Corp. of Am., Feb. 19GO. Lampson, B. and Sturgis, II., "Crash Becov-
ery in a Distributed Data Storage System",
[BP] Tech. ReF., Computer Science Lab., Xerox
Badal, D.Z.; and Popek, G.J. "A Proposal Palo Alto Research Center, 1976.
for Distributed Concurrency Control for
Partially Redundant Distributed Data Ease [Linl
System," Proc. 3rd Berkeley Workshon on Lin, W. K., "Concurrency Control in a Mul-
Distributed Data lknagement and Commuter tiple Copy Distributed Data Base System",
Networks, 197&, pp. 273-2ES. Proc. 4th Berkeley Nork. on Distributed
Data Management & Computer Networks, August
[BSI 1979.
Bernstein P.A. and Shipman D.W., "The Cor-
rectness of Concurrency Control Mechanisms [Nontgomeryl
in a System for Distributed Databases (SDD- Montgomery, \!.A., "Robust Concurrency Con-
l)", ACM Trans. on Database Svs., Vol. 5, trol for a Distributed Information System",
110. 1, March 13EO. Ph.D. dissertation, Laboratory for Computer
Science, MIT, Dec. 1978.
[BSRI
Bernstein P., Shipman D., and Rothnie J., [PBRI
"Concurrency Control in a System for Cis- Papadimitriou, C. H., Bernstein, P. A. and
tributed Databases (SDD-l)", ACM Trans. on Rothnie, J. B., Jr., "Some Computational
Database Svst., Vol. 5, No. 1, March 19SO. Problems Related to Database Concurrency
Control," Proc. Conf. on Theoretical Con-
[SW1 puter Science, Waterloo, Ontario, August
Bernstein, P. A., Shipman D. W. and Won&, 1977.
w. s., "Formal Aspects of Serializability
in Database Concurrency Control", IEEE [Papadimitrioul
Trans. on Software Engineering, Vol. SE-5, Papadimitriou, C. H., "Serializability of
No. 3, May 1979. Concurrent Updates", J. of the ACM, Vol.
26, No. 4, Oct. 1979, pp. 631-653.
[ GBWCE
1
Goodman, N., P.A. Bernstein, E. Wong, C.L. [Reed]
Reeve, and J.B. Rothnie, "Query Processing Reed, D.P., Namine and Svnchronization in a
in SDD-l", Tech. Rep. 79-06, Computer Decentralized Computer Svstem, Ph.D. The-
Corp. of Am., Oct. 1979 sis, M.I.T. Department of Electrical Engi-
neering, Sept. 1978.
[Gray 1
Gray, J. N. Notes on Database Operating 1SMl 1
Systems, unpublished lecture notes. IBM Shapiro, R.M. and Millstein, R.E., "Relia-
San Jose Research Laboratory, San Jose, bility and Fault Recovery in Distributed
Calif., 1977. Processing", Oceans '77 Conference Record,
Vol. II, Los Angeles, 1977.
Hammer, M. M. and Shipman, D. W., "An Over- [SM21
view of Reliability Mechanisms for a Dis- Shapiro, R.M. and Millstein, R.E., NSW Re-
tributed Data Base System", Proc. 1977 liabilitv Plan, Mass. Computer Associates,
COMPCOlJ.IEEE, N.Y. Tech. Rep. 7701-1411, June 1977.
[IIS [SLRI
Hammer, M.M., and Shipman, D.V., "Reliabil- Stearns, R.E., Lewis, P.M. 11 and
ity Mechanisms for SDD-I", Tech. Rep. 79- Rosenkrantz, D.J., "Concurrency Controls
05, Computer Corp. of Am., July 1979. for Database Systems", Proc. 17th SWP. on
Found. of Computer Science, IEEE, 1976, pp.
[HVI 19-32.
Herman, D. and J.P. Verjus, "An Algorithm
for Maintaining the Consistencyof Multiple
299
[Thomas11
Thomas, R.H., "A Solution to the Concur-
rency Control Problem for Multiple Copy Da-
tabases", Proc. 1978 COMPCOMConference.,
IEEE, N.Y.
[Thomas21
Thomas, R.H., "A Kajority Consensous Ap-
proach to Concurrency Control for Multiple
Copy Databases", ACM Trans. on Database
m, Vol. 4, No. 2, June 1979, pp. 180-
203.
300