Distributed Database Systems: January 2002
Distributed Database Systems: January 2002
net/publication/228806512
CITATIONS READS
14 10,527
1 author:
M. Tamer Özsu
University of Waterloo
401 PUBLICATIONS 12,822 CITATIONS
SEE PROFILE
All content following this page was uploaded by M. Tamer Özsu on 04 June 2014.
M. Tamer Özsu
University of Waterloo
Department of Computer Science
Waterloo, Ontario Canada N2L 3G1
{[email protected]}
Outline
In this article, we discuss the fundamentals of distributed DBMS technology. We
address the data distribution and architectural design issues as well as the algorithms that
need to be implemented to provide the basic DBMS functions such as query processing,
Glossary
Atomicity: The property of transaction processing whereby either all the operations of a
ent machines with limited functionality access a set of servers which manage data.
ages a database that is distributed across the nodes of a computer network and makes
1
Deadlock: An occurrence where each transaction in a set of transactions circularly waits
Isolation: The property of transaction execution which states that the effects of one
transaction on the database are isolated from other transactions until the first com-
Locking: A method of concurrency control where locks are placed on database units
Logging protocol: The protocol which records, in a separate location, the changes that a
One copy equivalence: Replica control policy which asserts that the values of all copies
of a logical data item should be identical when the transaction that updates that item
terminates.
Query optimization: The process by which the ``best'' execution strategy for a given
Query processing: The process by which a declarative query is translated into low-level
2
Quorum-based voting algorithm: A replica control protocol where transactions collect
votes to read and write copies of data items. They are permitted to read or write data
Read-Once/Write-All protocol: The replica control protocol which maps each logical
read operation to a read on one of the physical copies and maps a logical write opera-
Serializability: The concurrency control correctness criterion which requires that the
Termination protocol: A protocol by which individual sites can decide how to terminate
a particular transaction when they cannot communicate with other sites where the
transaction executes.
terminated the same way at every site where it executes. The name comes from the
fact that two rounds of messages are exchanged during this process.
Two-phase locking: A locking algorithm where transactions are not allowed to request
3
1. Introduction
The maturation of database management system (DBMS) technology has coincided
gies. The end result is the emergence of distributed database management systems. These
systems have started to become the dominant data management tools for highly data-
intensive applications. Many DBMS vendors have incorporated some degree of distribu-
(distributed DBMS) is the software system that permits the management of the distrib-
uted database and makes the distribution transparent to the users. The term “distributed
database system” (DDBS) is typically used to refer to the combination of DDB and the
distributed DBMS. These definitions point to two identifying architectural principles. The
first is that the system consists of a (possibly empty) set of query sites and a non-empty
set of data sites. The data sites have data storage capability while the query sites do not.
The latter only run the user interface routines in order to facilitate the data access at data
sites. The second is that each site (query or data) is assumed to logically consist of a sin-
gle, independent computer. Therefore, each site has its own primary and secondary stor-
age, runs its own operating system (which may be the same or different at different sites),
and has the capability to execute applications on its own. A computer network, rather
than a multiprocessor configuration, interconnects the sites. The important point here is
the emphasis on loose interconnection between processors that have their own operating
4
2. Data Distribution Alternatives
A distributed database is physically distributed across the data sites by fragmenting
and replicating the data. Given a relational database schema, fragmentation subdivides
is accomplished by a selection operation that places each tuple of the relation in a differ-
relation into a number of fragments by projecting over its attributes (e.g., the Employee
relation may be fragmented such that the Emp_number, Emp_name and Address in-
in close proximity to its place of use, thus potentially reducing transmission cost, and it
reduces the size of relations that are involved in user queries. Based on the user access
patterns, each of the fragments may also be replicated. This is preferable when the same
data are accessed from applications that run at a number of sites. In this case, it may be
more cost-effective to duplicate the data at a number of sites rather than continuously
moving it between them. Figure 1 depicts a data distribution where Employee, Pro-
ject and Assignment relations are fragmented, replicated and distributed across
5
3. Architectural Alternatives
There are many possible alternatives for architecting a distributed DBMS. The sim-
plest is the client/server architecture, where a number of client machines access a single
database server. The simplest client/server systems involve a single server that is ac-
case, the database management problems are considerably simplified since the database is
stored on a single server. The pertinent issues relate to the management of client buffers
and the caching of data and (possibly) locks. The data management is done centrally at
the single server. A more distributed, and more flexible, architecture is the multiple-
ers that have to communicate with each other in responding to user queries and in exe-
cuting transactions. Each client machine has a “home” server to which it directs user re-
quests. The communication of the servers among themselves is transparent to the users.
Most current database management systems implement one or the other type of the cli-
ent-server architectures. A truly distributed DBMS does not distinguish between client
and server machines. Ideally, each site can perform the functionality of a client and a
data that is distributed across multiple sites. The complexity of required software has de-
If the distributed database systems at various sites are autonomous and (possibly)
exhibit some form of heterogeneity, they are usually referred to as multidatabase systems
or federated database systems. If the data and DBMS functionality distribution is accom-
6
These are different than a distributed database system where the logical integration
among distributed data is tighter than is the case with multidatabase systems or federated
database systems, but the physical control is looser than that in parallel DBMSs. In this
counterparts provide, such as support for declarative user queries and their optimization,
transactional access to the database involving concurrency control and reliability, en-
forcement of integrity constraints and others. In the remaining sections we discuss some
Query processing deals with designing algorithms that analyze queries and convert
them into a series of data manipulation operations. Besides the methodological issues, an
important aspect of query processing is query optimization. The problem is how to decide
on a strategy for executing each query over the network in the most cost-effective way,
however cost is defined. The factors to be considered are the distribution of data, com-
munication costs, and lack of sufficient locally available information. The objective is to
optimize where the inherent parallelism of the distributed system is used to improve the
User accesses to shared databases are formulated as transactions, which are units of
execution that satisfy four properties: atomicity, consistency, isolation, and durability –
jointly known as the ACID properties. Atomicity means that a transaction is an atomic
7
unit and either the effects of all of its actions are reflected in the database, or none of
them are. Consistency generally refers to the correctness of the individual transactions;
i.e., that a transaction does not violate any of the integrity constraints that have been de-
fined over the database. Isolation addresses the concurrent execution of transactions and
specifies that actions of concurrent transactions do not impact each other. Finally, dura-
bility concerns the persistence of database changes in the face of failures. ACID proper-
ties are enforced by means of concurrency control algorithms and reliability protocols.
database, such that the integrity of the database is maintained. The concurrency control
One not only has to worry about the integrity of a single database, but also about the con-
sistency of multiple copies of the database. The condition that requires all the values of
multiple copies of every data item to converge to the same value is called mutual consis-
tency.
behavior in the face of failures. In addition to the typical failure types (i.e., transaction
failures and system failures), distributed DBMSs have to account for communication
failure occurs and various sites become either inoperable or inaccessible, the databases at
the operational sites remain consistent and up to date. This complicates the picture, as the
actions of these sites have to be eventually reconciled with those of failed ones. There-
fore, recovery protocols coordinate the termination of transactions so that they terminate
uniformly (i.e., they either abort or they commit) at all the sites where they execute. Fur-
8
thermore, when the computer system or network recovers from the failure, the distributed
DBMS should be able to recover and bring the databases at the failed sites up-to-date.
This may be especially difficult in the case of network partitioning, where the sites are
Distributed databases are typically replicated; that is, a number of the data items re-
side at more than one site. Replication improves performance (since data access can be
localized) and availability (since the failure of a site does not make a data item inaccessi-
ble). However, management of replicated data requires that the values of multiple copies
of a data item are the same. This is called the one copy equivalence property. Distributed
DBMSs that allow replicated data implement replication protocols to enforce one copy
equivalence.
level data manipulation operations. SQL is the standard query language that is supported
in current DBMSs. Query optimization refers to the process by which the “best” execu-
tion strategy for a given query is found from among a set of alternatives.
In distributed DBMSs, the process typically involves four steps (Figure 2): query
decomposition, data localization, global optimization, and local optimization. Query de-
composition takes an SQL query and translates it into one expressed in relational algebra.
In the process, the query is analyzed semantically so that incorrect queries are detected
and rejected as early as possible, and correct queries are simplified. Simplification in-
volves the elimination of redundant predicates that may be introduced as a result of query
9
modification to deal with views, security enforcement and semantic integrity control.
The initial algebraic query generated by the query decomposition step is input to the
second step: data localization. The initial algebraic query is specified on global relations
localize the query’s data using data distribution information. In this step, the fragments
that are involved in the query are determined and the query is transformed into one that
defined through fragmentation rules that can be expressed as relational operations (hori-
relation can be reconstructed by applying the inverse of the fragmentation rules. This is
fragmented query is the union (join) of the fragments. Thus, during the data localization
step each global relation is first replaced by its localization program, and then the result-
ing fragment query is simplified and restructured to produce another “good” query. Sim-
plification and restructuring may be done according to the same rules used in the decom-
position step. As in the decomposition step, the final fragment query is generally far
from optimal; the process has only eliminated “bad” algebraic queries.
these algebraic queries are “better”' than others. The quality of an algebraic query is de-
10
fined in terms of expected performance. The process of query optimization involves tak-
ing the initial algebraic query and, using algebraic transformation rules, transforming it
into other algebraic queries until the “best” one is found. The “best” algebraic query is
determined according to a cost function that calculates the cost of executing the query
global optimization to handle operations that involve data from multiple sites (e.g., join)
followed by local optimization for further optimizing operations that will be performed at
a given site.
The input to the third step, global optimization, is a fragment query, that is, an alge-
braic query on fragments. The goal of query optimization is to find an execution strategy
for the query that is close to optimal. Remember that finding the optimal solution is
operations) for transferring data between sites. The previous layers have already opti-
mized the query – for example, by eliminating redundant expressions. However, this op-
communication operations are not yet specified. By permuting the order of operations
within one fragment query, many equivalent query execution plans may be found. Query
optimization consists of finding the “best” one among candidate plans examined by the
optimizer1.
1
The difference between an optimal plan and the best plan is that the optimizer does not, because of com-
putational intractability, examine all of the possible plans.
11
The final step, local optimization, takes a part of the global query (called a
subquery) that will run at a particular site and optimizes it further. This step is very simi-
lar to query optimization in centralized DBMSs. Thus, it is at this stage that local infor-
mation about data storage, such as indexes, etc, are used to determine the best execution
space, a cost model, and a search strategy. The search space is the set of alternative exe-
cution plans to represent the input query. These plans are equivalent, in the sense that
they yield the same result but they differ on the execution order of operations and the way
these operations are implemented. The cost model predicts the cost of a given execution
plan. To be accurate, the cost model must have accurate knowledge about the parallel
execution environment. The search strategy explores the search space and selects the
best plan. It defines which plans are examined and in which order.
In a distributed environment, the cost function, often defined in terms of time units,
refers to computing resources such as disk space, disk I/Os, buffer space, CPU cost,
communication cost, and so on. Generally, it is a weighted combination of I/O, CPU, and
is to consider communication cost as the most significant factor. This is valid for wide
area networks, where the limited bandwidth makes communication much more costly
dict execution costs of alternative candidate orderings. Determining execution costs be-
fore query execution (i.e., static optimization) is based on fragment statistics and the for-
mulas for estimating the cardinalities of results of relational operations. Thus the optimi-
12
zation decisions depend on the available statistics on fragments. An important aspect of
query optimization is join ordering, since permutations of the joins within the query may
lead to improvements of several orders of magnitude. One basic technique for optimizing
a sequence of distributed join operations is through use of the semijoin operator. The
main value of the semijoin in a distributed system is to reduce the size of the join oper-
ands and thus the communication cost. However, more recent techniques, which consider
local processing costs as well as communication costs, do not use semijoins because they
might increase local processing costs. The output of the query optimization layer is an
lowest level are a set of read and write operations to the database. Concurrency control
algorithms enforce the isolation property of transaction execution, which states that the
effects of one transaction on the database are isolated from other transactions until the
schemes, a lock, in either shared or exclusive mode, is placed on some unit of storage
(usually a page) whenever a transaction attempts to access it. These locks are placed ac-
cording to lock compatibility rules such that read-write, write-read, and write-write con-
13
flicts are avoided. It is a well known theorem that if lock actions on behalf of concurrent
transactions obey a simple rule, then it is possible to ensure the serializability of these
transactions: “No lock on behalf of a transaction should be set once a lock previously
held by the transaction is released.” This is known as two-phase locking, since transac-
tions go through a growing phase when they obtain locks and a shrinking phase when
they release locks. In general, releasing of locks prior to the end of a transaction is prob-
lematic. Thus, most of the locking-based concurrency control algorithms are strict in that
these systems, the operations of a given transaction may execute at multiple sites where
they access data. In such a case, the serializability argument is more difficult to specify
and enforce. The complication is due to the fact that the serialization order of the same
set of transactions may be different at different sites. Therefore, the execution of a set of
2. the serialization orders of these transactions at all these sites are identical.
global serializability: centralized locking, primary copy locking, and distributed locking
algorithm.
14
In centralized locking, there is a single lock table for the entire distributed database.
This lock table is placed, at one of the sites, under the control of a single lock manager.
The lock manager is responsible for setting and releasing locks on behalf of transactions.
Since all locks are managed at one site, this is similar to centralized concurrency control
and it is straightforward to enforce the global serializability rule. These algorithms are
simple to implement, but suffer from two problems. The central site may become a bot-
tleneck, both because of the amount of work it is expected to perform and because of the
traffic that is generated around it; and the system may be less reliable since the failure or
inaccessibility of the central site would cause system unavailability. Primary copy lock-
ing is a concurrency control algorithm that is useful in replicated databases where there
may be multiple copies of a data item stored at different sites. One of the copies is desig-
nated as a primary copy and it is this copy that has to be locked in order to access that
item. All the sites know the set of primary copies for each data item in the distributed
system, and the lock requests on behalf of transactions are directed to the appropriate
primary copy. If the distributed database is not replicated, copy locking degenerates into
In distributed (or decentralized) locking, the lock management duty is shared by all
the sites in the system. The execution of a transaction involves the participation and co-
ordination of lock managers at more than one site. Locks are obtained at each site where
the transaction accesses a data item. Distributed locking algorithms do not have the over-
head of centralized locking ones. However, both the communication overhead to obtain
all the locks, and the complexity of the algorithm are greater.
15
One side effect of all locking-based concurrency control algorithms is that they
difficult. Nevertheless, the relative simplicity and better performance of locking algo-
rithms make them more popular than alternatives such as timestamp-based algorithms or
durability. Atomicity requires that either all the operations of a transaction are executed
or none of them are (all-or-nothing property). Thus, the set of operations contained in a
transaction is treated as one atomic unit. Atomicity is maintained in the face of failures.
Durability requires that the effects of successfully completed (i.e., committed) transac-
The underlying issue addressed by reliability protocols is how the DBMS can con-
tinue to function properly in the face of various types of failures. In a distributed DBMS,
four types of failures are possible transaction failures, site (system) failures, media (disk)
failures and communication failures. Transactions can fail for a number of reasons: due
to an error in the transaction caused by input data or by an error in the transaction code,
or the detection of a present or potential deadlock. The usual approach to take in cases of
transaction failure is to abort the transaction, resetting the database to its state prior to the
16
Site (or system) failures are due to a hardware failure (eg, processor, main memory,
power supply) or a software failure (bugs in system code). The effect of system failures
is the loss of main memory contents. Therefore, any updates to the parts of the database
that are in the main memory buffers (also called volatile database) are lost as a result of
system failures. However, the database that is stored in secondary storage (also called
stable database) is safe and correct. To achieve this, DBMSs typically employ logging
protocols, such as Write-Ahead Logging, which record changes to the database in system
logs and move these log records and the volatile database pages to stable storage at ap-
propriate times. From the perspective of distributed transaction execution, site failures
are important since the failed sites cannot participate in the execution of any transaction.
Media failures refer to the failure of secondary storage devices that store the stable
devices and maintaining archival copies of the database. Media failures are frequently
treated as problems local to one site and therefore are not specifically addressed in the
The three types of failures described above are common to both centralized and
distributed DBMSs. Communication failures, on the other hand, are unique to distributed
systems. There are a number of types of communication failures. The most common
ones are errors in the messages, improperly ordered messages, lost (or undelivered) mes-
sages, and line failures. Generally, the first two of these are considered to be the respon-
sibility of the computer network protocols and are not addressed by the distributed
DBMS. The last two, on the other hand, have an impact on the distributed DBMS proto-
cols and, therefore, need to be considered in the design of these protocols. If one site is
17
expecting a message from another site and this message never arrives, this may be be-
cause (a) the message is lost, (b) the line(s) connecting the two sites may be broken, or
(c) the site that is supposed to send the message may have failed. Thus, it is not always
possible to distinguish between site failures and communication failures. The waiting
site simply timeouts and has to assume that the other site is incommunicado. Distributed
DBMS protocols have to deal with this uncertainty. One drastic result of line failures
may be network partitioning in which the sites form groups where communication within
each group is possible but communication across groups is not. This is difficult to deal
with in the sense that it may not be possible to make the database available for access
commitment protocols and distributed recovery protocols. The most popular atomic
commitment protocol is two-phase commit. The recoverability protocols are built on top
of the local recovery protocols, which are dependent upon the supported mode of interac-
Two-phase commit (2PC) is a very simple and elegant protocol that ensures the
commit actions to distributed transactions by insisting that all sites involved in the exe-
cution of a distributed transaction agree to commit the transaction before its effects are
made permanent (i.e., all sites terminate the transaction in the same manner). If all the
sites agree to commit a transaction then all the actions of the distributed transaction take
effect; if one of the sites declines to commit the operations at that site, then all of the
other sites are required to abort the transaction. Thus, the fundamental 2PC rule states:
18
1. If even one site rejects to commit (which means it votes to abort) the transac-
tion, the distributed transaction has to be aborted at each site where it executes;
and
2. If all the sites vote to commit the transaction, the distributed transaction is
The simple execution of the 2PC protocol is as follows (Figure 3). There is a coor-
dinator process at the site where the distributed transaction originates, and participant
processes at all the other sites where the transaction executes. Initially, the coordinator
sends a “prepare” message to all the participants each of which independently determines
whether or not it can commit the transaction at that site. Those that can commit send back
a “vote-commit” message while those who are not able to commit send back a “vote-
abort” message. Once a participant registers its vote, it cannot change it. The coordinator
collects these messages and determines the fate of the transaction according to the 2PC
rule. If the decision is to commit, the coordinator sends a “global-commit” message to all
the participants; if the decision is to abort, it sends a “global-abort” message to those par-
ticipants who had earlier voted to commit the transaction. No message needs to be sent to
those participants who had originally voted to abort since they can assume, according to
the 2PC rule, that the transaction is going to be eventually globally aborted. This is
19
There are two rounds of message exchanges between the coordinator and the par-
ticipants; hence the name 2PC protocol. There are a number of variations of 2PC, such as
the linear 2PC and distributed 2PC, that have not found much favor among distributed
DBMS vendors. Two important variants of 2PC are the presumed abort 2PC and pre-
sumed commit 2PC. These are important because they reduce the message and I/O over-
head of the protocols. Presumed abort protocol is included in the X/Open XA standard
and has been adopted as part of the ISO standard for Open Distributed Processing.
One important characteristic of 2PC protocol is its blocking nature. Failures can oc-
cur during the commit process. As discussed above, the only way to detect these failures
is by means of a time-out of the process waiting for a message. When this happens, the
termine what to do with the transaction that was in the middle of the commit process. A
non-blocking commit protocol is one whose termination protocol can determine what to
do with a transaction in case of failures under any circumstance. In the case of 2PC, if a
site failure occurs at the coordinator site and one participant site while the coordinator is
collecting votes from the participants, the remaining participants cannot determine the
fate of the transaction among themselves, and they have to remain blocked until the coor-
dinator or the failed participant recovers. During this period, the locks that are held by the
transaction cannot be released, which reduces the availability of the database. There have
been attempts to devise non-blocking commit protocols (e.g., three-phase commit) but the
The inverse of termination is recovery. When a failed site recovers from the failure,
what actions does it have to take to recover the database at that site to a consistent state?
20
This is the domain of distributed recovery protocols. If each site can look at its own log
and decide what to do with the transaction, then the recovery protocol is said to be inde-
pendent. For example, if the coordinator fails after it sends the “prepare” command and
while waiting for the responses from the participants, upon recovery, it can determine
from its log where it was in the process and can restart the commit process for the trans-
action from the beginning by sending the “prepare” message one more time. If the par-
ticipants had already terminated the transaction, they can inform the coordinator. If they
were blocked, they can now resend their earlier votes and resume the commit process.
However, this is not always possible and the failed site has to ask others for the fate of
the transaction.
8. Replication Protocols
In replicated distributed databases, each logical data item has a number of physical
instances. For example, the salary of an employee (logical data item) may be stored at
three sites (physical copies). The issue in this type of a database system is to maintain
some notion of consistency among the copies. The most discussed consistency criterion is
one copy equivalence, which asserts that the values of all copies of a logical data item
If replication transparency is maintained, transactions will issue read and write op-
erations on a logical data item x. The replica control protocol is responsible for mapping
protocol. ROWA maps each read on x [Read(x)] to a read on one of the physical copies xi
21
[Read(xi)]. The copy that is read is insignificant from the perspective of the replica con-
trol protocol and may be determined by performance considerations. On the other hand,
each write on logical data item x is mapped to a set of writes on all copies of x.
ROWA protocol is simple and straightforward, but it requires that all copies of all
logical data items that are updated by a transaction be accessible for the transaction to
terminate. Failure of one site may block a transaction, reducing database availability.
A number of alternative algorithms have been proposed which reduce the require-
ment that all copies of a logical data item be updated before the transaction can terminate.
They relax ROWA by mapping each write to only a subset of the physical copies. The
long as a majority of the copies can be updated. Thus, all the copies of a logical data item
may not be updated to the new value when the transaction terminates.
This idea of possibly updating only a subset of the copies, but nevertheless success-
fully terminating the transaction, has formed the basis of quorum-based voting for replica
control protocols. The majority consensus algorithm can be viewed from a slightly differ-
ent perspective: It assigns equal votes to each copy and a transaction that updates that
logical data item can successfully complete as long as it has a majority of the votes.
Based on this idea, an early quorum-based voting algorithm assigns a (possibly unequal)
vote to each copy of a replicated data item. Each operation then has to obtain a read quo-
rum (Vr) or a write quorum (Vw) to read or write a data item, respectively. If a given data
item has a total of V votes, the quorums have to obey the following rules:
22
1. Vr + Vw > V (a data item is not read and written by two transactions concur-
2. Vw > V r /2 (two write operations from two transactions cannot occur concur-
The difficulty with this approach is that transactions are required to obtain a quo-
rum even to read data. This significantly and unnecessarily slows down read access to the
database. An alternative quorum-based voting protocol that overcomes this serious per-
formance drawback has also been proposed. However, this protocol makes unrealistic
assumptions about the underlying communication system. It requires that all sites detect
failures that change the network’s topology instantaneously, and that each site has a view
of the network consisting of all the sites with which it can communicate. In general,
communication networks cannot guarantee to meet these requirements. The single copy
the availability they provide. Voting-based protocols, on the other hand, are considered
too complicated with high overheads. Therefore, these techniques are not used in current
distributed DBMS products. More flexible replication schemes have been investigated
where the type of consistency between copies is under user control. A number of replica-
tion servers have been developed or are being developed with this principle. Unfortu-
nately, there is no clear theory that can be used to reason about the consistency of a repli-
cated database when the more relaxed replication policies are used.
23
Bibliography
Bernstein, P. A. and Newcomer, E. 1997. Principles of Transaction Processing for the
Özsu, M.T. and Valduriez, P. 1999. Principles of Distributed Database Systems, 2nd edi-
Sheth, A. and Larson, J. September 1990. “Federated Databases: Architectures and Inte-
Yu, C. and Meng, W. 1998. Principles of Query Processing for Advanced Database Ap-
24