Distributed Database Systems (David Bell, Jane Grimson)
Distributed Database Systems (David Bell, Jane Grimson)
Ilnno flrvmctnn
,JcIII uE EIIEJY.JlE
Distributed
Database Systems
A ADDISON-WESLEY
vv
Distributed
Database Systems
INTERNATIONAL COMPUTER SCIENCE SERIES
Jane Grimson
Trinity College, Dublin
A
VV
ADDISON-WESLEY
PUBLISHING
COMPANY
The programs in this book have been included for their instructional value.
They have been tested with care but are not guaranteed for any particular
purpose. The publisher does not offer any warranties or representations, nor
does it accept any liabilities with respect to the programs.
Overall objectives
For many types of organization, the ultimate information handling aspir-
ation is to have an integrated database system so that common files can
be consulted by various sections of the organization, in an efficient and
effective manner. However, database systems are not just pertinent to
organizations or even large sections of enterprises. Many users of personal
computers now accept a database management system (DBMS) as an
integral part of their computing software almost as important as an
operating system. Databases are no longer confined to the province of
large-scale users.
The addition of communications facilities to a database system can
take it from a centralized concept to a decentralized concept. The reservoir
of data need not all be situated at a single site on a single computing
facility. Neither does the database have to be confined within the bound-
aries of a single organization or corporation.
Now this has caused some debate among commentators on tech-
nology. For a long time, even predating the advent of electronic com-
puters, the notion has been around that computers will start to 'take over'
and control the destiny of homo sapiens. This has recently been countered
by pointing to the wide availability of personal computers and the possi-
bility of linking them through increasingly sophisticated communications
facilities. A message of great freedom and decentralization is now being
preached in some quarters. The other side can of course argue back.
We do not wish to contribute directly to this debate. On the other
hand, we do wish to present the core knowledge of one particular aspect
- the data-handling aspect - of distributed computing which impacts on
both of the views stated above.
The idea of writing this book dates back to a St Patrick's Day in
the mid-1980s when we were in Rome to plan the third phase of an EC
project on distributed databases (DDBs). The motivation was that we
V
vi Preface
could share with others our findings on that project. It has taken longer
than expected to get round to writing, and our goal has widened a little
in the interim. We are interested in presenting the details of a new tool
for information handling to help cope with the bulk of information that
is in existence at present or being accumulated in the modern world. In
particular we are interested in showing that generalized data management
which is integrated but decentralized is already here.
So we make the implicit assumption that this is desirable, at least
in some computer applications, for reasons of increasing robustness,
efficiency, and local autonomy. It could of course be argued that we
therefore favour the second of the two viewpoints above - that we are
backing the optimistic notion of greater openness and availability, but we
are very aware that the same technology could be used for negative
purposes. Our view as technical researchers is that this is really a perennial
problem with technological advance. Social and moral considerations have
to be taken into account before the use of any powerful tool is contem-
plated. Widening the data landscape will not of itself make the world a
better place. Knives and telephones have negative potential as well as
positive - but few would say that it would be better if they had not been
invented.
Information is in many ways an unusual commodity. For one thing
it is one of the world's few increasing resources. The essential features of
an information bank are that the information can be lodged in it, saved
indefinitely without deterioration, and withdrawn as desired. Books and
other media have served the purposes of previous generations admirably
but cannot cope effectively with the bulk of information generated in a
space-age, post-industrial society. Computers emerged at just the right
time as tools to manage the mountains of data accumulated and encoun-
tered in many walks of life. Networks of computers and cheap stores and
processors mean that the systems need no longer be centralized. Huge
collections of information can be made available at arm's length, and
telecommuting and electronic cottages are being talked about. Some
observers have even said that there are enough electronic storage devices,
or quickly could be, to hold all the recorded information presently in the
world (estimated to be of the order of less than 1021) bytes).
Again, we are not in the business of commenting on the desirability
or otherwise of such developments. We simply recognize that distributed
database systems are needed as tools if even the more modest dreams for
distributed information handling are to become a reality.
However, many difficulties are encountered when we attempt to
make a diverse collection of data stores appear as a single unit to the
user. Not only do we have the problem of providing a common interrog-
ation interface to the data, but we are also faced with problems of
efficiency, harmonization and representation as we develop distributed
database systems. Not all of the problems have been satisfactorily solved
Preface vii
yet, but there are now some systems available commercially to provide
some of the required functionality. This book is intended as a basic
treatment of most of these problems.
Rationale
Our approach to the book derives from accumulated experience in course
design. When designing degree courses in disciplines which have a strong
vocational character, we believe we should address four aspects of the
subject. First we should ensure that on completion of the course the
student should have a good grasp of the 'timeless' aspects of the subject.
As computing matures an increasing body of theoretical underpinning is
being built up. We believe that students should acquire this knowledge.
Now, this is not a theoretical book, but its primary objective is to acquaint
the reader with the most significant results of research and development
relevant to DDB systems at an appropriate level. So we have to make
sure that this aspect is recognized and acknowledged at appropriate places
and it is the foundation for the second objective of course design. This is
that the students should acquire a knowledge of the product-related state
of the art of the topics considered, so that they have an awareness of the
differences between research ideas and what is available to the prac-
titioner. It is, of course, impossible, using a static medium such as a book,
to ensure that the necessarily incomplete snapshot of today's situation is
kept up-to-date. The representation of any supplier's wares must be
verified and supplemented by anyone seriously surveying the market in
connection with a particular application. Nevertheless, an appreciation of
the broad view can be imparted.
The remaining two objectives of course design are exceedingly
important for students. They are that the students should have skills to
sell - a very important aspect for a vocational course - and (a not entirely
orthogonal objective) that they should have a systematic approach to
problem solving. Now while these two facets of course design are really
outside the scope of this book, we have endeavoured to ensure that the
presentation is compatible with the tutorial/practical aspects of courses for
which the book is a suitable text. This particular aspect of our philosophy is
reflected in the worked examples and the exercises for the reader in each
chapter of the book.
the data items, move data around the network and carry out operations
greatly influences the responsiveness and cost of servicing queries. The
language in which the queries are expressed is high level in the sense that
the users need not specify the path of access to the data items. But the
system has to do this, and the query optimizer is the system module
responsible for this.
Chapters 6 and 7 deal with transaction management. Transactions on
a database are 'collections of actions which comprise a consistent trans-
formation of the state of a system', invoked on the objects in the database.
Transformations are changes to the data records and devices. In many
applications a halt in the operation of the computer due to transaction or
other system failure is unacceptable. For example, backup or replica
database systems are often used to track the state of a primary database
system. These take over transaction processing in the event of disaster in the
main system. This is an example of database recovery. The difficulties of
ensuring that database recovery is effective and efficient in distributed
computer systems are addressed in Chapter 7.
Having multiple versions of data in distributed sites can increase
concurrency as well as supporting failure recovery. This is possible because
out-of-order requests to read data can be processed by reading suitable older
versions of data. More generally, it is important for transaction management
that such out-of-order operations do not interfere with the operation of
other transactions addressed to the system at the same time. For example, it
is possible to lose updates if the operations within different transactions are
out of sequence. Various techniques for enforcing a correct sequence of
operations within and between transactions are discussed in Chapter 6.
Efficiency is important because the enforcement of good 'schedules' can
result in indefinite blocking leading to long delays, especially in distributed
systems.
Some potential users of distributed systems are put off by the
perceived lack of control they have if their data is scattered over several
computers linked by telecommunications lines which are vulnerable to
unauthorized access or accidental corruption and failure. Chapter 8 deals
with these problems. Another major drawback of distributed databases is
due to the problems of database design and management, and these are
discussed in Chapter 9.
Given our course design philosophy expressed above, it would be
unforgivable to treat all of these topics in isolation from an example of
the sort of applications which might be expected to use DDBs. So Chapter
10 is devoted to an introduction of a case study based on a project we
jointly worked on for several years in health care. The idea is not to look
at the details of the DDB components already addressed by this point in
the book, but to provide the basis for a practical case study which readers
can use to appreciate the value of the DDB approach in applications, and
the difficulties that can be expected to arise.
x Preface
Audience
The text should be useful for a variety of readers. We have focused our
attention on the needs of students doing a third or fourth year undergrad-
uate course on computing, but we have also tried to make it suitable as
a management information system text. Some engineering courses may
find it useful as well. We believe that it will be suitable for introductory
postgraduate readings. We have also tried to ensure that it is readable by
managers and professionals who simply want to keep abreast of the
changing technology and the challenges they can expect in the near future.
By limiting the coverage to key issues and to basic aspects of various
topics, supplemented by pointers to the research literature for more
advanced readers, we believe that the material can be covered in a two-
semester course. This would permit coverage in detail of the algorithms
in the book and allow a substantial case study to be carried out.
Acknowledgments
The material for this book was drawn from our notes for lectures and
research projects, and these were accumulated over a fairly long period.
There were several key contributors to this knowledge, and also to our
way of presenting it, and while we have not attempted to make a compre-
hensive list of these, we hope it will be clear from the references at the
ends of the chapters who were the most influential. We ask to be forgiven
if we have underemphasized or misrepresented any individual's work.
All our friends who worked with us on the Multistar project had
an impact on our approach to DDBs. We are also grateful to all our
postgraduate students and research staff who contributed to our database
projects over the years.
We would like to thank colleagues who were inveigled into reading
drafts of the various chapters and whose comments have undoubtedly
improved the quality of the book. Also thanks to Rita whose assistance
in the preparation of the manuscript was invaluable and to Simon Plumtree
and Stephen Bishop of Addison-Wesley, and Mary Matthews of Keyword
Publishing Services for all their help and advice.
Finally, to our families - Sheelagh, Michael and Allistair, Bill,
Andrew and Sarah - go special thanks for patience, forebearance and the
"Colle Umberto" spirit during this project.
David Bell, Jane Grimson January1992
Contents
Preface v
1 Introduction 1
1.1 Introduction 1
1.2 The pressure to distribute data 2
1.3 Heterogeneity and data distribution 5
1.4 Integrating other kinds of information system 8
xi
xii Contents
4.5 A practical combinatorial optimization
approach to the file allocation problem 100
4.6 Integration of heterogeneous database systems 106
4.7 The global data model 108
4.8 Getting a relational schema equivalent to a
network schema 110
4.9 Processing relational queries against the
network database 116
Glossary/Acronyms 397
Author Index 402
Subject Index 406
Trademark notice
dBASE IIITM is a trademark of Ashton-Tate Incorporated
DECTM, VAXTM and VMSTM are trademarks of Digital Equipment Corporation
FOCUST' is a trademark of Application Builders Incorporated
IBMTM is a trademark of International Business Machines Corporation
IDMSTM is a trademark of Cullinet Corporation
INGRESTM is a trademark of Relational Technology Incorporated
MS-DOSTM is a trademark of Microsoft Corporation
OPEN-LOOKTM and UNIXTM are trademarks of AT&T
ORACLE® is a registered trademark of Oracle Corporation UK Limited
OSF/MotifTM is a trademark of the Open Software Foundation, Inc.
SMALLTALKTM is a trademark of Xerox Corporation
SPARCTM is a trademark of Sparc International Inc.
SunTM is a trademark of Sun Microsystems, Inc.
I Introduction
1.1 Introduction
At no previous time in the history of computing have there been so many
challenging innovations vying for the attention of information systems
engineers as there are at present. Technologically, advances are continu-
ally being made in hardware, software and 'methodologies'. Improving
speeds of, possibly parallel, action and the increased size of rapid access
stores available, along with new announcements of products and new
ideas from researchers bring with them demands from the user population
for their exploitation. For applications, even if users are not aware of
particular developments relevant to their operations, there is a perpetual
demand for more functionality, service, flexibility and performance. So
the information engineer designing a new information system, or pro-
longing the life of an old one, must always be seeking ways of linking
solutions offered by the technologists to the needs of users' applications.
One area in which solutions are becoming increasingly viable is in distrib-
uted information systems. These are concerned with managing data stored
in computing facilities at many nodes linked by communications networks.
Systems specifically aimed at the management of distributed databases
were first seriously discussed in the mid-1970s. Schemes for architectures
really started to appear at the end of that decade.
1
2 Distributed Database Systems
Users
I_
A Suppliers
7
User I
A'
Computer
User User
Figure 1.2 Ptolemic computing - users 'fitting in' with a centralized computer.
Computer Computer
I_-
Computer
EComputer
Figure 1.3 Copernican computing - computers are distributed and services are
tailored to users' needs.
1.2.3 Difficulties
A number of objections can be raised to counterbalance these advantages.
Some technological and user-related difficulties still hinder the take-up of
distributed information systems ideas, although these difficulties are being
energetically addressed by researchers and developers.
Technological problems occupying the attention of researchers
worldwide include
Records
'I
on these subjects. Each of these has generated its own propaganda, and
it is frequently found that a single concept appears under different names
in different classes. The forces for distribution have led to the development
of corresponding distributed architectures in all of the classes, and it is
advisable to keep some perspective on the current state of this evolution.
For example, it is possible to consider the information retrieval
systems class as having been generalized to cover multimedia data gener-
ally, including voice, graphics and video as well as text. An example of
a generalized system developed for this purpose is a distributed multimedia
database system called KALEID. Distributed expert systems prototypes
are also beginning to appear. An example is the HECODES hetero-
geneous expert system framework which has been demonstrated on a
widely distributed computer network. The results of this development
pattern is a collection of distributed but discrete, individually-limited
systems, each very imperfectly representing the world and hard to link
up to the others. There is, as could be expected from previous experience,
a user pull to permit these heterogeneous classes to co-exist, conveniently,
in a single system.
Pairwise integration has been taking place for a considerable time.
Knowledge based system-database system integration, information
10 Distributed Database Systems
SUMMARY
The pressure to distribute data comes from both user pull and
technology push. User pull is prompted by the growth of information
systems and by the need for security and consistency and for
maintenance and local autonomy reasons. Technology push is because, on
the processor side, Grosch's Law has been repealed, and this is
complemented by the fact that communications networks are continually
being enhanced.
The difficulties that this brings also come from users and
technology. We are convinced that these are secondary, but it must be
said that by no means everyone agrees that the virtues outweigh the
difficulties. We will discuss this further at various points in this book.
Users need to have applications and development tool kits appropriate for
distribution; conflicts between users can also be expected. Technology
brings difficulties in optimizing the evaluation of queries, harmonizing
the processing of diverse nodal contributers to solutions, allocating data
around the nodes in a sensible way, controlling access to the data,
facilitating recovery from failure of system components and maintaining
the fidelity of the model of the real world represented by the distributed
data collections.
Introduction 11
EXERCISES
1.1 Discuss three important technological reasons for distributing data.
1.2 Discuss three important user-related issues which suggest that data should be
distributed.
1.3 State Grosch's Law. On what grounds is it claimed that this 'law' has been
repealed?
1.5 'No chief information executive in his right mind would allow heterogeneous
data collections to arise in his organization!' Discuss the ways in which data
collections conforming to many styles and data models and using many
different types of computers and programming languages can evolve in a large
organization (such as a hospital).
1.6 In a hospital the medical clinicians wish to access 'structured' data records
describing their patients (holding, for example, date of birth, address,
profession and previous illnesses and treatment episodes). From time to time
they all wish to access various non-text and non-structured data collections for
their work. Identify examples of some of the additional types of information
they might require access to, and suggest how communications networks and
other computer system support can help in meeting their needs.
12 Distributed Database Systems
1.7 Suggest how pressure to decentralize data handling could arise from problems
in capacity planning and maintenance.
Bibliography
Abul-Huda B. and Bell D.A. An overview of a distributed multi-media
database managemen! system (KALEID). In Proc. 1st European Conf. on IT
for OrganisationalSystems, Athens, Greece, May 1988
KALEID is an attempt to adopt the principles of multidatabase systems (see
Chapter 2) for application to multimedia data systems. It is a generalization
to cover not only multisite data but also multimedia data.
Bell D.A. (1985). An architecture for integrating data, knowledge, and
information bases. Proc. ASLIB Informatics 8 Conf., Oxford, England
Many of the concepts described in Section 1.4 are considered in this paper.
An extension of the multidatabase approach to the retrieval of structured
text data to allow integration of multimedia information, structured text and
rules is proposed.
Bell D.A. and O'Hare G.M. (1985). The coexistence approach to knowledge
representation. Expert Systems J., 2(4), 230-238
This is a short but interesting case study where a network database structure
is used to handle some aspects of medical diagnosis which would normally be
expected to require artificial intelligence techniques.
Brodie M.L. and Myopoulus J. (1986). Knowledge bases and databases:
semantic versus computational theories of information. In New Directions for
Database Systems (Ariav G. and Clifford J. eds.), pp. 186-218, New Jersey:
Ablex Publishing Co.
Knowledge based systems are one of the three classes of information systems
which have evolved largely in isolation from each other. This paper is one
example of a useful introduction, vintage 1982, to this subject.
Ein-Dor P. (1977). Grosch's Law revisited. Datamation, June, 103-108
Grosch's 'Law' - really a very rough rule of thumb - stated that the returns
on an investment in terms of computer power increased as the square of the
investment. However the more recent studies reported here have shown that
this is no longer the case, and that returns are now about linear to the
investment.
Mostardi T. and Staniszkis W. (1989). Multidatabase system design
methodology. Database Technology J., 1(3), 27-37
There are few papers treating the subject of distributed database design.
CASE tools are not yet distributed. This paper takes a step in the right
direction by addressing the multidatabase design issue.
Otten A.M. (1989). The influence of data distribution on the systems
development life cycle. Database Technology J., 1(3), 39-46
A hard-headed look at the practical importance of distributed database
systems on systems design, building and operation.
Pavlovic-Lazetic G. and Wong E. (1985). Managing text as data. Proc. 1lth
Conf. on VLDB, Tokyo, Japan
The integration of database and information retrieval systems is considered
here.
Salton G. and McGill M.J. (1983). Introduction to Modern Information
Retrieval. Tokyo: McGraw-Hill
This is one of the many excellent books that are available as introductions to
Introduction 13
2.1 Introduction
Distributed database technology involves the merging of two divergent
concepts, namely integration through the database element and distri-
bution through the networking element, as shown in Figure 2.1. This
chapter gives a short overview of database technology, followed by an
introduction to computer networks, a basic understanding of which is
necessary to many of the issues involving distributed databases addressed
in the rest of the book. The final section of the chapter is devoted to an
introduction to distributed databases.
-! Database
Integration
Distribution
Distributed database
Phwic~l leveI
(1) relational
(2) network
(3) hierarchical.
INPATIENT
as shown in Table 2.2, which records requisitions for laboratory tests for
patients. There is a relationship between row 3 of the INPATIENT
relation (for patient Mary Geraghty) and rows 1 and 2 of the LABREQ
relation, as all three rows share the same patient# (41268). Similarly row
5 of INPATIENT is related to row 3 of LABREQ (patient# 97231) and
the last rows in the two relations are also related (patient# 80233).
This example serves to illustrate another important concept in
relational DBs, namely foreign keys. Patient# in LABREQ is a foreign
key referencing INPATIENT. This means that the values of patient# in
LABREQ are constrained to a subset of the values of patient# in the
INPATIENT relation. In practical terms, this rule means that it is not
possible to requisition laboratory test results (insert a tuple into the
LABREQ relation) for a non-existent patient (a patient for whom there
is no corresponding tuple in the INPATIENT relation).
Both primary keys and foreign keys express integrity constraints on
the database, that is they constrain the allowable values of attributes. In
the case of primary keys, duplicate values are forbidden and no part of
the primary key may be null, and in the case of foreign keys, the value
must already exist in some other relation in the DB. DBs generally reflect
a real-world situation - they represent physical data and events. For
example, Mary Geraghty is an actual patient in the hospital for whom Dr
Keogh has ordered two laboratory tests. Ensuring the integrity of the DB
is mainly concerned with making sure that the DB obeys the rules of the
outside world it is modelling. This is a complex issue and will be discussed
in more detail in Chapter 8.
An integral part of the relational approach is the relational algebra,
which consists of two groups of operators, enabling users to operate on an
entire relation. Non-relational systems by contrast generally only support
LABREQ
INPATIENT
OUTPATIENT
Table 2.4 (a) The UNION operator. (b) The INTERSECT operator. (c) The DIFFERENCE (MINUS)
operator.
(a)
(b)
(c)
would produce the relation shown in Table 2.6(a). Selects are denoted by
orcondition so the above select would be written us F.INPATIENT.
The PROJECT operator takes a vertical subset from a relation,
that is it selects particular columns from a relation, eliminating duplicates.
The effect of
giving the relation shown in Table 2.6(c). The JOIN operator is denoted
•><]ab
*0
a.
0)
0)
CO
*0
v U0 U0 0 U
0)
0.
0)
0
0)
CO
00 000 C 0~ 00 In 00 x 0c 00~
MMx 00 00M
0.
'C'o
M~M0~ ~ 0MCc
M c Mý 0CIM1 0r
0.
00
0) 0~ 0 0 0 0)
0)
0
0
CO
0~0~ 0~
Cj 11
0 0/ w -'4 Cfw wJ 0u0 0 U~ CC
f fOzO f
00000C.(~
bj- =- 0000 8- r!- N(
.0
Q QýZ7)U57)U5u
uu
0p
0t~ -. (N (N (N( N(
M
N
-0 0CýM00
~ N(
0
CC,
(3, aIn In n In ) W') Wr C
0 0)
.O CO
0
0)
Cu
~0 00000(000 O0 0 0~
z 0
0)
z CO
0001 c0 \0\00 ýo c oo Mm
Oen'n - - mm
0.
Overview of Databases and Computer Networks 25
"00'
U
~0p
Ho 0.
C) 00
-0 E o~.
0)
OW -O o e
CLe -U -U a0
-O
Ce
r-
0)
0)
.0.I ý
ann F0-
0 ,l C C.4 -
dd'
Z0
a0-
0)
Ce
Q)
E
0
0)
*0 0)
0 OMON Cl C')
0 Ce 0
U *0
0
.0
.0
z .0
00
0
0- 0
0)
0)
0) Ce 0
lu *0 cc 0
en• e-< t' t- C)
0)
r4
.0I 0 ,.v
(4
ýcen en
0.
z 0) cc
00 00 rn
ONclC
z 0) _j
M 0) z 0)
E 0.
E
Ce z
z 0
00 Ce
0
0)
0 z
Wi. 544
zL-
H. 0 H
0)
z
0.
C4 0) z
26 Distributed Database Systems
to describe them all, with the exception of the semi-join which plays a
particularly important role in query optimization in distributed DBMSs
(DDBMSs) as described in Chapter 5. The SEMI-JOIN of two relations,
A and B, is the join of A and B, projected back on the attributes of A.
Thus
would return the relation shown in Table 2.6(d). The semi-join operator
is denoted
<• a-b
CUSTOMER
(a)
(b)
Figure 2.3 (a) Hierarchical database structure. (b) Example hierarchical database.
28 Distributed Database Systems
Student Course
Course Student
(a) (b)
CUST-ACC Set
ACC-TRANS set
Student
Student
Owner of
STUDRES set
Member of both
Date Grade STUD.RES and CRS.RES
I I sets
Result /
Course title Number * Owner of
I I CRS-RES set
Course
patient# -- address
the integrity checker would signal an error.
Overview of Databases and Computer Networks 31
A- W
Figure 2.7 Sample determinancy diagram.
(3) Patients can have any number of tests on any date but (for
simplicity) only one test of any given type on any one day ordered
by one doctor.
Figure 2.9 Determinancy diagram for simple patient and laboratory test database.
above is in BCNF, as is
S date
student#
S[ grade
course#
I
Determinant and
candidate key
However, examination of Figure 2.9 shows that there are two identifiers,
both of which are determinants, but not of the same attributes. Patient#
is a determinant of and a candidate identifier for name, date-ofbirth,
address, sex and gp, whereas the composite attribute {patient, test-type,
date} is a determinant of and candidate identifier for reqdr. Hence to
normalize this relation we must split the relation into two, such that each
relation is in BCNF:
PATIENT (patient#, name, date of birth, address,
sex, GP)
and
Determinant and
(a)
B C,
(b)
relational operators, that is it includes not only the select operator but
also projection and join. The basic syntax of the SQL SELECT is
For example, to retrieve all the female patients from Dublin from
the INPATIENT relation of Table 2.1, the command would be
SELECT *
FROM INPATIENT
WHERE sex= IF' AND address= 'Dublin';
and the resulting relation is shown in Table 2.7(b). This example combines
both the relational select and project operators. Similarly, we can use the
SQL SELECT command to implement the relational join operator. The
example given in Table 2.6(c) would be specified in SQL by
SELECT *
FROM INPATIENT, LABREQ
WHERE INPATIENT. patient# = LABREQ. patient#;
SELECT name
FROM INPATIENT, LABREQ
WHERE INPATIENT.patient# = LABREQ.patient#
AND INPATIENT. gp = 'Dr Corrigan';
The result is shown in Table 2.7(c). Note that while two tests, FT4
and TT3, had been ordered for patients of Dr Corrigan, they are both
for the same patient, Mary Geraghty and since duplicates are eliminated,
only one tuple appears in the final relation.
36 Distributed Database Systems
Table 2.7 (a) Example of simple SQL SELECT. (b) Example including relational project. (c) Example
incorporating relational select, project and join operators.
SELECT
FROM INPATIENT
WHERE sex = 'F' AND address - 'DUBLIN'
(a)
patient# name
name
Mary Geraghty
SELECT Name
FROM INPATIENT, LABREQ
WHERE INPATIENT. patient# - LABREQ.patient#
AND INPATIENT.gp 'Dr Corrigan'
(c)
Computer networks are now very widely used; they range from
simple systems connecting a few personal computers (PCs) together to
worldwide networks with over 10000 machines and over a million users.
A network can be defined as an interconnected collection of autonomous
computers. A distributed system, such as a distributed database manage-
ment system (DDBMS), is built on top of a network in such a way as to
hide the network from the user.
Computer networks are generally classified according to whether
the computers they connect are separated by long (wide area network)
or short (local area network) distances.
Wide area networks (WANs) are used where the computers are
separated by distances greater than, say, 10 km. Typical transmission rates
for WANs are from 2 to 2000 kilobits per second.
The computers, which are referred to as hosts, nodes or sites, are
connected to a subnet. A subnet consists of two parts:
(a) Ring
(e) Irregular
Figure 2.13 Alternative point-to-point network topologies.
tation details from the upper layer. The International Standards Organiz-
ation (ISO) has defined a seven-layer manufacturer-independent protocol,
known as the ISO OSI Open Systems Interconnection Reference Model
and most existing systems are migrating towards this standard.
(a) Ring
(b) Bus
(c) Satellite
via open relay systems (OSI terminology for PSEs) to the corresponding
physical layer on host B and then up through the 'stack' on host B to
layer n. The functions of the various layers can be summarized as follows:
(1) The physical layer is concerned with the actual physical transmission
of bits across the communications media. It therefore addresses
issues such as whether the link is simplex, duplex or half-duplex,
42 Distributed Database Systems
Layer
4..-- Anolication
•-.. ..protocol
....
7 Application Application
,• Preettoprooo _.
6 Presentation Presentation
5 Session protocal
Session Session
Transport protocol
4 Transport Transport
Communication subnet
boundary
3 Network 4---., Network
Network
Physical Physical
Physical
I Physical medium I
Homogeneous Heterogeneous
Unfederated
Z \ Fe ated
Single Multi
use the term distributed database management system to cover both types
and only differentiate when necessary. Most of the concepts described in
this book are applicable to both types of DDBMS. We give examples of
each sub-class in Chapter 3.
A homogeneous DDBMS has multiple data collections; it integrates
multiple data resources. Some of the most venerable systems fall into this
class. They can be further divided into classes depending on whether or
not they are autonomous. We look at autonomy in more detail later on,
but in the meantime we will use this term to indicate a declared aim of
the systems designers to give the local systems control of their own
destinies.
A homogeneous DDB resembles a centralized DB, but instead of
storing all the data at one site, the data is distributed across a number of
sites in a network. Figure 2.17 shows the overall architecture of a pure
DDBMS. Note that there are no local users; all users access the underlying
DBs through the global interface. The global schema is the union of all
underlying local data descriptions (not specified in a schema) and user
views are defined against this global schema.
46 Distributed Database Systems
In Figure 2.17, we have not considered the case where there are
local schemas for the local databases. If we want to come up with a
standard conceptual architecture for a DDBMS evolved from the
ANSI-SPARC architecture, we could include local DBMSs and local
schemas, remembering that these do not have to be explicitly present in
any particular implementation. Indeed, in practice, most of the homo-
geneous systems do not have local schemas and have limited data manage-
ment software at the local level.
To handle the distribution aspects, we must add two additional
levels, as shown in Figure 2.18, to the standard three-level ANSI-SPARC
architecture shown in Figure 2.2, namely the fragmentation and allocation
schemas. The fragmentation schema describes how the global relations
are divided amongst the local DBs. Figure 2.19 gives an example of a
relation, R, which is divided into five separate fragments, each of which
could be stored at a different site. To materialize R (i.e. to reconstruct it
from its fragments), the following operations are required:
where JOIN and UNION have the normal relational meaning. It must of
course be possible to reconstitute a global relation from its fragments by
the application of standard relational operators. In practice this means
that the primary key of the relation R must be included in all vertical
fragments. Thus both A and B, and C and D, must be joinable and (A
JOIN B), (C JOIN D) and E all union compatible. A collection of
materialized relations at some particular time is called a snapshot.
The allocation schema then specifies, at which site each fragment
is stored. Hence fragments can migrate from site to site in response to
changing access patterns. Also, replication of fragments is easily supported
48 Distributed Database Systems
A B
C D
by allocating a fragment to more than one site. The optimizer (see Chapter
5) can then select the most efficient materialization of the relation.
In Figure 2.16, we see that the other main class of data sharing
systems, namely heterogeneous systems, is the class characterized by the
use of different DBMSs at the local nodes. There are two main sub-
classes: those that do their integration fully within the system, and those
providing simpler 'hooks' or external appendages called gateways, to
permit linkage to alien systems.
The former sub-class can be further refined into a sub-class which
provides a significant subset of the functions one would expect from any
DBMS (see Section 2.2.3), and those which emphasize the more pragmatic
aspects of collective data handling, such as conversions between systems
and some basic performance features (called multidatabase management
systems here).
Multidatabase systems (MDBMSs) have multiple DBMSs, possibly
of different types, and multiple, pre-existing DBs. Integration is therefore
performed by multiple software sub-systems. The overall architecture of
an MDBMS is shown in Figure 2.20. Note that in contrast to the homo-
geneous DDBMS, there are both local and global users. MDBMSs inte-
grate pre-existing, heterogeneous data resources, although homogeneous
systems can also be accommodated. It is an important feature of these
systems that local users continue to access their local DBs in the normal
way unaffected by the existence of the MDB. We follow the taxonomy
of Sheth and Larson (1990) in Figure 2.16 closely for MDBMSs.
There are federated and unfederated MDBMSs. In unfederated
systems, there are no local users and this is a relatively obscure sub-class.
The federated systems are split into those that have a global schema
(tightly coupled) and those which do not (loosely coupled).
The schema architecture for a, typical tightly coupled MDBMS is
shown in Figure 2.21. The global conceptual schema is a logical view of
Overview of Databases and Computer Networks 49
Global user
all the data in the MDB. It is only a subset of the union of all the local
conceptual schemas, since local DBMSs are free to decide what parts of
their local DBs they wish to contribute to the global schenip. This freedom
is known as local autonomy (see Section 2.4.2). An individual node's
participation in the MDB is defined by means of a participation schema,
and represents a view defined over the underlying local conceptual
schema. Three additional levels - the participation schemas, the global
conceptual schema and the global external views - have been added to
the ANSI-SPARC architecture. Support for user views is possibly even
more important in DDBMS environments, where the global schema is
likely to be extremely large and complex, representing as it does the
integration of the entire organization's data resource. The auxiliary
schema, which is shown in Figure 2.21, when it is used, describes the
rules which govern the mappings between the local and global levels. For
50 Distributed Database Systems
example, rules for unit conversion may be required when one site
expresses distance in kilometres and another in miles. Rules for handling
null values may be necessary where one site stores additional information
which is not stored at another site, for example one site stores the name,
home address and telephone number of its employees, whereas another
just stores names and addresses. Some MDBMSs also have a fragmen-
tation schema, although not an allocation schema, since the allocation of
fragments to sites is already fixed as MDBMSs integrate pre-existing
DBs.
The loosely coupled MDBMS, in which there has been growing
interest recently, is sometimes called an interoperable database system.
The important characteristic of these systems is that they have no global
conceptual schema. The construction of a global conceptual schema is a
Overview of Databases and Computer Networks 51
difficult and complex task and involves resolving both semantic and syntac-
tic differences between sites. Sometimes these differences are so extensive
that it does not warrant the huge investment involved in developing the
global schema, especially when the number of multisite global queries is
relatively low.
There are two main approaches to building loosely coupled
MDBMSs. In one the onus is on the users to build their own views over
the local conceptual schemas, as shown in Figure 2.22, using powerful
query languages such as Litwin and Abdellatif's MSQL. Alternatively,
local DBs can define their contribution to the federated DB by means of
an export schema (analogous to the participation schema of MDBMS),
as shown in Figure 2.23.
The reader should be aware that there is, as yet, no standard
agreement as to terminology in this field.
site and as MM-DD-YY at another. More subtle differences can also arise
in relation to the interpretation of local integrity constraints at the global
level. For example, all sites might have a constraint which states that an
employee can only have one basic salary. Imagine the case of an employee
who is working part-time and is recorded on two different local DBs. The
DDBMS would have to have a way of resolving such difficulties and
presenting the global user with a uniform, integrated view.
Much of the early work on distributed databases focused on the
problems associated with integrating heterogeneous data models. The
basic decision which had to be made was whether to adopt a single model
at the global level and map all the local models onto this model or whether
to simply provide bidirectional translators between all pairs of models.
This latter approach has the advantage that a user of the IDMS network
DBMS can view the entire DDB as a network DB, thereby avoiding the
necessity of learning another model and language. However, this approach
requires n(n-1) translators, where n is the number of different models.
The alternative and much more widely used approach is to adopt a single
model, the so-called canonical model, at the global level and map all the
local models onto it. This only requires 2n translators. Moreover, most
systems adopt the relational model as the canonical model, and since many
of the local models are already relational, the mapping can concentrate on
semantic rather than syntactic issues.
The data can be distributed across the sites in different ways. The
data may be fully replicated, in which case the entire global DB is
stored at each node. Full replication is used when fault tolerance and
performance are important; the data is still available even in the event of
site and network failure and there are no communication delays (for
retrieval). Fully replicated DBs are designed top-down since they really
do consist of a single DB duplicated in its entirety across the network. The
problem of supporting updates and recovering from failure in replicated
databases is discussed in Chapters 6 and 7.
At the other extreme, the global DB can be fully partitioned, that
is there is no replication whatsoever. Such a situation is common with
multidatabases where local pre-existing DBs are integrated in a bottom-
up fashion. In between these two extremes is partial replication of data.
Partial replication is generally required when certain parts of the global
DB are accessed frequently from a number of different sites in the
network.
The degree of local autonomy or local control supported in a DDB
environment is another important factor. Where a system allows full nodal
autonomy, integration is more difficult, as will become apparent in later
chapters of this book. Autonomy is concerned with the distribution of
control as opposed to the distribution of data. Thus, at one extreme, if
there is no local autonomy this implies that there is full global control.
Such a situation could be found in a homogeneous distributed database
54 Distributed Database Systems
SUMMARY
In this chapter, we have presented a brief overview of the two
technologies of database management and computer networks, which are
combined together to produce distributed database systems. This
overview is intended to remind the reader of concepts with which the
reader is assumed to be familiar.
In Section 2.4 we present a comprehensive and consistent
taxonomy of distributed data sharing systems which will be used
throughout the book. Systems are classified at the top level as being
either homogeneous or heterogeneous. Although much of the research to
Overview of Databases and Computer Networks 55
EXERCISES
2.1 (a) Is the natural join operator commutative?
i.e. is
A >< B = B >< A
A <B=B <A
(i)
E (iv)
(iii)
A B
C D
E F
(c) Derive a set of fully normalized relations for the database and identify the
primary key of each.
2.5 What is the difference between point-to-point and broadcast networks? For
each type, describe a number of alternative topologies.
2.6 Explain the seven layers of the ISO OSI Reference Model.
2.7 Explain in detail the difference between homogeneous and heterogeneous data
sharing systems.
58 Distributed Database Systems
2.8 Why do you think homogeneous distributed database systems have been
described as 'solutions in search of problems'?
Bibliography
ANSI/X3/SPARC (1975). Study Group on Data Base Management Systems.
Interim Report, ACM SIGMOD 7(2)
This report presents the standard reference architecture for DBMSs
consisting of three levels: conceptual, external and internal schemas (see
Section 2.2.2)
ANSI (1986). American National Standard for Information Systems, Database
Language-SQL. ANSI X3.135-1986
The ANSI SQL standard, see also Date (1990).
ANSI/ISO (1988). Information Processing Systems - Remote Database Access,
Draft proposal ISO/IEC DP 9579
This proposed standard for remote database access could provide a lot of
useful functions (e.g. atomic commit) to distributed data sharing systems
which are built using the RDAP protocol (see Section 2.3.2).
Beeri C, Bernstein P.A. and Goodman N. (1978). A sophisticate's introduction
to database normalization theory. In Proc. VLDB, West Berlin.
Breitbart Y. (1990). Multidatabase interoperability, ACM SIGMOD 19(3),
53-60
This short paper gives a brief introduction to multidatabases, focusing in
particular on the problems of schema integration (producing a single, unified
global conceptual schema) and transaction management.
CODASYL Programming Language Committee (1971). Data Base Task Group
Report of CODASYL Programming Language Committee
This report contains the specification of network DBMSs; the so-called
CODASYL systems (see Section 2.2.5.2).
Codd E.F. (1970). A relational model of data for large shared data banks,
CA CM, 13(6)
The seminal paper on the relational approach to DBM.
Coulouris G.F. and Dollimore J.B. (1988). Distributed Systems, Concepts and
Design.. Wokingham: Addison-Wesley
An excellent introduction to distributed systems, concentrating on distributed
file as opposed to database systems. Many of the issues discussed (e.g.
concurrency control, recovery) are equally relevant to DDBMSs.
Date C.J. (1982). An Introduction to Database Systems, vol. II. Wokingham:
Addison-Wesley
Date C.J. (1990). A Guide to the SQL Standard 2nd edn. Wokingham:
Addison-Wesley
Date C.J. (1990). An Introduction to Database Systems vol. I 5th edn.
Wokingham: Addison-Wesley
Day J.D. and Zimmermann H. (1983) The OSI reference model. Proc. IEEE,
71, 1334-1340
Overview of Databases and Computer Networks 59
SOLUTIONS TO EXERCISES
2.1
(a) Yes (see Ullman (1980) for proof)
(b) No
2.2
2.3
(i) (a) A
(b) RI (A, B) R2 (B, C)
(ii) (a) E
(b) R3 (E, F, G)
2.4
(a) The following, assumptions are made.
"* Students can take several courses and each course can be taken by several
students;
"* Lecturing staff can teach on more than one course and a course can be
taught by more than one lecturer;
"* A student has only one tutor, but a tutor can have many students;
"* A given course is taught to one year only;
"* Names are not necessarily unique;
* Addresses are not necessarily unique;
* Not all lecturers are tutors, but all tutors are lecturers.
3.1 Introduction
The purpose of a database is to integrate and manage the data relevant
to a given enterprise, such as a health provision network, an industrial
corporation or a bank. The motivation for establishing a database is to
get all of the data relevant to the operations of the enterprise gathered
together in a single reservoir, so that all of the problems associated with
applications in the enterprise can be serviced in a uniform manner. This is
to be done through a consistent set of languages, physical data structures,
constraint checkers, consistency checkers, development tools and other
data management functions.
However, we have already seen that centralized databases are beset
with problems that cannot be solved without a radical revision of the basic
concepts of data sharing, and this has led to the DDB concept. An
important factor to be considered if a change from centralization to
distribution of data is contemplated is that the user will typically have to
deal with more than one DBMS. It may be necessary to pull data derived
from a system managed by a particular DBMS into an apparently incom-
patible one or to write a query that simultaneously invokes more than
one 'brand' of DBMS. The user of course requires as much system support
as he/she can get for this and preferably 'off-the-peg' software offering
this support is desired.
63
64 Distributed Database Systems
were not suitable for what is clearly a common scenario for DDB develop-
ment, where considerable investment has already been made in developing
a variety of local databases on a variety of types and sizes of machines.
Most of them are homogeneous systems. These products were really only
suitable for adoption in 'green fields' situations.
From now on we simplify our terminology considerably by using a
common definition of the multidatabase approach. This is taken to be
simply an approach which allows a collection of pre-existing databases to
be treated as a single integrated unit. The software is superimposed on
local, possibly heterogeneous, DBs to provide transparent access to data
at multiple sites. Local accesses proceed as usual through the local DBMS.
Products which use this approach are starting to appear.
For many applications the multidatabase approach is likely to be
more feasible than the homogeneous, logically-integrated DDB approach
since it allows DBs to contract in to or out of the global DB almost as
desired and appropriate, thereby eliminating the need for design of a
finished DDB before any implementation takes place. The interfaces
needed for the addition of a new DB using a DBMS which is different
from all those already in the system require 'mere hours of coding'. There
are also several well known, non-multidatabase system prototypes which
manage data spread over heterogeneous computers.
The products already available, despite their immaturity, and even
the existence of impressive research prototypes, mean that the DDB
approach is worthy of serious consideration by those planning distributed
data handling systems in the 1990s. In the next sections we consider the
features of two representatives of each of the three classes of system. The
classes are: products (and we limit our attention here at least predomi-
nantly, to homogeneous systems), multidatabase systems and research
systems.
3.3.1 INGRES/STAR
The DDB manager developed and marketed by Relational Technology
Incorporated (RTI), INGRES/STAR, was originally restricted to running
on Unix computers which hosted their DBMS, INGRES. It has now been
further developed to become a useful general purpose DDB system which
can deal, albeit at a lower level of functionality and performance, with
data distributed over a large number of different computers, and managed
by a variety of DBMSs (see Figure 3.1). It provides a good subset of the
data management functions one expects from a centralized DBMS, and
future releases are promised to extend the size of that subset. It seems
to be well suited to a distributed environment where about 10% of the
data accesses are to remote sites. A user views the DDB as though it
were a local DB, and existing INGRES DBs can be integrated or a
Deci
transparency required to shield the user from the various dialects of SQL
that the participating systems (e.g. INGRES, SQL/DS, DB2, RMS and
Rdb) offer, and also from the nasty diversity of error messages submitted
by the heterogeneous systems. Non-relational gateways are also being
developed, and this requires the provision of relational interfaces to the
alien systems (see Chapter 4). The gateway carries out the parsing and
evaluating of a query over the INGRES 'view' of contributing databases.
The local data managers retrieve the data. A key goal of the gateway
feature is to keep the architecture fixed regardless of both the particular
INGRES product using it and the local database systems. GCA makes
INGRES/ NET topology-independent, makes it easier and more efficient
to add new protocols and gives it the multiserver architecture needed
to enhance system throughput and to allow an unlimited number of
simultaneous users. This release also improves distributed query optimiz-
ation (see Chapter 5) and simplifies information management through the
introduction of the STAR*VIEW distributed database monitor.
The data is catalogued in the 'name server', and the user views the
data as a set of local relations. These are referred to by 'synonyms' which
translate to particular remote data objects (which form, for example,
fragments of a global relation). The location transparency this gives allows
existing database applications to be used without pain or extensive user
retraining. The local DBs retain their autonomy. Many useful products,
such as business support interfaces developed by RTI's 'corporate part-
ners' for centralized INGRES DBs carry over to DDBs.
The history of this product typifies the phased approach mentioned
earlier. The developments listed above mark the end of the second phase
of a four-phase development project for INGRES/STAR. The first phase,
completed in 1986, was to develop the homogeneous version of the system.
Several local and remote INGRES databases running under UNIX or
VMS operating systems on computing facilities linked by DECnet or
TCP/IP could be integrated into a single DDB using this software. In phase
3 the functionality will be further improved by adding better distributed
transaction processing support, replication management, efficiency, secur-
ity and administration features. We will be examining features of the
subsystems needed to deal with these facilities in distributed database
systems in general in subsequent chapters.
The more futuristic object-oriented system features (see later) being
designed and prototyped in POSTGRES will be merged with
INGRES/STAR in phase 4, and the issues of nested configurations,
fragments and better configuration strategies are also being addressed for
that phase.
In summary, INGRES/STAR provides
/
To / from /
other sites /
I---------------------Z /
Remote site
3.4.1 Multibase
In the Multibase system, which was designed by Computer Corporation
of America (CCA), a simple functional query language (DAPLEX) is used
to reference and manipulate distributed data, which is viewed through a
single global schema. DAPLEX is also used for defining the data. The
users of the system are therefore enabled to access in a uniform manner,
and with ease and efficiency, data which is of interest to them but is
scattered over multiple, non-uniform local databases (Figure 3.3(a)). The
users are each provided with a view of the data as belonging to just one
non-distributed database. The pre-existing DBs, their DBMSs, and their
local applications programs are unaffected (at least logically but probably
not performance-wise), so that a global DB is easily extensible.
All of the underlying operations which are necessary to access data
relevant to a query are provided automatically and transparently to the
user, by Multibase. These operations include: locating the data needed
to service the query; knowing the local formats and the query languages
that have to be used to access the local data; breaking the global query
down into subqueries which can each be evaluated at a local site; correctly
resolving any inconsistencies that may occur between data from different
sites; combining the partial results to obtain an answer to the global
query; and presenting the result to the original user's site. By providing
the functions transparently within the system, many sources of error and
many tedious procedures can be eliminated.
The local host computers are all connected to a communications
network to permit global access through Multibase. The network can be
a local area network or a wide area network. The global user has to have
an interface to this network, so Multibase is connected into the network
in order to permit use to be made of its services at his/her node. Local
sites retain autonomy for local operation and for updates - a local DB
can be updated only locally in Multibase. Global concurrency control
therefore requires control of specific locking and timestamping mechan-
isms, for example. For instance, global deadlock detection algorithms
could well depend upon having local wait-for graphs available from the
74 Distributed Database Systems
Local computers
Local
database
GDM
/
/
LDI
Local
DBs
(b)
eralizations. Again it can be seen that Figure 2.21 has been interpreted
somewhat.
The view definition mechanisms of the local DBMSs are enhanced
by mapping languages to arrive at the NDMS internal schema and use is
made of semantic constraints to resolve conflicts due to semantic incom-
patibilities at the local level.
A system encyclopaedia containing all definitions pertaining to a
node is stored at the node, together with the NDMS control software.
This represents the local part of the global data dictionary. The local
DBA defines the relational application views needed at the node over the
internal schema. The end users may define, or have defined for them,
their own data abstractions over these views. Both permanent views and
periodically updated (by the system) snapshots are supported.
Local or distributed on-line and queued transactions, which can be
initiated at any point of the NDMS controlled network and executed at
any node, are controlled by the distributed transaction processing facility.
These may exchange messages and are synchronized by a two-phased
commit protocol. A transaction recovery mechanism, supported by a
system journal, is also available. Thus, unlike Multibase, update trans-
actions are supported. In ensuring the correctness of concurrent trans-
actions, no locks are applied at global level, so that deadlocks will be
localized and can be resolved by the local DBMS mechanisms. The query
processing strategy first ensures that the restrictions and projections are
pushed towards the leaves of the query tree to reduce the cardinalities
of intermediate relations. Then an intermediate storage representation,
vectorial data representation (VDR), based on transposed files is used to
enhance the performance of distributed JOINs. Query optimization is
founded on a novel set of heuristics for use at both the global and local
levels.
A product called Distributed Query System (DQS) based on this
prototype has been developed by CRAI for IBM environments, and
should be released in the near future. DOS is a retrieval-only multidatab-
ase system, using the relational data model for its global data model and
SQL as its query language (see Figure 3.5). It currently works over IMS,
IDMS, ADABAS and DB2 local databases and operating systems files
using BDAM, BSAM, QSAM, ISAM and VSAM. It provides trans-
parency over languages, data models and distribution, and pre-existing
DBs are left unaltered. Snapshots can be made and used in the same way
as any other relation. DQS uses a slightly modified version of the NDMS
query optimizer. Pilot installations were established in 1988 and initial
commercial releases started in February 1989.
In collaboration with a number of'European Community partners,
including the institutions of the authors of this book, CRAI have also
developed a further product, called Multistar (M*), which complements
DOS by providing similar functionality for Unix environments. It is also
The State of the Art 79
SNA network
based on NDMS (see Figure 3.6). The global user accesses data via SQL,
either in interactive mode or embedded in 'C' programs. Applications
which generate SQL statements dynamically from a source other than a
terminal will also be supported. A number of versions of the system have
been implemented on DEC (VAX using INGRES), ICL (CLAN6 using
INGRES and ORACLE and also Series 39 under OS VME), SUN (Series
80 Distributed Database Systems
3.5.1 R*
A system called R* has been developed in the Almaden Research Centre
of IBM (formerly the San Jos6 Research Laboratory) to extend their
experimental centralized system, System R. This system permits the data
collections comprising the database to be stored at geographically dis-
persed sites. The major objectives of this experimental homogeneous
The State of the Art 81
Catalog
Results
-
Via SNA
to other sites
SQL query with the names having been changed using information from
the system catalog and other transformations having been carried out.
The work done at the master site is repeated, but here only the centralized
versions of the modules are activated. Clearly no distribution plans or
such like objects are required at this stage. The input to this process is
not shown in Figure 3.7, but the outline of the execution is similar to that
shown for the local access to data at the master site. A low-level access
plan and program for the local part of the query is then produced and
executed. Independent local recompilation is possible if required because
all of the original SQL query is stored at the local site.
QILDMII
Figure 3.8 Components of EDDS.
is the global data dictionary, GDD, which has received special attention
in EDDS. For small system nodes an additional system layer is required
to generate global queries and return their results when the local system
needs supplementation.
The GDD holds all the meta-data for the system. Participation
schemas, view mappings, security and integrity constraints and infor-
mation about sites and networks are all held here. In early implemen-
The State of the Art 85
Global Global
user view user view
Global schema
A
Participati on schema
Auxliay Local
/1\ Participation s5chema
Local
Auxiliary relational Auxiliary relational
schema schema schema sema
schema
I I
DB
tations the GDD was replicated at each site, and it performs its tasks in
an active management manner, as well as a passive administrative one.
A more sophisticated version where the GDD is distributed and managed
by a database management system is also available.
Perhaps the most interesting feature of the system is its usefulness
for linking local systems with very limited capability, such as small PCs,
into the DDB system in a manner which permits them to act as worksta-
tions in tandem with computing facilities with much greater capability,
albeit with somewhat reduced performance and functionality. The devel-
opers offer an 'indirect node' facility to support this feature. For example
(see Figure 3.10), a microcomputer hosting a limited DBMS, such as
DBasell1, can be fitted with interfacing software which permits the user
to access data to which access rights are held using only the query facilities
of the local DBMS. In a manner transparent to the user, any data not
held locally is requested by an EDDS SQL query, generated by the
interfacing software, and posted off to a node of the pertinent EDDS
DDB for evaluation as though it were a conventional EDDS global query.
86 Distributed Database Systems
The results of the request are returned to the original query site and
concatenated with the local results from DBaselll to give a final reply
to the enquirer. By using a gateway between EDDS and Multi-star, and
hence DQS, this opens up an extensive world for such a small workstation.
Clearly the user will detect a difference in performance between such a
query and a local DBaselll query, but this sort of functionality has been
found to be in great demand by PC users in large organizations such as
hospitals.
There are several other examples of innovation in EDDS, as would
be expected in a research prototype. One of these is the approach taken
to the estimation of intermediate relation sizes for distributed query
optimization (see Chapter 5).
SUMMARY
In this chapter we have endeavoured to present an overview of the state
of the art both of the available commercial products in the DDB area
and of the rising stars of research and development. We focused on two
products, of which the first, INGRES*, marks what we consider to be
the leading edge of commercial package development. Although only two
out of four phases of its development have been completed, a good
subset of database functions have been provided in INGRES* for systems
where the data is distributed. The other product for larger IBM system
configurations, CA-DB:STAR, gives a range of capabilities which is
representative of a number of other systems at an earlier stage of
evolution than INGRES*. However, in another sense this product is
more mature than INGRES*, having been around for a longer period in
product form. It is a worthwhile contender for some environments.
The two multidatabase systems reviewed were Multibase and DQS.
The former, based on the DAPLEX data model, in many senses marked
The State of the Art 87
the goal for such systems in the early 1980s and inspired efforts at
development of working prototypes, although it never itself became a
product. This contrasts with DQS, which together with its 'stablemate'
Multistar, is based on an early relational prototype, NDMS, which has a
number of features making its design novel. In tandem, DQS and
Multistar, which can be linked by a tailored gateway feature, cover IBM
and Unix configurations, running a variety of DBMSs.
The examples we have chosen to illustrate current research
frontiers are R* from IBM's research labs and EDDS, a system produced
as a result of a number of European Community Programme projects in
the DDB area. R* is a respected system and many research papers have
resulted from work done on it. In particular contributions to query
optimization, which will be dealt with in a subsequent chapter, have
been very influential on research and development in this area. EDDS is
a research vehicle which emphasizes global data dictionary structure and
the accommodation of users of small systems in DDB systems, and is
interesting for the research issues addressed within it.
Interest in producing DDB prototypes appeared to wane a little in
the mid-1980s, but there is currently a new wave of enthusiasm for
producing heterogeneous DDBs, which promises to bear fruit for the
next few years.
Figure 3.11 adds some leaves to the tree in Figure 2.16 by
indicating some of the systems falling in the various classes.
The performance issues remain a major focus for development and
we shall have to wait for a few more years for a full answer to the
questions posed at the beginning of this chapter.
EXERCISES
3.1 In a university many kinds of database are in use for three basic types of
application function, dealing respectively with:
"* Student records
"* Salaries
"* Resources (including rooms and equipment).
Write a report to the university finance officer explaining how an
INGRES-STAR DDB can help in the integration of data in this
environment. Assume that the present system is UNIX-based. Pay
particular attention to explaining the roles of the system modules and
schemas, and how they can save on development costs for a replacement
application system which consultants have identified as essential.
3.2 Repeat Exercise 3.1 but assume that the present system is IBM-based and so
you are advocating the adoption of CA-DB:STAR in your report.
88 Distributed Database Systems
Distributed d[ata
systems
Homogeneous
(with no local schemas)
/ / Heterogeneous
Gý ID (multidatabase)
Federated Unfederated
/Unibase
Loosely couple ed Tightly cou
(no global scher na)
/
(global schc
Single Multiple
MRDS
CEDDS, Qs)
C4T
3.3 What functions would you expect to have most impact on the decision as to
whether or not DOS would be more appropriate than CA-DB:STAR for
integrating data when three banks with IBM computer installations merge?
3.5 Someone has calculated that all of the recorded information in the world (less
than 1020 bytes) could be accommodated on the available computer storage
devices which are currently in existence, or which could be made available
within a reasonable period of time by obliging suppliers if ordered. Why do
you think that this has not happened?
The State of the Art 89
Bibliography
Bell D.A., Ferndndez P6rez de Talens A., Gianotti N. et al. (1987). Multi-star:
a multidatabase system for health information systems, Proc. 7th Int. Congress
on Medical Informatics, Rome, Italy
The Multistar prototype was developed by a group of collaborators in an
EEC-funded project. It is largely based on NDMS and has been influenced
by EDDS. Its functionality is illustrated by reference to medical examples
which were studied in detail in this project.
Bell D.A., Grimson J.B. and Ling D.H.O. (1989). Implementation of an
integrated multidatabase-Prolog system. Information and Software Technology,
31(1), 29-38.
This paper gives some implementation and design information on the EDDS
prototype.
Breitbart Y.J. and Tieman L.R. (1984). ADDS - heterogeneous distributed
database system. Proc. 3rd Int. Seminar on Distributed Data Sharing Systems,
Parma, Italy
Another DDBS, previously called Amoca Distributed Database System, was
a multidatabase system developed for the oil industry. An interesting feature
of the system was that its designers considered the problems associated with
concurrent access to the DDB which many other similar systems ignore.
Breitbart has continued his work in this area and, for example, presented a
paper at the 1987 SIGMOD Conference on the multidatabase update
problem.
Brzezinski Z., Getta J., Rybnik J. and Stepniewski W. (1984). UNIBASE -
An integrated access to data, Proc. 10th Int. Conf. on VLDB, Singapore.
Deen S.M., Amin R.R., Ofori-Dwumfo G.O. and Taylor M.C. (1985). The
architecture of a generalised distributed database system - PRECI*, Computer
J. 28(3), 282-290.
PRECI* is the distributed version of the centralized research vehicle,
PRECI, developed at the University of Aberdeen. An interesting feature of
its architecture is that, like PRECI itself, it attempts to be all things to all
men to some extent. In the present context this means being a suitable
architecture for both heterogeneous and homogeneous DDBs, each giving
full DDB functionality, and also for multidatabases where the goal of
distribution transparency was shed.
Esculier C. and Popescu-Zeletin R. (1982). A Description of Distributed
Database Systems. EEC Report ITTF/2024/83.
This EC reports presents some details of each of the systems available in
Europe which, in 1982, were either products or 'would be within two years'.
The products were
* ADR's D-net
* Intertechnique's Reality D-DBMS
* ICL's DDB-50
* Siemen's Sesa***
* Nixdorf's VDN
* Tandem's Encompass
* Software AG's Datanet.
It is enlightening to note how few of these systems actually appeared on
the market.
Holloway S. (1986). ADR/DATACOM/DB - the high-performance relational
database management system. In Pergamon Infotech State of the Art Report on
Relational Database Systems (Bell D.A., ed.), pp. 114-135. Oxford: Pergamon
Infotech
90 Distributed Database Systems
This paper gives an excellent account of the system from which CA-DB:Star
evolved.
Landers T. and Rosenberg R.L. (1982). An overview of Multibase. Proc. 2nd
Int. Symposium on Dist. Databases, Berlin, Germany
This is an excellent, readable account of the Multibase system architecture.
Litwin W., Boudenant J., Esculier C. et al. Sirius systems for distributed
database management. In Distributed Databases (Schneider, H.-J., ed.), pp.
311-66. Amsterdam: North-Holland
Like SDD-1, the Sirius series of designs and prototypes from INRIA in
France had a significant influence on the design of full DDBs and
multidatabases. This is a very comprehensive paper, and it contains a rich
collection of well-considered solutions to DDB management and design
problems. The latter part of the paper describes MRDSM (see also Wong
and Bazex (1983)) which is a "multidatabase system' by a slightly different
definition than that used most frequently in this book (see also Wolski
(1989) and Figure 2.16). In MRDSM, when data from different local
databases is to be accessed, the query must refer to the data items by
qualifying them by their respective database names (in a way analogous to
the way attribute names are qualified by relation names if necessary). This
somewhat waters down the aspirations of those seeking data transparency,
but could be exceedingly effective in some practical application
environments. Litwin has recently been continuing his work in interdatabase
dependencies and other interoperability issues.
Lohmann G., Mohan C., Haas L.M. et al. (1985). Query processing in R*. In
Query Processing in Database Systems (Kim, W., Batory D. and Reiner D.,
eds), pp. 31-47. Heidelberg: Springer.
Mallamaci C.L. and Kowalewski M. (1988). DOS: a system for heterogeneous
distributed database integration. Proc. Eurinformation Conf., Athens, Greece
DQS is a product based on the NDMS prototype.
Neuhold E.J. and Walter B. (1982). Architecture of the distributed database
system POREL. In Distributed Databases (Schneider H.-J., ed.), pp. 247-310.
Amsterdam: North-Holland
POREL is another of the DDB proposals which became trend-setters in the
late 1970s and early 1980s. It is a heterogeneous system, although it appears
to be aimed at 'green fields' situations mainly, which pays particular
attention to communication issues which have arguably become somewhat
irrelevant since 1982 with, for example, the advent of the ISO/OSI reference
model.
Rothnie J.B., Bernstein P., Fox S. et al. (1980). Introduction to a system of
distributed databases (SDD-1). ACM TODS, 5(1), 1-17
SDD-1, one of the first comprehensive specifications for a DDBMS, was
produced by Computer Corporation of America. Its general structure is built
around three types of module or 'virtual machine'. Transaction modules plan
and rn.,-itor the execution of the distributed transactions, data modules can
read, move, manipulate or write sections of a DDB, and the reliable
network connects the other two kinds of module, guarantees delivery of
messages, all-or-nothing updates, site monitoring and synchronization.
The access language is Datalanguage which can be used in stand-alone
mode or embedded in an application program. The term 'transaction' is used
to describe a statement in Datalanguage; an atomic unit of interaction
between the world and SDD-1.
This particular paper, one of a well-known series on SDD-1, marked an
important milestone in the history of research on DDBs. It describes a
prototype (first built in 1978) homogeneous DDB system which worked on
The State of the Art 91
4.1 Introduction
An essential aspect of handling distributed data arises when we are
deciding how to distribute the data around the sites in order to take
advantage of the 'natural' parallelism of execution inherent in the distrib-
uted system or simply to get the best level of 'localization' of processing.
We explain this term later. Another fundamental issue is that of providing
automatic methods of transforming accesses to data which are written in
an 'alien' access language (in this case the global query language), into
access requests which can be dealt with by a given local DBMS.
In the first half of this chapter the data distribution issue is addressed
and a simple solution is provided. In Section 4.2 we establish the case for
careful allocation and placement of data over the storage devices of the
available computing facilities, and this is followed by a section giving
some examples which illustrate the advantages of careful data design.
Section 4.4 (which can be omitted on a first reading) shows how a
92
Data Handling - Distribution and Transformation 93
Q: Q =
site 1 site 2
pages(1,4) pages(2,3)
site 1 site 2
pages(1,2) pages(3,4)
or
site 1 site 2
pages(1,3) pages(2,4)
96 Distributed Database Systems
EXAMPLE 4.2
The case where there is a 1 : N (one to many) relationship between the
tuples of the two relations is relatively simple and so we focus on the
M : N case (many-to-many). Assume that the tuples of both relations
have the same width, and consider the following placement on a single
site. (Note that the pages in this example are twice as big as those in the
preceding example).
such a way that the evaluation of q is the UNION over i (i here is the
site index) of the local evaluations of the qi.
That is, if G is the global database = UiLi where the Li are the
local fragments then
evaluation (q,G) = Ui evaluation (qi,Li)
Having this property means that to process any query in Q, no data
transfers between sites are required apart from a final transmission of a
partial result from each Li to the querying site, where the UNION is
formed. Usually redundancy among the fragments is required for this
property to be attainable. This is acceptable for retrieval queries, provided
the storage costs do not rule it out. However, for update queries, which
have costs which are a non-decreasing function of the 'extent' of the
redundancy, we would certainly like to minimize redundancy. That is, we
would like to find a distribution strategy with minimal redundancy.
Putting this together with the local sufficiency goal above we can
express our goal here as being to find a locally sufficient set of local
fragments {Li}, such that {Ji} = {Li} for any other locally sufficient set
of fragments {JJ} for which Ji is as small as Li for every i. That is, there
is no different locally sufficient distribution less redundant than {Li}.
To demonstrate this qualitative approach to data distribution, we
assume in the rest of this section that we have only one relation for each
primitive object in our database and we restrict our queries to have either
only one variable (a unary operation), or to have an equi-JOIN on a
primary key or a foreign key, or to have finite combinations of both of
these. Although apparently very restrictive, this characterizes a large
natural class of queries and is therefore likely to be of practical value
and, perhaps more importantly in the present context, the queries so
defined reflect the semantic structure of the real world. So we use them
as the basis of the qualitative distribution approach outlined here.
An example of such a query in SQL is
This is realistic when we consider that most files are not fragmented
because many important accesses need to fetch all records in the files and
it is relatively hard, in general, to determine 'hit-groups', or active parts
of files or records, which are to be treated separately from the rest.
Anyway, sub-files can be treated just as complete files are, in cases where
they can be identified.
Furthermore, in real design situations it is unlikely that we will try
to exploit parallelism in our file allocation. The current conventional
wisdom is that this task should be delegated to the query optimization
sub-module of the distributed database management system.
The problem of optimizing the allocation of files to sites in distrib-
uted databases has a fairly long and varied history. Many of the solutions
are too general (and hence too complex) to be used on practical problems
because the number of parameters to be considered tends to be prohibi-
tive.
In this section we present a solution which is applicable only for a
particular manifestation of the problem, but which is pragmatic and useful
and also has the virtue of being illustrative of the solutions one can find
to this problem.
The particular solution given here was developed for distributing
data in a dedicated network of computing facilities which have tight limits
on their local mass storage capacity. It is obtained by transforming the
original problem (that of minimizing access costs) into the isomorphic
problem of maximizing local accesses.
Storage is considered to be a constraint on the optimization, rather
than as a cost: adding a file to a site does not incur any additional cost,
so long as the site capacity is not exceeded. The cost of accommodating
and exceeding the capacity at a site is considered to be so great, involving
as it does reconfiguring and reorganizing local mass storage, that it is out
of the question.
Another characteristic of the method considered here is that it is
assumed that the transaction traffic is known in advance - any ad hoc
queries are therefore optimized purely by the run-time query optimization
module in the DBMS. A given transaction can arise at any of the network's
nodes, and the frequency of occurrence at each site is input to the
algorithm as a table (see Example 4.3 for an illustration). Each transaction
is specified to the allocator simply as a set of entries in a table. These
show the number of update or retrieval accesses from the transaction to
each of the files (see Example 4.3 for an illustration). Reads and updates
are assumed to have equal costs. Local accesses are assumed to be much
cheaper than remote accesses. Remote accesses all have the same unit
cost, which is a reasonable assumption if we assume that the communi-
cation network underpinning the DDB is the same throughout. In the
version of the algorithm given here, no redundancy is permitted and
fragmentation of files is forbidden. So each file is assigned to one and
only one site in the network.
102 Distributed Database Systems
"* There are nki accesses (for retrieval or update) from transaction k
to file i
"* xj is a decision variable which is 1 if file i is allocated to node j, 0
"otherwise.
We know therefore that
"1,xij = 1 Vii 11 --i -- M (4.1)
Step 3
Inspect
If this solution is feasible (i.e. meets the constraints of Equation 4.2) it
is our answer; go to Step 7.
Step 4
Otherwise, identify all nodes which cause the constraints in Equation 4.2
to be broken.
Data Handling - Distribution and Transformation 103
Step 5
For every such over-subscribed node, solve the corresponding knapsack
problem (see below), thereby eliminating a node and the files allocated
to that node from further consideration. (If there is more than one such
node this step treats them in order of 'allocated value' and eliminates
them all.)
Step 6
Consider J(i) for any nodes j which remain. If there are some such nodes
go to Step 2.
Step 7
Otherwise, we have finished.
The knapsack problem is a classical problem in operations research
where a scarce resource (here a node's storage capacity) is to be allocated
to a number of competing users. In this case we want to 'pack' the node's
storage in a way which maximizes the 'value' of the contained goods, Vij.
For an introduction to solution methods see Daellenbach et al. (1983).
The knapsack problem instances to be solved in the exercises in this
chapter, and in many real design problems, are sufficiently small-scale to
permit exhaustive search for the answer.
Suppose we are to allocate eight files among five sites, each with
20 Mbytes disk storage and given the following access rates of transactions
to files (nki)
Files (size/Mbytes)
1 2 3 4 5 6 7 8
Transactions (10) (5) (18) (9) (9) (7) (4) (4)
1 10 10 10 20
2 20 10
3 75 75 150 15
4 5 5 10 10 10
5 5 5 10 10
6 10 2 5 1
7 2 10 1 5
8 6 6 3 3
9 1 1
10 5
104 Distributed Database Systems
Sites
Transactions 1 2 3 4 5
1 30
2 20 10
3 3 5 4 4
4 12 9 9
5 10 7 3
6 180 20 100 1 1
7 100 1 30 35
8 30 20 10 10 10
9 3 3
10 4
File i (size)
1 2 3 4 5 6 7 8
Site j (10) (5) (18) (9) (9) (7) (4) (4)
Step 1
At Step 1 of the algorithm the J(i) are the bold elements for each i.
Step 2
If we assign xij = 1 for these and 0 for the other entries we have our first
'solution'.
Step 3
Site 1 has been allocated 55 Mbytes of files. Its capacity is only 20 MBytes,
and so this is not a feasible solution.
Data Handling - Distribution and Transformation 105
Step 4
Site 1 (has been allocated too much)
Step 5
The maximum value (V,1) we can get from storing any files on site 1 is
obtained by storing files 1, 2 and 8 there.
Step 6
Our new Vi1 table is obtained by eliminating row 1 above and columns 1,
2 and 8.
The new J(i) are the underlined entries (all allocated to site 3)
Step2'
Assign xij = 1 to these, xij = 0 to the remainder of the entries.
Step 3'
Site 3 has been allocated 47 MBytes of files, but it can only store
20 MBytes.
Step4'
Site 3 (has been overloaded)
Step5'
The maximum value we can get from storing files on site 3 is obtained
by storing files 4 and 5 there.
Step 6'
Our new Vi, table is obtained by eliminating row 3 and columns 4 and 5
from the reduced table.
File
Site 3 6 7
2 161 0 0
4 445 150 70
5 430 150 30
Step2"
Assign 1 to xq for these entries, 0 for the rest.
Step3"
Site 4 has been allocated 29 MBytes, but it can take only 20 MBytes.
Step4"
Site 4 (has been overloaded)
Step5"
Store file 3 at site 4.
Step6"
Our new Vi, table is obtained by eliminating row 2, column 1 from the
table above.
Without spelling out the details it is clear that the remaining 2 files, 6
and 7, are allocated to site 5.
So our solution is
1 1,2,8 19
2 0
3 4,5 18
4 3 18
5 6,7 11
abstractions. For each of these some details are deliberately omitted from
the representation.
One of these abstractions is aggregation, where a relationship
between entities is represented as a higher level object. For example an
'appointment' entity could be used to represent a relationship between a
patient, a doctor and a clinic. Another abstraction is generalization, where
a set of similar entities is considered to be a single generic entity. For
example a 'person' object could be a generalization of doctor, patient and
nurse entities. Restrictions on the particular class of objects in order to
obtain a subset of interest give another abstracting mechanism. For exam-
ple, the set of patients in the orthopaedic department could form an
object called 'orthopaedic patients'.
There are various other conversions and transformations that can
take place between schemas. Some are needed for syntactic reasons, for
example, if different units such as millilitres and fluid ounces are used to
measure volume in two different laboratories. Another syntactic trans-
formation would be needed if the structure of the records (or relations,
for example) in the local databases were different.
Another set of transformations may be needed for semantic reasons.
For example where some hospitals in a region refer to themselves as a
default in a file used for transporting patients between hospitals by ambu-
lance. The other hospital for a particular transfer might be represented
explicitly, and the way this is done could vary between local systems. A
particular trip could then be represented twice within the distributed
system, but it might not be obvious that this is the case from any syntactic
considerations. The DDB system would have to 'figure out' the meanings
of the different structures from semantic information provided by the
system designer.
Each local data model is defined by the corresponding DDL and
DML semantics. In order to integrate pre-existing heterogeneous datab-
ases, special methods of mapping between different data models are
needed. An important goal for such mapping is that both the information
stored under the local data models and the operators which can be
addressed to them should be 'preserved'. That is, we should ideally be
able to access precisely all of the local data via the global model, in all
of the ways supported by the local DBMS. The use of this capability is,
of course, subject to logical and physical access constraints which exist in
a particular environment.
The pattern used here for transforming data models is that of
mapping a source data description, according to a source data model, into
a target description, according to a target data model. These descriptions,
or schemas, are said to be equivalent if they produce sets of database
states (actual databases) which are one-to-one and onto (i.e. nothing is
left out), the one-to-one states being equivalent. Database states under
the source and target data models are, in turn, said to be equivalent if
108 Distributed Database Systems
(1) that the global data model at any stage of its evolution is capable
of absorbing new data models (by extending the DDL and DML
using a system of axioms expressed in global data model terms);
(2) that the information and operators are preserved in the sense
above (i.e. there are commutative, or order-independent, mappings
between the schemas and operators, and a one-to-one, onto map-
ping between the database states);
Data Handling - Distribution and Transformation 109
(3) that the local data models can be synthesized into a unified global
data model (by constructing global data model kernel extensions
equivalent to the local data models, and unifying these extensions).
One such global data model, called the unifying conceptual data
model (UCDM), has been developed by starting from the relational data
model and extending it incrementally as in (1) above by adding axiomatic
extensions equivalent to various well-known (local) data models. The
principles in (2) and (3) above were systematically applied in order to
obtain these results.
Examples of simple axioms added to the RDM in order to make it
equivalent to the network model are the axioms of uniqueness, conditional
uniqueness and obligation for unique, unique non-null and mandatory
attributes (data items) respectively. More complex interrelational axioms
include the axiom of total functional dependency (between attributes in
different relations).
Another, perhaps more pragmatic, contender is DAPLEX. This is
a special purpose system for use as a pivot language and common data
model to which each local data model maps in order to integrate the pre-
existing databases. This system uses the functional model as a pivot
representation onto which the heterogeneous contributors map on a one-
to-one basis. Real-world entities and their properties are represented
as DAPLEX entities and functions respectively. Separate views of the
integrated 'database' are provided to meet local requirements.
Mapping between most well-known data models, and even
operating system files, is simple using DAPLEX. For example, a relational
database can be accommodated by representing each relation as a
DAPLEX entity and each domain as a DAPLEX single-valued function.
The network data model can be described in a similar way. Sets are
represented as multivalued functions which return member entities.
EXAMPLE 4.4
Consider the network schema at the 'remote' site as given below in Figure
4.1. The DAPLEX representation for this is given in Schema 4.1. The two
figures in the above example can also be used to outline how DAPLEX
can be used to integrate heterogeneous (here relational and network)
databases. DAPLEX views of the distributed data are defined using view
derivation over the DAPLEX versions of the local schemas.
Imagine a second DAPLEX schema equivalent to the relational
local schema at a site, which is very straightforwardly mapped from the
'local' schema, containing, in particular, an entity called LOCAL-CUST.
The view derivation for a global entity called GLOBAL-CUSTOMER is
given in Algorithm 4.1. This shows how to get the data from a local
database into the format required for global usage. It shows when cus-
110 Distributed Database Systems
PNAI
DAT
REG
OTY
define GLOBAL-CUSTOMER to be
CNAME -> CNAME (I) ;
MAIN-PURCH -> MAIN-PURCH (I);
RATING ->
max(int (TOT-PURCH(I) /4),
RATING (I));
end case
end loop
end DERIVE
The first two sources give the relations and attribute names, the third and
fourth sources supply integrity constraints for the relational schema.
Making some simplifying assumptions which we will identify (for
subsequent removal) later, the method derives a collection of relational
STEP 1
Construct the 'first-fix' relations
Each record type generates one relation. Its database key, and a combined
database key and set name for each set owning the record are appended.
There is a tuple in the extension of a relation for every record of that
type in the network database. Candidate keys (C-keys) are derived mainly
from the 'duplicates not allowed' clauses and indicated in the relational
schemas. Foreign keys are also derived from the logical linkage between
an owner record and its member, so there is a one-to-one mapping
between foreign key constraints and set types. The result of this step is a
,relational analog' of the network database, and it is now subjected to
two refining steps.
STEP 2
Top-down Synonym Substitution
Data items are conducted from the owner to the member record occur-
rences of the sets. A synonym of a database key for record R is a
combination of data items from its 'predecessors' which uniquely identify
the occurrences of R. This step replaces an owner database key in any
relation by a synonym (for example a key) of the owner relation. The
synonym is added to the key of the relation under consideration. This
process is repeated until no further substitution is possible.
STEP 3
Project out unwanted items
The database keys are eliminated from the schema. If there is a synonym
for every owner in the database, projecting out the database key of each
relation removes all database keys.
Hence we end up with a set of relations which have attributes
corresponding to the original network database's data items, concatenated
with some inherited attributes.
114 Distributed Database Systems
EXAMPLE 4.5
STEP 1
First-Fix Relations
CUSTOMER
C-Key CNAME
SALESMAN
s-db-key, SNAME, OFFICE, SALARY
C-key SNAME
PURCHASE
p-db-key, c-db-key, s-db-key, DATE, PRICE, QTY
C-key (- none)
STEP 2
Replace owner record keys in member record by synonym
PURCHASE
p-db-key, CNAME, SNAME, DATE, PRICE, QTY
STEP 3
Simplify
CUSTOMER
CNAME, ADDRESS, MAIN-PURCHASE, TOT-PURCH
KEY: CNAME
SALESMAN
SNAME, OFFICE, SALARY
KEY : SNAME
Data Handling - Distribution and Transformation 115
PURCHASE
CNAME, SNAME, DATE, PRICE, QTY
KEY: CNAME, SNAME
FOREIGN KEY : CNAME of CUSTOMER
SNAME of SALESMAN
EXAMPLE 4.6
Suppose we have the relational schema obtained in Example 4.5 and we
have the following query addressed to it: Find the names of all those
salesmen who sold bolts as main-purchase items to customers in Belfast
in August. An SQL version of this query is
EXAMPLE 4.7
Consider the query given in the previous example. Using the 'brute-force'
nested loop method and ignoring complications such as duplicates, we get
a simplified network DML program as follows.
SUMMARY
This chapter is in two parts. In the first part we have studied the question
of how to allocate data to the nodes of a computer network in a manner
which makes subsequent retrievals for the expected transactions as efficient
as possible. The case for careful consideration of the basis of distribution
of the data is made, and both qualitative and quantitative approaches to this
procedure are presented.
118 Distributed Database Systems
EXERCISES
4.1 Find an arrangement of the tuples of relation R, and R 2 of the example in
Section 4.3.2 for which just one page access is required to answer the original
query. Discuss the effect of page to site allocations for this placement.
4.2 Work through Example 4.3 for the case where each site can hold 30 Mbytes,
and file 7 has size 3 Mbytes instead of 4 Mbytes.
4.3 Determine an efficient allocation of six files to four sites, each of which is
capable of holding 50 Mbytes. The access profiles of six transactions are as
follows:
File (size)
1 2 3 4 5 6
Transactions (30) (20) (35) (10) (30) (25)
1 30 30
2 15 15 100
3 40 20 20
4 50 20
5 10 10 10
6 10 10 30
Sites
Transactions 1 2 3 4
1 50 100
2 20 10
3 20 10
4 20 20 20
5 20 50 10
6 50 10
Data Handling - Distribution and Transformation 119
4.5 Discuss informally the interrelationship between query optimization (see Chapter
5) and file allocation. Illustrate your answer by reference to the design of roads
in an urban development, and their subsequent use.
4.6 Explain the desirable features of a global schema for a heterogeneous database
system.
4.7 Name some distributed database systems which use the relational data model for
their global schemas. Suggest reasons for this choice, and discuss the difficulties
it could raise.
4.8 Discuss the pros and cons of converting local databases to make them homo-
geneous, as an alternative to using the integrating methods presented in this
chapter.
4.10 Integration can require conversions of units and various other changes between
local schemas and a global schema in addition to data model homogenization.
Give examples of three such conversions, and describe the 'integration rules'
which would be required in the system for this purpose. Where in the system
architecture would you suggest that this particular conversion functionality should
be provided?
Bibliography
Bell D.A. (1984). Difficult data placement problems. Computer J., 27(4),
315-320
This paper establishes the fact that several interesting and practical data
allocation problems are NP-complete.
Ceri S., Martella G. and Pelagatti G. (1980). Optimal file allocation for a
distributed database on a network of minicomputers. In Proc. ICOD 1 Conf.
Aberdeen, Scotland
A different version of the method described in Section 4.5, extended to the
case where redundancy is permitted in the allocation, can be found here.
Example 4.3 is based on figures given in an example in this paper.
Chang C.C. and Shielke. (1985). On the complexity of the file allocation
problem. Conf. on Foundationsof Data Organisation, Kyoto, Japan
The file allocation problem (see Section 4.5) is established to be NP-hard by
reducing it to a well-known graph colouring problem.
120 Distributed Database Systems
Chen H. and Kuck S.M. (1984). Combining relational and network retrieval
methods. Proc. SIGMOD Conf., Boston, USA
These authors have made a number of contributions to different aspects of
the conversion of data models, of which this paper is an example. It shows
how to get efficient DML programs from relational queries.
Daellenbach H.G., Genge J.A. and McNickle D.C. (1983). Introduction to
Operational Research Techniques. Massachusetts: Allyn and Bacon, Inc.
A treatment of the knapsack problem.
Dayal U. and Goodman N. (1982). Optimisation methods for CODASYL
database systems. Proc. ACM SIGMOD Conf., Orlando, USA
Fang M.T., Lee R.C.T. and Chang C.C. (1986). The idea of declustering and
its applications. Proc. 12th Int. Conf. on VLDB, Kyoto, Japan
Declustering is the inverse procedure to clustering. It can be appreciated by
thinking of the task of selecting two basketball teams for a match from a
group of players, half of whom are over 7 feet tall and the rest are under 5
feet tall. If we want the match to be evenly balanced we choose dissimilar
players, heightwise, to be on the same team. Similar requirements are met in
some computing environments, for example when parallelism is to be
exploited.
Kalinichenko L.A. (1987). Reusable database programming independent on
DBMS. Proc. l0th ISDMS Conf., Cedzyna, Poland
The UCDM, described briefly in Section 4.7, is developed in this paper.
Various principles for conversion between data models are given, and a
series of axiomatic extensions to allow well-known data models to be
covered are given.
Katz R.H. and Wong E. (1982). Decompiling CODASYL DML into relational
queries. ACM Trans. on Database Systems, 7(1), 1-23
Rosenthal A. and Reiner D. (1982). An architecture for query optimisation.
Proc. ACM SIGMOD Conf. on Management of Data, Orlando, USA
Shipman D. (1981). The functional data model and the data language
DAPLEX. ACM Trans. on Database Systems 6(1), 140-173
The DAPLEX language, which underlies the Multibase system, is presented
here.
Smith J.M. and Smith D.C.P. (1977). Database abstractions: aggregations and
generalisations. ACM Trans. on Database Systems, 2(2), 106-133
This is something of a classic as a treatment of the issues of data
abstractions (see Section 4.6).
Taylor R.W., Fry J.P. and Shneiderman B. (1979). Database program
conversion - a framework for research, Proc. 5th Int. Conf. on VLDB, Rio de
Janeiro, Brazil
The problem of finding out how well we can mechanically produce, from a
given program and database, an equivalent program to interact with a
converted database, is formulated and addressed in this paper. The precise
definition given in this chapter is taken from this paper.
Wong E. and Katz R.H. (1983). Distributing a database for parallelism. Proc.
ACM SIGMOD Conf., San Jose, California
This paper describes the semantic approach to optimizing the placement and
allocation of data. There is no need for the development of a quantitative
cost model in this case. The approach described in Section 4.4 of this
chapter is developed here.
Zaniolo C. (1979). Design of relational views over network schemas. Proc.
ACM SIGMOD Conf., Boston, Massachusetts
If the RDM is to be safely employed as a global data model for DDBs, as it
frequently is (in contrast to systems using functional data models, such as
Data Handling - Distribution and Transformation 121
SOLUTIONS TO EXERCISES
4.2
Files 1, 2, 5 and 8 are allocated to site I in the first iteration
Files 3, 4 and 7 are allocated to site 3 in the second iteration
File 6 is allocated to site 4 in the third iteration.
4.3
The Vij table is as follows:
File (size)
1 2 3 4 5 6
Sites (30) (20) (35) (10) (30) (25)
In this case the J(i) entries are underlined. This would give two overloaded sites
on the first iteration of the algorithm, namely sites 3 and 4.
Solving the knapsack problem for 4 first (because it has the greater 'allocated
value') we store files 1 and 2 at site 4. Similarly we store files 4 and 5 at site 3. On the
second iteration we still have two files to assign to either node 1 or node 2. Can you
think of a better approach?
Distributed Query
Optimization
5.1 Introduction
In databases in general, a major aim is to hide the structural details of
the data from the user as much as possible. In distributed databases one
of the main goals is that the distribution details should also be hidden, so
that invoking the capabilities and functionality of the system are as easy
and effective as possible.
The relational data model, which we focus upon here, can be used
to provide a data-independent interface to a centralized database. The
high level of this interface is due to its non-procedurality. An early
example of a relational data language was the Relational Calculus which
is considered by database experts to be a non-procedural language. Using
it the data requested in the query is simply described and no procedure
has to be specified for its extraction. On the other hand most people
would say that Relational Algebra is a procedural language because when
using it a method is specified for the retrieval of the requested data.
The Relational Calculus is a somewhat mathematically-oriented
language. So, in order not to restrict the usability and applicability of the
concept of non-procedurality, languages of equivalent power to that of
the Relational Calculus, but avoiding its technical complexity and
122
Distributed Query Optimization 123
The task of the query optimizer is to govern and expedite the processing
and data transmission required for responding to queries. We discuss the
issues of distributed query processing using a rather traditional and well-
tried pedagogical approach.
We clarify these concepts in Section 5.2, and in Section 5.3 we look
at methods of ordering relational operations that can be tried for giving
the correct result more efficiently than an initial specification. Section 5.4
deals briefly with the issue of selecting query plans in centralized systems
and in Section 5.5 we study the important issues of choosing a method
for a JOIN execution, looking in detail at SDD-I's SEMI-JOIN method
and also at the non-SEMI-JOIN method used in R*. The chapter ends
with a summary of some issues currently engaging the minds of database
researchers in this area.
* CPU
I/O channels
* Telecommunications lines.
can be found in the references at the end of this chapter. The operators
can then be implemented using a variety of low-level algorithms and
access devices, such as indices, pointers and methods of executing
relational operators. This is called query mapping. Query mapping there-
fore corresponds to choices 2 and 3 in Section 5.1 above; query transform-
ation corresponds to choice 1 (but it has an impact on choice 4 also, in
distributed systems).
In distributed systems message transmissionon data communications
lines is required to control the query evaluation operations and the order
in which they are executed; data transmission is needed to pass results
and partial results between sites. An optimizer must choose these and
this corresponds to choice 4 in section 5.1. Communications line trans-
mission rates are usually assumed to be some factors of ten (orders of
magnitude) slower in speed than I/O channels and so figure prominently
for heavy traffic, but the orders of magnitude vary somewhat. For exam-
ple, we may use a high-speed network (nominal speed 24 Mbits s-l, actual
speed 4 Mbits s-1) for some communications and a medium-speed network
(nominally 56 kbits s-', actual 40 kbits s- 1) for others. These speeds can
be compared with corresponding disk costs of 23.48 ms for 4 kbytes (i.e.
1.3 Mbits s-l). These figures come from a study carried out in 1985 by
Mackert and Lohmann, so the factors should be considered with care in
current environment.
We now illustrate the influence of the choices of ordering of oper-
ations with reference to a simple distributed database over 3 nodes.
This shows how the CPU, I/O and data transmission costs are affected
by the strategy chosen.
EXAMPLE 5.1
We use a simple JOIN and SELECT query over three relations at three
different sites to illustrate the effect of changing the order of execution
of the relational operations.
Site Relation
Assume that all the communications links have the same cost/tuple and
that size=cardinality. Assume also that 0.1% of the patients in HOSPI-
TALIZATION satisfy the criterion 'orthopaedic department since 1 Janu-
ary' and that this figure is reduced by 50% by the effects of the other
operations, and that 1% of the patients weigh over 100 kg.
Option 1
Move PATIENT relation to community care site, JOIN it with the SUR-
VEY relation there, and move the result to the query site for JOINing
with the other relation, shipped from the hospital. See Figure 5.1 (a).
Costs (let the cost of transmitting a tuple be t and the overall cost of
comparing and possibly concatenating two tuples be c):
JOINs . . . 10 000 x 1000c for PATIENT and restricted SURVEY (we
assume here that this gives a 1000 tuple result)
+ 200 x 1000c for JOINing this result and restricted HOSPI-
TALIZATION
Transmit . . . 200t + 10 000t + 1000t
Option 2
Send the restricted HOSPITALIZATION relation to the community care
site. Join it with restricted SURVEY there. Join the result with PATIENT
at the health centre, and ship the result to the query site. See Figure 5.1
(b).
The latter strategy incurs less associated CPU cost and less traffic is
involved, but there is less parallelism of execution than using the other
strategy. So the ordering of operations can make a significant impact on
both costs and responsiveness even for this simple example. This example
is based on one developed by Hevner. A more dramatic illustration of
the effect of the choice of strategy is given in the references. Using six
different strategies, response times varying from seconds to days were
obtained by Rothnie and Goodman!
The complexity of a query optimizer to handle even a simple query
such as this can be appreciated as the example is studied. In practice the
situation is often very much more complex than this. Consider, for exam-
ple, the case where the patients are spread over several sites, and so the
SURVEY relation is horizontally partitioned over the network. There are
128 Distributed Database Systems
(a)
(b)
Figure 5.1 (a) A JOIN order for Example 5.1. (b) An alternative JOIN order for Example 5.1.
Distributed Query Optimization 129
A few examples are given later in Figure 5.4. Ceri and Pelagatti
present a comprehensive list of equivalence transforms in their book which
is referenced below.
The rule of thumb used in centralized optimization that 'PROJEC-
TIONS and SELECTIONS should be performed early, and JOINS (and
other binary operations) late' applies to an even greater degree to distrib-
uted systems because large join sizes are undesirable for transfers between
distributed sites.
132 Distributed Database Systems
EXAMPLE 5.2
The second version of the query of Example 5.1 can be represented as
an operator graph as in Figure 5.3.
HOSPITALIZATION SURVEY
P, TIENT
Pat
I
Pat-name,
DOB
Pat-name,
GP
HOSPITALIZATION SURVEY
Pat-name,
DOB
Admit 1 Jan
00 100
Dept
Jan
Pat-namr
DOB DOB
orthopaedic'
(a)
(Pat-name,
DOB)
(Pat-name,DOB)
HOSPITALIZATION HOSPITALIZATION
(Dept) (Dept,
Admit)
V (Dept)
(c)
Figure 5.4 Some equivalence rules. (a) SELECT after and before JOIN-same
result. (b) PROJECT after and before JOIN-same result. (c) Cascade of
SELECT.
Distributed Query Optimization 135
with poor execution times and so have attracted the attention of query
subsystem designers.
'Naive' query evaluation involves exhaustive evaluation and com-
parison of the query blocks, and the top level determines the sequence
of evaluation of the blocks. Evaluation is therefore clearly dependent on
the user's block structure.
A number of efforts have been made to avoid this exhaustive
enumeration approach. If we assume just two levels of nesting and call
the nested SELECT . .. FROM ... WHERE blocks outer and inner
respectively with the obvious meaning, five basic types of nesting can be
identified.
Four of these types are distinguished by whether or not the inner
block has a JOIN predicate in its WHERE clause; and whether or not its
SELECT clause contains an aggregate function such as a MAX, AVG or
SUM operation. The remaining type has a JOIN predicate and a DIVIDE
predicate.
Some of the types can still be treated by a conceptually simple
nested iteration method, where the inner block is processed for each outer
block tuple, but this is costly and algorithms are available to transform
nested queries of four of the above types into standardized non-nested
JOIN predicates and the fifth into a simple query that can be handled
easily by System R. These cases and their treatments were identified and
discussed by Kim. However, care is needed when applying these results,
as has been pointed out by Kiessling.
A further, intuitive, approach to reducing the number of plans
generated is common expression analysis. Ad hoc query evaluation can
be greatly speeded up by making use of stored final or partial results from
previous queries which have query sub-graph paths or individual nodes
in common with the current query. Analysis shows that the processing
costs are not significantly increased if none of these common expressions
can be found. This sharing of processing is even more important if
transaction 'scenarios' (sets of transactions which hit the system
simultaneously) are identified, in which case multiple queries are optim-
ized. We return to this subject in Section 5.6.
Cost models for selection of plans are usually based on the number
of secondary storage accesses, buffer sizes and operand sizes. The avail-
136 Distributed Database Systems
able physical access structures such as indexes affect the cost of evaluating
queries. These parameters are known for the database relations and
indices and can be derived for the intermediate results. There is no general
consensus on the method of estimating the sizes of intermediate results. In
cases where only given database information is used during the estimation,
rather a lot of assumptions are made. Typical assumptions made are:
This formula can clearly be replaced (see Figure 5.4 (b)) by the equivalent
form
Procedure SEMI-JOIN
Begin
Step 1. Project the JOIN attributes from R 2 at B (= 'R2), after
applying any required SELECTIONs.
Step 2. Transmit rrR 2 to A.
Step 3. Compute the SEMI-JOIN of R 1 at A.
Step 4. Move the result to C.
end
EXAMPLE 5.3
Suppose that a researcher in the orthopaedics department of a hospital
suspects that the combination of the job a patient does and his/her weight
is in some way related to the patient's likelihood of having a particular
bone ailment. To investigate this the researcher wants to examine the
type of work of all the people treated in orthopaedic wards since 1 January
who are overweight. Accesses are made therefore to the two relations
HOSPITALIZATION and SURVEY of Example 5.1. We ignore
SELECT and PROJECT CPU costs.
A query is submitted to the DDB as follows:
The result is required at the hospital site (where the researcher works).
To evaluate this query using SEMI-JOINs, the following steps are
taken.
Step 1 (a) Restrict HOSPITALIZATION (dept = orthopaedics, admit
> 1 Jan)
(b) PROJECT pat-name from restricted HOSPITALIZATION
This gives a relation R of 200 tuples (each being of 'breadth'
I attribute)
i.e. Card(R) = 200
Step 2 Transmit R to community care site
Transmission cost = 200 transmission cost/attribute (ta)
So total cost is
500 ta + 220000 c, + 200 c, (5.2)
[B(R)-B(X)] x C(R)
(a) It should be noted that we have taken some liberties with the
estimation formulae given here. There are some variations
from the SDD-1 formulae, but we feel that the derivations of
our formulae are easier to understand.
(b) This first formula ignores the possibility of having duplicated
tuples after removing unwanted attributes;
(2) Benefit of restricting R to tuples with R.A=c (a constant)
= B(R) x C(R) x (1-1/C(RA))
(3) Benefit of SEMI-JOINing R and S over attribute A
142 Distributed Database Systems
MAX 0
M B (R) x C (R) x (1 - C(RA)/C(SA))
(Assume R is the relation contributing attributes to the S-J.)
EXAMPLE 5.4
Suppose a relation DOC has three attributes:
"* analyse the SQL query and perform object name resolution;
"* look up the catalog to check authorization (privilege) and view
integration;
"* plan the minimum cost access strategy (a string of relation accesses,
PROJECTIONs, SELECTIONs, data communications transfers,
SORTs and JOINs) for this global query;
"* generate the access modules and store them at a master site.
where the three weights Cp, Ccpu and Cm represent relative importance
values distinguishing between I/O, CPU and messages respectively. More
detailed formulae are presented at the end of this section. The ratios vary
with the system of course.
JOIN
se
-name
lex
Merge
The criteria used for the partitioning of any relation are taken into
account so that the costs of reconstituting it are kept low. The JOIN order
and presentation order are considered in addition to the I/O, CPU and
message factors above. This makes the fan-out of the search tree exceed-
ingly large. Pruning heuristics for the centralized DBMS System R are
used to reduce the space to be searched for an optimal plan. These are
illustrated in Example 5.5, in which the published System R operation is
'interpreted' for clarity. The basic ideas behind the R* optimization
method for non-nested queries are also illustrated in Example 5.5. This
example serves to demonstrate the complexity of query optimization.
To find the optimal execution for a JOIN-based query like the one
in this example, System R builds a decision tree of the strategies for
executing the query. This tree has branches corresponding to operations
such as sorts or index accesses on the relations and each leaf of the tree
identifies a different way of executing the query expressed as the root of
the tree (see Figure 5.6 for an illustration).
Heuristic rules are used to constrain the size of the tree. Some of
the rules inspect the individual relations to discard unpromising strategies
early (i.e. even before consideration of the JOIN method to be used).
Exhaustive searching is then used to find the most efficient JOIN
sequence, methods and sites. A well-known operations research technique
called Dynamic Programming is used to speed up the search.
EXAMPLE 5.5
Consider a query on the relations of Example 5.1 which requires the 3
relations to be JOINed, ignoring any costs due to data transmission.
To illustrate the optimization technique used by System R, we
define some indexes on the given relations. An index is said to be clustered
if tuples with equal values of the indexed attribute are stored consecutively
on the secondary storage devices. Unclustered indexes may require a
particular page to be accessed many times for a single query, and so
clustering has a potentially significant impact on efficiency. However, in
order to highlight the essentials of the method, clustering is not considered
in the following discussion.
There are indexes to HOSPITALIZATION on pat-name, admit
and discharge. This means that there are four possible ways of accessing
HOSPITALIZATION: by each of these indexes and by a serial scan.
Relation PATIENT also has an index on pat-name, so there are
two possible ways of accessing it.
Relation SURVEY has no index on pat-name but has an index on
type-of-work. System R's optimizer excludes all 'non-relevant' indexes
(those not affecting the given query's efficiency) from consideration at
this stage.
146 Distributed Database Systems
Given this information, we can limit the size of the query tree even
before considering the JOIN orders. Look first at the HOSPITALIZ-
ATION relation and consider JOINing it with PATIENT. Assume that
accessing a relation on any of the indexes presents the tuples in the
corresponding order. We can use the admit index to restrict the HOSPI-
TALIZATION relation before JOINing. However, the order of discharge
and admit are not interesting orders (i.e. they are not used in a JOIN or
specified in ORDER BY or GROUP BY clauses). A simple heuristic is
to prune such 'uninteresting order' paths from further consideration for
the JOIN tree. So we can discard the admit and discharge index accesses
from the tree here.
Now assume that it is not possible to discard the serial scan method
at this site. Note, however, that we discard it later, for illustration, using
another heuristic, 'prune paths whose costs of execution are too great for
some clear reason'.
Now look at the alternatives for executing these JOINs. They
are: to use merge-scan, or to use one of the two nested-loop methods
(HOSPITALIZATION inner and PATIENT inner respectively). Figure
5.6 shows the pruned query tree for this JOIN. For the merge scan method
we assume the pat-name index on HOSPITALIZATION is the only
contender because it is cheap and gives the correct JOIN order. One of
the two possible methods of accessing PATIENT still remains, as shown.
Figure 5.6 shows nine possible routes, even having discarded the
admit index and the discharge index. Suppose we prune the 'scan' branches
because they take a much greater access time than the indexes, which are
assumed here to be very selective. We are left with the following 'leaf'
branches from above (all in pat-name order): {4, 6, 9}.
Consider now the second part of the evaluation path being con-
sidered in the JOIN order for this example. We need to find the effect
of each of the access paths on the JOIN of the sub-result with SURVEY.
In SURVEY (S) there is no index on pat-name and the index on type-
of-work is not relevant in the sense above (although it could be used for
restricting S). Probably a nested-loops JOIN would be chosen because
the time to sort for the merge-sort would probably exclude it from
consideration.
So in this case the cheapest of the three routes found earlier, {4,
6, 9}, would be chosen and the time or cost would be increased by that
for the second part, in order to find the optimum strategy for the JOIN
order being considered ((H >< P) >< S). This would then be compared
with the other two possible JOIN orders to get the 'global' minimum.
Total cost
TC = (P x Cp) + (R x Ccpo) + Mc
where P is the number of pages fetched
R is the number of tuples fetched
M. is the cost of messages
(= Cpm- + Cmb X L)
where L is the number of bytes transmitted
The C-factors are weights. Cp and Ccpý give the I/O and CPU weights
respectively, as above. Cp. is the cost per message weight and Cmb is the
cost per message byte.
Shipping cost
This gives the message cost for single relation access and is a component
of JOIN costs.
148 Distributed Database Systems
JOIN costs
Here we present the R* formula for NL for single site joins which is a
component of total costs.
J.ss(R*) = Ro X (Rio) X Cep. + Pi. X Cp + Oc;
where Ro is the cardinality of the outer relation
Rio is the average number of matching tuples in the inner
relation
Pio is the average number of inner relation pages fetched
for each outer relation tuple
Oss is the single site cost of accessing the outer table
1 1
2 1
3 4
4 29
5 336
6 5687
"* Ship all data from the two sites to the requestor site for the
intersection to be performed;
"* Exchange identifiers of customers between sites, and ship only the
intersection data to the requestor site;
"* Ship identifiers from one site (A) to the other (B), and at the same
time ship all A's customer account data to the requestor site, to be
intersected later with the relevant data from B (in the light of the
identifiers received from A).
SUMMARY
In this chapter we have explored the techniques that are available for
obtaining executions of queries which are as fast or as cheap as possible.
We saw that changing the order of execution of relational
operations such as JOINs has a profound effect on the efficiency of
evaluation of queries. Moreover we saw that the methods chosen to
access data and carry out the operations, notably the algorithms for
performing JOINs, have an important impact on efficiency. Both of these
factors are of even greater importance when the database is spread over
distributed computing facilities.
An interesting feature of the chapter is the survey of some of the
open questions facing researchers and developers concerned with query
optimization. The variety and complexity of these issues demonstrates the
fact that we have really only introduced what is a subject of great
importance in present and future database systems.
EXERCISES
5.1 Derive an estimation formula for the size of the result of the JOIN of two
relations of cardinalities c, and c 2 respectively, making the assumption that
the values of the JOIN attributes are uniformly distributed. Assume that the
relations have k, and k2 distinct values of the JOIN attribute respectively, and
that r of these keys are common to both relations.
5.2 (a) Devise some relational extensions which demonstrate formulae a(i) and
b(ii) of Section 5.3.
(b) Illustrate that SELECT is distributive over JOIN provided that the
SELECT condition over the JOIN is the intersection of the SELECT
conditions over the individual relations.
5.3 Suppose that, for a given query, data from two relations EMP (103 tuples)
and PROJ (102 tuples) at sites about 100 miles apart, must be concatenated
(using a JOIN operation), and that there must be some way of screening out
unwanted tuples. EMP shows employees' details and there is a tuple for each
project they work on. PROJ gives details of the projects.
Distributed Query Optimization 155
at headquarters
PROJ {project-no, type, leader, phone-no, status,
location. . .
at branch See Table 5.1.
Suppose that the following query, expressed in SQL, is addressed to the
'database' consisting of these two relations
That is, 'get the husbands or wives and project leaders of people earning
more than £20000'.
The common attribute 'project-no' is the JOIN column here. Suppose
that the join-factor for this concatenation is ten (i.e. there is an average 'fan-
out' of ten projects to each employee) and that 10% of the employees are
paid more than 20 000. Show two alternative ways of evaluating the query,
based on two alternative ways of transforming it (considering only the
ordering of the relational operations and ignoring the final projection needed
to get the two target attributes). This treatment therefore ignores the details
of implementation which would of course add to the choices available.
5.4 Let A and B be two sites, headquarters and branch respectively, holding
relations R, and R 2 respectively.
R, is EMP as defined in Exercise 5.3
R 2 is PROJ
A (Headquarters)
EMP (employee-no, employee-name, spouse, salary, project-no)
B (Branch)
PROJ (project-no, type, leader, phone-no, location, status)
C (Main)
JOIN site
rI T a EMP 7r (F PROJ
spouse, salary project project- type =
project-no >20 000 no no, government
leader
5.9 Discuss the question of efficiency of the query optimization process itself.
Clearly if this process takes too long it defeats the whole purpose of the
exercise!
5.10 Discuss the issue of static versus dynamic query optimization. Under what
circumstances do you think dynamic optimizers would have the edge?
158 Distributed Database Systems
Bibliography
Beeri C., Fagin R., Maier D., Mendelzon A., Ullman J.D. and Yannakakis
M. (1981). Properties of acyclic database schemes. In Proc. 13th ACM
Symposium on Theory of Computing. Milwaukee, WC.
A particular problem dealt with here is that of converting cyclic queries to
tree queries.
Bell D.A., Grimson J.B. and Ling D.H.O. (1986). Query optimisation for
Multi-Star. EEC Report 773 D7.1, University of Ulster, N. Ireland.
Bell D.A., Ling D.H.O. and McClean S. (1989). Pragmatic estimation of join
sizes and attribute correlations. In Proc. 5th IEEE Data Engineering Conf. Los
Angeles.
This paper suggests a simple ten-parameter universal characterization of any
distribution at any level of the execution hierarchy (see Section 5.6) which
promises to give much superior estimates to those using the uniformity
assumption.
Bernstein P.A. and Chiu D.M. (1981). Using semi-joins to solve relational
queries. J. ACM, 28(1), 25-40.
A treatment of SEMI-JOINs, as used on the SDD-prototype, is given.
Object graphs are introduced.
Bernstein P. and Goodman N. (1981). The power of natural semi-joins, SIAM
J. Computing, 10(4), 751-71.
Bernstein P., Goodman N., Wong E., Reeve C.L. and Rothnie J.B. (1981).
Query processing for a system of distributed databases. SDD-1. ACM TODS,
6(4), 602-25.
This paper describes the CCA prototype DDB system called SDD-1. The
paper gives a useful introduction to SEMI-JOINs.
Bloom B.H. (1970). Space/time trade-offs in hash coding with allowable errors.
Comm. ACM, 13(7), 422-26.
Bodorik P., Pyra J. and Riordon J.S. (1990). Correcting execution of distributed
queries. In Proc. 2nd IEEE Int. Symposium on Databasesin Paralleland Distrib-
uted Systems, Dublin, Ireland.
Cai F-F., Hull M.E.C. and Bell D.A. (1989). Design of a predictive buffering
scheme for an experimental parallel database system. In Computing and Infor-
mation (Janicki R. and Koczkodaj W.W., eds). pp. 291-99. Amsterdam: Elsevier.
Ceri S. (1984). Query optimization in relational database systems. In Infotech
State of the Art Report on "Database Performance", 12(4) (Bell D.A., ed.).
pp. 3-20. Oxford: Pergamon Infotech.
This paper gives a very clear overview of the issues of centralized and distributed
query optimization, using a pedagogical approach similar to the one taken in
this chapter.
Ceri S. and Pelagatti G. (1984). Distributed Databases: Principles and Systems.
New York: McGraw-Hill.
The authors give a set of seven rules which can be used to define the results of
applying relational algebra operations where qualifications may apply on local
relations (for example, site 1 only stores tuples with 'Supplier' = 'London',
whereas site 2 holds 'Supplier' = 'Edinburgh' tuples). There are also five criteria
which can be applied to simplify the execution of queries. These are often
simple, such as push SELECTIONs towards the leaves of the operator tree
(e.g. distribute a SELECTION over a UNION).
Chan A. and Naimir B. (1982). On estimating cost of accessing records in blocked
database organisations. Computer J., 25(3), 368-74.
Distributed Query Optimization 159
Chan and Naimir present a formula for the expected number of pages
holding records pertinent to a query. This formula can be approximated
easily and accurately for fixed length records.
Christodoulakis S. (1983). Estimating block transfers and join sizes. In Proc.
ACM SIGMOD Conf., San Jose, California.
Christodoulakis shows how to obtain and use estimates of the number of pages
moved across the 'storage gap' in hierarchical stores for the estimation of the
sizes of JOINs (and SEMI-JOINs, where only one relation contributing to the
JOIN is presented in the result) when the distribution of records to pages is non-
uniform. Iterative formulae are developed for the calculation of the probability
distributions of pages containing a given number of records.
Christodoulakis S. (1984). Query optimisation in relational databases using
improved approximations of selectivities. In Pergamon-Infotech State of the Art
Report on "Database Performance". pp. 21-38. (Bell D.A., ed.).
Christodoulakis argues here that many common assumptions are unrealistic
and presents estimates of the number of pages to be transferred across the
storage gap for a query by generalizing some results for uniform distribution.
He also derives formulae for the calculation of the probability distributions
of pages containing a given number of records, which is perhaps less useful
in practical systems.
Chen M.-S. and Yu P.S. (1990). Using join operations as reducers in
distributed query processing. In Proc. 2nd IEEE Int. Symposium on Databases
in Paralleland Distributed Systems, Dublin, Ireland.
Cornell D.W. and Yu P.S. (1989). Integration of buffer management and
query optimisation in relational database environment. In Proc. 15th Conf. on
VLDB, Amsterdam, Holland.
Daniels D., Selinger P.G., Haas L. et al. (1982). An introduction to
distributed query compilation in R*. In Distributed Databases (Schneider H.-J.,
ed.), pp. 291-309. Amsterdam: North-Holland.
An excellent introduction to R*.
Epstein R. and Stonebraker M. (1980) Analysis of distributed database
processing strategies. In Proc. Sixth Int. Conf. on VLDB, Montreal, Canada.
Finkelstein S. (1982). Common expression analysis in database applications. In
Proc. ACM SIGMOD Conf., Orlando, USA.
Goodman N. and Shmueli 0. (1982). The tree property is fundamental for
query processing (Extended Abstract). In ACM SIGMOD Conf, Orlando,
USA.
This paper includes some help on converting cyclic queries to tree queries.
Grant J. and Minker J. (1981). Optimisation in deductive and conventional
relational database systems. In Advances in Database Theory (Gallaire H.,
Minker J. and Nicolas J.M., eds.), pp. 195-234. New York: Plenum.
Hammer M. and Zdonik S.B. (1980). Knowledge-based query processing. In
Proc. Sixth Int. Conf. on VLDB.
Hevner A.R. (1982). Methods for data retrieval in distributed systems. In
Proc. Eighth Conf. on VLDB, Mexico City.
Jarke M. (1985). Common subexpression isolation in multiple query
optimisation. In Query Processing in Distributed Systems (Kim W., Reiner D.
and Batory D., eds.), pp. 191-205. Berlin: Springer-Verlag.
Jarke M. and Koch J. (1984). Query optimisation in database systems, ACM
Computing Surveys, 16(2), 111-52.
Transformation by ordering is called 'ameliorization' by Jarke and Koch.
They identify two additional methods of transformation. One is
'simplification', which largely corresponds to identifying and capitalizing upon
160 Distributed Database Systems
SOLUTIONS TO EXERCISES
5.1
Size is r x ClC 2 / k 1k 2
5.3
(a) (1) JOIN EMP AND PROJ at Branch site (i.e. after transmitting full EMP
from headquarters) producing a temporary relation R of 103 tuples (note
that each employee is associated with a number of tuples: one/project)
(2) SELECT THOSE WITH SALARY > 20 000 giving a result of (10% of
EMP) = 100 tuples
(b) (1) SELECT FROM EMP TUPLES WITH SALARY > 20 000 at headquarters
producing a temporary relation S of 100 tuples (10% of EMP)
(2) JOIN S WITH PROJ, at branch site (i.e. after transmitting S from
headquarters) giving the same 100 tuples as in (a)
Distributed Query Optimization 163
5.7
For EMP, as given B(R) 30
C(R) - 103
C(RA) = 100 (10 projects/employee, and no SELECT on
salary)
(10 with SELECT on salary)
C(SA) = 10
So benefit = 27 000
For PROJ as given B(R) = 36
C(R) = 102
C(RA) = 40 (40% are government)
C(SA) = 100
So benefit = 0
Estimation is an imperfect science. Heuristics for pruning query trees in general
have not yet been developed to the stage where everyone is happy with them.
a Concurrency Control
6.1 Introduction
One of the most important characteristics of database management sys-
tems is that they should support multiuser access (i.e. several users 'simul-
taneously' reading and writing to the database). The problems associated
with the provision of concurrent access, in particular to those writing to
the database, are well known and have been the subject of extensive
research both for centralized and distributed databases.
In this chapter, we will first introduce the notion of a transaction as
the basic unit of work in a DBMS, and explain why control of concurrently
executing transactions is required, by looking at the types of problems
which can arise in the absence of any attempt at control. The module
responsible for concurrency control in a DBMS is known as a scheduler,
since its job is to schedule transactions correctly to avoid interference.
Communication between application programs and the scheduler is via
the transaction manager, which coordinates database operations on behalf
of applications. The theory of serializability as the most common means
of proving the correctness of these schedulers is then discussed. Having
therefore established the background necessary to an understanding of
164
Concurrency Control 165
6.2 Transactions
6.2.1 Basic transaction concepts
Of fundamental importance to concurrency control is the notion of a
transaction. A transaction is defined as a series of actions, carried out by
a single user/application program, which must be treated as an indivisible
unit. The transfer of funds from one account to another is an example of
a transaction; either the entire operation is carried out (for example, one
account is debited with an amount and the other account is credited with
the same amount), or none of the operation is carried out.
Transactions transform the database from one consistent state to
another consistent state, although consistency may be violated during
transaction execution. In the funds transfer example, the database is in
an inconsistent state during the period between the debiting of one account
and the crediting of the second account. If a failure were to occur during
this time, then the database could be inconsistent. It is the task of the
recovery manager (see Chapter 7) of the DBMS to ensure that all trans-
actions active at the time of failure are rolled back, or undone. The effect
of the rollback operation is to restore the database to the state it was in
prior to the start of the transaction and hence a consistent state. The four
basic, or so-called A.C.I.D. properties of a transaction are:
Transaction 2 Transaction 3
ITime
Program Program
start termination
Non-DB processing
Program 3
3
Tim T2
Program 2
2 3 4
T2 T, T TI
Program 1
-Time
(a)
T 21
Program 3 F-ý-' ý
Program 2
2 3 4
g Ti TI T I
Program I
STime
(b)
Figure 6.2 (a) End-to-end transaction execution. (b) Concurrently execution transactions.
begin transaction T1
read balance,
balanceX = balance, - 100
if balance, <0 then
begin
print insufficientfunds'
abort T,
end
write balance,
read balance,
balance, = balance, + 100
write balance,
commit T,
168 Distributed Database Systems
begin transaction T,
read balance,
balance, = balance, - 100
if balance, < 0 then
begin
print insufficientfunds'
abort T'
end
end-if
write balance,
commit T'
begin transaction T,
read balance
balance = balance + 100
write balancer
commit TB
commit T,
Concurrency Control 169
These problems are relevant to both the centralized and the distributed
case, although for simplicity, we will discuss them in terms of centralized
DBMSs.
Value of balance,
in DB
T, 100 T,
I
abort T1
end
write balance, 0
read balance
balancey=balancey+100
write balance,
commit T1
SCHEDULE
SURGEON
surgeon-name operation
Tom Tonsilectomy
Mary Tonsilectomy
Mary Appendectomy
Concurrency Control 171
T, T4
1 indicates switch of operations not possible because surgeon scheduled on 04.04.91 is not
qualified to perform new operation
2 indicates switch of surgeons not possible because new surgeon (Tom) is not qualified to
perform operation scheduled
SCHEDULE
SURGEON
surgeon-name operation
Tom Tonsilectomy
Mary Tonsilectomy
Mary Appendectomy
time
begin transaction T5
sum = 0
do while not end-of-relation
read balance
sum = sum+balancea
begin
print insufficientfunds'
abort T,
end
write balance,,
read balance,
balance = balance + 100
write balance,
commit T1 read balance,
sum=suM+y
commit T,
where 0' indicates either a read (R) or write (W) operation executed by
a transaction on a data item. Furthermore 01 precedes 02, which in turn
precedes 03, and so on. This is generally denoted
time T, T7 T,
t 2 . begin transaction T.
read x
x•x+1
write x
commit T6
t 3 • begin transaction T6
read y
y=y+l
write y
t 4 • commit T 6
read y
y=y+l
write y
t5 . commit T 7
(1) Each read operation reads the same values in both schedules; this
effectively means that those values must have been produced by
the same write operations in both schedules;
(2) The final database state is the same for both schedules; thus the final
write operation on each data item is the same in both schedules.
since in the first case T 2 reads the post-Ti value of data item x, whereas
in the second it sees the pre-T, value of x. In general, Rj(x) and R 2 (X)
do not conflict, whereas R1 (x) and W,(x) do conflict, as do W1 (x) and
W 2(x). In terms of schedule equivalence, it is the ordering of conflicting
operations which must be the same in both schedules. The conflict between
a read and a write operation is called a read-write conflict, and a conflict
between two write operations a write-write conflict.
Thus, globally, the two transactions are not serializable even though their
agents execute serially at each site. It is easy to envisage how such a
situation could arise. For example, if the global transactions T1 and T 2
were launched 'simultaneously' by different users at sites A and B, then
the schedulers operating independently at each site could schedule them
in this way. Hence, for distributed transactions, we require serializability
of all local schedules (both purely local and local agents of global
transactions) and global serializability for all global transactions. Effec-
tively, this means that all sub-transactions of global transactions appear
in the same order in the equivalent serial schedule at all sites, that is
if T'ý < T4
then T- < Tj for all sites K at which T, and Tj have agents.
T K < T K < T< K T <. ... < Tý is known as the local ordering for site
K,
while
T, < T 2 < T 3 < T 4 < . . . < T, is known as the global ordering for all
sites.
time
Site A Site B
Site A Site B
begin
Step 1: A global transaction is initiated at site A via the global trans-
action manager (GTMA)
Step 2: Using information about the location of data (from the catalogue
or data dictionary), the GTMA divides the global transaction
into a series of agents at each relevant site
Step 3: The global communications manager (GCMA) at A sends these
agents to the appropriate sites via the communications network
Step 4: Once all agents have completed, the results are communicated
back to site A via the GCMs.
end
Note that agents do not normally communicate directly with each other,
rather they communicate via the coordinator.
Concurrency Control 179
These methods have been mainly developed for centralized DBMSs and
then extended for the distributed case. Both locking and timestamping
180 Distributed Database Systems
database. In the first case the granule size for locking is a single tuple,
while in the second it is the entire database, and would prevent any
other transactions from executing until the lock is released; this would
clearly be undesirable. On the other hand, if a transaction was updating
90% of the tuples in a relation, then it would be more efficient to allow
it to lock the entire relation rather than forcing it to lock each individual
tuple separately. Ideally, the DBMS should support mixed granularity
with tuple, page and relation level locking. Many systems will automati-
cally upgrade locks from tuple/page to relation if a particular transaction
is locking more than a certain percentage of the tuples/pages in the
relation.
The most common locking protocol is known as two-phase locking
(2PL). 2PL is so-called because transactions which obey the 2PL protocol
operate in two distinct phases: a growing phase during which the trans-
action acquires locks and a shrinking phase during which it releases those
locks. The rules for transactions which obey 2PL are
6.5.2 Deadlock
A transaction T can be viewed as a sequence of read (R) and write (W)
operations, which navigates through the database claiming read or write
locks as it progresses. It can be represented by its schedule ST where
If it obeys 2PL, then it will hold all write locks until commit. Imagine,
however, that a transaction T1 requests a write-lock on data item xi, which
is currently locked by another transaction, T 2. There are two possibilities:
In the case of the first option, T1 retains all the locks it currently holds
and just enters a wait state. In the case of the second option, however,
T, must release all its locks and restart. For a complex transaction,
particularly one that is distributed, the overhead of having to restart the
entire transaction could be very high, as the rollback at one site will cause
cascading rollback of all agents of that transaction at all other sites. A
protocol which adopts this approach is referred to as a deadlock prevention
protocol for reasons which will be clarified in the next paragraph.
Allowing a blocked transaction to retain all its locks while it waits
for a lock to be released by another transaction can lead to deadlock.
Deadlock occurs when one transaction is waiting for a lock to be released
by a second transaction, which is in turn waiting for a lock currently held
by the first transaction. Where there is a possibility of deadlock, a deadlock
detection protocol is required, which will be invoked periodically to check
if the system is deadlocked. Methods for detecting and resolving deadlock
in both centralized and distributed DBMSs are discussed below. If, on
the other hand, a transaction releases all its locks on becoming blocked,
deadlock cannot occur; such protocols are therefore referred to as dead-
lock prevention protocols.
It is also possible in lock-based protocols for transactions to be
repeatedly rolled back or to be left in a wait state indefinitely, unable to
acquire their locks, even though the system is not deadlocked. Such a
situation is referred to as livelock, since the transaction which is livelocked
is blocked, yet all other transactions are 'live' and can continue normal
operations. Consider, for example, a 'bed-state' transaction in a hospital
which calculates the number of beds occupied at a particular time; this is
a similar type of transaction to the summary transaction shown in Example
6.5. At the same time as this 'bed-state' transaction is executing, other
transactions are admitting, discharging and transferring patients. The 'bed-
state' transaction will therefore require read-locks on all in-patient records
in order to get a consistent snapshot of the DB. However, in the presence
of these other transactions, it may be very difficult for the 'bed-state'
transaction to acquire its full lock-set. We say that the transaction is
livelocked. To avoid livelock, most schedulers operate a priority system,
whereby the longer a transaction has to wait, the higher its priority.
Deadlock in centralized DBMSs is generally detected by means of
wait-for graphs. In a wait-for graph, transactions (or their agents in the
distributed case) are represented by nodes and blocked requests for locks
184 Distributed Database Systems
xi
Ti (:: T2
Xi
G = T 1 -- T 2 -> T
G' = T 1 --- T 2 -- T3 -- T4 -- T 5 -- T,
Tl5T T4
Xm Xl
Site A Sit eB
A
TI
xi/
Xi
L 13~
_1
Figure 6.6 Distributed deadlock.
(1) centralized
(2) hierarchical
(3) distributed.
many sites involved and many transaction agents active. Also, as with
many centralized solutions to distributed problems, the site at which
the global graph is constructed could very easily become a bottleneck.
Furthermore, it would be necessary to assign a backup site in the event
of failure of the original deadlock detection site.
With hierarchical deadlock detection, the sites in the network are
organized into a hierarchy, such that a site which is blocked sends its local
wait-for graph to the deadlock detection site above it in the hierarchy.
Figure 6.7 shows the hierarchy for eight sites, A to H. The leaves of the
tree (level 4) are the sites themselves, where local deadlock detection is
performed. The level 3 deadlock detectors, DDij, detect deadlock involv-
ing pairs of sites i and j, while level 2 detectors perform detection for
four sites. The root of the tree at level 1 is effectively a centralized global
deadlock detector, so that if, for example, the deadlock was between sites
A and G, it would be necessary to construct the entire global wait-for
graph to detect it. Hierarchical deadlock detection reduces communication
costs compared with centralized detection, but it is difficult to implement,
especially in the face of site and communication failures.
There have been various proposals for distributed deadlock detec-
tion algorithms, which are potentially more robust than the hierarchical
or centralized methods, but since no one site contains all the information
necessary to detect deadlock, a lot of intersite communication may be
required.
One of the most well-known distributed deadlock detection methods
was developed by Obermarck, a variation of which was used in System
Level
DDx
1
4
SitesA B C D E F G H
Site A Site B
B
TATI
II
xi xi
TA TB
T22 12
Figure 6.8 Global deadlock detection using nodes to represent external agents
(EXT).
188 Distributed Database Systems
This does not necessarily imply that there is global deadlock since the
EXT nodes could represent totally disjoint agents, but cycles of this form
must appear in the graphs if there is a genuine deadlock. To determine
whether or not there is in fact deadlock, it is necessary to merge the two
graphs. Hence site A transmits its graph to B (or vice versa). The result,
GAB, will be the same as in Figure 6.6 and the cycle indicating actual
deadlock will appear:
GAB = T A-* TB -- >TTB - TA -A T*
In the general case, where a cycle involving the EXT node at site X
appears in the wait-for graph, the wait-for graph for site X should be sent
to site Y, for which X is waiting, where the two wait-for graphs are
combined. If no cycle is detected then the process is continued with
successive augmentation at each site of the wait-for graph. The process
stops if either a cycle appears, in which case one transaction is rolled back
and restarted together with all its agents, or the entire global wait-for
graph is constructed and no cycle has been detected. In this case there is
no deadlock in the system.
Even with the use of the external agent, represented by the addition
of the EXT nodes to the local wait-for graphs, deadlock detection in a
distributed system is still potentially a very costly exercise. It is difficult
to decide at what point it is necessary to check for deadlock. It would be
far too expensive to take the approach of centralized systems which
generally test for deadlock every time a transaction has a lock request
refused. It would probably also not be worthwhile checking for deadlock
every time a cycle involving the external agent node appears in a local
wait-for graph unless it is known that there is a lot of contention in the
distributed system and that deadlock is likely to be present. One possible
option is to use a time-out mechanism whereby deadlock detection is
initiated only after the local node has apparently 'hung' for a certain
period of time. However, distributed systems are prone to all sorts of
delays, particularly in communications (e.g. heavy traffic on the network),
which have nothing to do with deadlock. Time-outs in distributed systems
are not as useful as indicators of the possible occurrence of deadlock as
they are in a centralized system.
Concurrency control algorithms in which the possibility, as distinct
from the reality, of deadlock is detected and avoided are known as
deadlock prevention protocols. How does the transaction manager decide
whether or not to allow a transaction T 1, which has requested a lock on
data item xi currently held by transaction T2 , to wait and to guarantee
that this waiting cannot give rise to deadlock? One possible way is to
order the data by forcing locks to be acquired in a certain data-dependent
order. However, such an ordering would be virtually impossible to define,
since users access the DB through non-disjoint user views which can be
Concurrency Control 189
defined across any subset of the DB. A more realistic approach therefore
is to impose an ordering on the transactions and ensure that all conflicting
operations are executed in sequence according to this order. Deadlock is
thus prevented by only allowing blocked transactions to wait under certain
circumstances which will maintain this ordering.
The ordering mechanism is generally based on timestamps. The
problems associated with defining unique timestamps in distributed sys-
tems will be discussed in Section 6.5.3. By assigning a unique timestamp
to each transaction when it is launched, we can ensure that either older
transactions wait for younger ones (Wait-die) or vice versa (Wound-wait),
as proposed by Rosenkrantz et al. Algorithm 6.1 is the algorithm for wait-
die, while Algorithm 6.2 is the algorithm for wound-wait.
Note that if a transaction is rolled back, it retains its original
timestamp, otherwise it could be repeatedly rolled back. Effectively,
the timestamp mechanism supports a priority system by which older
transactions have a higher priority than younger ones or vice versa. Note
that the first part of the name of these protocols, wait- and wound-,
describes what happens when T, is older than T 2, while the second part
(die and wait) describes what happens if it is not. Wait-die and wound-
wait use locks as the primary concurrency control mechanism and are
therefore classified as lock-based rather than timestamp (see Section 6.5.3
begin
T, requests lock on data item currently held by T2
if T, is older than T 2 i. e. ts (Tj) < ts (T 2 )
then T, waits for T2 to commit or rollback
else T, is rolled back
end-if
end
begin
T, requests lock on data item currently held by T2
if T, is older thanT 2 i.e. ts(T1 ) < ts(T2 )
then T2 is rolled back
else T, waits for T 2 to commit or rollback
end-if
end
190 Distributed Database Systems
Conservative -0 Liberal
(1) The local site clock is advanced one unit for every event occurring
at that site; the events of interest are transaction starts and the
sending and receiving of messages;
(2) Intersite messages are timestamped by the sender; when site B
receives a message from site A, B advances its local clock to
where ts (e,) and ts (ej) are the values of site-clockA when events ej
and ej respectively occurred. Rule 2 effectively maintains a degree of
synchronization of local clocks between two communicating sites, such
that if event e, at site A is the sending of a message to site B occurring
at time ti, then we can ensure that event ek, the receipt of the message
at site B, occurs at tk where tk > ti. If there is no communication between
two sites then their clocks will drift apart, but this does not matter since,
in the absence of such communication, there is no need for synchronization
in the first place.
We must now consider the atomicity of transactions with times-
tamps. The purpose of the commit operation performed by a transaction
is to make the updates performed by the transaction permanent and
visible to other transactions (i.e. to ensure transaction atomicity and
durability).
Transactions which have been committed can never be undone.
With lock-based protocols, transaction atomicity is guaranteed by write-
locking all records until commit-time and, with 2PL in particular, all locks
are released together. With timestamp protocols, however, we do not
have the possibility of preventing other transactions from seeing partial
Concurrency Control 193
begin
Ti attempts to pre-write data item x
if x has been read or written by a younger transaction i. e.
ts (T) < ts (read x) or ts (Ti) < ts (write x)
then reject Ti and restart Ti
else accept pre-write: buffer (pre)write together with ts(T1)
end-if
end
194 Distributed Database Systems
begin
T, attempts to update (write) data item x
if there is an update pending on x by an older transaction, Tj
i.e. for which ts (Tj) < ts (Ti)
then T, waits until T, is committed or restarted
else T1 commits update and sets ts (write x) = ts (Ti)
end-if
end
Concurrency Control 195
begin
Ti attempts a read operation on data item x
if x has been updated by a younger transaction
i.e. ts (Ti) < wr ite (x)
then reject read operation and restart T,
else if there is an update pending on x by an older transaction,
Tj i.e. ts (T,) < ts (T,)
then Ti waits for Tj to commit or restart
else accept read operation and
set ts(read x) = max (ts (read x) , ts(Ti)
end-if
end-if
end
* Let RQB denote the read queue at site A of read requests from
transactions originating at site B.
"* Let UQA denote the write queue at site A of update requests
from transactions originating at site B.
"* Let ts (RQ,•) denote the timestamp of the read operation at the
head of queue RQB; similarly for ts (UQB).
* Let rABand UA denote a read and an update request, respectively
to site A from site B, with timestamp ts (rA) and ts (uA).
The algorithms for conservative timestamping for the read and write
operations are given in Algorithms 6.6 and 6.7.
It is an essential requirement with this method that all queues are
non-empty since if one queue, say, UQIN is empty and an update request,
UN' is received at site N from any other site K, there is no guarantee that
site M could not at some time in the future issue a request, UN, such that
ts (uM < ts (uK))
begin
Site B issues a read request, r2, for a data item stored at site A
Insert rB into appropriate place, according to its timestamp
(ts (r1)) , in read queue, RQCA
Check that all update requests queued at A from all sites i are
younger than rB otherwise wait
begin
Site B issues a write request, uB, on a data item stored at site A
Insert uB into appropriate place, according to its timestamp
ts (uA') , in update queue, UQ2
Check that all update queues at A from all sites i are non-empty
in the future with a timestamp less than the timestamp on the null request.
Alternatively, a site which is currently blocked due to an empty queue
could issue a specific request for such a null request.
(1) Read phase: this phase represents the body of the transaction up
to commit (no writes to the database during this phase);
198 Distributed Database Systems
(1) Ti completes its write phase before Tj starts its read phase; this
effectively means that Ti has finished before Tj begins (condition 1
of Figure 6.10);
(2) Writeset (T,) n readset (Tj) = 0 and T, completes its write phase
before Tj starts its write phase; this means that the set of data
objects updated by Ti (writeset (Ti)) cannot affect the set of data
objects read by Tj (readset Tj) and that Ti cannot overwrite Ti
because it will have finished writing before Tj (condition 2 of Figure
6.10);
(3) Writeset (T,) n ((readset Tj) U writeset (Tj)) = 0 and Ti completes
its read phase before Tj completes its read phase; this ensures that
Ti does not affect either the read or write phase of Tj (condition 3
of Figure 6.10).
Algorithm 6.8 Validation using optimistic concurrency control for centralized DBMS.
begin
Validate transaction T. (transaction number tn (Tj) against all
other older, committed transactions Ti)
for all Ti where tn(TJ) < tn(Tj) do
begin
Condition 1: Ti has completed its write phase before Tj starts its
read phase
if tn(Ti) < stn(Tj)
then return success
end- i f
TA BT .... , TN
would require all sites at which replicas are stored to be operational and
connected to the network. In the event of a network or site failure, it
would not be possible to update such replicated data items at all. Clearly,
such an approach would run counter to the fault-tolerant aims of repli-
cation. Hence, it is common to adopt a consensus approach, whereby if
the majority of sites vote to accept the update of a replicated data item,
then the global scheduler instructs all sites to update (commit). Sites
which are inaccessible during the update due to local site or network
failure could simply be notified of the update when they rejoin the network
(see also Section 6.6.1).
One of the earliest optimistic methods of concurrency control for
fully replicated databases was developed by Thomas. As with other opti-
mistic methods for concurrency control, transactions execute in three
phases: a read phase during which updates are made to local copies of
the data only, a validation phase during which the proposed update is
checked for conflicts at all sites and a write phase during which the
transaction is committed. Thomas' validation method is based on data item
and transaction timestamps for a fully replicated database. Transactions
execute in their entirety at one site. Along with every copy of every data
item is stored the timestamp of the transaction which last successfully
updated that data item. Thus for global consistency the value of the
timestamp should be the same for all copies of a given data item. In
summary, the method proceeds as follows. On entering the validation
phase at site S, transaction T, sends details of its readsets, writesets and
corresponding timestamps to all other sites. Each site then validates T,
against its local state and then votes to accept or reject it. If the majority
vote 'accept' then T, commits and all sites are notified. To validate T•,
site I checks the timestamps of the readset against the timestamps of the
local copies of the corresponding data items. If these are the same, this
indicates that if the updates performed by T, were propagated to site I,
no inconsistencies would result. If even one timestamp for a single data
item is different, then this would indicate that T, has read inconsistent
data. It is also necessary for each site I to validate T, against all other
pending (concurrent) transactions at site 1. If a pending transaction, T' is
found to conflict and is younger than T•, then site I rejects T,; if it is
older then validation for T, at I is deferred until the conflict request from
T! is resolved. This avoids deadlocks by ensuring that younger transactions
always wait for older transactions. If the majority of the votes are 'accept',
then T, is accepted and validation succeeds, otherwise it is rejected and
validation fails.
(1) Every data item has one copy (at one site) designated as the primary
copy; all other replicas are slave copies. Updates are directed to the
primary copy only and then propagated to the slave copies. Also,
all reads must first acquire a read lock on the primary copy before
reading a slave copy. In the event of network partitioning, only
primary copies are available, assuming of course that they are
accessible. If the primary site itself fails, it is possible to promote
one of the slave copies and designate it as the primary copy. This
is generally accomplished using a voting strategy (see below), but
this requires that the system can distinguish between site and net-
work failures. A new primary copy cannot be elected if the network
was partitioned due to communications failure, as the original pri-
mary site could still be operational but the system would have no
way of knowing this;
204 Distributed Database Systems
(2) Under the voting (also called quorum consensus) strategy, a trans-
action is permitted to update a data item only if it has access to
and can therefore lock a majority, of the copies of that data item.
This majority is known as a write quorum. In the event of the
transaction obtaining a majority, all copies are updated together as
a single unit and the results are then propagated to other sites. A
similar system, based on a read quorum, operates for reads to
prevent transactions reading out-of-date versions of data items. If
consistent reads are required, then the read quorum must also
represent a majority. Hence it is often the case that the write
quorum = read quorum. If, however, applications can tolerate
versions of the data which are slightly out of date, then for the sake
of higher data availability the read quorum can be reduced;
(3) While the voting strategy provides greater data availability than
primary copy in the event of failure, this availability is achieved at
the expense, during normal operation, of checking read or write
quorums for every read or write operation. The missing writes
strategy reverses this situation by involving much less overhead
during normal operation at the expense of higher overhead when
things go wrong. Under the missing writes strategy, transactions
operate in one of two modes: normal mode when all copies are
available, and failure mode when one or more sites may have failed.
Timeouts are used to detect failures so that a transaction in normal
mode, which issues a write to a site from which it fails to receive
an acknowledgement, switches to failure mode. This switching can
be made either dynamically, if possible, or else by rolling back and
restarting the transaction in failure mode. During failure mode, the
voting strategy outlined above is used.
(4) Conflict class analysis can be used as a general concurrency control
strategy and is not restricted to replicated databases. It will there-
fore be discussed separately in Section 6.8.1.
(1) Local transaction managers must guarantee local atomicity for both
purely local transactions and agents of global transactions;
(2) Local transaction managers must guarantee to preserve the order
of execution of agents of local transactions determined by the global
transaction manager;
(3) Each global transaction may spawn only one agent at a given site;
(4) The details of the readsets and writesets of all global transaction
agents must be available to the global transaction manager;
(5) The global transaction manager must be able to detect and resolve
global deadlocks; this means that local transaction managers must
make the local wait-for graph available to the global transaction
manager, which also must be able to view the local state.
and
or
R1 = W 1 = R2 = W2 = S
R3 = S and W 3 = 0
R4 = T and W 4 = 0
212 Distributed Database Systems
RI R2 R3 R4
Wl W2
Figure 6.11
W3
Conflict graph.
I
W4
Sfl T=0
SUMMARY
In this chapter, we have discussed the problems associated with and the
solutions to the control of users who are concurrently accessing a DB.
These problems are compounded in the DDB environment, where
multiple local users accessing their own local DBs only, are combined
with global users accessing data stored at different sites across the
network.
We began with a review of the background. The transaction is the
basic unit of work in a DBMS and has four A.C.I.D. properties, namely
atomicity, consistency, independence and durability. In Section 6.3, the
three classes of problems, lost updates, integrity constraint violation and
incorrect summaries, which can arise when transactions are allowed to
proceed without any attempt at synchronization, were discussed. This
general background to concurrency control for both centralized and
distributed DBs finished with a discussion of schedules and serialization.
A schedule is the entire sequence in order of the reads and writes of all
214 Distributed Database Systems
EXERCISES
6.1 What are the four A.C.I.D. properties of a transaction and explain why they
are necessary?
where ri (x1) and wi (xj) denote a read and a write operation by transaction i
on data item xj. Data items x, and x2 are stored at site A, while x3 and x4 are
stored at site B. In addition, two local transactions, L 3 and L 4 :
L 3 = [r 3 (X.), r 3 (x 2 )] at site A
L4 = [r 4 (x 3 ), r 4 (X4)] at site B
execute concurrently with T, and T 2.
Suppose that the schedules SA and SB produced by the local schedulers
at site A and B respectively are as follows:
SA = [r3 (x,), r, (x,), w, (x,), r2 (x2), w2 (x 2 ), r (x2)]
3
S- = [r4 (x 3), r, (x 3), w1 (x3), r 2 (x4), W2 (x4), r4 (x4)]
(a) Are these schedules locally serializable? If so, what are the equivalent
local serial schedules?
(b) Are they globally serializable? If so, what is the equivalent global serial
schedule?
6.3 Repeat Exercise 6.2, assuming that SA is as before, but that the local
scheduler at site B produces a different schedule, SB'
where SB' = [r, (x3), w 1 (x 3 ), r2 (x4), r 4 (x 3 ), r 4 (x4), w 2 (X4)]
6.4 The diagram in Figure 6.12 shows the interleaved execution of three
concurrent transactions, T1 , T 2 and T 3 , with timestamps
ts(T) < ts(T 2) < ts(T 3)
What would happen under each of the following protocols:
(a) 2PL with deadlock prevention
(b) 2PL with deadlock detection and recovery
(c) wait-die
(d) wound-wait?
Oldest Youngest
t 7
- L_ Read and lock (X 3 )
F
Time
6.6 Describe Obermarck's method for deadlock detection in a DDB. Redraw the
global wait-for graph for Exercise 6.5 using Obermarck's method.
6.7 Why does Obermarck's method for deadlock detection in a DDB detect false
deadlocks if transactions are allowed to abort spontaneously (e.g. to abort
under application control (as in the 'insufficient funds' example in Example
6.1)) and not as a result of concurrency control?
time
T, T2
Bibliography
Agrawal R. and DeWitt D.J. (1985). Integrated concurrency control and
recovery mechanisms: design and performance evaluation. ACM TODS, 10(4),
529-64.
This paper presents an integrated study of concurrency control and recovery
mechanisms for centralized DBs. It is particularly interesting because it
considers both concurrency control and recovery, which are clearly intimately
related, in a unified way. The model for analysing the relative costs of the
various approaches is especially useful since it not only isolates the costs of
the various components of a particular mechanism, but can also be used to
identify why a particular mechanism is expensive and hence where
improvements can be directed to greatest effect. See Section 6.7.2.
Alonso R., Garcia-Molina H. and Salem K. (1987). Concurrency control and
recovery for global procedures in federated database systems. Data
Engineering, 10(3), 5-11.
Two approaches to synchronizing global transactions in an MDBS are
discussed, namely sagas and altruistic locking (see Section 6.8).
Bernstein P.A. and Shipman D.W. (1980). The correctness of concurrency
control mechanisms in a system for distributed databases (SDD-1). ACM
TODS, 5(1), 52-68.
Bernstein P.A., Shipman D.W. and Rothnie J.B. (1980). Concurrency control
in a system for distributed databases (SDD-1). ACM TODS, 5(1), 18-51.
These two papers give a comprehensive overview of the concurrency control
and recovery mechanisms of the homogeneous DDBMS, SDD-1 (see Section
6.5.2.4).
Bernstein P.A. and Goodman N. (1981). Concurrency control in distributed
database systems. ACM Computing Surveys, 13(2), 185-222.
Bernstein P.A., Hadzilacos V. and Goodman N. (1987). Concurrency control
and recovery in database systems. Wokingham: Addison-Wesley.
A comprehensive study of concurrency control and recovery issues for both
centralized and distributed DBMSs.
Breitbart Y. and Silberschatz A. (1988). Multidatabase update issues. In Proc.
ACM SIGMOD Conf. pp. 135-42. Chicago, Illinois.
Brodie M.L. (1989). Future intelligent information systems: Al and database
218 Distributed Database Systems
different transactions. The method guarantees that only acyclic graphs are
created and that a correct schedule can be formed by ordering the nodes of
the graph topologically. The method could be particularly useful for long-
lived transactions and for supporting concurrency control in MDBMS.
Furtado A.L. and Casanova M.A. (1985). Updating relational views. In Query
Processing in Database Systems. (Kim W., Reiner D. and Batory D., eds.),
Berlin: Springer-Verlag.
Garcia-Molina H. (1983). Using semantic knowledge for transaction processing
in a distributed database, ACM TODS, 8(2), 186-213.
The method proposed in this paper, one of the early ones on exploiting
application-specific knowledge for improving concurrency control, is
discussed briefly in Section 6.8. In addition to pointing out the advantages of
the approach, the paper also discussed its weaknesses.
Garcia-Molina H. (1991). Global consistency constraints considered harmful for
heterogeneous database systems. In Proc. 1st Int. Workshop on Interoperability
in Multidatabase Systems, Kyoto, Japan, April 1991, 248--250.
This paper argues that the existence of global integrity (consistency)
constraints for MDBs is, by definition, a violation of nodal autonomy. The
implication of the absence of such constraints for concurrency control is
discussed; in particular MDBs and serializable schedules are seen as
contradictory. The author argues in favour of the use of sagas to provide
concurrency control for MDBs (see Section 6.8).
Garcia-Molina H. and Salem K. (1987). Sagas. In ACM SIGMOD Conf. May
1987, 249-259. San Francisco, California.
Sagas are long-lived transactions which can be written as a sequence of
transactions which can be interleaved with other transactions (see Section
6.8).
Garcia-Molina H. and Wiederhold G. (1982). Read-only transactions in a
distributed database, ACM TODS, 7(2), 209-34.
A useful study of the requirements of read-only transactions (queries), which
are likely to constitute a large part of transaction processing in a DDB
environment. The requirements for queries are grouped under five headings:
strong consistency (schedule of all update transactions and strong consistency
queries must be consistent); weak consistency (only query's view of data has
to be consistent); t-vintage query (requires a view of data as it existed at
time t); t-bound query (requires data it reads to reflect all updates
committed before time t); and latest-bound query (special case of t-bound
query with t=current time, thereby giving latest versions of all data accessed
by the query).
Gligor V.D. and Luckenbaugh G.L. (1984). Interconnecting heterogeneous
database management systems. IEEE Computer, 17(11), 33-43.
Gligor V.D. and Popescu-Zeletin R. (1985). Concurrency control issues in
distributed heterogeneous database management systems. In Distributed Data
Sharing Systems (Schreiber F.A. and Litwin W. (eds.), 43-56. Amsterdam:
North Holland.
The requirements for global concurrency control mechanisms for MDBSs,
based on the concatenation of local concurrency control mechanisms, are
discussed.
Gligor V. and Popescu-Zeletin R. (1986). Transaction management in
distributed heterogeneous database management systems. Information Systems,
11(4), 287-97.
Gray J.N. (1978). Notes on database operating systems. In Operating Systems -
An Advanced Course. Berlin: Springer-Verlag.
Grimson J.B., O'Sullivan D., Lawless P. et al. (1988). Research Issues in
220 Distributed Database Systems
SOLUTIONS TO EXERCISES
6.2
(a) SA is serializable:
L3 sees x, in pre-T1 version, .'. L 3 < T,
L3 sees x 2 in post-T 2 version, .. L3 > T 2
• "1`2 < L3 < TAý
equivalent serial is
SRA = [r2(x 2), w2 (X2), r 3 (x,), r 3 (x2), r, (x,), w, (x,)]
222 Distributed Database Systems
(b) Both T, and T 2 appear in the same order in both local schedules and hence
the global schedule is also serializable with
T 2 < T,
and the equivalent global serial schedule is
SGR = [r2 (x 2), wI (x2), r 2 (x4), w2 (x4), r, (x,), w, (xi), ri (x3 ), w1 (x 3)]
6.3
(a) Both local schedules are serializable:
Site A as before, i.e. T 2 < L3 < T,
For site B T, < L 4 < T 2
•'. the equivalent serial schedule is
SRB" = [r, (X 3 ), W 1 (X 3 ), r 4 (x 3 ), r 4 (X 4 ), r 2 (x 4 ), W 2 (X 4 )]
(b) T, and T 2 appear in different orders in the local schedules and therefore the
global schedule is not serializable.
6.4
(a) 2PL with deadlock prevention
t 4 : T3 is rolled back
t 6 : T 2 is rolled back
(c) Wait-die
t 4 : T, is rolled back
t 6 : T 2 waits
tT: T 2 is rolled back
(d) Wound-wait
t 4 : T 3 waits
t(: T 3 is rolled back
t 6 : T 2 waits
tg: T 2 is rolled back
Concurrency Control 223
X a11
I
yNV Nm
b
_ m
T4
_
n
L
Figure 6.13 Global wait-for graph for Exercise 6.5.
6.5
See figure 6.13.
The global wait-for graph, G, contains a cycle and hence there is deadlock:
G - T, --* T 2 --- T4 -- T6 ---> T5 > T3
---- ---> T,
6.6
See Figure 6.14, which shows the final global wait-for graph:
N ote_* m5--- th T4
Tthat, T2__• nT4 have
6i Ti ---> TomTted.
--- TE
Note that, for simplicity, the EXT nodes have been omitted.
IT
Figure 6.14 Global wait for graph (Obermarck's method) for Exercise 6.6.
6.7
Assume that Obermarck's wait-for graph contains a cycle of the form
T, -- > T 2 -- * . . . ---> T ---> T
where --> denotes the wait-for relationship. Thus for every pair of consecutive
edges T ---> Ti,+, transaction T, must have been waiting for transaction Tj,+ at
the time the edge was inserted into the graph.
However, the wait-for graph is not constructed instantaneously and
hence by the time the graph has been completed, it is possible that Tj could
have aborted independently (spontaneously) and hence broken the deadlock.
Using Obermarck's method, this would not be detected and there would be
unnecessary rollback of a transaction to break the false deadlock.
6.8
(a) There are a total of eight operations to be performed (two reads and two
writes by each transaction). They cannot be interleaved in any order; reads
must precede writes and the decrement must precede the increment for each
transaction. There are in fact a total of 8C4 = 70 possible interleavings.
(b) Let r, (x) denote the read balance. operation by transaction T, and w, (x)
denote the write balance, operation by T1 (similarly for T2 )
Only 2 of the 70 schedules are serializable, namely
S, = [r,(x), w1 (x), r(y), w 1(y), r2(Y), WAY), r2 (X), W2 (X)]
and
S2 = [rz(y), w2 (y), r 2 (x), w2 (x), rl(x), w,(x), rl(y), wl(y)]
and both are in fact serial schedules with T1 < T 2 in S,and T2 < T, in S2.
(c) All 70 schedules are correct and will leave the DB in a consistent state. This
illustrates the point that serializable schedules form only a subset of correct
schedules (i.e. those which guarantee the integrity of the DB). In this
particular example, although T, and T2 access the same data, they are
independent of one another. The decrement and increment operations are also
independent of each other, since they do not depend on the value stored in
the DB. Note, however, that if an 'insufficient funds' clause (as in Example
6.1) was incorporated into either or both of the transactions, the situation
would be different.
Recovery
7.1 Introduction
The ability to ensure the consistency of the DB in the presence of
unpredictable failures of both hardware and software components is an
essential feature of any DBMS. It is the role of the recovery manager of
the DBMS to restore the DB to a consistent state following a failure
which has rendered it either inconsistent or at least suspect.
In this chapter, we will introduce the background to recovery
in DBMSs, both centralized and distributed. The fundamental role of
transactions is discussed and the use of logs and the causes of failure are
presented. An overview of recovery protocols for centralized DBMSs is
then discussed in order to give a good understanding of the starting point
for the development of recovery protocols for DDBMSs. The chapter
ends with a brief discussion of the problems associated with recovery in
MDBSs.
* begin transaction
* write (includes insert, delete and update)
"* commit transaction
"* abort transaction
Details of the meaning of these commands are given in Section 6.1.
Note the log is often also used for purposes other than recovery (e.g. for
performance monitoring and auditing). In this case, additional information
may be recorded in the log (e.g. DB reads, user logons, logoffs, etc.) but
these are not relevant to recovery and hence are omitted from this
discussion. Each log record contains the following information, not all of
which is required for all actions as indicated below:
how this is achieved). Log files were traditionally stored on magnetic tape
because tape was a more reliable form of stable storage than magnetic
disk and was cheaper. However, the DBMSs of today are expected to be
able to recover quickly from minor failures. This requires, as we shall
see, that the log be stored on-line on a fast direct-access storage device.
Moreover, the discs of today are, if anything, more reliable than tapes.
In systems with a high transaction rate, a huge amount of logging
information will be generated every day (> 107 bytes daily is quite
possible) so it is not realistic, or indeed useful, to hold all this data on-
line all the time. The log is needed on-line for quick recovery after minor
failures (e.g. rollback of a transaction following deadlock). Major failures,
such as disk head crashes, obviously take longer to recover from and
would almost certainly require access to a large part of the log. In such
circumstances, it would be acceptable to wait until parts of the log on
archival storage are transferred back to on-line storage.
A common approach to handling the archiving of the log, is to
divide the on-line log into two separate direct access files. Log records
are written to the first file until it is, say, 95% full. The logging system
then opens the second file and writes all log records for new transactions
to the second file. Old transactions continue to use the first file until they
have finished, at which time the first file is transferred to archival storage.
In this way, the set of log records for an individual transaction cannot
be split between archival and on-line storage, making recovery of that
transaction more straightforward.
The log is treated by the operating system as just another file on
secondary storage. Hence log records must first be written into log buffers
which are periodically flushed to secondary storage in the normal way.
Logs can be written synchronously or asynchronously. With synchronous
log-writing, every time a record is written to the log, it is forced out onto
stable storage, while under asynchronous log-writing, the buffers are only
flushed periodically (e.g. when a transaction commits) and/or when they
become full. Synchronous writing imposes a delay on all transaction
operations, which may well prove unacceptable. The log is a potential
bottleneck in the overall DBMS and the speed of the write-log operation
can be a crucial factor in determining the overall performance of the
DBMS. However, the delay due to synchronous logging must be traded
off against the obvious advantages of a more up-to-date log when it comes
to recovery.
It is essential that log records (or at least certain parts of them) be
written before the corresponding write to the DB. This is known as the
write-ahead log protocol. If updates were made to the DB first and failure
occurred before the log record was written, then the recovery manager
would have no way of undoing (or redoing) the operation. Under the
write-ahead log protocol, the recovery manager can safely assume that,
if there is no commit transaction entry in the log for a particular trans-
Recovery 229
action, then that transaction was still active at the time of failure and
must therefore be undone.
Figure 7.1 shows the interaction between the local recovery man-
ager, which oversees local recovery at that site, the buffer manager, which
manages the buffers, the log buffers for communication to and from the
log, and the DB buffers for communication to and from the DB and
which at any time will contain cached portions of the DB.
7.1.3 Checkpointing
One of the difficulties facing the recovery manager, following major
failure, is to know how far back in the log to go in order to identify
transactions which might have to be redone (i.e. those which had commit-
Main memory
Secondary storage
Figure 7.1 Local recovery manager and its interfaces.
230 Distributed Database Systems
ted prior failure) or undone (those which were active at the time of
failure). To limit its search, the recovery manager takes periodic check-
points and on recovery it only has to go back as far as the last checkpoint.
There are two approaches to checkpointing, namely synchronous and
asynchronous. With asynchronous checkpointing, system processing is
allowed to continue uninterrupted by the checkpoint as shown in Figure
7.2(a). When a synchronous checkpoint is being taken, the system stops
TC 5
TC 4
P_
TC 3
TC 2
TCI
TC 5
TC 4
TC1
tc tf
(Synchronous checkpoint) (Failure)
(b)
accepting any new transactions until all executing transactions have fin-
ished, as shown in Figure 7.2(b).
The following actions are carried out at an asynchronous check-
point.
Most DBMSs use immediate or in-place updating (i.e. they write directly
to the DB via the DB buffers). The advantage of this approach is that
the updated pages are in place when the transaction commits and no
further action is required. However, this approach suffers from the disad-
vantage that, in the event of transaction failure, updates may have to be
undone. To avoid this, some systems use shadow writing or differential
files.
With shadow writing, updates are written to a separate part of the
DB on secondary storage and the DB indexes are not updated to point
to the updated pages until the transaction commits. The old versions of
the pages are then used for recovery and effectively become part of the
log.
With differential files, the main DB is not updated at all; rather
the changes effected by transactions are recorded in a separate part of
the DB called the differential file. When the differential file becomes too
large, resulting in a noticeable deterioration in overall performance, it
can be merged with the read-only DB to produce a new read-only DB
and an empty differential file.
The main DB is read-only, while the differential file is read-write.
Stonebraker (1975) proposed that the differential file be divided into two
distinct parts, D and I; deletions are stored in D and insertions are
recorded separately in I. An update is treated as a deletion followed by
232 Distributed Database Systems
LDB = (RDB U I) - D.
Disk controller
DiskI
Exact replicas of
each other
Disk controller
2 Disk 2
Computer
(a) Computer
(a)
Computer
(b)
Figure 7.3 (a) Mirroring. (b) Mirroring using primary and fallback partitions.
the fallback partitions, such that the same segment of the DB is stored
on different discs. Thus if segments A and B are stored in the primary
area on disk 1, then their mirrors will be stored in the fallback area of
disk 2 and vice versa. The disc controllers are multiply connected to the
discs. This approach is used in the Teradata DBC/1012 Database Machine,
although for added reliability there are multiple processors and duplicated
communication buses. In normal operation, the DBMS uses both disk
controllers to search disks in parallel, whereas such parallelism is not
possible with standard mirroring.
being partitioned into two or more sub-networks. Sites within the same
partition can communicate with one another, but not with sites in other
partitions. In Figure 7.4, following the failure of the line connecting sites
C and E, sites {A,B,C} in one partition are isolated from sites {D,E,F}
in the other partition. One of the difficulties of operating in a distributed
environment is knowing when and where a site or communication failure
has occurred. For example, suppose site C in Figure 7.4 sends site E a
message which E then fails to acknowledge within a certain time period
called a timeout. How can C decide whether E has failed or whether the
network has been partitioned due to communication failure in such a way
that C and E are in separate partitions and hence cannot communicate
with one another? In fact, all that C can conclude from E's failure to
respond, is that it is unable to send and receive messages to and from E.
Choosing the correct value for the timeout which will trigger this con-
clusion is difficult. It has to be at least equal to the maximum possible
time for the round-trip of message plus acknowledgement plus the pro-
cessing time at E.
If agents of the same global transaction are active at both C and E
and a network failure occurs which puts C and E in different partitions,
then it is possible for C and other sites in the same partition, to decide
to commit the global transaction, while E and other sites in its partition,
decides to abort it. Such an occurrence violates global transaction atom-
icity.
In general, it is not possible to design a non-blocking atomic com-
mitment protocol for arbitrarily partitioned networks. A non-blocking
protocol is one which does not block operational sites in the event of
failure. Sites which are still capable of processing should be allowed to
do so, without having to wait for failed sites to recover. Since recovery
-- - -- -- -- -- -- ---- -- -- -- -- -- -- - -
------ - - - - - - -j -
Partition I Partition 2
Figure 7.4 Partitioning of a network.
238 Distributed Database Systems
time
Site A Value of balance Site B
balance (xA) balance (x.)
(1) Undo one (or more) of the offending transactions - this could have
a cascading effect on other transactions in the same partition;
(2) Apply a compensating transaction, which involves undoing one of
the transactions and notifying any affected external agent that the
correction has been made; in the above example, this would not
only involve undoing the effect of one of the transactions (TA or
TB) but also asking the customer to return the £100!
(3) Apply a correcting transaction, which involves correcting the data-
base to reflect all the updates; in our example this would mean
amending the balance of the account to reflect both withdrawals,
thus setting the account balance to -£100 and applying an over-
drawn interest charge.
(1) Undo/redo
(2) Undo/no-redo
240 Distributed Database Systems
(3) No-undo/redo
(4) no-undo/no-redo.
Each of these will be described below. The four algorithms specify how
the recovery manager handles each of the different transaction operations:
* restart.
7.3.1 Undo/redo
Recovery managers based on the undo/redo algorithm are the most com-
plex since they involve both undoing and redoing of transactions following
failure. However, this approach has the advantage of allowing the buffer
manager to decide when to flush the buffers, hence reducing I/O overhead.
Its overall effect is to provide maximum efficiency during normal operation
(i.e. in the absence of transaction aborts and failures) at the expense of
greater overhead at recovery. The actions of the recovery manager in
response to the various operations are as follows:
Begin
transaction: this triggers some DBMS management functions such as
adding the new transaction to the list of currently active
transactions. Conceptually also an entry is made in the log,
Recovery 241
list), rather than having to go through the building up of a redo list and
an undo list. Clearly, for recoverability, these lists have to be kept on
stable storage and generally form part of the log.
Transactions fall into five classes as shown in Figure 7.2(a), where
tc is the time of the last checkpoint and tf the time at which failure occurs
(t, < tf). Let TCi (start) denote the start time of transaction class i (i.e.
when the begin transaction entry was made on the log) and TCi (end) its
finishing time (i.e. when the commit or abort entry was written to the
log). Remember that, for performance reasons, the actual begin trans-
action entry may not be made in the log until the transaction performs
its first write operation. Referring to Figure 7.2(a):
The action of taking the checkpoint at t, has ensured that all changes
made by such transactions will have been permanently recorded in the
DB (durability).
TC2 : Transactions belonging to this class started before the last check-
TC4 : Transactions of this class began after the checkpoint and finished
before the failure,
TC 4(start) > tc and TC 4(end) < tf.
Hence they are treated like transactions of class TC 2 .
TC5 : The final class of transactions began after the last checkpoint was
taken but were still active at the time of failure,
TC 5 (start) > tc and TC5 (end) = undefined.
Recovery 243
The log for this transaction class contains begin transaction and
possibly before- and after-images for data objects updated by the trans-
action, but no corresponding commit transaction. As with transactions of
class TC 3 , transactions in class TC 5 must be undone.
Algorithm 7.1. outlines the restart procedure under the undo/redo
protocol. Note that the situation is somewhat simplified if checkpoints are
taken at quiescent points (synchronous checkpointing), as only trans-
actions of classes TC 1, TC 4 and TCG would have to be considered, as
shown in Figure 7.2(b). Recovery procedures for these three classes are
the same as for asynchronous checkpointing.
When the recovery manager encounters an abort transaction com-
mand on the log, it adds the transaction to the undo-list. If a failure
occurs during the recovery procedure, on restart the recovery manager
must continue repairing the DB. It is therefore essential that the effect
of undoing or redoing an operation any number of times, will be the same
begin
STEP 2 CLASSIFY
do while not end-of-log
read next entry from log into log-record
if log-record type = Ibegin transaction' or I abort transaction'
then add transaction identifier for log record into undo-list
else if log-record type = Icommit transaction'
then move transaction identifier for log record from undo-list
to redo-list
end-if
end-if
end-do
as undoing it or redoing it only once. Formally, both undo and redo must
be idempotent, that is
UNDO(UNDO(UNDO(. . .01 ))) =UNDO (01 )
and
REDO(REDO(REDO(. . . O)) ) REDO (O)
7.3.2 Undo/no-redo
Using the undo/no-redo algorithm, the DB buffers are flushed at commit
so there will never be any need to redo transactions on restart, and hence
there is no need to store after-images on the log. Referring to Figure
7.2(a), the recovery manager only has to concern itself with transactions
active at the time of failure (i.e. transactions belonging to classes TC 3
and TC 5) which will have to be undone. The detailed actions are as
follows:
7.3.3 No-undo/redo
Using the no-undo/redo algorithm, the recovery manager does not write
uncommitted transactions to the stable DB. The buffer manager is forced
to retain the records in main memory in the DB buffers until commit and
this is known as pinning the buffers. Alternatively, updates can be written
to the log instead of the DB buffers or shadowing can be used. Referring
to Figure 7.2(a), the recovery manager only has to handle transactions of
classes TC 2 and TC 4 , which will have to be redone. No writes by trans-
actions of classes TC 3 and TC 5 will have reached stable storage. Details
of the actions required are as follows:
begin
STEP 2 CLASSIFY
do while not end-of-log
read next entry from log into log-record
if log-record type = begin transaction' or abort transaction
then add transaction identifier for log-record into undo-list
else if log-record type = 'commit transaction'
then remove transaction identifier for log record from
undo-list
end-if
end-if
end-do
STEP 3 RECOVER
do while not end-of-undo-list (working backwards)
undo transaction
end-do
end
commit: either the buffer manager is told it may flush the DB buffers
or, where updates have been written to the log, the after-
images of updated records are transferred to the DB via
the buffers. A commit record is written to the log, or the
transaction identifier is added to the commit-list, if one is
being used.
abort: if updates are being written to the log, then the recovery
manager simply writes an abort record to the log or adds
the transaction identifier to the abort-list; strictly speaking,
neither of these operations is necessary, but they represent
good housekeeping and facilitate subsequent garbage collec-
tion in the log. If the updates are in the DB buffers, then
they are erased and an abort record added to the log or
abort-list.
restart: the recovery manager must perform global undo.
Algorithm 7.3 outlines the procedure for restart under the no-
undo/redo protocol.
246 Distributed Database Systems
Algorithm 7.3 Restart procedure following site failure for no-undo/redo protocol.
begin
STEP 2 CLASSIFY
do until checkpoint record in log is reached (working backwards)
read next entry from log into log-record
if log-record type= I commit transaction
then add transaction identifier for log-record into redo-list
end-if
end-do
STEP 3 RECOVER
do while not end-of-redo-list (working forwards)
redo transaction
end-do
end
7.3.4 No-undo/no-redo
In order to avoid having to undo transactions, the recovery manager has
to ensure that no updates of transactions are written to the stable DB
prior to commit, whereas to avoid having to redo transactions, the recov-
ery manager requires that all updates have been written to the stable DB
prior to commit. This apparent paradox can be resolved by writing to the
stable DB in a single atomic action at commit. To do this, the system
uses shadowing as described in Section 7.1.4. Updates are written directly
via the buffers to stable storage (as for in-place updating) but to a separate
part of the stable DB. Addresses are recorded in a shadow address list.
All that is then required at commit is to update the DB indexes to point
to the new area using the shadow address list. This can be implemented
as an atomic operation. No action is required on restart since the stable
DB is guaranteed to reflect the effects of all committed transactions and
none of the uncommitted ones. The before- and after-images of data
objects, which are normally recorded in the log, are provided during
transaction execution by the DB itself and the shadow area, respectively.
A separate log is no longer required for recovery, although, as was
indicated previously, a log may be maintained for other purposes. The
details of no-undo/no-redo actions are as follows:
restart: the shadow address list, which contains all transactions active
at the time of failure, is garbage collected, thereby leaving
the DB indexes as they were.
We assume that every global transaction has one site which will act
as coordinator for that transaction. This will generally be the site at which
the transaction was submitted. Sites at which the global transaction has
248 Distributed Database Systems
(1) Each participant has one vote which can be either 'commit' or
'abort';
(2) Having voted, a participant cannot change its vote;
(3) If a participant votes 'abort' then it is free to abort the transaction
immediately; any site is in fact free to abort a transaction at any
time up until it records a 'commit' vote. Such a transaction abort
is known as a unilateral abort.
(4) If a participant votes 'commit', then it must wait for the coordinator
to broadcast either the 'global-commit' or 'global-abort' message;
(5) If all participants vote 'commit' then the global decision by the
coordinator must be 'commit';
(6) The global decision must be adopted by all participants.
Algorithm 7.4 (a) 2PC coordinator algorithm. (b) 2PC participants algorithm.
(a) begin
then begin
write 'global commit' record to log
send Iglobal commit' to all participants
end
STEP C3 TERMINATION
do until acknowledgement received from all participants
wait
end-do
write 'end global transaction record' to log
finish
end
(b) begin
STEP P1 VOTE
if vote = 'commit' then send Icommit' to coordinator
else send 'abort' and go to STEP P2b
do until global vote received from coordinator
wait
end-do
STEP P3 TERMINATION
send acknolwedgement to coordinator
finish
end
250 Distributed Database Systems
Coordinator
Participants
S'Vote'
A II
Wait
Q Q 0
Wait
ýLl
Local commit processing
(a)
The stages of 2PC for global commit and global abort are shown
diagrammatically in Figures 7.5(a) and (b) respectively, while details of
the algorithm are given in Algorithm 7.4(a) for the coordinator and in
Algorithm 7.4(b) for the participants.
2PC involves processes waiting for messages from other sites. To
avoid processes being blocked unnecessarily, a system of timeouts is used.
Timeouts in a distributed system have to be fairly accommodating due to
possible queues at computing facilities and network delays!
At the start (STEP P0 of Algorithm 7.4(b)), the participant waits
for the 'vote' instruction from the coordinator. Since unilateral aborts are
allowed, the participant is free to abort at any time until it actually casts
its vote. Hence, if it fails to receive a vote instruction from the coordinator,
Recovery 251
Coordinator Participants
D 'Vote,
Wa
(b) Termination
begin
do while P0 is blocked
restart performs the same action as all other participants and that this
restarting can be done independently (i.e. without the need to consult
either the coordinator or the other participants).
Let Pr be the participant process which is attempting to restart
following failure. If P, had not voted prior to failure, then it can safely
abort unilaterally and recover independently. It can also recover indepen-
dently if it had received the global decision (global-commit or global-
abort) prior to failure. If, however, P, had voted 'commit' and had not
been informed of the global decision prior to failure, then it cannot
recover independently. It must therefore ask the coordinator (or other
participants) what the global decision was. Algorithm 7.6 shows the restart
protocol.
A number of improvements to the centralized 2PC protocol have
been proposed which attempt to improve its overall performance, either
by reducing the number of messages which need to be exchanged, or by
speeding up the decision making process. These improvements depend
on adopting different ways of exchanging messages, or communication
254 Distributed Database Systems
begin
do while Pr is blocked
Coordinator
participants
P
N
(a)
- -"-- -- --. N
P1 P
---------------------------------------------------
Global
Global commit/abort
commit/abortGlobal commit/abort
(b)
Process I Process 2
Process 4 Process 3
(c)
Figure 7.6 (a) Centralized 2PC topology. (b) Linear 2PC topology. (c) Distrib-
uted (decentralized) 2PC with four processes.
7.4(a); this would effectively mean that the rightmost node, N, would be
able to broadcast the global decision to all participants in parallel.
The second variation to centralized 2PC, which has been proposed
by Skeen, uses a distributed topology as shown in Figure 7.6(c) and is
therefore known as distributed or decentralized 2PC. Both coordinator
and all participants receive all the votes of all other processes and hence
can make the global decision consistently, but independently.
Algorithm 7.7 (a) 3PC termination co-ordinator algorithm. (b) 3PC participants' algorithm.
(a) begin
STEP CI VOTE
write 'begin transaction' in log
send 'vote' instruction to all participants
do until all votes received
wait
on timeout go to STEP C2b
end-do
STEP C4 TERMINATION
do until acknowledgements received from all participants
wai t
end-do
write 'end transaction' entry in log
finish
end
(b) begin
STEP P1 VOTE
if participant is prepared to commit
then send commit I message to coordinator
else send 'abort message to coordinator and go to STEP P2b
do until global instruction received from coordinator
wait
end-do
STEP P3 COMMIT
do until 'global commit' instruction received from coordinator
wait
end-do
perform local commit processing
STEP P4 TERMINATION
send acknowledgement to coordinator
finish
end
Coordinator Participants
vote,
Wait
JýK
AV Ready to go
either way
Aýýv
Wait
Stand by to commit
'V
5ýi
'V
Local commit processing
'V I, '1,
(a)
Coordinator Participants
95P ,Vote'
Ready to go
either way
processing
11
7_V
Wait
"ýk Local abort processing
NCVV04xu
'IV Y"OIA0'IV IV
'IV Ne 'IV IV
(b)
Pi" ýV pgi" IV
unilaterally, then there could not have been a 'global pre-commit' and
hence the global decision again could not have been commit. A more
rigorous proof of the correctness of 3PC and non-blocking nature of 3PC
has been given by Bernstein et al. (1987).
On restart, a process must first ascertain what state it was in prior
to failure, which it does by examining its log. As with 2PC, if it had failed
prior to voting or had unilaterally aborted, then, on restart, it can safely
262 Distributed Database Systems
Algorithm 7.8 (a) 3PC termination protocol for new coordinator. (b) 3PC termination
protocol for participant under new coordinator.
(a) begin
STEP C3 TERMINATION
do until acknowledgements received from all participants
wait
end-do
write 'end transaction record to log
finish
end
Recovery 263
(b) begin
STEP P5 TERMINATION
Execute standard participant' s 3PC algorithm with new coordinator
finish
end
from C prior to PI's failure. Assume then that before sending pre-commit
messages to P 2, P 3 and P 4 , C also fails. Following an election, one of the
three operational sites left, say P 3, is elected the new coordinator. The
result of the termination protocol, since none of the operationalsites had
received a global pre-commit instruction from the old coordinator, will
be global-abort. Hence a participant such as P, which received a global
pre-commit prior to failure cannot recover independently and must on
restart therefore consult other sites. It seeks help in deciding how to
terminate and it can still process a 'global abort' decision correctly since
it has not actually committed the transaction.
In the case of total failure, each participant will attempt to recover
independently and then communicate its decision to all other participants.
If none can recover independently, then the last participant site to fail
applies the termination protocol. How a site knows that it was the last
site to fail is discussed in the next two paragraphs. Note that the termin-
ation protocol is normally only invoked by operational sites.
Once total site failure has occurred, the termination protocol may
only be invoked by the last site to fail, otherwise a decision could be
taken to commit or abort a transaction which is inconsistent with the
action already taken by the now-failed last process. Assume that each
site, 1, maintains a list of operational sites, OP/. This can easily be
accomplished by appending a list of participants to the vote instruction
from the coordinator, as suggested above. These lists are updated by sites
as failures are detected. As sites recover from total failure, they can
invoke the termination protocol if and only if the set of recovered oper-
ational sites, RS, includes the last process to fail. This condition can be
easily verified, as the last site to fail must be common to each OP1 . Hence,
for the set of sites, RS, to ensure consistent recovery, RS must contain
all those common sites, i.e.
RS fn OP,
1es
SUMMARY
The ability of a (D)DBMS to be able to recover automatically, without
user intervention, from unpredictable hardware and software failures is
difficult and expensive to achieve. This job is the responsibility of the
recovery manager. It is estimated, for example, that some 10% of the
code in System R was devoted to recovery and furthermore this
particular 10% of code was some of the most complex in the whole
system. The recovery manager has to cater for a wide range of failures
from a user-induced transaction rollback to a total failure of the entire
system.
266 Distributed Database Systems
EXERCISES
7.1 List the main causes of failure in a DDBMS and describe briefly what action
the recovery manager must take to recover in each case.
7.2 Write algorithms for restart under the following recovery protocols
(a) undo/redo
(b)no-undo/redo
(c) undo/no-redo
(d) no-undo/no-redo.
7.3 Explain in detail the 2PC protocol.
7.4 Explain in detail the 3PC protocol.
Recovery 267
7.8 What are the implications of synchronous logging under the undo/redo
protocol?
7.9 How many messages and rounds are required for n participants plus one
coordinator (i.e. n+1 sites), in the absence of failures, for each of the
following protocols:
(a) centralized 2PC
(b) distributed 2PC
(c) linear 2PC
(d) 3PC.
7.10 Explain, by example, how communication failures can cause sites in different
partitions to make inconsistent decisions (i.e. sites in one partition abort a
transaction, while sites in a different partition commit the transaction) under
3PC.
E
H
Bibliography
Agrawal R. and DeWitt D.J. (1985). Integrated concurrency control and
recovery mechanisms: design and performance evaluation. ACM TODS, 10(4),
529-64.
268 Distributed Database Systems
SOLUTIONS TO EXERCISES
7.7
Algorithm 7.9 (a) Coordinator's algorithm for distributed 2PC (Exercise 7.7).
(b) Participants' algorithm for distributed 2PC (Exercise 7.7).
(a) begin
then begin
write 'global commit' record to log
end
STEP C3 TERMINATION
write 'end global transaction record' to log
finish
end
(b) begin
STEP P1 VOTE
if vote = 'commit' then send 'commit' to coordinator
else send 'abort' and go to STEP P2b
do until global vote received from co-ordinator
wait
end-do
STEP P3 TERMINATION
finish
end
7.8
Fewer redos would be needed.
Recovery 271
7.9
(a) Centralized 2PC
Three rounds: (1) coordinator issues vote instruction
(2) participants cast their votes
(3) coordinator broadcasts the decision.
3n messages: n per round.
(b) Distributed 2PC
Two rounds: (1) coordinator broadcasts its vote
(2) participants vote.
n + n 2 messages: n for round 1 and n 2 for round 2.
(c) Linear 2PC
2n rounds: (1) to pass vote instruction and decision to
each participant (no broadcasting of messages in parallel).
2n messages: one per round.
(d) 3PC
Five rounds:
(1) coordinator issues vote instruction
(2) participants cast their votes
(3) coordinator broadcasts pre-commit
(4) participants acknowledge pre-commit
(5) coordinator broadcasts global commit.
5n messages: n per round.
7.10
Suppose that a communications failure has resulted in sites being partitioned
into two separate partitions, P1 and P2. It is possible for all sites in P1 to be
ready to commit (i.e. they have received the pre-commit instruction), while
sites in P2 are uncertain by virtue of the fact that the network partitioned
before they received the pre-commit instruction. According to the 3PC
termination protocol, sites in P1 will commit, while those in P2 will abort.
7.11
(a) Failure of (B, F) would result in two partitions:
{F, C, D, E, G} and {A, B, H}
Failure of (F, C) would result in two partitions:
{C, G, D, E} and {A, B, H, F)
Failure of (C, D) would result in two partitions:
{D, E, G} and {A, B, H, F, CQ
8.1 Introduction
The integrity of a DB is concerned with its consistency, correctness,
validity and accuracy. DBs generally model real-world organizations such
as banks, insurance companies and hospitals, and the state of the DB, if
it were to be frozen at a particular point in time, should accurately reflect
a real-world state. This freezing must be done when the DB is quiescent
(i.e. no transactions active). Since we are not talking about real-time
systems, we cannot say that the state of the DB corresponds exactly to a
real-world state at a given point in time. We are concerned with integrity
of the DB at a higher level. Looking at it in another way, we can say
that DB integrity is concerned with whether or not the state of the DB
obeys the rules of the organization it models. These rules, called integrity
rules or integrity constraints, take the form of general statements govern-
ing the way the organization works.
Integrity can be viewed as addressing the issue of accidental corrup-
tion of the DB, for example, by inserting an invalid patient#, by failure
of a concurrency control algorithm to generate serializable schedules, or
the recovery manager not restoring the DB correctly following failure.
Security, on the other hand, is concerned with deliberate attempts to gain
272
Integrity and Security 273
unauthorized access to the data and possibly alter it in some way. This
chapter reviews DB integrity issues for centralized DBMSs and then
outlines how these methods can be transferred to DDBs. This is followed
by a similar overview of security in centralized DBMSs. Security is con-
cerned with ensuring that the only operations which are accepted by the
(D)DBMS are from users who are authorized to perform those operations
on the data in question. For example, in a university DB, a student would
normally be allowed to read their own record in its entirety, but only the
lecturer on a particular course would be allowed to alter the student's
grade in that course. Thus while the motivations for DB integrity and
security are quite different, similar techniques can be used to assist in
safeguarding both.
The chapter ends with a section on security in DDBMSs, where
the existence of the underlying network must be taken into consideration.
and no two patients will have the same patient#. Such a constraint is
called a relation constraint.
Finally, Rule 3 could be specified as 'patient # is an integer in the
range 1 to 99999'. This is an example of a domain constraint.
There are a number of different aspects to integrity in addition to
domain, relational and referential integrity. Concurrency control and
recovery are very much concerned with ensuring the integrity of the DB
through transaction atomicity and durability. However, these topics have
already been discussed in detail in Chapters 6 and 7 and this chapter will
therefore concentrate on other integrity issues.
(1) Domain
(2) Relation
(3) Referential
(4) Explicit.
The first three are often grouped together and referred to as implicit
constraints because they are an integral part of the relational data model.
Relation constraints simply define the relation and its attributes and are
supported by all RDBMSs. Domain constraints define the underlying
domains on which individual attributes are defined and these are not
explicitly supported by all RDBMSs, although they should be. Referential
integrity constraints (see Section 2.2.4) are also not universally supported,
although most vendors of RDBMSs promise to provide such support in
their next release! Only a few systems, mainly research prototypes, sup-
port the specification of explicit constraints and usually only in a limited
way. Generally, explicit constraints are imposed by the rules of the real
world and are not directly related to the relational model itself. For
example, in a banking system the accounts of customers with a poor credit
rating are not allowed to be overdrawn. Such explicit constraints can also
be used to trigger a specific action. For example, if a stock level falls
below the reorder level, then an order is automatically produced. We will
see examples of different types of integrity constraints later.
The integrity subsystem of the DBMS is conceptually responsible
for enforcing integrity constraints. It has to detect violations and, in the
event of a violation, take appropriate action. Since in the absence of
failures and assuming correct concurrency control, the only way in which
the integrity of a DB can be compromised is as a result of an update
operation. The integrity subsystem must therefore monitor all update
operations. In a large multi-user DB environment, data will be updated
Integrity and Security 275
Recall that while primary keys, or parts thereof, are not allowed
to be null, foreign keys may be null. Whether it makes sense to allow
null foreign keys will depend on the rules governing the application. For
example, patient# in the LABREQ relation is a foreign key of the
INPATIENT relation, but it would not make sense to have a null value
for patient# in LABREQ, since we would have no way of telling for
which patient the test had been ordered. Of course, in this particular
case, patient# in LABREQ is in fact part of a composite primary key
{patient#, test-type} of LABREQ and hence by definition is not allowed
to be null. Assume in addition to the INPATIENT relation
INPATIENT (patient#, name, date-of-birth, address,
sex, gpn)
there is a second relation containing information about GPs:
GPLIST (gpname, gpaddress, g elno)
(1) Disallow the deletion of primary keys as long as there are foreign
keys referencing that primary key (RESTRICTED);
(2) Deletion of the primary key has a cascading effect on all tuples
whose foreign key references that primary key, and they too are
deleted (CASCADES);
278 Distributed Database Systems
(3) Deletion of the primary key results in the referencing foreign keys
being set to null (NULLIFIES).
(1) Disallow the update as long as there are foreign keys referencing
the primary key (RESTRICTED);
(2) Update of the primary key cascades to the corresponding foreign
keys, which are also updates (CASCADES);
(3) Update of the primary key results in the referencing foreign keys
being set to null (NULLIFIES).
The choice of which of the three approaches for both update and
deletion is appropriate will depend on the application, and it is likely
that different options will be specified for different foreign keys in the
same DB. For example, to our definition of the INPATIENT relation,
we could add the following foreign key definition:
ASSERT constraint-name
ON relations-names: condition
as
DATA OBJECTS
USERS R, R, R, R,
(1) User A can read object x if and only if clearance (A) _ classification
(x),
(2) User A can update object x if and only if clearance (A) = classifi-
cation (x).
We wish to form a view over these two relations which contains the names
and addresses of all students who passed course 1BAl in 1990. This can
be accomplished using the following CREATE command:
For example:
If USER-6 then issues a query to list all the female patients of the
INPATIENT relation in QUEL, that is
the PERSON relation satisfying the search predicate so the system will
respond:
Having located Patrick Murphy's tuple in the relation, the user can then
issue the following legitimate query to obtain the income:
The SQL SUM function calculates the sum of the values in a given column
of a relation.
Of course it is unlikely that users would be allowed to issue search
predicates involving names of individuals in such an application. However,
if the user knows something about an individual, such as their profession
or their date of birth, it is possible for the user to experiment with different
search predicates until eventually locating a single tuple. Such a search
predicate is known as a single tracker because it enables the user to track
down an individual tuple. The existence of a single tracker potentially
compromises the security of the DB.
A detailed discussion of statistical DBs is beyond the scope of this
book, and for further information, the reader is referred to Date (1982)
and Ferndndez et al. (1981), and for a discussion on trackers to Denning
et al. (1979). As yet the security problems associated with statistical DBs
have not been fully and satisfactorily solved. Although this might seem
like a peripheral and rather specialized issue, our research has shown that
there is great potential for the use of multidatabase technology in the
medical domain, specifically for the purpose of building large statistical
DBs for epidemiological research.
8.5.3 Encryption
Encryption is intended to overcome the problem of people who bypass
the security controls of the DBMS and gain direct access to the DB or
294 Distributed Database Systems
sent by anyone since everyone has access to eB. Thus instead of simply
encrypting the message, A first of all applies its own private deciphering
algorithm to the message, then encrypts it using B's encryption procedure
and sends it to B, that is eB (dA (m)). By applying the inverse functions
to the message, B can not only decipher the message but also be certain
that A sent it since only A knows dA. B first applies the deciphering
procedure, dB, to 'undo' the effect of A's eB, and then 'undoes' the effect
of A's dA by applying the public procedure eA to yield the message m.
Simply stated
SUMMARY
In this chapter, we have reviewed the basic concepts and issues relating
to both integrity and security. Under the heading of integrity, we have
296 Distributed Database Systems
seen how integrity constraints are used to implement the rules of the
real-world organization which the DB is modelling. The transfer of these
ideas to homogeneous DDBMSs presents few problems, since the
constraints can always be incorporated into the system catalogue in just
the same way as for centralized DBMSs. The specification and
enforcement of integrity constraints for MDBMSs with local autonomy is
still, however, an open research issue. We looked at a number of
problems including, inconsistencies between local constraints, the
specification of global constraints, and inconsistencies between local and
global constraints. Existing centralized DBMSs generally fall far short of
providing full integrity support, and in the case of MDBMS, there is
virtually no support at all. Enforcement of integrity is left to the
application programmers, who have to 'hardcode' the constraints and
their enforcement into the application programs. The run-time checking
of integrity constraints imposes an additional overhead on normal update
operations, but it is generally more efficient to allow the system to do the
checking, when the process can be optimized, than to leave it to the
programmer. In a centralized DBMS, a new integrity constraint will only
be accepted if the current DB state does not violate the new constraint
and does not contradict any existing constraints. For example, if a
constraint was specified that limited the maximum number of hours an
employee could work in one week to, say 60 hours, then prior to
inserting this constraint into the system catalogue thereby activating it,
the integrity checker will first verify that no current employee has
worked more than 60 hours in a week. If such an employee is found,
then the new constraint will be rejected. Even in a centralized system,
such checking of new constraints could be tedious and time-consuming.
In a large, distributed DB, this checking would involve validation against
local integrity constraints and local DB states across several sites. So it is
easy to see why the provision of integrity in MDBMSs at the global level
is still an open issue.
Integrity and security are related - the former is concerned with
accidental corruption of the DB leading to inconsistencies, while the
latter is concerned with deliberate tampering with, or unauthorized access
to, the data. DDBMSs rely, in the main, on the local security facilities,
both organizational through a security policy, and technical through
access controls on the data. Distribution adds a new dimension in that
data must be transmitted across potentially insecure networks. The most
widespread technical solution to this problem is to use encryption. As
has been indicated in this chapter, it is possible to develop quite
elaborate security measures at the local level, but if all sites in the
network do not enforce the same level of security, then security will
inevitably be compromised. As with integrity, security in MDBMSs with
full nodal autonomy for security policy is still an open, unresolved issue.
Integrity and Security 297
EXERCISES
8.1 A simple student DB is being established containing information about
students and the courses they are taking. The following information is to be
stored:
For each student: Student number (unique), name, address together
with a list of the courses being taken by the student
and the grades obtained.
For each course: Course number (unique), title and lecturer.
Design a plausible set of domain, relation and referential constraints for this
DB.
8.2 For the DB given in Exercise 8.1, define a trigger constraint which will notify
a student if they have failed a course (grade - 'F').
8.3 Discuss the problems associated with specifying and enforcing global integrity
constraints for MDBMSs.
8.4 Discuss the four main types of security policies for access control within
organizations.
8.7 What advantages does the view mechanism have over the authorization matrix
approach as a security measure?
8.8 Queries on the statistical DB, STATDB, shown in Table 8.2 are restricted to
COUNT and SUM. COUNT returns the cardinality (number of tuples) of the
result of a query and SUM returns the arithmetic sum of the values of an
attribute (or set of attributes) satifying the query. As an added security
measure, only queries where the cardinality of the result is greater than 2 and
less than 8 are allowed. The intention of this constraint is to prevent users
from identifying small subsets of records. Devise a series of statistical SQL
queries, using only COUNT and SUM, which will disclose Murphy's salary,
given that the user knows that Murphy is a programmer and lives in Dublin.
298 Distributed Database Systems
STATDB
8.10 What advantages do public key cryptosystems have over the Data Encryption
Standard?
Bibliography
ANSI (1986). American National Standard for Information Systems, Database
Langua-e-SQL, ANSI X3.135-1986.
Date C.J. (1982). An Introduction to Database Systems. Vol. II. Wokingham:
Addison-Wesley.
Chapter 4 of this book contains an overview of security issues for centralized
DBs.
Date C.J. (1990). A Guide to the SQL Standard, 2nd edn. Reading, MA:
Addison-Wesley.
Date C.J. (1990a). Referential Integrity and Foreign Keys. Part I: Basic
concepts; Part I1: Further considerations. In Relational Database Writings.
Wokingham: Addison-Wesley.
Integrity and Security 299
SOLUTIONS TO EXERCISES
8.1
8.2
8.5
Table 8.3 Authorisation matrix for Exercise 8.5.
DATA OBJECTS
8.8
SELECT COUNT (*)
FROM STATDB
WHERE job = ' programmer';
Response: 4
8.11
9.1 Introduction
This chapter is divided into two separate, but related, parts. The first part
deals with logical DDB design, while the second discusses the adminis-
tration of a DDB. In Chapter 4, we examined the problems associated
with physical DDB design (i.e. how best to distribute the data amongst
the sites in order to improve overall performance). Of course, in the case
of a multidatabase system, which integrates pre-existing DBs, the physical
layout of the data is already fixed and cannot be altered. It is the job of
the query optimizer alone to obtain the 'optimum' performance. The
design of the global conceptual schema (logical DDB) design is by contrast
applicable to both homogeneous DDBs and MDBs.
The role of the DDB administrator is to decide initially what data
to include in the DDB, how to structure it and subsequently how to
manage it. Hence logical design of a DDB is part of this remit. For
simplicity, we will present the issues of DDB design and management
separately.
We begin with an overview of the software life cycle to see where
the DB and DDB design processes fit into the overall framework of
systems development. We differentiate between DDB design for homo-
geneous and heterogeneous DDBs, since each presents quite separate
302
Logical Distributed Database Design and Administration 303
(1) During the feasibility study phase, a careful analysis of the feasibility
and potential of the proposed software system is carried out;
(2) The requirements collection and analysis phase involves extensive
discussion between system designer(s) and the end-users of the
system. The objective of the designer is to gain a detailed under-
standing of what the proposed system is intended to do, what data
is needed by the various applications of the system, and what
processes are to be performed on the data;
(3) The design phase involves the detailed design of the system. In the
case of (D)DB systems, it involves the design of the (D)DB itself
and of the applications which access the (D)DB;
(4) During the implementation phase the system is fully implemented
and tested; in a (D)DB system, this phase includes initial loading
of the (D)DB;
(5) Phase 2 produces a detailed requirements specification for the
system and phase 5, the validation and acceptance testing phase, is
concerned with evaluating the newly developed system against those
requirements. This evaluation includes both the functional and
performance requirements;
(6) Operation is the final phase of the software life cycle when the
system 'goes live'.
304 Distributed Database Systems
Although these phases are well defined and involve quite different
processes, there are often feedback loops, especially between the earlier
phases. However, if a problem is found during the operational phase,
which requires a major change to the system because the requirements in
stage 2 were not properly specified, the cost of rectifying the error can
be very high. New applications will be developed during the operational
phase and this is a normal part of the software life cycle. However, the
system should be able to integrate these new applications smoothly,
without requiring major restructuring of the system. In recent years, many
tools have been developed to support various phases of the life cycle, in
particular CASE (computer-assisted software engineering) tools, which
are mainly aimed at phases 2 to 4. They often provide facilities for rapid
prototyping which are very useful in the proper and accurate capture and
specification of user requirements, which is a notoriously difficult task.
In the context of this chapter, we are concerned with phase 3 of the life
cycle, the design phase.
In Section 2.2.6, we introduced the process of normalization of
relations. Normalization represents the final phase of the logical DB
design process. It is preceded by the mapping of the enterprise data
model onto relational tables. The enterprise data model is an abstract
representation of the entities of interest in an organization, together with
the relationships (1 : 1, reflexive, 1 : n, m : n, etc.) between those enti-
ties. The enterprise model is independent of any particular DBMS or DB
model. One of the most common representation methods for enterprise
models is based on the entity-relationship (ER) approach. Many CASE
tools offer the possibility of converting the formalized enterprise model
automatically into schema definitions for a target DBMS (centralized),
without any 'human intervention'. If the functional dependencies are
known, it is possible to ensure that only normalized tables are generated.
Thus, in terms of the software life cycle, the output of phase 2 includes
an enterprise model which can be converted relatively simply into a set
of normalized tables. At this stage, the user views and applications which
access the (D)DB through those views, are designed. The final design
stage is then the physical DB or internal schema design. It is really only
at this stage that distribution aspects should come in to the picture.
of local DBs, what facilities does the MDBS offer to integrate those
databases and produce a global schema? Ideally an integrator's workbench
should be provided, the output of which is the global data dictionary (see
Section 9.6.3 below), the global and participation schemas, mapping rules
and auxiliary DBs. However, in order to design this workbench, it is
necessary first to develop a methodology for performing the integration
on which the workbench can be built. Consequently, most work to date
in this area has focused on the methodological aspects of database inte-
gration, rather than on the provision of tool support.
Schema integration in an MDB is a complex task. The problems
arise from the structural and semantic differences between the local
schemas. These local schemas have been developed independently follow-
ing, not only different methodologies, but also different philosophies with
regard to information systems development.
At a basic level, different names may be assigned to the same
concept and vice versa. Furthermore, data can be represented at different
levels of abstraction. One schema might view a data object as the attribute
of an entity, whereas another might view it as an entity in its own right.
The problems associated with database integration and the stages
required to produce an integrated global schema are best understood by
working through an example. The example presented below in Figures
9.2(a) to (f) is very simple, but serves as a guide to the issues involved.
There are two local schemas, Schema A, which is the schema for the
outpatient department of a hospital and Schema B, which models a G.P.'s
DB for these patients. Figure 9.2(a) shows the original schemas. The
objective is to produce a single, integrated, uniform global schema from
the two local schemas as shown in Figure 9.2(f). In Figure 9.2(b) we have
identified that result in Schema B is a synonym for diagnosis in Schema
A and we have chosen to use diagnosis at the global level and we
must therefore replace result in Schema B with diagnosis. In Schema A,
consultant is an attribute of the Outpatient entity, whereas in Schema B,
the consultant is represented by the entity Consultant Referral. If we
decide that it is more appropriate to make consultant an entity at the
global level, then we must make consultant in Schema A into an entity
as shown in Figure 9.2(c).
We are now ready to superimpose (merge) the two schemas using
the Consultant Referral entity as the link between them, as shown in
Figure 9.2(d). Next we recognize that the entity Patient is in fact a subset
of the Outpatient entity and we can therefore create a subset relationship
as shown in Figure 9.2(e). Finally, we note that there are certain properties
in common between the Patient and Outpatient entities and since Patient
is a subset of Outpatient, we can drop these common properties from
Patient, leaving it with only the properties which are peculiar to it. The
final integrated model of the global conceptual schema is shown in Figure
9.2(f).
Logical Distributed Database Design and Administration 307
We can model these two entities by a single global entity called global-
patient which has all the attributes of both Outpatient and Patient, that is
Schema A
Name
ECG
Number
EEG Test Outss
Pathology Consultant
Diagnosis
Schema B
Name
Address
Married
Children
Came
Caddress
(a)
Result
Schema A
Name
EEG Number
Test Out-
ECG results patient Address
Consultant
Pathology Diagnosis
Schema B
Name
Address
Married
Children
Cname
Caddress
(b)
-QJDlagnosis
Figure 9.2 (a) Original schemas A and B before integration. (b) Choose diagnosis
for result in B. (c) Make CONSULTANT REFERRAL into an ENTITY in A.
(d) Superimposition of schemas. (e) Creation of a subset relationship. (f) Drop
properties of PATIENT common to OUTPATIENT.
EEG Cname
ECG
Pathonl'v Diagnosis
Name
Address
Married
Children
Cname
Caddress
(c)
Diagnosis
EEG Cname
ECG Caddress
Pathology Diagnosis
(d)
EEG Cname
ECG Caddress
Pathology Diagnosis
(e)
ECG Cname
EEG Caddress
Pathology Diagnosis
(f)
310 Distributed Database Systems
nnamne
Caddress
Diagnosis
At this point, it would be useful to enter the data into the dictionary,
which would then form the basis for the global data dictionary (see Section
9.6.3) for the multidatabase. Where a local node already has a data
dictionary and/or ER model, the process can clearly be speeded up and
greatly simplified.
Table 9.1
DUBLIN
CORK
nation is Dublin. If arr/dep = 'dep' then the airport attribute records the
destination of the flight, the source is Dublin. The CORK DB is much
more straightforward, simply recording the source and destination for
each flight. An example of the relations is given in Table 9.1 and the
corresponding ER diagrams in Figure 9.4.
Assume that the global relation, FLIGHT-INFO, which is the
view containing all the information from both the DUBLIN and CORK
relations, has the same format as the CORK, relation,
begin
do while not end-of-relation DUBLIN
FLIGHT-INFO. flight-no=DUBLIN. flight-no
FLIGHT-INFO. date = DUBLIN. date
FLIGHT-INFO. dep-time = DUBLIN. dep-time
FLIGHT-INFO. arrival-time = DUBLIN. arrival-time
if DUBLIN. arr/dep- arr'
then begin
FLIGHT-INFO. Source = DUBLIN. airport
FLIGHT-INFO. destination= 'Dublin'
end
else begin
FLIGHT-INFO. source= 'Dublin'
FLIGHT-INFO. destination = DUBLIN. airport
end
end-if
next tuple
end-do
FLIGHT-INFO = FLIGHT-INFO UNION CORK
end
over part of the DDB only. Where there are no export schemas (as in
Figure 2.22), there is effectively no logical DDB design stage at all, and
the federation is much more loosely coupled. Users define their own views
at run-time using powerful query languages, such as MSQL developed by
Litwin et al. There is also considerable interest in the development of
standards for DBs which make them self-describing. Such standards would
be of enormous benefit to such loosely-coupled database systems.
have been performed by the DBA would now become the remit of
the data administrator (DA). As indicated above, DB administration is
increasingly a highly technical, specialist function related to the physical
management of a single DB or group of DBs. Generally speaking, there-
fore, the administration of a DDB is more likely to be viewed as the
function of a DA rather than a DBA. In this chapter, therefore, we will
refer to administration of the local DBs as local database administration
and administration at a global level as data administration.
is responsible for seeing that the MDB network functions reliably. Also,
the notion that each local DB is completely and absolutely autonomous
and independent is probably not a very common scenario. After all, if
there are global users of the MDB, then there must be some organizational
coupling, however loose, between the individual nodes. A common use
for MDB technology would be in organizations with strong departmental
or branch autonomy, where departments had invested in a variety of
centralized DBMSs, but where there is a need to support some appli-
cations which cross organizational boundaries, as for example in the field
of strategic planning. In such a situation, there is definitely a notion of
control over the MDB as a whole (i.e. global control). This control would
be vested in the hands of the DA, who would liaise with the local DBAs
and would report to corporate senior management.
The only realistic application scenario where such a notion of overall
control of the MDB appears absent is where the MDB is being used to
assist in the provision of summary information from the local DBs to a
third party. This situation could arise, for example, in the health field,
where each hospital reports on a periodic basis to a regional health
authority, giving statistical (depersonalized) information, such as number
of inpatients, average length of stay, diagnosis, treatment required and
so on. Each hospital will have its own independent, autonomous DB.
There can be doubt that in such an application, confidentiality of medical
data requires that nodal autonomy is supreme. It is most unlikely, given
the nature of the interaction between the systems, that a global schema
would be required in such a scenario. The various sites, including the
regional site, would simply agree protocols amongst themselves, which
would then allow them to cooperate together in a loosely-coupled feder-
ation.
Thus the functions of the DDBA incorporate all 11 of the items listed
above and those of a DBA of a centralized DB only the first 7 items.
Since actual data does not reside at the global level - only meta-
data - the global DA's role is quite different from that of the local DBA.
The principal functions of the DA can be summarized as follows:
paid every week, after tax, social insurance, health insurance, pension
contribution and so on have been deducted. To the tax inspector, the
WAGE is the gross pay before any deductions have been made. To the
employer, on the other hand, the WAGE might mean the amount that
has to be paid out each week to and on behalf of an employee. For
example, it would include not only the gross wage of the employee prior
to any deductions, but also any contributions made on the employee's
behalf, such as employer's contributions to social welfare, pension schemes
and so on. Hence there is plenty of room for confusion even with the
relatively simple concept of a weekly WAGE. It was to overcome this
type of problem and also to provide much more extensive and flexible
management of meta-data that data dictionary systems (DDS) were
developed. Note that some writers differentiate between the data diction-
ary (inventory) and data directory (location) aspects of this system, calling
the combined system a data dictionary/directory system. However, in this
chapter, we prefer to use the term data dictionary (DD) to encompass
both these functions.
The principal aim of any DDS, whether in a centralized or distri-
buted environment, is to document all aspects of the data resource. To
perform this task successfully and also to ensure that the DD is regarded
as the sole authority for the definition and description of meta-data
within the organization, it must provide a number of facilities, which are
summarized below. Note that the catalogues are effectively a subset of
the DDS.
(a)
(b)
I Application
DDS
DBS
(c)
Figure 9.5 (a) Independent DDS. (b) Embedded DDS. (c) DB-application DDS.
need to maintain its own internal DDL tables, including meta-data for
the meta-data which is stored in the data dictionary DB! Meta-data for
non-DB systems can also be included. However, this approach suffers
from the same drawback as the independent DDS, since once again meta-
data is duplicated between the DD and the DBMS(s).
Such a classification of DDSs is helpful, but in recent years a
distinction has been made which is more useful in the context of DDBMSs.
DDs are classified as either active or passive. An active DD is one which
is accessed by the DBMS in order to process a user query. An embedded
324 Distributed Database Systems
DDS is clearly active. A passive DD, on the other hand, is one which
essentially provides documentation of the data resource. It is not accessed
on-line by the DBMSs, or by any other system whose meta-data it man-
ages. Independent DDSs are generally passive. Indeed, it would not be
realistic for them to be active since the performance of the overall systems
would be unacceptable.
The advantages of active DDs are obvious. They provide a single
repository of meta-data, thereby ensuring its integrity. However, a passive
DD allows for a more flexible evolution of the system in that it does not
depend on a single DBMS package for its operation. A passive DD can
be made to appear to the end-users as an active DD, thereby going some
way towards achieving the advantages of both approaches. To do this,
procedures which are automatically enforced, through the use of security
constraints on the meta-data in the DBMS, are established which effec-
tively force all updates on meta-data to be applied to the DD first. The
DDS automatically produces the necessary transactions to update the
meta-data in the DBMSs. Access to the DBMS meta-data (the schema
tables) is effectively barred to users. Of course, the DBMS itself will
continue to access its own meta-data, but the schema tables effectively
become internal to the DBMS.
A DDS in a distributed environment, especially in a multidatabase,
is more likely to be passive, with the active component being provided
by a catalogue or encyclopaedia stored at each node. These local catalogues
can then be generated directly from the centralized DD, in the same way
as schema DDL for independent DDSs (see above). The information
stored in the catalogues corresponds to the nine functions listed in Section
9.6.3, which have to be added to a conventional DDS operating in a
distributed environment.
SUMMARY
This chapter has reviewed two different aspects of DDB management,
namely logical DDB design and DDB administration. In both cases, we
saw that it is important to distinguish between top-down homogeneous
DDBs and heterogeneous bottom-up designed DDBSs such as
multidatabases. In the case of the former, both design and administration
are the same for the centralized case, with a few extra facilities to
incorporate the distributed dimension. Multidatabases, on the other
hand, offer a quite different challenge. The main problem for logical
MDB design is DB integration (i.e. the integration of the local
(participation) schemas into a single, uniform global schema). A manual
methodology for performing DB integration was given. It is likely that
schema integration methods of the future will be object-oriented, possibly
Logical Distributed Database Design and Administration 325
EXERCISES
9.1 (a) If the information systems within an organization are managed by a
heterogeneous collection of independent file systems and DBMSs, outline
the problems which you would expect the organization to be facing which
could be solved using a multidatabase approach.
(b) What is a data dictionary system (DDS) and what are its functions?
(c) If the organization mentioned in part (a) above purchased an independent,
passive DDS, which supported all the functions you have listed in your
answer to part (b), discuss in detail the effect this would have on the
organization. You should explain how the DDS would or would not help
the organization overcome the information processing problems you have
identified in your answer to part (a).
326 Distributed Database Systems
9.2 A university has three separate DBs (A, B, C) containing information about
full-time students, part-time students and staff. DBA contains information
about full-time students (name, address, degree for which they are registered
and their overall result) together with the grades they have obtained in the
various courses they have taken and the lecturer for each course is also
recorded. DBB contains the same information for part-time students, but with
an additional attribute to indicate their full-time job (day-job). DBc contains
information about the lecturing staff and the courses which they teach; the
number of hours per course and the rate per hour are also recorded.
Figure 9.6. gives the ER model for the three DB schemas.
Integrate the three schemas to produce a global schema; show each step of
the DB integration process, identifying the transformations performed at each
stage.
Schema A
Name
Course# N: O Student#
Grade MINAT Address
Degree reg
Lecturer E
Overall result
Schema B
Name
Crsno Student#
Address
Mark Degree crs
Overall mark
Lecturer
Day job
Schema C
Course# Name
Staff#
Rate TEACH STAFF
No_r Address
No.-hours Department
Bibliography
Allen F.W., Loomis M.E.S. and Mannino M.V. (1982). The integrated
dictionary/directory system. ACM Computing Surveys, 14 (2), 245-86.
A good easy-to-follow introduction to the functions and facilities of data
dictionary systems in a centralized DB environment.
Appleton D.S. (1986). Rule-based data resource management. Datamation,
86-99.
This paper presents a high level view of managing the data resource in which
the vital role of meta-data is emphasized.
Batini C., Lenzerini M. and Navathe S.B. (1986). Comparative analysis of
methodologies for database schema integration. ACM Computing Surveys,
18 (4), 323-64.
An excellent introduction to view and schema integration, on which much of
the material presented in Section 9.3.1 is based.
Bell D.A., Fernandez Perez de Talens A., Gianotti N., Grimson J., Hutt A.
and Turco G. (1986). Functional Requirements for a Multidatabase System.
MAP Project 773, Report 04. Available from Institute of Informatics, Ulster
University, N. Ireland.
Bertino E. (1991). Integration of heterogeneous data repositories by using
object-oriented views. In Proc. 1st Int. Workshop on Interoperability in
Multidatabase Systems, 22-9, Kyoto, Japan, April 1991.
This paper gives an overview of an object-oriented approach to schema
integration. The approach involves building a local object-oriented view of
each underlying heterogeneous data resource, which may optionally be
combined into integrated object-oriented views. Views are defined using an
object-oriented query language described by Bertino et al. (1989).
Bertino E., Negri M., Pelagatti G. and Sbattella L. (1989). Integration of
heterogeneous database applications through an object-oriented interface.
Information Systems, 14 (5).
Braithwaite K.S. (1985). Data Administration: Selected Topics of Data Control.
New York: Wiley.
Ceri S. and Pelagatti G. (1984). Distributed Databases: Principles and Systems.
New York: McGraw-Hill.
Chen P.P. (1976). The entity relationship model - towards a unified view of
data. ACM TODS, 1 (1), 48-61.
The original paper on entity relationship modelling, one of the most widely
used modelling techniques in the business world.
Czejdo B. and Taylor M. (1991). Integration of database systems using an
object-oriented approach. In Proc. 1st Int. Workshop on Interoperability in
Multidatabase systems, 30-7, Kyoto, Japan, April 1991.
This paper presents an interesting and fairly concrete approach to object-
oriented schema integration. Global users access the underlying
heterogeneous data collections through Smalltalk. The approach can be used
for both tightly and loosely coupled MDBSs (i.e. with or without a global
conceptual schema).
Dao S, Keirsey D.M., Williamson R., Goldman S. and Dolan C.P. (1991).
Smart data dictionary: a knowledge-object-oriented approach for
interoperability of heterogeneous information management systems. In Proc.
1st Int. Workshop on Interoperability in Multidatabase Systems, 88-91, Kyoto,
Japan, April 1991.
This paper gives a brief overview of a method of schema integration which
combines case-based reasoning with object-oriented techniques.
Database Architecture Framework Task Group (DAFTG) of the
328 Distributed Database Systems
SOLUTIONS TO EXERCISES
9.2
Step 1: Figure 9.7(a).
Choose course# for crsno in schema B.
Choose grade for mark in schema B.
Choose ID# for student# in schemas A and B and for staff# in schema C.
Choose degree.reg for degree-crs in schema B.
Choose overall-result for overall-nark in schema B.
Choose RESULTS for EXAMINATION RESULTS in schema A.
Name
Schema A Course#
ID#
Grade RESULTS FULL-TIME
Address
STUDENT
Degree reg
Lecturer
Overall result
Name
Schema B Course#, ID#
Address
Grade
Degree reg
Lecturer Overall-result
Day job
(a)
Schema A
Name
Lecturer ID#
Address
Degree-reg
Overallresult
Schema B
Course# Grade
Name
ID#
Lecturer Address
Degree-reg
Overall-result
Day-job
Schema C
Course# Name
ID#
Rate STAFF
Address
No._hours Department
(b)
Figure 9.7 (a) Standardization of attribute names. (b) Make lecturer in TEACH
into an entity in schemas A and B.
Logical Distributed Database Design and Administration 331
e
Course#
Rate ess
ee-reg
No. hours
all-result
ess
eereg
all-result
ob
:SS
rtment
(c) rtment
Course# Grade
Degree reg
Overallresult
Name
Id#
Address
Department
(d)
Figure 9.7 (c) Superposition of schemas. (d) Create new STUDENT and PER-
SON entities and subset relationships.
332 Distributed Database Systems
9.4 Outline the three main ways in which data dictionary systems and DBMSs can
be linked and compare the relative advantages and disadvantages of each
approach.
10.1 Introduction
333
334 Distributed Database Systems
10.2.1 Objectives
The demands of computerized information systems in a health network
with regard to hardware and software are rarely matched by industrial
applications. The problems to be solved are therefore correspondingly
diverse and require many innovative techniques.
The objectives of such computerization are to:
"* Reduce the need for, and duration of, treatment of patients by
good prevention methods and early diagnoses;
"* Increase the effectiveness of treatment to the extent allowed by
improved information;
"* Relieve professionals and other workers in care units of information
processing and documentation burdens, thereby freeing them for
more direct care of the patient;
"* Enhance the exploitation of resources available for health care by
good management and administration;
"* Archive clinical data, facilitate the compilation of medical statistics
and otherwise support research for the diagnosis and treatment of
disease.
health units can be greatly improved. Mobile populations and the 'cen-
tralized' nature of the specialist care units, for example, could mean that
data in another subregion, or even region, might be required. Remote
accesses such as these are likely to remain rare, and even accesses between
neighbouring hospitals are comparatively rare, but current information
technology now offers attractive alternatives to the traditional methods
of meeting these remote access as well as internal accesses requirements.
A STATIC SCHEMA
B DYNAMIC SCHEMA
Events
ECl Request for service is received
EC2 Appointment is agreed
EC3 Services are rendered
EC5 Documentation is completed
Conditions
CCl Enquirer qualified
C DISTRIBUTION SCHEMA
A STATIC SCHEMA
B DYNAMIC SCHEMA
Events
Conditions
C DISTRIBUTION SCHEMA
A STATIC SCHEMA
General objects
0K19 Receptionist (receptionist #, name)
OK20 Transportation (code #, means)
0K21 Dr (type #, Dr #, name)
0K22 Place (place #, place)
OK23 Subregion (subregion #, wording)
0K24 Region (region #, wording)
0K25 Diagnosis (diagosis #, wording)
0K26 Medicine (medicine #, wording)
B DYNAMIC SCHEMA
Functions
FK1 interview of patient
FK2 call Dr
FK3 wait for Dr
FK4 clinical exam (elementary)
FK5 clinical exam (regional)
FK6 determine destination in hospital if CK1
344 Distributed Database Systems
Events
Conditions
CK1 Patient must be moved
CK2 Pharmacy services required
CK3 Surgery necessary
CK4 Lab test required
C DISTRIBUTION SCHEMA
A STATIC SCHEMA
General aspects
0L11 Lab-receptionist (receipt #, name)
0L12 Doctor (Doctor #, name, speciality)
0L13 Collector (collector #, name, site)
B DYNAMIC SCHEMA
Functions
Events
Conditions
CLI The results are within reference limits
CL2 Further tests are required
CL3 Enquirer is qualified
C DISTRIBUTION SCHEMA
Functions
Events
Conditions
CR1 Patient has been hospitalized here before
CR2 Patient is to be admitted
CR3 Patient dies or completes treatment
CR4 Patient sent to/admitted from other hospital
CR5 X-raying required
CR6 Other treatment required
C DISTRIBUTION SCHEMA
A STATIC SCHEMA
B DYNAMIC SCHEMA
Events
Conditions
C DISTRIBUTION SCHEMA
Hospital A Hospital B
"* Professional
"* Patient
"* Ailment
"* Patient-professional.
"* Patient
"* Doctor
"* Hospitalization
"* Treatment
"* Out-patient.
"* Patient
"* Sample
"* Test request
"* Collector.
"* Patient
"* Doctor
"* Doctor-patient-encounter
"* Treatment session.
some stage and to some degree. Evolutionary forces within the domain
encouraged the development of the systems in this way, and these are
unlikely to change much as a result of adopting a distributed database
approach. So this characteristic will probably persist.
There is an interesting hierarchy or network of (distributed) datab-
ases within the distributed databases for this application. There are many
(distributed) databases in the system which correspond to individual pati-
ents, doctors and many other objects. Some of these may appear as
'simple' entities in other distributed databases. Consider an extreme case
of a patient who has some chronic condition such as asthma or hyperten-
sion, which may be particularly persistent. He or she could quite conceiv-
ably also be found, perhaps several years later, to be suffering from some
other chronic condition, for example arthritis. A patient like this could,
over a decade say, accumulate a sizeable collection of details resulting
from frequent consultations, tests and treatments. Now all patients' rec-
ords constitute a (distributed) database in a structural or intensional sense.
However the physical size of the record of an individual chronically ill
patient could mean that it qualifies as a database in the extensional sense
also.
Suppose such a doubly-unfortunate patient has a bulky record of
investigations for asthma and an extensive record of orthopaedic consul-
tations over many years. Access to the record is triggered by events such
as changes of consultants, or re-examinations of the original orthopaedic
diagnosis and prognosis with hindsight. Or a search could be requested
of the records to find when a new aspect of either condition, now clearly
apparent, first started (albeit imperceptibly at the time). Searches such as
these may demand the most efficient access techniques available to help
the care professional to cope with the bulk of data. Database methods
are attractive in that they provide such methods.
Databases may exist for a great variety of purposes, some of which
can be rather unexpected. An example is of a database used by a clinical
dermatologist to assist in the identification of where in a patient's environ-
ment a particular allergen exists. A database developed for this application
uses a CODASYL-like network structure to find intersections of inverted
lists which link articles and materials associated with allergies, jobs,
hobbies associated with the patient and the site of the rash on the patient's
body.
Other
hospitals
Homogeneous(?) Records-wards DB
Standard global distributed Labs-wards DB
database
Research
Ad hoc remote Heterogeneous GP-labs Dr
* From a GP to Hospital A
* Within Hospital A.
Clearly there are situations which fall between these gradings, for
example, varying levels of homogeneity of subcomponents further strati-
fies case 1.
Another important factor is that of the localization of queries and
updates. The difficulty faced by allowing distributed updates as well as
queries has a dramatic negative effect on the case for adoption of the
approach. So, again ranking in descending order of attractiveness, we can
categorize the degree of difficulty as follows:
(1) Multiple copies of certain sets of data must all be kept consistent;
(2) Multiple copies must be kept again, but with only the requirement
that some single primary copy is kept fully correct;
(3) Only one version of each data object exists.
lines indicate that the most homogeneous of these subsystems are likely
to find distributed database systems most feasible.
By careful design of the network the problem of updatability will
probably not be too important for most of the applications outlined in
this chapter, or for those envisaged as ad hoc transactions. The updating
of a particular relation, tuple or attribute is unlikely to demand a sophisti-
cated scheduler to ensure the correctness, at high performance, of a set
of interleaved transactions on the item at these granular levels. The
simple reason for this is that the probability of two transactions sharing
simultaneous access to any such item is very low and 'inconsistent' replicas
may be tolerable. This is not to say that this aspect of transaction pro-
cessing can be ignored, but rather that relatively simple mechanisms, such
as keeping a 'primary copy' for conflict resolution and supporting locks,
which can be fairly pessimistic in most cases because of the performance
levels required and workload characteristics, are adequate. Moreover, it
is expected that by far the greater proportion of the access traffic in
general, and updates in particular, will be local.
The amount of redundant data will be high in this application, but
the low required consistency level is likely to remove any problem here.
All three kinds of partitioning of individual relations can be foreseen, and
relatively modest query evaluation algorithms will probably be sufficient
for the application's needs.
SUMMARY
In this chapter we have looked at the objectives of computerization of
health care systems, and we have concentrated upon the information
handling problems encountered when distributed computing facilities are
used. In particular we have examined the suitability of the DDB
approach to handling these problems. We have presented 'evidence' that
several databases exist at various functionally and organizationally
distributed sites in health care networks, and use the distributed nature
of the information processing in this application to make some of the
concepts dealt with in earlier chapters more concrete. We believe that the
distributed database approach can help in making sense of several
dichotomies in medical informatics.
We also provide an intuitively acceptable set of criteria to help
determine if the DDB approach is appropriate for a particular application
environment.
A Case Study - Multidatabases in Health Care Networks 361
EXERCISES
10.1 Why are health care networks considered to be potentially excellent
application domains for distributed database systems?
10.2 In a countrywide national health care system a choice is to be made between
using a global schema or not when adopting the multidatabase approach to
handling accesses to data between sites.
What factors would have led to the choice of the multidatabase
approach, and what factors should be taken into consideration when choosing
between the alternatives above?
10.3 The Department of Education in a country has decided to install distributed
databases in the regions of the country (each of a few million inhabitants) in
order to further their objective of increased sharing of teaching resources and
also to help in the vertical reporting they require for regional administration
purposes.
Sketch and explain a sample layout of an intraregional network which
might be useful for this, indicating why pre-existing databases are likely to be
available at the sites.
How would the multimedia aspects of the application be catered for?
How would the interregional aspects of the application domain be handled?
10.4 Repeat Exercise 10.3 for the administrative and operational needs of a
futuristic supermarket chain dealing with warehouses, retail stores and home-
delivery services. Assume that the organization of the enterprise is based on a
two-level geographical grouping of services (analogous to the regional and
national levels of the previous question).
10.5 Repeat Exercise 10.4 for a group of future travel agents working at regional
and national levels. (Multimedia data here could be the equivalent of current
travel brochures, maps, railway networks, etc.).
10.6 Repeat Exercise 10.4 for a national police force seeking to integrate
multimedia police files from various locally-autonomous sites.
10.7 Suggest how the relations of Section 10.4 would be supplemented if the
radiology department of the hospital, which has to deal with collections of
digitized X-ray images layered by age (e.g. there is a two-week file held on
reasonably fast access storage, and a two-year file and a ten-year file which
are less accessible, being required much less frequently). Assume that the
images are represented by descriptions held on structured records in computer
systems.
10.8 Suggest some further queries of each type in the example in Figure 10.1.
362 Distributed Database Systems
Bibliography
Bell D.A. (1985). The application of distributed database technology to health
care networks. Int. J. Biomedical Engineering, 16, 173-82.
Bell D.A. (1990). Multidatabases in health care. Database Technology, 3(1),
31-9.
Bell D.A. and Carolan M. (1984). Data access techniques for diagnosis in
clinical dermatology. In Proc. 5th Int. Congress of EFMI, Brussels, Belgium.
Bell D.A., Ferndndez P6rez de Talens A., Gianotti, N. et al. (1987). Multi-
Star: A Multi-database system for health information systems. In Proc. 7th Int.
Congress of EFMI, Rome, Italy.
Bell D.A., Grimson J.B., Ling D.H.O. and O'Sullivan D. (1987). EDDS - A
system to harmonise access to databases on mainframes and micro. J.
Information and Software Technology, 29(7), 362-70.
Blois M.S. (1983). Information and computers in medicine. In Proc.
MEDINFO Conf. 83, Amsterdam, Holland.
Bush I.E. (1981). Hospital computer systems - for medicine or money? In
Proc. Computer Applications in Medical Care.
Fernandez P6rez de Talens A. and Giovanne P.M. (1987). Use of a
multidatabase management system MULTISTAR to obtain the centripetal
information flows for the national health statistics. In Proc. 7th Int. Congress
of EFMI, Rome, Italy.
Giere W. (1981). Foundations of clinical data automation in co-operative.
programs. In Proc. Computer Applications in Medical Care, Washington DC,
IEEE Computer Society Press.
Grimson J.B. (1982). Supporting hospital information systems. In Proc. 4th
Congress of European Federation of Medical Informatics (EFMI), Dublin,
Ireland.
Huet B., Polland C. and Martin J. (1982). Information analysis of an
emergency unit and a pre-diagnosis unit. In Proc. 4th Int. Congress of EFMI,
Dublin, Ireland.
The conceptual schemas in Section 10.3 were modelled on a single-site
schema given here.
Isaksson A.I., Gerdin-Jelger U., Lindelow B., Peterson H.E. and Sj6berg P.
(1983). Communications network structure within the Stockholm county health
care. In Proc. 4th World Conf. on Medical Informatics, Amsterdam, Holland.
Whiting-O'Keefe Q.E., Simborg D.W. and Tolchin S. (1981). The argument
for distributed hospital information systems (HIS). In Proc. Computer
Applications in Medical Care.
Future Developments
in Distributed
Databases
11.1 Introduction
In this chapter we will review technological developments in a number of
areas, which are likely to impact on future generations of distributed
database systems. There can be no doubt that distributed databases sys-
tems, especially in the form of multidatabase systems which allow inte-
grated access to heterogeneous data collections, will grow in importance
over the next few years. Most large organizations now take distribution
for granted. The availability of reliable, standard data communications
makes it possible for each user to have their own PC or workstation
connected to the network. The network, in turns, provides a wide range
of facilities to its users including printing, access to database servers and
electronic mail.
Brodie (1989) has predicted that the information systems of the
future will be based not simply on distributed database technology but
rather on intelligent interoperability. Such systems will require not only
the applications of techniques from the field of distributed databases to
support interoperation of systems, but also the incorporation of techniques
from artificial intelligence, in particular knowledge-based systems and
natural language processing, to provide the intelligence. In addition to
363
364 Distributed Database Systems
* Object-oriented DBMSs.
* Extended relational systems
11.2.1 Introduction
The field of artificial intelligence (AI) is very broad and covers a wide
range of disciplines including psychology, philosophy, linguistics and soci-
ology, as well as computer science itself. The unifying aim of Al is to
Future Developments in Distributed Databases 365
User
ak
Task
There are two main reasons for the lack of exploitation of the
potential of this technology in practice:
In cases where ESs have been applied, the domains considered are
often too restricted and are too simplistically viewed to be of practical
use.
Complex problems require more knowledge, of a 'deeper' kind,
than is normally to be found in demonstrations of the expert systems
approach to problem solving. We will return to this issue in Section 11.3.
ESs are generally developed as stand-alone systems which are not
integrated into the general information processing environment. A busy
hospital laboratory, for example, can make very effective use of ES
technology by analysing patient test results to detect abnormal results,
suggest possible diagnoses, automatically schedule further tests on the
basis of the results obtained to date and so on. The potential for improving
the efficiency and cost-effectiveness of the laboratory through the use of
ES technology is there, but it can never be fully realized if it is not
integrated into the routine laboratory system.
How can this integration be achieved? We can identify three main
ways in which ESs and DBs can be coupled together:
Future Developments in Distributed Databases 367
User
t
Expert
system
Data DR
management
(a)
User
I
(b)
User
I
-. Controller - DBMS DB
(c)
Figure 11.2 Coupling ES and DBs. (a) enhanced expert system, (b) intelligent
DB, (c) intersystem communication.
368 Distributed Database Systems
which will be filled correctly by the pharmacist; the nurse knows how to
call the porter and request that a patient be brought down to X-ray and
so on.
The agents in the information systems based on intelligent inter-
operability correspond to various subsystems: one might be the hospital
information system, based on conventional DB or DDB system, another
might be an ES in a particular medical speciality and yet another an X-
ray imaging system. Efficient management of the hospital - staff, patients
and resources - requires a cooperative effort between all these subsystems.
We discuss this in more detail in Section 11.3.
11.3.1 Introduction
Extensive domains require larger knowledge bases of the conventional
kind described above. One way of tackling this problem is to use groups
of smaller expert systems, like those already in existence, in concert. So
two prominent characteristics of 'second generation' expert systems are
11.3.2 Motivation
There has been great interest in distributed artificial intelligence (DAI)
in the last ten years or so. This turns out to be a very broad area and it
is not easy to produce a definitive taxonomy of the efforts that have been
made in it. We will attempt this task shortly, but in the meantime we
content ourselves with looking at the forces which have led to it. The
objective is to get individual problem-solving modules, or agents, to
interact constructively in the solution of problems that are beyond the
capability of any one of the agents by itself.
Two basic scenarios exist in DESs. As in DDBs, one is for use in
a retrofit situation, where the individual expertise is already in existence,
manifested by the regular work of the local agents, and their resources -
reasoning methods and facts or evidence - are required to be collectively
exploited to solve 'global' problems. The other is where the system
Future Developments in Distributed Databases 371
At the highest level another agent, H, would use the results from
MV and MD via B 3, and summarize changes in velocity and distance, using
them to determine, for example, if the target is human. The approach is
illustrated in Figure 11.4.
An example of the second sub-type, which incidentally does
decompositions as well if required, is the distant vehicle monitoring test-
bed (DVTB) which is also used for target tracking. This time the appli-
cation is to trace vehicles' histories in some domain of interest on the
basis of signals it receives. We choose this system as our illustration of
this sub-type because the emphasis is on the integration of signals coming
from various sources, in order to get a holistic picture of the space
being monitored, rather than on dividing a problem amongst different
specialists. The idea is that if several, say four, overlapping subregions
are monitored by pieces of equipment with different levels of reliability,
374 Distributed Database Systems
L
Agents at
lowest level
then the picture of the real 'world' that is required can be built up from
the different jigsaw pieces.
In the example in Figure 11.5, a sensing agent S would be assigned
to each of the four overlapping subregions, R 1 , . . ., R,. Corresponding
to each region there would also be an interpreting agent I carrying out
appropriate subtasks of interpretation for the local region. A further
integrating agent could then be used at a level above these with the task
of making gross sense of these local interpretations. Further levels can be
envisaged in order to multiply the area being monitored. For example
four regions the size of R can be monitored as a unit by incorporating
another gross interpreter above them.
In the case of collaborative reasoning all the agents address the
same problem. The problem is not normally decomposed in this case, but
each agent uses its local knowledge to come up with a result in collabor-
ation with the others. The agents can all be considered to be cooperating
on the solution of one problem, although it is possible that the system as
a whole will work on several problems concurrently. The idea here is that
of horizontal collaboration such as can be found in the medical world. A
consultant could confer with another kind of specialist in a complex case.
A team of experts with different specialisms work together on a case. An
example would be where a specialist orthopaedic surgeon requires the
Future Developments in Distributed Databases 375
Integrating
agent
/F-qI/\ \-Interpreting
Blackboard sub-system
t
Control
sub-system
I
Expert ------------------------ - -
I
Expert
system system
the node and the control subsystem. The blackboard subsystem stores
shared information for the Ss and structured information for the control
subsystem. The control ubsystem is responsible for managing the cooper-
ation and communications between the nodes and for the run-time man-
agement of HECODES. In practice the blackboard and control sub-
systems are stored at one (central) node to minimize communications.
If we now focus on the agent nodes in the system, we can identify
three main functional modules: the management, communications and
man-machine-interface modules. The modules are all designed to be
domain-independent.
The scheduler module in the central (control) node is a key element
of HECODES (see Figure 11.7). When it is fed meta-knowledge compris-
ing the control information for all the ESs, it is able to schedule the whole
system's operation and manage the cooperation between the agents. The
blackboard manager carries out all operations involving the blackboard,
which has its usual role for communications between the agents. The
communications module has a self-evident role, and the front-end pro-
378 Distributed Database Systems
User
Mantmachine lackboard
Meta~~ interface Bakor
knowedulerI
Blcbard
manager
Communications
module
4
Other nodes
and
expert systems
I
Communications
module
Man-machine
User interface
Imana
Expert system
er
Expert system
cessors in Figure 11.6 provide the interfacing between the ESs and the
control node. Each of these modules can be very complex. For example,
the scheduler must detect and remove, or better still avoid, deadlocks in
the system, which can occur due to cyclic waiting loops, much as for
DDBs. Another example is the interfacing that is required between differ-
ent ESs, for example, where they use different methods of dealing with
evidence or other knowledge which is inexact, as is usually the case.
Future Developments in Distributed Databases 379
* Complex objects
* Long-lived transactions.
382 Distributed Database Systems
TEST TEST
FAMILY-MEMBER ( Family-name,patient-name.............
ance penalty. If the GP wants to display all the data relating to one
patient, such a simple request would involve three JOINs!
Another important characteristic of many non-business applications
is their use of long-lived transactions. For example, in an engineering
design system, it is quite common for a design engineer to check a
component out of the DB, work on it over several days, or even weeks
and then check it back in. The recovery managers of RDBMSs are not
designed to handle such long-lived transactions.
Finally, conventional RDBMSs restrict users to a finite set of prede-
fined data types such as string, integer and real. Many are not suited to
the storage of arbitrary strings of text and very few can handle the
enormous bitmaps generated by imaging systems. There is a need in
many applications to handle multimedia information - conventional data,
images, graphics, text, even voice. Users need to be able to define abstract
data types and operations on these types. For example, we might wish to
search for a particular word in a text string or to rotate a graphical image
through a certain angle.
are encapsulated in the object). This is quite different from the way
relations and procedures which operate on those relations are handled in
RDBs. Relations (the attributes) are stored in the DB and procedures
(the methods) are stored separately in a software library. An important
advantage of encapsulation is that it facilitates reuse of code. Most large
software projects start coding from scratch, making little if any use of
existing and possibly similar code. Compare this to hardware projects,
where most of a new machine could be made up of components of an
existing design. Some of these components may not require any modifi-
cation whatsoever for use in the new machine, whereas others might
require only minor modifications. Very few components would have to
be built totally from scratch.
Objects which share the same methods and attributes can be
grouped together into a class (e.g. the class of patients). Classes them-
selves can be organised into a hierarchy such that subclasses inherit
attributes and methods from superclasses. For example, the patient class
could be divided into two subclasses - in-patients and out-patients - as
shown in Figure 11.10. Both in-patients and out-patients share some
methods and attributes with either parent class, but each has some
additional features which are peculiar to themselves. For example, an in-
patient will have a ward number and a bed number, whereas an out-
patient will have an appointment date and so on. Inheritance provides a
powerful tool for modelling and also facilitates code reuse since the code
for registering a new patient can be shared by both in-patients and out-
patients.
Ward Clinic
Under each of the three tenets, the Committee also put forward a
number of more detailed propositions. In particular, there is agreement
386 Distributed Database Systems
11.6.1 Introduction
In Chapter 1 of this book we pointed out that information systems which
deal with only structured text (i.e. using record-based databases), even if
they are supplemented by techniques for reasoning from Al (such as
expert systems), can provide only imperfect models of the real world.
Information can be communicated to users from suppliers using images
(including graphics, pictures and diagrams), voice, unstructured text and
even video, and in certain situations, it is impossible to dispense with
these. Modelling the real world as, for example, 2-D tables and communi-
cating on this basis is bound to fall short of the ideal.
Applications which demand the use of 'non-structured media'
include office information systems, geographical information systems,
engineering design systems, medical information systems and military
command and control systems. For some time database researchers have
been endeavouring to address the additional issues this raises for gen-
eralized database management systems.
The issues include modelling the part of the world of interest,
interrogating the database, presentation of results and ensuring acceptable
performance - problems which present a higher level of challenge than
the corresponding problems for conventional databases. When the systems
are distributed as well, clearly each of these problems is aggravated and,
of course, additional harmonization and integration difficulties are met as
well.
Interestingly, the ideas of multidatabases give us a hint as to a
possible approach to this problem. If the heterogeneity we dealt with in
previous chapters is extended to include multiple media, we can use some
of the solutions to MDB problems as a starting point for addressing the
distributed multimedia problems.
In the next subsection we rehearse the arguments for multimedia
databases in a little more detail, and from this get a clear picture of the
Future Developments in Distributed Databases 387
11.6.2 Motivation
Multimedia data can be defined for our purposes as data in the form of
computerized text (structured and unstructured), sound and/or images. If
computers are to be exploited fully in the communication process
described in Chapter 1 (Figure 1.2), they must be able to handle infor-
mation of the type that applications end-users normally handle. This
'handling' must include basic information processing for storing and
accessing information and also some more complex and specialized infor-
mation processing for identifying material of interest - matching, clus-
tering, retrieving on the basis of content, to list just a few examples.
A multimedia database management system (MMDBMS) is a
software system which manages the data in the same manner as conven-
tional DBMSs. For structured data alone such DBMSs do a fine job. For
unstructured text, such as in office documents and bibliographic systems,
a variety of tools can be found for most computers and workstations.
These include electronic mail services, text editors and formatters and
information retrieval system 'databases' (see Chapter 1).
However, in non-computer systems, users and suppliers of infor-
mation communicate using a rich variety of media and language elements.
An obvious example is where they talk directly to each other. A dialogue
of verbal questions and answers is a particularly venerable mode of
communication. Using maps, diagrams and little sketches, the communi-
cation process can be greatly expedited. Examples of this are when
computer experts are designing new applications, and this is a reflection
of the methods of communication methods used commonly in society at
large - for giving directions to someone from out of town, or instructing
an architect on your proposed new house. Pointing to objects or following
the course of a circuit diagram with a pencil and suchlike 'body language'
certainly helps in clarifying concepts which are difficult to articulate
verbally or in text.
So we can say that if computers are to be applied in any way
optimally to communications between people, the goal must be to store
and utilize information recorded using the kinds of media with which the
people are familiar. Clearly a very high-level means of addressing queries
388 Distributed Database Systems
hierarchy of types, for example) and non-first normal forms (where entities
can have other entities as their attributes). Most commentators agree that
some sort of object-orientation is needed.
Handling complex interobject relationships, permanent (system-
defined) identifiers or surrogates, property inheritance, allowing objects
to have procedures as attributes and allowing users to define data types
of their own are required. There are various difficulties which still retard
the development of this approach, such as how to incorporate the new
data types and functions into query languages, how to implement the
functions and how to exploit novel system features during query optimiz-
ation.
An alternative to this bottom-up approach is to make a fresh start
and develop a MMDBMS which from the outset is geared to integrating
multimedia data. Typical of the research that has been going on in this
area is the work carried out at the University of Waterloo in Canada on
the MINOS project. The researchers aim to handle compound documents
using a variety of media. The documents are subdivided into pages or
screenfulls of visual information and fixed lengths of audio information.
Browsing capabilities for both voice and text are provided by the system
which is designed ab initio for multimedia data handling. The system
architecture of MINOS, implemented on Sun workstations, consists of a
multimedia object server and numerous networked work stations. The
server subsystem is backed by high-speed magnetic disks and optical disks.
It provides performance-oriented features such as physical access methods,
recovery subsystems, caches and schedulers.
A query-trace for MINOS is instructive. Object-oriented queries
are addressed to it by users at work stations, and evaluation is carried
out at the server. Visual indexing cues (arrays of miniature images, for
example) can be provided to help navigation and in the clearer formulation
of fuzzy queries. A presentation manager in the work station is used to
help the user carry out a browse-mode perusal of data objects related to
his/her enquiry.
The third approach is to develop a multimedia database for some
application as a one-off exercise. This was a common pattern in the early
days of conventional databases. A specific data management system is
designed and bespoke-tailored to some very specific environment. In fact
this produces an elaborate, one-off, application system rather than the
generalized system support we are looking for here.
So while we can learn a lot from the experiences gained developing
these systems, they are outside the scope of our treatment of the subject.
We are considering the multimedia equivalent of the conventional DBMS.
The applications could be any of those listed above, or any other appli-
cations which require minimum cost, maximum flexibility and other,
often conflicting, characteristics of its accesses to multimedia data in a
'corporate' data reservoir.
390 Distributed Database Systems
When the bulk objects have been delivered to the workstations, local
image processing and other manipulations can be carried out to suit the
application.
We believe that these sorts of considerations will be important in
other application domains as well. However, each domain will have some
idiosyncratic problems, so the highly flexible, customizable MIDAM
approach can be used, where the fixed part of the system is EDDS and
the rest can be tailored given some important set-in-concrete system
support.
392 Distributed Database Systems
SUMMARY
We have reviewed some of the developments which are being
enthusiastically pursued by many research teams worldwide in areas
which are clearly related to distributed databases. We have made a point
of emphasizing throughout this book that DDBs are just one of a
number of ways in which the needs of future information users and
suppliers will be communicated. We have shown here how artificial
intelligence methods, the object-oriented approach, and multimedia data
handling hold some promise of much more comprehensive, flexible and
efficient information processing than current systems provide. Watch this
space!
EXERCISES
11.1 Outline the architecture of a distributed artificial expert system for use in
medical diagnosis. Explain the functions of the different modules, paying
particular attention to the features which handle inexactness of evidence.
11.2 Explain the differences between DESs and distributed deductive databases.
Give an example of a typical application of each class.
11.3 Suppose it is the year 2010 and you are responsible for the design of a system
to keep track of movements of the remaining gnus in an area of one square
mile in the Kilimanjaro National Park. Discuss the sort of system you might
use for this and discuss some of the more difficult technical problems you
would expect to encounter.
11.4 Discuss the inadequacies of the RDM for modelling in multimedia
applications, and describe how object-orientation could help.
11.5 What sort of effect do you think object-orientation would have on the system
architecture of MIDAM?
11.6 Drawing illustrations from a variety of applications areas, discuss the
motivations for having multimedia databases.
11.7 (a) What are the three main approaches to coupling databases and expert
systems?
(b) Discuss the relative advantages and disadvantages of each approach.
11.8 What are the characteristics of the object-oriented approach which make it a
particularly suitable choice for the canonical model for interoperable DBMSs?
11.9 (a) What are the two rival contenders for the title 'third generation DBMS'?
(b) Which contender is described as evolutionary and why?
(c) Which contender is described as revolutionary and why?
Future Developments in Distributed Databases 393
Bibliography
Abul-Huda B. and Bell D.A. (1988). An overview of a distributed multimedia
DBMS (KALEID). In Proc. EURINFO Conf. on Information Technology for
OrganisationalSystems, Athens, Greece.
This paper presents an overview of the problems of integrating distributed
multimedia data. KALEID was an early implementation of a system like the
MIDAM system, but it was not restricted to medical applications or
restricted to text and images only.
Alexander D., Grimson J., O'Moore R. and Brosnan P. (1990). Analysis of
Decision Support Requirements for Laboratory Medicine. Eolas Strategic
Research Programme Project on Integrating Knowledge and Data in
Laboratory Medicine, Deliverables D1, D2, D3 Department of Computer
Science, Trinity College, Dublin, Ireland.
This report gives a comprehensive review of computerized decision support
techniques in laboratory medicine from instrumentation through to patient
management systems.
Al-Zobaidie A. and Grimson J.B. (1988). Use of metadata to drive the
interaction between databases and expert systems. Information and Software
Technology, 30(8), 484-96.
The taxonomy for coupling ESs and DBs presented in Section 11.2.1 is
based on this paper, which also describes the DIFEAD System mentioned in
Section 11.2.1.
Atkinson M., Bancilhon F., De Witt D., Dittrich K., Maier D. and Zdonik S.
(1989). The object-oriented database system manifesto. In Proc. Deductive and
Object-oriented Databases, Kyoto, December 1989. Amsterdam: Elsevier-
Science.
This much publicized paper presents the ingredients which the authors feel
should go into a system for it to be called an OODBMS. They divide
features into those which they consider mandatory (complex objects, object
identity, encapsulation, types or classes, inheritance, overriding combined
with late binding, extensibility, computational completeness, persistence,
secondary storage management, concurrency, recovery and ad hoc query
support) and those which are optional (multiple inheritance, type checking,
inferencing, distribution, design transactions and versions).
Bancilhon F. and Ramakrishnan R. (1986). An amateur's introduction to
recursive query processing strategies. In Proc. ACM SIGMOD Conf., 16-52,
Washington, D.C.
This paper gives an easy-to-follow overview of recursive query processing
strategies, which form an important part of the research on deductive DBs.
Beech D. (1988). A foundation for evolution from relational to object
databases. In Proc. EDBT 88, Advances in Database Technology, Venice,
Italy, March 1988. Also in Lecture Notes in Computer Science Vol. 303
(Schmidt J.W., Ceri S. and Missikoff M., eds.), 251-70. Berlin: Springer-
Verlag.
This is an interesting proposal for an object-oriented extension to SQL,
called OSQL.
Bell D.A., Grimson J.B. and Ling D.H.O. (1989). Implementation of an
integrated multidatabase-PROLOG system. Information and Software
Technology, 31(1), 29-38.
Bell D.A. and Zhang C. (1990). Description and treatment of dead-locks in
the HECODES distributed expert system. IEEE Trans. on Systems, Man and
Cybernetics, 20(3), 645-64.
The causes of deadlocks in distributed expert systems in general are
examined and classified and methods of avoid them are put forward.
394 Distributed Database Systems
Brighton, England.
This paper advocates the use of object-oriented data modelling concepts for
multimedia databases. It is well worth a read as an introduction to
multimedia databases.
Woelk D., Kim W. and Luther W. (1986). An object-oriented approach to
multimedia databases. In Proc. 1986 ACM SIGMOD Conf., 311-325,
Washington DC. May 1986.
Zhang C. and Bell D.A. Some aspects of second generation expert systems. In
Proc. 1st Irish Conf. on Artificial Intelligence, Dublin, Ireland.
An account of some of the issues which expert systems in the next few years
will address.
Zhang C. and Bell D.A. (1991). HECODES: a framework for heterogeneous
cooperative distributed expert systems. Data and Knowledge Engineering, 6,
251-73.
This paper describes the architecture of HECODES in minute detail.
Glossary/Acronyms
C A programming language 79
C, Cost of joining two tuples 127
Ccp
C'P Cost per tuple concatenation 139
Importance weight of CPU 144, 147
Cm Importance weight of messages 144
Crb Cost per message byte 147
Cp Importance weight of I/O 147
397
398 Glossary/Acronyms
t Transmission cost/tuple 27
ta Transmission cost/attribute 139
TCP/IP A networking system for connecting processors 68
TM* Transaction Manager for R* 81
402
Author Index 403
Lindsay B. 59, 268, 328 Obermarck R. 186, 187, 216, 220, 223
Ling D.H.O. 89, 158, 362, 393, 394 Ofori-Dwumfo G.O. 89
Litwin W. 51, 59, 90, 314, 328 Oomachi K. 161
Lohmann G.M. 90, 150, 160, 269 Oszu M.T. 59, 209, 218, 268, 299
Loomis M.E.S. 327 Otten A.M. 12
Lorie R.A. 59, 162, 218 Owicki S. 199, 218
Lu R. 395
Luckenbaugh G.L. 219 Papadimitriou C.H. 221
Luther W. 396 Parsaye K. 395
Lynch N. 220 Patterson D.A. 269
Pavlovic-Lazetic G. 12
Mackert L.F. 150, 160 Pelagatti G. 119, 158, 218, 327
Maier D. 158, 393 Peterson H.E. 362
Mallamaci C.L. 90 Pirahesh H. 268
Mannino M.V. 327, 328 Poggio A. 395
Mark L. 59, 328 Polland C. 362
Martella G. 119 Popek G.J. 220
Martin J. 362 Popescu-Zeletin R. 89, 208, 219
Matsuo F. 395 Price T.G. 59, 162
McClean S. 159 Putzolu F. 59
McErlean F. 162 Pyra J. 158
McGee W.C. 220
McGill M.J. 12 Ramakrishnan R. 393
McJones P. 59 Ramarao K.V.S. 268
McNickle D.C. 120 Reeve C.L. 159
Menasce D.A. 220 Reiner D. 120, 161
Mendelzon A. 158 Reuter A. 268
Merrett T.H. 21, 59, 160 Riordan J.S. 158
Merritt M.J. 220 Rivest R.L. 299
Mezjk S. 91, 329 Robinson J.T. 198, 220
Minker J. 152, 159, 395 Rosenberg R.L. 90, 160
Mitchell D.P. 220 Rosenkrantz D.J. 189, 221
Mohan C. 90, 160, 268, 269 Rosenthal A. 120, 161
Monds F.C. 394 Rothnel K. 269
Morzy T. 218 Rothnie J.B. 90, 127, 158, 161, 217, 328
Mostardi T. 12, 310, 328 Roussopoulos N. 59, 328
Mukhopadhyay U. 395 Rowe L. 91
Muntz R.R. 220 Rybnik J. 89
Mutchler D. 220
Myopoulus J. 12 Sacco M.G. 161
Salem K. 209, 217, 219
Naimir B. 149, 158 Salton G. 12
Nakamura F. 161 Satoh K. 151, 161
Navathe S.B. 33, 59, 299, 327, 328, 329 Sbattella L. 327
Negri M. 327 Schwartz M.D. 299
Neuhold E.J. 90, 328 Schwartz P. 268
Nguyen G.T. 148, 161 Selinger P.G. 149, 159, 162, 328
Nicolas B. 328 Sevcik K.C. 218
Nygaard K. 394 Severance D.G. 269
Shamir A. 299
O'Hare G.M. 12 Shao J. 394, 395
O'Mahony D. 221 Shaw M.J. 395
O'Moore R.R. 393, 395 Sheth A.P. 48, 59, 221, 299, 328
O'Sullivan D. 219, 362 Shielke 98, 119
Author Index 405
2PC 43, 69, 78, 82, 247-56 conflict class analysis 204, 210-13
2PL 181ff locking 165, 179-90, 232
3GL 15, 18, 34 methods 164ff
3PC 247, 256-65 multidatabases 204-13
4GL 15, 34 optimistic 165, 179, 197-201, 206-7
problems 169-73
Abstract data type 388 serializability 173ff
Access plan 82, 122ff timestamping 165, 179, 190, 197
Ad hoc queries 101, 135, 352 transactions 165ff
Ad hoc transactions 206, 360 Constraints 273-80, 102, 123, 152, 160
Aggregate functions 123 Conversion 7, 106ff, 305-13
Aggregation 77
Allocation 95ff, 102ff Data
Archive 138, 336, 356, 234-36 allocation 93
Artificial Intelligence independence 16, 69, 122
distributed expert systems 370-81 locality 94
expert systems 365-69 model, global 53, 108ff
integration with DB 364-81 model, local 53, 109
knowledge-based systems 8, 365ff placement 93, 151
use in schema integration 325 resource 16
Associativity 129ff transmission 37ff, 124ff, 143
Autonomy 4, 53-4, 81, 357 Data Administration
data administrator (DA) 152,316ff
Bindings 123, 143 data dictionary 69, 71, 82, 320-4
database administrator (DBA) 315-18
Candidate key 31-4, 113 distributed database administrator (DDBA)
Canonical form 132 316-19
Canonical model 53 homogeneous DDB 317-19
Capacity planning 3,12 multidatabases 318-19
Cardinality 141ff, 154 system catalogues 16, 319-20
Catalogue see System Catalogue Data Dictionary
Colorability problem 98 active vs passive 323-4
Combinatorial optimization 100 centralized 319
Commutative operators 108, 129ff DB-application 322-3
Concurrency control distributed 320
application-specific 208-13 global 78, 320-4
406
Subject Index 407
David Bell is Professor of Computing inthe Department of Informatics at the Jniversity of Ulster at
Jordanstown. He has authored or edited several books and over 100 papers on database and related
computing developments. Jane Grimson is a senior lecturer incomputer science and Fellow of Trinity
College, Dublin. Both authors have a wealth of teaching and research experience.
90000
$41.75 1i
A VV97821540 I
Addison-Wesley Publishing Company ISBN 0-201-54400-8