Bookdb
Bookdb
Antonio Albano
University of Pisa
Department of Informatics
Largo B. Pontecorvo 3, 56127 Pisa – Italy
[email protected]
1 Introduction 1
INTRODUCTION
Logistics
Production
Sales and
Distribution
BD DBMS
Accounting
Inventory
Human Resources
APPLICATIONS
The data are under the control of a Data Base Management System (DBMS),
a centralized or distributed software system, which provides the tools to de-
fine the database, to select the data structures needed to store and retrieve the
data easily, and to access the data, interactively or by means of a programming
language.
Another application domain in which databases play a key role is Decision
Support. The main goal of such applications is to turn the data into information
useful to support management decisions. Three categories of decision support
are reporting data, analyzing data, and knowledge discovery with data mining
techniques.
Decision support applications, sometimes called online analytic processing
(OLAP), involve quite complex queries which cannot be efficiently executed
against operational databases, optimized for online transaction processing (OLTP).
For this reason, organizations maintain a separate database, called data ware-
house, specifically organized for such complex OLAP queries.
This report presents and discusses the principal topics in the database area.
The emphasis is on the concepts underlying database languages, systems and
design. The discussion is organized into three main sections: database from
the designer perspective, DBMS from a user perspective, and DBMS from a
system perspective.
Section 4 presents introductory and fundamental concepts regarding infor-
mation modeling. The problem of building a symbolic model of the knowledge
on some aspect of the world is addressed and an object formalism is introduced
to define this model. The formalism is used to explain the basic concepts used
in the rest of the paper. The basic features of the relational model is also
presented. The emphasis will be on the relational model since it has gained
wide acceptance among database researchers and practitioners and has a solid
theoretical basis. Moreover, an overview is given of the fundamental results of
normalization theory to design relational databases.
Section 3, DBMS from a user perspective, presents the functionality of
a DBMS: the separation of database description and application programs;
database languages; data control; facilities for the database administrator. A
large part of the presentation is devoted to the most important feature charac-
terizing a DBMS: the data model it supports, i.e. the abstraction mechanisms
c A. Albano 3
used to model the databases. The basic features of the relational language
SQL are also presented to define and use databases. The inclusion of SQL
statements in a program written in a conventional programming language is
also discussed.
4 CHAPTER 1 Introduction c A. Albano
Chapter 2
will present a formal language which can be used to implement the model. The
emphasis will be on the construction of a conceptual model, i.e. on a model
built using a formalism which is suitable for natural and direct modeling. The
examples in the following sections mainly refer to a simplified library or a
university administration system.
Definition 2.1.1 An entity is anything for which certain facts should be recorded,
independently of the existence of other entities.
Examples of properties in a library are the user’s name and address. The dif-
ference between a property and an entity results from a different interpretation
of the represented fact in the model: properties are facts which are of interest
only because they describe other facts which are considered as entities.
Entities with the same properties are said to have the same type and they
are classified into the same collection (called also entity set). For instance,
John, Mary, and Ann may be classified into the collection Persons based on
the fact that they have the same properties and represent humans.
Collection of entities with the same type are certainly an important aspect
of the knowledge about the reality to be modeled, but much more information
is carried by facts which establish associations among entities.
In the library, examples of relationships between entities can be the fact that
a bibliographic description refers to a book, or more than one book if more
than one copy of the same book exists, or the fact that the user Smith has
borrowed a copy of a particular book.
A relationship is usually binary, that is involves two entities, but in general
it may be n-ary. Moreover, several relationship sets might involve the same
entity sets.
If we take a picture of a given time slice of the reality, the entities of interest,
the values of their properties and the relationships in which they participate
constitute a state of this reality. In general, the reality undergoes changes
because entities are subjects of processes. These may be continuous processes
or discrete event processes such as a change in the address of a user, the loan
of a book, the acquisition of a new book, etc.
c A. Albano 2.1 What to Model 7
For example, a book is borrowed by at most one person, but a person can
borrow several books (the relationship is said one-to-many or 1:N). In contrast,
the relationship AppearsIn between authors and bibliographic descriptions, in
which an author has written several books and a book has been written by
several authors, is said to be many-to-many or N:M. A book must be related
to a bibliographic description (total), but a bibliographic description may not
be related to a library book (partial).
Procedural knowledge concerns the elementary actions (or operations) in
the application environment which are applied to concrete knowledge to cause
changes. It must be understood that concrete knowledge is about the structure
of the entities and procedural knowledge is about their behavior. Moreover,
while abstract knowledge imposes restrictions on possible values of concrete
knowledge, procedural knowledge imposes restrictions on the possible ways in
which concrete knowledge can be used or modified.
Examples of elementary actions for a university student are: enroll, graduate,
change address, and change telephone number.
Dynamics concerns how concrete and procedural knowledge can be used to
model complex activities in the application environment.
Dynamics regards changes in the reality triggered by events and accom-
plished by standard procedures. An example of such a procedure in a uni-
versity situation is: When a professor moves to another university, then stop
salary; exclude the professor from mailing lists (usually more than one); for
each course held by the professor, start procedure to assign new professor; for
each commission of which the professor was a member, start procedure for
new nominations; etc.
Finally, communications concern how information is entered in the informa-
tion system and is exchanged among members of the organization.
For the sake of simplicity we will not consider in the following, procedural
knowledge, dynamics and communications.
8 CHAPTER 2 Database: The Conceptual Designer Perspective c A. Albano
2.3.1 Object
An object is the computer representation of certain facts about an entity of
the observe world. An object is a software entity which has an internal state
(instance variables) and it is equipped with a set of local operations (methods)
to manipulate that state. The request to an object to execute an operation is
called a message, to which the object can reply. The structure of an object
state is modeled by a set of variables (or attributes) which can have values
of arbitrary complexity, including other objects which become components of
the object. When the state of an object can only be accessed and modified
through operations associated to that object, we say that the object is a data
abstraction or that it encapsulate its state.
Finally, each object is distinct from all other objects and has an identity
that persists over time, independently of changes to the value of its state, e.g.,
if X and Y are identifiers bound to objects of type T , X will be equal to Y if
they are bound to the same object. For instance, the object representing the
c A. Albano 2.3 ODM: An Object Data Model 9
person John is different from any other object representing another person,
but will remain the same even if his address or some other attribute changes.
2.3.2 Type
An object is an instance of a type defined with a generative type constuctor ,
i.e. each object type definition produces a new type, which is different from
any other previously defined types. An object type describes the state fields
and the implementation of methods of its possible instances. An object type
definition introduces a constructor of its instances, and so an object can be
constructed only after its object type definition has been given.
In the object programming context this approach to objects is called class-
based since the description of objects is called a class; we prefer the term
“type” since we will use “class” with a different meaning according to the
database tradition.
The signature ⇓T of an object type T is the set of label-type pairs of the
messages which can be sent to its instances.
Each object is a value of a certain type and objects of the same type have the
same properties, i.e. they have the same structure and the same operations,
specified by the type definition. The operations (the methods) to manipulate
the state are specified by giving a specific implementation (concrete behavior )
The type mechanism makes it possible to create many objects of the same
type using an appropriate constructor.
Example 2.1 Figure 2.1 shows a graphic representation of types.1 Attributes are
represented by the pair (Name : Type). Attributes can be multivalued (have a type seq
T ; they can be optional, meaning that the value can be left unspecified. Methods are
represented by (Name (Parameters) := Body).
APerson
Name: string
BirthDate: date
SpokenLanguages: seq string
OwnerOf: Car
2.3.3 Class
An object data model supports a mechanism to define a collection of homoge-
neous values to model multivalued attributes or collections of objects to model
databases. Usually two different mechanisms are provided:
1. Currently there is no standard notation for an ODM model. Most books use the ER
notation. We instead use a notation based on UML (Unified Modelling Language).
10 CHAPTER 2 Database: The Conceptual Designer Perspective c A. Albano
We assume that when an object with the type of the elements of a class is
constructed, then the object will itself become an element of that class.
2.3.4 Relationship
Classes of objects model sets of entities of the observed world, while relation-
ships between such entities of are represented with a separate mechanism as
shown with the following axamples.
Example 2.2 Figure 2.2 shows a graphic representation of classes with different level
of details: (a) class name only, (b) class name and the attributes of its elements, and
(c) class name, the attributes of its elements together with their values type.
A binary relationship between classes is represented by an oriented arc (Figure 2.3a).
The arc is labeled with the relationships name. A binary relationship with attributes
is represented by a relationship class attached to the arc using a dashed line (Fig-
ure 2.3b). The arcs may be labeled to clarify the role that entities play in the rela-
tionship: In this case the labels are used to name direct and inverse relationships; the
labels are mandatory in the case of recursive relationships (Figure 2.3c).
The graphic notation represents also the structural properties of relationships: car-
dinality and partecipation, to model respectively how many elements of one class can
be associated with elements of another class and whether an element of one class can
have elements of another class associated to it. Multivalued relationships are repre-
sented graphically with a double arrow; optional relationships with a crossed line.
For example, a student might have passed zero or more exams, but an exam result
must be associated to a student. Figure 2.4 shows a schema for a library information
system.
Persons
a)
Persons Persons
b) c)
HasPassed
Students ExamResults
a)
Borrows
Persons
Date
HasMother HasChildren
Users Books
IsMotherOf
b) c)
Example 2.3 If we are interested both in Persons and Students, we have to model
two different and essential facts: the type of Students elements is a subtype of the
12 CHAPTER 2 Database: The Conceptual Designer Perspective c A. Albano
UseAlso
Broader Narrower
Term Term
Indexes
Bibliographic
Terms
Descriptions
Use UsedFor
UseFor
AppearsIn Describes
Authors Books
HasLoaned Regards
Borrowers Loans
type Persons elements, because all the possible students are a subset of all the possible
persons; the set of all actual Students, is a subset of all actual Persons (i.e. the class
Studentsis a subclass of the class Persons) (Figure 2.5).
Persons
Name: string
BirthDate: date
SpokenLanguages: seq string
OwnerOf: Car
Students
Code: int
RegistrationYear: int
OwnerOf: CityCar
Persons
Students Employees
Instructors
Persons Students
Drivers Persons
UseAlso
Broader Narrower
Term Term Bibliographic
Descriptions
Indexes
Terms ISBN: string
Term: string Title: string
Publisher: string
Use UsedFor Year: int
UseFor
Describes
Authors
AppearsIn Books
Name: string
Position: string
Nationality: string
CopyNumber: int
BirthDate: date
HasLoaned Regards
Borrowers
Loans
Name: string
Date: date
Address: string
DueDate: date
Phones: seq string
ShortTerm
LoanBooks
Students Faculty Until: date
Code: string OfficePhone: string
in the 1976 and later extended with hierarchies. More recently the Unified
Modeling Language (UML) is becoming the standard notation for object mod-
eling. Several tools also exists to specify diagrams, examples are: ERwin from
Computer Associates, ER/Studio from Embarcadero Technologies, and Re-
lational Rose from Rational Software for UML. In addition, DBMS vendors
provide their own design tools, such as Oracle Designer and Power Designer
from Sybase. Just to give an idea of the alternative graphical notations, the
library schema in Figure 2.4 is shown in Figure 2.9 using the ER notation.
2.4 Exercises
1. A university database contains information about professors (identified by social
security number, or SSN) and courses (identified by courseid). Professors teach
courses; each of the situations described below concerns the Teaches relationship
set. For each of the following situations, draw a diagrams that capture this (as-
suming that no further constraints hold):
(a) Professors can teach the same course in several semesters, and each offering
must be recorded.
(b) Professors can teach the same course in several semesters, and only the most
recent such offering needs to be recorded. (Assume that this is the case in all
subsequent questions.)
(c) Every professor must teach some course.
(d) Every professor teaches exactly one course (no more, no less).
(e) Every professor teaches exactly one course (no more, no less), and every course
must be taught by some professor.
(f) Now suppose that certain courses can be taught by a team of professors jointly,
but it is possible that no one professor in a team can teach the course. Model
this, introducing additional entity sets and relationship sets if necessary.
c A. Albano 2.4 Exercises 15
UseAlso
Broader Narrower
Term Term
(0,1) (0,n) Indexes
(1,n) (1,n) Bibliographic
erms
Descriptions
(0,1) (0,n) (1,n) (0,n)
Use UsedFor
UseFor
AppearsIn Describes
(1,n) (1,1)
Authors Books
HasLoaned (0,1)
(0,n) (1,1) (1,1)
Borrowers Loans
Regards
2. Let us assume that a company uses the following worksheet to store data about
its computers.
4. Design a conceptual schema for a database to keep track of actors and directors of
films. Each actor o director has a unique name, a birth year, and a nationality. An
16 CHAPTER 2 Database: The Conceptual Designer Perspective c A. Albano
actor may be also a director. Each film has a title, the production year, the actors,
a director, and a producer. Films produced the same year have different titles.
5. We would like to design a database to maintain the following facts. Trains are
either local trains or express trains, but never both. A train has a unique number
and an engineer. Stations are either express stops or local stops, but never both.
A station has a name (assumed unique) and an address. All local trains stop at
all stations. Express trains stop only at express stations. For each train and each
station the train stops at, there is a time. Design a conceptual schema for the
database.
6. Consider the following information about a manufacturing company’s parts and
suppliers database. The database contains information about the way certain parts
are manufactured out of other parts: the subparts that are involved in the manufac-
ture of a part, the number of subparts used, the cost of manufacturing a part from
its subparts, the mass of the part as result of the subparts assemblage. The man-
ufactured parts may themselves be subparts in a further manufacturing process.
In addition, certain information must be held on the parts themselves: their code,
name and, if they are imported (i.e., manufactured externally), the supplier and
the purchase cost. Suppliers have a code, a name, several phones and an address.
Design a conceptual schema for the database.
7. Design a conceptual schema for a Company database to keep track of a company’s
employees, departments, and projects. The company is organized into departments.
Each department has a unique name, a unique number, a location, and a manager
who is one of its employees. We keep track of the start date when the employee
began managing the department. A department controls a number of projects,
each of which has a unique name, a unique number. An employee has a name, a
social security number, address, salary, sex (m or f), and birthdate. An employee
is assigned to one department but may work on several projects, which are not
necessarily controlled by the same department. We keep track of the percent-time
that an employee works on each project. We also keep track of the direct supervisor
of each employee, who belong to the same department, and the start date when
the employee began acting as supervisor. We want to keep track of the dependents
of each employee for insurance purposes. We keep each dependent’s name, sex,
birthdate, and relationship (spouse or child or other) to the employee (assume
that only one parent works for the company). We are not interested in information
about dependents once the parent leaves the company.
Chapter 3
- In the object data model the structure of the objects can be complex,
whereas in the relational data model the structure of a tuple is simple,
i.e. the values of the components of a tuple are elementary.
- In the object data model the associations model set of object tuples, whereas
in the relational data model associations are described by attributes which
can only have the value of the key of the associated elements of some other
relation as their values.
- In the object data model the structure of an object is defined together with
the representation of the procedural knowledge, whereas in the relational
data model only a mechanism to describe the structure of the tuples is
provided.
Bibliographic
Descriptions
ISBN: string
Title: string
Publisher: string
Year: int
Describes
Authors
AppearsIn Books
Name: string
Position: string
Nationality: string
CopyNumber: int
BirthDate: date
HasPassed
Students ExamResults
ODM schema
Candidate
Students ExamResults
Relational schema
ExamResults
Students Candidate: string
StudentCode: string <<PK>>
<<PK>> Candidate <<FK(Students)>>
Name: string Subject: string
City: string <<PK>>
BirthYear: int Date: string
Grade: int
1. A constant relation is written by listing its tuples within { }, for example {(A1 := 2, A2 :=
125); (A1 := 3, A2 := 250)}.
20 CHAPTER 3 Database: The Relational Designer Perspective c A. Albano
The result is a relation with the same type as R, whose tuples are those of R
which satisfy the condition.
Set union: R ∪ S
R and S are relations of the same type, and the result is a relation with tuples
which are in R or S or both.
Set difference: R − S
R and S are relations of the same type, and the result is a relation with tuples
which are in R but not in S.
Product: R × S
R{A1 : T1 , . . . , An : Tn } and S{An+1 : Tn+1 , . . . , An+m : Tn+m } are relations
with disjoint set of attributes. The result is a relation of type {A1 : T1 , . . . , An :
Tn , An+1 : Tn+1 , . . . , An+m : Tn+m } whose tuples are all possible tuples whose
first n components form a tuple in R and whose last m components form a
tuple in S.
Let us show how these operators can be used to write queries using the
following database:
Students
Name StudentCode City BirthYear
Isaia 071523 Pisa 1962
Rossi 067459 Lucca 1960
Bianchi 079856 Livorno 1961
Bonini 075649 Pisa 1962
ExamResults
Subject Candidate Date Grade
DA 071523 12/01/85 28
DA 067459 15/09/84 30
MTI 079856 25/10/84 30
DA 075649 27/06/84 25
LFC 071523 10/10/83 18
Example 3.2 First, we find the name, and the student code of all the students of
Pisa.
πN ame,StudentCode (σCity = ’Pisa’ (Students))
Name StudentCode
Isaia 071523
Bonini 075649
Next, suppose we want to find the names of all those students, who have passed the
exam “DA” with grade 30, plus the examination date. Let us compute the result in
more than one step, using the following strategy: since we need information from both
the Students relation and the ExamResults relations, let us first compute the product
of the two relations, producing the following temporary relation T :
T := Students × ExamResults
which can be very large: if there are n tuples in Students and m tuples in ExamResults,
then there are n × m tuples in T .
c A. Albano 3.2 Relational Algebra 21
However the only meaningful tuples in T are those with equal values for the attributes
StudentCode and Candidate.
R := σStudentCode = Candidate (T )
R and S are relations of the same type, and the result is a relation with tuples
which are both in R and in S.
Join: R ⊲⊳ S
R.Ai =S.Aj
Natural Join: (R ⊲⊳ S)
The natural join is only applicable when both R and S have attributes with the
same name. Let us assume that R and S have the common attribute Ai . The
result is computed by selecting those tuples of R × S that have the same value
for the common attribute Ai , and excluding one of the common attributes
from the result.
The following operators are examples of useful extended relational algebra
operations.
Generalized projection: πe1 AS ide1 , e2 AS ide2 ,..., en AS iden (E)
1. The tuples of E are partitioned in groups in such a way that all the tuples
in a group have the same values for A1 , . . . , An .
2. For each group with attributes values a1 , . . . , an , the result has a tuple
(a1 , . . . , an , v1 , . . . , vm )
where for each i, vi is the result of applying the aggregation function fi on
the multiset of Bi values in the group.
c A. Albano 3.2 Relational Algebra 23
R R
Result
(A1 , A2 , . . . , An ,
f1 , f2 , . . . fk )
For example, to find for each value of A1 the maximum value of A2 , and the
sum of the A3 values, we write the expression:
1. Cascade of select
σφX (σφY (E)) = σφX ∧φY (E),
2. Select and project are commutative
πY (σφX (E)) = σφX (πY (E)), if X ⊆ Y .
If X 6⊆ Y , then:
πY (σφX (E)) = πY (σφX (πXY (E))).
24 CHAPTER 3 Database: The Relational Designer Perspective c A. Albano
Students ExamResults
STEP 1: Representation of 1:N and 1:1 associations with the rules in Fig-
ure 3.5.
A B
A R B
Attributes Attributes
Attributes Attributes
R <<FK(B)>>
R A R B
A B
Attributes Attributes
Attributes Attributes
R <<FK(B)>>
A
A
Attributes
Attributes
R <<FK(B)>>
R
A B
Attributes Attributes A
R B
Attributes
Attributes
R <<FK(B)>>
R Attributes
Attributes
A B A A R B B
A <<PK>> <<FK(A)>>
Attributes Attributes Attributes Attributes
B <<PK>> <<FK(B)>>
A A A1 R A2
Attributes Attributes A1 <<PK>> <<FK1(A)>>
R A2 <<PK>> <<FK2(A)>>
A R B
A R B A A <<PK>> <<FK(A)>> B
Attributes Attributes Attributes B <<PK>> <<FK(B)>> Attributes
R Attributes
R
Attributes
3.3.1 Exercises
1. Convert the following conceptual schemas to a relational database schema.
(a) Your solution to Exercise 2.4(3).
(b) Your solution to Exercise 2.4(5).
(c) Your solution to Exercise 2.4(6).
R KA <<PK>> S
XA
R <<FK(...)>>
T B C W
XB XC
W <<FK(...)>>
ODM subclasses
A
A R KA <<PK>> S
R KA <<PK>> S
XA
XA R <<FK(...)>>
XB
XC KA KA
T R <<FK(...)>> W
B C
W <<FK(...)>> T W
KA <<PK>> <<FK(A)>> KA <<PK>> <<FK(A)>>
ToDiscriminate
XB XC
W <<FK(...)>>
A
R S (?)
KA <<PK>>
XA
R <<FK(...)>>
T B C
KA <<PK>> KA <<PK>>
XA XA W
R XB R XC
R <<FK(...)>> R <<FK(...)>>
W <<FK(...)>>
The library has a set of books (not more than one copy per book), each
identified by a unique book number. Books may be loaned to borrowers, each
identified by a unique name, and having an address and telephone number; a
library user can have more than one book on loan at the same time; the lending
date is also recorded. The key of the relation is {UserName, CallNumber}. An
example of an instance of the relation is:
c A. Albano 3.4 Relational Database Design: Normalization Theory 29
Broader
Term
Bibliographic
erms Indexes
Descriptions
Use
Borrowers Loans
ShortTerm
Students Faculty
LoanBooks
Name StudentCode
Isaia 071523
Bonini 075649
The above schema is “bad” because it presents the following main undesirable
properties:
- Repetition of information. Every time a user borrows another book, the
information about his address and telephone will be repeated; this wastes
space and complicates database updating when a user changes address.
- Inability to represent certain information. Information about users can be
stored only when they borrow a book.
An alternative design is to replace the schema with two relation schemas, but
a careless decomposition may lead to another kind of “bad” design. Consider
the following rather absurd decomposition where the association between loans
and borrowers is modeled by the telephone numbers:
Users(UserName, Address, Tel)
Loans(CallNumber, Author, Title, Date, Tel)
The instances of the two relations are obtained by projections of the Library
relation as follows:
Users = πUserName, Address, Tel (Library) =
30 CHAPTER 3 Database: The Relational Designer Perspective c A. Albano
which is wrong since Laura Paolicchi has not borrowed a book in January.
Thus, when we join Users and Loans we have more tuples in the result than
those we expect. This anomaly is called a loss of information and the decom-
position is called a lossy decomposition. The reason for this anomaly is that
we have selected a wrong external key to describe the association of users and
loans. A correct design would have been
Users(UserName, Address, Tel)
Loans(CallNumber, Author, Title, Date, UserName)
The main goal of relational design theory is to give formal criteria to design
databases without anomalies of the types represented by the above examples.
In the following, we will assume that attributes have a global meaning, i.e.
attributes mean the same wherever they occur in a database schema, and we
adopt the following conventions:
- Capital letters near the beginning of the alphabet stand for single attributes
(A, B, A1 , A2 , etc.).
- Capital letters near the end of the alphabet stand for sets of attributes
(X, Y, U, Z, etc.).
- XY is used as a shorthand for X ∪ Y , AB as a shorthand for {A, B}, and
AX as a shorthand for {A} ∪ X.
- A1 A2 . . . An is a shorthand for {A1 , A2 , . . . , An }.
- Names beginning with a capital letter denote relation schemas, and R(T ) a
relation with a set of attributes T .
- Let t be a tuple, R(T ) a relation schema, and X ⊆ T , then t[X] denotes the
X-value of t.
c A. Albano 3.4 Relational Database Design: Normalization Theory 31
{X → Y, X → Z} |= X → Y Z
and
W ⊆ X {} |= X → W
An interesting question is whether there is a way of computing all the possible
FDs logically implied by a set F , using a set of inference rules with the property
of being sound and complete so that we can derive mechanically all the FDs
implied by F , and only those.
32 CHAPTER 3 Database: The Relational Designer Perspective c A. Albano
F1 (reflexivity) If Y ⊆ X, then X → Y
F2 (augmentation) If X → Y, Z ⊆ T, then XZ → Y Z
F3 (transitivity) If X → Y, Y → Z, then X → Z
Using these rules, the following rules can also be proved correct
{X → Y, X → Z} ⊢ X → Y Z (union rule)
Z ⊆ Y {X → Y } ⊢ X → Z(decomposition rule)
{} ⊢ X → X
{X → Y } ⊢ XZ → Y
W ⊆ Z, V ⊆ Y {X → Y } ⊢ XZ → V W
So far, we have discussed derived dependencies in two ways: we have talked
about logically implied dependencies (|=) and about dependencies which are
inferred using Armstrong’s axioms as deduction rules (⊢). In fact, these two
ways of defining derived dependencies are the same: if a functional dependency
f can be inferred from a set F using Armstrong’s axioms, then f is logically
implied by F (soundness), and, vice versa, if f is logically implied by F , then
f can also be inferred using Armstrong’s axioms (completeness).
2. There are several equivalent sets of rules and we present just one of them here.
c A. Albano 3.4 Relational Database Design: Normalization Theory 33
Definition 3.4.6 Given the schema R < T, F >, we say that W ⊆ T is a key
(or a candidate key) of R if
1. W → T ∈ F +
2. ∀V ⊂ W, V → T 6∈ F +
In general, there are many candidate keys for a relation, and we designate one
of them as the primary key to be used in representing associations. We also
use the term superkey for any superset of a key and the term prime attribute
for an attribute which belongs to a candidate key. The following results have
been proved for keys:
1. The problem of finding all the keys of a relation requires an algorithm with
an exponential time complexity.
2. The problem of testing whether an attribute is prime is N P-complete.
Definition 3.4.7 Two sets of FDs, F and G, over schema R are equivalent,
written F ≡ G, iff F + = G+ . If F ≡ G, then F is a cover for G (and G a cover
for F ).
- no dependency in F is redundant.
The following example shows that in general a set F of FDs can have more
than one canonical cover.
That is, every legal instance r is the natural join of its projections onto the
Ri ’s. From the definition of the natural join operator, the following result can
be proved.
Example 3.4 Let us consider the following instance of the relation R(A, B, C):
A B C
a1 b c1
a2 b c2
c A. Albano 3.4 Relational Database Design: Normalization Theory 35
The following decomposition is not data preserving because r ⊆ (πA,B r) ⊲⊳ (πB,C r).
πTi (F ) = {X → Y ∈ F + |X, Y ⊆ Ti }
Example 3.6 Let us consider the following schema ZipCodes(City, Street, Zip), with
FDs
City Street → Zip
Zip → City
That is, the address (city and street) determines the zip code, and the zip code
determines the city, although not the street address. Since the candidate keys are
{City, Street}, {Street, Zip}, all attributes are primes, and thus the schema is in 3NF,
but it suffers from the repetition of information problem. Consequently, 3NF does not
solve the problem of detecting “bad” schemas completely and another normal form is
required.
c A. Albano 3.4 Relational Database Design: Normalization Theory 37
The schema ZipCodes(City, Street, Zip) from Example 3.6 is a well known
example showing that a relation schema can be in 3NF without being in BCNF.
If F is a canonical cover, then the following result holds
ρ = {R < T, F >}
while exists in ρ a Ri < Ti , Fi > not in BCNF because of the FD X → A do
T1 = X A
F1 = πT1 (Fi )
T2 = Ti − A
F2 = πT2 (Fi )
ρ = ρ − Ri + {R1 < T1 , F1 >, R2 < T2 , F2 >}
end
The decomposition is data preserving but, in general, not dependency preserv-
ing, as shown by the following example: R < {J, K, L}, {JK → L, L → K} >
is not in BCNF, however every decomposition will fail to preserve JK → L.
Thus, obtaining a data and dependency preserving decomposition is an im-
possible goal.
[?] gave an algorithm with a polynomial time complexity O(a5 p) to com-
pute a data preserving decomposition in BCNF, although it will sometimes
decompose a relation that is already in BCNF. However, the problem of de-
ciding whether a relation schema has a dependency preserving decomposition
in BCNF is N P-hard.
Employees
EmplName ChildName Salary Year
Bragazzi Maurizio 1000000 1980
Bragazzi Maurizio 1200000 1984
Bragazzi Maurizio 1400000 1988
Bragazzi Marcello 1000000 1980
Bragazzi Marcello 1200000 1984
Bragazzi Marcello 1400000 1988
Fantini Maria 1000000 1980
Fantini Maria 800000 1984
Fantini Maria 600000 1988
Y ”, iff in any instance r of R, for any two tuples t1 , t2 ∈ r with t1 [X] = t2 [X],
there exists a tuple t3 ∈ r such that t3 [X] = t1 [X] = t2 [X], t3 [Y ] = t1 [Y ], and
t3 [Z] = t2 [Z].
Theorem 3.4.6 [?] The following axioms are sound and complete for func-
tional and multivalued dependencies:
F1 (reflexivity) If Y ⊆ X, then EX → Y
F2 (augmentation) If X → Y, Z ⊆ T , then XZ → Y Z
F3 (transitivity) If X → Y, Y → Z, then X → Z
M1 (complemention) If X→→Y , then X→→T − XY
M2 (multivalued augmentation) If V ⊆ W, W ⊆ T, X→→Y , then XW →→Y V
M3 (multivalued transitivity) If X→→Y, Y →→Z, then X→→Z − Y
M4 (replication) If X → Y , then X→→Y
M5 If Z ′ ⊆ Z, Y ∩ Z = ∅, X→→Y, Y → Z ′ , then X → Z ′
The following theorem shows how MVDs are related to lossless decomposition.
Definition 3.4.18 A relation schema R < T, D > is in 4NF if for every non-
trivial MVD X→→Y in R, X is a superkey of R.
A relation that is not in 4NF can be decomposed in much the same way
as we constructed BCNF database schemas. The resulting decomposition is
data preserving. However, in general, it is not possible to design a database
schema that meets the three criteria: 4NF, dependency preservation, and data
preservation. Moreover, it is not known how (or if) a synthesis algorithm can
handle MVDs.
Other kinds of dependencies have been defined to avoid other forms of data
redundancy in a relation schema. The interested reader may consult [?] for a
fuller discussion of dependency theory, including other topics which have not
been addressed here.
3.4.10 Exercises
1. Prove that for a schema R < T, F >, with F a canonical cover, if an attribute Ai
does not appear in the right side of any FD, then Ai belongs to every key of R.
2. Prove that if a schema R < T, F > has two attributes only, then it is in BCNF.
3. Prove that if a schema R < T, F > is in 3NF, and all keys are made of one
attributes, then it is in BCNF. Hint : prove that for each X → A ∈ F , X is a
superkey.
4. For each of the following relational schemas and set of functional dependencies:
(a) R(A, B, C, D) with functional dependencies AB → C, C → D, and D → A.
(b) R(A, B, C, D) with functional dependencies A → B, and A → C.
(c) R(A, B, C, D) with functional dependencies A → B, and B → C.
do the following:
(a) Find all the keys of R,
(b) Indicate all the BCNF violations.
(c) Decompose the relations, as necessary, into collections of relations that are in
BCNF. Say if the decomposition is dependency preserving.
(d) Indicate all the 3NF violations.
(e) Decompose the relations, as necessary, into collections of relations that are in
3NF and are data preserving.
5. Consider the following poorly designed relational schema:
UnivInfo(studID, studName, course, profID, profOffice)
Each tuple in relation UnivInfo encodes the fact that the student with the given ID
and name took the given course from the professor with the given ID and office.
Assume that students have unique ID’s but not necessarily unique names, and
professors have unique ID’s but not necessarily unique office. Each student has
one name; each professor has one office.
(a) Specify a set of completely nontrivial functional dependencies for relation Uni-
vInfo that encodes the assumptions described above but no additional assump-
tions.
c A. Albano 3.4 Relational Database Design: Normalization Theory 41
(b) Decompose relation UnivInfo into BCNF according to your functional depen-
dencies in part (1).
(c) Now add the following two assumptions: (1) No student takes two different
courses from the same professor; (2) No course is taught by more than one
professor. Modify your set of functional dependencies from part (a) to take
these new assumptions into account.
42 CHAPTER 3 Database: The Relational Designer Perspective c A. Albano
Chapter 4
All above features are guaranteed by a Data Base Management System (DBMS),
defined as follows:
1. The term “user” is adopted throughout this paper to mean either an end-user or an
application program which is performing data manipulation operations.
c A. Albano 4.2 Functions of a DBMS 45
The physical level is the lowest level of abstraction at which the database is
described. This level contains the description of the data structures used to
store and access the data. The principal data structures used will be discussed
in sections ??–??.
The logical level, often called the conceptual level, is the next level of ab-
straction and describes the logical structure of the data and the relationships
established among them, i.e. the schema, using a language which supports the
abstraction mechanisms of a particular data model. The language used for the
classical data models — the hierarchical, network, and relational data models,
discussed below — is called the Data Description Language (DDL), since only
data are described in the database schema and not procedural aspects.
The logical view level is the level at which that part of the entire database
which is accessible to a certain class of users is described (external schema).
There may be many views of the same database, and all of them are defined in
terms of the schema given at the logical level. For example, only some classes
may be accessible and only a subset of the attributes of an element are visible
for a particular user category. An external schema is not necessarily a subset
of a schema, it can also contain new classes, defined in terms of those actually
present in the database.
The description of the database at these different levels is given by the
person responsible for creating the database, usually known as the database
administrator (DBA), and the information in the schema is usually stored in
a system catalog, described in the following, which constitutes an additional
database that can be queried by users .
Example 4.1 The difference between the levels of data description can be under-
stood using an example of a relational database for university employees. At the
logical level, the database structure is described in terms of the following table:
At the logical view level, to the administration office and to the library is not allowed
to access all the information in the table Persons, but only a subset of them:
These three levels of data description were proposed in 1978 by the ANSI/X3/
SPARC study group on DBMSs, with the aim of guaranteeing two important
properties: physical and logical data independence.
Physical data independence means that modifications to the physical database
organization will not imply modifications to applications programs.
Logical data independence means that the mechanism used to define external
schemas should ensure that certain modifications to the logical schema, such
as adding new definitions for example, will not comport changes to the appli-
cation programs, but simply a redefinition of the associated external schemas
in terms of the new logical schema. The only kind of change in the logical
schema that cannot be reflected in a redefinition of an external schema is the
deletion of information in the logical schema which corresponds to information
present in the external schema. Logical data independence is highly desirable
because of the costs involved in software maintenance.
Although these three levels of data description are not supported in most
DBMSs, some systems, for example the relational ones, have physical and
logical data independence.
- procedural, which are “record oriented”, in the sense that they deliver one
record at a time and require that a user, wishing to retrieve a particular
set of records, writes a procedure which implements an appropriate search
strategy to “navigate” through the database structure;
- nonprocedural, or declarative, which are “set oriented”, in the sense that
they deliver a set of records satisfying a condition and require a user to
characterize the data he wants, with the system assuming the responsibility
for devising an appropriate search strategy.
- access control which limits the kind of access to the database allowed to
a particular user. In fact, although the purpose of a DBMS is to facilitate
database sharing by users, this sharing must be selective. The owner of data
should be able to specify the nature of the access privileges allowed to those
c A. Albano 4.2 Functions of a DBMS 47
users who will access the data (i.e. read only or read/write), to allow certain
users to see only certain fields or certain records, or even to allow only a
view of aggregate values (such as averages);
- integrity control which prevents data which violate the constraints declared
in the database schema from being entered into the database;
- concurrency control which ensures that users simultaneously accessing a
database do not interfere with one another. In fact, when more than one
user accesses the same data, unpredictable results can occur.
Example 4.2 Let us assume that John and Jane have a joint savings account and
both go to different tellers. The current balance is $350. Jane wishes to add $400
to the account. John wishes to withdraw $50. Let us assume the following events
happen in the order in which they are shown:
Jane’s teller reads $350,
John’s teller reads $350,
Jane’s teller writes $750,
John’s teller writes $300,
The account now reads $300, and this certainly is not a correct way to allow more
than one person to use the same account.
- data recovery which entails restoring the database to a consistent state after
the occurrence and detection of a failure. A database may become incon-
sistent because of a transaction failure, a system failure, or a media (disk)
failure.
A transaction can be interrupted because (a) the program has been coded
in such a way that if certain conditions are detected then an abort must
be issued, (b) because the DBMS detects a violation by the transaction of
some integrity constraint or access right, or (c) because it was decided to
terminate the transaction since it was involved in a deadlock detected by the
DBMS. When a transaction aborts, its actions are undone automatically by the
recovery facility, restoring the database to the same state it had at beginning
of the transaction.
When a media failure occurs, the recovery facility can use its historical data
to reconstruct the current database contents starting from a prior version of
the database.
Techniques used by DBMSs for concurrency control and data recovery will
be considered later.
Null values are not allowed in keys. One additional feature to note is that a
default value can be specified for an attribute. This value will be automatically
assigned to the attribute of a tuple should the tuple be inserted without this
attribute being given a specific value. Semantic constraints are specified using
the CHECK clause.
A relation schema can be modified using the ALTER TABLE statement and
deleted with the DROP TABLE statement.
In relational databases, it is common for tuples in one relation to reference
tuples in the same or other relations to model associations. It is a violation of
data integrity if the referenced tuple does not exist in the appropriate relation.
For example, it makes no sense to have a ExamResults tuple with candidate
100 and not have the tuple with StudentCode = 100 in the relation Students.
The requirement that the referenced tuple must exists is called referential
integrity. One important type of referential integrity is the so-called foreign
key constraint.
The following example shows how foreign key constraints are specified in
SQL:
The FOREIGN KEY clause has the option ON DELETE to specify what to do if a
referenced tuple is deleted. NO ACTION means that any attempt to remove a
Students tuple must be rejected outright if the student is referenced by a Exam-
Results tuple. The option ON DELETE CASCADE means that the referencing tuple
is to be removed too. The option ON DELETE SET NULL means that the foreign
key attributes in the references tuple must be set to NULL. Similar options are
provided for the option ON UPDATE. NO ACTION is the default situation when
ON DELETE or ON UPDATE is not specified.
More general remedial actions can be specified when a constraints is violated
using the trigger mechanism: Whenever a specific event occurs, a specified
action is executed.
Besides ordinary tables, also virtual tables (called views) can be defined with
the CREATE VIEW statement. A view can be queried as an ordinary table, but
its content does not physically exists in the database, instead, a definition of
how to construct the view from ordinary database tables is given as a query
with the CREATE VIEW statement and stored in the system catalog.
For example, the following view defines the students of Pisa:
CREATE VIEW PisaStudents AS
SELECT Name, StudentCode, BirthYear
FROM Students
WHERE City = ’Pisa’;
Access Control
Since databases often contain sensitive information, a DBMS ensures that
only those authenticated users who are authorized to access the database are
allowed to and they are only allowed to access information that has been
specifically made available to them.
SQL provide the GRANT and REVOKE statements to allow security to be set
up on the tables in the database. When a user create a table he automatically
becomes the owner of the table and receives full privileges for the table. To
allow other users the access to the table, the owner must explicitly grant them
the necessary privileges using the GRANT statements:
GRANT { privilegeList | ALL PRIVILEGES } [(columnName [, columnName])]
ON objectName
TO { authorizationIdList | PUBLIC }
[ WITH GRANT OPTION ]
Privileges are the actions that a user is permitted to carry on a given base
table or view (the objectName); examples are:
GRANT SELECT
ON Students
TO PUBLIC;
REVOKE SELECT
ON Students
FROM PUBLIC;
SELECT *
FROM Students
WHERE Name = ’Rossi’;
SELECT is a keyword telling the database that this is a query. The asterisk
means to retrieve all columns; alternatively, you could have listed the desired
columns by name, separated by commas. The FROM Students clause identifies
the table from which you want to retrieve the data.
WHERE Name = ’Rossi’ is a predicate, and all rows that make the predicate TRUE
are returned. This is an example of set-at-a-time operation. The predicate is
optional, but in its absence the operation is performed on the entire table, so
that, in this case, the entire table would have been retrieved. The semi-colon
is the statement terminator.
The relationship between SQL and relation algebra is as follows:
Set union: R ∪ S is equivalent to
SELECT *
FROM R
UNION
SELECT *
FROM S;
SELECT *
FROM R
EXCEPT
SELECT *
FROM S;
SELECT DISTINCT A1 , A2 , . . . , Am
FROM R;
52 CHAPTER 4 DBMS: The User Perspective c A. Albano
SELECT *
FROM R
WHERE Condition;
Product: R × S is equivalent to
SELECT *
FROM R, S;
Join: R R.Ai⊲⊳
=S.Aj S is equivalent to
SELECT *
FROM R, S
WHERE R.Ai = S.Aj ;
SELECT *
FROM R NATURAL JOIN S;
– AVG([DISTINCT] Attr): Compute the average of the values in column Attr of the
query result. Again DISTINCT means that each value should be used only
once.
– MAX(Attr), MIN(Attr): Compute the maximum or the minimum value in the
column Attr.
For example, the following query returns the number of students tuples:
SELECT COUNT(*)
FROM Students;
The HAVING condition (unlike the WHERE condition) is applied to groups, not
to individual tuples (Figure 4.2).
Finally, the order of tuples in the query result is generally unpredictable. If
a particular ordering is desired, the ORDER BY clause can be used:
SELECT Name, BirthYear
FROM Students
ORDER BY Name;
Ascending order is used by default, but descending order can also be specified:
SELECT Name, BirthYear
FROM Students
ORDER BY DESC Name;
54 CHAPTER 4 DBMS: The User Perspective c A. Albano
Nested Queries
Nested subqueries increase the expressive power of SQL, but are one of the
most complex, expensive, and error-prone feature of SQL.
Consider the query list the student code of the students who did not pass any
exams:
SELECT StudentCode
FROM Students
WHERE StudentCode NOT IN (
== Students who have passed an exam
SELECT Candidate
FROM ExamResults ) ;
For INSERT, you simply identify the table and its columns and list the values,
as follows:
INSERT INTO Students (Name, StudentCode, City, BirthYear)
VALUES (’Rossi’, ’01234’, ’Pisa’, 1990);
This statement inserts a row with a value for every column but. If a value is
specified for every column of the table, and the values are given in the same
order as the columns in the table, the column list can be omitted. A SELECT
statement can be used in place of the VALUES clause of the INSERT statement
to retrieve data from elsewhere in the database.
UPDATE is similar to SELECT in that it takes a predicate and operates on all
rows that make the predicate TRUE. For example:
UPDATE Students
SET City = ’Florence’
WHERE Name = ’Rossi’;
This sets to ‘Florence’ the city for the student named ‘Rossi’. The SET clause
of an UPDATE command can refer to current column values. “Current” in this
case means the values in the column before any changes were made by this
statement.
DELETE is quite similar to UPDATE. The following statement deletes all rows
for students from ‘Pisa’:
DELETE FROM Students
WHERE City = ’Pisa’;
You can only delete entire rows not individual values. To do the latter, use
UPDATE to set the values to null. Be careful with DELETE that you do not omit
the predicate; this empties the table.
constants, and types, structured data, and customized error handling. The
language compiler can control completely that SQL statements are well
formed. A notably example is Oracle PL/SQL.
Let us illustrate the approach by showing two programs which print the
name and birth year of the students of Pisa. The first example (Figure 4.3)
use the standard cursor, while the second example use a special construct
FOR with an implicit cursor (Figure 4.4).
class PrintStudentsName{
public static void main(String argv[]){
Class.forName(”DBMS driver”);
Connection con = // connect
DriverManager.getConnection(”url”, ”login”, ”psw”);
Statement stmt = con.createStatement(); // set up stmt
String query = ”SELECT Name
FROM Students
WHERE City = ”’ + argv[0] + ” ”’;
ResultSet iter = stmt.executeQuery(query);
System.out.println(”Names retrieved:”);
try { // to handle exceptions
// loop through result tuples
while (iter.next()) {
String name = iter.getString(”Name”);
int year = iter.getInt(”BirthYear”);
System.out.println(” Name: ” + name + ”; BirthYear: ” + year);
}
} catch(SQLException ex) {
System.out.println(ex.getMessage() + ex.getSQLState() + ex.getErrorCode());
}
stmt.close(); con.close();
}}
gram. Before the program can be compiled by the host language compiler,
the SQL statements must be processed by a pre-compiler, which check SQL
syntax, the number and types of arguments and results, and replace them
into calls to a library of functions. At runtime these functions communicate
with the DBMS.
Let us illustrate the approach by showing a C program which prints the
name and birth year of the students of Pisa (Figure 4.6).
Figure 4.7 shows the same example in SQLJ, is a dialect of embedded SQL
that can be included in Java programs. The pre-compiler replace SQLJ con-
structs by call to a library which accesses a database using calls to a JDBC
driver.
The statement #SQL iterator GetInfoStIte . . . in the figure tells the pre-compiler
to generate a class GetInfoStIte which implements an iterator with the next()
method. The class GetInfoStIte is used to store result sets in which each row
has two columns: a string and an integer. The declaration gives a Java name
to these columns, Name and Year, and implicitly defines the column accessor
methods, Name() and Year(), which can be used to return data stored in the
corresponding columns.
4.3.5 Exercises
1. Give a relational schema in SQL for the following databases:
(a) Your solution to Exercise 3.3.1(1).
(b) Your solution to Exercise 3.3.1(2).
2. Give a relational schema in SQL for your solution to Exercise 3.3.1(3), and write
the following queries:
(a) Retrieve the birth-date and name of the female employees.
58 CHAPTER 4 DBMS: The User Perspective c A. Albano
char SQLSTATE[6];
EXEC SQL BEGIN DECLARE SECTION
char c sname[20]; short c BirthYear;
EXEC SQL END DECLARE SECTION
short c City = ”Pisa”;
EXEC SQL DECLARE sinfo CURSOR FOR
SELECT S.name, S.BirthYear
FROM Students S
WHERE S.City = :c City
ORDER BY S.name;
do {
EXEC SQL FETCH sinfo INTO :c sname, :c BirthYear;
printf(”Name:%s; BirthYear: %s ”, c sname, c BirthYear);
} while (SQLSTATE != 02000);
EXEC SQL CLOSE sinfo;
#SQL iter = {
SELECT Name, BirthYear AS Year
FROM Students
WHERE City =:(argv[0]) };
System.out.println(”Students retrieved”);
while (iter.next()) {
String name = iter.Name();
int year = iter.Year();
System.out.println(” Name = ” + name + ” Year = ” + year);
}
iter.close();
Oracle.close(); }
(b) For each employee, retrieve the employee name and the name of the depart-
ment where he works.
(c) Retrieve the distinct salary of every employee.
(d) Retrieve the names and the ages of female employees older than their super-
visor.
(e) Retrieve the names of all employees who do not have supervisors.
(f) Retrieve the name and address of all employees who work for the “Research”
department.
(g) For every project located in “Pisa”, list the project number, the controlling
department number, and the departament manager’s name, address, and birth-
date.
(h) Make a list of all projects numbers for projects that involve an employee whose
last name is Smith, either as a worker or as a manager of the department that
controls the project.
(i) Retrieve the names of employees who have no dependents.
(j) List the names of supervisors who have at least one dependent.
(k) For each employee, retrieve the employee’s name and the name of his or her
immediate supervisor.
(l) Retrieve the name of each employee who has a dependent with the same first
name and sex as the employee.
(m) Retrieve a list of employees and the projects they are working on, ordered by
department and, within each department, ordered alphabetically by name.
(n) Find the sum of the salaries of all the employees of the Research department,
as well as the the maximum salary, the minimum salary, and the average salary
in this department.
(o) For each department, retrieve the department number, the number of employ-
ees in the department, and their average salary.
(p) For each project on which more than two employees work, retrieve the project
number, the project name, and the number of employees who work on the
project.
(q) For each project, retrieve the project number, the project name, and the num-
ber of employees from department 5 who work on the project.
(r) For each department having more than five employees, retrieve the department
number, the number of employees making more than 40.000.
(s) Retrieve the name of each employee who has all dependents with the same sex
as the employee.
(t) Retrieve the name of each employee who has all dependents with the same
sex.
(u) Retrieve the names of the employees who work only to projects for 20 percent-
time.
(v) Retrieve the name of each employee who work only on projects controlled by
department number 5.
(w) Retrieve the name of each employee who work only on projects controlled by
the same department.
(x) Retrieve the name of each employee who work on all the projects (and only
those) to which the employee 100 participates.
60 CHAPTER 4 DBMS: The User Perspective c A. Albano