Database Normalization
Database Normalization
Materials:
I. Introduction
A. We have already looked at some issues arising in connection with the design
of relational databases. We now want to take the intuitive concepts and
expand and formalize them.
1
B. We will base most of our examples in this series of lectures on a simplified
library database similar to the one we used in our introduction to relational
algebra and SQL lectures, with some modifications
1. We will deal with only book and borrower entities and the checked_out
relationship between them (we will ignore the reserve_book and
employee tables)
C. There are two major kinds of problems that can arise when designing a
relational database. We illustrate each with an example.
2
a) Obviously, this scheme is useful in the sense that a desk attendant
could desire to see all of this information at one time.
b) But this makes a poor relation scheme for the conceptual level of
database design. (It might, however, be a desirable view to construct for
the desk attendant at the view level, using joins on conceptual relations.)
If a borrower returns the last book he/she has checked out, all
record of the borrower disappears from the database. (No good
solution to this.)
3
d) Up until now, we have given intuitive arguments that designing the
database around a single table like this is bad - though not something
that a naive user is incapable of! What we want to do in this series of
lectures is formalize that intuition into a more comprehensive, formal
set of tests we can apply to a proposed database design.
b) There is now no way to represent the fact that a certain borrower has a
certain book out - or that a particular date_due pertains to a particular
Borrower/Book combination.
4
(1) To see where this term comes from, suppose we have two
borrowers and two books in our database, each of which is checked
out - i.e, using our original scheme, we would have the following
single table:
(2) Now suppose we decompose this along the lines of the proposed
decomposition. We get the following three tables.
2016-11-15
2016-11-10
(4) We say that the result is one in which information has been lost. At first,
that sounds strange - it appears that information has actually been gained,
since the new table is 4 times as big as the original, with 6 extraneous
rows. But we call this an information loss because
(a) Any table is a subset of the cartesian join of the domains of its
attributes.
5
(b) The information in a table can be thought of as the knowledge
that certain rows from the set of potential rows are / are not
present.
ASK
6
D. First, some notes on terminology that we will use in this lecture:
1. A relation scheme is the set of attributes for some relation - e.g. the
scheme for Borrower is { borrower_id, last_name, first_name}.
7
II. Functional Dependencies
A. Though we have not said so formally, what was lurking in the background of
our discussion of decompositions was the notion of FUNCTIONAL
DEPENDENCIES. A functional dependency is a property of the
UNDERLYING REALITY which we are modeling, and affects the way we
model it.
8
borrower_id
last_name
first_name
call_number
copy_number
barcode
title
author
date_due
(Note: these FD’s imply a lot of other FD’s - we’ll talk about this shortly)
call_number → author
ASK
9
c) The relationship that exists is one that we will introduce later, called a
multi-valued dependency.
d) For now, we will make the simplifying assumption that each book has
a single, principal author which is the only one listed in the database.
Thus, we will assume that, for now:
call_number → author
1. We begin by looking at the reality being modeled, and make explicit the
dependencies that are present in it. This is not always trivial.
call_number → author
10
2. Note that there is a correspondence between FD’s and symbols in an ER
diagram - so if we start with an ER diagram, we can list the dependencies
in it.
A B C
ASK
A → BC
A B C M W X Y
ASK
A → BCMWXY
W → XY
11
c) How about this?
A B C M W X Y
ASK
A → BCMWXY
W → XYMABC
d) Or this?
A B C M W X Y
ASK
A → BC
W → XY
AW → M
e) Thus, the same kind of thinking that goes into deciding on keys and
one-to-one, one-to-many, or many-to-many relationships in ER
diagrams goes into identifying dependencies in relational schemes.
12
1. Example: given the dependencies
(The call number of a (checked out) book determines the name of the
borrower who has it)
We want to show that, given any two legal tuples t1 and t2 such that
t1[call_number, copy_number] = t2[call_number, copy_number], it
must be the case that t1[last_name, first_name] = t2[last_name,
first_name].
(1) Suppose there are two tuples t1 and t2 such that this does not hold
- e.g.
t1[borrower_id] ≠ t2[borrower_id]
13
But if it is the case that
t1[borrower_id] = t2[borrower_id]
then the FD borrower_id → last_name, first_name is violated.
14
(2) Example of augmentation:
Given:
It follows that
15
borrower_id → last_name
and
borrower_id → first_name
c) Each of these additional rules can be proved from the ones in the
basic set of Armstrong’s axioms - e.g.
Given: 𝛂 → 𝛃, 𝛂 → 𝝲
Prove: 𝛂 → 𝛃𝝲
16
(2) Proof of the decomposition rule:
Given: 𝛂 → 𝛃𝝲
Prove: 𝛂 → 𝛃 and 𝛂 → 𝝲
Proof: 𝛃𝝲 → 𝛃 and 𝛃𝝲 → 𝝲 (by reflexivity)
𝛂 → 𝛃 and 𝛂 → 𝝲 (by transitivity using given)
Given: 𝛂 → 𝛃 and 𝛃𝝲 → d
Prove: 𝛂𝝲 → 𝝳
Proof: 𝛂𝝲 → 𝛃𝝲 (augmentation of first given with g)
𝛂𝝲 → 𝝳 (transitive rule using second given)
d) Note that the union and decomposition rules, together, give us some
choices as to how we choose to write a set of FD's.
For example, given the FD's
𝛂 → 𝛃𝝲 and 𝛂 → 𝝳𝝴
17
e) In practice, F+ can be computed algorithmically. An algorithm is
given in the text for determining F+ given F:
PROJECT - 5th ed slide
f) Note that, using an algorithm like this, we end up with a rather large
set of FD’s. (Just the reflexivity rule alone generates lots of FD’s.)
For this reason, it is often more useful to consider finding the closure
of a given attribute, or set of attributes. (If we apply this process to all
attributes appearing on the left-hand side of an FD, we end up with all
the interesting FD’s)
(2) Example of applying the algorithm to left hand sides of each of the
FD’s for our library:
Starting set (F):
borrower_id → last_name, first_name
call_number → title
call_number, copy_number → barcode, borrower_id, date_due
barcode → call_number, copy_number
call_number → author
18
(b) Compute call_number +
Initial: { call_number }
On first iteration through loop, add
title
author
19
(d) Compute barcode +:
Initial: { barcode }
On first iteration thorough loop, add
call_number
copy_number
On second iteration through loop, add
title
borrower_id
date_due
author
On third iteration through loop, add
last_name
first_name
∴ barcode → barcode, call_number,
copy_number, title, borrower_id, date_due, author,
last_name, first_name
20
could find a superkey by combining sets of attributes to get a
set that determines everything.
3. Given that we can infer additional dependencies from a set of FD's, we might
ask if there is some way to define a minimal set of FD's for a given reality.
(3) No two dependencies have the same left side (i.e. the right sides of
dependencies with the same left side are combined)
21
(2) Rewrite with a single attribute on the right hand side of each
borrower_id → borrower_id
borrower_id → last_name
borrower_id → first_name
call_number → call_number
call_number → title
call_number → author
call_number, copy_number → call_number
call_number, copy_number → copy_number
call_number, copy_number → title
call_number, copy_number → barcode
call_number, copy_number → borrower_id
call_number, copy_number → date_due
call_number, copy_number → author
call_number, copy_number → last_name
call_number, copy_number → first_name
barcode → barcode
barcode → call_number
barcode → copy_number
barcode → title
barcode → borrower_id
barcode → date_due
barcode → author
barcode → last_name
barcode → first_name
(4) There are dependencies in this list which are implied by other
dependencies in the list, and so should be eliminated. Which ones?
ASK
22
• call_number, copy_number → title
call_number, copy_number → author
(Since the same RHS appears with only call_number on the LHS)
• barcode → title
barcode → author
(These are implied by the transitive rule given that
barcode → call_number and call_number determines these).
• call_number, copy_number → last_name
call_number, copy_number → first_name
(These are implied by the transitive rule given that call_number,
copy_number → borrower_id and borrower_id determines these)
• barcode → last_name
barcode → first_name
(These are implied by the transitive rule given that
barcode → borrower_id and borrower_id determines these)
• Either one of the following - but not both!
call_number, copy_number → borrower_id
call_number, copy_number → date_due
or
barcode → borrower_id
barcode → date_due
(Either set is implied by the transitive rule from the other set given
barcode → call_number, copy_number or call_number,
copy_number → barcode.)
(Assume we keep the ones with call_number, copy_number on the
LHS)
23
borrower_id → last_name
borrower_id → first_name
call_number → title
call_number → author
call_number, copy_number → barcode
call_number, copy_number → borrower_id
call_number, copy_number → date_due
barcode → call_number
barcode → copy_number
c) Unfortunately, for any given set of FD’s, the canonical cover is not
necessarily unique - there may be more than one set of FD’s that
satisfies the requirement.
Example: For the above, we could have kept
barcode → borrower_id, date_due
and dropped
call_number, copy_number → borrower_id, date_due.
24
D. Functional dependencies are used in two ways in database design
d) We run into a problem when all of these FD’s appear in a single table
- we will formalize this soon.)
25
Example: If our decomposition includes a scheme including the
following attributes:
call_number, copy_number, barcode ...
then when we are inserting a new tuple we can easily test to see
whether or not it violates the following dependencies
barcode → call_number, copy_number
call_number, copy_number → barcode
Now suppose we decomposed this scheme in such a way that no
table contains all three of these attributes - i.e. into something like:
call_number, barcode ....
and
copy_number, barcode ...
When inserting a new book entity (now as two tuples in two
tables), we can still test
barcode → call_number, copy_number
by testing each part of the right hand side separately for each table
- but the only way we can test whether
call_number, copy_number → barcode
is satisfied by a new entity is by joining the two tables to make sure
that the same call_number and copy_number don’t appear with a
different barcode
26
(3) A decomposition is dependency preserving if the transitive closure
of the original set is equal to the transitive closure of the set of
restrictions to each scheme.
Example: if we have a scheme (ABCD) with dependencies
A→B
B → CD
27
Example: suppose we have the tuple a1 b1 c1 d1 in the table,
and try to insert a2 b1 c2 d2.
E. Note that the notions of superkey, candidate key, and primary key we
developed earlier can now be stated in terms of functional dependencies.
28
b) If K is composite, then for K to be a candidate key it must be the case
that for each proper subset of K there is some attribute in R that is
NOT functionally dependent on that subset, though it is on K.
4. Since a relation is a set, it must have a superkey (possibly the entire set of
attributes.) Therefore, it must have one or more candidate keys, and a
primary key can be chosen. We assume, in all further discussions of
design, that each relation scheme we work with has a primary key.
29
III.Using Functional Dependencies to Design Database Schemes
4. However, all three may not be achievable at the same time in all cases,
in which case some compromise is needed. One thing we never
compromise, however is lossless-join, since that involves the destruction
of information. We may have to accept some redundancy to preserve
dependencies, or we may have to give up dependency-preservation in
order to eliminate all redundancies. (We’ll see an example of this later.)
2. We will use the library database and set of FD’s we just developed, and
will progressively normalize it to 4NF.
30
C. First Normal Form (1NF):
a) Repeating groups.
4. 1NF is desirable for most applications, because it guarantees that each attribute
in R is functionally dependent on the primary key, and simplifies queries.
However, there are some applications for which atomicity may be undesirable -
e.g. keyword fields in bibliographic databases. There are some who have
argued for not requiring normalization in such cases, though the pure relational
model certainly does.
31
NOTE: We only require attributes not part of a candidate key to be fully
functionally dependent on each candidate key. An attribute that IS part
of a candidate key CAN be dependent on just part of some other
candidate key. We address this situation in conjunction with BCNF.)
c) The result of this is that we cannot record the fact that QA76.9.D3
S5637 is the call number for “Database System Concepts ” unless we
actually own a copy of the book. (Maybe this is a problem, maybe
not.) Moreover, if we do own a copy and it is lost, and we delete it
from the database, then we have to re-enter this information when we
get a new copy.
32
Book_info(call_number, title, author)
and
Everything_else(borrower_id, last_name, first_name,
call_number, copy_number, barcode, date_due)
This is now 2NF.
e) Observe that any 1NF relation scheme which does NOT have a
COMPOSITE primary key is, of necessity, in 2NF.
33
a) We cannot record information about a borrower who does not have a
book checked out.
c) If a borrower has only one book checked out and returns it, all
information about the borrower’s name is also deleted.
34
call_number, copy_number → barcode, borrower_id, date_due
barcode → call_number, copy_number
c) The fourth dependency does not lead to adding a schema, since all of
its attributes occur together in the third scheme.
d) The set of schemas includes a candidate key for the whole relation -
so we are done.
1. The first three normal forms were developed in a context in which it was
tacitly assumed that each relation scheme would have a single candidate
key. Later consideration of schemes in which there were multiple
candidate keys led to the realization that 3NF was not a strong enough
criterion, and led to the proposal of a new definition for 3NF. To avoid
confusion with the old definition, this new definition has come to be
known as Boyce-Codd Normal Form or BCNF.
2. BCNF is a strictly stronger requirement than 3NF. That is, every BCNF
relation scheme is also 3NF (though the reverse may not be true.) It also
has a cleaner, simpler definition than 3NF, since no reference is made to
other normal forms (except for an implicit requirement of 1NF, since a
BCNF relation is a normalized relation and a normalized relation is
1NF). Thus, for most applications, attention will be focused on finding a
design that satisfies BCNF, and the previous definitions of 1NF, 2NF,
and 3NF will not be needed. There will, however, be times when BCNF
is not possible without sacrificing dependency-preservation; in these
cases, we may use 3NF as a a compromise.
35
3. Definition of BCNF: A normalized relation R is in BCNF iff every
nontrivial functional dependency that must be satisfied by R is of the
form A → B, where A is a superkey for R.
call_number → author
with FD’s
ASK
36
(3) Is it 3NF?
ASK
(5) Of course, this scheme is not BCNF - the BCNF definition does
not have the “loophole” and would force us to decompose further
into something like:
(call_number, copy_number, barcode)
(call_number, copy_number, author)
37
a) Example: The previous example we used DOES allow a dependency
preserving decomposition into BCNF - e.g. the BCNF decomposition
above does preserve the FD’s.
F+ = F =
call_number, copy_number → barcode
barcode → call_number, copy_number
38
(i.e. in this case taking the transitive closure of F adds no new
dependencies of interest.)
candidate keys are (call_number, copy_number, author) and
(barcode, author)
At the first iteration of the while, we find that the one scheme found in
result is non-BCNF. We look at our dependencies and find that the first is
of the form 𝛂 → 𝛃, where 𝛂 is call_number, copy_number and 𝛃 is
barcode, but call_number, copy_number is not a key for this scheme- so
we replace the scheme in result by:
R - 𝛃 = (call_number, copy_number, author)
plus 𝛂 ∪ 𝛃 = (call_number, copy_number, barcode)
At the second iteration of the while, we find that both schemes in result
are BCNF, so we stop - which is the same as the BCNF scheme we
introduced earlier.
G.We said earlier that we had three goals we wanted to achieve in design:
1. ASK
2. We have seen how to use FD's to help accomplish the first goal, and how
to use FD’s to test whether the second is satisfied. Obviously, the set of
FD’s is what we want to preserve, though this is not always attainable if
we want to go to the highest normal form.
39
IV.Normalization Using Multivalued Dependencies
40
t1[A] = t2[A] = t3[A] = t4[A] and
t3[B] = t1[B] and t4[B] = t2[B] and
t3[R-A-B] = t2[R-A-B] and t4[R-A-B] = t1[R-A-B]
Note: if t1[B] = t2[B], then this requirement is satisfied by letting t3 = t2
and t4 = t1. Likewise, if t1[R-A-B] = t2[R-A-B], then the requirement is
satisfied by setting t3 = t1 and t4 = t2. Thus, this definition is only
interesting when t1[B] <> t2[B] and t1[R-A-B] <> t2[R-A-B].
c) Thus once we know that the author values associated with QA76.9.D3
S5637 (the call number for our textbook) are Korth, Silberschatz, and
Sudarshan, the multivalued dependency from call_number to author
tells us two things:
41
will be either Korth or Silberschatz or Sudarshan - but never some
other name such as Peterson.
42
Thus t3 is
QA76.9.D3 S5637 1 Silberschatz
And t4 is
QA76.9.D3 S5637 2 Korth
While the former occurs in the database, the latter does not, and so
must be added.
(4) On the other hand, suppose our database contains just one copy -
i.e.
QA76.9.D3 S5637 1 Silberschatz
QA76.9.D3 S5637 1 Korth
QA76.9.D3 S5637 1 Sudarshan
This satisfies the multivalued dependency call_number ->> author
as it stands.
To see this, let t1 be the first tuple and t2 the second. Since they
agree on call_number but differ on author, we require the presence
of tuples t3 and t4 which have the same call_number, and with
t3 agreeing with t1 on author (Silberschatz)
t4 agreeing with t2 on author (Korth)
t3 agreeing with t2 on everything else (copy_number = 1)
t4 agreeing with t1 on everything else (copy_number = 1)
Of course, now t3 and t4 are already in the database (indeed, t3 is
just t1 and t2 is just t4) so the definition is satisfied.
43
A B C
44
b) Of course, a functional dependency is a much stronger statement than
a multi-valued dependency, so we don’t want to simply replace FD’s
with MVD’s in our set of dependencies.
or
b) The union of its left hand and right hand sides is the whole scheme
i.e. For any MVD on a relation R of the form 𝛂 ->> 𝛃,
45
set of FD’s and MVD’s D, we can find their closure D+ by using
appropriate rules of inference. These are discussed in Appendix B of the
text.
PROJECT: Rules of inference for FD’s and MVD’s
a) Note that this set includes both the FD rules of inference we
considered earlier, and new MVD rules of inference
b) Note, in particular, that though there is a union rule for MVD’s just
like there is a union rule for FD’s, there is no MVD rule analogous to
the decomposition rule for FD’s.
e.g. given A → BC, we can infer A → B and A → C.
However, given A ->> BC, we cannot necessarily infer A ->> B or A -
>> C unless certain other conditions hold.
2. Note that every 4NF relation is also BCNF. BCNF requires that, for
each nontrivial functional dependency A → B that must hold on R, A is a
superkey for R.
But if A → B, then A ->> B. Further, if R is in 4NF, then for every
nontrivial multivalued dependency of the form A ->> B, A must be a
superkey. This is precisely what BCNF requires.
46
3. An algorithm is given in the book for converting a non 4NF scheme to 4NF
PROJECT - 5th ed slide
It basically operates by isolating MVDs in their own relation, so that they
become trivial.
Example: application of this algorithm to our library database (with multiple
authors).
a) Our canonical cover for F+, with added MVDs for author
borrower_id → last_name, first_name
call_number → title
call_number ->> author
call_number, copy_number → barcode,borrower_id, date_due
barcode → call_number, copy_number,borrower_id, date_due
Notes:
(1) We do not include call_number ->> copy_number
(a) It is not the case that if we have two different copies of some
call_number, each copy_number value appears with each
barcode - just with the one for that book.
(b) Likewise, it is not the case that if we have two different copies
of some call_number, each copy_number value appears with
each borrower/date_due - just to the one (if any) that pertains to
that particular copy.
(2) The definition of 4NF - and the 4NF decomposition algorithm - are both
couched solely in terms of MVD’s. However, since every FD is also an
MVD, we will use the above set, remembering that when we have say
borrower_id → last_name, first_name
we necessarily also have
borrower_id ->> last_name, first_name
47
(3) The algorithm calls for using D+ - the transitive closure of D, the
set of FD’s and MVD’s. As it turns out, all we really need to know
for this problem is Fc (the canonical cover for the FD’s) plus the
MVD’s. (The transitive closure of the set of MVD’s is huge!)
b) Initial scheme:
{ (borrower_id, last_name, first_name, call_number,
copy_number, barcode, title, author date_due)
}
48
V. Higher Normal Forms
49
VI.Some Final Thoughts About Database Design
In a good design, every attribute depends on the key, the whole key, and
nothing but the key.
b) This is often the way a naive user designs a database.- though the
naive user may not get around to normalization!
50
PROJECT
PROJECT
However, a simplistic conversion into tables would, in this case, lead to more
tables than we need, since the Author and Title tables contain no information
other than their keys.
(If they did contain additional information, then making them separate entities
would make sense. We can imagine having more information about authors,
thus warranting a separate Author entity; it would be hard to imagine what
would warrant a separate Title entity)
51
Thus, we may be better to take the set of tables arising from converting the
original ER diagram to tables and then normalizing the tables, leading to the
following set of tables:
3. Note that these two approaches lead, after normalization, to similar but
not identical designs.
a) How does the design that comes from normalizing our original ER
diagram differ from the design we came to by normalization of a
universal relation?
ASK
The former has a separate Checked_out table, rather than keeping
borower_id and date_due in the Book table.
b) Which is better?
ASK
The latter design avoids the necessity of storing null for the borrower
id of a book that is not checked out, at the expense of having an
additional table. Thus,
(1) To record the return of a book under the first model, we set the
borrower_id attribute of the Book tuple to null.
52
(2) To record the return of a book under the second model, we delete
the Checked_out tuple
2. Note that views can be used to give the illusion of a denormalized design
to users, but do not address the performance issue, since the DBMS must
still do the join when a user accesses the view
53