Ch7 Functional Dependencies and Normalization
Ch7 Functional Dependencies and Normalization
Outline:
7.3 Normalization.
Definitions.
First Normal Form (1NF)
Second Normal Form (2NF)
Third Normal Form (3NF)
Boyce-Codd Normal Form (BCNF)
1
Informal Guidelines Design for Relational Databases
There are four informal measure of quality for relation schema design:
1. Semantics of the relation attributes.
2. Redundant values in tuples and update anomalies.
3. Null values in tuples.
4. The possibility of generating spurious tuples.
These measures are not always independent of one another.
Guideline1:
Map each entity set to a relation & each relationship set to a relation.
o Don’t mix these two
o Only use foreign keys to refer to other entities as opposed to
duplicating attributes in other relations.
EMP_DEPT
It is good design?
o The attributes semantic of the EMP_DEPT relation are not clear.
o It is poor design because they violate guideline 1 by mixing attributes from
distinct real-world entities.
o Solution: decompose the “EMP_DEPT” relation into two relations and foreign
key.
EMP DEPT
2
Example: Consider the following “EMP_DEPT” relation.
EMP_DEPT
….
Redundancy
o There are problems that result from having such large table:
B. Deletion Anomalies:
Deleting departments will delete their employees.
3
department, the information concerning that
department is lost from the database.
Guideline2:
Minimize the amount of data redundancy in the database.
o If some attributes do not apply to many tuples, then the occurrence of data
in those columns is sparse and storage space is wasted.
o Queries involving tables having null values might have difficulty dealing
with the null values, namely joins and aggregates.
Guideline3:
o Try to minimize the occurrences of NULL values in your design.
o Attributes that are NULL frequently could be placed in separate
relations (with the primary keys).
4. Spurious Tuples
4
o Example: Consider the following table
Employee
EID EName TelNo SDate
12345 Kim 555-1234 01/23/2004
54321 Kim 555-5432 05/03/2001
Emp1
EID EName
12345 Kim
54321 Kim
Emp2
EName TelNo SDate
Kim 555-1234 01/23/2004
Kim 555-5432 05/03/2001
Guideline4:
o The relations should be designed to satisfy the lossless join
condition.
o No spurious tuples should be generated by doing a natural-join of
any relations.
5
7.2 Functional Dependencies
Definitions
A B
1 4
2 5
1 7
On these instances, A→B does not hold, but B→A does hold.
X Y Z
X1 Y1 Z1
X2 Y1 Z1
X3 Y2 Z3
X4 Y3 Z2
6
K is a superkey for relation schema r if and only if K→r (K determines all
attributes of r. all attributes of relation occur on the right-hand side of the FD).
K is a candidate key for r if and only if
K→r, and
No α K, α→r (no subset of K is a superkey)
Example: Consider the relation EMP (SSN, Ename, Deptno). Suppose EMP
is used to represent many-to-one relationship from EMP to DEPT, where SSN
is a key in the EMP relation and Deptno is a key in the DEPT relation.
o Are the following FDs holds in EMP relation?
1. {SSN}→{Deptno}
2. { Deptno}→{SSN}
o What about many-to-many relationship and one-to-one relationship?
Relation extensions r(R) that satisfy the FD constraints are called legal
extensions (or legal relation states) of R.
A FD is a property of the relation schema (intension) R, not of a particular
legal relation state (extension) r of R. hence, an FD cannot be inferred
automatically from a given relation extension r but must be defined explicitly
by someone who knows the semantics of the attributes of R.
Example: Consider the following relation
7
The Text not determine the Course
Given a set of FDs F, we can infer additional FDs that hold whenever the FDs
in F hold
The set of all dependences is called the closure of F and is denoted by F+.
A. Armstrong’s Axioms
We can find all of F+ by applying Armstrong’s axioms.
8
1. X→XY (Augmentation by X)
2. YX→YZ (Augmentation by Y)
3. X→YZ (Transitivity)
A→B
Transitive B→H
B→H
2. AG→I
By augmentation A→C with G, to get
AG→CG and then transitivity with CG→I
CG→I
Transitive AG→I
A→C Aug AG→CG
3. CG→HI
By augmentation CG→I with CG, to get
CG→CGI, then by augmentation CG→H
with I, to get CGI→HI, and then by
transitivity CG→HI
Aug
CG→I CG→CGI
Aug Transitive CG→HI
CG→H CGI→HI
9
4. IR4: Decomposition rule
{X→YZ}╞ {X→Y and X→Z}
By reflexivity YZ→Y
Transitive X→Y
X→YZ
By reflexivity YZ→Z
Transitive X→Z
X→YZ
Aug
X→Y X→XY
Transitive X→YZ
Aug
X→Z XY→YZ
Aug
X→Y XW→YW
Transitive XW→YZ
Aug
W→Z WY→YZ
WY→Z
Transitive WX→Z
Aug
X→Y XW→YW
F+ can grow quite large, as we keep applying rules to find more FDs.
10
Sometimes we want to find all of F+, and other times we just want to find part
of it.
We are often interested in finding the part that tells us whether or not some
subset of attributes x is a superkey for R.
If you can uniquely determine all attributes in R by some subset of attributes
X, then X is a superkey for R.
The closure of X under F, denoted X+, is the subset of attributes that are
uniquely determined by X under F.
Definition:
o Given a schema R, a set X of attributes in R, and a set F of FDs that
hold for R, then the set of all attributes of R that are functionally
dependent on X is called closure of X under F (X+)
X+ := X;
Repeat
oldX+ := X+ ;
for each functional dependency Y→Z in F do
if Y X+ then X+ := X+ U Z;
Until (X+ = oldX+);
oldX+ X+
AB AB
ABC
ABCF
ABCEF
ABCEF
{AB}+ = {A B C E F}
{AB} is not a superkey because it is not determine R
11
Is {AG} a candidate key?
oldX+ X+
AG AG
ABG
ABCG
ABCGH
ABCGH
ABCGHI
ABCGHI
{AG}+ = {A B C G H I}
{AG} is a superkey
7.3 Normalization
Definitions:
A superkey of a relation schema R = {A1, A2, ...., An} is a set of attributes S
subset-of R with the property that no two tuples t1 and t2 in any legal relation
state r of R will have t1[S] = t2[S].
12
A key K is a superkey with the additional property that removal of any
attribute from K will cause K not to be a superkey any more.
If a relation schema has more than one key, each is called a candidate key.
o One of the candidate keys is arbitrarily designated to be the primary
key, and the others are called alternate keys.
A Prime attribute must be a member of some candidate key.
A Nonprime attribute is not a prime attribute—that is, it is not a member of
any candidate key.
13
There are three main techniques to achieve first normal form for such a
relation:
I. First Solution: Decompose the non 1NF relation into two 1NF
relations, by remove the attribute Dlocation that violate 1NF
and place it in a separate relation Dept_Location. The primary
key of this relation is the combination (Dnumber, Dlocation).
Dnumber Dlocation
5 LocX
5 LocY
5 LocZ
4 LocW
1 LocZ
II. Second solution: Expand the key so that there will be a separate
tuple in the original DEPARTMENT relation for each location
of a department.
The primary key is (Dnumber, Dlocation)
14
Headquarters 1 88677 LocZ NULL NULL
The EMPLOYEE relation is not in 1NF, because the Name attribute not
atomic attribute
SSN is the primary key of the EMP_PROJ relation and Pnumber is the
partial key of the nested relation; that is, within each tuple, the nested
relation must have unique values of Pnumber.
15
To normalize this into 1NF, we remove the nested relation attributes into
a new relation and propagate the primary key into it; the primary key of
the new relation will combine the partial key with the primary key of the
original relation.
Solution:
SSN Ename
EMP_PROJ
SSN Pnumber Hours Ename Pname Plocation
FD1
FD2
FD3
16
A. The semantic of the attributes are not clear.
B. Redundant data (update anomalies)
EMP_PROJ relation is in 1NF but is not in 2NF,
because the nonprime attribute Ename violate 2NF
(FD2); as do the nonprime attributes Pname and
Plocation (FD3).
FD2 and FD3 make Ename, Pname and Plocation
partially dependent on the primary key {SSN,
Pnumber} of EMP_PROJ
Solution: separate each FD violate 2NF into new
relation as follows:
o After decomposition:
No redundant values
No spurious tuples through join
Solution:
17
B. General definition of 2NF
A B C D E F
FD1
FD2
A B C D
FD1
D E F
FD2
EMP_DEPT
Ename SSN BDate Address Dnumber Dname DMGRSSN
FD1
FD2
18
Solution:
ED1
Ename SSN BDate Address Dnumber
FD1
ED2
Dnumber Dname DMGRSSN
FD2
FD1: {Item}→{Category}
FD2: {Item}→{Discount}
FD3: {Category}→{Discount}
The relation is in 1NF and in 2NF but it is not in 3NF (FD3)
Solution:
19
o Example: Consider the following relation
LOTS1 LOTS2
Property_ID# County_Name Lot# Area Price County_Name Tax_Rate
FD1 FD3
FD2
FD4
20
LOTS1 in 2NF but not in 3NF (FD4)
Decompose LOTS1 into two relations as follows
LOTS1A LOTS1B
Property_ID# County_Name Lot# Area Area Price
FD1 FD4
FD2
BOOK_AUTH
Book Author Book List Publisher Author
Publisher
Title Name Type Price Date Affil
FD1
FD2
FD3
The BOOK_AUTH relation in 1NF but it is not in 2NF (FD1 and FD3)
FD1
FD2
B_A2_1 B_A2_2
Book Book Book List
Publisher
Title Type Type Price
FD1 FD2
21
Boyce_Codd Normal Form (BCNF)
BCNF was proposed as a simpler form of 3NF, but it was found to be stricter
than 3NF, because every relation in BCNF is also in 3NF; however, a relation
in 3NF is not necessarily in BCNF.
A relation schema R is in BCNF if whenever a nontrivial FD X→A holds in
R, then X is a superkey of R.
The only difference between the definitions of BCNF and 3NF is the second
condition of 3NF, which allows A to be prime, is absent from BCNF.
A B C
FD1
FD2
a1 b1 c1
a1 b2 c3
a2 b2 c4
a3 b2 c3
a3 b3 c5
A C C B
22
FD2: {Instructor}→{Course}
Solution:
There are three possible decomposition of the relation
1. {student, instructor} and {student, course}
2. {course, instructor } and {course, student}
3. {instructor, course } and {instructor, student}
Out of the above three, only the 3rd decomposition will not generate
spurious tuples after join.
Product Prod
CustID Name Order# OrderDate Price Qty
Code Desc
C004 Adams, Anne Ord001 03/08/2000 P100 TopDeck 3.25 150
C004 Adams, Anne Ord001 03/08/2000 P300 KitKat 3.10 100
C003 Jones, Carol Ord002 05/08/2000 P200 BarOne 2.95 240
C002 Black, Roger Ord003 05/08/2000 P500 MilkyBar 3.20 370
C002 Black, Roger Ord003 05/08/2000 P400 Flake 3.40 120
C001 Smith, John Ord004 09/08/2000 P300 KitKat 3.10 280
C005 Rhodes, Sean Ord005 12/08/2000 P400 Flake 3.40 150
C002 Black, Roger Ord006 13/08/2000 P100 TopDeck 3.25 320
C001 Smith, John Ord007 15/08/2000 P500 MilkyBar 3.20 240
How would you split the previous relation to minimize data redundancy?
23