Normalization Illustrated
Normalization Illustrated
We have noted that relations that form the database must satisfy some properties,
for example, relations have no duplicate tuples, tuples have no ordering associated
with them, and each element in the relation is atomic. Relations that satisfy these
basic requirements may still have some undesirable properties, for example, data
redundancy and update anomalies. We illustrate these properties and study how
relations may be transformed or decomposed (or normalised) to eliminate them.
Most such undesirable properties do not arise if the database modelling has been
carried out very carefully using some technique like the Entity-Relationship Model
that we have discussed but it is still important to understand the techniques in this
chapter to check the model that has been obtained and ensure that no mistakes
have been made in modelling.
The above table satisfies the properties of a relation and is said to be in first
normal form (or 1NF). Conceptually it is convenient to have all the information in
one relation since it is then likely to be easier to query the database. But the above
relation has the following undesirable features:
1. Repetition of information --- A lot of information is being repeated. Student name, address,
course name, instructor name and office number are being repeated often. Every time we wish to
insert a student enrolment, say, in CP302 we must insert the name of the course CP302 as well as
the name and office number of its instructor. Also every time we insert a new enrolment for, say
Smith, we must repeat his name and address. Repetition of information results in wastage of
storage as well as other problems.
2. Update Anomalies --- Redundant information not only wastes storage but makes updates more
difficult since, for example, changing the name of the instructor of CP302 would require that all
tuples containing CP302 enrolment information be updated. If for some reason, all tuples are not
The above problems arise primarily because the relation student has information
about students as well as subjects. One solution to deal with the problems is to
decompose the relation into two or more smaller relations.
Decomposition may provide further benefits, for example, in a distributed database
different relations may be stored at different sites if necessary. Of course,
decomposition does increase the cost of query processing since the decomposed
relations will need to be joined, sometime frequently.
The above relation may be easily decomposed into three relations to remove most
of the above undesirable properties:
S (sno, sname, address)
C (cno, cname, instructor, office)
SC (sno, cno)
Such decomposition is called normalization and is essential if we wish to overcome
undesirable anomalies. As noted earlier, normalization often has an adverse effect
on performance. Data which could have been retrieved from one relation before
normalization may require several relations to be joined after normalization.
Normalization does however lead to more efficient updates since an update that
may have required several tuples to be updated before normalization could well
need only one tuple to be updated after normalization.
Although in the above case we are able to look at the original relation and propose
a suitable decomposition that eliminates the anomalies that we have discussed, in
general this approach is not possible. A relation may have one hundred or more
attributes and it is then almost impossible for a person to conceptualise all the
information and suggest a suitable decomposition to overcome the problems. We
therefore need an algorithmic approach to finding if there are problems in a
proposed database design and how to eliminate them if they exist.
There are several stages of the normalization process. These are called the first
normal form (1NF), the second normal form (2NF), the third normal form (3NF),
Boyce-Codd normal form (BCNF), the fourth normal form (4NF) and the fifth normal
form (5NF). For all practical purposes, 3NF or the BCNF are quite adequate since
they remove the anomalies discussed above for most common situations. It should
be clearly understood that there is no obligation to normalise relations to the
Role of Normalization
In an earlier chapter, we have discussed data modeling using the entity-
relationship model. In the E-R model, a conceptual schema using an entity-
relationship diagram is built and then mapped to a set of relations. This technique
ensures that each entity has information about only one thing and once the
conceptual schema is mapped into a set of relations, each relation would have
information about only one thing. The relations thus obtained would normally not
suffer from any of the anomalies that have been discussed in the last section. It
can be shown that if the entity-relationship is built using the guidelines that were
presented in Chapter 2, the resulting relations are in BCNF. (How??? Check?)
Of course, mistakes can often be made in database modeling specially when the
database is large and complex or one may, for some reasons, carry out database
schema design using techniques other than a modelling technique like the entity-
relationship model. For example, one could collect all the information that an
enterprise possesses and build one giant relation (often called the universal
relation) to hold it. This bottom-up approach is likely to lead to a relation that is
likely to suffer from all the problems that we have discussed in the last section. For
example, the relation is highly likely to have redundant information and update,
deletion and insertion anomalies. Normalization of such large relation will then be
essential to avoid (or at least minimise) these problems.
Now to define the normal forms more formally, we first need to define the concept
of functional dependence.
Single-Valued Dependencies
Initially Codd (1972) presented three normal forms (1NF, 2NF and 3NF) all based
on functional dependencies among the attributes of a relation. Later Boyce and
Codd proposed another normal form called the Boyce-Codd normal form (BCNF).
The fourth and fifth normal forms are based on multivalue and join dependencies
and were proposed later.
Functional Dependency
Consider a relation R that has two attributes A and B. The attribute B of the
relation is functionally dependent on the attribute A if and only if for each value of
A no more than one value of B is associated. In other words, the value of attribute
A uniquely determines the value of B and if there were several tuples that had the
same value of A then all these tuples will have an identical value of attribute B.
That is, if t1 and t2 are two tuples in the relation R and
t1(A) = t2(A) then we must have t1(B) = t2(B).
A and B need not be single attributes. They could be any subsets of the attributes
of a relation R (possibly single attributes). We may then write
The second functional dependency above assumes that the grades are dependent
only on the marks. This may sometime not be true since the instructor may decide
to take other considerations into account in assigning grades, for example, the
class average mark.
For example, in the student database that we have discussed earlier, we have the
following functional dependencies:
These functional dependencies imply that there can be only one name for each
sno, only one address for each student and only one subject name for each cno. It
is of course possible that several students may have the same name and several
students may live at the same address. If we consider cno -> instructor, the
dependency implies that no subject can have more than one instructor (perhaps
this is not a very realistic assumption). Functional dependencies therefore place
constraints on what information the database may store. In the above example,
one may be wondering if the following FDs hold
Functional dependencies arise from the nature of the real world that the database
models. Often A and B are facts about an entity where A might be some identifier
for the entity and B some characteristic. Functional dependencies cannot be
automatically determined by studying one or more instances of a database. They
can be determined only by a careful study of the real world and a clear
understanding of what each attribute means.
We have noted above that the definition of functional dependency does not require
that A and B be single attributes. In fact, A and B may be collections of attributes.
For example
The above example illustrates full functional dependence. However the following
dependence
Closure
Let a relation R have some functional dependencies F specified. The closure of F
(usually written as F+) is the set of all functional dependencies that may be
logically derived from F. Often F is the set of most obvious and important
functional dependencies and F+, the closure, is the set of all the functional
dependencies including F and those that can be deduced from F. The closure is
important and may, for example, be needed in finding one or more candidate keys
of the relation.
For example, the student relation has the following functional dependencies
sno -> sname cno -> cname sno -> address cno -> instructor instructor ->
office
To determine F+, we need rules for deriving all functional dependencies that are
implied by F. A set of rules that may be used to infer additional dependencies was
proposed by Armstrong in 1974. These rules (or axioms) are a complete set of
rules in that all possible functional dependencies may be derived from them. The
rules are:
1. Reflexivity Rule --- If X is a set of attributes and Y is a subset of X, then X -> Y holds.
The reflexivity rule is the most simple (almost trivial) rule. It states that each
subset of X is functionally dependent on X.
2. Augmentation Rule --- If X -> Y holds and W is a set of attributes, then WX -> WY holds.
3. Transitivity Rule --- If X -> Y and Y -> Z hold, then X -> Z holds.
The transitivity rule is perhaps the most important one. It states that if X
functionally determines Y and Y functionally determines Z then X functionally
determines Z.
1. Union Rule --- If X -> Y and X -> Z hold, then X -> YZ holds.
2. Decomposition Rule --- If X -> YZ holds, then so do X -> Y and X -> Z.
3. Pseudotransitivity Rule --- If X -> Y and WY -> Z hold then so does WX -> Z.
Based on the above axioms and the functional dependencies specified for relation
student, we may write a large number of functional dependencies. Some of these
are:
(sno, cno) -> sno (Rule 1)
(sno, cno) -> cno (Rule 1)
(sno, cno) -> (sname, cname) (Rule 2)
cno -> office (Rule 3)
sno -> (sname, address) (Union Rule)
etc.
Often a very large list of dependencies can be derived from a given set F since
Rule 1 itself will lead to a large number of dependencies. Since we have seven
attributes (sno, sname, address, cno, cname, instructor, office), there are 128 (that
is, 2^7) subsets of these attributes. These 128 subsets could form 128 values of X
in functional dependencies of the type X -> Y. Of course, each value of X will then
be associated with a number of values for Y ( Y being a subset of X) leading to
several thousand dependencies. These large number of dependencies are not
particularly helpful in achieving our aim of normalizing relations.
Although we could follow the present procedure and compute the closure of F to
find all the functional dependencies, the computation requires exponential time
and the list of dependencies is often very large and therefore not very useful.
There are two possible approaches that can be taken to avoid dealing with the
large number of dependencies in the closure. One is to deal with one attribute or a
set of attributes at a time and find its closure (i.e. all functional dependencies
relating to them). The aim of this exercise is to find what attributes depend on a
given set of attributes and therefore ought to be together. The other approach is to
find the minimal covers. We will discuss both approaches briefly.
As noted earlier, we need not deal with the large number of dependencies that
might arise in a closure since often one is only interested in determining closure of
a set of attributes given a set of functional dependencies. Closure of a set of
attributes X is all the attributes that are functionally dependent on X given some
functional dependencies F while the closure of F was all functional dependencies
that are implied by F. Computing the closure of a set of attributes is a much
simpler task if we are dealing with a small number of attributes. We will denote the
closure of a set of attributes X given F by X+.
An algorithm to determine the closure:
Requirements (a), as already noted, can be met easily given any set of
dependencies F. Requirement (b) guarantees that we cannot remove any
dependencies from F and still have a set of dependencies equivalent to F or no
attribute on the left hand side of a dependency is redundant. Requirement (c)
makes sure that no dependencies may be replaced by a dependency that involves
a subset of the left hand side.
Single-Valued Dependencies
Initially Codd (1972) presented three normal forms (1NF, 2NF and 3NF) all based
on functional dependencies among the attributes of a relation. Later Boyce and
Codd proposed another normal form called the Boyce-Codd normal form (BCNF).
The fourth and fifth normal forms are based on multivalue and join dependencies
and were proposed later.
A relation is in 1NF if and only if all underlying domains contain atomic values only.
The first normal form deals only with the basic structure of the relation and does
not resolve the problems of redundant information or the anomalies discussed
earlier. All relations discussed in these notes are in 1NF.
The attribute dob is the date of birth and the primary key of the relation is sno with
the functional dependencies sno -> sname and sno -> dob. The relation is in
1NF as long as dob is considered an atomic value and not consisting of three
components (day, month, year). The above relation of course suffers from all the
anomalies that we have discussed earlier and needs to be normalized. (add
example with date of birth)
To understand the above definition of 2NF we need to define the concept of key
attributes. Each attribute of a relation that participates in at least one candidate
key of is a key attribute of the relation. All other attributes are called non-key.
The concept of 2NF requires that all attributes that are not part of a candidate key
be fully dependent on each candidate key. If we consider the relation
and assume that (sno, cno) is the only candidate key (and therefore the primary
key), the relation is not in 2NF since sname and cname are not fully dependent on
S1 (sno, sname)
S2 (cno, cname)
SC (sno, cno)
Use an example that leaves one relation in 2NF but not in 3NF.
We may recover the original relation by taking the natural join of the three
relations.
If however we assume that sname and cname are unique and therefore we have
the following candidate keys
(sno, cno)
(sno, cname)
(sname, cno)
(sname, cname)
The above relation is now in 2NF since the relation has no non-key attributes. The
relation still has the same problems as before but it then does satisfy the
requirements of 2NF. Higher level normalization is needed to resolve such problems
with relations that are in 2NF and further normalization will result in decomposition
of such relations.
Assume that cname is not unique and therefore cno is the only candidate key. The
following functional dependencies exist
We can derive cno -> office from the above functional dependencies and
therefore the above relation is in 2NF. The relation is however not in 3NF since
office is not directly dependent on cno. This transitive dependence is an indication
that the relation has information about more than one thing (viz. course and
instructor) and should therefore be decomposed. The primary difficulty with the
above relation is that an instructor might be responsible for several subjects and
therefore his office address may need to be repeated many times. This leads to all
the problems that we identified at the beginning of this chapter. To overcome these
difficulties we need to decompose the above relation in the following two relations:
s(cno, cname)
inst(instructor, office)
si(cno, instructor)
The decomposition into three relations is not necessary since the original relation
is based on the assumption of one instructor for each course.
The 3NF is usually quite adequate for most relational database designs. There are
however some situations, for example the relation student(sno, sname, cno,
cname) discussed in 2NF above, where 3NF may not eliminate all the redundancies
and inconsistencies. The problem with the relation student(sno, sname, cno,
(sno, cno)
(sno, cname)
(sname, cno)
(sname, cname)
Since the relation has no non-key attributes, the relation is in 2NF and also in 3NF,
in spite of the relation suffering the problems that we discussed at the beginning of
this chapter.
The difficulty in this relation is being caused by dependence within the candidate
keys. The second and third normal forms assume that all attributes not part of the
candidate keys depend on the candidate keys but does not deal with dependencies
within the keys. BCNF deals with such dependencies.
It should be noted that most relations that are in 3NF are also in BCNF.
Infrequently, a 3NF relation is not in BCNF and this happens only if
(a) the candidate keys in the relation are composite keys (that is, they are not
single attributes),
(b) there is more than one candidate key in the relation, and
(c) the keys are not disjoint, that is, some attributes in the keys are common.
The BCNF differs from the 3NF only when there are more than one candidate keys
and the keys are composite and overlapping. Consider for example, the
relationship
Let us assume that the relation has the following candidate keys:
(sno, cno)
(sno, cname)
(sname, cno)
(sname, cname)
where attributes that are part of a candidate key are dependent on part of another
candidate key. Such dependencies indicate that although the relation is about
some entity or association that is identified by the candidate keys e.g. (sno, cno),
there are attributes that are not about the whole thing that the keys identify. For
example, the above relation is about an association (enrolment) between students
and subjects and therefore the relation needs to include only one identifier to
identify students and one identifier to identify subjects. Providing two identifiers
about students (sno, sname) and two keys about subjects (cno, cname) means that
some information about students and subjects that is not needed is being
provided. This provision of information will result in repetition of information and
the anomalies that we discussed at the beginning of this chapter. If we wish to
include further information about students and courses in the database, it should
not be done by putting the information in the present relation but by creating new
relations that represent information about entities student and subject.
(sno, sname)
(cno, cname)
(sno, cno, date-of-enrolment)
We now have a relation that only has information about students, another only
about subjects and the third only about enrolments. All the anomalies and
repetition of information have been removed.
1. Attribute preservation
2. Lossless-join decomposition
3. Dependency preservation
4. Lack of redundancy
Attribute Preservation
This is a simple and an obvious requirement that involves preserving all the
attributes that were there in the relation that is being decomposed.
Lossless-Join Decomposition
In these notes so far we have normalised a number of relations by decomposing
them. We decomposed a relation intuitively. We need a better basis for deciding
decompositions since intuition may not always be correct. We illustrate how a
careless decomposition may lead to problems including loss of information.
Suppose we decompose the above relation into two relations enrol1 and enrol2 as
follows
There are problems with this decomposition but we wish to focus on one aspect at
the moment. Let an instance of the relation enrol be
All the information that was in the relation enrol appears to be still available in
enrol1 and enrol2 but this is not so. Suppose, we wanted to retrieve the student
numbers of all students taking a course from Wilson, we would need to join enrol1
and enrol2. The join would have 11 tuples as follows:
The join contains a number of spurious tuples that were not in the original relation
Enrol. Because of these additional tuples, we have lost the information about which
students take courses from WILSON. (Yes, we have more tuples but less
information because we are unable to say with certainty who is taking courses
from WILSON). Such decompositions are called lossy decompositions. A nonloss or
lossless decomposition is that which guarantees that the join will result in exactly
the same relation as was decomposed. One might think that there might be other
ways of recovering the original relation from the decomposed relations but, sadly,
no other operators can recover the original relation if the join does not (why?).
We need to analyse why some decompositions are lossy. The common attribute in
above decompositions was Date-enrolled. The common attribute is the glue that
gives us the ability to find the relationships between different relations by joining
the relations together. If the common attribute is not unique, the relationship
information is not preserved. If each tuple had a unique value of Date-enrolled, the
problem of losing information would not have existed. The problem arises because
several enrolments may take place on the same date.
That is, the common attributes in R1 and R2 must include a candidate key of either
R1 or R2. How do you know, you have a loss-less join decomposition?
Dependency Preservation
Let us consider a relation R(A, B, C, D) that has the dependencies F that include
the following:
A -> B
A -> B
etc
If we decompose the above relation into R1(A, B) and R2(B, C, D) the dependency
A -> C cannot be checked (or preserved) by looking at only one relation. It is
desirable that decompositions be such that each dependency in F may be checked
by looking at only one relation and that no joins need be computed for checking
dependencies. In some cases, it may not be possible to preserve each and every
dependency in F but as long as the dependencies that are preserved are
equivalent to F, it should be sufficient.
We can partition the dependencies given by F such that F1, F2, ..., Fn. Fn are
dependencies that only involve attributes from relations R1, R2, ..., Rn respectively.
If the union of dependencies Fi imply all the dependencies in F, then we say that
the decomposition has preserved dependencies, otherwise not.
If the decomposition does not preserve the dependencies F, then the decomposed
relations may contain relations that do not satisfy F or the updates to the
decomposed relations may require a join to check that the constraints implied by
the dependencies still hold.
S1(sno, instructor)
S2(sno, office)
Lack of Redundancy
We have discussed the problems of repetition of information in a database. Such
repetition should be avoided as much as possible.
Deriving BCNF
Should we also include deriving 3NF? page 409-411 Ullman.
Once we have obtained relations by using the above approach we need to check
that they are indeed in BCNF. If there is any relation R that has a dependency A ->
Band A is not a key, the relation violates the conditions of BCNF and may be
decomposed in AB and R - A. The relation AB is now in BCNF and we can now check
if R - A is also in BCNF. If not, we can apply the above procedure again until all the
relations are in fact in BCNF
Multivalued Dependencies
Normalization Page # 20/26
Recall that when we discussed database modelling using the E-R Modelling
technique, we noted difficulties that can arise when an entity has multivalue
attributes. It was because in the relational model, if all of the information about
such entity is to be represented in one relation, it will be necessary to repeat all
the information other than the multivalue attribute value to represent all the
information that we wish to represent. This results in many tuples about the same
instance of the entity in the relation and the relation having a composite key (the
entity id and the mutlivalued attribute). Of course the other option suggested was
to represent this multivalue information in a separate relation. The situation of
course becomes much worse if an entity has more than one multivalued attributes
and these values are represented in one relation by a number of tuples for each
entity instance such that every value of one the multivalued attributes appears
with every value of the second multivalued attribute to maintain consistency. The
multivalued dependency relates to this problem when more than one multivalued
attributes exist. Consider the following relation that represents an entity employee
that has one mutlivalued attribute proj:
So far we have dealt with multivalued facts about an entity by having a separate
relation for that multivalue attribute and then inserting a tuple for each value of
that fact. This resulted in composite keys since the multivalued fact must form part
of the key. In none of our examples so far have we dealt with an entity having more
than one multivalued attribute in one relation. We do so now.
The fourth and fifth normal forms deal with multivalued dependencies. Before
discussing the 4NF and 5NF we discuss the following example to illustrate the
concept of multivalued dependency.
The above relation is therefore in 3NF (even in BCNF) but it still has some
disadvantages. Suppose a programmer has several qualifications (B.Sc, Dip. Comp.
Sc, etc) and is proficient in several programming languages; how should this
information be represented? There are several possibilities.
Other variations are possible (we remind the reader that there is no relationship
between qualifications and programming languages). All these variations have
some disadvantages. If the information is repeated we face the same problems of
repeated information and anomalies as we did when second or third normal form
conditions are violated. If there is no repetition, there are still some difficulties with
search, insertions and deletions. For example, the role of NULL values in the above
relations is confusing. Also the candidate key in the above relations is (emp name,
qualifications, language) and existential integrity requires that no NULLs be
specified. These problems may be overcome by decomposing a relation like the
one above as follows:
emp_name qualifications
SMITH B.Sc
SMITH Dip.CS
emp_name languages
SMITH FORTRAN
In the example above, if there was some dependence between the attributes
qualifications and language, for example perhaps, the language was related to the
qualifications (perhaps the qualification was a training certificate in a particular
language), then the relation would not have MVD and could not be decomposed
into two relations as abve. In the above situation whenever X ->> Y holds, so does
X ->> Z since the role of the attributes Y and Z is symmetrical.
(a) Z is a single valued attribute. In this situation, we deal with R(X, Y, Z) as before
by entering several tuples about each entity.
(b) Z is multivalued.
Now, more formally, X ->> Y is said to hold for R(X, Y, Z) if t1 and t2 are two tuples
in R that have the same values for attributes X and therefore with t1[x] = t2[x]
then R also contains tuples t3 and t4 (not necessarily distinct) such that
We are therefore insisting that every value of Y appears with every value of Z to
keep the relation instances consistent. In other words, the above conditions insist
that Y and Z are determined by X alone and there is no relationship between Y and
Z since Y and Z appear in every possible pair and hence these pairings present no
information and are of no significance. Only if some of these pairings were not
present, there would be some significance in the pairings.
Give example (instructor, quals, subjects) --- explain if subject was single valued;
otherwise all combinations must occur. Discuss duplication of info in that case.
Multivalued Normalisation -
Fourth Normal Form
We have considered an example of Programmer(Emp name, qualification,
languages) and discussed the problems that may arise if the relation is not
normalised further. We also saw how the relation could be decomposed into
P1(Emp name, qualifications) and P2(Emp name, languages) to overcome these
problems. The decomposed relations are in fourth normal form (4NF) which we
shall now define.
In fourth normal form, we have a relation that has information about only one
entity. If a relation has more than one multivalue attribute, we should decompose it
to remove difficulties with multivalued facts.
The fifth normal form deals with join-dependencies which is a generalisation of the
MVD. The aim of fifth normal form is to have relations that cannot be decomposed
further. A relation in 5NF cannot be constructed from several smaller relations.
A relation R satisfies join dependency (R1, R2, ..., Rn) if and only if R is equal to the
join of
R1, R2, ..., Rn where Ri are subsets of the set of attributes of R.
A relation R is in 5NF (or project-join normal form, PJNF) if for all join dependencies
at least one of the following holds.
An example of 5NF can be provided by the example below that deals with
departments, subjects and students.
The above relation says that Comp. Sc. offers subjects CP1000, CP2000 and
CP3000 which are taken by a variety of students. No student takes all the subjects
and no subject has all students enrolled in it and therefore all three fields are
needed to represent the information.
The above relation does not show MVDs since the attributes subject and student
are not independent; they are related to each other and the pairings have
significant information in them. The relation can therefore not be decomposed in
two relations