Normalization
Normalization
A large database defined as a single relation may result in data duplication. This repetition of
data may result in:
What is Normalization?
o Normalization is the process of organizing the data in the database.
o Normalization is used to minimize the redundancy from a relation or set of relations. It is
also used to eliminate undesirable characteristics like Insertion, Update, and Deletion
Anomalies.
o Normalization divides the larger table into smaller and links them using relationships.
o The normal form is used to reduce redundancy from the database table.
Why do we need Normalization?
The main reason for normalizing the relations is removing these anomalies. Failure to
eliminate anomalies leads to data redundancy and can cause data integrity and other
problems as the database grows. Normalization consists of a series of guidelines that helps
to guide you in creating a good database structure.
o Insertion Anomaly: Insertion Anomaly refers to when one cannot insert a new tuple into a
relationship due to lack of data.
o Deletion Anomaly: The delete anomaly refers to the situation where the deletion of data
results in the unintended loss of some other important data.
o Updatation Anomaly: The update anomaly is when an update of a single data value
requires multiple rows of data to be updated.
Advantages of Normalization
o Normalization helps to minimize data redundancy.
o Greater overall database organization.
o Data consistency within the database.
o Much more flexible database design.
o Enforces the concept of relational integrity.
Disadvantages of Normalization
o You cannot start building the database before knowing what the user needs.
o The performance degrades when normalizing the relations to higher normal forms, i.e.,
4NF, 5NF.
o It is very time-consuming and difficult to normalize relations of a higher degree.
o Careless decomposition may lead to a bad database design, leading to serious problems.
EMPLOYEE table:
7390372389,
12 Sam Punjab
8589830302
To overcome this problem, we have to eliminate the non atomic values of EMP_PHONE
attribute.
The decomposition of the EMPLOYEE table into 1NF has been shown below:
14 John 7272826385 UP
14 John 9064738238 UP
Method 1:
To remove the repeating values for a column, the EMPLOYEE table was converted to a flat
relation EMPLOYEE_1 table by repeating the pair (EMP_ID, EMP_NAME) for every entry in
the table. Now the new relation does not contain any non-atomic values so the table is said
to be normalized and is in First Normal Form.
Method 2:
Another method is to remove the attributes that violate 1NF and place it in a separate relation
along with primary key. So the unnormalized relation, EMPLOYEE table is decomposed into
two sub-relations EMP_DETAILS and EMP_PERFORMANCE
EMP_DETAILS
EMP_ID EMP_NAME
14 John
20 Harry
12 Sam
EMP_PERFORMANCE
14 7272826385 UP
14 9064738238 UP
20 8574783832 Bihar
12 7390372389 Punjab
12 8589830302 Punjab
The main idea of decomposing the relations is to keep the different types of information in
their separate relation as first normal form disallows multivalve attribute that are composite in
nature. In the EMP_DETAILS relation, the attribute (EMP_ID) acts as a primary key and in
the EMP_PERFORMANCE relation the attributes (EMP_ID, EMP_PHONE) act as a primary
key. Now it satisfies both the conditions for a relation to be in 1NF.
o One relation consists of the primary key (EMP_ID) of the original relation (i.e. EMPLOYEE)
and non repeating attributes of the original relation (i.e. EMP_NAME).
o The other relation consists of copy of the primary key of the original relation and all the
repeating attributes of the original relation.
Method 3:
The third method of normalizing a unnormalized relation into 1NF will be explained with
following example where skills of an employee of some company are fixed. Suppose an
employee can have maximum of five skills.
EMP_SKILL relation
EMP_ID Skill
14 DBMS, C, C++
20 JAVA, C
Here the EMP_SKILL relation is not 1NF as the skill attribute contains a set of values. So to
remove this problem, we define multiple Skill columns as shown.
14 DBMS C C++ - -
20 JAVA C - - -
The above representation is in 1NF but this technique is not preferred as it may cause
problems such as:
o It would be difficult to query the relation. For Example, it would be difficult to answer the
queries like “which employee share a skill?”, “Which employees have skill C?”
o Restriction of employee skills to 5. If employee with more skills appears, it would be left
unrecorded.
To sum up, all the three approaches are correct because they transform any unnormalized
table into a first normal form table. However, the second approach where table is
decomposed into relations is more efficient as minimizes the duplicacy of the data. So for a
relation to be in first normal form, each set of repeating groups should appear in its own table
and every relation should have a primary key.
Whereas the first normal form was concerned with the structure of the representation of
relation, the second normal form is concerned with the eliminating redundancy in these
relations.
o Insertion anomaly
o Deletion anomaly
o Updation anomaly
These anomalies have got their name from the relational operations they perform on a
relation.
4253 C 15 175
4256 C 50 175
Insertion anomaly: Suppose that we want to insert information about a new book into the
ORDER_BOOK relation. But we cannot insert this information until some order is placed for
it because in this relation the primary key is composed of two attributes Order_No and
B_Name which are called composite keys.
So neither the Order_No nor B_Name can contain null values because it is against the
principle of entity integrity rule. So we cannot insert the information of a new book whose
order has not been placed yet because in that case, the attribute Order_No will contain null
value which is against the entity integrity rule i.e. primary key cannot null values. This is shown
in following figure:
4253 C 15 175
4154 IT 30 200
4256 C 50 175
Relations which exhibit such kind of undesirable property are said to suffer from insertion
anomaly.
Deletion anomaly: Suppose that an order whose order number is 4154 is cancelled due to
certain reasons. Therefore, we would have to delete this order information from the
ORDER_BOOK relation.
As we can see from the relation that this particular order contain information about the book
whose name is “IT”. So on deletion of this record from the relation would result in loss of
information about the “IT” book. This may lead to loss of vital information as it is the only
record which contains information about the book “IT”. But if we try to remove any other record
from the relation then it would cause no problem as it still contains information of the book in
other record.
For Example: Deleting record whose Order_No = 4154 as shown in following figure:
4253 C 15 175
4256 C 50 175
Relations which exhibit such kind of undesirable property are said to suffer from deletion
anomaly.
Updation Anomaly: Modifying some values in the relations may also prove cumbersome.
Suppose that if the price of the book C is modified to 190 then every tuple referring to this
book have to be updated and multiple updating always carries some risk of inconsistencies.
In the ORDER_BOOK relation, the updation seems to be very easy because it contains only
two tuples having B_price as 175. But if in case relation has thousands of tuples containing
a large number of redundant data, the updations may lead to inconsistency as humans are
prone to errors.
Relations which exhibit such kind of undesirable property are said to suffer from updation
anomaly.
The above considerations leads us to a conclusion that relations in 1NF have undesirable
data manipulation properties hence bringing relation to 1NF would not terminate logical
database design. Further transformations are needed to eliminate this kind of anomalies form
a set of original relations. So this bring us the concept of second normal form.
TEACHER table
25 Chemistry 30
25 Biology 30
47 English 35
83 Math 38
83 Computer 38
To convert the given table into 2NF, we decompose it into two tables:
TEACHER_DETAIL table:
TEACHER_ID TEACHER_AGE
25 30
47 35
83 38
TEACHER_SUBJECT table:
TEACHER_ID SUBJECT
25 Chemistry
25 Biology
47 English
83 Math
83 Computer
To discuss the various anomalies, we will consider the STUDENT relation that holds
information about students and teachers.
In the above table, Stu_Id is the primary key which acts as the roll number of the student.
Since the STUDENT relation is composed of only one attribute which acts as a primary key
(Stu_Id) so it is in 2NF. But it suffers from the insertion, deletion and updation anomalies
which are explained as follows.
Insertion anomaly: Suppose that we want to insert a a new record with some information
about a new teacher who has not yet been assigned a personal student. But this insertion
record is not allowed because the primary key Stu_Id contains a nullvalue which is not
possible as it is against the entity integrity rule.
For Example: Suppose that we want to insert information about a new teacher ‘Mayank’
having Teach_Id = ‘206’ Teach_Qual = ‘MCA‘who has not yet been allotted any student. This
is not possible as Stu_Id will contain a null value.
Deletion anomaly: Suppose that a student whose Stu_Id = 1768 decides to leave the college,
so we would have to delete this tuple from the STUDENT relation.
As we can see from the relation that this particular student is the last student of the teacher
whose Teach_Id = ‘205’. Thus on deleting this tuple, the information about the teacher would
also be deleted. This may lead to vital information. This is the deletion anomaly.
There would be no deletion problem if the student who decides to leave the college is not the
last student of the particular teacher.
For Example: Deleting student record with Stu_Id = 2523 will not lead to deletion of teacher
information whose Teach_Id = ‘201’ because it is present elsewhere.
Updation Anomaly: The second normal form also suffers from updation anomaly.
For Example: The value of the qualifications of the teacher i.e. Teach_Qual whose Teach_Id
= ‘204’ is updated from MCA to Ph.D. This would be quite a big problem as the updation in
the tuple will have to be made where ever this information reoccurs. Although this relation is
having few tuples so it would be quite a big problem here but normally a teacher, teaches
many students. So in case of huge databases it will be a big problem and may lead to
inconsistencies as human are prone to errors.
The above considerations leads us to a conclusion that relation in 2NF have undesirable data
manipulation properties hence bringing a relation to 2NF would not terminate logical database
design. Further transformations are needed to eliminate these kinds of anomalies from an
original relation. So this brings us to a concept of the Third normal form.
1. X is a super key.
2. Y is a prime attribute, i.e., each element of Y is part of some candidate key.
To explain the 3NF, let us consider the example of Employee_Detail relation as shown below.
Example:
EMPLOYEE_DETAIL table:
Non-prime attributes: In the given table, all attributes except EMP_ID are non-prime.
That's why we need to move the EMP_CITY and EMP_STATE to the new
<EMPLOYEE_ZIP> table, with EMP_ZIP as a Primary key.
EMPLOYEE table:
EMPLOYEE_ZIP table:
EMP_ZIP EMP_STATE EMP_CITY
201010 UP Noida
02228 US Boston
60007 US Chicago
06389 UK Norwich
462007 MP Bhopal
To discuss the various anomalies, we will consider the STAFF relation that holds information
about the staff, the equipment key they have been allocated and the language in which they
are fluent.
STAFF Relation:
In the above visualization it shows that it is not a relation. In order to represent it as a relation
in 1NF, we need to convert it to the form as shown in the following table.
STAFF Relation:
Anurag PC English
Anurag PC French
Kapil PC English
Kapil PC French
Kapil PC Japanese
In the above relation, every STAFF has two independent sets of features associated with it.
The STAFF relation has a primary key composed of S_Name, Equipment and Language.
There is no transitive dependency, so the relation STAFF is in 3NF. But it stills suffers from
the insertion, deletion and updation anomalies which are explained as follows.
Insertion anomaly: Suppose an Anurag learns a new language Japanese then we will have
to insert two new records into the STAFF relation. Similarly, if the Equipment values
corresponding to staff Anurag changes in numbers from 2 to 5, then with introduction of new
language Japanese we will have to insert multiple records into the STAFF relation which
results in redundancy.
Deletion anomaly: Let us suppose that the name of the staff changed from Anurag to Anuraj
here sue to some reasons then multiple records need to be updated which may result in
inconsistency of data.
Updation Anomaly: Let us suppose that an staff Kapil has his equipment PC deallocated, then
all the information about his languages skills would be lost due to deletion of these records
which may result in loss of vital information.
All these anomalies are encountered due to the presence of multivalued dependency which
is removed in fourth normal form. The concept of multivalued dependency and 4NF will be
explained later.