0% found this document useful (0 votes)
13 views

Normalization

Normalization is a process in database design aimed at reducing data redundancy and eliminating anomalies such as insertion, deletion, and update anomalies by organizing data into smaller, well-structured relations. It involves multiple normal forms (1NF, 2NF, 3NF, BCNF, 4NF, 5NF), each with specific criteria to ensure data integrity and minimize redundancy. While normalization offers advantages like improved data consistency and organization, it can also lead to performance issues and complexity in database design.

Uploaded by

sipeji8490
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Normalization

Normalization is a process in database design aimed at reducing data redundancy and eliminating anomalies such as insertion, deletion, and update anomalies by organizing data into smaller, well-structured relations. It involves multiple normal forms (1NF, 2NF, 3NF, BCNF, 4NF, 5NF), each with specific criteria to ensure data integrity and minimize redundancy. While normalization offers advantages like improved data consistency and organization, it can also lead to performance issues and complexity in database design.

Uploaded by

sipeji8490
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Normalization

A large database defined as a single relation may result in data duplication. This repetition of
data may result in:

o Making relations very large.


o It isn't easy to maintain and update data as it would involve searching many records in
relation.
o Wastage and poor utilization of disk space and resources.
o The likelihood of errors and inconsistencies increases.
So to handle these problems, we should analyze and decompose the relations with redundant
data into smaller, simpler, and well-structured relations that are satisfy desirable properties.
Normalization is a process of decomposing the relations into relations with fewer attributes.

What is Normalization?
o Normalization is the process of organizing the data in the database.
o Normalization is used to minimize the redundancy from a relation or set of relations. It is
also used to eliminate undesirable characteristics like Insertion, Update, and Deletion
Anomalies.
o Normalization divides the larger table into smaller and links them using relationships.
o The normal form is used to reduce redundancy from the database table.
Why do we need Normalization?

The main reason for normalizing the relations is removing these anomalies. Failure to
eliminate anomalies leads to data redundancy and can cause data integrity and other
problems as the database grows. Normalization consists of a series of guidelines that helps
to guide you in creating a good database structure.

Data modification anomalies can be categorized into three types:

o Insertion Anomaly: Insertion Anomaly refers to when one cannot insert a new tuple into a
relationship due to lack of data.
o Deletion Anomaly: The delete anomaly refers to the situation where the deletion of data
results in the unintended loss of some other important data.
o Updatation Anomaly: The update anomaly is when an update of a single data value
requires multiple rows of data to be updated.

Types of Normal Forms:


Normalization works through a series of stages called Normal forms. The normal forms apply
to individual relations. The relation is said to be in particular normal form if it satisfies
constraints.

Following are the various types of Normal forms:

Normal Form Description

1NF A relation is in 1NF if it contains an atomic value.

A relation will be in 2NF if it is in 1NF and all non-


2NF key attributes are fully functional dependent on
the primary key.

A relation will be in 3NF if it is in 2NF and no


3NF
transition dependency exists.

A stronger definition of 3NF is known as Boyce


BCNF
Codd's normal form.
A relation will be in 4NF if it is in Boyce Codd's
4NF
normal form and has no multi-valued dependency.

A relation is in 5NF. If it is in 4NF and does not


5NF contain any join dependency, joining should be
lossless.

Advantages of Normalization
o Normalization helps to minimize data redundancy.
o Greater overall database organization.
o Data consistency within the database.
o Much more flexible database design.
o Enforces the concept of relational integrity.

Disadvantages of Normalization
o You cannot start building the database before knowing what the user needs.
o The performance degrades when normalizing the relations to higher normal forms, i.e.,
4NF, 5NF.
o It is very time-consuming and difficult to normalize relations of a higher degree.
o Careless decomposition may lead to a bad database design, leading to serious problems.

First Normal Form (1NF)


o A relation will be 1NF if it contains an atomic value.
o It states that an attribute of a table cannot hold multiple values. It must hold only single-
valued attribute.
o First normal form disallows the multi-valued attribute, composite attribute, and their
combinations.
Example: Relation EMPLOYEE is not in 1NF because of multi-valued attribute EMP_PHONE.

EMPLOYEE table:

EMP_ID EMP_NAME EMP_PHONE EMP_STATE


7272826385,
14 John UP
9064738238

20 Harry 8574783832 Bihar

7390372389,
12 Sam Punjab
8589830302

The above EMPLOYEE table is an unnormalized relation as it contains multiple values


corresponding to EMP_PHONE attribute i.e. these values are non-atomic. So relations with
multi value entries are called unnormalized relations.

To overcome this problem, we have to eliminate the non atomic values of EMP_PHONE
attribute.

The decomposition of the EMPLOYEE table into 1NF has been shown below:

EMP_ID EMP_NAME EMP_PHONE EMP_STATE

14 John 7272826385 UP

14 John 9064738238 UP

20 Harry 8574783832 Bihar

12 Sam 7390372389 Punjab

12 Sam 8589830302 Punjab

There are 3 ways to achieve first normal form.

Method 1:

To remove the repeating values for a column, the EMPLOYEE table was converted to a flat
relation EMPLOYEE_1 table by repeating the pair (EMP_ID, EMP_NAME) for every entry in
the table. Now the new relation does not contain any non-atomic values so the table is said
to be normalized and is in First Normal Form.

Method 2:

Another method is to remove the attributes that violate 1NF and place it in a separate relation
along with primary key. So the unnormalized relation, EMPLOYEE table is decomposed into
two sub-relations EMP_DETAILS and EMP_PERFORMANCE

EMP_DETAILS

EMP_ID EMP_NAME

14 John

20 Harry

12 Sam

EMP_PERFORMANCE

EMP_ID EMP_PHONE EMP_STATE

14 7272826385 UP

14 9064738238 UP

20 8574783832 Bihar

12 7390372389 Punjab

12 8589830302 Punjab
The main idea of decomposing the relations is to keep the different types of information in
their separate relation as first normal form disallows multivalve attribute that are composite in
nature. In the EMP_DETAILS relation, the attribute (EMP_ID) acts as a primary key and in
the EMP_PERFORMANCE relation the attributes (EMP_ID, EMP_PHONE) act as a primary
key. Now it satisfies both the conditions for a relation to be in 1NF.

The relation is decomposed according to the following rules:

o One relation consists of the primary key (EMP_ID) of the original relation (i.e. EMPLOYEE)
and non repeating attributes of the original relation (i.e. EMP_NAME).
o The other relation consists of copy of the primary key of the original relation and all the
repeating attributes of the original relation.
Method 3:

The third method of normalizing a unnormalized relation into 1NF will be explained with
following example where skills of an employee of some company are fixed. Suppose an
employee can have maximum of five skills.

EMP_SKILL relation

EMP_ID Skill

14 DBMS, C, C++

20 JAVA, C

12 DBMS, HTML, VB, MS OFFICE

Here the EMP_SKILL relation is not 1NF as the skill attribute contains a set of values. So to
remove this problem, we define multiple Skill columns as shown.

The above relation is decomposed into 1NF in the following example.

EMP_ID Skil_1 Skill_2 Skill_3 Skill_4 Skill_5

14 DBMS C C++ - -
20 JAVA C - - -

12 DBMS HTML VB MS OFFICE -

The above representation is in 1NF but this technique is not preferred as it may cause
problems such as:

o It would be difficult to query the relation. For Example, it would be difficult to answer the
queries like “which employee share a skill?”, “Which employees have skill C?”
o Restriction of employee skills to 5. If employee with more skills appears, it would be left
unrecorded.
To sum up, all the three approaches are correct because they transform any unnormalized
table into a first normal form table. However, the second approach where table is
decomposed into relations is more efficient as minimizes the duplicacy of the data. So for a
relation to be in first normal form, each set of repeating groups should appear in its own table
and every relation should have a primary key.

Anomalies in first normal form

Whereas the first normal form was concerned with the structure of the representation of
relation, the second normal form is concerned with the eliminating redundancy in these
relations.

The various anomalies can be divided as:

o Insertion anomaly
o Deletion anomaly
o Updation anomaly
These anomalies have got their name from the relational operations they perform on a
relation.

Let us take a following example of ORDER_BOOK relation:

Order_No B_Name Quantity Price

4253 C 15 175

4253 Database 20 225


4154 IT 30 200

4256 C 50 175

4186 Database 15 225

Insertion anomaly: Suppose that we want to insert information about a new book into the
ORDER_BOOK relation. But we cannot insert this information until some order is placed for
it because in this relation the primary key is composed of two attributes Order_No and
B_Name which are called composite keys.

So neither the Order_No nor B_Name can contain null values because it is against the
principle of entity integrity rule. So we cannot insert the information of a new book whose
order has not been placed yet because in that case, the attribute Order_No will contain null
value which is against the entity integrity rule i.e. primary key cannot null values. This is shown
in following figure:

Order_No B_Name Quantity Price

4253 C 15 175

4253 Database 20 225

4154 IT 30 200

4256 C 50 175

4186 Database 15 225

NULL Operating System 30 300

Relations which exhibit such kind of undesirable property are said to suffer from insertion
anomaly.
Deletion anomaly: Suppose that an order whose order number is 4154 is cancelled due to
certain reasons. Therefore, we would have to delete this order information from the
ORDER_BOOK relation.

As we can see from the relation that this particular order contain information about the book
whose name is “IT”. So on deletion of this record from the relation would result in loss of
information about the “IT” book. This may lead to loss of vital information as it is the only
record which contains information about the book “IT”. But if we try to remove any other record
from the relation then it would cause no problem as it still contains information of the book in
other record.

For Example: Deleting record whose Order_No = 4154 as shown in following figure:

Order_No B_Name Quantity Price

4253 C 15 175

4253 Database 20 225

4256 C 50 175

4186 Database 15 225

NULL Operating System 30 300

Relations which exhibit such kind of undesirable property are said to suffer from deletion
anomaly.

Updation Anomaly: Modifying some values in the relations may also prove cumbersome.
Suppose that if the price of the book C is modified to 190 then every tuple referring to this
book have to be updated and multiple updating always carries some risk of inconsistencies.

In the ORDER_BOOK relation, the updation seems to be very easy because it contains only
two tuples having B_price as 175. But if in case relation has thousands of tuples containing
a large number of redundant data, the updations may lead to inconsistency as humans are
prone to errors.
Relations which exhibit such kind of undesirable property are said to suffer from updation
anomaly.

The above considerations leads us to a conclusion that relations in 1NF have undesirable
data manipulation properties hence bringing relation to 1NF would not terminate logical
database design. Further transformations are needed to eliminate this kind of anomalies form
a set of original relations. So this bring us the concept of second normal form.

Second Normal Form (2NF)


o In the 2NF, relational must be in 1NF.
o In the second normal form, all non-key attributes are fully functional dependent on the
primary key
Example: Let's assume, a school can store the data of teachers and the subjects they teach.
In a school, a teacher can teach more than one subject.

TEACHER table

TEACHER_ID SUBJECT TEACHER_AGE

25 Chemistry 30

25 Biology 30

47 English 35

83 Math 38

83 Computer 38

In the given table, non-prime attribute TEACHER_AGE is dependent on TEACHER_ID which


is a proper subset of a candidate key. That's why it violates the rule for 2NF.

To convert the given table into 2NF, we decompose it into two tables:

TEACHER_DETAIL table:
TEACHER_ID TEACHER_AGE

25 30

47 35

83 38

TEACHER_SUBJECT table:

TEACHER_ID SUBJECT

25 Chemistry

25 Biology

47 English

83 Math

83 Computer

Anomalies in Second Normal Form


Even if the relation in 2NF, it still suffers from insertion, deletion and updation anomalies. So
before discussing the third normal form, we will explain these anomalies.

To discuss the various anomalies, we will consider the STUDENT relation that holds
information about students and teachers.

Stu_Id Stu_Name Teach_Id Teach_Name Teach_Qual


2523 Anurag 201 Mohan MCA

3712 Raju 202 Ravi M.Tech

4906 Raman 203 Mahima Ph.D

2716 Jyoti 204 Anjali MCA

1768 Meetali 205 Sonia M.Tech

In the above table, Stu_Id is the primary key which acts as the roll number of the student.

Since the STUDENT relation is composed of only one attribute which acts as a primary key
(Stu_Id) so it is in 2NF. But it suffers from the insertion, deletion and updation anomalies
which are explained as follows.

Insertion anomaly: Suppose that we want to insert a a new record with some information
about a new teacher who has not yet been assigned a personal student. But this insertion
record is not allowed because the primary key Stu_Id contains a nullvalue which is not
possible as it is against the entity integrity rule.

For Example: Suppose that we want to insert information about a new teacher ‘Mayank’
having Teach_Id = ‘206’ Teach_Qual = ‘MCA‘who has not yet been allotted any student. This
is not possible as Stu_Id will contain a null value.

Stu_Id Stu_Name Teach_Id Teach_Name Teach_Qual

2523 Anurag 201 Mohan MCA

3712 Raju 202 Ravi M.Tech

4906 Raman 203 Mahima Ph.D

2716 Jyoti 204 Anjali MCA


1768 Meetali 205 Sonia M.Tech

NULL NULL 206 Mayank MCA

Deletion anomaly: Suppose that a student whose Stu_Id = 1768 decides to leave the college,
so we would have to delete this tuple from the STUDENT relation.

As we can see from the relation that this particular student is the last student of the teacher
whose Teach_Id = ‘205’. Thus on deleting this tuple, the information about the teacher would
also be deleted. This may lead to vital information. This is the deletion anomaly.

There would be no deletion problem if the student who decides to leave the college is not the
last student of the particular teacher.

For Example: Deleting student record with Stu_Id = 2523 will not lead to deletion of teacher
information whose Teach_Id = ‘201’ because it is present elsewhere.

Stu_Id Stu_Name Teach_Id Teach_Name Teach_Qual

3712 Raju 202 Ravi M.Tech

4906 Raman 203 Mahima Ph.D

2716 Jyoti 204 Anjali MCA

1768 Meetali 205 Sonia M.Tech

NULL NULL 206 Mayank MCA

Updation Anomaly: The second normal form also suffers from updation anomaly.

For Example: The value of the qualifications of the teacher i.e. Teach_Qual whose Teach_Id
= ‘204’ is updated from MCA to Ph.D. This would be quite a big problem as the updation in
the tuple will have to be made where ever this information reoccurs. Although this relation is
having few tuples so it would be quite a big problem here but normally a teacher, teaches
many students. So in case of huge databases it will be a big problem and may lead to
inconsistencies as human are prone to errors.

The above considerations leads us to a conclusion that relation in 2NF have undesirable data
manipulation properties hence bringing a relation to 2NF would not terminate logical database
design. Further transformations are needed to eliminate these kinds of anomalies from an
original relation. So this brings us to a concept of the Third normal form.

Third Normal Form (3NF)


o A relation will be in 3NF if it is in 2NF and not contain any transitive partial dependency.
o 3NF is used to reduce the data duplication. It is also used to achieve the data integrity.
o If there is no transitive dependency for non-prime attributes, then the relation must be in
third normal form.
A relation is in third normal form if it holds atleast one of the following conditions for every
non-trivial function dependency X → Y.

1. X is a super key.
2. Y is a prime attribute, i.e., each element of Y is part of some candidate key.

To explain the 3NF, let us consider the example of Employee_Detail relation as shown below.

Example: EMPLOYEE_DETAIL table

Example:

EMPLOYEE_DETAIL table:

EMP_ID EMP_NAME EMP_ZIP EMP_STATE EMP_CITY

222 Harry 201010 UP Noida

333 Stephan 02228 US Boston

444 Lan 60007 US Chicago

555 Katharine 06389 UK Norwich


666 John 462007 MP Bhopal

Super key in the table above:

1. {EMP_ID}, {EMP_ID, EMP_NAME}, {EMP_ID, EMP_NAME, EMP_ZIP}....so on


Candidate key: {EMP_ID}

Non-prime attributes: In the given table, all attributes except EMP_ID are non-prime.

Here, EMP_STATE & EMP_CITY dependent on EMP_ZIP and EMP_ZIP dependent on


EMP_ID. The non-prime attributes (EMP_STATE, EMP_CITY) transitively dependent on
super key(EMP_ID). It violates the rule of third normal form.The reduction of 2NF relation into
3NF consists of splitting the 2NF into appropriate relations such that every non-key attribute
are functionally dependent on the primary key not transitively or indirectly of the respective
relations.

That's why we need to move the EMP_CITY and EMP_STATE to the new
<EMPLOYEE_ZIP> table, with EMP_ZIP as a Primary key.

EMPLOYEE table:

EMP_ID EMP_NAME EMP_ZIP

222 Harry 201010

333 Stephan 02228

444 Lan 60007

555 Katharine 06389

666 John 462007

EMPLOYEE_ZIP table:
EMP_ZIP EMP_STATE EMP_CITY

201010 UP Noida

02228 US Boston

60007 US Chicago

06389 UK Norwich

462007 MP Bhopal

Anomalies in Third Normal Form


Even if the relation in 3NF, it still suffers from insertion, deletion and updation anomalies. So
before discussing the next higher normal form, we will explain these anomalies.

To discuss the various anomalies, we will consider the STAFF relation that holds information
about the staff, the equipment key they have been allocated and the language in which they
are fluent.

STAFF (@S_Name + @Equipment + @Language) where @symbol tells that it is a primary


key.

STAFF Relation:

S_Name Equipment Language

Anurag PC English Mainframe French

Kapil PC English French Japanese

In the above visualization it shows that it is not a relation. In order to represent it as a relation
in 1NF, we need to convert it to the form as shown in the following table.
STAFF Relation:

S_Name Equipment Language

Anurag PC English

Anurag PC French

Anurag Mainframe English

Anurag Mainframe French

Kapil PC English

Kapil PC French

Kapil PC Japanese

In the above relation, every STAFF has two independent sets of features associated with it.

The STAFF relation has a primary key composed of S_Name, Equipment and Language.
There is no transitive dependency, so the relation STAFF is in 3NF. But it stills suffers from
the insertion, deletion and updation anomalies which are explained as follows.

Insertion anomaly: Suppose an Anurag learns a new language Japanese then we will have
to insert two new records into the STAFF relation. Similarly, if the Equipment values
corresponding to staff Anurag changes in numbers from 2 to 5, then with introduction of new
language Japanese we will have to insert multiple records into the STAFF relation which
results in redundancy.

Deletion anomaly: Let us suppose that the name of the staff changed from Anurag to Anuraj
here sue to some reasons then multiple records need to be updated which may result in
inconsistency of data.
Updation Anomaly: Let us suppose that an staff Kapil has his equipment PC deallocated, then
all the information about his languages skills would be lost due to deletion of these records
which may result in loss of vital information.

All these anomalies are encountered due to the presence of multivalued dependency which
is removed in fourth normal form. The concept of multivalued dependency and 4NF will be
explained later.

You might also like