0% found this document useful (0 votes)
50 views45 pages

Database Normalization

Database normalization is a technique to organize data in a database. It removes duplicated data and ensures data is stored logically. The normalization process converts tables to normal forms like 1NF, 2NF and 3NF to reduce anomalies like insertion, update and deletion anomalies. Higher normal forms like BCNF, 4NF, 5NF and 6NF further reduce dependencies but are not always needed in practice. Normalization through at least 3NF is important to maintain data integrity.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
50 views45 pages

Database Normalization

Database normalization is a technique to organize data in a database. It removes duplicated data and ensures data is stored logically. The normalization process converts tables to normal forms like 1NF, 2NF and 3NF to reduce anomalies like insertion, update and deletion anomalies. Higher normal forms like BCNF, 4NF, 5NF and 6NF further reduce dependencies but are not always needed in practice. Normalization through at least 3NF is important to maintain data integrity.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 45

DATABASE

NORMALIZATION
DATABASE NORMALIZATION
INTRO1
Database Normalization is a technique of organizing the data in the database.
Normalization is a systematic approach of decomposing tables to eliminate data
redundancy(repetition) and undesirable characteristics like Insertion, Update and
Deletion Anomalies.
It is a multi-step process that puts data into tabular form, removing duplicated data
from the relation tables.
Without normalizing, changes made to the database to reflect changes in
reality will create inconsistencies in the data, affectively making the data
useless.
DATABASE NORMALIZATION
INTRO2
The normalization process involves converting tables into various types of normal
forms
A table that contains a repeating group, or multiple entries for a single row, is called
an unnormalized table
Normalization starts at 1NF and can continue up to 6NF
 1NF (First Normal Form)
 2NF (Second Normal Form)
 3NF (Third Normal Form)
 BCNF (Boyce-Codd Normal Form)
 4NF (Fourth Normal Form)
 5NF (Fifth Normal Form)
 6NF (Sixth Normal Form)
DATABASE NORMALIZATION
INTRO3
Many businesses will be satisfied with their database being in 3NF and stop there as at this point the database will
accurately represent reality provided your data is accurate.

There is however 3 more stages of normalization that are so remote they almost never exist.

4NF is removing any multi-valued dependencies.

5NF is removing any join dependencies

6NF is Domain Key Normal Form or DKNF states that every constant is a consequence of domain and key constraints.
DKNF is theoretically the holy grail of normalization where insertion, deletion, or updates can never exist.

The process for reaching these last three normal forms is difficult and largely academic. In most cases it is enough just to
know that they exist.

As an aside, many database designers may not remove all violations of all normal form rules. For example, state and zip
codes, though a transitive dependency, will rarely be projected out because it is largely unnecessary to do so. Creating
additional tables requires more CPU at runtime to join tables back together and the likelihood of a new state being added is
remote.
NORMALIZATION
Normalization is necessary if you do not do it then the overall integrity of the data stored in the
database will eventually degrade.  Specifically, this is due to data anomalies.  These anomalies naturally
occur and result in data that does not match the real-world the database purports to represent.

Anomalies are caused when there is too much redundancy in the database's information. Anomalies can
often be caused when the tables that make up the database suffer from poor construction.

So, what does "poor construction" mean? Poor table design will become evident if, when the designer
creates the database, he doesn't identify the entities that depend on each other for existence, like the
rooms of a hotel and the hotel, and then minimize the chance that one would ever exist independent of
the other.

The normalization process was created largely in order to reduce the negative effects of creating tables
that will introduce anomalies into the database.
PROBLEMS WITHOUT
NORMALIZATION

If a table is not properly normalized and have data redundancy, then it


will not only eat up extra memory space but will also make it difficult to
handle and update the database, without facing data loss.
Insertion, Updation and Deletion Anomalies are very frequent if
database is not normalized.
To understand these anomalies let us take an example of a Student
table.
DATA ANOMALIES
There are three types of Data Anomalies: Update Anomalies, Insertion Anomalies, and Deletion Anomalies.
Update Anomalies happen when the person charged with the task of keeping all the records current and
accurate, is asked, for example, to change an employee’s title due to a promotion. If the data is stored
redundantly in the same table, and the person misses any of them, then there will be multiple titles
associated with the employee. The end user has no way of knowing which is the correct title.

Insertion Anomalies happen when inserting vital data into the database is not possible because other data is
not already there. For example, if a system is designed to require that a customer be on file before a sale can
be made to that customer, but you cannot add a customer until they have bought something, then you have
an insert anomaly. It is the classic "catch-22" situation.

Deletion Anomalies happen when the deletion of unwanted information causes desired information to be
deleted as well. For example, if a single database record contains information about a particular product
along with information about a salesperson for the company and the salesperson quits, then information
about the product is deleted along with salesperson information.
PROBLEMS WITHOUT
NORMALIZATION
rollno name branch hod office_tel
401 Akon CSE Mr. X 53337
402 Bkon CSE Mr. X 53337
403 Ckon CSE Mr. X 53337
404 Dkon CSE Mr. X 53337

In the table above, we have data of 4 Computer Sci. students.


Data for the fields branch, hod (Head of Department) and office_tel is repeated for
the students who are in the same branch in the college.
This is Data Redundancy.
INSERTION ANOMALIES
Suppose for a new admission, until and unless a student opts for a branch, data of the
student cannot be inserted, or else we will have to set the branch information as
NULL.
Also, if we have to insert data of 100 students of same branch, then the branch
information will be repeated for all those 100 students.
These scenarios are nothing but Insertion anomalies.
rollno name branch hod office_tel
401 Akon CSE Mr. X 53337
402 Bkon CSE Mr. X 53337
403 Ckon CSE Mr. X 53337
404 Dkon CSE Mr. X 53337
UPDATION ANOMALIES
What if Mr. X leaves the college? or is no longer the HOD of computer science
department? In that case all the student records will have to be updated, and if by
mistake we miss any record, it will lead to data inconsistency. This is Updation
anomaly.

rollno name branch hod office_tel


401 Akon CSE Mr. X 53337
402 Bkon CSE Mr. X 53337
403 Ckon CSE Mr. X 53337
404 Dkon CSE Mr. X 53337
DELETION ANOMALIES
In our Student table, two different information are kept together, Student
information and Branch information. Hence, at the end of the academic year, if
student records are deleted, we will also lose the branch information. This is
Deletion anomaly.
rollno name branch hod office_tel
401 Akon CSE Mr. X 53337
402 Bkon CSE Mr. X 53337
403 Ckon CSE Mr. X 53337
404 Dkon CSE Mr. X 53337
FUNCTIONAL DEPENDENCIES
The attributes of a table is said to be dependent on each other when an attribute of a table uniquely identifies another
attribute of the same table.
For example: Suppose we have a student table with attributes: Stu_Id, Stu_Name, Stu_Age. Here Stu_Id attribute
uniquely identifies the Stu_Name attribute of student table because if we know the student id we can tell the student
name associated with it.
This is known as functional dependency and can be written as Stu_Id->Stu_Name or in words we can say
Stu_Name is functionally dependent on Stu_Id.
Formally:
If column A of a table uniquely identifies the column B of same table then it can represented as A->B (Attribute B is
functionally dependent on attribute A)
Types of Functional Dependencies
 Trivial functional dependency
 Non-trivial functional dependency
 Multivalued dependency
 Transitive dependency
TRIVIAL FUNCTIONAL
DEPENDENCY
In the world of relational database theory, a functional dependency exists when one attribute
determines another attribute uniquely in a database.
A trivial functional dependency is a database dependency that occurs when you describe a functional
dependency of an attribute or of a collection of attributes that includes the original attribute.
In other words, the dependency of an attribute on a set of attributes is known as trivial functional
dependency if the set of attributes includes that attribute.
Symbolically: A ->B is trivial functional dependency if B is a subset of A.
The following dependencies are also trivial: A->A & B->B
For example: Consider a table with two columns Student_id and Student_Name.
{Student_Id, Student_Name} -> Student_Id is a trivial functional dependency as Student_Id is a
subset of {Student_Id, Student_Name}. That makes sense because if we know the values of
Student_Id and Student_Name then the value of Student_Id can be uniquely determined.
Also, Student_Id -> Student_Id & Student_Name -> Student_Name are trivial dependencies too.
NON TRIVIAL FUNCTIONAL
DEPENDENCY
a functional dependency X->Y holds true where Y is not a subset of X then this dependency is called non trivial
Functional dependency.
For example: An employee table with three attributes: emp_id, emp_name, emp_address.
The following functional dependencies are non-trivial:
emp_id -> emp_name (emp_name is not a subset of emp_id)
emp_id -> emp_address (emp_address is not a subset of emp_id)

On the other hand, the following dependencies are trivial:


{emp_id, emp_name} -> emp_name [emp_name is a subset of {emp_id, emp_name}]
Refer: trivial functional dependency.
Completely non trivial FD: If a FD X->Y holds true where X intersection Y is null then this dependency is said to
be completely non trivial function dependency.
NON-TRIVIAL FUNCTIONAL
DEPENDENCY EXAMPLE 2
Company CEO Age
Microsoft Satya Nadella 51
Google Sundar Pichai 46
Apple Tim Cook 57

Functional dependency which also known as a nontrivial dependency occurs when A->B holds true
where B is not a subset of A. In a relationship, if attribute B is not a subset of attribute A, then it is
considered as a non-trivial dependency.
Example:
(Company} -> {CEO} (if we know the Company, we knows the CEO name)
But CEO is not a subset of Company, and hence it's non-trivial functional dependency.
MULTIVALUED DEPENDENCY
Multivalued dependency occurs when there are more than one independent multivalued attributes in a
table.
For example: Consider a bike manufacture company, which produces two colors (Black and white) in
each model every year.
Bike model Manuf year Color
M1001 2007 Black
M1001 2007 Red
M2012 2008 Black
M2012 2008 Red
M2222 2009 Black
M2222 2009 Red

Here columns manuf_year and color are independent of each other and dependent on bike_model.
In this case these two columns are said to be multivalued dependent on bike_model. These
dependencies can be represented like this:
bike_model ->> manuf_year
bike_model ->> color
TRANSITIVE DEPENDENCY
A Transitive Dependency is a type of functional dependency which happens when t is
indirectly formed by two functional dependencies. A transitive dependency can only
occur in a relation of three of more attributes. This dependency helps us normalizing the
database in 3NF (3rd Normal Form).
Book Author Author-age
Game of Thrones George R.R. Martin 66
Harry Potter J.K. Rowling 49
Dying of the Light George R.R. Martin 66

{Book} ->{Author} (if we know the book, we knows the author name)
{Author} does not ->{Book}
{Author} -> {Author_age}
Therefore as per the rule of transitive dependency: {Book} -> {Author_age} should hold,
that makes sense because if we know the book name we can know the author’s age.
ANOTHER TRANSITIVE DEPENDENCY EXAMPLE

Company CEO Age


Microsoft Satya Nadella 51
Google Sundar Pichai 46
Alibaba Jack Ma 54

{Company} -> {CEO} (if we know the compay, we know its CEO's name)
{CEO } -> {Age} If we know the CEO, we know the Age
Therefore, according to the rule of rule of transitive dependency:
{ Company} -> {Age} should hold, that makes sense because if we know the
company name, we can know his age.
Again, remember that transitive dependency can only occur in a relation of three or
more attributes.
LET’S HAVE A KEY DISCUSSION
AGAIN:
DEFINITION OF CANDIDATE KEY IN DBMS:
A super key with no redundant attribute is known as candidate key. Candidate keys
are selected from the set of super keys, the only thing we take care while selecting
candidate key is that the candidate key should not have any redundant attributes.
That’s the reason they are also termed as minimal super key.
CANDIDATE KEY EXAMPLE
Lets take an example of table “Employee”. This How many super keys the above table can have?
table has three attributes: Emp_Id,
Emp_Number & Emp_Name. Here Emp_Id & 1. {Emp_Id}
Emp_Number will be having unique values and 2. {Emp_Number}
Emp_Name can have duplicate values as more
than one employees can have same name. 3. {Emp_Id, Emp_Number}
Emp_IdEmp_Number Emp_Name 4. {Emp_Id, Emp_Name}
------ ---------- -------- 5. {Emp_Id, Emp_Number, Emp_Name}
E012264 Steve 6. {Emp_Number, Emp_Name}
E222278 Ajeet
E232288 Chaitanya Lets select the candidate keys from the above
set of super keys.
E452290 Robert
SELECTING THE CANDIDATE
KEY
1. {Emp_Id} – No redundant attributes
2. {Emp_Number} – No redundant attributes
3. {Emp_Id, Emp_Number} – Redundant attribute. Either of those attributes can be a minimal super key as both of
these columns have unique values.
4. {Emp_Id, Emp_Name} – Redundant attribute Emp_Name.
5. {Emp_Id, Emp_Number, Emp_Name} – Redundant attributes. Emp_Id or Emp_Number alone are sufficient enough
to uniquely identify a row of Employee table.
6. {Emp_Number, Emp_Name} – Redundant attribute Emp_Name.
The candidate keys we have selected are:
{Emp_Id}
{Emp_Number}
Note: A primary key is selected from the set of candidate keys. That means we can either have Emp_Id or Emp_Number
as primary key. The decision is made by DBA (Database administrator)
DEFINITION OF SUPER KEY IN
DBMS
: A super key is a set of one or more attributes (columns), which can uniquely
identify a row in a table. Often DBMS beginners get confused between super key
and candidate key, so we will also discuss candidate key and its relation with super
key in this article.
How candidate key is different from super key?
Answer is simple – Candidate keys are selected from the set of super keys, the only
thing we take care while selecting candidate key is: It should not have any redundant
attribute. That’s the reason they are also termed as minimal super key.
LET’S TAKE AN EXAMPLE TO
UNDERSTAND THIS:
Table: Employee Super keys: The table on the left has following super
keys. All of the following sets of super key are able
to uniquely identify a row of the employee table.
Emp_SSN Emp_Number {Emp_SSN}
Emp_Name
{Emp_Number}
--------- ---------- --------
{Emp_SSN, Emp_Number}
123456789 226 Steve
{Emp_SSN, Emp_Name}
999999321 227 Ajeet
{Emp_SSN, Emp_Number,
888997212 228 Chaitanya Emp_Name}
777778888 229 Robert {Emp_Number, Emp_Name}
CANDIDATE KEYS AGAIN
As mentioned in the beginning, a candidate key is a minimal super key with no
redundant attributes. The following two set of super keys are chosen from the above
sets as there are no redundant attributes in these sets.

{Emp_SSN}
{Emp_Number}

Only these two sets are candidate keys as all other sets are having redundant
attributes that are not necessary for unique identification.
SUPER KEY VS CANDIDATE
KEY

There can be some confusion between super key and candidate key. Let me give you a clear explanation.
1. First you have to understand that all the candidate keys are super keys. This is because the candidate keys
are chosen out of the super keys.
2. How we choose candidate keys from the set of super keys? We look for those keys from which we cannot
remove any fields. In the above example, we have not chosen {Emp_SSN, Emp_Name} as candidate key
because {Emp_SSN} alone can identify a unique row in the table and Emp_Name is redundant.

Primary key:
A Primary key is selected from a set of candidate keys. This is done by database admin or database
designer. We can say that either {Emp_SSN} or {Emp_Number} can be chosen as a primary key for the
table Employee.
FIRST NORMAL FORM (1NF)
For a table to be in the First Normal Form, it should follow the following 4 rules:
1.It should only have single(atomic) valued attributes/columns.
2.Values stored in a column should be of the same domain
3.All the columns in a table should have unique names.
4.And the order in which data is stored, does not matter.
SECOND NORMAL FORM (2NF)
For a table to be in the Second Normal Form,
1.It should be in the First Normal form.
2.And, it should not have Partial Dependency.
THIRD NORMAL FORM (3NF)
A table is said to be in the Third Normal Form when,
1.It is in the Second Normal form.
2.And, it doesn't have Transitive Dependency.
BOYCE AND CODD NORMAL
FORM (BCNF)
Boyce and Codd Normal Form is a higher version of the Third Normal form. This
form deals with certain type of anomaly that is not handled by 3NF. A 3NF table
which does not have multiple overlapping candidate keys is said to be in BCNF. For
a table to be in BCNF, following conditions must be satisfied:
•R must be in 3rd Normal Form
•and, for each functional dependency ( X → Y ), X should be a super Key.
FOURTH NORMAL FORM (4NF)
Fourth Normal Form (4NF)
A table is said to be in the Fourth Normal Form when,
1.It is in the Boyce-Codd Normal Form.
2.And, it doesn't have Multi-Valued Dependency.
FIRST NORMAL FORM
Rule 1: Single Valued Attributes
Each column of your table should be single valued which means they should not contain multiple values.
We will explain this with help of an example later, let's see the other rules for now.
Rule 2: Attribute Domain should not change
This is more of a "Common Sense" rule. In each column the values stored must be of the same kind or type.
For example: If you have a column dob to save date of births of a set of people, then you cannot or you
must not save 'names' of some of them in that column along with 'date of birth' of others in that column. It
should hold only 'date of birth' for all the records/rows.
Rule 3: Unique name for Attributes/Columns
This rule expects that each column in a table should have a unique name. This is to avoid confusion at the
time of retrieving data or performing any other operation on the stored data.
If one or more columns have same name, then the DBMS system will be left confused.
Rule 4: Order doesn't matters
This rule says that the order in which you store the data in your table doesn't matter.
1NF EXAMPLE
roll_no name subject
101 Akon OS, CN
103 Ckon Java
102 Bkon C, C++

The table already satisfies 3 rules out of the 4 rules, as all column names are unique, data is in the
correct order and not inter-mixed different type of data in columns.
But out of the 3 different students in table, 2 have opted for more than 1 subject and as per the 1st
Normal form each column must contain atomic value.
HOW TO SOLVE 1NF
Here is our updated table and it now satisfies the First Normal Form.
roll_no name subject
101 Akon OS
101 Akon CN
102 Ckon Java
102 Bkon C
102 Bkon C++

By doing so, although a few values are getting repeated but values for the subject
column are now atomic for each record/row.
Using the First Normal Form, data redundancy increases, as there will be many
columns with same data in multiple rows but each row as a whole will be unique.
SECOND NORMAL FORM (2NF)
For a table to be in the Second Normal Form,
1.It should be in the First Normal form.
2.No non-prime attribute is dependent on the proper subset of any candidate key of
table.

An attribute that is not part of any candidate key is known as non-prime attribute.
HOW TO SOLVE 2NF
Example: Suppose a school wants to store the data of teachers and the subjects they
teach. They create a table that looks like this: Since a teacher can teach more than one
subjects, the table can have multiple rows for a same teacher.
teacher_ID subject teacher_age
111 Math 38
111 Physics 38
222 Biology 38
333 Physics 40
333 Chemistry 40

Candidate Keys: {teacher_id, subject}


Non prime attribute: teacher_age
The table is in 1 NF because each attribute has atomic values. However, it is not in 2NF
because non prime attribute teacher_age is dependent on teacher_id alone which is a proper
subset of candidate key. This violates the rule for 2NF as the rule says “no non-prime
attribute is dependent on the proper subset of any candidate key of the table”.
TO MAKE THE TABLE COMPLIES WITH 2NF WE
CAN BREAK IT IN TWO TABLES LIKE THIS:

teacher_details table: teacher_subject table:


teacher_ID subject
teacher_ID teacher_age
111 Math
111 38
111 Physics
222 38
222 Biology
333 40
333 Physics
333 Chemistry

Now the tables comply with Second


normal form (2NF).
THIRD NORMAL FORM (3NF)
Table must be in 2NF
Transitive functional dependency of non-prime attribute on any super key should be
removed.
An attribute that is not part of any candidate key is known as non-prime attribute.

In other words 3NF can be explained like this: A table is in 3NF if it is in 2NF and for each
functional dependency X-> Y at least one of the following conditions hold:
X is a super key of table
Y is a prime attribute of table
An attribute that is a part of one of the candidate keys is known as prime attribute.
EXAMPLE: SUPPOSE A COMPANY WANTS TO STORE THE
COMPLETE ADDRESS OF EACH EMPLOYEE, THEY CREATE A
TABLE NAMED EMPLOYEE_DETAILS THAT LOOKS LIKE THIS:
emp_ID emp_name emp_zip emp_state emp_city emp_district

1001 John 85250 AZ Ajo Tan Dan


1002 Ajeet 92110 CA Beach White Light
1006 Lara 92106 CA Beach Blue Light
1101 Lillie 38655 MS Elvis Red Light
1201 Steve 38677 MS Elvis Green Cash

Super keys: {emp_id}, {emp_id, emp_name}, {emp_id, emp_name, emp_zip}…so on


Candidate Keys: {emp_id}
Non-prime attributes: all attributes except emp_id are non-prime as they are not part of any candidate keys.
Here, emp_state, emp_city & emp_district dependent on emp_zip. And, emp_zip is dependent on emp_id that makes
non-prime attributes (emp_state, emp_city & emp_district) transitively dependent on super key (emp_id). This
violates the rule of 3NF.
TO MAKE THIS TABLE COMPLIES WITH
3NF WE HAVE TO BREAK THE TABLE INTO
TWO TABLES TO REMOVE THE
TRANSITIVE DEPENDENCY
employee table: employee_zip table:
emp_zip emp_sta emp_city emp_distri
emp_ID emp_nam emp_zip te ct
e
1001 John 85250 85250 AZ Ajo Tan Dan
1002 Ajeet 92110
92110 CA Beach White
1006 Lora 92106 Light
1101 Lillie 38655 92106 CA Beach Blue Light
1201 Steve 3867 38655 MS Elvis Red Light
38677 MS Elvis Green
Cash
BOYCE AND CODD NORMAL
FORM (BCNF)
Boyce and Codd Normal Form is a higher version of the Third Normal form. This
form deals with certain type of anomaly that is not handled by 3NF. A 3NF table
which does not have multiple overlapping candidate keys is said to be in BCNF. For
a table to be in BCNF, following conditions must be satisfied:
•It is an advance version of 3NF that’s why it is also referred as 3.5NF. BCNF is
stricter than 3NF. A table complies with BCNF if it is in 3NF and for every
functional dependency X->Y, X should be the super key of the table.
EXAMPLE: SUPPOSE THERE IS A COMPANY WHEREIN
EMPLOYEES WORK IN MORE THAN ONE DEPARTMENT.
THEY STORE THE DATA LIKE THIS:
emp_ID emp_nation emp_dept dept_type dept_no_of_
ality emp

1001 Austrian Production D001 200


and planning

1001 Austrian Stores D001 250


1002 American Design and D134 100
technical
support
1002 American Purchasing D134 600
department

Functional dependencies in the table above:


emp_id -> emp_nationality
emp_dept -> {dept_type, dept_no_of_emp}

Candidate key: {emp_id, emp_dept}


The table is not in BCNF as neither emp_id nor emp_dept alone are keys.
TO MAKE THE TABLE COMPLY WITH BCNF WE CAN BREAK THE TABLE IN THREE TABLES
LIKE THIS:
emp_ID emp_dept
1001 Production and
planning
emp_nationality table:
1001 Stores
emp_ID emp_nati
onality
emp_dept_mapping table: 1002 Design and
technical
1001 Austrian support
1002 American Functional dependencies: 1002 Purchasing
department
emp_id -> emp_nationality
emp_dept dept_type dept_no_of_emp emp_dept -> {dept_type, dept_no_of_emp}
Candidate keys:
Production and D001 200 For first table: emp_id
emp_dept table:
planning For second table: emp_dept
For third table: {emp_id, emp_dept}
Stores D001 250
Design and D134 100
This is now in BCNF as in both the
technical functional dependencies left side part is a
support
key.
Purchasing D134 600
department
QUICK QUICK SUMMARY
Conversion to First Normal Form
 In general, when converting a non-1NF table to 1NF, the primary key
typically will include the original primary key concatenated with the key of
the repeating group, that is, the field that distinguishes one occurrence of the
repeating group from another within a given row in the table
Second Normal Form
 A table (relation) is in second normal form (2NF) if it is in first normal
form and no nonkey field is dependent on only a portion of the primary key
Third Normal Form
 A table is in third normal form (3NF) if it is in second normal form and if
the only determinants it contains are candidate keys
 Any field or collection of fields that determines another field is called a
determinant

You might also like