0% found this document useful (0 votes)
29 views34 pages

Week 2

Normalization is a process in database management systems that organizes data to eliminate redundancy and maintain data integrity by breaking down large tables into smaller, related tables. It helps prevent data anomalies such as insertion, update, and deletion anomalies, and involves multiple steps including achieving First, Second, and Third Normal Forms. Each normal form has specific rules to ensure that data is stored logically and efficiently, enhancing the scalability and adaptability of the database structure.

Uploaded by

vrindavksraj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views34 pages

Week 2

Normalization is a process in database management systems that organizes data to eliminate redundancy and maintain data integrity by breaking down large tables into smaller, related tables. It helps prevent data anomalies such as insertion, update, and deletion anomalies, and involves multiple steps including achieving First, Second, and Third Normal Forms. Each normal form has specific rules to ensure that data is stored logically and efficiently, enhancing the scalability and adaptability of the database structure.

Uploaded by

vrindavksraj
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 34

Normalizatio

n
Meaning
Normalization is the process of
organizing data within a database (
relational database) to eliminate
data anomalies, such as redundancy.
In simpler terms, it involves
breaking down a large, complex
table into smaller and simpler tables
while maintaining data relationships.
Normalization is commonly used
when dealing with large datasets.
 Ifa dataset is maintained in the form of
just a single table, it leads to Data
Redundancy, which means a single
value of data is stored multiple times.
 This leads to many issues like an
increment in the database size, slower
data retrieval, and data inconsistency.
 Thus, to overcome this, Normalization in
DBMS is used in which a large table is
reduced into smaller tables until each of
the single tables contains one relation.
 Normalization in DBMS is a technique using
which you can organize the data in the
database tables so that:
◦ There is less repetition of data,
◦ A large set of data is structured into a bunch of
smaller tables, and the tables have a proper
relationship between them.
◦ DBMS Normalization is a systematic approach
to decompose (break down) tables to
eliminate data redundancy(repetition) and
undesirable characteristics like Insertion
anomaly in DBMS, Update anomaly in DBMS, and
Delete anomaly in DBMS.
 It is a multi-step process that puts data
into tabular form, removes duplicate data,
and set up the relationship between tables.
Why we need Normalization in
DBMS?
Normalization is required for,
 Eliminating redundant(useless) data, therefore
handling data integrity, because if data is
repeated it increases the chances of inconsistent
data.
 Normalization helps in keeping data consistent by
storing the data in one table and referencing it
everywhere else.
 Storage optimization although that is not an issue
these days because Database storage is cheap.
 Breaking down large tables into smaller tables with
relationships, so it makes the database structure
more scalable and adaptable.
 Ensuring data dependencies make sense i.e. data is
logically stored.
Problems without Normalization
in DBMS
If a table is not properly
normalized and has data
redundancy(repetition) then it
will not only eat up extra
memory space but will also
make it difficult for you to handle
and update the data in the
database, without losing data.
Insertion, Updation, and Deletion
Anomalies are very frequent if
the database is not normalized.
To understand these anomalies let us
take an example of a Student table.
rollno name branc hod office_t
h el
401 Akon CSE Mr. X 53337
402 Bkon CSE Mr. X 53337
403 Ckon CSE Mr. X 53337
404 Dkon CSE Mr. X 53337

As we can see, data for the fields branch, hod(Head of


Department), and office_tel are repeated for the students
who are in the same branch in the college, this is Data
Redundancy.
1. Insertion Anomaly in DBMS
 Suppose for a new admission, until and unless a student opts for a branch,
data of the student cannot be inserted, or else we will have to set the
branch information as NULL.
 Also, if we have to insert data for 100 students of the same branch, then
the branch information will be repeated for all those 100 students.
 These scenarios are nothing but Insertion anomalies.
 If you have to repeat the same data in every row of data, it's better
to keep the data separately and reference that data in each row.
 So in the above table, we can keep the branch information separately, and
just use the branch_id in the student table, where branch_id can be
used to get the branch information.
2. Updation Anomaly in DBMS
 What if Mr. X leaves the college? or Mr. X is no longer the HOD of the
computer science department? In that case, all the student records will
have to be updated, and if by mistake we miss any record, it will lead to
data inconsistency.
 This is an Updation anomaly because you need to update all the records in
your table just because one piece of information got changed.
3. Deletion Anomaly in DBMS
 In our Student table, two different pieces of information are kept
together, the Student information and the Branch information.
 So if only a single student is enrolled in a branch, and that student leaves
the college, or for some reason, the entry for the student is deleted, we
Types of Normalization
First Normal Form (1NF)
For a table to be in the First Normal
Form, it should follow the following 4
rules:
It should only have single(atomic)
valued attributes/columns.
Values stored in a column should be of
the same domain.
All the columns in a table should have
unique names.
And the order in which data is stored
should not matter.
If we have an Employee table in
which we store the employee
information along with the employee
skillset, the table will look like this
The table has 4 columns:
em emp_nam emp_ emp_ski
• All the columns have different
p_id e mobile lls
names.
Python, • All the columns hold values of the
99999
1 John Tick JavaScri same type like emp_name has all
57773
pt the names, emp_mobile has all
HTML, the contact numbers, etc.
Darth 88888 CSS, • The order in which we save data
2 doesn't matter
Trader 53337 JavaScri
pt • But the emp_skills column
holds multiple comma-separated
Java, values, while as per the First
Rony 77777
3 Linux, Normal form, each column should
Shark 20008
C++
So how do you fix the above table? There are two ways to do this:
have a single value.
• Remove the emp_skills column from the Employee table and
keep it in some other table.
• Or add multiple rows for the employee and each row is linked
Rules for First Normal Form
The first normal form expects you to follow a few simple rules
while designing your database, and they are:
Rule 1: Single Valued Attributes
 Each column of your table should be single valued which means
they should not contain multiple values.
Rule 2: Attribute Domain should not change
 This is more of a "Common Sense" rule. In each column the
values stored must be of the same kind or type.
 For example: If you have a column dob to save date of births
of a set of people, then you cannot or you must not save
'names' of some of them in that column along with 'date of
birth' of others in that column. It should hold only 'date of birth'
for all the records/rows.
Rule 3: Unique name for Attributes/Columns
 This rule expects that each column in a table should have a
unique name. This is to avoid confusion at the time of retrieving
data or performing any other operation on the stored data.
 If one or more columns have same name, then the DBMS
system will be left confused.
Rule 4: Order doesn't matters
 This rule says that the order in which you store the data in your
table doesn't matter.
roll_no name subject
101 Akon OS, CN
103 Ckon Java
102 Bkon C, C++

Our table already satisfies 3 rules out of the 4 rules, as all our
column names are unique, we have stored data in the order we
wanted to and we have not inter-mixed different type of data in
columns.

But out of the 3 different students in our table, 2 have opted for
more than 1 subject. And we have stored the subject names in a
single
By doing column.aBut
so, although few as per the 1st Normal form each column must
values are getting
contain repeated
atomic but
value.
values for the subject column
roll_no name subject
are now atomic for each
record/row. 101 Akon OS
Using the First Normal Form, 101 Akon CN
data redundancy increases, as 103 Ckon Java
there will be many columns with 102 Bkon C
same data in multiple rows but
each row as a whole will be 102 Bkon C++
Second Normal Form (2NF)
For a table to be in the Second Normal Form,
 It should be in the First Normal form.
 And, it should not have Partial Dependency.

What is Partial Dependency?


 When a table has a primary key that is made up
of two or more columns, then all the columns(not
included in the primary key) in that table should
depend on the entire primary key and not on a
part of it. If any column(which is not in the
primary key) depends on a part of the primary
key then we say we have Partial dependency in
the table.
What is Dependency?
 Let's take an example of a Student table with
columns student_id, name, reg_no(registration
number), branch and address(student's home address).
stude name reg_n branc addre
nt_id o h ss
10 Akon 07- CSE Keral
WY a
11 Akon 08- IT Gujar
WY at
In this table, student_id is the primary key
and will be unique for every row, hence we
can use student_id to fetch any row of data
from this table
Even for a case, where student names are
same, if we know the student_id we can
easily fetch the correct record.
Hence we can say a Primary Key for a table is the column
or a group of columns(composite key) which can uniquely
identify each record in the table.
I can ask from branch name of student
with student_id 10, and I can get it. Similarly, if I ask
for name of student with student_id 10 or 11, I will get
it.
So all I need is student_id and every other
column depends on it, or can be fetched using it.

This is Dependency and we also call it Functional


Dependency.
1. Create Separate tables for Employee and
Employee Skills
So the Employee table will look like this
emp_mobil
emp_id emp_name
e
999995777
1 John Tick
3
Darth 888885333
2
Trader 7
2. And the emp_id emp_skill
777772000
new Employee_Skill 8tabl 1
3 Rony Shark Python
e: 1 JavaScript
2 HTML
2 CSS
2 JavaScript
3 Java
3 Linux
3 C++
2. Add Multiple rows for Multiple skills
 You can also simply add multiple rows to add
multiple skills. This will lead to repetition of the
data, but that can be handled as you further
Normalize your data using the Second Normal
form and
emp_i the Third Normal form.
emp_name emp_mobile emp_skill
d

1 John Tick 9999957773 Python

1 John Tick 9999957773 JavaScript

2 Darth Trader 8888853337 HTML

2 Darth Trader 8888853337 CSS

2 Darth Trader 8888853337 JavaScript

3 Rony Shark 7777720008 Java

3 Rony Shark 7777720008 Linux

3 Rony Shark 7777720008 C++


If we have two tables Students and
Subjects, to store student
information and information related
to subjects.
Student table:
student_na
student_id branch
me
1 Akon CSE
2 Bkon Mechanical
subject_id subject_name
1 C Language
Subjects table: 2 DSA
3 Operating System

student subject_ teacher


And we have another marks
_id id _name
table Score to store the
marks scored by students 1 1 70 Miss. C
in any subject like this, 1 2 82 Mr. D
2 1 65 Mr. Op
 Now in the above table, the primary key is student_id +
subject_id, because both these information are required
to select any row of data.
 But in the Score table, we have a
column teacher_name, which depends on the subject
information or just the subject_id, so we should not
keep that information in the Score table.
 The column teacher_name should be in
Updated Subject ta
ble: the Subjects table. And then the entire system will be
Normalized as per the Second Normal Form.
subject_na teacher_na
subject_id
me me
C
1 Miss. C
Language
2 DSA Mr. D
Operating
3 Mr. Op
System
Updated Score table student_id subject_id marks
1 1 70
1 2 82
2 1 65
Third Normal Form (3NF)
A table is said to be in the Third Normal Form
when,
It satisfies the First Normal Form and the
Second Normal form.
And, it doesn't have Transitive Dependency.

What is Transitive Dependency?


In a table we have some column that acts as
the primary key and other columns depends
on this column. But what if a column that is
not the primary key depends on another
column that is also not a primary key or part
of it? Then we have Transitive dependency in
our table.
Let's take an example. We had
the Score table in the Second
Normal Form above. If we have to
store some extra information in it,
like,
◦ exam_type
◦ total_marks
To store the type of exam and the
total marks in the exam so that we
can later calculate the percentage
of marks scored by each student.
The Score table will look like
this,

stude subjec exam_ total_


marks
nt_id t_id type marks
Theor
1 1 70 100
y
Theor
1 2 82 100
y
Practi
2 1 42 50
cal
In the table above, the column exam_type depends on
both student_id and subject_id, because,
• a student can be in the CSE branch or the Mechanical
branch,
• and based on that they may have different exam types
for different subjects.
• The CSE students may have both Practical and Theory for
Compiler Design,
• whereas Mechanical branch students may only have
Theory exams for Compiler Design.
But the column total_marks just depends on
the exam_type column. And the exam_type column is not a
part of the primary key. Because the primary key is student_id
How to Transitive Dependency?
You can create a separate table
for ExamType and use it in
the Score table
exam_ty exam_ty total_m
duration
pe_id pe arks
New ExamType table, 1 Practical 50 45
2 Theory 100 180
Worksh
3 150 300
op

We have created a new table ExamType and


we have added more related information in it
like duration(duration of exam in mins.), and
now we can use the exam_type_id in
the Score table.
Boyce-Codd Normal Form
(BCNF)
Boyce and Codd Normal Form is a
higher version of the Third Normal Form.
This form deals with a certain type of
anomaly that is not handled by 3NF.
A 3NF table that does not
have multiple overlapping candidate
keys is said to be in BCNF.
For a table to be in BCNF, the following
conditions must be satisfied:
◦ R must be in the 3rd Normal Form
◦ and, for each functional dependency ( X →
Y ), X should be a Super Key.
For a table to satisfy the Boyce-Codd Normal
Form, it should satisfy the following two
conditions:
It should be in the Third Normal Form.
And, for any dependency A → B, A should be
a super key.
The second point sounds a bit tricky, right?
In simple words, it means, that for a
dependency A → B, A cannot be a non-
prime attribute, if B is a prime attribute.
Below we have a college enrolment
table with
columns student_id,
student_id subject professor subject and profe
101 Java P.Java
ssor.
101 C++ P.Cpp
102 Java P.Java2
103 C# P.Chash
104 Java P.Java

In the table above:


One student can enrol for multiple subjects. For example, student
with student_id 101, has opted for subjects - Java & C++
For each subject, a professor is assigned to the student.
And, there can be multiple professors teaching one subject like
we have for Java.
 Well, in the table above student_id, subject together form
the primary key, because using student_id and subject, we
can find all the columns of the table.
 One more important point to note here is, one professor
teaches only one subject, but one subject may have two
different professors.
 Hence, there is a dependency
between subject and professor here,
where subject depends on the professor name.
 This table satisfies the 1st Normal form because all the
values are atomic, column names are unique and all the
values stored in a particular column are of same domain.
 This table also satisfies the 2nd Normal Form as their is
no Partial Dependency.
 And, there is no Transitive Dependency, hence the table
also satisfies the 3rd Normal Form.
Why this table is not in BCNF?
In the table above, student_id,
subject form primary key, which
means subject column is a prime
attribute.
But, there is one more
dependency, professor → subject.
And while subject is a prime
attribute, professor is a non-prime
attribute, which is not allowed by
BCNF.
To make this relation(table) satisfy
BCNF, we will decompose this table
into two tables, student table
and professor
student_id table.
p_id
101 1
Student Table 101 2
and so on...

p_id professor subject


1 P.Java Java
And, Professor Table 2 P.Cpp C++
and so on...
Fourth Normal Form (4NF)
A table is said to be in the Fourth Normal Form
when,
It is in the Boyce-Codd Normal Form.
And, it doesn't have Multi-Valued Dependency.

Fourth Normal Form comes into picture


when Multi-valued Dependency occur in any
relation. In this tutorial we will learn about
Multi-valued Dependency, how to remove it and
how to make any table satisfy the fourth
normal form.
What is Multi-valued Dependency?
A table is said to have multi-valued
dependency, if the following conditions are
true,
For a dependency A → B, if for a single value
of A, multiple value of B exists, then the
table may have multi-valued dependency.
Also, a table should have at-least 3 columns
for it to have a multi-valued dependency.
And, for a relation R(A,B,C), if there is a
multi-valued dependency between, A and B,
then B and C should be independent of each
other.
If all these conditions are true for any
relation(table), it is said to have multi-
valued dependency.
college enrolment table with columns s_id, course and hobby.
s_id course hobby
1 Science Cricket
1 Maths Hockey
2 C# Cricket
2 Php Hockey
student with s_id 1 has opted for two
courses, Science and Maths, and has two
hobbies, Cricket and Hockey.
You must be thinking what problem this can lead to, right?
Well the two records for student with s_id 1, will give rise to
two more records, as shown below, because for one student,
two hobbies exists, hence along with both the courses, these
hobbies should be specified.

s_id course hobby


1 Science Cricket
1 Maths Hockey
1 Science Hockey
1 Maths Cricket
How to satisfy 4th Normal Form?
To make the above relation satisfy the 4th normal form, we
can decompose the table into 2 tables.
A table can also have
functional dependency
CourseOpted Table along with multi-valued
dependency. In that
s_id hobby case, the functionally
dependent columns are
1 Cricket moved in a separate
1 Hockey table and the multi-
2 Cricket valued dependent
columns are moved to
2 Hockey separate tables.

s_id course
And, Hobbies Table,
1 Science
1 Maths
Now this relation
satisfies the 2 C#
fourth normal 2 Php
form.

You might also like