DBM - Unit - 3
DBM - Unit - 3
Normalization:
A large database defined as a single relation may result in data duplication. This repetition of
data may result in:
● It isn't easy to maintain and update data as it would involve searching many records in
relation.
So to handle these problems, we should analyze and decompose the relations with redundant data
into smaller, simpler, and well-structured relations that are satisfy desirable properties.
Normalization is a process of decomposing the relations into relations with fewer attributes.
The inventor of the relational model Edgar Codd proposed the theory of normalization of
data with the introduction of the First Normal Form, and he continued to extend theory with
Second and Third Normal Form. Later he joined Raymond F. Boyce to develop the theory of
Boyce-Codd Normal Form.
purpose of Normalization:
1. To eliminate redundant (repetitive) data
2. To ensure data is stored logically.
Functional Dependency
Functional Dependency (FD) is a constraint that determines the relation of one attribute to
another attribute in a Database Management System (DBMS). Functional Dependency helps to
maintain the quality of data in the database. It plays a vital role to find the difference between
good and bad database design.
Example:
Employee number Employee Name Salary City
In this example, if we know the value of Employee number, we can obtain the Employee Name,
city, salary, etc. By this, we can say that the city, Employee Name, and salary are functionally
dependent on Employee number.
Key terms
Here, are some key terms for Functional Dependency in Database:
Axiom Axioms is a set of inference rules used to infer all the functional dependencies
on a relational database.
Decompositio It is a rule that suggests if you have a table that appears to contain two entities
n which are determined by the same primary key then you should consider
breaking them up into two different tables.
Union It suggests that if two tables are separate, and the PK is the same, you should
consider putting them. together
Below are the Three most important rules for Functional Dependency in Database:
● Reflexive rule –. If X is a set of attributes and Y is_subset_of X, then X holds a value of
Y.
● Augmentation rule: When x -> y holds, and c is attribute set, then ac -> bc also holds.
That is adding attributes which do not change the basic dependencies.
● Transitivity rule: This rule is very much similar to the transitive rule in algebra if x -> y
holds and y -> z holds, then x -> z also holds. X -> y is called as functionally that
determines y.
There are mainly four types of Functional Dependency in DBMS. Following are the types of
Functional Dependencies in DBMS:
● Multivalued Dependency
● Trivial Functional Dependency
● Non-Trivial Functional Dependency
● Transitive Dependency
In this example, maf_year and color are independent of each other but dependent on car_model.
In this example, these two columns are said to be multivalue dependent on car_model.
For example:
Emp_id Emp_name
AS555 Harry
AS811 George
AS999 Kevin
Example:
(Company} -> {CEO} (if we know the Company, we knows the CEO name)
But CEO is not a subset of Company, and hence it’s non-trivial functional dependency.
Example:
Company CEO Age
Alibaba Jack Ma 54
{Company} -> {CEO} (if we know the compay, we know its CEO’s name)
{CEO } -> {Age} If we know the CEO, we know the Age
{ Company} -> {Age} should hold, that makes sense because if we know the company name, we
can know his age.
Note: You need to remember that transitive dependency can only occur in a relation of three or
more attributes.
Normalization Definition :
Normalization is a method of organizing the data in the database which helps you to avoid data
redundancy, insertion, update & deletion anomaly. It is a process of analyzing the relation
schemas based on their different functional dependencies and primary key.
Normalization is inherent to relational database theory. It may have the effect of duplicating the
same data within the database which may result in the creation of additional tables.
● Functional Dependency avoids data redundancy. Therefore same data do not repeat at
multiple locations in that database
● It helps you to maintain the quality of data in the database
● It helps you to defined meanings and constraints of databases
● It helps you to identify bad designs
● It helps you to find the facts regarding the database design
Armstrong axioms
Inference Rules
● Using the inference rule, we can derive additional functional dependency from the initial
set.
An inference rule asserts that a user can apply to a set of functional dependencies to
derive other FD (functional dependencies). William w. Armstrong developed these axioms in the
database management system in 1974.
Following are the six most essential inference rules for functional dependency:
Reflexive Rule
In the reflexive rule, if X is a set of attributes and Y is the subset of X, then X functionally
determines Y., i.e., X ⊇ Y then X->Y.
E.g.:
X={a,e,i,o,u}
Y={a,e,o}
Augmentation Rule
In the augmentation rule, if X determines Y and Z is any attribute set, XZ determines YZ. It is
also called a partial dependency. i.e if X -> Y then XZ -> YZ.
E.g.:
Union Rule
This rule is also known as the additive rule. In the union rule, if X determines Y and X
determines Z, then X also determines both Y and Z., i.e., If X -> Y and X -> Z then,
X -> YZ.
Decomposition Rule
This rule is the reverse of the Union rule. If X determines Y and Z together in the decomposition
rule, X determines Y and Z separately. i.e If X -> YZ then ,X -> Y and X -> Z .
Minimal Cover:
The formal definition is: A set of FD F to be minimal if it satisfies the following conditions −
Canonical cover is called minimal cover which is called the minimum set of FDs. A set of FD
FC is called canonical cover of F if each FD in FC is a −
● Simple FD.
● Left reduced FD.
● Non-redundant FD.
Important definitions:
Extraneous attributes: An attribute of a functional dependency is said to be extraneous if we
can remove it without changing the closure of the set of functional dependencies.
Canonical cover: A canonical cover of a set of functional dependencies F such that ALL
the following properties are satisfied:
● Each left side of a functional dependency in is unique. That is, there are no two
Example
Consider an example to find canonical cover of F.
The given functional dependencies are as follows −
A -> BC
B -> C
A -> B
AB -> C
➢ Minimal cover: The minimal cover is the set of FDs which are equivalent to the
given FDs.
➢ Canonical cover: In canonical cover, the LHS (Left Hand Side) must be unique.
First of all, we will find the minimal cover and then the canonical cover.
First step − Convert RHS attribute into singleton attribute.
A -> B
A -> C
B -> C
A -> B
AB -> C
Second step − Remove the extra LHS attribute
Find the closure of A.
A+ = {A, B, C}
So, AB -> C can be converted into A -> C
A -> B
A -> C
B -> C
A -> B
A -> C
Third step − Remove the redundant FDs.
A -> B
B -> C
Now, we will convert the above set of FDs into canonical cover.
The canonical cover for the above set of FDs will be as follows −
A -> BC
B -> C
Decomposition
Properties of Decomposition
1. Lossless Decomposition
2. Dependency Preservation
3. Lack of Data Redundancy
1. Lossless Decomposition
● Decomposition must be lossless. It means that the information should not get lost from
the relation that is decomposed.
● It gives a guarantee that the join will result in the same relation as it was decomposed.
Example:
Let's take 'E' is the Relational Schema, With instance 'e'; is decomposed into: E1, E2, E3, . . . .
En; With instance: e1, e2, e3, . . . . en, If e1 ⋈ e2 ⋈ e3 . . . . ⋈ en, then it is called as 'Lossless
Join Decomposition'.
In the above example, it means that, if natural joins of all the decomposition give the original
relation, then it is said to be lossless join decomposition.
Example: <Employee_Department> Table
Eid Ename Age City Salary Deptid DeptName
● Decompose the above relation into two relations to check whether a decomposition is
lossless or lossy.
● Now, we have decomposed the relation that is Employee and Department.
Relation 1 : <Employee> Table
If the <Employee> table contains (Eid, Ename, Age, City, Salary) and <Department>
table contains (Deptid and DeptName), then it is not possible to join the two tables or
relations, because there is no common column between them. And it becomes Lossy
Join Decomposition.
2. Dependency Preservation
● Dependency is an important constraint on the database.
● Every dependency must be satisfied by at least one decomposed table.
● If {A → B} holds, then two sets are functional dependent. And, it becomes more useful
for checking the dependency easily if both sets in a same relation.
● This decomposition property can only be done by maintaining the functional
dependency.
● In this property, it allows to check the updates without computing the natural join of the
database structure.
Normalization
If a table is not properly normalized and have data redundancy then it will not only eat up extra
memory space but will also make it difficult to handle and update the database, without facing
data loss. Insertion, Updation and Deletion Anomalies are very frequent if database is not
normalized. To understand these anomalies let us take an example of a Student table.
In the table above, we have data of 4 Computer Sci. students. As we can see, data for the fields
branch, hod(Head of Department) and office_tel is repeated for the students who are in the same
branch in the college, this is Data Redundancy.
Normalization Rule
For a table to be in the First Normal Form, it should follow the following 4 rules:
The first normal form expects you to follow a few simple rules while designing your database,
and they are:
Each column of your table should be single valued which means they should not contain multiple
values. We will explain this with help of an example later, let's see the other rules for now.
This is more of a "Common Sense" rule. In each column the values stored must be of the same
kind or type.
For example: If you have a column dob to save date of births of a set of people, then you cannot
or you must not save 'names' of some of them in that column along with 'date of birth' of others
in that column. It should hold only 'date of birth' for all the records/rows.
This rule expects that each column in a table should have a unique name. This is to avoid
confusion at the time of retrieving data or performing any other operation on the stored data.
If one or more columns have the same name, then the DBMS system will be left confused.
EMPLOYEE table:
14 John 7272826385, UP
9064738238
The decomposition of the EMPLOYEE table into 1NF has been shown below:
14 John 7272826385 UP
14 John 9064738238 UP
Dependency
Let's take an example of a Student table with columns student_id, name, reg_no(registration
number), branch and address(student's home address).
In this table, student_id is the primary key and will be unique for every row, hence we can use
student_id to fetch any row of data from this table
Even for a case, where student names are the same, if we know the student_id we can easily
fetch the correct record.
Hence we can say a Primary Key for a table is the column or a group of columns(composite key)
which can uniquely identify each record in the table.
I can ask from the branch name of the student with student_id 10, and I can get it. Similarly, if I
ask for the name of a student with student_id 10 or 11, I will get it. So all I need is student_id
and every other column depends on it, or can be fetched using it.
Partial Dependency
Now that we know what dependency is, we are in a better state to understand what partial
dependency is.
For a simple table like Student, a single column like student_id can uniquely identfy all the
records in a table.
But this is not true all the time. So now let's extend our example to see if more than 1 column
together can act as a primary key.
Let's create another table for Subject, which will have subject_id and subject_name fields and
subject_id will be the primary key.
subject_id subject_name
1 Java
2 C++
3 Php
Now we have a Student table with student information and another table Subject for storing
subject information.
Let's create another table Score, to store the marks obtained by students in the respective
subjects. We will also be saving name of the teacher who teaches that subject along with marks.
1 10 1 70 Java Teacher
2 10 2 75 C++ Teacher
3 11 1 80 Java Teacher
In the score table we are saving the student_id to know which student's marks are these and
subject_id to know for which subject the marks are for.
Together, student_id + subject_id forms a Candidate Key(learn about Database Keys) for this
table, which can be the Primary key.
See, if I ask you to get me marks of student with student_id 10, can you get it from this table?
No, because you don't know for which subject. And if I give you subject_id, you would not
know for which student. Hence we need student_id + subject_id to uniquely identify any row.
Now if you look at the Score table, we have a column names teacher which is only dependent on
the subject, for Java it's Java Teacher and for C++ it's C++ Teacher & so on.
Now as we just discussed that the primary key for this table is a composition of two columns
which is student_id & subject_id but the teacher's name only depends on subject, hence the
subject_id, and has nothing to do with student_id.
This is Partial Dependency, where an attribute in a table depends on only a part of the primary
key and not on the whole key.
There can be many different solutions for this, but out objective is to remove teacher's name
from Score table.
The simplest solution is to remove columns teacher from Score table and add it to the Subject
table. Hence, the Subject table will become:
And our Score table is now in the second normal form, with no partial dependency.
1 10 1 70
2 10 2 75
3 11 1 80
Third Normal Form (3NF)
So let's use the same example, where we have 3 tables, Student, Subject and Score.
Student Table
Subject Table
Score Table
1 10 1 70
2 10 2 75
3 11 1 80
In the Score table, we need to store some more information, which is the exam name and total
marks, so let's add 2 more columns to the Score table.
Transitive Dependency:
With exam_name and total_marks added to our Score table, it saves more data now. Primary key
for our Score table is a composite key, which means it's made up of two attributes or columns →
student_id + subject_id.
Our new column exam_name depends on both student and subject. For example, a mechanical
engineering student will have Workshop exam but a computer science student won't. And for
some subjects you have Prctical exams and for some you don't. So we can say that exam_name is
dependent on both student_id and subject_id.
And what about our second new column total_marks? Does it depend on our Score table's
primary key?
Well, the column total_marks depends on exam_name as with exam type the total score changes.
For example, practicals are of less marks while theory exams are of more marks.
But, exam_name is just another column in the score table. It is not a primary key or even a part
of the primary key, and total_marks depends on it.
1 Workshop 200
2 Mains 70
3 Practicals 30
For a table to satisfy the Boyce-Codd Normal Form, it should satisfy the following two
conditions:
The second point sounds a bit tricky, right? In simple words, it means that for a dependency A →
B, A cannot be a non-prime attribute, if B is a prime attribute.
Below we have a college enrolment table with columns student_id, subject and professor.
103 C# P.Chash
As you can see, we have also added some sample data to the table.
● One student can enrol for multiple subjects. For example, student with student_id 101,
has opted for subjects - Java & C++
● For each subject, a professor is assigned to the student.
● And, there can be multiple professors teaching one subject like we have for Java.
Well, in the table above student_id, subject together form the primary key, because using
student_id and subject, we can find all the columns of the table.
One more important point to note here is, one professor teaches only one subject, but one subject
may have two different professors.
Hence, there is a dependency between subject and professor here, where subject depends on the
professor name.
This table satisfies the 1st Normal form because all the values are atomic, column names are
unique and all the values stored in a particular column are of same domain.
This table also satisfies the 2nd Normal Form as their is no Partial Dependency.
And, there is no Transitive Dependency, hence the table also satisfies the 3rd Normal Form.
In the table above, student_id, subject form primary key, which means subject column is a prime
attribute.
And while subject is a prime attribute, professor is a non-prime attribute, which is not allowed by
BCNF.
To make this relation(table) satisfy BCNF, we will decompose this table into two tables, student
table and professor table.
Student Table
student_id p_id
101 1
101 2
and so on...
1 P.Java Java
2 P.Cpp C++
and so on...
And now, this relation satisfies Boyce-Codd Normal Form. In the next tutorial we will learn
about the Fourth Normal Form.