Database Normalization
Database Normalization
NORMALIZATION
DATABASE NORMALIZATION
INTRO1
Database Normalization is a technique of organizing the data in the database.
Normalization is a systematic approach of decomposing tables to eliminate data
redundancy(repetition) and undesirable characteristics like Insertion, Update and
Deletion Anomalies.
It is a multi-step process that puts data into tabular form, removing duplicated data
from the relation tables.
Without normalizing, changes made to the database to reflect changes in
reality will create inconsistencies in the data, affectively making the data
useless.
DATABASE NORMALIZATION
INTRO2
The normalization process involves converting tables into various types of normal
forms
A table that contains a repeating group, or multiple entries for a single row, is called
an unnormalized table
Normalization starts at 1NF and can continue up to 6NF
1NF (First Normal Form)
2NF (Second Normal Form)
3NF (Third Normal Form)
BCNF (Boyce-Codd Normal Form)
4NF (Fourth Normal Form)
5NF (Fifth Normal Form)
6NF (Sixth Normal Form)
DATABASE NORMALIZATION
INTRO3
Many businesses will be satisfied with their database being in 3NF and stop there as at this point the database will
accurately represent reality provided your data is accurate.
There is however 3 more stages of normalization that are so remote they almost never exist.
6NF is Domain Key Normal Form or DKNF states that every constant is a consequence of domain and key constraints.
DKNF is theoretically the holy grail of normalization where insertion, deletion, or updates can never exist.
The process for reaching these last three normal forms is difficult and largely academic. In most cases it is enough just to
know that they exist.
As an aside, many database designers may not remove all violations of all normal form rules. For example, state and zip
codes, though a transitive dependency, will rarely be projected out because it is largely unnecessary to do so. Creating
additional tables requires more CPU at runtime to join tables back together and the likelihood of a new state being added is
remote.
NORMALIZATION
Normalization is necessary if you do not do it then the overall integrity of the data stored in the
database will eventually degrade. Specifically, this is due to data anomalies. These anomalies naturally
occur and result in data that does not match the real-world the database purports to represent.
Anomalies are caused when there is too much redundancy in the database's information. Anomalies can
often be caused when the tables that make up the database suffer from poor construction.
So, what does "poor construction" mean? Poor table design will become evident if, when the designer
creates the database, he doesn't identify the entities that depend on each other for existence, like the
rooms of a hotel and the hotel, and then minimize the chance that one would ever exist independent of
the other.
The normalization process was created largely in order to reduce the negative effects of creating tables
that will introduce anomalies into the database.
PROBLEMS WITHOUT
NORMALIZATION
Insertion Anomalies happen when inserting vital data into the database is not possible because other data is
not already there. For example, if a system is designed to require that a customer be on file before a sale can
be made to that customer, but you cannot add a customer until they have bought something, then you have
an insert anomaly. It is the classic "catch-22" situation.
Deletion Anomalies happen when the deletion of unwanted information causes desired information to be
deleted as well. For example, if a single database record contains information about a particular product
along with information about a salesperson for the company and the salesperson quits, then information
about the product is deleted along with salesperson information.
PROBLEMS WITHOUT
NORMALIZATION
rollno name branch hod office_tel
401 Akon CSE Mr. X 53337
402 Bkon CSE Mr. X 53337
403 Ckon CSE Mr. X 53337
404 Dkon CSE Mr. X 53337
Functional dependency which also known as a nontrivial dependency occurs when A->B holds true
where B is not a subset of A. In a relationship, if attribute B is not a subset of attribute A, then it is
considered as a non-trivial dependency.
Example:
(Company} -> {CEO} (if we know the Company, we knows the CEO name)
But CEO is not a subset of Company, and hence it's non-trivial functional dependency.
MULTIVALUED DEPENDENCY
Multivalued dependency occurs when there are more than one independent multivalued attributes in a
table.
For example: Consider a bike manufacture company, which produces two colors (Black and white) in
each model every year.
Bike model Manuf year Color
M1001 2007 Black
M1001 2007 Red
M2012 2008 Black
M2012 2008 Red
M2222 2009 Black
M2222 2009 Red
Here columns manuf_year and color are independent of each other and dependent on bike_model.
In this case these two columns are said to be multivalued dependent on bike_model. These
dependencies can be represented like this:
bike_model ->> manuf_year
bike_model ->> color
TRANSITIVE DEPENDENCY
A Transitive Dependency is a type of functional dependency which happens when t is
indirectly formed by two functional dependencies. A transitive dependency can only
occur in a relation of three of more attributes. This dependency helps us normalizing the
database in 3NF (3rd Normal Form).
Book Author Author-age
Game of Thrones George R.R. Martin 66
Harry Potter J.K. Rowling 49
Dying of the Light George R.R. Martin 66
{Book} ->{Author} (if we know the book, we knows the author name)
{Author} does not ->{Book}
{Author} -> {Author_age}
Therefore as per the rule of transitive dependency: {Book} -> {Author_age} should hold,
that makes sense because if we know the book name we can know the author’s age.
ANOTHER TRANSITIVE DEPENDENCY EXAMPLE
{Company} -> {CEO} (if we know the compay, we know its CEO's name)
{CEO } -> {Age} If we know the CEO, we know the Age
Therefore, according to the rule of rule of transitive dependency:
{ Company} -> {Age} should hold, that makes sense because if we know the
company name, we can know his age.
Again, remember that transitive dependency can only occur in a relation of three or
more attributes.
LET’S HAVE A KEY DISCUSSION
AGAIN:
DEFINITION OF CANDIDATE KEY IN DBMS:
A super key with no redundant attribute is known as candidate key. Candidate keys
are selected from the set of super keys, the only thing we take care while selecting
candidate key is that the candidate key should not have any redundant attributes.
That’s the reason they are also termed as minimal super key.
CANDIDATE KEY EXAMPLE
Lets take an example of table “Employee”. This How many super keys the above table can have?
table has three attributes: Emp_Id,
Emp_Number & Emp_Name. Here Emp_Id & 1. {Emp_Id}
Emp_Number will be having unique values and 2. {Emp_Number}
Emp_Name can have duplicate values as more
than one employees can have same name. 3. {Emp_Id, Emp_Number}
Emp_IdEmp_Number Emp_Name 4. {Emp_Id, Emp_Name}
------ ---------- -------- 5. {Emp_Id, Emp_Number, Emp_Name}
E012264 Steve 6. {Emp_Number, Emp_Name}
E222278 Ajeet
E232288 Chaitanya Lets select the candidate keys from the above
set of super keys.
E452290 Robert
SELECTING THE CANDIDATE
KEY
1. {Emp_Id} – No redundant attributes
2. {Emp_Number} – No redundant attributes
3. {Emp_Id, Emp_Number} – Redundant attribute. Either of those attributes can be a minimal super key as both of
these columns have unique values.
4. {Emp_Id, Emp_Name} – Redundant attribute Emp_Name.
5. {Emp_Id, Emp_Number, Emp_Name} – Redundant attributes. Emp_Id or Emp_Number alone are sufficient enough
to uniquely identify a row of Employee table.
6. {Emp_Number, Emp_Name} – Redundant attribute Emp_Name.
The candidate keys we have selected are:
{Emp_Id}
{Emp_Number}
Note: A primary key is selected from the set of candidate keys. That means we can either have Emp_Id or Emp_Number
as primary key. The decision is made by DBA (Database administrator)
DEFINITION OF SUPER KEY IN
DBMS
: A super key is a set of one or more attributes (columns), which can uniquely
identify a row in a table. Often DBMS beginners get confused between super key
and candidate key, so we will also discuss candidate key and its relation with super
key in this article.
How candidate key is different from super key?
Answer is simple – Candidate keys are selected from the set of super keys, the only
thing we take care while selecting candidate key is: It should not have any redundant
attribute. That’s the reason they are also termed as minimal super key.
LET’S TAKE AN EXAMPLE TO
UNDERSTAND THIS:
Table: Employee Super keys: The table on the left has following super
keys. All of the following sets of super key are able
to uniquely identify a row of the employee table.
Emp_SSN Emp_Number {Emp_SSN}
Emp_Name
{Emp_Number}
--------- ---------- --------
{Emp_SSN, Emp_Number}
123456789 226 Steve
{Emp_SSN, Emp_Name}
999999321 227 Ajeet
{Emp_SSN, Emp_Number,
888997212 228 Chaitanya Emp_Name}
777778888 229 Robert {Emp_Number, Emp_Name}
CANDIDATE KEYS AGAIN
As mentioned in the beginning, a candidate key is a minimal super key with no
redundant attributes. The following two set of super keys are chosen from the above
sets as there are no redundant attributes in these sets.
{Emp_SSN}
{Emp_Number}
Only these two sets are candidate keys as all other sets are having redundant
attributes that are not necessary for unique identification.
SUPER KEY VS CANDIDATE
KEY
There can be some confusion between super key and candidate key. Let me give you a clear explanation.
1. First you have to understand that all the candidate keys are super keys. This is because the candidate keys
are chosen out of the super keys.
2. How we choose candidate keys from the set of super keys? We look for those keys from which we cannot
remove any fields. In the above example, we have not chosen {Emp_SSN, Emp_Name} as candidate key
because {Emp_SSN} alone can identify a unique row in the table and Emp_Name is redundant.
Primary key:
A Primary key is selected from a set of candidate keys. This is done by database admin or database
designer. We can say that either {Emp_SSN} or {Emp_Number} can be chosen as a primary key for the
table Employee.
FIRST NORMAL FORM (1NF)
For a table to be in the First Normal Form, it should follow the following 4 rules:
1.It should only have single(atomic) valued attributes/columns.
2.Values stored in a column should be of the same domain
3.All the columns in a table should have unique names.
4.And the order in which data is stored, does not matter.
SECOND NORMAL FORM (2NF)
For a table to be in the Second Normal Form,
1.It should be in the First Normal form.
2.And, it should not have Partial Dependency.
THIRD NORMAL FORM (3NF)
A table is said to be in the Third Normal Form when,
1.It is in the Second Normal form.
2.And, it doesn't have Transitive Dependency.
BOYCE AND CODD NORMAL
FORM (BCNF)
Boyce and Codd Normal Form is a higher version of the Third Normal form. This
form deals with certain type of anomaly that is not handled by 3NF. A 3NF table
which does not have multiple overlapping candidate keys is said to be in BCNF. For
a table to be in BCNF, following conditions must be satisfied:
•R must be in 3rd Normal Form
•and, for each functional dependency ( X → Y ), X should be a super Key.
FOURTH NORMAL FORM (4NF)
Fourth Normal Form (4NF)
A table is said to be in the Fourth Normal Form when,
1.It is in the Boyce-Codd Normal Form.
2.And, it doesn't have Multi-Valued Dependency.
FIRST NORMAL FORM
Rule 1: Single Valued Attributes
Each column of your table should be single valued which means they should not contain multiple values.
We will explain this with help of an example later, let's see the other rules for now.
Rule 2: Attribute Domain should not change
This is more of a "Common Sense" rule. In each column the values stored must be of the same kind or type.
For example: If you have a column dob to save date of births of a set of people, then you cannot or you
must not save 'names' of some of them in that column along with 'date of birth' of others in that column. It
should hold only 'date of birth' for all the records/rows.
Rule 3: Unique name for Attributes/Columns
This rule expects that each column in a table should have a unique name. This is to avoid confusion at the
time of retrieving data or performing any other operation on the stored data.
If one or more columns have same name, then the DBMS system will be left confused.
Rule 4: Order doesn't matters
This rule says that the order in which you store the data in your table doesn't matter.
1NF EXAMPLE
roll_no name subject
101 Akon OS, CN
103 Ckon Java
102 Bkon C, C++
The table already satisfies 3 rules out of the 4 rules, as all column names are unique, data is in the
correct order and not inter-mixed different type of data in columns.
But out of the 3 different students in table, 2 have opted for more than 1 subject and as per the 1st
Normal form each column must contain atomic value.
HOW TO SOLVE 1NF
Here is our updated table and it now satisfies the First Normal Form.
roll_no name subject
101 Akon OS
101 Akon CN
102 Ckon Java
102 Bkon C
102 Bkon C++
By doing so, although a few values are getting repeated but values for the subject
column are now atomic for each record/row.
Using the First Normal Form, data redundancy increases, as there will be many
columns with same data in multiple rows but each row as a whole will be unique.
SECOND NORMAL FORM (2NF)
For a table to be in the Second Normal Form,
1.It should be in the First Normal form.
2.No non-prime attribute is dependent on the proper subset of any candidate key of
table.
An attribute that is not part of any candidate key is known as non-prime attribute.
HOW TO SOLVE 2NF
Example: Suppose a school wants to store the data of teachers and the subjects they
teach. They create a table that looks like this: Since a teacher can teach more than one
subjects, the table can have multiple rows for a same teacher.
teacher_ID subject teacher_age
111 Math 38
111 Physics 38
222 Biology 38
333 Physics 40
333 Chemistry 40
In other words 3NF can be explained like this: A table is in 3NF if it is in 2NF and for each
functional dependency X-> Y at least one of the following conditions hold:
X is a super key of table
Y is a prime attribute of table
An attribute that is a part of one of the candidate keys is known as prime attribute.
EXAMPLE: SUPPOSE A COMPANY WANTS TO STORE THE
COMPLETE ADDRESS OF EACH EMPLOYEE, THEY CREATE A
TABLE NAMED EMPLOYEE_DETAILS THAT LOOKS LIKE THIS:
emp_ID emp_name emp_zip emp_state emp_city emp_district