Data Normalization
Data Normalization
Functional Dependency
Functional dependency (FD) is set of constraints between two attributes in a relation. Functional
dependency says that if two tuples have same values for attributes A1, A2,..., An then those two
tuples must have to have same values for attributes B1, B2, ..., Bn. Functional dependency is
represented by arrow sign (→), that is X→Y, where X functionally determines Y. The left hand
side attributes determines the values of attributes at right hand side.
Transitivity rule: Same as transitive rule in algebra, if a → b holds and b → c holds then a → c
also hold. a → b is called as a functionally determines b.
We might say that Salesperson Number defines Salesperson Name. If I give you a Salesperson
Number, you can give me back the one and only name that goes with it. These defining
associations are commonly written with a right-pointing arrow like this:
Salesperson Number →Salesperson Name
In the more formal terms of functional dependencies, the attribute on the left side is referred to as
the determinant attribute. This is because its value determines the value of the attribute on the
right side. Conversely, we also say that the attribute on the right is functionally dependent on the
attribute on the left.
Figure 4-23 Salesperson entity attributes.
Salesperson Number
Salesperson Name
Commission
Percentage
Year of Hire
Department
Number
Manager Name
Product Number
Product Name
Unit Price
Quantity
If a database design is not perfect it may contain anomalies, which are like a bad dream for
database itself. Managing a database with anomalies is next to impossible. Data normalization
is a methodology for organizing attributes into tables so that redundancy among the nonkey
attributes is eliminated. Each of the resultant tables deals with a single data focus, which is just
another way of saying that each resultant table will describe a single entity type or a single
many-to-many relationship. Furthermore, foreign keys will appear exactly where they are
needed. In other words, the output of the data normalization process is a properly structured
relational database
Update anomalies: if data items are scattered and are not linked to each other properly,
then there may be instances when we try to update one data item that has copies of it
scattered at several places, few instances of it get updated properly while few are left with
there old values. This leaves database in an inconsistent state.
Deletion anomalies: we tried to delete a record, but parts of it left undeleted because of
unawareness, the data is also saved somewhere else.
Insert anomalies: we tried to insert data in a record that does not exist at all.
Normalization is a method to remove all these anomalies and bring database to consistent state
and free from any kinds of anomalies.
The table in Figure 4-25 is unnormalized. The table has four records, one for each salesperson.
But since each salesperson has sold several products and there is only one record for each
salesperson, several attributes of each record must have multiple values. For example, the record
for salesperson 137 has three product numbers, 19440, 24013, and 26722, in its Product Number
attribute because salesperson 137 has sold all three of those products. Having such multivalued
attributes is not permitted and so this table is unnormalized.
In the first normal form, each attribute value is atomic, that is, no attribute is multivalued. The
table in Figure 4-26 is the first normal form representation of the data. The attributes under
consideration have been listed in one table, and a primary key has been established. In this
1NF:
Remove multivalued attributes
Student database (Student_ID, Student_Name, Batch, Advisor, Department_Name,
Department_Head, Course_No, Course_Title)
2NF:
Remove partial functional dependencies. data is dependent on part of the primary key.
Student (Student _ID, Student_Name, Batch, Advisor, Department_Name, Department_Head)
Student_Course (Student_ID,Course_No, Course_Title)
3NF:
Remove transitive dependencies
Student (Student _ID, Student_Name, Batch, Department_Name)
Advisor ( Batch, Advisor)
Department (Department_Name, Department_Head)
Student_Course ( Student_ID, Course_ID)
Course (Course_ID, Course_Title)
2NF
Employee (Empoyee_ID, Employee_Name, Mobile, Department_Name, Department_Location)
Project (Project_ID, Project_Name, Employee_ID)
3NF
Employee (Empoyee_ID, Employee_Name, Mobile,Department_ID)
Department (Department_ID, Department_Name, Department_Location)
Project (Project_ID, Project_Name, Employee_ID)
Example 1a
•DB(Patno, PatName, appNo, time, doctor)
•No repeating groups, so in 1NF
•2NF –eliminate partial key dependencies:
–DB(Patno, appNo, time, doctor)
–R1(Patno, PatName)
•3NF –no transient dependences so in 3NF
•Now try BCNF.
Rewrite to BCNF
•DB(Patno, appNo, time, doctor)
R1(Patno, PatName)
•BCNF: rewrite to
DB(Patno, time, doctor)
R1(Patno, PatName)
R2(time, appNo)
•time is enough to work out the appointment number of a patient. Now BCNF is satisfied, and
the final relations shown are in BCNF
Example 1b
•DB(Patno, PatName, appNo, time, doctor)
•No repeating groups, so in 1NF
•2NF –eliminate partial key dependencies:
–DB(Patno, time, doctor)
–R1(Patno, PatName)
–R2(time, appNo)
•3NF –no transient dependences so in 3NF
•Now try BCNF.
Summary -Example 1
This example has demonstrated three things:
•BCNF is stronger than 3NF, relations that are in 3NF are not necessarily inBCNF
•BCNF is needed in certain situations to obtain full understanding of the data model
•there are several routes to take to arrive at the same set of relations in BCNF.
–Unfortunately there are no rules as to which route will be the easiest one to take.