Normalization: Basic Concepts
Normalization: Basic Concepts
Normalization is a design technique that is widely used as a guide in designing relational databases. Normalization is essentially a two step process that puts data into tabular form by removing repeating groups and then removes duplicated data from the relational tables. Normalization theory is based on the concepts of normal forms. A relational table is said to be a particular normal form if it satisfied a certain set of constraints. There are currently five normal forms that have been defined. In this section, we will cover the first three normal forms that were defined by E. F. Codd.
Basic Concepts
The goal of normalization is to create a set of relational tables that are free of redundant data and that can be consistently and correctly modified. This means that all tables in a relational database should be in the third normal form (3NF). A relational table is in 3NF if and only if all non-key columns are (a) mutually independent and (b) fully dependent upon the primary key. Mutual independence means that no non-key column is dependent upon any combination of the other columns. The first two normal forms are intermediate steps to achieve the goal of having all tables in 3NF. In order to better understand the 2NF and higher forms, it is necessary to understand the concepts of functional dependencies and lossless decomposition.
Functional Dependencies
The concept of functional dependencies is the basis for the first three normal forms. A column, Y, of the relational table R is said to be functionally dependent upon column X of R if and only if each value of X in R is associated with precisely one value of Y at any given time. X and Y may be composite. Saying that column Y is functionally dependent upon X is the same as saying the values of column X identify the values of column Y. If column X is a primary key, then all columns in the relational table R must be functionally dependent upon X. A short-hand notation for describing a functional dependency is: R.x >; R.y which can be read as in the relational table named R, column x functionally determines (identifies) column y. Full functional dependence applies to tables with composite keys. Column Y in relational table R is fully functional on X of R if it is functionally dependent on X and not functionally dependent upon any subset of X. Full functional dependence means that when a primary key is composite, made of two or more columns, then the other columns must be identified by the entire key and not just some of the columns that make up the key.
Overview
Simply stated, normalization is the process of removing redundant data from relational tables by decomposing (splitting) a relational table into smaller tables by projection. The goal is to have only primary keys on the left hand side of a functional dependency. In order to be correct, decomposition must be lossless. That is, the new tables can be recombined by a natural join to recreate the original table without creating any spurious or redundant data.
Sample Data
Data taken from Date [Date90] is used to illustrate the process of normalization. A company obtains parts from a number of suppliers. Each supplier is located in one city. A city can have more than one supplier located there and each city has a status code associated with it. Each supplier may provide many parts. The company creates a simple relational table to store this information that can be expressed in relational notation as: FIRST (s#, status, city, p#, qty) where
supplier identifcation number (this is the primary key) status code assigned to city name of city where supplier is located part number of part supplied quantity of parts supplied to date
In order to uniquely associate quantity supplied (qty) with part (p#) and supplier (s#), a composite primary key composed of s# and p# is used.
Although the table FIRST is in 1NF it contains redundant data. For example, information about the supplier's location and the location's status have to be repeated for every part supplied. Redundancy causes what are called update anomalies. Update anomalies are problems that arise when information is inserted, deleted, or updated. For example, the following anomalies could occur in FIRST:
INSERT. The fact that a certain supplier (s5) is located in a particular city (Athens) cannot be added until they supplied a part.
DELETE. If a row is deleted, then not only is the information about quantity and part lost but also information about the supplier. UPDATE. If supplier s1 moved from London to New York, then six rows would have to be updated with this new information.
4.
5.
To transform FIRST into 2NF we move the columns s#, status, and city to a new table called SECOND. The column s# becomes the primary key of this new table. The results are shown below in Figure 2.
Tables in 2NF but not in 3NF still contain modification anomalies. In the example of SECOND, they are: INSERT. The fact that a particular city has a certain status (Rome has a status of 50) cannot be inserted until there is a supplier in the city. DELETE. Deleting any row in SUPPLIER destroys the status information about the city as well as the association between supplier and city.
2. 3.
Create and name a new table for each determinant and the unique columns it determines. Move the determined columns from the original table to the new table. The determinate becomes the primary key of the new table.
4.
Delete the columns you just moved from the original table except for the determinate which will serve as a foreign key.
5.
To transform SUPPLIER into 3NF, we create a new table called CITY_STATUS and move the columns city and status into it. Status is deleted from the original table, city is left behind to serve as a foreign key to CITY_STATUS, and the original table is renamed to SUPPLIER_CITY to reflect its semantic meaning. The results are shown in Figure 3 below. Figure 3: Tables in 3NF
The results of putting the original table into 3NF has created three tables. These can be represented in "psuedo-SQL" as: PARTS (#s, p#, qty) Primary Key (s#,#p) Foreign Key (s#) references SUPPLIER_CITY.s# SUPPLIER_CITY(s#, city) Primary Key (s#) Foreign Key (city) references CITY_STATUS.city CITY_STATUS (city, status) Primary Key (city)
Advanced Normalization
After 3NF, all normalization problems involve only tables which have three or more columns and all the columns are keys. Many practitioners argue that placing entities in 3NF is generally sufficient because it is rare that entities that are in 3NF are not also in 4NF and 5NF. They further argue that the benefits
gained from transforming entities into 4NF and 5NF are so slight that it is not worth the effort. However, advanced normal forms are presented because there are cases where they are required.
A table is in the fifth normal form (5NF) if it cannot have a lossless decomposition into any number of smaller tables. While the first four normal forms are based on the concept of functional dependence, the fifth normal form is based on the concept of join dependence. Join dependency means that an table, after it has been decomposed into three or more smaller tables, must be capable of being joined again on common keys to form the original table. Stated another way, 5NF indicates when an entity cannot be further decomposed. 5NF is complex and not intuitive. Most experts agree that tables that are in the 4NF are also in 5NF except for "pathological" cases. Teorey suggests that true many-to-many-to-many ternary relations are one such case. Adding an instance to an table that is not in 5NF creates spurious results when the tables are decomposed and then rejoined. For example, let's suppose that we have an employee who uses design skills on one project and programming skills on another. This information is shown below.
prj# 11 28 prj# 11 28 11
Next we add an employee (1544) who uses programming skills on Project 11.
Next, we project this information into three tables as we did above. However, when we rejoin the tables, the recombined table contains spurious results.
prj# 11 11 28 11 11
skill Design Program <<spurious data Program Design <<spurious data Program
By adding one new instance to a table not in 5NF, two false assertions were stated:
Employee 1211 has been assigned to Project 11. Project 11 requires programming skills. Therefore, Employee 1211 must use programming skills while assigned to Project 11.
Assertion 2
Employee 1544 has been assigned to project 11. Project 11 needs Design skills. Therefore, Employee 1544 must use Design skills in Project 11.