Database Normalization
Database Normalization
The first thing to notice is this table serves many purposes including:
Consider if we move the Chicago office to Evanston, IL. To properly reflect this in our
table, we need to update the entries for all the SalesPersons currently in
Chicago. Our table is a small example, but you can see if it were larger, that
potentially this could involve hundreds of updates.
Also consider what would happen if John Hunt quits. If we remove his entry, then we
lose the information for New York.
These situations are modification anomalies. There are three modification anomalies
that can occur:
Insert Anomaly
There are facts we cannot record until we know information for the entire row. In our
example we cannot record a new sales office until we also know the sales person.
Why? Because in order to create the record, we need provide a primary key. In our
case this is the EmployeeID.
Update Anomaly
The same information is recorded in multiple rows. For instance if the office number
changes, then there are multiple updates that need to be made. If these updates are
not successfully completed across all rows, then an inconsistency
occurs.
Deletion Anomaly
Deletion of a row can cause more than one set of facts to be removed. For instance,
if John Hunt retires, then deleting that row cause use to lose information about the
New York office.
SELECT SalesOffice
FROM SalesStaff
Customer2 = ‘Ford’ OR
Customer3 = ‘Ford’
Clearly if the customer were somehow in one column our query would be
simpler. Also, consider if you want to run a query and sort by customer. The way the
table is currently defined, this isn’t possible, unless you use three separate queries
with a UNION. These anomalies can be eliminated or reduced by properly separating
the data into different tables, to house the data in tables which serve a single
purpose. The process to do this is called normalization, and the various stages you
can achieve are called the normal forms.
Definition of Normalization
There are three common forms of normalization: 1st, 2nd, and 3rd normal form. There
are several additional forms, such as BCNF, but I consider those advanced, and not
too necessary to learn in the beginning. The forms are progressive, meaning that to
qualify for 3rd normal form a table must first satisfy the rules for 2nd normal form, and
2nd normal form must adhere to those for 1st normal form. Before we discuss the
various forms and rules in details, let’s summarize the various forms:
First Normal Form – The information is stored in a relational table and each column
contains atomic values, and there are not repeating groups of columns.
Second Normal Form – The table is in first normal form and all the columns depend on the
Third Normal Form – the table is in second normal form and all of its columns are not
For the examples, we’ll use the Sales Staff Information shown below as a starting
point. As we pointed out in the last post’s modification anomalies section, there are
several issues to keeping the information in this form. By normalizing the data you
see we’ll eliminate duplicate data as well as modification anomalies.
one or more columns, called the primary key, uniquely identify each row.
Each column contains atomic values, and there are not repeating groups of columns.
Tables in first normal form cannot contain sub columns. That is, if you are listing
several cities, you cannot list them in one column and separate them with a semi-
colon.
When a value is atomic, the values cannot be further subdivided. For example, the value
“Chicago” is atomic; whereas “Chicago; Los Angeles; New York” is not. Related to this
requirement is the concept that a table should not contain repeating groups of columns such
as Customer1Name, Customer2Name, and Customer3Name.
Our example table is transformed to first normal form by placing the repeating customer
related columns into their own table. This is shown below:
The repeating groups of columns now become separate rows in the Customer table
linked by the EmployeeID foreign key. As mentioned in the lesson on Data Modeling,
a foreign key is a value which matches back to another table’s primary key. In this
case, the customer table contains the corresponding EmployeeID for the
SalesStaffInformation row. Here is our data in first normal form.
Now it is time to take a look at the second normal form. I like to think the reason
we place tables in 2nd normal form is to narrow them to a single purpose. Doing so
bring’s clarity to the database design, makes it easier for us to describe and use a
table, and tends to eliminate modification anomalies.
This stems from the primary key identifying the main topic at hand, such as
identifying buildings, employees, or classes, and the columns, serving to add
meaning through descriptive attributes.
An EmployeeID isn’t much on its own, but add a name, height, hair color and age, and
now you’re starting to describe a real person.
All the non-key columns are dependent on the table’s primary key.
We already know about the 1st normal form, but what about the second
requirement? Let me try to explain.
The primary key provides a means to uniquely identify each row in a table. When we
talk about columns depending on the primary key, we mean, that in order to find a
particular value, such as what color is Kris’ hair, you would first have to know the
primary key, such as an EmployeeID, to look up the answer.
Once you identify a table’s purpose, then look at each of the table’s columns and ask
yourself, “Does this column serve to describe what the primary key identifies?”
If you answer “yes,” then the column is dependent on the primary key and belongs in the
table.
If you answer “no,” then the column should be moved different table.
When all the columns relate to the primary key, they naturally share a common
purpose, such as describing an employee. That is why I say that when a table is in
second normal form, it has a single purpose, such as storing employee information.
The first issue is the SalesStaffInformation table has two columns which aren’t
dependent on the EmployeeID. Though they are used to describe which office the
SalesPerson is based out of, the SalesOffice and OfficeNumber columns themselves
don’t serve to describe who the employee is.
The second issue is that there are several attributes which don’t completely rely on
the entire Customer table primary key. For a given customer, it doesn’t make sense
that you should have to know both the CustomerID and EmployeeID to find the
customer.
It stands to reason you should only need to know the CustomerID. Given this, the
Customer table isn’t in 2nd normal form as there are columns that aren’t dependent
on the full primary key. They should be moved to another table.
These issues are identified below in red.
In the case of SalesOffice and OfficeNumber, a SalesOffice was created. A foreign key
was then added to SalesStaffInformaiton so we can still describe in which office a
sales person is based.
The changes to make Customer a second normal form table are a little
trickier. Rather than move the offending columns CustomerName, CustomerCity, and
CustomerPostalCode to new table, recognize that the issue is EmployeeID! The three
columns don’t depend on this part of the key. Really this table is trying to serve two
purposes:
With these changes made the data model, in second normal form, is shown below.
The SalesStaffCustomer table is a strange one. It’s just all keys! This type of table is
called an intersection table. An intersection table is useful when you need to
model a many-to-many relationship.
Each column is a foreign key. If you look at the data model you’ll notice that there is
a one to many relationship to this table from SalesStaffInformation and another from
Customer. In effect the table allows you to bridge the two tables together.
For all practical purposes this is a pretty workable database. Three out of the four
tables are even in third normal form, but there is one table which still has a minor
issue, preventing it from being so.
The third post focused on the second normal form, its definition, and examples to
hammer it home.
Once a table is in second normal form, we are guaranteed that every column is
dependent on the primary key, or as I like to say, the table serves a single
purpose. But what about relationships among the columns? Could there be
dependencies between columns that could cause an inconsistency?
A table containing both columns for an employee’s age and birth date is spelling
trouble, there lurks an opportunity for a data inconsistency!
It contains only columns that are non-transitively dependent on the primary key
Transitive
When something is transitive, then a meaning or relationship is the same in the
middle as it is across the whole. If it helps think of the prefix trans as meaning
“across.” When something is transitive, then if something applies from the beginning
to the end, it also applies from the middle to the end.
Since ten is greater than five, and five is greater than three, you can infer that ten is
greater than three.
Transitive Dependence
Now let’s put the two words together to formulate a meaning for transitive
dependence that we can understand and use for database columns.
This can be generalized as being three columns: A, B and PK. If the value of A relies
on PK, and B relies on PK, and A also relies on B, then you can say that A relies on
PK though B. That is A is transitively dependent on PK.
Let’s look at some examples to understand further.
Primary
Key (PK) Column A Column B Transitive Dependence?
To be non-transitively dependent, then, means that all the columns are dependent on
the primary key (a criteria for 2nd normal form) and no other columns in the table.
To better visualize this, here are the Customer and PostalCode tables with data.
Now each column in the customer table is dependent on the primary key. Also, the
columns don’t rely on one another for values. Their only dependency is on the
primary key.
At this point our data model fulfills the requirements for the third normal form. For
most practical purposes this is usually sufficient; however, there are cases where
even further data model refinements can take place. If you are curious to know
about these advanced normalization forms, I would encourage you to read
about BCNF (Boyce-Codd Normal Form) and more!
I think you should normalize if you feel that introducing update or insert anomalies
can severely impact the accuracy or performance of your database application. If
not, then determine whether you can rely on the user to recognize and update the
fields together.
There are times when you’ll intentionally denormalize data. If you need to present
summarized or complied data to a user, and that data is very time consuming or
resource intensive to create, it may make sense to maintain this data separately.