Unit-Iv: I. Pitfalls in Relational Database Design
Unit-Iv: I. Pitfalls in Relational Database Design
Example
Consider the relation schema:
Lending-schema = (branch-name, branch-city, assets,
customer-name, loan-number, amount)
Redundancy:
Data for branch-name, branch-city, assets are repeated for each loan that a
branch makes
Wastes space
Complicates updating, introducing possibility of inconsistency of assets value
Null values
Cannot store information about a branch if no loans exist
Can use null values, but they are difficult to handle.
Database System Concepts 7.6 ©Silberschatz, Korth and Sudarshan
II. ANOMALIES
REDUNDANT INFORMATION IN TUPLES AND UPDATE ANOMALIES
One of the schema designs is to minimize the storage space used by the base relations.
Attributes of different entities (EMPLOYEEs, DEPARTMENTs, PROJECTs) should
not be mixed in the same relation.
o Mixing attributes of multiple entities may cause problems
Only foreign keys should be used to refer to other entities.
Entity and relationship attributes should be kept apart as much as possible.
Redundant information leads to wastage of memory and inconsistency.
Problems with update anomalies
o Insertion anomalies
o Deletion anomalies
o Modification anomalies
Update Anomaly:
Changing the name of project number 20 from “Reorganization” to “Customer-
Accounting” may cause this update to be made for all 100 employees working on
project 20.
Insert Anomaly:
When a project is deleted, it will result in deleting all the employees who work on that
project. Alternately, if an employee is the sole employee on a project, deleting that
employee would result in deleting the corresponding project.
GOAL
Design a schema that does not suffer from insertion, deletion and update anomalies.
Design Goals:
1. Avoid redundant data
2. Ensure that relationships among attributes are represented
3. Facilitate the checking of updates for violation of database integrity constraints.
Consider the relation schema EMP_PROJ in Figure 1.35, from the semantics of the
attributes and the relation, the following functional dependencies should hold:
FD1. SSN -> ENAME
o Social security number determines employee name
FD2. PNUMBER -> {PNAME, PLOCATION}
o Project number determines project name and location
FD3. {SSN, PNUMBER} -> HOURS
o Employee ssn and project number determines the hours per week that the
employee works on the project.
IR1, IR2, IR3 form a sound and complete set of inference rules
The complete set of all possible dependencies can be deduced from IR1, IR2, and IR3
(completeness property)
IV. NORMALIZATION
The normalization process, as first proposed by Codd (1972a), takes a relation schema through
a series of tests to certify whether it satisfies a certain normal form.
Definition: The normal form of a relation refers to a defined standard structure for relational
databases in which a relation may not be nested within another relation.
Normalization of data:
The database designers need not normalize to the highest possible normal form. (Usually
up to 3NF, BCNF or 4NF)
1. LOSSLESS DECOMPOSITION
Decomposition must be lossless. It means that the information should not get lost
from the relation that is decomposed.
It gives a guarantee that the join will result in the same relation as it was decomposed.
Example:
Let's take 'E' is the Relational Schema, With instance 'e'; is decomposed into: E1, E2, E3, . . .
. En; With instance: e1, e2, e3, . . . . en, If e1 ⋈ e2 ⋈ e3 . . . . ⋈ en, then it is called
as 'Lossless Join Decomposition'.
In the above example, it means that, if natural joins of all the decomposition give the
original relation, then it is said to be lossless join decomposition.
Decompose the above relation into two relations to check whether a decomposition is
lossless or lossy.
Now, we have decomposed the relation that is Employee and Department.
If the <Employee> table contains (Eid, Ename, Age, City, Salary) and <Department>
table contains (Deptid and DeptName), then it is not possible to join the two tables or
relations, because there is no common column between them. And it becomes Lossy Join
Decomposition.
2. DEPENDENCY PRESERVATION
Dependency is an important constraint on the database.
Every dependency must be satisfied by at least one decomposed table.
If we decompose a relation R into relations R1 and R2, All dependencies of R either
must be a part of R1 or R2 or must be derivable from combination of FD’s of R1 and R2.
For Example,
A relation R (A, B, C, D) with FD set{A->BC} is decomposed into
R1(ABC) and R2(AD) which is dependency preserving.
because FD A->BC is a part of R1(ABC).
First technique:
1. Remove the attribute Dlocations and place it in a separate relation DEPT_LOCATIONS,
along with the primary key Dnumber.
2. The primary key of this relation is the combination {Dnumber, Dlocation}.
3. A distinct tuple in DEPT_LOCATIONS exists for each location of a department.
4. This decomposes the non-1NF relation into two 1NF relations.
Second Technique:
1. Expand the key so that there will be a separate tuple, in the original DEPARTMENT
relation for each location of a DEPARTMENT and is depicted in Fig. 1.37.
2. The primary key becomes the combination {Dnumber, Dlocation}.
3. Disadvantage: introducing redundancy in the relation.
Third technique:
1. If a maximum number of values is known for the attribute—for example, if it is known that
at most three locations can exist for a department—replace the Dlocations attribute by three
atomic attributes: Dlocation1, Dlocation2, and Dlocation3.
2. Disadvantage: Introducing NULL values if most departments have fewer than three
locations.
The first solution is considered best because it does not suffer from redundancy and it
is completely general, having no limit placed on a maximum number of values.
Definition of 2NF:
FDs are
The non-prime attrubutes Ename, Pname, Hours are not fully functional dependent on
the primary key
Each Ssn and Pnumber can appear multiple times in Emp_Proj, but this is OK as the
combination is the primary key.
The redundancy of Ename, Pname and Plocation is, however, avoidable by breaking
down or decomposing the original relation (EMP_PROJ ).
It is represented in the Fig 1.38 as follows.
EMP_PROJ
FD2
FD3
R1 R2 R3
Reasons are:
Example - II
Second normal form says, that every non-prime attribute should be fully functionally
dependent on prime key attribute. That is, if X → A holds, then there should not be any
proper subset Y of X, for that Y → A also holds.
We see here in Student_Project relation (Ref Fig 1.39) that the prime key attributes
are Stu_ID and Proj_ID. According to the rule, non-key attributes, i.e. Stu_Name and
Proj_Name must be dependent upon both and not on any of the prime key attribute
individually. But we find that Stu_Name can be identified by Stu_ID and Proj_Name can be
identified by Proj_ID independently. This is called partial dependency, which is not allowed
in Second Normal Form. So the above relation is decomposed as follows.
We broke the relation as in Fig 1.40 into two relations to bring it into 2NF.So there
exists no partial dependency and the relation is in 2NF.
Definition of 3NF:
A relation schema R is in Third Normal Form (2NF) if every non-prime attribute ‘A’ in
R is non-transitively dependent on every super key of R.
Example - I:
FDs are
Where
SSN → DNUMBER → DMGRSSN
is a transitive dependency and can be decomposed into
R1 ( ENAME,DOB,ADDRESS,DNUMBER)
R2 (DNUMBER,DMGRSSN,DLOC)
FD1
FD2
R1 R2
DNUMBER DMGRSSN DLOC
SSN ENAME DOB ADDRESS
Example II :
The above Fig 1.42 depicts Student_detail relation, Stu_ID is key and only prime key
attribute. We find that City can be identified by Stu_ID as well as Zip itself. Neither Zip is a
superkey nor City is a prime attribute. Additionally, Stu_ID → Zip → City, so there
exists transitive dependency.
Fig. 1.43. Relation in 3NF
We broke the relation as in Fig 1.43 into two relations to bring it into 3NF.
Example:
Consider a relation TEACH ( Ref Fig 1.44) with the following dependencies:
Decomposition of this relation schema into two schemas is not straightforward because it
may be decomposed into one of the three following possible pairs:
1. {Student, Instructor} and {Student, Course}
2. {Course, Instructor} and {Course, Student}
3. {Instructor, Course} and {Instructor, Student}
All three decompositions lose the functional dependency FD1.
Out of the above three, only the 3rd decomposition will not generate spurious tuples after a
join. (hence has the non-additivity property).
A relation not in BCNF should be decomposed so as to meet this property. Non additive
decomposition is a must during normalization.
The BCNF relation of fig 1.44 is shown in Fig 1.44(a).
Ins_Course
Instructor Course
Ins_Student
Instructor Student
Fig 1.45. The EMP relation with two MVDs: Ename →→ Pname and Ename →→ Dname
Fig1.46. Decomposing the EMP relation into two 4NF relations EMP_PROJECTS and
EMP_DEPENDENTS
Example
R = (A, B, C, G, H, I)
F={ AB
AC
CG H
CG I
B H}
some members of F+
AH
by transitivity from A B and B H
AG I
by augmenting A C with G, to get AG CG
and then transitivity with CG I
CG HI
from CG H and CG I : “union rule” can be inferred from
– definition of functional dependencies, or
– Augmentation of CG I to infer CG CGI, augmentation of
CG H to infer CGI HI, and then transitivity
F+ = F
repeat
for each functional dependency f in F+
apply reflexivity and augmentation rules on f
add the resulting functional dependencies to F+
for each pair of functional dependencies f1and f2 in F+
if f1 and f2 can be combined using transitivity
then add the resulting functional dependency to F+
+
until F does not change any further
XIV. DENORMALIZATION
Denormalization is the process of storing the join of higher normal form relations as a
base relation, which is in a lower normal form. It is the process of attempting to optimize the
read performance of a database by adding redundant data or by grouping data. It is also a
means of addressing performance or scalability in relational database software.
A normalized database is the starting point for the denormalization process. It’s important to
differentiate from the database that has not been normalized and the database that was
normalized first and then denormalized later. The second one is okay; the first is often the
result of bad database design or a lack of knowledge.
Methods of De-normalization
There are few of denormalization method discussed below.
1. Adding Redundant columns
2. Adding derived columns
3. Collapsing the tables
4. Snapshots
5. VARRAYS
6. Materialized Views
Adding Redundant columns
The redundant column which is frequently used in the joins is added to the main table. The
other table is retained as it is.
For example, consider EMPLOYEE and DEPT tables. Suppose we have to generate a report
where we have to show employee details and his department name. Here we need to have
join EMPLOYEE with DEPT to get department name.
Adding derived columns
Suppose we have STUDENT table with student details like his ID, name, address and course.
Another table MARKS with his internal marks in different subjects. There is a need to
generate a report for individual student in which we need to have his details, total marks and
grade. In this case, we have to query STUDENT table, then join the MARKS table to
calculate the total of marks in different subjects. Based on the total, we have to decide the
grade too in the select query. Then it has to be printed on the report.
Collapsing the tables
In this method, frequently used tables are combined into one table to reduce the joins among
the table. Thus it increases the performance of the retrieval query. By joining the redundant
column into one table may cause the redundancy in the table. But it is ignored as far as it does
not affect the meaning of other records in the table.
For example, after denormalization of STUDENT and ADDRESS, it should have all the
students with correct address. It should not lead to wrong address of students.
In addition to collapsing the tables, we can duplicate or even split the table, if they increase
the performance of the query. But duplicating and splitting are not methods of
denormalization.
Snapshots
This is one of the earliest methods of creating data redundancy. In this method, the database
tables are duplicated and stored in various database servers. They are refreshed at specific
time periods to maintain the consistency among the database server tables. By using this
method, users are located at different places were able to access the servers which are nearer
to them, and hence retrieving the data quickly. They need not access the tables located at
remote servers in this case. This helps in faster access.
VARRAYS
In this method tables are created as VARRAY tables, where repeating groups of columns are
stored in single table. This VARRAY method over-rules the condition of 1NF. According to
1NF, each column value should be atomic. But this method allows same data to be stored in
different columns for each record.
Consider the example of STUDENT and MARKS. Say MARKS table has marks of 3
subjects for each student.
Materialized Views
Materialized views are similar to tables where all the columns and derived values are pre-
calculated and kept. Hence if there is any query with same query used in the materialized
view, then the query will be replaced by this materialized view. Since this view has all the
columns as a result of join and pre-calculated value, there is no need to calculate the values
again. Hence it reduces the time consumed by the query.
The only problem with materialized view is it will not get refreshed like any other views
when there is change in table data. We have to explicitly refresh them to get the correct data
in the materialized view.
Advantages and Disadvantages of De-normalization
Advantages of De-normalization
Minimizes the table joins
It reduces the number of foreign keys and indexes. This helps in saving the memory
usage and less data manipulation time.
If there is any aggregation columns are used to denormalize, then these computations
are carried out at the data manipulation time rather than at the retrieval time. i.e.;, if
we have used ‘total marks’ as the denormalized column, then the total is calculated
and updated when other related column entries – say student details and his marks are
inserted. Hence when we query STUDENT table for his details and marks, we need
not calculate his total. Hence it saves the retrieval time.
It reduces number of tables in the database. As the number of table increases, the
mapping increases; joins increases; memory space increases and so on.
Disadvantages of De-normalization
Although it supports faster retrieval, it slows down the data manipulation. If the
column is frequently updated, then it reduces the speed of updation.
If there is any change in the requirement, then we need to analyze the data and tables
again to understand the performance. Hence denormalization is specific the
requirement or application that a user is using.
Complexity of coding and number table depends on the requirement / application. It
can increase or decrease the tables. There can be chance that the code will get more
complex because of redundancy in the table. Hence it needs thorough analysis of
requirement, query, data etc.
Difference between Normalization and Denormalization:
Normalization and denormalization are two processes that are completely opposite.
Normalization is the process of dividing larger tables in to smaller ones reducing the
redundant data, while denormalization is the process of adding redundant data to
optimize performance.
Normalization is carried out to prevent databases anomalies.
Denormalization is usually carried out to improve the read performance of the
database, but due to the additional constraints used for denormalization, writes (i.e.
insert, update and delete operations) can become slower. Therefore, a denormalized
database can offer worse write performance than a normalized database.
It is often recommended that you should “normalize until it hurts, denormalize until it
works.