Unit Iv Data Normalization: Semantics of Attributes Should Be Easy To Interpret
Unit Iv Data Normalization: Semantics of Attributes Should Be Easy To Interpret
DATA NORMALIZATION
Informal Design Guidelines for Relational Databases
What is relational database design?
The grouping of attributes to form "good" relation schemas
Two levels of relation schemas
The logical "user view" level
The storage "base relation" level
Design is concerned mainly with base relations
Guidelines:
1. Semantics of the Relation Attributes
2. Redundant Information in Tuples and Update Anomalies
3. Null Values in Tuples
4. Spurious Tuples
1
2. Redundant Information in Tuples and Update Anomalies
Information is stored redundantly
Wastes storage
Causes problems with update anomalies
i. Insertion anomalies
ii. Deletion anomalies
iii. Modification anomalies
Anomalies: anomalies are problems that occur in poorly planned, un-normalized databases
where all the data is stored in one table
Update anomalies − If data items are scattered and are not linked to each other
properly, then it could lead to strange situations. For example, when we try to update
one data item having its copies scattered over several places, a few instances get
updated properly while a few others are left with old values. Such instances leave the
database in an inconsistent state.
Deletion anomalies − We tried to delete a record, but parts of it was left undeleted
because of unawareness, the data is also saved somewhere else.
Insert anomalies − We tried to insert data in a record that does not exist at all.
Normalization is a method to remove all these anomalies and bring the database to a
consistent state.
Eg:
Consider the relation:
EMP_PROJ(Emp#, Proj#, Ename, Pname, No_hours)
Update Anomaly:
Changing the name of project number P1 from “Billing” to “Customer-
Accounting” may cause this update to be made for all 100 employees working
on project P1.
Consider the relation:
EMP_PROJ(Emp#, Proj#, Ename, Pname, No_hours)
Insert Anomaly:
Cannot insert a project unless an employee is assigned to it.
Conversely
Cannot insert an employee unless an he/she is assigned to a project.
Consider the relation:
EMP_PROJ(Emp#, Proj#, Ename, Pname, No_hours)
Delete Anomaly:
When a project is deleted, it will result in deleting all the employees who work
on that project.
Alternately, if an employee is the sole employee on a project, deleting that
employee would result in deleting the corresponding project.
Guideline to Redundant Information in Tuples and Update Anomalies
GUIDELINE 2:
Design a schema that does not suffer from the insertion, deletion and update
anomalies.
2
If there are any anomalies present, then note them so that applications can be
made to take them into account.
3. Null Values in Tuples
GUIDELINE 3:
Relations should be designed such that their tuples will have as few NULL
values as possible
Attributes that are NULL frequently could be placed in separate relations
(with the primary key)
Reasons for nulls:
Attribute not applicable or invalid
Attribute value unknown (may exist)
Value known to exist, but unavailable
4. Generation of Spurious Tuples
Bad designs for a relational database may result in erroneous results for certain JOIN
operations
The "lossless join" property is used to guarantee meaningful results for join operations
NATURAL JOIN
Result produces many more tuples than the original set of tuples in EMP_PROJ
Called spurious tuples
Represent spurious information that is not valid
Guideline 4
Design relation schemas to be joined with equality conditions on attributes that are
appropriately related
Guarantees that no spurious tuples are generated
Avoid relations that contain matching attributes that are not (foreign key, primary key)
combinations
Summary and Discussion of Design Guidelines
1. Anomalies cause redundant work to be done
2. Waste of storage space due to NULLs
3. Difficulty of performing operations and joins due to NULL values
4. Generation of invalid and spurious data during joins
Functional Dependencies
Formal tool for analysis of relational schemas
Enables us to detect and describe some of the above-mentioned problems in precise terms
Definition: A functional dependency, denoted X->Y, between two sets of attributes X and Y
that are subsets of R specifies a constraint on the possible tuples that can form a relation state
r of R. The constraint is that, for any two tuples t1 and t2 in r that have t1[X]=t2[X], they must
also have t1[Y]= t2[Y].
Functional dependency says that if two tuples have same values for attributes A1,
A2,..., An, then those two tuples must have to have same values for attributes B1,
B2, ..., Bn.
Functional dependency is represented by an arrow sign (→) that is, X→Y, where X
functionally determines Y.
3
Abbreviated as FD
X is called left-hand side of the FD
Y is called right-hand side of the FD
A simple example of single value functional dependency is when A is the primary key of an entity (eg. SID) and B
is some single valued attribute of the entity (eg. Sname). Then, A → B must always hold.
C1 S1 A
C1 S2 A
C2 S1 A
C3 S1 A
Normalization of Database
Database Normalisation is a technique of organizing the data in the database.
Normalization is a systematic approach of decomposing tables to eliminate data
redundancy and undesirable characteristics like Insertion, Update and Deletion
Anamolies.
It is a multi-step process that puts data into tabular form by removing duplicated data
from the relation tables.
Normalization is used for mainly two purpose,
o Eliminating reduntant(useless) data.
o Ensuring data dependencies make sense i.e data is logically stored.
4
Definition: The normal form of a relation refers to the highest normal form consideration
that it meets, and hence indicates the degree to which it has been normalized.
Without Normalization, it becomes difficult to handle and update the database, without facing
data loss. Insertion, Updation and Deletion Anamolies are very frequent if Database is not
Normalized. To understand these anomalies let us take an example of Student table.
S_id S_Name S_Address Subject_opted
401 Adam Noida Bio
402 Alex Panipat Maths
403 Stuart Jammu Maths
404 Adam Noida Physics
Updation Anamoly : To update address of a student who occurs twice or more than
twice in a table, we will have to update S_Address column in all the rows, else data
will become inconsistent.
Insertion Anamoly : Suppose for a new admission, we have a Student id(S_id), name
and address of a student but if student has not opted for any subjects yet then we have
to insert NULL there, leading to Insertion Anamoly.
Deletion Anamoly : If (S_id) 401 has only one subject and temporarily he drops it,
when we delete that row, entire student record will be deleted along with it.
Normalization Rule
5
As per First Normal Form, no two Rows of data must contain repeating group of information
i.e each set of column must have a unique value, such that multiple columns cannot be used
to fetch the same row. Each table should be organized into rows, and each row should have a
primary key that distinguishes it as unique.
The Primary key is usually a single column, but sometimes more than one column can be
combined to create a single primary key. For example consider a table which is not in First
normal form
In First Normal Form, any row must not have a column in which more than one value is
saved, like separated with commas. Rather than that, we must separate such data into multiple
rows. We re-arrange the relation (table) as below, to convert it to First Normal Form.
Using the First Normal Form, data redundancy increases, as there will be many columns with
same data in multiple rows but each row as a whole will be unique.
2. Second Normal Form (2NF)
Definition: A relation schema R is in second normal form (2NF) if it is in 1NF and
every non-prime attribute A in R is fully functionally dependent on the primary key
As per the Second Normal Form there must not be any partial dependency of any column on
primary key. It means that for a table that has concatenated primary key, each column in the
table that is not part of the primary key must depend upon the entire concatenated key for its
existence. If any column depends only on one part of the concatenated key, then the table
fails Second normal form.
6
Non-prime attribute − An attribute, which is not a part of the prime-key, is said to
be a non-prime attribute.
We see here in Student_Project relation that the prime key attributes are Stu_ID and Proj_ID.
According to the rule, non-key attributes, i.e. Stu_Name and Proj_Name must be dependent
upon both and not on any of the prime key attribute individually. But we find that Stu_Name
can be identified by Stu_ID and Proj_Name can be identified by Proj_ID independently. This
is called partial dependency, which is not allowed in Second Normal Form.
We broke the relation in two as depicted in the above picture. So there exists no partial
dependency.
3. Third Normal Form (3NF)
Definition: A relation schema R is in third normal form (3NF) if it is in 2NF and no non-
prime attribute A in R is transitively dependent on the primary key.
Third Normal form applies that every non-prime attribute of table must be dependent on
primary key, or we can say that, there should not be the case that a non-prime attribute is
determined by another non-prime attribute. So this transitive functional dependency should
be removed from the table and also the table must be in Second Normal form. For example,
consider a table with following fields.
7
Transitive functional dependency: a FD X -> Z that can be derived from
two FDs X -> Y and Y -> Z
We find that in the above Student_detail relation, Stu_ID is the key and only prime key
attribute. We find that City can be identified by Stu_ID as well as Zip itself. Neither Zip is a
superkey nor is City a prime attribute. Additionally, Stu_ID → Zip → City, so there exists
transitive dependency.
To bring this relation into third normal form, we break the relation into two relations as
follows −
In the above image, Stu_ID is the super-key in the relation Student_Detail and Zip is the
super-key in the relation ZipCodes. So,
and
Zip → City
8
Which confirms that both the relations are in BCNF.
1. Primary Index
Dense Index
In this case, indexing is created for primary key as well as on the columns on which we
perform transactions. That means, user can fire query not only based on primary key column.
He can query based on any columns in the table according to his requirement. But creating
index only on primary key will not help in this case. Hence index on all the search key
columns are stored. This method is called dense index.
For example, Student can be searched based on his ID which is a primary key. In addition, we
search for student by his first name, last name, particular age group, residing in some place,
opted for some course etc. That means most of the columns in the table can be used for
searching the student based on different criteria. But if we have index on his ID, other
searches will not be efficient. Hence index on other search columns are also stored to make
the fetch faster.
9
Though it addresses quick search on any search key, the space used for index and address
becomes overhead in the memory. Here the (index, address) becomes almost same as (table
records, address). Hence more space is consumed to store the indexes as the record size
increases.
Sparse Index
In order to address the issues of dense indexing, sparse indexing is introduced. In this method
of indexing, range of index columns store the same data block address. And when data is to
be retrieved, the block address will be fetched linearly till we get the requested data.
Let us see how above example of dense index is converted into sparse index.
10
In above diagram we can see, we have not stored the indexes for all the records, instead only
for 3 records indexes are stored. Now if we have to search a student with ID 102, then the
address for the ID less than or equal to 102 is searched – which returns the address of ID 100.
This address location is then fetched linearly till we get the records for 102. Hence it makes
the searching faster and also reduces the storage space for indexes.
The range of column values to store the index addresses can be increased or decreased
depending on the number of record in the table. The main goal of this method should be more
efficient search with less memory space.
But if we have very huge table, then if we provide very large range between the columns will
not work. We will have to divide the column ranges considerably shorter. In this situation,
(index, address) mapping file size grows like we have seen in the dense indexing.
2. Secondary Index
In the sparse indexing, as the table size grows, the (index, address) mapping file size also
grows. In the memory, usually these mappings are kept in the primary memory so that
address fetch should be faster. And latter the actual data is searched from the secondary
memory based on the address got from mapping. If the size of this mapping grows, fetching
the address itself becomes slower. Hence sparse index will not be efficient. In order to
overcome this problem next version of sparse indexing is introduced i.e.; Secondary
Indexing.
In this method, another level of indexing is introduced to reduce the (index, address) mapping
size. That means initially huge range for the columns are selected so that first level of
mapping size is small. Then each range is further divided into smaller ranges. First level of
mapping is stored in the primary memory so that address fetch is faster. Secondary level of
mapping and the actual data are stored in the secondary memory – hard disk.
11
In the above diagram, we can see that columns are divided into groups of 100s first. These
groups are stored in the primary memory. In the secondary memory, these groups are further
divided into sub-groups. Actual data records are then stored in the physical memory. We can
notice that, address index in the first level is pointing to the first address in the secondary
level and each secondary index addresses are pointing to the first address in the data block. If
we have to search any data in between these values, then it will search the corresponding
address from first and second level respectively. Then it will go to the address in the data
blocks and perform linear search to get the data.
For example, if it has to search 111 in the above diagram example, it will search the max
(111) <= 111 in the first level index. It will get 100 at this level. Then in the secondary index
level, again it does max (111) <= 111, and gets 110. Now it goes to data block with address
110 and starts searching each record till it gets 111. This is how a search is done in this
method. Inserting/deleting/updating is also done in same manner.
3. Clustering Index
In some cases, the index is created on non-primary key columns which may not be unique for
each record. In such cases, in order to identify the records faster, we will group two or more
columns together to get the unique values and create index out of them. This method is
known as clustering index. Basically, records with similar characteristics are grouped
together and indexes are created for these groups.
For example, students studying in each semester are grouped together. i.e.; 1st Semester
students, 2nd semester students, 3rd semester students etc are grouped.
12
In above diagram we can see that, indexes are created for each semester in the index file. In
the data block, the students of each semester are grouped together to form the cluster. The
address in the index file points to the beginning of each cluster. In the data blocks, requested
student ID is then search in sequentially.
New records are inserted into the clusters based on their group. In above case, if a new
student joins 3rd semester, then his record is inserted into the semester 3 cluster in the
secondary memory. Same is done with update and delete.
If there is short of memory in any cluster, new data blocks are added to that cluster.
In above diagram we can see that, indexes are created for each semester in the index file. In
the data block, the students of each semester are grouped together to form the cluster. The
address in the index file points to the beginning of each cluster. In the data blocks, requested
student ID is then search in sequentially.
New records are inserted into the clusters based on their group. In above case, if a new
student joins 3rd semester, then his record is inserted into the semester 3 cluster in the
secondary memory. Same is done with update and delete.
If there is short of memory in any cluster, new data blocks are added to that cluster.
13
This method of file organization is better compared to other methods as it provides clean
distribution of records, and hence making search easier and faster. But in each cluster, there
would be unused space left. Hence it will take more memory compared to other methods.
14