Unit 3 - Database Management System - WWW - Rgpvnotes.in
Unit 3 - Database Management System - WWW - Rgpvnotes.in
Unit-III
Normalization:
Normalization is a database design technique which organizes tables in a manner that reduces redundancy and
dependency of data. It divides larger tables to smaller tables and link them using relationships.
If a database design is not perfect, it may contain anomalies, which are like a bad dream for any database
administrator. Managing a database with anomalies is next to impossible.
Update anomalies − If data ite s are s attered a d are ot li ked to ea h other properl , the it ould lead
to strange situations. For example, when we try to update one data item having its copies scattered over
several places, a few instances get updated properly while a few others are left with old values. Such instances
leave the database in an inconsistent state.
Deletion anomalies − We tried to delete a re ord, ut parts of it as left u deleted e ause of u a are ess,
the data is also saved somewhere else.
Insert anomalies − We tried to i sert data i a re ord that does ot e ist at all.
Normalization is a method to remove all these anomalies and bring the database to a consistent state.
Here are the most commonly used normal forms:
First normal form(1NF)
Second normal form(2NF)
Third normal form(3NF)
Boyce & Codd normal form (BCNF)
Normal forms:
First normal form (1NF)
As per the rule of first normal form, an attribute (column) of a table cannot hold multiple values. It should hold
only atomic values.
Example: Suppose a company wants to store the names and contact details of its employees. It creates a table
that looks like this:
8812121212
102 Jonny Kanpur
9900012222
9990000123
104 Lokesh Bangalore
8123450987
Two employees (Jon & Lester) are having two mobile numbers so the company stored them in the same field as
you can see in the table above.
This table is not in 1NF as the rule sa s ea h attri ute of a ta le ust ha e ato i si gle alues , the
emp_mobile values for employees Jon & Lester violates that rule.
To make the table complies with 1NF we should have the data like this:
111 Maths 38
111 Physics 38
222 Biology 38
333 Physics 40
333 Chemistry 40
teacher_id teacher_age
111 38
222 38
333 40
teacher_subject table:
teacher_id subject
111 Maths
111 Physics
222 Biology
333 Physics
333 Chemistry
Now the tables comply with Second normal form (2NF).
Employee table:
Employee_zip table:
To make the table comply with BCNF we can break the table in three tables like this:
Emp_nationality table:
emp_id emp_nationality
1001 Indian
1002 American
Emp_dept table:
Emp_dept_mapping table:
emp_id emp_dept
1001 stores
Functional Dependency:
Functional dependency (FD) is a set of constraints between two attributes in a relation. Functional dependency
says that if two tuples have same values for attributes A1, A2..., An, then those two tuples must have to have
same values for attributes B1, B2, ..., Bn.
Functional dependency is represented by an arrow sig → that is, X→Y, here X fu tio all deter i es Y.
The left-hand side attributes determine the values of attributes on the right-hand side.
DEPARTMENT
FK
DEPT_LOCATIONS
FK
DNUMBER DLOCATION
PK
PROJECT
FK
PNAME PNUMBER PLOCATION DNUM
PK
WORKS_ON
FK FK
SSN PNUMBER HOURS
The , { A → C; A → D; A → E; B → C; B → D; B → E }
all are Partial Dependencies.
Transitive Dependency –
Given a relation R(A,B,C) then dependency like A–>B, B–>C is a transitive dependency, since A–>C is implied
.
In the above Fig,
SSN --> DMGRSSN is a transitive FD
{since SSN --> DNUMBER and DNUMBER --> DMGRSSN hold}
FD Axioms
Understanding: Functional Dependencies are recognized by analysis of the real world; no automation or
algorithm. Finding or recognizing them are the database designer's task.FD manipulations:
Decomposition:
Decomposition in DBMS is nothing but another name for Normalization. ... Normal forms in a database or the
concept of Normalization makes a Relation or Table free from insert/update/delete anomalies and saves space
by removing duplicate data.
Decomposition is the process of breaking down in parts or elements.
It replaces a relation with a collection of smaller relations.
It breaks the table into multiple tables in a database.
It should always be lossless, because it confirms that the information in the original relation can be accurately
reconstructed based on the decomposed relations.
If there is no proper decomposition of the relation, then it may lead to problems like loss of information.
Properties of Decomposition
1. Lossless Decomposition
2. Dependency Preservation
3. Lack of Data Redundancy
1. Lossless Decomposition
Decomposition must be lossless. It means that the information should not get lost from the relation that is
decomposed.
It gives a guarantee that the join will result in the same relation as it was decomposed.
Example:
Let's take 'E' is the Relational Schema, With instance 'e'; is decomposed into: E1, E2, E3, . . . . En; with instance:
e1, e2, e3, . . . . en, If e1 ⋈ e2 ⋈ e3 . . . . ⋈ en, then it is called as 'Lossless Join Decomposition'.
In the above example, it means that, if natural joins of all the decomposition give the original relation,
then it is said to be lossless join decomposition.
Example: <Employee_Department> Table
Human
E005 STU 32 Bangalore 25000 D005
Resource
Decompose the above relation into two relations to check whether a decomposition is lossless or lossy.
Now, we have decomposed the relation that is Employee and Department.
Dangling tuples: An attribute of the join operator is that it is possible for certain tuples to be "dangling"; that
is, they fail to match any tuple of the other relation in the common attributes. Dangling tuples do not have any
trace in the result of the join, so the join may not represent the data of the original relations completely.
Multivalued dependency:
Multivalued dependency occurs when there are more than one independent multivalued attributes in a table.
, a multivalued dependency is a full constraint between two sets of attributes in a relation.
In contrast to the functional dependency, the multivalued dependency requires that certain tuples be present
in a relation. Therefore, a multivalued dependency is a special case of tuple-generating dependency. The
multivalued dependency plays a role in the 4NF database normalization.
A multivalued dependency is a special case of a join dependency, with only two sets of values involved, i.e. it is
a binary join dependency.
For example: Consider a bike manufacture company, which produces two colors (Black and white) in each
model every year.
bike_model manuf_year color
Here columns manuf_year and color are independent of each other and dependent on bike_model. In this
case these two columns are said to be multivalued dependent on bike_model. These dependencies can be
represented like this:
bike_model ->> manuf_year
bike_model ->> color
Query optimization:
Query optimization is a function of many relational database management systems. The query
optimizer attempts to determine the most efficient way to execute a given query by considering the
Equivalent expressions
possible query plans. Alternative ways of evaluating a given query
Query Parser – Verify validity of the SQL statement. Translate query into an internal structure using
Query Flow:
Query Optimizer – Find the best expression from various different algebraic expressions. Criteria used is
relational calculus.
Cheap ess
Code Generator/Interpreter – Make calls for the Query processor as a result of the work done by the
Query Processor – Execute the calls obtained from the code generator.
optimizer.
The first technique is based on Heuristic Rules for ordering the operations in a query execution strategy.
The second technique involves the systematic estimation of the cost of the different execution strategies and
choosing the execution plan with the lowest cost.
Semantic query optimization is used with the combination with the heuristic query transformation rules.
It uses constraints specified on the database schema such as unique attributes and other more complex
constraints, in order to modify one query into another query that is more efficient to execute.
1. Heuristic Rules: -
The heuristic rules are used as an optimization technique to modify the internal representation of the query.
Usually, heuristic rules are used in the form of query tree of query graph data structure, to improve its
performance.
One of the main heuristic rules is to apply SELECT operation before applying the JOIN or other BINARY
operations.
This is because the size of the file resulting from a binary operation such as JOIN is usually a multi-value
function of the sizes of the input files.
The SELECT and PROJECT reduced the size of the file and hence, should be applied before the JOIN or other
binary operation.
Heuristic query optimizer transforms the initial (canonical) query tree into final query tree using equivalence
transformation rules.
This final query tree is efficient to execute.
Now let us consider the query in the above database to find the name of employees
or after 97 ho ork o a proje t a ed Gro th .
SELECT EName
Fig 3.1
General Transformation Rules: -
Transformation rules are used by the query optimizer to transform one relational algebra expression into an
equivalent expression that is more efficient to execute.
A relation is considering as equivalent of another relation if two relations have the same set of attributes in
a different order but representing the same information.
These transformation rules are used to restructure the initial (canonical) relational algebra query tree
attributes during query decomposition.
1. Cascade of σ :-
σ AND AND …AND ‘ =σ σ …σ ‘ …
2. Commutativity of σ :-
σC σC ‘ =σC σC ‘
3. Cascade of Л :-
RxS=SxR
. Co uti g σ ith ⋈ or x :-
If all attributes in selection condition c involved only attributes of one of the relation
schemas (R).
σ ‘⋈“ = σ ‘ ⋈S
Alternatively, selection condition c can be written as (c1 AND c2) where condition c1
involves only attributes of R and condition c2 involves only attributes of S then:
σ ‘⋈“ = σ ‘ ⋈ σ “) )
. Co uti g Л ith ⋈ or x :-
- he proje tio list L = {A ,A ,..A ,B ,B ,…B }.
-R S=S R
Minus (R-S) is not commutative.
9. Associatively of ⋈, x, , and :-
- If ∅ stands for any one of these operation throughout the expression then :
(R ∅ S) ∅ T = R ∅ (S ∅ T)
σ ‘∅“ = σ ‘ σ “
Л ‘∅“ = Л ‘ Л “
. The Л operatio co ute ith :-
ЛL ‘ “ = ЛL ‘ ЛL “
. Co erti g a (σ, ) sequence with
σ ‘ “ = ‘ ⋈ c S)
Heuristic Optimization Algorithm: -
The Database Management System Use Heuristic Optimization Algorithm that utilizes some of the
transformation rules to transform an initial query tree into an optimized and efficient executable query tree.
The steps of the heuristic optimization algorithm that could be applied during query processing and
optimization are:
Step-1: -
Perform SELECT operation to reduce the subsequent processing of the relation:
Use the transformation rule 1 to break up any SELECT operation with conjunctive condition into a cascade
of SELECT operation.
Step-2: -
Perform commutativity of SELECT operation with other operation at the earliest to move each SELECT
operation down to query tree.
Use transformation rules 2, 4, 6 and 10 concerning the commutativity of SELECT with other operation such
as unary and binary operations and move each select operation as far down the tree as is permitted by the
attributes involved in the SELECT condition. Keep selection predicates on the same
relation together.
Step-3: -
Combine the Cartesian product with subsequent SELECT operation whose predicates represents the join
condition into a JOIN operation.
Use the transformation rule 12 to combine the Cartesian product operation with subsequent SELECT
operation.
Step-4: -
Use the commutativity and associativity of the binary operations.
Use transformation rules 5 and 9 concerning commutativity and associativity to rearrange the leaf nodes of
the tree so that the leaf node with the most restrictive selection operation is executed first in the query tree
representation. The most restrictive SELECT operation means:
Either the one that produce a relation with a fewest tuples or with smallest size.
The one with the smallest selectivity.
Step-5: -
Perform the projection operation as early as possible to reduce the cardinality of the relation and the
subsequent processing of the relation, and move the Projection operations as far down the query tree as
possible.
o Use transformation rules 3, 4, 7 and 10 concerning the cascading and commuting of projection operations
with other binary operation. Break down and move the projection attributes down the tree as far as needed.
Keep the projection attributes in the same relation together.
Step-6: -
Compute common expression once.
Identify sub-tree that represent group of operations that can be executed by a single algorithm.
Simple selection
S1: Linear search
S2: Binary search
S3a: Primary index
S3b: Hash key
S4: Primary index, multiple records
(For a comparative clause like >, ≤ Use the i de to retrie e the ou di g = condition, then iterate.
S5: Clustering index, multiple records
For an equality clause that's nonkey.
S6: Secondary index (B+-Tree)
Complex selection
A conjunctive selection combines subclauses with logical And.
S7: Conjunctive selection, individual index
Retrieve records based on the indexed attribute that satisfy its associated condition. Check each one whether it
satisfies the remaining conditions.
S8: Conjunctive selection, composite index
If the attributes participating the composite index have equality conditions, use it directly.
S9: Conjunctive selection, intersection of result sets
Retrieve records satisfying the clause of their indexed attributes, separately, to form result sets. Compute the
intersection of these sets.
Condition selectivity
The selectivity sl is the ratio of tuples satisfying the selection condition to the total tuples in the relation. Low
values of sl are usually desirable. More importantly, a DBMS will tend to keep an estimate of the distribution of
values among the attributes of the rows of a table, as a histogram, to be able to estimate the selectivity of query
operations.
PROJECT operation
If the < attribute list > of the P‘OJECT operatio π ‘ i ludes a ke of ‘, the the u er of tuples i the
projection result is equal to the number of tuples in R, but only with the values for the attributes in < attribute
list > in each tuple.
If the < attribute list > does not contain a key of R, duplicate tuples must be eliminated. The following methods
can be used to eliminate duplication.
Sorting the result of the operation and then eliminating duplicate tuples.
Hashing the result of the operation into hash file in memory and check each hashed record against those in
the same bucket; if it is a duplicate, it is not inserted.
CARTESIAN PRODUCT operation: – It is an extremely expensive operation. Hence, it is important to avoid this
operation and to substitute other equivalent operations during optimization.
SELECT LNAME, FNAME FROM EMPLOYEE WHERE SALARY > ( SELECT MAX (SALARY) FROM EMPLOYEE WHERE
DNO=5);
We can decompose it into two blocks: Inner block: Outer block:
(SELECT MAX (SALARY) SELECT LNAME, FNAME FROM EMPLOYEE FROM EMPLOYEE WHERE DNO=5) WHERE
SALARY > c
Optimization methods:
Use associativity of binary operations to rearrange leaf nodes so leaf nodes with most restrictive
operation.
One of the main heuristic rules is to apply SELECT and PROJECT operations before applying the JOIN or other
binary operations. This is because the size of the file resulting from a binary operation—such as JOIN—is usually
a multiplicative function of the sizes of the input files. The SELECT and PROJECT operations reduce the size of a
Systems may use heuristics to reduce the number of choices that must be made in a cost-based fashion.
Heuristic optimization transforms the query-tree by using a set of rules that typically (but not in all cases)
improve execution performance:
o Perform selection early (reduces the number of tuples)
o Perform projection early (reduces the number of attributes)
o Perform most restrictive selection and join operations (i.e. with smallest result size) before other similar
operations.
Some systems use only heuristics, others combine heuristics with partial cost-based optimization.
CS-5003
DBMS