CHAPTER_2_Query_Processing_&_Optimization_Handout_Material
CHAPTER_2_Query_Processing_&_Optimization_Handout_Material
Chapter 2
11.1 Introduction
Query processing requires that the DBMS identify and execute a strategy for retrieving the
results of the query. The query determines what data is to be found, but does not define
the method by which the data manager searches the database. Therefore Query optimization
is necessary to determine the optimal alternative to process a query. There are two main
techniques for query optimization. The first approach is to use a rule based or heuristic
method for ordering the operations in a query execution strategy. The second approach
estimates the cost of different execution strategies and chooses the best solution. In general
most commercial database systems use a combination of both techniques.
419
420 Introduc tion to Database Management System
factors like existence of certain database structures, presence of different indexes, file is sorted
or not, cost of transformation, physical characteristics of data etc. After transformation of
query, transformed query is evaluated by using number of strategies known as access plans.
While generating access plans, factors like physical properties of data and storage are taken
into account and the optimal access plan is executed. The next step is to validate the user
privileges and ensure that the query does not disobey the relevant integrity constraints.
Finally, execution plan is executed to generate the result.
Transform into
Query result
=============================================================================================================
Query normalization
Semantic analysis
Query simplifier
Query restructuring
Algebraic expressions
E.Department = D.Department
E D
Employee Department
Query graph notation : Graph data structure is also used for internal representation
of query. In graphs:
Relation nodes – represent relations by single circle.
Constant nodes – represent constant values by double circle.
Edges – represent relation and join conditions.
Square brackets – represent attributes retrieved from each
relation. [E.Employee_Name, E.Address,
[D.Manager-ID]
E.Job, E.Department]
‘Delhi’ D E
Department-Location = Delhi D.Department = E.Department
(iii) Query Optimization : The aim of the query optimization step is to choose the best
possible query execution plan with minimum resources required to execute that plan. Query
optimization is discussed in detail in section 11.3.
(iv) Execution Plan : Execution plan is the basic algorithm used for each operation in
the query. Execution plans are classified into following Four types : (a) Left-deep tree query
execution plane, (b) Right-deep tree query execution plan, (c) Linear tree execution plan,
(d) Bushy execution plan.
424 Introduc tion to Database Management System
(a) Left-deep tree query execution plan : In left-deep tree query execution plan, development
of plan starts with a single relation and successively adding a operation involving
a single relation until the query is completed. For example, Only the left hand side
of a join is allowed to participate in result from a previous join and hence named
left-deep tree. It is shown in Figure 11.5.
Result
R4
R3
R1 R2
Advantages : The main advantages of left-deep tree query execution plan are
It reduces search space.
Query optimiser is based on dynamic programming techniques.
It is convenient for pipelined evaluation as only one input to each join is pipelined.
Disadvantage : The disadvantages of left-deep tree query execution plan are
Reduction in search space leads to miss some lower cost execution strategies.
(b) Right-deep tree query execution plan : It is almost same as left-deep query execution
plan with the only difference that only the right hand side of a join is allowed to
participate in result from a previous join and hence named right-deep tree. It is
applicable on applications having a large main memory. It is shown in Figure 11.6.
Result
R4
R3
R2 R1
(c) Linear tree execution plan : The combination of left-deep and right-deep execution
plans with a restriction that the relation on one side of each operator is always a
base relation is known as linear trees. It is shown in Figure 11.7.
Query Processing and Optimiz ation 425
Result
R4
R3
R2 R1
(d) Bushy execution plan : Bushy execution plan is the most general type of execution
plan. More than one relation can participate in intermediate results. It is shown in
Figure 11.8.
Result
R3 R4 R5
R2 R1
Query performance of a database systems is dependent not only on the database structure,
but also on the way in which the query is optimized. Query optimization means converting a
query into an equivalent form which is more efficient to execute. It is necessary for high-level
relation queries and it provides an opportunity to DBMS to systematically evaluate alterative
426 Introduc tion to Database Management System
query execution strategies and to choose an optimal strategy. A typical query optimization
process is shown in Figure 11.9.
Statistical data
Estimation formulas
Simplified relational
(determine cardinality
algebra query tree
of intermediate result tables)
Query optimiser
Rule 6. Commutavity of selection (σ) and join () or Cartesian product (X)
(σX R Y S ≡ (σX (R) Y S)
σX (R × S) ≡ (σX (R)) × S
Ex. σEmployee.Age > 30 ∧ Dept-Name = ‘MARKETING’ (Employee) Employee.Dept-ID = Department.Dept-ID
(Department) ≡ σEmployee.Age>30 (Employee) Employee.Dept-ID = Department.Dept-ID
(σDept-Name = ‘MARKETING’ (Department))
Rule 7. Commutavity of projection (π) and join () or Cartesian product (X).
pL1 ∪ L2 (R Z S) ≡ (pL1 (R)) Z (pL2 (S))
Ex. πEmp-Name, Dept-Name, Dept-ID (Employee E.Dept-ID = D.Dept-ID Department) ≡
(πEmp-name, Dept-ID (Employee)) E.Dept-ID = D.Dept-ID (πDept-Name, Dept-ID (Department))
Consider the query below in SQL and transformation of its initial query tree into an
optimal query tree.
Select Emp_Name
From Employee e, Department d, Project p
Where p.Name = ‘LUXMI PUB.’
AND d.Proj_ID = p.Proj_ID
AND e.Dept_ID= d.Dept_ID
AND e.Age > 35
This query needs to display names of all employees working for project “LUXMI PUB.”
and having age more than 35 years.
Figure 11.10 shows the initial query tree for the given SQL query. If the tree is executed
directly then it results in the Cartesian product of entire Employee, Department, and Project
table but in reality, the query needed only one record from relation Project and only the
employee records for those whose age is greater than 35 years.
Emp_Name
X Project
Employee Department
— We can improve the performance by first applying selection operations to reduce the
number of records that appear in Casterian product. Figure 11.11 shows the improved query
tree.
430 Introduc tion to Database Management System
Emp_Name
d.Proj_ID = p.Proj_ID
X Project
Employee
— The query tree can be further improved by applying more restrictive selection operation.
So, switch the positions of relations Project and Employee as you know that in a single
project it may be more than one employee. Figure 11.12 shows the improved query tree.
Emp_Name
e.Dept_ID = d.Dept_ID
X Employee
Project
FIGURE 11.12. Improved query tree by applying more restrictive selection operations.
p.Dept_ID = d.Dept_ID
Employee
p.Name = ‘LUXMI PUB.’ Department
Project
— Further improvement can be done in query tree by keeping only required attributes
(columns) of relations by applying projection operations as early as possible in the query
tree. Optimizer keep the attributes required to display, and the attributes needed by
the subsequent operation in the intermediate relations. Modified query tree is shown in
Figure 11.14.
Emp_Name
e.Dept_ID = d.Dept_ID
Project
The SELECT clause in SQL is equivalent to projection operation and WHERE clause in SQL is
equivalent to selection operation in relational algebra.
432 Introduc tion to Database Management System
11.3.3 Cost Based Query Optimization
In cost based query optimization, optimizer estimates the cost of running of all alternatives
of a query and choose the optimum alternative. The alternative which uses the minimum
resources is having minimum cost. The cost of a query operation is mainly depend on
its selectivity i.e., the proportion of the input relations that forms the output. Following are
the main components used to determine the cost of execution of a query:
(a) Access cost of secondary storage : Access cost to secondary storage consists of cost of
database manipulation operations which includes searching, writing, reading of data
blocks stored in the secondary memory. The cost of searching depends upon the type
of indexes (primary, secondary, hashed), type of file structure, ordering of relation
in addition to physical storage location like file blocks are allocated contiguously on
the same disk or scattered on the disk.
(b) Storage cost : Storage cost consists of cost of storing intermediate results (tables or
files) that are generated by the execution strategy for the query.
(c) Computation cost : Computation cost consists of performing in-memory operations
during query execution such as sorting of records in a file, merging of records,
performing computations on field values, searching of records. These are mainly
performed on data buffers.
(d) Memory usage cost : It consists of cost of pertaining to the number of memory
buffers needed during query execution.
(e) Communication cost : It consists of the cost of transferring query and its result from
database location to the location of terminal where the query is originated.
From all the above components, the most important is access cost to secondary storage
because secondary storage is comparatively slower than other devices. Optimizer try to
minimize computation cost for small databases as most of the data files are stored in main
memory. For large database, it try to minimize the access cost to secondary storage and for
distributed databases, it trys to minimize the communication cost because various sites are
involved for data transfer.
To estimate the cost of execution strategies, optimizer access statistical data stored in
DBMS catalog. The information stored in DBMS catalog is given below:
(i) Number of records in relation X, given as R.
(ii) Number of blocks required to store relation X, given as B.
(iii) Blocking factor of relation X, given as BFR.
(iv) Primary access method for each file and attributes for each file.
(v) Number of levels for each multi-level index for a attribute A given as IA.
(vi) Number of first-level index blocks for a attribute A, given as BAI1.
(vii) Selection cardinality of attribute A in relation R, given as SA, where SA= R × SLA,
where SLA is the selectivity of the attributes.
Cost Function for Selection Operation : Selection operation works on a single relation in
relation algebra and retrieves the records that satisfy the given condition. Depending upon
the structure of file, available indexes, searching methods, the estimated cost of strategies
for selection operation is as given below.
Query Processing and Optimiz ation 433
Example. Consider we have 600 records in Department table. BFR for department table
is 60 and number of blocks are 600/60 = 10.
For the Join operation, Employee Dept-ID Department