0% found this document useful (0 votes)
11 views

Chapter 2 Query Processing and Optimization

notetype

Uploaded by

Mesfin Tadesse
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Chapter 2 Query Processing and Optimization

notetype

Uploaded by

Mesfin Tadesse
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 45

Advanced Database Systems

Chapter 2: Query Processing and Optimization

Sirage Z.
Department of Computer Science
Debre Berhan University
2022
Query Processing
 Query is a request for data or information from a database table.
 QP Is find information in one or more databases and deliver it to the
user quickly and efficiently.
 Traditional techniques work well for databases with standard, single-
site relational structures.
 Databases containing more complex and diverse types of data
demand new query processing and optimization techniques.
 Query Processing can be divided into four main phases:
• Decomposition:
• Optimization
• Code generation, and
• Execution
11/24/2024 Query Processing and Optimization 2
Query Processing Steps

 Parser and Translator:-check correct query syntax and Schema elements.


 Eg. SELECT empName from employee having name=‘”abc”
> Having is invalid(where is correct )syntax error
 Assume emp.gmail is not available in the table;
> Select emp.gmail from employee.schema element error

11/24/2024 Query Processing and Optimization 3


…CON’T
 Optimizer:
 Find best plan to relational algebra
 Finds all equivalent relational algebra
 Find the expression with least cost
 Statistics about data: is the size of relation
 Number of blocks been used to store the relation
 Number of records in the relation.
 Number of index
 Query Evaluation engine:
 Evaluates the above plan and gets the results.
 Is directly accessing data warehouse.
11/24/2024 Query Processing and Optimization 4
Phases of Query processing
 Processing can be divided into :Decomposition, Optimization , and Execution ,Code generation
 Query Decomposition
 is the process of transforming a high level query into a relational algebra
query, and to check that the query is syntactically and semantically correct.
 Consists of parsing and validation
 Input: Relational Algebra query on global relations
 Typical stages in query decomposition are:
 1) Analysis: lexical and syntactical analysis of the query (correctness).
> Query tree will be built for the query containing leaf node for base relations, one or
many non-leaf nodes for relations produced by relational algebra operations and root
node for the result of the query.
> Sequence of operation is from the leaves to the root.

11/24/2024 Query Processing and Optimization 5


…CON’T
 2)Normalization: convert the query into a normalized form.
> The predicate WHERE will be converted to Conjunctive ( ) or Disjunctive
^
(∨) Normal form.
 3) Semantic Analysis: to reject normalized queries that are not correctly formulated or
contradictory.
> Incorrect if components do not contribute to generate result.
> Contradictory if the predicate cannot be satisfied by any tuple.
> Algorithms: relation connection graph and normalized attribute connection
graph.
 4) Simplification: to detect redundant qualifications, eliminate common sub-expressions, and
transform the query to a semantically equivalent but more easily and effectively computed
form.
 5) Query Restructuring More than one translation is possible Use transformation rules.
> Re arranging nodes so that the most restrictive condition will be executed
first
11/24/2024 Query Processing and Optimization 6
…CON’T
 Most real-world data is not well structured.
 Today's databases typically contain much non-structured data
such as text, images, video, and audio, often distributed
across computer networks.
 In this complex environment, efficient and accurate query
processing becomes quite challenging.
 There could be tons of tricks (not only in storage and query
processing, but also in concurrency control, recovery, etc.)
 Different tricks may work better in different usage scenarios.
 Same tricks get used over and over again in different
11/24/2024
applications.
Query Processing and Optimization 7
Query processing
 Execute transactions in behalf of this query and print the result. Steps in query
processing:

11/24/2024 Query Processing and Optimization 8


…CON’T
 Consider the following query from two relations; staff and
branch
 Staff ( StaffNo, name,position, salary, sex, BranchNo)
 Branch (BranchNo,name, city)
 Eg. Find all managers who manage a branch at Addis.
 We can write this query in a SQL as follows
Select S.*
from staff s, branch b
s
where branchNo= .branchNo AND
(S.position=’Manager’ AND b.city=’Addis’)

11/24/2024 Query Processing and Optimization 9


…CON’T
 Assume
i. One record is accessed at a time, n staff , m branches, x non-manager,
and y non-Addis branches for some integers n,m,x,y.
ii. intermediate results are stored on disk
iii. ignore about the final result(write) because it is the same for all the
expressions
 Then, this high level SQL can be transformed in the following three low level
equivalent relational algebra expressions.

(position=’manager’)^(City=’Addis’ ) ^
(staff.branchNo=branch.branchNo)(Staff X Branch)
 Analysis:
i. read each tuple from the two relations n+m reads
ii. create a table of the Cartesian productnXm writes
iii. test each tuple of step 2nXm read
 Total No. of Disk access: 2(nXm) +n+m or
11/24/2024 Query Processing and Optimization 10
…CON’T
OR
 (position=’manager’) ^ (City=’Addis’)(Staff
staff.branchNo=branch.branchNo Branch)
 Analysis:
 i. read each tuple from the two relationn+m reads
 ii. create a table of the Join  n writes
 iii. test each tuple of step 2  n read
 Total No. of Disk access:  3(n) +m Or
 (position =’manager’ (Staff )) staff.branchNo=branch.branchNo
( City=’Addis’(Branch))
 Analysis:
 i. Test each tuple from the two relations  n+m reads
 ii. create a “manager_Staff” and “addis_Branch” realtions 
(n-x) +(m-y) writes
 iii. create a join of the two relations at step 2 (n-x) + (m-y)
reads
 Total No. of Disk access: 2(n-x)+2(m-y)+n+m
11/24/2024
 Which of the expressions given above do you think is best
Query Processing and Optimization
(Optimal)? 11
…CON’T
Assume:
– 1000 tuples in Staff.
– 50 Managers
– 50 tuples in Branch.
– 5 London branches
– No indexes or sort keys
– All temporary results are written back to disk
(memory is small)
– Tuples are accessed one at a time (not in blocks)

11/24/2024 Query Processing and Optimization 12


…CON’T
 Query 1 (Bad)
s(position=‘Manager’)^(city=‘London’)^(Staff.branchNo=Branch.branchNo) (Staff X Branch)
 Requires (1000+50) disk accesses to read from Staff and Branch relations
 Creates temporary relation of Cartesian Product (1000*50) tuples
 Requires (1000*50) disk access to read in temporary relation and test
predicate
Total Work = (1000+50) + 2*(1000*50) =
101,050 I/O operations
 Query 2 (Better)
s(position=‘Manager’)^(city=‘London’) (Staff wvStaff.branchNo = Branch.branchNo Branch)
– Again requires (1000+50) disk accesses to read from Staff and Branch
– Joins Staff and Branch on branchNo with 1000 tuples
(1 employee : 1 branch )
– Requires (1000) disk access to read in joined relation and check predicate
Total Work = (1000+50) + 2*(1000) =
3050 I/O operations
33% Improvement over Query 1
11/24/2024 Query Processing and Optimization 13
…CON’T
 Query 3 (Best)
[ s(position=‘Manager’) (Staff) ] wvStaff.branchNo = Branch.branchNo
[ s(city=‘London’) (Branch) ]
 Read Staff relation to determine ‘Managers’ (1000 reads)
> Create 50 tuple relation(50 writes)

 Read Branch relation to determine ‘London’ branches (50 reads)


> Create 5 tuple relation(5 writes)

 Join reduced relations and check predicate (50 + 5 reads)

Total Work = 1000 + 2*(50) + 5 + (50 + 5) =


1160 I/O operations

8700% Improvement over Query 1


11/24/2024 Query Processing and Optimization 14
Query Optimization
 What is wrong with the ordinary query?
 Everyone wants the performance of their database to be optimal.
 In particular, there is often a requirement for a specific query or object that is query
based, to run faster.
 Problem of query optimization is to find the sequence of steps that produces the answer
to user request in the most efficient manner, given the database structure.
 The performance of a query is affected by the tables or queries that underlies the query
and by the complexity of the query.
 When data/workload characteristics change
 The best navigation strategy changes
 The best way of organizing the data changes

11/24/2024 Query Processing and Optimization 15


…CON’T
 Query optimizers are one of the main means by which modern database systems
achieve their performance advantages.
 Given a request for data manipulation or retrieval, an optimizer will choose an
optimal plan for evaluating the request from among the many alternative
strategies.
 i.e. there are many ways (access paths) for accessing desired file/record.
 The optimizer tries to select the most efficient (cheapest) access path for accessing
the data.
 DBMS is responsible to pick the best execution strategy based various considerations.
 Query optimizers were already among the largest and most complex modules of
database systems.

11/24/2024 Query Processing and Optimization 16


…CON’T
 Most efficient processing: Least amount of I/O and CPU resources.
 Selection of the best method: In a non-procedural language the system
does the optimization at the time of execution.
 For optimizing the execution of a query the programmer must know:
> File organization
> Record access mechanism and primary or secondary key.
> Data location on disk.
> Data access limitations.
 To write correct code, application programmers need to know how data
is organized physically (e.g., which indexes exist)
 To write efficient code, application programmers also need to worry
about data/workload characteristics
 One has to cope with change! (Real time changes hence, preferable to
give the responsibility of optimization to the DBMS).

11/24/2024 Query Processing and Optimization 17


…CON’T
 Example: Consider relations r(AB) and s(CD). We require r X s.
 Method 1
a) Load next record of r in RAM.
b) Load all records of s, one at a time and concatenate with r.
c) All records of r concatenated?
> NO: goto a.
> YES: exit (the result in RAM or on disk).
 Performance: Too many accesses.
 Method 2: Improvement
a) Load as many blocks of r as possible leaving room for one block of s.
b) Run through the s file completely one block at a time.
 Performance: Reduces the number of times s blocks are loaded by a factor of equal to
the number of r records than can fit in main memory.
 Considerations during query Optimization:
 Narrow down intermediate result sets quickly. SELECT before JOIN
 Use access structures (indexes).

11/24/2024 Query Processing and Optimization 18


Approaches to Query Optimization
 Heuristics Approach
Uses the knowledge of the characteristics of the relational
algebra operations and the relationship between the
operators to optimize the query.
Thus the heuristic approach of optimization will make use
of:
 Properties of individual operators
 Association between operators
 Query Tree: a graphical representation of the operators,
relations, attributes and predicates and processing
sequence during query processing.
Query Processing and Optimization 19
…CON’T
 Query tree is composed of three main parts:
> The Leafs: the base relations used for processing the query/ extracting the
required information
> The Root: the final result/relation as an output based on the operation on the
relations used for query processing
> Nodes: intermediate results or relations before reaching the final result.
 Sequence of execution of operation in a query tree will start from the leaves and
continues to the intermediate nodes and ends at the root.
 The properties of each operations and the association between operators is analyzed
using set of rules called TRANSFORMATION RULES.
 Use of the transformation rules will transform the query to relatively good execution
strategy.

Query Processing and Optimization 20


Transformation Rules for Relational Algebra

 1. Cascade of SELECTION: conjunctive SELECTION Operations can


cascade into individual Selection Operations and Vice Versa
 (c1ʌc2ʌc3) (R)= c1(c2(c3(R)) where ci is a predicate
 2. Commutativity of SELECTION operations
c1(c2(R))= c2(c1(R)) where ci is a predicate
 3. Cascade of PROJECTION: in the sequence of PROJECTION Operations,
only the last in the sequence is required
L1 L2 L3 L4(R) = L1(R)
 4. Commutativity of SELECTION with PROJECTION and Vise Versa
 If the predicate c1 involves only the attributes in the projection list (L1), then the
selection and projection operations commute
<a1,a2..an>(c1(R))=c1(<a1,a2,,,,an>(R))

Where c1€{a1,a2…an}

11/24/2024 Query Processing and Optimization 21


…CON’T
 5. Commutativity of THETA JOIN/Cartesian Product
R X S is equivalent to S X R
 Also holds for Equi-Join and Natural-Join

(R c1S)=(S R)
c1

 6. Commutativity of SELECTION with THETA JOIN


 a. If the predicate c1 involves only attributes of one of the relations (R) being
joined, then the Selection and Join operations commute

c1(R c S) =(c1(R)) S
c )
 b.If the predicate is in the form c1 ʌ c2 and c1 involves only attributes of R
and c2 involves only attributes of S, then the Selection and Theta Join
operations commute

c1ʌ c2 (R c S)=(c1 (R)) c (c2 S))


Query Processing and Optimization 22
…CON’T
 7. Commutativity of PROJECTION and THETA JOIN
 If the projection list is of the form L1, L2, where L1 involves only attributes of R and
L2 involves only attributes of S being joined and the predicate c involves only
attributes in the projection list, then the PROJECTION and JOIN operations
commute as:

L1,L2 (R c S)=(L1 (R)) c (L2 S))


 However if the join condition c contains additional attributes not in projection list
L=L1 U L2 , say M=M1U M2 where M1 is from R and M2 is from S then the
PROJECTION and JOIN operations commute as follows:

L1,L2 (R c S)= L1,L2 ((L1,M1R) c (L2,M2 S))

 8. Commutativity of the Set Operations: UNION and INTERSECTION but not SET
R ∩ S=S ∩ R and R ∪ S=S ∪ R
DIFFERENCE:

Query Processing and Optimization 23


…CON’T
 9. Associativity of the THETA JOIN, CARTESIAN PRODUCT,
UNION and INTERSECTION.
(R𝜽 S) 𝜽 T=R𝜽 (S 𝜽 T) where 𝜽 is one of the operations

 10. Commuting SELECTION with SET OPERATIONS


c (R 𝜽 S) =(c(R) 𝜽 c(S)) where 𝜽 is one of the operations

 11. Commuting PROJECTION with UNION


L1 (S ∪ R) =L1 (S) ∪ L1 (R)

Query Processing and Optimization 24


…CON’T
 Heuristic Approach will be implemented by using
the above transformation rules in the following
sequence or steps.
 Sequence for Applying Transformation Rules
 1. Use
 Rule-1 Cascade SELECTION
 2. Use
 Rule-2: Commutativity of SELECTION
 Rule-4: Commuting SELECTION with PROJECTION
 Rule-6: Commuting SELECTION with JOIN and
CARTESIAN
 Rule-10: commuting SELECTION with SET OPERATIONS
Query Processing and Optimization 25
…CON’T
 3. Use
 Rule-9: Associativity of Binary Operations (JOIN,
CARTESIAN, UNION and INTERSECTION). Rearrange
nodes by making the most restrictive operations to
be performed first (moving it as far down the tree as
possible)
 4. Perform Cartesian Operations with the
subsequent Selection Operation
 5. Use
 Rule-3: Cascade of PROJECTION
 Rule-4: Commuting PROJECTION with SELECTION
 Rule-7: Commuting PROJECTION with JOIN and
CARTESIAN

Main Heuristic
 The main heuristic is to first apply operations that reduce the size
(the cardinality and/or the degree) of the intermediate relation.
 That is:
 a. Perform SELECTION as early as possible: that will reduce the cardinality (number of tuples) of the
relation.
 b. Perform PROJECTION as early as possible: that will reduce the degree (number of attributes) of the
relation.
> Both a and b will be accomplished by placing the SELECT and PROJECT operations
as far down the tree as possible.
 c. SELECT and JOIN operations with most restrictive conditions resulting with smallest absolute size
should be executed before other similar operations.
 This is achieved by reordering the nodes with JOIN.

Query Processing and Optimization 27


…CON’T
 Example: consider the following schemas and the query, where the
EMPLOYEE and the PROJECT relations are related by the WORKS_ON
relation.
 EMPLOYEE (EEmpID, FName, LName, Salary, Dept, Sex, DoB)
 PROJECT (PProjID, PName, PLocation, PFund, PManagerID)
 WORKS_ON (WEmpID, WProjID)
 WEmpID (refers to employee identification) and PProjID (refers to
project identification) are foreign keys to WORKS_ON relation from
EMPLOYEE and PROJECT relations respectively.

11/24/2024 Query Processing and Optimization 28


…CON’T
 Query: The manager of the company working on road
construction would like to view employees name born
before January 1 1965 who are working on the project
named Ring Road.
 Relational Algebra representation of the query will be:
 <FName, LName> (<DoB<Jan 1 1965
ʌ WEmpID = EEmpID ʌ
PProjID=WProjID ʌ PName=’Ring
Road’> (EMPLOYEE X WORKS_ON X
PROJECT))
11/24/2024 Query Processing and Optimization 29
…CON’T
 The SQL equivalence for the above query will be:
 SELECT FName, LName FROM EMPLOYEE, WORKS_ON, PROJECT WHERE DoB<Jan 1 1965 ʌ
EEmpID=WEmpID ʌ WProjID=PProjID ʌ PName=”Ring Road”
 The initial query tree will be:

11/24/2024 Query Processing and Optimization 30


…CON’T
 By applying the first step (cascading the selection) we will come up with the
following structure.
(DoB<Jan 1 1965)
((WEmpID=EEmpID)((PProjID=WProjID)((PName=’Ring Road’) (EMPLOYEE X
WORKS_ON X PROJECT ) ) )
 By applying the second step it can be seen that some conditions have attribute
that belong to a single relation (DoB belongs to EMPLOYEE and PName belongs to
PROJECT) thus the selection operation can be commuted with Cartesian
Operation.
 Then, since the condition WEmpID=EEmpID base the employee and WORKS_ON
relation the selection with this condition can be cascaded.

((PProjID=WProjID) ((PName=’RingRoad’) PROJECT) X((WEmpID=EEmpID)


(WORKS_ON X ((DoB<Jan 1 1965) EMPLOYEE))))

11/24/2024 Query Processing and Optimization 31


…CON’T
 The query tree 
<FName, LName
>

after this
modification will 
(PProjID=WProjID
)

be:
X

 (WEmpID=EEmpID)
 (PName=’Ring
Road’)

PROJECT
X

 (DoB<Jan1 1965) WORKS_ON

EMPLOYEE

11/24/2024 Query Processing and Optimization 32


…CON’T
 Using the third step, perform most restrictive
operations first.
 From the query given we can see that
selection on PROJECT is most restrictive than
selection on EMPLOYEE.
 Thus, it is better to perform selection on
PROJECT BEFORE on EMPLOYEE. rearrange
the nodes to achieve this.

11/24/2024 Query Processing and Optimization 33


 <FName, LName
>

 ( WEmpID=EEmpID
)

 ( PProjID=WProjID
)
 (DoB<Jan1 1965
)

EMPLOY EE
X

 (PName=’Ring
Road’) WORKS_ON

PROJECT

11/24/2024 Query Processing and Optimization 34


 Using the forth step, Perform Join Operations
with the subsequent Selection Operation.
 <FName, LName>

( WEmpID=EEmpID )

 (DoB<Jan 1 1965)

( PProjID=WProjID )

EMPLOY EE

 (PName=’Ring

‘)
Road
WORKS_ON

PROJECT

11/24/2024 Query Processing and Optimization 35


 Using the fifth step, Perform the projection as
early as possible.  <FName, LName
>

( WEmpID=EEmpID
)

 <WEmpID>

( PProjID=WProjID
)
 <FName,LName,EEmpID>

WORKS_ON  (DoB<Jan1 1965


)

 <PProjID> EMPLOYEE

 (PName=’Ring
Road’)

PROJECT

11/24/2024 Query Processing and Optimization 36


Cost Estimation Approach to Query Optimization
 The main idea is to minimize the cost of processing a query.
 The cost function is comprised of:
 I/O cost + CPU processing cost + communication cost + Storage cost
 These components might have different weights in different processing environments.
 The DBMS will use information stored in the system catalogue for the purpose of
estimating cost.
 The main target of query optimization is to minimize the size of the intermediate
relation.
 The size will have effect in the cost of:
 Disk Access
 Data Transpiration
 Storage space in the Primary Memory
 Writing on Disk
11/24/2024 Query Processing and Optimization 37
…CON’T
 The statistics in the system catalogue used for cost
estimation purpose are:
 Cardinality of a relation: the number of tuples contained in a relation
currently (r)
 Degree of a relation: number of attributes of a relation
 Number of tuples on a relation that can be stored in one block of
memory
 Total number of blocks used by a relation
 Number of distinct values of an attribute (d)
 Selection Cardinality of an attribute (S): that is average number of
records that will satisfy an equality condition S=r/d
 By seeing the above information one could calculate
the cost of executing a query and selecting the best
strategy, which is with the minimum cost of processing.
11/24/2024 Query Processing and Optimization 38
Cost Components for Query Optimization
 The costs of query execution can be calculated for the following
major process we have during processing.
1) Access Cost of Secondary Storage
 Data is going to be accessed from secondary storage, as an
query will be needing some part of the data stored in the
database.
 The disk access cost can again be analyzed in terms of:
> Searching
> Reading, and
> Writing, data blocks used to store some portion of a
relation.
 The disk access cost will vary depending on the file organization
used and the access method implemented for the file
organization.
11/24/2024 Query Processing and Optimization 39
…CON’T
2) Storage Cost
 While processing a query, as any query would be
composed of many database operations, there
could be one or more intermediate results before
reaching the final output.
 These intermediate results should be stored in
primary memory for further processing.
 The bigger the intermediate relation, the larger the
memory requirement, which will have impact on
the limited available space.

11/24/2024 Query Processing and Optimization 40


…CON’T
3) Computation Cost
 Query is composed of many operations.
 The operations could be database operations like reading and writing to a disk, or mathematical
and other operations like:
> Searching
> Sorting
> Merging
> computation on field values
4) Communication Cost
 In most database systems the database resides in on station and various queries originate from
different terminals.
 This will have impact on the performance of the system adding cost for query processing.
 Thus, the cost of transporting data between the database site and the terminal from where the
query originate should be analyzed.

11/24/2024 Query Processing and Optimization 41


Pipelining
 Pipelining is another method used for query
optimization.
 It is sometime referred to as on-the-fly processing
of queries.
 As query optimization tries to reduce the size of the
intermediate result, pipelining use a better way of
reducing the size by performing different conditions
on a single intermediate result continuously.
 Thus the technique is said to reduce the number of
intermediate relations in query execution.
 Pipelining performs multiple operations on a single
relation in a pipeline.
11/24/2024 Query Processing and Optimization 42
…CON’T
 Ex: Lets say we have a relation on employee
with the following schema
 Employee(ID, FName, LName, DoB, Salary,
Position, Dept)
 If a query would like to extract supervisors
with salary greater than 2000, the relational
algebra representation of the query will be:
 (Salary>2000) ʌ (Position=Supervisor) (Employee)
 After reading the relation from the memory,
the system could perform the operation by
cascading the SELECT operation.
11/24/2024 Query Processing and Optimization 43
…CON’T
 1) Approach One
 (Salary>2000) ʌ ( (Position=Supervisor)
(Employee))
 Using this approach we will have the following relations
 Employee
 Relation created by the Operation:

R1 =  (Position=Supervisor) (Employee)
 The resulting Relation with the Operation

R1 =  (Salary>2000) (R1)
 2) Approach Two
 One can select a single tuple from the relation Employee
and perform both tests in a pipeline and create the final
relation at once.
 This is what is called PIPELINING
11/24/2024 Query Processing and Optimization 44
11/24/2024 Query Processing and Optimization 45

You might also like