100% found this document useful (1 vote)
168 views

Chapter 1 Query Processing

This document summarizes key aspects of query processing and optimization in database systems. It discusses the main steps in query processing including parsing, optimization, code generation and execution. It also covers transformation rules for relational algebra operations, algorithms for sorting and implementing selection, join and other operations, and different approaches to query optimization including heuristic, cost-based and semantic approaches.

Uploaded by

Waal Mk
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
168 views

Chapter 1 Query Processing

This document summarizes key aspects of query processing and optimization in database systems. It discusses the main steps in query processing including parsing, optimization, code generation and execution. It also covers transformation rules for relational algebra operations, algorithms for sorting and implementing selection, join and other operations, and different approaches to query optimization including heuristic, cost-based and semantic approaches.

Uploaded by

Waal Mk
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Advanced Database Systems

Chapter one

Query processing and Optimization


Contents
• Query Processing steps
• Transformation Rules for Relational Algebra Operations
• Basic algorithms for executing query operations -External sorting
• Implementing the SELECT Operation
• Implementing the JOIN Operation
• Factors affecting JOIN performance
• Algorithms for PROJECT and SET Operations
• Query Optimization
• Approaches to Query Optimization
o Heuristics Approach
o Cost-based query optimization
o Semantic Query Optimization
• Steps in Typical Heuristic optimization
Query Processing and Optimization
 Query Processing: refers to the range of
activities involved in extracting data from a database.
 Query optimization: is the process of
choosing a suitable execution strategy for processing
a query.
+ Before optimizing the query it is represented in an
internal or intermediate form using two data
structures:-
❖Query tree: A tree data structure that corresponds to a
relational algebra expression. It represents the input relations of
the query as leaf nodes of the tree, and represents the relational
algebra operations as internal nodes.
❖Query graph: A graph data structure that corresponds
to a relational calculus expression. It does not indicate an order on
which operations to perform first. There is only a single graph
3
corresponding to each query.
Query Processing (cont.…)

4
Query Processing (cont.…)
 Scanner: The scanner identifies the language
tokens such as SQL Keywords, attribute
names, and relation names in the text of the
query.

 Parser: The parser checks the query syntax


to determine whether it is formulated according
to the syntax rules of the query language.

 Validation: The query must be validated by


checking that all attributes and relation
names are valid and semantically meaningful
names in the schema of the particular database
being queried. 5
Query Processing (cont…)
 Query Optimization: The process of
choosing a suitable execution strategy
for processing a query.
◦ This module has the task of producing an
execution plan.

 Query Code Generator: It generates the


code to execute the plan.

 Runtime Database Processor: It has the


task of running the query code whether in
compiled or interpreted mode.
◦ If a runtime error results an error message is
generated by the runtime database processor 6
Query Processing (cont…)
 Query Processing can be divided into four main
phases:
◦ Decomposition: Transforming high level query
to relational algebra
◦ Optimization: Choosing efficient strategy
◦ Code generation: generating the code to
implement the chosen strategy
◦ Execution: running the code
 Decomposition: it is the process of transforming a
high level query into a relational algebra query,
and to check that the query is syntactically and
semantically correct. Query decomposition
consists of parsing and validation. 7
Cont…
 Query block:
◦ The basic unit that can be translated into
the algebraic operators
A query block contains a single
SELECT-FROM-WHERE expression,
as well as GROUP BY and HAVING
clause if these are part of the block.
 Nested queries within a query are
identified as separate query blocks.
8
Cont…

SELECT LNAME, FNAME


FROM EMPLOYEE
WHERE SALARY > ( SELECT MAX (SALARY)
FROM EMPLOYEE
WHERE DNO = 5);

SELECT LNAME, FNAME SELECT MAX (SALARY)


FROM EMPLOYEE FROM EMPLOYEE
WHERE SALARY > C WHERE DNO = 5

πLNAME, FNAME (σSALARY>C(EMPLOYEE)) ℱMAX SALARY (σDNO=5 (EMPLOYEE))

9
Transformation Rules for Relational Algebra Operations:
 Relational algebra operations:
o Select, Project , Join, Union, Intersection, Cartesian Product
General Transformation rules for relational algebra(useful in
query optimization)
1. Cascade of s: A conjunctive selection condition
can be broken up into a cascade (sequence) of
individual s operations:
◦ s c1 AND c2 AND ... AND cn(R) = sc1 (sc2 (...(scn(R))...) )

2. Commutatively of s:
◦ The s operation is commutative:
◦ sc1 (sc2(R)) = sc2 (sc1(R))
3. Cascade of p: In a cascade (sequence) of p
operations, all but the last one can be ignored:
◦ pList1 (pList2 (...(pListn(R))...) ) = pList1(R)
10
Cont…
4. Commuting s with p:
◦ If the selection condition c involves only the
attributes A1, ..., An in the projection list, the
two operations can be commuted:
 pA1, A2, ..., An (sc (R)) = sc (pA1, A2, ..., An (R))

5. Commuting of and X: both are


commutative
R c S=S cR RXS=SXR
11
Cont…
6. Commuting p with : If projection list is L = {A1, ..., An,
B1, ..., Bm}, where A1, ..., An are attributes of R and B1, ...,
Bm are attributes of S and the join condition c involves
only attributes in L, the two operations can be commuted
as follows:
◦ pL ( R C S ) = (pA1, ..., An (R)) C (p B1, ..., Bm (S))

7. Converting a (s, x) sequence into : If the condition c of a


s that follows a x Corresponds to a join condition,
convert the (s, x) sequence into a as follows:

(sC (R x S)) = (R C S)

12
Example

 Convert the following SQL to relational algebra


select salary
from employee
where salary < 5000;

Solution:
Strategy 1- s salary <75000 (p salary (employee))

Strategy 2- p salary (s salary<5000 (employee))

13
Basic algorithms for executing query operations
 Sorting is one of the primary algorithms used in query
processing. E.g. ORDER BY-clause
 External sorting:
◦ Refers to sorting algorithms that are suitable for large
files of records stored on disk that do not fit entirely in
main memory, such as most database files
 External sorting uses Sort-Merge strategy:
◦ Starts by sorting small subfiles (runs) of the main file and
then merges the sorted runs, creating larger sorted
subfiles that are merged in turn
◦ Sorting phase:
 Number subfiles (runs) nR = (b/nB)
◦ Merging phase:
 Degree of merging(dM) = Min (nB-1, nR); nP = (logdM(nR))

14
Cont…
◦ nR: number of initial runs;
◦ b: number of file blocks;
◦ nB: available buffer space;
◦ dM: degree of merging;
◦ P: number of passes
 The size of a run and number of initial run depends on the
number of file blocks (b) and available buffer space (nB)
 Example: if nB=5 blocks and size of the file=1024 blocks,
◦ nR=(b/nB)= (1024/5) =205 runs each of size 5 blocks
except the last run which will have 4 blocks.
◦ Hence, after the sort phase, 205 sorted runs are stored as
temporary subfiles on disk

15
Cont…
 In the merging phase, the sorted runs are merged
during one or more passes.
 The degree of merging (dM) is the number of runs
that can be merged in each pass.
◦ dM=min((nB-1) and nR))
 The number of passes=[(logdM (nR))]
 In each pass, one buffer block is needed to hold one
block from each of the runs being merged and one block
is needed for containing one block of the merge result
 In the above example, dM=4(four way merging)
 Hence, the 205 initial sorted runs would be merged
into:
◦ 52 at the end of the first pass
◦ 13 at the end of the second pass
◦ 4 at the end of the third pass
◦ 1 at the end of the fourth pass
16
Cont…
Exercise
 A file of 4096 blocks is to be sorted with
an available buffer space of 64 blocks.
How many passes will be needed in the
merge phase of the external sort-merge
algorithm?

17
Implementing the SELECT Operation
 There are many options for executing a SELECT
operation
 Some options depend on the file having specific
access paths and may apply only to certain types
of selection conditions
Examples:
◦ (OP1): s SSN='123456789' (EMPLOYEE)
◦ (OP2): s DNUMBER>5(DEPARTMENT)
◦ (OP3): s DNO=5(EMPLOYEE)
◦ (OP4): s DNO=5 AND SALARY>30000 AND SEX=F(EMPLOYEE)
◦ (OP5): s ESSN=123456789 AND PNO=10(WORKS_ON)

18
Use the following logical model to understand the discussions in this
chapter

19
Cont…
 Search Methods for implementing Simple Selection:
◦ S1 Linear search (brute force):
 Retrieve every record in the file, and test whether its
attribute values satisfy the selection condition.
◦ S2 Binary search:
 If the selection condition involves an equality
comparison on a key attribute on which the file is
ordered, binary search (which is more efficient than
linear search) can be used.
 An example is OP1 if SSN is the ordering attribute for
EMPLOYEE file
◦ S3 Using a primary index or hash key to retrieve a
single record:
 If the selection condition involves an equality
comparison on a key attribute with a primary index
(or a hash key), use the primary index (or the hash key)
to retrieve the record.
 For Example, OP1 use primary index to retrieve the
record 20
Cont…
 Search Methods for implementing Simple Selection
◦ S4 Using a primary index to retrieve
multiple records:
 If the comparison condition is >, ≥, <, or ≤ on a
key field with a primary index, use the index
to find the record satisfying the corresponding
equality condition, then retrieve all subsequent
records in the (ordered) file. (see OP2)
◦ S5 Using a clustering index to retrieve
multiple records:
 If the selection condition involves an equality
comparison on a non-key attribute with a
clustering index, use the clustering index to
retrieve all the records satisfying the selection
condition. (See OP3)
21
cont.…
• S6: using a secondary index on an equality comparison:
• This search method can be used to retrieve a single
record if the indexing field is a key (has unique values) or
to retrieve multiple records if the indexing field is not a
key
• Search Methods for implementing complex Selection
• S7: Conjunctive selection using an individual index :
 If an attribute involved in any single simple condition in
the conjunctive condition has an access path that
permits the use of one of the methods S2 to S5, use
that condition to retrieve the records and then check
whether each retrieved record satisfies the
remaining simple conditions in the conjunctive
condition
22
Cont…
 Whenever a single condition specifies the
selection, we can only check whether an access
path exists on the attribute involved in that
condition.
 If an access path exists, the method
corresponding to that access path is used;
 Otherwise, the “brute force” linear search
approach of method S1 is used
◦ For conjunctive selection conditions,
whenever more than one of the attributes
involved in the conditions have an access path,
query optimization should be done to choose
the access path that retrieves the fewest records in
the most efficient way 23
Cont…
 Disjunctive selection conditions: This is a
situation where simple conditions are connected by
the OR logical connective rather than AND
◦ Compared to conjunctive selection, It is much
harder to process and optimize
 Example: s DNO=5 OR SALARY>3000 OR SEX=‘F’(EMPLOYEE)
◦ Little optimization can be done because the
records satisfying the disjunctive condition are the
union of the records satisfying the individual
conditions
◦ Hence, if any of the individual conditions does not
have an access path, we are compelled to use the
brute force approach
24
Implementing the JOIN Operation:

◦ The join operation is one of the most time


consuming operation in query processing
◦ Join
 two–way join: a join on two files
 e.g. R A=B S
 multi-way joins: joins involving more than two files
 e.g. R A=B S C=D T
 Examples
◦ (OP6): EMPLOYEE DNO=DNUMBER DEPARTMENT
◦ (OP7): DEPARTMENT MGRSSN=SSN EMPLOYEE

25
Implementing the JOIN Operation(cont…)

 Methods for implementing joins:


◦ J1 Nested-loop join (brute force):
 For each record t in R (outer loop), retrieve every
record s from S (inner loop) and test whether the
two records satisfy the join condition t[A] = s[B]
◦ J2 Single-loop join (Using an access
structure to retrieve the matching records):
 If an index (or hash key) exists for one of the two
join attributes — say, B of S — retrieve each record
t in R, one at a time, and then use the access
structure to retrieve directly all matching records s
from S that satisfy s[B] = t[A].

26
Implementing the JOIN Operation (cont…)

 Methods for implementing joins:


◦ J3 Sort-merge join:
 If the records of R and S are physically sorted (ordered)
by value of the join attributes A and B, respectively,
we can implement the join in the most efficient way
possible.
 Both files are scanned in order of the join attributes,
matching the records that have the same values for A
and B.
 In this method, the records of each file are scanned
only once each for matching with the other file—
unless both A and B are non-key attributes

27
Implementing the JOIN Operation (cont...)

 Factors affecting JOIN performance


◦ Available buffer space
◦ Join selection factor
◦ Choice of inner VS outer relation

28
Algorithms for PROJECT and SET Operations
 Algorithm for PROJECT operations ()
p <attribute list>(R)

1. If <attribute list> has a key of relation R,


extract all tuples from R with only the
values for the attributes in <attribute list>.
2. If <attribute list> does NOT include a key
of relation R, duplicated tuples must be
removed from the results.

 Methods to remove duplicate tuples


1. Sorting
2. Hashing
29
Query Optimization
 Given the database structure, the challenge of query
optimization is to find the sequence of steps that produces the
answer to user request in the most efficient manner.
 The performance of a query is affected by the tables or
queries that underlies the query and by the complexity of
the query.
 Given a request for data manipulation or retrieval, an
optimizer will choose an optimal plan for evaluating
the request from among the available alternative
strategies.
 There are many ways (access paths) for accessing desired
file/record. The optimizer tries to select the most efficient
(cheapest) access path for accessing the data.
 DBMS is responsible to pick the best execution strategy based
on various considerations.
30
Query Optimization(cont…)

 Example: Consider relations r(AB) and s(CD).


◦ We want r X s.
 Method 1
a.Load next record of r in RAM.
b.Load all records of s, one at a time and concatenate
with r.
c.All records of r concatenated?
 NO: goto a.
 YES: exit (the result in RAM or on disk).
◦ Performance: Too many accesses.

31
Query Optimization(cont…)
 Method 2: Improvement
a.Load as many blocks of r as possible leaving room
for one block of s.
b.Run through the s file completely one block at a
time.
 Performance: Reduces the number of times s
blocks are loaded by a factor of equal to the
number of r records that can fit in main memory.
 Considerations during query Optimization:
◦ Narrow down intermediate result sets quickly.
SELECT before JOIN
◦ Use access structures (indexes). 32
Approaches to Query Optimization
 Heuristics Approach
◦ The heuristic approach uses the knowledge of the
characteristics of the relational algebra operations
and the relationship between the operators to
optimize the query.
◦ Thus the heuristic approach of optimization will
make use of:
 Properties of individual operators
 Association between operators
 Query Tree: a graphical representation of the
operators, relations, attributes and predicates and
processing sequence during query processing.
33
Using Heuristics in Query Optimization
 Process for heuristics optimization
1. The parser of a high-level query generates an initial
internal representation;
2. Apply heuristics rules to optimize the internal
representation.
3. A query execution plan is generated to execute groups
of operations based on the access paths available on
the files involved in the query
 The main heuristic is to apply first the operations
that reduce the size of intermediate results
◦ E.g., Apply SELECT and PROJECT operations before
applying the JOIN or other binary operations

34
Using Heuristics in Query Optimization (cont…)
 Heuristic Optimization of Query Trees:
◦ The same query could correspond to many
different relational algebra expressions — and
hence many different query trees.
◦ The task of heuristic optimization of query trees is to
find a final query tree that is efficient to execute.
 Example: Find the name of all employees who
are working in AQUARIUS project and born
before ‘1957-12-31
Q: SELECT LNAME
FROM EMPLOYEE, WORKS_ON, PROJECT
WHERE PNAME = ‘AQUARIUS’ AND
PNMUBER=PNO AND ESSN=SSN
AND BDATE > ‘1957-12-31’;

35
Steps in Typical Heuristic optimization
1. Deconstruct conjunctive selections into a
sequence of single selection operations (Equiv.
rule 1.).
2.Move selection operations down the query tree
for the earliest possible execution
3.Execute first those selection and join operations
that will produce the smallest relations.
4.Replace Cartesian product operations that are
followed by a selection condition by join
operations
5.Deconstruct and move as far down the tree as
possible lists of projection attributes, creating
new projections where needed 36
Steps in converting a query tree during heuristic optimization

Initial query tree for the query Q on slide 33

37
Steps in converting a query tree during heuristic optimization

Moving the select operation down the tree

38
Steps in converting a query tree during heuristic optimization

Applying the more restrictive select operation first

39
Steps in converting a query tree during heuristic optimization

Replacing Cartesian product and select with join

40
Steps in converting a query tree during heuristic optimization

Moving project operations down the query tree

41
Using Heuristics in Query Optimization
 Summary of Heuristics for Algebraic Optimization:
1. The main heuristic is to apply first the operations
that reduce the size of intermediate results.
2. Perform select operations as early as possible to
reduce the number of tuples and perform project
operations as early as possible to reduce the
number of attributes. (This is done by moving
select and project operations as far down the tree
as possible.)
3. The select and project operations that are most
restrictive should be executed before other
similar operations. (This is done by reordering the
leaf nodes of the tree among themselves and
adjusting the rest of the tree appropriately.)
42
Using Selectivity and Cost Estimates in Query Optimization
 Cost-based query optimization:
◦ Estimate and compare the costs of executing a query
using different execution strategies and choose the
strategy with the lowest cost estimate
 Cost Components for Query Execution
1. Access cost to secondary storage: Cost of transferring
data blocks between secondary storage and main memory
buffers.
2. Computation cost: Cost of performing in-memory
operations on the records within the data buffers during
query execution.
3. Memory usage cost: Cost pertaining to the number of
main memory buffers needed during query execution.
4. Communication cost: cost of shipping the query and its
results from the database site to the site or terminal.(E.g
distributed databases )
 Note: Different database systems may focus on different
cost components.
43
Using Selectivity and Cost Estimates in Query Optimization

 Catalog Information Used in Cost Functions


◦ Information about the size of a file
 number of records (tuples) (r),
 record size (R),
 number of blocks (b)
 blocking factor (bfr)
b=r/bfr
◦ Information about indexes and indexing attributes of a file
 Number of levels (x) of each multilevel index
 Number of first-level index blocks (bI1)
 Number of distinct values (d) of an attribute
 Selectivity (sl) of an attribute

44
Semantic Query Optimization
 Semantic Query Optimization:
◦ Uses constraints specified on the database schema in
order to modify one query into another query that is
more efficient to execute
 Consider the following SQL query,
SELECT E.LNAME, M.LNAME
FROM EMPLOYEE E M
WHERE E.SUPERSSN=M.SSN AND
E.SALARY>M.SALARY
 Explanation:
◦ Suppose that we had a constraint on the database
schema that stated that no employee can earn more
than his or her direct supervisor. If the semantic query
optimizer checks for the existence of this constraint, it
need not execute the query at all because it knows that
the result of the query will be empty. 45

You might also like