Chapter - 2 Query Processing
Chapter - 2 Query Processing
Systems(CoSc2042)
Chapter Two
2
Overview of Query Processing
Query processing: The activities involved in parsing,
3
Steps of Query Processing
1. Parsing and
translation
2. Optimization
3. Evaluation
4
DBMS has algorithms to implement relational algebra
expressions
SQL is a kind of high level language; specify what is wanted,
not how it is obtained
5
6
Query optimization:
The activity of choosing an efficient execution
strategy for processing a query.
Task: Find an efficient physical query plan (aka
execution plan) for an SQL query
Goal: Minimize the evaluation time for the query, i.e.,
compute query result as fast as possible
Cost Factors: Disk accesses, read/write operations,
[I/O, page transfer] (CPU time is typically ignored)
Optimization: find the most efficient evaluation plan for
a query because there can be more than one way.
7
Examples:
8
Find all Managers who work at a London branch.
Example - 2
SELECT * FROM Staff s, Branch b WHERE s.branchNo = b.branchNo
AND (s.position = ‘Manager’ AND b.city = ‘London’);
9
The equivalent relational algebra queries corresponding
to this SQL statement are:
The
Different
Strategi
es
10
Cost Comparison
Cost (in disk accesses) are:
Cartesian product and join operations are much more expensive than
selection.
We will see shortly that one of the fundamental strategies in query
processing is to perform the unary operations, Selection and Projection,
as early as possible, thereby reducing the operands of any subsequent
binary operations.
11
Phases of query processing
12
Query Decomposition
Transform high-level query into RA query.
Normalization,
Semantic analysis,
Simplification,
Query restructuring.
13
Analysis
Analyze query lexically and syntactically using compiler techniques.
Verify relations and attributes exist.
Verify operations are appropriate for object type.
14
Analysis
Finally, query transformed into a query tree constructed as follows:
15
Normalization
Converts query into a normalized form for easier
manipulation.
Predicate can be converted into one of two forms:
(position='Manager'branchNo='B003')(salary>20000branc
hNo ='B003')
16
Semantic Analysis
Rejects normalized queries that are incorrectly
formulated or contradictory.
Query is incorrectly formulated if components do
not contribute to generation of result.
Query is contradictory if its predicate cannot be
satisfied by any tuple.
Algorithms to determine correctness exist only for
queries that do not contain disjunction and
negation.
17
Semantic Analysis
To detect
➠ connection graph (query graph)
➠ join graph
18
Relation connection graph
19
Example 2
SELECT Ename,Resp FROM Emp, Works, Project
WHERE Emp.Eno = Works.Eno AND Works.Pno =
Project.Pno AND Pname = ‘CAD/CAM’ AND Dur > 36
AND Title = ‘Programmer’
21
Example
SELECT TITLE FROM Emp E WHERE(NOT (TITLE=
“Programmer”) AND TITLE=“Programmer” ) OR
(TITLE=”Electrical Eng.” AND NOT (TITLE=“Electrical
Eng.”))OR ENAME=“J.Doe”; is
equivalent to
SELECT TITLE FROM Emp E WHERE ENAME=
“J.Doe”;
22
Restructuring
Convert
. SQL to relational
algebra
Make use of query trees
Example: SELECT Ename FROM
Emp, Works, Project WHERE
Emp.Eno = Works.Eno AND
Works.Pno = Project.Pno AND
Ename <> ‘J. Doe’ AND Pname =
‘CAD/CAM’ AND (Dur = 12 OR
Dur = 24)
23
Query tree:
A tree data structure that corresponds to a relational algebra
expression.
It represents the input relations of the query as leaf nodes
of the tree, and represents the relational algebra operations
as internal nodes.
Query graph:
A graph data structure that corresponds to a relational
calculus expression.
It does not indicate an order on which operations to perform
first.
There is only a single graph corresponding to each query.
24
Transformation Rules for RA
Operations
1. Conjunctive Selection operations can cascade into
individual Selection operations (and vice versa).
2. Commutativity of Selection.
25
3. In a sequence of Projection operations, only the last in the
sequence is required.
26
Con…
4. Commutativity of Selection and Projection.
If predicate p involves only attributes in projection
list, Selection and Projection operations commute:
27
Con…
5. Commutativity of Theta join (and Cartesian
product).
28
6. Commutativity of Selection and Theta join (or
Cartesian product)
If selection predicate involves only attributes of one
of join relations, Selection and Join (or Cartesian
product) operations commute:
29
7. Commutativity of Projection &Theta join (or
Cartesian product)
30
8. Commutativity of Union & Intersection (but not set
difference)
RS=SR
RS=SR
9.Commutativity of Selection and set operations
(Union, Intersection, and Set difference).
p(R S) = p(S) p(R)
p(R S) = p(S) p(R)
p(R - S) = p(S) - p(R)
32
2. Query Optimization
Optimization – not necessarily “optimal”, but
reasonably efficient
Techniques:
Heuristic rules
35
Apply Selections Early
36
Apply More Restrictive Selections Early
37
Form Joins
38
Apply Projections Early
39
Cost- Based Optimization
Statistics on the inputs to each operator are needed.
Statistics on leaf relations are stored in the system
catalog.
Statistics on intermediate relations must be estimated;
most
important is the relations' cardinalities.
Cost can be CPU time, I/O time, communication time, main
memory usage, or a combination.
The candidate query tree with the least total cost is selected
for execution.
40
Measures of Query Cost
There are many possible ways to estimate cost, e.g.,
based on
disk accesses, CPU time, or communication overhead.
Disk access is the cost of block transfers from/to
disks.
Simplifying assumption: each block transfer has
the same cost
Cost of algorithm (e.g., for join or selection) depends
on database buffer size;
More memory for DB buffer reduces disk accesses.
Selectivity and Cost Estimates in Query
Optimization
Catalog Information Used in Cost Functions
Information about the size of a file
number of records (tuples) (r),
record size (R),
number of blocks (b)
blocking factor (bfr)
Information about indexes and indexing attributes of
a file
Number of levels (x) of each multilevel index
Number of first-level index blocks (bI1)
Number of distinct values (d) of an attribute
Selectivity (sl) of an attribute
Selection cardinality (s) of an attribute. (s = sl * r)
Database Statistics
For each base relation R
nTuples(R) – the number of tuples (records) in relation R (that is, its
cardinality).
bFactor(R) – the blocking factor of R (that is, the number of tuples of R that
fit into one block).
nBlocks(R) – the number of blocks required to store R. If the tuples of R
are stored physically together, then:
nBlocks(R) = [nTuples(R)/bFactor(R)]
We use [x] to indicate that the result of the calculation is rounded to the
smallest integer that is greater than or equal to x.
For each attribute A of base relation R
nDistinctA(R) – the number of distinct values that appear for
attribute A in relation R.
minA(R),maxA(R) – the minimum and maximum possible
values for the attribute A in relation R.
SCA(R) – the selection cardinality of attribute A in relation R.
This is the average number of tuples that satisfy an equality
condition on attribute A.
44
Con…
46
Example
For the purposes of this example, we make the following assumptions about
the Staff relation:
There is a hash index with no overflow on the primary key attribute staffNo.
There is a clustering index on the foreign key attribute branchNo.
There is a B+-tree index on the salary attribute.
The Staff relation has the following statistics stored in the system catalog:
47
Q1
The estimated cost of a linear search on the key attribute staffNo is 50 blocks,
and the cost of a linear search on a non-key attribute is 100 blocks.
Now we consider the following Selection operations, and use the above
strategies to improve on these two costs:
S1:σstaffNo=‘SG5’(Staff)
S2:σposition=‘Manager’(Staff)
S3:σbranchNo=‘B003’(Staff)
S4:σsalary>20000(Staff)
Solution: S1: This Selection operation contains an equality condition on the
primary key. Therefore, as the attribute staffNo is hashed we can use strategy
3 defined above to estimate the cost as 1 block. The estimated cardinality of
48
the result relation is SCstaffNo(Staff) = 1.
S2: The attribute in the predicate is a non-key, non-indexed attribute, so we cannot
improve on the linear search method, giving an estimated cost of 100 blocks. The
estimated cardinality of the result relation is SCposition(Staff) = 300.
S3: The attribute in the predicate is a foreign key with a clustering index, so we can use
Strategy 6 to estimate the cost as 2 + [6/30] = 3 blocks. The estimated cardinality of the
result relation is SCbranchNo(Staff) = 6.
S4: The predicate here involves a range search on the salary attribute, which has a B+-
tree index, so we can use strategy 7 to estimate the cost as: 2 + [50/2] + [3000/2] = 1527
blocks. However, this is significantly worse than the linear search strategy, so in this case
we would use the linear search method. The estimated cardinality of the result relation is
49
Selection Operation
S1 - Linear search
cost(S1)= BR
51
52
Cost of Operations
Cost = I/O cost + CPU cost
I/O cost: # pages (reads & writes) or # operations
(multiple pages)
CPU cost: # comparisons or # tuples processed
Cost depends on
Types of query conditions
bR: # pages in R
54
Simple Selection
Simple selection: A op a(R)
A is a single attribute, a is a constant, op is one of =, ,
<, , >, .
Do not further discuss because it requires a
sequential scan of table.
How many tuples will be selected?
Selectivity Factor (SFA op a(R)) : Fraction of tuples of R
satisfying “A op a”
0 SFA op a(R) 1
# tuples selected: NS = nR SFA op a(R)
55
Options of Simple Selection
Sequential (linear) Scan
General condition: cost = bR
Equality on key: average cost = bR / 2
Binary Search
Records are stored in sorted order
Equality on key: cost = log2(bR)
Equality on non-key (duplicates allowed)
cost = log2(bR) + NS/bfR - 1
= sorted search time + selected – first one
56
Example: Cost of Selection
Relation: R(A, B, C)
nR = 10000 tuples
bfR = 20 tuples/page
dist(A) = 50, dist(B) = 500
B+ tree clustering index on A with order 25 (p=25)
B+ tree secondary index on B w/ order 25
Query:
select * from R where A = a1 and B = b1
Relational Algebra: A=a1 B=b1 (R)
57
Example: Cost of Selection (cont.)
Option 1: Sequential Scan
Have to go thru the entire relation
Cost = bR = 10000/20 = 500
Option 2: Binary Search using A = a
It is sorted on A (why?)
NS = 10000/50 = 200
assuming equal distribution
Cost = log2(bR) + NS/bfR - 1
= log2(500) + 200/20 - 1 = 18
58
Cost of Join
59
Estimate Size of Join Result
nR nS nR nS
NJ = min( , )
dist(R. A) dist(S .B)
60
Estimate Size of Join Result (cont.)
How wide is a tuple in join result?
Natural join: W = W(R) + W(S) – W(SR)
Theta join: W = W(R) + W(S)
What is blocking factor of join result?
bfJoin = block size / W
How many blocks does join result have?
bJoin = NJ / bfJoin
61
Query Execution Plans
An execution plan for a relational algebra query consists
of a combination of the relational algebra query tree
and information about the access methods to be used
for each relation as well as the methods to be used in
computing the relational operators stored in the tree.
Materialized evaluation: the result of an operation is
stored as a temporary relation.
Pipelined evaluation: as the result of an operator is
produced, it is forwarded to the next operator in
sequence
62
Query Tuning
Monitoring or revising the query to increase throughput,
to lower response time for time-critical applications.
Having to tune queries is a fact of life.
Query tuning has a localized effect and is thus relatively
attractive.
63
Assignment one
Using heuristic algorithm optimize the following
sql query.
SELECT LNAME FROM EMPLOYEE, WORKS_ON, PROJECT
64