CH 11
CH 11
Optimization
Chapter 11
Query Evaluation
• Problem: An SQL query is declarative – does
not specify a query execution plan.
• A relational algebra expression is procedural
– there is an associated query execution plan.
• Solution: Convert SQL query to an
equivalent relational algebra and evaluate it
using the associated query execution plan.
– But which equivalent expression is best?
1
Naive Conversion
SELECT DISTINCT TargetList
FROM R1, R2, …, RN
WHERE Condition
is equivalent to
πTargetList (σCondition (R1 × R2 × ... × RN))
but this may imply a very inefficient query execution plan.
Example: πName (σId=ProfId ∧CrsCode=‘CS532’ (Professor
Professor × Teaching))
Teaching
• Result can be < 100 bytes
• But if each relation is 50K then we end up computing an
intermediate result Professor × Teaching of size 500M
before shrinking it down to just a few bytes.
2
Query Optimizer
• Uses heuristic algorithms to evaluate relational
algebra expressions. This involves:
– estimating the cost of a relational algebra expression
– transforming one relational algebra expression to an
equivalent one
– choosing access paths for evaluating the subexpressions
• Query optimizers do not “optimize” – just try to find
“reasonably good” evaluation strategies
3
Selection and Projection Rules
• Join commutativity: R S ≡ S R
– used to reduce cost of nested loop evaluation strategies (smaller relation
should be in outer loop)
• Join associativity: R (S T) ≡ (R S) T
– used to reduce the size of intermediate relations in computation of multi-
relational join – first compute the join that yields smaller intermediate
result
• N-way join has T(N)× N! different evaluation plans
– T(N) is the number of parenthesized expressions
– N! is the number of permutations
• Query optimizer cannot look at all plans (might take longer to
find an optimal plan than to compute query brute-force). Hence it
does not necessarily produce optimal plan
8
4
Pushing Selections and Projections
• σCond (R × S) ≡ R Cond S
– Cond relates attributes of both R and S
– Reduces size of intermediate relation since rows can be
discarded sooner
• σCond (R × S) ≡ σCond (R) × S
– Cond involves only the attributes of R
– Reduces size of intermediate relation since rows of R are
discarded sooner
• πattr(R × S) ≡ πattr(πattr′ (R) × S),
if attributes(R) ⊇ attr′ ⊇ attr attributes(R)
– reduces the size of an operand of product
9
Equivalence Example
10
5
Cost - Example 1
SELECT P.Name
FROM Professor P, Teaching T
WHERE P.Id = T.ProfId -- join condition
AND P. DeptId = ‘CS’ AND T.Semester = ‘F1994’
π Name(σDeptId=‘CS’ ∧ Semester=‘F1994’(Professor
Professor Id=ProfId Teaching))
Teaching
π Name
Master query
execution plan
σDeptId=‘CS’∧ Semester=‘F1994’
(nothing pushed)
Id=ProfId
Professor Teaching 11
6
Estimating Cost - Example 1
• Join - block-nested loops with 52 page buffer (50
pages – input for Professor, 1 page – input for
Teaching,
Teaching 1 – output page
– Scanning Professor (outer loop): 200 page transfers,
(4 iterations, 50 transfers each)
– Finding matching rows in Teaching (inner loop):
1000 page transfers for each iteration of outer loop
– Total cost = 200+4*1000 = 4200 page transfers
13
14
7
Pipelining
• Solution: use pipelining:
pipelining
– join and select/project act as coroutines, operate as
producer/consumer sharing a buffer in main memory.
• When join fills buffer; select/project filters it and outputs
result
• Process is repeated until select/project has processed last
output from join
– Performing select/project adds no additional cost
Intermediate output
join result
select/project
select project final result
buffer 15
• Total cost:
4200 + (cost of outputting final result)
16
8
Cost Example 2
SELECT P.Name
FROM Professor P, Teaching T
WHERE P.Id = T.ProfId AND
P. DeptId = ‘CS’ AND T.Semester = ‘F1994’
πName(σSemester=‘F1994’ (σDeptId=‘CS’ (Professor
Professor) Id=ProfId Teaching))
Teaching
π Name
Partially pushed
plan: selection
pushed to Professor σSemester=‘F1994’
Id=ProfId
σDeptId=‘CS’
Professor Teaching
17
clustered index
on DeptId
rows of
Professor
18
9
Cost Example 2 -- join
1.2
20 × 10
20
10
Cost Example 2 – select/project
• Pipe result of join to select (on Semester) and
project (on Name) at no I/O cost
• Cost of output same as for Example 1
• Total cost:
6 (select on Professor)
Professor + 224 (join) = 230
• Comparison:
4200 (example 1) vs. 230 (example 2) !!!
21
22
11
Estimating Output Size
• For the query: SELECT TargetList
FROM R1, R2, …, Rn
WHERE Condition
23
24
12
Reduction Due to Simple Condition
1
• reduction (R
Ri.A=val) =
Values(R
R.A)
1
• reduction (R
Ri.A=R
Rj.B) =
max(Values(R
Ri.A), Values(R
Rj.B))
MaxVal(R
i R .A) – val
• reduction (R
Ri.A > val) =
MaxVal(R
Ri.A) – MinVal(Ri.A)
25
26
13
Reduction Due to TargetList
• reduction(TargetList) =
number-of-attributes (TargetList)
Σi number-of-attributes (R
Ri)
27
weight(R.A) =
Tuples(R) × reduction(R.A=value)
28
14
Choosing Query Execution Plan
• Step 1: Choose a logical plan
• Step 2: Reduce search space
• Step 3: Use a heuristic search to further
reduce complexity
29
30
15
Step 2: Reduce Search Space
• Deal with associativity of binary operators
(join, union, …)
D
A B C D
A B C D C
Logical query
execution plan A B
Equivalent
Equivalent left
query tree
deep query tree
31
Step 2 (cont’d)
• Two issues:
– Choose a particular shape of a tree (like in the
previous slide)
• Equals the number of ways to parenthesize N-way
join – grows very rapidly
– Choose a particular permutation of the leaves
• E.g., 4! permutations of the leaves A, B, C, D
32
16
Step 2: Dealing With Associativity
• Too many trees to evaluate: settle on a particular
shape: left-
left-deep tree.
– Used because it allows pipelining:
pipelining
P1 P2 P3
A B X C Y D
X Y
– Property: once a row of X has been output by P1, it need not
be output again (but C may have to be processed several
times in P2 for successive portions of X)
– Advantage: none of the intermediate relations (X,
X, Y)
Y have to
be completely materialized and saved on disk.
• Important if one such relation is very large, but the final result is
small
33
34
17
Step 3: Heuristic Search
35
36
18
Index-Only Queries
• A B+ tree index with search key attributes A1, A2, …,
An has stored in it the values of these attributes for each
row in the table.
– Queries involving a prefix of the attribute list A1, A2, .., An
can be satisfied using only the index – no access to the actual
table is required.
• Example: Transcript has a clustered B+ tree index on
StudId. A frequently asked query is one that requests
all grades for a given CrsCode.
– Problem: Already have a clustered index on StudId – cannot
create another one (on CrsCode)
– Solution: Create an unclustered index on (CrsCode, Grade)
• Keep in mind, however, the overhead in maintaining extra indices
37
19