07 QueryOptimisation-no Blanks
07 QueryOptimisation-no Blanks
Presented by
A/Prof Uwe Roehm
School of Computer Science
Learning Objectives
– Query Optimisation
– Equivalence Rules
– Join Ordering
– Cost-Based Query Optimisation
– Optimising nested and dynamic queries
⋈
relational algebra
expression
suosCode=‘DATA3404’ Student
focus this week
optimizer
Enrolled
execution plan
statistics
about data
evaluation engine
query output
data
DATA3404 "Data Science Platforms" - 2020 (Roehm) 5
Query Optimization
Central Problems:
– Query is (by definition) declarative,
e.g. it does not specify the execution order.
But we need an executable plan.
processing
› Typically disk access is the predominant cost,
and is also relatively easy to estimate Network
comms
- Based on statistics about size of relations,
spread of values in a column, etc.
› For simplicity, we just use number of page
transfers from disk (I/Os) as the cost
measure Query time
⋈
SELECT *
FROM Student JOIN Enrolled USING(sid) BLOCK NESTED LOOPS
WHERE grade='HD'
sid name address dob sid uosCode grade
123 Alice 1 Acacia Drive 18-AUG-1989 123 COMP9120 HD
124 Bob 7 Belmont Av 07-JUN-1993 124 COMP5338 P
… …
σ FILTER
Enrolled
Name
Equivalence rules Alice
Clare
Enrolled Student
sid uosCode sid Name pname
101 DATA3404 103 Clare suosCode=‘DATA3404’ Name
102 COMP5338 102 Bob
Clare
103 COMP5338 101 Alice ⋈ Alice
103 DATA3404
Student Enrolled
Selection Equivalence
SELECT *
FROM Enrolled
WHERE sid=102 sid uos grade
AND uos=‘DATA3404’ 102 DATA3404 HD
Enrolled σsid=102 σuos='DATA3404'
102 COMP5338 P
sid uos grade
101 DATA3404 D
σsid=102 ˄ uos='DATA3404' sid uos grade
101 COMP5338 CR
102 DATA3404 HD
102 DATA3404 HD
102 COMP5338 P
sid uos grade
σuos='DATA3404' 101 DATA3404 D σsid=102
102 DATA3404 HD
πsid,name,home πname,home
sid name home
101 Alice Austria sid name home
102 Bob Brazil 102 Bob Brazil
103 Clare Chile σsid=102
A projection commutes with a selection that only uses attributes retained by the projection.
DATA3404 "Data Science Platforms" - 2020 (Roehm) 16
Joins and Cross Products
SELECT * FROM Student S
JOIN Enrolled E ON (E.sid=S.sid)
E.sid uosCode S.sid Name
E S
101 COMP9120 101 Alice
sid uosCode sid Name
101 COMP9120 102 Bob
101 COMP9120 101 Alice
× 101 COMP9120 103 Clare
102 COMP9120 102 Bob
102 COMP9120 101 Alice
102 COMP5338 103 Clare
102 COMP9120 102 Bob
104 COMP9120
102 COMP9120 103 Clare
(R⋈θ S)≡σθ(R×S)
⋈E.sid=S.sid 102 COMP5338 101 Alice
102 COMP5338 102 Bob
Join Equivalences
SELECT * FROM Student S JOIN Enrolled E ON (E.sid=S.sid)
E S
sid uosCode sid Name Joins commute:
101 COMP9120 101 Alice (R⋈θS) ≡ (S⋈ θ R)
102 COMP9120 102 Bob Joins are associative:
102 COMP5338 103 Clare R⋈θ(S⋈ηT) ≡ (S⋈ θ R) ⋈ η T
104 COMP9120
E ⋈E.sid=S.sid S S ⋈E.sid=S.sid E
⋈ suosCode=‘DATA3404’ Student
⋈
⋈
psid,name Enrolled
Enrolled Student
Student
DATA3404 "Data Science Platforms" - 2020 (Roehm) 19
Optimization Heuristics
– Working through all possible join orders can be a big job as number of relations
in query gets large
– Can use dynamic programming to store intermediate results
– Cost-based optimization is expensive, even with dynamic programming.
– Systems may use heuristics to reduce the number of choices that must be made in a
cost-based fashion.
– Heuristic optimization transforms the query-tree by using a set of rules that
typically (but not in all cases) improve execution performance:
– Perform selection early (reduces the number of tuples)
– Perform projection early (reduces the number of attributes)
– Perform most restrictive selection and join operations before other similar
operations.
– Some systems use only heuristics, others combine heuristics with partial cost-
based optimization.
Join Optimization
– Fundamental problem for query optimization: Join Order
– In principle, naïve join optimization could enumerate all possible execution
plans, i.e., all possible 2-way join combinations for each query block.
#& = ( #$ - #&+$+,
$)*
Search Space
– The resulting search space is enormous:
Possible bushy join trees joining n relations
Number of relations n Cn-1 Join trees
2 1 2
3 2 12
4 5 120
5 14 1,680
6 42 40,340
7 132 665,280
8 429 17,297,280
10 4,862 17,643,225,600
– And we haven’t yet even considered the use of m different join algorithms
(yielding another factor of m(n−1))!
DATA3404 "Data Science Platforms" - 2020 (Roehm) 24
Left-Deep Join Plans
– To master this search space, fundamental decision in System R (the father of
all query optimizers): only left-deep join trees are considered.
– In left-deep join trees, the right-hand-side input for each join is a relation, not
the result of an intermediate join.
– Left-deep trees allow us to generate all fully pipelined plans.
• Intermediate results not written to temporary files.
• Not all left-deep trees are fully pipelined (e.g., Sort-Merge join).
⋈
⋈
Join results must
be cached for use ⋈
in next join
⋈ D
⋈ ⋈ ⋈ D
A B
bushy join plan left-deep join plan
A B non-left-deep plan
DATA3404 "Data Science Platforms" - 2020 (Roehm) 25
for i := 2 to n do
{
for all S Í {R1, …, Rn} s.t. |S|=i do
{
bestPlan := a dummy plan w/ infinite cost
for all Rj, Sj s.t. S = {Rj} È Sj do
{
p := joinPlan(optPlan(Sj), Rj);
if cost(p) £ cost(bestPlan) then
bestPlan := p
}
optPlan(S) := bestPlan
}
}
return (optPlan({R1, …, Rn}))
DATA3404 "Data Science Platforms" - 2020 (Roehm) 28
Examples of
Cost-based Query Optimisation
FILTER (pod=5)
INDEX NESTED JOIN
FILTER (pod=5)
Pushed-down selection
PROJECTION (tid,cid)
– Questions / Caveats:
– How to optimise parameterized queries?
• Just take a ‘typical’ value for placeholders? Which value is ‘typical’?
• E.g. Oracle: Optimizer peeks into the actual bind values, then optimises.
Re-uses this plan even if cursor uses the query with different bind values
– How to cache/re-use these queries?
• If re-issued a query with a different bind value, shall we still re-use plans?
• E.g. Oracle: Does not share plans with bind values!
DATA3404 "Data Science Platforms" - 2020 (Roehm) 38
Key Concepts
– Build on topics from past two weeks: – Execution Plans
– Index choice – Should be able to annotate an expression
– Expression trees tree with appropriate physical operations
– Physical operations – Should be able to identify plans that
involve indexes, and propose suitable
– Access Paths
indexes for these plans
– Estimating I/O cost for all the above
– Should be able to compare plans based
upon I/O cost
– RA Expression Equivalence
– Should be able to translate between
expression trees using RA equivalence rules
– Textbooks
– Ramakrishnan/Gehrke: Chapter 22
– Kifer/Bernstein/Lewis: Chapter 24