Relational Query Optimization: CS186 R & G Chapters 12/15
Relational Query Optimization: CS186 R & G Chapters 12/15
Optimization
CS186
R & G Chapters 12/15
Review
Implementation of single Relational Operations
Choices depend on indexes, memory, stats,
Joins
Blocked nested loops:
simple, exploits extra memory
Sort/Merge Join
good with small amount of memory, bad with duplicates
Hash Join
fast (enough memory), bad with skewed data
SELECT S.sname
FROM Reserves R,
Sailors
S
R.sid=S.sid AND
R.bid=100 AND
S.rating>5
(sname)(bid=100 rating > 5) (Reserves Sailors)
WHERE
sname
bid=100
rating > 5
sid=sid
Reserves
Sailors
sname
(sort project)
bid=100
rating > 5
(inline selection)
sid=sid
Reserves
(hash join)
Sailors
Queries
Select *
From Blah B
Where B.blah = blah
Query Parser
Usually there is a
heuristics-based
rewriting step before
the cost-based steps.
Query Optimizer
Plan
Generator
Plan Cost
Estimator
Query Executor
Catalog Manager
Schem
a
Statistic
s
Motivating Example
SELECT S.sname
FROM Reserves R,
Sailors
S
R.sid=S.sid AND
R.bid=100 AND
Cost: 500+500*1000 I/Os
S.rating>5
Plan:
WHERE
sname
(On-the-fly)
(On-the-fly)
sname
(On-the-fly)
sname
bid=100
bid=100
rating > 5
(On-the-fly)
(On-the-fly)
(Page-Oriented
sid=sid Nested loops)
(Page-Oriented
sid=sid Nested loops)
Sailors
Reserves
500,500 IOs
rating > 5
(On-the-fly) Reserves
Sailors
250,500 IOs
(On-the-fly)
sname
bid=100
(On-the-fly)
(Page-Oriented
sid=sid Nested loops)
rating > 5
(On-the-fly) Reserves
(On-the-fly)
(Page-Oriented
sid=sid Nested loops)
bid = 100
rating > 5
(On-the-fly)
Sailors
(On-the-fly)
Reserves
Sailors
250,500 IOs
250,500 IOs
bid=100
(On-the-fly)
rating > 5
(On-the-fly)
(On-the-fly) Reserves
Sailors
250,500 IOs
(On-the-fly)
(Page-Oriented
sid=sid Nested loops)
(Page-Oriented
sid=sid Nested loops)
rating > 5
(On-the-fly)
sname
bid=100
(On-the-fly)
Sailors
Reserves
6000 IOs
(On-the-fly)
sname
rating > 5
(On-the-fly)
(Page-Oriented
sid=sid Nested loops)
(Page-Oriented
sid=sid Nested loops)
bid=100
bid=100
(On-the-fly)
Sailors
(On-the-fly)
Reserves
Reserves
6000 IOs
(On-the-fly)
(Scan &
to
temp T2)
Sailors
4250 IOs
1000 + 500+ 250 + (10 * 250)
sname
sname
(Page-Oriented
sid=sid Nested loops)
(Scan &
to
temp T2)
bid=100
(On-the-fly)
Reserves
4250 IOs
Sailors
(On-the-fly)
(Page-Oriented
sid=sid Nested loops)
rating>5
(On-the-fly)
Sailors
bid=100
(Scan &
Write to
temp T2)
Reserves
4010 IOs
500 + 1000 +10 +(250 *10)
(On-the-fly)
sname
sid=sid
(Scan;
write to bid=100
temp T1)
(Sort-Merge Join)
rating > 5
(Scan;
write to
temp T2)
Reserves
Sailors
With 5 buffers, cost of plan:
Scan Reserves (1000) + write temp T1 (10 pages, if we
have 100 boats, uniform distribution) = 1010.
Scan Sailors (500) + write temp T2 (250 pages, if have 10 ratings) =
750.
Sort T1 (2*2*10) + sort T2 (2*4*250) + merge (10+250) = 2300
sname
(On-the-fly)
(On-the-fly)
With clustered index on
rating > 5
bid of Reserves, we get
(Index Nested Loops,
100,000/100 = 1000
sid=sid with pipelining )
tuples on 1000/100 = 10 (Use hash
Index, do
pages.
bid=100
Sailors
not write
to temp)
INL with outer not
Reserves
materialized.
Projecting out unnecessary fields from
outer
doesnt help.
A search algorithm
To sift through the plan space based on cost!
Summary
Query optimization is an important task in a
relational DBMS.
Must understand optimization in order to
understand the performance impact of a given
database design (relations, indexes) on a
workload (set of queries).
Two parts to optimizing a query:
Consider a set of alternative plans.
Must prune search space; typically, left-deep plans only.
Query Optimization
Query can be dramatically improved by
changing access methods, order of operators.
Iterator interface
Cost estimation
Size estimation and reduction factors
Statistics and Catalogs
Relational Algebra Equivalences
Choosing alternate plans
Multiple relation queries
Will focus on System R-style optimizers
Highlights of System R
Optimizer
Impact:
Most widely used currently; works well for < 10 joins.
Cost estimation:
Very inexact, but works ok in practice.
Statistics, maintained in system catalogs, used to estimate cost
of operations and result sizes.
Considers combination of CPU and I/O costs.
More sophisticated techniques known now.
Plan Space: Too large, must be pruned.
Many plans share common, overpriced subtrees
SELECT S.sname
FROM Sailors S
WHERE S.age IN
(SELECT MAX (S2.age
FROM Sailors S2
GROUP BY S2.rating
S.sid, MIN(R.day)
(HAVING COUNT(*)>2 (
GROUP BY S.Sid (
B.color = red (
Sailors Reserves
Boats))))
Relational Algebra
Equivalences
Allow us to choose different join orders and to `push
selections and projections ahead of joins.
Selections:
R (S T) (R S) T
(associative)
RS SR
(commutative)
This means we can do joins in any order.
Butbeware of cartesian product!
More Equivalences
Eager projection
Can cascade and push some projections thru
selection
Can cascade and push some projections below one
side of a join
Rule of thumb: can project anything not needed
downstream
Selection between attributes of the two arguments
of a cross-product converts cross-product to a join.
A selection on just attributes of R commutes with
R
S. (i.e., (R
S) (R)
S)
Cost Estimation
For each plan considered, must estimate total
cost:
Must estimate cost of each operation in plan tree.
Depends on input cardinalities.
Weve already discussed how to estimate the cost of
operations (sequential scan, index scan, joins, etc.)
term
Max #
Postgres 8:
include/utils/selfuncs.h
Backend/optimizer/path/clauses
el.c
/*
*
* THIS IS A HACK TO GET V4 OUT THE DOOR.
*
-- JMH 7/9/92
*/
s1 = (Selectivity) 0.3333333;
Term col1=col2
RF = 1/MAX(NKeys(I1), NKeys(I2))
S.{A}
Example
If we have an index on rating:
SELECT S.sid
FROM Sailors
WHERE
S.rating=8
D
C
A
C
A
Enumeration of Left-Deep
Plans
Left-deep plans differ only in the order of
relations, the access method for each relation,
and the join method for each join.
Enumerated using N passes (if N relations joined):
Cost
{R, S}
<none>
hashjoin(R 1000
,S)
{R, S}
<R.a, S.b>
sortmerge
(R,S)
1500
Enumeration of Plans
(Contd.)
An N-1 way plan is not combined with an
additional relation unless there is a join
condition between them, unless all predicates in
WHERE have been used up.
i.e., avoid Cartesian products if possible.
ORDER BY, GROUP BY, aggregates etc. handled as a
final step, using either an `interestingly ordered
plan or an additonal sort/hash operator.
In spite of pruning plan space, this approach is
still exponential in the # of tables.
Recall that in practice, COST considered is #IOs
+ factor * CPU Inst
Example
Sailors:
Hash, B+ on sid
Reserves:
Clustered B+ tree on
bid
B+ on sid
Boats
B+, Hash on color
GROUPBY sid
sid=sid
bid=bid
Color=red
Sailors
Reserves
Boats
Pass 1
Best plan for accessing each
relation regarded as the first
relation in an execution plan
Reserves, Sailors: File Scan
Boats: B+ tree & Hash on color
Pass 2
For each of the plans in pass 1, generate plans
joining another relation as the inner, using all join
methods (and matching inner access methods)
Points to Remember
Must understand optimization in order to
understand the performance impact of a given
database design (relations, indexes) on a
workload (set of queries).
Two parts to optimizing a query:
Consider a set of alternative plans.
Good to prune search space; e.g., left-deep plans only,
avoid Cartesian products.
Points to Remember
Single-relation queries:
All access paths considered, cheapest is chosen.
Issues: Selections that match index, whether index
key has all needed fields and/or provides tuples in a
desired order.
Summary
Optimization is the reason for the lasting
power of the relational system
But it is primitive in some ways
New areas: Smarter summary statistics
(fancy histograms and sketches), autotuning statistics, adaptive runtime reoptimization (e.g. eddies)