0% found this document useful (0 votes)
87 views20 pages

07 QueryOptimisation-no Blanks

The document discusses query optimization techniques used in database management systems. It covers topics like equivalence rules, join ordering, cost-based query optimization and optimizing nested and dynamic queries. The goal of query optimization is to find an efficient execution plan by generating equivalent queries and choosing the lowest cost plan based on statistics and estimated costs.

Uploaded by

朱宸烨
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
87 views20 pages

07 QueryOptimisation-no Blanks

The document discusses query optimization techniques used in database management systems. It covers topics like equivalence rules, join ordering, cost-based query optimization and optimizing nested and dynamic queries. The goal of query optimization is to find an efficient execution plan by generating equivalent queries and choosing the lowest cost plan based on statistics and estimated costs.

Uploaded by

朱宸烨
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

DATA3404 Data Science Platforms

Week 7: Query Optimisation

Presented by
A/Prof Uwe Roehm
School of Computer Science

DATA3404 "Data Science Platforms" - 2020 (Roehm) 1

Learning Objectives
– Query Optimisation
– Equivalence Rules
– Join Ordering
– Cost-Based Query Optimisation
– Optimising nested and dynamic queries

DATA3404 "Data Science Platforms" - 2020 (Roehm) 2


Cost-based Query Optimisation

DATA3404 "Data Science Platforms" - 2020 (Roehm) 3

Motivation for Query Optimisation


Query is (by definition) declarative, so it does not specify the execution plan.
If a query runs slowly, what can you do?
– Sometimes you can write a better query
– E.g. by avoiding cross-joins or trying to use more selective filter conditions
– Add indexes
– Some systems allow hints (assumes you know better than the query planner)
– Oracle ‘+ALL ROWS’, etc
– PostgreSQL doesn’t have them (deliberately)
– SQL Server OPTION clause
– MySQL USE/IGNORE/FORCE INDEX
– Configuration tweaks & Hardware upgrades
To do most of the above you need to understand what’s going on inside, and how the
query optimizer works

DATA3404 "Data Science Platforms" - 2020 (Roehm) 4


Reminder: Query Processing
SELECT name
FROM Student
query
NATURAL JOIN Enrolled
WHERE uosCode=‘DATA3404’
pname
parser and
translator


relational algebra
expression
suosCode=‘DATA3404’ Student
focus this week
optimizer

Enrolled

execution plan
statistics
about data

evaluation engine

query output
data
DATA3404 "Data Science Platforms" - 2020 (Roehm) 5

Query Optimization
Central Problems:
– Query is (by definition) declarative,
e.g. it does not specify the execution order.
But we need an executable plan.

– The goal of query optimization is to find a suitable execution plan.


– Ideally: Want to find best plan.
– Practically: Avoid worst plans!
Time for query optimization adds to total query execution time.

DATA3404 "Data Science Platforms" - 2020 (Roehm) 6


Cost-based Query Optimization
– Generation of query-evaluation plans for an expression involves several steps:
1. Generating logically equivalent expressions
• Use equivalence rules to transform an expression into an equivalent one.
2. Annotating resultant expressions to get alternative query plans
3. Choosing the cheapest plan based on estimated cost

– The overall process is called cost based optimization.


– Two main issues:
• For a given query, what plans are considered?
– Algorithm to search plan space for cheapest (estimated) plan.
• How is the cost of a plan estimated?
– Alternative: rule-based optimization

DATA3404 "Data Science Platforms" - 2020 (Roehm) 7

The Importance of I/O Cost


› Amongst all equivalent evaluation plans
choose the one with lowest estimated cost
› Usually cost in terms of time to answer query CPU
Disk I/O

processing
› Typically disk access is the predominant cost,
and is also relatively easy to estimate Network
comms
- Based on statistics about size of relations,
spread of values in a column, etc.
› For simplicity, we just use number of page
transfers from disk (I/Os) as the cost
measure Query time

DATA3404 "Data Science Platforms" - 2020 (Roehm) 8


Cost Estimation
– For each plan considered, must estimate cost:
– Must estimate cost of each operation in plan tree.
• Depends on input cardinalities.
• We’ve already discussed how to estimate the cost of operations
(sequential scan, index scan, joins, etc.)
– Must also estimate size of result for each operation in tree!
• Use information about the input relations.
• For selections and joins, assume independence of predicates
– Database Statistics play a crucial role here
– Our assumption: data values to be uniformly distributed
(Note: This is typically not the case with many data sets)
– How to determine a data distribution over large data sets?
– How to keep those statistics up-to-date?
Þ Periodic sampling
– How to handle data inter-dependencies? Hard!
DATA3404 "Data Science Platforms" - 2020 (Roehm) 9

Exercise 1: Query Costs


sid name address dob sid uosCode grade
123 Alice 1 Acacia Drive 18-AUG-1989 123 COMP9120 HD
124 Bob 7 Belmont Av 07-JUN-1993 124 COMP5338 P
… …


SELECT *
FROM Student JOIN Enrolled USING(sid) BLOCK NESTED LOOPS
WHERE grade='HD'
sid name address dob sid uosCode grade
123 Alice 1 Acacia Drive 18-AUG-1989 123 COMP9120 HD
124 Bob 7 Belmont Av 07-JUN-1993 124 COMP5338 P
… …

σ FILTER

sid name address dob sid uosCode grade


123 Alice 1 Acacia Drive 18-AUG-1989 123 COMP9120 HD
… …
DATA3404 "Data Science Platforms" - 2020 (Roehm) 10
Equivalence Rules

DATA3404 "Data Science Platforms" - 2020 (Roehm) 11

Equivalent Algebra Expressions


– Two relational algebra expressions are said to be equivalent if on every
legal database instance the two expressions generate the same set of tuples
– Note: order of tuples is irrelevant
– In SQL, inputs and outputs are multisets of tuples, hence above definition
on multiset of tuples

– An equivalence rule says that expressions of two forms are equivalent


– Can replace expression of first form by second, or vice versa

DATA3404 "Data Science Platforms" - 2020 (Roehm) 12


Equivalent Algebra Expressions
pname
Enrolled Student
sid uosCode sid Name Name

101 DATA3404 101 Alice Alice
102 COMP5338 102 Bob suosCode=‘DATA3404’ Student

Enrolled
Name
Equivalence rules Alice
Clare
Enrolled Student
sid uosCode sid Name pname
101 DATA3404 103 Clare suosCode=‘DATA3404’ Name
102 COMP5338 102 Bob
Clare
103 COMP5338 101 Alice ⋈ Alice
103 DATA3404
Student Enrolled

DATA3404 "Data Science Platforms" - 2020 (Roehm) 13

Selection Equivalence
SELECT *
FROM Enrolled
WHERE sid=102 sid uos grade
AND uos=‘DATA3404’ 102 DATA3404 HD
Enrolled σsid=102 σuos='DATA3404'
102 COMP5338 P
sid uos grade
101 DATA3404 D
σsid=102 ˄ uos='DATA3404' sid uos grade
101 COMP5338 CR
102 DATA3404 HD
102 DATA3404 HD
102 COMP5338 P
sid uos grade
σuos='DATA3404' 101 DATA3404 D σsid=102
102 DATA3404 HD

Selections with conditions joined with ˄ σ c1∧ ...∧ cn ( R) ≡ σ c1 ( . . . σ cn ( R))


(AND) operator cascade

Nested selections operations commute σ c1 (σ c 2 (R)) ≡ σ c 2 (σ c1 (R))


DATA3404 "Data Science Platforms" - 2020 (Roehm) 14
Projection Equivalence
SELECT name, home
FROM Student sid name home
101 Alice Austria
πsid,name,home 102 Bob Brazil πname,home
Student
103 Clare Chile
sid name age home name home

101 Alice 21 Austria πname,home Alice Austria

102 Bob 32 Brazil Bob Brazil

103 Clare 23 Chile Clare Chile


sid name
101 Alice
πsid,name 102 Bob
home attribute
not available
103 Clare

Projection operations cascade (


π a1 ( R) ≡ π a1 ... (π a1..an ( R)) )
DATA3404 "Data Science Platforms" - 2020 (Roehm) 15

Projections and Selections


SELECT name, home
FROM Student sid name age home
WHERE sid=102
σsid=102 102 Bob 32 Brazil
πname,home
Student
sid name age home name home sid attribute not
101 Alice 21 Austria πname,home Alice Austria
available
name home
102 Bob 32 Brazil Bob Brazil Bob Brazil
σsid=102
103 Clare 23 Chile Clare Chile

πsid,name,home πname,home
sid name home
101 Alice Austria sid name home
102 Bob Brazil 102 Bob Brazil
103 Clare Chile σsid=102

A projection commutes with a selection that only uses attributes retained by the projection.
DATA3404 "Data Science Platforms" - 2020 (Roehm) 16
Joins and Cross Products
SELECT * FROM Student S
JOIN Enrolled E ON (E.sid=S.sid)
E.sid uosCode S.sid Name
E S
101 COMP9120 101 Alice
sid uosCode sid Name
101 COMP9120 102 Bob
101 COMP9120 101 Alice
× 101 COMP9120 103 Clare
102 COMP9120 102 Bob
102 COMP9120 101 Alice
102 COMP5338 103 Clare
102 COMP9120 102 Bob
104 COMP9120
102 COMP9120 103 Clare
(R⋈θ S)≡σθ(R×S)
⋈E.sid=S.sid 102 COMP5338 101 Alice
102 COMP5338 102 Bob

E.sid uosCode S.sid Name 102 COMP5338 103 Clare


101 COMP9120 101 Alice σE.sid=S.sid 104 COMP9120 101 Alice
102 COMP9120 102 Bob 104 COMP9120 102 Bob
102 COMP5338 102 Bob 104 COMP9120 103 Clare
DATA3404 "Data Science Platforms" - 2020 (Roehm) 17

Join Equivalences
SELECT * FROM Student S JOIN Enrolled E ON (E.sid=S.sid)

E S
sid uosCode sid Name Joins commute:
101 COMP9120 101 Alice (R⋈θS) ≡ (S⋈ θ R)
102 COMP9120 102 Bob Joins are associative:
102 COMP5338 103 Clare R⋈θ(S⋈ηT) ≡ (S⋈ θ R) ⋈ η T
104 COMP9120

E ⋈E.sid=S.sid S S ⋈E.sid=S.sid E

E.sid uosCode S.sid Name S.sid Name E.sid uosCode


101 COMP9120 101 Alice 101 Alice 101 COMP9120
102 COMP9120 102 Bob 102 Bob 102 COMP5338
102 COMP5338 102 Bob 102 Bob 102 COMP9120

DATA3404 "Data Science Platforms" - 2020 (Roehm) 18


More Equivalences
Pushing down selections
suosCode=‘DATA3404’ ⋈

⋈ suosCode=‘DATA3404’ Student

Enrolled Student Enrolled

Pushing down projections


pname pname



psid,name Enrolled
Enrolled Student

Student
DATA3404 "Data Science Platforms" - 2020 (Roehm) 19

Optimization Heuristics
– Working through all possible join orders can be a big job as number of relations
in query gets large
– Can use dynamic programming to store intermediate results
– Cost-based optimization is expensive, even with dynamic programming.
– Systems may use heuristics to reduce the number of choices that must be made in a
cost-based fashion.
– Heuristic optimization transforms the query-tree by using a set of rules that
typically (but not in all cases) improve execution performance:
– Perform selection early (reduces the number of tuples)
– Perform projection early (reduces the number of attributes)
– Perform most restrictive selection and join operations before other similar
operations.
– Some systems use only heuristics, others combine heuristics with partial cost-
based optimization.

DATA3404 "Data Science Platforms" - 2020 (Roehm) 20


Join Ordering

DATA3404 "Data Science Platforms" - 2020 (Roehm) 21

Join Optimization
– Fundamental problem for query optimization: Join Order
– In principle, naïve join optimization could enumerate all possible execution
plans, i.e., all possible 2-way join combinations for each query block.

[Source: DB2 lecture, Uni Tuebingen]


DATA3404 "Data Science Platforms" - 2020 (Roehm) 22
How Many Such Combinations Are There?
– A join over n+1 relations R1,...,Rn+1 requires n binary joins.
– Its root-level operator joins sub-plans of k and n − k − 1 join operators
(0 ≤ k ≤ n − 1):

k joins n-k-1 joins


R1, …, Rk+1 Rk+2, …, Rn+1

– Let #$ be the number of possibilities to construct a binary tree of i inner


nodes (join operators): &+,

#& = ( #$ - #&+$+,
$)*

DATA3404 "Data Science Platforms" - 2020 (Roehm) 23

Search Space
– The resulting search space is enormous:
Possible bushy join trees joining n relations
Number of relations n Cn-1 Join trees
2 1 2
3 2 12
4 5 120
5 14 1,680
6 42 40,340
7 132 665,280
8 429 17,297,280
10 4,862 17,643,225,600

– And we haven’t yet even considered the use of m different join algorithms
(yielding another factor of m(n−1))!
DATA3404 "Data Science Platforms" - 2020 (Roehm) 24
Left-Deep Join Plans
– To master this search space, fundamental decision in System R (the father of
all query optimizers): only left-deep join trees are considered.
– In left-deep join trees, the right-hand-side input for each join is a relation, not
the result of an intermediate join.
– Left-deep trees allow us to generate all fully pipelined plans.
• Intermediate results not written to temporary files.
• Not all left-deep trees are fully pipelined (e.g., Sort-Merge join).


Join results must
be cached for use ⋈
in next join
⋈ D

⋈ ⋈ ⋈ D

Join results must C ⋈


be cached for use
A B C D ⋈ C in next join

A B
bushy join plan left-deep join plan
A B non-left-deep plan
DATA3404 "Data Science Platforms" - 2020 (Roehm) 25

Reminder: Passing Records between Operations


› Materialization (set-at-a-time):
1. Evaluate whole operation
2. Store (materialize) results in a temporary relation pb,d

3. Next operation reads in temporary relation.


• Always applicable, expensive I/O

› Pipelining (tuple-at-a-time):
⋈ R
1. Evaluate one row of output of a relation
2. Pass (pipeline) each row to next operation sS.e>=100 ∧ S.e<=119
T
• Much cheaper (all in memory)
S
• Some operations not compatible with pipelining
(e.g., inner table of join, sorts, hash joins, aggregations)

DATA3404 "Data Science Platforms" - 2020 (Roehm) 26


Dynamic Programming:
Bottom-Up Enumeration of Left-Deep Plans
– Left-deep plans differ only in the order of relations, the access method for each relation,
and the join method for each join.
– Enumerated using N passes (if N relations joined):
Dynamic programming approach
– Pass 1: Best 1-relation plans
Find best access path for each relation individually.
– Pass 2: Best 2-relation plans
Find best way to join result of each pair of tables Ri and Rj using previous best access paths:
optPlan( {Ri, Rj} ) = best of Ri ⋈ Rj and Rj ⋈ Ri (12 plans to consider)
– Pass N: Best N-relation plans
Find best way to join result of a (N-1)-relation plan (as outer) to the N’th relation based on
the best n-1 plans.
– For each subset of relations, retain only:
– Cheapest plan overall, plus
– Cheapest plan for each interesting order of the tuples.

DATA3404 "Data Science Platforms" - 2020 (Roehm) 27

Dynamic Programming in Bottom-Up Query Optimization


for i := 1 to n do
optPlan({Ri}) := best_access_plans(Ri)

for i := 2 to n do
{
for all S Í {R1, …, Rn} s.t. |S|=i do
{
bestPlan := a dummy plan w/ infinite cost
for all Rj, Sj s.t. S = {Rj} È Sj do
{
p := joinPlan(optPlan(Sj), Rj);
if cost(p) £ cost(bestPlan) then
bestPlan := p
}
optPlan(S) := bestPlan
}
}
return (optPlan({R1, …, Rn}))
DATA3404 "Data Science Platforms" - 2020 (Roehm) 28
Examples of
Cost-based Query Optimisation

DATA3404 "Data Science Platforms" - 2020 (Roehm) 29

Exercise 2: Execution Trees


– Scenario: – Physical operations available:
Car(cid, pod) – TABLE ACCESS
Trip(tid,cid) – INDEX SCAN
– (simple) NESTED LOOPS
SELECT Car.cid, T.tid
– BLOCK NESTED LOOPS
FROM Car C, Trip T
WHERE T.cid = C.cid
– INDEX NESTED
AND C.pod = 5; – MERGE JOIN
– Give execution plans:
– No indexes
– Clustered primary index on cid
– Proposed extra index

DATA3404 "Data Science Platforms" - 2020 (Roehm) 30


Example Solutions
Using clustered Using unclustered secondary
Basic Plan using no indexes index on Trip(cid)
primary index
PROJECTION (tid,cid)
PROJECTION (tid,cid) PROJECTION (tid,cid)

FILTER (pod=5)
INDEX NESTED JOIN
FILTER (pod=5)

BLOCK NESTED JOIN


INDEX NESTED JOIN

INDEX SCAN ON INDEX SCAN


Car.pod (pod=5) ON Trip.cid

TABLE SCAN Car TABLE SCAN Trip


TABLE SCAN Trip INDEX SCAN Car.cid

Pushed-down selection
PROJECTION (tid,cid)

BLOCK NESTED JOIN

TABLE SCAN Car (pod=5) TABLE SCAN Trip


DATA3404 "Data Science Platforms" - 2020 (Roehm) 31

Exercise 3: Choosing an Optimal Plan


Query Statistics
SELECT Car.cid, T.tid – Car(cid, pod) :
FROM Car C, Trip T – 10,000 rows each 50 bytes
WHERE T.cid = C.cid – 10,000 values for cid
AND C.pod = 5; – 2500 values for pod
– Trip(tid, cid):
– Identify the best plan by costing out – Foreign key (a) references Car
several logically equivalent plans. – 50,000 rows, each 40 bytes
– 50,000 values for tid
– 10,000 values for cid
– Assume a page is 4096 bytes, of which
4000 are useful for data records
(the rest is header)
DATA3404 "Data Science Platforms" - 2020 (Roehm) 32
Basic Plan using No Indexes

PROJECTION (tid,cid) Cost estimate


Selection and projection in
memory (pipelined)
– Cost to scan Car
– Car has 4000/50=80 data records per page,
FILTER (pod=5) so is stored on 10000/80=125 pages
– Cost to scan Trip
– Trip has 4000/40=100 data records per
BLOCK NESTED JOIN
page, so is stored on 50000/100=500 pages
– So we read 125 pages of Car, and we scan Trip
125 times (doing 500 page reads each time)
– Total disk I/O is 125+125*500 = 62625 pages
TABLE SCAN Car TABLE SCAN Trip

DATA3404 "Data Science Platforms" - 2020 (Roehm) 33

Plan using Pushed-down Selection


Cost estimate
PROJECTION (tid,cid)
– Cost to scan Car
– Car is stored on 125 pages
Selection and projection in
memory (pipelined) – Selectivity on Car
BLOCK NESTED JOIN – Filter condition has selectivity of 1/2500
– about 4 cars will get through filter (10,000 cars * selectivity)
– So we read 125 pages of Car, and we scan Trip 4
times (doing 500 page reads each time)
TABLE SCAN Car (pod=5) TABLE SCAN Trip
– Total disk I/O is 125+4*500 = 2125 pages

DATA3404 "Data Science Platforms" - 2020 (Roehm) 34


Plan using Clustered Primary Index
A plan suitable if Car has clustered Cost estimate
primary index on Car.cid: – Assume index on Car.cid has 2 levels
(index height 1)
PROJECTION (tid,cid) – Cost to scan Trip
Selection and projection in
memory (pipelined) – Read 500 pages
– Cost to look up row of Car with given cid
FILTER (pod=5)
– Read one page per level, then read the one
data record that is pointed to (recall Car.cid is
primary key of Car)
INDEX NESTED JOIN
– Cost of lookup is 3 pages
– So we read 500 pages of Trip, and we do index
lookup on C once for each record in Trip (i.e., we do
50,000 index lookups)
TABLE SCAN Trip INDEX SCAN Car.cid – Total disk I/O is: 500 + 50000*3 = 150,500 pages

DATA3404 "Data Science Platforms" - 2020 (Roehm) 35

Plan using Unclustered Secondary Index

A plan suitable if we add an unclustered Cost estimate


secondary index on Car.pod, and an
– Assume index on Car.pod has 2 levels, and index
unclustered secondary index on Trip.cid
on Trip.cid has 3 levels
– Cost to use index to find rows with Car.pod = 3
PROJECTION (tid,cid) – Read one page per level, then fetch the data
Selection and projection in records pointed to
memory (pipelined)
– There are 2 index levels (incl root), and there will
INDEX NESTED JOIN
be 4 records matching pod with search value 5
– Cost for the lookup is 2+4 = 6 pages read
– Cost to use index to find rows of Trip with given cid
– Read root plus one page per level, then fetch the
INDEX SCAN ON INDEX SCAN ON
data records pointed to
Car.pod (pod=5) Trip.cid – There will be 50000/10000 = 5 records with a
given value of Trip.cid
– Cost of a lookup is 3 + 5 = 8 pages
– Total disk I/O is: 6 + 4*8 = 38 pages

DATA3404 "Data Science Platforms" - 2020 (Roehm) 36


FYI: Nested Queries SELECT S.name
FROM Student S
– Nested block is optimized WHERE EXISTS
independently, with the outer tuple (SELECT *
FROM Enrolled E
considered as providing a selection WHERE E.uos=‘DATA3404’ AND
condition. E.sid=S.sid)

– Outer block is optimized with the Nested block to optimize:


cost of `calling’ nested block SELECT *
FROM Enrolled E
computation taken into account. WHERE E.uos=‘DATA3404’ AND
S.sid=outer_value
– Implicit ordering of these blocks
Equivalent non-nested query:
means that some good strategies SELECT S.name
are not considered. FROM Student S, Enrolled E
The non-nested version of the query is WHERE S.sid=E.sid
typically optimized better. AND E.uos=‘DATA3404’
DATA3404 "Data Science Platforms" - 2020 (Roehm) 37

FYI: Optimising Dynamic Queries


– Common: Dynamic queries with bind parameters:
PreparedStatement stmt = connection.prepareStatement(“SELECT * FROM R WHERE R.A=?”);
stmt.setInt(1,4711);
ResultSet rs = stmt.executeQuery();

– More secure than static SQL


– Commonly advised as this would be faster (because parsed only once)

– Questions / Caveats:
– How to optimise parameterized queries?
• Just take a ‘typical’ value for placeholders? Which value is ‘typical’?
• E.g. Oracle: Optimizer peeks into the actual bind values, then optimises.
Re-uses this plan even if cursor uses the query with different bind values
– How to cache/re-use these queries?
• If re-issued a query with a different bind value, shall we still re-use plans?
• E.g. Oracle: Does not share plans with bind values!
DATA3404 "Data Science Platforms" - 2020 (Roehm) 38
Key Concepts
– Build on topics from past two weeks: – Execution Plans
– Index choice – Should be able to annotate an expression
– Expression trees tree with appropriate physical operations
– Physical operations – Should be able to identify plans that
involve indexes, and propose suitable
– Access Paths
indexes for these plans
– Estimating I/O cost for all the above
– Should be able to compare plans based
upon I/O cost

– RA Expression Equivalence
– Should be able to translate between
expression trees using RA equivalence rules

DATA3404 "Data Science Platforms" - 2020 (Roehm) 39

Next Lecture (after the Easter break)


Next Week: Easter break – no lectures
Then:
– Distributed Data Management
– Data Partitioning & Data Sharing
– Data Replication
– Distributed Query/Join Processing

– Textbooks
– Ramakrishnan/Gehrke: Chapter 22
– Kifer/Bernstein/Lewis: Chapter 24

DATA3404 "Data Science Platforms" - 2020 (Roehm) 40

You might also like