0% found this document useful (0 votes)
16 views55 pages

Lecture09 Optimization

This document discusses query optimization and cardinality estimation. It explains the goals of query optimization, the major components including cardinality and cost estimation, search space, and plan enumeration algorithms. It also covers selectivity factors, assumptions made in cardinality estimation, and computing the cost of query plans.

Uploaded by

deponly
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views55 pages

Lecture09 Optimization

This document discusses query optimization and cardinality estimation. It explains the goals of query optimization, the major components including cardinality and cost estimation, search space, and plan enumeration algorithms. It also covers selectivity factors, assumptions made in cardinality estimation, and computing the cost of query plans.

Uploaded by

deponly
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

CSE544

Data Management
Lectures 9
Query Optimization (Part 1)

CSEP 544 - Spring 2021 1


Announcements
• HW2 is due tomorrow!
• HW3 will be posted on Wednesday

• Review 5 (How good?) due Wednesday

• Mini-project guidelines posted

CSEP 544 - Spring 2021 2


Query Optimization Motivation
SQL query
Declarative query
Recall physical and
Parse & Rewrite Query logical data independence

Logical
Query Select Logical Plan
plan
optimization
Select Physical Plan
Physical
plan
Query Execution

Disk 3
Query Optimization
Goal:

• Given a query plan, find a cheaper


(cheapest?) equivalent plan

• Why difficult:
– Need to explore a large number of plans
– Need to estimate the cost of each plan
4
Query Optimization
Three major components:

1. Cardinality and cost estimation

2. Search space

3. Plan enumeration algorithms

CSEP 544 - Spring 2021 5


Query Optimization
Three major components:

1. Cardinality and cost estimation

2. Search space

3. Plan enumeration algorithms

CSEP 544 - Spring 2021 6


Cardinality Estimation
Problem: given statistics on base tables
and a query, estimate size of the answer

Very difficult, because:


• Need to do it very fast
• Need to use very little memory

CSEP 544 - Spring 2021 7


Statistics on Base Data
• Number of tuples (cardinality) T(R)
• Number of physical pages B(R)
• Indexes, number of keys in the index V(R,a)
• Histogram on single attribute (1d)
• Histogram on two attributes (2d)

Computed periodically, often using sampling

CSEP 544 - Spring 2021 8


Assumptions
• Uniformity

• Independence

• Containment of values

• Preservation of values
CSEP 544 - Spring 2021 9
Size Estimation

Selection: size decreases by selectivity factor θ

T(σpred(R)) = θpred * T(R)

CSEP 544 - Spring 2021 10


Size Estimation

Selection: size decreases by selectivity factor θ

T(σpred(R)) = θpred * T(R)


T(R ⨝A=B S) = θA=B * T(R) * T(S)

CSEP 544 - Spring 2021 11


T(σpred(R)) = θpred * T(R)

Selectivity Factors
Uniformity assumption σA=c(R)
Equality:

12
T(σpred(R)) = θpred * T(R)

Selectivity Factors
Uniformity assumption σA=c(R)
Equality:
• θA=c = 1/V(R,A)

13
T(σpred(R)) = θpred * T(R)

Selectivity Factors
Uniformity assumption σA=c(R)
Equality:
• θA=c = 1/V(R,A)
σc1<A<c2(R)
Range:
• θc1<A<c2 = (c2 – c1)/(max(R,A) - min(R,A))

14
T(σpred(R)) = θpred * T(R)

Selectivity Factors
Uniformity assumption σA=c(R)
Equality:
• θA=c = 1/V(R,A)
σc1<A<c2(R)
Range:
• θc1<A<c2 = (c2 – c1)/(max(R,A) - min(R,A))

Conjunction σA=c and B=d(R)


T(σpred(R)) = θpred * T(R)

Selectivity Factors
Uniformity assumption σA=c(R)
Equality:
• θA=c = 1/V(R,A)
σc1<A<c2(R)
Range:
• θc1<A<c2 = (c2 – c1)/(max(R,A) - min(R,A))

Conjunction σA=c and B=d(R)


Independence assumption
• θpred1 and pred2 = θpred1*θpred2 = 1/V(R,A) * 1/V(R,B)
T(R ⨝A=B S) = θA=B * T(R) * T(S)

Selectivity Factors
R ⋈R.A=S.B S

Join

CSEP 544 - Spring 2021 17


T(R ⨝A=B S) = θA=B * T(R) * T(S)

Selectivity Factors
R ⋈R.A=S.B S

Join
• θ R.A=S.B = 1/ ( MAX( V(R,A), V(S,B))

Why? Will explain next...

CSEP 544 - Spring 2021 18


T(R ⨝A=B S) = θA=B * T(R) * T(S)

Selectivity Factors
R ⋈R.A=S.B S
Containment of values: if V(R,A) ≤ V(S,B), then
the set of A values of R is included in the set of
B values of S

• Note: this indeed holds when A is a foreign


key in R, and B is a key in S

CSEP 544 - Spring 2021 19


T(R ⨝A=B S) = θA=B * T(R) * T(S)

Selectivity Factors
R ⋈R.A=S.B S
Assume V(R,A) ≤ V(S,B)
• Tuple t in R joins with T(S)/V(S,B) tuples in S

CSEP 544 - Spring 2021 20


T(R ⨝A=B S) = θA=B * T(R) * T(S)

Selectivity Factors
R ⋈R.A=S.B S
Assume V(R,A) ≤ V(S,B)
• Tuple t in R joins with T(S)/V(S,B) tuples in S
• Hence T(R ⨝A=B S) = T(R) T(S) / V(S,B)

CSEP 544 - Spring 2021 21


T(R ⨝A=B S) = θA=B * T(R) * T(S)

Selectivity Factors
R ⋈R.A=S.B S
Assume V(R,A) ≤ V(S,B)
• Tuple t in R joins with T(S)/V(S,B) tuples in S
• Hence T(R ⨝A=B S) = T(R) T(S) / V(S,B)

In general:
• T(R ⨝A=B S) = T(R) T(S) / max(V(R,A),V(S,B))
• θ R.A=S.B = 1/ ( max( V(R,A), V(S,B))

CSEP 544 - Spring 2021 22


Final Assumption
Preservation of values:
For any other attribute C:
• V(R ⨝A=B S, C) = V(R, C) or
• V(R ⨝A=B S, C) = V(S, C)

• This is needed higher up in the plan

CSEP 544 - Spring 2021 23


Computing the Cost of a Plan

• Estimate cardinalities bottom-up

• Estimate cost by using estimated cardinalities

• Examples next...

CSEP 544 - Spring 2021 24


Supplier(sid, sname, scity, sstate)
Supply(sid, pno, quantity)

Logical Query Plan 1


πsname
SELECT sname
FROM Supplier x, Supply y
WHERE x.sid = y.sid
and y.pno = 2
σpno=2∧scity=‘Seattle’∧sstate=‘WA’ and x.scity = ‘Seattle’
and x.sstate = ‘WA’

sid = sid

Supply Supplier
T(Supplier) = 1000
T(Supply) = 10000 B(Supplier) = 100
B(Supply) = 100 V(Supplier, scity) = 20
V(Supplier, state) = 10
M=11
V(Supply, pno) = 2500
Supplier(sid, sname, scity, sstate)
Supply(sid, pno, quantity)

Logical Query Plan 1


πsname
SELECT sname
Estimated FROM Supplier x, Supply y
(why?) WHERE x.sid = y.sid
and y.pno = 2
σpno=2∧scity=‘Seattle’∧sstate=‘WA’ and x.scity = ‘Seattle’
and x.sstate = ‘WA’
T = 10000

sid = sid

Supply Supplier
T(Supplier) = 1000
T(Supply) = 10000 B(Supplier) = 100
B(Supply) = 100 V(Supplier, scity) = 20
V(Supplier, state) = 10
M=11
V(Supply, pno) = 2500
Supplier(sid, sname, scity, sstate)
Supply(sid, pno, quantity)

Logical Query Plan 1


πsname
SELECT sname
Estimated FROM Supplier x, Supply y
(why?) WHERE x.sid = y.sid
and y.pno = 2
σpno=2∧scity=‘Seattle’∧sstate=‘WA’ and x.scity = ‘Seattle’
and x.sstate = ‘WA’
T = 10000

Because key / foreign-key


sid = sid

Supply Supplier
T(Supplier) = 1000
T(Supply) = 10000 B(Supplier) = 100
B(Supply) = 100 V(Supplier, scity) = 20
V(Supplier, state) = 10
M=11
V(Supply, pno) = 2500
Supplier(sid, sname, scity, sstate)
Supply(sid, pno, quantity)

Logical Query Plan 1


πsname
SELECT sname
Estimated FROM Supplier x, Supply y
(why?) WHERE x.sid = y.sid
and y.pno = 2
σpno=2∧scity=‘Seattle’∧sstate=‘WA’ and x.scity = ‘Seattle’
and x.sstate = ‘WA’
T = 10000

Because key / foreign-key


sid = sid
Also: θ = 1/max(V(Supply,sid)*V(Supplier,sid) = 1/V(Supplier,sid)

Supply Supplier
T(Supplier) = 1000
T(Supply) = 10000 B(Supplier) = 100
B(Supply) = 100 V(Supplier, scity) = 20
V(Supplier, state) = 10
M=11
V(Supply, pno) = 2500
Supplier(sid, sname, scity, sstate)
Supply(sid, pno, quantity)

Estimated
(why?) Logical Query Plan 1
πsname
SELECT sname
T <1 FROM Supplier x, Supply y
WHERE x.sid = y.sid
and y.pno = 2
σpno=2∧scity=‘Seattle’∧sstate=‘WA’ and x.scity = ‘Seattle’
and x.sstate = ‘WA’
T = 10000

sid = sid

Supply Supplier
T(Supplier) = 1000
T(Supply) = 10000 B(Supplier) = 100
B(Supply) = 100 V(Supplier, scity) = 20
V(Supplier, state) = 10
M=11
V(Supply, pno) = 2500
Supplier(sid, sname, scity, sstate)
Supply(sid, pno, quantity)

Logical Query Plan 2


πsname
SELECT sname
FROM Supplier x, Supply y
WHERE x.sid = y.sid
and y.pno = 2
and x.scity = ‘Seattle’
and x.sstate = ‘WA’

sid = sid

σpno=2 σscity=‘Seattle’∧sstate=‘WA’

Supply Supplier
T(Supplier) = 1000
T(Supply) = 10000 B(Supplier) = 100
B(Supply) = 100 V(Supplier, scity) = 20
V(Supplier, state) = 10
M=11
V(Supply, pno) = 2500
Supplier(sid, sname, scity, sstate)
Supply(sid, pno, quantity)

Logical Query Plan 2


πsname
SELECT sname
FROM Supplier x, Supply y
WHERE x.sid = y.sid
and y.pno = 2
and x.scity = ‘Seattle’
and x.sstate = ‘WA’

T=4 sid = sid T= 5

σpno=2 σscity=‘Seattle’∧sstate=‘WA’

Supply Supplier
T(Supplier) = 1000
T(Supply) = 10000 B(Supplier) = 100
B(Supply) = 100 V(Supplier, scity) = 20
V(Supplier, state) = 10
M=11
V(Supply, pno) = 2500
Supplier(sid, sname, scity, sstate)
Supply(sid, pno, quantity)

Logical Query Plan 2


πsname
SELECT sname
FROM Supplier x, Supply y
WHERE x.sid = y.sid
and y.pno = 2
and x.scity = ‘Seattle’
and x.sstate = ‘WA’

T=4 sid = sid T= 5


Very wrong!
Why?
σpno=2 σscity=‘Seattle’∧sstate=‘WA’

Supply Supplier
T(Supplier) = 1000
T(Supply) = 10000 B(Supplier) = 100
B(Supply) = 100 V(Supplier, scity) = 20
V(Supplier, state) = 10
M=11
V(Supply, pno) = 2500
Supplier(sid, sname, scity, sstate)
Supply(sid, pno, quantity)

Logical Query Plan 2


πsname
SELECT sname
FROM Supplier x, Supply y
T=4 WHERE x.sid = y.sid
and y.pno = 2
and x.scity = ‘Seattle’
and x.sstate = ‘WA’

T=4 sid = sid T= 5


Very wrong!
Why?
σpno=2 σscity=‘Seattle’∧sstate=‘WA’

Supply Supplier
T(Supplier) = 1000
T(Supply) = 10000 B(Supplier) = 100
B(Supply) = 100 V(Supplier, scity) = 20
V(Supplier, state) = 10
M=11
V(Supply, pno) = 2500
Supplier(sid, sname, scity, sstate)
Supply(sid, pno, quantity)

Logical Query Plan 2


πsname
Different
estimate L SELECT sname
FROM Supplier x, Supply y
T=4 WHERE x.sid = y.sid
and y.pno = 2
and x.scity = ‘Seattle’
and x.sstate = ‘WA’

T=4 sid = sid T= 5


Very wrong!
Why?
σpno=2 σscity=‘Seattle’∧sstate=‘WA’

Supply Supplier
T(Supplier) = 1000
T(Supply) = 10000 B(Supplier) = 100
B(Supply) = 100 V(Supplier, scity) = 20
V(Supplier, state) = 10
M=11
V(Supply, pno) = 2500
Supplier(sid, sname, scity, sstate)
Supply(sid, pno, quantity)

Physical Plan 1
πsname
T <1

σpno=2∧scity=‘Seattle’∧sstate=‘WA’
T = 10000
Total cost: 100/10 * 100 = 1000

sid = sid
Block nested loop join

Scan
Supply Scan
Supplier
T(Supplier) = 1000
T(Supply) = 10000 B(Supplier) = 100
B(Supply) = 100 V(Supplier, scity) = 20
V(Supplier, state) = 10
M=11
V(Supply, pno) = 2500
Supplier(sid, sname, scity, sstate)
Supply(sid, pno, quantity)

Physical Plan 1
πsname
T <1

σpno=2∧scity=‘Seattle’∧sstate=‘WA’
T = 10000
Total cost: 100+100*100/10 = 1100

sid = sid
Block nested loop join

Scan
Supply Scan
Supplier
T(Supplier) = 1000
T(Supply) = 10000 B(Supplier) = 100
B(Supply) = 100 V(Supplier, scity) = 20
V(Supplier, state) = 10
M=11
V(Supply, pno) = 2500
Supplier(sid, sname, scity, sstate)
Supply(sid, pno, quantity)

Physical Plan 2
πsname
Cost of Supply(pno) = 4
T=4 Cost of Supplier(scity) = 50
Total cost: 54

T= 5
T=4 sid = sid
Main memory join
σsstate=‘WA’
Unclustered σpno=2 T= 50
index lookup σscity=‘Seattle’ Unclustered
Supply(pno) index lookup
Supply Supplier Supplier(scity)

T(Supplier) = 1000
T(Supply) = 10000 B(Supplier) = 100
B(Supply) = 100 V(Supplier, scity) = 20
V(Supplier, state) = 10
M=11
V(Supply, pno) = 2500
Supplier(sid, sname, scity, sstate)
Supply(sid, pno, quantity)

Physical Plan 2
πsname
Cost of Supply(pno) = 4
T=4 Cost of Supplier(scity) = 50
Total cost: 54

T= 5
T=4 sid = sid
Main memory join
σsstate=‘WA’
Unclustered σpno=2 T= 50
index lookup σscity=‘Seattle’ Unclustered
Supply(pno) index lookup
Supply Supplier Supplier(scity)

T(Supplier) = 1000
T(Supply) = 10000 B(Supplier) = 100
B(Supply) = 100 V(Supplier, scity) = 20
V(Supplier, state) = 10
M=11
V(Supply, pno) = 2500
Supplier(sid, sname, scity, sstate)
Supply(sid, pno, quantity)

Physical Plan 2
πsname
Cost of Supply(pno) = 4
T=4 Cost of Supplier(scity) = 50
Total cost: 54

T= 5
T=4 sid = sid
Main memory join
σsstate=‘WA’
Unclustered σpno=2 T= 50
index lookup σscity=‘Seattle’ Unclustered
Supply(pno) index lookup
Supply Supplier Supplier(scity)

T(Supplier) = 1000
T(Supply) = 10000 B(Supplier) = 100
B(Supply) = 100 V(Supplier, scity) = 20
V(Supplier, state) = 10
M=11
V(Supply, pno) = 2500
Supplier(sid, sname, scity, sstate)
Supply(sid, pno, quantity)

Physical Plan 3
πsname
T=4

σscity=‘Seattle’∧sstate=‘WA’ Cost of Supply(pno) = 4


Cost of Index join = 4
Total cost: 8

T=4 sid = sid


Clustered
Index join
Unclustered σpno=2
index lookup
Supply(pno)
Supply Supplier
T(Supplier) = 1000
T(Supply) = 10000 B(Supplier) = 100
B(Supply) = 100 V(Supplier, scity) = 20
V(Supplier, state) = 10
M=11
V(Supply, pno) = 2500
Supplier(sid, sname, scity, sstate)
Supply(sid, pno, quantity)

Physical Plan 3
πsname
T=4

σscity=‘Seattle’∧sstate=‘WA’ Cost of Supply(pno) = 4


Cost of Index join = 4
Total cost: 8

T=4 sid = sid


Clustered
Index join
Unclustered σpno=2
index lookup
Supply(pno)
Supply Supplier
T(Supplier) = 1000
T(Supply) = 10000 B(Supplier) = 100
B(Supply) = 100 V(Supplier, scity) = 20
V(Supplier, state) = 10
M=11
V(Supply, pno) = 2500
Supplier(sid, sname, scity, sstate)
Supply(sid, pno, quantity)

Physical Plan 3
πsname
T=4

σscity=‘Seattle’∧sstate=‘WA’ Cost of Supply(pno) = 4


Cost of Index join = 4
Total cost: 8

T=4 sid = sid


Clustered
Index join
Unclustered σpno=2
index lookup
Supply(pno)
Supply Supplier
T(Supplier) = 1000
T(Supply) = 10000 B(Supplier) = 100
B(Supply) = 100 V(Supplier, scity) = 20
V(Supplier, state) = 10
M=11
V(Supply, pno) = 2500
Discussion
• We considered only IO cost;
real systems need to consider IO+CPU

• Each system has its own hacks

• We assumed that all index pages were in


memory: sometimes we need to add the cost
of fetching index pages from disk

CSEP 544 - Spring 2021 43


Histograms
• T(R), V(R,A) too coarse
• Histogram: separate stats per bucket

• In each bucket store:


– T(bucket)
– V(bucket,A) – optional

CSEP 544 - Spring 2021 44


Histograms
Employee(ssn, name, age)
T(Employee) = 25000, V(Empolyee, age) = 50
σage=48(Empolyee) = ?
Histograms
Employee(ssn, name, age)
T(Employee) = 25000, V(Empolyee, age) = 50
σage=48(Empolyee) = ?
Estimate: T(Employee) / V(Employee,age) = 500
Histograms
Employee(ssn, name, age)
T(Employee) = 25000, V(Empolyee, age) = 50
σage=48(Empolyee) = ?
Estimate: T(Employee) / V(Employee,age) = 500

Age: 0..20 20..29 30-39 40-49 50-59 > 60

T= 200 800 5000 12000 6500 500


Histograms
Employee(ssn, name, age)
T(Employee) = 25000, V(Empolyee, age) = 50
σage=48(Empolyee) = ?
Estimate: T(Employee) / V(Employee,age) = 500

Age: 0..20 20..29 30-39 40-49 50-59 > 60

T= 200 800 5000 12000 6500 500

Assume V = 10
Histograms
Employee(ssn, name, age)
T(Employee) = 25000, V(Empolyee, age) = 50
σage=48(Empolyee) = ?
Estimate: T(Employee) / V(Employee,age) = 500

Age: 0..20 20..29 30-39 40-49 50-59 > 60

T= 200 800 5000 12000 6500 500

Estimate: 12000/10 = 1200 Assume V = 10


Histograms
Employee(ssn, name, age)
T(Employee) = 25000, V(Empolyee, age) = 50
σage=48(Empolyee) = ?
Estimate: T(Employee) / V(Employee,age) = 500

Age: 0..20 20..29 30-39 40-49 50-59 > 60

T= 200 800 5000 12000 6500 500

V= 3 10 7 6 5 4

Estimate: 12000/10 = 1200


Histograms
Employee(ssn, name, age)
T(Employee) = 25000, V(Empolyee, age) = 50
σage=48(Empolyee) = ?
Estimate: T(Employee) / V(Employee,age) = 500

Age: 0..20 20..29 30-39 40-49 50-59 > 60

T= 200 800 5000 12000 6500 500

V= 3 10 7 6 5 4

Estimate: 12000/10 = 1200 12000/6 = 2000


Types of Histograms
• Eq-Width

• Eq-Depth

• Compressed: store outliers separately

• V-Optimal histograms
CSEP 544 - Spring 2021 52
Employee(ssn, name, age)
Histograms
Eq-width:

Age: 0..20 20..29 30-39 40-49 50-59 > 60

Tuples 200 800 5000 12000 6500 500

Eq-depth:

Age: 0..32 33..41 42-46 47-52 53-58 > 60

Tuples 1800 2000 2100 2200 1900 1800

Compressed: store separately highly frequent values: (48,1900)


V-Optimal Histograms
• Error:
0
# 𝜎*,! 𝑅 − 𝑒𝑠𝑡-'./ 𝜎*,! 𝑅
!∈#$%&'((*)

• Bucket boundaries = argminHist(Error)


• Dynamic programming
• Modern databases systems use V-
optimal histograms or some variations
CSEP 544 - Spring 2021 54
Discussion
• Cardinality estimation = still unsolved
• Histograms:
– Small number of buckets (why?)
– Updated only periodically (why?)
– No 2d histograms (except db2) why?
• Samples:
– Fail for low selectivity estimates, joins
• Cross-join correlation – still unsolved
55

You might also like