Query Optimization
Query Optimization
Distributed Query
Processing Methodology
Calculus Query on Distributed
Relations
Query GLOBAL
Decomposition SCHEMA
Fragment Query
Global STATS ON
Optimization FRAGMENTS
Local LOCAL
LOCAL
Optimization SCHEMAS
SITES
Optimized Local
Queries
Page 1
Step 3 – Global Query
Optimization
Cost-Based Optimization
n Solution space
l The set of equivalent algebra expressions (query trees).
Page 2
Query Optimization Process
Input Query
Equivalent QEP
Best QEP
Search Space
▷◁PNO
n Search space
characterized by ▷◁ENO PROJ
alternative execution
n Focus on join trees EMP ASG
n For N relations, there are ▷◁ENO
O(N!) equivalent join
trees that can be obtained ▷◁PNO EMP
by applying
commutativity and PROJ ASG
associativity rules ▷◁ENO,PNO
SELECT ENAME,RESP
FROM EMP, ASG,PROJ ×
ASG
WHERE EMP.ENO=ASG.ENO
AND ASG.PNO=PROJ.PNO PROJ EMP
CS742 – Distributed & Parallel DBMS M. Tamer Özsu Page 4. 6
Page 3
Search Space
n Restrict by means of heuristics
º Perform unary operations before binary operations
º …
n Restrict the shape of the join tree
l Consider only linear trees, ignore bushy ones
⋈
⋈
⋈ R4
⋈ R3 ⋈ ⋈
R1 R2 R1 R2 R3 R4
Search Strategy
n How to “move” in the search space.
n Deterministic
º Start from base relations and build plans by adding one
relation at each step
º Dynamic programming: breadth-first
º Greedy: depth-first
n Randomized
º Search for optimalities around a particular starting point
º Trade optimization time for execution time
º Better when > 10 relations
º Simulated annealing
º Iterative improvement
Page 4
Search Strategies
n Deterministic ⋈
⋈ ⋈ R4
⋈ ⋈ R3 ⋈ R3
R1 R2 R1 R2 R1 R2
n Randomized
⋈ ⋈
⋈ R3 ⋈ R2
R1 R2 R1 R3
Cost Functions
n Total Time (or Total Cost)
l Reduce each cost (in terms of time) component individually
l Do as little of each cost component as possible
l Optimizes the utilization of the resources
Page 5
Total Cost
Page 6
Response Time
Example
Site 1
x units
Site 3
Site 2 y units
Page 7
Optimization Statistics
n Primary cost factor: size of intermediate
relations
l Need to estimate their sizes
Statistics
Page 8
Intermediate Relation Sizes
Selection
size(R) = card(R) ⋅ length(R)
card(σF(R)) = SFσ(F) ⋅ card(R)
where
1
S Fσ(A = value) =
card(∏A(R))
max(A) – value
S Fσ(A >value) =
max(A) – min(A)
value – max(A)
S Fσ(A <value) =
max(A) – min(A)
SFσ(p(Ai)∧ p(Aj)) = SFσ(p(Ai)) ⋅ SFσ(p(Aj))
Projection
card(ΠA(R))=card(R)
Cartesian Product
card(R × S) = card(R) * card(S)
Union
upper bound: card(R ∪ S) = card(R) + card(S)
lower bound: card(R ∪ S) = max{card(R), card(S)}
Set Difference
upper bound: card(R–S) = card(R)
lower bound: 0
CS742 – Distributed & Parallel DBMS M. Tamer Özsu Page 4. 18
Page 9
Intermediate Relation Size
Join
l Special case: A is a key of R and B is a foreign key of S
card(R ⋈A=B S) = card(S)
l More general:
card(R ⋈ S) = SF⋈ * card(R) ⋅ card(S)
Semijoin
card(R ⋉A S) = SF⋉(S.A) * card(R)
where
card(∏A(S))
SF⋉(R ⋉A S)= SF⋉(S.A) = card(dom[A])
Page 10
Histogram Example
Page 11
Dynamic Algorithm
Decompose each multi-variable query into a
sequence of mono-variable queries with a
common variable
Process each by a one variable query processor
l Choose an initial execution plan (heuristics)
l Order the rest by considering intermediate relation sizes
Dynamic Algorithm–Decomposition
n Replace an n variable query q by a series of
queries
q1→q2 → … → qn
where qi uses the result of qi-1.
n Detachment
l Query q decomposed into q' → q" where q' and q" have a
common variable which is the result of q'
n Tuple substitution
l Replace the value of each tuple with actual values and
simplify the query
q(V1, V2, ... Vn) → (q' (t1, V2, V2, ... , Vn), t1∈R)
Page 12
Detachment
⇓
q':SELECT V1.A1 INTO R1'
FROM R1 V1
WHERE P1(V1.A1)
Detachment Example
Names of employees working on CAD/CAM project
q 1: SELECT EMP.ENAME
FROM EMP, ASG, PROJ
WHERE EMP.ENO=ASG.ENO
AND ASG.PNO=PROJ.PNO
AND PROJ.PNAME="CAD/CAM"
⇓
q11: SELECT PROJ.PNO INTO JVAR
FROM PROJ
WHERE PROJ.PNAME="CAD/CAM"
Page 13
Detachment Example (cont’d)
q': SELECT EMP.ENAME
FROM EMP,ASG,JVAR
WHERE EMP.ENO=ASG.ENO
AND ASG.PNO=JVAR.PNO
⇓
q12: SELECT ASG.ENO INTO GVAR
FROM ASG,JVAR
WHERE ASG.PNO=JVAR.PNO
Tuple Substitution
Page 14
Static Algorithm
Execute joins
l Determine the possible ordering of joins
Static Algorithm
For joins, two alternative algorithms :
n Nested loops
for each tuple of external relation (cardinality n1)
for each tuple of internal relation (cardinality n2)
join two tuples if the join predicate is true
end
end
l Complexity: n1* n2
n Merge join
sort relations
merge relations
l Complexity: n1+ n2 if relations are previously sorted and
equijoin
Page 15
Static Algorithm – Example
ASG
ENO PNO
EMP PROJ
Example (cont’d)
Choose the best access paths to each relation
l EMP: sequential scan (no selection on EMP)
l ASG: sequential scan (no selection on ASG)
l PROJ: index on PNAME (there is a selection on PROJ based on
PNAME)
Determine the best join ordering
l EMP ▷◁ ASG ▷◁ PROJ
l ASG ▷◁PROJ ▷◁ EMP
l PROJ ▷◁ASG ▷◁ EMP
l ASG ▷◁EMP ▷◁ PROJ
l EMP × PROJ ▷◁ ASG
l PRO × JEMP ▷◁ASG
l Select the best ordering based on the join costs evaluated according to
the two methods
Page 16
Static Algorithm
Alternatives
EMP ⋈ ASG EMP × PROJ ASG ⋈ EMP ASG ⋈ PROJ PROJ ⋈ ASG PROJ × EMP
pruned pruned pruned pruned
Static Algorithm
Page 17
Hybrid optimization
n In general, static optimization is more efficient
than dynamic optimization
l Adopted by all commercial DBMS
n But even with a sophisticated cost model (with
histograms), accurate cost prediction is difficult
n Example
l Consider a parametric query with predicate
WHERE R.A = $a /* $a is a parameter
l The only possible assumption at compile time is uniform
distribution of values
n Solution: Hybrid optimization
l Choose-plan done at runtime, based on the actual
parameter binding
Hybrid Optimization
Example
$a=A
$a=A
Page 18
Join Ordering in Fragment
Queries
l System R*
l Two-step
Join Ordering
R S
if size(R) > size(S)
Page 19
Join Ordering – Example
Consider
PROJ ⋈PNO ASG ⋈ENO EMP
Site 2
ASG
ENO PNO
EMP PROJ
Site 1 Site 3
5. EMP → Site 2
PROJ → Site 2
Site 2 computes EMP ⋈ PROJ ⋈ ASG
CS742 – Distributed & Parallel DBMS M. Tamer Özsu Page 4. 40
Page 20
Semijoin Algorithms
n Consider the join of two relations:
l R[A] (located at site 1)
l S[A](located at site 2)
n Alternatives:
1. Do the join R ⋈AS
2. Perform one of the semijoin equivalents
R ⋈AS ⇔
(R ⋉AS) ⋈AS
⇔
R ⋈A (S ⋉A R)
⇔
(R ⋉A S) ⋈A (S ⋉A R)
Semijoin Algorithms
n Perform the join
l send R to Site 2
l Site 2 computes R ⋈A S
n Consider semijoin (R ⋉AS) ⋈AS
l S' = ΠA(S)
l S' → Site 1
l Site 1 computes R' = R ⋉AS'
l R'→ Site 2
l Site 2 computes R' ⋈AS
Semijoin is better if
size(ΠA(S)) + size(R ⋉AS)) < size(R)
CS742 – Distributed & Parallel DBMS M. Tamer Özsu Page 4. 42
Page 21
Distributed Dynamic
Algorithm
Static Approach
n Compilation
Page 22
Static Approach –
Performing Joins
Static Approach –
Vertical Partitioning & Joins
Page 23
Static Approach –
Vertical Partitioning & Joins
2. Move inner relation to the site of outer relation
Cannot join as they arrive; they need to be stored
Static Approach –
Vertical Partitioning & Joins
Page 24
Static Approach –
Vertical Partitioning & Joins
4.Fetch inner tuples as needed
(a) Retrieve qualified tuples at outer relation site
(b) Send request containing join column value(s) for outer tuples to
inner relation site
(c) Retrieve matching inner tuples at inner relation site
(d) Send the matching inner tuples to outer relation site
(e) Join as they arrive
Total Cost = cost(retrieving qualified outer tuples)
+ msg. cost * (no. of outer tuples fetched)
+ no. of outer tuples fetched * no. of
inner tuples fetched * avg. inner tuple
size * msg. cost / msg. size)
+ no. of outer tuples fetched * cost(retrieving
matching inner tuples for one outer value)
Page 25
2-Step Optimization
1. At compile time, generate a static plan with
operation ordering and access methods only
2. At startup time, carry out site and copy
selection and allocate operations to sites
Page 26
2-Step Algorithm
n For each q in Q compute load (Sq)
n While Q not empty do
1. Select subquery a with least allocation flexibility
2. Select best site b for a (with least load and best benefit)
3. Remove a from Q and recompute loads if needed
Note: if in iteration 2, q2, were allocated to s4, this would have produced a
better plan. So hybrid optimization can still miss optimal plans
Page 27