7 QueryProcessing
7 QueryProcessing
Query Compiler
Query
Processor
Plan Plan
Cost
Generator Estimator
7-2
What are we trying to do?
■ Consider query
● “For each project whose budget is greater than $250000 and which
employs more than two employees, list the names and titles of
employees.”
■ In SQL
SELECT Ename, Title
FROM Emp, Project, Works
WHERE Budget > 250000
AND Emp.Eno=Works.Eno
AND Project.Pno=Works.Pno
AND Project.Pno IN
(SELECT w.Pno
FROM Works w
GROUP BY w.Pno
HAVING SUM(*) > 2)
■ How to execute this query? 7-3
A Possible Execution Plan
ΠEname, Title(GroupPno,Eno(Emp⋈(σBudget>250000Project⋈Works)))
7-5
Query Processing Methodology
SQL Queries
Normalization
Normalization
Analysis
Analysis
Simplification
System
Simplification
Catalog
Restructuring
Restructuring
Optimization
Optimization
7-7
Analysis
■ Refute incorrect queries
■ Type incorrect
● If any of its attribute or relation names are not defined in
the global schema
● If operations are applied to attributes of the wrong type
■ Semantically incorrect
● Components do not contribute in any way to the
generation of the result
● Only a subset of relational calculus queries can be tested
for correctness
● Those that do not contain disjunction and negation
● To detect
➠ connection graph (query graph)
➠ join graph
7-8
Analysis – Example
SELECT Ename,Resp
FROM Emp, Works, Project
WHERE Emp.Eno = Works.Eno
AND Works.Pno = Project.Pno
AND Pname = ‘CAD/CAM’
AND Dur > 36
AND Title = ‘Programmer’
Works Works
Emp.Eno=Works.Eno Works.Pno=Project.Pno Emp.Eno=Works.Eno Works.Pno=Project.Pno
7-9
Analysis
If the query graph is not connected, the query
may be wrong.
SELECT Ename,Resp
FROM Emp, Works, Project
WHERE Emp.Eno = Works.Eno
AND Pname = ‘CAD/CAM’
AND Dur > 36
AND Title = ‘Programmer’
Works
Ename
RESULT
Pname=‘CAD/CAM’
7-10
Simplification
■ Why simplify?
● The simpler the query, the easier (and more efficient) it
is to execute it
■ How? Use transformation rules
● elimination of redundancy
➠ idempotency rules
p1 ∧ ¬( p1) ⇔ false
p1 ∧ (p1 ∨ p2) ⇔ p1
p1 ∨ false ⇔ p1
…
● application of transitivity
● use of integrity rules
7-11
Simplification – Example
SELECT Title
FROM Emp
WHERE Ename = ‘J. Doe’
OR (NOT(Title = ‘Programmer’)
AND (Title = ‘Programmer’
OR Title = ‘Elect. Eng.’)
AND NOT(Title = ‘Elect. Eng.’))
⇓
SELECT Title
FROM Emp
WHERE Ename = ‘J. Doe’
7-12
Restructuring
■ Convert SQL to relational algebra ΠENAME Project
7-13
How to implement operators
■ Projection
● Without duplicate elimination – O(n)
● With duplicate elimination
➠ Sorting-based– O(nlogn)
➠ Hash-based – O(n+t) where t is the result of hashing phase
7-14
How to implement operators
(cont’d)
■ Join
● Nested loop join: R⋈S
foreach tuple r∈R do
foreach tuple s∈S do
if r==s then add <r,s> to result
● O(n*m)
● Improvements possible by
➠ page-oriented nested loop join
➠ block-oriented nested loop join
7-15
How to implement operators
(cont’d)
■ Join
● Index nested loop join: R⋈S
foreach tuple r∈R do
use index on join attr. to find tuples of S
foreach such tuple s∈S do
add <r,s> to result
● Sort-merge join
➠ SortR and S on the join attribute
➠ Merge the sorted relations
● Hash join
➠ Hash R and S using a common hash function
➠ Within each bucket, find tuples where r=s
7-16
Index Selection Guidelines
7-17
Example 1
7-18
Example 2
SELECT e.Ename, w.Resp
FROM Emp e, Works w
WHERE e.Age BETWEEN 45 AND 60
AND e.Title=‘Programmer’
AND e.Eno=w.Eno
■ Clearly, Emp should be the outer relation.
● Suggests that we build a hash index on w.Eno.
■ What index should we build on Emp?
● B+ tree on e.Age could be used, OR an index on e.Title could be used.
Only one of these is needed, and which is better depends upon the
selectivity of the conditions.
➠ As a rule of thumb, equality selections more selective than range selections.
■ As both examples indicate, our choice of indexes is guided by
the plan(s) that we expect an optimizer to consider for a query.
Have to understand optimizers! 7-19
Examples of Clustering
SELECT e.Title
FROM Emp e
WHERE e.Age > 40
■ B+ tree index on e.Age can be used to get
qualifying tuples.
● How selective is the condition?
● Is the index clustered?
7-20
Clustering and Joins
SELECT e.Ename, p.Pname
FROM Emp e, Project p
WHERE p.Budget=‘350000’
AND e.City=p.City
■ Clustering is especially important when accessing inner tuples
in Index Nested Loop join.
● Should make index on e.City clustered.
■ Suppose that the WHERE clause is instead:
WHERE e.Title=‘Programmer’ AND e.City=p.City
● If many employees are Programmers, Sort-Merge join may be worth
considering. A clustered index on p.City would help.
■ Summary: Clustering is useful whenever many tuples are to
be retrieved.
7-21
Selecting Alternatives
SELECT Ename
FROM Emp e,Works w
WHERE e.Eno = w.Eno
AND w.Dur > 37
Strategy 1
ΠENAME(σDUR>37∧EMP.ENO=ASG.ENO (Emp × Works))
Strategy 2
ΠENAME(Emp ⋈ENO (σDUR>37 (Works)))
7-24
Query Optimization Issues –
Optimization Timing
■ Static
● compilation ⇒ optimize prior to the execution
● difficult to estimate the size of the intermediate results
⇒ error propagation
● can amortize over many executions
■ Dynamic
● run time optimization
● exact information on the intermediate relation sizes
● have to reoptimize for multiple executions
■ Hybrid
● compile using a static algorithm
● if the error in estimate sizes > threshold, reoptimize at
run time 7-25
Query Optimization Issues –
Statistics
■ Relation
● cardinality
● size of a tuple
● fraction of tuples participating in a join with another
relation
● …
■ Attribute
● cardinality of domain
● actual number of distinct values
● …
■ Common assumptions
● independence between different attribute values
● uniform distribution of attribute values within their
domain 7-26
Query Optimization Components
7-28
Intermediate Relation Sizes
Selection
size(R) = card(R) ∗ length(R)
card(σF (R)) = SFσ (F) ∗ card(R)
where
1
S Fσ(A = value) =
card(∏A(R))
max(A) – value
S Fσ(A > value) =
max(A) – min(A)
value – min(A)
S Fσ(A < value) =
max(A) – min(A)
7-29
Intermediate Relation Sizes
Projection
card(ΠA(R))=card(R)
Cartesian Product
card(R × S) = card(R) ∗ card(S)
Union
upper bound: card(R ∪ S) = card(R) + card(S)
lower bound: card(R ∪ S) = max{card(R), card(S)}
Set Difference
upper bound: card(R–S) = card(R)
lower bound: 0
7-30
Intermediate Relation Size
Join
● Special case: A is a key of R and B is a foreign key
of S;
card(R ⋈A=B S) = card(S)
● More general:
card(R ⋈ S) = SFJ ∗ card(R) ∗ card(S)
7-31
Search Space
7-32
Search Space – Join Trees
⋈
■ For N relations, there are O(N!)
⋈ Project
equivalent join trees that can be Emp Works
obtained by applying commutativity
and associativity rules ⋈
⋈ Emp
SELECT Ename,Resp
FROM Emp, Works, Project Project Works
⋈
WHERE Emp.Eno=Works.Eno
AND Works.PNO=Project.PNO
× Works
Project Emp
7-33
Transformation Rules
■ Commutativity of binary operations
● R×S⇔S×R
● R⋈S⇔S⋈R
● R∪S⇔S∪R
■ Associativity of binary operations
● ( R × S ) × T ⇔ R × (S × T)
● ( R ⋈ S ) ⋈ T ⇔ R ⋈ (S ⋈ T )
■ Idempotence of unary operations
● ΠA’(ΠA’’(R)) ⇔ ΠA’(R)
● σp1(A1)(σp2(A2)(R)) = σp1(A1) ∧ p2(A2)(R)
where R[A] and A' ⊆ A, A" ⊆ A and A' ⊆ A"
7-34
Transformation Rules
■ Commuting selection with projection
■ Commuting selection with binary operations
● σp(A)(R × S) ⇔ (σp(A) (R)) × S
● σp(Ai)(R ⋈(Aj,Bk) S) ⇔ (σp(Ai) (R)) ⋈(Aj,Bk) S
● σp(Ai)(R ∪ T) ⇔ σp(Ai) (R) ∪ σp(Ai) (T)
where Ai belongs to R and T
■ Commuting projection with binary operations
● ΠC(R × S) ⇔ Π A’(R) × Π B’(S)
● ΠC(R ⋈(Aj,Bk) S) ⇔ Π A’(R) ⋈(Aj,Bk) Π B’(S)
7-36
Equivalent Query
ΠEname
7-37
Another Equivalent Query
ΠEname
⋈
ΠPno,Ename
7-38
Search Strategy
■ How to “move” in the search space.
■ Deterministic
● Start from base relations and build plans by adding one
relation at each step
● Dynamic programming: breadth-first
● Greedy: depth-first
■ Randomized
● Search for optimalities around a particular starting point
● Trade optimization time for execution time
● Better when > 5-6 relations
● Simulated annealing
● Iterative improvement
7-39
Search Algorithms
■ Restrict the search space
● Use heuristics
➠ E.g., Perform unary operations before binary operations
● Restrict the shape of the join tree
➠ Consider only linear trees, ignore bushy ones
Linear Join Tree Bushy Join Tree
⋈
⋈
⋈ R4
⋈ R3 ⋈ ⋈
R1 R2 R1 R2 R3 R4
7-40
Search Strategies
■ Deterministic
⋈
⋈ ⋈ R4
⋈ ⋈ R3 ⋈ R3
R1 R2 R1 R2 R1 R2
■ Randomized
⋈ ⋈
⋈ R3 ⋈ R2
R1 R2 R1 R3
7-41
Summary
■ Declarative SQL queries need to be converted into
low level execution plans
■ These plans need to be optimized to find the
“best” plan
■ Optimization involves
● Search space: identifies the alternative plans and
alternative execution algorithms for algebra operators
➠ This is done by means of transformation rules
● Cost function: calculates the cost of executing each plan
➠ CPU and I/O costs
● Search algorithm: controls which alternative plans are
investigated
7-42