0% found this document useful (0 votes)
27 views

7 QueryProcessing

The document discusses query processing and optimization. It describes how a high-level SQL query is compiled and optimized to generate an efficient execution plan. The key components are the query compiler, query processor, plan generator, plan estimator and plan evaluator. It provides examples of query normalization, analysis, simplification, restructuring and implementation of relational operators like selection, projection and join. Index selection guidelines are also presented to efficiently evaluate queries.

Uploaded by

Ra Nim gh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

7 QueryProcessing

The document discusses query processing and optimization. It describes how a high-level SQL query is compiled and optimized to generate an efficient execution plan. The key components are the query compiler, query processor, plan generator, plan estimator and plan evaluator. It provides examples of query normalization, analysis, simplification, restructuring and implementation of relational operators like selection, projection and join. Index selection guidelines are also presented to efficiently evaluate queries.

Uploaded by

Ra Nim gh
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Query Processing

high level user query (SQL)

Query Compiler
Query
Processor
Plan Plan
Cost
Generator Estimator

low level data manipulation


commands
(execution plan)
Plan Evaluator
7-1
Query Processing Components

■ Query language that is used


● SQL: “intergalactic dataspeak”
■ Query execution methodology
● The steps that one goes through in executing high-level
(declarative) user queries.
■ Query optimization
● How do we determine a good execution plan?

7-2
What are we trying to do?
■ Consider query
● “For each project whose budget is greater than $250000 and which
employs more than two employees, list the names and titles of
employees.”
■ In SQL
SELECT Ename, Title
FROM Emp, Project, Works
WHERE Budget > 250000
AND Emp.Eno=Works.Eno
AND Project.Pno=Works.Pno
AND Project.Pno IN
(SELECT w.Pno
FROM Works w
GROUP BY w.Pno
HAVING SUM(*) > 2)
■ How to execute this query? 7-3
A Possible Execution Plan

1. T1 ← Scan Project table and select all tuples


with Budget value > 250000
2. T2 ← Join T1 with the Works relation
3. T3 ← Join T2 with the Emp relation
4. T4 ← Group tuples of T3 over Pno
5. Scan tuples in each group of T4 and for groups
that have more than 2 tuples, Project over
Ename, Title
Note: Overly simplified – we’ll detail later.
7-4
Pictorial Representation
ΠEname, Title
T4
Group by
1. How do we get this
T3
⋈ plan?
T2
2. How do we execute
⋈ Emp each of the nodes?
T1
σBudget>250000
Works
Project

ΠEname, Title(GroupPno,Eno(Emp⋈(σBudget>250000Project⋈Works)))
7-5
Query Processing Methodology
SQL Queries
Normalization
Normalization

Analysis
Analysis

Simplification
System
Simplification
Catalog

Restructuring
Restructuring

Optimization
Optimization

“Optimal” Execution Plan


7-6
Query Normalization
■ Lexical and syntactic analysis
● check validity (similar to compilers)
● check for attributes and relations
● type checking on the qualification
■ Put into (query normal form
● Conjunctive normal form
(p11∨p12∨…∨p1n) ∧…∧ (pm1∨pm2∨…∨pmn)
● Disjunctive normal form
(p11∧p12 ∧…∧p1n) ∨…∨ (pm1 ∧pm2∧…∧ pmn)
● OR's mapped into union
● AND's mapped into join or selection

7-7
Analysis
■ Refute incorrect queries
■ Type incorrect
● If any of its attribute or relation names are not defined in
the global schema
● If operations are applied to attributes of the wrong type
■ Semantically incorrect
● Components do not contribute in any way to the
generation of the result
● Only a subset of relational calculus queries can be tested
for correctness
● Those that do not contain disjunction and negation
● To detect
➠ connection graph (query graph)
➠ join graph
7-8
Analysis – Example
SELECT Ename,Resp
FROM Emp, Works, Project
WHERE Emp.Eno = Works.Eno
AND Works.Pno = Project.Pno
AND Pname = ‘CAD/CAM’
AND Dur > 36
AND Title = ‘Programmer’

Query graph Join graph


Dur>36

Works Works
Emp.Eno=Works.Eno Works.Pno=Project.Pno Emp.Eno=Works.Eno Works.Pno=Project.Pno

Title = Emp Resp Project Emp Project


‘Programmer’
Ename
RESULT
Pname=‘CAD/CAM’

7-9
Analysis
If the query graph is not connected, the query
may be wrong.
SELECT Ename,Resp
FROM Emp, Works, Project
WHERE Emp.Eno = Works.Eno
AND Pname = ‘CAD/CAM’
AND Dur > 36
AND Title = ‘Programmer’

Works

Emp Resp Project

Ename
RESULT
Pname=‘CAD/CAM’

7-10
Simplification
■ Why simplify?
● The simpler the query, the easier (and more efficient) it
is to execute it
■ How? Use transformation rules
● elimination of redundancy
➠ idempotency rules
p1 ∧ ¬( p1) ⇔ false
p1 ∧ (p1 ∨ p2) ⇔ p1
p1 ∨ false ⇔ p1

● application of transitivity
● use of integrity rules
7-11
Simplification – Example
SELECT Title
FROM Emp
WHERE Ename = ‘J. Doe’
OR (NOT(Title = ‘Programmer’)
AND (Title = ‘Programmer’
OR Title = ‘Elect. Eng.’)
AND NOT(Title = ‘Elect. Eng.’))


SELECT Title
FROM Emp
WHERE Ename = ‘J. Doe’

7-12
Restructuring
■ Convert SQL to relational algebra ΠENAME Project

■ Make use of query trees


■ Example σDUR=12 OR DUR=24

SELECT Ename σPNAME=“CAD/CAM” Select

FROM Emp, Works, Project


WHERE Emp.Eno = Works.Eno σENAME≠“J. DOE”
AND Works.Pno = Project.Pno
AND Ename <> ‘J. Doe’ ⋈PNO
AND Pname = ‘CAD/CAM’
⋈ENO Join
AND (Dur = 12 OR Dur = 24)

Project Works Emp

7-13
How to implement operators

■ Selection (assume R has n pages)


● Scan without an index – O(n)
● Scan with index
➠ B+index – O(logn)
➠ Hash index – O(1)

■ Projection
● Without duplicate elimination – O(n)
● With duplicate elimination
➠ Sorting-based– O(nlogn)
➠ Hash-based – O(n+t) where t is the result of hashing phase

7-14
How to implement operators
(cont’d)
■ Join
● Nested loop join: R⋈S
foreach tuple r∈R do
foreach tuple s∈S do
if r==s then add <r,s> to result
● O(n*m)
● Improvements possible by
➠ page-oriented nested loop join
➠ block-oriented nested loop join

7-15
How to implement operators
(cont’d)
■ Join
● Index nested loop join: R⋈S
foreach tuple r∈R do
use index on join attr. to find tuples of S
foreach such tuple s∈S do
add <r,s> to result
● Sort-merge join
➠ SortR and S on the join attribute
➠ Merge the sorted relations
● Hash join
➠ Hash R and S using a common hash function
➠ Within each bucket, find tuples where r=s

7-16
Index Selection Guidelines

■ Hash vs tree index


● Hash index on inner is very good for Index Nested
Loops.
➠ Should be clustered if join column is not key for inner, and
inner tuples need to be retrieved.
● Clustered B+ tree on join column(s) good for Sort-
Merge.

7-17
Example 1

SELECT e.Ename, w.Dur


FROM Emp e, Works w
WHERE w.Resp=‘Mgr’
AND e.Eno=w.Eno
■ Hash index on w.Resp supports ‘Mgr’ selection.
■ Hash index on w.Eno allows us to get matching (inner) Emp
tuples for each selected (outer) Works tuple.
■ What if WHERE included: “AND e.Title=`Programmer’’’?
● Could retrieve Emp tuples using index on e.Title, then join with
Works tuples satisfying Resp selection.

7-18
Example 2
SELECT e.Ename, w.Resp
FROM Emp e, Works w
WHERE e.Age BETWEEN 45 AND 60
AND e.Title=‘Programmer’
AND e.Eno=w.Eno
■ Clearly, Emp should be the outer relation.
● Suggests that we build a hash index on w.Eno.
■ What index should we build on Emp?
● B+ tree on e.Age could be used, OR an index on e.Title could be used.
Only one of these is needed, and which is better depends upon the
selectivity of the conditions.
➠ As a rule of thumb, equality selections more selective than range selections.
■ As both examples indicate, our choice of indexes is guided by
the plan(s) that we expect an optimizer to consider for a query.
Have to understand optimizers! 7-19
Examples of Clustering

SELECT e.Title
FROM Emp e
WHERE e.Age > 40
■ B+ tree index on e.Age can be used to get
qualifying tuples.
● How selective is the condition?
● Is the index clustered?

7-20
Clustering and Joins
SELECT e.Ename, p.Pname
FROM Emp e, Project p
WHERE p.Budget=‘350000’
AND e.City=p.City
■ Clustering is especially important when accessing inner tuples
in Index Nested Loop join.
● Should make index on e.City clustered.
■ Suppose that the WHERE clause is instead:
WHERE e.Title=‘Programmer’ AND e.City=p.City
● If many employees are Programmers, Sort-Merge join may be worth
considering. A clustered index on p.City would help.
■ Summary: Clustering is useful whenever many tuples are to
be retrieved.
7-21
Selecting Alternatives
SELECT Ename
FROM Emp e,Works w
WHERE e.Eno = w.Eno
AND w.Dur > 37
Strategy 1
ΠENAME(σDUR>37∧EMP.ENO=ASG.ENO (Emp × Works))
Strategy 2
ΠENAME(Emp ⋈ENO (σDUR>37 (Works)))

■ Strategy 2 is “better” because


● It avoids Cartesian product
● It selects a subset of Works before joining
■ How to determine the “better” alternative? 7-22
Query Optimization Issues –
Types of Optimizers
■ “Exhaustive” search
● cost-based
● optimal
● combinatorial complexity in the number of relations
■ Heuristics
● not optimal
● regroup common sub-expressions
● perform selection, projection as early as possible
● reorder operations to reduce intermediate relation size
● optimize individual operations
7-23
Query Optimization Issues –
Optimization Granularity

■ Single query at a time


● cannot use common intermediate results
■ Multiple queries at a time
● efficient if many similar queries
● decision space is much larger

7-24
Query Optimization Issues –
Optimization Timing
■ Static
● compilation ⇒ optimize prior to the execution
● difficult to estimate the size of the intermediate results
⇒ error propagation
● can amortize over many executions
■ Dynamic
● run time optimization
● exact information on the intermediate relation sizes
● have to reoptimize for multiple executions
■ Hybrid
● compile using a static algorithm
● if the error in estimate sizes > threshold, reoptimize at
run time 7-25
Query Optimization Issues –
Statistics
■ Relation
● cardinality
● size of a tuple
● fraction of tuples participating in a join with another
relation
● …
■ Attribute
● cardinality of domain
● actual number of distinct values
● …
■ Common assumptions
● independence between different attribute values
● uniform distribution of attribute values within their
domain 7-26
Query Optimization Components

■ Cost function (in terms of time)


● I/O cost + CPU cost
● These might have different weights
● Can also maximize throughput
■ Solution space
● The set of equivalent algebra expressions (query trees).
■ Search algorithm
● How do we move inside the solution space?
● Exhaustive search, heuristic algorithms (iterative
improvement, simulated annealing, genetic,…)
7-27
Cost Calculation

■ Cost function takes CPU and I/O processing into


account
● Instruction and I/O path lengths
■ Estimate the cost of executing each node of the
query tree
● Is pipelining used or are temporary relations created?
■ Estimate the size of the result of each node
● Selectivity of operations – “reduction factor”
● Error propagation is possible

7-28
Intermediate Relation Sizes
Selection
size(R) = card(R) ∗ length(R)
card(σF (R)) = SFσ (F) ∗ card(R)
where
1
S Fσ(A = value) =
card(∏A(R))
max(A) – value
S Fσ(A > value) =
max(A) – min(A)
value – min(A)
S Fσ(A < value) =
max(A) – min(A)

SFσ(p(Ai) ∧ p(Aj)) = SFσ(p(Ai)) ∗ SFσ(p(Aj))


SFσ(p(Ai) ∨ p(Aj)) = SFσ(p(Ai)) + SFσ(p(Aj)) – (SFσ(p(Ai)) ∗ SFσ(p(Aj)))
SFσ(A ∈ value) = SFσ(A= value) ∗ card({values})

7-29
Intermediate Relation Sizes
Projection
card(ΠA(R))=card(R)

Cartesian Product
card(R × S) = card(R) ∗ card(S)

Union
upper bound: card(R ∪ S) = card(R) + card(S)
lower bound: card(R ∪ S) = max{card(R), card(S)}

Set Difference
upper bound: card(R–S) = card(R)
lower bound: 0

7-30
Intermediate Relation Size

Join
● Special case: A is a key of R and B is a foreign key
of S;
card(R ⋈A=B S) = card(S)
● More general:
card(R ⋈ S) = SFJ ∗ card(R) ∗ card(S)

7-31
Search Space

■ Characterized by “equivalent” query plans


● Equivalence is defined in terms of equivalent query
results
■ Equivalent plans are generated by means of
algebraic transformation rules
■ The cost of each plan may be different
■ Focus on joins

7-32
Search Space – Join Trees

■ For N relations, there are O(N!)
⋈ Project
equivalent join trees that can be Emp Works
obtained by applying commutativity
and associativity rules ⋈
⋈ Emp
SELECT Ename,Resp
FROM Emp, Works, Project Project Works


WHERE Emp.Eno=Works.Eno
AND Works.PNO=Project.PNO
× Works

Project Emp
7-33
Transformation Rules
■ Commutativity of binary operations
● R×S⇔S×R
● R⋈S⇔S⋈R
● R∪S⇔S∪R
■ Associativity of binary operations
● ( R × S ) × T ⇔ R × (S × T)
● ( R ⋈ S ) ⋈ T ⇔ R ⋈ (S ⋈ T )
■ Idempotence of unary operations
● ΠA’(ΠA’’(R)) ⇔ ΠA’(R)
● σp1(A1)(σp2(A2)(R)) = σp1(A1) ∧ p2(A2)(R)
where R[A] and A' ⊆ A, A" ⊆ A and A' ⊆ A"
7-34
Transformation Rules
■ Commuting selection with projection
■ Commuting selection with binary operations
● σp(A)(R × S) ⇔ (σp(A) (R)) × S
● σp(Ai)(R ⋈(Aj,Bk) S) ⇔ (σp(Ai) (R)) ⋈(Aj,Bk) S
● σp(Ai)(R ∪ T) ⇔ σp(Ai) (R) ∪ σp(Ai) (T)
where Ai belongs to R and T
■ Commuting projection with binary operations
● ΠC(R × S) ⇔ Π A’(R) × Π B’(S)
● ΠC(R ⋈(Aj,Bk) S) ⇔ Π A’(R) ⋈(Aj,Bk) Π B’(S)

● ΠC(R ∪ S) ⇔ Π C (R) ∪ Π C (S)


where R[A] and S[B]; C = A' ∪ B' where A' ⊆ A, B' ⊆ B, Aj ⊆ A',
Bk ⊆ B' 7-35
Example
Consider the query: ΠENAME Project
Find the names of employees other than J.
Doe who worked on the CAD/CAM project
for either one or two years. σDUR=12 OR DUR=24

SELECT Ename σPNAME=“CAD/CAM” Select


FROM Project p, Works w,
Emp e
σENAME≠“J. DOE”
WHERE w.Eno=e.Eno
AND w.Pno=p.Pno
AND Ename<>`J. Doe’ ⋈

AND p.Pname=`CAD/CAM’
Join
AND (Dur=12 OR Dur=24)
Project Works Emp

7-36
Equivalent Query
ΠEname

σPname=`CAD/CAM’ ∧ (Dur=12 ∨ Dur=24) ∧ Ename<>`J. DOE’

Works Project Emp

7-37
Another Equivalent Query
ΠEname


ΠPno,Ename

ΠPno ΠPno,Eno ΠEno,Ename

σPname = `CAD/CAM’ σDur =12 ∧ Dur=24 σEname <> `J. Doe’

Project Works Emp

7-38
Search Strategy
■ How to “move” in the search space.
■ Deterministic
● Start from base relations and build plans by adding one
relation at each step
● Dynamic programming: breadth-first
● Greedy: depth-first
■ Randomized
● Search for optimalities around a particular starting point
● Trade optimization time for execution time
● Better when > 5-6 relations
● Simulated annealing
● Iterative improvement

7-39
Search Algorithms
■ Restrict the search space
● Use heuristics
➠ E.g., Perform unary operations before binary operations
● Restrict the shape of the join tree
➠ Consider only linear trees, ignore bushy ones
Linear Join Tree Bushy Join Tree


⋈ R4

⋈ R3 ⋈ ⋈

R1 R2 R1 R2 R3 R4

7-40
Search Strategies
■ Deterministic

⋈ ⋈ R4

⋈ ⋈ R3 ⋈ R3

R1 R2 R1 R2 R1 R2

■ Randomized
⋈ ⋈

⋈ R3 ⋈ R2

R1 R2 R1 R3

7-41
Summary
■ Declarative SQL queries need to be converted into
low level execution plans
■ These plans need to be optimized to find the
“best” plan
■ Optimization involves
● Search space: identifies the alternative plans and
alternative execution algorithms for algebra operators
➠ This is done by means of transformation rules
● Cost function: calculates the cost of executing each plan
➠ CPU and I/O costs
● Search algorithm: controls which alternative plans are
investigated
7-42

You might also like