Query Optimization
Query Optimization
Introduction
• Alternative ways of evaluating a given query
– Equivalent expressions
– Different algorithms for each operation
Introduction (Cont.)
• An evaluation plan defines exactly what algorithm is used for each
operation, and how the execution of the operations is coordinated.
■ Find out how to view query execution plans on your favorite database
Introduction (Cont.)
■ Cost difference between evaluation plans for a query can be enormous
● E.g. seconds vs. days in some cases
■ Steps in cost-based query optimization
1. Generate logically equivalent expressions using equivalence rules
2. Annotate resultant expressions to get alternative query plans
3. Choose the cheapest plan based on estimated cost
■ Estimation of plan cost based on:
● Statistical information about relations. Examples:
number of tuples, number of distinct values for an attribute
● Statistics estimation for intermediate results
to compute cost of complex expressions
● Cost formulae for algorithms, computed using statistics
Generating Equivalent Expressions
Transformation of Relational Expressions
• Two relational algebra expressions are said to be equivalent if the two
expressions generate the same set of tuples on every legal database
instance
– Note: order of tuples is irrelevant
– we don’t care if they generate different results on databases that violate
integrity constraints
• In SQL, inputs and outputs are multisets of tuples
– Two expressions in the multiset version of the relational algebra are said
to be equivalent if the two expressions generate the same multiset of
tuples on every legal database instance.
• An equivalence rule says that expressions of two forms are equivalent
– Can replace expression of first form by second, or vice versa
Equivalence Rules
1. Conjunctive selection operations can be deconstructed into a sequence of
individual selections.
σ θ ∧θ ( E )=σ θ (σ θ (E ))
1 2 1 2
2. Selection operations are commutative.
σ θ (σ θ ( E ))=σ θ (σ θ ( E ))
1 2 2 1
3. Only the last in a sequence of projection operations is needed, the others
can be omitted.
Π L ( Π L (…( Π Ln ( E ))…))=Π L (E )
1 2 1
4. Selections can be combined with Cartesian products and theta joins.
α. σθ(E1 X E2) = E1 θ E2
∏ L ∪ L ( E1
1 2 θ E2 ) = ∏ L ∪L ((∏ L ∪ L ( E1 ))
1 2 1 3 θ (∏ L ∪L ( E2 )))
2 4
Equivalence Rules (Cont.)
9. The set operations union and intersection are commutative
E1 ∪ E2 = E2 ∪ E1
E1 ∩ E2 = E2 ∩ E1
■ (set difference is not commutative).
10. Set union and intersection are associative.
(E1 ∪ E2) ∪ E3 = E1 ∪ (E2 ∪ E3)
(E1 ∩ E2) ∩ E3 = E1 ∩ (E2 ∩ E3)
11. The selection operation distributes over ∪, ∩ and –.
σθ (E1 – E2) = σθ (E1) – σθ(E2)
and similarly for ∪ and ∩ in place of –
Also: σθ (E1 – E2) = σθ(E1) – E2
and similarly for ∩ in place of –, but not for ∪
12. The projection operation distributes over union
ΠL(E1 ∪ E2) = (ΠL(E1)) ∪ (ΠL(E2))
Transformation Example: Pushing Selections
• Query: Find the names of all instructors in the Music department, along with
the titles of the courses that they teach
– Πname, title(σdept_name= “Music”
(instructor (teaches Πcourse_id, title (course))))
• Transformation using rule 7a.
(r1 r2) r3
so that we compute and store a smaller temporary relation.
Join Ordering Example (Cont.)
• Consider the expression
Πname, title(σdept_name= “Music” (instructor) teaches)
Πcourse_id, title (course))))
• Could compute teaches Πcourse_id, title (course) first, and join result with
σdept_name= “Music” (instructor)
but the result of the first join is likely to be a large relation.
• Only a small fraction of the university’s instructors are likely to be from the
Music department
– it is better to compute
σdept_name= “Music” (instructor) teaches
first.
Enumeration of Equivalent Expressions
• Query optimizers use equivalence rules to systematically
generate expressions equivalent to the given expression
• Can generate all equivalent expressions as follows:
– Repeat
• apply all applicable equivalence rules on every subexpression of every
equivalent expression found so far
• add newly generated expressions to the set of equivalent expressions
Until no new equivalent expressions are generated above
• The above approach is very expensive in space and time
– Two approaches
• Optimized plan generation based on transformation rules
• Special case approach for queries with only selections, projections and
joins
Implementing Transformation Based Optimization
• Space requirements reduced by sharing common sub-expressions:
– when E1 is generated from E2 by an equivalence rule, usually only the top level of the two
are different, subtrees below are the same and can be shared using pointers
• E.g. when applying join commutativity
E1 E2
• Equi-width histograms
• Equi-depth histograms
Selection Size Estimation
• σA=v(r)
• nr / V(A,r) : number of records that will satisfy the selection
• Equality condition on a key attribute: size estimate = 1
• σA≤V(r) (case of σA ≥ V(r) is symmetric)
– Let c denote the estimated number of tuples satisfying the
condition.
– If min(A,r) and max(A,r) are available in catalog
• c = 0 if v < min(A,r)
v−min( A ,r )
• c= n.
r
max( A ,r)−min( A , r)
– If histograms available, can refine above estimate
– In absence of statistical information c is assumed to be nr / 2.
Size Estimation of Complex Selections
• The selectivity of a condition θi is the probability that a tuple
in the relation r satisfies θi .
– If si is the number of satisfying tuples in r, the selectivity of θi is
given by si /nr.
• Conjunction: σθ1∧ θ2∧. . . ∧ θn (r). Assuming indepdence,
estimate of
s1∗s 2∗. . .∗s n
tuples in the result is: n r∗ n
nr
• Disjunction:σθ1∨ θ2 ∨. . . ∨ θn (r). Estimated number of tuples:
(
n r∗ 1−(1−
s1
nr
)∗(1−
s2
nr
)∗. . .∗(1−
sn
nr )
)
■ Nested Subqueries
■ Materialized Views
Optimizing Nested Subqueries**
• Nested query example:
select name
from instructor
where exists (select *
from teaches
where instructor.ID = teaches.ID and
teaches.year = 2007)
• SQL conceptually treats nested subqueries in the where
clause as functions that take parameters and return a single
value or set of values
– Parameters are variables from outer level query that are used in
the nested subquery; such variables are called correlation
variables
• Conceptually, nested subquery is executed once for each
tuple in the cross-product generated by the outer level from
clause
– Such evaluation is called correlated evaluation
– Note: other conditions in where clause may be used to compute a
join (instead of a cross-product) before executing the nested
subquery
Optimizing Nested Subqueries (Cont.)
• Correlated evaluation may be quite inefficient since
– a large number of calls may be made to the nested query
– there may be unnecessary random I/O as a result
• SQL optimizers attempt to transform nested subqueries to
joins where possible, enabling use of efficient join
techniques
• E.g.: earlier nested query can be rewritten as
select name
from instructor, teaches
where instructor.ID = teaches.ID and teaches.year = 2007
– Note: the two queries generate different numbers of duplicates
(why?)
• teaches can have duplicate IDs
• Can be modified to handle duplicates correctly as we will see
• In general, it is not possible/straightforward to move the
entire nested subquery from clause into the outer level
query from clause
– A temporary relation is created instead, and used in body of outer
level query
Optimizing Nested Subqueries (Cont.)
In general, SQL queries of the form below can be rewritten as
shown
• Rewrite: select …
from L1
where P1 and exists (select *
from L2
where P2)
• To: create table t1 as
select distinct V
from L2
•P contains predicates in P
1
that do not involve any
where P2 1 2
correlation variables
2
select name
from instructor, t1
where t1.ID = instructor.ID
• The process of replacing a nested query by a query
with a join (possibly with a temporary relation) is
called decorrelation.
•
Optimizing nested queries...cont
• Decorrelation is more complicated when
– the nested subquery uses aggregation, or
– when the result of the nested subquery is used to test for equality,
or
– when the condition linking the nested subquery to the other
query is not exists,
– and so on.
Materialized Views**
• A materialized view is a view whose contents are
computed and stored.
• Consider the view
create view department_total_salary(dept_name,
total_salary) as
select dept_name, sum(salary)
from instructor
group by dept_name
• Materializing the above view would be very useful if the total
salary by department is required frequently
– Saves the effort of finding multiple tuples and adding up their
amounts
Materialized View Maintenance
• The task of keeping a materialized view up-to-date with the
underlying data is known as materialized view
maintenance
• Materialized views can be maintained by recomputation on
every update
• A better option is to use incremental view maintenance
– Changes to database relations are used to compute changes
to the materialized view, which is then updated
• View maintenance can be done by
– Manually defining triggers on insert, delete, and update of each
relation in the view definition
– Manually written code to update the view whenever database
relations are updated
– Periodic recomputation (e.g. nightly)
– Above methods are directly supported by many database systems
• Avoids manual effort/correctness issues
Incremental View Maintenance
• The changes (inserts and deletes) to a relation or
expressions are referred to as its differential
– Set of tuples inserted to and deleted from r are denoted ir and dr
• To simplify our description, we only consider inserts and
deletes
– We replace updates to a tuple by deletion of the tuple followed by
insertion of the update tuple
• We describe how to compute the change to the result of
each relational operation, given changes to its inputs
• We then outline how to handle relational algebra
expressions
Join Operation
• Consider the materialized view v = r s and an update to r
• Let rold and rnew denote the old and new states of relation r
• Consider the case of an insert to r:
– We can write rnew s as (rold ∪ ir) s
– And rewrite the above to (rold s) ∪ (ir s)
– But (rold s) is simply the old value of the materialized view, so the
incremental change to the view is just ir s
• Thus, for inserts vnew = vold ∪(ir s)
• Similarly for deletes vnew = vold – (dr s)
A, 1 1, p A, 1, p
B, 2 2, r B, 2, r
2, s B, 2, s
C,2
C, 2, r
C, 2, s
Selection and Projection Operations
• Selection: Consider a view v = σθ(r).
– vnew = vold ∪σθ(ir)
– vnew = vold - σθ(dr)
• Projection is a more difficult operation
– R = (A,B), and r(R) = { (a,2), (a,3)}
– ∏A(r) has a single tuple (a).
– If we delete the tuple (a,2) from r, we should not delete the tuple
(a) from ∏A(r), but if we then delete (a,3) as well, we should delete
the tuple
• For each tuple in a projection ∏A(r) , we will keep a count of
how many times it was derived
– On insert of a tuple to r, if the resultant tuple is already in ∏A(r) we
increment its count, else we add a new tuple with count = 1
– On delete of a tuple from r, we decrement the count of the
corresponding tuple in ∏A(r)
Aggregation Operations
• count : v = A gcount(B )
(r)
.
– When a set of tuples ir is inserted
• For each tuple r in ir, if the corresponding group is already present in v, we increment
its count, else we add a new tuple with count = 1
– When a set of tuples dr is deleted
• for each tuple t in ir.we look for the group t.A in v, and subtract 1 from the count for the
group.
– If the count becomes 0, we delete from v the tuple for the group t.A
• sum: v = Ag sum (B)
(r)
– We maintain the sum in a manner similar to count, except we add/subtract the B value
instead of adding/subtracting 1 for the count
– Additionally we maintain the count in order to detect groups with no tuples. Such groups
are deleted from v
• Cannot simply test for sum = 0 (why?)
• To handle the case of avg, we maintain the sum and count
aggregate values separately, and divide at the end
Aggregate Operations (Cont.)
• min, max: v = Agmin (B) (r).
– Handling insertions on r is straightforward.
– Maintaining the aggregate values min and max on deletions may
be more expensive. We have to look at the other tuples of r that
are in the same group to find the new minimum
Other Operations
• Set intersection: v = r ∩ s
– when a tuple is inserted in r we check if it is present in s, and if so
we add it to v.
– If the tuple is deleted from r, we delete it from the intersection if it
is present.
– Updates to s are symmetric
– The other set operations, union and set difference are handled in a
similar fashion.
• Outer joins are handled in much the same way as joins but
with some extra work
– we leave details to you.
Handling Expressions
• To handle an entire expression, we derive expressions for
computing the incremental change to the result of each sub-
expressions, starting from the smallest sub-expressions.
• E.g. consider E1 E2 where each of E1 and E2 may be a
complex expression
– Suppose the set of tuples to be inserted into E1 is given by D1
• Computed earlier, since smaller sub-expressions are handled first
– Then the set of tuples to be inserted into E1 E2 is given by
D1 E2
• This is just the usual way of maintaining joins
Query Optimization and Materialized Views
• Rewriting queries to use materialized views:
– A materialized view v = r s is available
– A user submits a query r s t
– We can rewrite the query as v t
• Whether to do so depends on cost estimates for the two alternative
• Replacing a use of a materialized view by the view
definition:
– A materialized view v = r s is available, but without any index
on it
– User submits a query σA=10(v).
– Suppose also that s has an index on the common attribute B, and r
has an index on attribute A.
– The best plan for this query may be to replace v by r s, which
can lead to the query plan σA=10(r) s
• Query optimizer should be extended to consider all above
alternatives and choose the best overall plan
Materialized View Selection
• Materialized view selection: “What is the best set of views
to materialize?”.
• Index selection: “what is the best set of indices to create”
– closely related, to materialized view selection
• but simpler
• Materialized view selection and index selection based on
typical system workload (queries and updates)
– Typical goal: minimize time to execute workload , subject to
constraints on space and time taken for some critical
queries/updates
– One of the steps in database tuning
• more on tuning in later chapters
• Commercial database systems provide tools (called “tuning
assistants” or “wizards”) to help the database administrator
choose what indices and materialized views to create