15 Optimization
15 Optimization
1 Overview
Because SQL is declarative, the query only tells the DBMS what to compute, but not how to compute
it. Thus, the DBMS needs to translate a SQL statement into an executable query plan. But there are
different ways to execute each operator in a query plan (e.g., join algorithms) and there will be differences
in performance among these plans. The job of the DBMS’s optimizer is to pick an optimal plan for any
given query.
The first implementation of a query optimizer was IBM System R and was designed in the 1970s. Prior to
this, people did not believe that a DBMS could ever construct a query plan better than a human. Many
concepts and design decisions from the System R optimizer are still in use today.
Query optimization is the most difficult part of building a DBMS. Some systems have attempted to apply
machine learning to improve the accuracy and efficiency of optimizers, but no major DBMS currently
deploys an optimizer based on this technique.
Figure 2: Predicate Pushdown: – Instead of performing the filter after the join, the
filter can be applied earlier in order to pass fewer elements into the filter.
Figure 3: Projection Pushdown – Since the query only asks for the student name
and ID, the DBMS can remove all columns except for those two before applying the
join.
5 Cost Estimations
DBMS’s use cost models to estimate the cost of executing a plan. These models evaluate equivalent plans
for a query to help the DBMS select the most optimal one.
The cost of a query depends on several underlying metrics split between physical and logical costs, includ-
ing:
• CPU: small cost, but tough to estimate.
• Disk I/O: the number of block transfers.
• Memory: the amount of DRAM used.
• Network: the number of messages sent.
Exhaustive enumeration of all valid plans for a query is much too slow for an optimizer to perform. For
joins alone, which are commutative and associative, there are 4n different orderings of every n-way join.
Optimizers must limit their search space in order to work efficiently.
To approximate costs of queries, DBMS’s maintain internal statistics about tables, attributes, and indexes in
their internal catalogs. Different systems maintain these statistics in different ways. Most systems attempt
to avoid on-the-fly computation by maintaining an internal table of statistics. These internal tables may
then be updated in the background.
Selection Statistics
The selection cardinality can be used to determine the number of tuples that will be selected for a given
input.
Equality predicates on unique keys are simple to estimate (see Figure 4). A more complex predicate is
shown in Figure 5.
The selectivity (sel) of a predicate P is the fraction of tuples that qualify. The formula used to compute
selective depends on the type of predicate. Selectivity for complex predicates is hard to estimate accu-
rately which can pose a problem for certain systems. An example of a selectivity computation is shown in
Figure 6.
Observe that the selectivity of a predicate is equivalent to the probability of that predicate. This allows
probability rules to be applied in many selectivity computations. This is particularly useful when dealing
with complex predicates. For example, if we assume that multiple predicates involved in a conjunction are
independent, we can compute the total selectivity of the conjunction as the product of the selectivities of
the individual predicates.
These assumptions are often not satisfied by real data. For example, correlated attributes break the assump-
tion of independence of predicates.
6 Histograms
Real data is often skewed and is tricky to make assumptions about. However, storing every single value of
a data set is expensive. One way to reduce the amount of memory used by storing data in a histogram to
group together values. An example of a graph with buckets is shown in Figure 7.
Figure 7: Equi-Width Histogram: The first figure shows the original frequency
count of the entire data set. The second figure is an equi-width histogram that com-
bines together the counts for adjacent keys to reduce the storage overhead.
Another approach is to use a equi-depth histogram that varies the width of buckets so that the total number
of occurrences for each bucket is roughly the same. An example is shown in Figure 8.
In place of histograms, some systems may use sketches to generate approximate statistics about a data set.
7 Sampling
DBMS’s can use sampling to apply predicates to a smaller copy of the table with a similar distribution (see
Figure 9). The DBMS updates the sample whenever the amount of changes to the underlying table exceeds
some threshold (e.g., 10% of the tuples).
Figure 8: Equi-Depth Histogram – To ensure that each bucket has roughly the
same number of counts, the histogram varies the range of each bucket.
Figure 9: Sampling – Instead of using one billion values in the table to estimate
selectivity, the DBMS can derive the selectivities for predicates from a subset of the
original table.
12 Nested Sub-Queries
The DBMS treats nested sub-queries in the where clause as functions that take parameters and return a
single value or set of values.
• Re-write the query by de-correlating and / or flattening nested subqueries. An example of this is
shown in Figure 10.
• Decompose the nested query and store the result to a temporary table. An example of this is shown
in Figure 11.
Figure 10: Subquery Optimization - Rewriting The former query can be rewritten
as the latter query by rewriting the subquery as a JOIN. Removing a level of nesting
in this way effectively flattens the query.
13 Expression Rewriting
An optimizer transforms a query’s expression ( e.g. WHERE/ON clause predicates) into a minimal set of
expressions.
• Search for expressions that match a pattern.
• When a match is found, rewrite the expression.
• Halt if there are no more rules that match.
Some examples of expression rewriting
Figure 13: Merging Predicates – The WHERE predicate in query 1 has redundancy
as what it is searching for is any value between 1 and 150. Query 2 shows the more
succinct way to express request in query 1.