Query Optimization
Query Optimization
Relational Algebra
the introduction of the relational model by Codd in 1970 [Codd70], two classes
of languages have been proposed and implemented to work with a relational
database. The first class is called nonprocedural and includes relational calculus
and Quel. The second class is known as procedural and includes relational
algebra and the Structured Query Language (SQL) [SQL92]. In procedural
languages, the query directs the DBMS on how to arrive at the answer. In
contrast, in a nonprocedural language, the query indicates what is needed and
leaves it to the system to find the process for arriving at the answer. Although it
sounds easier to tell the system what is needed instead of how to get the answer,
nonprocedural languages are not as popular as procedural languages. As a
matter of fact, SQL (a procedural language) is the only widely accepted
language for end user interface to relational systems today.
SQL is typically used as such a high-level language interface. Even though SQL is
an accepted and popular interface for end users, it does not lend itself nicely to
internal processing. Perhaps the most problematic aspect of SQL is its power in
representing complex queries easily at a very high level without specifying how
the operations should be performed. That is why most commercial database
systems use an internal representation based on relational algebra that specifies
the ordering of different operations within the query. Therefore, to understand
how SQL queries are processed, we need to understand how their equivalent
relational algebra commands work.
For the remainder of this section we will use the following notations:
Relational algebra (RA) supports unary and binary types of operations. Unary
operations take one relation (table) as an input and produce another as the
output. Binary operations take two relations as input and produce one relation as
the output. Note that regardless of the type of operation, the output is always a
relation. RA operators are divided into basic operators and derived operators.
Basic operators need to be supported by the language compiler since they
cannot be created from any other operations. Derived operators, on the other
hand, are optional since they can be expressed in terms of the basic operators.
Notations in focus:
Select Operator in Relational Algebra The select operator returns all tuples of the
relation whose attribute(s) satisfy the given predicates (conditions). If no
condition is specified, the select operator returns all tuples of the relation.
For example, “SLbal =1200 (Account)” returns all accounts that have a balance
of $1200. The result is a relation with four attributes (since the Account relation has
four attributes) and as many rows as the number of accounts with a balance of
exactly $1200.
Project Operator in Relational Algebra The project operator returns the values of
all attributes specified in the project operation for all tuples of the relation passed
as a parameter. In a project operation, all rows qualify but only those attributes
specified are returned.
For instance, “PJ Cname,Ccity (Customer)” returns the customer name and the
city where the customer lives for each and every customer of the bank.
Combining Select and Project We can combine the select and project operators
in forming complex RA expressions that not only apply a given set of predicates
to the tuples of a relation but also trim the attributes to a desired set.
For example, assume we want to get the customer ID and customer name for all
customers who live in Edina. We can do this by combining the SL and the PJ
expressions as “PJ CID, Cname (SL Ccity =‘Edina’ (Customer)).”
The following statements are true for the union operation in RA:
• We cannot union relations “R(a1, a2, a3)” and “S(b1, b2)” because they have
different degrees.
• We cannot union relations “R(a1 char(10), a2 Integer)” and “S(b1 char(15), b2
Date)” because the a2 and b2 attributes have different data types.
• If relation “R(a1 char(10), a2 Integer)” has cardinality K and relation “S(b1
char(10), b2 Integer)” has cardinality L, then “R UN S” has cardinality “K +
L” and is of the form “(c1 char(10), c2 Integer).”
Suppose we need to get the name and the address for all of the customers who
live in a city named “Edina” or “Eden Prairie.” To find the results, we first need to
create a temporary relation that holds Cname and Ccity for all customers in
Edina; then we need to repeat this for all the customers in Eden Prairie; and
finally, we need to union the two relations. We can write this RA expression as
follows:
Assume we need to print the customer ID for all customers who have an account
at the Main branch but do not have a loan there. To do this, we first form the set
of all customers with accounts at the Main branch and then subtract all the
customers with a loan at the Main branch from that set. This excludes the
customers who are in the intersection of the two sets (those who have both an
account and a loan at the Main branch) leaving behind the desired customers.
The RA expression for this question is written as
In addition to the basic operators in RA, the language also has a set of derived
operators. These operators are called “derived” since they can be expressed in
terms of the basic operators. As a result, they are not required by the language,
but are supported for ease of programming. These operators are SI, JN (NJN),
and DV. The following sections represent an overview of these operators.
Set Intersect Operator in Relational Algebra Set intersect (SI) is a binary operator
that returns the tuples in the intersection of two relations. If the two relations do
not intersect, the operator returns an empty relation. Suppose that we need to
get the customer name for all customers who have an account and a loan at
the Main branch in the bank. Considering the set of customers who have an
account at the Main branch and the set of customers who have a loan at that
branch, the answer to the question falls in the intersection of the two sets and
can be expressed as follows:
Join Operator in Relational Algebra The join (JN) operator in RA is a special case
of the CP operator. In a CP operation, rows from the two relations are
concatenated without any restrictions. However, in a JN, before the tuples are
concatenated, they are checked against some condition(s). JN is a binary
operation that returns a relation by combining tuples from two input relations
based on some specified conditions. These operations are known as conditional
joins, where conditions are applied to the attributes of the two relations before
the tuples are concatenated.
One popular join condition is to force equality on the values of the attributes of
the two relations. These types of joins are known as equi-joins. The expression “R
JNa2=b2 S” is a join that returns a relation with “<= L ∗ K” tuples and each tuple is
in the form “[a1, a2, ... , an, b1, b2, ... , bm],” satisfying the condition “a2 = b2.”
Divide Operator in Relational Algebra Divide (DV) is a binary operator that takes
two relations as input and produces one relation as the output. The DV operation
in RA is similar to the divide operation in math. We will use an example to show
how the divide operation works in relational algebra.
The following explains what the DV operator actually does:
• First, the DV operation performs a group-by on the Cname attribute of the fifirst
relation, which results in a set of branch names for each customer.
• Then, it checks to see if the set of Bname values associated with every unique
value of Cname is the same set or a superset of the set of Bname values from the
second relation. If it is (the same set or superset), then the customer identifified by
that Cname is part of the division result.
R1 = PJcname(R) CP S
R2 = PJcname (R1 SD R)
R(cname, bname) DV S(bname) = PJcname (R) SD R2
Computing Selection
The general form of a select statement is “SLp (R).” Predicate p is either empty (in
which case all tuples in R are returned), a simple predicate, or a complex
predicate formed by combining simple predicates with “AND,” “OR,” or “NOT.”
How tuples within R are examined depends on the complexity of the predicate p
and availability of indexes on attributes of R.
No Index on R
If R does not have any index and it is not sorted, then the only way to process the
select statement is to scan all disk pages (or disk blocks) of R. Each page is
brought into memory and the tuples inside the page are examined against the
predicate “p.” Once a qualified tuple is found, it is put in the output of
the select operator. Using a relation scan is the most expensive approach to
processing a select, since every tuple in the relation must be examined. The disk
I/O cost of performing a select on R using a relation scan is equal to the number
of actual disk read operations.
B + Tree Index on R
Simple Predicates
To calculate the actual disk I/O cost of using an index, we have to know
(1) if the index is a unique index or a nonunique index, and (2) whether the
index is clustered or nonclustered. In a unique index, the key value we are
looking for can appear at most once in the leaves of the index, while in a
nonunique index, the value may appear multiple times. In a clustered
index, the tuples that are addressed by the key entries in adjacent leaves
are stored “close to each other on the disk” (clustered). In a nonclustered
index, the tuples may be scattered across the disk. As a result, finding
a set of clustered tuples requires fewer disk I/Os than finding the same
tuples if they were not clustered.
If all of the matching tuples identified by the index are stored on only one
disk page, we need to spend at most one additional disk access to get to
the database page where the tuple(s) are stored. This is true for both a
clustered and a nonclustered index.
If, on the other hand, the index identifies matching tuple(s) that are stored
across multiple pages, the access cost could be much greater. When
multiple tuples are returned by the index search, it is important to know if
the index is clustered or not. In a clustered index, all the tuples that have
the same value for the key attribute are stored next to each other
physically on the disk, in contiguous disk pages. Therefore, for a clustered
index the cost of reading all tuples required is much less than it would be
for the nonclustered case. In the best case scenario, it might require only a
single disk I/O to read all the disk pages containing the matching tuples.
But, if the index is not clustered, the matching tuples could be scattered
on the disk, and hence, in the worst case scenario, the cost of retrieving
the tuples could be as many additional disk I/Os as the number of data
pages used by the matching tuples—that is, in the worst case, at least as
many as the actual number of matching tuples.
Complex predicates
Hash Index on R
One of the issues with using hash indexing is collision handling. A collision
happens when two different keys hash into the same value. All keys that
collide have to be put in the same bucket. This can cause a bucket
overflow, requiring additional pages to accommodate the collided keys.
Collisions and bucket overflows can cause additional disk I/Os when the
index is used to find matching tuples.
The goal of the query processing subsystem of a DBMS should be to minimize the
amount of time it takes to return the answer to a user’s query.
In a centralized system, the goals of the query processor may include the
following:
The system might be unable to realize all of these goals. For example, minimizing
the total resource usage may not yield minimum query response time. It is
understood that minimizing the amount of memory allocated to sorting relations
can have a direct impact on how fast a relation can be sorted. Faster sorting of
the relations speeds up the total amount of time needed to join two relations
using the sort–merge strategy. The more memory pages (memory frames)
allocated to the sort process the faster the sort can be done, but since total
physical memory is limited, increasing the size of sort memory decreases the
amount of memory that can be allocated to other data structures, temporary
table storage in memory, and other processes. In effect, this may increase the
query response time.
As shown in Figure 29, the first step in processing a query is parsing and
translation. During this step, the query is checked for syntax and
correctness of its data types. If this check passes, the query is translated
from its SQL representation to an equivalent relational algebra expression.
Example 1:
Suppose we want to retrieve the name of all customers who have one
or more accounts in branches in the city of Edina. We can write the SQL
statement for this question as
Select c.Cname
From Customer c, Branch b, Account a
Where c.CID = a.CID
AND a.Bname = b.Bname
AND b.Bcity = ‘Edina’;
There are two join conditions and one select condition (known as a filter)
in this statement. The relational algebra (RA) expression that the parser
might generate is shown below:
The DBMS does not execute this expression as is. The expression must go
through a series of transformations and optimization before it is ready to
run. The query optimizer is the component responsible for doing that.
Query Optimization
There are three steps that make up query optimization. These are cost
estimation, plan generation, and query plan code generation. In some
DBMSs (e.g., DB2), an extra step called “Query Rewrite” is performed
before query optimization is undertaken. In query rewrite, the query
optimizer rewrites the query by eliminating redundant predicates,
expanding operations on views, eliminating redundant subexpressions,
and simplifying complex expressions such as nesting.
The RA expression for Example 1 does not run efficiently, since forming
Cartesian products of the three tables involved in the query produces
large intermediate relations. Instead, join operators are used and the
expression is rewritten as
Cost Estimation
These properties are outlined below (the symbol “≡” stands for
equivalence):
Uop1(Uop2(R)) ≡ Uop2(Uop1(R))
Example:
Uop((R)) ≡ Uop1(Uop2((R))
Example:
R Bop1 S ≡ S Bop1 R
Example
Customer NJN Account ≡ Account NJN Customer
Example
Example
Plan Generation
In this optimization step, the query plan is generated. A query plan (or
simply, a plan, as it is known by almost all DBMSs) is an extended
query tree that includes access paths for all operations in the tree. Access
paths provide detailed information on how each operation in the tree is to
be performed.
Query optimization has been researched for many years and is still the
focus of research due to its importance in centralized and distributed
DBEs. Among the proposed algorithms, exhaustive search and heuristics-based
algorithms are most popular. The difference between these two
approaches is in the time and space complexity requirements and
superiority of the plans they generate.
Exhaustive Search Optimization
Algorithms in this class will first form all possible query plans for a given
query and then select the “best” plan for the query. Dynamic
programming (DP)is an example of an algorithm in this class. Since the
solution space containing all the possible query execution alternatives
using DP is very large—it has an exponential time-and-space complexity.
Heuristics-Based Optimization
Code Generation
The last step in query optimization is code generation. Code is the final
representation of the query and is executed or interpreted depending on
the type of operating system or hardware. The query code is turned over
to the execution manager (EM) to execute.