0% found this document useful (0 votes)
19 views

Query Optimization

This document discusses query optimization and relational algebra. It introduces relational algebra and some of its basic operators like select, project, join, union, set difference. It also discusses derived operators like set intersect and divide. Relational algebra is used to represent SQL queries internally in database systems.

Uploaded by

ninjaqtr
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views

Query Optimization

This document discusses query optimization and relational algebra. It introduces relational algebra and some of its basic operators like select, project, join, union, set difference. It also discusses derived operators like set intersect and divide. Relational algebra is used to represent SQL queries internally in database systems.

Uploaded by

ninjaqtr
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Query Optimization

Relational Algebra

the introduction of the relational model by Codd in 1970 [Codd70], two classes
of languages have been proposed and implemented to work with a relational
database. The first class is called nonprocedural and includes relational calculus
and Quel. The second class is known as procedural and includes relational
algebra and the Structured Query Language (SQL) [SQL92]. In procedural
languages, the query directs the DBMS on how to arrive at the answer. In
contrast, in a nonprocedural language, the query indicates what is needed and
leaves it to the system to find the process for arriving at the answer. Although it
sounds easier to tell the system what is needed instead of how to get the answer,
nonprocedural languages are not as popular as procedural languages. As a
matter of fact, SQL (a procedural language) is the only widely accepted
language for end user interface to relational systems today.

SQL is typically used as such a high-level language interface. Even though SQL is
an accepted and popular interface for end users, it does not lend itself nicely to
internal processing. Perhaps the most problematic aspect of SQL is its power in
representing complex queries easily at a very high level without specifying how
the operations should be performed. That is why most commercial database
systems use an internal representation based on relational algebra that specifies
the ordering of different operations within the query. Therefore, to understand
how SQL queries are processed, we need to understand how their equivalent
relational algebra commands work.

For the remainder of this section we will use the following notations:

• R and S are two relations.


• The number of tuples in a relation is called the cardinality of that relation.
• R has attributes a1, a2, ... , an and has cardinality of K.
• S has attributes b1, b2, ... , bm and has cardinality of L.
• r is a tuple in R and is shown as r[a1, a2, ... , an].
• s is a tuple in S and is shown as s[b1, b2, ... , bm].

Subset of Relational Algebra Commands

Relational algebra (RA) supports unary and binary types of operations. Unary
operations take one relation (table) as an input and produce another as the
output. Binary operations take two relations as input and produce one relation as
the output. Note that regardless of the type of operation, the output is always a
relation. RA operators are divided into basic operators and derived operators.
Basic operators need to be supported by the language compiler since they
cannot be created from any other operations. Derived operators, on the other
hand, are optional since they can be expressed in terms of the basic operators.
Notations in focus:

• SL represents the relational algebra SELECT operator.


• PJ represents the relational algebra PROJECT operator.
• JN represents the relational algebra JOIN operator.
• NJN represents the relational algebra natural JOIN operator.
• UN represents the relational algebra UNION operator.
• SD represents the relational algebra natural SET DIFFERENCE operator.
• CP represents the relational algebra CROSS PRODUCT operator.
• SI represents the relational algebra SET INTERSECT operator.
• DV represents the relational algebra DIVIDE operator.

Figure 28. Symbols used in Relational Algebra

Relational Algebra Basic Operators

Select Operator in Relational Algebra The select operator returns all tuples of the
relation whose attribute(s) satisfy the given predicates (conditions). If no
condition is specified, the select operator returns all tuples of the relation.

For example, “SLbal =1200 (Account)” returns all accounts that have a balance
of $1200. The result is a relation with four attributes (since the Account relation has
four attributes) and as many rows as the number of accounts with a balance of
exactly $1200.

Project Operator in Relational Algebra The project operator returns the values of
all attributes specified in the project operation for all tuples of the relation passed
as a parameter. In a project operation, all rows qualify but only those attributes
specified are returned.

For instance, “PJ Cname,Ccity (Customer)” returns the customer name and the
city where the customer lives for each and every customer of the bank.

Combining Select and Project We can combine the select and project operators
in forming complex RA expressions that not only apply a given set of predicates
to the tuples of a relation but also trim the attributes to a desired set.

For example, assume we want to get the customer ID and customer name for all
customers who live in Edina. We can do this by combining the SL and the PJ
expressions as “PJ CID, Cname (SL Ccity =‘Edina’ (Customer)).”

Union Operator in Relational Algebra Union is a binary operation in RA that


combines the tuples from two relations into one relation. Any tuple in the union is
in the first relation, the second relation, or both relations. In a sense, the union
operator in RA behaves the same way that the addition operator works in math—
it adds up the elements of two sets. There are two compatibility requirements for
the union operation. First, the two relations have to be of the same degree—the
two relations have to have the same number of attributes. Second,
corresponding attributes of the two relations have to be from compatible
domains.

The following statements are true for the union operation in RA:
• We cannot union relations “R(a1, a2, a3)” and “S(b1, b2)” because they have
different degrees.
• We cannot union relations “R(a1 char(10), a2 Integer)” and “S(b1 char(15), b2
Date)” because the a2 and b2 attributes have different data types.
• If relation “R(a1 char(10), a2 Integer)” has cardinality K and relation “S(b1
char(10), b2 Integer)” has cardinality L, then “R UN S” has cardinality “K +
L” and is of the form “(c1 char(10), c2 Integer).”

Suppose we need to get the name and the address for all of the customers who
live in a city named “Edina” or “Eden Prairie.” To find the results, we first need to
create a temporary relation that holds Cname and Ccity for all customers in
Edina; then we need to repeat this for all the customers in Eden Prairie; and
finally, we need to union the two relations. We can write this RA expression as
follows:

PJCID, Cname (SLCcity = ‘Edina’ (Customer))


UN
PJCID, Cname (SLCcity = ‘Eden Prairie’ (Customer))

The union operator is commutative, meaning that “R UN S = S UN R.” Also, the


union operator is associative, meaning that “R UN (S U P) = (R UN S) UN P.”

Set Difference Operator in Relational Algebra Set difference (SD) is a binary


operation in RA that subtracts the tuples in one relation from the tuples of another
relation. In other words, SD removes the tuples that are in the intersection of the
two relations from the first relation and returns the result. In “S SD R,” the tuples
in the set difference belong to the S relation but do not belong to R. Set
difference is an operator that subtracts the elements of two sets. In a sense, the
set difference operator in RA behaves the same way that the subtraction
operator works in math. There are again two compatibility requirements
for this operation. First, the two relations have to be the same degree, and
second, the corresponding attributes of the two relations have to come from
compatible domains.

Assume we need to print the customer ID for all customers who have an account
at the Main branch but do not have a loan there. To do this, we first form the set
of all customers with accounts at the Main branch and then subtract all the
customers with a loan at the Main branch from that set. This excludes the
customers who are in the intersection of the two sets (those who have both an
account and a loan at the Main branch) leaving behind the desired customers.
The RA expression for this question is written as

PJCID (SLBcity = ‘Main’ (Account))


SD
PJCID (SLBcity = ‘Main’ (Loan))
Cartesian Product Operator in Relational Algebra Cartesian product (CP), which
is also known as cross product, is a binary operation that concatenates each and
every tuple from the first relation with each and every tuple from the second
relation. CP is a set operator that multiplies the elements of two sets. In a sense,
the CP operator in RA behaves the same way that the multiplication operator
works in math. This operation is hardly used in practice, since it produces a large
number of tuples—most of which do not contain any useful information.

Relational Algebra Derived Operators

In addition to the basic operators in RA, the language also has a set of derived
operators. These operators are called “derived” since they can be expressed in
terms of the basic operators. As a result, they are not required by the language,
but are supported for ease of programming. These operators are SI, JN (NJN),
and DV. The following sections represent an overview of these operators.

Set Intersect Operator in Relational Algebra Set intersect (SI) is a binary operator
that returns the tuples in the intersection of two relations. If the two relations do
not intersect, the operator returns an empty relation. Suppose that we need to
get the customer name for all customers who have an account and a loan at
the Main branch in the bank. Considering the set of customers who have an
account at the Main branch and the set of customers who have a loan at that
branch, the answer to the question falls in the intersection of the two sets and
can be expressed as follows:

PJCname (SLBcity = ‘Main’ (Account))


SI
PJCname (SLBcity = ‘Main’ (Loan))

SI operation is associative and commutative. Therefore, “R SI S = S SI R” and “R


SI (S SI P) = (R SI S) SI P.”

Join Operator in Relational Algebra The join (JN) operator in RA is a special case
of the CP operator. In a CP operation, rows from the two relations are
concatenated without any restrictions. However, in a JN, before the tuples are
concatenated, they are checked against some condition(s). JN is a binary
operation that returns a relation by combining tuples from two input relations
based on some specified conditions. These operations are known as conditional
joins, where conditions are applied to the attributes of the two relations before
the tuples are concatenated.

One popular join condition is to force equality on the values of the attributes of
the two relations. These types of joins are known as equi-joins. The expression “R
JNa2=b2 S” is a join that returns a relation with “<= L ∗ K” tuples and each tuple is
in the form “[a1, a2, ... , an, b1, b2, ... , bm],” satisfying the condition “a2 = b2.”

RA supports the concept of natural join, where equality is enforced automatically


on the attributes of the two relations that have the same name.

Divide Operator in Relational Algebra Divide (DV) is a binary operator that takes
two relations as input and produces one relation as the output. The DV operation
in RA is similar to the divide operation in math. We will use an example to show
how the divide operation works in relational algebra.
The following explains what the DV operator actually does:
• First, the DV operation performs a group-by on the Cname attribute of the fifirst
relation, which results in a set of branch names for each customer.
• Then, it checks to see if the set of Bname values associated with every unique
value of Cname is the same set or a superset of the set of Bname values from the
second relation. If it is (the same set or superset), then the customer identifified by
that Cname is part of the division result.

Since DV is a derived operation it can be expressed using the base RA operations


expressed as follows:

R1 = PJcname(R) CP S
R2 = PJcname (R1 SD R)
R(cname, bname) DV S(bname) = PJcname (R) SD R2

Computing Relational Algebra Operations

Computing Selection

The general form of a select statement is “SLp (R).” Predicate p is either empty (in
which case all tuples in R are returned), a simple predicate, or a complex
predicate formed by combining simple predicates with “AND,” “OR,” or “NOT.”
How tuples within R are examined depends on the complexity of the predicate p
and availability of indexes on attributes of R.

No Index on R

If R does not have any index and it is not sorted, then the only way to process the
select statement is to scan all disk pages (or disk blocks) of R. Each page is
brought into memory and the tuples inside the page are examined against the
predicate “p.” Once a qualified tuple is found, it is put in the output of
the select operator. Using a relation scan is the most expensive approach to
processing a select, since every tuple in the relation must be examined. The disk
I/O cost of performing a select on R using a relation scan is equal to the number
of actual disk read operations.

B + Tree Index on R

If R has an index—usually a B+Tree in most modern DBMSs—keyed on the


attribute that is used in the select predicate, we also have the option of using this
index instead of scanning the relation to qualify all the matching tuples in the
relation. A B+Tree index sorts the key values (the attribute that the index is built
on) at the leaves of the index. Whether or not an index should be used to process
a select depends on the complexity of the select predicate.

Simple Predicates

A simple predicate is of the form attribute op value, where op is


a relational operator (one of “=,” “=,” “>,” “<,” “<>,” “> =,” and “<=”).
Finding every tuple in the relation whose attribute value satisfies this simple
predicate requires traversing the index tree from the root node to a leaf.
Assuming the worst case scenario, each node in the index tree requires a
separate disk read (equivalent to needing one disk page per index node
when no I/O blocking is used); therefore, the number of disk I/Os required
to locate a given target leaf is the same as the height of the index. Once
a leaf node of the index has been identified, the next step is to read the
disk page(s) containing the actual matching tuple (or tuples) in the
relation. Usually a DBMS stores a rowid as one of the entries in the leaves of
the index, which is then used to directly access the relation’s tuples.

To calculate the actual disk I/O cost of using an index, we have to know
(1) if the index is a unique index or a nonunique index, and (2) whether the
index is clustered or nonclustered. In a unique index, the key value we are
looking for can appear at most once in the leaves of the index, while in a
nonunique index, the value may appear multiple times. In a clustered
index, the tuples that are addressed by the key entries in adjacent leaves
are stored “close to each other on the disk” (clustered). In a nonclustered
index, the tuples may be scattered across the disk. As a result, finding
a set of clustered tuples requires fewer disk I/Os than finding the same
tuples if they were not clustered.

If all of the matching tuples identified by the index are stored on only one
disk page, we need to spend at most one additional disk access to get to
the database page where the tuple(s) are stored. This is true for both a
clustered and a nonclustered index.

If, on the other hand, the index identifies matching tuple(s) that are stored
across multiple pages, the access cost could be much greater. When
multiple tuples are returned by the index search, it is important to know if
the index is clustered or not. In a clustered index, all the tuples that have
the same value for the key attribute are stored next to each other
physically on the disk, in contiguous disk pages. Therefore, for a clustered
index the cost of reading all tuples required is much less than it would be
for the nonclustered case. In the best case scenario, it might require only a
single disk I/O to read all the disk pages containing the matching tuples.
But, if the index is not clustered, the matching tuples could be scattered
on the disk, and hence, in the worst case scenario, the cost of retrieving
the tuples could be as many additional disk I/Os as the number of data
pages used by the matching tuples—that is, in the worst case, at least as
many as the actual number of matching tuples.

Complex predicates

A complex predicate for a select operation has either a normal


conjunctive or normal disjunctive form. A predicate of normal conjunctive
form uses “AND” to combine simple predicates, while a normal disjunctive
form uses “OR.”

Hash Index on R

A hash index uses a hash function to calculate the address of a bucket


where the key value corresponding to the hash value is to be stored. As a
result, a hash index has only one level, as compared to the B+Tree, which
has multiple levels. To find a key value in the index, the value is run through
the hash function and a bucket address is generated. The key values in
the bucket are then searched to see if a match exists. If it does, the rowid
is used to read the actual tuple from the disk page in
the database. Similar to the B+Tree index, the hash index can be clustered
or not. In a clustered hash index, all the pages that correspond to a given
set of keys in a bucket are clustered together as contiguous pages.

One of the issues with using hash indexing is collision handling. A collision
happens when two different keys hash into the same value. All keys that
collide have to be put in the same bucket. This can cause a bucket
overflow, requiring additional pages to accommodate the collided keys.
Collisions and bucket overflows can cause additional disk I/Os when the
index is used to find matching tuples.

Query Processing in Centralized Systems

The goal of the query processing subsystem of a DBMS should be to minimize the
amount of time it takes to return the answer to a user’s query.

In a centralized system, the goals of the query processor may include the
following:

• Minimize the query response time.


• Maximize the parallelism in the system.
• Maximize the system throughput.
• Minimize the total resources used (amount of memory, disk space, cache, etc.).
• Other goals?

The system might be unable to realize all of these goals. For example, minimizing
the total resource usage may not yield minimum query response time. It is
understood that minimizing the amount of memory allocated to sorting relations
can have a direct impact on how fast a relation can be sorted. Faster sorting of
the relations speeds up the total amount of time needed to join two relations
using the sort–merge strategy. The more memory pages (memory frames)
allocated to the sort process the faster the sort can be done, but since total
physical memory is limited, increasing the size of sort memory decreases the
amount of memory that can be allocated to other data structures, temporary
table storage in memory, and other processes. In effect, this may increase the
query response time.

The center point of any query processor in a centralized or distributed system is


the data dictionary (DD) (the catalog). In a centralized system, the catalog
contains dictionary information about tables, indexes, views, and columns
associated with each table or index. The catalog also contains statistics about
the structures in the database. A system may store the number of pages used by
each relation and indexes, the number of rows per page for a given relation, the
number of unique values in the key columns of a given relation, the types of keys,
the number of leaf index pages, and so on.
Query Parsing and Translation

As shown in Figure 29, the first step in processing a query is parsing and
translation. During this step, the query is checked for syntax and
correctness of its data types. If this check passes, the query is translated
from its SQL representation to an equivalent relational algebra expression.

Figure 29. Query Processing Architecture for a DBE

Example 1:

Suppose we want to retrieve the name of all customers who have one
or more accounts in branches in the city of Edina. We can write the SQL
statement for this question as

Select c.Cname
From Customer c, Branch b, Account a
Where c.CID = a.CID
AND a.Bname = b.Bname
AND b.Bcity = ‘Edina’;

There are two join conditions and one select condition (known as a filter)
in this statement. The relational algebra (RA) expression that the parser
might generate is shown below:

PJcname (SLBcity = ‘Edina’ (Customer CP (Account CP Branch)))

The DBMS does not execute this expression as is. The expression must go
through a series of transformations and optimization before it is ready to
run. The query optimizer is the component responsible for doing that.

Query Optimization

There are three steps that make up query optimization. These are cost
estimation, plan generation, and query plan code generation. In some
DBMSs (e.g., DB2), an extra step called “Query Rewrite” is performed
before query optimization is undertaken. In query rewrite, the query
optimizer rewrites the query by eliminating redundant predicates,
expanding operations on views, eliminating redundant subexpressions,
and simplifying complex expressions such as nesting.

The RA expression for Example 1 does not run efficiently, since forming
Cartesian products of the three tables involved in the query produces
large intermediate relations. Instead, join operators are used and the
expression is rewritten as

PJcname (Customer NJN Account)NJN


(Account NJN (SLBcity = ‘Edina’(Branch)))

This expression can be refined further by eliminating the redundant


joining of Account relation as

PJcname (Customer NJN (Account NJN (SLBcity=‘Edina’


(Branch))))

Cost Estimation

Given a query with multiple relational algebra operators,


there are usually multiple alternatives that can be used to express the
query. These alternatives are generated by applying the associative,
commutative, idempotent, distributive, and factorization properties of the
basic relational operators [Ceri84].

These properties are outlined below (the symbol “≡” stands for
equivalence):

• Unary operator (Uop) is commutative:

Uop1(Uop2(R)) ≡ Uop2(Uop1(R))

Example:

SLBname = ‘Main’ (SLAssets > 12000000 (Branch) ≡


SLAssets > 12000000 (SLBname = ‘Main’ ((Branch))

• Unary operator is idempotent:

Uop((R)) ≡ Uop1(Uop2((R))

Example:

SLBname = ‘Main’ AND Assets > 12000000 (Branch) ≡


SLBname = ‘Main’ (SLAssets > 12000000 (Branch)

• Binary operator (Bop) is commutative except for set difference:

R Bop1 S ≡ S Bop1 R

Example
Customer NJN Account ≡ Account NJN Customer

• Binary operator is associative:

R Bop1 (S Bop2 T) ≡ (R Bop1 S) Bop2 T

Example

Customer NJN (Account NJN Branch) ≡


(Customer NJN Account) NJN Branch

• Unary operator is distributive with respect to some binary operations:

(Uop(R)) Bop (Uop(S)) ≡ Uop(R Bop S)

Example

SLsal > 50000 (PJCname, sal (Customer))


UN (SLsal > 50000 (PJEname, sal (Employee))) ≡
SLsal > 50000
(PJCname, sal (Customer)) UN (PJEname, sal (Employee))

This is the inverse of the distributive property.

For query optimization discussion, it is more convenient to use a query tree


instead of the RA expression. A query tree is a tree whose leaves represent
the relations and whose nodes represent the query’s relational operators.
Unary operators take one relation as input and produce one relation as
output, while binary operators take two relations as input and produce
one relation as output. That is why every operator’s output can be fed into
any other operator. Results from one level’s operators are used by the next
level’s operators in the tree until the final results are gathered at the root.

Plan Generation

In this optimization step, the query plan is generated. A query plan (or
simply, a plan, as it is known by almost all DBMSs) is an extended
query tree that includes access paths for all operations in the tree. Access
paths provide detailed information on how each operation in the tree is to
be performed.

In addition to the access paths specified for each individual RA operator,


the plan also specifies how the intermediate relations should be passed
from one operator to the next—materializing temporary tables and/or
pipelining can be used. Furthermore, operation combinations may also be
specified.

Query optimization has been researched for many years and is still the
focus of research due to its importance in centralized and distributed
DBEs. Among the proposed algorithms, exhaustive search and heuristics-based
algorithms are most popular. The difference between these two
approaches is in the time and space complexity requirements and
superiority of the plans they generate.
Exhaustive Search Optimization

Algorithms in this class will first form all possible query plans for a given
query and then select the “best” plan for the query. Dynamic
programming (DP)is an example of an algorithm in this class. Since the
solution space containing all the possible query execution alternatives
using DP is very large—it has an exponential time-and-space complexity.

Heuristics-Based Optimization

Originally, most commercial DBMSs used heuristicsbased approaches,


which are also known as rule-based optimization (RBO) approaches, as
their optimization technique. The algorithms in this class have a
polynomial time-and-space complexity as compared to the exponential
complexity of the exhaustive search-based algorithms, but the heuristics-
based algorithms do not always generate the best query plan. In this
approach, a small set of rules (heuristics) are used to order the operators in
an RA expression without much regard to the database’s statistics. One
such heuristic is based on the observation that if we reduce the size of the
input relations to a join, we reduce the overall cost of the operation. This
rule is enforced by applying the select and/or project operations before
the join. This rule is also known as “pushing” the select and project toward
the leaves of the query tree. Another heuristic joins the two smallest tables
together first. This is based on the belief that since not all tuples from the
relations being joined qualify the join condition, fewer tuples will be used in
the next operation of the query tree, resulting in a smaller overall cost for
the query. Another popular heuristic is simply to disallow/avoid using the
cross-product operation, which is based on the fact that the intermediate
result of this operation is usually a huge relation.

Code Generation

The last step in query optimization is code generation. Code is the final
representation of the query and is executed or interpreted depending on
the type of operating system or hardware. The query code is turned over
to the execution manager (EM) to execute.

You might also like