Chapter 1 Part II
Chapter 1 Part II
QUERY ALGORITHMS
Queries are ultimately reduced to a number of file
scan operations on the underlying physical file
structures.
For each relational operation, there can exist several
different access paths to the particular records
needed.
The query execution engine can have a multitude of
specialized algorithms designed to process particular
relational operation and access path combinations.
We will look at some examples of algorithms for both
the select and join operations.
Selection Algorithms
The Select operation must search through the
data files for records meeting the selection
criteria. The following are some examples of
simple (one attribute) selection algorithms:
Linear search
Every record from the file is read and compared
to the selection criteria. The execution cost for
searching on a non -key attribute is br, where br
is the number of blocks in the file representing
relation r. On a key attribute, the average cost is
br / 2, with a worst case of br
Binary search on primary key
Search using a primary index on
equality
Search using a primary index on
comparison
Search using a secondary index on
equality
Join Algorithms
Like selection, the join operation can be
implemented in a variety of ways. In terms of
disk accesses, the join operations can be very
expensive, so implementing and utilizing
efficient join algorithms is critical in
minimizing a query’s execution time. The
following are 4 well-known types of join
algorithms:
Nested-Loop Join
This algorithm consists of an inner for loop nested
within an outer for loop. To illustrate this
algorithm, we will use the following notations:
r, s Relations r and s
tr Tuple (record) in relation r
ts Tuple (record) in relation s
nr Number of records in relation r
ns Number of records in re lation s
br Number of blocks with records in relation r
bs Number of blocks with records in relation s
Nested-Loop Join
Here is a sample pseudo -code listing for
joining the two relations r and s utilizing the
nested –for loop
for each tuple tr in r
for each tuple ts in s
if join condition is true for (tr, ts)
add tr+ts to the result
Nested-Loop Join
Each record in the outer relation r is scanned once, and each
record in the inner relation s is scanned nr times, resulting
in nr* ns total record scans. If only one block of each
relation can fit into memory, then the cost (number of block
accesses )is nr * bs + br . If all blocks in both relations can fit
into memory, then the cost is br + bs . If all of the blocks in
relation s (the inner relation) can fit into memory, then the
cost is identical to both relations fitting in memory:
br + bs .
Index Nested -Loop Join:
This algorithm is the same as the Nested-Loop
Join, except an index file on the inner
relation’s (s) join attribute is used versus a
data-file scan on s—each index lookup in
the inner loop is essentially an equality
selection on s utilizing one of the selection
algorithms . Let c be the cost for the lookup,
then the worst-case cost for joining r and s is
br + nr * c
Sort-Merge Join
This algorithm can be used to perform natural
joins and equi -joins and requires that each
relation ( r and s) be sorted by the common
attributes between them ( R ∩ S) .
Each record in r and s is only scanned once,
thus producing a worst and best -case cost of
br + bs
Hash Join
Like the sort -merge join, the hash join
algorithm can be used to perform natural joins
and equi-joins .
The hash join utilizes two hash table file
structures (one for each relation) to
partition each relation’s records into sets
containing identical hash values on the join
attributes.
Hash Join
Each relation is scanned and its corresponding
hash table on the join attribute values is built.
Note that collisions may occur, resulting in some
of the partitions containing different sets records
with matching join attribute values.
After the two hash tables are built , for each
matching partition in the hash tables, an in -
memory hash index of the smaller relation’s (the
build relation) records is built and a nested –loop
join is performed against the corresponding
records in the other relation , writing out to the
result for each join
Hash Join
Note that the above works only if the required
amount of memory is available to hold the
hash index and the number records in any
partition of the build relation. If not, then a
process known as recursive partitioning is
performed.
Hash Join
QUERY OPTIMIZATION
The function of a DBMS’ query optimization
engine is to find an evaluation plan that
reduces the overall execution cost of a query.
We have seen in the previous sections that the
costs for performing particular operations
such as select and join can vary quite
dramatically. As an example, consider 2
relations r and s, with the following
characteristics:
In heuristic -based optimization,
mathematical rules are applied to the
components of the query to generate an
evaluation plan that, theoretically, will
result in a lower execution time.
Typically, these components are the
data elements within an internal data
structure, such as a query tree, that the
query parser has generated from a higher
level representation of the query (i.e.
SQL).
Another way of optimizing a query is semantic
–based query optimization. In many cases, the
data within and between relations contain
“rules” and patterns that are based upon
“real-world” situations that the DBMS does
not “know” about. For example, vehicles like
the Delorean were not made after 1990, so a
query like “Retrieve all vehicles with make
equal to Delorean and year > 2000” will
produce zero records. Injecting these types of
semantic rules into a DBMS can thus further
enhance a query’s execution time.