4.6 Algorithms For Select and Join Operations
4.6 Algorithms For Select and Join Operations
CS8492
DATABASE MANAGEMENT SYSTEMS
(Common to CSE & IT)
UNIT NO. 4
Version: 1.0
CS8492
There are many algorithms for executing a SELECT operation, which is basically a search
operation to locate the records in a disk file that satisfy a certain condition. Some of the search
algorithms depend on the file having specific access paths, and they may apply only to certain types
of selection conditions. We discuss some of the algorithms for implementing SELECT in this
section. We will use the following operations, specified on the relational database in Figure 3.5, to
illustrate our dis-cussion:
Search Methods for Simple Selection. A number of search algorithms are pos-sible for
selecting records from a file. These are also known as file scans, because they scan the records of a
file to search for and retrieve records that satisfy a selec-tion condition.4 If the search algorithm
involves the use of an index, the index search is called an index scan. The following search methods
(S1 through S6) are examples of some of the search algorithms that can be used to implement a
select operation:
S1—Linear search (brute force algorithm). Retrieve every record in the file, and test whether
its attribute values satisfy the selection condition. Since the records are grouped into disk blocks,
each disk block is read into a main memory buffer, and then a search through the records within the
disk block is conducted in main memory.
S2—Binary search. If the selection condition involves an equality compari-son on a key
attribute on which the file is ordered, binary search—which is
more efficient than linear search—can be used. An example is OP1 if Ssn is the ordering attribute for
the EMPLOYEE file.5
S3a—Using a primary index. If the selection condition involves an equality comparison on a
key attribute with a primary index—for example, Ssn = ‘123456789’ in OP1—use the primary
index to retrieve the record. Note that this condition retrieves a single record (at most).
S3b—Using a hash key. If the selection condition involves an equality com-parison on a key
attribute with a hash key—for example, Ssn = ‘123456789’ in OP1—use the hash key to retrieve
the record. Note that this condition retrieves a single record (at most).
S4—Using a primary index to retrieve multiple records. If the comparison condition is >,
>=, <, or <= on a key field with a primary index—for exam-ple, Dnumber > 5 in OP2—use the index
CS8492
to find the record satisfying the cor-responding equality condition (Dnumber = 5), then retrieve all
subsequent records in the (ordered) file. For the condition Dnumber < 5, retrieve all the preceding
records.
S5—Using a clustering index to retrieve multiple records. If the selection condition involves
an equality comparison on a nonkey attribute with a clustering index—for example, Dno = 5 in
OP3—use the index to retrieve all the records satisfying the condition.
S6—Using a secondary (B+-tree) index on an equality comparison. This search method can
be used to retrieve a single record if the indexing field is a key (has unique values) or to retrieve
multiple records if the indexing field is not a key. This can also be used for comparisons involving
>, >=, <, or <=.
In Section 19.8, we discuss how to develop formulas that estimate the access cost of these
search methods in terms of the number of block accesses and access time. Method S1 (linear
search) applies to any file, but all the other methods depend on having the appropriate access path on
the attribute used in the selection condition. Method S2 (binary search) requires the file to be sorted
on the search attribute. The methods that use an index (S3a, S4, S5, and S6) are generally referred to
as index searches, and they require the appropriate index to exist on the search attribute. Methods S4
and S6 can be used to retrieve records in a certain range—for example, 30000 <= Salary <= 35000.
Queries involving such conditions are called range queries.
S8—Conjunctive selection using a composite index. If two or more attrib-utes are involved in
equality conditions in the conjunctive select condition and a composite index (or hash structure)
exists on the combined fields— for example, if an index has been created on the composite key
(Essn, Pno) of the WORKS_ON file for OP5—we can use the index directly.
satisfies the remaining conditions.7 In general, method S9 assumes that each of the indexes is on a
nonkey field of the file, because if one of the conditions is an equality condition on a key field, only
one record will satisfy the whole condition.
Whenever a single condition specifies the selection—such as OP1, OP2, or OP3— the
DBMS can only check whether or not an access path exists on the attribute involved in that
condition. If an access path (such as index or hash key or sorted file) exists, the method
corresponding to that access path is used; otherwise, the brute force, linear search approach of
method S1 can be used. Query optimization for a SELECT operation is needed mostly for
conjunctive select conditions whenever more than one of the attributes involved in the conditions
have an access path. The optimizer should choose the access path that retrieves the fewest records in
the most efficient way by estimating the different costs (see Section 19.8) and choosing the method
with the least estimated cost.
Although exact selectivities of all conditions may not be available, estimates of selectivities
are often kept in the DBMS catalog and are used by the optimizer. For example, for an equality
condition on a key attribute of relation r(R), s = 1/|r(R)|, where |r(R)| is the number of tuples in
relation r(R). For an equality condition on a nonkey attribute with i distinct values, s can be
estimated by (|r(R)|/i)/|r(R)| or 1/i, assuming that the records are evenly or uniformly distributed
among the distinct values. Under this assumption, |r(R)|/i records will satisfy an equality condition
on this attribute. In general, the number of records satisfying a selection condition with selectivity sl
is estimated to be |r(R)| * sl. The smaller this estimate is, the higher the desirability of using that
condition first to retrieve records. In certain cases, the actual distribution of records among the
various distinct values of the attribute is kept by the DBMS in the form of a histogram, in order to
get more accurate esti-mates of the number of records that satisfy a particular condition.
With such a condition, little optimization can be done, because the records satisfying the
disjunctive condition are the union of the records satisfying the individual conditions. Hence, if any
one of the conditions does not have an access path, we are compelled to use the brute force, linear
CS8492
search approach. Only if an access path exists on every simple condition in the disjunction can we
optimize the selection by retrieving the records satisfying each condition—or their record ids—and
then applying the union operation to eliminate duplicates.
A DBMS will have available many of the methods discussed above, and typically many
additional methods. The query optimizer must choose the appropriate one for executing each
SELECT operation in a query. This optimization uses formulas that estimate the costs for each
available access method, as we will discuss in Section 19.8. The optimizer chooses the access
method with the lowest estimated cost.
CS8492