Cont’d… The query must also be validated by checking that all attribute and relation names are valid and semantically meaningful names in the schema. An internal representation of the query is then created, usually as a tree data structure called a query tree or query graph. The DBMS must then devise an execution strategy or query plan for retrieving the results of the query from the database files.
Cont’d… A query typically has many possible execution strategies, and the process of choosing a suitable one for processing a query is known as query optimization.
Cont’d… The query optimizer module has the task of producing a good execution plan. The code generator generates the code to execute that plan. The runtime database processor has the task of running (executing) the query code, whether in compiled or interpreted mode, to produce the query result. If a runtime error results, an error message is generated by the runtime database processor.
Cont’d… Finding the optimal strategy is usually too time-consuming—except for the simplest of queries. Hence, planning of a good execution strategy may be a more accurate description than query optimization. A relational DBMS must systematically evaluate alternative query execution strategies and choose a reasonably efficient or near-optimal strategy.
Cont’d… Each DBMS typically has a number of general database access algorithms that implement relational algebra operations such as SELECT or JOIN. Only execution strategies that can be implemented by the DBMS access algorithms That apply to the particular query, as well as to the particular physical database design, can be considered by the query optimization module.
Cont’d… There are two main techniques that are employed during query optimization. The first technique is based on heuristic rules for ordering the operations in a query execution strategy. A heuristic is a rule that works well in most cases but is not guaranteed to work well in every case. The rules typically reorder the operations in a query tree.
Cont’d… The second technique involves systematically estimating the cost of different execution strategies and choosing the execution plan with the lowest cost estimate. These techniques are usually combined in a query optimizer.
Translating SQL into Relational Algera An SQL query is first translated into an equivalent extended relational algebra expression— represented as a query tree data structure—that is then optimized. Typically, SQL queries are decomposed into query blocks, which form the basic units that can be translated into the algebraic operators and optimized.
Cont’d… A query block contains a single SELECT-FROM- WHERE expression, as well as GROUP BY and HAVING clauses if these are part of the block. Nested queries are identified as separate blocks. For example, the following SQL is nested
Algorithm for external sorting Sorting is one of the primary algorithms used in query processing. External sorting refers to sorting algorithms that are suitable for large files of records stored on disk that do not fit entirely in main memory, such as most database files. The typical external sorting algorithm uses a sort- merge strategy, which starts by sorting small sub files.
Cont’d… The sort-merge algorithm, like other database algorithms, requires buffer space in main memory, where the actual sorting and merging of the runs is performed. In the sorting phase, runs (portions or pieces) of the file that can fit in the available buffer space are read into main memory, sorted using an internal sorting algorithm, and written back to disk as temporary sorted subfiles (or runs).
Cont’d… In the merging phase, the sorted runs are merged during one or more merge passes. Each merge pass can have one or more merge steps. The degree of merging (dM) is the number of sorted sub files that can be merged in each merge step. The performance of the sort-merge algorithm can be measured in the number of disk block reads and writes before the sorting of the whole file is completed.
The Algorithm for SELECT and JOIN Implementing the SELECT Operation There are many algorithms for executing a SELECT operation, which is basically a search operation to locate the records in a disk file that satisfy a certain condition.
Search Methods for Simple Selection A number of search algorithms are possible for selecting records from a file. These are also known as file scans, because they scan the records of a file that satisfy a selection condition. If the search algorithm involves the use of an index, the index search is called an index scan.
Search Algorithms for SELECT Operation S1—Linear search (brute force algorithm) Retrieve every record in the file, and test whether its attribute values satisfy the selection condition. Each disk block is read into a main memory buffer, and then a search through the records within the disk block is conducted in main memory.
Cont’d… S2—Binary search. If the selection condition involves an equality comparison on a key attribute on which the file is ordered, binary search—which is more efficient than linear search. S3a—Using a primary index. If the selection condition involves an equality comparison on a key attribute with a primary index—for example, Ssn = ‘123456789’.
Cont’d… S3b—Using a hash key. If the selection condition involves an equality comparison on a key attribute with a hash key—for example, Ssn = ‘123456789’ in OP1. S4—Using a primary index to retrieve multiple records. If the comparison condition is >, >=, <, or <= on a key field with a primary index—for example, Dnumber > 5 in OP2—use the index to find the record satisfying the corresponding condition.
Cont’d… S5—Using a clustering index to retrieve multiple records. If the selection condition involves an equality comparison on a nonkey attribute with a clustering index. S6—Using a secondary (B+-tree) index on an equality comparison. This search method can be used to retrieve a single record if the indexing field is a key (has unique values)
Cont’d… It is used to retrieve multiple records if the indexing field is not a key. This can also be used for comparisons involving >, >=, <, or <=.
Search Methods for Complex Selection If a condition of a SELECT operation is a conjunctive condition—that is, if it is made up of several simple conditions connected with the AND logical connective the DBMS can use the following additional methods to implement the operation: S7—Conjunctive selection using an individual index. If an attribute involved in any single simple condition in the conjunctive select condition has an access path that permits the use of one of the methods S2 to S6
Cont’d… S8—Conjunctive selection using a composite index. If two or more attributes are involved in equality conditions in the conjunctive select condition and a composite index (or hash structure) exists on the combined fields.
Cont’d… S9—Conjunctive selection by intersection of record pointers. If secondary indexes (or other access paths) are available on more than one of fields involved in simple conditions in the conjunctive select condition, and if the indexes include record pointers (rather than block pointers), then each index can be used to retrieve the set of record pointers that satisfy the individual condition.
Cont’d… Whenever a single condition specifies the selection —such as OP1, OP2, or OP3— the DBMS can only check whether or not an access path exists on the attribute involved in that condition. If an access path (index or hash key or sorted file) exists, the method corresponding to that access path is used; otherwise, the brute force, linear search approach of method S1 can be used.
Cont’d… Query optimization for a SELECT operation is needed mostly for conjunctive select conditions whenever more than one of the attributes involved in the conditions have an access path. The optimizer should choose the access path that retrieves the fewest records in the most efficient way by estimating the different costs
Selectivity of a Condition When the optimizer is choosing between multiple simple conditions in a conjunctive select condition, it typically considers the selectivity of each condition. The selectivity (sl) is defined as the ratio of the number of records (tuples) that satisfy the condition between 0 and 1. Zero selectivity means none of the records in the file satisfies the selection condition
Disjunctive Selection Conditions Compared to a conjunctive selection condition, a disjunctive condition (where simple conditions are connected by the OR logical connective rather than by AND) is much harder to process and optimize. For example, consider OP4: OP4: σDno=5 OR Salary > 30000 OR Sex=‘F’ (EMPLOYEE)
Implementing the JOIN Operation The JOIN operation is one of the most time- consuming operations in query processing. Many of the join operations encountered in queries are of the EQUIJOIN and NATURAL JOIN varieties.
Methods for Implementing Joins. J1—Nested-loop join (or nested-block join). This is the default (brute force) algorithm, as it does not require any special access paths on either file in the join. For each record t in R (outer loop), retrieve every record s from S (inner loop) and test whether the two records satisfy the join condition.
Cont’d… J2—Single-loop join (using an access structure to retrieve the matching records). If an index (or hash key) exists for one of the two join attributes— say, attribute B of file S—retrieve each record t in R (loop over file R)
Cont’d… J3—Sort-merge join. If the records of R and S are physically sorted (ordered) by value of the join attributes A and B, respectively, we can implement the join in the most efficient way possible. J4—Partition-hash join. The records of files R and S are partitioned into smaller files. The partitioning using the same hashing function h on the join attribute A of R (for partitioning file R) and B of S (for partitioning file S)
Cont’d… First, a single pass through the file with fewer records (say, R) hashes its records to the various partitions of R; this is called the partitioning phase, since the records of R are partitioned into the hash buckets. The collection of records with the same value of h(A) are placed in the same partition, which is a hash bucket in a hash table in main memory.
Cont’d… In the second phase, called the probing phase, a single pass through the other file (S) then hashes each of its records using the same hash function h(B) to probe the appropriate bucket, and that record is combined with all matching records from R in that bucket.
Algorithms for PROJECT and SET Operations A PROJECT operation π<attribute list>(R) is straightforward to implement if <attribute list> includes a key of relation R. If <attribute list> does not include a key of R, duplicate tuples must be eliminated by Sorting. Hashing can also be used to eliminate duplicates: as record is hashed and inserted into a bucket of the hash file in memory, it is checked against those records already in the bucket;
Cont’d… if it is a duplicate, it is not inserted in the bucket. Set operations—UNION, INTERSECTION, SET DIFFERENCE, and CARTESIAN PRODUCT—are sometimes expensive to implement. In particular, the CARTESIAN PRODUCT operation R × S is quite expensive because its result includes a record for each combination of records from R and S. The other three set operations—UNION, INTERSECTION, and SET DIFFERENCE14—apply only to type-compatible
Cont’d… The customary way to implement these operations is to use variations of the sort-merge technique: The two relations are sorted on the same attributes, and, after sorting, a single scan through each relation is sufficient to produce the result.
Cont’d… Hashing can also be used to implement UNION, INTERSECTION, and SET DIFFERENCE. One table is first scanned and then partitioned into an in-memory hash table with buckets, and the records in the other table are then scanned one at a time and used to probe the appropriate partition.
Implementing Aggregate Operations The aggregate operators (MIN, MAX, COUNT, AVERAGE, SUM), when applied to an entire table, can be computed by a table scan or by using an appropriate index, if available. Eg. SELECT MAX (Salary) FROM EMPLOYEE The usual technique for such queries is to first use either sorting or hashing on the grouping attributes to partition the file into the appropriate groups.
Cont’d… Then the algorithm computes the aggregate function for the tuples in each group, which have the same grouping attribute(s) value. Notice that if a clustering index exists on the grouping attribute(s), then the records are already partitioned (grouped) into the appropriate subsets.
Combining Operations using Pipelining It is common to create the query execution code dynamically to implement multiple operations. The generated code for producing the query combines several algorithms that correspond to individual operations. The relational algebra optimization can group operations together for execution. This is called pipelining or stream-based processing.
Using Heuristics in Query Optimization It is internal representation of a query—which is usually in the form of a query tree or a query graph data structure. The scanner and parser of an SQL query first generate a data structure that corresponds to an initial query representation, which is then optimized according to heuristic rules. This leads to an optimized query representation, which corresponds to the query execution strategy
Cont’d… A query tree is used to represent a relational algebra or extended relational algebra expression, whereas a query graph is used to represent a relational calculus expression. A query tree is a tree data structure that corresponds to a relational algebra expression. It represents the input relations of the query as leaf nodes of the tree, and represents the relational algebra operations as internal nodes.
Cont’d… An execution of the query tree consists of executing an internal node operation whenever its operands are available and then replacing that internal node by the relation that results from executing the operation. The order of execution of operations starts at the leaf nodes, which represents the input database relations for the query, and ends at the root node, which represents the final operation of the query
Cont’d… The optimizer must include rules for equivalence among relational algebra expressions that can be applied to transform the initial tree into the final, optimized query tree.
Cont’d… If the semantic query optimizer checks for the existence of this constraint, it does not need to execute the query at all because it knows that the result of the query will be empty. This may save considerable time if the constraint checking can be done efficiently.