Chapter 1 Query Processing
Chapter 1 Query Processing
Chapter one
4
Query Processing (cont.…)
Scanner: The scanner identifies the language
tokens such as SQL Keywords, attribute
names, and relation names in the text of the
query.
9
Transformation Rules for Relational Algebra Operations:
Relational algebra operations:
o Select, Project , Join, Union, Intersection, Cartesian Product
General Transformation rules for relational algebra(useful in
query optimization)
1. Cascade of s: A conjunctive selection condition
can be broken up into a cascade (sequence) of
individual s operations:
◦ s c1 AND c2 AND ... AND cn(R) = sc1 (sc2 (...(scn(R))...) )
2. Commutatively of s:
◦ The s operation is commutative:
◦ sc1 (sc2(R)) = sc2 (sc1(R))
3. Cascade of p: In a cascade (sequence) of p
operations, all but the last one can be ignored:
◦ pList1 (pList2 (...(pListn(R))...) ) = pList1(R)
10
Cont…
4. Commuting s with p:
◦ If the selection condition c involves only the
attributes A1, ..., An in the projection list, the
two operations can be commuted:
pA1, A2, ..., An (sc (R)) = sc (pA1, A2, ..., An (R))
(sC (R x S)) = (R C S)
12
Example
Solution:
Strategy 1- s salary <75000 (p salary (employee))
13
Basic algorithms for executing query operations
Sorting is one of the primary algorithms used in query
processing. E.g. ORDER BY-clause
External sorting:
◦ Refers to sorting algorithms that are suitable for large
files of records stored on disk that do not fit entirely in
main memory, such as most database files
External sorting uses Sort-Merge strategy:
◦ Starts by sorting small subfiles (runs) of the main file and
then merges the sorted runs, creating larger sorted
subfiles that are merged in turn
◦ Sorting phase:
Number subfiles (runs) nR = (b/nB)
◦ Merging phase:
Degree of merging(dM) = Min (nB-1, nR); nP = (logdM(nR))
14
Cont…
◦ nR: number of initial runs;
◦ b: number of file blocks;
◦ nB: available buffer space;
◦ dM: degree of merging;
◦ P: number of passes
The size of a run and number of initial run depends on the
number of file blocks (b) and available buffer space (nB)
Example: if nB=5 blocks and size of the file=1024 blocks,
◦ nR=(b/nB)= (1024/5) =205 runs each of size 5 blocks
except the last run which will have 4 blocks.
◦ Hence, after the sort phase, 205 sorted runs are stored as
temporary subfiles on disk
15
Cont…
In the merging phase, the sorted runs are merged
during one or more passes.
The degree of merging (dM) is the number of runs
that can be merged in each pass.
◦ dM=min((nB-1) and nR))
The number of passes=[(logdM (nR))]
In each pass, one buffer block is needed to hold one
block from each of the runs being merged and one block
is needed for containing one block of the merge result
In the above example, dM=4(four way merging)
Hence, the 205 initial sorted runs would be merged
into:
◦ 52 at the end of the first pass
◦ 13 at the end of the second pass
◦ 4 at the end of the third pass
◦ 1 at the end of the fourth pass
16
Cont…
Exercise
A file of 4096 blocks is to be sorted with
an available buffer space of 64 blocks.
How many passes will be needed in the
merge phase of the external sort-merge
algorithm?
17
Implementing the SELECT Operation
There are many options for executing a SELECT
operation
Some options depend on the file having specific
access paths and may apply only to certain types
of selection conditions
Examples:
◦ (OP1): s SSN='123456789' (EMPLOYEE)
◦ (OP2): s DNUMBER>5(DEPARTMENT)
◦ (OP3): s DNO=5(EMPLOYEE)
◦ (OP4): s DNO=5 AND SALARY>30000 AND SEX=F(EMPLOYEE)
◦ (OP5): s ESSN=123456789 AND PNO=10(WORKS_ON)
18
Use the following logical model to understand the discussions in this
chapter
19
Cont…
Search Methods for implementing Simple Selection:
◦ S1 Linear search (brute force):
Retrieve every record in the file, and test whether its
attribute values satisfy the selection condition.
◦ S2 Binary search:
If the selection condition involves an equality
comparison on a key attribute on which the file is
ordered, binary search (which is more efficient than
linear search) can be used.
An example is OP1 if SSN is the ordering attribute for
EMPLOYEE file
◦ S3 Using a primary index or hash key to retrieve a
single record:
If the selection condition involves an equality
comparison on a key attribute with a primary index
(or a hash key), use the primary index (or the hash key)
to retrieve the record.
For Example, OP1 use primary index to retrieve the
record 20
Cont…
Search Methods for implementing Simple Selection
◦ S4 Using a primary index to retrieve
multiple records:
If the comparison condition is >, ≥, <, or ≤ on a
key field with a primary index, use the index
to find the record satisfying the corresponding
equality condition, then retrieve all subsequent
records in the (ordered) file. (see OP2)
◦ S5 Using a clustering index to retrieve
multiple records:
If the selection condition involves an equality
comparison on a non-key attribute with a
clustering index, use the clustering index to
retrieve all the records satisfying the selection
condition. (See OP3)
21
cont.…
• S6: using a secondary index on an equality comparison:
• This search method can be used to retrieve a single
record if the indexing field is a key (has unique values) or
to retrieve multiple records if the indexing field is not a
key
• Search Methods for implementing complex Selection
• S7: Conjunctive selection using an individual index :
If an attribute involved in any single simple condition in
the conjunctive condition has an access path that
permits the use of one of the methods S2 to S5, use
that condition to retrieve the records and then check
whether each retrieved record satisfies the
remaining simple conditions in the conjunctive
condition
22
Cont…
Whenever a single condition specifies the
selection, we can only check whether an access
path exists on the attribute involved in that
condition.
If an access path exists, the method
corresponding to that access path is used;
Otherwise, the “brute force” linear search
approach of method S1 is used
◦ For conjunctive selection conditions,
whenever more than one of the attributes
involved in the conditions have an access path,
query optimization should be done to choose
the access path that retrieves the fewest records in
the most efficient way 23
Cont…
Disjunctive selection conditions: This is a
situation where simple conditions are connected by
the OR logical connective rather than AND
◦ Compared to conjunctive selection, It is much
harder to process and optimize
Example: s DNO=5 OR SALARY>3000 OR SEX=‘F’(EMPLOYEE)
◦ Little optimization can be done because the
records satisfying the disjunctive condition are the
union of the records satisfying the individual
conditions
◦ Hence, if any of the individual conditions does not
have an access path, we are compelled to use the
brute force approach
24
Implementing the JOIN Operation:
25
Implementing the JOIN Operation(cont…)
26
Implementing the JOIN Operation (cont…)
27
Implementing the JOIN Operation (cont...)
28
Algorithms for PROJECT and SET Operations
Algorithm for PROJECT operations ()
p <attribute list>(R)
31
Query Optimization(cont…)
Method 2: Improvement
a.Load as many blocks of r as possible leaving room
for one block of s.
b.Run through the s file completely one block at a
time.
Performance: Reduces the number of times s
blocks are loaded by a factor of equal to the
number of r records that can fit in main memory.
Considerations during query Optimization:
◦ Narrow down intermediate result sets quickly.
SELECT before JOIN
◦ Use access structures (indexes). 32
Approaches to Query Optimization
Heuristics Approach
◦ The heuristic approach uses the knowledge of the
characteristics of the relational algebra operations
and the relationship between the operators to
optimize the query.
◦ Thus the heuristic approach of optimization will
make use of:
Properties of individual operators
Association between operators
Query Tree: a graphical representation of the
operators, relations, attributes and predicates and
processing sequence during query processing.
33
Using Heuristics in Query Optimization
Process for heuristics optimization
1. The parser of a high-level query generates an initial
internal representation;
2. Apply heuristics rules to optimize the internal
representation.
3. A query execution plan is generated to execute groups
of operations based on the access paths available on
the files involved in the query
The main heuristic is to apply first the operations
that reduce the size of intermediate results
◦ E.g., Apply SELECT and PROJECT operations before
applying the JOIN or other binary operations
34
Using Heuristics in Query Optimization (cont…)
Heuristic Optimization of Query Trees:
◦ The same query could correspond to many
different relational algebra expressions — and
hence many different query trees.
◦ The task of heuristic optimization of query trees is to
find a final query tree that is efficient to execute.
Example: Find the name of all employees who
are working in AQUARIUS project and born
before ‘1957-12-31
Q: SELECT LNAME
FROM EMPLOYEE, WORKS_ON, PROJECT
WHERE PNAME = ‘AQUARIUS’ AND
PNMUBER=PNO AND ESSN=SSN
AND BDATE > ‘1957-12-31’;
35
Steps in Typical Heuristic optimization
1. Deconstruct conjunctive selections into a
sequence of single selection operations (Equiv.
rule 1.).
2.Move selection operations down the query tree
for the earliest possible execution
3.Execute first those selection and join operations
that will produce the smallest relations.
4.Replace Cartesian product operations that are
followed by a selection condition by join
operations
5.Deconstruct and move as far down the tree as
possible lists of projection attributes, creating
new projections where needed 36
Steps in converting a query tree during heuristic optimization
37
Steps in converting a query tree during heuristic optimization
38
Steps in converting a query tree during heuristic optimization
39
Steps in converting a query tree during heuristic optimization
40
Steps in converting a query tree during heuristic optimization
41
Using Heuristics in Query Optimization
Summary of Heuristics for Algebraic Optimization:
1. The main heuristic is to apply first the operations
that reduce the size of intermediate results.
2. Perform select operations as early as possible to
reduce the number of tuples and perform project
operations as early as possible to reduce the
number of attributes. (This is done by moving
select and project operations as far down the tree
as possible.)
3. The select and project operations that are most
restrictive should be executed before other
similar operations. (This is done by reordering the
leaf nodes of the tree among themselves and
adjusting the rest of the tree appropriately.)
42
Using Selectivity and Cost Estimates in Query Optimization
Cost-based query optimization:
◦ Estimate and compare the costs of executing a query
using different execution strategies and choose the
strategy with the lowest cost estimate
Cost Components for Query Execution
1. Access cost to secondary storage: Cost of transferring
data blocks between secondary storage and main memory
buffers.
2. Computation cost: Cost of performing in-memory
operations on the records within the data buffers during
query execution.
3. Memory usage cost: Cost pertaining to the number of
main memory buffers needed during query execution.
4. Communication cost: cost of shipping the query and its
results from the database site to the site or terminal.(E.g
distributed databases )
Note: Different database systems may focus on different
cost components.
43
Using Selectivity and Cost Estimates in Query Optimization
44
Semantic Query Optimization
Semantic Query Optimization:
◦ Uses constraints specified on the database schema in
order to modify one query into another query that is
more efficient to execute
Consider the following SQL query,
SELECT E.LNAME, M.LNAME
FROM EMPLOYEE E M
WHERE E.SUPERSSN=M.SSN AND
E.SALARY>M.SALARY
Explanation:
◦ Suppose that we had a constraint on the database
schema that stated that no employee can earn more
than his or her direct supervisor. If the semantic query
optimizer checks for the existence of this constraint, it
need not execute the query at all because it knows that
the result of the query will be empty. 45