Query Processing Concepts
Query Processing Concepts
Query Processing Concepts
Objectives
2
(Cont.)
3
Query Processing
4
(Cont.)
6
(Cont.)
7
Analysis
On completion of the
Root
analysis, the high-level
query has been
transformed into some
internal representation Intermediate operations
(query tree) that is
more suitable for
processing.
leaves
9
Normalization
10
Conjunctive normal form
11
Disjunctive normal form
12
Semantic analysis
13
Simplification
14
Query restructuring
15
Query optimization
16
(Cont.)
17
(Cont.)
18
Dynamic query optimization
19
Static query optimization
20
(cont.)
21
Transformation Rules for the
Relational Algebra Operations
22
Heuristics rules
23
(Cont.)
24
Cost estimation
25
Join operation
26
Pipelining
27
Left – deep trees
28
Query Processing Tutorial
Basic Steps in Query
Processing
Optimization – finding the cheapest evaluation plan for a query.
Given relational algebra expression may have many equivalent
expressions
E.g. balance<2500(balance(account) is equivalent to
balance(balance<2500(account))
Any relational-algebra expression can be evaluated in
many ways. Annotated expression specifying detailed
evaluation strategy is called an evaluation-plan.
E.g. can use an index on balance to find accounts with
balance <2500, or can perform complete relation scan and
discard accounts with balance 2500
Amongst all equivalent expressions, try to choose the
one with cheapest possible evaluation-plan. Cost
30 estimate of a plan based on statistical information in the
DBMS catalog.
Catalog Information for Cost
Estimation
32
Measures of Query Cost
36
Selection Cost Estimate Example
branch-name = “Perryridge”(account)
Number of blocks is baccount = 500: 10,000 tuples in
the relation; each block holds 20 tuples.
Assume account is sorted on branch-name.
– V(branch-name, account) is 50
– 10000/50 = 200 tuples of the account relation pertain to
Perryridge branch
– 200/20 = 10 blocks for these tuples
– A binary search to find the first record would take
log2(500) = 9 block accesses
Total cost of binary search is 9 +10 –1 = 18 block
37 accesses (versus 500 for linear scan)
Selections Using Indices
42
Example of Cost Estimate for Complex
Selection
45
External Sort-Merge
53
Nested-Loop Join
Compute the theta join, r s
for each tuple tr in r do begin
for each tuple ts in s do begin
test pair (tr, ts) to see if they satisfy the join condition
if they do, add tr · ts to the result.
end
end
r is called the outer relation and s the inner relation
of the join.
Requires no indices and can be used with any kind of
join condition.
Expensive since it examines every pair of tuples in
the two relations. If the smaller relation fits entirely in
54 main memory, use that relation as the inner relation.
Nested-Loop Join (Cont.)
55
Block Nested-Loop Join
Variant of nested-loop join in which every block of inner relation
is paired with every block of outer relation.
for each block Br of r do begin
for each block Bs of s do begin
for each tuple tr in Br do begin
for each tuple ts in Bs do begin
test pair (tr, ts) for satisfying the join condition
if they do, add tr ·ts to the result.
end
end
end
end
Worse case: each block in the inner relation s is read only once
for each block in the outer relation (instead of once for each
56 tuple in the outer relation)
Block Nested-Loop Join
(Cont.)
Worst case estimate: br bs + br block accesses.
Best case: br + bs block accesses.
Improvements to nested-loop and block nested loop
algorithms:
– If equi-join attribute forms a key on inner relation, stop inner
loop with first match
– In block nested-loop, use M – 2 disk blocks as blocking unit
for outer relation, where M = memory size in blocks; use
remaining two blocks to buffer inner relation and output.
– Reduces number of scans of inner relation greatly.
– Scan inner loop forward and backward alternately, to make
use of blocks remaining in buffer (with LRU replacement)
57 – Use index on inner relation if available
Indexed Nested-Loop Join
If an index is available on the inner loop’s join attribute and join is
an equi-join or natural join, more efficient index lookups can
replace file scans.
Can construct an index just to compute a join.
For each tuple tr in the outer relation r, use the index to look up
tuples in s that satisfy the join condition with tuple tr.
Worst case: buffer has space for only one page of r and one page
of the index.
– br disk accesses are needed to read relation r, and, for each
tuple in r, we perform an index lookup on s.
– Cost of the join: br + nr c, where c is the cost of a single
selection on s using the join condition.
If indices are available on both r and s, use the one with fewer
58 tuples as the outer relation.
Example of Index Nested-Loop
Join
63
Hash-Join (Cont.)
0 0
1 1
.
. .
2 .
. 2
.
.
3 3
4 4
r Partitions Partitions s
64 of r of s
Hash-Join algorithm
66
Cost of Hash-Join
customer depositor
Assume that memory size is 20 blocks.
bdepositor = 100 and bcustomer = 400.
Depositor is to be used as build input. Partition it into
five partitions, each of size 20 blocks. This
partitioning can be done in one pass.
Similarly, partition customer into five partitions, each
of size 80. This is also done in one pass.
Therefore total cost: 3(100 + 400) = 1500 block
68 transfers (ignores cost of writing partially filled
Hybrid Hash-Join
Useful when memory sizes are relatively large, and the build
input is bigger than memory.
With a memory size of 25 blocks, depositor can be partitioned
into five partitions, each of size 20 blocks.
Keep the first of the partitions of the build relation in memory. It
occupies 20 blocks; one block is used for input, and one block
each is used for buffering the other four partitions.
Customer is similarly partitioned into five partitions each of size
80; the first is used right away for probing, instead of being
written out and read back in.
Ignoring the cost of writing partially filled blocks, the cost is
3(80+320) +20 + 80 = 1300 block transfers with hybrid hash-
join, instead of 1500 with plain hash-join.
Hybrid hash-join most useful if M . b s
69
Complex Joins
Join with a conjunctive condition:
r 1 2… n s
– Compute the result of one of the simpler joins r is
– final result comprises those tuples in the intermediate result
that satisfy the remaining conditions
1 … i–1 i+1 … n
– Test these conditions as tuples in r i s are generated.
Join with a disjunctive condition:
r 1 2… n s
Compute as the union of the records in individual join r i
70 s:
Complex Joins (Cont.)
73
Other Operations (Cont.)
E.g., Set operations using hashing:
1. Partition both relations using the same hash function,
thereby creating Hr0, …, Hrmax, and Hs0, …, Hsmax.
2. Process each partition i as follows. Using a different hashing
function, build an in-memory hash index on Hri after it is
brought into memory.
3. r s: Add tuples in Hsi to the hash index if they are not
already in it. Then add the tuples in the hash index to the
result.
r s: output tuples in Hsi to the result if they are already
there in the hash index.
r s: for each tuple in Hsi, if it is there in the hash index,
74 delete it from the index. Add remaining tuples in the hash
Other Operations (Cont.)
balance<2500 customer
account
76
Evaluation of Expressions (Cont.)
branch-city = Brooklyn
branch-city = Brooklyn
branch
branch-city = Brooklyn
account depositor branch account depositor
(a) Initial Expression Tree (b) Transformed Expression
79 Equivalent expressions
Tree
Equivalence Rules
81
Equivalence Rules (Cont.)
82
Equivalence Rules (Cont.)
88
Join Ordering Example (Cont.)
(hash-join)
(merge-join) depositor
Pipeline Pipeline
branch-name = Brooklyn balance < 1000
(use index 1) (use linear scan)
branch account
90
Choice of Evaluation Plans
Must consider the interaction of evaluation techniques when
choosing evaluation plans: choosing the cheapest algorithm for
each operation independently may not yield the best overall
algorithm. E.g.
– Merge-join may be costlier than hash-join, but may provide a
sorted output which reduces the cost for an outer level
aggregation.
– Nested-loop join may provide opportunity for pipelining
Practical query optimizers incorporate elements of the following
two broad approaches:
1. Search all the plans and choose the best plan in a cost-
based fashion.
91 2. Use heuristics to choose a plan.
Cost Based Optimization
r5
r4
r3 r4 r5
r3
r1 r2
r1 r2
93 (a) Left-deep Join Tree (b) Non-left-deep Join Tree
Dynamic Programming in
Optimization
97
Structure of Query Optimizers
98
Structure of Query Optimizers
(Cont.)
Some query optimizers integrate heuristic selection and the
generation of alternative access plans.
– System R and Starburst use a hierarchical procedure based
on the nested-block concept of SQL: heuristic rewriting
followed by cost-based join-order optimization.
– The Oracle7 optimizer supports a heuristic based on
available access paths.
Even with the use of heuristics, cost-based query optimization
impose a substantial overhead.
This expense is usually more than offset by savings at query-
execution time, particularly by reducing the number of slow disk
accesses.
99