0% found this document useful (0 votes)
14 views19 pages

CH 11

Uploaded by

beshahashenafi32
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views19 pages

CH 11

Uploaded by

beshahashenafi32
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

An Overview of Query

Optimization
Chapter 11

Query Evaluation
• Problem: An SQL query is declarative – does
not specify a query execution plan.
• A relational algebra expression is procedural
– there is an associated query execution plan.
• Solution: Convert SQL query to an
equivalent relational algebra and evaluate it
using the associated query execution plan.
– But which equivalent expression is best?

1
Naive Conversion
SELECT DISTINCT TargetList
FROM R1, R2, …, RN
WHERE Condition
is equivalent to
πTargetList (σCondition (R1 × R2 × ... × RN))
but this may imply a very inefficient query execution plan.
Example: πName (σId=ProfId ∧CrsCode=‘CS532’ (Professor
Professor × Teaching))
Teaching
• Result can be < 100 bytes
• But if each relation is 50K then we end up computing an
intermediate result Professor × Teaching of size 500M
before shrinking it down to just a few bytes.

Problem: Find an equivalent relational algebra expression that can be


evaluated “efficiently”.
3

Query Processing Architecture

2
Query Optimizer
• Uses heuristic algorithms to evaluate relational
algebra expressions. This involves:
– estimating the cost of a relational algebra expression
– transforming one relational algebra expression to an
equivalent one
– choosing access paths for evaluating the subexpressions
• Query optimizers do not “optimize” – just try to find
“reasonably good” evaluation strategies

Equivalence Preserving Transformations

• To transform a relational expression into another


equivalent expression we need transformation
rules that preserve equivalence
• Each transformation rule
– Is provably correct (ie, does preserve equivalence)
– Has a heuristic associated with it

3
Selection and Projection Rules

• Break complex selection into simpler ones:


– σCond1∧Cond2 (R) ≡ σCond1 (σCond2 (R) )
• Break projection into stages:
– πattr (R) ≡ π attr (π attr′ (R)), if attr ⊆ attr′
• Commute projection and selection:
– π attr (σCond(R)) ≡ σCond (π attr (R)),
if attr ⊇ all attributes in Cond

Commutativity and Associativity of Join


(and Cartesian Product as Special Case)

• Join commutativity: R S ≡ S R
– used to reduce cost of nested loop evaluation strategies (smaller relation
should be in outer loop)
• Join associativity: R (S T) ≡ (R S) T
– used to reduce the size of intermediate relations in computation of multi-
relational join – first compute the join that yields smaller intermediate
result
• N-way join has T(N)× N! different evaluation plans
– T(N) is the number of parenthesized expressions
– N! is the number of permutations
• Query optimizer cannot look at all plans (might take longer to
find an optimal plan than to compute query brute-force). Hence it
does not necessarily produce optimal plan
8

4
Pushing Selections and Projections
• σCond (R × S) ≡ R Cond S
– Cond relates attributes of both R and S
– Reduces size of intermediate relation since rows can be
discarded sooner
• σCond (R × S) ≡ σCond (R) × S
– Cond involves only the attributes of R
– Reduces size of intermediate relation since rows of R are
discarded sooner
• πattr(R × S) ≡ πattr(πattr′ (R) × S),
if attributes(R) ⊇ attr′ ⊇ attr attributes(R)
– reduces the size of an operand of product
9

Equivalence Example

• σC1 ∧C2 ∧C3 (R


R × S) ≡ σC1 (σC2 (σC3 (RR × S) ) )
≡ σC1 (σC2 (R
R) × σC3 (SS) )
≡ σC2 (R
R) C1 σC3 (S
S)

assuming C2 involves only attributes of R,


C3 involves only attributes of S,
and C1 relates attributes of R and S

10

5
Cost - Example 1
SELECT P.Name
FROM Professor P, Teaching T
WHERE P.Id = T.ProfId -- join condition
AND P. DeptId = ‘CS’ AND T.Semester = ‘F1994’

π Name(σDeptId=‘CS’ ∧ Semester=‘F1994’(Professor
Professor Id=ProfId Teaching))
Teaching

π Name
Master query
execution plan
σDeptId=‘CS’∧ Semester=‘F1994’
(nothing pushed)

Id=ProfId

Professor Teaching 11

Metadata on Tables (in system catalogue)


– Professor (Id, Name, DeptId)
• size: 200 pages, 1000 rows, 50 departments
• indexes: clustered, 2-level B+tree on DeptId, hash on Id
– Teaching (ProfId, CrsCode, Semester)
• size: 1000 pages, 10,000 rows, 4 semesters
• indexes: clustered, 2-level B+tree on Semester;
hash on ProfId
– Definition: Weight of an attribute – average number
of rows that have a particular value
• weight of Id = 1 (it is a key)
• weight of ProfId = 10 (10,000 classes/1000 professors)
12

6
Estimating Cost - Example 1
• Join - block-nested loops with 52 page buffer (50
pages – input for Professor, 1 page – input for
Teaching,
Teaching 1 – output page
– Scanning Professor (outer loop): 200 page transfers,
(4 iterations, 50 transfers each)
– Finding matching rows in Teaching (inner loop):
1000 page transfers for each iteration of outer loop
– Total cost = 200+4*1000 = 4200 page transfers

13

Estimating Cost - Example 1 (cont’d)

• Selection and projection – scan rows of


intermediate file, discard those that don’t satisfy
selection, project on those that do, write result
when output buffer is full.
• Complete algorithm:
– do join, write result to intermediate file on disk
– read intermediate file, do select/project, write final
result
– Problem: unnecessary I/O

14

7
Pipelining
• Solution: use pipelining:
pipelining
– join and select/project act as coroutines, operate as
producer/consumer sharing a buffer in main memory.
• When join fills buffer; select/project filters it and outputs
result
• Process is repeated until select/project has processed last
output from join
– Performing select/project adds no additional cost

Intermediate output
join result
select/project
select project final result

buffer 15

Estimating Cost - Example 1 (cont’d)

• Total cost:
4200 + (cost of outputting final result)

– We will disregard the cost of outputting final


result in comparing with other query evaluation
strategies, since this will be same for all

16

8
Cost Example 2
SELECT P.Name
FROM Professor P, Teaching T
WHERE P.Id = T.ProfId AND
P. DeptId = ‘CS’ AND T.Semester = ‘F1994’
πName(σSemester=‘F1994’ (σDeptId=‘CS’ (Professor
Professor) Id=ProfId Teaching))
Teaching
π Name
Partially pushed
plan: selection
pushed to Professor σSemester=‘F1994’

Id=ProfId
σDeptId=‘CS’

Professor Teaching
17

Cost Example 2 -- selection


• Compute σDeptId=‘CS’ (Professor
Professor) (to reduce size of one
join table) using clustered, 2-level B+ tree on DeptId.
– 50 departments and 1000 professors; hence weight
of DeptId is 20 (roughly 20 CS professors). These
rows are in ~4 consecutive pages in Professor.
Professor
• Cost = 4 (to get rows) + 2 (to search index) = 6
• keep resulting 4 pages in memory and pipe to next step

clustered index
on DeptId
rows of
Professor
18

9
Cost Example 2 -- join

• Index-nested loops join using hash index on


ProfId of Teaching and looping on the selected
professors (computed on previous slide)
– Since selection on Semester was not pushed, hash
index on ProfId of Teaching can be used
– Note: if selection on Semester were pushed, the
index on ProfId would have been lost – an
advantage of not using a fully pushed query
execution plan
19

Cost Example 2 – join (cont’d)


– Each professor matches ~10 Teaching rows. Since 20 CS
professors, hence 200 teaching records.
– All index entries for a particular ProfId are in same bucket.
Assume ~1.2 I/Os to get a bucket.
• Cost = 1.2 × 20 (to fetch index entries for 20 CS
professors) + 200 (to fetch Teaching rows, since hash
index is unclustered) = 224

1.2
20 × 10

ProfId hash Teaching

20

10
Cost Example 2 – select/project
• Pipe result of join to select (on Semester) and
project (on Name) at no I/O cost
• Cost of output same as for Example 1
• Total cost:
6 (select on Professor)
Professor + 224 (join) = 230
• Comparison:
4200 (example 1) vs. 230 (example 2) !!!

21

Estimating Output Size


• It is important to estimate the size of the output of a
relational expression – size serves as input to the next
stage and affects the choice of how the next stage will
be evaluated.
• Size estimation uses the following measures on a
particular instance of R:
– Tuples(R): number of tuples
– Blocks(R): number of blocks
– Values(R.A): number of distinct values of A
– MaxVal(R.A): maximum value of A
– MinVal(R.A): minimum value of A

22

11
Estimating Output Size
• For the query: SELECT TargetList
FROM R1, R2, …, Rn
WHERE Condition

– Reduction factor is Blocks (result set)


R1) × ... × Blocks(R
Blocks(R Rn)

• Estimates by how much query result is smaller than input

23

Estimation of Reduction Factor


• Assume that reduction factors due to target
list and query condition are independent
• Thus:
reduction(Query) =
reduction(TargetList) × reduction(Condition)

24

12
Reduction Due to Simple Condition
1
• reduction (R
Ri.A=val) =
Values(R
R.A)
1
• reduction (R
Ri.A=R
Rj.B) =
max(Values(R
Ri.A), Values(R
Rj.B))

– Assume that values are uniformly distributed,


Tuples(RRi) < Tuples(R
Rj), and every row of Ri matches a row
of Rj . Then the number of tuples that satisfy Condition is:
Ri.A) × (Tuples(R
Values(R Ri)/Values(R
Ri.A))
× (Tuples(R
Rj)/Values(R
Rj.B))

MaxVal(R
i R .A) – val
• reduction (R
Ri.A > val) =
MaxVal(R
Ri.A) – MinVal(Ri.A)
25

Reduction Due to Complex Condition

• reduction(Cond1 AND Cond2) =


reduction(Cond1) × reduction(Cond2)
• reduction(Cond1 OR Cond2) =
min(1, reduction(Cond1) + reduction(Cond2))

26

13
Reduction Due to TargetList

• reduction(TargetList) =
number-of-attributes (TargetList)
Σi number-of-attributes (R
Ri)

27

Estimating Weight of Attribute

weight(R.A) =
Tuples(R) × reduction(R.A=value)

28

14
Choosing Query Execution Plan
• Step 1: Choose a logical plan
• Step 2: Reduce search space
• Step 3: Use a heuristic search to further
reduce complexity

29

Step 1: Choosing a Logical Plan


• Involves choosing a query tree, which indicates the
order in which algebraic operations are applied
• Heuristic: Pushed trees are good, but sometimes “nearly
fully pushed” trees are better due to indexing (as we
saw in the example)
• So: Take the initial “master plan” tree and produce a
fully pushed tree plus several nearly fully pushed trees.

30

15
Step 2: Reduce Search Space
• Deal with associativity of binary operators
(join, union, …)

D
A B C D

A B C D C
Logical query
execution plan A B
Equivalent
Equivalent left
query tree
deep query tree
31

Step 2 (cont’d)
• Two issues:
– Choose a particular shape of a tree (like in the
previous slide)
• Equals the number of ways to parenthesize N-way
join – grows very rapidly
– Choose a particular permutation of the leaves
• E.g., 4! permutations of the leaves A, B, C, D

32

16
Step 2: Dealing With Associativity
• Too many trees to evaluate: settle on a particular
shape: left-
left-deep tree.
– Used because it allows pipelining:
pipelining
P1 P2 P3

A B X C Y D
X Y
– Property: once a row of X has been output by P1, it need not
be output again (but C may have to be processed several
times in P2 for successive portions of X)
– Advantage: none of the intermediate relations (X,
X, Y)
Y have to
be completely materialized and saved on disk.
• Important if one such relation is very large, but the final result is
small
33

Step 2: Dealing with Associativity


• consider the alternative: if we use the
association ((A
A B)
B (C
C D))
D

P1 A B X Each row of X must


be processed against
X Y all of Y. Hence all of
P2 Y (can be very large)
C D Y P3 must be stored
in P3, or P2 has to
recompute it several
times.

34

17
Step 3: Heuristic Search

• The choice of left-deep trees still leaves open


too many options (N! permutations):
– (((A
A B)
B C) D),
– (((C
C A)
A D) B), …..
• A heuristic (often dynamic programming
based) algorithm is used to get a ‘good’ plan

35

Step 3: Dynamic Programming Algorithm

• Just an idea – see book for details


• To compute a join of E1, E2, …, EN in a left-deep
manner:
– Start with 1-relation expressions (can involve σ, π)
– Choose the best and “nearly best” plans for each (a plan is
considered nearly best if its out put has some “interesting”
form, e.g., is sorted)
– Combine these 1-relation plans into 2-relation expressions.
Retain only the best and nearly best 2-relation plans
– Do same for 3-relation expressions, etc.

36

18
Index-Only Queries
• A B+ tree index with search key attributes A1, A2, …,
An has stored in it the values of these attributes for each
row in the table.
– Queries involving a prefix of the attribute list A1, A2, .., An
can be satisfied using only the index – no access to the actual
table is required.
• Example: Transcript has a clustered B+ tree index on
StudId. A frequently asked query is one that requests
all grades for a given CrsCode.
– Problem: Already have a clustered index on StudId – cannot
create another one (on CrsCode)
– Solution: Create an unclustered index on (CrsCode, Grade)
• Keep in mind, however, the overhead in maintaining extra indices
37

19

You might also like