0% found this document useful (0 votes)
55 views

Relational Query Optimization: Warih Maharani, ST.,MT

The document discusses query optimization in relational databases. It covers: 1) Query optimization chooses the most efficient execution plan for a query by estimating costs of different plans. 2) System R's optimizer uses statistics and cost models considering CPU and I/O costs to estimate costs of possible plans. 3) Query processing involves parsing, optimization, and evaluation. Optimization selects the lowest cost plan from equivalent options.

Uploaded by

redy2006
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views

Relational Query Optimization: Warih Maharani, ST.,MT

The document discusses query optimization in relational databases. It covers: 1) Query optimization chooses the most efficient execution plan for a query by estimating costs of different plans. 2) System R's optimizer uses statistics and cost models considering CPU and I/O costs to estimate costs of possible plans. 3) Query processing involves parsing, optimization, and evaluation. Optimization selects the lowest cost plan from equivalent options.

Uploaded by

redy2006
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 39

Relational Query Optimization

Warih Maharani, ST.,MT.

1
What is Query Optimization and
Why
v Whereas declarative query languages (including
SQL) offer a great comfort for users, they place a
considerable burden on a Query Processor
v A Query Processor is responsible to produce an
execution plan that will guarantee an acceptable
response time
v Choosing a query execution plan is called Query
Optimization and it mainly means making decisions
about data access methods
v Query Optimization strongly relies on File
Organization techniques
2
Overview of Query Optimization
v Plan: Tree of relational op, with choice of alg for each op.
v Two main issues:
– For a given query, what plans are considered?
u Algorithm to search plan space for cheapest (estimated) plan.
– How is the cost of a plan estimated?
v Ideally: Want to find best plan. Practically: Avoid
worst plans!
v We will study the System R approach.

3
Highlights of System R Optimizer
v Impact:
– Most widely used currently; works well for < 10 joins.
v Cost estimation: Approximate art at best.
– Statistics, maintained in system catalogs, used to estimate
cost of operations and result sizes.
– Considers combination of CPU and I/O costs.
v Plan Space: Too large, must be pruned.

4
Basic Steps in Query Processing
1. Parsing and translation
2. Optimization
3. Evaluation

5
Basic Steps in Query Processing
(Cont.)

v Parsing and translation


– translate the query into its internal form. This is
then translated into relational algebra.
– Parser checks syntax, verifies relations
v Evaluation
– The query-execution engine takes a query-
evaluation plan, executes that plan, and returns the
answers to the query.

6
Basic Steps in Query Processing :
Optimization
v A relational algebra expression may have many equivalent
expressions
– E.g., σbalance<2500(∏balance(account)) is equivalent to
∏balance(σbalance<2500(account))
v Each relational algebra operation can be evaluated using one of
several different algorithms
– Correspondingly, a relational-algebra expression can be
evaluated in many ways.
v Annotated expression specifying detailed evaluation strategy is
called an evaluation-plan.
– E.g., can use an index on balance to find accounts with balance <
2500,
– or can perform complete relation scan and discard accounts with
balance ≥ 2500
7
Basic Steps: Optimization (Cont.)

v Query Optimization: Amongst all equivalent evaluation plans


choose the one with lowest cost.
– Cost is estimated using statistical information from the
database catalog
u e.g. number of tuples in each relation, size of tuples, etc.

v In this chapter we study


– How to measure query costs
– Algorithms for evaluating relational algebra operations
– How to combine algorithms for individual operations in
order to evaluate a complete expression
v We study how to optimize queries, that is, how to
find an evaluation plan with lowest estimated cost
8
Cost-based Query Sub-System
Select *
Queries From Blah B
Where B.blah = blah
Usually there is a
heuristics-based
Query Parser rewriting step before
the cost-based steps.

Query Optimizer

Plan Plan Cost Catalog Manager


Generator Estimator

Schema Statistics
Query Executor
9
Measures of Query Cost
v Cost is generally measured as total elapsed time for
answering query
– Many factors contribute to time cost
u disk accesses, CPU, or even network communication

v Typically disk access is the predominant cost, and is also


relatively easy to estimate. Measured by taking into
account
– Number of seeks * average-seek-cost
– Number of blocks read * average-block-read-cost
– Number of blocks written * average-block-write-cost
u Cost to write a block is greater than cost to read a
block
– data is read back after being written to ensure
that the write was successful

10
Measures of Query Cost (Cont.)
v For simplicity we just use the number of block transfers from disk and
the number of seeks as the cost measures
– tT – time to transfer one block
– tS – time for one seek
– Cost for b block transfers plus S seeks
b * tT + S * tS
v We ignore CPU costs for simplicity
– Real systems do take CPU cost into account
v We do not include cost to writing output to disk in our cost
formulae
v Several algorithms can reduce disk IO by using extra buffer space
– Amount of real memory available to buffer depends on other
concurrent queries and OS processes, known only during execution
u We often use worst case estimates, assuming only the minimum
amount of memory needed for the operation is available
v Required data may be buffer resident already, avoiding disk I/O
– But hard to take into account for cost estimation
11
Selection Operation
v File scan – search algorithms that locate and retrieve
records that fulfill a selection condition.
v Algorithm A1 (linear search). Scan each file block and
test all records to see whether they satisfy the
selection condition.
– Cost estimate = br block transfers + 1 seek
u br denotes number of blocks containing records from
relation r
– If selection is on a key attribute, can stop on finding record
u cost = (br /2) block transfers + 1 seek

– Linear search can be applied regardless of


u selection condition or

u ordering of records in the file, or

u availability of indices

– Applicable in all cases, but if other algorithm applicable,


they are generally faster than linear search 12
Selection Operation (Cont.)
v A2 (binary search). Applicable if selection is
an equality comparison on the attribute on
which file is ordered.
– Assume that the blocks of a relation are stored
contiguously
– Cost estimate (number of disk blocks to be
scanned):
u cost of locating the first tuple by a binary search on
the blocks
– log2(br) * (tT + tS)
u If there are multiple records satisfying selection
– Add transfer cost of the number of blocks containing
records that satisfy selection condition
– Will see how to estimate this cost in Chapter 14
13
Selections Using Indices
v Index scan – search algorithms that use an index
– selection condition must be on search-key of index.
v A3 (primary index on candidate key, equality). Retrieve a single
record that satisfies the corresponding equality condition
– Cost = (hi + 1) * (tT + tS)
v A4 (primary index on nonkey, equality) Retrieve multiple records.
– Records will be on consecutive blocks
u Let b = number of blocks containing matching records

– Cost = hi * (tT + tS) + tS + tT * b


v A5 (equality on search-key of secondary index).
– Retrieve a single record if the search-key is a candidate key
u Cost = (hi + 1) * (tT + tS)

– Retrieve multiple records if search-key is not a candidate key


u each of n matching records may be on a different block

u Cost = (hi + n) * (tT + tS)

– Can be very expensive!


14
Selections Involving Comparisons
v Can implement selections of the form σA≤V (r) or σA ≥ V(r) by using
– a linear file scan or binary search,
– or by using indices in the following ways:
v A6 (primary index, comparison). (Relation is sorted on A)
u For σA ≥ V(r) use index to find first tuple ≥ v and scan relation
sequentially from there
u For σA≤V (r) just scan relation sequentially till first tuple > v;
do not use index
v A7 (secondary index, comparison).
u For σA ≥ V(r) use index to find first index entry ≥ v and scan
index sequentially from there, to find pointers to records.
u For σA≤V (r) just scan leaf pages of index finding pointers to
records, till first entry > v
u In either case, retrieve records that are pointed to

– requires an I/O for each record


– Linear file scan may be cheaper
15
Implementation of Complex Selections
v Conjunction: σθ1∧ θ2∧. . . θn(r)
v A8 (conjunctive selection using one index).
– Select a combination of θi and algorithms A1 through A7
that results in the least cost for σθi (r).
– Test other conditions on tuple after fetching it into memory
buffer.
v A9 (conjunctive selection using multiple-key index).
– Use appropriate composite (multiple-key) index if available.
v A10 (conjunctive selection by intersection of identifiers).
– Requires indices with record pointers.
– Use corresponding index for each condition, and take
intersection of all the obtained sets of record pointers.
– Then fetch records from file
– If some conditions do not have appropriate indices, apply
test in memory.
16
Algorithms for Complex Selections
v Disjunction:σθ1∨ θ2 ∨. . . θn (r).
v A11 (disjunctive selection by union of identifiers).
– Applicable if all conditions have available indices.
u Otherwise use linear scan.
– Use corresponding index for each condition, and take
union of all the obtained sets of record pointers.
– Then fetch records from file
v Negation: σ¬θ(r)
– Use linear scan on file
– If very few records satisfy ¬θ, and an index is
applicable to θ
u Find satisfying records using index and fetch from file
17
Schema for Examples
Sailors (sid: integer, sname: string, rating: integer, age: real)
Reserves (sid: integer, bid: integer, day: dates, rname: string)

v Similar to old schema; rname added for variations.


v Reserves:
– Each tuple is 40 bytes long, 100 tuples per page, 1000 pages.
v Sailors:
– Each tuple is 50 bytes long, 80 tuples per page, 500 pages.

18
RA Tree: sname
Motivating Example
bid=100 rating > 5
SELECT S.sname
FROM Reserves R, Sailors S
WHERE R.sid=S.sid AND sid=sid
R.bid=100 AND S.rating>5
Reserves Sailors
v Cost: 1000 + 100*1000*500 I/Os.
(On-the-fly)
v By no means the worst plan! Plan: sname
v Misses several opportunities:
selections could have been bid=100 rating > 5 (On-the-fly)
`pushed’ earlier, no use is made
of any available indexes, etc.
(Simple Nested Loops)
v Goal of optimization: To find more sid=sid
efficient plans that compute the
same answer.
Reserves Sailors
19
(On-the-fly)
sname
Alternative Plans 1
(No Indexes) sid=sid
(Sort-Merge Join)

(Scan; (Scan;
write to bid=100 rating > 5 write to
temp T1) temp T2)
v Main difference: push selects.
Reserves Sailors
v With 5 buffers, cost of plan:
– Scan Reserves (1000) + write temp T1 (10 pages, if we have 100 boats,
uniform distribution).
– Scan Sailors (500) + write temp T2 (250 pages, if we have 10 ratings).
– Sort T1 (2*2*10), sort T2 (2*3*250), merge (10+250)
– Total: 3560 page I/Os.
v If we used BNL join, join cost = 10+4*250, total cost = 2770.
v If we `push’ projections, T1 has only sid, T2 only sid and sname:
– T1 fits in 3 pages, cost of BNL drops to under 250 pages, total < 2000.

20
(On-the-fly)

Alternative Plans 2 sname

With Indexes rating > 5 (On-the-fly)

v With clustered index on bid of sid=sid


(Index Nested Loops,
with pipelining )
Reserves, we get 100,000/100 =
1000 tuples on 1000/100 = 10 pages. (Use hash
index; do bid=100 Sailors
not write
v INL with pipelining (outer is not result to
temp)
materialized). Reserves

–Projecting out unnecessary fields from outer doesn’t help.


v Join column sid is a key for Sailors.
–At most one matching tuple, unclustered index on sid OK.
v Decision not to push rating>5 before the join is based on
availability of sid index on Sailors.
v Cost: Selection of Reserves tuples (10 I/Os); for each,
must get matching Sailors tuple (1000*1.2); total 1210 I/Os.
21
What is needed for optimization?

v A closed set of operators


– Relational ops (table in, table out)
– Encapsulation based on iterators
v Plan space, based on
– Based on relational equivalences
v Cost Estimation, based on
– Cost formulas
– Size estimation, based on
u Catalog information on base tables
u Selectivity (Reduction Factor) estimation
v A search algorithm
– To sift through the plan space based on cost!

22
Summary
v Query optimization is an important task in a
relational DBMS.
v Must understand optimization in order to understand
the performance impact of a given database design
(relations, indexes) on a workload (set of queries).
v Two parts to optimizing a query:
– Consider a set of alternative plans.
u Must prune search space; typically, left-deep plans only.
– Must estimate cost of each plan that is considered.
u Must estimate size of result and cost for each plan node.
u Key issues: Statistics, indexes, operator implementations.
23
Query Blocks: Units of Optimization
SELECT S.sname
v An SQL query is parsed into a FROM Sailors S
collection of query blocks, and these WHERE S.age IN
are optimized one block at a time. (SELECT MAX (S2.age)
v Nested blocks are usually treated as FROM Sailors S2
calls to a subroutine, made once per GROUP BY S2.rating)
outer tuple. (This is an over-
simplification, but serves for now.) Outer block Nested block
v For each block, the plans considered are:
– All available access methods, for each reln in FROM clause.
– All left-deep join trees (i.e., all ways to join the relations one-
at-a-time, with the inner reln in the FROM clause, considering
all reln permutations and join methods.)
24
Cost Estimation
v For each plan considered, must estimate cost:
– Must estimate cost of each operation in plan tree.
u Depends on input cardinalities.
u We’ve already discussed how to estimate the cost of operations
(sequential scan, index scan, joins, etc.)
– Must estimate size of result for each operation in tree!
u Use information about the input relations.
u For selections and joins, assume independence of predicates.
v We’ll discuss the System R cost estimation approach.
– Very inexact, but works ok in practice.
– More sophisticated techniques known now.
25
Statistics and Catalogs
v Need information about the relations and indexes
involved. Catalogs typically contain at least:
– # tuples (NTuples) and # pages (NPages) for each relation.
– # distinct key values (NKeys) and NPages for each index.
– Index height, low/high key values (Low/High) for each
tree index.
v Catalogs updated periodically.
– Updating whenever data changes is too expensive; lots of
approximation anyway, so slight inconsistency ok.
v More detailed information (e.g., histograms of the
values in some field) are sometimes stored.
26
Size Estimation and Reduction Factors
SELECT attribute list
FROM relation list
v Consider a query block: WHERE term1 AND ... AND termk

v Maximum # tuples in result is the product of the


cardinalities of relations in the FROM clause.
v Reduction factor (RF) associated with each term reflects
the impact of the term in reducing result size. Result
cardinality = Max # tuples * product of all RF’s.
– Implicit assumption that terms are independent!
– Term col=value has RF 1/NKeys(I), given index I on col
– Term col1=col2 has RF 1/MAX(NKeys(I1), NKeys(I2))
– Term col>value has RF (High(I)-value)/(High(I)-Low(I))
27
Relational Algebra Equivalences
v Allow us to choose different join orders and to `push’
selections and projections ahead of joins.
v Selections: σ c1∧ ...∧ cn ( R) ≡ σ c1 ( . . . σ cn ( R) ) (Cascade)
σ c1 (σ c 2 ( R) ) ≡ σ c 2 (σ c1 ( R) ) (Commute)
v (
Projections: π a1 ( R) ≡ π a1 . . . (π an ( R) ) ) (Cascade)

v Joins: R >< (S >< T) ≡ (R>< S) >< T (Associative)


(R >< S) ≡ (S >< R) (Commute)

+ Show that: R >< (S >< T) ≡ (T >< R) >< S


28
More Equivalences
v A projection commutes with a selection that only
uses attributes retained by the projection.
v Selection between attributes of the two arguments of
a cross-product converts cross-product to a join.
v A selection on just attributes of R commutes with
R >< S. (i.e., σ (R >< S) ≡ σ (R) >< S )
v Similarly, if a projection follows a join R >< S, we can
`push’ it by retaining only attributes of R (and S) that
are needed for the join or are kept by the projection.

29
Enumeration of Alternative Plans
v There are two main cases:
– Single-relation plans
– Multiple-relation plans
v For queries over a single relation, queries consist of a
combination of selects, projects, and aggregate ops:
– Each available access path (file scan / index) is considered,
and the one with the least estimated cost is chosen.
– The different operations are essentially carried out
together (e.g., if an index is used for a selection, projection
is done for each retrieved tuple, and the resulting tuples
are pipelined into the aggregate computation).
30
Cost Estimates for Single-Relation Plans

v Index I on primary key matches selection:


– Cost is Height(I)+1 for a B+ tree, about 1.2 for hash index.
v Clustered index I matching one or more selects:
– (NPages(I)+NPages(R)) * product of RF’s of matching selects.
v Non-clustered index I matching one or more selects:
– (NPages(I)+NTuples(R)) * product of RF’s of matching selects.
v Sequential scan of file:
– NPages(R).
+ Note: Typically, no duplicate elimination on projections!
(Exception: Done on answers if user says DISTINCT.)
31
SELECT S.sid
Example FROM Sailors S
WHERE S.rating=8
v If we have an index on rating:
– (1/NKeys(I)) * NTuples(R) = (1/10) * 40000 tuples retrieved.
– Clustered index: (1/NKeys(I)) * (NPages(I)+NPages(R)) =
(1/10) * (50+500) pages are retrieved. (This is the cost.)
– Unclustered index: (1/NKeys(I)) * (NPages(I)+NTuples(R))
= (1/10) * (50+40000) pages are retrieved.
v If we have an index on sid:
– Would have to retrieve all tuples/pages. With a clustered
index, the cost is 50+500, with unclustered index, 50+40000.
v Doing a file scan:
– We retrieve all file pages (500).
32
Queries Over Multiple Relations
v Fundamental decision in System R: only left-deep join
trees are considered.
– As the number of joins increases, the number of alternative
plans grows rapidly; we need to restrict the search space.
– Left-deep trees allow us to generate all fully pipelined plans.
u Intermediate results not written to temporary files.

u Not all left-deep trees are fully pipelined (e.g., SM join).

D D

C C

A B C D A B B
A
33
Enumeration of Left-Deep Plans
v Left-deep plans differ only in the order of relations,
the access method for each relation, and the join
method for each join.
v Enumerated using N passes (if N relations joined):
– Pass 1: Find best 1-relation plan for each relation.
– Pass 2: Find best way to join result of each 1-relation plan
(as outer) to another relation. (All 2-relation plans.)
– Pass N: Find best way to join result of a (N-1)-relation plan
(as outer) to the N’th relation. (All N-relation plans.)
v For each subset of relations, retain only:
– Cheapest plan overall, plus
– Cheapest plan for each interesting order of the tuples.
34
Enumeration of Plans (Contd.)
v ORDER BY, GROUP BY, aggregates etc. handled as a
final step, using either an `interestingly ordered’
plan or an addional sorting operator.
v An N-1 way plan is not combined with an
additional relation unless there is a join condition
between them, unless all predicates in WHERE have
been used up.
– i.e., avoid Cartesian products if possible.
v In spite of pruning plan space, this approach is still
exponential in the # of tables.
35
Sailors:
sname
B+ tree on rating
Example Hash on sid
Reserves:
B+ tree on bid
v Pass1: sid=sid
– Sailors: B+ tree matches rating>5,
and is probably cheapest. However,
bid=100 rating > 5
if this selection is expected to
retrieve a lot of tuples, and index is
unclustered, file scan may be cheaper. Reserves Sailors
u Still, B+ tree plan kept (because tuples are in rating order).
– Reserves: B+ tree on bid matches bid=500; cheapest.
v Pass 2:
– We consider each plan retained from Pass 1 as the outer,
and consider how to join it with the (only) other relation.
u e.g., Reserves as outer: Hash index can be used to get Sailors tuples
that satisfy sid = outer tuple’s sid value.
36
SELECT S.sname
FROM Sailors S
Nested Queries WHERE EXISTS
(SELECT *
FROM Reserves R
v Nested block is optimized WHERE R.bid=103
independently, with the outer AND R.sid=S.sid)
tuple considered as providing a
selection condition. Nested block to optimize:
SELECT *
v Outer block is optimized with
FROM Reserves R
the cost of `calling’ nested block
WHERE R.bid=103
computation taken into account.
AND S.sid= outer value
v Implicit ordering of these blocks
means that some good strategies Equivalent non-nested query:
are not considered. The non- SELECT S.sname
nested version of the query is FROM Sailors S, Reserves R
typically optimized better. WHERE S.sid=R.sid
AND R.bid=103
37
Summary
v Query optimization is an important task in a
relational DBMS.
v Must understand optimization in order to understand
the performance impact of a given database design
(relations, indexes) on a workload (set of queries).
v Two parts to optimizing a query:
– Consider a set of alternative plans.
u Must prune search space; typically, left-deep plans only.
– Must estimate cost of each plan that is considered.
u Must estimate size of result and cost for each plan node.
u Key issues: Statistics, indexes, operator implementations.
38
Summary (Contd.)
v Single-relation queries:
– All access paths considered, cheapest is chosen.
– Issues: Selections that match index, whether index key has
all needed fields and/or provides tuples in a desired order.
v Multiple-relation queries:
– All single-relation plans are first enumerated.
u Selections/projections considered as early as possible.

– Next, for each 1-relation plan, all ways of joining another


relation (as inner) are considered.
– Next, for each 2-relation plan that is `retained’, all ways of
joining another relation (as inner) are considered, etc.
– At each level, for each subset of relations, only best plan for
each interesting order of tuples is `retained’.
39

You might also like