0% found this document useful (0 votes)
67 views30 pages

Overview of Query Evaluation: R&G Chapter 12

The document provides an overview of query evaluation and optimization in a database management system. It discusses that a query optimizer must consider different possible query execution plans involving various join algorithms and access paths. The optimizer's goal is to minimize the number of disk I/Os by choosing the most efficient plan. The document reviews different algorithms for relational operators like selection, projection, joins, and how to estimate their costs using statistics stored in catalogs. It provides examples of using indexes and sorting to evaluate queries more efficiently.

Uploaded by

budisetiono56
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views30 pages

Overview of Query Evaluation: R&G Chapter 12

The document provides an overview of query evaluation and optimization in a database management system. It discusses that a query optimizer must consider different possible query execution plans involving various join algorithms and access paths. The optimizer's goal is to minimize the number of disk I/Os by choosing the most efficient plan. The document reviews different algorithms for relational operators like selection, projection, joins, and how to estimate their costs using statistics stored in catalogs. It provides examples of using indexes and sorting to evaluate queries more efficiently.

Uploaded by

budisetiono56
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 30

Overview of Query

Evaluation
R&G Chapter 12
Lecture 13

Administrivia
Exams graded
HW2 due in a week
No Office Hours Today

Review: Storage
A DBMS has layers

Query Optimization
and Execution
Relational Operators
Files and Access Methods
Buffer Management
Disk Space Management

DB

Now to Midterm 2

Review
We studied Relational Algebra
Many equivalent queries, produce same result
Which expression is most efficient?
We studied file organizations
Hash files, Sorted files, Clustered &
Unclustered Indexes
Compared scans, sorting, searches, insert,
delete
Today: costs to implement relational
operations
Thurs, Tues: Sorting, Joins

Queries today, more on sorting next


time
Remember: SQL declarative language
It describes the query result, but not how to get it
Relational Algebra describes how to get results
But many rel. algebra queries equivalent
How to choose the right one for an SQL query?
In a nutshell:
When database executing query, it must generate
a variety of possible plans (relational algebra
queries), and find the cheapest one to execute.

Review: Relational Algebra


First, remember Relational Algebra
Selection ( )
(horizontal).
Projection (
(vertical).

Selects a subset of rows from relation


Retains only wanted columns from relation

Cross-product ( ) Allows us to combine two relations.


Set-difference ( ) Tuples in r1, but not in r2.
Union ( ) Tuples in r1 and/or in r2.
Intersection ()
Join (
)
Division ( / )

Overview of Query Evaluation


Plan: Tree of R.A. ops, with choice of alg for each
op.
Two main issues in query optimization:
For a given query, what plans are considered?
Algorithm to search plan space for cheapest (estimated) plan.

How is the cost of a plan estimated?

Ideally: Want to find best plan.


Practically: Avoid worst plans!
We will study the System R approach.

Overview (cont)
Query Evaluation involves:
Choosing an Access Path to get at each
table
Evaluating different algorithms for each
relational operator
Choosing the order to apply the relational
operators
These choices interrelated

Overview (cont)
Overall goal: minimize I/Os
Algorithms for evaluating relational
operators use simple ideas extensively:
Indexing: Can use WHERE conditions to
retrieve small set of tuples (selections, joins)
Iteration: Sometimes, faster to scan all tuples
even if there is an index. (sometimes scan the
data entries in an index instead of the table
itself.)
Partitioning: By using sorting or hashing, we
can partition the input tuples and replace an
expensive operation by similar operations on
smaller
inputs.
* Watch
for these
techniques as we discuss query evaluation!

Intermission: a preview of
sorting
Data can only be sorted when in memory
But tables often *much* bigger than
memory
One solution: merge sort
Every one stand up
Go to the aisle by the windows
I will take 10 people at a time onto the
stage
I will sort each group of 10 on last name
from A to Z
Groups will then be merged

Two-Way External Merge Sort


Each pass we read +
write each page in file.
N pages in the file =>
the number of passes

log 2 N 1
So total cost is:

3,4

6,2

9,4

8,7

5,6

3,1

3,4

2,6

4,9

7,8

5,6

1,3

4,7
8,9

2,3
4,6

1,3
5,6

Input file
PASS 0
1-page runs
PASS 1
2

2-page runs
PASS 2

2,3

2 N log 2 N 1

Idea: Divide and conquer:


sort subfiles and merge

4,4
6,7
8,9

1,2
3,5
6

4-page runs

PASS 3
1,2
2,3
3,4
4,5
6,6
7,8
9

8-page runs

Schema for Examples

Sailors (sid: integer, sname: string, rating: integer, age


Reserves (sid: integer, bid: integer, day: dates, rname:
Similar to old schema; rname added for
variations.
Reserves:
Each tuple is 40 bytes long, 100 tuples per page,
1000 pages.
Sailors:
Each tuple is 50 bytes long, 80 tuples per page, 500
pages.

Example 1
Select sname, bid from Sailors S, Reserves
R where s.sid = r.sid and S.age > 99
Several possible rel. algebra queries:

s.age>99)(S

R)

s.age>99)S)

Second one may be much cheaper if right


indexes exist.

Statistics and Catalogs


Need information about relations and indexes
involved. Catalogs typically contain at least:
# tuples (NTuples), # pages (NPages) for each relation.
# distinct key values (NKeys) and NPages for each index.
Index height, low/high key values (Low/High) for each
tree index.
Catalogs updated periodically.
Updating whenever data changes is too expensive; lots
of approximation anyway, so slight inconsistency ok.
More detailed information (e.g., histograms of the
values in some field) are sometimes stored.

Access Paths Getting tuples from a


Table

Access path: a method of retrieving tuples:

File scan, or index that matches a selection (in the query)

Is an index useful for a query? If it matches


predicate:

Tree index matches (a conjunction of) terms that involve


only attributes in a prefix of the search key.
E.g., Tree index on <a, b, c>
matches the selection a=5 AND b=3, and a=5 AND
b>6,
but not b=3.
Hash index matches (a conjunction of) terms that has a
term attribute = value for every attribute in the search key.
E.g., Hash index on <a, b, c>
matches a=5 AND b=3 AND c=5;
but it does not match b=3, or a=5 AND b=3, or a>5

A Note on Complex Selections

(day<8/9/94 AND rname=Paul) OR bid=5 OR sid


Selection conditions are first converted to
conjunctive normal form (CNF):
(day<8/9/94 OR bid=5 OR sid=3 ) AND
(rname=Paul OR bid=5 OR sid=3)
We only discuss case with no ORs; see text if
you are curious about the general case.

One Approach to Selections


Find the most selective access path,
retrieve tuples using it, and
apply any remaining terms that dont match the
index:
Most selective access path: An index or file scan that will
require the fewest I/Os.
Terms that match this index reduce the number of tuples
retrieved; other terms are used to discard some retrieved
tuples, but do not affect number of tuples/pages fetched.
Consider day<8/9/94 AND bid=5 AND sid=3. A B+ tree
index on day can be used; then, bid=5 and sid=3 must
be checked for each retrieved tuple. Similarly, a hash
index on <bid, sid> could be used; day<8/9/94 must
then be checked.

Using an Index for Selections


Cost depends on #qualifying tuples, and
clustering.
Cost of finding qualifying data entries (typically
small) plus cost of retrieving records (could be
large w/o clustering).
For example, assuming uniform distribution of
names, about 5% of tuples qualify (50 pages,
5000 tuples). With a clustered index, cost is
little more than 50 I/Os; if unclustered, upto
1000 I/Os!
SELECT *
FROM
Reserves R
WHERE R.rname < C%

Projection

SELECT

DISTINCT

R.sid,

R.bid
FROM
Reserves
Expensive part is removing duplicates.
SQL systems dont removeR duplicates unless the
keyword DISTINCT is specified in a query.
Sorting Approach: Sort on <sid, bid> and remove
duplicates. (Can optimize this by dropping unwanted
information while sorting.)
Hashing Approach: Hash on <sid, bid> to create
partitions. Load partitions into memory one at a time,
build in-memory hash structure, and eliminate
duplicates.
If there is an index with both R.sid and R.bid in the
search key, may be cheaper to sort data entries!

Join: Index Nested Loops


foreach tuple r in R do
foreach tuple s in S where ri == sj do
add <r, s> to result
No index: Cost M + M * N
If there is an index on the join column of one
relation (say S), can make it the inner and exploit
the index.
Cost: M + ( (M*pR) * cost of finding matching S tuples)
For each R tuple, cost of probing S index is about
1.2 for hash index, 2-4 for B+ tree. Cost of then
finding S tuples (assuming Alt. (2) or (3) for data
entries) depends on clustering.
Clustered index: 1 I/O (typical), unclustered: up to 1
I/O per matching S tuple.

Examples of Index Nested Loops


Hash-index (Alt. 2) on sid of Sailors (as inner):
Scan Reserves: 1000 page I/Os, 100*1000 tuples.
For each Reserves tuple: 1.2 I/Os to get data entry in
index, plus 1 I/O to get (the exactly one) matching
Sailors tuple. Total: 220,000 I/Os.
Hash-index (Alt. 2) on sid of Reserves (as inner):
Scan Sailors: 500 page I/Os, 80*500 tuples.
For each Sailors tuple: 1.2 I/Os to find index page with
data entries, plus cost of retrieving matching Reserves
tuples. Assuming uniform distribution, 2.5 reservations
per sailor (100,000 / 40,000). Cost of retrieving them
is 1 or 2.5 I/Os depending on whether the index is
clustered.


Join: Sort-Merge (R

i=j

S)

Sort R and S on the join column, then scan them to do


a ``merge (on join col.), and output result tuples.
Advance scan of R until current R-tuple >= current S
tuple, then advance scan of S until current S-tuple >=
current R tuple; do this until current R tuple = current S
tuple.
At this point, all R tuples with same value in Ri (current R
group) and all S tuples with same value in Sj (current S
group) match; output <r, s> for all pairs of such tuples.
Then resume scanning R and S.
R is scanned once; each S group is scanned once per
matching R tuple. (Multiple scans of an S group are
likely to find needed pages in buffer.)

Example of Sort-Merge Join


sid
22
28
31
44
58

sname rating age


dustin
7
45.0
yuppy
9
35.0
lubber
8
55.5
guppy
5
35.0
rusty
10 35.0

sid
28
28
31
31
31
58

bid
103
103
101
102
101
103

day
12/4/96
11/3/96
10/10/96
10/12/96
10/11/96
11/12/96

rname
guppy
yuppy
dustin
lubber
lubber
dustin

Cost: M log M + N log N + (M+N)


The cost of scanning, M+N, could be M*N (very unlikely!)
With 35, 100 or 300 buffer pages, both Reserves and
Sailors can be sorted in 2 passes; total join cost: 7500.

Highlights of System R Optimizer


Impact:
Most widely used currently; works well for < 10 joins.
Cost estimation: Approximate art at best.
Statistics, maintained in system catalogs, used to
estimate cost of operations and result sizes.
Considers combination of CPU and I/O costs.
Plan Space: Too large, must be pruned.
Only the space of left-deep plans is considered.
Left-deep plans allow output of each operator to be pipelined
into the next operator without storing it in a temporary relation.

Cartesian products avoided.

Cost Estimation
For each plan considered, must estimate
cost:
Must estimate cost of each operation in plan
tree.
Depends on input cardinalities.
Weve already discussed how to estimate the cost
of operations (sequential scan, index scan, joins,
etc.)

Must also estimate size of result for each


operation in tree!
Use information about the input relations.
For selections and joins, assume independence of
predicates.

Size Estimation and Reduction


Factors
SELECT attribute list
FROM relation list
Consider a query block: WHERE term1 AND ... AND

termk

Maximum # tuples in result is the product of the


cardinalities of relations in the FROM clause.
Reduction factor (RF) associated with each term
reflects the impact of the term in reducing result
size. Result cardinality = Max # tuples * product
of all RFs.
Implicit assumption that terms are independent!
Term col=value has RF 1/NKeys(I), given index I on col
Term col1=col2 has RF 1/MAX(NKeys(I1), NKeys(I2))
Term col>value has RF (High(I)-value)/(High(I)-Low(I))

Motivating Example

RA Tree:

SELECT S.sname
FROM Reserves R, Sailors S
WHERE R.sid=S.sid AND
R.bid=100 AND S.rating>5

sname

bid=100

rating > 5

sid=sid

Sailors

Reserves
Cost: 500+500*1000 I/Os
(On-the-fly)
By no means the worst plan!
Plan: sname
Misses several opportunities:
selections could have been
rating > 5 (On-the-fly)
`pushed earlier, no use is
bid=100
made of any available indexes,
etc.
(Simple Nested Loops)
Goal of optimization: To find
sid=sid
more efficient plans that
compute the same answer.
Reserves

Sailors

(On-the-fly)

Alternative Plans 1
(No Indexes)

sname

sid=sid

(Scan;
write to bid=100
temp T1)

(Sort-Merge Join)

rating > 5

(Scan;
write to
temp T2)

Main difference: push selects.


Reserves
Sailors
With 5 buffers, cost of plan:
Scan Reserves (1000) + write temp T1 (10 pages, if we have 100
boats, uniform distribution).
Scan Sailors (500) + write temp T2 (250 pages, if we have 10
ratings).
Sort T1 (2*2*10), sort T2 (2*3*250), merge (10+250)
Total: 3560 page I/Os.
If we used BNL join, join cost = 10+4*250, total cost = 2770.
If we `push projections, T1 has only sid, T2 only sid and sname:
T1 fits in 3 pages, cost of BNL drops to under 250 pages, total <
2000.

Alternative Plans 2
With Indexes

sname

(On-the-fly)

rating > 5 (On-the-fly)

(Index Nested Loops,


With clustered index on bid of
sid=sid with pipelining )
Reserves, we get 100,000/100 =
1000 tuples on 1000/100 = 10
(Use hash
Sailors
index; do bid=100
pages.
not write
result to
INL with pipelining (outer is not
temp)
Reserves
materialized).
Projecting out unnecessary fields from outer
help.
Joindoesnt
column
sid is a key for Sailors.

At most one matching tuple, unclustered index on sid OK.

Decision not to push rating>5 before the join is based on


availability of sid index on Sailors.
Cost: Selection of Reserves tuples (10 I/Os); for each,
must get matching Sailors tuple (1000*1.2); total 1210

Summary
There are several alternative evaluation algorithms for
each relational operator.
A query is evaluated by converting it to a tree of operators
and evaluating the operators in the tree.
Must understand query optimization in order to fully
understand the performance impact of a given database
design (relations, indexes) on a workload (set of queries).
Two parts to optimizing a query:
Consider a set of alternative plans.
Must prune search space; typically, left-deep plans only.

Must estimate cost of each plan that is considered.


Must estimate size of result and cost for each plan node.
Key issues: Statistics, indexes, operator implementations.

You might also like