Query-Optimization
Query-Optimization
415)
Today’s Session:
DBMS Internals- Part IX
Query Optimization
Announcements:
PS4 is due on April 15
P3 is due on April 18
DBMS Layers
Queries
Query Optimization
and Execution
Relational Operators
Transaction Files and Access Methods
Manager
Recovery
Buffer Management Manager
Lock
Manager Disk Space Management
DB
Outline
A Brief Primer on Query Optimization
Evaluating Query Plans
Enumerating Plans
Cost-Based Query Sub-System
Select *
Queries From Blah B
Where B.blah = blah
Usually there is a
heuristics-based
Query Parser rewriting step before
the cost-based steps.
Query Optimizer
Schem Statistic
Query Plan Evaluator a s
Query Optimization Steps
Step 1: Queries are parsed into internal forms
(e.g., parse trees)
Query Optimizer
Schem Statistic
Query Plan Evaluator a s
Catalog Manager: The Schema
What kind of information do we store at the Schema?
Information about tables (e.g., table names and
integrity constraints) and attributes (e.g., attribute
names and types)
Information about indices (e.g., index structures)
Information about users
STUDENT TAKES
An SQL block can be thought of as an algebra expression containing:
A cross-product of all relations in the FROM clause
Selections in the WHERE clause
Projections in the SELECT clause
p Canonical form p
s
s
p Canonical form p
s
s
Hash join;
merge join;
nested loops; s Index; seq scan
STUDENT TAKES
Enumerating Plans
Query Evaluation Plans
A query evaluation plan (or simply a plan) consists of an
extended relational algebra tree (or simply a tree)
The cost of the join is 1000 + 1000 * 500 = 501,000 I/Os (assuming page-oriented
Simple NL join)
The selection and projection are done on-the-fly; hence, do not incur additional I/Os
Pushing Selections
How can we reduce the cost of a join?
By reducing the sizes of the input relations!
sname
Reserves Sailors
Pushing Selections
How can we reduce the cost of a join?
By reducing the sizes of the input relations!
(On-the-fly) (On-the-fly)
sname sname
(Sort-Merge Join)
bid=100 rating > 5 (On-the-fly) sid=sid
(Scan; (Scan;
write to bid=100 rating > 5 write to
(Simple Nested Loops) temp T1) temp T2)
sid=sid
Reserves Sailors
Reserves Sailors
(File Scan) (File Scan)
The I/O Cost of the New Q Plan
What is the I/O cost of the following evaluation plan?
(On-the-fly)
sname
(Sort-Merge Join)
sid=sid
(Scan; (Scan;
write to bid=100 rating > 5 write to
temp T1) temp T2)
Reserves Sailors
Cost of Scanning Reserves = 1000 I/Os Cost of Scanning Sailors = 500 I/Os
Cost of Writing T1 = 10* I/Os (later) Cost of Writing T2 = 250* I/Os (later)
Reserves Sailors
The I/O Cost of the New Q Plan
What is the I/O cost of the following evaluation plan?
Done on-the-fly, thus, do
(On-the-fly)
sname not incur additional I/Os
(Sort-Merge Join)
sid=sid
(Scan; (Scan;
write to bid=100 rating > 5 write to
temp T1) temp T2)
Reserves Sailors
The I/O Cost of the New Q Plan
What is the I/O cost of the following evaluation plan?
Done on-the-fly, thus, do
Merge Cost = 10 + 250 = 260 I/Os (On-the-fly)
sname not incur additional I/Os
Reserves Sailors
Cost of Scanning Reserves = 1000 I/Os Cost of Scanning Sailors = 500 I/Os
Cost of Writing T1 = 10 I/Os (later) Cost of Writing T2 = 250 I/Os (later)
(Sort-Merge Join)
bid=100 rating > 5 (On-the-fly) sid=sid
(Scan; (Scan;
write to bid=100 rating > 5 write to
temp T1) temp T2)
(Simple Nested Loops)
sid=sid Reserves Sailors
Reserves Sailors
(File Scan) (File Scan)
sname
“Push” ahead
the join
The cost after applying
sid=sid this heuristic can become
2000 I/Os (as opposed to
(Scan; (Scan;
write to bid=100 rating > 5 write to 4060 I/Os with only
temp T1) temp T2)
pushing the selection)!
Reserves Sailors
Using Indexes
What if indexes are available on Reserves and Sailors?
(On-the-fly)
sname
(Use hash
index; do
not write bid=100 Sailors (Hash index on sid)
result to
temp)
With clustered index on bid of Reserves, we get 100,000/100 = 1000 tuples (assuming 100
boats and uniform distribution of reservations across boats)
Since the index is clustered, the 1000 tuples appear consecutively within the same
bucket; thus # of pages = 1000/100 = 10 pages
Using Indexes
What if indexes are available on Reserves and Sailors?
(On-the-fly)
sname
(Use hash
index; do
not write bid=100 Sailors (Hash index on sid)
result to
temp)
For each selected Reserves tuple, we can retrieve matching Sailors tuples using the hash
index on the sid field
Selected Reserves tuples need not be materialized and the join result can be pipelined!
For each tuple in the join result, we apply rating > 5 and the projection of sname on-the-fly
Using Indexes
What if indexes are available on Reserves and Sailors?
(On-the-fly)
sname
(Use hash
index; do
not write bid=100 Sailors (Hash index on sid)
result to
temp)
(Use hash
index; do
not write bid=100 Sailors (Hash index on sid)
result to
temp)
10 I/Os sid=sid
(Index Nested Loops,
with pipelining )
(Use hash
index; do
Cost = 1.2 I/Os for
not write bid=100 Sailors (Hash index on sid)
result to
temp) 1000 Reserves
(Clustered hash index on bid) Reserves
tuples; hence,
1200 I/Os
Total Cost = 501, 000 I/Os Total Cost = 4060 I/Os Total Cost = 1210 I/Os
But, How Can we Ensure Correctness?
sname sname
Canonical form
rating > 5
bid=100 rating > 5
sid=sid
sid=sid
bid=100 Sailors
Enumerating Plans
Relational Algebra Equivalences
A relational query optimizer uses relational algebra
equivalences to identify many equivalent expressions for a
given query
a1 R a1 ... an R
It follows:
R
(S T) (T R) S
This says that regardless of the order in which the relations are
considered, the final result is the same!
This order-independence is fundamental to how a query optimizer
generates alternative query evaluation plans
RA Equivalences: Selections, Projections, Cross
Products and Joins
Selections with Projections:
a ( c ( R )) c ( a ( R ))
sname sname
Canonical form
rating > 5
bid=100 rating > 5
sid=sid
sid=sid
bid=100 Sailors
Query Optimization
Continue…
and Execution
Relational Operators
Transaction Files and Access Methods
Manager
Recovery
Buffer Management Manager
Lock
Manager Disk Space Management
DB