0% found this document useful (0 votes)
1 views

Query-Optimization

This document discusses query optimization in database management systems, detailing the steps involved in optimizing queries, including parsing, transforming, enumerating plans, and estimating costs. It emphasizes the importance of the catalog manager for retrieving necessary information and outlines techniques such as pushing selections and projections to reduce join costs. Additionally, it highlights the use of indexes to enhance query performance and the impact of different evaluation plans on I/O costs.

Uploaded by

Faiz Dosani
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views

Query-Optimization

This document discusses query optimization in database management systems, detailing the steps involved in optimizing queries, including parsing, transforming, enumerating plans, and estimating costs. It emphasizes the importance of the catalog manager for retrieving necessary information and outlines techniques such as pushing selections and projections to reduce join costs. Additionally, it highlights the use of indexes to enhance query performance and the impact of different evaluation plans on I/O costs.

Uploaded by

Faiz Dosani
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 51

Database Applications (15-

415)

DBMS Internals- Part IX


Lecture 22, April 12, 2020
Mohammad Hammoud
Today…
 Last Session:
 DBMS Internals- Part VIII
 Algorithms for Relational Operations (Cont’d)

 Today’s Session:
 DBMS Internals- Part IX
 Query Optimization

 Announcements:
 PS4 is due on April 15
 P3 is due on April 18
DBMS Layers
Queries

Query Optimization
and Execution

Relational Operators
Transaction Files and Access Methods
Manager
Recovery
Buffer Management Manager
Lock
Manager Disk Space Management

DB
Outline
A Brief Primer on Query Optimization 
Evaluating Query Plans

Relational Algebra Equivalences

Estimating Plan Costs

Enumerating Plans
Cost-Based Query Sub-System
Select *
Queries From Blah B
Where B.blah = blah
Usually there is a
heuristics-based
Query Parser rewriting step before
the cost-based steps.

Query Optimizer

Plan Plan Cost Catalog Manager


Generator Estimator

Schem Statistic
Query Plan Evaluator a s
Query Optimization Steps
 Step 1: Queries are parsed into internal forms
(e.g., parse trees)

 Step 2: Internal forms are transformed into ‘canonical forms’


(syntactic query optimization)

 Step 3: A subset of alternative plans are enumerated

 Step 4: Costs for alternative plans are estimated

 Step 5: The query evaluation plan with the least estimated


cost is picked
Required Information to Evaluate Queries
 To estimate the costs of query plans, the query
optimizer examines the system catalog and retrieves:
 Information about the types and lengths of fields
 Statistics about the referenced relations
 Access paths (indexes) available for relations

 In particular, the Schema and Statistics components


in the Catalog Manager are inspected to find a good
enough query evaluation plan
Cost-Based Query Sub-System
Select *
Queries From Blah B
Where B.blah = blah
Usually there is a
heuristics-based
Query Parser rewriting step before
the cost-based steps.

Query Optimizer

Plan Plan Cost Catalog Manager


Generator Estimator

Schem Statistic
Query Plan Evaluator a s
Catalog Manager: The Schema
 What kind of information do we store at the Schema?
 Information about tables (e.g., table names and
integrity constraints) and attributes (e.g., attribute
names and types)
 Information about indices (e.g., index structures)
 Information about users

 Where do we store such information?


 In tables, hence, can be queried like any other tables
 For example: Attribute_Cat (attr_name: string,
rel_name: string; type: string; position: integer)
Catalog Manager: Statistics
 What would you store at the Statistics component?
 NTuples(R): # records for table R
 NPages(R): # pages for R
 NKeys(I): # distinct key values for index I
 INPages(I): # pages for index I
 IHeight(I): # levels for I
 ILow(I), IHigh(I): range of values for I
 ...

 Such statistics are important for estimating plan


costs and result sizes (to be discussed shortly!)
SQL Blocks
 SQL queries are optimized by decomposing them into a
collection of smaller units, called blocks

 A block is an SQL query with:


 No nesting
 Exactly 1 SELECT and 1 FROM clauses
 At most 1 WHERE, 1 GROUP BY and 1 HAVING clauses

 A typical relational query optimizer concentrates on


optimizing a single block at a time
Translating SQL Queries Into Relational Algebra
Trees
p
select name s
from STUDENT, TAKES
where c-id=‘415’ and
STUDENT.ssn=TAKES.ssn 

STUDENT TAKES
 An SQL block can be thought of as an algebra expression containing:
 A cross-product of all relations in the FROM clause
 Selections in the WHERE clause
 Projections in the SELECT clause

 Remaining operators can be carried out on the result of such


SQL block
Translating SQL Queries Into Relational Algebra
Trees (Cont’d)

p Canonical form p

s 
s


STUDENT TAKES STUDENT TAKES

Still the same result!

How can this be guaranteed?


Translating SQL Queries Into Relational Algebra
Trees (Cont’d)

p Canonical form p

s 
s


STUDENT TAKES STUDENT TAKES

OBSERVATION: try to perform selections and projections early!


Translating SQL Queries Into Relational Algebra
Trees (Cont’d)

Hash join; 
merge join;
nested loops; s Index; seq scan

STUDENT TAKES

How to evaluate a query plan (as opposed to


evaluating an operator)?
Outline
A Brief Primer on Query Optimization

Evaluating Query Plans 


Relational Algebra Equivalences

Estimating Plan Costs

Enumerating Plans
Query Evaluation Plans
 A query evaluation plan (or simply a plan) consists of an
extended relational algebra tree (or simply a tree)

 A plan tree consists of annotations at each node indicating:


 The access methods to use for each relation
 The implementation method to use for each operator

 Consider the following SQL query Q:

SELECT S.sname What is the


FROM Reserves R, corresponding
Sailors S RA of Q?
WHERE R.sid=S.sid AND
R.bid=100 AND
Query Evaluation Plans (Cont’d)
 Q can be expressed in relational algebra as follows:
 sname( (Reserves Sailors)
bid 100 rating 5 sid sid

A RA Tree: An Extended RA Tree:


(On-the-fly)
sname
sname

bid=100 rating > 5 (On-the-fly)


bid=100 rating > 5

(Simple Nested Loops)


sid=sid sid=sid

Reserves Sailors Sailors (File Scan)


(File Scan) Reserves
Pipelining vs. Materializing
 When a query is composed of several operators, the
result of one operator can sometimes be pipelined to
another operator Applied on-the-fly
(On-the-fly)
sname

Pipeline the output of the join into the


selection and projection that follow
bid=100 rating > 5 (On-the-fly)

(Simple Nested Loops)


sid=sid

(File Scan) Reserves Sailors (File Scan)


Pipelining vs. Materializing
 When a query is composed of several operators, the
result of one operator can sometimes be pipelined to
another operator Applied on-the-fly
(On-the-fly)
sname

Pipeline the output of the join into the


selection and projection that follow
bid=100 rating > 5 (On-the-fly)

In contrast, a temporary table can be materialized (Simple Nested Loops)


to hold the intermediate result of the join and read sid=sid
back by the selection operation!
(File Scan) Reserves Sailors (File Scan)

Pipelining can significantly save I/O cost!


The I/O Cost of the Q Plan
 What is the I/O cost of the following evaluation plan?
(On-the-fly)
sname

bid=100 rating > 5 (On-the-fly)

(Simple Nested Loops)


sid=sid

(File Scan) Reserves Sailors (File Scan)

 The cost of the join is 1000 + 1000 * 500 = 501,000 I/Os (assuming page-oriented
Simple NL join)
 The selection and projection are done on-the-fly; hence, do not incur additional I/Os
Pushing Selections
 How can we reduce the cost of a join?
 By reducing the sizes of the input relations!

sname

bid=100 rating > 5

Involves bid in Reserves; Involves rating in Sailors;


hence, “push” ahead of the join! hence, “push” ahead of the join!
sid=sid

Reserves Sailors
Pushing Selections
 How can we reduce the cost of a join?
 By reducing the sizes of the input relations!

(On-the-fly) (On-the-fly)
sname sname

(Sort-Merge Join)
bid=100 rating > 5 (On-the-fly) sid=sid

(Scan; (Scan;
write to bid=100 rating > 5 write to
(Simple Nested Loops) temp T1) temp T2)
sid=sid
Reserves Sailors

Reserves Sailors
(File Scan) (File Scan)
The I/O Cost of the New Q Plan
 What is the I/O cost of the following evaluation plan?
(On-the-fly)
sname

(Sort-Merge Join)
sid=sid

(Scan; (Scan;
write to bid=100 rating > 5 write to
temp T1) temp T2)

Reserves Sailors
Cost of Scanning Reserves = 1000 I/Os Cost of Scanning Sailors = 500 I/Os
Cost of Writing T1 = 10* I/Os (later) Cost of Writing T2 = 250* I/Os (later)

*Assuming 100 boats and uniform distribution of reservations across boats.


*Assuming 10 ratings and uniform distribution over ratings.
The I/O Cost of the New Q Plan
 What is the I/O cost of the following evaluation plan?
Merge Cost = 10 + 250 = 260 I/Os (On-the-fly)
sname

Cost = 2×2×10 = 40 I/Os Cost = 2×4×250 = 2000 I/Os


(Sort-Merge Join)
(assuming B = 5) sid=sid (assuming B = 5)
To
so (Scan; (Scan; rt T2
rt write to bid=100 rating > 5 write to so
T1 temp T1) temp T2) To

Reserves Sailors
The I/O Cost of the New Q Plan
 What is the I/O cost of the following evaluation plan?
Done on-the-fly, thus, do
(On-the-fly)
sname not incur additional I/Os

(Sort-Merge Join)
sid=sid

(Scan; (Scan;
write to bid=100 rating > 5 write to
temp T1) temp T2)

Reserves Sailors
The I/O Cost of the New Q Plan
 What is the I/O cost of the following evaluation plan?
Done on-the-fly, thus, do
Merge Cost = 10 + 250 = 260 I/Os (On-the-fly)
sname not incur additional I/Os

Cost = 2×2×10 = 40 I/Os Cost = 2×4×250 = 2000 I/Os


(Sort-Merge Join)
(assuming B = 5) sid=sid (assuming B = 5)
To
so (Scan; (Scan; rt T2
rt write to bid=100 rating > 5 write to so
T1 temp T1) temp T2) To

Reserves Sailors
Cost of Scanning Reserves = 1000 I/Os Cost of Scanning Sailors = 500 I/Os
Cost of Writing T1 = 10 I/Os (later) Cost of Writing T2 = 250 I/Os (later)

Total Cost = 1000 + 10 + 500 + 250 + 40 + 2000 + 260 = 4060 I/Os


The I/O Costs of the Two Q Plans
(On-the-fly)
(On-the-fly) sname
sname

(Sort-Merge Join)
bid=100 rating > 5 (On-the-fly) sid=sid

(Scan; (Scan;
write to bid=100 rating > 5 write to
temp T1) temp T2)
(Simple Nested Loops)
sid=sid Reserves Sailors

Reserves Sailors
(File Scan) (File Scan)

Total Cost = 501, 000 I/Os Total Cost = 4060 I/Os


Pushing Projections
 How can we reduce the cost of a join?
 By reducing the sizes of the input relations!

 Consider (again) the following plan:


 What are the attributes required
sname
from T1 and T2?
 Sid from T1
 Sid and sname from T2
sid=sid

Hence, as we scan Reserves and


(Scan; (Scan;
write to bid=100 rating > 5 write to Sailors we can also remove
temp T1) temp T2) unwanted columns (i.e., “Push” the
Reserves Sailors
projections ahead of the join)!
Pushing Projections
 How can we reduce the cost of a join?
 By reducing the sizes of the input relations!

 Consider (again) the following plan:

sname
“Push” ahead
the join
The cost after applying
sid=sid this heuristic can become
2000 I/Os (as opposed to
(Scan; (Scan;
write to bid=100 rating > 5 write to 4060 I/Os with only
temp T1) temp T2)
pushing the selection)!
Reserves Sailors
Using Indexes
 What if indexes are available on Reserves and Sailors?
(On-the-fly)
sname

rating > 5 (On-the-fly)

(Index Nested Loops,


sid=sid with pipelining )

(Use hash
index; do
not write bid=100 Sailors (Hash index on sid)
result to
temp)

(Clustered hash index on bid) Reserves

 With clustered index on bid of Reserves, we get 100,000/100 = 1000 tuples (assuming 100
boats and uniform distribution of reservations across boats)
 Since the index is clustered, the 1000 tuples appear consecutively within the same
bucket; thus # of pages = 1000/100 = 10 pages
Using Indexes
 What if indexes are available on Reserves and Sailors?
(On-the-fly)
sname

rating > 5 (On-the-fly)

(Index Nested Loops,


sid=sid with pipelining )

(Use hash
index; do
not write bid=100 Sailors (Hash index on sid)
result to
temp)

(Clustered hash index on bid) Reserves

 For each selected Reserves tuple, we can retrieve matching Sailors tuples using the hash
index on the sid field
 Selected Reserves tuples need not be materialized and the join result can be pipelined!
 For each tuple in the join result, we apply rating > 5 and the projection of sname on-the-fly
Using Indexes
 What if indexes are available on Reserves and Sailors?
(On-the-fly)
sname

Is it necessary to project out


unwanted columns?
rating > 5 (On-the-fly)

NO, since selection results


are NOT materialized (Index Nested Loops,
sid=sid with pipelining )

(Use hash
index; do
not write bid=100 Sailors (Hash index on sid)
result to
temp)

(Clustered hash index on bid) Reserves


Using Indexes
 What if indexes are available on Reserves and Sailors?
(On-the-fly)
sname

Does the hash index on sid


rating > 5 (On-the-fly) need to be clustered?

NO, since there is at most


(Index Nested Loops,
sid=sid with pipelining ) 1 matching Sailors tuple
per a Reserves tuple! Why?
(Use hash
index; do
not write bid=100 Sailors (Hash index on sid)
result to
temp)

(Clustered hash index on bid) Reserves


Using Indexes
 What if indexes are available on Reserves and Sailors?
(On-the-fly)
sname

rating > 5 (On-the-fly)

(Index Nested Loops,


sid=sid with pipelining )

(Use hash Cost = 1.2 I/Os (if


index; do
not write bid=100 Sailors (Hash index on sid) A(1)) or 2.2 (if A(2))
result to
temp)

(Clustered hash index on bid) Reserves


Using Indexes
 What if indexes are available on Reserves and Sailors?
(On-the-fly)
sname
Why not pushing this selection
ahead of the join?
rating > 5 (On-the-fly)
It would require a scan on Sailors!

(Index Nested Loops,


sid=sid with pipelining )

(Use hash
index; do
not write bid=100 Sailors (Hash index on sid)
result to
temp)

(Clustered hash index on bid) Reserves


The I/O Cost of the New Q Plan
 What is the I/O cost of the following evaluation plan?
(On-the-fly)
sname

rating > 5 (On-the-fly)

10 I/Os sid=sid
(Index Nested Loops,
with pipelining )

(Use hash
index; do
Cost = 1.2 I/Os for
not write bid=100 Sailors (Hash index on sid)
result to
temp) 1000 Reserves
(Clustered hash index on bid) Reserves
tuples; hence,
1200 I/Os

Total Cost = 10 + 1200 = 1210 I/Os


Comparing I/O Costs: Recap
(On-the-fly) (On-the-fly)
sname (On-the-fly) sname sname

rating > 5 (On-the-fly)


bid=100 rating > 5 (On-the-fly) (Sort-Merge Join)
sid=sid
(Index Nested
Loops,
(Scan; (Scan; sid=sid with pipelining )
write to bid=100 rating > 5 write to
(Simple Nested temp T1) temp T2) (Hash
Loops) index)
sid=sid
Reserves Sailors bid=100 Sailors (Hash
index
on sid)
Reserves Sailors Reserves
(File Scan) (File Scan)

Total Cost = 501, 000 I/Os Total Cost = 4060 I/Os Total Cost = 1210 I/Os
But, How Can we Ensure Correctness?
sname sname

Canonical form
rating > 5
bid=100 rating > 5

sid=sid

sid=sid
bid=100 Sailors

Reserves Sailors Reserves

Still the same result!

How can this be guaranteed?


Outline
A Brief Primer on Query Optimization

Evaluating Query Plans

Relational Algebra Equivalences 


Estimating Plan Costs

Enumerating Plans
Relational Algebra Equivalences
 A relational query optimizer uses relational algebra
equivalences to identify many equivalent expressions for a
given query

 Two relational algebra expressions over the same set of


input relations are said to be equivalent if they produce the
same result on all relations’ instances

 Relational algebra equivalences allow us to:


 Push selections and projections ahead of joins
 Combine selections and cross-products into joins
 Choose different join orders
RA Equivalences: Selections
 Two important equivalences involve selections:
1. Cascading of Selections:
 c1 ... cn  R   c1  . . .  cn  R
Allows us to combine several selections into one selection

OR: Allows us to replace a selection with several smaller selections


2. Commutation of Selections:
 c1  c 2  R   c 2  c1  R
Allows us to test selection conditions in either order
RA Equivalences: Projections
 One important equivalence involves projections:
 Cascading of Projections:

 a1 R   a1 ... an R 

This says that successively eliminating columns from a relation


is equivalent to simply eliminating all but the columns retained
by the final projection!
RA Equivalences: Cross-Products and Joins
 Two important equivalences involve cross-products
and joins:
1. Commutative Operations:
(R × S) (S × R)
(R S)  (S R)

This allows us to choose which relation to be the inner and


which to be the outer!
RA Equivalences: Cross-Products and Joins
 Two important equivalences involve cross-products
and joins:
2. Associative Operations:
R × (S × T)  (R × S) × T
R (S T)  (R S) T

It follows:
R 
(S T)  (T R) S

This says that regardless of the order in which the relations are
considered, the final result is the same!
This order-independence is fundamental to how a query optimizer
generates alternative query evaluation plans
RA Equivalences: Selections, Projections, Cross
Products and Joins
 Selections with Projections:

 a ( c ( R ))  c ( a ( R ))

This says we can commute a selection with a projection if the


selection involves only attributes retained by the projection!
 Selections with Cross-Products:
R c T  c ( R S )

This says we can combine a selection with a cross-product to


form a join (as per the definition of a join)!
RA Equivalences: Selections, Projections, Cross
Products and Joins
 Selections with Cross-Products and with Joins:
 c(RS )  c(R)S
 c(R  S )  c(R) S
Caveat: The attributes mentioned in c must appear only in R and
NOT in S

This says we can commute a selection with a cross-product or a join


if the selection condition involves only attributes of one of the
arguments to the cross-product or join!
RA Equivalences: Selections, Projections, Cross
Products and Joins
 Selections with Cross-Products and with Joins (Cont’d):
 c(RS )  (RS )
c1 c2 c3
 ( ( (RS )))
c1 c2 c3
 ( (R) (S ))
c1 c2 c3
This says we can push part of the selection condition c ahead of
the cross-product!
This applies to joins as well!
RA Equivalences: Selections, Projections, Cross
Products and Joins
 Projections with Cross-Products and with Joins:
 a(RS )  (R) (S )
a1 a2
 a(R c S )  (R)c  (S )
a1 a2
 a(R c S )  a( (R)c  (S ))
a1 a2
Intuitively, we need to retain only those attributes of R and S that
are either mentioned in the join condition c or included in the set
of attributes a retained by the projection
How to Estimate the Cost of Plans?
 Now that correctness is ensured, how can the DBMS
estimate the costs of various plans?

sname sname

Canonical form
rating > 5
bid=100 rating > 5

sid=sid

sid=sid
bid=100 Sailors

Reserves Sailors Reserves


Next Class
Queries

Query Optimization
Continue…
and Execution

Relational Operators
Transaction Files and Access Methods
Manager
Recovery
Buffer Management Manager
Lock
Manager Disk Space Management

DB

You might also like