13 QP1
13 QP1
Relational
Operations
R&G - Chapters
12 and 14
Introduction
Todays topic: QUERY PROCESSING
Some database operations are EXPENSIVE
Can greatly improve performance by being
smart
e.g., can speed up 1,000,000x over nave
approach
Main weapons are:
1. clever implementation techniques for operators
2. exploiting relational algebra equivalences
3. using statistics and cost models to choose among
these.
predicates
Then, as needed:
tables
Apply any projections and output
expressions
Apply duplicate elimination and/or ORDER BY
Select *
From Blah B
Where B.blah = blah
Query Parser
Query Optimizer
Plan
Generator
Plan Cost
Estimator
Catalog Manager
Schem
a
Statistic
s
Relational Operations
We will consider how to implement:
Selection ( ) Select a subset of rows.
Projection ( ) Remove unwanted columns.
Join ( >
<) Combine two relations.
Set-difference ( - ) Tuples in reln. 1, but not in reln. 2.
Union ( ) Tuples in reln. 1 and in reln. 2.
Q: What about Intersection?
Simple Selections
R. attr op value ( R)
SELECT
FROM
WHERE
*
Reserves R
R.rname < C%
Our options
If no appropriate index exists:
Must scan the whole relation
cost = [R]. For reserves = 1000 I/Os.
Our options
With index on selection attribute:
1. Use index to find qualifying data entries
2. Retrieve corresponding data records
Total cost = cost of step 1 + cost of step 2
For reserves, if selectivity = 10% (100 pages, 10000
tuples):
If clustered index, cost is a little over 100 I/Os;
If unclustered, could be up to 10000 I/Os! unless
CLUSTERED
Index entries
direct search for
data entries
Data entries
UNCLUSTERED
Data entries
(Index File)
(Data file)
Data Records
Data Records
Data entries
(Index File)
(Data file)
Data Records
2 Approaches to General
Selections
Approach I:
1. Find the cheapest access path
2. retrieve tuples using it
3. Apply any remaining terms that dont match the
index
2 Approaches to General
Selections
Approach II: use 2 or more matching indexes.
1. From each index, get set of rids
2. Compute intersection of rid sets
3. Retrieve records for rids in intersection
4. Apply any remaining terms
EXAMPLE: day<8/9/94 AND bid=5 AND sid=3
Suppose we have an index on day, and another index on sid.
Get rids of records satisfying day<8/9/94.
Also get rids of records satisfying sid=3.
Find intersection, then retrieve records, then check bid=5.
Projection
SELECT
DISTINCT
R.sid,
R.bid
FROM
Reserves
Use sorting!!
1. Scan R, extract only the needed attributes
2. Sort the resulting set
3. Remove adjacent duplicates
Cost:
Reserves with size ratio 0.25 = 250 pages.
With 20 buffer pages can sort in 2 passes, so:
1000 +250 + 2 * 2 * 250 + 250 = 2500
I/Os
Projection -- improved
Modify the external sort algorithm:
Modify Pass 0 to eliminate unwanted fields.
Modify Passes 1+ to eliminate duplicates.
Cost:
Reserves with size ratio 0.25 = 250 pages.
With 20 buffer pages can sort in 2 passes, so:
1. Read 1000 pages
2. Write 250 (in runs of 40 pages each)
3. Read and merge runs
Distinct
name, gpa
Sort
name, gpa
HeapScan
Iterators
iterator
Example: Sort
init():
generate the sorted runs on disk (passes 0 to n-1)
Allocate runs[] array and fill in with disk pointers.
Initialize numberOfRuns
Allocate nextRID array and initialize to first RID of each run
next():
nextRID array tells us where were up to in each run
find the next tuple to return based on nextRID array
advance the corresponding nextRID entry
return tuple (or EOF -- End of Fun -- if no tuples remain)
close():
deallocate the runs and nextRID arrays
Postgres Version
src/backend/executor/nodeSort.c
ExecInitSort (init)
ExecSort (next)
ExecEndSort (close)
The encapsulation stuff is hardwired into
the Postgres C code
Postgres predates even C++!
See src/backend/execProcNode.c for the code
that
dispatches the methods explicitly!
Joins
SELECT
FROM
WHERE
*
Reserves R1, Sailors S
R1.sid=S1.sid
block of R tuples
(B-2 pages)
Join Result
...
...
...
Input buffer for S
Output buffer
ri
INDEX on S
Data entries
sj
S:
Data Records
Sort-Merge Join 1.
Example:
SELECT
FROM
WHERE
sid
22
28
31
44
58
*
Reserves R1, Sailors S1
R1.sid=S1.sid
sid
28
28
31
31
31
58
bid
103
103
101
102
101
103
day
12/4/96
11/3/96
10/10/96
10/12/96
10/11/96
11/12/96
rname
guppy
yuppy
dustin
lubber
lubber
dustin
Other Considerations
1. An important refinement:
Do the join during the final merging pass of sort !
If have enough memory, can do:
1. Read R and write out sorted runs
2. Read S and write out sorted runs
3. Merge R-runs and S-runs, and find
R
S matches
><
Cost = 3*[R] + 3*[S]
Q: how much memory is enough (will answer next time )
Summary
A virtue of relational DBMSs:
queries are composed of a few basic operators
The implementation of these operators can be carefully tuned
Many alternative implementation techniques for each
operator
No universally superior technique for most operators.
Must consider available alternatives
Called Query optimization -- we will study this topic soon!