0% found this document useful (0 votes)
25 views50 pages

L10-Query Evaluaion

Database Query Evaluation

Uploaded by

Jason Wong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views50 pages

L10-Query Evaluaion

Database Query Evaluation

Uploaded by

Jason Wong
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 50

Evaluating Relational

Operators

1
Background
• Data pages must be read into the memory to be processed
• The memory area where a data page is stored for processing is
called a buffer page (or simply buffer)
• More available buffers usually can speed up the processing
• To speed up the query evaluation, a data file (i.e, relation) often
needs to be sorted based on the search key
• Suppose a file contains M pages, and B buffers are available in the
memory then (see Section 13.3: External Merge sort)
– Sorting the file requires 2xMx(logB-1M/B + 1) page accesses
– Example, a file contains 1000 pages, and 20 buffer pages are available, then
sorting the file requires 2x1000x(log19 1000/20 + 1) =
2x1000x(2+1) = 6000 page accesses

2
Access Paths
• An access path is a method of retrieving tuples:
– File scan, or index that matches a selection (in the
query)
• Selectivity of an access path:
– # of pages retrieved (index pages + data pages)
• For a single operation, there may be different access paths
with different selectivity
• Most selective access path for an operation:
– The one with the lowest selectivity
• Using the most selective access path to minimizes the cost
for the operation

3
Examples and Cost Calculations
• Given the following schema:
–Sailors(sid: integer, sname: string, rating: integer, age: real)
–Reserves(sid: integer, bid: integer, day: dates, rname: string)
• rname is the name of the person who has made the reservation
• sid is the id of the person on whose behalf the reservation was made
• Thus, rname and sid may refer to different persons
• Assuming the following sizes
–Sailors: 500 pages, 80 tuples/page, 50 bytes/tuple
–Reserves: 1000 pages, 100 tupes/page, 40 bytes/tuple
• We consider only I/O cost: number of pages that are read/written
• If alternatives involve the same cost for writing pages, we ignore
these when doing the comparison
4
A Motivating Example
• Consider the following simple query
SELECT *
FROM Reserves R
WHERE R.rname = ‘Joe’
• If no index is created on rname, the most selective
access path (also the only one) is:
1. Scanning the entire Reserves relation, by reading the page
one after another
2. For each page scanned, checking the condition on each tuple
3. If the condition is met, then add the tuple to the result
 Cost: 1000 I/Os
• If index is available on rname, we can do it much faster

5
Selection: R.attr op value (R)
• No Index, Unsorted (R contains M pages, same below)
– Most selective access path is file scan
– Cost: M I/Os. Let R be Reserves. Then M = 1000
• No Index, R sorted on R.attr. Most selective access path:
– Binary search on R.attr for value to locate the first tuple that satisfies the
condition
– Start at this position, scan the relation until the condition becomes untrue
– Cost of binary search: log2(M)
– Cost of scanning after binary search: depends on the # of tuples satisfying
the condition, can vary from zero to M.
– Example: Let R be Reserves and ‘Chan’ < rname < ‘Lin’. Assume 10% of
the tuples satisfy the condition.
• Cost of binary search: log2(1000)  10
• Cost of scanning after binary search: 100
• Total cost: 110
6
Selection: R.attr op value (R)
• B+ Tree Index on R.attr (assuming op is not  )
1. Search the tree to find the first data entry that points to a qualifying
tuple of R
2. Scan the leaf page to retrieve all the data entries in which the key value
satisfies the selection condition (not needed for clustered index)
3. For each data entry retrieved, follow the pointer to get the
corresponding tuple of R
 Cost
 Step 1: height of the tree, usually 2 or 3 I/Os
 Step 2: depends on the number of such data entries
 Step 3: let N be the number of the qualifying tuples and P be the
number of tuples that can be stored in a page.
 Index is clustered:  N/P+1 (in reality, most likely N < P, so cost  2)
 Index is not clustered: N (since in the worst case, each qualifying tuple
could be in a different page.

7
Example Sel. : B+ Tree Clustered Index
Assume the search key is sid. Consider the selection condition:
19 < sid < 29

17

5 13 24 30

2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 27* 29* 30* 33* 34* 38* 39*

Data file
22
19

33
30
27

34
29

39
38
24
20

• Step 1 (searching the tree): 3 pages


• Step 2: (need not scan the leave nodes) 0 pages
• Step 3 (retrieving the qualifying tuples): 2 pages 8
Example Sel. : B+ Tree Unclustered Index
Assume the search key is sid. Consider the selection condition:
19 < sid < 29

17

5 13 24 30

2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 27* 29* 30* 33* 34* 38* 39*

Data file
16

22

19

33

30
34

27
39

29
38
24

20
8

• searching the tree: 3 pages


• scan the leaf nodes: 2 pages
• retrieving the qualifying tuples: 4 pages 9
• Total = 3+2-1+4 = 8
Sample DB Selection cost for B+ Tree Index
• Consider selection ‘Chan’ < rname < ‘Lin’, assume 10% of
the tuples satisfy the condition
– Search B+ tree: a few I/O, say 3
– Clustered index:
• Scan the data file: 1000x 10% = 100
– Unclustered index:
• Search B+ tree: 3
• Scan the leaf nodes: assume the total number of leaf pages is 1/10 of
the number of data page. The cost is 1000x 1/10 x 1/10 = 10
• Retrieve the qualifying tuples:
– Total number of tuples in Reserves: 100, 000
– total number of qualifying tuples: 100, 000 x 1/10 = 10, 000
– Cost: 10, 000 I/O (since in the worst case, for each tuple we need to read
a page. Note that a page containing more than one qualifying tuples may
be read repeatedly)
• Thus, unclustered B+ tree index is not appropriate for
range search
10
Selection: R.attr op value (R) (cont.)
• Hash index on R.attr (assuming op is = )
1. Calculate the hash value for value. (The hash value identifies the
directory entry.)
2. Get the directory entry identified by the hash value
3. Retrieve the bucket page(s) pointed by the directory entry
4. For each data entry in the bucket, retrieve the qualifying tuple
 Cost
 Step 1: 0
 Step 2: 1 if directory does not fit in the memory, 0 otherwise
 Step 3: typically 1.2 (Recall we may go through overflow pages)
 Step 4: if R.attr is a key for R, then 1, otherwise, depends on the
number of qualifying tuples and whether or not it is a clustered index.

11
Example Selection: Hash Index Clustered
• Selection Condition: age = 7 sid sname age
• h(7) = 7 mod 32 = 7 = 1112 2 1
4* 12* 32*48* 4
• Cost
– Calculate h(7): 0 5
– Get directory entry: 0 2 2 7
(assume directory is in 00 1* 5* 21* 25*
main memory) 7
01
– Get bucket page: 1 7
10 2
– Get qualifying tuples: 2
11 10* 10
10

DIRECTORY 2
3* 7* 19*

Bucket pages
Example Selection: Hash Index Unclustered
• Sel. Cond. age = 7 sid sname age
2 5
• h(7) = 7 mod 32 = 7 = 1112 4* 12* 32*48* 4
• Cost
1
– Get directory entry: 0 2 2 7
(assume directory is in 00 1* 5* 21* 25*
main memory) 10
01
– Get bucket page: 1 7
10 2
– Get qualifying tuple: 3
11 10* 10
7

DIRECTORY 2
3* 7* 19*

Bucket pages

13
Projection
• Consider example:
SELECT DISTINCT R.sid, R.bid
FROM Reserves R
• General method
– Scan relation R and discard unwanted attributes
– Eliminate duplicates (This is an expensive operation.)
• projection based on sorting
1. Scan R, write sid and bid of each tuple to a temporary file T
2. Sort T based on both sid and bid
3. Scan the sorted file, compare the adjacent tuples and discard duplicates
• Cost, assuming T has 250 pages, and 20 buffers available
– Step 1: 1000 + 250 = 1250 I/Os
– Step 2: 2 250  2 =1000 I/Os (Refer to the formula on p.2)
– Step 3: 250 I/Os
– Total: 2500 I/Os

14
Example: sid, bid(Reserves) based on sorting
sid bid day rname sid bid sid bid sid bid

12 500 12 500 2 400 2 400

Discard unwanted attr


14 100 14 100 2 400 7 120

Sort on sid and bid

Discard duplicates
2 400 2 400 7 120 9 111
12 300 9 111 10 100
12 300 2 400 10 100 10 300
2 400 20 150 10 300 12 300
20 150
10 300 12 300 12 500
10 300 14 100 12 500 14 300
14 100 16 200 14 100 14 100
16 200 9 111 14 100 16 200
10 100 16 200 20 150
9 111 7 120 20 150
10 100
7 120

Reserves 15
The Join operation
• Example:
SELECT *
FROM Reserves R, Sailors S (R has 1000 pages, 100 tuples/page)
WHERE R.sid = S.sid (S has 500 pages, 80 tuples/page)
• Nested Loops Join
foreach tuple r in R do (R is called outer relation)
foreach tuple s in S do (S is called inner relation)
if r.sid = s.sid then add <r, s> to result
Cost
– Scan R: 1000 I/Os
– S is scanned once for each tuple of R :1000  100 500
– Total: 1000  100  500 + 1000 = 50001000 I/Os
– Switch R and S, the total is: 500  80  1000 + 500 = 40000500 I/Os
– If each I/O takes 10ms, the total time is over 100 hours!!
16
Join Operation (cont.)
• Refinement to Nested Loops Join – a page at a time
For each page p of R
for each page q of S
output all r  p and s  q such that r.sid =
s.sid
• Cost
– Scan R: 1000 I/Os
– For each page of R, read 500 pages of S
– Total: 1000 + 1000  500 = 501000
– Amount of time: 1.4 hours

17
One page is reserved
• Block Nested Loops Join to read Sailors and
the other page is
– Suppose we have enough buffers to hold B+2 pages, reserved for output.
then read B pages of Reserves at a time:
For each block P of Reserves
for each page q of Sailors
for each r  P and s  q such that r.sid=s.sid
add <r, s> to the result
• Cost (assuming B = 100. Thus Reserves contains 10
blocks)
Note that the definition
– Scan Reserves: 1000 I/Os of B is a little bit
– For each block, scan Sailors, for 500 I/Os different from that
in your textbook.
– Total: 1000 + 10500 = 6000
– If we choose Sailors to be the outer relation, the cost is:
500 + 51000 = 5500
• If the buffer is large enough to hold the smaller relation
(+ 2 more pages), the cost is 1000 + 500 = 1500
18
Diagramatic View for Block Nested Loops Join
Relations Sailors
and Reserves Result

B pages for Reserves

input buffer to
output buffer
scan Sailors

Disk Disk

19
• Index Nested Loops Join
– Assume an index on sid of the Sailors relation,

For each r  Reserves do


for each s  Sailors where r.sid = s.sid (use index)
add <r, s> to the result
• Note: for each tuple in Reserves, we use the index to find the
matching tuples in Sailors
• Cost, assuming hash index (assume directory is in main memory)
– Scan Reserves: 1000 I/Os
– For each tuple in Reserves, an average of 1.2 I/O to get to bucket
page containing the matching Sailors data entry
– For each matching Sailors data entry, retrieve the Sailors tuple for 1
I/O (note: sid is the primary key of Sailors relation)
– Each block of Reserves contains 100 tuples.
– Total: 1000+100x1000x(1+1.2)=221000 I/Os
20
Index on sid of Sailors
sid age level name sid bid day rname

16
16 16
h(1
6)
16*, 24* h

Buckets Directory Reserves


Sailors
21
• Now assume an index on sid of the Reserves relation, we use
the following algorithm
For each s  Sailors do
for each r  Reserves where s.sid = r.sid
add <r, s> to the result
• Cost, assuming hash index (assume directory is in main memory)
– Scan Sailors: 500 I/Os
– For each tuple in Sailors, an average of 1.2 I/O to get to the bucket
page containing the matching Reserves data entry
– For each matching Reserves data entry, retrieve the Reserves tuples.
Estimation: 100,000 reservations for 40,000 sailors, so each sailor
makes 2.5 reservations on the average
• Clustered index: 2.5 reservations likely on the same page, 40000x1 I/Os.
Total cost: 500+40000  1.2+40000  1 = 88500
• Uclustered index: 2.5 reservations not likely on the same page, 40000x2.5
I/Os. Total cost = 500+40000  1.2+40000  2.5 = 148500 I/Os
22
Unclustered Index on sid of Reserves
sid bid day rname sid age level name

16
16 16
h(1
16*, 24* 6 )
h

16

16

Buckets Directory
Reserves Sailors 23
Idea for Sort-Merge Join (R i=j S)
• Sort R and S on the join column, then scan them to do a
``merge’’ (on join col.), and output result tuples.
– Advance scan of R until current R-tuple >= current S tuple, then
advance scan of S until current S-tuple >= current R tuple; do this
until current R tuple = current S tuple.
– At this point, all R tuples with the same value in Ri (current R
group) and all S tuples with the same value in Sj (current S group)
match; output <r, s> for all pairs of such tuples.
– Then resume scanning R and S.
• R is scanned once; each S group is scanned once per
matching R tuple. (Multiple scans of an S group are likely
to find needed pages in buffer.)
24
Example of Sort-Merge Join
sid bid day rname
sid sname rating age
22 dustin 7 45.0 28 103 12/4/96 guppy
28 yuppy 9 35.0 28 103 11/3/96 yuppy
31 lubber 8 55.5 31 101 10/10/96 dustin
44 guppy 5 35.0 31 102 10/12/96 lubber
58 rusty 10 35.0 31 101 10/11/96 lubber
58 103 11/12/96 dustin

• Cost
• Cost for sorting the two relations (use formula)
• Cost of joining two relations: if one of the join attributes is a primary key, then
this cost is M+N, where M and N are the sizes (in pages) of the two relations

25
Size of the Join result
• Terminologies
– n : number of tuples in relation r.
r
– f : blocking factor of relation r, i.e., the number of
r
tuples of relation r that fit into one block (page).
– If tuples of relation r are stored together physically,
br  nr / f r 

– S , the size of a record (tuple) of relation r.


r
– V(A,r): the number of distinct values that appear in
relation r for attribute
 (r) A
A
• The size of .
• V(A,r) = n if A is a key for relation r.
r
26
• SC(A,r): the selection cardinality of
attribute A of relation r, i.e., average number
of records that satisfy an equality condition
on attribute A.
– SC(A,r) = 1, if A is a key of r.
– Assume distinct values are distributed evenly
and A is not a key then SC ( A, r )  (nr / V ( A, r ))

27
• Example
– ncustomer = 10,000.
– fcustomer = 25 bcustomer = 10000/25 = 400.
– ndepositor = 5000
– fdepositor = 50  bdepositor = 5000/50 = 100.
– V(customer-name, depositor) = 2500
• On average each customer has two accounts
– Assume customer-name in depositor is a
foreign key on customer.
28
• r  s contains nrns tuples.
• Each tuple of r  s occupies sr+ss bytes.
• Size of natural join
– Let r(R) and s(S) be relations.
– If R  S = 
• r   s is the same as r  s.
– If R  S is a key of r:
• A tuple of s will join with at most one tuple from r.
• The number of tuples in r   s is no greater than the
number of tuples in s.
• If R  S is a foreign key of s referencing to r, then
the number of tuples in r   s is exactly the same as
the number of tuples in s.
29
• Example, depositor  customer
» Customer-name in depositor is a foreign key of refer
to the customer-name in customer.
– The size of the result is ndepositor = 5000.
– If R  S is a key for neither r nor s:
• Let R  S = {A}.
• Assume each value appears with equal probability.
• We estimate that each tuple t in r produces
ns
in r   s
V ( A, s )
• Total number of tuples in r   s is estimated to be
nr ns
V ( A, s )
30
– Similarly, if we reverse the roles of r and s, the total
number of tuple is estimated to be
nr ns
V ( A, r )
– These two estimates differ if V(A,r)  V(A,s).
– The lower of the two estimates is probably the better
one, since there are likely to be some dangling tuples.
– Example without using information about foreign keys:
• V(customer-name,depositor) = 2500
– Size = 5000*10000/2500 = 20000
• V(customer-name, customer) = 10000
– Size = 5000*10000/10000 = 5000
• The lowest one is 5000.

31
Selection Operation
• Linear Search
– All block have to be read: b
r
– Selection on a key attribute: b /2
r
• Binary Search
– Locating the first tuple:  2 r 
log (b )

– Assume the total number of records that will satisfy the


selection is SC(A,r).
– These records will occupy SC ( A, r ) / f r 
– Total block access:  2 r  
log (b )  SC ( A, r ) / f r   1

– If the equality condition is on a key attribute


• SC(A,r) = 1
• Total cost = log 2 (br )
32
• Clustered index, equality on key
– Cost = index access + 1
• Clustered index, equality on nonkey
– SC(A,r) tuples will satisfy an equality condition.
– Block access to retrieve record: SC ( A, r ) / f r 
– Total: index access  SC ( A, r ) / f
 r
• Unclustered index, equality
– SC(A,r) tuples will satisfy an equality condition.
– Worst case scenario: each matching record resides on a
different block.
– Cost = index access + SC(A,r)
– For key indexing attribute: index access + 1

33
• Selections involving comparisons
– Let lowest and highest values be min(A,r) and
max(A,r).
– Assume the values are uniformly distributed.
– Number of records that will satisfy A  v
• 0, if v  min( A, r )
• nr , if v  max( A, r )
• otherwise, nr  (v  min( A, r )) /(max( A, r )  min( A, r ))
• Clustered index, comparison
• Let the number of values that satisfy the condition
be c.
• Cost = index access + c / f r 
34
• Unclustered index, comparison
– Cost = index access + c
– E.g. Assume one-half of the records satisfy the
condition
• Cost = index access + nr/2
• Suppose B+tree is used
– Cost = number of levels  (number of leave nodes)/2 - 1  n r / 2

35
Complex Selections
• Conjunction selection    1 2  n
(r )
– Let si be the size of   i (r )
– Probability of satisfying condition  i is si / nr
n 
– Estimate size = r 1 2 ( s  s    s n ) / nr
n

• Disjunction selection  1  2  n (r )


– Estimate size = nr  (1  (1  s1 / nr )   (1  sn / nr ))
• Negation   (r )
– Estimate size = size(r )  size(  (r ))

36
• Conjunctive selection using one index
– Use a selection algorithm (those discussed before) to retrieve the
records.
– Complete the operation by testing, in the memory buffer (on the
fly), whether the remaining conditions are satisfied.
• Conjunctive selection using composite index
– As discussed before.
• Conjunctive Selection by intersection of identifiers
– Each index is scanned for pointers to tuples that satisfy an
individual condition.
– Intersect all the retrieved pointers
– Using the resulting pointers to retrieve the actual records.
– If indices are not available on all the individual conditions, then
the retrieved records are tested against the remaining conditions.

37
• Disjunctive selection by union of identifiers
– Each index is scanned for pointers to tuples that satisfy an
individual condition.
– Union all the retrieved pointers
– Using the resulting pointers to retrieve the actual records.
– If even one of the conditions does not have an access path, we will
have to perform a linear scan on the relation.

38
More Examples

Sailors (sid: integer, sname: string, rating: integer, age: real)


Reserves (sid: integer, bid: integer, day: dates, rname: string)

• Reserves:
– Each tuple is 40 bytes long, 100 tuples per page, 1000
pages.
• Sailors:
– Each tuple is 50 bytes long, 80 tuples per page, 500
pages.
39
RA Tree: sname
SELECT S.sname
FROM Reserves R, Sailors S
WHERE R.sid=S.sid AND bid=100 rating > 5
R.bid=100 AND S.rating>5

sid=sid
• Translating to relational algebra operation
sname(bid=100rating>5(Reserves sid=sidSailors))
Reserves Sailors
• Tree representation (to the right)
– Nodes are relational operators (On-the-fly)
– sname
Edges point to where the input comes from
– In general, we must also add access path to each node Plan:
• Cost:500+500*1000=500500 I/Os (page at a time)
• Misses several opportunities: selections could have bid=100 rating > 5 (On-the-fly)
been `pushed’ earlier, didn’t use any available
indexes, etc.
• Goal of optimization: To find more efficient plans (Simple Nested Loops)
that compute the same answer. sid=sid

Sailors 40
Reserves
On-the-Fly vs Materialized Evaluations
• Let op1 and op2 be two relational algebra operations, and
op1 is performed on the result of op2
• The evaluation of op1 is on-the-fly if the result of op2 is
directly sent to op1 (i.e, not stored in a temporary file)
• It is materialized if that result is stored in a temporary file
first
• On-the-fly evaluation is called pipelined evaluation
• On-the-fly can be more efficient then materialized
– Example: bid=100(Reserves sid=sid Sailors))
• Whether on-the-fly should be used depends on the case
– Example: (age<25 Sailors) sid=sid (bid=100 Reserves)
– What if only one sailor has ever reserved bid=100 once? (i.e, only
one tuple in Reserves has bid=100)

41
On-the-Fly vs. Materialized Evaluations : example
bid=100(Reserves sid=sidSailors))
Reserves Sailors Reserves Sailors
sid bid sid name sid bid sid name
1 100 1 Lin 1 100 1 Lin
2 200 2 Wu 2 200 2 Wu
Disk 2 100 3 Du 2 100 3 Du Disk
1 100 Lin
2 200 Wu

Buffer for 1 100 1 Lin Buffer for


Reserves 2 200 2 Wu Reserves
Sailors 2 100 3 Du Sailors

1 100 Lin
2 200 Wu
RAM

Buffer for temp. Buffer for temp.


result result
RAM
42
Materialized On-the-Fly
On-the-Fly vs. Materialized Evaluations : example
bid=100(Reserves sid=sidSailors))
Reserves Sailors Reserves Sailors
sid bid sid name sid bid sid name
1 100 1 Lin 1 100 1 Lin
2 200 2 Wu 2 200 2 Wu
Disk 2 100 3 Du 2 100 3 Du Disk
1 100 Lin 2 100 Wu
2 200 Wu

Buffer for 1 100 1 Lin 1 100 1 Lin Buffer for


Reserves 2 200 2 Wu 2 200 2 Wu Reserves
Sailors 2 100 3 Du 2 100 3 Du Sailors

1 100 Lin 1 100 Lin 1 100 Lin


2 100 Wu
2 100 Wu 2 200 Wu RAM

Buffer for temp. Buffer for temp. Buffer for temp. Buffer for temp.
result result result result
RAM
43
Materialized On-the-Fly
On-the-Fly vs. Materialized Evaluations : example
bid=100(Reserves sid=sidSailors))
Reserves Sailors Reserves Sailors
sid bid sid name sid bid sid name
1 100 1 Lin 1 100 1 Lin
2 200 2 Wu 2 200 2 Wu
Disk 2 100 3 Du 2 100 3 Du Disk
1 100 Lin 2 100 Wu
2 200 Wu

Buffer for 1 100 1 Lin 1 100 1 Lin Buffer for


Reserves 2 200 2 Wu 2 200 2 Wu Reserves
Sailors 2 100 3 Du 2 100 3 Du Sailors

1 100 Lin 1 100 Lin


2 100 Wu 2 100 Wu
2 100 Wu 2 100 Du RAM

Buffer for temp. Buffer for temp. Buffer for temp. Buffer for temp.
result result result result
RAM
44
Materialized On-the-Fly
On-the-Fly vs. Materialized Evaluations : example
bid=100(Reserves sid=sidSailors))

Reserves Sailors Reserves Sailors


sid bid sid name sid bid sid name
1 100 1 Lin 1 100 1 Lin
2 200 2 Wu 2 200 2 Wu
Disk 3 100 3 Du 3 100 3 Du Disk
1 100 Lin 2 100 Wu
2 200 Wu

1 100 1 Lin 1 100 1 Lin


2 200 2 Wu 2 200 2 Wu
2 100 3 Du 2 100 3 Du

1 100 Lin 1 100 Lin 1 100 Lin


2 200 Wu 2 100 Du 2 200 Wu 1 100 Lin RAM
2 100 Wu
2 100 Wu 2 100 Du

RAM 1 100 Lin 2 100 Wu


2 200 Wu On-the-Fly 45
Materialized, data flow order:
(On-the-fly)
sname
Alternative Plans 1
(No Indexes) sid=sid
(Sort-Merge Join)

(Scan; (Scan;
write to bid=100 rating > 5 write to
• Main difference: push selects. temp T1) temp T2)
• With 5 buffer pages, cost of plan: Reserves Sailors
– Scan Reserves (1000) + write temp T1 (10 pages, assuming 100 boats, uniform
distribution. There are 100,000 tuples in Reserves, with 1000 tuples per boat, stored in
10 pages.)
– Scan Sailors (500) + write temp T2 (250 pages, assuming 10 ratings, uniform
distribution. There are 40,000 tuples in Sailors, with 4000 tuples per rating. Thus 20000
tuples have ratings > 5, stored in 20000/80 = 250 pages. )
– Sort T1 (2*10*2), sort T2 (2*250*4), merge (10+250)
– Total: 4060 page I/Os.

• If we used BNL join, join cost = 10+4*250, total cost = 1010 + 750 + 1010= 2770.
(Note for BNL we do not sort T1 and T2.)
• Furthermore, we can push projections (next slide)
46
Alternative Plans 1 (cont.)
(No Indexes) sid=sid
(BNL)

• When we `push’ projections as


On-the-fly: On-the-fly:
sname,sid
well, T1 has only sid, T2 only sid write to T1 sid
write to T2
and sname: Scan
Scan rating > 5
– Assume T1 fits in 3 pages, and T2 bid=100
fits in 200 pages.
– Join cost of BNL: 3 + 1x200 = 203
File scan Reserves Sailors File scan
– Total cost: 1000+3+500+200+203

= 1906

47
(On-the-fly)
sname

Alternative Plans 2 with Indexes


Assumption: clustered hash index on bid of rating > 5 (On-the-fly)

Reserves, and hash index on sid of Sailors


1. Hash on 100, get all Reserves tuples with bid = (Index Nested Loops,
sid=sid with pipelining )
100
2. Use index nested loops with pipelining (outer is (Use hash
index; do bid=100 Sailors
not materialized) not write
result to
– For each tuple obtained at step 1, hash its sid temp)
value to get the matching Sailors tuple (Note: sid Reserves

is the key of Sailors. So at most one matching


tuple exists: unclustered index on sid of Sailors
ok)
– Join these two tuples, select if rating > 5
– If selected, projecting out its sname
• Cost, assuming 100 boats uniformly distributed in Reserves
• We get 100,000/100 = 1000 tuples with a bid of 100. They are in 10 pages (note:
Reserves is sorted on bid). For each such tuple, use 2.2 I/O (0 [directory in main memory]
+1.2 [with overflow buckets]+1 [sid is the key in Sailor]) to get the matching tuple in
Sailors, for a total of 1000x2.2=2200 I/Os
• Total cost of the query: 2 +10 + 2200 = 2212 I/Os (Assume index cost of Reserves is 2,
i.e., directory is on disk, no overflow bucket ) 48
(On-the-fly)

Alternative Plans 2
sname

With Indexes (cont.) rating > 5 (On-the-fly)

• Note: we did not push the (Index Nested Loops,


selection on rating before the sid=sid with pipelining )

join. Why?
(Use hash
– If we perform the selection index; do bid=100 Sailors
not write
before the join, we must result to
temp)
scan the Sailors, with an Reserves
extra cost of 500 I/Os.

 More importantly, once the selection is performed, we have no index on the sid field
of the result any more
 Note: the result is a brand new file. Its location is independent of the location of
Sailors.
 We have to build a new index solely for the subsequent Join operation. Is this
worthwhile?

49
(On-the-fly)

Alternative Plans 2
sname

With Indexes (cont.) rating > 5 (On-the-fly)

• Assume: Sailors sorted on sid, clustered ( Nested Loops,


hash index on bid of Reserves, 3 buffer sid=sid page at a time)
pages available and 1000 tuples with (Use hash
index; do not
bid=100 in Reserves (i.e, in 10 pages) write result to Sailors
bid=100
• The cost for the upper plan: temp relation)

– 2 + 10 + 10x500 = 5012
Reserves
• The cost for the lower plan
(On-the-fly)
– 2 + 10(select) + 10(write) + 2x3x10(sort) + sname
(10+500) (sort-merge join) = 592
• This implies that on-the-fly evaluation is
rating > 5 (On-the-fly)
not always the best

( Merge join)
sid=sid

(Use hash index; write


result to T1.Then sort it
on sid) bid=100 Sailors

50
Reserves

You might also like