QueryProcessing Sorting
QueryProcessing Sorting
(On-the-fly)
• Join – several algorithms: sname
– Nested Loops
– Indexed Nested Loops bid=100 rating > 5 (On-the-fly)
– Sort-Merge
– Hash Join (Simple Nested Loops)
sid=sid
Reserves Sailors
Review: Cost Estimation
• To compute operator costs, DBMS needs:
– Sizes of relations
– Info on indexes (type, size)
– Info on attribute values (high, low, distr, etc.)
– This info stored in catalogs
• Efficient sorting
Input Output
Buffer Buffer
INPUT
f(x) OUTPUT
RAM
2-Way Sort
INPUT 1
OUTPUT
INPUT 2
= log 2 N + 1 2,3
PASS 2
1,2
2,3
• Idea: Divide and conquer: 3,4 8-page runs
sort subfiles and merge 4,5
6,6
7,8
9
General External Merge Sort
INPUT 1
INPUT 2
... ... OUTPUT ...
INPUT B-1
Disk Disk
B Main memory buffers
Cost of External Merge Sort
• Number of passes: 1+ logB−1 N/ B
• Cost = 2N * (# of passes)
• E.g., with 5 buffer pages, to sort 108
page file:
– Pass 0: = 22 sorted runs of 108 / 5
5 pages each (last run is only 3 pages)
• Now, do four-way (B-1) merges
– Pass 1: = 6 sorted runs of 20
pages each (last run is only 8 pages)
– Pass 2: 2 sorted runs, 80 pages and 28 22 / 4
pages
– Pass 3: Sorted file of 108 pages
Number of Passes of External Sort
Data Records
Index
(Directs search)
Data Entries
("Sequence set")
Data Records
External Sorting vs. Unclustered Index
foreach tuple r in R do
foreach tuple s in S do
if ri == sj then add <r, s> to result
foreach tuple r in R do
foreach tuple s in S where ri == sj do
add <r, s> to result
• If there is an index on the join column of one relation
(say S), can make it the inner and exploit the index.
– Cost: M + ( (M*pR) * cost of finding matching S tuples)
• For each R tuple, cost of probing S index is about 1.2
for hash index, 2-4 for B+ tree. Cost of then finding S
tuples (assuming Alt. (2) or (3) for data entries)
depends on clustering.
– Clustered index: 1 I/O (typical), unclustered: upto 1 I/O
per matching S tuple.
Examples of Index Nested Loops
• Partition both 1
INPUT 2
relations using hash hash 2
function
fn h: R tuples in ... h
partition i will only B-1
Partitions
of R & S Join Result
Hash table for partition
Read in a partition hash Ri (k < B-1 pages)
of R, hash it using fn
h2
h2 (<> h!). Scan
h2
matching partition
of S, search for Input buffer
for Si
Output
buffer
matches.
Disk B main memory buffers Disk
Observations on Hash-Join