Quiz 2 Cheatsheet
Quiz 2 Cheatsheet
Access time – the time it takes from when a read or write request Erase happens in units of erase block Two main goals of parallelism in a disk system:
is issued to when data transfer begins. Consists of: Redundancy – store extra information that can be used to rebuild
• Takes 2 to 5millisecs information lost in a disk failure 1. Load balance multiple small accesses to increase throughput
• Seek time – time it takes to reposition the arm over the correct track. • Erase block typically 256 KB to 1 MB (128 to 256 pages)
Average seek time is 1/2 the worst case seek time. E.g., Mirroring (or shadowing) 2. Parallelize large accesses to reduce response time.
• Would be 1/3 if all tracks had the same number of sectors, and we Remapping of logical page addresses to physical page addresses • Duplicate every disk. Logical disk consists of two physical disks.
avoids waiting for erase
Improve transfer rate by striping data across multiple disks.
ignore the time to start and stop arm movement
• Every write is carried out on both disks Bit-level striping – split the bits of each byte across multiple disks
4 to 10 milliseconds on typical disks Flash translation table tracks mapping
• Rotational latency – time it takes for the sector to be accessed to appear Reads can take place from either disk
• also stored in a label field of flash page • In an array of eight disks, write bit i of each byte to disk i.
under the head. • If one disk in a pair fails, data still available in the other
4 to 11 milliseconds on typical disks (5400 to 15000 r.p.m.) • remapping carried out by flash translation layer • Each access can read data at eight times the rate of a single disk.
Data loss would occur only if a disk fails, and its mirror disk
Average latency is 1/2 of the above latency. After 100,000 to 1,000,000 erases, erase block becomes unreliable also fails before the system is repaired • But seek/access time worse than for a single disk
• Overall latency is 5 to 20 msec depending on disk model and cannot be used
• Probability of combined event is very small Bit level striping is not used much any more
Data-transfer rate – the rate at which data can be retrieved from Page write
• wear leveling Except for dependent failure modes such as fire or Block-level striping – with n disks, block i of a file goes to disk (i
or stored to the disk. Physical Page Address
• 25 to 200 MB per second max rate, lower for inner tracks building collapse or electrical power surges mod n) + 1
Logical address and valid bit Mean time to data loss depends on mean time to failure, • Requests for different blocks can run in parallel if the blocks reside on
Logical Page Address stored with each and mean time to repair different disks
physical page (extra bytes)
• E.g. MTTF of 100,000 hours, mean time to repair of 10 hours • A request for a long sequence of blocks can utilize all disks in parallel
gives mean time to data loss of 500*106 hours (or 57,000 years)
for a mirrored pair of disks (ignoring dependent failure modes)
Flash Translation Table
Database System Concepts - 7th Edition 12.9 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 12.13 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 12.17 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 12.18 ©Silberschatz, Korth and Sudarshan
RAID Levels RAID Levels (Cont.) RAID Levels (Cont.) RAID Levels (Cont.)
Schemes to provide redundancy at lower cost by using disk striping Parity blocks: Parity block j stores XOR of bits from block j of each RAID Level 5: Block-Interleaved Distributed Parity; partitions data RAID Level 5 (Cont.)
combined with parity bits disk and parity among all N + 1 disks, rather than storing data in N disks • Block writes occur in parallel if the blocks and their parity blocks
• Different RAID organizations, or RAID levels, have differing cost, • When writing data to a block j, parity block j must also be computed and parity in 1 disk. are on different disks.
performance and reliability characteristics and written to disk • E.g., with 5 disks, parity block for nth set of blocks is stored on disk RAID Level 6: P+Q Redundancy scheme; similar to Level 5, but
RAID Level 0: Block striping; non-redundant. Can be done by using old parity block, old value of current block (n mod 5) + 1, with the data blocks stored on the other 4 disks. stores two error correction blocks (P, Q) instead of single parity
and new value of current block (2 block reads + 2 block writes) block to guard against multiple disk failures.
• Used in high-performance applications where data loss is not critical.
Or by recomputing the parity value using the new values of blocks • Better reliability than Level 5 at a higher cost
RAID Level 1: Mirrored disks with block striping
corresponding to the parity block
Becoming more important as storage sizes increase
• Offers best write performance.
• More efficient for writing large amounts of data sequentially
• Popular for applications such as storing log files in a database system.
• To recover data for a block, compute XOR of bits from all other blocks
in the set including the parity block
Database System Concepts - 7th Edition 12.19 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 12.20 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 12.21 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 12.22 ©Silberschatz, Korth and Sudarshan
Fixed-Length Records Variable-Length Records: Slotted Page Structure Heap File Organization Multitable Clustering File Organization
Records can be placed anywhere in the file where there is free space Store several relations in one file using a multitable clustering
Deletion of record i: alternatives:
file organization
• move records i + 1, . . ., n to i, . . . , n – 1 Records usually do not move once allocated
• move record n to i Important to be able to efficiently find free space within file department
• do not move records, but link all free records on a free list Free-space map
• Array with 1 entry per block. Each entry is a few bits to a byte,
and records fraction of block that is free
Slotted page header contains:
• In example below, 3 bits per block, value divided by 8 indicates instructor
• number of record entries fraction of block that is free
• end of free space in the block
• location and size of each record
• Can have second-level free-space map
Records can be moved around within a page to keep them contiguous
with no empty space between them; entry in the header must be • In example below, each entry stores maximum from 4 entries of multitable clustering
updated. first-level free-space map of department and
instructor
Pointers should not point directly to record — instead they should
point to the entry for the record in header. Free space map written to disk periodically, OK to have wrong (old)
values for some entries (will be detected and fixed)
Database System Concepts - 7th Edition 13.6 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 13.8 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 13.11 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 13.14 ©Silberschatz, Korth and Sudarshan
Database System Concepts - 7th Edition 13.20 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 13.21 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 13.22 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 13.23 ©Silberschatz, Korth and Sudarshan
Ordered Indices Sparse Index Files Multilevel Index B+-Tree Index Files (Cont.)
If index does not fit in memory, access becomes expensive.
In an ordered index, index entries are stored sorted on the Sparse Index: contains index records for only some search-key A B+-tree is a rooted tree satisfying the following properties:
values. Solution: treat index kept on disk as a sequential file and
search key value.
• Applicable when records are sequentially ordered on search-key construct a sparse index on it.
Clustering index: in a sequentially ordered file, the index whose All paths from root to leaf are of the same length
To locate a record with search-key value K we: • outer index – a sparse index of the basic index
search key specifies the sequential order of the file. Each node that is not a root or a leaf has between n/2 and
• Find index record with largest search-key value < K • inner index – the basic index file n children.
• Also called primary index
• Search file sequentially starting at the record to which the index If even outer index is too large to fit in main memory, yet A leaf node has between (n–1)/2 and n–1 values
• The search key of a primary index is usually but not record points another level of index can be created, and so on.
necessarily the primary key. Special cases:
Indices at all levels must be updated on insertion or deletion
Secondary index: an index whose search key specifies an order • If the root is not a leaf, it has at least 2 children.
from the file.
different from the sequential order of the file. Also called • If the root is a leaf (that is, there are no other nodes in
nonclustering index. the tree), it can have between 0 and (n–1) values.
Index-sequential file: sequential file ordered on a search key,
with a clustering index on the search key.
Database System Concepts - 7th Edition 14.5 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 14.8 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 14.12 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 14.19 ©Silberschatz, Korth and Sudarshan
B+-Tree Node Structure Non-Leaf Nodes in B+-Trees Observations about B+-trees Queries on B+-Trees
function find(v)
Typical node Non leaf nodes form a multi-level sparse index on the leaf nodes. Since the inter-node connections are done by pointers, 1. C=root
For a non-leaf node with m pointers: “logically” close blocks need not be “physically” close. 2. while (C is not a leaf node)
• All the search-keys in the subtree to which P1 points are less The non-leaf levels of the B+-tree form a hierarchy of sparse 1. Let i be least number s.t. V ≤ Ki.
than K1 indices. 2. if there is no such number i then
• Ki are the search-key values • For 2 ≤ i ≤ n – 1, all the search-keys in the subtree to which Pi The B+-tree contains a relatively small number of levels 3. Set C = last non-null pointer in C
• Pi are pointers to children (for non-leaf nodes) or pointers to points have values greater than or equal to Ki–1 and less than Level below root has at least 2* n/2 values 4. else if (v = C.Ki ) Set C = Pi +1
records or buckets of records (for leaf nodes). Ki 5. else set C = C.Pi
Next level has at least 2* n/2 * n/2 values
The search-keys in a node are ordered • All the search-keys in the subtree to which Pn points have 3. if for some i, Ki = V then return C.Pi
values greater than or equal to Kn–1 .. etc.
K1 < K2 < K3 < . . . < Kn–1 4. else return null /* no record with search-key value v exists. */
• If there are K search-key values in the file, the tree height is
(Initially assume no duplicate keys, address duplicates later) no more than logn/2(K)
• thus searches can be conducted efficiently.
Insertions and deletions to the main file can be handled
efficiently, as the index can be restructured in logarithmic time (as
we shall see).
Database System Concepts - 7th Edition 14.20 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 14.22 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 14.24 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 14.25 ©Silberschatz, Korth and Sudarshan
Updates on B+-Trees: Insertion Updates on B+-Trees: Insertion (Cont.) Insertion in B+-Trees (Cont.) Example of B+-tree Deletion (Cont.)
Assume record already added to the file. Let Splitting a leaf node: Splitting a non-leaf node: when inserting (k,p) into an already full internal
• take the n (search-key value, pointer) pairs (including the one being node N
l pr be pointer to the record, and let
inserted) in sorted order. Place the first n/2 in the original node, • Copy N to an in-memory area M with space for n+1 pointers and n
l v be the search key value of the record and the rest in a new node. keys
1. Find the leaf node in which the search-key value would appear • let the new node be p, and let k be the least key value in p. Insert • Insert (k,p) into M
1. If there is room in the leaf node, insert (v, pr) pair in the leaf (k,p) in the parent of the node being split. • Copy P1,K1, …, K n/2-1,P n/2 from M back into node N
node • If the parent is full, split it and propagate the split further up. • Copy Pn/2+1,K n/2+1,…,Kn,Pn+1 from M into newly allocated node N' Before and after deletion of “Gold”
2. Otherwise, split the node (along with the new (v, pr) entry) Splitting of nodes proceeds upwards till a node that is not full is found. • Insert (K n/2,N') into parent N
as discussed in the next slide, and propagate updates to • In the worst case the root node may be split increasing the height of Example
parent nodes. the tree by 1.
Node with Gold and Katz became underfull, and was merged with its sibling
Parent node becomes underfull, and is merged with its sibling
Result of splitting node containing Brandt, Califieri and Crick on inserting Adams Read pseudocode in book!
Next step: insert entry with (Califieri, pointer-to-new-node) into parent • Value separating two nodes (at the parent) is pulled down when merging
Root node then has only one child, and is deleted
Database System Concepts - 7th Edition 14.29 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 14.30 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 14.33 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 14.36 ©Silberschatz, Korth and Sudarshan
Updates on B+-Trees: Deletion Updates on B+-Trees: Deletion Complexity of Updates B+-Tree File Organization
Assume record already deleted from file. Let V be the search key Otherwise, if the node has too few entries due to the removal, but Cost (in terms of number of I/O operations) of insertion and deletion of a B+-Tree File Organization:
the entries in the node and a sibling do not fit into a single node, single entry proportional to height of the tree
value of the record, and Pr be the pointer to the record. • leaf nodes in a B+-tree file organization store records, instead
then redistribute pointers: • With K entries and maximum fanout of n, worst case complexity of of pointers
Remove (Pr, V) from the leaf node insert/delete of an entry is O(logn/2(K))
• Redistribute the pointers between the node and a sibling such • Helps keep data records clustered even when there are
If the node has too few entries due to the removal, and the that both have more than the minimum number of entries. In practice, number of I/O operations is less:
entries in the node and a sibling fit into a single node, then insertions/deletions/updates
• Update the corresponding search-key value in the parent of • Internal nodes tend to be in buffer
merge siblings: Leaf nodes are still required to be half full
the node. • Splits/merges are rare, most insert/delete operations only affect a
• Insert all the search-key values in the two nodes into a leaf node • Since records are larger than pointers, the maximum number
single node (the one on the left), and delete the other node. The node deletions may cascade upwards till a node which has of records that can be stored in a leaf node is less than the
n/2 or more pointers is found. Average node occupancy depends on insertion order
• Delete the pair (Ki–1, Pi), where Pi is the pointer to the number of pointers in a nonleaf node.
• 2/3rds with random, ½ with insertion in sorted order
deleted node, from its parent, recursively using the above If the root node has only one pointer after deletion, it is deleted Insertion and deletion are handled in the same way as insertion
procedure. and the sole child becomes the root. and deletion of entries in a B+-tree index.
Database System Concepts - 7th Edition 14.37 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 14.38 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 14.39 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 14.41 ©Silberschatz, Korth and Sudarshan
Bulk Loading and Bottom-Up Build Handling of Bucket Overflows (Cont.) Deficiencies of Static Hashing Dynamic Hashing
Inserting entries one-at-a-time into a B+-tree requires ≥ 1 IO per entry Overflow chaining – the overflow buckets of a given bucket are chained In static hashing, function h maps search-key values to a fixed Periodic rehashing
• assuming leaf level does not fit in memory together in a linked list. set of B of bucket addresses. Databases grow or shrink with • If number of entries in a hash table becomes (say) 1.5 times
• can be very inefficient for loading a large number of entries at a time Above scheme is called closed addressing (also called closed hashing or time. size of hash table,
(bulk loading) open hashing depending on the book you use)
• If initial number of buckets is too small, and file grows, create new hash table of size (say) 2 times the size of the
Efficient alternative 1: • An alternative, called performance will degrade due to too much overflows. previous hash table
open addressing
• sort entries first (using efficient external-memory sort algorithms (also called • If space is allocated for anticipated growth, a significant Rehash all entries to new table
discussed later in Section 12.4) amount of space will be wasted initially (and buckets will be
open hashing or Linear Hashing
• insert in sorted order closed hashing underfull).
insertion will go to existing page (or cause a split) depending on the book • Do rehashing in an incremental manner
• If database shrinks, again space will be wasted.
you use) which does not Extendable Hashing
much improved IO performance, but most leaf nodes half full use overflow buckets, One solution: periodic re-organization of the file with a new hash
Efficient alternative 2: Bottom-up B+-tree construction is not suitable for function • Tailored to disk based hashing, with buckets shared by
• As before sort entries database applications. multiple hash values
• Expensive, disrupts normal operations
• And then create tree layer-by-layer, starting with leaf level • Doubling of # of entries in hash table, without doubling # of
Better solution: allow the number of buckets to be modified
details as an exercise buckets
dynamically.
• Implemented as part of bulk-load utility by most database systems
Database System Concepts - 7th Edition 14.45 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 14.54 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 14.57 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 14.58 ©Silberschatz, Korth and Sudarshan
Measures of Query Cost Selection Operation Selections Using Indices Selections Using Indices
Disk cost can be estimated as: File scan Index scan – search algorithms that use an index A4 (secondary index, equality on key/non-key).
• Number of seeks * average-seek-cost Algorithm A1 (linear search). Scan each file block and test all • selection condition must be on search-key of index. • Retrieve a single record if the search-key is a candidate key
• Number of blocks read * average-block-read-cost records to see whether they satisfy the selection condition.
A2 (clustering index, equality on key). Retrieve a single record that Cost = (hi + 1) * (tT + tS)
• Number of blocks written * average-block-write-cost • Cost estimate = br block transfers + 1 seek satisfies the corresponding equality condition • Retrieve multiple records if search-key is not a candidate key
For simplicity we just use the number of block transfers from disk br denotes number of blocks containing records from relation r • Cost = (hi + 1) * (tT + tS) each of n matching records may be on a different block
and the number of seeks as the cost measures • If selection is on a key attribute, can stop on finding record A3 (clustering index, equality on nonkey) Retrieve multiple Cost = (hi + n) * (tT + tS)
• tT – time to transfer one block cost = (br /2) block transfers + 1 seek records.
• Can be very expensive!
Assuming for simplicity that write cost is same as read cost • Linear search can be applied regardless of • Records will be on consecutive blocks
• tS – time for one seek selection condition or
Let b = number of blocks containing matching records
• Cost for b block transfers plus S seeks ordering of records in the file, or
b * tT + S * tS • Cost = hi * (tT + tS) + tS + tT * b
availability of indices
tS and tT depend on where data is stored; with 4 KB blocks: Note: binary search generally does not make sense since data is not
• High end magnetic disk: tS = 4 msec and tT =0.1 msec stored consecutively
• SSD: : tS = 20-90 microsec and tT = 2-10 microsec for 4KB • except when there is an index available,
• and binary search requires more seeks than index search
Database System Concepts - 7th Edition 15.8 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 15.10 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 15.11 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 15.12 ©Silberschatz, Korth and Sudarshan
Selections Involving Comparisons Implementation of Complex Selections Algorithms for Complex Selections Example: External Sorting Using Sort-Merge
Can implement selections of the form σA≤V (r) or σA ≥ V(r) by using Conjunction: σθ1∧ θ2∧. . . θn(r) Disjunction:σθ1∨ θ2 ∨. . . θn (r).
• a linear file scan, A7 (conjunctive selection using one index). A10 (disjunctive selection by union of identifiers).
• or by using indices in the following ways:
• Select a combination of θi and algorithms A1 through A7 that • Applicable if all conditions have available indices.
A5 (clustering index, comparison). (Relation is sorted on A) results in the least cost for σθi (r). Otherwise use linear scan.
For σA ≥ V(r) use index to find first tuple ≥ v and scan relation • Test other conditions on tuple after fetching it into memory buffer.
sequentially from there • Use corresponding index for each condition, and take union of all
For σA≤V (r) just scan relation sequentially till first tuple > v; do
A8 (conjunctive selection using composite index). the obtained sets of record pointers.
not use index • Use appropriate composite (multiple-key) index if available. • Then fetch records from file
A6 (clustering index, comparison). A9 (conjunctive selection by intersection of identifiers). Negation: σ¬θ(r)
For σA ≥ V(r) use index to find first index entry ≥ v and scan • Requires indices with record pointers. • Use linear scan on file
index sequentially from there, to find pointers to records.
• Use corresponding index for each condition, and take intersection • If very few records satisfy ¬θ, and an index is applicable to θ
For σA≤V (r) just scan leaf pages of index finding pointers to of all the obtained sets of record pointers.
records, till first entry > v Find satisfying records using index and fetch from file
In either case, retrieve records that are pointed to • Then fetch records from file
requires an I/O per record; Linear file scan may be cheaper! • If some conditions do not have appropriate indices, apply test in
memory.
Database System Concepts - 7th Edition 15.13 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 15.14 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 15.15 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 15.18 ©Silberschatz, Korth and Sudarshan
External Sort-Merge External Sort-Merge (Cont.) External Sort-Merge (Cont.) External Merge Sort (Cont.)
2. Merge the runs (N-way merge). We assume (for now) that N < M. If N ≥ M, several merge passes are required. Cost analysis:
Let M denote memory size (in pages). 1. Use N blocks of memory to buffer input runs, and 1 block to buffer • In each pass, contiguous groups of M - 1 runs are merged. • 1 block per run leads to too many seeks during merge
output. Read the first block of each run into its buffer page • A pass reduces the number of runs by a factor of M -1, and Instead use bb buffer blocks per run
1. Create sorted runs. Let i be 0 initially.
Repeatedly do the following till the end of the relation: 2. repeat creates runs longer by the same factor. read/write bb blocks at a time
(a) Read M blocks of relation into memory 1. Select the first record (in sort order) among all buffer pages E.g. If M=11, and there are 90 runs, one pass reduces the Can merge M/bb–1 runs in one pass
(b) Sort the in-memory blocks 2. Write the record to the output buffer. If the output buffer is full number of runs to 9, each 10 times the size of the initial runs • Total number of merge passes required: log M/bb–1(br/M).
(c) Write sorted data to run Ri; increment i. write it to disk. • Repeated passes are performed till all runs have been merged • Block transfers for initial run creation as well as in each pass is 2br
Let the final value of i be N into one.
3. Delete the record from its input buffer page. for final pass, we don’t count write cost
2. Merge the runs (next slide)….. If the buffer page becomes empty then • we ignore final write cost for all operations since the output
read the next block (if any) of the run into the buffer. of an operation may be sent to the parent operation without
3. until all input buffer pages are empty: being written to disk
Thus total number of block transfers for external sorting:
br ( 2 log M/bb–1 (br / M) + 1)
• Seeks: next slide
Database System Concepts - 7th Edition 15.19 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 15.20 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 15.21 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 15.22 ©Silberschatz, Korth and Sudarshan
External Merge Sort (Cont.) Nested-Loop Join Nested-Loop Join (Cont.) Block Nested-Loop Join
Cost of seeks To compute the theta join r⨝θs In the worst case, if there is enough memory only to hold one block Variant of nested-loop join in which every block of inner relation is
• During run generation: one seek to read each run and one seek to for each tuple tr in r do begin of each relation, the estimated cost is paired with every block of outer relation.
write each run for each tuple ts in s do begin nr ∗ bs + br block transfers, plus nr + b r seeks for each block Br of r do begin
2 br / M test pair (tr,ts) to see if they satisfy the join condition θ If the smaller relation fits entirely in memory, use that as the inner for each block Bs of s do begin
if they do, add tr • ts to the result. relation.
• During the merge phase for each tuple tr in Br do begin
end • Reduces cost to br + bs block transfers and 2 seeks for each tuple ts in Bs do begin
Need 2 br / bb seeks for each merge pass end Assuming worst case memory availability cost estimate is Check if (tr,ts) satisfy the join condition
• except the final one which does not require a write r is called the outer relation and s the inner relation of the join. • with student as outer relation: if they do, add tr • ts to the result.
Total number of seeks: Requires no indices and can be used with any kind of join condition. 5000 ∗ 400 + 100 = 2,000,100 block transfers, end
2 br / M + br / bb (2 logM/bb–1(br / M) -1)
Expensive since it examines every pair of tuples in the two relations. 5000 + 100 = 5100 seeks end
• with takes as the outer relation end
end
10000 ∗ 100 + 400 = 1,000,400 block transfers and 10,400 seeks
If smaller relation (student) fits entirely in memory, the cost
estimate will be 500 block transfers.
Block nested-loops algorithm (next slide) is preferable.
Database System Concepts - 7th Edition 15.23 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 15.25 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 15.26 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 15.27 ©Silberschatz, Korth and Sudarshan
Database System Concepts - 7th Edition 15.29 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 15.31 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 15.32 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 15.33 ©Silberschatz, Korth and Sudarshan
The hash-join of r and s is computed as follows. The value n and the hash function h is chosen such that each si Partitioning is said to be skewed if some partitions have significantly If recursive partitioning is not required: cost of hash join is
1. Partition the relation s using hashing function h. When should fit in memory. more tuples than some others 3(br + bs) +4 ∗ nh block transfers +
partitioning a relation, one block of memory is reserved as • Typically n is chosen as bs/M * f where f is a “fudge factor”, Hash-table overflow occurs in partition si if si does not fit in memory. 2( br / bb + bs / bb) seeks
the output buffer for each partition. typically around 1.2 Reasons could be If recursive partitioning required:
2. Partition r similarly. • The probe relation partitions si need not fit in memory • Many tuples in s with same value for join attributes • number of passes required for partitioning build relation s to less
3. For each i: Recursive partitioning required if number of partitions n is greater • Bad hash function than M blocks per partition is logM/bb–1(bs/M)
(a) Load si into memory and build an in-memory hash index than number of pages M of memory. Overflow resolution can be done in build phase • best to choose the smaller relation as the build relation.
on it using the join attribute. This hash index uses a • instead of partitioning n ways, use M – 1 partitions for s • Partition si is further partitioned using different hash function. • Total cost estimate is:
different hash function than the earlier one h. • Further partition the M – 1 partitions using a different hash function • Partition ri must be similarly partitioned. 2(br + bs) logM/bb–1(bs/M) + br + bs block transfers +
(b) Read the tuples in ri from the disk one by one. For each • Use same partitioning method on r Overflow avoidance performs partitioning carefully to avoid 2(br / bb + bs / bb) logM/bb–1(bs/M) seeks
tuple tr locate each matching tuple ts in si using the in- overflows during build phase If the entire build input can be kept in main memory no partitioning is
memory hash index. Output the concatenation of their • Rarely required: e.g., with block size of 4 KB, recursive partitioning
not needed for relations of < 1GB with memory size of 2MB, or • E.g. partition build relation into many partitions, then combine them required
attributes.
relations of < 36 GB with memory of 12 MB Both approaches fail with large numbers of duplicates • Cost estimate goes down to br + bs.
Relation s is called the build input and r is called the probe input. • Fallback option: use block nested loops join on overflowed partitions
Database System Concepts - 7th Edition 15.36 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 15.37 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 15.38 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 15.39 ©Silberschatz, Korth and Sudarshan