0% found this document useful (0 votes)
43 views2 pages

Quiz 2 Cheatsheet

The document discusses performance measures of disks including access time, data transfer rate, and ways to improve reliability and performance via techniques like redundancy, mirroring, and striping data across multiple disks in parallel.

Uploaded by

Atharva Tambat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
43 views2 pages

Quiz 2 Cheatsheet

The document discusses performance measures of disks including access time, data transfer rate, and ways to improve reliability and performance via techniques like redundancy, mirroring, and striping data across multiple disks in parallel.

Uploaded by

Atharva Tambat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Performance Measures of Disks Flash Storage (Cont.

) Improvement of Reliability via Redundancy Improvement in Performance via Parallelism

 Access time – the time it takes from when a read or write request  Erase happens in units of erase block  Two main goals of parallelism in a disk system:
is issued to when data transfer begins. Consists of:  Redundancy – store extra information that can be used to rebuild
• Takes 2 to 5millisecs information lost in a disk failure 1. Load balance multiple small accesses to increase throughput
• Seek time – time it takes to reposition the arm over the correct track. • Erase block typically 256 KB to 1 MB (128 to 256 pages)
 Average seek time is 1/2 the worst case seek time.  E.g., Mirroring (or shadowing) 2. Parallelize large accesses to reduce response time.
• Would be 1/3 if all tracks had the same number of sectors, and we  Remapping of logical page addresses to physical page addresses • Duplicate every disk. Logical disk consists of two physical disks.
avoids waiting for erase
 Improve transfer rate by striping data across multiple disks.
ignore the time to start and stop arm movement
• Every write is carried out on both disks  Bit-level striping – split the bits of each byte across multiple disks
 4 to 10 milliseconds on typical disks  Flash translation table tracks mapping
• Rotational latency – time it takes for the sector to be accessed to appear  Reads can take place from either disk
• also stored in a label field of flash page • In an array of eight disks, write bit i of each byte to disk i.
under the head. • If one disk in a pair fails, data still available in the other
 4 to 11 milliseconds on typical disks (5400 to 15000 r.p.m.) • remapping carried out by flash translation layer • Each access can read data at eight times the rate of a single disk.
 Data loss would occur only if a disk fails, and its mirror disk
 Average latency is 1/2 of the above latency.  After 100,000 to 1,000,000 erases, erase block becomes unreliable also fails before the system is repaired • But seek/access time worse than for a single disk
• Overall latency is 5 to 20 msec depending on disk model and cannot be used
• Probability of combined event is very small  Bit level striping is not used much any more
 Data-transfer rate – the rate at which data can be retrieved from Page write
• wear leveling Except for dependent failure modes such as fire or  Block-level striping – with n disks, block i of a file goes to disk (i
or stored to the disk. Physical Page Address

• 25 to 200 MB per second max rate, lower for inner tracks building collapse or electrical power surges mod n) + 1
Logical address and valid bit  Mean time to data loss depends on mean time to failure, • Requests for different blocks can run in parallel if the blocks reside on
Logical Page Address stored with each and mean time to repair different disks
physical page (extra bytes)
• E.g. MTTF of 100,000 hours, mean time to repair of 10 hours • A request for a long sequence of blocks can utilize all disks in parallel
gives mean time to data loss of 500*106 hours (or 57,000 years)
for a mirrored pair of disks (ignoring dependent failure modes)
Flash Translation Table
Database System Concepts - 7th Edition 12.9 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 12.13 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 12.17 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 12.18 ©Silberschatz, Korth and Sudarshan

RAID Levels RAID Levels (Cont.) RAID Levels (Cont.) RAID Levels (Cont.)
 Schemes to provide redundancy at lower cost by using disk striping  Parity blocks: Parity block j stores XOR of bits from block j of each  RAID Level 5: Block-Interleaved Distributed Parity; partitions data  RAID Level 5 (Cont.)
combined with parity bits disk and parity among all N + 1 disks, rather than storing data in N disks • Block writes occur in parallel if the blocks and their parity blocks
• Different RAID organizations, or RAID levels, have differing cost, • When writing data to a block j, parity block j must also be computed and parity in 1 disk. are on different disks.
performance and reliability characteristics and written to disk • E.g., with 5 disks, parity block for nth set of blocks is stored on disk  RAID Level 6: P+Q Redundancy scheme; similar to Level 5, but
 RAID Level 0: Block striping; non-redundant.  Can be done by using old parity block, old value of current block (n mod 5) + 1, with the data blocks stored on the other 4 disks. stores two error correction blocks (P, Q) instead of single parity
and new value of current block (2 block reads + 2 block writes) block to guard against multiple disk failures.
• Used in high-performance applications where data loss is not critical.
 Or by recomputing the parity value using the new values of blocks • Better reliability than Level 5 at a higher cost
 RAID Level 1: Mirrored disks with block striping
corresponding to the parity block
 Becoming more important as storage sizes increase
• Offers best write performance.
• More efficient for writing large amounts of data sequentially
• Popular for applications such as storing log files in a database system.
• To recover data for a block, compute XOR of bits from all other blocks
in the set including the parity block

Database System Concepts - 7th Edition 12.19 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 12.20 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 12.21 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 12.22 ©Silberschatz, Korth and Sudarshan

Fixed-Length Records Variable-Length Records: Slotted Page Structure Heap File Organization Multitable Clustering File Organization
 Records can be placed anywhere in the file where there is free space Store several relations in one file using a multitable clustering
 Deletion of record i: alternatives:
file organization
• move records i + 1, . . ., n to i, . . . , n – 1  Records usually do not move once allocated
• move record n to i  Important to be able to efficiently find free space within file department
• do not move records, but link all free records on a free list  Free-space map
• Array with 1 entry per block. Each entry is a few bits to a byte,
and records fraction of block that is free
 Slotted page header contains:
• In example below, 3 bits per block, value divided by 8 indicates instructor
• number of record entries fraction of block that is free
• end of free space in the block
• location and size of each record
• Can have second-level free-space map
 Records can be moved around within a page to keep them contiguous
with no empty space between them; entry in the header must be • In example below, each entry stores maximum from 4 entries of multitable clustering
updated. first-level free-space map of department and
instructor
 Pointers should not point directly to record — instead they should
point to the entry for the record in header.  Free space map written to disk periodically, OK to have wrong (old)
values for some entries (will be detected and fixed)
Database System Concepts - 7th Edition 13.6 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 13.8 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 13.11 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 13.14 ©Silberschatz, Korth and Sudarshan

Buffer Manager Buffer Manager Buffer-Replacement Policies Buffer-Replacement Policies (Cont.)


 Programs call on the buffer manager when they need a block from  Buffer replacement strategy (details coming up!)  Most operating systems replace the block least recently used (LRU  Toss-immediate strategy – frees the space occupied by a block as soon
disk.  Pinned block: memory block that is not allowed to be written back to disk strategy) as the final tuple of that block has been processed
• If the block is already in the buffer, buffer manager returns the • Pin done before reading/writing data from a block • Idea behind LRU – use past pattern of block references as a  Most recently used (MRU) strategy – system must pin the block
address of the block in main memory predictor of future references currently being processed. After the final tuple of that block has been
• Unpin done when read /write is complete processed, the block is unpinned, and it becomes the most recently used
• If the block is not in the buffer, the buffer manager • LRU can be bad for some queries
• Multiple concurrent pin/unpin operations possible block.
 Allocates space in the buffer for the block  Queries have well-defined access patterns (such as sequential
 Keep a pin count, buffer block can be evicted only if pin count = 0 scans), and a database system can use the information in a user’s  Buffer manager can use statistical information regarding the probability
• Replacing (throwing out) some other block, if required, to that a request will reference a particular relation
make space for the new block.  Shared and exclusive locks on buffer query to predict future references
• Needed to prevent concurrent operations from reading page contents  Mixed strategy with hints on replacement strategy provided • E.g., the data dictionary is frequently accessed. Heuristic: keep
• Replaced block written back to disk only if it was modified data-dictionary blocks in main memory buffer
since the most recent time that it was written to/fetched as they are moved/reorganized, and to ensure only one by the query optimizer is preferable
from the disk. move/reorganize at a time  Example of bad access pattern for LRU: when computing the join of 2  Operating system or buffer manager may reorder writes
 Reads the block from the disk to the buffer, and returns the • Readers get shared lock, updates to a block require exclusive lock relations r and s by a nested loops • Can lead to corruption of data structures on disk
address of the block in main memory to requester. • Locking rules: for each tuple tr of r do  E.g. linked list of blocks with missing block on disk
for each tuple ts of s do
 Only one process can get exclusive lock at a time if the tuples tr and ts match …  File systems perform consistency check to detect such situations
 Shared lock cannot be concurrently with exclusive lock • Careful ordering of writes can avoid many such problems
 Multiple processes may be given shared lock concurrently

Database System Concepts - 7th Edition 13.20 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 13.21 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 13.22 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 13.23 ©Silberschatz, Korth and Sudarshan

Ordered Indices Sparse Index Files Multilevel Index B+-Tree Index Files (Cont.)
 If index does not fit in memory, access becomes expensive.
 In an ordered index, index entries are stored sorted on the  Sparse Index: contains index records for only some search-key A B+-tree is a rooted tree satisfying the following properties:
values.  Solution: treat index kept on disk as a sequential file and
search key value.
• Applicable when records are sequentially ordered on search-key construct a sparse index on it.
 Clustering index: in a sequentially ordered file, the index whose  All paths from root to leaf are of the same length
 To locate a record with search-key value K we: • outer index – a sparse index of the basic index
search key specifies the sequential order of the file.  Each node that is not a root or a leaf has between n/2 and
• Find index record with largest search-key value < K • inner index – the basic index file n children.
• Also called primary index
• Search file sequentially starting at the record to which the index  If even outer index is too large to fit in main memory, yet  A leaf node has between (n–1)/2 and n–1 values
• The search key of a primary index is usually but not record points another level of index can be created, and so on.
necessarily the primary key.  Special cases:
 Indices at all levels must be updated on insertion or deletion
 Secondary index: an index whose search key specifies an order • If the root is not a leaf, it has at least 2 children.
from the file.
different from the sequential order of the file. Also called • If the root is a leaf (that is, there are no other nodes in
nonclustering index. the tree), it can have between 0 and (n–1) values.
 Index-sequential file: sequential file ordered on a search key,
with a clustering index on the search key.

Database System Concepts - 7th Edition 14.5 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 14.8 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 14.12 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 14.19 ©Silberschatz, Korth and Sudarshan

B+-Tree Node Structure Non-Leaf Nodes in B+-Trees Observations about B+-trees Queries on B+-Trees
function find(v)
 Typical node  Non leaf nodes form a multi-level sparse index on the leaf nodes.  Since the inter-node connections are done by pointers, 1. C=root
For a non-leaf node with m pointers: “logically” close blocks need not be “physically” close. 2. while (C is not a leaf node)
• All the search-keys in the subtree to which P1 points are less  The non-leaf levels of the B+-tree form a hierarchy of sparse 1. Let i be least number s.t. V ≤ Ki.
than K1 indices. 2. if there is no such number i then
• Ki are the search-key values • For 2 ≤ i ≤ n – 1, all the search-keys in the subtree to which Pi  The B+-tree contains a relatively small number of levels 3. Set C = last non-null pointer in C
• Pi are pointers to children (for non-leaf nodes) or pointers to points have values greater than or equal to Ki–1 and less than  Level below root has at least 2* n/2 values 4. else if (v = C.Ki ) Set C = Pi +1
records or buckets of records (for leaf nodes). Ki 5. else set C = C.Pi
 Next level has at least 2* n/2 * n/2 values
 The search-keys in a node are ordered • All the search-keys in the subtree to which Pn points have 3. if for some i, Ki = V then return C.Pi
values greater than or equal to Kn–1  .. etc.
K1 < K2 < K3 < . . . < Kn–1 4. else return null /* no record with search-key value v exists. */
• If there are K search-key values in the file, the tree height is
(Initially assume no duplicate keys, address duplicates later) no more than  logn/2(K)
• thus searches can be conducted efficiently.
 Insertions and deletions to the main file can be handled
efficiently, as the index can be restructured in logarithmic time (as
we shall see).

Database System Concepts - 7th Edition 14.20 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 14.22 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 14.24 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 14.25 ©Silberschatz, Korth and Sudarshan

Updates on B+-Trees: Insertion Updates on B+-Trees: Insertion (Cont.) Insertion in B+-Trees (Cont.) Example of B+-tree Deletion (Cont.)
Assume record already added to the file. Let  Splitting a leaf node:  Splitting a non-leaf node: when inserting (k,p) into an already full internal
• take the n (search-key value, pointer) pairs (including the one being node N
l pr be pointer to the record, and let
inserted) in sorted order. Place the first n/2 in the original node, • Copy N to an in-memory area M with space for n+1 pointers and n
l v be the search key value of the record and the rest in a new node. keys
1. Find the leaf node in which the search-key value would appear • let the new node be p, and let k be the least key value in p. Insert • Insert (k,p) into M
1. If there is room in the leaf node, insert (v, pr) pair in the leaf (k,p) in the parent of the node being split. • Copy P1,K1, …, K n/2-1,P n/2 from M back into node N
node • If the parent is full, split it and propagate the split further up. • Copy Pn/2+1,K n/2+1,…,Kn,Pn+1 from M into newly allocated node N' Before and after deletion of “Gold”
2. Otherwise, split the node (along with the new (v, pr) entry)  Splitting of nodes proceeds upwards till a node that is not full is found. • Insert (K n/2,N') into parent N
as discussed in the next slide, and propagate updates to • In the worst case the root node may be split increasing the height of  Example
parent nodes. the tree by 1.

 Node with Gold and Katz became underfull, and was merged with its sibling
 Parent node becomes underfull, and is merged with its sibling
Result of splitting node containing Brandt, Califieri and Crick on inserting Adams  Read pseudocode in book!
Next step: insert entry with (Califieri, pointer-to-new-node) into parent • Value separating two nodes (at the parent) is pulled down when merging
 Root node then has only one child, and is deleted
Database System Concepts - 7th Edition 14.29 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 14.30 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 14.33 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 14.36 ©Silberschatz, Korth and Sudarshan

Updates on B+-Trees: Deletion Updates on B+-Trees: Deletion Complexity of Updates B+-Tree File Organization

Assume record already deleted from file. Let V be the search key  Otherwise, if the node has too few entries due to the removal, but  Cost (in terms of number of I/O operations) of insertion and deletion of a  B+-Tree File Organization:
the entries in the node and a sibling do not fit into a single node, single entry proportional to height of the tree
value of the record, and Pr be the pointer to the record. • leaf nodes in a B+-tree file organization store records, instead
then redistribute pointers: • With K entries and maximum fanout of n, worst case complexity of of pointers
 Remove (Pr, V) from the leaf node insert/delete of an entry is O(logn/2(K))
• Redistribute the pointers between the node and a sibling such • Helps keep data records clustered even when there are
 If the node has too few entries due to the removal, and the that both have more than the minimum number of entries.  In practice, number of I/O operations is less:
entries in the node and a sibling fit into a single node, then insertions/deletions/updates
• Update the corresponding search-key value in the parent of • Internal nodes tend to be in buffer
merge siblings:  Leaf nodes are still required to be half full
the node. • Splits/merges are rare, most insert/delete operations only affect a
• Insert all the search-key values in the two nodes into a leaf node • Since records are larger than pointers, the maximum number
single node (the one on the left), and delete the other node.  The node deletions may cascade upwards till a node which has of records that can be stored in a leaf node is less than the
n/2 or more pointers is found.  Average node occupancy depends on insertion order
• Delete the pair (Ki–1, Pi), where Pi is the pointer to the number of pointers in a nonleaf node.
• 2/3rds with random, ½ with insertion in sorted order
deleted node, from its parent, recursively using the above  If the root node has only one pointer after deletion, it is deleted  Insertion and deletion are handled in the same way as insertion
procedure. and the sole child becomes the root. and deletion of entries in a B+-tree index.

Database System Concepts - 7th Edition 14.37 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 14.38 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 14.39 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 14.41 ©Silberschatz, Korth and Sudarshan
Bulk Loading and Bottom-Up Build Handling of Bucket Overflows (Cont.) Deficiencies of Static Hashing Dynamic Hashing
 Inserting entries one-at-a-time into a B+-tree requires ≥ 1 IO per entry  Overflow chaining – the overflow buckets of a given bucket are chained  In static hashing, function h maps search-key values to a fixed  Periodic rehashing
• assuming leaf level does not fit in memory together in a linked list. set of B of bucket addresses. Databases grow or shrink with • If number of entries in a hash table becomes (say) 1.5 times
• can be very inefficient for loading a large number of entries at a time  Above scheme is called closed addressing (also called closed hashing or time. size of hash table,
(bulk loading) open hashing depending on the book you use)
• If initial number of buckets is too small, and file grows,  create new hash table of size (say) 2 times the size of the
 Efficient alternative 1: • An alternative, called performance will degrade due to too much overflows. previous hash table
open addressing
• sort entries first (using efficient external-memory sort algorithms (also called • If space is allocated for anticipated growth, a significant  Rehash all entries to new table
discussed later in Section 12.4) amount of space will be wasted initially (and buckets will be
open hashing or  Linear Hashing
• insert in sorted order closed hashing underfull).
 insertion will go to existing page (or cause a split) depending on the book • Do rehashing in an incremental manner
• If database shrinks, again space will be wasted.
you use) which does not  Extendable Hashing
 much improved IO performance, but most leaf nodes half full use overflow buckets,  One solution: periodic re-organization of the file with a new hash
 Efficient alternative 2: Bottom-up B+-tree construction is not suitable for function • Tailored to disk based hashing, with buckets shared by
• As before sort entries database applications. multiple hash values
• Expensive, disrupts normal operations
• And then create tree layer-by-layer, starting with leaf level • Doubling of # of entries in hash table, without doubling # of
 Better solution: allow the number of buckets to be modified
 details as an exercise buckets
dynamically.
• Implemented as part of bulk-load utility by most database systems
Database System Concepts - 7th Edition 14.45 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 14.54 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 14.57 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 14.58 ©Silberschatz, Korth and Sudarshan

Measures of Query Cost Selection Operation Selections Using Indices Selections Using Indices
 Disk cost can be estimated as:  File scan  Index scan – search algorithms that use an index  A4 (secondary index, equality on key/non-key).
• Number of seeks * average-seek-cost  Algorithm A1 (linear search). Scan each file block and test all • selection condition must be on search-key of index. • Retrieve a single record if the search-key is a candidate key
• Number of blocks read * average-block-read-cost records to see whether they satisfy the selection condition.
 A2 (clustering index, equality on key). Retrieve a single record that  Cost = (hi + 1) * (tT + tS)
• Number of blocks written * average-block-write-cost • Cost estimate = br block transfers + 1 seek satisfies the corresponding equality condition • Retrieve multiple records if search-key is not a candidate key
 For simplicity we just use the number of block transfers from disk  br denotes number of blocks containing records from relation r • Cost = (hi + 1) * (tT + tS)  each of n matching records may be on a different block
and the number of seeks as the cost measures • If selection is on a key attribute, can stop on finding record  A3 (clustering index, equality on nonkey) Retrieve multiple  Cost = (hi + n) * (tT + tS)
• tT – time to transfer one block  cost = (br /2) block transfers + 1 seek records.
• Can be very expensive!
 Assuming for simplicity that write cost is same as read cost • Linear search can be applied regardless of • Records will be on consecutive blocks
• tS – time for one seek  selection condition or
 Let b = number of blocks containing matching records
• Cost for b block transfers plus S seeks  ordering of records in the file, or
b * tT + S * tS • Cost = hi * (tT + tS) + tS + tT * b
 availability of indices
 tS and tT depend on where data is stored; with 4 KB blocks:  Note: binary search generally does not make sense since data is not
• High end magnetic disk: tS = 4 msec and tT =0.1 msec stored consecutively
• SSD: : tS = 20-90 microsec and tT = 2-10 microsec for 4KB • except when there is an index available,
• and binary search requires more seeks than index search
Database System Concepts - 7th Edition 15.8 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 15.10 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 15.11 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 15.12 ©Silberschatz, Korth and Sudarshan

Selections Involving Comparisons Implementation of Complex Selections Algorithms for Complex Selections Example: External Sorting Using Sort-Merge

 Can implement selections of the form σA≤V (r) or σA ≥ V(r) by using  Conjunction: σθ1∧ θ2∧. . . θn(r)  Disjunction:σθ1∨ θ2 ∨. . . θn (r).
• a linear file scan,  A7 (conjunctive selection using one index).  A10 (disjunctive selection by union of identifiers).
• or by using indices in the following ways:
• Select a combination of θi and algorithms A1 through A7 that • Applicable if all conditions have available indices.
 A5 (clustering index, comparison). (Relation is sorted on A) results in the least cost for σθi (r).  Otherwise use linear scan.
 For σA ≥ V(r) use index to find first tuple ≥ v and scan relation • Test other conditions on tuple after fetching it into memory buffer.
sequentially from there • Use corresponding index for each condition, and take union of all
 For σA≤V (r) just scan relation sequentially till first tuple > v; do
 A8 (conjunctive selection using composite index). the obtained sets of record pointers.
not use index • Use appropriate composite (multiple-key) index if available. • Then fetch records from file
 A6 (clustering index, comparison).  A9 (conjunctive selection by intersection of identifiers).  Negation: σ¬θ(r)
 For σA ≥ V(r) use index to find first index entry ≥ v and scan • Requires indices with record pointers. • Use linear scan on file
index sequentially from there, to find pointers to records.
• Use corresponding index for each condition, and take intersection • If very few records satisfy ¬θ, and an index is applicable to θ
 For σA≤V (r) just scan leaf pages of index finding pointers to of all the obtained sets of record pointers.
records, till first entry > v  Find satisfying records using index and fetch from file
 In either case, retrieve records that are pointed to • Then fetch records from file
 requires an I/O per record; Linear file scan may be cheaper! • If some conditions do not have appropriate indices, apply test in
memory.
Database System Concepts - 7th Edition 15.13 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 15.14 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 15.15 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 15.18 ©Silberschatz, Korth and Sudarshan

External Sort-Merge External Sort-Merge (Cont.) External Sort-Merge (Cont.) External Merge Sort (Cont.)
2. Merge the runs (N-way merge). We assume (for now) that N < M.  If N ≥ M, several merge passes are required.  Cost analysis:
Let M denote memory size (in pages). 1. Use N blocks of memory to buffer input runs, and 1 block to buffer • In each pass, contiguous groups of M - 1 runs are merged. • 1 block per run leads to too many seeks during merge
output. Read the first block of each run into its buffer page • A pass reduces the number of runs by a factor of M -1, and  Instead use bb buffer blocks per run
1. Create sorted runs. Let i be 0 initially.
Repeatedly do the following till the end of the relation: 2. repeat creates runs longer by the same factor.  read/write bb blocks at a time
(a) Read M blocks of relation into memory 1. Select the first record (in sort order) among all buffer pages  E.g. If M=11, and there are 90 runs, one pass reduces the  Can merge M/bb–1 runs in one pass
(b) Sort the in-memory blocks 2. Write the record to the output buffer. If the output buffer is full number of runs to 9, each 10 times the size of the initial runs • Total number of merge passes required: log M/bb–1(br/M).
(c) Write sorted data to run Ri; increment i. write it to disk. • Repeated passes are performed till all runs have been merged • Block transfers for initial run creation as well as in each pass is 2br
Let the final value of i be N into one.
3. Delete the record from its input buffer page.  for final pass, we don’t count write cost
2. Merge the runs (next slide)….. If the buffer page becomes empty then • we ignore final write cost for all operations since the output
read the next block (if any) of the run into the buffer. of an operation may be sent to the parent operation without
3. until all input buffer pages are empty: being written to disk
 Thus total number of block transfers for external sorting:
br ( 2 log M/bb–1 (br / M) + 1) 
• Seeks: next slide

Database System Concepts - 7th Edition 15.19 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 15.20 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 15.21 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 15.22 ©Silberschatz, Korth and Sudarshan

External Merge Sort (Cont.) Nested-Loop Join Nested-Loop Join (Cont.) Block Nested-Loop Join
 Cost of seeks  To compute the theta join r⨝θs  In the worst case, if there is enough memory only to hold one block  Variant of nested-loop join in which every block of inner relation is
• During run generation: one seek to read each run and one seek to for each tuple tr in r do begin of each relation, the estimated cost is paired with every block of outer relation.
write each run for each tuple ts in s do begin nr ∗ bs + br block transfers, plus nr + b r seeks for each block Br of r do begin
 2 br / M test pair (tr,ts) to see if they satisfy the join condition θ  If the smaller relation fits entirely in memory, use that as the inner for each block Bs of s do begin
if they do, add tr • ts to the result. relation.
• During the merge phase for each tuple tr in Br do begin
end • Reduces cost to br + bs block transfers and 2 seeks for each tuple ts in Bs do begin
 Need 2 br / bb seeks for each merge pass end  Assuming worst case memory availability cost estimate is Check if (tr,ts) satisfy the join condition
• except the final one which does not require a write  r is called the outer relation and s the inner relation of the join. • with student as outer relation: if they do, add tr • ts to the result.
 Total number of seeks:  Requires no indices and can be used with any kind of join condition.  5000 ∗ 400 + 100 = 2,000,100 block transfers, end
2 br / M + br / bb (2 logM/bb–1(br / M) -1)
 Expensive since it examines every pair of tuples in the two relations.  5000 + 100 = 5100 seeks end
• with takes as the outer relation end
end
 10000 ∗ 100 + 400 = 1,000,400 block transfers and 10,400 seeks
 If smaller relation (student) fits entirely in memory, the cost
estimate will be 500 block transfers.
 Block nested-loops algorithm (next slide) is preferable.
Database System Concepts - 7th Edition 15.23 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 15.25 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 15.26 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 15.27 ©Silberschatz, Korth and Sudarshan

Indexed Nested-Loop Join Merge-Join Merge-Join (Cont.) Hash-Join


 Index lookups can replace file scans if 1. Sort both relations on their join attribute (if not already sorted on the  Can be used only for equi-joins and natural joins  Applicable for equi-joins and natural joins.
• join is an equi-join or natural join and join attributes).
 Each block needs to be read only once (assuming all tuples for any  A hash function h is used to partition tuples of both relations
• an index is available on the inner relation’s join attribute 2. Merge the sorted relations to join them
given value of the join attributes fit in memory  h maps JoinAttrs values to {0, 1, ..., n}, where JoinAttrs denotes the
 Can construct an index just to compute a join. 1. Join step is similar to the merge stage of the sort-merge common attributes of r and s used in the natural join.
algorithm.  Thus the cost of merge join is:
 For each tuple tr in the outer relation r, use the index to look up tuples br + bs block transfers + br / bb + bs / bb seeks
in s that satisfy the join condition with tuple tr. 2. Main difference is handling of duplicate values in join attribute — • r0, r1, . . ., rn denote partitions of r tuples
 Worst case: buffer has space for only one page of r, and, for each every pair with same value on join attribute must be matched • + the cost of sorting if relations are unsorted.  Each tuple tr ∈ r is put in partition ri where i = h(tr [JoinAttrs]).
tuple in r, we perform an index lookup on s. 3. Detailed algorithm in book  hybrid merge-join: If one relation is sorted, and the other has a • r0,, r1. . ., rn denotes partitions of s tuples
secondary B+-tree index on the join attribute
 Cost of the join: br (tT + tS) + nr ∗ c  Each tuple ts ∈s is put in partition si, where i = h(ts [JoinAttrs]).
• Where c is the cost of traversing index and fetching all matching s tuples • Merge the sorted relation with the leaf entries of the B+-tree .
for one tuple or r • Sort the result on the addresses of the unsorted relation’s tuples
• c can be estimated as cost of a single selection on s using the join  Note: In book, Figure 12.10 ri is denoted as Hri, si is denoted as Hsi
condition. • Scan the unsorted relation in physical address order and merge and
with previous result, to replace addresses by the actual tuples n is denoted as nh.
 If indices are available on join attributes of both r and s,
use the relation with fewer tuples as the outer relation.  Sequential scan more efficient than random lookup

Database System Concepts - 7th Edition 15.29 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 15.31 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 15.32 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 15.33 ©Silberschatz, Korth and Sudarshan

Hash-Join Algorithm Hash-Join algorithm (Cont.) Handling of Overflows Cost of Hash-Join

The hash-join of r and s is computed as follows.  The value n and the hash function h is chosen such that each si  Partitioning is said to be skewed if some partitions have significantly  If recursive partitioning is not required: cost of hash join is
1. Partition the relation s using hashing function h. When should fit in memory. more tuples than some others 3(br + bs) +4 ∗ nh block transfers +
partitioning a relation, one block of memory is reserved as • Typically n is chosen as bs/M * f where f is a “fudge factor”,  Hash-table overflow occurs in partition si if si does not fit in memory. 2( br / bb + bs / bb) seeks
the output buffer for each partition. typically around 1.2 Reasons could be  If recursive partitioning required:
2. Partition r similarly. • The probe relation partitions si need not fit in memory • Many tuples in s with same value for join attributes • number of passes required for partitioning build relation s to less
3. For each i:  Recursive partitioning required if number of partitions n is greater • Bad hash function than M blocks per partition is logM/bb–1(bs/M)
(a) Load si into memory and build an in-memory hash index than number of pages M of memory.  Overflow resolution can be done in build phase • best to choose the smaller relation as the build relation.
on it using the join attribute. This hash index uses a • instead of partitioning n ways, use M – 1 partitions for s • Partition si is further partitioned using different hash function. • Total cost estimate is:
different hash function than the earlier one h. • Further partition the M – 1 partitions using a different hash function • Partition ri must be similarly partitioned. 2(br + bs) logM/bb–1(bs/M) + br + bs block transfers +
(b) Read the tuples in ri from the disk one by one. For each • Use same partitioning method on r  Overflow avoidance performs partitioning carefully to avoid 2(br / bb + bs / bb) logM/bb–1(bs/M)  seeks
tuple tr locate each matching tuple ts in si using the in- overflows during build phase  If the entire build input can be kept in main memory no partitioning is
memory hash index. Output the concatenation of their • Rarely required: e.g., with block size of 4 KB, recursive partitioning
not needed for relations of < 1GB with memory size of 2MB, or • E.g. partition build relation into many partitions, then combine them required
attributes.
relations of < 36 GB with memory of 12 MB  Both approaches fail with large numbers of duplicates • Cost estimate goes down to br + bs.
Relation s is called the build input and r is called the probe input. • Fallback option: use block nested loops join on overflowed partitions

Database System Concepts - 7th Edition 15.36 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 15.37 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 15.38 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 15.39 ©Silberschatz, Korth and Sudarshan

Hybrid Hash–Join Complex Joins ACID Properties Transaction State


 Useful when memory sized are relatively large, and the build input is A transaction is a unit of program execution that accesses and
 Join with a conjunctive condition:  Active – the initial state; the transaction stays in this state while it is
bigger than memory. possibly updates various data items.To preserve the integrity of data
r ⨝ θ1∧ θ 2∧... ∧ θ n s executing
the database system must ensure:
 Main feature of hybrid hash join:  Partially committed – after the final statement has been executed.
• Either use nested loops/block nested loops, or
Keep the first partition of the build relation in memory.  Atomicity. Either all operations of the transaction are properly  Failed -- after the discovery that normal execution can no longer
• Compute the result of one of the simpler joins r ⨝ θi s
 E.g. With memory size of 25 blocks, instructor can be partitioned into reflected in the database or none are. proceed.
five partitions, each of size 20 blocks.  final result comprises those tuples in the intermediate result
 Consistency. Execution of a transaction in isolation preserves the  Aborted – after the transaction has been rolled back and the
• Division of memory: that satisfy the remaining conditions
consistency of the database. database restored to its state prior to the start of the transaction. Two
 The first partition occupies 20 blocks of memory θ1 ∧ . . . ∧ θi –1 ∧ θi +1 ∧ . . . ∧ θn options after it has been aborted:
 Isolation. Although multiple transactions may execute concurrently,
 1 block is used for input, and 1 block each for buffering the other 4
 Join with a disjunctive condition each transaction must be unaware of other concurrently executing • restart the transaction
partitions.
transactions. Intermediate transaction results must be hidden from  can be done only if no internal logical error
 teaches is similarly partitioned into five partitions each of size 80 r⨝θ s other concurrently executed transactions.
1 ∨ θ2 ∨... ∨ θn • kill the transaction
• the first is used right away for probing, instead of being written out • Either use nested loops/block nested loops, or • That is, for every pair of transactions Ti and Tj, it appears to Ti that either
 Cost of 3(80 + 320) + 20 +80 = 1300 block transfers for Tj, finished execution before Ti started, or Tj started execution after Ti  Committed – after successful completion.
• Compute as the union of the records in individual joins r ⨝ θi s:
hybrid hash join, instead of 1500 with plain hash-join. finished.
 Hybrid hash-join most useful if M >> bs (r ⨝ θ1 s) ∪ (r ⨝ θ2 s) ∪ . . . ∪ (r ⨝ θ s)
n  Durability. After a transaction completes successfully, the changes it
has made to the database persist, even if there are system failures.
Database System Concepts - 7th Edition 15.41 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 15.42 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 17.7 ©Silberschatz, Korth and Sudarshan Database System Concepts - 7th Edition 17.8 ©Silberschatz, Korth and Sudarshan

You might also like