ch12 4
ch12 4
Database System Concepts 12.1 ©Silberschatz, Korth and Sudarshan Database System Concepts 12.2 ©Silberschatz, Korth and Sudarshan
! Access types supported efficiently. E.g., Indexing techniques evaluated on basis of:
" records with a specified value in the attribute ! In an ordered index, index entries are stored sorted on the
search key value. E.g., author catalog in library.
" or records with an attribute value falling in a specified range of
values. ! Primary index: in a sequentially ordered file, the index whose
search key specifies the sequential order of the file.
! Access time
" Also called clustering index
! Insertion time
" The search key of a primary index is usually but not necessarily the
! Deletion time primary key.
! Space overhead ! Secondary index: an index whose search key specifies an order
different from the sequential order of the file. Also called
non-clustering index.
! Index-sequential file: ordered sequential file with a primary index.
Database System Concepts 12.3 ©Silberschatz, Korth and Sudarshan Database System Concepts 12.4 ©Silberschatz, Korth and Sudarshan
Dense Index Files Sparse Index Files
! Dense index — Index record appears for every search-key value ! Sparse Index: contains index records for only some search-key
in the file. values.
" Applicable when records are sequentially ordered on search-key
! To locate a record with search-key value K we:
" Find index record with largest search-key value < K
" Search file sequentially starting at the record to which the index
record points
! Less space and less maintenance overhead for insertions and
deletions.
! Generally slower than dense index for locating records.
! Good tradeoff: sparse index with an index entry for every block in
file, corresponding to least search-key value in the block.
Database System Concepts 12.5 ©Silberschatz, Korth and Sudarshan Database System Concepts 12.6 ©Silberschatz, Korth and Sudarshan
Database System Concepts 12.7 ©Silberschatz, Korth and Sudarshan Database System Concepts 12.8 ©Silberschatz, Korth and Sudarshan
Multilevel Index (Cont.) Index Update: Deletion
! If deleted record was the only record in the file with its particular
search-key value, the search-key is deleted from the index also.
! Single-level index deletion:
" Dense indices – deletion of search-key is similar to file record
deletion.
" Sparse indices – if an entry for the search key exists in the index, it
is deleted by replacing the entry in the index with the next search-
key value in the file (in search-key order). If the next search-key
value already has an index entry, the entry is deleted instead of
being replaced.
Database System Concepts 12.9 ©Silberschatz, Korth and Sudarshan Database System Concepts 12.10 ©Silberschatz, Korth and Sudarshan
Database System Concepts 12.11 ©Silberschatz, Korth and Sudarshan Database System Concepts 12.12 ©Silberschatz, Korth and Sudarshan
Secondary Index on balance field of Primary and Secondary Indices
account
! Secondary indices have to be dense.
! Indices offer substantial benefits when searching for records.
! When a file is modified, every index on the file must be updated,
Updating indices imposes overhead on database modification.
! Sequential scan using primary index is efficient, but a sequential
scan using a secondary index is expensive
" each record access may fetch a new block from disk
Database System Concepts 12.13 ©Silberschatz, Korth and Sudarshan Database System Concepts 12.14 ©Silberschatz, Korth and Sudarshan
Database System Concepts 12.15 ©Silberschatz, Korth and Sudarshan Database System Concepts 12.16 ©Silberschatz, Korth and Sudarshan
B+-Tree Node Structure Leaf Nodes in B+-Trees
! Typical node
Properties of a leaf node:
! For i = 1, 2, . . ., n–1, pointer Pi either points to a file record with
search-key value Ki, or to a bucket of pointers to file records,
each record having search-key value Ki. Only need bucket
" Ki are the search-key values structure if search-key does not form a primary key.
" Pi are pointers to children (for non-leaf nodes) or pointers to records ! If Li, Lj are leaf nodes and i < j, Li’s search-key values are less
or buckets of records (for leaf nodes). than Lj’s search-key values
! The search-keys in a node are ordered ! Pn points to next leaf node in search-key order
K1 < K2 < K3 < . . . < Kn–1
Database System Concepts 12.17 ©Silberschatz, Korth and Sudarshan Database System Concepts 12.18 ©Silberschatz, Korth and Sudarshan
! Non leaf nodes form a multi-level sparse index on the leaf nodes.
For a non-leaf node with m pointers:
" All the search-keys in the subtree to which P1 points are less than K1
" For 2 ≤ i ≤ n – 1, all the search-keys in the subtree to which Pi points
have values greater than or equal to Ki–1 and less than Km–1
Database System Concepts 12.19 ©Silberschatz, Korth and Sudarshan Database System Concepts 12.20 ©Silberschatz, Korth and Sudarshan
Example of B+-tree Observations about B+-trees
Database System Concepts 12.21 ©Silberschatz, Korth and Sudarshan Database System Concepts 12.22 ©Silberschatz, Korth and Sudarshan
Database System Concepts 12.23 ©Silberschatz, Korth and Sudarshan Database System Concepts 12.24 ©Silberschatz, Korth and Sudarshan
Updates on B+-Trees: Insertion Updates on B+-Trees: Insertion (Cont.)
! Find the leaf node in which the search-key value would appear ! Splitting a node:
! If the search-key value is already there in the leaf node, record is " take the n(search-key value, pointer) pairs (including the one being
added to file and if necessary a pointer is inserted into the inserted) in sorted order. Place the first ! n/2 " in the original node,
bucket. and the rest in a new node.
" let the new node be p, and let k be the least key value in p. Insert
! If the search-key value is not there, then add the record to the
(k,p) in the parent of the node being split. If the parent is full, split it
main file and create a bucket if necessary. Then: and propagate the split further up.
" If there is room in the leaf node, insert (key-value, pointer) pair in the
! The splitting of nodes proceeds upwards till a node that is not full
leaf node
is found. In the worst case the root node may be split increasing
" Otherwise, split the node (along with the new (key-value, pointer) the height of the tree by 1.
entry) as discussed in the next slide.
Database System Concepts 12.25 ©Silberschatz, Korth and Sudarshan Database System Concepts 12.26 ©Silberschatz, Korth and Sudarshan
Database System Concepts 12.27 ©Silberschatz, Korth and Sudarshan Database System Concepts 12.28 ©Silberschatz, Korth and Sudarshan
Updates on B+-Trees: Deletion Examples of B+-Tree Deletion
! Otherwise, if the node has too few entries due to the removal,
and the entries in the node and a sibling fit into a single node,
then
" Redistribute the pointers between the node and a sibling such that
both have more than the minimum number of entries.
" Update the corresponding search-key value in the parent of the
node.
! The node deletions may cascade upwards till a node which has
!n/2 " or more pointers is found. If the root node has only one
pointer after deletion, it is deleted and the sole child becomes the
root.
Database System Concepts 12.29 ©Silberschatz, Korth and Sudarshan Database System Concepts 12.30 ©Silberschatz, Korth and Sudarshan
Database System Concepts 12.33 ©Silberschatz, Korth and Sudarshan Database System Concepts 12.34 ©Silberschatz, Korth and Sudarshan
Database System Concepts 12.35 ©Silberschatz, Korth and Sudarshan Database System Concepts 12.36 ©Silberschatz, Korth and Sudarshan
B-Tree Index Files (Cont.) Static Hashing
! Advantages of B-Tree indices:
! A bucket is a unit of storage containing one or more records (a
" May use less tree nodes than a corresponding B+-Tree.
bucket is typically a disk block).
" Sometimes possible to find search-key value before reaching leaf
node. ! In a hash file organization we obtain the bucket of a record
directly from its search-key value using a hash function.
! Disadvantages of B-Tree indices:
! Hash function h is a function from the set of all search-key
" Only small fraction of all search-key values are found early
values K to the set of all bucket addresses B.
" Non-leaf nodes are larger, so fan-out is reduced. Thus B-Trees
typically have greater depth than corresponding ! Hash function is used to locate records for access, insertion as
B+-Tree well as deletion.
" Insertion and deletion more complicated than in B+-Trees ! Records with different search-key values may be mapped to
" Implementation is harder than B+-Trees. the same bucket; thus entire bucket has to be searched
sequentially to locate a record.
! Typically, advantages of B-Trees do not out weigh disadvantages.
Database System Concepts 12.37 ©Silberschatz, Korth and Sudarshan Database System Concepts 12.38 ©Silberschatz, Korth and Sudarshan
Database System Concepts 12.39 ©Silberschatz, Korth and Sudarshan Database System Concepts 12.40 ©Silberschatz, Korth and Sudarshan
Hash Functions Handling of Bucket Overflows
! Worst has function maps all search-key values to the same ! Bucket overflow can occur because of
bucket; this makes access time proportional to the number of " Insufficient buckets
search-key values in the file. " Skew in distribution of records. This can occur due to two
! An ideal hash function is uniform, i.e., each bucket is assigned reasons:
the same number of search-key values from the set of all ! multiple records have same search-key value
possible values. ! chosen hash function produces non-uniform distribution of key
! Ideal hash function is random, so each bucket will have the values
same number of records assigned to it irrespective of the actual ! Although the probability of bucket overflow can be reduced, it
distribution of search-key values in the file. cannot be eliminated; it is handled by using overflow buckets.
! Typical hash functions perform computation on the internal
binary representation of the search-key.
" For example, for a string search-key, the binary representations of
all the characters in the string could be added and the sum modulo
the number of buckets could be returned. .
Database System Concepts 12.41 ©Silberschatz, Korth and Sudarshan Database System Concepts 12.42 ©Silberschatz, Korth and Sudarshan
Database System Concepts 12.43 ©Silberschatz, Korth and Sudarshan Database System Concepts 12.44 ©Silberschatz, Korth and Sudarshan
Example of Hash Index Deficiencies of Static Hashing
! In static hashing, function h maps search-key values to a fixed
set of B of bucket addresses.
" Databases grow with time. If initial number of buckets is too small,
performance will degrade due to too much overflows.
" If file size at some point in the future is anticipated and number of
buckets allocated accordingly, significant amount of space will be
wasted initially.
" If database shrinks, again space will be wasted.
" One option is periodic re-organization of the file with a new hash
function, but it is very expensive.
! These problems can be avoided by using techniques that allow
the number of buckets to be modified dynamically.
Database System Concepts 12.45 ©Silberschatz, Korth and Sudarshan Database System Concepts 12.46 ©Silberschatz, Korth and Sudarshan
Database System Concepts 12.47 ©Silberschatz, Korth and Sudarshan Database System Concepts 12.48 ©Silberschatz, Korth and Sudarshan
Use of Extendable Hash Structure Updates in Extendable Hash Structure
! Each bucket j stores a value ij; all the entries that point to the To split a bucket j when inserting record with search-key value Kj:
same bucket have the same values on the first ij bits. ! If i > ij (more than one pointer to bucket j)
! To locate the bucket containing search-key Kj: " allocate a new bucket z, and set ij and iz to the old ij -+ 1.
1. Compute h(Kj) = X " make the second half of the bucket address table entries pointing
2. Use the first i high order bits of X as a displacement into bucket to j to point to z
address table, and follow the pointer to appropriate bucket " remove and reinsert each record in bucket j.
! To insert a record with search-key value Kj " recompute new bucket for Kj and insert record in the bucket (further
" follow same procedure as look-up and locate the bucket, say j. splitting is required if the bucket is still full)
" If there is room in the bucket j insert record in the bucket. ! If i = ij (only one pointer to bucket j)
" Else the bucket must be split and insertion re-attempted (next slide.) " increment i and double the size of the bucket address table.
! Overflow buckets used instead in some cases (will see shortly) " replace each entry in the table by two entries that point to the same
bucket.
" recompute new bucket address table entry for Kj
Now i > ij so use the first case above.
Database System Concepts 12.49 ©Silberschatz, Korth and Sudarshan Database System Concepts 12.50 ©Silberschatz, Korth and Sudarshan
Database System Concepts 12.51 ©Silberschatz, Korth and Sudarshan Database System Concepts 12.52 ©Silberschatz, Korth and Sudarshan
Example (Cont.) Example (Cont.)
Hash structure after insertion of Mianus record
! Hash structure after insertion of one Brighton and two Downtown
records
Database System Concepts 12.53 ©Silberschatz, Korth and Sudarshan Database System Concepts 12.54 ©Silberschatz, Korth and Sudarshan
Database System Concepts 12.55 ©Silberschatz, Korth and Sudarshan Database System Concepts 12.56 ©Silberschatz, Korth and Sudarshan
Extendable Hashing vs. Other Schemes Comparison of Ordered Indexing and Hashing
Database System Concepts 12.57 ©Silberschatz, Korth and Sudarshan Database System Concepts 12.58 ©Silberschatz, Korth and Sudarshan
Database System Concepts 12.59 ©Silberschatz, Korth and Sudarshan Database System Concepts 12.60 ©Silberschatz, Korth and Sudarshan
Indices on Multiple Attributes Grid Files
Suppose we have an index on combined search-key ! Structure used to speed the processing of general multiple
(branch-name, balance). search-key queries involving one or more comparison
! With the where clause operators.
where branch-name = “Perryridge” and balance = 1000 ! The grid file has a single grid array and one linear scale for
the index on the combined search-key will fetch only records each search-key attribute. The grid array has number of
that satisfy both conditions. dimensions equal to number of search-key attributes.
Using separate indices in less efficient — we may fetch many
! Multiple cells of grid array can point to same bucket
records (or pointers) that satisfy only one of the conditions.
! To find the bucket for a search-key value, locate the row and
! Can also efficiently handle
column of its cell using the linear scales and follow pointer
where branch-name - “Perryridge” and balance < 1000
! But cannot efficiently handle
where branch-name < “Perryridge” and balance = 1000
May fetch many records that satisfy the first but not the
second condition.
Database System Concepts 12.61 ©Silberschatz, Korth and Sudarshan Database System Concepts 12.62 ©Silberschatz, Korth and Sudarshan
Database System Concepts 12.63 ©Silberschatz, Korth and Sudarshan Database System Concepts 12.64 ©Silberschatz, Korth and Sudarshan
Grid Files (Cont.) Bitmap Indices
! During insertion, if a bucket becomes full, new bucket can be ! Bitmap indices are a special type of index designed for efficient
created if more than one cell points to it. querying on multiple keys
" Idea similar to extendable hashing, but on multiple dimensions
! Records in a relation are assumed to be numbered sequentially
" If only one cell points to it, either an overflow bucket must be from, say, 0
created or the grid size must be increased
" Given a number n it must be easy to retrieve record n
! Linear scales must be chosen to uniformly distribute records
! Particularly easy if records are of fixed size
across cells.
! Applicable on attributes that take on a relatively small number of
" Otherwise there will be too many overflow buckets.
distinct values
! Periodic re-organization to increase grid size will help.
" E.g. gender, country, state, …
" But reorganization can be very expensive. " E.g. income-level (income broken up into a small number of levels
! Space overhead of grid array can be high. such as 0-9999, 10000-19999, 20000-50000, 50000- infinity)
! R-trees (Chapter 23) are an alternative ! A bitmap is simply an array of bits
Database System Concepts 12.65 ©Silberschatz, Korth and Sudarshan Database System Concepts 12.66 ©Silberschatz, Korth and Sudarshan
! In its simplest form a bitmap index on an attribute has a bitmap ! Bitmap indices are useful for queries on multiple attributes
for each value of the attribute " not particularly useful for single attribute queries
" Bitmap has as many bits as records ! Queries are answered using bitmap operations
" In a bitmap for value v, the bit for a record is 1 if the record has the " Intersection (and)
value v for the attribute, and is 0 otherwise
" Union (or)
" Complementation (not)
! Each operation takes two bitmaps of the same size and applies
the operation on corresponding bits to get the result bitmap
" E.g. 100110 AND 110011 = 100010
100110 OR 110011 = 110111
NOT 100110 = 011001
" Males with income level L1: 10010 AND 10100 = 10000
! Can then retrieve required tuples.
! Counting number of matching tuples is even faster
Database System Concepts 12.67 ©Silberschatz, Korth and Sudarshan Database System Concepts 12.68 ©Silberschatz, Korth and Sudarshan
Bitmap Indices (Cont.) Efficient Implementation of Bitmap Operations
! Bitmap indices generally very small compared with relation size ! Bitmaps are packed into words; a single word and (a basic CPU
" E.g. if record is 100 bytes, space for a single bitmap is 1/800 of space
instruction) computes and of 32 or 64 bits at once
used by relation. " E.g. 1-million-bit maps can be anded with just 31,250 instruction
! If number of distinct attribute values is 8, bitmap is only 1% of ! Counting number of 1s can be done fast by a trick:
relation size " Use each byte to index into a precomputed array of 256 elements
each storing the count of 1s in the binary representation
! Deletion needs to be handled properly
! Can use pairs of bytes to speed up further at a higher memory
" Existence bitmap to note if there is a valid record at a record location cost
" Needed for complementation " Add up the retrieved counts
! not(A=v): (NOT bitmap-A-v) AND ExistenceBitmap ! Bitmaps can be used instead of Tuple-ID lists at leaf levels of
! Should keep bitmaps for all values, even null value B+-trees, for values that have a large number of matching
records
" To correctly handle SQL null semantics for NOT(A=v):
" Worthwhile if > 1/64 of the records have that value, assuming a
! intersect above result with (NOT bitmap-A-Null) tuple-id is 64 bits
" Above technique merges benefits of bitmap and B+-tree indices
Database System Concepts 12.69 ©Silberschatz, Korth and Sudarshan Database System Concepts 12.70 ©Silberschatz, Korth and Sudarshan
Partitioned Hashing
! Hash values are split into segments that depend on each
attribute of the search-key.
(A1, A2, . . . , An) for n attribute search-key
! Example: n = 2, for customer, search-key being
(customer-street, customer-city)
search-key value hash value
End of Chapter (Main, Harrison) 101 111
(Main, Brooklyn) 101 001
(Park, Palo Alto) 010 010
(Spring, Brooklyn) 001 001
(Alma, Palo Alto) 110 010
! To answer equality query on single attribute, need to look
up multiple buckets. Similar in effect to grid files.
Database System Concepts 12.73 ©Silberschatz, Korth and Sudarshan Database System Concepts 12.74 ©Silberschatz, Korth and Sudarshan