Week 9
Week 9
Week 9 Lecture 1
Class BSCCS2001
Materials
Module # 41
Type Lecture
Week # 9
Use "Name" Index — sorted on 'Name', search 'Pabitra Mitra' and navigate on pointer (red #)
Week 9 Lecture 1 1
Use "Phone" Index — sorted on 'Phone', search '84772' and navigate on pointer (rec #)
We can keep the records sorted on 'Name' or on 'Phone' (called the primary index), but not on both
Basic Concepts
Indexing mechanisms used to speed up access to desired data
For example →
Index files are typically much smaller than the original file
Hash indices → search keys are distributed uniformly across buckets using a hash function
For example →
Access time
Insertion time
Deletion time
Space overhead
Ordered Indices
In an ordered index, index entries are stored sorted on the search key value
Primary index → In a sequentially ordered file, the index whose search key specifies the sequential order of the file
The search key of a primary index is usually but not necessarily the primary key
Secondary index → An index whose search key specifies an order different from the sequential order of the file
Week 9 Lecture 1 2
Dense index on dept_name, with instructor file stored on dept_name
Search file sequentially starting at the record to which the index record points
Week 9 Lecture 1 3
Compared to dense indices →
Less space and less maintenance overhead for insertions and deletions
Good tradeoff → Sparse index with an index entry for every block in file, corresponding to least search-key value in
the block
Week 9 Lecture 1 4
Index record points to a bucket that contains pointers to all the actual records with that particular search-key value
BUT → Updating indices imposes overhead on database modification — when a file is modified, every index on the
file must be updated
Sequential scan using primary index is efficient, but a sequential scan using a secondary index is expensive
Each record access may fetch a new block from the disk
Block fetch requires about 5 to 10 milliseconds, versus about 100 nanoseconds for memory access
Multilevel Index
If primary index does not fit in memory, access becomes expensive
Solution → treat primary index kept on disk as a sequential file and construct a sparse index on it
If even outer index is too large to fit in the main memory, yet another level of index can be created and so on
Indices at all levels must be updated on insertion or deletion from the file
Week 9 Lecture 1 5
Sparse indices —
If an entry for the search key exists in the index, it is deleted by replacing the entry in the index with the next
search-key value in the file (in search-key order)
If the next search-key value already has an index entry, the entry is deleted instead of being replaced
Perform a lookup using the search-key value appearing in the record to be inserted
Dense indices → if the search-key value does not appear in the index, insert it
Sparse indices → if index stores an entry for each block of the file, no change needs to be made to the index
unless a new block is created
If a new block is created, the first search-key value appearing in the new block is inserted into the index
Multilevel insertion and deletion → algorithms are simple extensions of the single-level algorithms
Secondary Indices
Frequently, one wants to find all the records whose values in a certain field (which is not the search-key of the primary
index) satisfy some condition
Example 1 → In the instructor relation stored sequentially by the ID, we may want to find all instructors in a
particular department
Example 2 → as above, but where we want to find all instructions with a specified salary or with salary in a
specific range of values
We can have a secondary index with an index record for each search-key value
Week 9 Lecture 1 6
📚
Week 9 Lecture 2
Class BSCCS2001
Materials
Module # 42
Type Lecture
Week # 9
Week 9 Lecture 2 1
Worst case time (n data items in the data structure)
Between an array and a list, there is a trade-off between search and insert/delete complexity
For a BST of n nodes, log n ≤ h < n, where h is the height of the tree
A BST is balanced if h ∼ O(log n) → that is what we desire
This is possible if
Week 9 Lecture 2 2
Balanced Binary Search Trees
A BST is balanced if h ∼ O(log n)
Balancing guarantees may be of various types →
Worst-case
Heights of the two child subtrees of any node differ by at most one: ∣hL − hR ∣ ≤ 1
If they differ by more than one, rebalancing is done by rotation
Randomized
Randomized BST
A BST of n keys is random if either it is empty (n = 0) or the probability that a given key is at the root is
1
n
and the left and the right subtrees are random
Skip List
Amortized
Splay
These data structures have the optimal complexity for the required operations →
Search: O(log n)
Insert: Search + O(1) → O(log n)
Delete: Search + O(1) → O(log n)
And they are →
2-3-4 Trees
All leaves are at the same depth
h ∼ O(log n)
Complexity of search, insert and delete → O(h) ∼ O(log n)
All data is kept in sorted order
Every node (leaf or internal) is a 2-node, 3-node or a 4-node (based on the number of links or children) and holds one,
two or three data elements, respectively
A 2-node must contain a single data item (S) and two links
A 3-node must contain two data items (S, L) and three links
A 4-node must contain three data items (S, M, L) and four links
Week 9 Lecture 2 3
2-3-4 Trees → Search
Search
If it is a 4 node, split the node by moving the middle item to parent node and then insert
Node splitting
A 4-node is split as soon as it is encountered during a search from the root to a leaf
Be the root
Splitting at Root
Week 9 Lecture 2 4
Splitting with 3 Node parent
Week 9 Lecture 2 5
It ensures that the tree does not have a path with multiple 4-nodes at any point
This might lead to cases where for one insert we may need to perform O(h) splits going till up to the root
10
10, 30
10, 30, 60
Split for 20
Week 9 Lecture 2 6
10, 30, 60, 20, 50, 40, 70
10, 30, 60, 20, 50, 40, 70, 80
Week 9 Lecture 2 7
10, 30, 60, 20, 50, 40, 70, 80, 15, 90, 100
Find theItem 's inorder successor and swap it with theItem (deletion will always be at the leaf)
2-3-4 Tree
Advantages
Week 9 Lecture 2 8
All data is kept in sorted order
Disadvantages
Uses variety of node types — need to destruct and construct multiple nodes for converting a 2 Node to 3 Node, a
3 Node to a 4 Node, for splitting, etc
Consider only one node type with space for 3 items and 4 links
Wastes some space, but has several advantages for external data structures
Each node that is not a root or a leaf has between ⌈ n2 ⌉ and n children
(n−1)
A leaf node has between ⌈ 2 ⌉ and n − 1 values
Special cases
B-Tree
Week 9 Lecture 2 9
📚
Week 9 Lecture 3
Class BSCCS2001
Materials
Module # 43
Type Lecture
Week # 9
Ensures that all leaf nodes remain at the same height (like 2-3-4 Tree)
It has the leaf nodes that are linked using a linked list
Example
Week 9 Lecture 3 1
At most n pointers
B+ Tree: Search
Suppose we have to search 55 in the B+ tree below
First, we will fetch for the intermediary node which will direct to the leaf node that can contain a record for 55
So, in the intermediary node, we will find a branch between 50 and 75 nodes
B+ Tree: Insert
Suppose we want to insert a record 60 that goes to the 3rd leaf node after 55
The leaf node of this tree is already full, so we cannot insert 60 there
So we have to split the leaf node, so that it can be inserted into tree without affecting the fill factor, balance and order
The 3rd leaf node has the values (50, 55, 60, 65, 70) and its current root node is 50
We will split the leaf node of the tree in the middle so that its balance is not altered
So we can group (50, 55) and (60, 65, 70) into 2 leaf nodes
If these two has to be leaf nodes, the intermediate node cannot branch from 50
It should have 60 added to it, and then we can have pointers to a new leaf node
In a normal scenario, it is very easy to find the node where it fits and then place it in that leaf node
B+ Tree: Delete
Week 9 Lecture 3 2
To delete 60, we have to remove 60 from intermediate node as well as 4th leaf node
If we remove it from the intermediate node, then the tree will not remain a B+ tree
Performance degrades as file grows, since many overflow blocks get created
Automatically re-organizes itself with small, local, changes in the face of insertions and deletions
Week 9 Lecture 3 3
B+ Tree Index Files: Structure
A B+ tree is a rooted tree satisfying the following properties:
All paths from the root to leaf are the same of the same length
Each node that is not a root node or a leaf node has between ⌈ n2 ⌉ and n children
If the root is a leaf (that is, there are no other nodes in the tree) it can have between 0 and (n − 1) values
Week 9 Lecture 3 4
B+ Tree Index Files: Non-Leaf Nodes
Non-leaf nodes form a multi-level sparse index on the leaf nodes
All the search-keys in the sub-tree to which P1 points are less than K1
For 2≤ i ≤ n − 1, all the search-keys in the sub-tree to which Pi points have values greater than or equal to
Ki−1 and less than Ki
All the search-keys in the sub-tree to which Pn points have values greater than or equal to Kn−1
... etc.
If there are K search-key values in the file, the tree height is no more than ⌈log⌈ n ⌉ (K)⌉
2
Insertions and deletions to the main file can be handled efficiently, as the index can be restructured in logarithmic time
Week 9 Lecture 3 5
B+ Tree Index Files: Queries
Find record with search-key value V
C = root
If there are K search-key values in the file, the height of the tree is no more than ⌈log⌈ n2 ⌉ (K)⌉
above difference is significant since every node access may need a disk I/O, costing around 20 milliseconds
traverse Pi even if V = Ki
As soon as we reach a leaf node C check if C has only search key values less than V
Week 9 Lecture 3 6
Find the leaf node in which the search-key value would appear
Add the record to the main file (and create a bucket if necessary)
If there is room in leaf node, insert (key-value, pointer) pair in the leaf node
Otherwise, split the node (along with the new (key-value, pointer) entry)
take the n (search-key value, pointer) pairs (including the one being inserted) in sorted order
Place the first ⌈ n2 ⌉ in the original node, and the rest in a new node
let the new node be p and let k be the least key value in p
Splitting of nodes proceeds upwards till a node that is not full is found
in the worst case, the root node may be split increasing the height of the tree by 1
Splitting a non-leaf node → when inserting (k, p) into an already full internal node N
Copy P⌈ n2 ⌉+1 , K⌈ n2 ⌉+1 , ..., Kn , Pn+1 from M into newly allocated node N ′
Week 9 Lecture 3 7
Updates on B+ Trees: Deletion
Find the record to be deleted, and remove it from the main file and from the bucket (if present)
Remove (search-key value, pointer) from the leaf node if there is no bucket or if the bucket has become empty
If the node has too few entries due to the removal, and the entries in the node and a sibling fit into a single node, then
merge siblings:
Insert all the search-key values in the two nodes into a single node (the one of the left) and delete the other node
Delete the pair (Ki−1 , Pi ) where Pi is the pointer to the deleted node, from its parent, recursively using the
above procedure
Otherwise, if the node has too few entries due to the removal, but the entries in the node and a sibling do not fit into a
single node, then redistribute pointers:
Redistribute the pointers between the node and a sibling such that both have more than the minimum number of
entries
Week 9 Lecture 3 8
The node deletions may cascade upwards till a node which has ⌈ n2 ⌉ or more pointers is found
If the root node has only one pointer after deletion, it is deleted and the sole child becomes the root
Leaf containing Singh and Wu became underfull, and borrowed a value from Kim from its left sibling
Week 9 Lecture 3 9
Node with "Gold" and "Katz" became underfull, and was merged with its sibling
Value separating two nodes (at the parent) is pulled down when merging
The leaf nodes in a B+ tree file organization store records, instead of pointers
Since records are larger than pointers, the maximum number of records that can be stored in a leaf node is less
than the number of pointers is a non-leaf node
Insertion and deletion are handled in the same way as insertion and deletion of entries in a B+ tree index
Good space utilization important since records use more space than pointers
To improve space utilization, involve more sibling nodes in redistribution during splits and merges
Involving 2 siblings in redistribution (to avoid split/merge where possible) results in each node having at least
⌈ 2n
3 ⌉ entries
Deletion of a tuple can be expensive if there are many duplicates on search key
Widely used
Week 9 Lecture 3 10
Node splits in B+ tree file organizations become very expensive
Solution → Use primary-index search key instead of record pointer in secondary index
Indexing Strings
Variable length strings as keys
Variable fanout
Prefix compression
Keep enough characters to distinguish entries in the subtrees separated by the key value
Search keys in non-leaf nodes appear nowhere else in the B-tree; an additional pointer field for each search key in a
non-leaf node must be included
Week 9 Lecture 3 11
Comparison of B-Tree and B+ Tree Index Files
Advantages of B-Tree indices:
Sometimes possible to find search-key value before reaching the leaf node
Week 9 Lecture 3 12
📚
Week 9 Lecture 4
Class BSCCS2001
Materials
Module # 44
Type Lecture
Week # 9
Week 9 Lecture 4 1
Static hashing
A bucket is a unit of storage containing one or more records (a bucket is typically a disk block)
In a hash file organization we obtain the bucket of a record directly from its search-key value using a hash function
Hash function h is a function from the set of all search-key values K to the set of all bucket addresses B
Hash function is used to locate records for access, insertion as well as deletion
Records with different search-key values may be mapped to the same bucket; thus, entire bucket has to be searched
sequentially to locate a record
The hash function returns the sum of the binary representations of the characters modulo 10
For example
h(Music) = 1 h(History) = 2
h(P hysics) = 3 h(Elec. Eng.) = 3
Week 9 Lecture 4 2
Hash file organization of instructor file, using dept_name as key
Hash functions
Worst hash function maps all search-key values to the same bucket; this makes access time proportional to the
number of search-key values in the file
An ideal hash function is uniform ie. each bucket is assigned the same number of search-key values from the set of all
possible values
Ideal hash functions is random, so each bucket will have the same number of records assigned to it irrespective of the
actual distribution of search-key values in the file
Typical hash functions perform computation on the internal binary representation of the search-key
For example, for a string search-key, the binary representations of all the characters in the string could be added
and the sum modulo the number of buckets could be returned
Insufficient buckets
Overflow chaining → the overflow buckets of a given bucket are chained together in a linked list
Week 9 Lecture 4 3
An alternative, called open hashing, which does not use overflow buckets, is not suitable for database
applications
Hash Indices
Hashing can be used not only for file organization, but also for index-structure creation
A hash index organizes the search keys, with their associated record pointers, into a hash file structure
If the file itself is organized using hashing, a separate primary hash index on it using the same search-key is
unnecessary
However, we use the term hash index to refer to both secondary index structures and hash organized files
Week 9 Lecture 4 4
Hash index on instructor, on attribute ID
If initial number of buckets is too small, and the file grows, performance will degrade due to too much
overflows
If space is allocated for anticipated growth, a significant amount of space will be wasted initially (and the
buckets will be underfull)
One solution → Periodic re-organization of the file with a new hash function
Dynamic Hashing
Good for database that grows and shrinks in size
Hash function generates values over a large range — typically b-bit integers with b = 32
At any time, use only a prefix of the hash function to index into a table of bucket addresses
Week 9 Lecture 4 5
Bucket address table size = 2i
Initially, i =0
Value of i grows and shrinks as the size of the database grows and shrinks
Multiple entries in the bucket address table may point to a bucket (Why?)
The number of buckets also changes dynamically due to coalescing and splitting of buckets
ij ≤ i
All the entries that point to the same bucket have the same values on the first ij bits
Compute h(Kj ) =X
Use the first i high order bits of X as a displacement into bucket address table and follow the pointer to
appropriate bucket
Week 9 Lecture 4 6
Overflow buckets used instead of some cases
Re-compute new bucket for Kj and insert record in the bucket (further splitting is required if the bucket is still full)
Else
Replace each entry in the table by 2 entries that point to the same bucket
The bucket itself can be removed if it becomes empty (with appropriate updates to the bucket address table)
Coalescing of buckets can be done (can coalesce only with a "buddy" bucket having same value of ij and ij −1
prefix, if it is present)
Note → decreasing bucket address table size is an expensive operation and should be done only if number of
buckets becomes much smaller than the size of the table
Example
Initial Hash structure; bucket size = 2
Week 9 Lecture 4 7
Insert "Mozart", "Srinivasan" and "Wu" records
Week 9 Lecture 4 8
Insert "Gold" and "El Said" records
Week 9 Lecture 4 9
Insert "Singh", "Califieri", "Crick", "Brandt" records
Week 9 Lecture 4 10
Comparison Schemes
Extendable Hashing vs Other Schemes
Benefits of extendable hashing
Bucket address table may itself become very big (larger than memory)
Week 9 Lecture 4 11
Linear hashing is an alternative mechanism
Is it desirable to optimize average access time at the expense of worst-case access time?
Hashing is generally better at retrieving records having a specified value of the key
In practice:
PostgreSQL supports hash indices, but discourages use due to poor performance
Bitmap Indices
Bitmap indices are a special type of index designed for efficient querying on multiple keys
For example, income-level (income broken up into small number of levels such as 0-9999, 10000-19999, 20000-
50000, 50000-infinity)
In its simplest form a bitmap index on an attribute has a bitmap for each value of the attribute
In a bitmap for value v, the bit for a record is 1 if the record has the value v for the attribute and is 0 otherwise
Week 9 Lecture 4 12
Intersection (AND)
Union (OR)
Complementation (NOT)
Each operation takes two bitmaps of the same size and applies the operation on corresponding bits to get the result
bitmap
For example, if record is 100 bytes, space for a single bitmap is 1/800 of space and by relation
For example, 1-million-bit maps can be and-ed with just 31,250 instructions
Use each byte to index into a pre-computed array of 256 elements each storing the count of 1s in the binary
representation
Bitmaps can be used instead of Tuple-ID lists at leaf levels of B+ -trees for values that have a large number of
matching records
Worthwhile, if > 1/64 of the records have that value, assuming a tuple-id is 64 bits
Above technique merges benefits of bitmap and B+ -tree indices
Week 9 Lecture 4 13
📚
Week 9 Lecture 5
Class BSCCS2001
Materials
Module # 45
Type Lecture
Week # 9
Use create unique index to indirectly specify and enforce the condition that the search key is a candidate key
To drop an index
drop index <index-name>
A composite index key cannot exceed roughly one-half (minus some overhead) of the available space in the
data block
Week 9 Lecture 5 1
Specify several storage settings explicitly for the index
Create index on two columns, to speed up queries that test either the first column or both columns
If a query is going to sort on the function UPPER(ENAME), an index on the ENAME column itself would not speed up
this operation and it might be slow to call the function for each result row
A function-based index pre-computes the result of the function for each column value, speeding up queries that
use the function for searching or sorting:
Example:
Multiple-Key Access
Use multiple indices for certain types of queries
Example:
select ID
from instructor
Use index on dept_name to find instructors with department name Finance; test salary = 80000
Use index on salary to find instructors with a salary of 80000; test dept_name = "Finance"
Use dept_name index to find pointers to all records pertaining to the "Finance" department
Week 9 Lecture 5 2
a1 < b1 or
a1 = b1 and a2 < b2
Hence, the order is important
(dept_name, salary)
the index on (dept_name, salary) can be used to fetch only records that satisfy both conditions
Using separate indices in less efficient — we may fetch many records (or pointers) that satisfy only one of the
conditions
May fetch many records that satisfy the first but not the second condition
You must own, or have the INDEX object privilege for the corresponding table
The schema that contains the index must also have a quota for the tablespace intended to contain the index, or
the UNLIMITED TABLESPACE system privilege
To create an index in another user's schema, you must have the CREATE ANY INDEX system privilege
Function-based indexes also require the QUERY_REWRITE privilege and that the QUERY_REWRITE_ENABLED
initialization parameter to be set to TRUE
Efficiency of access and update — a better normalized design often gives better performance
The performance of a database system, however, is also significantly impacted by the way the data is physically
organized and managed
While normalization and design are startup time activities that are usually performed once at the beginning (and rarely
changed later), the performance behavior continues to evolve as the database is used over time
Collect Statistics about data (of various tables) to learn of the patterns
Week 9 Lecture 5 3
Rather, we take a quick look into a few common guidelines that can help you keep your database agile in its
behaviour
Every query (access) results in a 'search' on the underlying physical data structures
Every update (insert/delete/values update) results in update of the index files — an overhead or penalty for
quicker access
Having unnecessary indexes can cause significant degradation or performance of various operations
Index files may also occupy significant space on your disk and/or
Create an index if you frequently want to retrieve less than 15% of the rows in a large table
The percentage varies greatly according to the relative speed of a table scan and how clustered the row data
is about the index key
Index columns used for joins to improve performance on joins of multiple tables
Primary and unique keys automatically have indexes, but you might want to create an index on a foreign key
If a query is taking too long, then the table might have grown from small to large
The column contains many nulls, but queries often select all rows having a value
In this case, a comparison that matches all the non-null values, such as:
WHERE COL_X > -9.99 *power(10, 125) is preferable to WHERE COL_X IS NOT NULL
This is because the first uses an index on COL_X (if COL_X is a numeric column)
Week 9 Lecture 5 4
Columns with the following characteristics are less suitable for indexing:
There are many nulls in the column and you do not search on the non-null values
The size of single index entry cannot exceed roughly one-half (minus some overhead) of the available space in
the data block
The more indexes, the more overhead is incurred as the table is altered
When rows are inserted or deleted, all indexes on the table must be updated
You must weigh the performance benefit of indexes for queries against the performance overhead of the updates
If a table is primarily read-only, you might use more indexes, but, if a table is heavily updated, you might use
fewer indexes
The order of columns in the CREATE INDEX statement can affect performance
You can create a composite index (using several columns) and the same index can be used for queries that
reference all of these columns, or just some of them
For the VENDOR_PARTS table, assume that there are 5 vendors and each vendor has about 1,000 parts
Create a composite index with the most selective (with most values) column first
Composite indexes speed up queries that use the leading portion of the index:
So, queries with WHERE clauses using only PART_NO column also runs faster
With only 5 distinct values, a separate index on VENDOR_ID does not help
The database can use indexes more effectively when it has statistical information about the tables involved in the
queries
Week 9 Lecture 5 5
Gather statistics when the indexes are created by including the keywords COMPUTE STATISTICS in the
CREATE INDEX statement
As data is updated and the distribution of values changes, periodically refresh the statistic by calling
procedures like (in Oracle):
DBMS_STATS.GATHER_TABLE_STATISTICS
DBMS_STATS.GATHER_SCHEMA_STATISTICS
The table might be very small, or there might be many rows in the table but very few index entries
When you drop an index, all extents of the index's segment are returned to the containing tablespace and become
available for other objects in the tablespace
If you drop a table, then all associated indexes are dropped too
To drop an index, the index must be contained in your schema or you must have the DROP ANY INDEX system
privilege
Week 9 Lecture 5 6