0% found this document useful (0 votes)
641 views87 pages

Indexing Structures: Professor Navneet Goyal Department of Computer Science & Information Systems BITS, Pilani

Uploaded by

sm137
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
641 views87 pages

Indexing Structures: Professor Navneet Goyal Department of Computer Science & Information Systems BITS, Pilani

Uploaded by

sm137
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 87

Indexing Structures

Professor Navneet Goyal


Department of Computer Science & Information Systems
BITS, Pilani
Topics
 Basic Concepts
 Classification of Indices
 Tree-based Indexing
 Hash-based Indexing
 Comparison

© Prof. Navneet Goyal, BITS, Pilani


Basic Concepts
 Indexing mechanisms used to speed up access to
desired data.
 E.g., index at the end of a book
 E.g., author catalog in library
 Search Key – attribute(s) used to look up records
in a file
 Multiple indexes for a single file
 An index file consists of records (called index
entries) of the form

search-key pointer

© Prof. Navneet Goyal, BITS, Pilani


Basic Concepts
 Index files are typically much smaller than
the original file
 Kinds of indices:
 Ordered indices: search keys are stored in
sorted order (single-level)
 Tree indices: search keys are arranged in a tree
(multi-level)
 Hash indices: search keys are distributed
uniformly across “buckets” using a “hash
function”

© Prof. Navneet Goyal, BITS, Pilani


Classification
 Single-level vs. Multi-level
 Dense vs. Sparse
 Static vs. Dynamic

© Prof. Navneet Goyal, BITS, Pilani


Choosing an Index
 No single indexing structure suitable
for all database applications
 Can be chosen based on the following
factors:
 Access types supported efficiently. E.g.,
• records with a specified value in the attribute
• or records with an attribute value falling in a specified range
of values.
 Access time
 Insertion time
 Deletion time
 Space overhead

© Prof. Navneet Goyal, BITS, Pilani


Primary Index
 Example of an ordered index
 In an ordered index, index entries are
stored sorted on the search key value.
E.g., topics in book index.
 Requires relation to be sorted on the
search key
 Search key should be a ‘KEY’ of the
relation
 If not, then it is called a Clustering Index

© Prof. Navneet Goyal, BITS, Pilani


Primary Index
10
20
10
30
30
40
50
50
70
60
90
70
80

90
100

© Prof. Navneet Goyal, BITS, Pilani


Primary Index
 Primary index requires that the ordering field of
the data file have a distinct value for each record.
 Primary index is sparse
 Contains as many records as there are blocks* in
the data file (there are 5 blocks in this example
and each block can hold only 2 records).
 The first record in each block of the data file is
called anchor record of the block, or simply block
anchor.
 There can be only one primary index on a table

© Prof. Navneet Goyal, BITS, Pilani


Clustering Index
1
1
1
2

1
2
2
3
3 3
4 3

5
3
3
OPTION 1 4
5

© Prof. Navneet Goyal, BITS, Pilani


Clustering
Index

Figure taken from Elmasiri, 4e

© Prof. Navneet Goyal, BITS, Pilani


Clustering
Index

Figure taken from Elmasiri, 4e

© Prof. Navneet Goyal, BITS, Pilani


Clustering Index
 Data file is sorted on a non-key field
 Retrieves cluster of records for a
given search key
 Clustering index is always sparse

© Prof. Navneet Goyal, BITS, Pilani


Secondary Index
(key)
5
1
2
2 7
3 1

4
4
5
3
6
8
7 6
8

© Prof. Navneet Goyal, BITS, Pilani


Secondary
Index (key)

Figure taken from Elmasiri, 4e

© Prof. Navneet Goyal, BITS, Pilani


Secondary Index
(Non-key)
 Option 1 is to include several index entries
with the same index field value- one for each
record
 This would be a dense index
 Option 2 is to have variable length records
for the index entries, with a repeating field
for the pointer-one pointer to each block that
contains a record with matching indexing
field value.
 This would be a non-dense index.

© Prof. Navneet Goyal, BITS, Pilani


Secondary Index
(Non-key)
Emp# SSN Name Dept # DOB SALARY 1
OPTION 1
3 2
5 3
1 3
3 3
2 3
3 4
4 5
5 5
3 1 B1(1)
2 B2(1)
3 B3(1), B3(2), B3(3), B3(4)
4 B4(1)
OPTION 2
5 B5(1)

© Prof. Navneet Goyal, BITS, Pilani


Secondary
Index
(Non-key)

 Option 3 is most commonly used


 Record Pointers
 Implemented using one level of indirection so that
index entries are of fixed length and have unique
field values

Figure taken from Elmasiri, 4e


© Prof. Navneet Goyal, BITS, Pilani
Types of Single-level
Indexes

Ordering Nonordering
Field Field
Key Field Primary Secondary Index
Index (key)
Nonkey Clustering Secondary Index
Field Index (nonkey)

© Prof. Navneet Goyal, BITS, Pilani


Example 1: Primary Index
EMPLOYEE(NAME, SSN, ADDRESS, JOB, SAL, ... )
Suppose that:
record size R=100 bytes
block size B=1024 bytes
r=30000 records
Then, we get:
blocking factor Bfr= B div R= 1024 div 100= 10 records/block
number of file blocks b= (r/Bfr)= (30000/10)= 3000 blocks
For an index on the SSN field, assume the field size VSSN=9 bytes,
assume the block pointer size PR=6 bytes. Then: index entry size Ri=(VSSN+
PR)=(9+6)=15 bytes
index blocking factor Bfri= B div Ri= 1024 div 15 = 68 entries/block
number of index blocks bi= (ri/ Bfri)= (3000/68)= 45 blocks
binary search needs log2bi= log245= 6 block accesses (+ 1 for the data block)
This is compared to binary search cost of:
log 2 b = log 2 3000 = 12 block accesses
© Prof. Navneet Goyal, BITS, Pilani
Example 2: Secondary Index
-Non Key Field
EMPLOYEE(NAME, SSN, ADDRESS, JOB, SAL, ... )
Suppose that:
record size R=100 bytes
block size B=1024 bytes
r=30000 records
Then, we get:
blocking factor Bfr= B div R= 1024 div 100= 10 records/block
number of file blocks b= (r/Bfr)= (30000/10)= 3000 blocks
For an index on the Job field, assume the field size VJOB=9 bytes,
assume the block pointer size PR=6 bytes. Then: index entry size Ri=(VJOB+
PR)=(9+6)=15 bytes
index blocking factor Bfri= B div Ri= 1024 div 15 = 68 entries/block
number of index blocks bi= (ri/ Bfri)= (30000/68)= 442 blocks
binary search needs log2bi= log2442= 9 block accesses (+ 1 for the data block)
This is compared to the linear search cost of:
b/2 = 3000/2 = 1500 block accesses
© Prof. Navneet Goyal, BITS, Pilani
Properties of Single-
level Indexes
Type of Number of Index Entries Dense or Block
Index Sparse Anchoring

Primary No. of blocks in data file Sparse Yes


Clustering No. of distinct index field values Sparse Yes/no*
Secondary Number of records in data file Dense No
(key)
Secondary No. of records** Dense No
(nonkey) No. of distinct index field Sparse
values***

* Yes if every distinct value of the ordering field starts from a new block; no otherwise
** For Option 1
*** For Options 2 & 3
© Prof. Navneet Goyal, BITS, Pilani
Multilevel Indexes
 In all single level indexes, the index file
is always sorted on the search key
 For an index with bi blocks, a binary
search requires approximately (log2 bi)
block accesses
 The idea behind multilevel indexes is to
reduce the part of the index file that we
continue to search by a factor of bfri
(blocking factor)

© Prof. Navneet Goyal, BITS, Pilani


Multilevel Indexes
 Blocking Factor=block size in bytes/record size
in bytes
 bfri, the blocking factor for the index, is always
greater than 2
 Search space is reduced much faster
 bfri is called the fan-out (fo) for the multilevel
index
 Searching a multilevel index requires (logfo bi)
block accesses, which is a smaller number that
for binary search if fo>2.

© Prof. Navneet Goyal, BITS, Pilani


Multilevel Indexes
 MLI considers the index file (first or base
level of MLI) as an ordered file with a distinct
value
 We can create a PI for the first level
 Index to the first level is called the 2nd level
of the MLI
 2nd level is a PI, so block anchors can be used
 2nd level has one record for each block of the
1st level

© Prof. Navneet Goyal, BITS, Pilani


Multilevel Indexes
 Blocking factor for the 2nd level & all
subsequent levels is the same as that of
the 1st level index
 If the 1st level has r1 entries, & the
blocking factor is bfri =fo, then the 1st
level needs r1/fo blocks
 r2=r1/fo
 The same process can be repeated for the
second level & we wet r3 = r2/fo

© Prof. Navneet Goyal, BITS, Pilani


Multilevel Indexes
 Note that we require the 2nd level only if
the 1st level needs more than 1 block of
disk space
 Similarly, we require the 3rd level only
if the 2nd level needs more than 1 block
of disk space
 Repeat the preceding process until all
the entries of some index level t fit in a
single block

© Prof. Navneet Goyal, BITS, Pilani


Multilevel Indexes
 1 <= r1/(fo)t
 An MLI with r1 1st level entries, will have
approx. t levels, where
t= log fo r1
 MLI can be used for any type of index,
primary, clustering, or secondary, as
long as the 1st level index has distinct
search key values and fixed-length
entries

© Prof. Navneet Goyal, BITS, Pilani


Multilevel
Indexes

Figure taken from Elmasiri, 4e


© Prof. Navneet Goyal, BITS, Pilani
Example: MLI
EMPLOYEE(NAME, SSN, ADDRESS, JOB, SAL, ... )
Suppose that:
record size R=100 bytes
block size B=1024 bytes
r=30000 records
Dense secondary index of Ex. 2 is converted into an MLI
index blocking factor Bfri= B div Ri= 1024/15=68 entries/block
number of 1st level index blocks bi= (ri/ Bfri)= (30000/68)= 442
number of 2nd level index blocks = 442/68 = 7 &
number of 3rd level index blocks = 7/68 = 1
number of block accesses = t+1=3+1 = 4 block accesses
This is compared to 10 block accesses using dense secondary
© Prof. Navneet Goyal, BITS, Pilani
Multi-Level Indexes
 Such a multi-level index is a form
of search tree ; however, insertion
and deletion of new index entries
is a severe problem because every
level of the index is an ordered
file.

© Prof. Navneet Goyal, BITS, Pilani


Multiple-key Access
 Implicit assumption that the index is
created on only one attribute
 In many retrieval & update requests,
multiple attributes are involved
 Option 1: Multiple such indexes on a relation
can be used to answer queries
 Option 2: Have a composite search key

© Prof. Navneet Goyal, BITS, Pilani


Multiple-key Access
 Example: List all employees of DNO=4 with
AGE=59
 Assume DNO has an index, but age does not
 Assume AGE has an index, but DNO does not
 If both DNO and AGE have indexes. Both would
give a set of records or a set of pointers (to
blocks or records) as result. Intersection of
these records or pointers yields those records
that satisfy both conditions, or the blocks in
which records satisfying both conditions are
located

© Prof. Navneet Goyal, BITS, Pilani


Multiple-key Access
 All the above alternatives give the correct
result
 IF the set of records that satisfy each
condition ( DNO=4 or AGE=59) individually
are large, yet only a few records satisfy the
combined condition, then none of the above
technique is efficient.
 Try having a composite search key
<DNO, AGE> or <AGE, DNO>

© Prof. Navneet Goyal, BITS, Pilani


Index Update
 Insert
 Delete
 Update (first delete & then insert)
 Compare single-level & ML indexes
 DO IT YOURSELF!!!

© Prof. Navneet Goyal, BITS, Pilani


Indexed Sequential File
 Common file organization used in data
processing
 Ordered file with a ML primary index on its
ordering key field
 Indexed sequential file
 Used in large no. of early IBM systems
 Insertions handles by some form of overflow
file that is merged periodically with the data
file
 Index is recreated during file reorganization

© Prof. Navneet Goyal, BITS, Pilani


IBM’s ISAM
 Indexed Sequential Access Method
 2-level index
 Closely related to the organization of the
disk

© Prof. Navneet Goyal, BITS, Pilani


Tree-based Indexing
 ISAM & B,B+-trees
 Based on tree data structures
 Provide:
 Efficient support for range queries
 Efficient support for insertion & deletion
 Support for equality queries (not as efficient as
hash-based indexes)
 ISAM is static, whereas B,B+-tree are
dynamic, adjusts gracefully under inserts
and deletes

© Prof. Navneet Goyal, BITS, Pilani


Search Tree
 Search tree is a special type of tree
that is used to guide the search for a
record, given the search key
 MLI is a variation of the search tree
A node in a search tree with pointers to subtrees below it

© Prof. Navneet Goyal, BITS, Pilani


Search Tree

A search tree of order p = 3

© Prof. Navneet Goyal, BITS, Pilani


Search Tree
 Each key value in the tree is
associated with a pointer to the record
in the data file having that value.
 Pointer could be to the disk block
containing the record
 Search tree itself can be stored on the
disk by assigning each tree node to a
disk block

© Prof. Navneet Goyal, BITS, Pilani


Search Tree
Constraints:
 Search keys within a node is ordered
(increasing from L to R)
 For all values X in the subtree pointed
to by Pi, we have

i=1 1<i<q i=q

© Prof. Navneet Goyal, BITS, Pilani


Search Tree
 Algorithms for inserts and deletes do not
guarantee that a search tree is balanced
 Keeping a search tree balanced HELPS!!
 Keeping search tree balanced yields a
uniform search speed regardless of the
value of the search key
 Deletions may lead to nearly empty nodes,
thus wasting space and increasing no. of
levels

© Prof. Navneet Goyal, BITS, Pilani


B-Tree
 B-tree has additional constraints that ensure
that tree is always balanced and that the
space wasted by deletion is never excessive
 Algorithms for inserts and deletes are more
complex in order to maintain these
additional constraints
 They are mostly simple
 Become complicated only when inserts and
deletes lead to splitting and merging of
nodes respectively

© Prof. Navneet Goyal, BITS, Pilani


B-Tree
 One or two levels of index are often
very helpful in speeding up queries
 More general structure that is used in
commercial systems
 This family of data structures is called
B-trees, & the particular variant that
is often used in known as B+-tree

© Prof. Navneet Goyal, BITS, Pilani


B-Tree: Characteristics
 Automatically maintains as many
levels of index as is appropriate for
the size of the file being indexed
 Manages space on the blocks they use
so that every block is between half
full & completely full
 Each node corresponds to a disk block

© Prof. Navneet Goyal, BITS, Pilani


Structure of B-Trees
 Balanced tree
 All paths from the root to a leaf have the same
length
 Three layers in a B-tree
 Root
 Intermediate layer
 Leaves
 Parameter n is associates with each B-tree
 Each node will have n search keys & n+1 pointers
 Pick n to be as large as will allow n+1 pointers &
n keys to fit in one block

© Prof. Navneet Goyal, BITS, Pilani


Example
 Block size = 4096 bytes
 Search key – 4 byte integer
 Pointer - 8 bytes
 Assume no header information kept in block
 We choose n such that
4n + 8(n+1) <= 4096
 n=340
 Block can hold 340 keys & 341 pointers

© Prof. Navneet Goyal, BITS, Pilani


B-Trees & B+-Trees
 An insertion into a node that is not full
is quite efficient; if a node is full the
insertion causes a split into two nodes
 Splitting may propagate to other tree
levels
 A deletion is quite efficient if a node
does not become less than half full
 If a deletion causes a node to become
less than half full, it must be merged
with neighboring nodes

© Prof. Navneet Goyal, BITS, Pilani


Difference between B-tree
and B+-tree
 In a B-tree, pointers to data records
exist at all levels of the tree

 In a B+-tree, all pointers to data


records exists at the leaf-level nodes

 A B+-tree can have less levels (or


higher capacity of search values) than
the corresponding B-tree

© Prof. Navneet Goyal, BITS, Pilani


Rules for B-Trees
 At the root, there are at least two used
pointers. All pointers point to the B-tree blocks
at the lower level
 At a leaf, the last pointer points to the next
leaf block to the right, i.e., to the block with
next higher keys
 Among the other n pointers in a leaf, at least
(n+1)/2 are used to point to data records and
unused pointers can be thought of as null and
do not point anywhere
 The ith pointer, if it is used, points to a record
with the ith key
© Prof. Navneet Goyal, BITS, Pilani
Rules for B-Trees
 At any interior node, all the n+1 pointers can be used
to point to B-tree blocks at the next lower level
 At least (n+1)/2 of them are actually used
 If j pointers are used, then there will be j-1 keys, k1,
k2,…., kj-1.
 The 1st pointer points to a part of the B-tree where
some of the records with keys less than k1 will be
found.
 The 2nd pointer goes to that part of the tree where all
the records with keys that are at least k1, but less than
k2 will be found, and so on
 Finally, the jth pointer gets us to that part of the B-tree
where some of the records with keys greater than or
equal to kj-1 are found.

© Prof. Navneet Goyal, BITS, Pilani


Rules for B-Trees
 Note that some of the records with
keys far below k1 or far above kj-1
may not be reachable from this block
at all, but will be reached via another
block at the same level.
 The nodes at any level, left to right,
contain keys in non-decreasing order.

© Prof. Navneet Goyal, BITS, Pilani


Hash-based Indexing
 Intuition behind hash-based indexes
 Good for equality searches
 Useless for range searches
 Static hashing
 Dynamic hashing
 Extendible hashing
 Linear hashing

© Prof. Navneet Goyal, BITS, Pilani


Static Hashing
 A bucket is a unit of storage containing one or more records
(a bucket is typically a disk block).
 In a hash file organization we obtain the bucket of a record
directly from its search-key value using a hash function.
 Hash function h is a function from the set of all search-key
values K to the set of all bucket addresses B.
 Hash function is used to locate records for access, insertion as
well as deletion.
 Records with different search-key values may be mapped to
the same bucket; thus entire bucket has to be searched
sequentially to locate a record.

© Prof. Navneet Goyal, BITS, Pilani


Static Hashing
Hash file organization of account file, using branch_name
as key
 There are 10 buckets,
 The binary representation of the ith
character is assumed to be the integer i.
 The hash function returns the sum of
the binary representations of the
characters modulo 10
 E.g. h(Perryridge) = 5 h(Round Hill) = 3
h(Brighton) = 3

© Prof. Navneet Goyal, BITS, Pilani


Static Hashing

© Prof. Navneet Goyal, BITS, Pilani


Hash Functions
 Worst hash function maps all search-key values to
the same bucket; this makes access time
proportional to the number of search-key values in
the file.
 An ideal hash function is uniform, i.e., each bucket
is assigned the same number of search-key values
from the set of all possible values.
 Ideal hash function is random, so each bucket will
have the same number of records assigned to it
irrespective of the actual distribution of search-
key values in the file.
 Typical hash functions perform computation on the
internal binary representation of the search-key.

© Prof. Navneet Goyal, BITS, Pilani


Bucket Overflow
 Bucket overflow can occur because of
 Insufficient buckets
 Skew in distribution of records. This can
occur due to two reasons:
• multiple records have same search-key value
• chosen hash function produces non-uniform
distribution of key values
 Although the probability of bucket
overflow can be reduced, it cannot be
eliminated; it is handled by using
overflow buckets.

© Prof. Navneet Goyal, BITS, Pilani


Bucket Overflows

 Overflow chaining – the overflow


buckets of a given bucket are
chained together in a linked list.
 Above scheme is called closed
hashing.

© Prof. Navneet Goyal, BITS, Pilani


Bucket Overflows

© Prof. Navneet Goyal, BITS, Pilani


Hash Indexes
 Hashing can be used not only for file
organization, but also for index-structure
creation.
 A hash index organizes the search keys, with
their associated record pointers, into a hash file
structure.
 Strictly speaking, hash indices are always
secondary indices
 if the file itself is organized using hashing, a separate
primary hash index on it using the same search-key is
unnecessary.
 However, we use the term hash index to refer to both
secondary index structures and hash organized files.

© Prof. Navneet Goyal, BITS, Pilani


Example of Hash Index

© Prof. Navneet Goyal, BITS, Pilani


Deficiencies of Static
Hashing
 Databases grow with time. If initial number of
buckets is too small, performance will degrade
due to too much overflows.
 If file size at some point in the future is
anticipated and number of buckets allocated
accordingly, significant amount of space will be
wasted initially.
 If database shrinks, again space will be wasted.
 One option is periodic re-organization of the file
with a new hash function, but it is very expensive.
These problems can be avoided by using techniques that
allow the number of buckets to be modified dynamically.

© Prof. Navneet Goyal, BITS, Pilani


Dynamic Hashing
 Long overflow chains can develop
and degrade performance.
 Extendible and Linear Hashing:
Dynamic techniques to fix this
problem.

© Prof. Navneet Goyal, BITS, Pilani


Extendible Hashing
 Insert new data entry to a full bucket
 Add overflow page OR
 Reorganize file using double the no.
of buckets & redistributing the
entries
 Drawback – entire file has to be read
& twice as many pages have to be
written

© Prof. Navneet Goyal, BITS, Pilani


Extendible Hashing
 Idea: Use directory of pointers to
buckets, double # of buckets by
doubling the directory, splitting
just the bucket that overflowed!
 Directory much smaller than file, so
doubling it is much cheaper. Only
one page of data entries is split.
No overflow page!
 Trick lies in how hash function is
adjusted!

© Prof. Navneet Goyal, BITS, Pilani


Extendible Hashing
LOCAL DEPTH 2
Bucket A
 Directory is array of size 4. GLOBAL DEPTH
4* 12* 32* 16*

 To find bucket for r, take


last `global depth’ # bits of
2 2
h(r); we denote r by h(r).
Bucket B
 If h(r) = 5 = binary 101, it 00 1* 5* 21* 13*
is in bucket pointed to by
01. 01

10 2
Bucket C
11 10*

2
DIRECTORY
Bucket D
15* 7* 19*

© Prof. Navneet Goyal, BITS, Pilani


Extendible Hashing
Insert: If bucket is full, split it (allocate
new page, re-distribute).

 If necessary, double the directory. (As


we will see, splitting a bucket does not
always require doubling; we can tell by
comparing global depth with local depth
for the split bucket.)

© Prof. Navneet Goyal, BITS, Pilani


Insert h(r)=20
(Causes Doubling)
LOCAL DEPTH 2 3
Bucket A LOCAL DEPTH
GLOBAL DEPTH 32*16* 32* 16* Bucket A
GLOBAL DEPTH

2 2
3 2
00 1* 5* 21*13* Bucket B 000 1* 5* 21* 13* Bucket B
01 001
10 2 2
010
10* Bucket C
11 10*
011 Bucket C
100
2
DIRECTORY 101 2
Bucket D
15* 7* 19*
110 15* 7* 19* Bucket D
111
2
3
4* 12* 20* Bucket A2
DIRECTORY 4* 12* 20* Bucket A2
(`split image'
of Bucket A) (`split image'
of Bucket A)
© Prof. Navneet Goyal, BITS, Pilani
Points to Note
 20 = binary 10100. Last 2 bits (00) tell us r
belongs in A or A2. Last 3 bits needed to tell
which.
 Global depth of directory: Max # of bits needed to tell
which bucket an entry belongs to.
 Local depth of a bucket: # of bits used to determine if an
entry belongs to this bucket.
 When does bucket split cause directory doubling?
 Before insert, local depth of bucket = global depth.
Insert causes local depth to become > global depth;
directory is doubled by copying it over and `fixing’
pointer to split image page. (Use of least significant bits
enables efficient doubling via copying of directory!)

© Prof. Navneet Goyal, BITS, Pilani


Points to Note
 Does splitting a bucket always
necessitates a directory doubling?
 Try inserting 9*
 Belongs to bucket B, which is already full
 Split the bucket B and using directory
elements 001 & 101 to point to the
bucket & its split image

© Prof. Navneet Goyal, BITS, Pilani


Points to Ponder

 Why use LSB, why not MSB?


 What if a bucket becomes empty?

© Prof. Navneet Goyal, BITS, Pilani


Directory Doubling
Why use least significant bits in directory?
 Allows for doubling via copying!
6 = 110 3
6 = 110 3
000 000
001 100
2 2
010 010
1 00 1 00
011 110
0 6* 01 100 0 10 001
1 10 6* 101 1 6* 01
101
11 11 6*
110 6* 011 6*
111 111

Least Significant vs. Most Significant


© Prof. Navneet Goyal, BITS, Pilani
Comments on Extendible
Hashing
 If directory fits in memory, equality search
answered with one disk access; else two.
 100MB file, 100 bytes/rec, 4K pages contains 1,000,000
records (as data entries) and 25,000 directory elements;
chances are high that directory will fit in memory.
 Directory grows in spurts, and, if the distribution of hash
values is skewed, directory can grow large.
 Multiple entries with same hash value cause problems!
 Delete: If removal of data entry makes bucket
empty, can be merged with `split image’. If each
directory element points to same bucket as its split
image, can halve directory.
© Prof. Navneet Goyal, BITS, Pilani
Extendable Hashing
 Benefits of extendable hashing:
 Hash performance does not degrade with growth of
file
 Minimal space overhead
 Disadvantages of extendable hashing
 Extra level of indirection to find desired record
 Bucket address table may itself become very big
(larger than memory)
• Need a tree structure to locate desired record in the
structure!
 Changing size of bucket address table is an
expensive operation

© Prof. Navneet Goyal, BITS, Pilani


Linear Hashing
 Linear hashing is an alternative mechanism
which avoids these disadvantages at the
possible cost of more bucket overflows
 This is another dynamic hashing scheme,
alternative to Extensible Hashing.
 Motivation: Ext. Hashing uses a directory
that grows by doubling… Can we do
better? (smoother growth)
 LH: split buckets from left to right,
regardless of which one overflowed
(simple, but it works!!)

© Prof. Navneet Goyal, BITS, Pilani


Linear Hashing
 Does not require a directory
 LH provides a way to control
chains from growing too large
on average
 It accomplishes this by
expanding address space
gracefully, one chain at a time
 Achieved using chain splitting

© Prof. Navneet Goyal, BITS, Pilani


Linear Hashing: Example
Suppose M=3, three buckets
[0], [1], and [2]
[1] = {106, 217, 151, 418, 379}
Three issues with chain splitting:
 How can a chain be split?
 Which chain should be split?
 When should a chain be split?

© Prof. Navneet Goyal, BITS, Pilani


Linear Hashing: Example
 How can a chain be split?
 Split a chain [m] evenly into two chains using a
mod function
 Since we want to expand the address space, the
argument for a hash fn. need not be M
 Use 2M to rehash the records in [m]
 On average, mod 2M will hash half of the records
to chain [m], and the other half to chain [M+m]
[1] = {106, 217, 151, 418, 379}
Rehash using mod 2M(=6)
[1] = {217, 151, 379}
[4] = {106, 418}
© Prof. Navneet Goyal, BITS, Pilani
Linear Hashing: Example
 Which chain to be split?
 Following possibilities:
• Split chain [0]: this will create chain [3]
• Split chain [1]: this will create chain [4]
• Split chain [2]: this will create chain [5]
 Linear hashing gets its name from the fact that
chains are designated linearly for splitting
 In the example, we will first split the chain [0],
then [1], and then [2]
 Note that this is independent of where the
insertions are taking place

© Prof. Navneet Goyal, BITS, Pilani


Linear Hashing: Example
 Which chain to be split?
0 1 2

M=3
mod 3
0 1 2 3

mod 6 mod 3 mod 6

© Prof. Navneet Goyal, BITS, Pilani


Linear Hashing: Example
Initially: h(x) = x mod N (N=4 here)
Assume 3 records/bucket
Insert 17 = 17 mod 4 1
Bucket id 0 1 2 3
13
4 8 5 9 6 7 11

© Prof. Navneet Goyal, BITS, Pilani


Linear Hashing: Example
Initially: h(x) = x mod N (N=4 here)
Assume 3 records/bucket Overflow for Bucket 1
Insert 17 = 17 mod 4 1
Bucket id 0 1 2 3
13
4 8 5 9 6 7 11

Split bucket 0, anyway!!


© Prof. Navneet Goyal, BITS, Pilani
Linear Hashing: Example
To split bucket 0, use another function
h1(x):
h0(x) = x mod N , h1(x) = x mod (2*N)
Split pointer
17
0 1 2 3
13
4 85 9 6 7 11

© Prof. Navneet Goyal, BITS, Pilani


Q&A
Thank You

You might also like