2.8. ADS - Collision Resolution-Extendible Hashing-1
2.8. ADS - Collision Resolution-Extendible Hashing-1
Extendible Hashing
UNIT 2 Maps and Hash Tables
● Maps, and Hash Tables: Maps and Dictionaries, Hash Tables.
● Hash table representation: Hash functions, collision resolution-separate
chaining, open addressing- linear probing, quadratic probing, double
hashing, rehashing, extendible hashing, comparison of hashing and skip
lists.
Dynamic Hashing
● Dynamic hashing is a technique used in computer science for the
organization and storage of data in a hash table, which allows the table
to expand or shrink dynamically as more data is added or removed. This
method helps maintain an efficient load factor, improving lookup and
storage times.
● A major drawback of the static hashing scheme is that the hash address
space is fixed. Hence, it is difficult to expand or shrink the file
dynamically.
Extendible Hashing
● Extendible hashing is a dynamic approach to managing data. In this hashing
method, flexibility is a crucial factor.
● It allows the hash table to grow or shrink as needed, accommodating varying
amounts of data without requiring a complete rehashing of the contents.
This adaptability reduces data management overhead and improves
performance.
● If either open addressing hashing or separate chaining hashing is used, the
major problem is that collisions could cause several blocks to be examined
during a Find, even for a well- distributed hash table. Extendible hashing
allows a find to be performed in two disk accesses. Insertions also require
few disk accesses.
● Extendible Hashing is a dynamic hashing method wherein directories, and
buckets are used to hash data.
Main features of Extendible Hashing:
● Directories: The directories store the addresses of the buckets in
pointers. An ID is assigned to each directory which may change each
time when Directory Expansion takes place.
● Buckets: The buckets are used to hash the actual data.
Contd…
● Directories: These containers store pointers to buckets. The hash
function returns this directory id which is used to navigate to the
appropriate bucket. Number of Directories = 2^Global Depth.
● Buckets: They store the hashed keys. Directories point to buckets. A
bucket may contain more than one pointers to it if its local depth is less
than the global depth.
Contd…
● Global Depth: It is associated with the Directories. They denote the number
of bits that are used by the hash function to categorize the keys. Global
Depth = Number of bits in directory id.
● Local Depth: It is the same as that of Global Depth except for the fact that
Local Depth is associated with the buckets and not the directories. Local
depth in accordance with the global depth is used to decide the action that to
be performed in case an overflow occurs. Local Depth is always less than or
equal to the Global Depth.
● Bucket Splitting: When the number of elements in a bucket exceeds a
particular size, then the bucket is split into two parts.
● Directory Expansion: Directory Expansion Takes place when a bucket
overflows. Directory Expansion is performed when the local depth of the
overflowing bucket is equal to the global depth.
Steps:
1. Initialize the bucket depths and the global depth of the directories.
2. Convert data into a binary representation.
3. Consider the "global depth" number of the least significant bits (LSBs) of data.
4. Map the data according to the ID of a directory.
5. Check for the following conditions if a bucket overflows (if the number of
elements in a bucket exceeds the set limit):
1. Global depth == bucket depth: Split the bucket into two and increment the
global depth and the buckets' depth. Re-hash the elements that were
present in the split bucket.
2. Global depth > bucket depth: Split the bucket into two and increment the
bucket depth only. Re-hash the elements that were present in the split
bucket.
6. Repeat the steps above for each element.
Basic Working of Extendible Hashing:
● Step 1 – Analyze Data Elements: Data elements
may exist in various forms eg. Integer, String,
Float, etc.. Currently, let us consider data
elements of type integer. eg: 49.
● Step 2 – Convert into binary format: Convert
the data element in Binary form. For string
elements, consider the ASCII equivalent integer
of the starting character and then convert the
integer into binary form. Since we have 49 as
our data element, its binary form is 110001.
Contd…
● Step 3 – Check Global Depth of the directory. Suppose the global depth
of the Hash-directory is 3.
● Step 4 – Identify the Directory: Consider the ‘Global-Depth’ number of
LSBs in the binary number and match it to the directory id.
● Eg. The binary obtained is: 110001 and the global-depth is 3. So, the hash
function will return 3 LSBs of 110001 viz. 001.
● Step 5 – Navigation: Now, navigate to the bucket pointed by the
directory with directory-id 001.
● Step 6 – Insertion and Overflow Check: Insert the element and check if
the bucket overflows. If an overflow is encountered, go to step 7
followed by Step 8, otherwise, go to step 9.
Contd…
● Step 7 – Tackling Over Flow Condition during Data Insertion: Many times,
while inserting data in the buckets, it might happen that the Bucket
overflows.
● First, Check if the local depth is less than or equal to the global depth. Then
choose one of the cases below.
● Case1: If the local depth of the overflowing Bucket is equal to the global
depth, then Directory Expansion, as well as Bucket Split, needs to be
performed. Then increment the global depth and the local depth value by 1.
And, assign appropriate pointers.
● Directory expansion will double the number of directories present in the hash
structure.
● Case2: In case the local depth is less than the global depth, then only Bucket
Split takes place. Then increment only the local depth value by 1. And, assign
appropriate pointers.
Contd…
● Step 8 – Rehashing of Split Bucket Elements:
The Elements present in the overflowing
bucket that is split are rehashed w.r.t the
new global depth of the directory.
● Step 9 – The element is successfully hashed.
Example ● Solution: First, calculate the binary
forms of each of the given
numbers.
● elements:
16,4,6,22,24,10,31,7,9,20,26. ● 16- 10000
● Bucket Size: 3 (Assume) ● 4- 00100
● 6- 00110
● Hash Function: Suppose the ● 22- 10110
● 24- 11000
global depth is X. Then the Hash
● 10- 01010
Function returns X LSBs. ● 31- 11111
● 7- 00111
● 9- 01001
● 20- 10100
● 26- 11010
Steps:
● Inserting 16:
● Initially, the global-depth and
● The binary format of 16 is 10000
local-depth is always 1. Thus, the
and global-depth is 1. The hash
hashing frame looks like this:
function returns 1 LSB of 10000
which is 0. Hence, 16 is mapped
to the directory with id=0.
● Inserting 4 and 6: ● Inserting 22: The binary form of
● Both 4(100) and 6(110) have 0 in 22 is 10110. Its LSB is 0. The
their LSB. Hence, they are bucket pointed by directory 0 is
hashed as follows: already full. Hence, Overflow
occurs.
● As directed by Step 7-Case 1, Since Local Depth = Global Depth, the bucket splits
and directory expansion takes place. Also, rehashing of numbers present in the
overflowing bucket takes place after the split. And, since the global depth is
incremented by 1, now, the global depth is 2. Hence, 16,4,6,22 are now rehashed
w.r.t 2 LSBs.[ 16(10000),4(100),6(110),22(10110) ]
Steps:
● *Notice that the bucket which was underflow has remained untouched.
But, since the number of directories has doubled, we now have 2
directories 01 and 11 pointing to the same bucket. This is because the
local-depth of the bucket has remained 1. And, any bucket having a local
depth less than the global depth is pointed-to by more than one
directories.
● Inserting 24 and 10: 24(11000) and Inserting 31,7,9: All of these elements
10 (1010) can be hashed based on [31(11111), 7(111), 9(1001)] have either
directories with id 00 and 10. Here, 01 or 11 in their LSBs. Hence, they are
we encounter no overflow mapped on the bucket pointed out by 01
condition. and 11. We do not encounter any
overflow condition here.
● Inserting 20: Insertion of data 20 is inserted in bucket pointed out by 00.
As directed by Step 7-Case 1, since the local
element 20 (10100) will again cause
depth of the bucket = global-depth,
the overflow problem. directory expansion (doubling) takes place
along with bucket splitting. Elements
present in overflowing bucket are rehashed
with the new global depth. Now, the new
Hash table looks like this:
● Inserting 26: Global depth is 3. The bucket overflows, and, as directed by
Hence, 3 LSBs of 26(11010) are Step 7-Case 2, since the local depth of
considered. Therefore 26 best fits in bucket < Global depth (2<3), directories
are not doubled but, only the bucket is
the bucket pointed out by directory
split and elements are rehashed.
010.
Finally, the output of hashing the given
list of numbers is obtained.
Key Observations
● A Bucket will have more than one pointers pointing to it if its local depth
is less than the global depth.
● When overflow condition occurs in a bucket, all the entries in the bucket
are rehashed with a new local depth.
● Global depth == bucket depth: Split the bucket into two and increment
the global depth and the buckets' depth.
● Global depth > bucket depth: Split the bucket into two and increment
the bucket depth only.
Advantages
● It provides a good balance between space usage and search efficiency.
● Data retrieval is less expensive (in terms of computing).
● No problem of Data-loss since the storage capacity increases
dynamically.
● It adapts dynamically to the data distribution, avoiding the need for
frequent rehashing.
● It offers predictable performance for search, insertion, and deletion
operations.
Advantages
● Dynamic Expansion: Extendible hashing allows the hash table to grow or
shrink dynamically without rehashing the entire table, making it efficient for
applications with unpredictable data growth.
● Efficient Access: Data access time remains constant (O(1)) on average, as long
as the hash function distributes keys evenly.
● Reduced Collisions: By splitting buckets when they become full, it minimizes
the likelihood of collisions, ensuring efficient data retrieval.
● Space Efficiency: Only the buckets that need to grow are split, leading to
better space utilization compared to other static hashing methods.
● No Fixed Size: There is no need to define the size of the hash table in
advance, unlike static hashing methods.
● Adaptability: Extendible hashing adapts well to large datasets and varying
workloads, making it suitable for database systems.
Disadvantages
● Overhead: Requires maintaining a directory structure (global depth and local
depth), which introduces additional memory overhead.
● Complex Implementation: The algorithm is more complex to implement
compared to simpler static hashing techniques.
● Directory Doubling: When the global depth increases, the directory size
doubles, which may lead to significant memory usage and overhead in certain
cases.
● Split Operations: Splitting buckets can be time-consuming, especially if there
are frequent insertions, leading to potential performance degradation.
● Dependency on Hash Function: The efficiency heavily depends on the quality
of the hash function. Poor hash functions can lead to skewed bucket
distributions.
● Fragmentation: Splitting buckets can cause fragmentation, where many
buckets are underutilized, leading to wasted space.
Advantages/Disadvantage
● The main advantage of extendible hashing that makes it attractive is that the
performance of the file does not degrade as the file grows, as opposed to static
external hashing where collisions increase and the corresponding chaining effectively
increases the average number of accesses per key. Additionally, no space is allocated in
extendible hashing for future growth, but additional buckets can be allocated
dynamically as needed. The space overhead for the directory table is negligible. The
maximum directory size is 2k, where k is the number of bits in the hash value. Another
advantage is that splitting causes minor reorganization in most cases, since only the
records in one bucket are redistributed to the two new buckets. The only time
reorganization is more expensive is when the directory has to be doubled (or halved). A
disadvantage is that the directory must be searched before accessing the buckets
themselves, resulting in two block accesses instead of one in static hashing. This
performance penalty is considered minor and thus the scheme is considered quite
desirable for dynamic files.
Limitations Of Extendible Hashing
● The directory size may increase significantly if several records are hashed
on the same directory while keeping the record distribution non-uniform.
● Size of every bucket is fixed.
● Memory is wasted in pointers when the global depth and local depth
difference becomes drastic.
● This method is complicated to code
Example
● Let's take the following example ● Convert the data into binary
representation:
to see how this hashing method
works where: ● 28 = 11100
● 4 = 00100
● Data = {28,4,19,1,22,16,12,0,5,7} ● 19 = 10011
● 1 = 00001
● Bucket limit = 3 ● 22 = 10110
● 16 = 10000
● 12 = 01100
● 0 = 00000
● 5 = 00101
● 7 = 00111
Example
Example
Example
Example
Example
Example
Example
Search Complexity
● Time Complexity: 𝑂(1) (average case)
● Extendible hashing ensures efficient lookups by using a hash function and a directory
structure. The key is hashed, and the directory directly points to the appropriate bucket.
● Databases: Storing dynamic data with a need for quick lookup and insert.
● File Systems: Managing large directories.
● Distributed Systems: Storing hash-indexed data that grows unpredictably.
extendible hashing
Reference
● Michael T. Goodrich, Roberto Tamassia, and Michael H. Goldwasser.
2013. Data Structures and Algorithms in Python (1st. ed.). Wiley
Publishing