0% found this document useful (0 votes)
15 views61 pages

14 Hashing

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views61 pages

14 Hashing

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 61

Hashin

g in
Data
Structu
res Devi
Mrs. K Jyothsna
SCOPE
Hashing – Search the Record
• Hashing generates a fixed-size output from an variable size input using the mathematical (hash) function.

• Hashing determines an index or location for the storage of an item in a data structure.

Three primary components:

1. Key: Key is a string or integer that pass as an input to the hash function that determines an index of
storage

for an item in a data structure.

2. Hash Function: The hash function receives the input key and returns the index (hash index) of an

element in an array called a hash table.

3. Hash Table: Hash table is a data structure that maps keys to values using a hash function.

– Hash stores the data in an associatively in an array where each data value has its own unique index.
Index Value
Hash Table & Hash Function Representation
U h(k1) Value1
(universe of
keys) h(k4) Value4

K k1 Hash Function
k4 k2
(actu h(k2) = h(k5) Value2
al k k3 ,
5
keys) Value
5
• A bucket in a hash file is unit of storage (typically a disk block) that can hold one or more
records.
• Hash table uses a hash function h to compute the slot for each key. h(k3) Value3
• Store the element in slot h(k)
• A hash function h transforms a key into an index in a hash table T[0…m-1]: m - Table
Hash 1
h : U → {0, 1, . . . , m - 1}
h(k) = k mod (m-1) i.e. k hashes to slot h(k)
Advantages: Bucket (m-1)
• When domain K is much smaller than domain U, a hash table requires much less space. i.e., storage space
o It can reduce storage requirements to |K|.
o It can still get O(1) search time, but when the average case, not the worst case.
• Complexity for calculating hash value using h: Time complexity: O(n) and Space complexity:
O(1)
Hashing Applications
• Keeping track of customer’s account information at a bank
– Search through records to check the balance, and perform
transactions
• Keep track of reservations on flights
– Search to find empty seats, cancel/modify reservations
• Search engine
– Looks for all documents containing a given word
Hashing working principle
Problem Statement: Let's consider strings {“hi”, “an”, “is”}. Store these in a array (table) using Hashing.
• Values to be accessed (Search) or updated in the table quickly i.e., O(1) time. Ordering (Sorting) of strings
does not considered in the table. Hence, the given set of strings can act as a key, and the value.
Compute the key to store the value in corresponding index
1. Apply hash function to calculate the hash value that acts as the index of the data structure to store the value.
2. Let’s consider ascii values “a” = 97, “h”=104, “i” = 105, “n”=110, “s” = 115, .. etc, to all characters.
3. The numerical value is the summation of all characters of the string:
“hi” = 104 + 105 = 209, “an” = 97 + 110 = 207 , “is” = 105 + 115 = 220
4. Now, assume table of size 9 to store these strings.
5. The hash function uses the sum of the characters in key mod Table size.
6. Calculate the storage location (bucket) for the string in the table as sum(string) mod 7.
“hi” = 104 + 105 = 209 mod 9 = 2, “an” = 97 + 110 = 207 mod 9 =0 , “is” = 105 + 115 = 220 mod 9 =
4
0
an 1 2
hi 3 4
is 5 6 7 8

key and indices of array mapping


Operations in Hashing

• To insert a record into the structure compute the hash value h(Ki), and

place the record in the bucket address.


• For search operations, compute the hash value and search in the bucket
for the specific record.
• To delete simply search and remove.
Properties of the Hash Function
Index Key
0
• The distribution should be uniform.
1
– An ideal hash function should assign the same number of 2

records in each bucket. 3


4
• The distribution should be random. 5

– Regardless of the actual search-keys, the each bucket has 6


7
the same number of records on average.
8
– Hash values should not depend on any ordering (random) 9

or the search-keys
Implement Dictionary using Direct Addressing
Assumptions:
– Key values are distinct
– Each key is drawn from a universe U = {0, 1, . . . , m
- 1}
Goal:
– Store the items in an array, indexed by keys

Direct-address table representation:


– An array T[0 . . . m - 1]
– Each slot, or position, in T corresponds to a key in U
– For an element x with key k, a pointer to x (or x
itself) will be placed in location T[k]
– If there are no elements with key k in the set, T[k] is
empty, represented by NIL
Dictionaries
• Dictionary is a data structure that supports mainly two basic
operations:
• insert a new item
• Search an item with a given key
• Queries: return information about the set S
o Search (S, k) - search an element in set S
o Minimum (S), Maximum (S) – Find min & max element in set S
o Successor (S, k) – Successor of element k in set S.
o Predecessor (S, k) – Predecessor of element k in set S
• Modifying operations: change the set
o Insert (S, k) – Inset k into set S
o Delete (S, k) –Delete k from set S
Operations
Insert the data into dictionary

• insert(string word, string meaning)

return word

Delete the data from dictionary

• deleteWord(string word)

delete word

Search / Access the data into dictionary

• find(string word)

return word
Dictionary using Hashing
class node }
{ int hashf(string word);
string key, value; bool insert(string,string);
node next; string find(string word);
node(string key, bool deleteWord(string
string value) word);
{ };
this.key =
key; hashf(string word)
this.value = value; {
next = NULL;
int asciiSum=0;
}
for(int
}; i=0;i<word.length();i++)
class dictionary
{
{
asciiSum=asciiSum+word
node head[MAX]; [i];
dictionary()
}
{
return (asciiSum%100);
Dictionary using Hashing
}
find(string word) deleteWord(string word)
{ {
int index=hashf(word); int index=hashf(word);
node tmp=head[index];
node start=head[index]; node prev=head[index];
if(start==NULL) if(tmp==NULL)
return "-1"; //if no word is present at
while(start!=NULL) that index
{ {
if(start.key==word) return false;
return start.value; }
start=start.next;
}
return "-1";
Dictionary using Hashing

if(tmp.key==word && tmp.next ==NULL) prev.next=tmp.next;


//only one word is present tmp.next=NULL;
{ delete tmp;
tmp.next=NULL; return true;
delete tmp; }
return true; else //delete at end
} {
prev.next=NULL;
while(tmp.key!=word && tmp.next! tmp.next=NULL;
=NULL) delete tmp;
{ return true;
prev=tmp; }
tmp=tmp.next; }
}
if(tmp.key==word&&tmp. next!=NULL)
{
Dictionary using Hashing

insert(string word, string meaning)


{ int index=hashf(word);
node p=new node(word,
meaning);
if(head[index]==NULL) {
head[index]=p;
System.out.print(word+"inserted“);
return true; }
else {
node start=head[index];
while(start.next!=NULL)
start=start.next;
start.next=p;
System.out.print(word +
"inserted. “);
}
Dictionary Implementations
• Implementing dictionaries using:
– Direct addressing
– Ordered/unordered arrays
– Ordered/unordered linked lists

Insert Search
direct addressing O(1) O(1)
ordered array O(n) O(log n)
ordered list O(n) O(n)
unordered array O(1) O(n)
unordered list O(1) O(n)
Static Hashing
• In static hashing when a search-key value is provided, the hash function always computes the
same address. It is used in databases.
E.g.,
• Generate address for STUDENT_ID = 76 using mod (5) hash function, it always result in the
same bucket address 1. h(76) = 76 mod 5 = 1
• There will not be any changes to the bucket address here i.e., number buckets is constant in the memory.
• Insertion, deletion, update and search are done in constant time.

Disadvantage:
• Assume bucket corresponding to the hash key is not free at the
time of inserting a new data into the table i.e., data already
exists in that address.
• It is called bucket overflow. Also called collision.
Image
• This issue overcome by Open hashing and Closed Hashing.
source
Collisions
• The hashing generates a same hash value for two big input keys. The newly inserted key maps to an
already
• occupied
E.g., storage location. However it should be handled using collision handling methods.
“is” = 105 + 115 = 220 mod 9 = 4
• Both strings {“is”, “si”} contain the same hash value. “si” = 115 + 105 = 220 mod 9 = 4
• So, both strings to be stored in the same storage location.
• It creates problem in searching, insertion, deletion, and updating of value.
• The second string collided with first string when try to store using hash
value. Index Value Hash Table

U
(universe of h(k1) Value1
keys) h(k4) Value4 Buckets

K k1
(actu k4 k2 h(k2) = h(k5) Value2,Value 5 Collision
al
keys) k5 k3 h(k3) Value3 s!

m-1
Collisions
• Two or more keys hash to the same slot!!

• For a given set K of keys

– If |K| ≤ m, collisions may or may not happen, depending on the hash function

– If |K| > m, collisions will definitely happen (i.e., there must be at least two
keys that have the same hash value)
• Avoiding collisions completely is hard, even with a good hash function
Collision Resolution Methods

Methods for Handling Collisions:


– Separate Chaining (Open Hashing)
– Open addressing (Closed Hashing)
• Linear probing
• Quadratic probing
• Double hashing
Separate Chaining (Open Hashing)
 Store all the elements that contains same hash value in single memory slot using a linked list.
 Here, array is implemented as a linked list.
 Open Hashing - Keys (elements) are stored outside the hash table i.e., in separate linked list
(chain).
 For searching the key (index), linked list is
identified using the hash code generated by h for the
key.
 Key K is searched in the linked list by linearly
traversing. If the intrinsic key for any input is equal
to K that is the storage index for the input data.
 If pointer has reached the end of the linked list and
yet data is not found that indicates data does not
exist.
 In separate chaining, if two different elements
contains the same hash value then store both the
elements in the same linked list one after the other.
Separate Chaining (Open Hashing)
• Choosing the size of the table
– Small enough not to waste space
– Large enough such that lists remain short
– Typically 1/5 or 1/10 of the total number of elements

• How should we keep the lists: ordered or not?


– Not ordered!
• Insert is fast
• Can easily remove the most recently inserted elements
Separate Chaining - Insertion operation
Table size =10
Index Key Linked
• Apply simple hash function as “key mod 10” and a List
Head
sequence of input keys as 32, 30, 0, 55, 45, 65 to
0 30 0
find storage location.
1
• h(k) = k mod Table_Size
2 32
• h(32) = 32 mod 10 = 2
3
• h(30) = 30 mod 10 = 0
4
• h(0) = 0 mod 10 = 0
5 55 45 65
• h(55) = 55 mod 10 = 5
6
• h(45) = 45 mod 10 = 5
7
• h(65) = 65 mod 10 = 5
8
9
Separate Chaining - Search operation
Table size =10
Index Key Linked
• Apply simple hash function to search the data 75, List
Head
32.
0 0 30 0
• Search for an element with key k in list T[h(k)]
1
• h(k) = k mod Table_Size
2 2 32
• h(75) = 75 mod 10 = 5
3
• h(32) = 32 mod 10 = 2
4
5 5 55 45 65
• Running time is proportional to the length of the
6
list of elements in slot h(k)
7
8
9
Separate Chaining - Deletion operation
Table size =10
Index Key Linked
• Need to find the element to be deleted. List
Head
• Apply simple hash function to delete the data
32. 0 0 30 0

• h(k) = k mod Table_Size 1


2 2 32
• h(32) = 32 mod 10 = 2
3
Worst-case running time: 4
• Deletion depends on searching the corresponding 5 5 55 45 65
list 6
7
8
9
Separate Chaining - Insertion operation
Table size =10
Index Key Linked
Time complexity: List
Head
• N is number of keys stored in
table 0 30 0

• M is number of buckets in table. 1

• Load factor = number of keys stored in table = 𝑁 2 32


number of buckets in table
𝑀 3
• Worst-case running time is for insert / delete and
4
search is O(1+load factor)
5 55 45 65

6
7
8
9
Separate Chaining (Open Hashing)
class HashNode<K, V> {
K key; // hash function to find index for a key
Key K Value V
V value; private int getBucketIndex(K key)
HashNode <K, V> next; { int hashCode =
public HashNode(K key, V value) { key.hashCode();
this.key = key; int index = hashCode % numBuckets;
this.value = value; return index;
} } }
// hash table // Returns value for a key
class Map<K, V> { // bucketArray public V get(K key) {
private ArrayList<HashNode<K, V>> bucketArray; // Find head of chain for given key
private int numBuckets; // Current capacity of array int bucketIndex =
list private int size; getBucketIndex(key);
while (head != null) { // Search key in chain
public Map() { HashNode<K, V> head =
if (head.key.equals(key))
bucketArray = new ArrayList<>(); bucketArray.get(bucketIndex);
return head.value;
numBuckets = 10; head = head.next;
size = 0; }
for (int i = 0; i < numBuckets; i++) //empty chains // If key not found
bucketArray.add(null); return null;
} }
Separate Chaining (Open Hashing)
// Remove a given key }
public V remove(K key)
// find index for given key // Adds a key value pair to hash
{ int bucketIndex = getBucketIndex(key); public void add(K key, V value)
// head of chain {
HashNode<K, V> head = bucketArray.get(bucketIndex); // Find head of chain for given key
HashNode<K, V> prev = null; //Search for key in its chain int bucketIndex = getBucketIndex(key);
while (head != null) { HashNode<K, V> head = bucketArray.get(bucketIndex);
if (head.key.equals(key)) // If Key found // Check if key is already
break; present while (head != null)
prev = head; // Else traverse in {
chain if (head.key.equals(key))
} head = head.next; {
if (head == null) // If key was not there head.value = value;
return null; return;
size--; // Reduce }
size
if (prev != null) // Remove key head = head.next;
prev.next = head.next; }
else
bucketArray.set(bucketIndex, head.next);
return head.value;
Separate Chaining (Open Hashing)
// Insert key in chain {
size++; add(headNode.key, headNode.value);
head = bucketArray.get(bucketIndex); headNode = headNode.next;
HashNode<K, V> newNode = new HashNode<K, V>(key, value); }
newNode.next = head; }
bucketArray.set(bucketIndex, newNode); }
// If load factor goes beyond threshold, then double hash table size }
if ((1.0*size)/numBuckets >= 0.7)
{
ArrayList<HashNode<K, V>> temp = bucketArray;
bucketArray = new ArrayList<>();
numBuckets = 2 * numBuckets;
size = 0;
for (int i = 0; i < numBuckets;
i++)
bucketArray.add(null);
for (HashNode<K, V> headNode : temp)
{
while (headNode != null)
Separate Chaining (Open Hashing)
Advantages:
• Simple to implement.
• Hash table contains memory, because can add more elements to the chain using linked list.
• It is mostly used when size of the data is unknown i.e., lements count and how frequently keys may be
inserted or deleted.
Limitations:
• The cache performance of chaining is not good as keys are stored using a linked list.
• Open addressing provides better cache performance as everything is stored in the same table.
• Wastage of Space (few slots of the hash table are never used)
• If the chain becomes long, then search time can become O(n) in the worst case
Open Addressing (Closed Hashing)
• If we have enough contiguous memory to store all the keys (m >
N)
e.g., insert
 store the keys in the table itself 14
• No need to use linked lists anymore. h = 69 mod 13
Generalize hash function for Open Addressing
=4
• Hash function contains two arguments here:
h = 72 mod 13
(i) Key value k, and (ii) Probe number p =7
h(k, p), h = 79 mod 13
=1
p=0,1,...,m-1 h = 98 mod 13
• Probe
Insert sequences
Operation- =1
<h(k,0), h(k,1),
 Hash function is used...,
to h(k,m-1)>
compute the hash value for a key to be inserted. h = 14 mod 13
Hash value is then used as an index to store the key in the hash table. =1
If collision occurs, h = 50 mod 13
 Probing is performed until an empty bucket is found. =11
 Once an empty bucket is found, the key is inserted.
 Probing is performed in accordance with the technique used for open
addressing.
• Search time depends on the length of the probe sequence.
Open Addressing (Closed Hashing)
Delete Operation
 The key is first searched and then deleted. Search 14
 After deleting the key, that particular bucket is marked as “deleted”.
 During insertion, the buckets marked as “deleted” are considered as an empty bucket.
 E.g., 72 mod 13 = 7

Search Operation: E.g., 14 mod 13 =


1 To search a desired key, Compute the hash value for desired input key (data). DELETED
 Check bucket of the hash table using the calculated hash value.
 If the desired key is found, the target key is accessed.
 Otherwise, the subsequent buckets are checked until the desired key or an empty bucket
is found.
 The empty bucket denotes that the key is not exist in the hash table.
<1, 5, 9>
 During searching, the search is not terminated on encountering the bucket marked Search
as “deleted”. indices
 The search terminates only after the required key or an empty bucket is found.
Linear probing: Inserting a key (data)
• Hash table is searched sequentially that starts from the original location of the hash.
• If there is a collision, check the next available position in the table (i.e., probing).
h(k, i) = (h1(k) + i) mod m ; i=0,1,2,...
• First slot probed: h1(k)
• Second slot probed: h1(k) + 1
• Third slot probed: h1(k)+2, and so on

• Probe sequence is: < h1(k), h1(k)+1 , h1(k)+2 , ....>

• Can generate m probe sequences maximum, why?

wrap around
Index Key
Linear probing: Inserting a key (data)
0 10
• Apply linear probing to insert the sequence of keys 10, 30, 16, 44, 35 using
hash function as “key mod 5” . 1 30
• Create an empty hash table with size 5 i.e., hash values range is from 0 to 4. 2 16
h(k, i) = (h1(k) + i) mod m ; i=0,1,2,... 3 35
• h(10 , 0) = 10 mod 5 = 0 4 44
• h(30 , 0) = 30 mod 5 = 0 ; But slot 0 is filled. Collision occurs.
• So, identify next empty slot h(30, 1) = (30+1) mod 5 = 1 Stored at index
1.
• h(16 , 0) = 16 mod 5 = 1 ; But slot 1 is filled. Collision occurs.
• So, identify next empty slot h(16, 1) = (16+1) mod 5 = 2 Stored at index
2.
• h(44 , 0) = 44 mod 5 = 4
Collision occurs.
• h(35, 0) = 35 mod 5 = 0 ; But slot 0 is filled. Collision occurs.
Collision occurs.
• So, identify next empty slot h(35, 1) = (35+1) mod 5 = 1
Stored at index
• So, identify next empty slot h(35, 2) = (35+2) mod 5 = 2
Linear probing: Searching for a key (data)
• Three cases: Index Key
0 10
(1) Position in table is occupied with an element of equal key 1 30
2 16
(2) Position in table is empty
3 35
4 44
(3) Position in table occupied with a different element

• Case 2: Probe the next higher index until the element is found or an empty
position

is found

• The process wraps around to the beginning of the table


Linear probing: Deleting a key Index Key
0 10
• Delete: 30
1 Deleted

• Problems 2 16
3 35
– Cannot mark the slot as empty 4 44

– Impossible to retrieve inserted keys after that slot was


occupied.

• Solution INDEX KEY


0 10
– Mark the slot with a sentinel value DELETED 1 DELETED
2 16
• The deleted slot can later be used for insertion
3 35
4 44
• Searching will be able to find all the keys.
Advantages & Limitations of Linear
Advantages Probing
• Easy to compute.
Limitations
• The primary issue is clustering i.e., many consecutive elements form groups.
• Then, it takes time to search an element or to find an empty bucket.

Primary Clustering Problem


• Some slots become more likely than others
• Long chunks of occupied slots are created
 search time increases!!

initially, all slots have probability 1/m


Slot b:
2/m

Slot d:
4/m

Slot e:
5/m
Quadratic probing : Open Addressing
• Quadratic probing process considers the original hash value, and adding successive values of an
arbitrary quadratic polynomial until an empty slot is found.

ℎ′ (k) = k mod m

h(k, i) =(ℎ′(k) + 𝑖 2 ) mod m ; 𝑖= (0,1,2,... (m-1))


• Clustering is less serious but still an issue (secondary clustering) i.e., possibility of creating big
clusters, i=0,1,2,... Initial probe position determines the probe sequence. Index Key

• Starts from the original hash location. If this location is not empty then check the other slots. 0
1
• Let hash(x) is the slot index computed using the hash function and n be the size of the hash table. 2
• If the slot hash(x) % n is full, then check (hash(x) + 12 ) % n. 3

12 ) 22 )
4
• If (hash(x) + % n is also full, then check (hash(x) + % n. 5
• If (hash(x) + 22 ) % n is also full, then check (hash(x) + 32 ) % n. 6

• Repeat for all the values of i until an empty slot is found


Index Key
Quadratic probing : Open Addressing
0
• Apply quadratic probing to insert the sequence of keys 10, 30, 16, 46,
1
35 using hash function as “key mod 7” .
2
• Create an empty hash table with size 7 i.e., hash values range is from 0
to 6. 3 10 10 mod 7

ℎ′ (k)
4
= k mod m

h(k, i) = (ℎ′(k) + 𝑖 2 ) mod m ; ℎ ′ = U = (0,1,2,...


5
6
(m-1))
• h(10) = 10 mod 7 = 3 Index Key
0
• h(30) = 30 mod 7 = 1
2. 2 30 30 mod 7
3 10
4
5
6
Index Key
Quadratic probing : Open Addressing
ℎ′ (k) = k mod m
0
1
h(k, i) = (ℎ′(k) + 𝑖 2 ) mod m 2 30 h(2 +02) mod 7

• h(16) = 16 mod 7 = 2 ; But slot 2 is filled. Collision occurs.


3 10 h(2 +12) mod 7
4
• So, identify next empty slot (2 +12) mod 7 = 3; index 3 also
5
h(2 +22) mod 7
filled.
6 16
• So, identify next empty slot (2 +22) mod 7= 6 mod 7 = 6;

Index Key
0
1
2 30
• h(44) = 46 mod 7 = 3 10
4 4 46 46 mod 7
5
6 16
Quadratic probing : Open Addressing
Index Key
• h(35) = 35 mod 7 =
0 35 35 mod 7
0.
1
2 30
3 10
4 46
5
6 16
Advantages:
• More efficient for a closed hash table.

Disadvantage:
• Secondary clustering issue i.e., two keys contains the same probe sequence that access the same
location.
Double Hashing (Open Addressing)
• Double hashing applies two hash functions.
(1) First hash function to determine the first slot.
(2) Second hash function to determine the increment for the probe sequence.
• The second hash function is used when the first function causes a collision. It provides an offset index
to
store the value. h1(k) = k mod m
h2(k) = k mod p p – prime number which should be <
m
h(k,i) = (h1(k) + i * h2(k) ) mod m, i=0,1,...
• Initial probe: h1(k) Index Key
0
• Second probe is offset by h2(k) mod m, so on ...
1
• i is a non-negative integer that indicates a
2
collision number,
m = hash table size. 3
• k = element/key which is being hashed; 4
• It can generate m2 probe sequences maximum
• Time complexity: O(n)
Double Hashing (Open Addressing): Example Index Key
Insert the key sequence 69, 72, 79, 98, and 14 using double 0
hashing. 1 79
Given hash functions are: 2
h1(k) = k mod 13 hash table size m= 13 ; 3

h2(k) = 1+ (k mod 11) Here, 11 is prime < 13 4 69


5
h(k,i) = (h1(k) + i h2(k) ) mod 13
6
Insertion 7 72
8
h1(69) = 69 mod 13 = 4  69 is stored at
index 4. 9
10
h1(72) = 72 mod 13 = 7  72 is stored at index 7. 11
12

h1(79) = 79 mod 13 = 1  79 is stored at index 1.


h2(k) = 1+ (k mod 11) Double Hashing: Example Index Key
0
h1(98) = 98 mod 13 =  Collision at index 7.
7 1 79
h(98,1) = (h1(98) + 1* h2(98)) mod 13 = (7 + (1 * (1+ (98 mod 11))) mod 13 2
3
= (7 + (1 * ( 1+ 10)) mod 13 = (18) mod 13 = 5.  98 is Stored at index 5
4 69
h1(14,0) = 14 mod 13 = 1 Collision occurs at index 1! 5 98
h(14,1) = (h1(14) + 1 * h2(14)) mod 13 = (1 + (1 * (1+14 mod 11))) mod 13 6
7 72
= (1 + (4 mod 11)) mod 13 = (1 + 4 ) mod 13 = 5. Collision occurs
8
at index 5!
9 14
h(14,2) = (h1(14) + 2 * h2(14)) mod 13= (1 + (2 * (1+14 mod 11))) mod 13 10
11
= (1 + 8) mod 13 = 9. Its free. 14 is Stored at index 9
12
Advantage:
• Avoids clustering
Disadvantage:
• Harder to delete an element
Rehashing
• Rehashing is the reverse of hashing.
• It retains the performance of the algorithm.
• Rehashing expands the size of the Map, Array, and Hash table dynamically to maintain the retrieve and store
the data with time complexity O(1).
• Rehashing is the process of re-calculating the hash value of already stored entries.
• Move the stored entries to a bigger size hash table when the number of elements in the current small hash
table
reaches the maximum threshold value.
• The load factor plays a vital role in rehashing.

Image source
Need of Rehashing
• Rehashing is required when the load factor increases.
• The load factor increases when we insert key and it also increases the time complexity.
• Generally, the time complexity of Hash table is O(1).
• To reduce the time complexity and load factor of the Hash table, implement the rehashing
technique.
Apply rehashing in the following conditions:
• When the table is half full.
• When the load reaches at a certain level (load > 1/2).
• When an insertion fails.
• Heuristically choose a load factor threshold, rehash when threshold breached.
Rehashing Working Principle
1. Check the load factor for each new insertion of an element to the hash table.
2. If the load factor is greater than its default value (0.75), then implement rehashing.
3. Create a new bucketArray of double size (i.e. double the size of the previous array).
4. Iterate over each element of the previous array (old bucketArray), and call the insert() method for
each
element.
5. The insert() method inserts all the elements in the newly created bucketArray that is of larger size.
• Time complexity of rehashing of n elements : O(1)* n + O(n) = O(n)
Load Factor in Rehashing
Load Factor
• The load factor is a measure that decides when to increase the HashMap or Hashtable size to maintain the
store and search operation of complexity O(1).
• The default value of the load factor of Hash Table is 0.75 (75% of the table size).
• The load factor decides when to increase the number of buckets to store the keys in hash table.
• Highest load factor: Lower space consumption but higher lookups.
• Smallest Load factor: Larger space consumption compared to the required number of elements.
Load factor = initial capacity of the Hash table * Load factor of the Hash table
E.g.,
• The initial capacity of Hash table = 16;
• Default load factor of Hash table = 0.75
• Load factor = 16*0.75 = 12.
• It represents that the 12th key of Hash table keeps its size =16. Once the 13th key enters into the Hash table,
it will increase the size from default 24 = 16 buckets to 25 = 32 buckets.
Rehashing: Load Factor Numerical Example
• Let the default bucket size of the Hash table is 16. The initial capacity of Hash table = 16;
• Default load factor of Hash table = 0.75. Load factor = 16 * 0.75 = 12.
• Insert the first element and check whether need to increase the Hash table capacity or not.
• It can be determined using:
Size of the Hash table (m)
Number of Buckets (n)
• If the Hash table size =1 and the bucket size = 16, then load
factor for 1st element = 1/16 = 0.0625.
• Compare the obtained load factor value with the default load
factor (0.75).
0.0625 < 0.75
• The obtained LF value < default value of the load factor. So, No
need to increase the Hash table size.
• Therefore, Hash table size does not need to increase until the
12th element.
• Because load factor of 12th element =12/16 = 0.75. The obtained LF value is equal to the default load factor,
i.e., 0.75.
• Once insert the 13th element in the Hash table, the size of Hash table is increased.
• th
Dynamic Hashing or Extendible hashing
• Data buckets grows or shrinks (added or removed dynamically) as the records increases or decreases.
• Extendible hashing splits and combine buckets appropriately with the data size. i.e. buckets are added
and
deleted on demand.
• The hash function produce a large number of values uniformly and randomly
• Hash indices are typically a prefix (Most Significant Bits in binary value) of the entire hash value.
• Also, can use Least Significant Bits (LSB) of binary value.
• More than one consecutive index can point to the same bucket.
– The indices have the same hash prefix which can be shorter than the length of the index.

Features of Dynamic Hashing


1. Directories
The directories store addresses of the buckets as pointers. An unique id is assigned to each directory
which
may change each time during Directory expansion i.e., hash table buckets expansion.
2. Buckets
Buckets store the hashed keys. Directories point to buckets.
Terminologies & Structure Representation of Extendible hashing
Local Depth
Directories - Store addresses of the buckets as pointers
• The hash function returns the directory id which is used to navigate to the Global Depth =2
2
appropriate bucket. Number of Directories = 2^Global Depth. 00 Data
Buckets - Buckets store the hashed keys. 2
01 Data
Global Depth 2
10
• Global Depth is associated with the Directories. It represents the number of Data
11
bits used by the hash function to categorize the keys. 2
• Global Depth = Number of bits in directory id. Directory Data
Local Depth
• Local Depth is associated with the buckets and not the directories. Buckets
• Local depth in accordance with the global depth is used to decide the action
that to be performed in case an overflow occurs. Local Depth is always less
than or equal to the Global Depth.
Bucket Splitting
• When the number of elements in a bucket exceeds a particular size, then the bucket is split into two parts.
Directory Expansion
• Directory Expansion to be applied when a bucket overflows. Directory Expansion is performed when the local
depth of the overflowing bucket is equal to the global depth.
Working Principle of Extendible hashing

Directory Pass the Data Data Stored in


Input Data Hash Directory
Location into bucket Identified
Converted Function Points
to (Hashed)
into f(h)
Memory Bucket
Binary
Bucket
1. Analyze the Data: Data elements exists in various types such as Integer, String, Float, etc., E.g., 49,
'Kaven',
2.35.
2. Binary format conversion: Convert the data element into binary type.
1. E.g., For string elements, apply ASCII equivalent integer. Then convert the integer into binary. Here,
49 into binary is 110001.
3. Check Global Depth of the directory: Assume the global depth of the Hash-directory is 2.
4. Identify the Directory: Consider the ‘Global-Depth’ number of MSBs in the binary number and match it to
the directory id.
 E.g., The obtained binary is 110001, and the global-depth is 2. Hence, the hash function will return two
MSBs of 110001 is 11.
5. Navigation: Find the bucket pointed by the directory with directory-id 11.
6. Insertion and Overflow Check: Insert the element and check the bucket status free or overflow. If
an overflows, move to step 7 followed by Step 8.Otherwise, move to step 9.
Working Principle of Extendible hashing
When Overflow occurs

Local Depth = Global Depth Local Depth < Global Depth

Apply Bucket Split and Directory Expansion Apply Bucket Split only

7. Handling Overflow during Data Insertion: Handle the bucket overflow with appropriate procedure.
First,
Check if the local depth is less than or equal to the global depth. Then choose one of the cases below.
Case1: If the local depth of the overflowing Bucket is equal to the global depth, then apply Directory Expansion, as well
as Bucket Split, needs to be performed. Then increment the global depth and the local depth value by 1. Assign
appropriate pointers. Directory expansion will double the number of directories present in the hash structure.
Case2: In case the local depth is less than the global depth, then only Bucket Split takes place. Then increment only the
local depth value by 1. Assign appropriate pointers.
8. Rehashing of Split Bucket Elements: The Elements present in the overflowing bucket that is split are rehashed
with respect to the new global depth of the directory.
Numerical Example: Extendible hashing
• Apply Extendible Hashing to insert the values in the storage: 17, 5, 6, 22, 24, 11,30, 7, 10, 21, 27.
• Let Bucket Size be 3.
• Hash Function:
– Let the global depth is x. Then convert into binary using the Hash Function that returns x
MSBs.
• Solution: First, calculate the binary forms of each of the given numbers.
• Analyze the Data: It contains integers.
17- 10001 30 - 11110
• Convert into binary digits.
5- 00101 7 - 00111
6- 00110 10 - 01010
22- 10111 21- 10101
24- 11000 27 - 11011
11- 01011 Local Depth
• Initially, the global-depth and local-depth is always 1. The hashing frame is:
Global Depth =1 1
0
1
1
Directory
Buckets
Numerical Example: Extendible hashing
• Insert the value 17. Here the binary is 10001. The global depth =1. Hash function returns MSB of 10001
that is 1. Hence 17 is mapped to the directory with id =1.

Global Depth =1 1
0
1
1 17
Directory
Buckets

Insert the value 5. Here the binary is 00101. The global depth =1. Hash function returns MSB of 00001 that is 0. Hence 5
is
mapped to the directory with id =0.

Global Depth =1 1
0 5
1
1 17
Directory
Buckets
Numerical Example: Extendible hashing
• Insert the value 6. Here the binary is 00110. The global depth =1. Hash function returns MSB of 00110
that is 0. Hence 6 is mapped to the directory with id =0.

Global Depth =1 1
0 5 6
1
1 17
Directory
Buckets

Insert the value 22. Here the binary is 10111. The global depth =1. Hash function returns MSB of 10111 that is 1. Hence 22
is mapped to the directory with id =1.

Global Depth =1 1
0 5 6
1
1 17 22
Directory
Buckets
Numerical Example: Extendible hashing
• Insert the value 24. Here the binary is 11000. The global depth =1. Hash function returns MSB of 11000
that is 1. Hence 24 is mapped to the directory with id =1.

Global Depth =1 1
0 5 6
1
1 17 22 24
Directory
Buckets

Insert the value 11. Here the binary is 01011. The global depth =1. Hash function returns MSB of 01011 that is 0. Hence 11
is mapped to the directory with id =0.

Global Depth =1 1
0 5 6 11
1
1 17 22 24
Directory
Buckets
Numerical Example: Extendible hashing
• Insert the value 30. Here the binary is 11110. The global depth =1. Hash function returns MSB of 11110
that is 1. The bucket pointed by directory 1 is already full. Hence, Over Flow occurs.
Overflow Occurs.
Because of Local Depth = Global Depth

Global Depth =1 1
0 5 6 11
1
1 17 22 24 , 30
Directory
Buckets

• As per Case1 in Step 7, Since Local Depth = Global Depth, the


bucket splits and directory expansion is applied.
• Also, rehashing of numbers present in the overflowing bucket Global Depth =2 1
takes place after the split.
00 5 6 11
• Increment the global depth by 1, so, the global depth is 2. 2
• Hence, 17, 22, 24, 30 are now rehashed with respect to 2 MSBs. 01 17 22
• [ 17(10001), 22(10111), 24(11000), 30(11110) ] 10 2
24 30
11
Directory Buckets
Numerical Example: Extendible hashing
• Here observe that the bucket which was underflow remains unchanged.
• Insert the value 7. Here the binary is 00111. The global depth =2.
• Hash function returns MSB of 00111 that is 00.
• The bucket pointed by directory 00 is already full. Hence, Over Flow
occurs.

Overflow Occurs.
Because of Local Depth < Global Depth
2
• As per Case2 in Step 7, Since Local Depth < Global Depth, 5 6 7
directories are not doubled. However, the bucket only is split and Global Depth =2 2
elements are rehashed.
00 11
• Hence, 5, 6, 11, 7 are now rehashed with respect to 2 MSBs. 2
• [ 5(00101), 6(00110), 11(01011), 7(00111) ] 01 17 22
10 2
24 30
11
Buckets
Directory
Numerical Example: Extendible hashing 2
5 6 7
• Here observe that the bucket which was underflow remains
Global Depth =2 2
unchanged.
00 11 10
• Insert the value 10. Here the binary is 01010. The 2
global depth =2. Hash function returns MSB of 01111 01 17 22
that is 01. Hence 10 is mapped to the directory with id 10 2
24 30
=01. 11
Buckets
Directory

2
5 6 7
Global Depth =2 2 Insert the value 21. Here the binary is 10101. The global
00 11 10 depth
2 =2. Hash function returns MSB of 10111 that is 10. Hence 21 is
01 17 22 21 mapped to the directory with id =10.
10 2 Insert the value 27. Here the binary is 11011. The global depth
24 30 27 =2. Hash function returns MSB of 11011 that is 11. Hence 27 is
11
Buckets mapped to the directory with id =11.
Directory
17, 5, 6, 22, 24, 11,30, 7, 10, 21, 27 all are inserted
Extendible hashing advantages & limitations
Observed points:
• A Bucket contains more than one pointers pointing to it if its local depth is less than the global depth.
• When overflow condition occurs in a bucket, all the entries in the bucket are rehashed with a new
local depth.
• If Local Depth of the overflowing bucket, the size of a bucket cannot be changed after the data
insertion
process begins.
Advantages:
• Data retrieval needs less computations.
• The storage capacity increases dynamically.
• Due to dynamic changes in hashing function, associated old values are rehashed with respect to the
new hash function.
Limitations:
• The directory size may increase significantly if several records are hashed on the same directory
while
keeping the record distribution non-uniform.
• Size of every bucket is fixed.
• Memory is wasted in pointers when the global depth and local depth difference becomes drastic.
References

1. Goodrich, Michael T & Roberto Tamassia, and “Data Structures & Algorithms in Java”, WILEY
publications 2015.

2. Sartaj Sahni, “Data Structures, Algorithms and Applications in Java”, Second Edition, Universities
Press.

3. Mark Allen Weiss, “Data Structures & Algorithm Analysis in Java”, Third Edition,
Pearson
Publishing.

4. https://www.javatpoint.com/rehashing-in-java

You might also like