14 Hashing
14 Hashing
g in
Data
Structu
res Devi
Mrs. K Jyothsna
SCOPE
Hashing – Search the Record
• Hashing generates a fixed-size output from an variable size input using the mathematical (hash) function.
• Hashing determines an index or location for the storage of an item in a data structure.
1. Key: Key is a string or integer that pass as an input to the hash function that determines an index of
storage
2. Hash Function: The hash function receives the input key and returns the index (hash index) of an
3. Hash Table: Hash table is a data structure that maps keys to values using a hash function.
– Hash stores the data in an associatively in an array where each data value has its own unique index.
Index Value
Hash Table & Hash Function Representation
U h(k1) Value1
(universe of
keys) h(k4) Value4
K k1 Hash Function
k4 k2
(actu h(k2) = h(k5) Value2
al k k3 ,
5
keys) Value
5
• A bucket in a hash file is unit of storage (typically a disk block) that can hold one or more
records.
• Hash table uses a hash function h to compute the slot for each key. h(k3) Value3
• Store the element in slot h(k)
• A hash function h transforms a key into an index in a hash table T[0…m-1]: m - Table
Hash 1
h : U → {0, 1, . . . , m - 1}
h(k) = k mod (m-1) i.e. k hashes to slot h(k)
Advantages: Bucket (m-1)
• When domain K is much smaller than domain U, a hash table requires much less space. i.e., storage space
o It can reduce storage requirements to |K|.
o It can still get O(1) search time, but when the average case, not the worst case.
• Complexity for calculating hash value using h: Time complexity: O(n) and Space complexity:
O(1)
Hashing Applications
• Keeping track of customer’s account information at a bank
– Search through records to check the balance, and perform
transactions
• Keep track of reservations on flights
– Search to find empty seats, cancel/modify reservations
• Search engine
– Looks for all documents containing a given word
Hashing working principle
Problem Statement: Let's consider strings {“hi”, “an”, “is”}. Store these in a array (table) using Hashing.
• Values to be accessed (Search) or updated in the table quickly i.e., O(1) time. Ordering (Sorting) of strings
does not considered in the table. Hence, the given set of strings can act as a key, and the value.
Compute the key to store the value in corresponding index
1. Apply hash function to calculate the hash value that acts as the index of the data structure to store the value.
2. Let’s consider ascii values “a” = 97, “h”=104, “i” = 105, “n”=110, “s” = 115, .. etc, to all characters.
3. The numerical value is the summation of all characters of the string:
“hi” = 104 + 105 = 209, “an” = 97 + 110 = 207 , “is” = 105 + 115 = 220
4. Now, assume table of size 9 to store these strings.
5. The hash function uses the sum of the characters in key mod Table size.
6. Calculate the storage location (bucket) for the string in the table as sum(string) mod 7.
“hi” = 104 + 105 = 209 mod 9 = 2, “an” = 97 + 110 = 207 mod 9 =0 , “is” = 105 + 115 = 220 mod 9 =
4
0
an 1 2
hi 3 4
is 5 6 7 8
• To insert a record into the structure compute the hash value h(Ki), and
or the search-keys
Implement Dictionary using Direct Addressing
Assumptions:
– Key values are distinct
– Each key is drawn from a universe U = {0, 1, . . . , m
- 1}
Goal:
– Store the items in an array, indexed by keys
return word
• deleteWord(string word)
delete word
• find(string word)
return word
Dictionary using Hashing
class node }
{ int hashf(string word);
string key, value; bool insert(string,string);
node next; string find(string word);
node(string key, bool deleteWord(string
string value) word);
{ };
this.key =
key; hashf(string word)
this.value = value; {
next = NULL;
int asciiSum=0;
}
for(int
}; i=0;i<word.length();i++)
class dictionary
{
{
asciiSum=asciiSum+word
node head[MAX]; [i];
dictionary()
}
{
return (asciiSum%100);
Dictionary using Hashing
}
find(string word) deleteWord(string word)
{ {
int index=hashf(word); int index=hashf(word);
node tmp=head[index];
node start=head[index]; node prev=head[index];
if(start==NULL) if(tmp==NULL)
return "-1"; //if no word is present at
while(start!=NULL) that index
{ {
if(start.key==word) return false;
return start.value; }
start=start.next;
}
return "-1";
Dictionary using Hashing
Insert Search
direct addressing O(1) O(1)
ordered array O(n) O(log n)
ordered list O(n) O(n)
unordered array O(1) O(n)
unordered list O(1) O(n)
Static Hashing
• In static hashing when a search-key value is provided, the hash function always computes the
same address. It is used in databases.
E.g.,
• Generate address for STUDENT_ID = 76 using mod (5) hash function, it always result in the
same bucket address 1. h(76) = 76 mod 5 = 1
• There will not be any changes to the bucket address here i.e., number buckets is constant in the memory.
• Insertion, deletion, update and search are done in constant time.
Disadvantage:
• Assume bucket corresponding to the hash key is not free at the
time of inserting a new data into the table i.e., data already
exists in that address.
• It is called bucket overflow. Also called collision.
Image
• This issue overcome by Open hashing and Closed Hashing.
source
Collisions
• The hashing generates a same hash value for two big input keys. The newly inserted key maps to an
already
• occupied
E.g., storage location. However it should be handled using collision handling methods.
“is” = 105 + 115 = 220 mod 9 = 4
• Both strings {“is”, “si”} contain the same hash value. “si” = 115 + 105 = 220 mod 9 = 4
• So, both strings to be stored in the same storage location.
• It creates problem in searching, insertion, deletion, and updating of value.
• The second string collided with first string when try to store using hash
value. Index Value Hash Table
U
(universe of h(k1) Value1
keys) h(k4) Value4 Buckets
K k1
(actu k4 k2 h(k2) = h(k5) Value2,Value 5 Collision
al
keys) k5 k3 h(k3) Value3 s!
m-1
Collisions
• Two or more keys hash to the same slot!!
– If |K| ≤ m, collisions may or may not happen, depending on the hash function
– If |K| > m, collisions will definitely happen (i.e., there must be at least two
keys that have the same hash value)
• Avoiding collisions completely is hard, even with a good hash function
Collision Resolution Methods
6
7
8
9
Separate Chaining (Open Hashing)
class HashNode<K, V> {
K key; // hash function to find index for a key
Key K Value V
V value; private int getBucketIndex(K key)
HashNode <K, V> next; { int hashCode =
public HashNode(K key, V value) { key.hashCode();
this.key = key; int index = hashCode % numBuckets;
this.value = value; return index;
} } }
// hash table // Returns value for a key
class Map<K, V> { // bucketArray public V get(K key) {
private ArrayList<HashNode<K, V>> bucketArray; // Find head of chain for given key
private int numBuckets; // Current capacity of array int bucketIndex =
list private int size; getBucketIndex(key);
while (head != null) { // Search key in chain
public Map() { HashNode<K, V> head =
if (head.key.equals(key))
bucketArray = new ArrayList<>(); bucketArray.get(bucketIndex);
return head.value;
numBuckets = 10; head = head.next;
size = 0; }
for (int i = 0; i < numBuckets; i++) //empty chains // If key not found
bucketArray.add(null); return null;
} }
Separate Chaining (Open Hashing)
// Remove a given key }
public V remove(K key)
// find index for given key // Adds a key value pair to hash
{ int bucketIndex = getBucketIndex(key); public void add(K key, V value)
// head of chain {
HashNode<K, V> head = bucketArray.get(bucketIndex); // Find head of chain for given key
HashNode<K, V> prev = null; //Search for key in its chain int bucketIndex = getBucketIndex(key);
while (head != null) { HashNode<K, V> head = bucketArray.get(bucketIndex);
if (head.key.equals(key)) // If Key found // Check if key is already
break; present while (head != null)
prev = head; // Else traverse in {
chain if (head.key.equals(key))
} head = head.next; {
if (head == null) // If key was not there head.value = value;
return null; return;
size--; // Reduce }
size
if (prev != null) // Remove key head = head.next;
prev.next = head.next; }
else
bucketArray.set(bucketIndex, head.next);
return head.value;
Separate Chaining (Open Hashing)
// Insert key in chain {
size++; add(headNode.key, headNode.value);
head = bucketArray.get(bucketIndex); headNode = headNode.next;
HashNode<K, V> newNode = new HashNode<K, V>(key, value); }
newNode.next = head; }
bucketArray.set(bucketIndex, newNode); }
// If load factor goes beyond threshold, then double hash table size }
if ((1.0*size)/numBuckets >= 0.7)
{
ArrayList<HashNode<K, V>> temp = bucketArray;
bucketArray = new ArrayList<>();
numBuckets = 2 * numBuckets;
size = 0;
for (int i = 0; i < numBuckets;
i++)
bucketArray.add(null);
for (HashNode<K, V> headNode : temp)
{
while (headNode != null)
Separate Chaining (Open Hashing)
Advantages:
• Simple to implement.
• Hash table contains memory, because can add more elements to the chain using linked list.
• It is mostly used when size of the data is unknown i.e., lements count and how frequently keys may be
inserted or deleted.
Limitations:
• The cache performance of chaining is not good as keys are stored using a linked list.
• Open addressing provides better cache performance as everything is stored in the same table.
• Wastage of Space (few slots of the hash table are never used)
• If the chain becomes long, then search time can become O(n) in the worst case
Open Addressing (Closed Hashing)
• If we have enough contiguous memory to store all the keys (m >
N)
e.g., insert
store the keys in the table itself 14
• No need to use linked lists anymore. h = 69 mod 13
Generalize hash function for Open Addressing
=4
• Hash function contains two arguments here:
h = 72 mod 13
(i) Key value k, and (ii) Probe number p =7
h(k, p), h = 79 mod 13
=1
p=0,1,...,m-1 h = 98 mod 13
• Probe
Insert sequences
Operation- =1
<h(k,0), h(k,1),
Hash function is used...,
to h(k,m-1)>
compute the hash value for a key to be inserted. h = 14 mod 13
Hash value is then used as an index to store the key in the hash table. =1
If collision occurs, h = 50 mod 13
Probing is performed until an empty bucket is found. =11
Once an empty bucket is found, the key is inserted.
Probing is performed in accordance with the technique used for open
addressing.
• Search time depends on the length of the probe sequence.
Open Addressing (Closed Hashing)
Delete Operation
The key is first searched and then deleted. Search 14
After deleting the key, that particular bucket is marked as “deleted”.
During insertion, the buckets marked as “deleted” are considered as an empty bucket.
E.g., 72 mod 13 = 7
wrap around
Index Key
Linear probing: Inserting a key (data)
0 10
• Apply linear probing to insert the sequence of keys 10, 30, 16, 44, 35 using
hash function as “key mod 5” . 1 30
• Create an empty hash table with size 5 i.e., hash values range is from 0 to 4. 2 16
h(k, i) = (h1(k) + i) mod m ; i=0,1,2,... 3 35
• h(10 , 0) = 10 mod 5 = 0 4 44
• h(30 , 0) = 30 mod 5 = 0 ; But slot 0 is filled. Collision occurs.
• So, identify next empty slot h(30, 1) = (30+1) mod 5 = 1 Stored at index
1.
• h(16 , 0) = 16 mod 5 = 1 ; But slot 1 is filled. Collision occurs.
• So, identify next empty slot h(16, 1) = (16+1) mod 5 = 2 Stored at index
2.
• h(44 , 0) = 44 mod 5 = 4
Collision occurs.
• h(35, 0) = 35 mod 5 = 0 ; But slot 0 is filled. Collision occurs.
Collision occurs.
• So, identify next empty slot h(35, 1) = (35+1) mod 5 = 1
Stored at index
• So, identify next empty slot h(35, 2) = (35+2) mod 5 = 2
Linear probing: Searching for a key (data)
• Three cases: Index Key
0 10
(1) Position in table is occupied with an element of equal key 1 30
2 16
(2) Position in table is empty
3 35
4 44
(3) Position in table occupied with a different element
• Case 2: Probe the next higher index until the element is found or an empty
position
is found
• Problems 2 16
3 35
– Cannot mark the slot as empty 4 44
Slot d:
4/m
Slot e:
5/m
Quadratic probing : Open Addressing
• Quadratic probing process considers the original hash value, and adding successive values of an
arbitrary quadratic polynomial until an empty slot is found.
ℎ′ (k) = k mod m
• Starts from the original hash location. If this location is not empty then check the other slots. 0
1
• Let hash(x) is the slot index computed using the hash function and n be the size of the hash table. 2
• If the slot hash(x) % n is full, then check (hash(x) + 12 ) % n. 3
12 ) 22 )
4
• If (hash(x) + % n is also full, then check (hash(x) + % n. 5
• If (hash(x) + 22 ) % n is also full, then check (hash(x) + 32 ) % n. 6
ℎ′ (k)
4
= k mod m
Index Key
0
1
2 30
• h(44) = 46 mod 7 = 3 10
4 4 46 46 mod 7
5
6 16
Quadratic probing : Open Addressing
Index Key
• h(35) = 35 mod 7 =
0 35 35 mod 7
0.
1
2 30
3 10
4 46
5
6 16
Advantages:
• More efficient for a closed hash table.
Disadvantage:
• Secondary clustering issue i.e., two keys contains the same probe sequence that access the same
location.
Double Hashing (Open Addressing)
• Double hashing applies two hash functions.
(1) First hash function to determine the first slot.
(2) Second hash function to determine the increment for the probe sequence.
• The second hash function is used when the first function causes a collision. It provides an offset index
to
store the value. h1(k) = k mod m
h2(k) = k mod p p – prime number which should be <
m
h(k,i) = (h1(k) + i * h2(k) ) mod m, i=0,1,...
• Initial probe: h1(k) Index Key
0
• Second probe is offset by h2(k) mod m, so on ...
1
• i is a non-negative integer that indicates a
2
collision number,
m = hash table size. 3
• k = element/key which is being hashed; 4
• It can generate m2 probe sequences maximum
• Time complexity: O(n)
Double Hashing (Open Addressing): Example Index Key
Insert the key sequence 69, 72, 79, 98, and 14 using double 0
hashing. 1 79
Given hash functions are: 2
h1(k) = k mod 13 hash table size m= 13 ; 3
Image source
Need of Rehashing
• Rehashing is required when the load factor increases.
• The load factor increases when we insert key and it also increases the time complexity.
• Generally, the time complexity of Hash table is O(1).
• To reduce the time complexity and load factor of the Hash table, implement the rehashing
technique.
Apply rehashing in the following conditions:
• When the table is half full.
• When the load reaches at a certain level (load > 1/2).
• When an insertion fails.
• Heuristically choose a load factor threshold, rehash when threshold breached.
Rehashing Working Principle
1. Check the load factor for each new insertion of an element to the hash table.
2. If the load factor is greater than its default value (0.75), then implement rehashing.
3. Create a new bucketArray of double size (i.e. double the size of the previous array).
4. Iterate over each element of the previous array (old bucketArray), and call the insert() method for
each
element.
5. The insert() method inserts all the elements in the newly created bucketArray that is of larger size.
• Time complexity of rehashing of n elements : O(1)* n + O(n) = O(n)
Load Factor in Rehashing
Load Factor
• The load factor is a measure that decides when to increase the HashMap or Hashtable size to maintain the
store and search operation of complexity O(1).
• The default value of the load factor of Hash Table is 0.75 (75% of the table size).
• The load factor decides when to increase the number of buckets to store the keys in hash table.
• Highest load factor: Lower space consumption but higher lookups.
• Smallest Load factor: Larger space consumption compared to the required number of elements.
Load factor = initial capacity of the Hash table * Load factor of the Hash table
E.g.,
• The initial capacity of Hash table = 16;
• Default load factor of Hash table = 0.75
• Load factor = 16*0.75 = 12.
• It represents that the 12th key of Hash table keeps its size =16. Once the 13th key enters into the Hash table,
it will increase the size from default 24 = 16 buckets to 25 = 32 buckets.
Rehashing: Load Factor Numerical Example
• Let the default bucket size of the Hash table is 16. The initial capacity of Hash table = 16;
• Default load factor of Hash table = 0.75. Load factor = 16 * 0.75 = 12.
• Insert the first element and check whether need to increase the Hash table capacity or not.
• It can be determined using:
Size of the Hash table (m)
Number of Buckets (n)
• If the Hash table size =1 and the bucket size = 16, then load
factor for 1st element = 1/16 = 0.0625.
• Compare the obtained load factor value with the default load
factor (0.75).
0.0625 < 0.75
• The obtained LF value < default value of the load factor. So, No
need to increase the Hash table size.
• Therefore, Hash table size does not need to increase until the
12th element.
• Because load factor of 12th element =12/16 = 0.75. The obtained LF value is equal to the default load factor,
i.e., 0.75.
• Once insert the 13th element in the Hash table, the size of Hash table is increased.
• th
Dynamic Hashing or Extendible hashing
• Data buckets grows or shrinks (added or removed dynamically) as the records increases or decreases.
• Extendible hashing splits and combine buckets appropriately with the data size. i.e. buckets are added
and
deleted on demand.
• The hash function produce a large number of values uniformly and randomly
• Hash indices are typically a prefix (Most Significant Bits in binary value) of the entire hash value.
• Also, can use Least Significant Bits (LSB) of binary value.
• More than one consecutive index can point to the same bucket.
– The indices have the same hash prefix which can be shorter than the length of the index.
Apply Bucket Split and Directory Expansion Apply Bucket Split only
7. Handling Overflow during Data Insertion: Handle the bucket overflow with appropriate procedure.
First,
Check if the local depth is less than or equal to the global depth. Then choose one of the cases below.
Case1: If the local depth of the overflowing Bucket is equal to the global depth, then apply Directory Expansion, as well
as Bucket Split, needs to be performed. Then increment the global depth and the local depth value by 1. Assign
appropriate pointers. Directory expansion will double the number of directories present in the hash structure.
Case2: In case the local depth is less than the global depth, then only Bucket Split takes place. Then increment only the
local depth value by 1. Assign appropriate pointers.
8. Rehashing of Split Bucket Elements: The Elements present in the overflowing bucket that is split are rehashed
with respect to the new global depth of the directory.
Numerical Example: Extendible hashing
• Apply Extendible Hashing to insert the values in the storage: 17, 5, 6, 22, 24, 11,30, 7, 10, 21, 27.
• Let Bucket Size be 3.
• Hash Function:
– Let the global depth is x. Then convert into binary using the Hash Function that returns x
MSBs.
• Solution: First, calculate the binary forms of each of the given numbers.
• Analyze the Data: It contains integers.
17- 10001 30 - 11110
• Convert into binary digits.
5- 00101 7 - 00111
6- 00110 10 - 01010
22- 10111 21- 10101
24- 11000 27 - 11011
11- 01011 Local Depth
• Initially, the global-depth and local-depth is always 1. The hashing frame is:
Global Depth =1 1
0
1
1
Directory
Buckets
Numerical Example: Extendible hashing
• Insert the value 17. Here the binary is 10001. The global depth =1. Hash function returns MSB of 10001
that is 1. Hence 17 is mapped to the directory with id =1.
Global Depth =1 1
0
1
1 17
Directory
Buckets
Insert the value 5. Here the binary is 00101. The global depth =1. Hash function returns MSB of 00001 that is 0. Hence 5
is
mapped to the directory with id =0.
Global Depth =1 1
0 5
1
1 17
Directory
Buckets
Numerical Example: Extendible hashing
• Insert the value 6. Here the binary is 00110. The global depth =1. Hash function returns MSB of 00110
that is 0. Hence 6 is mapped to the directory with id =0.
Global Depth =1 1
0 5 6
1
1 17
Directory
Buckets
Insert the value 22. Here the binary is 10111. The global depth =1. Hash function returns MSB of 10111 that is 1. Hence 22
is mapped to the directory with id =1.
Global Depth =1 1
0 5 6
1
1 17 22
Directory
Buckets
Numerical Example: Extendible hashing
• Insert the value 24. Here the binary is 11000. The global depth =1. Hash function returns MSB of 11000
that is 1. Hence 24 is mapped to the directory with id =1.
Global Depth =1 1
0 5 6
1
1 17 22 24
Directory
Buckets
Insert the value 11. Here the binary is 01011. The global depth =1. Hash function returns MSB of 01011 that is 0. Hence 11
is mapped to the directory with id =0.
Global Depth =1 1
0 5 6 11
1
1 17 22 24
Directory
Buckets
Numerical Example: Extendible hashing
• Insert the value 30. Here the binary is 11110. The global depth =1. Hash function returns MSB of 11110
that is 1. The bucket pointed by directory 1 is already full. Hence, Over Flow occurs.
Overflow Occurs.
Because of Local Depth = Global Depth
Global Depth =1 1
0 5 6 11
1
1 17 22 24 , 30
Directory
Buckets
Overflow Occurs.
Because of Local Depth < Global Depth
2
• As per Case2 in Step 7, Since Local Depth < Global Depth, 5 6 7
directories are not doubled. However, the bucket only is split and Global Depth =2 2
elements are rehashed.
00 11
• Hence, 5, 6, 11, 7 are now rehashed with respect to 2 MSBs. 2
• [ 5(00101), 6(00110), 11(01011), 7(00111) ] 01 17 22
10 2
24 30
11
Buckets
Directory
Numerical Example: Extendible hashing 2
5 6 7
• Here observe that the bucket which was underflow remains
Global Depth =2 2
unchanged.
00 11 10
• Insert the value 10. Here the binary is 01010. The 2
global depth =2. Hash function returns MSB of 01111 01 17 22
that is 01. Hence 10 is mapped to the directory with id 10 2
24 30
=01. 11
Buckets
Directory
2
5 6 7
Global Depth =2 2 Insert the value 21. Here the binary is 10101. The global
00 11 10 depth
2 =2. Hash function returns MSB of 10111 that is 10. Hence 21 is
01 17 22 21 mapped to the directory with id =10.
10 2 Insert the value 27. Here the binary is 11011. The global depth
24 30 27 =2. Hash function returns MSB of 11011 that is 11. Hence 27 is
11
Buckets mapped to the directory with id =11.
Directory
17, 5, 6, 22, 24, 11,30, 7, 10, 21, 27 all are inserted
Extendible hashing advantages & limitations
Observed points:
• A Bucket contains more than one pointers pointing to it if its local depth is less than the global depth.
• When overflow condition occurs in a bucket, all the entries in the bucket are rehashed with a new
local depth.
• If Local Depth of the overflowing bucket, the size of a bucket cannot be changed after the data
insertion
process begins.
Advantages:
• Data retrieval needs less computations.
• The storage capacity increases dynamically.
• Due to dynamic changes in hashing function, associated old values are rehashed with respect to the
new hash function.
Limitations:
• The directory size may increase significantly if several records are hashed on the same directory
while
keeping the record distribution non-uniform.
• Size of every bucket is fixed.
• Memory is wasted in pointers when the global depth and local depth difference becomes drastic.
References
1. Goodrich, Michael T & Roberto Tamassia, and “Data Structures & Algorithms in Java”, WILEY
publications 2015.
2. Sartaj Sahni, “Data Structures, Algorithms and Applications in Java”, Second Edition, Universities
Press.
3. Mark Allen Weiss, “Data Structures & Algorithm Analysis in Java”, Third Edition,
Pearson
Publishing.
4. https://www.javatpoint.com/rehashing-in-java