Unit 10
Unit 10
Structure
10.0 Introduction
10.1 Objectives
10.2 Drivers and motivations for hashing
10.3 Index Mapping
10.3.1 Challenges with Index Mapping
10.3.2 Hash Function
10.3.3 Simple Hash Example
10.4 Collision Resolution
10.4.1 Separate Chaining
10.4.2 Open Addressing
10.4.3 Double Hashing
10.5 Comparison of Collision Resolution Methods
10.6 Load Factor and Rehashing
10.6.1 Rehashing
10.7 Summary
10.8 Solutions/Answers
10.9 Further Readings
10.0 INTRODUCTION
Hashing is a key technique in information retrieval. Hashing transforms the input data
into a small set of keys that can be efficiently stored and retrieved. Hashing provides
constant time and highly efficient information retrieval capability irrespective of total
search space.
10.1 OBJECTIVES
As part of searching and information retrieval we use hashing for mainly below
given reasons:
Provide constant time data retrieval and insertion
Manage the data related to large set of input keys efficiently
Provide cost efficient hash key computations
As we can see from Figure 2, instead of creating a huge hash table of size 9875, we
have now managed to store the elements within an array of size 10. The hash function
has converted the input into smaller set of keys that are used as index for the hash
table.
Let us re-look at two challenges we saw in our earlier direct access method. The
examples we discussed in Figure 2 use integer values as input keys. If we use non-
integer values such as images or strings, we need to convert it first into a non-negative
integer value. For instance, using the corresponding ascii values for each of its
character, we can get a numeric value for a string. We can then use the hash function
to create a fingerprint for the numeric value to store the corresponding details in the
right sized hash table.
// the function returns the stored value for the input key
return this.hashArray[hashvalue].value;
//the function adds the value for the given input key
this.hashArray[hashvalue].value = inputValue ;
this.hashArray[hashvalue].value = null;
4
File Structures and
10.4 COLLISION RESOLUTION Advanced Data Structures
For large set of input keys, we might end up having same hash value for two different
input values. For instance, let us consider our simple hash function h(x) = x mod 10
As the hash function provides the remainder value, if we have two input keys 24 and
4874, the hash value will be 4. These cases cause collision as both the input keys 24
and 4874 compete for same slot in the hash table.
We discuss three key collision resolution techniques in subsequent sections.
The example Figure 3 depicts collision handling using the modulo based hash
function. The input values 0051 and 821 result in the same hash value of 1. In the
hash table we chain the values for user 2 and user 5 for the same slot.
We are chaining the data values user2 and user5 for the slot 0 in the hash table.
Insert Operation
Given below are the steps for the insert operation using separate chaining method to
insert a key k:
1. Compute the hash value of the key k using the hash function.
2. Use the computed hash value as the index in the hash table to find the slot for
the key k.
3. Add the key k to the linked list at the slot. 5
Hashing
Search operation
Given below are the steps for the search operation for key k:
1. Compute the hash value of the key k using the hash function.
2. Use the computed hash value as the index in the hash table to find the slot for
searching the key k.
3. Check for all elements in the linked list at the slot for a match with key k.
There are mainly two variants of open addressing – linear probing and quadratic
probing. In the linear probing we sequentially iterate to the next available spot.
We define the linear probe by the following equation for ith iteration:
The linear probing leads to a situation known as “primary clustering” wherein the
consecutive slots form “cluster” of keys in the hash table. As the cluster size grows, it
impacts the efficiency of the probing for placing the next key.
Quadratic probing makes larger jumps to avoid the primary clustering. We define the
quadratic probe by the following equation for ith iteration:
As we can see in the equation, for every iteration the quadratic probing makes larger
jumps. In Quadratic probing we encounter “secondary clustering” problem.
6
Let us look at an example for linear probing to avoid the collision. Let us consider the File Structures and
Advanced Data Structures
following set of keys [56, 1072, 97, 84, 60] and the hash table size of 5. When we
apply mod based hash function and start placing the keys in the appropriate slots we
get the placement as depicted in Figure 4.
The value 56 goes to position 1 due to the mod value of (56 mod 5) operation.
Similarly, 1072 assumes position 2. However, when we try to place the next element
97 we end up with a collision at slot 2. So, we find the next empty slot at slot 3 and
place 97 there. Rest of the elements 84 and 60 go to the positions 4 and 0 respectively
based on their mod values.
The first hash function in the double hashing finds the initial slot for the key and the
second hash function determines the size of jumps for the probe. The ith probe is
defined as follows
Let us look at an example for the double hashing to understand it better. Let us
consider a hash table of size 5 and we use the below given hash functions:
H1(x) = x mod 5
H2(x) = x mod 7
Let us try to insert two elements 60 and 80 into the hash table. We can place the first
element 60 at slot 0 based on the hash function. When we try to insert the second
element 80, we face a collision at slot 0. For the first iteration we apply the double
hashing as follows:
H(80,1) = (0+1*3) mod 5 = 3
Hence, we now place the element 80 in slot 3 to avoid collision as depicted in figure
5.
7
Hashing
Comparison of linear probing, quadratic probing and double hashing is given below:
The hash table provides constant time complexity for operations such as retrieval,
insertion and deletion with lesser keys. As the key size grows, we run out of vacant
spots in the hash table leading to collision that impacts the time complexity. When the
collision happens, we need to re-adjust the hash table size so that we can
accommodate additional keys. Load factor defines the threshold when we should re-
size the hash table to main the constant time complexity.
Load factor is the ratio of the elements in the hash table to the total size of the hash
table. We define load factor as follows:
Load factor = (Number of keys stored in the hash table)/Total size of the hash table.
In open addressing as all the keys are stored within the hash table, the load factor is
<=1. In separate chaining method as the keys can be stored outside the hash table,
there is a possibility of load factor exceeding the value of 1.
If the load factor is 0.75 then as soon as the hash table reaches 75% of its size, we
increase its capacity. For instance, lets consider the hash table of size 10 with load
factor of 0.75. We can insert seven hash keys into this hash table without triggering
the re-size. As soon as we add the eighth key, the load factor becomes 0.80 that
8
exceeds the configured threshold triggering the hash table resize. We normally double File Structures and
Advanced Data Structures
the hash table size during the resize operation.
10.6.1 Rehashing
when the load factor exceeds the configured value, we increase the size of hash table.
Once we do it we should also re-compute the hash values for the existing keys as the
size of the hash table has changed. This process is called “rehashing”. Rehashing is a
costly exercise especially if the key size is huge. Hence it is necessary to carefully
select the optimal initial size of the hash table to start with.
Let us look at the rehashing with an example. We have a hash table of size 4 with load
factor of 0.60. Let’s start by inserting these elements – 30, 31 and 32. We can insert
30 at slot 2 and 31 at slot 3 and 32 at slot 0. Insertion of 32 triggers the hash table
resize as the load factor has breached the threshold of 0.60. As a result, we double the
hash table size to 8.
With the new hash table size, we need to recalculate the hash values of the already
inserted keys. Key 30 will now be placed in slot 6, key 31 will be placed in slot 7 and
key 32 in slot 0.
10.7 SUMMARY
In this unit, we started discussing the main motivations for the hashing. Hashing
allows us to store and retrieve large data efficiently.
Index mapping uses the input values as direct index into the hash table. Index
mapping requires huge hash table size leading to inefficiencies. When we handle large
size input values we encounter collision where multiple input values compete for the
same spot in the hash table. The main collision resolution techniques are separate
chaining and open addressing. In separate chaining we chain the values that get
mapped to a spot. We use linear probing and quadratic probing as part of open
addressing technique to find the next available spot. We use two hash functions as part
of double hashing. Load factor determines the trigger for the hash table resizing and
once the hash table is resized, we re-compute the hash values of the existing keys
using rehashing.
10.8 SOLUTIONS/ANSWERS
9
Hashing
☞ Check Your Progress – 1
1. Index mapping
2. handling non integer keys and large hash table size
3. computing efficiency, uniform distribution, deterministic and minimal
collisions
4. O(1)
5. Linked List
6. True
6. Rehashing
Lafore, Robert. Data structures and algorithms in Java. Sams publishing, 2017.
Karumanchi, Narasimha. Data structures and algorithms made easy: data structure
and algorithmic puzzles. Narasimha Karumanchi, 2011.
West, Douglas Brent. Introduction to graph theory. Vol. 2. Upper Saddle River:
Prentice hall, 2001.
https://en.wikipedia.org/wiki/Hash_function#Trivial_hash_function
https://en.wikibooks.org/wiki/A-
level_Computing/AQA/Paper_1/Fundamentals_of_data_structures/Hash_tables_an
d_hashing
https://ieeexplore.ieee.org/book/8039591
10