DS Module-X
DS Module-X
DS Module-X
Hash Tables
Hash functions
Applications
Hash Table Representation
Hash table is one of the most important data structures that uses a special function known as a hash function that
maps a given value with a key to access the elements faster.
(or)
Hashing is the process of mapping large amount of data item to smaller table with the help of hashing function.
(or)
Hashing is a technique that is used to uniquely identify a specific object from a group of similar objects
(or)
Hashing is an important Data Structure which is designed to use a special function called the Hash function
which is used to map a given value with a particular key for faster access of elements.
Why hashing?(Need of Hashing)
Suppose we want to design a system for storing employee records keyed using phone numbers. And we want
following queries to be performed efficiently:
Insert a phone number and corresponding information.
Search a phone number and fetch the information.
Delete a phone number and related information.
We can think of using the following data structures to maintain information about different phone numbers.
Array of phone numbers and records.
Linked List of phone numbers and records.
Balanced binary search tree with phone numbers as keys.
Direct Access Table.
For arrays and linked lists, we need to search in a linear fashion, which can be costly in practice.
With balanced binary search tree, we get moderate search, insert and delete times. All of these operations can
be guaranteed to be in O(Logn) time.
Another solution that one can think of is to use a direct access table where we make a big array and use phone
numbers as index in the array
First problem with this solution is extra space required is huge.
For example if phone number is n digits, we need O(m * 10n) space for table where m is size of a pointer to
record.
Another problem is an integer in a programming language may not store n digits.
Due to above limitations Direct Access Table cannot always be used.
Hashing is the solution that can be used in almost all such situations and performs extremely well compared to
above data structures like Array, Linked List, Balanced BST in practice.
With hashing we get O(1) search time on average (under reasonable assumptions) and O(n) in worst case.
Hashing is an improvement over Direct Access Table
A Hash table is a data structure that stores some information, and the information has basically two
main components, i.e., key and value. The hash table can be implemented with the help of an
associative array. The efficiency of mapping depends upon the efficiency of the hash function used
for mapping.
Hashing is also known as Hashing Algorithm or Message Digest Function.
Hashing allows to update and retrieve any data entry in a constant time O(1).
Constant time O(1) means the operation does not depend on the size of the
data.
It minimizes the number of comparisons while performing the search.
Hashing is used with a database to enable items to be retrieved more quickly.
It is used in the encryption and decryption of digital signatures.
In hashing, large keys are converted into small keys by using hash functions.
The values are then stored in a data structure called hash table.
The idea of hashing is to distribute entries (key/value pairs) uniformly across
an array.
Hashing Implementation
Hashing is implemented using two important concepts they are
Hash Function --An element is converted into an integer by using a
hash function.
Hash Table----The element generated using hash function can be
used as an index to store the original element, which falls into the
hash table.
Hash Table
Hash table is a type of data structure which is used for storing and accessing data very quickly.
Insertion of data in a table is based on a key value. Hence every entry in the hash table is defined with some
key.
Hash table or hash map is a data structure used to store key-value pairs
It is a collection of items stored to make it easy to find them later.
It uses a hash function to compute an index into an array of buckets or slots from which the desired value can
be found.
It is an array of list where each list is known as bucket.
It contains value based on the key.
Hash table is used to implement the map interface and extends Dictionary class.
Hash table is synchronized and contains only unique elements.
Hash Function
A hash function is any function that can be used to map data of arbitrary size to fixed-size values.
Hash function is a function which is applied on a key by which it produces an integer, which can be used as an
address of hash table.
Hash function is a function that maps any big number or string to a small integer value.
A fixed process converts a key to a hash key is known as a Hash Function.
This function takes a key and maps it to a value of a certain length which is called a Hash value or Hash.
Hash value represents the original string of characters, but it is normally smaller than the original
The values returned by a hash function are called hash values, hash codes, hash sums, or simply hashes.
The values returned by a hash function are called hash values, hash codes, digests, or simply hashes.
The values are used to index a fixed-size table called a hash table.
Use of a hash function to index a hash table is called hashing or scatter storage addressing.
Characteristics of good hash function
To achieve a good hashing mechanism, It is important to have a good hash function with
the following basic requirements:
The hash function should generate different hash values for the similar string.
The hash function is easy to understand and simple to compute.
The hash function should produce the keys which will get distributed, uniformly over an
array.
A number of collisions should be less while placing the data in the hash table.
The hash function is a perfect hash function when it uses all the input data.
Types of hash function
There are various types of hash functions available such as-
Division Hash Function
Mid Square Hash Function
Multiplicative Hash Function
Digit Folding Hash Function
Digital Analysis2
Division method
Division method or reminder method takes an item and divides it by the table size and returns the remainder as
its hash value
h(key) = key mod table_size
i.e. key % table_size
Since it requires only a single division operation, hashing by division is quite fast.
It has been found that the best results with the division method are achieved when the table size is prime
After computing the hash values, we can insert each item into the hash table at the designated position
It is easy to search for an item using hash function where it computes the slot name for the item and then
checks the hash table to see if it is present.
the hash function is dependent upon the remainder of a division
For example:-if the record 52,68,99,84 is to be placed in a hash table
and let us take the table size is 10.
Then:
h(key)=record% table size.
H(52)----- 2=52%10
H(68) ----8=68%10
H(99)-----9=99%10
H(84)------4=84%10
In the hash table, 4 of the 10 slots are occupied, it is referred to as the load factor and
denoted by, λ = No. of items / table size.
For example , λ = 4/10.
by this hash function, many keys can have the same hash. This is called Collision.
A prime not too close to an exact power of 2 is often good choice for table_size.
Suppose r = 256 and table_size = 17, in which r % table_size i.e. 256 % 17 = 1.
So for key = 37596, its hash is
37596 % 17 = 12
But for key = 573, its hash function is also
573 % 12 = 12
Mid square method
In this method firstly key is squared and then mid part of the result is taken
as the index.
The mid square method is a very good hash function. It involves squaring
the value of the key and then extracting the middle r digits as the hash
value. The value of r can be decided according to the size of the hash table.
Suppose the hash table has 100 memory locations. So r=2 because two
digits are required to map the key to memory location.
k = 50
k*k = 2500
h(50) = 50
The hash value obtained is 50
Multiplicative Hash Function
A faster but often misused alternative is multiplicative hashing, in which the hash index is computed
as
Where A ---is an real constant value suggested by donald knuth ie to be
Here k is again an integer hash code, a is a real number and frac is the function that returns the
Multiplicative hashing sets the hash index from the fractional part of multiplying k by a large real
number.
Multiplicative hashing is cheaper than modular hashing because multiplication is usually
considerably faster than division (or mod).
It also works well with a bucket array of size m=2p, which is convenient.
K=45 M=25 A=0.68103398987
H(45)=Floor(M*Frac(KA))
=Floor(25*Frac(45*0.68103398987))
=Floor(25*Fra(30.64652954415))
=Floor(25*30)
=Floor(750)
H(45) =700
Digit Folding Hash Function
In this method the key is interpreted as an integer using some radix (say 10).
The integer is divided into segments, each segment except possibly the last having the same number of digits.
These segments are then added to obtain the home address.
Two folding methods are used they are:
Fold shift
Fold boundary
Fold Shift
In fold shift the key value is divided into parts whose size matches the size of the required address. Then the
left and right parts are shifted and added with the middle part.
Fold Boundary
In fold boundary the left and right numbers are folded on a fixed boundary
between them and the center number. The two outside values are thus
reversed.
If the need is to generate a 5 digit address then make units of 4 digits and
add them which will make a 4 or 5 digit address.
Digit Analysis
Digit analysis, is used with static files. A static file is one in which all the identifiers are known in advance.
Using this method, we first transform the identifiers into numbers using some radix, r.
We then examine the digits of each identifier, deleting those digits that have the most skewed distributions.
We continue deleting digits until the number of remaining digits is small enough to give an address in the range of
the hash table.
The digits used to calculate the hash address must be the same for all identifiers and must not have abnormally
high peaks or valleys (the standard deviation must be small).
Abrd gftr poid werd
2465 2685 2975 2535
246 268 297 253
46 68 97 53
6 8 7 3
Collision
A situation when the resultant hashes for two or more data elements in the data set U, maps to the
same location in the has table, is called a hash collision
In such a situation two or more data elements would qualify to be stored/mapped to the same location
in the hash table.
In computer science, a collision or clash is a situation that occurs when two distinct pieces of data
have the same hash value
Hash collisions are practically unavoidable when hashing a random subset of a large set of possible
keys.
according to the birthday problem there is approximately a 95% chance of at least two of the keys
being hashed to the same slot.
almost all hash table implementations have some collision resolution strategy to handle such events
The impact of collisions depends on the application.
it is practically never
possible to find a hashing function which computes unique
hashes for each element in the data set U
Collision Resolution Techniques
Collision Resolution Techniques are the techniques used for resolving or
handling the collision
Collision resolution techniques are classified as
Separate Chaining
This technique creates a linked list to the slot for which
collision occurs.
The new key is then inserted in the linked list.
These linked lists to the slots appear like chains.
That is why, this technique is called as separate chaining.
Let us consider a simple hash function as “key mod 7” and sequence of keys as 50, 700, 76, 85, 92, 73, 101.
Advantages:
1) Simple to implement.
2) Hash table never fills up, we can always add more elements to chain.
3) Less sensitive to the hash function or load factors.
4) It is mostly used when it is unknown how many and how frequently keys may be inserted or
deleted.
Disadvantages:
1) Cache performance of chaining is not good as keys are stored using linked list. Open addressing
provides better cache performance as everything is stored in same table.
2) Wastage of Space (Some Parts of hash table are never used)
3) If the chain becomes long, then search time can become O(n) in worst case.
4) Uses extra space for links.
Time Complexity
For Searching-
In worst case, all the keys might map to the same bucket of the hash
table.
In such a case, all the keys will be present in a single linked list.
Sequential search will have to be performed on the linked list to perform
the search.
So, time taken for searching in worst case is O(n).
For Deletion-
In worst case, the key might have to be searched first and
then deleted.
In worst case, time taken for searching is O(n).
So, time taken for deletion in worst case is O(n).
Open Addressing
Like separate chaining, open addressing is a method for handling
collisions.
In Open Addressing, all elements are stored in the hash table itself.
So at any point, size of the table must be greater than or equal to the total
number of keys
Linear Probing
Quadratic Probing
Double Hashing
Linear Probing
In linear probing,
When collision occurs, we linearly probe for the next bucket.
We keep probing until an empty bucket is found.
Let us consider a simple hash function as “key mod 7” and sequence of
keys as 50, 700, 76, 85, 92, 73, 101.
Advantage-
It is easy to compute.
Linear probing has the best cache performance
Disadvantages-
Primary Clustering: One of the problems with linear probing is Primary clustering, many consecutive
elements form groups and it starts taking time to find a free slot or to search an element.
Secondary Clustering: Secondary clustering is less severe, two records do only have the same collision
Quadratic Probing is similar to Linear probing.
How Quadratic Probing is done?
Let hash(K) be the slot index computed using the hash function.
If the slot hash(K) % S is full, then we try (hash(K) + 1*1) % S.
If (hash(K) + 1*1) % S is also full, then we try (hash(K) + 2*2) % S.
If (hash(K) + 2*2) % S is also full, then we try (hash(K) + 3*3) % S.
This process is repeated for all the values of i until an empty slot is found.
With linear probing we know that we will always find an open spot if one
exists (It might be a long search but we will find it).
However, this is not the case with quadratic probing unless you take care in
the choosing of the table size.
In order to guarantee that our quadratic probes will hit every single available
spots eventually, our table size must meet these requirements:
Be a prime number
never be more than half full (even by one element)
(48+1)%7=0
(5+1)%7=6
(5+4)%7=2
55%7=6
( 55+1)%7=8
(55+4)%7=3
(7+1)%7=1 (7+4)%7=
Advantages
Quadratic is a very simple and fast way to avoid the clustering problem of linear hash.
It's normally used only when table size is prime (which may also be good for other reasons)
Though Quadratic probing isn't perfect, but it does offer some advantages over alternatives:
Quadratic probing can be a more efficient algorithm in a closed hash table, since it better avoids the
clustering problem that can occur with linear probing, although it is not immune
simpler logic for storage management
reduced storage requirement in general
Disadvantages
It has secondary clustering. Two keys have the same probe sequence when they hash to the same location.
Double hashing
Double hashing is a collision resolving technique in Open Addressed Hash tables.
Double hashing uses the idea of applying a second hash function to key when a collision occurs.
The result of the second hash function will be the number of positions from the point of collision to
insert.
There are a couple of requirements for the second function:
it must never evaluate to 0
must make sure that all cells can be probed
First hash function is typically hash1(key) = key % TABLE_SIZE
A popular second hash function is: Hash2(key) = R - ( key % R ) where R is a prime number that is
It drastically reduces clustering.
It requires fewer comparisons.
Double hashing is useful if an application requires a smaller hash table since it effectively
double hashing can find the next free slot faster than the linear probing approach
Disadvantages
As the table fills up the performance degrades.
computational cost may be high
Rehashing
As the name suggests, rehashing means hashing again
Once the hash table gets too full, the running time for operations will start to take too long and may fail.
To solve this problem, a table at least twice the size of the original will be built and the elements will be transferred to the new
table.
Basically, when the load factor increases to more than its pre-defined value (default value of load factor is 0.75), the complexity
increases.
So to overcome this, the size of the array is increased (doubled) and all the values are hashed again and stored in the new double
The new size of the hash table:
should also be prime
will be used to calculate the new insertion spot (hence the name rehashing)
This is a very expensive operation! O(N) since there are N elements to
rehash and the table size is roughly 2N.
This is ok though since it doesn't happen that often.
The question becomes when should the rehashing be applied?
Some possible answers:
once the table becomes half full
once an insertion fails
once a specific load factor has been reached, where load factor is the ratio
of the number of elements in the hash table to the table size
To reduce the load factor and the time complexity.
Extendible Hashing
Extendible Hashing is a dynamic hashing method wherein directories, and buckets are
It is an aggressively flexible method in which the hash function also experiences dynamic
changes.
Main features of Extendible Hashing: The main features in this hashing technique are:
Directories: The directories store addresses of the buckets in pointers. An id is assigned to
each directory which may change each time when Directory Expansion takes place.
Buckets: The buckets are used to hash the actual data.
Frequently used terms in Extendible Hashing:
Directories: These containers store pointers to buckets. Each directory is given a unique id which
may change each time when expansion takes place. The hash function returns this directory id
which is used to navigate to the appropriate bucket. Number of Directories = 2^Global Depth.
Buckets: They store the hashed keys. Directories point to buckets. A bucket may contain more
than one pointers to it if its local depth is less than the global depth.
Global Depth: It is associated with the Directories. They denote the number of bits which are
used by the hash function to categorize the keys. Global Depth = Number of bits in directory id.
Local Depth: It is the same as that of Global Depth except for the fact that Local Depth is
associated with the buckets and not the directories. Local depth in accordance with the
global depth is used to decide the action that to be performed in case an overflow occurs.
Local Depth is always less than or equal to the Global Depth.
Bucket Splitting: When the number of elements in a bucket exceeds a particular size, then
the bucket is split into two parts.
Directory Expansion: Directory Expansion Takes place when a bucket overflows.
Directory Expansion is performed when the local depth of the overflowing bucket is equal
to the global depth.
Basic Working of Extendible Hashing:
Steps to Follow
Step 1 – Analyze Data Elements: Data elements may exist in various forms eg. Integer, String, Float, etc.. Currently, let us
Step 2 – Convert into binary format: Convert the data element in Binary form. For string elements, consider the ASCII
equivalent integer of the starting character and then convert the integer into binary form. Since we have 49 as our data element, its
Step 3 – Check Global Depth of the directory. Suppose the global depth of the Hash-directory is 3.
Step 4 – Identify the Directory: Consider the ‘Global-Depth’ number of LSBs in the binary number and match it to the directory
id.
Eg. The binary obtained is: 110001 and the global-depth is 3. So, the hash function will return 3 LSBs of 110001 viz. 001.
Step 5 – Navigation: Now, navigate to the bucket pointed by the directory with directory-id 001.
Step 6 – Insertion and Overflow Check: Insert the element and check if the bucket overflows. If an overflow is encountered, go
First, Check if the local depth is less than or equal to the global depth. Then choose one of the cases below.
Case1: If the local depth of the overflowing Bucket is equal to the global depth, then Directory Expansion, as well as Bucket
Split, needs to be performed. Then increment the global depth and the local depth value by 1. And, assign appropriate pointers.
Directory expansion will double the number of directories present in the hash structure.
Case2: In case the local depth is less than the global depth, then only Bucket Split takes place. Then increment only the local
depth value by 1. And, assign appropriate pointers.
Step 8 – Rehashing of Split Bucket Elements: The Elements present in the overflowing bucket that is split are rehashed w.r.t
the new global depth of the directory.
Step 9 – The element is successfully hashed
Handling Over Flow Condition
Example
let us consider a prominent example of hashing the following elements: 16,4,6,22,24,10,31,7,9,20,26.
Bucket Size: 3 (Assume)
Hash Function: Suppose the global depth is X. Then the Hash Function returns X LSBs.
First, calculate the binary forms of each of the given numbers.
4- 00100
16- 10000
6- 00110
22- 10110
24- 11000
10- 01010
31- 11111
7- 00111
9- 01001
20- 10100
26- 01101
Initially, the global-depth and local-depth is always 1. Thus, the hashing
frame looks like this:
Inserting 16:
The binary format of 16 is 10000 and global-depth is 1. The hash function
returns 1 LSB of 10000 which is 0. Hence, 16 is mapped to the directory
Inserting 4 and 6:
Both 4(100) and 6(110)have 0 in their LSB. Hence, they are hashed as
follows
Inserting 22: The binary form of 22 is 10110. Its LSB is 0. The bucket
pointed by directory 0 is already full. Hence, Over Flow occurs.
Inserting 24 and 10: 24(11000) and 10 (1010) can be hashed based on
directories with id 00 and 10. Here, we encounter no overflow condition.
Inserting 31,7,9: All of these elements[ 31(11111), 7(111), 9(1001) ] have
either 01 or 11 in their LSBs.
Hence, they are mapped on the bucket pointed out by 01 and 11. We do not
encounter any overflow condition here.
Inserting 20: Insertion of data element 20 (10100) will again
cause the overflow problem.
20 is inserted in bucket pointed out by 00. As directed by Step 7-
Case 1, since the local depth of the bucket = global-depth, directory
expansion (doubling) takes place along with bucket splitting.
Elements present in overflowing bucket are rehashed with the new
global depth. Now, the new Hash table looks like this:
Inserting 26: Global depth is 3. Hence, 3 LSBs of 26(11010) are considered.
Therefore 26 best fits in the bucket pointed out by directory 010.
The bucket overflows, and, as directed by Step 7-Case 2, since the local
depth of bucket < Global depth (2<3), directories are not doubled but,
only the bucket is split and elements are rehashed.
Finally, the output of hashing the given list of numbers is obtained.
Advantages:
Data retrieval is less expensive (in terms of computing).
No problem of Data-loss since the storage capacity increases dynamically.
With dynamic changes in hashing function, associated old values are rehashed w.r.t the new hash function.
The directory size may increase significantly if several records are hashed on the same directory while
Size of every bucket is fixed.
Memory is wasted in pointers when the global depth and local depth difference becomes drastic.
This method is complicated to code.
Applications of Extendible hashing
Extendible hashing can be used in applications
where exact match query is the most important
query such as hash join
BIBLIOGRAPHY
TEXT BOOKS:
1. Pradip Dey, Manas Ghosh, Reema Thareja, “Computer Programming and Data Structures” –
[ Module-1]
2. E. Balagurusamy, “Computer Programming and Data Structures”, 4th Edition, Tata McGraw Hill
3. Narashimha K , “Data structures and Algorithms Made Easy”, 5th Edition, CareerMonk
Publications- [ Module-4]
2. Michael T.Goodrich, Roberto Tamassia, David Mount, “Data Structures and Algorithms in C++”, Wiley India
Edition
4. Aaron M. Tenenbaum, Yedidvah Langsam, Moshe J. Augenstein, “Data Structures using C” , Pearson Education
5. Sahni, “Data Structure s, Algorithms and Applications in C++”, McGraw-Hill International Edition
6. Gilberg, Forouzan, “ Data Structures- A Pseduocode Approach with C++”, Cengage Learnng