UNIT V - Hashing
UNIT V - Hashing
▪ Hashing is the process of mapping large amount of data item to smaller table with the
help of hashing function.
▪ Hashing is also known as Hashing Algorithm or Message Digest Function.
▪ It is a technique to convert a range of key values into a range of indexes of an array.
▪ It is used to facilitate the next level searching method when compared with the linear or
binary search.
▪ Hashing allows to update and retrieve any data entry in a constant time O(1).
▪ Constant time O(1) means the operation does not depend on the size of the data.
▪ Hashing is used with a database to enable items to be retrieved more quickly.
▪ It is used in the encryption and decryption of digital signatures.
Hashing is an important data structure designed to solve the problem of efficiently finding and
storing data in an array.
Example:
if you have a list of 20000 numbers, and you have given a number to search in that list-
you will scan each number in the list until you find a match.
It requires a significant amount of your time to search in the entire list and locate that
specific number. This manual process of scanning is not only time-consuming but inefficient
too. With hashing in the data structure, you can narrow down the search and find the number
within seconds.
Examples of Hashing in Data Structure
• In schools, the teacher assigns a unique roll number to each student. Later, the teacher
uses that roll number to retrieve information about that student.
• A library has an infinite number of books. The librarian assigns a unique number to each
book. This unique number helps in identifying the position of the books on the bookshelf.
1. The hash function converts the item into a small integer or hash value. This integer is
used as an index to store the original data.
2. It stores the data in a hash table. You can use a hash key to locate data quickly.
Solution: Hashing
Division method
Choose a number m larger than the number n of keys in K. (The number m is
usually chosen to be a prime number or a number without small divisors, since
this frequently minimizes the number of collisions.) The hash functions H is
defined by
H(k)=k(mod m) or H(k)=k(mod m)+1
Here k (mod m) denotes the remainder when k is divided by m. The second
formula is used when we want the hash addresses to range from 1to m rather
than from 0 to m-1.
Midsquare method
The key k is squared. Then the hash function H is defined by
H(k)=l
Where l is obtained by deleting digits from both ends of k2. We emphasize that
the same positions of k2 must be used for all of the keys.
Folding method
The key k is partitioned into a number of parts, k1 ..... , kr, where each part, except
possibly the last, has the same number of digits as the required address. Then
the parts are added together, ignoring the last carry. That is,
H(k)=k1+k2+ ...... +kr
Where the leading-digit carries, if any, are ignored. Sometimes, for extra ―milling‖,
the even-numbered parts, k2,k4, .... , are each reversed before the addition.
Example
Consider the company in the above Example, each of whose 68 employees is assigned a
unique 4-digit employee number. Suppose L consists of 100 two-digit addresses: 00, 01,
02, ......, 99. We apply the above hash functions to each of the following employee
numbers:
Division Method
Choose a prime number m close to 99, such as m=97. Then H(3205)=4, H(7148)=67,
H(2345)=17
That is, dividing 3205 by 97 gives a remainder of 4, dividing 7148 by 97 gives a remainder
of 67, and dividing 2345 by 97 gives a remainder of 17. In the case that the memory
addresses begin with 01 rather than 00, we choose that the function H(k)=k(mod m)+1 to
obtain:
Midsquare method
H(k): 72 93 99
Observe that the fourth and fifth digits, counting from the right, are chosen for the hash
address.
Folding method
Chopping the key k into two parts and adding yields the following hash addresses:
H(3205)=32+05=37, H(7148)=71+48=19,H(2345)=23+45=68
Observe that the leading digit 1 in H(7148) is ignored. Alternatively, one may want to
reverse the second part before adding, thus producing the following hash addresses:
H(3205)=32+50=82, H(7148)=71+84+55,H(2345)=23+54=77
Collision Resolution
Collisions occur when the hash function maps two different keys to the same location.
Obviously, two records cannot be stored in the same location.
Suppose we want to add a new record R with key k to our file F, but
suppose the memory location address H(k) is already occupied. This situation is
called collision.
Therefore, a method used to solve the problem of collision, also called collision resolution
technique, is applied. The two most popular methods of resolving collisions are:
1. Chaining
2. Open addressing
Separate Chaining:
The idea is to make each cell of hash table point to a linked list of records that have same
hash function value.
Let us consider a simple hash function as “key mod 7” and sequence of keys as 50, 700, 76,
85, 92, 73, 101.
Advantages:
1) Simple to implement.
2) Hash table never fills up, we can always add more elements to the chain.
3) Less sensitive to the hash function or load factors.
4) It is mostly used when it is unknown how many and how frequently keys may be inserted or
deleted.
Disadvantages:
1) Cache performance of chaining is not good as keys are stored using a linked list. Open
addressing provides better cache performance as everything is stored in the same table.
2) Wastage of Space (Some Parts of hash table are never used)
3) If the chain becomes long, then search time can become O(n) in the worst case.
4) Uses extra space for links.
Performance of Chaining:
Performance of hashing can be evaluated under the assumption that each key is equally likely
to be hashed to any slot of table (simple uniform hashing).
Open Addressing:
The hash table contains two types of values: sentinel values (e.g., –1) and
data values. The presence of a sentinel value indicates that the location contains no data value
at present but can be used to hold a value.
When a key is mapped to a particular memory location, then the value it holds is checked. If it
contains a sentinel value, then the location is free and the data value can be stored in it.
However,
if the location already has some data value stored in it, then other slots are examined
systematically in the forward direction to find a free slot. If even a single free location is not
found, then we have an OVERFLOW condition.
The process of examining memory locations in the hash table is called probing.
Open addressing technique can be implemented using
Linear Probing
The simplest approach to resolve a collision is linear probing. In this technique, if a value is
already
stored at a location generated by h(k), then the following hash function is used to resolve the
collision:
Where m is the size of the hash table, h¢(k) = (k mod m), and i is the probe number that varies
from
0 to m–1.
Therefore, for a given key k, first the location generated by [h¢(k) mod m] is probed because for
the first time i=0. If the location is free, the value is stored in it, else the second probe generates
the address of the location given by [h¢(k) + 1]mod m. Similarly, if the location is occupied, then
[h¢(k) + 2]mod m, [h¢(k) + 3]mod m, [h¢(k) + 4]mod m, [h¢(k) + 5]mod m, and so on, until a
free location is found.
Note: Linear probing is known for its simplicity. When we have to store a value, we try the slots:
[h¢(k)]
mod m, [h¢(k) + 1]mod m, [h¢(k) + 2]mod m, [h¢(k) + 3]mod m, [h¢(k) + 4]mod m, [h¢(k) +
5]mod m, and so
no, until a vacant location is found.
Example Consider a hash table of size 10. Using linear probing, insert the keys 72, 27,
36, 24, 63, 81, 92, and 101 into the table.
Let h¢(k) = k mod m, m = 10
Initially, the hash table can be given as:
One main disadvantage of linear probing is that records tend to cluster, that is,
appear next to one another, when the load factor is greater than 50 percent.
Such a clustering substantially increases the average search time for a record.
Two techniques to minimize clustering are as follows:
Quadratic probing
Suppose a record R with key k has the hash address H(k)=h. Then, instead of
searching the locations with addresses h, h+1, h+2,.., we linearly search the
locations with addresses
If the number m of locations in the table T is a prime number, then the above
sequence will access half of the locations in T.
Double hashing
Here a second hash function H‘ is used for resolving a collision, as follows.
Suppose a record R with key k has the hash addresses H(k)=h and H‘(k)=h‘≠m.
Then we linearly search the locations with addresses
h, h+h‘, h+2h‘, h+3h‘,....
ADVANTAGES :
Linear probing finds an empty location by doing a linear search in the array beginning from
position h(k). Although the algorithm provides good memory caching through good locality of
reference.
DISADVANTAGES :
results in clustering, and thus there is a higher risk of more collisions where one collision has
already taken place. The performance of linear probing is sensitive to the distribution of input
values.
As the hash table fills, clusters of consecutive cells are formed and the time required for a
search increases with the size of the cluster.
Quadratic Probing
In this technique, if a value is already stored at a location generated by h(k), then the following
hash function is used to resolve the collision:
h(k, i) = [h¢(k) + c1i + c2i2] mod m
where m is the size of the hash table, h¢(k) = (k mod m), i is the probe number that varies from
0 to m–1, and c1 and c2 are constants such that c1 and c2 π 0.
Quadratic probing eliminates the primary clustering phenomenon of linear probing because
instead of doing a linear search, it does a quadratic search.
For a given key k, first the location generated by h¢(k) mod m is probed. If the location is free,
the value is stored in it, else subsequent locations probed are offset by factors that depend in a
quadratic manner on the probe number i.
Although quadratic probing performs better than linear probing, in order to maximize the
utilization of the hash table, the values of c1, c2, and m need to be constrained.
Example
Consider a hash table of size 10. Using quadratic probing, insert the keys 72,
27, 36, 24, 63, 81, and 101 into the table. Take c1 = 1 and c2 = 3.
Solution
Let h¢(k) = k mod m, m = 10
Initially, the hash table can be given as:
If m is a prime number, then the above sequence will access all the locations in
the table T.
Quadratic probing resolves the primary clustering problem that exists in the linear probing
technique. Quadratic probing provides good memory caching because it preserves some locality
of reference.
DISADVANTAGES
secondary clustering. It means that if there is a collision between two keys, then the same probe
sequence will be followed for both. With quadratic probing, the probability for multiple collisions
increases as the table becomes full. This situation is usually encountered when the hash table is
Double Hashing
In double hashing, we use two hash functions rather than a single function. The hash function in
the case of double hashing can be given as:
where m is the size of the hash table, h1(k) and h2(k) are two hash functions given as h1(k) = k
mod
m, h2(k) = k mod m', i is the probe number that varies from 0 to m–1, and m' is chosen to be less
than
When we have to insert a key k in the hash table, we first probe the location given by applying
[h1(k) mod m] because during the first probe, i = 0. If the location is vacant, the key is inserted
into
it, else subsequent probes generate locations that are at an offset of [h2(k) mod m] from the
previous
location. Since the offset may vary with every probe depending on the value generated by the
second hash function, the performance of double hashing is very close to the performance of the
Consider a hash table of size = 10. Using double hashing, insert the keys 72,
27, 36, 24, 63, 81, 92, and 101 into the table. Take h1 = (k mod 10) and h2 = (k mod 8).
Solution
Let m = 10
ADVANTAGES
Double hashing minimizes repeated collisions and the effects of clustering. That is, double
hashing is free from problems associated with primary clustering as well as secondary
clustering.
Rehashing
When the hash table becomes nearly full, the number of collisions increases, thereby degrading
the performance of insertion and search operations. In such cases, a better option is to create a
new hash table with size double of the original hash table.
All the entries in the original hash table will then have to be moved to the new hash table. This
is done by taking each entry, computing its new hash value, and then inserting it in the new hash
table.
Though rehashing seems to be a simple process, it is quite expensive and must therefore not
The hash function used is h(x) = x % 5. Rehash the entries into to a new hash table.
COMPARISION BETWEEN SEPARATE CHAININING AND OPEN ADDRESSING