Hashing FINAL

Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

Block-II

Block Name: Dictionary ADT and Sorting and Selection Techniques

Unit 4: Hashing
Unit Contents: Hash function, Hash Tables, Hash Collision, Collision resolution

CONTENTS:

4.0 Introduction
4.1 Unit objective
4.2 Hashing
4.3 Hash Table
4.4 Collision in Hash Table
4.5 Hash Function
4.6 Types of Hash Function
4.6.1 Division method
4.6.2 Multiplication Method
4.6.3 Mid-Square Method
4.7 Collision Resolution
4.7.1 Open Addressing( Closed Hashing)
4.7.1.1 Linear Probing
4.7.1.2 Quadratic Probing
4.7.1.3 Double Hashing
4.7.2 Separate chaining(Open Hashing)
4.8 Check your progress
4.9 Summing Up
4.10 Questions
4.10.1 Short types question
4.10.2 Broad type question
4.11 Suggested Readings

4.0 Introduction:
In Data structure algorithms generally we study how to retrieve an element from any data structure
where it’s stored. In Sequential search and binary search and all the search trees, searching time
depends on number of elements and number of key comparisons. Linear search running time is
proportional to O (n), while Binary search running time is proportional to O(log n). In a balanced
binary search tree, running time can be guaranteed to be in O (log n). Here we will discuss a good
searching approach where less key comparisons are involved and searching can be performed in
constant time i.e O(1). In this searching approach searching time is independent of the number of
elements. Hashing is such a kind of approach. Before going to hashing we will discuss a data
structure named direct access table. By considering an example we will describe the Direct Access
table. Suppose we would like to store students' records keyed using phone numbers.

In the direct access table where we create an array and phone numbers will be considered
as index in the array. If no phone number is present, an entry in the array is NIL. Otherwise the
array entry stores pointers to records corresponding to phone numbers. In this approach, we can
insert, delete and search in O(1) time. Time complexity wise this solution is the best among
all. For example, to insert a phone number, we create a record with details of the given phone
number, use the phone number as index and store the pointer to the created record in the table. The
direct access table has many practical limitations. First of all it requires a huge extra space. Second
problem is that an integer in a programming language may not store n digits.

We can overcome this problem in Hashing. In hashing we get O(1) search time on an
average and O(n) in the worst case. Hashing is an improvement over Direct Access Table.

4.1 Unit Objective:

In this Chapter we will study the concepts of the following:


- Understand the basic concept of Hashing.
- Importance of hashing.
- Hash Table and Hash function.
- Different types of Hash function
- Collision Resolution
- Different Types of Collision Resolution Techniques.

4.2 Hashing:
Hashing is a technique to convert a range of key values into a range of indexes of an array. In
other words we can say that Hashing is a technique or process of mapping keys, values into the
hash table by using a hash function. Hashing is also called the message digest function. It is
widely used in the encryption and decryption of digital signatures.
Let us describe the concept by using an example.
Suppose we have to store 100 numbers of student information. Here each student has a unique
roll number in the range 0-99 with name. Roll numbers will be considered as keys. We can take
an array of size 100 to store the information.

Key Array of Student information


Key 0 -----> [0] 0 Ramen
Key 1 -----> [1] 1 Raja
Key 2 -----> [2] 2 Bishnu
Key 3 -----> [3] 3 Rajib
Key 4 -----> [4] 4 Mohan
Key 5 -----> [5] 5 Punam
Key 6 -----> [6] 6 Rupshree
...................................................... .........................................................
...................................................... ..........................................................
Key 0 -----> [0] 98 Nilima
Key 0 -----> [0] 99 Amit
\

Here, we can directly access any student information through the key index because the key index
and students roll is the same. This method of searching is called direct addressing. It is useful
when the set of possible keys is very small.
Let us consider the case where we need to store 100 students' information and a five digit roll_no
taken as a key index. If we consider the direct addressing method, the key index will be in the
range of 00000 to 99999 and the array size will be 100000.

Key Array of Student information


Key 00000 -----> [0] 00000 Ramen
Key 00001 -----> [1] 00001 Raja
Key 00002 -----> [2] 00002 Bishnu
Key 00003 -----> [3] 00003 Rajib
Key0000 4 -----> [4] 00004 Mohan
Key0000 5 -----> [5] 00005 Punam
Key 00006 -----> [6] 00006 Rupshree
...................................................... .........................................................
...................................................... ..........................................................
Key 99998 -----> [99998] 99998 Nilima
Key 99999 -----> [99999] 99999 Amit

But in this method we will use 100 locations of the array. Excluding 100 locations, all locations
of the array will be unused, which means wastage of space. That is why direct addressing is rarely
used.
Now let us discuss how this technique can be improved so that there is no wastage of
memory. Here, we will adopt some approach through which we can convert the key within a
range. This value will be used as a key index. There are lots of techniques by which we can reduce
the size of the array. For example, we can use the last two digits of the key to identify a student.
If we use this technique, then the student’s information bearing the roll no 54567 is stored in array
index 67. Similarly, the student’s information bearing the roll no 45859 is stored in the array index
59. This process of converting a key to array index is called Hashing and this conversion can be
done through hash function.
\
Key Hash Function() Address

4.3 Hash Table:

Hash table is a data structure in which a key is mapped to array locations by a hash function. In
simple words, a hash table is an array in which insertion and searching is done through hashing.
Hash table stores some elements which basically consist of two main components, i.e., key and
value. Key is a unique integer used for indexing the values, whereas Value means data which is
associated with key.
4.4 Collision in Hash Table:
Suppose we generated two addresses of an array from different keys using a hash function. If both
the addresses of the array generated by the hash function are the same, then this situation is called
Collision in Hash table. Let us describe collision by using an example.
Suppose we need to store Employee information through Emp_id. Here we have considered a
hash function that maps a key to the array address by summing up all the digits exists in Emp_id.
If the Emp id is 62865, then this employee information will be stored in the array location 27.

Key Employee information


............................... ......................................
............................................ .......................................
24 Emp_id 34764
25...................................... ........................................
26 Emp_id 56357
27 Emp_id 62865
28 Emp_id 75691
29 .................................. .....................................
...................................... ...............................
...................................... ...............................
...................................... ...............................

Now, we would like to store the Employee information of Emp Id 59823. If we add all the digits
of Emp_id, the sum is 27. But the location 27 has already been occupied by the Emp_id 62865.
This situation is Collision in the hash table. The keys which are mapped to the same address are
called synonyms.

4.5 Hash function:

Hash function is a function which is applied on a key by which it generates an integer within some
suitable range in order to reduce the collision that is used as an address of the hash table. Integer
returned by the hash function is called hash key. Practically there is no such type of hash function
which can fully eliminate collision. But by using a good hash function it can minimize the
collision.

Properties of good hash function:

1. The hash function should generate different hash values for the similar string.
2. The hash function should be easy to compute.
3. The hash function should distribute the keys as uniformly as possible over an array.
4. The hash function should generate the address with a minimum number of collisions.
5. The hash function is a perfect hash function when it uses all the input data.

4.6 Types of Hash Function:


Here we will discuss some hash function which uses numeric keys. In the real world, sometimes
we use alphanumeric key.

4.6.1 Division Method(Modulo-Division):

Here, the key ‘x’ is divided by the table size ‘m’ and the remainder will be considered as
the address for the hash table. This method ensures that we will get the address in the limited
range of the table i.e we will get the address in the range of {0,1,2………m-1}. The hash function
can be given as

h(x)= x mod m

In this method collision can be minimized if the m value is taken to be prime number.
For example

For key ‘x’ = 545 and m = 17,

h(x)= 545 % 17=1 // in c language ‘ %’ is modulo operator

For key = 78549 and m=17

h(x)= 78549 % 17= 9

For key = 59467 and m=17

h(x) = 59467 % 17 = 1
From the above example we can observe that after considering the prime number there are
also few chances for collision. This can be improved by considering a prime number not too
close to an exact power of 2 for table_size.

4.6.2 Multiplication Method:


In multiplication method we compute the hash value in 4 steps
1. Choose a constant A between 0 and 1.
2. Multiply the key k with A
3. Take the fractional part
4. Multiply the fractional part with m, and take the floor of the result.
Hence the hash function can be written as
h(x)= ⌊ m( (k ✕A) mod 1⌋
Where (k ✕A) mod 1 gives the fractional part of kA and m is the total number of indices in the
hash table.
Note: This algorithm works better if we choose some values depending on the characteristics
being hashed. An American Scientist “Knuth” has suggested that the best choice of A is (sqrt 5-
1)/2=0.6180339887.

For example:
Hash table of size 1000,
key= 24561
h(24561)= ⌊ 1000(24561*0.6180339887 mod 1 ⌋
= 1000⌊ 15179.5327964 mod 1) ⌋
= 1000* 0.5327964
= ⌊ 532.7964⌋

= 532

4.6.3 Mid Square Method:


In the mid square method, we need to square the value of the key and take some digits or bits from
the middle of this square as the address. This technique can generate keys with high randomness
if we take a big enough value. It has some limitations. As the value is squared, if a 6-digit number
is taken, then the square will have 12-digits. This exceeds the range of int data type. So, overflow
must be taken care of. if the key is too large then we can take part of the key and perform the mid
square method on that part rather than the whole key. The chances of a collision in mid-square
hashing are low, not obsolete.
Suppose our keys are four digits integer and table size is 1000. So we will need a 3 digits address.
Now we will square the keys and take the 3rd, 4th and 5th digits value from each squared number
as a hash address.

key: 1562 1232 1355 1656


Square of key: 2439844 1517824 1836025 2742336
Address: 398 178 360 423

4.7 Collision Resolution:

A good hash function performs one to one mapping between a set of all possible keys, but it is
totally impossible. A collision occurs when a hash function maps two different keys in the same
location of a hash table. So, we can use different collision resolution techniques by which these
keys can be placed in an alternate location.
Two most important collision resolution techniques we will study here,
i) Open Addressing(Closed hashing)
ii) Separate Chaining(Open Hashing)

4.7.1 Open Addressing (Closed Hashing):

When we map a key and get a particular location in the hash table by using hash function, if the
location is already occupied then we will search some other empty location in the hash table and
insert the value on it. This method is also known as Closed Hashing because the array is
assumed to be closed.
Note: The whole process of examining memory locations in the hash table is called probing. We
will mainly study three methods to search for an empty location inside the table when we face a
collision.
1. Linear Probing
2. Quadratic Probing
3. Double hashing

Linear Probing:

When a hash function gives an address which is already occupied in the hash table, In such a case,
the hash function searches linearly for the next empty cell in the hash table. For example, if a hash
function gives an address ‘a’ and suppose it is not empty, then it will search for the next empty
location i.e ‘a+1’. If this location is also occupied, it will search the next location i.e ‘a+2’ and this
procedure will continue till it finds an empty location where the key can be inserted. During search
for an empty position in the array, we assume the array is closed or circular. For example, if we
need to store a key value in position 4 , where table size is 10, then we need to search the empty
location in the sequence- 4, 5, 6, 7, 8, 9, 0, 1, 2, 3. If the location 4 is empty, we will insert the value
in position 4, otherwise we will search linearly for an empty position and store the key value on
that position.

Advantage-

• It is very easy to compute.

Disadvantage-

• The main disadvantage in linear probing is clustering.


• Many consecutive elements form groups.
• Then, it takes time to search for an element or to find an empty bucket.

Time Complexity-
Worst time to search an element in linear probing is O (table size).

4.7.1.1 Quadratic Probing:

In linear probing, the main disadvantage is clustering. In quadratic probing this problem is solved
by storing the colliding keys away from the initial collision point.
H(k,i)=(h(k)+i2) mod Table_size
Here, i varies from 0 to tablesize-1 and h is the hash function. Here also the array is assumed to
be closed. The search for empty locations will be in the sequence:
h(k), h(k)+1, h(k)+4, h(k)+9…………………….. all mod Table_size.

For example,
Table size=10
h(key)=key%10
Key to be inserted 48, 34, 29, 68, 98, 54, 53, 74
h(48)= 48%10= 8
h(34)= 34%10= 4
h(29)= 29%10= 9
h(68)= 68%10= 8
h(98)= 98%10= 8
h(54)= 54%10= 4
h(53)= 53%10= 3
h(74)= 74%10= 4

Keys 48,34,29 are inserted without collision. But for the key 68 outcome address is 8, which was
already occupied earlier. In such cases we need to search the next free location by the hash
function. So the next location will be (8+1)%10=9, which is also occupied. So the next location
will be (8+22)%10=12%10= 2, now it is empty and insert the key at position 2.
Advantages:

Quadratic probing may be a smaller amount likely to possess the matter of primary
clustering and is less complicated to implement than Double Hashing.

Disadvantages:

• Quadratic probing has secondary clustering. This happens when 2 keys hash to the
identical location, they have the identical probe sequence. So, it takes many
attempts before an insertion is being made.
• Also probe sequences don't probe all locations within the table.

4.7.1.2 Double Hashing:

In double hashing we will use two independent hash functions rather than a single hash
function. Hence, it is called double hashing. The double hash function can be defined as:
h(k, i) = [h1(k)+ ih2(k)] mod t

Here, t is the table size, h1(k) and h2(k) two independent hash functions where, h1(k)= k mod t and
h2(k)= k mod t’, where t’ will be less than t. and i is probe number start from 0 to t-1.

For example:
Table size(t)= 10, Inserted keys are 54,97,43,27,34
h1(k)= (k mod 10) and h2(k)=(k mod 8)
Initially, the hash table will be as:
We have,
h(k, i) = [h1(k)+ ih2(k)] mod t
Step 1: Key= 54
h(54,0)= [54 mod 10+( 0✕ 54 mod 8)] mod 10
= [4+ (0✕0)] mod 10
= [4 mod 10]
=4
(Since position 4 is empty, we will put 54 in location 4.)

0 1 2 3 4 5 6 7 8 9

54

Step 2:
Key= 97
h(97,0)= [97 mod 10+( 0✕ 97 mod 8)] mod 10
= [6+ (0 * 1)] mod 10
= [6 + 0] mod 10
=6
(Since position 6 is empty, we will put 97 in location 6)

0 1 2 3 4 5 6 7 8 9

54 97

Step 3:
Key= 43
h(43 ,0)= [43 mod 10+( 0✕ 43 mod 8)] mod 10
= [3+ (0✕3)] mod 10
= [3+ 0] mod 10
=3
(Since position 3 is empty, we will put 43 in location 3.)
0 1 2 3 4 5 6 7 8 9

43 54 97

Step 4:
Key=27
h(27 ,0)= [27 mod 10+( 0✕ 27 mod 8)] mod 10
= [7+ (0✕3)] mod 10
= [7+ 0] mod 10
=7
(Since position 7 is empty, we will put 27 in location 7)

0 1 2 3 4 5 6 7 8 9

43 54 97 27

Step 5:
Key= 34
h(34 ,0)= [34 mod 10+( 0✕ 34 mod 8)] mod 10
= [4+ (0✕ 2)] mod 10
= [4+ 0] mod 10
=4
0 1 2 3 4 5 6 7 8 9

43 54 97 27

( Location 4 is already occupied by the key 54. So we cannot store the key 34 in location
4. We need to find the next location by taking the probe i=1, this time.
h(34 ,1)= [34 mod 10+( 1✕ 34 mod 8)] mod 10
= [4+ (1✕ 2)] mod 10
= [4+ 2] mod 10
= 6 mod 10
=6
0 1 2 3 4 5 6 7 8 9

43 54 97 27

( Location 6 is again occupied by the key 97. So we cannot store the key 34 in location 6.
We need to find the next location by taking the probe i=2, this time.

h(34 ,2)= [34 mod 10+( 2✕ 34 mod 8)] mod 10


= [4+ (2✕ 2)] mod 10
= [4+ 4] mod 10
= 8 mod 10
=8
0 1 2 3 4 5 6 7 8 9

43 54 97 27 34

Now the location 8 is empty, we will put 34 in location 8)


We will repeat the entire process by increasing the probe by 1 until we will not get the
empty location.

Advantages:
• Double hashing eliminates the problems of the clustering issue.

Disadvantages:
• Double hashing is more complicated to implement than any other hashing.
• Double hashing can cause thrashing.

4.7.2 Separate Chaining (Open Hashing):

In this technique, when we face the problem of the same hash address a Linked list are maintained
for those elements. Here the hash table does not contain actual keys and records but it is just an
array of pointers, where each pointer points to a linked list. That is location 1 in the hash table
points to the head of the linked list of all the key values that is hashed to 1. If there is no key value
hashes to 1 , the location is set to NULL.
For example:
The Keys are 267, 341,223,674, 755, 921,733,874, 231,397 which are to
hashed
Table size= 10

h( 267)= 267 mod 10 = 7


h( 341)= 341 mod 10 = 1
h( 223)= 223 mod 10 = 3
h( 674)= 674 mod 10 = 4
h( 755)= 755 mod 10 = 5
h( 921)= 921 mod 10 = 1
h( 733)= 733 mod 10 = 3
h( 874)= 874 mod 10 = 4
h( 231)= 231 mod 10 = 1
h( 397)= 397 mod 10 = 7
Advantages:

• Not depending on the size of the table.


• Implementation is very simple.

Disadvantages:

• Keys are not evenly distributed in separate chaining.


• Misuse of space due to lots of empty spaces in the table.
• The list in the positions can be very long.

4.8 Check your progress:

4.8.1 Multiple Choice Question and Answer:


1. The searching technique that takes O (1) time to find a data is
a) Linear Search
b) Hashing
c) Binary Search
d) Tree Search

2.What is a hash table?


a) A structure that maps keys to values
b) A structure that maps values to keys
c) A structure used for storage
d) A structure used to implement stack and queue

3.Which Open addressing technique is free from clustering problem?


a) Linear Probing
b) quadratic probing
c) Double hashing
d) None of the above

4.What is direct addressing?


a) Distinct array position for every possible key
b) Fewer array positions than keys
c) Fewer keys than array positions
d) Same array position for all keys

5.What is a hash function?


a) A function has allocated memory to keys
b) A function that computes the location of the key in the array
c) A function that creates an array
d) A function that computes the location of the values in the array.
6. In separate chaining, which data structure is normally used?
a) Singly linked list
b) Doubly linked list
c) Circular linked list
d) Binary trees

7. Which hash function used in Division method?


a) h(k)= k/m, where k is the key and m is the table size.
b) h(k)= k mod m, where k is the key and m is the table size
c) h(k)= m/k, where k is the key and m is the table size
d) h(k)= m mod k, where k is the key and m is the table size

Answer:
Question Number Answer Key
1 b) Hashing
2 a) A structure that maps keys to values
3 c) Double hashing
4 a)Distinct array position for every possible key

5 b) function that computes the location of the key in the array

6 a) Singly linked list

7 b) h(k)= k mod m,, where k is the key and m is


the table size

4.8.2 Fill up the Blanks and Answer:


1. The hash function should generate different ___________for the similar string.
2. A collision occurs when a hash function maps _____________in the same location of a
hash table.
3. The main disadvantage in linear probing is___________.
4. Worst time to search an element in linear probing is ___________.
5. Quadratic probing has __________clustering.
6. In Division method, the key is divided by the table size and the ___________will be
considered as the address for the hash table.
Answer

1. Hash values
2. two different keys
3. clustering
4. O (table size).
5. Secondary
6. Remainder

4.8.3 Short Type question and Answer:

1. What is Hashing?

Answer: Hashing is a technique to convert a range of key values into a range of indexes of an
array. In other words we can say that Hashing is a technique or process of mapping keys, values
into the hash table by using a hash function. Hashing is also called the message digest function. It
is widely used in the encryption and decryption of digital signatures.

2. What is Hash Table?

Answer: Hash table is a data structure in which a key is mapped to array locations by a hash
function. In simple words, a hash table is an array in which insertion and searching is done through
hashing. Hash table stores some elements which basically consist of two main components, i.e.,
key and value. Key is a unique integer used for indexing the values, whereas Value means data
which is associated with key.

3. Explain Division method and Multiplication method of hash function ?


Answer: In Division method, the key is divided by the table size and the remainder will be
considered as the address for the hash table. Multiplication method applies the hash
function as h(x)= ⌊ m( (k ✕A) mod 1⌋

Where (k ✕A) mod 1 gives the fractional part of kA and m is the total number of indices
in the hash table.

4. What is collision in Hash Table?

Answer: A good hash function performs one to one mapping between a set of all possible keys,
but it is totally impossible. A collision occurs when a hash function maps two different keys in the
same location of a hash table. So, we can use different collision resolution techniques by which
these keys can be placed in an alternate location.
Two most important collision resolution techniques we will study here,
i) Open Addressing(Closed hashing)
ii) Separate Chaining(Open Hashing)

5. Explain Double Hashing.

Answer: In double hashing we will use two independent hash functions rather than a single hash
function. Hence, it is called double hashing. The double hash function can be defined as:
h(k, i) = [h1(k)+ ih2(k)] mod t

Here, t is the table size, h1(k) and h2(k) two independent hash functions where, h1(k)= k mod t and
h2(k)= k mod t’, where t’ will be less than t. and i is probe number start from 0 to t-1.

4.9 Summing Up

• In hashing we get O(1) search time on an average and O(n) in the worst case. Hashing is
an improvement over Direct Access Table.
• Hashing is a technique to convert a range of key values into a range of indexes of an array.
• Hash table is a data structure in which a key is mapped to array locations by a hash
function. In simple words, a hash table is an array in which insertion and searching is done
through hashing.
• Suppose we generated two addresses of an array from different keys using a hash function.
If both the addresses of the array generated by the hash function are the same, then this
situation is called Collision in Hash table.
• Hash function is a function which is applied on a key by which it generates an integer
within some suitable range in order to reduce the collision that is used as an address of the
hash table.
• Most popular Hash function which are Division Method Modulo-Division),
Multiplication Method, Mid-Square Method etc.
• In Division method, the key is divided by the table size and the remainder will be
considered as the address for the hash table.
• In Multiplication method applies the hash function as h(x)= ⌊ m( (k ✕A) mod 1⌋

Where (k ✕A) mod 1 gives the fractional part of kA and m is the total number of indices
in the hash table.
• In Mid-Square Method, Key value is squared and take some digits or bits from the middle
of this squared value as the address. This technique can generate keys with high
randomness if we take a big enough value.
• A collision occurs when a hash function maps two different keys in the same location of a
hash table. Different collision resolution techniques can be adopted by which these keys
can be placed in an alternate location.
• Two most important collision resolution techniques we will study here, they are Open
Addressing(Closed hashing) and Separate Chaining(Open Hashing).
• In Open Addressing, computes a new positions using a probe sequence and the next record
is stored in that position.
• Open addressing can be implemented by using three method named as Linear Probing,
Quadratic Probing and Double hashing.
• In Linear Probing, the hash function searches linearly for the next empty cell in the hash
table. For example, if a hash function gives an address ‘a’ and suppose it is not empty, then
it will search for the next empty location i.e ‘a+1’ and so on.
• In Quadratic Probing, if a value is already occupied at a location generated by the h(k)
then the following hash function can resolved the problem.

H(k,i)=(h(k)+i2) mod Table_size


Here, i varies from 0 to tablesize-1 and h is the hash function. Here also the array is
assumed to be closed. The search for empty locations will be in the sequence:
h(k), h(k)+1, h(k)+4, h(k)+9…………………….. all mod Table_size.
• In Double Hashing, we will use two hash functions rather than a single function. The double
hash function can be defined as:

h(k, i) = [h1(k)+ ih2(k)] mod t


Here, t is the table size, h1(k) and h2(k) two independent hash functions where, h1(k)= k
mod t and h2(k)= k mod t’, where t’ will be less than t. and i is probe number start from 0
to t-1.
• In Separate chaining method, each location in as hash table stores a pointer to a linked
list that contains all the key values that were hashed to that location.

4.10 Questions:
4.10.1 Short types questions:
5. What is Hashing?
6. What is Hash Table?
7. What is the importance of Hashing?
8. What is collision in Hash Table?
9. Explain the Division Method.
10. Explain Double Hashing.

10.0.1 Broad type questions:


1. What is hashing? Write few applications of Hashing?
2. Explain different types of Hash function.
3. What is collision in Hash table? Write a brief overview of different collision
Resolution Techniques.

10.1 Suggested Readings:


1. Data Structures Through C In Depth. by S.K.Srivastava /Deepali Srivastava..
2. Data Structures Using C by Reema Thareja

You might also like