Hashing FINAL
Hashing FINAL
Hashing FINAL
Unit 4: Hashing
Unit Contents: Hash function, Hash Tables, Hash Collision, Collision resolution
CONTENTS:
4.0 Introduction
4.1 Unit objective
4.2 Hashing
4.3 Hash Table
4.4 Collision in Hash Table
4.5 Hash Function
4.6 Types of Hash Function
4.6.1 Division method
4.6.2 Multiplication Method
4.6.3 Mid-Square Method
4.7 Collision Resolution
4.7.1 Open Addressing( Closed Hashing)
4.7.1.1 Linear Probing
4.7.1.2 Quadratic Probing
4.7.1.3 Double Hashing
4.7.2 Separate chaining(Open Hashing)
4.8 Check your progress
4.9 Summing Up
4.10 Questions
4.10.1 Short types question
4.10.2 Broad type question
4.11 Suggested Readings
4.0 Introduction:
In Data structure algorithms generally we study how to retrieve an element from any data structure
where it’s stored. In Sequential search and binary search and all the search trees, searching time
depends on number of elements and number of key comparisons. Linear search running time is
proportional to O (n), while Binary search running time is proportional to O(log n). In a balanced
binary search tree, running time can be guaranteed to be in O (log n). Here we will discuss a good
searching approach where less key comparisons are involved and searching can be performed in
constant time i.e O(1). In this searching approach searching time is independent of the number of
elements. Hashing is such a kind of approach. Before going to hashing we will discuss a data
structure named direct access table. By considering an example we will describe the Direct Access
table. Suppose we would like to store students' records keyed using phone numbers.
In the direct access table where we create an array and phone numbers will be considered
as index in the array. If no phone number is present, an entry in the array is NIL. Otherwise the
array entry stores pointers to records corresponding to phone numbers. In this approach, we can
insert, delete and search in O(1) time. Time complexity wise this solution is the best among
all. For example, to insert a phone number, we create a record with details of the given phone
number, use the phone number as index and store the pointer to the created record in the table. The
direct access table has many practical limitations. First of all it requires a huge extra space. Second
problem is that an integer in a programming language may not store n digits.
We can overcome this problem in Hashing. In hashing we get O(1) search time on an
average and O(n) in the worst case. Hashing is an improvement over Direct Access Table.
4.2 Hashing:
Hashing is a technique to convert a range of key values into a range of indexes of an array. In
other words we can say that Hashing is a technique or process of mapping keys, values into the
hash table by using a hash function. Hashing is also called the message digest function. It is
widely used in the encryption and decryption of digital signatures.
Let us describe the concept by using an example.
Suppose we have to store 100 numbers of student information. Here each student has a unique
roll number in the range 0-99 with name. Roll numbers will be considered as keys. We can take
an array of size 100 to store the information.
Here, we can directly access any student information through the key index because the key index
and students roll is the same. This method of searching is called direct addressing. It is useful
when the set of possible keys is very small.
Let us consider the case where we need to store 100 students' information and a five digit roll_no
taken as a key index. If we consider the direct addressing method, the key index will be in the
range of 00000 to 99999 and the array size will be 100000.
But in this method we will use 100 locations of the array. Excluding 100 locations, all locations
of the array will be unused, which means wastage of space. That is why direct addressing is rarely
used.
Now let us discuss how this technique can be improved so that there is no wastage of
memory. Here, we will adopt some approach through which we can convert the key within a
range. This value will be used as a key index. There are lots of techniques by which we can reduce
the size of the array. For example, we can use the last two digits of the key to identify a student.
If we use this technique, then the student’s information bearing the roll no 54567 is stored in array
index 67. Similarly, the student’s information bearing the roll no 45859 is stored in the array index
59. This process of converting a key to array index is called Hashing and this conversion can be
done through hash function.
\
Key Hash Function() Address
Hash table is a data structure in which a key is mapped to array locations by a hash function. In
simple words, a hash table is an array in which insertion and searching is done through hashing.
Hash table stores some elements which basically consist of two main components, i.e., key and
value. Key is a unique integer used for indexing the values, whereas Value means data which is
associated with key.
4.4 Collision in Hash Table:
Suppose we generated two addresses of an array from different keys using a hash function. If both
the addresses of the array generated by the hash function are the same, then this situation is called
Collision in Hash table. Let us describe collision by using an example.
Suppose we need to store Employee information through Emp_id. Here we have considered a
hash function that maps a key to the array address by summing up all the digits exists in Emp_id.
If the Emp id is 62865, then this employee information will be stored in the array location 27.
Now, we would like to store the Employee information of Emp Id 59823. If we add all the digits
of Emp_id, the sum is 27. But the location 27 has already been occupied by the Emp_id 62865.
This situation is Collision in the hash table. The keys which are mapped to the same address are
called synonyms.
Hash function is a function which is applied on a key by which it generates an integer within some
suitable range in order to reduce the collision that is used as an address of the hash table. Integer
returned by the hash function is called hash key. Practically there is no such type of hash function
which can fully eliminate collision. But by using a good hash function it can minimize the
collision.
1. The hash function should generate different hash values for the similar string.
2. The hash function should be easy to compute.
3. The hash function should distribute the keys as uniformly as possible over an array.
4. The hash function should generate the address with a minimum number of collisions.
5. The hash function is a perfect hash function when it uses all the input data.
Here, the key ‘x’ is divided by the table size ‘m’ and the remainder will be considered as
the address for the hash table. This method ensures that we will get the address in the limited
range of the table i.e we will get the address in the range of {0,1,2………m-1}. The hash function
can be given as
h(x)= x mod m
In this method collision can be minimized if the m value is taken to be prime number.
For example
h(x) = 59467 % 17 = 1
From the above example we can observe that after considering the prime number there are
also few chances for collision. This can be improved by considering a prime number not too
close to an exact power of 2 for table_size.
For example:
Hash table of size 1000,
key= 24561
h(24561)= ⌊ 1000(24561*0.6180339887 mod 1 ⌋
= 1000⌊ 15179.5327964 mod 1) ⌋
= 1000* 0.5327964
= ⌊ 532.7964⌋
= 532
A good hash function performs one to one mapping between a set of all possible keys, but it is
totally impossible. A collision occurs when a hash function maps two different keys in the same
location of a hash table. So, we can use different collision resolution techniques by which these
keys can be placed in an alternate location.
Two most important collision resolution techniques we will study here,
i) Open Addressing(Closed hashing)
ii) Separate Chaining(Open Hashing)
When we map a key and get a particular location in the hash table by using hash function, if the
location is already occupied then we will search some other empty location in the hash table and
insert the value on it. This method is also known as Closed Hashing because the array is
assumed to be closed.
Note: The whole process of examining memory locations in the hash table is called probing. We
will mainly study three methods to search for an empty location inside the table when we face a
collision.
1. Linear Probing
2. Quadratic Probing
3. Double hashing
Linear Probing:
When a hash function gives an address which is already occupied in the hash table, In such a case,
the hash function searches linearly for the next empty cell in the hash table. For example, if a hash
function gives an address ‘a’ and suppose it is not empty, then it will search for the next empty
location i.e ‘a+1’. If this location is also occupied, it will search the next location i.e ‘a+2’ and this
procedure will continue till it finds an empty location where the key can be inserted. During search
for an empty position in the array, we assume the array is closed or circular. For example, if we
need to store a key value in position 4 , where table size is 10, then we need to search the empty
location in the sequence- 4, 5, 6, 7, 8, 9, 0, 1, 2, 3. If the location 4 is empty, we will insert the value
in position 4, otherwise we will search linearly for an empty position and store the key value on
that position.
Advantage-
Disadvantage-
Time Complexity-
Worst time to search an element in linear probing is O (table size).
In linear probing, the main disadvantage is clustering. In quadratic probing this problem is solved
by storing the colliding keys away from the initial collision point.
H(k,i)=(h(k)+i2) mod Table_size
Here, i varies from 0 to tablesize-1 and h is the hash function. Here also the array is assumed to
be closed. The search for empty locations will be in the sequence:
h(k), h(k)+1, h(k)+4, h(k)+9…………………….. all mod Table_size.
For example,
Table size=10
h(key)=key%10
Key to be inserted 48, 34, 29, 68, 98, 54, 53, 74
h(48)= 48%10= 8
h(34)= 34%10= 4
h(29)= 29%10= 9
h(68)= 68%10= 8
h(98)= 98%10= 8
h(54)= 54%10= 4
h(53)= 53%10= 3
h(74)= 74%10= 4
Keys 48,34,29 are inserted without collision. But for the key 68 outcome address is 8, which was
already occupied earlier. In such cases we need to search the next free location by the hash
function. So the next location will be (8+1)%10=9, which is also occupied. So the next location
will be (8+22)%10=12%10= 2, now it is empty and insert the key at position 2.
Advantages:
Quadratic probing may be a smaller amount likely to possess the matter of primary
clustering and is less complicated to implement than Double Hashing.
Disadvantages:
• Quadratic probing has secondary clustering. This happens when 2 keys hash to the
identical location, they have the identical probe sequence. So, it takes many
attempts before an insertion is being made.
• Also probe sequences don't probe all locations within the table.
In double hashing we will use two independent hash functions rather than a single hash
function. Hence, it is called double hashing. The double hash function can be defined as:
h(k, i) = [h1(k)+ ih2(k)] mod t
Here, t is the table size, h1(k) and h2(k) two independent hash functions where, h1(k)= k mod t and
h2(k)= k mod t’, where t’ will be less than t. and i is probe number start from 0 to t-1.
For example:
Table size(t)= 10, Inserted keys are 54,97,43,27,34
h1(k)= (k mod 10) and h2(k)=(k mod 8)
Initially, the hash table will be as:
We have,
h(k, i) = [h1(k)+ ih2(k)] mod t
Step 1: Key= 54
h(54,0)= [54 mod 10+( 0✕ 54 mod 8)] mod 10
= [4+ (0✕0)] mod 10
= [4 mod 10]
=4
(Since position 4 is empty, we will put 54 in location 4.)
0 1 2 3 4 5 6 7 8 9
54
Step 2:
Key= 97
h(97,0)= [97 mod 10+( 0✕ 97 mod 8)] mod 10
= [6+ (0 * 1)] mod 10
= [6 + 0] mod 10
=6
(Since position 6 is empty, we will put 97 in location 6)
0 1 2 3 4 5 6 7 8 9
54 97
Step 3:
Key= 43
h(43 ,0)= [43 mod 10+( 0✕ 43 mod 8)] mod 10
= [3+ (0✕3)] mod 10
= [3+ 0] mod 10
=3
(Since position 3 is empty, we will put 43 in location 3.)
0 1 2 3 4 5 6 7 8 9
43 54 97
Step 4:
Key=27
h(27 ,0)= [27 mod 10+( 0✕ 27 mod 8)] mod 10
= [7+ (0✕3)] mod 10
= [7+ 0] mod 10
=7
(Since position 7 is empty, we will put 27 in location 7)
0 1 2 3 4 5 6 7 8 9
43 54 97 27
Step 5:
Key= 34
h(34 ,0)= [34 mod 10+( 0✕ 34 mod 8)] mod 10
= [4+ (0✕ 2)] mod 10
= [4+ 0] mod 10
=4
0 1 2 3 4 5 6 7 8 9
43 54 97 27
( Location 4 is already occupied by the key 54. So we cannot store the key 34 in location
4. We need to find the next location by taking the probe i=1, this time.
h(34 ,1)= [34 mod 10+( 1✕ 34 mod 8)] mod 10
= [4+ (1✕ 2)] mod 10
= [4+ 2] mod 10
= 6 mod 10
=6
0 1 2 3 4 5 6 7 8 9
43 54 97 27
( Location 6 is again occupied by the key 97. So we cannot store the key 34 in location 6.
We need to find the next location by taking the probe i=2, this time.
43 54 97 27 34
Advantages:
• Double hashing eliminates the problems of the clustering issue.
Disadvantages:
• Double hashing is more complicated to implement than any other hashing.
• Double hashing can cause thrashing.
In this technique, when we face the problem of the same hash address a Linked list are maintained
for those elements. Here the hash table does not contain actual keys and records but it is just an
array of pointers, where each pointer points to a linked list. That is location 1 in the hash table
points to the head of the linked list of all the key values that is hashed to 1. If there is no key value
hashes to 1 , the location is set to NULL.
For example:
The Keys are 267, 341,223,674, 755, 921,733,874, 231,397 which are to
hashed
Table size= 10
Disadvantages:
Answer:
Question Number Answer Key
1 b) Hashing
2 a) A structure that maps keys to values
3 c) Double hashing
4 a)Distinct array position for every possible key
1. Hash values
2. two different keys
3. clustering
4. O (table size).
5. Secondary
6. Remainder
1. What is Hashing?
Answer: Hashing is a technique to convert a range of key values into a range of indexes of an
array. In other words we can say that Hashing is a technique or process of mapping keys, values
into the hash table by using a hash function. Hashing is also called the message digest function. It
is widely used in the encryption and decryption of digital signatures.
Answer: Hash table is a data structure in which a key is mapped to array locations by a hash
function. In simple words, a hash table is an array in which insertion and searching is done through
hashing. Hash table stores some elements which basically consist of two main components, i.e.,
key and value. Key is a unique integer used for indexing the values, whereas Value means data
which is associated with key.
Where (k ✕A) mod 1 gives the fractional part of kA and m is the total number of indices
in the hash table.
Answer: A good hash function performs one to one mapping between a set of all possible keys,
but it is totally impossible. A collision occurs when a hash function maps two different keys in the
same location of a hash table. So, we can use different collision resolution techniques by which
these keys can be placed in an alternate location.
Two most important collision resolution techniques we will study here,
i) Open Addressing(Closed hashing)
ii) Separate Chaining(Open Hashing)
Answer: In double hashing we will use two independent hash functions rather than a single hash
function. Hence, it is called double hashing. The double hash function can be defined as:
h(k, i) = [h1(k)+ ih2(k)] mod t
Here, t is the table size, h1(k) and h2(k) two independent hash functions where, h1(k)= k mod t and
h2(k)= k mod t’, where t’ will be less than t. and i is probe number start from 0 to t-1.
4.9 Summing Up
• In hashing we get O(1) search time on an average and O(n) in the worst case. Hashing is
an improvement over Direct Access Table.
• Hashing is a technique to convert a range of key values into a range of indexes of an array.
• Hash table is a data structure in which a key is mapped to array locations by a hash
function. In simple words, a hash table is an array in which insertion and searching is done
through hashing.
• Suppose we generated two addresses of an array from different keys using a hash function.
If both the addresses of the array generated by the hash function are the same, then this
situation is called Collision in Hash table.
• Hash function is a function which is applied on a key by which it generates an integer
within some suitable range in order to reduce the collision that is used as an address of the
hash table.
• Most popular Hash function which are Division Method Modulo-Division),
Multiplication Method, Mid-Square Method etc.
• In Division method, the key is divided by the table size and the remainder will be
considered as the address for the hash table.
• In Multiplication method applies the hash function as h(x)= ⌊ m( (k ✕A) mod 1⌋
Where (k ✕A) mod 1 gives the fractional part of kA and m is the total number of indices
in the hash table.
• In Mid-Square Method, Key value is squared and take some digits or bits from the middle
of this squared value as the address. This technique can generate keys with high
randomness if we take a big enough value.
• A collision occurs when a hash function maps two different keys in the same location of a
hash table. Different collision resolution techniques can be adopted by which these keys
can be placed in an alternate location.
• Two most important collision resolution techniques we will study here, they are Open
Addressing(Closed hashing) and Separate Chaining(Open Hashing).
• In Open Addressing, computes a new positions using a probe sequence and the next record
is stored in that position.
• Open addressing can be implemented by using three method named as Linear Probing,
Quadratic Probing and Double hashing.
• In Linear Probing, the hash function searches linearly for the next empty cell in the hash
table. For example, if a hash function gives an address ‘a’ and suppose it is not empty, then
it will search for the next empty location i.e ‘a+1’ and so on.
• In Quadratic Probing, if a value is already occupied at a location generated by the h(k)
then the following hash function can resolved the problem.
4.10 Questions:
4.10.1 Short types questions:
5. What is Hashing?
6. What is Hash Table?
7. What is the importance of Hashing?
8. What is collision in Hash Table?
9. Explain the Division Method.
10. Explain Double Hashing.