0% found this document useful (0 votes)
26 views20 pages

UNIT V - Hashing

Uploaded by

VVM
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
26 views20 pages

UNIT V - Hashing

Uploaded by

VVM
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

What is Hashing?

▪ Hashing is the process of mapping large amount of data item to smaller table with the
help of hashing function.
▪ Hashing is also known as Hashing Algorithm or Message Digest Function.
▪ It is a technique to convert a range of key values into a range of indexes of an array.
▪ It is used to facilitate the next level searching method when compared with the linear or
binary search.
▪ Hashing allows to update and retrieve any data entry in a constant time O(1).
▪ Constant time O(1) means the operation does not depend on the size of the data.
▪ Hashing is used with a database to enable items to be retrieved more quickly.
▪ It is used in the encryption and decryption of digital signatures.

Hashing is an important data structure designed to solve the problem of efficiently finding and
storing data in an array.

Example:

if you have a list of 20000 numbers, and you have given a number to search in that list-

you will scan each number in the list until you find a match.

It requires a significant amount of your time to search in the entire list and locate that

specific number. This manual process of scanning is not only time-consuming but inefficient
too. With hashing in the data structure, you can narrow down the search and find the number
within seconds.
Examples of Hashing in Data Structure

The following are real-life examples of hashing in the data structure –

• In schools, the teacher assigns a unique roll number to each student. Later, the teacher
uses that roll number to retrieve information about that student.

• A library has an infinite number of books. The librarian assigns a unique number to each
book. This unique number helps in identifying the position of the books on the bookshelf.

What is Hash Function?


▪ A fixed process converts a key to a hash key is known as a Hash Function.
▪ This function takes a key and maps it to a value of a certain length which is called a Hash
value or Hash.
▪ Hash value represents the original string of characters, but it is normally smaller than the
original.
▪ It transfers the digital signature and then both hash value and signature are sent to the
receiver. Receiver uses the same hash function to generate the hash value and then
compares it to that received with the message.
▪ If the hash values are same, the message is transmitted without errors.

What is Hash Table?


▪ Hash table or hash map is a data structure used to store key-value pairs.
▪ It is a collection of items stored to make it easy to find them later.
▪ It uses a hash function to compute an index into an array of buckets or slots from which
the desired value can be found.
▪ It is an array of list where each list is known as bucket.
▪ It contains value based on the key.
▪ Hash table is used to implement the map interface and extends Dictionary class.
▪ Hash table is synchronized and contains only unique elements.
▪ The above figure shows the hash table with the size of n = 10. Each position of the hash
table is called as Slot. In the above hash table, there are n slots in the table, names = {0,
1, 2, 3, 4, 5, 6, 7, 8, 9}. Slot 0, slot 1, slot 2 and so on. Hash table contains no items, so
every slot is empty.
▪ As we know the mapping between an item and the slot where item belongs in the hash
table is called the hash function. The hash function takes any item in the collection and
returns an integer in the range of slot names between 0 to n-1.
▪ Suppose we have integer items {26, 70, 18, 31, 54, 93}. One common method of
determining a hash key is the division method of hashing and the formula is :

Hash Key = Key Value % Number of Slots in the Table

Hashing in a data structure is a two-step process.

1. The hash function converts the item into a small integer or hash value. This integer is
used as an index to store the original data.

2. It stores the data in a hash table. You can use a hash key to locate data quickly.

Note: Why do we need hashing?


Many applications deal with lots of data
- Search engines and web pages
- There are myriad look ups.
- The look ups are time critical.
- Typical data structures like arrays and lists, may not be sufficient to
handle efficient lookups
- In general: When look-ups need to occur in near constant time. O(1)
- We need something that can do better than a binary search, O(log N).
We want, O(1).

Solution: Hashing
Division method
Choose a number m larger than the number n of keys in K. (The number m is
usually chosen to be a prime number or a number without small divisors, since
this frequently minimizes the number of collisions.) The hash functions H is
defined by
H(k)=k(mod m) or H(k)=k(mod m)+1
Here k (mod m) denotes the remainder when k is divided by m. The second
formula is used when we want the hash addresses to range from 1to m rather
than from 0 to m-1.

Midsquare method
The key k is squared. Then the hash function H is defined by

H(k)=l
Where l is obtained by deleting digits from both ends of k2. We emphasize that
the same positions of k2 must be used for all of the keys.
Folding method
The key k is partitioned into a number of parts, k1 ..... , kr, where each part, except
possibly the last, has the same number of digits as the required address. Then
the parts are added together, ignoring the last carry. That is,
H(k)=k1+k2+ ...... +kr
Where the leading-digit carries, if any, are ignored. Sometimes, for extra ―milling‖,
the even-numbered parts, k2,k4, .... , are each reversed before the addition.
Example
Consider the company in the above Example, each of whose 68 employees is assigned a
unique 4-digit employee number. Suppose L consists of 100 two-digit addresses: 00, 01,
02, ......, 99. We apply the above hash functions to each of the following employee
numbers:

Division Method

3205, 7148, 2345

Choose a prime number m close to 99, such as m=97. Then H(3205)=4, H(7148)=67,
H(2345)=17

That is, dividing 3205 by 97 gives a remainder of 4, dividing 7148 by 97 gives a remainder
of 67, and dividing 2345 by 97 gives a remainder of 17. In the case that the memory
addresses begin with 01 rather than 00, we choose that the function H(k)=k(mod m)+1 to
obtain:

H(3205)=4+1=5, H(7148)=67+1=68, H(2345)=17+1=18

Midsquare method

The following calculations are performed:

k: 3205 7148 2345

k2: 10 272 025 51 093 904 935499 025

H(k): 72 93 99

Observe that the fourth and fifth digits, counting from the right, are chosen for the hash
address.

Folding method

Chopping the key k into two parts and adding yields the following hash addresses:

H(3205)=32+05=37, H(7148)=71+48=19,H(2345)=23+45=68

Observe that the leading digit 1 in H(7148) is ignored. Alternatively, one may want to
reverse the second part before adding, thus producing the following hash addresses:

H(3205)=32+50=82, H(7148)=71+84+55,H(2345)=23+54=77
Collision Resolution

Collisions occur when the hash function maps two different keys to the same location.
Obviously, two records cannot be stored in the same location.
Suppose we want to add a new record R with key k to our file F, but
suppose the memory location address H(k) is already occupied. This situation is
called collision.

Therefore, a method used to solve the problem of collision, also called collision resolution
technique, is applied. The two most popular methods of resolving collisions are:

1. Chaining

2. Open addressing

Separate Chaining:
The idea is to make each cell of hash table point to a linked list of records that have same
hash function value.
Let us consider a simple hash function as “key mod 7” and sequence of keys as 50, 700, 76,
85, 92, 73, 101.
Advantages:
1) Simple to implement.
2) Hash table never fills up, we can always add more elements to the chain.
3) Less sensitive to the hash function or load factors.
4) It is mostly used when it is unknown how many and how frequently keys may be inserted or
deleted.

Disadvantages:
1) Cache performance of chaining is not good as keys are stored using a linked list. Open
addressing provides better cache performance as everything is stored in the same table.
2) Wastage of Space (Some Parts of hash table are never used)
3) If the chain becomes long, then search time can become O(n) in the worst case.
4) Uses extra space for links.

Performance of Chaining:
Performance of hashing can be evaluated under the assumption that each key is equally likely
to be hashed to any slot of table (simple uniform hashing).

m = Number of slots in hash table


n = Number of keys to be inserted in hash table

Load factor α = n/m


Expected time to search = O(1 + α)

Expected time to delete = O(1 + α)

Time to insert = O(1)


Time complexity of search insert and delete is
O(1) if α is O(1)

Open Addressing:

Linear Probing and Modifications

The hash table contains two types of values: sentinel values (e.g., –1) and

data values. The presence of a sentinel value indicates that the location contains no data value
at present but can be used to hold a value.

When a key is mapped to a particular memory location, then the value it holds is checked. If it

contains a sentinel value, then the location is free and the data value can be stored in it.
However,

if the location already has some data value stored in it, then other slots are examined
systematically in the forward direction to find a free slot. If even a single free location is not
found, then we have an OVERFLOW condition.

The process of examining memory locations in the hash table is called probing.
Open addressing technique can be implemented using

linear probing, quadratic probing, double hashing, and rehashing.

Linear Probing

The simplest approach to resolve a collision is linear probing. In this technique, if a value is
already

stored at a location generated by h(k), then the following hash function is used to resolve the

collision:

h(k, i) = [h¢(k) + i] mod m

Where m is the size of the hash table, h¢(k) = (k mod m), and i is the probe number that varies
from

0 to m–1.

Therefore, for a given key k, first the location generated by [h¢(k) mod m] is probed because for

the first time i=0. If the location is free, the value is stored in it, else the second probe generates

the address of the location given by [h¢(k) + 1]mod m. Similarly, if the location is occupied, then

subsequent probes generate the address as

[h¢(k) + 2]mod m, [h¢(k) + 3]mod m, [h¢(k) + 4]mod m, [h¢(k) + 5]mod m, and so on, until a
free location is found.

Note: Linear probing is known for its simplicity. When we have to store a value, we try the slots:
[h¢(k)]
mod m, [h¢(k) + 1]mod m, [h¢(k) + 2]mod m, [h¢(k) + 3]mod m, [h¢(k) + 4]mod m, [h¢(k) +
5]mod m, and so
no, until a vacant location is found.
Example Consider a hash table of size 10. Using linear probing, insert the keys 72, 27,
36, 24, 63, 81, 92, and 101 into the table.
Let h¢(k) = k mod m, m = 10
Initially, the hash table can be given as:

One main disadvantage of linear probing is that records tend to cluster, that is,

appear next to one another, when the load factor is greater than 50 percent.
Such a clustering substantially increases the average search time for a record.
Two techniques to minimize clustering are as follows:
Quadratic probing
Suppose a record R with key k has the hash address H(k)=h. Then, instead of
searching the locations with addresses h, h+1, h+2,.., we linearly search the
locations with addresses
If the number m of locations in the table T is a prime number, then the above
sequence will access half of the locations in T.
Double hashing
Here a second hash function H‘ is used for resolving a collision, as follows.
Suppose a record R with key k has the hash addresses H(k)=h and H‘(k)=h‘≠m.
Then we linearly search the locations with addresses
h, h+h‘, h+2h‘, h+3h‘,....
ADVANTAGES :

Linear probing finds an empty location by doing a linear search in the array beginning from

position h(k). Although the algorithm provides good memory caching through good locality of

reference.

DISADVANTAGES :

results in clustering, and thus there is a higher risk of more collisions where one collision has
already taken place. The performance of linear probing is sensitive to the distribution of input
values.

As the hash table fills, clusters of consecutive cells are formed and the time required for a
search increases with the size of the cluster.

Quadratic Probing
In this technique, if a value is already stored at a location generated by h(k), then the following
hash function is used to resolve the collision:
h(k, i) = [h¢(k) + c1i + c2i2] mod m
where m is the size of the hash table, h¢(k) = (k mod m), i is the probe number that varies from
0 to m–1, and c1 and c2 are constants such that c1 and c2 π 0.
Quadratic probing eliminates the primary clustering phenomenon of linear probing because
instead of doing a linear search, it does a quadratic search.
For a given key k, first the location generated by h¢(k) mod m is probed. If the location is free,
the value is stored in it, else subsequent locations probed are offset by factors that depend in a
quadratic manner on the probe number i.

Although quadratic probing performs better than linear probing, in order to maximize the
utilization of the hash table, the values of c1, c2, and m need to be constrained.

Example
Consider a hash table of size 10. Using quadratic probing, insert the keys 72,
27, 36, 24, 63, 81, and 101 into the table. Take c1 = 1 and c2 = 3.
Solution
Let h¢(k) = k mod m, m = 10
Initially, the hash table can be given as:
If m is a prime number, then the above sequence will access all the locations in
the table T.

Remark: One major disadvantage in any type of open addressing procedure is in


the implementation of deletion. Specifically, suppose a record R is deleted from
the location T[r]. Afterwards, suppose we meet T[r] while searching for another
record R‘. This does not necessarily mean that the search is unsuccessful. Thus,
when deleting the record R, we must label the location T[r] to indicate that it
ADVANTAGES

Quadratic probing resolves the primary clustering problem that exists in the linear probing

technique. Quadratic probing provides good memory caching because it preserves some locality

of reference.
DISADVANTAGES

secondary clustering. It means that if there is a collision between two keys, then the same probe

sequence will be followed for both. With quadratic probing, the probability for multiple collisions

increases as the table becomes full. This situation is usually encountered when the hash table is

more than full.

Double Hashing

In double hashing, we use two hash functions rather than a single function. The hash function in
the case of double hashing can be given as:

h(k, i) = [h1(k) + ih2(k)] mod m

where m is the size of the hash table, h1(k) and h2(k) are two hash functions given as h1(k) = k
mod

m, h2(k) = k mod m', i is the probe number that varies from 0 to m–1, and m' is chosen to be less
than

m. We can choose m' = m–1 or m–2.

When we have to insert a key k in the hash table, we first probe the location given by applying

[h1(k) mod m] because during the first probe, i = 0. If the location is vacant, the key is inserted
into

it, else subsequent probes generate locations that are at an offset of [h2(k) mod m] from the
previous

location. Since the offset may vary with every probe depending on the value generated by the

second hash function, the performance of double hashing is very close to the performance of the

ideal scheme of uniform hashing.


Example

Consider a hash table of size = 10. Using double hashing, insert the keys 72,
27, 36, 24, 63, 81, 92, and 101 into the table. Take h1 = (k mod 10) and h2 = (k mod 8).
Solution
Let m = 10

Initially, the hash table can be given as:

ADVANTAGES

Double hashing minimizes repeated collisions and the effects of clustering. That is, double
hashing is free from problems associated with primary clustering as well as secondary
clustering.
Rehashing

When the hash table becomes nearly full, the number of collisions increases, thereby degrading

the performance of insertion and search operations. In such cases, a better option is to create a

new hash table with size double of the original hash table.

All the entries in the original hash table will then have to be moved to the new hash table. This

is done by taking each entry, computing its new hash value, and then inserting it in the new hash

table.

Though rehashing seems to be a simple process, it is quite expensive and must therefore not

be done frequently. Consider the hash table of size 5 given below.

The hash function used is h(x) = x % 5. Rehash the entries into to a new hash table.
COMPARISION BETWEEN SEPARATE CHAININING AND OPEN ADDRESSING

S.No. Separate Chaining Open Addressing

Open Addressing requires more


1. Chaining is Simpler to implement. computation.

In chaining, Hash table never fills up,


we can always add more elements to In open addressing, table may
2. chain. become full.

Open addressing requires extra


Chaining is Less sensitive to the hash care to avoid clustering and load
3. function or load factors. factor.

Chaining is mostly used when it is Open addressing is used when the


unknown how many and how frequently frequency and number of keys is
4. keys may be inserted or deleted. known.

Cache performance of chaining is not Open addressing provides better


good as keys are stored using linked cache performance as everything is
5. list. stored in the same table.

In Open addressing, a slot can be


Wastage of Space (Some Parts of hash used even if an input doesn’t map to
6. table in chaining are never used). it.

7. Chaining uses extra space for links. No links in Open addressing

You might also like