0% found this document useful (0 votes)
49 views

Unit 5 Data Structure

Hashing is a technique that maps keys to values in a hash table using a hash function. It allows for fast access and retrieval of elements in constant time. Hashing works by applying a hash function to a key to generate a hash value that maps to an index location in a hash table where the value can be stored. Good hash functions uniformly distribute keys and minimize collisions where different keys map to the same index. Collisions are handled using techniques like separate chaining, which links collided elements together in buckets, or open addressing, which probes through the table to find empty slots for collided elements. Rehashing doubles the size of the table when load factors become too high to maintain efficiency.

Uploaded by

Jaff Bezos
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views

Unit 5 Data Structure

Hashing is a technique that maps keys to values in a hash table using a hash function. It allows for fast access and retrieval of elements in constant time. Hashing works by applying a hash function to a key to generate a hash value that maps to an index location in a hash table where the value can be stored. Good hash functions uniformly distribute keys and minimize collisions where different keys map to the same index. Collisions are handled using techniques like separate chaining, which links collided elements together in buckets, or open addressing, which probes through the table to find empty slots for collided elements. Rehashing doubles the size of the table when load factors become too high to maintain efficiency.

Uploaded by

Jaff Bezos
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

What is Hashing?

Hashing is a technique or process of mapping keys, and values into the hash table by using a
hash function. It is done for faster access to elements. The efficiency of mapping depends on
the efficiency of the hash function used.
Need for Hash data structure
Storing in Array takes O(1) time, searching in it takes at least O(log n) time. This time
appears to be small, but for a large data set, it can cause a lot of problems and this, in turn,
makes the Array data structure inefficient. With the introduction of the Hash data structure,
it is now possible to easily store data in constant time and retrieve them in constant time as
well.

Components of Hashing
There are majorly three components of hashing:
1. Key: A Key can be anything string or integer which is fed as input in the hash function
the technique that determines an index or location for storage of an item in a data
structure.
2. Hash Function: The hash function receives the input key and returns the index of an
element in an array called a hash table. The index is known as the hash index.
3. Hash Table: Hash table is a data structure that maps keys to values using a special
function called a hash function. Hash stores the data in an associative manner in an array
where each data value has its own unique index.

How does Hashing work?


Suppose we have a set of strings {“ab”, “cd”, “efg”} and we would like to store it in a
table.
Our main objective here is to search or update the values stored in the table quickly in O(1)
time and we are not concerned about the ordering of strings in the table. So the given set of
strings can act as a key and the string itself will act as the value of the string but how to
store the value corresponding to the key?
Step 1: We know that hash functions (which is some mathematical formula) are used to
calculate the hash value which acts as the index of the data structure where the value will be
stored.
Step 2: So, let‟s assign
“a” = 1,
“b”=2, .. etc, to all alphabetical characters.
Step 3: Therefore, the numerical value by summation of all characters of the string:
“ab” = 1 + 2 = 3,

1
“cd” = 3 + 4 = 7 ,
“efg” = 5 + 6 + 7 = 18
Step 4: Now, assume that we have a table of size 7 to store these strings. The hash function
that is used here is the sum of the characters in key mod Table size. We can compute the
location of the string in the array by taking the sum(string) mod 7.
Step 5: So we will then store
 “ab” in 3 mod 7 = 3,
 “cd” in 7 mod 7 = 0, and
 “efg” in 18 mod 7 = 4.

The above technique enables us to calculate the location of a given string by using a simple
hash function and rapidly find the value that is stored in that location. Therefore the idea of
hashing seems like a great way to store (key, value) pairs of the data in a table.
What is a Hash function?
The hash function creates a mapping between key and value, this is done through the use of
mathematical formulas known as hash functions. The result of the hash function is referred
to as a hash value or hash. The hash value is a representation of the original string of
characters but usually smaller than the original.
For example: Consider an array as a Map where the key is the index and the value is the
value at that index. So for an array A if we have index i which will be treated as the key
then we can find the value by simply looking at the value at A[i].

Deletion
When deleting records from a hash table, there are two important considerations.

Deleting a record must not hinder later searches. Thus, the delete process cannot simply
mark the slot as empty, because this will isolate records further down the probe sequence.

We do not want to make positions in the hash table unusable because of deletion. The
freed slot should be available to a future insertion.

Define a data item having some data and key, based on which the search is to be conducted in
a hash table.
struct DataItem {
int data;
int key;
};

Hash Method

Define a hashing method to compute the hash code of the key of the data item.
int hashCode(int key){
return key % SIZE;
}

2
Whenever an element is to be deleted, compute the hash code of the key passed and locate the
index using that hash code as an index in the array. Use linear probing to get the element
ahead if an element is not found at the computed hash code. When found, store a dummy
item there to keep the performance of the hash table intact.

Rehashing:
Rehashing means hashing again. Basically, when the load factor increases to more than its
pre-defined value (default value of load factor is 0.75), the complexity increases. So to
overcome this, the size of the array is increased (doubled) and all the values are hashed
again and stored in the new double-sized array to maintain a low load factor and low
complexity.

Why rehashing?
Rehashing is done because whenever key value pairs are inserted into the map, the load
factor increases, which implies that the time complexity also increases as explained above.
This might not give the required time complexity of O(1).
Hence, rehash must be done, increasing the size of the bucketArray so as to reduce the load
factor and the time complexity.

How Rehashing is done?


Rehashing can be done as follows:
 For each addition of a new entry to the map, check the load factor.
 If it‟s greater than its pre-defined value (or default value of 0.75 if not given), then
Rehash.
 For Rehash, make a new array of double the previous size and make it the new
bucketarray.
 Then traverse to each element in the old bucketArray and call the insert() for each so as
to insert it into the new larger bucket array

Properties of a Good hash function

A hash function that maps every item into its own unique slot is known as a perfect hash
function. We can construct a perfect hash function if we know the items and the collection
will never change but the problem is that there is no systematic way to construct a perfect
hash function given an arbitrary collection of items. Fortunately, we will still gain
performance efficiency even if the hash function isn‟t perfect. We can achieve a perfect
hash function by increasing the size of the hash table so that every possible value can be
accommodated. As a result, each item will have a unique slot. Although this approach is
feasible for a small number of items, it is not practical when the number of possibilities is
large.
So, We can construct our hash function to do the same but the things that we must be
careful about while constructing our own hash function.
A good hash function should have the following properties:
1. Efficiently computable.
2.  Should uniformly distribute the keys (Each table position is equally likely for each.
3. Should minimize collisions.
4. Should have a low load factor(number of items in the table divided by the size of the
table).

3
Problem with Hashing
If we consider the above example, the hash function we used is the sum of the letters, but if
we examined the hash function closely then the problem can be easily visualized that for
different strings same hash value is begin generated by the hash function.
For example: {“ab”, “ba”} both have the same hash value, and string {“cd”,”be”} also
generate the same hash value, etc. This is known as collision and it creates problem in
searching, insertion, deletion, and updating of value.

What is collision?
The hashing process generates a small number for a big key, so there is a possibility that
two keys could produce the same value. The situation where the newly inserted key maps to
an already occupied, and it must be handled using some collision handling technology.
How to handle Collisions?
There are mainly two methods to handle collision:
1. Separate Chaining:
2. Open Addressing:

1) Separate Chaining
The idea is to make each cell of the hash table point to a linked list of records that have the
same hash function value. The idea behind separate chaining is to implement the array as a
linked list called a chain. Separate chaining is one of the most popular and commonly used
techniques in order to handle collisions. Chaining is simple but requires additional memory
outside the table.
Example: We have given a hash function and we have to insert some elements in the hash table using a
separate chaining method for collision resolution technique.
Hash function = key % 5,
Elements = 12, 15, 22, 25 and 37.
Let‟s see step by step approach to how to solve the above problem:
 Step 1: First draw the empty hash table which will have a possible range of hash values from 0 to 4
according to the hash function provided.

4
Step 2: Now insert all the keys in the hash table one by one. The first key to be inserted is 12 which is
mapped to bucket number 2 which is calculated by using the hash function 12%5=2.

Step 3: Now the next key is 22. It will map to bucket number 2 because 22%5=2. But bucket 2 is
already occupied by key 12.

 Step 4: The next key is 15. It will map to slot number 0 because 15%5=0.

 Step 5: Now the next key is 25. Its bucket number will be 25%5=0. But bucket 0 is already occupied
by key 25. So separate chaining method will again handle the collision by creating a linked list to
bucket 0.

5
Hence In this way, the separate chaining method is used as the collision resolution technique.

2) Open Addressing
In open addressing, all elements are stored in the hash table itself. Each table entry contains either a
record or NIL. When searching for an element, we examine the table slots one by one until the desired
element is found or it is clear that the element is not in the table.

2.a) Linear Probing

In linear probing, the hash table is searched sequentially that starts from the original location of the
hash. If in case the location that we get is already occupied, then we check for the next location.
Example: Let us consider a simple hash function as “key mod 5” and a sequence of keys that are to be
inserted are 50, 70, 76, 85, 93.
 Step1: First draw the empty hash table which will have a possible range of hash values from 0 to 4
according to the hash function provided.

 Step 2: Now insert all the keys in the hash table one by one. The first key is 50. It will map to slot
number 0 because 50%5=0. So insert it into slot number 0.

Step 3: The next key is 70. It will map to slot number 0 because 70%5=0 but 50 is already at slot
number 0 so, search for the next empty slot and insert it.

6
Step 4: The next key is 76. It will map to slot number 1 because 76%5=1 but 70 is already at slot
number 1 so, search for the next empty slot and insert it.

Step 5: The next key is 93 It will map to slot number 3 because 93%5=3, So insert it into slot number 3.

2.b) Quadratic Probing

Quadratic probing is an open addressing scheme in computer programming for resolving hash collisions
in hash tables. Quadratic probing operates by taking the original hash index and adding successive
values of an arbitrary quadratic polynomial until an open slot is found.
Example: Let us consider table Size = 7, hash function as Hash(x) = x % 7 and collision resolution
strategy to be f(i) = i2 . Insert = 25, 33, and 105
 Step 1: Create a table of size 7.

7
Hash table

 Step 2 – Insert 22 and 30


 Hash(25) = 22 % 7 = 1, Since the cell at index 1 is empty, we can easily insert 22 at slot
1.
 Hash(30) = 30 % 7 = 2, Since the cell at index 2 is empty, we can easily insert 30 at slot
2.

 Step 3: Inserting 50
 Hash(25) = 50 % 7 = 1
 In our hash table slot 1 is already occupied. So, we will search for slot 1+1 2, i.e. 1+1 = 2,
 Again slot 2 is found occupied, so we will search for cell 1+2 2, i.e.1+4 = 5,
 Now, cell 5 is not occupied so we will place 50 in slot 5.

2.c) Double Hashing

Double hashing is a collision resolving technique in Open Addressed Hash tables. Double hashing make
use of two hash function,
 The first hash function is h1(k) which takes the key and gives out a location on the hash table. But if
the new location is not occupied or empty then we can easily place our key.
 But in case the location is occupied (collision) we will use secondary hash-function h2(k) in
combination with the first hash-function h1(k) to find the new location on the hash table.

8
This combination of hash functions is of the form
h(k, i) = h1(k) + i * h2(k)) % n
where
 i is a non-negative integer that indicates a collision number,
 k = element/key which is being hashed
 n = hash table size.
Complexity of the Double hashing algorithm:
Time complexity: O(n)
Example: Insert the keys 27, 43, 692, 72 into the Hash Table of size 7. where first hash-function is h1
(k) = k mod 7 and second hash-function is h2(k) = 1 + (k mod 5)
 Step 1: Insert 27
 27 % 7 = 6, location 6 is empty so insert 27 into 6 slot.

 Step 2: Insert 43
 43 % 7 = 1, location 1 is empty so insert 43 into 1 slot.

 Step 3: Insert 692


 692 % 7 = 6, but location 6 is already being occupied and this is a collision
 So we need to resolve this collision using double hashing.
hnew = [h1(692) + i * (h2(692)] % 7
= [6 + 1 * (1 + 692 % 5)] % 7
=9%7
=2

Now, as 2 is an empty slot,


so we can insert 692 into 2nd slot.

9

 Step 4: Insert 72
 72 % 7 = 2, but location 2 is already being occupied and this is a collision.
 So we need to resolve this collision using double hashing.
hnew = [h1(72) + i * (h2(72)] % 7
= [2 + 1 * (1 + 72 % 5)] % 7
=5%7
= 5,

Now, as 5 is an empty slot,


so we can insert 72 into 5th slot.

What is meant by Load Factor in Hashing?


The load factor of the hash table can be defined as the number of items the hash table
contains divided by the size of the hash table. Load factor is the decisive parameter that is
used when we want to rehash the previous hash function or want to add more elements to
the existing hash table.
It helps us in determining the efficiency of the hash function i.e. it tells whether the hash
function which we are using is distributing the keys uniformly or not in the hash table.
Load Factor = Total elements in hash table/ Size of hash table
Applications of Hash Data structure
 Hash is used in databases for indexing.
 Hash is used in disk-based data structures.
 In some programming languages like Python, JavaScript hash is used to implement
objects.

Real-Time Applications of Hash Data structure


 Hash is used for cache mapping for fast access to the data.
 Hash can be used for password verification.
 Hash is used in cryptography as a message digest.

10
Advantages of Hash Data structure
 Hash provides better synchronization than other data structures.
 Hash tables are more efficient than search trees or other data structures
 Hash provides constant time for searching, insertion, and deletion operations on average.

Disadvantages of Hash Data structure


 Hash is inefficient when there are many collisions.
 Hash collisions are practically not avoided for a large set of possible keys.
 Hash does not allow null values.

Coalesced hashing
Coalesced hashing is a collision avoidance technique when there is a fixed sized data. It is a
combination of both Separate chaining and Open addressing. It uses the concept of Open
Addressing(linear probing) to find first empty place for colliding element from the bottom of
the hash table and the concept of Separate Chaining to link the colliding elements to each
other through pointers.
The hash function used is h=(key)%(total number of keys). Inside the hash table, each node
has three fields:
 h(key): The value of hash function for a key.
 Data: The key itself.
 Next: The link to the next colliding elements.
The basic operations of Coalesced hashing are:
1. INSERT(key): The insert Operation inserts the key according to the hash value of that
key if that hash value in the table is empty otherwise the key is inserted in first empty
place from the bottom of the hash table and the address of this empty place is mapped in
NEXT field of the previous pointing node of the chain.(Explained in example below).
2. DELETE(Key): The key if present is deleted. Also if the node to be deleted contains the
address of another node in hash table then this address is mapped in the NEXT field of the
node pointing to the node which is to be deleted
3. SEARCH(key): Returns True if key is present, otherwise return False.
The best case complexity of all these operations is O(1) and the worst case complexity is
O(n) where n is the total number of keys.It is better than separate chaining because it inserts
the colliding element in the memory of hash table only instead of creating a new linked list as
in separate chaining.

What is Dynamic Hashing?


It is a hashing technique that enables users to lookup a dynamic data set. Means, the data set
is modified by adding data to or removing the data from, on demand hence the name
„Dynamic‟ hashing. Thus, the resulting data bucket keeps increasing or decreasing
depending on the number of records.
In this hashing technique, the resulting number of data buckets in memory is ever-changing.
Operations Provided by Dynamic Hashing
Dynamic hashing provides the following operations −
 Delete − Locate the desired location and support deleting data (or a chunk of data) at
that location.
 Insertion − Support inserting new data into the data bucket if there is a space available
in the data bucket.

11
Query − Perform querying to compute the bucket address.

Update − Perform a query to update the data.

Advantages of Dynamic Hashing
Dynamic hashing is advantageous in the following ways −
It works well with scalable data.
It can handle addressing large amount of memory in which data size is always
changing.
 Bucket overflow issue comes rarely or very late.
Disadvantages of Dynamic Hashing
Dynamic hashing comes with the following disadvantage −
 The location of the data in memory keeps changing according to the bucket size. Hence
if there is a phenomenal increase in data, then maintaining the bucket address table
becomes a challenge.

Differences between Static and Dynamic Hashing

Static Hashing Dynamic Hashing

Fixed-size, non-changing data. Variable-size, changing data.

The resulting Data Bucket is of fixed- The resulting Data Bucket is of variable-
length. length.

Challenge of Bucket overflow can arise Bucket overflow can occur very late or
often depending upon memory size. doesn‟t occur at all.

Simple Complex

Extendible Hashing is a dynamic hashing method wherein directories, and buckets are
used to hash data. It is an aggressively flexible method in which the hash function also
experiences dynamic changes.
Main features of Extendible Hashing: The main features in this hashing technique are:

 Directories: The directories store addresses of the buckets in pointers. An id is assigned


to each directory which may change each time when Directory Expansion takes place.
 Buckets: The buckets are used to hash the actual data.

12

You might also like