0% found this document useful (0 votes)
8 views

Module-4 Dictionaries and Hash Tables

Uploaded by

Abdul Farhaan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views

Module-4 Dictionaries and Hash Tables

Uploaded by

Abdul Farhaan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 31

MODULE- 4

DICTIONARIES AND HASH TABLES


Dictionaries: The dictionary in data structure is an unordered collection of data
items that are used to store the data in the form of key-value pairs.
• Each element presents in a dictionary data structure compulsorily have a
key and some value is associated with that particular key.
• A dictionary is also called a hash, a map, a hashmap in different
programming languages
• keys in a dictionary must be simple types (such as integers or strings)
while the values can be of any type.
• Keys in a dictionary must be unique

A dictionary can be represented in C in 2 methods:


1. Sorted Array- An array data structure is used to implement the dictionary.
2. Sorted Chain- A linked list data structure is used to implement the dictionary
Linear List Representation of Dictionaries:
While using Linked list to represent a Dictionary, each node contains key, value
and the address of the next node.
struct Node {
int key;
int value;
struct Node* next;
} *head=NULL, *temp;

Operations:
• Insertion
• Deletion
• Searching
Insertion:
• Insertion in a dictionary is similar to that of a linked list. We create and
new node and store it’s address in the previous node.
• While inserting a value in the dictionary, we search a dictionary for the
given key
o If we find the key, we enter (overwrite) the data into that node
o If we do not find the key, We create a new node at the correct
position (keys should be in ascending order) and enter value and
key into the new node.
Example: Insert (2,90),(11,86),(4,65) into a dictionary
Step 1:

Step 2:

Step 3:
void insert(int key, int value)
{
temp = head;
while (temp != NULL) //Traversing to find the key
{
//key is found, enter the value into the node
if (temp->key == key)
{
temp->value = value;
return;
}
temp = temp->next;
}

// Key not found, create a new node


struct Node* newNode = (struct Node*)malloc(sizeof(struct Node));
newNode->key = key;
newNode->value = value;
newNode->next = NULL;
if (head == NULL)
{
head = newNode;
}
else {
temp = head;
while (temp->next != NULL) {
temp = temp->next;
}
temp->next = newNode;
}
}
Deletion: Deletion in dictionary is similar to that of a linked list. We delete
based using the keys.
• If we want to delete a middle node, prev->node = curr->next
• If we delete the first node, We need to assign the head pointer to the
next node
Example: Delete 4

void deleteKey(int key)


{
temp = head;
struct Node* prev = NULL;

while (temp != NULL)


{
if (temp->key == key)
{

if (prev == NULL)
head = temp->next;
else
prev->next = temp->next;

free(temp);
return;
}
prev = temp;
temp = temp->next;
}}
Searching: To search for a value,
• We traverse through the dictionary and search for the matching key
• If we find the key, We return the corresponding value
int search(int key) {
temp = head;

while (temp != NULL) {


if (temp->key == key) {
return temp->value;
}
temp = temp->next;
}

// Key not found


return -1;
}

HASH TABLE REPRESENTATION: Hash tables are an


implementation of the dictionary abstract data type, used for storing key-value
pairs. Hash table is a data structure that maps keys to values using a special
function called a hash function. Hash stores the data in an associative manner
in an array where each data value has its own unique index.

Hashing: Hashing is a technique or process of mapping keys, and values into


the hash table by using a hash function. It is done for faster access to
elements. The efficiency of mapping depends on the efficiency of the hash
function used.
Hash Function: It is used to convert convert an input (such as a string or
integer) into a fixed-size output (referred to as a hash code or hash value).
The data is then stored and retrieved using this hash value as an index in an
array or hash table.

Types of Hash Function:


• Division Method (k mod M)
• Folding Method

Division Method: This is the most simple and easiest method to generate a
hash value. The hash function divides the value k by M and then uses the
remainder obtained.

Formula:
h(K) = k mod M
Here,
k is the key value, and
M is the size of the hash table.

It is best suited that M is a prime number as that can make sure the keys are
more uniformly distributed. The hash function is dependent upon the
remainder of a division.
Example:
k = 12345 k=1276
M = 95 M=11
h(12345) = 12345 mod 95 h(1276) = 1276 mod 11
= 90 =0

Digit Folding Method: This method involves two steps:


1. Divide the key-value k into a number of parts i.e. k1, k2, k3,….,kn,
where each part has the same number of digits except for the last
part that can have lesser digits than the other parts.
2. Add the individual parts. The hash value is obtained by ignoring the
last carry if any.
Formula:
k = k1, k2, k3, k4, ….., kn
s = k1+ k2 + k3 + k4 +….+ kn
h(K)= s
Here,
s is obtained by adding the parts of the key k
Example:
k = 12345
k1 = 12, k2 = 34, k3 = 5
s = k1 + k2 + k3
= 12 + 34 + 5
= 51
h(K) = 51

Problem with Hashing (Collision): The hashing process generates a small


number for a big key, so there is a possibility that two keys could produce the
same value. The situation where the newly inserted key maps to an already
occupied, and it must be handled using some collision handling technology.

Collision resolution techniques:


• Separate Chaining
• Open Addressing
o Linear Probing
o Quadratic Probing
o Double Hashing
Separate Chaining: Separate Chaining is the collision resolution technique that
is implemented using linked list. When two or more elements are hash to the
same location, these elements are represented into a singly-linked list like a
chain. Since this method uses extra memory to resolve the collision, therefore,
it is also known as open hashing.
Example: Let us consider a simple hash function as “key mod 7” and a
sequence of keys as 50, 700, 76, 85, 92, 73, 101

Open Addressing: In Open Addressing, all elements are stored in the hash table
itself. So at any point, the size of the table must be greater than or equal to the
total number of keys (Note that we can increase table size by copying old data
if needed). This approach is also known as closed hashing.
1.Linear Probing: In linear probing, the hash table is searched sequentially that
starts from the original location of the hash. If in case the location that we get
is already occupied, then we check for the next location.
Formula:
index = ((key% TABLE_SIZE ) + i) % TABLE_SIZE
Example: Insert the following sequence of keys in the hash table
{9, 7, 11, 13, 12, 8}
Use linear probing technique for collision resolution
h(k, i) = [h(k) + i] mod m
h(k) = 2k + 5
m=10
Solution:
Step 01:
First Draw an empty hash table of Size 10.
The possible range of hash values will be [0, 9].
Step 02:
Insert the given keys one by one in the hash table.
First Key to be inserted in the hash table = 9.
h(k) = 2k + 5
h(9) = 2*9 + 5 = 23
h(k, i) = [h(k) + i] mod m
h(9, 0) = [23 + 0] mod 10 = 3
So, key 9 will be inserted at index 3 of the hash table

Step 03:

Next Key to be inserted in the hash table = 7.


h(k) = 2k + 5
h(7) = 2*7 + 5 = 19
h(k, i) = [h(k) + i] mod m
h(7, 0) = [19 + 0] mod 10 = 9
So, key 7 will be inserted at index 9 of the hash table
Step 04:
Next Key to be inserted in the hash table = 11.
h(k) = 2k + 5
h(11) = 2*11 + 5 = 27
h(k, i) = [h(k) + i] mod m
h(11, 0) = [27 + 0] mod 10 = 7
So, key 11 will be inserted at index 7 of the hash table

Step 05:
Next Key to be inserted in the hash table = 13.
h(k) = 2k + 5
h(13) = 2*13 + 5 = 31
h(k, i) = [h(k) + i] mod m
h(13, 0) = [31 + 0] mod 10 = 1
So, key 13 will be inserted at index 1 of the hash table
Step 06:
Next key to be inserted in the hash table = 12.
h(k) = 2k + 5
h(12) = 2*12 + 5 = 27
h(k, i) = [h(k) + i] mod m
h(12, 0) = [27 + 0] mod 10 = 7
Here Collision has occurred because index 7 is already filled.
Now we will increase i by 1.
h(12, 1) = [27 + 1] mod 10 = 8
So, key 12 will be inserted at index 8 of the hash table.

Step 07:
Next key to be inserted in the hash table = 8.
h(k) = 2k + 5
h(8) = 2*8 + 5 = 21
h(k, i) = [h(k) + i] mod m
h(8, 0) = [21 + 0] mod 10 = 1
Here Collision has occurred because index 1 is already filled.
Now we will increase i by 1 now i become 1.
h(k) = 2k + 5
h(8) = 2*8 + 5 = 21
h(k, i) = [h(k) + i] mod m
h(8, 0) = [21 + 1] mod 10 = 2
index 2 is vacant so 8 will be inserted at index 2.
This is how the linear probing collision resolution technique works.

Operations:
• Insertion
• Deletion
• Searching

Insertion: Consider the following example - we have an underlying array that is


already populated with a few elements:

If we would like to add another element to the table with key k4 , which is
hashed at the same location of another element,
void insert(int key)
{
int index = hashFunction(key);
while (hashTable[index] != -1)
{ -1 represents empty slot. If slot is not empty,
we perform linear probing (Move to the next
index = (index + 1) % SIZE; slot)
}
hashTable[index] = key; // Insert the key
printf("Inserted %d at index %d\n", key, index);
}
Searching: Linear probing works as its name suggests. There is a probe that
linearly traverses the underlying array. If it finds an element in its path while
traversing the array - it will return it to the user.

However, if it finds an empty bucket during its search, it will stop at this
moment and signal to the user that it has found nothing.
int search(int key)
{
int index = hashFunction(key);
while (hashTable[index] != -1)
{
if (hashTable[index] == key) // Check if the key matches
{
return index; // Key found, return the index
}
index = (index + 1) % SIZE; // Linear probing: move to the next slot
}
return -1; // Key not found
}
Deletion: Deleting an item while using linear probing is a bit tricky. Consider
this example where I naively remove an item from a hash table:

We just deleted an item from the hash table. Then we look for the item which
is hashed by the hash function in the same location as the item which was
just deleted and which was placed earlier in the hash table to the
right relative to the element that has just been deleted:
Because the linear probe stops when it encounters a empty space.

To avoid such unpredicted behavior - we must place at the location of the


deletion of an element what is called a tombstone .

If the linear probe encounters a tombstone while searching the array, it will
ignore it, and continue its search for the element we want.

Let’s see an example with this idea :

Then, we search for an element like before, however this time we are able to
find it thanks to the tombstone:
void delete(int key)
{
int index = search(key);
if (index != -1)
{
hashTable[index] = -1; // Mark the slot as empty
printf("Deleted key %d at index %d\n", key, index);
}
else
{
printf("Key %d not found in the hash table\n", key);
}
}
Display Function:
void display()
{
printf("Hash Table:\n");
for (int i = 0; i < SIZE; i++)
{
printf("[%d] -> ", i);
if (hashTable[i] != -1)
{
printf("%d", hashTable[i]);
}
else
{ printf("Empty");
}
printf("\n");
}
}
Disadvantage of Linear Probing: clustering

Clustering is a phenomenon that occurs as elements are added to a hash table.


Elements may have a tendency to clump together, forming clusters, which over
time will significantly impact performance for searching and adding elements
because we’ll approach a worst case O(n) time complexity.

2.Quadratic Probing: Quadratic probing is an open-addressing scheme where


we look for the i2‘th slot in the i’th iteration if the given hash value x collides in
the hash table.

Procedure: Let hash(x) be the slot index computed using the hash function.

• If the slot hash(x) % S is full, then we try (hash(x) + 1*1) % S.


• If (hash(x) + 1*1) % S is also full, then we try (hash(x) + 2*2) % S.
• If (hash(x) + 2*2) % S is also full, then we try (hash(x) + 3*3) % S.
• This process is repeated for all the values of i until an empty slot is
found.

Disadvantage of Quadratic Probing: Secondary clustering (It is seen when


filling a hash table with many elements that has to the same open bucket)

Example: Let us consider a simple hash function as “key mod 7” and sequence
of keys as 50, 700, 76, 85, 92, 73, 101
3.Double Hashing: It is a technique in which two hash functions are used when
there is an occurrence of collision. In this method 1 hash function is simple as
same as division method. But a good second Hash Function must follow the
following rules:

1. It must never evaluate to zero.

2. Must sure that all cells can be probed

The hash functions for this technique are:

H1(key) = key % table_size

H2(key) = P - (key % P)

Where, p is a prime number which should be taken smaller than the size of

a hash table.

Double hashing can be done using :

(H1(key) + i * H2(key)) % TABLE_SIZE

Example: Insert 67,90,55,17,49

67%10 = 7
0 90
90%10 = 0 1 17
55%10 = 5 2
3
17%10 = 7
4
(apply 2nd function as there’s a collision)
5 55
= 7 - (17%7) = 7 - 3 6
=4 7 67
8
Final Index = (7+4) % 10 = 11%10
9 49
=1

49%10 = 9
Example: Insert 76,93,40,47,10,55

Differences between Separate Chaining and Open Addressing:

Rehashing: Rehashing is the process of increasing the size of a hashmap and


redistributing the elements to new buckets based on their new hash values.
There are situations in which the rehashing is required.
• When table is completely full
• With quadratic probing when the table is filled half.
• When insertions fail due to overflow.
In such situations, we have to transfer entries from old table to the new table
by re computing their positions using hash functions. Rehashing is done to
improve the performance of the hashmap and to prevent collisions caused by a
high load factor.
Load Factor: The ratio of the number of elements to the number of buckets.

Procedure:

• For each addition of a new entry to the map, check the load factor.
• If it’s greater than its pre-defined value (or default value of 0.75 if not
given), then Rehash.
• For Rehash, make a new array of double the previous size and make it
the new bucketarray.
• Then traverse to each element in the old bucketArray and call the
insert() for each so as to insert it into the new larger bucket array.

Extendible Hashing: Extendible Hashing is a dynamic hashing method wherein


directories, and buckets are used to hash data. It is an aggressively flexible
method in which the hash function also experiences dynamic changes.
• Directories: The directories store addresses of the buckets in pointers.
An id is assigned to each directory which may change each time when
Directory Expansion takes place.
• Buckets: The buckets are used to hash the actual data.
• Global Depth: It is associated with the Directories. They denote the
number of bits which are used by the hash function to categorize the
keys.
Global Depth = Number of bits in directory id.
• Local Depth: It is the same as that of Global Depth except for the fact
that Local Depth is associated with the buckets and not the directories.
Local depth in accordance with the global depth is used to decide the
action that to be performed in case an overflow occurs.
Local Depth is always less than or equal to the Global Depth.
Procedure:
Step 1 – Analyze Data Elements: Data elements may exist in various forms
eg. Integer, String, Float, etc.. Currently, let us consider data elements of
type integer. eg: 49.
Step 2 – Convert into binary format: Convert the data element in Binary
form. For string elements, consider the ASCII equivalent integer of the
starting character and then convert the integer into binary form. Since we
have 49 as our data element, its binary form is 110001.
Step 3 – Check Global Depth of the directory. Suppose the global depth of
the Hash-directory is 3.
Step 4 – Identify the Directory: Consider the ‘Global-Depth’ number of
LSBs in the binary number and match it to the directory id.
Eg. The binary obtained is: 110001 and the global-depth is 3. So, the hash
function will return 3 LSBs of 110001 viz. 001.

Step 5 – Navigation: Now, navigate to the bucket pointed by the directory


with directory-id 001.
Step 6 – Insertion and Overflow Check: Insert the element and check if the
bucket overflows. If an overflow is encountered, go to step 7 followed
by Step 8, otherwise, go to step 9.
Step 7 – Tackling Over Flow Condition during Data Insertion: when Bucket
overflows, First, Check if the local depth < or = the global depth. Then
choose one of the cases below.
Step 8 – Rehashing of Split Bucket Elements: The Elements present in the
overflowing bucket that is split are rehashed w.r.t the new global depth of
the directory.
Step 9 – The element is successfully hashed.
Overflow Handling:
Case1: If local depth = global depth, then Directory Expansion, as well as
Bucket Split, needs to be performed.
Case2: If local depth < global depth, then only Bucket Split takes place
Example: Now, let us consider an example of hashing the following
elements: 16,4,6,22,24,10,31,7,9,20,26.
Bucket Size: 3
Hash Function: Suppose the global depth is X. Then the Hash Function returns
X LSBs.
Binary forms of each of the given numbers:
16- 10000 31- 11111
4- 00100 7- 00111
6- 00110 9- 01001
22- 10110 20- 10100
24- 11000 26- 11010
10- 01010
Initially, the global-depth and local-depth is always 1. Thus, the hashing frame
looks like this:
Inserting 16:
The binary format of 16 is 10000 and global-depth is 1. The hash function
returns 1 LSB of 10000 which is 0. Hence, 16 is mapped to the directory with
id=0.

Inserting 4 and 6:
Both 4(100) and 6(110)have 0 in their LSB. Hence, they are hashed as follows:

Inserting 22: binary form of 22 is 10110. Its LSB is 0. The bucket pointed by
directory 0 is already full. Hence, Over Flow occurs.
• Apply Step 7-Case 1,
• Since Local Depth = Global Depth, the bucket splits and directory
expansion takes place.
• Also, rehashing of numbers present in the overflowing bucket
takes place after the split.

• since the global depth is incremented by 1, now,the global depth
is 2.
• Hence, 16,4,6,22 are now rehashed w.r.t 2 LSBs.[
16(10000),4(100),6(110),22(10110) ]
Notice that the bucket which was underflow has remained untouched but 01
and 11 pointing to the same bucket.
Inserting 24 and 10: 24(11000) and 10 (1010) can be hashed based on
directories with id 00 and 10. Here, we encounter no overflow condition.

Inserting 31,7,9: All of these elements[ 31(11111), 7(111), 9(1001) ] have


either 01 or 11 in their LSBs. Hence, they are mapped on the bucket
pointed out by 01 and 11. We do not encounter any overflow condition
here.
Inserting 20: Insertion of data element 20 (10100) will again cause the
overflow problem.

20 is inserted in bucket pointed out by 00. As directed by Step 7-Case 1,


since the local depth of the bucket = global-depth, directory expansion
(doubling) takes place along with bucket splitting. Elements present in
overflowing bucket are rehashed with the new global depth. Now, the new
Hash table looks like this:

Inserting 26: Global depth is 3. Hence, 3 LSBs of 26(11010) are considered.


Therefore 26 best fits in the bucket pointed out by directory 010.
The bucket overflows, and, as directed by Step 7-Case 2, since the local
depth of bucket < Global depth (2<3), directories are not doubled but, only
the bucket is split and elements are rehashed.
Finally, the output of hashing the given list of numbers is obtained.

You might also like