0% found this document useful (0 votes)
8 views37 pages

Adsa Unit-I

The document provides an overview of advanced data structures and algorithms, focusing on dictionaries and hashing techniques. It explains the Dictionary Abstract Data Type (ADT), its operations, and properties, as well as various hashing methods and collision resolution techniques. Key concepts include separate chaining, open addressing, and different types of hash functions, highlighting their advantages and disadvantages.

Uploaded by

jacob.cse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views37 pages

Adsa Unit-I

The document provides an overview of advanced data structures and algorithms, focusing on dictionaries and hashing techniques. It explains the Dictionary Abstract Data Type (ADT), its operations, and properties, as well as various hashing methods and collision resolution techniques. Key concepts include separate chaining, open addressing, and different types of hash functions, highlighting their advantages and disadvantages.

Uploaded by

jacob.cse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Advanced Data Structures and Algorithms

UNIT-I: DICTIONARIES & HASHING


Dictionaries : Definition, Dictionary Abstract Data Type, Implementation of
Dictionaries, Hashing: Review of Hashing, Hash Function, Collision Resolution
Techniques in Hashing, Separate Chaining, Open Addressing, Linear Probing,
Quadratic Probing, Double Hashing, Rehashing, Extendible Hashing.

DICTIONARIES

Definition: A Dictionary is an ordered or unordered list of key-element pairs.


Where keys are used to locate elements in list. It allows for efficient retrieval,
insertion, and deletion of values based on their associated keys. The keys are
always unique within a dictionary. The values of the dictionary may or may not
be unique. We can put heterogeneous type values inside the dictionaries. The
size of the Dictionary can change dynamically as elements are inserted and
deleted.

Dictionary in data structures is referred to with various names such as maps,


symbol tables, associative array etc.

For example: Consider a data structure that stores bank accounts data.
Where account number server as a key to identify the account details.

Abstract Data Type (ADT): It's a theoretical concept used in computer science
to define a data type purely by its behavior (operations and their properties)
rather than its implementation.
In an ADT, you specify:
1. Data: The type of data that the ADT can hold.
2. Operations: The operations that can be performed on that data (like
adding, removing, or accessing elements).
3. Properties: The expected behavior of these operations, such as their time
complexity.

M.Tech-CSE & CSE(AI&ML) SREC, Nandyal Page 1


Advanced Data Structures and Algorithms

Dictionary ADT:

1. Data: Set of unique <key,value> (ie., <k,v>) pairs.


2. Operations:

1. Insert (key, value): Inserts <key, value> pair in the dictionary. If key
already used overwrites the existing <key, value> pair.

2. Delete (key): Removes the <key, value> pair associated with the
specified key.

3. Search (key): Retrieves the value associated with the specified key.

4. Update (key, value): Modifies the value for the specified key.

5. Keys ( ): Returns a collection of all keys in the dictionary.

6. Values ( ): Returns a collection of all values in the dictionary.

7. Items ( ): Returns a collection of all key-value pairs.

3. Properties:

A Dictionary ADT can be implemented using various data structures,


including:

 Hash Table: Offers average-case O(1) time complexity for


insertions, deletions, and lookups.

 Binary Search Tree (BST): Provides O(log n) time complexity for


balanced trees (like AVL or Red-Black trees).

 List or Array: Simple but less efficient, especially for large datasets
(O(n) for search).

M.Tech-CSE & CSE(AI&ML) SREC, Nandyal Page 2


Advanced Data Structures and Algorithms

Applications of Dictionaries:

Dictionaries are widely used in various applications, such as:

 Database indexing

 Caching mechanisms

 Implementing associative arrays

 Language processing (e.g., symbol tables in compilers)

HASHING
HASH TABLE:

Hash table is a data structure used for storing and retrieving data very quickly.
Insertion of data in the hash table is based on the key value. Hence every entry
in the hash table is associated with some key. Using the hash key the required
piece of data can be searched in the hash table by few or more key
comparisons. The searching time is then dependent upon the size of the hash
table.

The effective representation of dictionary can be done using hash table.


We can place the dictionary entries in the hash table using hash function.
Hash function is a function which is used to put the data in the hash table.
Hence one can use the same hash function to retrieve the data from the hash
table. Thus hash function is used to implement the hash table. The integer
returned by the hash function is called hash key.

For example: Consider that we want place some employee records in the hash
table The record of employee is placed with the help of key: employee ID. The
employee ID is a 7 digit number for placing the record in the hash table. To
place the record 7 digit number is converted into 3 digits by taking only last
three digits of the key. If the key is 4967000 it can be stored at 0th position.
The second key 8421002, the record of those key is placed at 2nd position in
the array. Hence the hash function will be- H(key) = key%1000 Where
key%1000 is a hash function and key obtained by hash function is called hash
key.

M.Tech-CSE & CSE(AI&ML) SREC, Nandyal Page 3


Advanced Data Structures and Algorithms

Bucket and Home bucket: The hash function H(key) is used to map several
dictionary entries in the hash table. Each position of the hash table is called
bucket. The function H(key) is home bucket for the dictionary with pair whose
value is key.

For eg., Index Key


0 NULL
H(key)=key%10 Home bucket
21
1 for 21
H(21)=21%10=1 (home bucket) 2 NULL
3 NULL
H(44)=44%10=4 (home bucket) Home bucket
44
4 for 44
H(58)=58%10=8 (home bucket) 5 NULL
6 NULL
7 NULL
Home bucket
58
8 for 58
9 NULL

TYPES OF HASH FUNCTION: There are various types of hash function that
are used to place the record in the hash table-

1 Division Method: The hash function depends upon the remainder of


division. Typically the divisor is table length. For eg; If the record 54, 72, 89, 37
is placed in the hash table and if the table size is 10 then

h(key) = record % table size

54%10=4 0
1
72%10=2 2 72
3
89%10=9
4 54
37%10=7 5
6
7 37
8
9 89

M.Tech-CSE & CSE(AI&ML) SREC, Nandyal Page 4


Advanced Data Structures and Algorithms

2. Mid Square: In the mid square method, the key is squared and the middle
or mid part of the result is used as the index. If the key is a string, it has to be
preprocessed to produce a number.

Consider that if we want to place a record 3111 then (3111) 2 = 9678321 for the
hash table of size 1000 H(3111) = 783 (the middle 3 digits)

3. Multiplicative hash function: The given record is multiplied by some


constant value. The formula for computing the hash key is- H(key) = floor(p
*(fractional part of key*A)) where p is integer constant and A is constant real
number. Donald Knuth suggested to use constant A = 0.61803398987.

If key 107 and p=50 then

H(key) = floor(50*(107*0.61803398987))

= floor(3306.4818458045)

= 3306

At 3306 location in the hash table the record 107 will be placed.

4. Digit Folding: The key is divided into separate parts and using some simple
operation these parts are combined to produce the hash key.

For eg; consider a record 12365412 then it is divided into separate parts as
123 654 12 and these are added together H(key) = 123+654+12 = 789 The
record will be placed at location 789

5. Digit Analysis: The digit analysis is used in a situation when all the
identifiers are known in advance. We first transform the identifiers into
numbers using some radix, r. Then examine the digits of each identifier. Some
digits having most skewed distributions are deleted. This deleting of digits is
continued until the number of remaining digits is small enough to give an
address in the range of the hash table. Then these digits are used to calculate
the hash address.

COLLISION:

The hash function is a function that returns the key value using which the
record can be placed in the hash table. Thus this function helps us in placing
the record in the hash table at appropriate position and due to this we can
retrieve the record directly from that location. This function need to be

M.Tech-CSE & CSE(AI&ML) SREC, Nandyal Page 5


Advanced Data Structures and Algorithms

designed very carefully and it should not return the same hash key address for
two different records. This is an undesirable situation in hashing.

Definition: The situation in which the hash function returns the same hash
key (home bucket) for more than one record is called collision and two same
hash keys returned for different records is called synonym. Similarly when
there is no room for a new pair in the hash table then such a situation is called
overflow. Sometimes when we handle collision it may lead to overflow
conditions. Collision and overflow show the poor hash functions.

Example: Consider a hash function. H(key) = recordkey%10 having the hash


table size of 10

The record keys to be placed are 131, 44, 43, 78, 19, 36, 57 and 77

131%10=1
0
44%10=4 1 131
2
43%10=3 3 43
4 44
78%10=8
5
19%10=9 6 36
7 57
36%10=6 8 78
9 19
57%10=7

77%10=7, Collision occurs at index 7.

Now if we try to place 77 in the hash table then we get the hash key to be 7 and
at index 7 already the record key 57 is placed. This situation is called collision.
From the index 7 if we look for next vacant position at subsequent indices 8, 9
then we find that there is no room to place 77 in the hash table. This situation
is called overflow.

M.Tech-CSE & CSE(AI&ML) SREC, Nandyal Page 6


Advanced Data Structures and Algorithms

COLLISION RESOLUTION TECHNIQUES:

If collision occurs then it should be handled by applying some techniques.


Such a technique is called collision handling technique.

1. Separate Chaining

2. Open addressing

i) Linear probing

ii) Quadratic probing

iii) Double hashing

3. Rehashing

4. Extendible hashing

1. SEPARATE CHAINING (Open Hashing):

Each slot (i.e bucket) in the hash table points to a linked list (or another data
structure) that store all the elements that hash to the same index. When a
collision occurs, the new element is simply added to the list.

For eg., Consider the keys to be placed in their home buckets are 131, 3, 4, 21,
61, 7, 97, 8, 9

then we will apply a hash function as H(key) = key % D Where D is the size of
table. The hash table will be- Here D = 10

M.Tech-CSE & CSE(AI&ML) SREC, Nandyal Page 7


Advanced Data Structures and Algorithms

A chain is maintained for colliding elements. for instance 131 has a home
bucket (key) 1, Similarly key 21 and 61 demand for home bucket 1. Hence a
chain is maintained at index 1.

Advantages:

1. Simplicity:

o The implementation of separate chaining is straightforward. Each


bucket can easily hold multiple entries, making it easier to manage
collisions.

2. Dynamic Size:

o There’s no need to worry about resizing the table when the number
of entries grows, as each bucket can expand independently as
needed.

3. Flexible Handling of Collisions:

o Multiple keys can hash to the same index without losing data. The
linked list or similar structure can grow to accommodate additional
entries.
M.Tech-CSE & CSE(AI&ML) SREC, Nandyal Page 8
Advanced Data Structures and Algorithms

4. Good Load Factor Management:

o Even with a load factor greater than 1 (number of


elements/number of buckets), performance remains acceptable, as
retrieval times depend more on the length of the chains than on
the total number of entries.

5. Ease of Deletion:

o Deleting an entry is easier, as you only need to adjust the pointers


in the linked list rather than rehashing the entire table.

Disadvantages:

1. Memory Overhead:

o Each entry in a chain requires additional memory for the pointers


in the linked list, leading to more overhead compared to open
addressing methods.

2. Performance Degradation:

o If many keys hash to the same index (poor hash function), chains
can become long, leading to O(n) time complexity for search, insert,
and delete operations in the worst case.

3. Cache Performance:

o Linked lists may lead to poor cache performance due to non-


contiguous memory allocation, which can slow down access times
compared to data stored in a contiguous array.

4. Increased Complexity:

o While the basic structure is simple, the handling of chains can


complicate operations, especially in terms of maintaining
performance across different load factors.

5. Rehashing:

o If the load factor becomes too high, rehashing the entire table (i.e.,
creating a larger table and redistributing the entries) may be
required, which can be expensive.

M.Tech-CSE & CSE(AI&ML) SREC, Nandyal Page 9


Advanced Data Structures and Algorithms

2. OPEN ADDRESSING (Closed Hashing):

Instead of using linked lists, this technique finds another FREE slot within the
hash table itself when a collision occurs. This process is known as Probing,
and the sequence is known as probing sequence.

i) Linear Probing:

When collision occurs i.e. when two records demand for the same home
bucket in the hash table then collision can be solved by placing the
second record linearly down whenever the empty bucket is found. When
use linear probing (open addressing), the hash table is represented as a
one-dimensional array with indices that range from 0 to the desired table
size-1. Before inserting any elements into this table, we must initialize the
table to represent the situation where all slots are empty. This allows us
to detect overflows and collisions when we insert elements into the table.
Then using some suitable hash function the element can be inserted into
the hash table.

This method uses the below formula to find the next empty slot in
sequence.

Hash function: H(key)=(H(key)+i)%m, for i=1,2,…

Example: Consider the division hash function.

Hash function: H(key) = key % tablesize

Following keys are to be inserted in the hash table 131, 4, 8, 7, 21, 5, 31, 61,
9, 29. Initially, we will put the following keys in the hash table.

Assume tablesize=10

For instance the element 131 can be placed at H(key) = 131 % 10=1

Index 1 will be the home bucket for 131. Continuing in this fashion we will
place 4, 8, 7. Now the next key to be inserted is 21.

According to the hash function H(key)=21%10

H(21) = 1

M.Tech-CSE & CSE(AI&ML) SREC, Nandyal Page 10


Advanced Data Structures and Algorithms

But the index 1 location is already occupied by 131 i.e. collision occurs. To
resolve this collision we will linearly move down and at the next empty location
we will prob the element.

H(21)=(H(21)+1) % 10 = (1+1) % 10 = 2

Therefore 21 will be placed at the index 2. If the next element is 5 then we get
the home bucket for 5 as index 5 and this bucket is empty so we will put the
element 5 at index 5.

Index Key Index Key Index Key


0 NULL 0 NULL 0 29
Collision
For
1 131 1 131 21,31,61 1 131
2 NULL 2 21 2 21
3 NULL 3 31 3 31
4 4 4 4 4 4
5 NULL 5 5 5 5
6 NULL 6 61 6 61
7 7 7 7 7 7
8 8 8 8 8 8
9 NULL 9 NULL 9 9

The next record key is 9. According to decision hash function it demands for
the home bucket 9. Hence we will place 9 at index 9. Now the next final record
key 29 and it hashes a key 9. But home bucket 9 is already occupied. And
there is no next empty bucket as the table size is limited to index 9. The
overflow occurs. To handle it we move back to bucket 0 and is the location over
there is empty 29 will be placed at index 0.

Problem with linear probing:

One major problem with linear probing is primary clustering. Primary


clustering is a process in which a block of data is formed in the hash table
when collision is resolved.

M.Tech-CSE & CSE(AI&ML) SREC, Nandyal Page 11


Advanced Data Structures and Algorithms

For eg., 19%10=9 Index Key


0 9
18%10=8 1 9
2 8 Cluster
39%10=9
3 NULL
29%10=9 4 NULL
5 NULL
8%10=8 6 NULL
7 NULL
8 8
9 9 Cluster

Advantages:

1. Simplicity:

o The implementation of linear probing is straightforward and easy


to understand. It requires minimal additional data structures,
using the existing array for both storage and collision resolution.

2. Cache Performance:

o Since elements are stored in contiguous memory locations, linear


probing can benefit from better cache performance compared to
chaining. This leads to faster access times due to spatial locality.

3. Reduced Memory Overhead:

o Unlike separate chaining, which requires additional memory for


pointers in linked lists, linear probing uses the same array for
storage, leading to less memory overhead.

4. Easy to Implement:

o Insertion, deletion, and search operations are simple to implement


with linear probing, as they only require checking the next
available slots.

M.Tech-CSE & CSE(AI&ML) SREC, Nandyal Page 12


Advanced Data Structures and Algorithms

5. Fewer Data Structures:

o With linear probing, there’s no need for auxiliary data structures to


manage collisions, making the overall design cleaner.

Disadvantages:

1. Clustering:

o Linear probing can lead to primary clustering, where a group of


contiguous filled slots increases the likelihood of further collisions.
This can degrade performance as the load factor increases.

2. Performance Degradation:

o As the load factor approaches 1, the performance can degrade


significantly, with search, insertion, and deletion operations
potentially reaching O(n) time complexity in the worst case.

3. Complex Deletion:

o Deleting an element requires careful handling to avoid breaking


the chain of probes. Simply marking a slot as deleted can lead to
issues when searching for other elements.

4. Load Factor Sensitivity:

o The performance of linear probing is highly sensitive to the load


factor. Maintaining a lower load factor (e.g., less than 0.7) is often
necessary to ensure efficient operation.

5. Rehashing Complexity:

o If the table needs to be resized, rehashing all elements can be a


costly operation, especially if many elements are clustered
together.

ii) Quadratic Probing:

Quadratic probing operates by taking the original hash value and


adding successive values of an arbitrary quadratic polynomial to the
starting value. Quadratic probing prevents cluster formation in the
hash table.

M.Tech-CSE & CSE(AI&ML) SREC, Nandyal Page 13


Advanced Data Structures and Algorithms

This method uses following formula.

Hash function: H(key)=(H(key)+i 2)%m

Where m is the Table size and preferably a prime number.

For example: Let insert the following keys in the hash table

37,90,55,22,11,17,49,87
Index Key
H(37)=37%10=7 0 90
1 11
H(90)=90%10=0 2 22
3 NULL
H(55)=55%10=5 4 NULL
5 55
H(22)=22%10=2
6 NULL
H(11)=11%10=1 7 37
8 NULL
9 NULL

When we want to insert key 17 then collision occurs at index 7.


So we use quadratic probing for i=1,2,… to find the empty bucket.

H(17)= )=((17%10)+12)=7+1=8

Now, the index 8 is empty, so we place the key 17 at index 8.

Index Key
0 90
1 11
2 22
3 NULL
4 NULL
5 55
6 NULL
Collision
37
7 for 17
8 17
9 49

M.Tech-CSE & CSE(AI&ML) SREC, Nandyal Page 14


Advanced Data Structures and Algorithms

Next for key 49, H(49)=49%10=9

Next for key 87, H(87)=87%10=7 collision occurs at index 7

So we use quadratic probing for i=1,2,… to find the empty bucket.

H(87)=(H(87))+12)%10=(7+1)%10=8, occupied

H(87)=(H(87))+22)%10=(7+4)%10=1,occupied

H(87)=(H(87))+32)%10=(7+9)%10=6, empty so place key 87 at index 6.

Index Key
0 90
1 11
2 22
3 NULL
4 NULL
5 55
6 87
7 37
8 17
9 49

Advantages:

1. Reduced Clustering:

o Quadratic probing helps mitigate the primary clustering problem


found in linear probing. The quadratic formula spreads out the
probes more evenly across the hash table, reducing the chances of
forming long runs of occupied slots.

2. Improved Performance:

o In many cases, quadratic probing can offer better average-case


performance compared to linear probing, especially as the load
factor increases, due to its more dispersed probing pattern.

M.Tech-CSE & CSE(AI&ML) SREC, Nandyal Page 15


Advanced Data Structures and Algorithms

3. Simple Implementation:

o Like linear probing, quadratic probing is relatively easy to


implement. It requires minimal additional data structures,
operating directly on the existing array.

4. Cache Efficiency:

o While it may not be as cache-friendly as linear probing due to the


non-contiguous access pattern, it still benefits from better locality
compared to separate chaining, as it operates on a single array.

5. Flexibility in Load Factor Management:

o Quadratic probing can handle a higher load factor more efficiently


than linear probing, making it suitable for hash tables that
anticipate moderate to high usage.

Disadvantages:

1. Complexity in Implementation:

o The mathematical calculation for probing (i.e., (h(key)+i 2)mod m


can make the implementation slightly more complex than linear
probing, which uses a simple increment.

2. Secondary Clustering:

o While quadratic probing reduces primary clustering, it can still


suffer from secondary clustering. If two keys hash to the same
initial slot, they will follow the same probe sequence, which can
lead to inefficiencies.

3. Load Factor Limitations:

o Like other open addressing techniques, performance can degrade


significantly as the load factor approaches 1. A careful choice of
load factor is necessary to maintain efficiency.

M.Tech-CSE & CSE(AI&ML) SREC, Nandyal Page 16


Advanced Data Structures and Algorithms

4. Need for Table Size Constraints:

o The size of the hash table should typically be a prime number to


ensure that all slots can be probed. This requirement can
complicate the choice of table size.

5. Deletion Complexity:

o Similar to linear probing, handling deletions in quadratic probing


can be complex. Marking a slot as deleted can interfere with the
probing sequence, making it difficult to locate other elements.

iii) Double Hashing:

Double hashing is a technique in which a second hash function is


applied to the key when a collision occurs. By applying the second hash
function we will get the number of positions from the point of collision to
insert.
There are two important rules to be followed for the second function:
i) It must never evaluate to zero.
ii) Must make sure that all cells can be probed.

The formula to be used for double hashing is


1. H1(key)=key % tablesize
2. H2(key)=M-(key%M), where M is a prime number and M<tablesize

When a collision occurs then probing is done by using the below formula for
i=1,2,..
(H1(key) + i * H2(key)) % tablesize

For example, consider the hash table with size 10 and insert the following keys
in the hash table. Assume table size =10, and prime number(M)=7
37,90,45,22,17,49,55

M.Tech-CSE & CSE(AI&ML) SREC, Nandyal Page 17


Advanced Data Structures and Algorithms

Index Key
0 90
H1(37)=37%10=7 1 17
2 22
H1(90)=90%10=0 3
H1(45)=45%10=5 4
H1(22)=22%10=2 5 45
6
H1(17)=17%10=7, Collision occurs at index 7. Collision for
7 37 17
So calculate H2(key)
8
H2(17)=M-(key%M)=7-(17%7)=7-3=4
9

Now perform probing to find next index for i=1,2…


Index=(H1(key) + i * H2(key)) % tablesize
Index=(7+ 1* 4)%10=11%10=1

H1(49)=49%10=9
H1(55)=55%10=5, Collision occurs at index 5.
Index Key
0 90
So calculate H2(key) 1 17
2 22
H2(55)=M-(key%M)=7-(55%7)=7-6=1 3
4
Now perform probing to find next index for 5 45 Collision for 55
i=1,2… 6 55
Index=(H1(key) + i * H2(key)) % tablesize 7 37
Index=(5+ 1* 1)%10=6%10=6 8
9 49

Advantages:

1. Reduced Clustering:

o Double hashing effectively minimizes both primary and secondary


clustering, as it uses a second hash function to determine the next
probe position, making the probing sequence less predictable.

M.Tech-CSE & CSE(AI&ML) SREC, Nandyal Page 18


Advanced Data Structures and Algorithms

2. Better Distribution:

o The combination of two hash functions generally leads to a more


uniform distribution of entries across the hash table, improving
performance even as the load factor increases.

3. Flexibility:

o Double hashing allows for more flexibility in handling collisions


compared to linear or quadratic probing, adapting better to varying
data distributions.

4. Performance:

o In many cases, double hashing can achieve better average-case


performance compared to other open addressing methods,
particularly in high-load scenarios.

5. Efficiency in Space:

o Like other open addressing techniques, double hashing does not


require additional memory for pointers, as it uses the existing
array for storage.

Disadvantages:

1. Complexity:

o The implementation of double hashing is more complex than other


collision resolution techniques. It requires defining two hash
functions and ensuring they work well together, which can
increase the coding overhead.

2. Performance Sensitivity:

o The efficiency of double hashing heavily depends on the choice of


both hash functions. Poor choices can lead to clustering and
degrade performance.

3. Table Size Constraints:

o The size of the hash table should typically be a prime number to


ensure that all slots can be probed. This requirement complicates
the choice of table size.
M.Tech-CSE & CSE(AI&ML) SREC, Nandyal Page 19
Advanced Data Structures and Algorithms

4. Increased Computation:

o Each collision requires additional computation to evaluate the


second hash function, which can impact performance, especially if
hash functions are computationally expensive.

5. Deletion Complexity:

o Similar to other open addressing methods, deletion can be tricky.


Marking slots as deleted may disrupt the probing sequence,
complicating subsequent searches for other keys.

3. REHASHING:

Rehashing is a technique in which the table is resized, i.e., the size of table is
doubled by creating a new table. It is preferable that the table size is a prime
number.

There are situations in which the rehashing is required.

i) When table is completely full


ii) With quadratic probing when the table is filled half.
iii) When insertions fail due to overflow.

In such situations, we have to transfer entries from old table to the


new table by re computing their positions using hash functions.

For example, consider the following keys to insert in the hash table.

37,90,55,22,17,49 and 87,assume the tablesize=10.

Hash function: H(key)=key%10


Index Key
H(37)=37%10=7 0 90
1 87
2 22
3
H(90)=90%10=0 4
5 55
H(55)=55%10=5
6
H(22)=22%10=2 7 37 Collision for 17,87
8 17
9 49

M.Tech-CSE & CSE(AI&ML) SREC, Nandyal Page 20


Advanced Data Structures and Algorithms

H(17)=17%10=7, Collision occurs for 17 and resolved by linear probing

H(49)=49%10=9

H(87)=87%10=7, Collision occurs for 87 and resolved by linear probing

Now this table is almost full and if we try to insert more elements
collisions will occur and eventually further insertions will fail. Hence we
will rehash by doubling the table size. The old table size is 10 then we
should double this size for new table, and it becomes 20. But 20 is not a
prime number, we will prefer to make the table size as 23.

And new hash function will be

H(key)= key mod 23 Index Key


0
Now, all the old hash table entries are mapped 1
to new hash table. 2
3 49
37%23=14
4
90%23=21 5
6
55%23=9 7
22%23=22 8
9 53
17%23=17
10
49%23=3 11
12
87%23=18 13
14 37
15
16
17 17
18 87
19
20
21 90
22 22

Now the hash table is sufficiently large to accommodate new insertions.

M.Tech-CSE & CSE(AI&ML) SREC, Nandyal Page 21


Advanced Data Structures and Algorithms

Advantages:

1. Improved Performance:

o By increasing the size of the hash table, rehashing can reduce the
load factor, which generally leads to improved performance for
insertion, deletion, and search operations. A lower load factor
minimizes collisions and clustering.

2. Better Distribution:

o A larger hash table allows for a more even distribution of keys,


reducing the chances of collisions and improving access times.

3. Dynamic Growth:

o Rehashing allows the hash table to grow dynamically in response


to increasing data, making it suitable for applications with variable
input sizes.

4. Efficiency Maintenance:

o Regular rehashing helps maintain efficient performance levels over


time, particularly in environments where the number of entries
fluctuates significantly.

5. Flexibility:

o Rehashing can be implemented with different strategies (e.g.,


changing the size to the next prime number), allowing for
customization based on specific use cases.

Disadvantages:

1. High Overhead:

o Rehashing can be computationally expensive, as it involves


creating a new table and redistributing all existing entries. This
can temporarily slow down the performance during the rehashing
process.

M.Tech-CSE & CSE(AI&ML) SREC, Nandyal Page 22


Advanced Data Structures and Algorithms

2. Memory Consumption:

o During rehashing, both the old and new tables exist


simultaneously, leading to increased memory usage until the old
table is no longer needed.

3. Complexity:

o Implementing rehashing adds complexity to the hash table design.


It requires careful management of the hash functions, resizing
logic, and redistribution of entries.

4. Temporary Performance Degradation:

o While rehashing can lead to long-term performance improvements,


it can cause temporary performance degradation during the
rehashing process, affecting operations in the short term.

5. Potential Data Loss:

o If not implemented carefully, there is a risk of data loss during the


transition from the old table to the new one, especially if the
rehashing code contains bugs.

Load Factor:

The load factor in a hash table is a measure of how full the hash table is, and
it plays a crucial role in the performance of hash table operations. It is defined
as the ratio of the number of entries (elements) stored in the hash table to the
total number of slots (buckets) available in the table.

Definition

The load factor (α) can be calculated using the formula:

α=n/m

where:

 n = number of entries (or keys) in the hash table.

 m = number of slots (or buckets) in the hash table.

M.Tech-CSE & CSE(AI&ML) SREC, Nandyal Page 23


Advanced Data Structures and Algorithms

Importance of Load Factor

1. Performance:

o The load factor directly affects the performance of the hash table. A
higher load factor typically leads to more collisions, which can
degrade the average time complexity of operations such as
insertion, deletion, and search.

2. Collision Resolution:

o As the load factor increases, the probability of collisions rises,


leading to longer chains in separate chaining or longer probe
sequences in open addressing methods. This can slow down
operations significantly.

3. Resizing:

o Many hash table implementations use a threshold for the load


factor to trigger resizing. For instance, if the load factor exceeds a
certain value (commonly around 0.7 to 0.8), the hash table might
be resized (typically doubled), and all existing entries rehashed into
the new table.

4. Space Efficiency:

o A lower load factor can lead to wasted space, as many buckets may
remain empty. Balancing between load factor and space usage is
key to efficient hash table design.

4. EXTENDIBLE HASHING:

Extendible hashing is a technique which handles a large amount of data.

Extendible hashing is a dynamic hashing technique designed to handle


growing datasets efficiently. It allows the hash table to expand as new keys are
added, while maintaining quick access times. Here are the key concepts:

M.Tech-CSE & CSE(AI&ML) SREC, Nandyal Page 24


Advanced Data Structures and Algorithms

Structure

1. Directory: The main component is a directory that holds pointers to


buckets. Each bucket contains a set of records, and the directory size
can grow or shrink as needed.

2. Buckets: Each bucket can hold multiple records. When a bucket


becomes full, it can split to accommodate more entries.

Key Concepts

1. Hash Function: A hash function generates an initial hash value for a


key. The output is typically a binary value that indicates which bucket to
use.

2. Global Depth: This is the number of bits in the hash value used to index
into the directory. The global depth determines the size of the directory.

3. Local Depth: Each bucket has its own local depth, which indicates how
many bits from the hash value are used for that particular bucket.

M.Tech-CSE & CSE(AI&ML) SREC, Nandyal Page 25


Advanced Data Structures and Algorithms

Basic working of Extensible Hashing:

1. Insertion:

 Step 1 – Analyze Data Elements: Data elements may exist in various


forms eg. Integer, String, Float, etc.. Currently, let us consider data
elements of type integer. eg: 49.

 Step 2 – Convert into binary format: Convert the data element in


Binary form. For string elements, consider the ASCII equivalent integer of
the starting character and then convert the integer into binary form.
Since we have 49 as our data element, its binary form is 110001.

 Step 3 – Check Global Depth of the directory. Suppose the global


depth of the Hash-directory is 3.

 Step 4 – Identify the Directory: Consider the ‘Global-Depth’ number of


LSBs in the binary number and match it to the directory id.
Eg. The binary obtained is: 110001 and the global-depth is 3. So, the
hash function will return 3 LSBs of 110001 viz. 001.

 Step 5 – Navigation: Now, navigate to the bucket pointed by the


directory with directory-id 001.

M.Tech-CSE & CSE(AI&ML) SREC, Nandyal Page 26


Advanced Data Structures and Algorithms

 Step 6 – Insertion and Overflow Check: Insert the element and check if
the bucket overflows. If an overflow is encountered, go to step 7 followed
by Step 8, otherwise, go to step 9.

 Step 7 – Tackling Over Flow Condition during Data Insertion: Many


times, while inserting data in the buckets, it might happen that the
Bucket overflows. In such cases, we need to follow an appropriate
procedure to avoid mishandling of data.
First, Check if the local depth is less than or equal to the global depth.
Then choose one of the cases below.

o Case1: If the local depth of the overflowing Bucket is equal to the


global depth, then Directory Expansion, as well as Bucket Split,
needs to be performed. Then increment the global depth and the
local depth value by 1. And, assign appropriate pointers.
Directory expansion will double the number of directories present
in the hash structure.

o Case2: In case the local depth is less than the global depth, then
only Bucket Split takes place. Then increment only the local depth
value by 1. And, assign appropriate pointers.

M.Tech-CSE & CSE(AI&ML) SREC, Nandyal Page 27


Advanced Data Structures and Algorithms

 Step 8 – Rehashing of Split Bucket Elements: The Elements present in


the overflowing bucket that is split are rehashed w.r.t the new global
depth of the directory.

 Step 9 – The element is successfully hashed.

2. Search:

 Compute the hash value of the key.

 Use the global depth to find the appropriate bucket and retrieve
the key.

3. Deletion:

 If a key is removed, the bucket may become under-utilized. If a


bucket's local depth falls below a certain threshold, it may be
merged with another bucket.

For example: Now, let us consider a prominent example of hashing the


following elements: 16,4,6,22,24,10,31,7,9,20,26. Bucket Size: 3

Hash Function: Suppose the global depth is N. Then the Hash Function
returns N LSBs.

First, calculate the binary forms of each of the given numbers.


16-10000
4-00100
6-00110
22-10110
24-11000
10-01010
31-11111
7-00111
9-01001
20-10100
26- 11010

M.Tech-CSE & CSE(AI&ML) SREC, Nandyal Page 28


Advanced Data Structures and Algorithms

 Initially, the global-depth and local-depth is always 1. Thus, the hashing


frame looks like this:

Inserting 16:

The binary format of 16 is 10000 and global-depth is 1. The hash function


returns 1 LSB of 10000 which is 0. Hence, 16 is mapped to the directory with
id=0.

Inserting 4 and 6:

Both 4(100) and 6(110)have 0 in their LSB. Hence, they are hashed as follows:

M.Tech-CSE & CSE(AI&ML) SREC, Nandyal Page 29


Advanced Data Structures and Algorithms

Inserting 22: The binary form of 22 is 10110. Its LSB is 0. The bucket pointed
by directory 0 is already full. Hence, Over Flow occurs.

As directed by Step 7-Case 1, Since Local Depth = Global Depth, the bucket
splits and directory expansion takes place. Also, rehashing of numbers present
in the overflowing bucket takes place after the split. And, since the global depth
is incremented by 1, now,the global depth is 2. Hence, 16,4,6,22 are now
rehashed w.r.t 2 LSBs.[ 16(10000),4(100),6(110),22(10110) ]

M.Tech-CSE & CSE(AI&ML) SREC, Nandyal Page 30


Advanced Data Structures and Algorithms

Inserting 24 and 10: 24(11000) and 10 (1010) can be hashed based on


directories with id 00 and 10. Here, we encounter no overflow condition.

Inserting 31,7,9: All of these elements[ 31(11111), 7(111), 9(1001) ] have


either 01 or 11 in their LSBs. Hence, they are mapped on the bucket pointed
out by 01 and 11. We do not encounter any overflow condition here.

M.Tech-CSE & CSE(AI&ML) SREC, Nandyal Page 31


Advanced Data Structures and Algorithms

Inserting 20: Insertion of data element 20 (10100) will again cause the
overflow problem.

20 is inserted in bucket pointed out by 00. As directed by Step 7-Case 1, since


the local depth of the bucket = global-depth, directory expansion (doubling)
takes place along with bucket splitting. Elements present in overflowing bucket
are rehashed with the new global depth. Now, the new Hash table looks like
this:

M.Tech-CSE & CSE(AI&ML) SREC, Nandyal Page 32


Advanced Data Structures and Algorithms

Inserting 26: Global depth is 3. Hence, 3 LSBs of 26(11010) are considered.


Therefore 26 best fits in the bucket pointed out by directory 010.

The bucket overflows, and, as directed by Step 7-Case 2, since the local depth
of bucket < Global depth (2<3), directories are not doubled but, only the

M.Tech-CSE & CSE(AI&ML) SREC, Nandyal Page 33


Advanced Data Structures and Algorithms

bucket is split and elements are rehashed.


Finally, the output of hashing the given list of numbers is obtained.

Advantages:

 Dynamic Growth: Extensible hashing can grow efficiently as more keys


are added, without needing to rehash the entire table immediately.

 Flexible Size: The directory can grow or shrink based on the data,
allowing for efficient space utilization.

 Performance: Search, insertion, and deletion operations can remain


efficient even as the dataset grows.

Disadvantages:

 Complexity: The implementation is more complex than static hashing


techniques due to the need to manage dynamic resizing and splitting of
buckets.

 Overhead: Maintaining the directory and managing splits can introduce


some overhead, especially in scenarios with frequent insertions and
deletions.

Overall, extensible hashing is a powerful method for managing dynamic


datasets with efficient operations, making it well-suited for applications like
databases and file systems.

M.Tech-CSE & CSE(AI&ML) SREC, Nandyal Page 34


Advanced Data Structures and Algorithms

UNIVERSAL HASH FUNCTION:

A universal hash function is a type of hash function that provides a strong


guarantee against collision in hash tables. The concept was introduced to
minimize the probability of collisions when hashing keys into a hash table,
particularly when adversarial inputs are a concern.

Definition:

A family of hash functions H is said to be universal if, for any two distinct keys
x and y, the probability that they collide (i.e., hash to the same value) when a
hash function h from this family is chosen at random is low. Formally, a family
of hash functions H is universal if:

P(h(x)=h(y))≤1/m

for any x≠y in the input set, where m is the number of possible hash values
(i.e., the size of the hash table).

Example of a Universal Hash Function:

One commonly used example of a universal hash function is based on the


multiplication method. Given a prime number P, a random integer a (where
1≤a<p1) and a size m (the number of slots in the hash table), the hash function
can be defined as:

h(x)=((a⋅x)mod p)mod m

This function maps an integer x to a position in a hash table of size m. The


choice of a and p should be such that p is larger than any input value and a
prime number, ensuring good distribution.

Advantages:

1. Reduced Collision Probability:

o By using a universal hash function, the probability of collisions is


significantly reduced, which improves the average performance of
hash table operations.

M.Tech-CSE & CSE(AI&ML) SREC, Nandyal Page 35


Advanced Data Structures and Algorithms

2. Adversarial Resistance:

o Universal hash functions are resilient against adversarial input


distributions. Even if an attacker knows the hash function, they
cannot easily force collisions.

3. Simplicity:

o Many universal hash functions are simple to compute, making


them efficient for practical applications.

4. Theoretical Foundations:

o The mathematical basis for universal hashing provides strong


guarantees about performance, which can be advantageous in
theoretical analysis.

Disadvantages:

1. Randomness Requirement:

o Universal hashing often requires the use of random values (like the
random integer aaa), which can complicate implementation,
especially in deterministic settings.

2. Performance Overhead:

o Generating random numbers and selecting hash functions from a


family can introduce overhead compared to simpler, fixed hash
functions.

3. Not Always the Best Choice:

o While universal hashing provides strong guarantees, for many


practical applications, simpler hash functions might suffice,
especially when the input set is known and controlled.

M.Tech-CSE & CSE(AI&ML) SREC, Nandyal Page 36


Advanced Data Structures and Algorithms

DICTIONARY IMPLEMENTATION USING HASHING:

1. Insertion:

 Compute the hash of the key.


 Store the value at the index given by the hash.
 Handle collisions using chaining (linked lists) or open addressing.

2. Deletion:

 Compute the hash of the key.


 Find the corresponding index and remove the key-value pair.
 Handle any necessary rehashing in case of open addressing.

3. Search:

 Compute the hash of the key.


 Locate the index and retrieve the value.
 Handle collisions appropriately.

4. Update:

 Compute the hash of the key.


 Locate the index and update the value.
 Handle collisions appropriately.

M.Tech-CSE & CSE(AI&ML) SREC, Nandyal Page 37

You might also like