ADS Unit-2
ADS Unit-2
WHAT IS HASHING?
❖ Sequential search requires, on the average O(n) comparisons to locate an
element, so many comparisons are not desirable for a large database of elements.
❖ Binary search requires much fewer comparisons on the average O (log n) but
there is an additional requirement that the data should be sorted.
Even with best sorting algorithm, sorting of elements require O(n log n)
comparisons.
❖ There is another widely used technique for storing of data called hashing.
It does away with the requirement of keeping data sorted (as in binary search)
and its best case timing complexity is of constant order O(1).
In its worst case, hashing algorithm starts behaving like linear search.
Introduction to Hashing
• Suppose that we want to store 10,000 students records (each with a 5-digit ID) in
a given container.
· Using an array of size 100,000 would give O(1) access time but will lead to a lot
of space wastage.
• Is there some way that we could get O(1) access without wasting a lot of space?
Components of Hashing
1. Key: A Key can be anything string or integer which is fed as input in the
hash function the technique that determines an index or location for
storage of an item in a data structure.
2. Hash Function: The hash function receives the input key and returns
the index of an element in an array called a hash table. The index is
known as the hash index.
3. Hash Table: Hash table is a data structure that maps keys to values
using a special function called a hash function. Hash stores the data in an
associative manner in an array where each data value has its own unique
index.
Components of Hashing
Hash keys are also used for efficient data retrieval and storage in
hash tables or data structures, as they allow quick look-up and
comparison operations.
1. Static hashing: In static hashing, the hash function maps search-key values to a
fixed set of locations.
2. Dynamic hashing: In dynamic hashing a hash table can grow to handle more
items. The associated hash function must change as the table grows.
If for two (key: value) pairs, the same index is obtained after applying the Hash
Function, this condition is called Collision. We need to choose a Hash Function
such that Collision doesn't occur.
Terminology:
1. Hashing: The whole process
2. Hash value/ code: The index in the Hash Table for storing the value
obtained after computing the Hash Function on the corresponding key.
3. Hash Table: The data structure associated with hashing in which keys are
mapped with values stored in the array.
4. Hash Function/ Hash: The mathematical function to be applied on keys to
obtain indexes for their corresponding values into the Hash Table.
1. Division Method
2. Mid Square Method
3. Multiplication Method
4. Digit Folding Method
5. Digit Analysis Method
There are various types of hash functions that are used to place the record in the
hash table
1. Division Method:
Say that we have a Hash Table of size 'S', and we want to store a (key, value) pair in the Hash
Table. The Hash Function, according to the Division method, would be:
1. {10:"Sudha"}
Key mod M = 10 mod 5 = 0
2. {11:"Venkat"}
Key mod M = 11 mod 5 = 1
3. {12:"Jeevani"}
Key mod M = 12 mod 5 = 2
Observe that the Hash values were consecutive. This is the disadvantage of
this type of Hash Function. We get consecutive indexes for consecutive keys,
leading to poor performance due to decreased security. Sometimes, we need
to analyze many consequences while choosing the Hash Table size.
We should choose the number of digits to extract based on the size of the
Hash Table. Suppose the Hash Table size is 100; indexes will range from 0 to
99. Hence, we should select 2 digits from the middle.
Suppose the size of the Hash Table is 10 and the key: value pairs are:
H(10) = 10 * 10 = 100 = 0
H(11) = 11 * 11 = 121 = 2
H(12) = 12 * 12 = 144 = 4
o All the digits in the key are utilized to contribute to the index, thus
increasing the performance of the Data Structure.
o If the key is a large value, squaring it further increases the value,
which is considered the con.
o Collisions might occur, too, but we can try to reduce or handle them.
o Another important point here is that, with the huge numbers, we need
to take care of overflow conditions. For suppose, if we take a 6-digit
key, we get a 12-digit number that exceeds the range of defined
integers when we square it. We can use the long int or string
multiplication technique.
3. Folding Method
Given a {key: value} pair and the table size is 100 (0 - 99 indexes), the key
is broken down into 2 segments each except the last segment. The last
segment can have less number of digits. Now, the Hash Function would be:
For suppose "k" is a 10-digit key and the size of the table is 100(0 - 99), k is
divided into:
1234 = 12 + 34 = 46
46 % 100 = 46
5678 = 56 + 78 = 134
134 % 99 = 35
4. Multiplication method
Unlike the three methods above, this method has more steps involved:
For example:
A = 0.56
= floor(100 * 0.04)
= floor(4) = 4
= floor(99 * 0.68)
= floor(67.32)
= 67
5. Digit Analysis:
Perfect hashing:
Perfect hashing is a type of hash function that ensures no collisions; that is, each
input has a unique hash value. It is particularly useful when the set of keys is
known in advance, allowing the creation of a hash function that maps each key to a
distinct slot in the hash table. There are two main types of perfect hashing:
1. Static Perfect Hashing: This is used when the set of keys does not change
over time. It involves creating a hash function that guarantees a collision-
free mapping for the given set of keys.
2. Dynamic Perfect Hashing: This is more complex and allows for keys to be
added or removed. It involves a more sophisticated structure that can adjust
to changes while maintaining the collision-free property.
1. First Level Hashing: A primary hash function divides the keys into several
smaller groups or buckets. The aim is to have each bucket contain a
manageable number of keys.
2. Second Level Hashing: For each bucket created by the first-level hash
function, a secondary hash function is designed such that there are no
collisions within that bucket. This often requires some trial and error or
probabilistic methods to find a suitable secondary hash function.
Example
o 10 → 1
o 22 → 1
o 31 → 1
o 40 → 1
o 50 → 2
o Bucket 0: {50}
o Bucket 1: {10, 22, 31, 40}
Disadvantages:
Universal hashing:
Universal hashing is a concept in computer science used to construct hash
functions with certain randomness properties.
This approach ensures that, on average, the performance of the hash function is
good, regardless of the input data.
Universal hashing aims to minimize the probability of collision (two keys hashing
to the same value) by using a family of hash functions and selecting one at random.
A family of hash functions H is called universal if, for any two distinct keys x and y, the
probability that a randomly chosen hash function h from H maps x and y to the same value is at
most 1/m, where m is the size of the hash table. Formally, for x ≠ y:
To create a universal family of hash functions, we can use modular arithmetic and
random coefficients. One common method is to use the following formula:
h(x)=((ax+b)mod p)mod m
where:
Fast data retrieval: Hashing allows for quick access to elements with
constant-time complexity.
What is Collision?
The hashing process generates a small number for a big key, so there is a
possibility that two keys could produce the same value. The situation where
the newly inserted key maps to an already occupied and it must be handled
using some collision handling technology.
Collision in Hashing
This can happen even with a good hash function, especially if the hash table
is full or the keys are similar.
Poor Hash Function: A hash function that does not distribute keys
evenly across the hash table can lead to more collisions.
High Load Factor: A high load factor (ratio of keys to hash table size)
increases the probability of collisions.
Similar Keys: Keys that are similar in value or structure are more likely to
collide.
Overflow handling:
2. Closed Addressing:
Chaining: Store colliding keys in a linked list or binary search tree at
each index
Insert(k): Keep probing until an empty slot is found. Once an empty slot is
found, insert k.
Search(k): Keep probing until the slot’s key doesn’t become equal to k or an
empty slot is reached.
Delete(k): Delete operation is interesting. If we simply delete a key, then the
search may fail. So slots of deleted keys are marked specially as “deleted”.
The insert can insert an item in a deleted slot, but the search does not stop at a
deleted slot.
1. Linear Probing:
If you observe carefully, then you will understand that the interval
between probes will increase proportionally to the hash value.
Quadratic probing is a method with the help of which we can solve the
problem of clustering that was discussed above.
It works by using two hash functions to compute two different hash values
for a given key.
The first hash function is used to compute the initial hash value, and the
second hash function is used to compute the step size for the probing
sequence.
Double hashing has the ability to have a low collision rate, as it uses two
hash functions to compute the hash value and the step size.
First, it requires the use of two hash functions, which can increase the
computational complexity of the insertion and search operations.
If the hash functions are not well designed, the collision rate may still be
high.
Separate Chaining:
The idea behind separate chaining is to implement the array as a
linked list called a chain.
So what happens is, when multiple elements are hashed into the same
slot index, then these elements are inserted into a singly-linked list
which is known as a chain.
Here, all those elements that hash into the same slot index are inserted
into a linked list.
Now, we can use a key K to search in the linked list by just linearly
traversing.
If the intrinsic key for any entry is equal to K then it means that we
have found our entry.
If we have reached the end of the linked list and yet we haven’t found
our entry then it means that the entry does not exist.
The output, often called a "digest," appears random and unique to each unique
input.
Here are some key properties and functions of secure hash functions:
1. Deterministic: The same input will always produce the same hash output.
2. Fast Computation: It should be quick to compute the hash value for any
given input.
3. Small Changes in Input Change the Output: A small change to the input
should produce a significantly different hash.
4. Collision Resistance: It should be infeasible to find two different inputs that
produce the same hash output.
5. Fixed Output Length: The output length of the hash should be fixed,
regardless of the input length.
Characteristics of MD5
1. Padding the Message: The original message is padded so that its length is
congruent to 448 modulo 512. This involves appending a single '1' bit,
followed by a number of '0' bits, and finally the length of the original
message as a 64-bit integer.
2. Initialization: Four 32-bit variables (A, B, C, and D) are initialized with
specific constants.
3. Processing in Blocks: The padded message is processed in 512-bit blocks.
Each block modifies the values of A, B, C, and D through a series of bitwise
operations and additions.
4. Final Output: The final values of A, B, C, and D are concatenated to
produce the final 128-bit hash value.
Example Input
"hello"
Step-by-Step Process
o Append enough '0' bits so that the length of the message is congruent
to 448 modulo 512. The total length needs to be 448 bits (or 56 bytes)
plus the length of the original message (5 bytes), making a total of 56
- 5 - 1 = 50 bytes of padding:
01101000 01100101 01101100 01101100 01101111 1 +
00000000...00000000 (448 bits in total)
2. Initialization:
o Initialize the four MD5 buffer variables (A, B, C, and D) to specific
constants:
A = 0x67452301
B = 0xEFCDAB89
C = 0x98BADCFE
D = 0x10325476
61f1141806fbee0528a1a2bf59437d949
40b1131806abde718a1b2ba56437e901
81b01710816aeed518a102bb50417d94
Security of MD5
MD5 was initially considered secure but has since been found vulnerable to
various types of attacks:
1. Collision Attacks: A collision occurs when two different inputs produce the
same hash output. Researchers have demonstrated that MD5 is susceptible to
collision attacks, making it insecure for cryptographic purposes.
2. Pre-image and Second Pre-image Attacks: While MD5 is more resistant to
pre-image and second pre-image attacks than to collision attacks, its
weaknesses in collision resistance make it unsuitable for security-sensitive
applications.
Applications of MD5
1. Open Addressing
Linear Probing
Time Complexity:
o Average case: O(1)O(1)O(1) for insertion, deletion, and search when
the load factor is low.
o Worst case: O(n)O(n)O(n) when the load factor approaches 1 due to
clustering.
Space Complexity:
o O(n)O(n)O(n), where nnn is the number of slots in the hash table.
Pros:
o Simple to implement.
o Good cache performance due to locality of reference.
Cons:
o Primary clustering can degrade performance.
Quadratic Probing
Time Complexity:
o Average case: O(1)O(1)O(1) for insertion, deletion, and search.
o Worst case: O(n)O(n)O(n) in rare cases due to secondary clustering.
Space Complexity:
o O(n)O(n)O(n).
Pros:
o Reduces primary clustering compared to linear probing.
Cons:
o Secondary clustering still exists.
o May fail to find an empty slot even if the table is not full (requires
careful selection of constants).
Double Hashing
Time Complexity:
o Average case: O(1)O(1)O(1) for insertion, deletion, and search.
o Worst case: O(n)O(n)O(n) if both hash functions produce poor results.
Space Complexity:
o O(n)O(n)O(n).
Pros:
o Minimizes clustering.
o More uniform distribution.
Cons:
o Slightly more complex to implement due to multiple hash functions.
2. Separate Chaining
Time Complexity:
o Average case: O(1)O(1)O(1) for insertion.
o Worst case: O(n)O(n)O(n) for search and deletion if all keys hash to
the same slot.
o Average case for search and deletion: O(1+α)O(1 + \alpha)O(1+α),
where α\alphaα is the load factor.
Space Complexity:
o O(n+m)O(n + m)O(n+m), where mmm is the number of elements
stored.
Pros:
o Simple to implement.
o Load factor can exceed 1.
o Easy to resize.
Cons:
o Extra memory required for pointers in linked lists.
o Cache performance can be poor.
3. Cuckoo Hashing
Time Complexity:
o Average case: O(1)O(1)O(1) for insertion, deletion, and search.
o Worst case: O(n)O(n)O(n) for insertion due to potential rehashing.
Space Complexity:
o O(n)O(n)O(n).
Pros:
o Guarantees O(1)O(1)O(1) worst-case lookup time.
o Simplifies deletion.
Cons:
o Insertion can be complex and may require multiple relocations.
o High memory overhead due to maintaining multiple hash tables.
Linear Probing: Simple and efficient with good cache performance, but
suffers from clustering.
Quadratic Probing: Reduces clustering but requires careful selection of
parameters.
Double Hashing: Provides good distribution at the cost of more complex
implementation.
Separate Chaining: Handles high load factors well but has poor cache
performance.
Cuckoo Hashing: Guarantees constant-time lookups but can be complex
and requires multiple tables.
This allows easy insertion or deletion into the database and reduces
performance issues.
Static hash tables have a fixed size determined at creation. When the data set
grows, the table can become full, leading to overflow issues.
If the table is too large initially, it wastes space when the data set is small.
Dynamic hashing allows the hash table to grow and shrink as needed,
efficiently accommodating varying data sizes without significant space
wastage or performance degradation.
When a bucket in a static hash table becomes full, additional records are
typically handled using overflow chains or linked lists.
These overflow chains can become long and degrade performance, as they
require multiple accesses to retrieve a single record.
As the load factor of a static hash table increases (i.e., as it becomes more
full), the likelihood of collisions and the average access time for records
increase.
Static hash tables are not adaptable to changes in the dataset size without a
costly rehashing operation, which involves creating a larger table and re-
inserting all existing records.
Dynamic hashing adjusts the table size based on actual data needs, ensuring
that space is used efficiently without significant waste or the need for
frequent resizing operations.
Dynamic hashing, through its ability to grow and shrink, simplifies the
management of large and growing datasets, making it easier to handle data
expansion without major disruptions or performance hits.
Buckets: They store the hashed keys. Directories point to buckets. A bucket
may contain more than one pointers to it if its local depth is less than the
global depth.
Global Depth: It is associated with the Directories. They denote the number
of bits which are used by the hash function to categorize the keys. Global
Depth = Number of bits in directory id.
Local Depth: It is the same as that of Global Depth except for the fact that
Local Depth is associated with the buckets and not the directories. Local depth
in accordance with the global depth is used to decide the action that to be
performed in case an overflow occurs. Local Depth is always less than or
equal to the Global Depth.
Step 1 – Analyze Data Elements: Data elements may exist in various forms
eg. Integer, String, Float, etc.. Currently, let us consider data elements of type
integer. eg: 49.
Step 2 – Convert into binary format: Convert the data element in Binary
form. For string elements, consider the ASCII equivalent integer of the
starting character and then convert the integer into binary form. Since we have
49 as our data element, its binary form is 110001.
Step 3 – Check Global Depth of the directory. Suppose the global depth of
the Hash-directory is 3.
Step 6 – Insertion and Overflow Check: Insert the element and check if the
bucket overflows. If an overflow is encountered, go to step 7 followed
by Step 8, otherwise, go to step 9.
First, Check if the local depth is less than or equal to the global depth.
o Case2: In case the local depth is less than the global depth, then only
Bucket Split takes place. Then increment only the local depth value
by 1. And, assign appropriate pointers.
Step 8 – Rehashing of Split Bucket Elements: The Elements present in the
overflowing bucket that is split are rehashed w.r.t the new global depth of the
directory.
Hash Function: Suppose the global depth is X. Then the Hash Function returns
X LSBs.
Solution: First, calculate the binary forms of each of the given numbers.
16- 10000
4- 00100
6- 00110
22- 10110
24- 11000
10- 01010
31- 11111
7- 00111
9- 01001
20- 10100
26- 11010
Initially, the global-depth and local-depth is always 1. Thus, the hashing frame
looks like this:
Inserting 16:
The binary format of 16 is 10000 and global-depth is 1. The hash function
returns 1 LSB of 10000 which is 0. Hence, 16 is mapped to the directory with
id=0.
Inserting 4 and 6:
Both 4(100) and 6(110)have 0 in their LSB. Hence, they are hashed as
follows:
Inserting 22: The binary form of 22 is 10110. Its LSB is 0. The bucket
pointed by directory 0 is already full. Hence, Over Flow occurs.
As directed by Step 7-Case 1, Since Local Depth = Global Depth, the bucket
splits and directory expansion takes place.
Also, rehashing of numbers present in the overflowing bucket takes place after
the split.
Inserting 20: Insertion of data element 20 (10100) will again cause the
overflow problem.
This approach simplifies the structure and often relies on techniques such as linear
hashing to handle dynamic growth and reorganization of the hash table.
It avoids the need for a global directory by directly managing the bucket splits and
the rehashing process.
How Linear Hashing Works
1. Initial Setup: Start with an initial number of buckets, say M, and a hash
function h(k) = k mod M.
2. Split Pointer: Maintain a split pointer that initially points to the first bucket.
3. Progressive Splitting: When a bucket overflows, the bucket pointed to by
the split pointer is split, and the split pointer is incremented.
4. Doubling and Modifying the Hash Function: When the split pointer
reaches the end of the current set of buckets, the number of buckets doubles,
and the hash function is adjusted accordingly.
Insertion: Insert the key into the appropriate bucket. If the bucket
overflows, split the bucket pointed to by the split pointer.
Splitting: When splitting a bucket, redistribute the keys using the new hash
function, and increment the split pointer.
Doubling: When the split pointer completes a cycle through the buckets,
double the number of buckets and update the hash function.