0% found this document useful (0 votes)
18 views53 pages

ADS Unit-2

Uploaded by

sivasaivallabha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views53 pages

ADS Unit-2

Uploaded by

sivasaivallabha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 53

UNIT-2

WHAT IS HASHING?
❖ Sequential search requires, on the average O(n) comparisons to locate an
element, so many comparisons are not desirable for a large database of elements.

❖ Binary search requires much fewer comparisons on the average O (log n) but
there is an additional requirement that the data should be sorted.

Even with best sorting algorithm, sorting of elements require O(n log n)
comparisons.

❖ There is another widely used technique for storing of data called hashing.

It does away with the requirement of keeping data sorted (as in binary search)
and its best case timing complexity is of constant order O(1).

In its worst case, hashing algorithm starts behaving like linear search.

❖ Best case timing behavior of searching using hashing = O(1)

❖ Worst case timing Behavior of searching using hashing = O(n)

Introduction to Hashing
• Suppose that we want to store 10,000 students records (each with a 5-digit ID) in
a given container.

· A linked list implementation would take O(n) time.

· A height balanced tree would give O(log n) access time.

· Using an array of size 100,000 would give O(1) access time but will lead to a lot

of space wastage.

• Is there some way that we could get O(1) access without wasting a lot of space?

• The answer is hashing.


Introduction to Hashing in Data Structure:
Hashing is a popular technique in computer science that involves
mapping large data sets to fixed-length values.

It is a process of converting a data set of variable size into a data


set of a fixed size.

The ability to perform efficient lookup operations makes hashing


an essential concept in data structures.

Components of Hashing

There are majorly three components of hashing:

1. Key: A Key can be anything string or integer which is fed as input in the
hash function the technique that determines an index or location for
storage of an item in a data structure.

2. Hash Function: The hash function receives the input key and returns
the index of an element in an array called a hash table. The index is
known as the hash index.

3. Hash Table: Hash table is a data structure that maps keys to values
using a special function called a hash function. Hash stores the data in an
associative manner in an array where each data value has its own unique
index.
Components of Hashing

The hash key serves several purposes.

It is commonly used for data integrity checks, as even a small


change in the input data will produce a significantly different hash
key.

Hash keys are also used for efficient data retrieval and storage in
hash tables or data structures, as they allow quick look-up and
comparison operations.

How Hashing Works?


The process of hashing can be broken down into three steps:

o Input: The data to be hashed is input into the hashing algorithm.


o Hash Function: The hashing algorithm takes the input data and applies a
mathematical function to generate a fixed-size hash value.
The hash function should be designed so that different input values produce
different hash values, and small changes in the input produce large changes
in the output.

o Output: The hash value is returned, which is used as an index to store or


retrieve data in a data structure.

Types of Hash Functions

• There are two types of hashing :

1. Static hashing: In static hashing, the hash function maps search-key values to a
fixed set of locations.

2. Dynamic hashing: In dynamic hashing a hash table can grow to handle more
items. The associated hash function must change as the table grows.

Hashing is the technique/ process of mapping key: value pairs by calculating a


Hash code using the Hash Function. When given a (key: value) pair, the Hash
Function calculates a small integer value from the key. The obtained integer is
called the Hash value/ Hash code and acts as the index to store the corresponding
value inside the Hash Table.

If for two (key: value) pairs, the same index is obtained after applying the Hash
Function, this condition is called Collision. We need to choose a Hash Function
such that Collision doesn't occur.
Terminology:
1. Hashing: The whole process
2. Hash value/ code: The index in the Hash Table for storing the value
obtained after computing the Hash Function on the corresponding key.
3. Hash Table: The data structure associated with hashing in which keys are
mapped with values stored in the array.
4. Hash Function/ Hash: The mathematical function to be applied on keys to
obtain indexes for their corresponding values into the Hash Table.

Here explains different types of Hash Functions where programmers frequently


use. These are the five Hash Functions we can choose based on the key being
numeric or alphanumeric:

1. Division Method
2. Mid Square Method
3. Multiplication Method
4. Digit Folding Method
5. Digit Analysis Method

There are various types of hash functions that are used to place the record in the
hash table

1. Division Method:
Say that we have a Hash Table of size 'S', and we want to store a (key, value) pair in the Hash
Table. The Hash Function, according to the Division method, would be:

H(key) = key mod M


o Here M is an integer value used for calculating the Hash value, and M
should be greater than S. Sometimes, S is used as M.
o This is the simplest and easiest method to obtain a Hash value.
o The best practice is using this method when M is a prime number, as
we can distribute all the keys uniformly.
o It is also fast as it requires only one computation - modulus.
Let us now take an example to understand the cons of this method:

Size of the Hash Table = 5 (M, S)

Key: Value pairs: {10: "Sudha", 11: "Venkat", 12: "Jeevani"}

For every pair:

1. {10:"Sudha"}
Key mod M = 10 mod 5 = 0
2. {11:"Venkat"}
Key mod M = 11 mod 5 = 1
3. {12:"Jeevani"}
Key mod M = 12 mod 5 = 2

Observe that the Hash values were consecutive. This is the disadvantage of
this type of Hash Function. We get consecutive indexes for consecutive keys,
leading to poor performance due to decreased security. Sometimes, we need
to analyze many consequences while choosing the Hash Table size.

A simple program to demonstrate the mechanism of the division method:


1. #include<stdio.h>
2. int main()
3. {
4. int size, i, indexes[3];
5. int keys[3] = {10, 11, 12};
6. printf("Enter the size of the Hash Table: ");
7. scanf("%d", &size);
8. int M = size
9. for(i = 0; i < 3; i ++)
10. {
11. indexes[i] = (keys[i] % M);
12. }
13. printf("\nThe indexes of the values in the Hash Table: ");
14. for(i = 0; i < 3; i++)
15. {
16. printf("%d ", indexes[i]);
17. }
18. return 0;
Output:
Enter the size of the Hash Table: 5
The indexes of the values in the Hash Table: 0 1 2

2. Mid Square Method:


It is a two-step process of computing the Hash value. Given a {key: value}
pair, the Hash Function would be calculated by:

1. Square the key -> key * key


2. Choose some digits from the middle of the number to obtain the Hash value.

We should choose the number of digits to extract based on the size of the
Hash Table. Suppose the Hash Table size is 100; indexes will range from 0 to
99. Hence, we should select 2 digits from the middle.

Suppose the size of the Hash Table is 10 and the key: value pairs are:

{10: "Sudha, 11: "Venkat", 12: "Jeevani"}

Number of digits to be selected: Indexes: (0 - 9), so 1

H(10) = 10 * 10 = 100 = 0

H(11) = 11 * 11 = 121 = 2

H(12) = 12 * 12 = 144 = 4

o All the digits in the key are utilized to contribute to the index, thus
increasing the performance of the Data Structure.
o If the key is a large value, squaring it further increases the value,
which is considered the con.
o Collisions might occur, too, but we can try to reduce or handle them.
o Another important point here is that, with the huge numbers, we need
to take care of overflow conditions. For suppose, if we take a 6-digit
key, we get a 12-digit number that exceeds the range of defined
integers when we square it. We can use the long int or string
multiplication technique.

3. Folding Method
Given a {key: value} pair and the table size is 100 (0 - 99 indexes), the key
is broken down into 2 segments each except the last segment. The last
segment can have less number of digits. Now, the Hash Function would be:

H(x) = (sum of equal-sized segments) mod (size of the Hash Table)


o The last carry with fewer digits can be ignored in calculating the Hash value.

For suppose "k" is a 10-digit key and the size of the table is 100(0 - 99), k is
divided into:

sum = (k1k2) + (k3k4) + (k5k6) + (k7k8) + (k9k10)

Now, H(x) = sum % 100

Let us now take an example:

The {key: value} pairs: {1234: "Sudha", 5678: "Venkat"}

Size of the table: 100 (0 - 99)

For {1234: "Sudha"}:

1234 = 12 + 34 = 46

46 % 100 = 46

For {5678: "Venkat"}:

5678 = 56 + 78 = 134

134 % 99 = 35
4. Multiplication method
Unlike the three methods above, this method has more steps involved:

1. We must choose a constant between 0 and 1, say, A.


2. Multiply the key with the chosen A.
3. Now, take the fractional part from the product and multiply it by the table
size.
4. The Hash will be the floor (only the integer part) of the above result.

So, the Hash Function under this method will be:

1. H(x) = floor(size(key*A mod 1))

For example:

{Key: value} pairs: {1234: "Sudha", 5678: "Venkat"}

Size of the table: 100

A = 0.56

For {1234: "Sudha"}:

H(1234) = floor(size(1234*0.56 mod 1))

= floor(100 * 0.04)

= floor(4) = 4

For {5678: "Venkat"}:


H(5678) = floor(size(5678*0.56 mod 1))

= floor(99 * 0.68)

= floor(67.32)

= 67

o It is considered best practice to use the multiplication method when


the Hash Table size is a power of 2 as it makes the access and all the
operations faster.

5. Digit Analysis:

In the digit analysis method for hashing, we transform identifiers


into numbers using a specified radix, r.

Then, we focus on specific digits within each identifier, eliminating


those with skewed distributions.

By repeating this process, we reduce the number of digits until we


obtain an address within the hash table’s range 1.

For example, if our key is 1234567, we might select digits in positions


2 through 4 (yielding 234) and then manipulate them (e.g., reversing
or performing circular shifts) to form the home address 2.
This technique is particularly useful when dealing with static files
where all identifiers are known in advance.

Certainly! In the digit analysis method for hashing, we manipulate specific


digits from a key to form an index. Here’s how it works:
1. Divide the Key: Take the key (e.g., 1234567) and divide it into separate
parts. For instance, if we choose positions 2 through 4, we get 234.
2. Manipulate the Digits: You can manipulate these digits in various ways:
o Reverse the digits: If we reverse 234, we get 432.
o Perform a circular shift to the right: Shifting 234 to the right gives us 423.
3. Hash Function: Combine the manipulated digits to produce a hash key. For
example:
o If we add the digits: (H(\text{{key}}) = 2 + 3 + 4 = 9)
o If we reverse the digits: (H(\text{{key}}) = 432)
o If we perform a circular shift: (H(\text{{key}}) = 423)

Perfect hashing:
Perfect hashing is a type of hash function that ensures no collisions; that is, each
input has a unique hash value. It is particularly useful when the set of keys is
known in advance, allowing the creation of a hash function that maps each key to a
distinct slot in the hash table. There are two main types of perfect hashing:

1. Static Perfect Hashing: This is used when the set of keys does not change
over time. It involves creating a hash function that guarantees a collision-
free mapping for the given set of keys.
2. Dynamic Perfect Hashing: This is more complex and allows for keys to be
added or removed. It involves a more sophisticated structure that can adjust
to changes while maintaining the collision-free property.

Static Perfect Hashing

The process generally involves two levels of hashing:

1. First Level Hashing: A primary hash function divides the keys into several
smaller groups or buckets. The aim is to have each bucket contain a
manageable number of keys.
2. Second Level Hashing: For each bucket created by the first-level hash
function, a secondary hash function is designed such that there are no
collisions within that bucket. This often requires some trial and error or
probabilistic methods to find a suitable secondary hash function.

Steps to Create a Static Perfect Hash Function

1. Select a First-Level Hash Function: Choose a hash function that distributes


keys into buckets.
2. Create Buckets: Use the first-level hash function to assign each key to a
bucket.
3. Design Second-Level Hash Functions: For each bucket, find a secondary
hash function that maps keys in that bucket to unique slots.

Example

Consider a small set of keys {10, 22, 31, 40, 50}.

1. First-Level Hash Function: h1(x)=xmod 3h1(x) = x \mod 3h1(x)=xmod3

This function maps the keys as follows:

o 10 → 1
o 22 → 1
o 31 → 1
o 40 → 1
o 50 → 2

Now we have two buckets:

o Bucket 0: {50}
o Bucket 1: {10, 22, 31, 40}

2. Second-Level Hash Functions:


o For Bucket 0: Since there's only one key, no collision occurs.
o For Bucket 1: We need to find a secondary hash function h2h2h2 such
that h2h2h2 maps each key to a unique slot. We might try several
functions until we find one that works. For instance, h2(x)=(xmod
5)mod 4h2(x) = (x \mod 5) \mod 4h2(x)=(xmod5)mod4 might work if
it avoids collisions within the bucket.
Advantages:

 No collisions mean constant time complexity for search operations.


 Efficient use of space if designed carefully.

Disadvantages:

 Designing perfect hash functions can be computationally expensive,


especially for large sets of keys.
 Dynamic changes (additions/removals) are difficult to handle and often
require rehashing.

Universal hashing:
Universal hashing is a concept in computer science used to construct hash
functions with certain randomness properties.

This approach ensures that, on average, the performance of the hash function is
good, regardless of the input data.

Universal hashing aims to minimize the probability of collision (two keys hashing
to the same value) by using a family of hash functions and selecting one at random.

Definition and Properties

A family of hash functions H is called universal if, for any two distinct keys x and y, the
probability that a randomly chosen hash function h from H maps x and y to the same value is at
most 1/m, where m is the size of the hash table. Formally, for x ≠ y:

Pr [h(x) = h(y)] ≤ 1/m

Construction of Universal Hash Functions

To create a universal family of hash functions, we can use modular arithmetic and
random coefficients. One common method is to use the following formula:
h(x)=((ax+b)mod p)mod m

where:

 p is a prime number larger than the largest possible key.


 a and b are randomly chosen integers such that 1 ≤ a < p1 and 0 ≤b <p.
 m is the size of the hash table.

Benefits of Universal Hashing

 Reduced Collisions: By using a random hash function from a universal


family, the expected number of collisions for any given set of keys is
minimized.
 Security: In adversarial settings where an attacker might choose input keys
to cause collisions, universal hashing can prevent predictable collisions since
the attacker does not know which hash function will be used.
 Simplicity: Universal hash functions are simple to implement and analyze,
making them practical for real-world applications.
Applications

 Hash Tables: Universal hashing is used to improve the average-case


performance of hash tables.
 Cryptography: It is used in various cryptographic protocols to ensure
security properties.
 Load Balancing: Universal hashing can be applied to distribute loads
evenly across servers or other resources.

Advantages of Hashing in Data Structures

 Key-value support: Hashing is ideal for implementing key-value data


structures.

 Fast data retrieval: Hashing allows for quick access to elements with
constant-time complexity.

 Efficiency: Insertion, deletion, and searching operations are highly


efficient.

 Memory usage reduction: Hashing requires less memory as it allocates


a fixed space for storing elements.

 Scalability: Hashing performs well with large data sets, maintaining


constant access time.

 Security and encryption: Hashing is essential for secure data storage


and integrity verification.

What is Collision?

The hashing process generates a small number for a big key, so there is a
possibility that two keys could produce the same value. The situation where
the newly inserted key maps to an already occupied and it must be handled
using some collision handling technology.

Collision in Hashing

What is a Hash Collision?


A hash collision occurs when two different keys map to the same index in a
hash table.

This can happen even with a good hash function, especially if the hash table
is full or the keys are similar.

Causes of Hash Collisions:

 Poor Hash Function: A hash function that does not distribute keys
evenly across the hash table can lead to more collisions.

 High Load Factor: A high load factor (ratio of keys to hash table size)
increases the probability of collisions.

 Similar Keys: Keys that are similar in value or structure are more likely to
collide.
Overflow handling:

Collision Resolution Techniques


There are two types of collision resolution techniques:
1. Open Addressing:
 Linear Probing: Search for an empty slot sequentially
 Quadratic Probing: Search for an empty slot using a quadratic
function
 Double hashing

2. Closed Addressing:
 Chaining: Store colliding keys in a linked list or binary search tree at
each index

Open Addressing Collision Handling technique in


Hashing



 Open Addressing is a method for handling collisions.

 In Open Addressing, all elements are stored in the hash


table itself.

 So at any point, the size of the table must be greater than or


equal to the total number of keys.

 This approach is also known as closed hashing. This entire


procedure is based upon probing.

We will understand the types of probing ahead:

 Insert(k): Keep probing until an empty slot is found. Once an empty slot is
found, insert k.

 Search(k): Keep probing until the slot’s key doesn’t become equal to k or an
empty slot is reached.
 Delete(k): Delete operation is interesting. If we simply delete a key, then the
search may fail. So slots of deleted keys are marked specially as “deleted”.
The insert can insert an item in a deleted slot, but the search does not stop at a
deleted slot.

Different ways of Open Addressing:

1. Linear Probing:

In linear probing, the hash table is searched sequentially that


starts from the original location of the hash.

If in case the location that we get is already occupied, then we


check for the next location.

The function used for rehashing is as follows: rehash (key) =


(n+1) % table-size.

For example, the typical gap between two probes is 1 as seen in


the example below:

Let hash(x) be the slot index computed using a hash function


and S be the table size

If slot hash(x) % S is full, then we try (hash(x) + 1) % S

If (hash(x) + 1) % S is also full, then we try (hash(x) + 2) % S

If (hash(x) + 2) % S is also full, then we try (hash(x) + 3) % S


Example: Let us consider a simple hash function, as “key mod 5”
and a sequence of keys that are to be inserted are 50, 70, 76, 85
and 93.
2. Quadratic Probing

If you observe carefully, then you will understand that the interval
between probes will increase proportionally to the hash value.

Quadratic probing is a method with the help of which we can solve the
problem of clustering that was discussed above.

This method is also known as the mid-square method.

In this method, we look for the i^2‘th (i square) slot in


the ith iteration. We always start from the original hash location.

If only the location is occupied then we check the other slots.

let hash(x) be the slot index computed using hash function.

If slot hash(x) % S is full, then we try (hash(x) + 1*1) % S


If (hash(x) + 1*1) % S is also full, then we try (hash(x) + 2*2) % S

If (hash(x) + 2*2) % S is also full, then we try (hash(x) + 3*3) % S

Example: Let us consider table Size = 7, hash function as Hash(x) = x


% 7 and collision resolution strategy to be f(i) = i^2 . Insert = 22, 30,
and 50.
Double Hashing

 Double hashing is a collision resolution technique used in hash tables.

It works by using two hash functions to compute two different hash values
for a given key.

The first hash function is used to compute the initial hash value, and the
second hash function is used to compute the step size for the probing
sequence.

Double hashing has the ability to have a low collision rate, as it uses two
hash functions to compute the hash value and the step size.

This means that the probability of a collision occurring is lower than in


other collision resolution techniques such as linear probing or quadratic
probing.
 However, double hashing has a few drawbacks.

First, it requires the use of two hash functions, which can increase the
computational complexity of the insertion and search operations.

Second, it requires a good choice of hash functions to achieve good


performance.

If the hash functions are not well designed, the collision rate may still be
high.

Advantages of Double hashing

 The advantage of Double hashing is that it is one of the best forms of


probing, producing a uniform distribution of records throughout a hash
table.

 This technique does not yield any clusters.

 It is one of the effective methods for resolving collisions.

Double hashing can be done using :

(hash1(key) + i * hash2(key)) % TABLE_SIZE

Here hash1() and hash2() are hash functions and TABLE_SIZE


is size of hash table.
(We repeat by increasing i when collision occurs)

Method 1: First hash function is typically hash1(key) = key %


TABLE_SIZE

A popular second hash function is hash2(key) = PRIME – (key %


PRIME) where PRIME is a prime smaller than the TABLE_SIZE.

A good second Hash function is:

 It must never evaluate to zero


 Just make sure that all cells can be probed
Closed Addressing:

Separate Chaining:
The idea behind separate chaining is to implement the array as a
linked list called a chain.

The linked list data structure is used to implement this technique.

So what happens is, when multiple elements are hashed into the same
slot index, then these elements are inserted into a singly-linked list
which is known as a chain.

Here, all those elements that hash into the same slot index are inserted
into a linked list.
Now, we can use a key K to search in the linked list by just linearly
traversing.
If the intrinsic key for any entry is equal to K then it means that we
have found our entry.

If we have reached the end of the linked list and yet we haven’t found
our entry then it means that the entry does not exist.

Hence, the conclusion is that in separate chaining, if two different


elements have the same hash value then we store both the elements in
the same linked list one after the other.

Example: Let us consider a simple hash function as “key mod 5” and a


sequence of keys as 12, 22, 15, 25
Secure Hash Function
A secure hash function is a cryptographic algorithm that takes an input (or
"message") and returns a fixed-size string of bytes, typically in the form of a hash
value.

The output, often called a "digest," appears random and unique to each unique
input.

Here are some key properties and functions of secure hash functions:

Properties of Secure Hash Functions

1. Deterministic: The same input will always produce the same hash output.
2. Fast Computation: It should be quick to compute the hash value for any
given input.
3. Small Changes in Input Change the Output: A small change to the input
should produce a significantly different hash.
4. Collision Resistance: It should be infeasible to find two different inputs that
produce the same hash output.
5. Fixed Output Length: The output length of the hash should be fixed,
regardless of the input length.

Common Secure Hash Functions

1. MD5 (Message Digest Algorithm 5): Produces a 128-bit hash value.


However, it's no longer considered secure due to vulnerabilities to collision
attacks.
2. SHA-1 (Secure Hash Algorithm 1): Produces a 160-bit hash value. Also
considered insecure due to vulnerability to collision attacks.
3. SHA-2 (Secure Hash Algorithm 2): Includes SHA-224, SHA-256, SHA-
384, and SHA-512, producing hash values of 224, 256, 384, and 512 bits
respectively. SHA-256 and SHA-512 are widely used and considered secure.
4. SHA-3 (Secure Hash Algorithm 3): The latest member of the Secure Hash
Algorithm family, using the Keccak algorithm. It is considered secure and is
available in different output lengths, similar to SHA-2.

Applications of Secure Hash Functions

 Data Integrity: Ensuring that data has not been altered.


 Password Storage: Storing hashed passwords instead of plain text.
 Digital Signatures: Verifying the authenticity and integrity of a message.
 Blockchain and Cryptocurrencies: Ensuring the integrity of data within
blocks and transactions.

MD5 (Message Digest Algorithm 5)

MD5 (Message Digest Algorithm 5) is a widely used cryptographic hash function


that produces a 128-bit (16-byte) hash value, typically rendered as a 32-character
hexadecimal number.

It was designed by Ronald Rivest in 1991 to replace an earlier hash function,


MD4.

Characteristics of MD5

1. Output Size: MD5 generates a 128-bit (16-byte) hash value.


2. Input Size: MD5 can take any input size and produce a fixed-size hash.
3. Performance: MD5 is designed to be fast and efficient, making it suitable
for many applications.
MD5 Algorithm Process

1. Padding the Message: The original message is padded so that its length is
congruent to 448 modulo 512. This involves appending a single '1' bit,
followed by a number of '0' bits, and finally the length of the original
message as a 64-bit integer.
2. Initialization: Four 32-bit variables (A, B, C, and D) are initialized with
specific constants.
3. Processing in Blocks: The padded message is processed in 512-bit blocks.
Each block modifies the values of A, B, C, and D through a series of bitwise
operations and additions.
4. Final Output: The final values of A, B, C, and D are concatenated to
produce the final 128-bit hash value.

Here is a step-by-step example of how the MD5 algorithm processes an input to


produce a hash value.

Example Input

Let's take the simple input string:

"hello"
Step-by-Step Process

1. Padding the Message:


o Convert the message to its binary form:

"hello" -> 01101000 01100101 01101100 01101100 01101111

o Append a single '1' bit to the end of the message:

01101000 01100101 01101100 01101100 01101111 1

o Append enough '0' bits so that the length of the message is congruent
to 448 modulo 512. The total length needs to be 448 bits (or 56 bytes)
plus the length of the original message (5 bytes), making a total of 56
- 5 - 1 = 50 bytes of padding:
01101000 01100101 01101100 01101100 01101111 1 +
00000000...00000000 (448 bits in total)

o Append the original length of the message as a 64-bit integer. The


length of "hello" is 5 bytes (40 bits), so in binary, it's:

00000000 00000000 00000000 00000000 00000000 00000000


00101000

2. Initialization:
o Initialize the four MD5 buffer variables (A, B, C, and D) to specific
constants:

A = 0x67452301
B = 0xEFCDAB89
C = 0x98BADCFE
D = 0x10325476

3. Processing in 512-bit Blocks:


o Split the padded message into 512-bit blocks (since our message is
short and fits into one block, we process only one block).
o Each block is divided into 16 words of 32 bits each.

4. Processing Each Block:


o For each 512-bit block, perform a series of operations (64 steps) that
involve bitwise operations and modular additions.
o The operations are grouped into four rounds, each round consisting of
16 operations:
 Round 1: Perform operations using a non-linear function, the
current values of A, B, C, and D, and one of the 16 words from
the block.
 Round 2: Similar to Round 1 but using a different non-linear
function and shifting.
 Round 3: Another set of operations with yet another non-linear
function.
 Round 4: Final set of operations.

o After processing the block, update the values of A, B, C, and D with


the results.
5. Final Output:
o After processing all blocks, concatenate the final values of A, B, C,
and D to produce the final 128-bit hash value.

61f1141806fbee0528a1a2bf59437d949
40b1131806abde718a1b2ba56437e901
81b01710816aeed518a102bb50417d94

Security of MD5

MD5 was initially considered secure but has since been found vulnerable to
various types of attacks:

1. Collision Attacks: A collision occurs when two different inputs produce the
same hash output. Researchers have demonstrated that MD5 is susceptible to
collision attacks, making it insecure for cryptographic purposes.
2. Pre-image and Second Pre-image Attacks: While MD5 is more resistant to
pre-image and second pre-image attacks than to collision attacks, its
weaknesses in collision resistance make it unsuitable for security-sensitive
applications.

Applications of MD5

Despite its vulnerabilities, MD5 is still used in various non-cryptographic


applications, such as:

1. Checksums: Verifying the integrity of files and data transfers.


2. Fingerprinting: Generating unique identifiers for data and files.
3. Non-cryptographic Uses: MD5 is often used in situations where security is
not a primary concern, such as checksums for data integrity verification.
Theoretical Evaluation of Overflow
Techniques:
Evaluating the various overflow handling techniques theoretically involves
considering several factors: space complexity, time complexity (for insertion,
deletion, and search operations), and the impact of load factors.

Here's a detailed theoretical evaluation of the common techniques:

1. Open Addressing
Linear Probing

 Time Complexity:
o Average case: O(1)O(1)O(1) for insertion, deletion, and search when
the load factor is low.
o Worst case: O(n)O(n)O(n) when the load factor approaches 1 due to
clustering.
 Space Complexity:
o O(n)O(n)O(n), where nnn is the number of slots in the hash table.
 Pros:
o Simple to implement.
o Good cache performance due to locality of reference.
 Cons:
o Primary clustering can degrade performance.

Quadratic Probing

 Time Complexity:
o Average case: O(1)O(1)O(1) for insertion, deletion, and search.
o Worst case: O(n)O(n)O(n) in rare cases due to secondary clustering.
 Space Complexity:
o O(n)O(n)O(n).
 Pros:
o Reduces primary clustering compared to linear probing.
 Cons:
o Secondary clustering still exists.
o May fail to find an empty slot even if the table is not full (requires
careful selection of constants).
Double Hashing

 Time Complexity:
o Average case: O(1)O(1)O(1) for insertion, deletion, and search.
o Worst case: O(n)O(n)O(n) if both hash functions produce poor results.
 Space Complexity:
o O(n)O(n)O(n).
 Pros:
o Minimizes clustering.
o More uniform distribution.
 Cons:
o Slightly more complex to implement due to multiple hash functions.

2. Separate Chaining

 Time Complexity:
o Average case: O(1)O(1)O(1) for insertion.
o Worst case: O(n)O(n)O(n) for search and deletion if all keys hash to
the same slot.
o Average case for search and deletion: O(1+α)O(1 + \alpha)O(1+α),
where α\alphaα is the load factor.
 Space Complexity:
o O(n+m)O(n + m)O(n+m), where mmm is the number of elements
stored.
 Pros:
o Simple to implement.
o Load factor can exceed 1.
o Easy to resize.
 Cons:
o Extra memory required for pointers in linked lists.
o Cache performance can be poor.

3. Cuckoo Hashing

 Time Complexity:
o Average case: O(1)O(1)O(1) for insertion, deletion, and search.
o Worst case: O(n)O(n)O(n) for insertion due to potential rehashing.
 Space Complexity:
o O(n)O(n)O(n).
 Pros:
o Guarantees O(1)O(1)O(1) worst-case lookup time.
o Simplifies deletion.
 Cons:
o Insertion can be complex and may require multiple relocations.
o High memory overhead due to maintaining multiple hash tables.

Summary of Performance and Use Cases

 Linear Probing: Simple and efficient with good cache performance, but
suffers from clustering.
 Quadratic Probing: Reduces clustering but requires careful selection of
parameters.
 Double Hashing: Provides good distribution at the cost of more complex
implementation.
 Separate Chaining: Handles high load factors well but has poor cache
performance.
 Cuckoo Hashing: Guarantees constant-time lookups but can be complex
and requires multiple tables.

The choice of technique depends on the specific requirements of the application,


such as memory constraints, the expected load factor, and performance needs for
insertion, deletion, and search operations.

Dynamic Hashing in DBMS:


Hashing in DBMS is used for searching the needed data on the
disc.

As static hashing is not efficient for large databases, dynamic


hashing provides a way to work efficiently with databases that can
be scaled.

What is Dynamic Hashing in DBMS?


Dynamic hashing is a technique used to dynamically add and
remove data buckets when demanded.
Dynamic hashing can be used to solve the problem like bucket
overflow which can occur in static hashing.

In this method, the data bucket size grows or shrinks as the


number of records increases or decreases.

This allows easy insertion or deletion into the database and reduces
performance issues.

Motivation for Dynamic Hashing:


Dynamic hashing, including techniques like extendable hashing, addresses the
limitations and inefficiencies of static hashing methods.

Here are the motivations for dynamic hashing:

1. Handling Dynamic Data Sizes

Static Hashing Limitations:

 Static hash tables have a fixed size determined at creation. When the data set
grows, the table can become full, leading to overflow issues.
 If the table is too large initially, it wastes space when the data set is small.

Dynamic Hashing Solution:

 Dynamic hashing allows the hash table to grow and shrink as needed,
efficiently accommodating varying data sizes without significant space
wastage or performance degradation.

2. Avoiding Overflow Chains

Static Hashing Limitations:

 When a bucket in a static hash table becomes full, additional records are
typically handled using overflow chains or linked lists.
 These overflow chains can become long and degrade performance, as they
require multiple accesses to retrieve a single record.

Dynamic Hashing Solution:

 Dynamic hashing techniques like extendable hashing split buckets and


redistribute records to avoid overflow chains.
 This ensures more uniform distribution of records across buckets,
maintaining efficient access times.

3. Maintaining Consistent Performance

Static Hashing Limitations:

 As the load factor of a static hash table increases (i.e., as it becomes more
full), the likelihood of collisions and the average access time for records
increase.

Dynamic Hashing Solution:

 By dynamically adjusting the structure (e.g., doubling the directory size in


extendable hashing), dynamic hashing maintains a low load factor, thereby
ensuring consistent and efficient performance for insertions, deletions, and
lookups.

4. Flexibility and Adaptability

Static Hashing Limitations:

 Static hash tables are not adaptable to changes in the dataset size without a
costly rehashing operation, which involves creating a larger table and re-
inserting all existing records.

Dynamic Hashing Solution:

 Dynamic hashing methods can adapt on-the-fly to changes in dataset size,


allowing for incremental growth or reduction without needing a complete
rehash.
 This makes them more flexible and suitable for applications with highly
variable data sizes.
5. Efficient Space Utilization

Static Hashing Limitations:

 Choosing the initial size of a static hash table is challenging.


Underestimating leads to overflows, while overestimating leads to wasted
space.

Dynamic Hashing Solution:

 Dynamic hashing adjusts the table size based on actual data needs, ensuring
that space is used efficiently without significant waste or the need for
frequent resizing operations.

6. Simplified Management of Large Datasets

Static Hashing Limitations:

 Managing large datasets with static hash tables can be cumbersome,


especially when dealing with data growth that exceeds initial expectations.

Dynamic Hashing Solution:

 Dynamic hashing, through its ability to grow and shrink, simplifies the
management of large and growing datasets, making it easier to handle data
expansion without major disruptions or performance hits.

Dynamic Hashing Using Directories


(Extendible Hashing):
Extendible Hashing is a dynamic hashing method wherein directories, and
buckets are used to hash data.

It is an aggressively flexible method in which the hash function also experiences


dynamic changes.
Main features of Extendible Hashing : The main features in this hashing
technique are:

Directories: The directories store addresses of the buckets in pointers. An id is


assigned to each directory which may change each time when Directory
Expansion takes place.
Buckets: The buckets are used to hash the actual data.

Basic Structure of Extendible Hashing :

Frequently used terms in Extendible Hashing :

 Directories: These containers store pointers to buckets. Each directory is


given a unique id which may change each time when expansion takes place.
The hash function returns this directory id which is used to navigate to the
appropriate bucket. Number of Directories = 2^Global Depth.

 Buckets: They store the hashed keys. Directories point to buckets. A bucket
may contain more than one pointers to it if its local depth is less than the
global depth.

 Global Depth: It is associated with the Directories. They denote the number
of bits which are used by the hash function to categorize the keys. Global
Depth = Number of bits in directory id.

 Local Depth: It is the same as that of Global Depth except for the fact that
Local Depth is associated with the buckets and not the directories. Local depth
in accordance with the global depth is used to decide the action that to be
performed in case an overflow occurs. Local Depth is always less than or
equal to the Global Depth.

 Bucket Splitting: When the number of elements in a bucket exceeds a


particular size, then the bucket is split into two parts.

 Directory Expansion: Directory Expansion Takes place when a bucket


overflows. Directory Expansion is performed when the local depth of the
overflowing bucket is equal to the global depth.

Basic Working of Extendible Hashing :

 Step 1 – Analyze Data Elements: Data elements may exist in various forms
eg. Integer, String, Float, etc.. Currently, let us consider data elements of type
integer. eg: 49.

 Step 2 – Convert into binary format: Convert the data element in Binary
form. For string elements, consider the ASCII equivalent integer of the
starting character and then convert the integer into binary form. Since we have
49 as our data element, its binary form is 110001.

 Step 3 – Check Global Depth of the directory. Suppose the global depth of
the Hash-directory is 3.

 Step 4 – Identify the Directory: Consider the ‘Global-Depth’ number of


LSBs in the binary number and match it to the directory id.
Eg. The binary obtained is: 110001 and the global-depth is 3. So, the hash
function will return 3 LSBs of 110001 viz. 001.
 Step 5 – Navigation: Now, navigate to the bucket pointed by the directory
with directory-id 001.

 Step 6 – Insertion and Overflow Check: Insert the element and check if the
bucket overflows. If an overflow is encountered, go to step 7 followed
by Step 8, otherwise, go to step 9.

 Step 7 – Tackling Over Flow Condition during Data Insertion: Many


times, while inserting data in the buckets, it might happen that the Bucket
overflows.

In such cases, we need to follow an appropriate procedure to avoid


mishandling of data.

First, Check if the local depth is less than or equal to the global depth.

Then choose one of the cases below.

o Case1: If the local depth of the overflowing Bucket is equal to the


global depth, then Directory Expansion, as well as Bucket Split,
needs to be performed. Then increment the global depth and the local
depth value by 1. And, assign appropriate pointers.
Directory expansion will double the number of directories present in
the hash structure.

o Case2: In case the local depth is less than the global depth, then only
Bucket Split takes place. Then increment only the local depth value
by 1. And, assign appropriate pointers.
 Step 8 – Rehashing of Split Bucket Elements: The Elements present in the
overflowing bucket that is split are rehashed w.r.t the new global depth of the
directory.

 Step 9 – The element is successfully hashed.

Example based on Extendible Hashing : Now, let us consider a


prominent example of hashing the following
elements: 16,4,6,22,24,10,31,7,9,20,26.

Bucket Size: 3 (Assume)

Hash Function: Suppose the global depth is X. Then the Hash Function returns
X LSBs.

 Solution: First, calculate the binary forms of each of the given numbers.
16- 10000
4- 00100
6- 00110
22- 10110
24- 11000
10- 01010
31- 11111
7- 00111
9- 01001
20- 10100
26- 11010

 Initially, the global-depth and local-depth is always 1. Thus, the hashing frame
looks like this:

 Inserting 16:
The binary format of 16 is 10000 and global-depth is 1. The hash function
returns 1 LSB of 10000 which is 0. Hence, 16 is mapped to the directory with
id=0.

 Inserting 4 and 6:
Both 4(100) and 6(110)have 0 in their LSB. Hence, they are hashed as
follows:
 Inserting 22: The binary form of 22 is 10110. Its LSB is 0. The bucket
pointed by directory 0 is already full. Hence, Over Flow occurs.

 As directed by Step 7-Case 1, Since Local Depth = Global Depth, the bucket
splits and directory expansion takes place.

Also, rehashing of numbers present in the overflowing bucket takes place after
the split.

And, since the global depth is incremented by 1, now,the global depth is 2.

Hence, 16,4,6,22 are now rehashed w.r.t 2 LSBs.


[ 16(10000),4(100),6(110),22(10110) ]
*Notice that the bucket, which was underflow, has remained untouched. But,
since the number of directories has doubled, we now have 2 directories 01 and 11
pointing to the same bucket. This is because the local-depth of the bucket has
remained 1. And, any bucket having a local depth less than the global depth is
pointed-to by more than one directories.
 Inserting 24 and 10: 24(11000) and 10 (1010) can be hashed based on
directories with id 00 and 10. Here, we encounter no overflow condition.
 Inserting 31,7,9: All of these elements[ 31(11111), 7(111), 9(1001) ] have
either 01 or 11 in their LSBs.
 Hence, they are mapped on the bucket pointed out by 01 and 11. We do not
encounter any overflow condition here.

 Inserting 20: Insertion of data element 20 (10100) will again cause the
overflow problem.

 20 is inserted in bucket pointed out by 00. As directed by Step 7-Case 1, since


the local depth of the bucket = global-depth, directory expansion (doubling)
takes place along with bucket splitting.
 Elements present in overflowing bucket are rehashed with the new global
depth. Now, the new Hash table looks like this:

 Inserting 26: Global depth is 3. Hence, 3 LSBs of 26(11010) are considered.


Therefore 26 best fits in the bucket pointed out by directory 010.
 The bucket overflows, and, as directed by Step 7-Case 2, since the local
depth of bucket < Global depth (2<3), directories are not doubled but, only
the bucket is split and elements are rehashed.

 Finally, the output of hashing the given list of numbers is obtained.

 Hashing of 11 Numbers is Thus Completed.

Directoryless Dynamic Hashing:


Directoryless dynamic hashing is a technique where the hash table dynamically
adjusts its size without the need for a separate directory to manage the buckets.

This approach simplifies the structure and often relies on techniques such as linear
hashing to handle dynamic growth and reorganization of the hash table.

Key Concepts of Directoryless Dynamic Hashing


Linear Hashing

Linear hashing is a directoryless dynamic hashing technique that dynamically


grows the hash table by splitting buckets in a linear sequence.

It avoids the need for a global directory by directly managing the bucket splits and
the rehashing process.
How Linear Hashing Works

1. Initial Setup: Start with an initial number of buckets, say M, and a hash
function h(k) = k mod M.
2. Split Pointer: Maintain a split pointer that initially points to the first bucket.
3. Progressive Splitting: When a bucket overflows, the bucket pointed to by
the split pointer is split, and the split pointer is incremented.
4. Doubling and Modifying the Hash Function: When the split pointer
reaches the end of the current set of buckets, the number of buckets doubles,
and the hash function is adjusted accordingly.

Operations in Linear Hashing

 Insertion: Insert the key into the appropriate bucket. If the bucket
overflows, split the bucket pointed to by the split pointer.
 Splitting: When splitting a bucket, redistribute the keys using the new hash
function, and increment the split pointer.
 Doubling: When the split pointer completes a cycle through the buckets,
double the number of buckets and update the hash function.

You might also like