Module 6 DSA 24
Module 6 DSA 24
HASHING
by
Dr. Swapnil Gundewar
DEPARTMENT OF AIDS/AIML/CSD/CSME/CSE
FACULTY OF ENGINEERING
Course outcomes
Topic CO
Introduction to Hashing 1
Collision-Resolution Technique 2
File Structure: Concepts of fields, records and files 3
• The hash function creates a mapping between key and value, this is done
through the use of mathematical formulas known as hash functions.
• The result of the hash function is referred to as a hash value or hash.
• Division Method.
• Multiplication Method
• Mid-Square Method
• Folding Method
• Universal Hashing
• Perfect Hashing
Division Method
Keys: 15,28,37,45,60,75,92
Multiplication Method
• The multiplication method is another popular technique for designing hash functions.
• It involves multiplying the key by a constant, extracting the fractional part, and then
multiplying it by the size of the hash table to determine the index.
• Hash Function: m x k x A
• Here, k is the key, m is the size of the hash table, A is a constant (typically a fraction
between 0 and 1),
•Step 1: Multiply the key k by the constant 𝐴.
Steps:
• Let's work through an example using the multiplication method. Assume you have a
hash table of size 7 (m=7) and you want to hash the following 7 keys:
• Let's choose A=0.6180339887A = 0.6180339887
• The Mid-Square Method involves squaring the key and then extracting the middle
portion of the resulting number as the hash value.
• This method is particularly useful because it tends to distribute keys more uniformly
across the hash table, even if the input keys are not well-distributed.
• Steps in the Mid-Square Method:
• Extract the Middle Digits: From the squared result, extract a specific number of digits
from the middle. The number of digits to extract typically corresponds to the size of
the hash table or the number of bits required.
• Calculate the Index: The extracted middle digits are used as the index in the hash
table.
Folding Method
• The Folding Method is another hashing technique that involves dividing the key into
parts, applying a specific operation (like addition), and then combining the results to
form the final hash value. This method is particularly useful when the keys are long
numbers or strings.
Steps in the Folding Method
• Divide the Key: Split the key into equal parts. The parts can be of any size, but they
are typically chosen to match the size of the desired hash table or hash function.
• Fold the Parts: Combine the parts using a simple operation, typically addition, to
produce a single number. The parts might be folded using other methods, such as
reversing certain segments before summing them.
• Apply the Hash Function: The resulting number is then reduced by the hash table
size using a modulus operation to ensure it fits within the bounds of the hash table.
Variations of Folding
1.Simple Folding:
1. The key is divided into equal parts, and each part is added together directly to
form the hash value.
2.Folding by Reversal:
1. The key is divided into parts, but before adding, alternate parts may be reversed
to ensure that all parts contribute uniquely to the final hash value.
3.Folding by Boundaries:
1. The key is divided at certain boundaries (like between digits), and then the parts
are added together.
Example of the Folding Method:
• Let's consider a key of 8 digits, 12345678 , and a hash table size m=100 We'll use the simple folding
method in this example.
1.Divide the Key:
1. Split the key into parts of 4 digits each:
1. Part 1: 1234
2. Part 2: 5678
2.Fold the Parts:
1. Add the parts together: 1234+5678=6912
3.Apply the Hash Function:
1. Reduce the result using the modulus operation with the hash table size: 6912mod 100= 12
2. The final hash value is 121212, which is the index where the key will be stored in the hash table.
Example with Reversal
• Let's use the same key 12345678 but apply folding by reversal:
1.Divide the Key:
1. Split the key into 4-digit parts:
1. Part 1: 1234
2. Part 2: 5678
2.Reverse the Second Part:
1. Reverse Part 2 before adding:
1. Reversed Part 2: 8765
3.Fold the Parts:
1. Add the original Part 1 and the reversed Part 2: 1234+8765=9999
4.Apply the Hash Function:
1. Reduce the result using the modulus operation: 9999mod 100=99
2. The final hash value is 999999, which is the index in the hash table.
Cryptographic Hash Functions
• Deterministic: The same input will always produce the same hash output.
• Pre-image Resistance: Given a hash value, it should be computationally infeasible to
find the original input.
• Collision Resistance: It should be computationally infeasible to find two different
inputs that produce the same hash output.
• Avalanche Effect: A small change in the input should produce a drastically different
hash.
• Fast Computation: Hash functions should be efficient to compute for any given input.
• Perfect Hashing is a technique used to create hash functions that ensure no collisions
occur, meaning that each key maps to a unique slot in the hash table.
• This is particularly useful when the set of keys is known in advance and is static (i.e.,
the keys do not change over time).
• Perfect hashing is often divided into two categories: static perfect hashing and
dynamic perfect hashing.
Key Concepts of Perfect Hashing
1.Linear probing
2.Quadratic probing
3.Double hashing
Linear probing
• Given a hash table of size 20 with 15 keys currently inserted, what is the load factor of the hash table?
Quadratic probing
•Record: A record is a collection of related fields that together represent a single entity.
For example, an employee record might include fields such as Employee ID,
Name, Address, and Department.
• File Organization refers to the logical relationships among various records that
constitute the file, particularly with respect to the means of identification and access
to any specific record.
• In simple terms, Storing the files in a certain order is called File Organization.
• File Structure refers to the format of the label and data blocks and of any logical
control record.
Types of file organization
File organization contains various methods. These particular methods have pros and
cons on the basis of access or selection. In the file organization, the programmer
decides the best-suited file organization method according to his requirement.
Sequential File Organization
• In this method, the new record is always inserted at the file's end, and then it will sort
the sequence in ascending or descending order. Sorting of records is based on any
primary key or any other key.
• In the case of modification of any record, it will update the record and then sort the
file, and lastly, the updated record is placed in the right place.
Heap file organization
• It is the simplest and most basic type of organization. It works with data blocks. In heap file organization, the
records are inserted at the file's end. When the records are inserted, it doesn't require the sorting and ordering of
records.
• When the data block is full, the new record is stored in some other block. This new data block need not to be the
very next data block, but it can select any data block in the memory to store new records. The heap file is also
known as an unordered file.
• In the file, every record has a unique id, and every page in a file is of the same size. It is the DBMS responsibility to
store and manage the new records.
Hash File Organization
• Hash File Organization uses the computation of hash function on some fields of the
records. The hash function's output determines the location of disk block where the
records are to be placed.
B+ File Organization
• B+ tree file organization is the advanced method of an indexed sequential access method. It uses a
tree-like structure to store records in File.
• It uses the same concept of key-index where the primary key is used to sort the records. For each
primary key, the value of the index is generated and mapped with the record.
• The B+ tree is similar to a binary search tree (BST), but it can have more than two children. In this
method, all the records are stored only at the leaf node. Intermediate nodes act as a pointer to the
leaf nodes. They do not contain any records.
Indexed sequential access method (ISAM)
• ISAM method is an advanced sequential file organization. In this method, records are stored in the file
using the primary key. An index value is generated for each primary key and mapped with the record.
This index contains the address of the record in the file.
Cluster file organization
• When the two or more records are stored in the same file, it is known as clusters.
These files will have two or more tables in the same data block, and key attributes
which are used to map these tables together are stored only once.
• This method reduces the cost of searching for various records in different files.
• The cluster file organization is used when there is a frequent need for joining the
tables with the same condition. These joins will give only a few records from both
tables. In the given example, we are retrieving the record for only particular
departments. This method can't be used to retrieve the record for the entire
department.
Properties of a good hash function
•Deterministic: For the same input, a hash function must always produce the same output.
•Fast Computation: It should be computationally efficient to compute the hash value for any given input.
•Uniformity: Ideally, the hash function should distribute hash values uniformly across its entire output range,
•reducing the likelihood of collisions (different inputs producing the same hash value).
•Pre-image resistance: Given a hash value hhh, it should be computationally difficult to find any input xxx such
•that hash(x)=hhash(x) = hhash(x)=h.
•Collision resistance: It should be hard to find two different inputs xxx and yyy such that hash(x)=hash(y)hash(x)
•= hash(y)hash(x)=hash(y).
•Avalanche effect: A small change in the input should produce a significant change in the output (hash value).
•This property ensures that similar inputs result in vastly different hash values.
Summary
• Hashing: The symbol table, Hashing Functions, Collision-Resolution
Techniques, File Structure:
• Concepts of fields, records and files, Sequential, Indexed and
Relative/Random File Organization,
• Indexing structure for index files, hashing for direct files, Multi-Key file
organization and access methods.
References
1. Data structure using C and C++-AM Tanenbaum, Y Langsam& MJ Augustein,Prentice Hall India.
2. Data structures & Program Design in C -Robert Kruse, Bruse Leung,Pearson Education.
Expected Questions
BAQ
1.Define buckets.
2.Restate random access.
3.Restate the extendible hashing.
SAQ
1.Relate Indexing for Index files
2.Explain hashing.
LAQ
1. Illustrate collision in hashing.
2. Illustrate common applications of hashing.