0% found this document useful (0 votes)
37 views64 pages

Module 6 DSA 24

Uploaded by

pratiknagre34
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views64 pages

Module 6 DSA 24

Uploaded by

pratiknagre34
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 64

Subject: Data Structure and Algorithms

HASHING
by
Dr. Swapnil Gundewar
DEPARTMENT OF AIDS/AIML/CSD/CSME/CSE

FACULTY OF ENGINEERING
Course outcomes
Topic CO
Introduction to Hashing 1
Collision-Resolution Technique 2
File Structure: Concepts of fields, records and files 3

Sequential, Indexed and Relative/Random File Organization 3

Indexing structure for index files 3


Hashing for direct files 3
Multi-Key file organization and access methods 4
Specific Learning Objectives

S/No Learning objectives Level Criteria Condition

1 Understand the basic concepts of hashing Must know At least one -

2 Learn the characteristics of hashing function Must know At least two -

3 Understand techniques to resolve collisions Must know all -


What is Hashing?
Hashing refers to the process of generating a fixed-size output from an input of
variable size using the mathematical formulas known as hash functions. This
technique determines an index or location for the storage of an item in a data structure.
It is done for faster access to elements.
The efficiency of mapping depends on the efficiency of the hash function used.
Components of Hashing
Key: A Key can be anything string or integer which is fed as input in the
hash function the technique that determines an index or location for
storage of an item in a data structure.
Hash Function: The hash function receives the input key and returns the
index of an element in an array called a hash table. The index is known as
the hash index.
Hash Table: Hash table is a data structure that maps keys to values using
a special function called a hash function. Hash stores the data in an
associative manner in an array where each data value has its own unique
index.
Need for Hash data structure
• Every day, the data on the internet is increasing multifold and it is always a struggle to store this data
efficiently. In day-to-day programming, this amount of data might not be that big, but still, it needs to
be stored, accessed, and processed easily and efficiently.
• A very common data structure that is used for such a purpose is the Array data structure. So now we
are looking for a data structure that can store the data and search in it in constant time, i.e. in O(1)
time.
• This is how Hashing data structure came into play. With the introduction of the Hash data structure, it
is now possible to easily store data in constant time and retrieve them in constant time as well.
How does Hashing work?
• Step 4: Now, assume that we have a table of size 7 to store these strings. The hash
function that is used here is the sum of the characters in key mod Table size. We can
compute the location of the string in the array by taking the sum(string) mod 7.

• Step 5: So we will then store


• “ab” in 3 mod 7 = 3,
• “cd” in 7 mod 7 = 0, and
• “efg” in 18 mod 7 = 4.
Hash function

• The hash function creates a mapping between key and value, this is done
through the use of mathematical formulas known as hash functions.
• The result of the hash function is referred to as a hash value or hash.

• The hash value is a representation of the original string of characters but


usually smaller than the original.
Types of Hash functions

• Division Method.

• Multiplication Method

• Mid-Square Method

• Folding Method

• Cryptographic Hash Functions

• Universal Hashing

• Perfect Hashing
Division Method
Keys: 15,28,37,45,60,75,92
Multiplication Method

• The multiplication method is another popular technique for designing hash functions.
• It involves multiplying the key by a constant, extracting the fractional part, and then
multiplying it by the size of the hash table to determine the index.
• Hash Function: m x k x A
• Here, k is the key, m is the size of the hash table, A is a constant (typically a fraction
between 0 and 1),
•Step 1: Multiply the key k by the constant 𝐴.
Steps:

•Step 2: Extract the fractional part of the product.


•Step 3: Multiply the fractional part by m (the size of the hash table).
•Step 4: Apply the floor function to get the final index.
Keys: 15,28,37,45,60,75,92

• Let's work through an example using the multiplication method. Assume you have a
hash table of size 7 (m=7) and you want to hash the following 7 keys:
• Let's choose A=0.6180339887A = 0.6180339887
• The Mid-Square Method involves squaring the key and then extracting the middle
portion of the resulting number as the hash value.
• This method is particularly useful because it tends to distribute keys more uniformly
across the hash table, even if the input keys are not well-distributed.
• Steps in the Mid-Square Method:

• Square the Key: Start by squaring the key k.

• Extract the Middle Digits: From the squared result, extract a specific number of digits
from the middle. The number of digits to extract typically corresponds to the size of
the hash table or the number of bits required.
• Calculate the Index: The extracted middle digits are used as the index in the hash
table.
Folding Method

• The Folding Method is another hashing technique that involves dividing the key into
parts, applying a specific operation (like addition), and then combining the results to
form the final hash value. This method is particularly useful when the keys are long
numbers or strings.
Steps in the Folding Method

• Divide the Key: Split the key into equal parts. The parts can be of any size, but they
are typically chosen to match the size of the desired hash table or hash function.
• Fold the Parts: Combine the parts using a simple operation, typically addition, to
produce a single number. The parts might be folded using other methods, such as
reversing certain segments before summing them.
• Apply the Hash Function: The resulting number is then reduced by the hash table
size using a modulus operation to ensure it fits within the bounds of the hash table.
Variations of Folding

1.Simple Folding:
1. The key is divided into equal parts, and each part is added together directly to
form the hash value.
2.Folding by Reversal:
1. The key is divided into parts, but before adding, alternate parts may be reversed
to ensure that all parts contribute uniquely to the final hash value.
3.Folding by Boundaries:
1. The key is divided at certain boundaries (like between digits), and then the parts
are added together.
Example of the Folding Method:

• Let's consider a key of 8 digits, 12345678 , and a hash table size m=100 We'll use the simple folding
method in this example.
1.Divide the Key:
1. Split the key into parts of 4 digits each:
1. Part 1: 1234
2. Part 2: 5678
2.Fold the Parts:
1. Add the parts together: 1234+5678=6912
3.Apply the Hash Function:
1. Reduce the result using the modulus operation with the hash table size: 6912mod 100= 12
2. The final hash value is 121212, which is the index where the key will be stored in the hash table.
Example with Reversal
• Let's use the same key 12345678 but apply folding by reversal:
1.Divide the Key:
1. Split the key into 4-digit parts:
1. Part 1: 1234
2. Part 2: 5678
2.Reverse the Second Part:
1. Reverse Part 2 before adding:
1. Reversed Part 2: 8765
3.Fold the Parts:
1. Add the original Part 1 and the reversed Part 2: 1234+8765=9999
4.Apply the Hash Function:
1. Reduce the result using the modulus operation: 9999mod 100=99
2. The final hash value is 999999, which is the index in the hash table.
Cryptographic Hash Functions

• Cryptographic hash functions are mathematical algorithms that take an


input (or "message") and return a fixed-size string of bytes.
• The output, typically called a hash value, digest, or simply hash, is unique
to each unique input.
• Even a small change in the input should produce a significantly different
hash, a property known as the avalanche effect.
Key Properties of Cryptographic Hash Functions:

• Deterministic: The same input will always produce the same hash output.
• Pre-image Resistance: Given a hash value, it should be computationally infeasible to
find the original input.
• Collision Resistance: It should be computationally infeasible to find two different
inputs that produce the same hash output.
• Avalanche Effect: A small change in the input should produce a drastically different
hash.
• Fast Computation: Hash functions should be efficient to compute for any given input.

Common Cryptographic Hash Functions : MD5, SHA-1,SHA-256


Universal Hashing

• Universal Hashing is a concept in computer science that refers to a technique where a


hash function is selected at random from a family of hash functions with the goal of
reducing the probability of collisions, even when the set of inputs is adversarially
chosen.
• Universal hashing is often used in hash tables and other data structures to ensure
good average-case performance, even in the presence of a worst-case scenario.
• In contrast to a fixed hash function, where adversarially chosen inputs might lead to
many collisions, universal hashing introduces randomness, ensuring that the
expected number of collisions remains low.
• This randomness makes it difficult for an adversary to cause many collisions
deliberately.
• Universal hashing is thus a powerful tool for designing efficient and robust hash
tables, even in worst-case scenarios.
Perfect Hashing

• Perfect Hashing is a technique used to create hash functions that ensure no collisions
occur, meaning that each key maps to a unique slot in the hash table.
• This is particularly useful when the set of keys is known in advance and is static (i.e.,
the keys do not change over time).
• Perfect hashing is often divided into two categories: static perfect hashing and
dynamic perfect hashing.
Key Concepts of Perfect Hashing

• Static Perfect Hashing:


In static perfect hashing, the set of keys is fixed and known in advance.
Once the hash function is constructed, it guarantees that there are no
collisions for this set of keys.
• Dynamic Perfect Hashing:
Dynamic perfect hashing allows the hash table to handle a dynamic set of
keys (i.e., keys can be added or removed over time) while still minimizing
collisions.
Collision Resolution Techniques

• When using hash tables, a common issue is collisions, where two


different keys hash to the same index in the hash table. To handle these
collisions, several techniques have been developed. These are known as
collision-resolution techniques.

Separate chaining (open hashing)


Open addressing (closed hashing)
Separate chaining
• This method involves making a linked list out of the slot where the
collision happened, then adding the new key to the list.
• Separate chaining is the term used to describe how this connected list of
slots resembles a chain.
• It is more frequently utilized when we are unsure of the number of keys
to add or remove.

hash function as “key mod 5” and a sequence of keys as


12, 22, 15, 25
Open Addressing

• Open Addressing is a method for handling collisions. In Open Addressing,


all elements are stored in the hash table itself.
• So at any point, the size of the table must be greater than or equal to the
total number of keys (Note that we can increase table size by copying old
data if needed). This approach is also known as closed hashing.
The following techniques are used in open addressing:

1.Linear probing
2.Quadratic probing
3.Double hashing
Linear probing

• Linear probing is a collision resolution technique used in hash tables, particularly in


open addressing. When a collision occurs (i.e., two keys hash to the same index in the
table), linear probing tries to find the next available slot in the hash table by
sequentially checking the subsequent indices.
How Linear Probing Works

•Initial Insertion: When a key is inserted, it is placed at the position determined by


the hash function.
•Collision Handling: If the hash index is already occupied, linear probing checks the next
index (index + 1). If that’s occupied too, it continues to check the subsequent indices
(index + 2, index + 3, etc.) until an empty slot is found.
•Wrap-Around: If the end of the table is reached, linear probing wraps around to the
beginning of the table and continues the search.
•Search: When searching for a key, start at the position determined by the hash function.
If the key is not found there, continue checking subsequent indices until the key is found
or an empty slot is encountered (which indicates the key is not in the table).
50, 70, 76, 85, 93.

Hash Function: Key mod 5


Load Factor Calculation

• Given a hash table of size 20 with 15 keys currently inserted, what is the load factor of the hash table?
Quadratic probing

• Quadratic probing is a collision resolution technique used in open addressing hash


tables.
• When a collision occurs (i.e., when the hash function maps two or more keys to the
same index), quadratic probing searches for the next available slot in a quadratic
sequence.
• This means the distance from the original hash position increases quadratically with
each attempt.

Hash Value+(ixi)% size of table


Hash function as “key mod 7” and sequence of keys as 22, 30 and 50.
Insert the keys 79, 69, 98, 72, 14, 50 into the Hash Table of size 13.
Concepts of Fields, Records, and Files

•Field: A field is a single piece of data or attribute within a record.


For example, in a database of employees, fields might include Employee ID, Name,
and Address.

•Record: A record is a collection of related fields that together represent a single entity.
For example, an employee record might include fields such as Employee ID,
Name, Address, and Department.

•File: A file is a collection of records, typically stored together on a storage medium.


Files can be organized in various ways to optimize access and retrieval.
File Organization

• File Organization refers to the logical relationships among various records that
constitute the file, particularly with respect to the means of identification and access
to any specific record.
• In simple terms, Storing the files in a certain order is called File Organization.
• File Structure refers to the format of the label and data blocks and of any logical
control record.
Types of file organization

File organization contains various methods. These particular methods have pros and
cons on the basis of access or selection. In the file organization, the programmer
decides the best-suited file organization method according to his requirement.
Sequential File Organization

Pile File Method


•It is a quite simple method. In this method, we store the record in a sequence, i.e., one after another.
Here, the record will be inserted in the order in which they are inserted into tables.
•In case of updating or deleting of any record, the record will be searched in the memory blocks. When
it is found, then it will be marked for deleting, and the new record is inserted.
Sorted File Method

• In this method, the new record is always inserted at the file's end, and then it will sort
the sequence in ascending or descending order. Sorting of records is based on any
primary key or any other key.
• In the case of modification of any record, it will update the record and then sort the
file, and lastly, the updated record is placed in the right place.
Heap file organization

• It is the simplest and most basic type of organization. It works with data blocks. In heap file organization, the
records are inserted at the file's end. When the records are inserted, it doesn't require the sorting and ordering of
records.
• When the data block is full, the new record is stored in some other block. This new data block need not to be the
very next data block, but it can select any data block in the memory to store new records. The heap file is also
known as an unordered file.
• In the file, every record has a unique id, and every page in a file is of the same size. It is the DBMS responsibility to
store and manage the new records.
Hash File Organization

• Hash File Organization uses the computation of hash function on some fields of the
records. The hash function's output determines the location of disk block where the
records are to be placed.
B+ File Organization

• B+ tree file organization is the advanced method of an indexed sequential access method. It uses a
tree-like structure to store records in File.
• It uses the same concept of key-index where the primary key is used to sort the records. For each
primary key, the value of the index is generated and mapped with the record.
• The B+ tree is similar to a binary search tree (BST), but it can have more than two children. In this
method, all the records are stored only at the leaf node. Intermediate nodes act as a pointer to the
leaf nodes. They do not contain any records.
Indexed sequential access method (ISAM)

• ISAM method is an advanced sequential file organization. In this method, records are stored in the file
using the primary key. An index value is generated for each primary key and mapped with the record.
This index contains the address of the record in the file.
Cluster file organization

• When the two or more records are stored in the same file, it is known as clusters.
These files will have two or more tables in the same data block, and key attributes
which are used to map these tables together are stored only once.
• This method reduces the cost of searching for various records in different files.
• The cluster file organization is used when there is a frequent need for joining the
tables with the same condition. These joins will give only a few records from both
tables. In the given example, we are retrieving the record for only particular
departments. This method can't be used to retrieve the record for the entire
department.
Properties of a good hash function

•Deterministic: For the same input, a hash function must always produce the same output.
•Fast Computation: It should be computationally efficient to compute the hash value for any given input.
•Uniformity: Ideally, the hash function should distribute hash values uniformly across its entire output range,
•reducing the likelihood of collisions (different inputs producing the same hash value).
•Pre-image resistance: Given a hash value hhh, it should be computationally difficult to find any input xxx such
•that hash(x)=hhash(x) = hhash(x)=h.
•Collision resistance: It should be hard to find two different inputs xxx and yyy such that hash(x)=hash(y)hash(x)
•= hash(y)hash(x)=hash(y).
•Avalanche effect: A small change in the input should produce a significant change in the output (hash value).
•This property ensures that similar inputs result in vastly different hash values.
Summary
• Hashing: The symbol table, Hashing Functions, Collision-Resolution
Techniques, File Structure:
• Concepts of fields, records and files, Sequential, Indexed and
Relative/Random File Organization,
• Indexing structure for index files, hashing for direct files, Multi-Key file
organization and access methods.
References
1. Data structure using C and C++-AM Tanenbaum, Y Langsam& MJ Augustein,Prentice Hall India.
2. Data structures & Program Design in C -Robert Kruse, Bruse Leung,Pearson Education.
Expected Questions
BAQ
1.Define buckets.
2.Restate random access.
3.Restate the extendible hashing.
SAQ
1.Relate Indexing for Index files
2.Explain hashing.
LAQ
1. Illustrate collision in hashing.
2. Illustrate common applications of hashing.

You might also like