0% found this document useful (0 votes)
7 views

Week 9_Hash Functions and Collision

The document provides an overview of hash functions and their role in data structures, explaining how they convert input values into hash values for efficient data storage and retrieval. It discusses various types of hash functions, collision handling, and the implementation of hash tables, emphasizing their efficiency in operations like insertion, deletion, and lookup. Additionally, it highlights the importance of choosing appropriate hash functions to minimize collisions and optimize performance.

Uploaded by

looser1432019
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Week 9_Hash Functions and Collision

The document provides an overview of hash functions and their role in data structures, explaining how they convert input values into hash values for efficient data storage and retrieval. It discusses various types of hash functions, collision handling, and the implementation of hash tables, emphasizing their efficiency in operations like insertion, deletion, and lookup. Additionally, it highlights the importance of choosing appropriate hash functions to minimize collisions and optimize performance.

Uploaded by

looser1432019
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

Hash Functions and

Collision Handling

By: Dr. Dipti Verma


Associate Professor
CSVTU, Bhilai
• In this Presentation we cover following topics:
1.What is Hashing?
2.What is Hashing in Data Structure?
3.How does Hashing in Data Structure Work?
4.What is the 'Key' in Hashing?
5.What are Hash Function and Hash Table?
6.Types of Hash functions
7.What is a Hash Collision?
8.Types of Hashing in Data Structure
9. The application of Hashing Function
What is Hash Function in Data
Structure?
• A hash function is a mathematical function that
converts a given input value into a resulting hash
value. A hash function can be used to generate a
checksum for data, index data in a database, or encrypt
data. In data structures, a hash function is used to
calculate the hash value of a key, which is then used to
store and retrieve the corresponding data.
• Hash functions are often used in conjunction with an array,
where the hash value is used as an index in the array. This
allows for fast insertion and retrieval of data from the array.
However, if two different keys result in the same hash
value, this is called a collision, and it must be handled
appropriately.
• There are many different types of hash functions, each with
its own strengths and weaknesses. The most common type
of hash function is the modular hashing function, which
uses the modulus operator to calculate the hash value. Other
types of hash functions include multiplicative hashing,
additive hashing, and universal hashing.
Hash Function
• Any function that converts data of any size into fixed-
size values is a hash function. Hash values, hash
codes, digests, or just hashes are the names given to
the results of a hash function. The values are typically
used to index a hash table, a fixed-size table.
• Hashing is the process of generating a value from a
text or a list of numbers using a mathematical function
known as a hash function.
• A Hash Function is a function that converts a given numeric
or alphanumeric key to a small practical integer value. The
mapped integer value is used as an index in the hash table. In
simple terms, a hash function maps a significant number or
string to a small integer that can be used as the index in the
hash table.

• The pair is of the form (key, value), where for a given key,
one can find a value using some kind of a “function” that
maps keys to values. The key for a given object can be
calculated using a function called a hash function. For
example, given an array A, if i is the key, then we can find the
value by simply looking up A[i].
Describe the hash function.
• A hash function is a fixed procedure that changes a key into a hash
key.
• This function converts a key into a length-restricted value known as
a hash value or hash.
• Although the hash value is typically less than the original, it
nevertheless represents the original string of characters.
• The digital signature is transferred, and both the hash value and the
signature are then given to the recipient. The hash value generated
by the receiver using the same hash algorithm is compared to the
hash value received along with the message.
• The message is sent without problems if the hash values match.
• Assume we want to create a system for storing employee
records that include phone numbers (as keys). We also want
the following queries to run quickly:

• Insert a phone number and any necessary information.


• Look up a phone number and get the information.
• Remove a phone number and any associated information
• We can consider using the following data structures to store
information about various phone numbers.

• A collection of phone numbers and records.


• Phone numbers and records are linked in this list.
• Phone numbers serve as keys in a balanced binary search
tree.
• Table with Direct Access.
• We must search in a linear fashion for arrays and linked lists,
which can be costly in practise. If we use arrays and keep the
data sorted, we can use Binary Search to find a phone
number in O(Logn) time, but insert and delete operations
become expensive because we must keep the data sorted.

• We get moderate search, insert, and delete times with a


balanced binary search tree. All of these operations will be
completed in O(Logn) time.
• The term "access-list" refers to a set of rules for controlling
network traffic and reducing network attacks. ACLs are used
to filter network traffic based on a set of rules defined for
incoming or outgoing traffic.

• Another option is to use a direct access table, in which we


create a large array and use phone numbers as indexes. If the
phone number is not present, the array entry is NIL;
otherwise, the array entry stores a pointer to the records
corresponding to the phone number. In terms of time
complexity, this solution is the best of the bunch; we can
perform all operations in O(1) time. To insert a phone
number, for example, we create a record with the phone
number's details, use the phone number as an index, and store
the pointer to the newly created record in the table.
• This solution has a number of practical drawbacks. The first
issue with this solution is the amount of extra space required.
For example, if a phone number has n digits, we require O(m
* 10n) table space, where m is the size of a pointer to record.
Another issue is that an integer in a programming language
cannot hold n digits.

• Because of the limitations mentioned above, Direct Access


Table cannot always be used. In practise, Hashing is the
solution that can be used in almost all such situations and
outperforms the above data structures such as Array, Linked
List, and Balanced BST. We get O(1) search time on average
(under reasonable assumptions) and O(n) in the worst case
with hashing. Let's break down what hashing is.
• What exactly do you mean by hashing?
• Hashing is a popular technique for quickly storing and retrieving data.
The primary reason for using hashing is that it produces optimal
results by performing optimal searches.
• Why should you use Hashing?
• If we try to search, insert, or delete any element in a balanced binary
search tree, the time complexity for the same is O. (logn). Now, there
may be times when our applications need to perform the same
operations in a faster, more optimised manner, and this is where
hashing comes into play. All of the above operations in hashing can be
completed in O(1), or constant time. It is critical to understand that
hashing's worst-case time complexity remains O(n), but its average
time complexity is O. (1).
• Let us now look at some fundamental hashing operations.
Fundamental Operations:
• HashTable: Use this operation to create a new hash table.
• Delete: This operation is used to remove a specific key-value
pair from the hash table.
• Get: This operation is used to find a key within the hash table
and return the value associated with that key.
• Put: This operation is used to add a new key-value pair to the
hash table.
• DeleteHashTable: This operation is used to remove the hash
table.
What is a Hash Table?
• A hash table is a data structure that stores key-value
pairs. The keys are used to access the values, which
are usually stored in an array. Hash tables are used to
implement associative arrays, which is a type of data
structure that allows you to store and retrieve data
based on keys instead of indices.
• Hash tables are efficient for storing and retrieving data
because the key is used to directly access the value in the
array. This means that there is no need to search through the
entire array for the desired value. Furthermore, hash tables
can be implemented with different collision handling
strategies, which further improve the efficiency of the data
structure.
• Overall, hash tables are a powerful and efficient data
structure that can be used in a variety of applications. If you
need to store and retrieve data based on keys, then a hash
table is likely the best data structure for the task.
• A data structure called a hash table or hash map is used to
hold key-value pairs.
• It is a collection of materials that have been organised for
later simple access.
• It computes an index into an array of buckets or slots from
which the requested value can be located using a hash
function.
• Each list in the array is referred to as a bucket.
• On the basis of the key, it contains value.
• The map interface is implemented using a hash table, which
also extends the Dictionary class.
• The hash table is synchronised and only has distinct
components.
Components of Hashing:
• Hash Table: An array that stores pointers to records that
correspond to a specific phone number. If no existing phone
number has a hash function value equal to the index for the
entry, the entry in the hash table is NIL. In simple terms, a
hash table is a generalisation of an array. A hash table
provides the functionality of storing a collection of data in
such a way that it is easy to find those items later if needed.
This makes element searching very efficient
• Hash Function: A function that reduces a large phone
number to a small practical integer value. In a hash table, the
mapped integer value serves as an index. So, to put it simply,
a hash function is used to convert a given key into a specific
slot index. Its primary function is to map every possible key
to a unique slot index. The hash function is referred to as a
perfect hash function if each key maps to a distinct slot
index. Although it is exceedingly challenging to construct the
ideal hash function, it is our responsibility as programmers to
do so in a way that minimises the likelihood of collisions. This
section will cover collision.
How Does Hashing in Data
Structure Work?
• Hashing is a process of mapping keys to values in a
data structure. It is used to store and retrieve data from
a data structure quickly. Hashing works by converting
the key into an index that is used to access the data.
•Hash functions are used to create hash tables.
Hash tables are data structures that store data in
an array using a hash function to map keys to
values. Hash tables are used to store data in a
way that is efficient and easy to search.
•Hash functions are used in many different
applications, such as creating digital signatures
and verifying message integrity.
• Hashing in the data structure is used to quickly identify a
specific value within a given array. It creates a unique hash
code for each element in the array and then stores the hash
code instead of the actual element. This allows for quick
lookup when searching for a specific value, as well as easy
identification of any duplicates. Hashing in the data structure
is a technique that is used to quickly identify a specific value
within a given array. It works by creating a unique hash code
for each element in the array and then stores the hash code in
lieu of the actual element. This allows for quick look-up
when searching for a specific value, as well as easy
identification of any duplicates.
• The data structure's hash function validates the imported file using
a hash value. You may quicken the process by using the item's
hash key. It improves search efficiency and retrieval effectiveness.
This is a straightforward method for defining hashing in a data
structure. Hashing is an important tool to have in your arsenal
when constructing data structures and can be used in many
different ways. The data structure's hash function validates the
imported file by using a hash value. You may quicken the process
by using the item's hash key. It improves search efficiency and
retrieval effectiveness. This is a straightforward method for
defining hashing in a data structure. Hashing is an important tool
to have in your arsenal when constructing data structures and can
be used in many different ways.
• However, you'll need to spend a lot of time searching through
the full list to find that particular number. Not only is this
manual scanning technique time-consuming, but it is also
ineffective. You may speed up the search and locate the
number by hashing the data structure. To make things easier
for you, KnowledgeHut brings comprehensive skill-
oriented online courses to learn the nitty gritty of data
science and programming language. Learn Python
Programming for big data systems to enhance your hands-on
experience.
Types of Hash functions
• There are many hash functions that use numeric or
alphanumeric keys. This article focuses on discussing
different hash functions:

• Division Method.
• Mid Square Method.
• Folding Method.
• Multiplication Method.
• Let’s begin discussing these methods in detail.
1. Division Method:
• This is the most simple and easiest method to generate a hash
value. The hash function divides the value k by M and then
uses the remainder obtained.
• Formula: h(K) = k mod M
• Here,
• k is the key value, and
• M is the size of the hash table.
• It is best suited that M is a prime number as that can make
sure the keys are more uniformly distributed. The hash
function is dependent upon the remainder of a division.
Example:
k = 12345
M = 95
h(12345) = 12345 mod 95
= 90

k = 1276
M = 11
h(1276) = 1276 mod 11
=0
• Pros:

• This method is quite good for any value of M.


• The division method is very fast since it requires only a
single division operation.

• Cons:

• This method leads to poor performance since consecutive


keys map to consecutive hash values in the hash table.
• Sometimes extra care should be taken to choose the value of
M.
2. Mid Square Method:
• The mid-square method is a very good hashing method. It
involves two steps to compute the hash value-
• Square the value of the key k i.e. k2
• Extract the middle r digits as the hash value.
• Formula:
• h(K) = h(k x k)
• Here,
• k is the key value.
• The value of r can be decided based on the size of the table.
• Example:
• Suppose the hash table has 100 memory locations. So r = 2
because two digits are required to map the key to the
memory location.

• k = 60
• k x k = 60 x 60
• = 3600
• h(60) = 60

• The hash value obtained is 60


• Pros:
• The performance of this method is good as most or all digits
of the key value contribute to the result. This is because all
digits in the key contribute to generating the middle digits of
the squared result.
• The result is not dominated by the distribution of the top
digit or bottom digit of the original key value.
• Cons:
• The size of the key is one of the limitations of this method, as
the key is of big size then its square will double the number
of digits.
• Another disadvantage is that there will be collisions but we
can try to reduce collisions.
3. Digit Folding Method:

• This method involves two steps:

• Divide the key-value k into a number of parts i.e. k1,


k2, k3,….,kn, where each part has the same number of
digits except for the last part that can have lesser digits
than the other parts.
• Add the individual parts. The hash value is obtained by
ignoring the last carry if any.
• Formula:

• k = k1, k2, k3, k4, ….., kn


• s = k1+ k2 + k3 + k4 +….+ kn
• h(K)= s

• Here,
• s is obtained by adding the parts of the key k
• Example:
• k = 12345
• k1 = 12, k2 = 34, k3 = 5
• s = k1 + k2 + k3
• = 12 + 34 + 5
• = 51
• h(K) = 51
• Note:
• The number of digits in each part varies depending upon the
size of the hash table. Suppose for example the size of the
hash table is 100, then each part must have two digits except
for the last part which can have a lesser number of digits.
4. Multiplication Method

• This method involves the following steps:


• Choose a constant value A such that 0 < A < 1.
• Multiply the key value with A.
• Extract the fractional part of kA.
• Multiply the result of the above step by the size of the hash
table i.e. M.
• The resulting hash value is obtained by taking the floor of the
result obtained in step 4.
• Formula:

• h(K) = floor (M (kA mod 1))

• Here,
• M is the size of the hash table.
• k is the key value.
• A is a constant value.
• Example:
• k = 12345
• A = 0.357840
• M = 100

• h(12345) = floor[ 100 (12345*0.357840 mod 1)]


• = floor[ 100 (4417.5348 mod 1) ]
• = floor[ 100 (0.5348) ]
• = floor[ 53.48 ]
• = 53
• Pros:

• The advantage of the multiplication method is that it can


work with any value between 0 and 1, although there are
some values that tend to give better results than the rest.

• Cons:

• The multiplication method is generally suitable when the


table size is the power of two, then the whole process of
computing the index by the key using multiplication hashing
is very fast.
What are the Applications of
Hashing Functions?
• Hashing functions are commonly used in computer
security. They can be used to store passwords, encrypt
data, and generate unique identifiers. Hashing
functions are also used in data structures such as hash
tables and hash maps.
The following characteristics a decent hash
function ought to have:
• Effectively calculable.
• The keys ought to be distributed equally among all table
positions.
• Ought to reduce collisions.
• Low load factor should be the norm (number of items in
table divided by size of the table)
• A poor hash function for phone numbers, for instance,
would be to use the first three digits. Consideration of the
last three numbers is a better function. Please be aware that
this hash function might not be the best. There could be
better options.
Separate Chaining Collision Handling Technique in
Hashing
• What is Collision?
• Since a hash function gets us a small number for a key
which is a big integer or string, there is a possibility
that two keys result in the same value. The situation
where a newly inserted key maps to an already
occupied slot in the hash table is called collision and
must be handled using some collision handling
technique.
• What are the chances of collisions with the large table?

• Collisions are very likely even if we have a big table to store


keys. An important observation is Birthday Paradox. With
only 23 persons, the probability that two people have the
same birthday is 50%.

• How to handle Collisions?


• There are mainly two methods to handle collision:

• Separate Chaining
• Open Addressing
• In this article, only separate chaining is discussed. We will be
discussing Open addressing in the next post.
• Separate Chaining:
• The idea behind separate chaining is to implement the
array as a linked list called a chain. Separate chaining
is one of the most popular and commonly used
techniques in order to handle collisions.

• The linked list data structure is used to implement this


technique. So what happens is, when multiple
elements are hashed into the same slot index, then
these elements are inserted into a singly-linked list
which is known as a chain.
• Here, all those elements that hash into the same slot index are
inserted into a linked list. Now, we can use a key K to search
in the linked list by just linearly traversing. If the intrinsic key
for any entry is equal to K then it means that we have found
our entry. If we have reached the end of the linked list and yet
we haven’t found our entry then it means that the entry does
not exist. Hence, the conclusion is that in separate chaining, if
two different elements have the same hash value then we
store both the elements in the same linked list one after the
other.

• Example: Let us consider a simple hash function as “key mod


7” and a sequence of keys as 50, 700, 76, 85, 92, 73, 101
Advantages:
• Simple to implement.
• Hash table never fills up, we can always add more
elements to the chain.
• Less sensitive to the hash function or load factors.
• It is mostly used when it is unknown how many and
how frequently keys may be inserted or deleted.
Disadvantages:
• The cache performance of chaining is not good as keys are
stored using a linked list. Open addressing provides better
cache performance as everything is stored in the same table.
• Wastage of Space (Some Parts of the hash table are never
used)
• If the chain becomes long, then search time can become O(n)
in the worst case
• Uses extra space for links
Performance of Chaining:
• Performance of hashing can be evaluated under the
assumption that each key is equally likely to be hashed to any
slot of the table (simple uniform hashing).
• m = Number of slots in hash table
• n = Number of keys to be inserted in hash table
• Load factor α = n/m
• Expected time to search = O(1 + α)
• Expected time to delete = O(1 + α)
• Time to insert = O(1)
• Time complexity of search insert and delete is O(1) if α is O(1)
Data Structures For Storing Chains:
• . Linked lists

• Search: O(l) where l = length of linked list


• Delete: O(l)
• Insert: O(l)
• Not cache friendly
• Dynamic Sized Arrays ( Vectors in C++, ArrayList in
Java, list in Python)

• Search: O(l) where l = length of array


• Delete: O(l)
• Insert: O(l)
• Cache friendly
• 3. Self Balancing BST ( AVL Trees, Red-Black Trees)

• Search: O(log(l)) where l = length of linked list


• Delete: O(log(l))
• Insert: O(l)
• Not cache friendly
• Java 8 onwards use this for HashMap
Hashing - Open Addressing for Collision Handling

• We have talked about

• A well-known search method is hashing.


• When the new key's hash value matches an already-occupied
bucket in the hash table, there is a collision.
Open Addressing for Collision Handling
• Similar to separate chaining, open addressing is a technique
for dealing with collisions. In Open Addressing, the hash
table alone houses all of the elements. The size of the table
must therefore always be more than or equal to the total
number of keys at all times (Note that we can increase table
size by copying old data if needed). This strategy is often
referred to as closed hashing. The foundation of this entire
process is probing. We will comprehend several forms of
probing later.
• Insert (k): Continue probing until a slot is left open. Put k in the first empty
spot you find.
• Search (k): Continue probing until either an empty slot is found or the slot's
key no longer equals k.
• Delete (k): An intriguing delete procedure. The search can fail if we just
remove a key. Therefore, deleted key slots are specifically noted as "deleted."
• Although an item can be inserted into a deleted slot, the search continues after
the slot has been empty.

• NOTE- The "removed" buckets are handled the same as any other empty
buckets during insertion.
• When searching, the search does not stop when it comes across a "deleted"
bucket.
• Only when the necessary key or an empty bucket are discovered does the
quest come to an end.
Open Addressing
• Open addressing is when
• All the keys are kept inside the hash table, unlike separate
chaining.
• The hash table contains the only key information.
• The methods for open addressing are as follows:

• Linear Probing
• Quadratic Probing
• Double Hashing
• The following techniques are used for open addressing:
• (a) Linear probing
• In linear probing, the hash table is systematically examined
beginning at the hash's initial point. If the site we receive is already
occupied, we look for a different one.

• The rehashing function is as follows: table-size = (n+1)% rehash(key).


As may be seen in the sample below, the usual space between two
probes is 1.

• Let S be the size of the table and let hash(x) be the slot index
calculated using a hash algorithm.
• If slot hash (x) % S is full, then we try ( hash (x) + 1 ) % S
• If ( hash (x) + 1 ) % S is also full, then we try ( hash (x) + 2) % S
• If ( hash (x) + 2 ) % S is also full, then we try ( hash (x) + 3 ) % S
• ..................................................
• Linear probing problems:

• Primary Clustering: Primary clustering is one of the issues


with linear probing. Many successive items form clusters,
making it difficult to locate a free slot or to search for an
element.
• Secondary Clustering: Secondary clustering is less severe,
and two records can only share a collision chain (also known
as a probe sequence) if they start out in the same location.
• Advantage-

• Calculating this is simple.


• Disadvantage-
• Clustering is the fundamental issue with linear probing.
• Groups are composed of several adjacent pieces.
• After then, searching for an element or an empty bucket takes
time.

• Time Complexity:
• The worst time in linear probing to search an element is O (
table size ). This is due to
• even if all other elements are absent and there is only one
element.
• The hash table's "deleted" markers then force a full table
search.
• Quadratic probing
• If you pay close attention, you will notice that the hash value
will cause the interval between probes to grow. The above-
discussed clustering issue can be resolved with the aid of the
quadratic probing technique. The mid-square method is
another name for this approach. We search for the i2'th slot
in the i'th iteration using this strategy. We always begin
where the hash was generated. We check the other slots if
only the location is taken.
• let hash (x) be the slot index computed using hash function.
• If slot hash(x) % S is full, then we try ( hash (x) + 1*1 ) % S
• If ( hash (x) + 1*1 ) % S is also full, then we try ( hash (x) +
2*2 ) % S
• If ( hash (x) + 2*2 ) % S is also full, then we try ( hash (x) +
3*3 ) % S
• ..................................................
• ..................................................
• Double Hash
• Another hash function calculates the gaps that exist between
the probes. Clustering is optimally reduced by the use of
double hashing. This method uses a different hash function to
generate the increments for the probing sequence. We search
for the slot i*hash2(x) in the i'th rotation using another hash
algorithm, hash2(x).
• let hash(x) be the slot index computed using hash function.
• If slot hash(x) % S is full, then we try (hash(x) + 1*hash2(x))
%S
• If (hash(x) + 1*hash2(x)) % S is also full, then we try
(hash(x) + 2*hash2(x)) % S
• If (hash(x) + 2*hash2(x)) % S is also full, then we try
(hash(x) + 3*hash2(x)) % S
• ..................................................
• ..................................................
• Comparing the first three:

• The best cache performance is provided by linear probing,


although clustering is a problem. Linear probing also has the
benefit of being simple to compute.
• Between the two in terms of clustering and cache
performance is quadratic probing.
• Although double hashing lacks clustering, it performs poorly
in caches. Due to the necessity to compute two hash
functions, double hashing takes longer to compute.
S. Separate Chaining Open Addressing
No.
1 Chaining is easier to put Open Addressing calls for
into practise. increased processing
power.
2 Hash tables never run out Table may fill up when
of space when chaining addressing in open fashion.
since we can always add
new elements.
3 Chaining is less susceptible To prevent clustering and
to load or the hash load factor, open
function. addressing calls for extra
caution.
S. Separate Chaining Open Addressing
No.
4 When it is unclear how many or When the frequency and
how frequently keys might be quantity of keys are known,
added or removed, chaining is open addressing is employed.
typically utilised.
5 Chaining's cache performance is Since everything is stored in the
poor since keys are stored in same table, open addressing
linked lists. improves cache speed.
6 Space wastage (Some Parts of A slot can be used in open
hash table in chaining are never addressing even if an input
used). doesn't map to it.
7 Chaining requires additional Links absent in open addressing
room for links.
• Because we traverse a Linked List by essentially jumping
from one node to the next throughout the computer's memory,
chaining's cache efficiency is poor. Because of this, the CPU
is unable to cache nodes that haven't been visited yet, which
is bad for us. However, since data isn't dispersed while using
Open Addressing, the CPU can cache information for speedy
access if it notices that a particular area of memory is
frequently accessed.
• Performance of Open Addressing: Similar to Chaining, the
performance of hashing can be assessed assuming that each
key has an equal likelihood of being hashed to any slot of the
table (simple uniform hashing)
• m = Number of slots in the hash table
• n = Number of keys to be inserted in the hash table

• Load factor α = n/m ( < 1 )

• Expected time to search/insert/delete < 1 / ( 1 - α )

• So Search, Insert and Delete take (1 / ( 1 - α ) ) time
• Load Factor (α)-
• Load factor (α) is defined as-

• Hashing - Open Addressing for Collision Handling


• The load factor value in open addressing is always between 0
and 1. This is due to

• In open addressing, the hash table contains all of the keys.


• As a result, the table's size is always more than or at least
equal to the number of keys it stores.
Conclusions-
• The best cache performance is achieved via linear probing,
although clustering is a problem.
• Between the two in terms of clustering and cache
performance is quadratic probing.
• Although clustering is absent, double caching has poor cache
performance.

You might also like