Hashing Unit 1
Hashing Unit 1
BACKGROUND
Suppose we want to design a system for storing students records and we want to perform following
operations efficiently:
• Insert, Search and Delete operations based on student id
HASHING
• Hashing is the process of indexing and retrieving element in a data structure
to provide faster way of finding the element using the hash key.
• With hashing we get O(1) search time on average (under reasonable assumptions) and O(n) in worst
case.
5
REAL-WORLD APPLICATIONS OF
HASH TABLES
6
DICTIONARY
• Dictionary is a collection of data elements uniquely identified by a field called key.
• A dictionary supports the operations of search, insert and delete.
• The ADT of a dictionary is defined as a set of elements with distinct keys supporting the operations
of search, insert, delete and create (which creates an empty dictionary).
• While most dictionaries deal with distinct keyed elements, it is not uncommon to find applications
calling for dictionaries with duplicate or repeated keys
• Hence it is essential that the dictionary evolves rules to resolve the ambiguity that may arise while
searching for or deleting data elements with duplicate keys.
• A dictionary supports both sequential and random access.
• A sequential access is one in which the data elements of the dictionary are ordered and accessed
according to the order of the keys (ascending or descending, for example).
• A random access is one in which the data elements of the dictionary are not accessed according to a
particular order.
• Hash tables are ideal data structures for dictionaries.
7
Database Indexing
• Hash tables are used to implement hash indexes in databases to allow for fast data retrieval.
• For example, searching for a record by a key (e.g., user ID or product SKU) can be done in constant
time on average.
Caching Systems
• Caches use hash tables to store data temporarily for quick access.
• Examples include web browsers caching recently visited websites or content delivery networks
(CDNs) caching popular web pages.
• Key-value pairs in a cache represent URLs (keys) and their corresponding web content (values).
Password Storage
• Hash tables, combined with cryptographic hashing algorithms, store hashed passwords securely.
• When a user enters a password, the system hashes the input and compares it with the stored hash.
DNS Resolution
• The Domain Name System (DNS) uses hash tables to map domain names (e.g., example.com) to IP
addresses.
• This enables quick resolution of domain names to their corresponding servers.
8
Blockchain
Hash tables are integral to storing and verifying transaction data in blockchains.
Hash functions ensure the immutability and integrity of transaction blocks.
Gaming
Hash tables store game state data, such as the mapping of game objects to their properties or locations.
They are also used in AI algorithms, such as pathfinding and move evaluation in board games like chess.
9
• When the data elements of the dictionary are to be stored in the hash table, each key Xi is mapped to
a position Pi in the hash table as determined by the value of H(Xi), (i.e.) Pi = H(Xi).
10
• To search for a key X in the hash table all that one does is to determine the position P by computing
P = H(X) and access the appropriate data element.
• In the case of insertion of a key X or its deletion, the position P in the hash table where the data
element needs to be inserted or from where it is to be deleted respectively, is determined by
computing P = H(X).
• If the hash table is implemented using a sequential data structure, for example arrays, then the hash
function H(X) may be so chosen to yield a value that corresponds to the index of the array.
• In such a case, the hash function is a mere mapping of the keys to the array indices.
11
• Bucket Overflow: A Bucket overflow occurs when the home bucket for a new pair (key, element) is
already full.
13
DIVISION METHOD
• Idea:
• Say that we have a Hash Table of size 'Tablesize', and we want to store a (key, value) pair in the
Hash Table. It is fast as it requires only one computation - modulus.
• Computes hash value from key using the % operator.
• Map a key k into one of the m slots by taking the remainder of k divided by m
h(k) = k mod m
• Works well when m is prime
1. Example: k=1276, m=10,
h(1276) = 1276 mod 10 = 6
Advantage: fast, requires only one operation
Disadvantage: Certain values of m are not good choice,
e.g., power of 2 : Table size that is a power of 2 like 32 and 1024 should be avoided, for it leads to more
collisions.
16
MULTIPLICATION METHOD
• Idea:
1. We must choose a constant between 0 and 1, say,
A.
2. Multiply the key with the chosen A.
3. Now, take the fractional part from the product
and multiply it by the table size.
4. The Hash will be the floor (only the integer part)
of the above result.
So, the Hash Function:
H(x) = floor(size(key*A mod 1))
17
FOLDING
• Idea • For {1234: "Sudha"}:
• The key is broken down into 2 segments each 1234 = 12 + 34 = 46
• except the last segment. The last segment can 46 % 100 = 46
have less number of digits.
• For {5678: "Venkat"}:
• Now, the Hash Function would be:
5678 = 56 + 78 = 134
• H(x) = (sum of equal-sized segments) mod (size
134 % 99 = 35
of the Hash Table)
• Example:
• sum = (k1k2) + (k3k4) + (k5k6) + (k7k8) +
(k9k10)
• Now, H(x) = sum % 100
• The {key: value} Pairs: {1234: "Sudha", 5678:
"Venkat"}
• Size of the table: 100 (0 - 99)
19
MID-SQUARE METHOD
• Idea
1. Square the key -> key * key
2. Choose some digits from the middle of the number to obtain the Hash value.
• Example: Suppose the size of the Hash Table is 10 and
• The key: value pairs are: {10: "Sudha, 11: "Venkat", 12: "Jeevani"}
• Number of digits to be selected: Indexes: (0 - 9)
• H(10) = 10 * 10 = 100 = 0
• H(11) = 11 * 11 = 121 = 2
• H(12) = 12 * 12 = 144 = 4
• Advantage : Works well if the keys do not contain a lot of leading or trailing zeros.
• Disadvantage:
• Selection of middle part
• Non-integer keys have to be pre-processed to obtain corresponding integer
20
UNIVERSAL HASHING
• For any choice of hash function, there exists a bad set of identifiers
• A malicious adversary could choose keys to be hashed such that all go into the same slot (bucket)
• Average retrieval time is Θ(n)
• Solution
• a random hash function
• choose hash function independently of keys!
• create a set of hash functions H, from which h can be randomly selected
• A collection H of hash functions is universal if for any randomly chosen f from H (and two keys k
and l),
Pr{f(k)= f(l)} ≤ 1/m
21
EXAMPLE
Hash Table
22
EXAMPLE
• In the example was assumed that the hash function yields distinct values for the individual keys.
• If this were to be followed as a criterion, then the situation may turn out of control since in the case
of dictionaries with very large set of data elements, the hash table size can be too huge to be handled
efficiently.
• Therefore it is convenient to choose hash functions which yield values lying within a limited range so
as to restrict the length of the table.
23
COLLISION
• The phenomenon in which two or more keys yield same hash value for the given hash function is
called as collision.
• This may lead to inconsistencies and issues the storage and retrieval of keys
• Hence, collision handling techniques are needed
24
SEPARATE CHAINING
• Maintain array of linked list
• Separate list is maintained for all elements mapped to the same value
27
SEPARATE CHAINING
• Separate Chaining (hashing with chaining/open hashing)
To handle the collision,
1. This technique creates a linked list to the slot for which collision occurs.
2. The new key is then inserted in the linked list.
3. These linked lists to the slots appear like chains.
4. That is why, this technique is called as separate chaining
For Searching-
• In worst case, all the keys might map to the same bucket of the
hash table.
• In such a case, all the keys will be present in a single linked list.
• Sequential search will have to be performed on the linked list to
perform the search.
• So, time taken for searching in worst case is O(n).
28
SEPARATE CHAINING
For Deletion-
• In worst case, the key might have to be searched first and then
deleted.
• In worst case, time taken for searching is O(n).
• So, time taken for deletion in worst case is O(n).
Advantages:
1) Simple to implement.
2) Hash table never fills up, we can always add more elements to the
chain.
3) Less sensitive to the hash function or load factors.
4) It is mostly used when it is unknown how many and how
frequently keys may be inserted or deleted. .
29
SEPARATE CHAINING
• Disadvantages:
1) Cache performance of chaining is not good as keys are stored using a linked list. Open addressing
provides better cache performance as everything is stored in the same table.
2) Wastage of Space (Some Parts of hash table are never used)
3) If the chain becomes long, then search time can become O(n) in the worst case.
4) Uses extra space for links
30
Q
31
Q Given a sequence of numbers { 437, 325, 175, 199, 171, 189, 127, 509} and a hash
function H(X) = X mod 10. Use separate chaining collision resolution strategy
for the hash table.
32
Q Consider a simple hash function as “key mod 30” and a sequence of keys as 3, 1,
63, 5, 11, 15, 18, 16, 46.
37
REPLACEMENT
Suppose we have to store following elements :
131, 21, 31, 4, 5
LINEAR PROBING :CHAINING WITH 42
REPLACEMENT
After Insertion
131, 21, 31, 4, 5
LINEAR PROBING :CHAINING WITH 43
REPLACEMENT
LINEAR PROBING :CHAINING WITH 44
REPLACEMENT
45
QUADRATIC PROBING
When collision occurs, we probe for i2 bucket in ith iteration.
• We keep probing until an empty bucket is found.
• Quadratic probing is an open addressing technique that uses quadratic polynomial for searching until
an empty slot is found.
• It allows the insertion ki at first free location from (position+i2)%m where i=0 to m-1.
h´(𝑥) = 𝑥 𝑚𝑜𝑑 𝑚
ℎ(𝑥, 𝑖) = (ℎ´(𝑥) + 𝑖2)𝑚𝑜𝑑 𝑚
• The value of i = 0, 1, . . ., m – 1. So, we start from i = 0, and increase this until we get one free space.
• So initially when i = 0, then the h(x, i) is same as h´(x).
46
QUADRATIC PROBING
In short,
If the slot hash(x) % S is full, then we try (hash(x) + 1*1) % S.
If (hash(x) + 1*1) % S is also full, then we try (hash(x) + 2*2) %S.
If (hash(x) + 2*2) % S is also full, then we try (hash(x) + 3*3) %S.
This process is repeated for all the values of i until an empty slot is found.
47
QUADRATIC PROBING
QUADRATIC PROBING 48
Example: Let us consider table Size = 7, hash function as Hash(x) = x % 7 and collision resolution
strategy to be f(i) = i2 . Insert = 22, 30, and 50.
49
Q. Using Linear probing and Quadratic probing, insert the following values in the hash table of size
10. Show how many collisions occur in each iterations
• 28, 55, 71, 67, 11, 10, 90, 44
50
Q. We have a hash table of size 10 to store integer keys, with hash function h(x) = x mod 10. Construct
a hash table step by step using linear probing without replacement strategy and insert elements in the
order 31,3,4,21,61,6,71,8,9,25. Calculate average number of comparisons required to search given data
from hash table using linear probing without replacement.
Average no. of comparisons = (Total no. of comparisons / Total no. of successful searches)
51
Q. Insert the following data in the hash table of size 10 using linear probing with chaining by applying
with replacement : 11, 33, 20, 88, 79, 98, 68, 44, 66, 24. Calculate average number of comparisons
required to search given data from hash table.
Average no. of comparisons = (Total no. of comparisons / Total no. of successful searches)
52
DOUBLE HASHING-
• We use another hash function hash2 (x) and look for i * hash2 (x) bucket in ith iteration.
• It requires more computation time as two hash functions need to be computed.
• Double hashing is a collision resolving technique in Open Addressed Hash tables.
• Double hashing uses the idea of applying a second hash function to key when a collision
occurs
53
DOUBLE HASHING
55
DOUBLE HASHING
• Double hashing can be done using :
(hash1(key) + i * hash2(key)) % TABLE_SIZE
• Here hash1() and hash2() are hash functions and TABLE_SIZE is size of hash table.
(We repeat by increasing i when collision occurs)
• First hash function is typically hash1(key) = key % TABLE_SIZE
• A popular second hash function is : hash2(key) = PRIME – (key % PRIME) where PRIME is a prime
smaller than the TABLE_SIZE.
56
DOUBLE HASHING
57
DOUBLE HASHING
58
DOUBLE HASHING
59
DOUBLE HASHING
60
DOUBLE HASHING
61
DOUBLE HASHING
Q. Imagine you need to store some items inside a hash table of size 20. The values given are:
(16, 8, 63, 9, 27, 37, 48, 5, 69, 34, 1).
• h1(n)=n%20
• h2(n)=n%13
• h(n, i) = (h1 (n) + i*h2(n)) mod 20
62
DOUBLE HASHING
63
REHASHING
• Rehashing is a collision resolution technique.
• Rehashing is a technique in which the table is resized, i.e., the size of table
is approx. doubled by creating a new table.
• Situations in which the rehashing is required.
• When table is completely full
• With quadratic probing when the table is filled half.
• When insertions fail due to overflow.
• In such situations, we have to transfer entries from old table to the new
table by re-computing their positions using hash functions
64
65
REHASHING
• Rehashing means hashing again.
• Basically, when the load factor increases to more than its predefined value
(default value of load factor is 0.75), the complexity increases.
• So to overcome this, the size of the table is increased (doubled) and all the values
are hashed again and stored in the new double sized array to maintain a low load
factor and low complexity.
• How Rehashing is done?
• For each addition of a new entry to the table, check the load factor.
• If it’s greater than its pre-defined value (or default value of 0.75 if not given), then
Rehash.
• For Rehash, make a new array of double the previous size and make it the new
bucket array.
• Then traverse to each element in the old bucket Array and call the insert() for
each so as to insert it into the new larger bucket array.
66
REHASHING
a) Rehashing is the process of increasing the size of
a hashtable and redistributing the elements to new
buckets based on their new hash values.
b) It is done to improve the performance of the
hashtable and to prevent collisions caused by a high
load factor.
c) When a hashtable becomes full, the load factor
(i.e., the ratio of the number of elements to the
number of buckets) increases.
d) As the load factor increases, the number of
collisions also increases, which can lead to poor
performance.
e) To avoid this, the hash table can be resized and
the elements can be rehashed to new buckets,
which decreases the load factor and reduces the
number of collisions.
67
68
EXTENSIBLE / EXTENDIBLE HASHING
1. The dynamic hashing method is used to overcome the problems of static hashing
like bucket overflow.
2. In this method, data buckets grow or shrink as the records increases or decreases.
This method is also known as Extendable hashing method.
3. This method makes hashing dynamic, i.e., it allows insertion or deletion without
resulting in poor performance.
• How to insert a new record
1. Firstly, you have to follow the same procedure for retrieval, ending up in some
bucket.
2. If there is still space in that bucket, then place the record in it.
3. If the bucket is full, then we will split the bucket and redistribute the records.
69
EXTENSIBLE / EXTENDIBLE HASHING
How to search a key
1. First, calculate the hash address of the key.
2. Check how many bits are used in the directory, and these bits are called as i.
3. Take the least significant i bits of the hash address. This gives an index of the
directory.
4. Now using the index, go to the directory and find bucket address where the
record might be.
70
EXTENSIBLE / EXTENDIBLE HASHING
Consider the following grouping of keys into buckets, depending on their hash
address:
71
EXTENSIBLE / EXTENDIBLE HASHING
The last two bits of 2 and 4 are 00. So it will go into bucket B0. The last two bits of 5 and 6
are 01, so it will go into bucket B1. The last two bits of 1 and 3 are 10, so it will go into
bucket B2. The last two bits of 7 are 11, so it will go into B3.
72
EXTENSIBLE / EXTENDIBLE HASHING
Insert key 9 with hash address 10001 into the above structure:
1. Since key 9 has hash address 10001, it must go into B1. But bucket B1 is full,
so it will get split.
2. The splitting will separate 5, 9 from 6 since last three bits of 5, 9 are 001, so
it will go into bucket B1, and the last three bits of 6 are 101, so it will go into
bucket B4.
3. Keys 2 and 4 are still in B0. The record in B0 pointed by the 000 and 100
entry because last two bits of the entry are 00.
4. Keys 1 and 3 are still in B2. The record in B2 pointed by the 010 and 110
entry because last two bits of both the entry are 10.
5. Key 7 are still in B3. The record in B3 pointed by the 111 and 011 entry
because last two bits of both the entry are 11.
73
EXTENSIBLE / EXTENDIBLE HASHING
B4
74
ADVANTAGES OF EXTENSIBLE HASHING
1. In this method, the performance does not decrease as the data grows in the
system.
2. It simply increases the size of memory to accommodate the data.
2. In this method, memory is well utilized as it grows and shrinks with the data.
There will not be any unused memory lying.
3. This method is good for the dynamic database where data grows and shrinks
frequently
75
SKIP LIST
1. A skip list is a probabilistic data structure.
2. The skip list is used to store a sorted list of elements or data with a linked
list.
3. It allows the process of the elements or data to view efficiently.
4. In one single step, it skips several elements of the entire list, which is why
it is known as a skip list
77
SKIP LIST
4. The skip list is an extended version of the linked list.
5. It allows the user to search, remove, and insert the element very quickly.
6. It consists of a base list that includes a set of elements which maintains the
link hierarchy of the subsequent elements.
Skip list structure
• It is built in two layers: The lowest layer and Top layer.
• The lowest layer of the skip list is a common sorted linked list, and the
top layers of the skip list are like an "express line“ where the elements are
skipped.
78
79
SKIP LIST
• Let's take an example to understand the working of the skip list. In this
example, we have 14 nodes, such that these nodes are divided into two layers, as
shown in the diagram.
• The lower layer is a common line that links all nodes, and the top layer is an
express line that links only the main nodes, as you can see in the diagram.
• Suppose you want to find 47 in this example. You will start the search from
the first node of the express line and continue running on the express line until
you find a node that is equal a 47 or more than 47.
• You can see in the example that 47 does not exist in the express line, so you
search for a node of less than 47, which is 40. Now, you go to the normal line
with the help of 40, and search the 47, as shown in the diagram.
80
SKIP LIST
Skip List Basic Operations
There are the following types of operations in the skip list.
1. Insertion operation: It is used to add a new node to a particular location in a specific
situation.
2. Deletion operation: It is used to delete a node in a specific situation.
3. Search Operation: The search operation is used to search a particular node in a skip list.
81
82
83
84
85
86
87
88
89