Hashing in Data Structures
Hashing in Data Structures
Have you ever heard of hashing but aren't sure how it works or why it's
important? In this data-driven world, hashing is a widely used technique to get
the required data from a whole lot of it. It is the process of mapping a variable-
length input data set into a finite-sized output data set. It increases your
efficiency in retrieving the desired result from a bunch of data sets and even
storing it.
Hash key: It is the data you want to be hashed in the hash table. The
hashing algorithm translates the key to the hash value. This identifier can be
a string or an integer. There are some types of hash keys:
1. Public key - It is an open key used solely for data encryption.
2. Private key - It is known as a symmetric key used for both purposes,
encryption and decryption.
3. SSH public key - SSH is a set of both public and private keys.
Hash Function: It performs the mathematical operation of accepting the key
value as input and producing the hash code or hash value as the output.
Some of the characteristics of an ideal hash function are as follows:
o It must produce the same hash value for the same hash key to be
deterministic.
o Every input has a unique hash code. This feature is known as the hash
property.
o It must be collision-friendly.
o A little bit of change leads to a drastic change in the output.
o The calculation must be quick
Hash Table: It is a type of data structure that stores data in an array format.
The table maps keys to values using a hash function.
Use cases of Hashing
Password Storage: Hash functions are commonly used to securely store
passwords. Instead of storing the actual passwords, the system stores their
hash values. When a user enters a password, it is hashed and compared
with the stored hash value for authentication.
Data Integrity: Hashing is used to ensure data integrity by generating hash
values for files or messages. By comparing the hash values before and after
transmission or storage, it's possible to detect if any changes or tampering
occurred.
Data Retrieval: Hashing is used in data structures like hash tables, which
provide efficient data retrieval based on key-value pairs. The hash value
serves as an index to store and retrieve data quickly.
Digital Signatures: Hash functions are an integral part of digital signatures.
They are used to generate a unique hash value for a message, which is then
encrypted with the signer's private key. This allows for verification of the
authenticity and integrity of the message using the signer's public key.
Example of Hashing
Python
Java
C++
import hashlib
def sha256(input):
hash_object = hashlib.sha256(input.encode('utf-8'))
hex_digest = hash_object.hexdigest()
return hex_digest
def main():
message = "Hello, world!"
hash_value = sha256(message)
if __name__ == "__main__":
main()
Formula:
h(K) = k mod M
(where k = key value and M = the size of the hash table)
Advantages:
This method is effective for all values of M.
The division strategy only requires one operation, thus it is quite quick.
Disadvantages:
Since the hash table maps consecutive keys to successive hash values, this
could result in poor performance.
There are times when exercising extra caution while selecting M's value is
necessary.
Advantages:
This technique works well because most or all of the digits in the key value
affect the result. All of the necessary digits participate in a process that
results in the middle digits of the squared result.
The result is not dominated by the top or bottom digits of the initial key value.
Disadvantages:
The size of the key is one of the limitations of this system; if the key is large,
its square will contain twice as many digits.
Probability of collisions occurring repeatedly.
Formula:
k = k1, k2, k3, k4, ….., kn
s = k1+ k2 + k3 + k4 +….+ kn
h(K)= s
(Where, s = addition of the parts of key k)
Advantages:
Creates a simple hash value by precisely splitting the key value into equal-
sized segments.
Without regard to distribution in a hash table.
Disadvantages:
When there are too many collisions, efficiency can occasionally suffer.
Formula:
h(K) = floor (M (kA mod 1))
(Where, M = size of the hash table, k = key value and A = constant value)
Advantages:
Any number between 0 and 1 can be applied to it, however, some values
seem to yield better outcomes than others.
Disadvantages:
The multiplication method is often appropriate when the table size is a power
of two since multiplication hashing makes it possible to quickly compute the
index by key.
Advantages:
Implementation is simple and easy
We can add more keys to the table because the hash table has a lot of empty
places.
Less sensitive than average to changing load factors
Typically utilized when there is uncertainty on the number and frequency of
keys to be used in the hash table.
Disadvantages:
Space is wasted
The length of the chain lengthens the search period.
Comparatively worse cache performance to closed hashing.
2. Closed hashing (Open addressing)
Instead of using linked lists, open addressing stores each entry in the array itself.
The hash value is not used to locate objects. To insert, it first verifies the array
beginning from the hashed index and then searches for an empty slot using
probing sequences. The probe sequence, with changing gaps between
subsequent probes, is the process of progressing through entries. There are
three methods for dealing with collisions in closed hashing:
1. Linear Probing
Linear probing includes inspecting the hash table sequentially from the very
beginning. If the site requested is already occupied, a different one is searched.
The distance between probes in linear probing is typically fixed (often set to a value
of 1).
Formula
index = key % hashTableSize
Sequence
index = ( hash(n) % T)
(hash(n) + 1) % T
(hash(n) + 2) % T
(hash(n) + 3) % T … and so on.
2. Quadratic Probing
The distance between subsequent probes or entry slots is the only difference
between linear and quadratic probing. You must begin traversing until you find an
available hashed index slot for an entry record if the slot is already taken. By adding
each succeeding value of any arbitrary polynomial in the original hashed index, the
distance between slots is determined.
Formula
index = index % hashTableSize
Sequence
index = ( hash(n) % T)
(hash(n) + 1 x 1) % T
(hash(n) + 2 x 2) % T
(hash(n) + 3 x 3) % T … and so on
3. Double-Hashing
The time between probes is determined by yet another hash function. Double
hashing is an optimized technique for decreasing clustering. The increments for
the probing sequence are computed using an extra hash function.
Formula
(first hash(key) + i * secondHash(key)) % size of the table
Sequence
index = hash(x) % S
(hash(x) + 1*hash2(x)) % S
(hash(x) + 2*hash2(x)) % S
(hash(x) + 3*hash2(x)) % S … and so on
Importance of Hashing
1. Easy retrieval of required information from large data sets in an efficient
manner.
2. The hash code produced by the hash function serves as the unique identifier
in the data set thus maintaining data integrity.
3. The data is stored in a structured manner as there is an index for every
record in the hash table. This ensures efficient storage and retrieval.
Limitations of Hashing
1. Many a time there leads to a situation of collision where two or more inputs
have the same hash value.
2. The performance of the hashing algorithm depends upon the quality of the
hash function. Sometimes, a not well-thought-of hash function may lead to
collisions thus reducing the efficiency of the algorithm.
Summary
For effective organization, data structures include hashing, which entails turning
data into fixed-length values. Separate chaining and open addressing are the two
primary hashing methods. Data is transformed into distinct fixed-length codes by
hash methods like SHA-256. Hashing is useful for password storage, data
integrity, and digital signatures and speeds up data retrieval while requiring less
storage. Division, mid square, folding, and multiplication procedures are only a
few of the several hash function kinds.