Hash-Data Structure
Hash-Data Structure
There are several searching techniques like linear search, binary search, search trees etc. In these
techniques, time taken to search any particular element depends on the total number of elements.
Example:
1. Linear Search takes O(n) time to perform the search in unsorted arrays consisting of n elements.
2. Binary Search takes O(logn) time to perform the search in sorted arrays consisting of n
elements.
3. It takes O(logn) time to perform the search in Binary Search Tree consisting of n elements.
Drawback:
Thus, it becomes a data structure in which insertion and search operations are very fast irrespective of
the size of the data. Hash Table uses an array as a storage medium and uses hash technique to generate
an index where an element is to be inserted or is to be located from.
Definition: It is a method for storing and retrieving data from a database in O(1) time. Sometimes it is
called mapping. With the help of hashing in data structure, we convert larger values into smaller
values using the concept of hashing. With the help of the search key, we point the data into the
database and for this, we use a pointer in hashing.
Hans Peter Luhn was the first to invent hashing in the 1940s. In November 1958, he presented his
work on hashing at a six-day international conference dedicated to scientific information.
In this technique, we give an input called a key to the hash function. The function uses this key and
generates the unique index corresponding to that value in the hash table. After that, it returns the value
stored at that index which is known as the hash value.
Data can be hashed into a shorter, fixed-length value for quicker access using a key or set of characters.
This is how key-value pairs are stored in hash tables. The representation of the hash function looks like
this:
Hash key: It is the data you want to be hashed in the hash table. The hashing algorithm translates the
key to the hash value. This identifier can be a string or an integer. Therefore, Hashing Key is the raw
data that has to be hashed in a hash table. The hashing algorithm carries out a function to translate the
hash key into the hash value. The outcome of feeding the hash key through the hashing algorithm is
what is known as the hash value.
1. Public key:- Public key often termed 'asymmetric' key, is a type of key only used for data
encryption. The mechanism is relatively slower because the public key is an open key. The
common uses of public keys include the functions of cryptography, the transfer of bitcoins, and
securing online sessions. However, Public key functions along with a set of private keys, and
thus, the overall security is not compromised.
2. Private key:- The private key is employed in both encryption and decryption. Each party that
sends or receives sensitive information that has been encrypted shares a key. Due to the fact that
both parties share it, the private key is also referred to as "symmetric". A private key is typically
a long, impossible-to-guess string of bits generated at random or artificially random.
3. SSH public key:- SSH uses both a public and a private key. SSH is a set of keys that can be
used to authenticate and decrypt a communication sent from a distance. Both the distant servers
and the stakeholders have access to the public key.
Hash Function: It performs the mathematical operation of accepting the key value as input and
producing the hash code or hash value as the output. Some of the characteristics of an ideal hash
function are as follows:
✔ It must produce the same hash value for the same hash key to be deterministic.
✔ Every input has a unique hash code. This feature is known as the hash property.
✔ It must be collision-friendly.
✔ A little bit of change leads to a drastic change in the output.
✔ The calculation must be quick.
Hash Table: It is a type of data structure that stores data in an array format. The table maps keys to
values using a hash function.
Data Integrity: Hashing is used to ensure data integrity by generating hash values for files or
messages. By comparing the hash values before and after transmission or storage, it's possible
to detect if any changes or tampering occurred.
Data Retrieval: Hashing is used in data structures like hash tables, which provide efficient data
retrieval based on key-value pairs. The hash value serves as an index to store and retrieve data
quickly.
Digital Signatures: Hash functions are an integral part of digital signatures. They are used to
generate a unique hash value for a message, which is then encrypted with the signer's private
key. This allows for verification of the authenticity and integrity of the message using the
signer's public key.
1. Division Method
The easiest and quickest way to create a hash value is through division. The k-value is divided by M in
this hash function, and the result is used.
Formula:
h(K) = k mod M (k % M)
(where k = key value and M = the size of the hash table)
Example of Division Method
k = 1987
M = 13
Therefore, h(1987) = 1987 mod 13 (1987 % 13)
h(1987) = 4
Advantages:
✔ This method is effective for all values of M.
✔ The division strategy only requires one operation, thus it is quite quick.
Disadvantages:
✗ Since the hash table maps consecutive keys to successive hash values, this could result in poor
performance.
✗ There are times when exercising extra caution while selecting M's value is necessary.
Formula:
h(K) = h(k x k)
Advantages:
✔ This technique works well because most or all of the digits in the key value affect the result. All
of the necessary digits participate in a process that results in the middle digits of the squared
result.
✔ The result is not dominated by the top or bottom digits of the initial key value.
Disadvantages:
✗ The size of the key is one of the limitations of this system; if the key is large, its square will
contain twice as many digits.
✗ Probability of collisions occurring repeatedly.
3. Folding Method
1. except for the last component, which may have fewer digits than the other parts, the key-value k
should be divided into a predetermined number of pieces, such as k1, k2, k3,..., kn, each having
the same amount of digits.
2. Add each element individually. The hash value is calculated without taking into account the
final carry, if any.
Formula:
k = k1, k2, k3, k4, ..., kn
s = k1+ k2 + k3 + k4 +….+ kn
h(K)= s
Advantages:
✔ Creates a simple hash value by precisely splitting the key value into equal-sized segments.
✔ Without regard to distribution in a hash table.
Disadvantages:
✗ When there are too many collisions, efficiency can occasionally suffer.
4. Multiplication Method
Formula:
(Where, M = size of the hash table, k = key value and A = constant value)
Example of Multiplication Method
k = 5678
A = 0.6829
M = 200
Now, calculating the new value of h(5678):
h(5678) = floor(200((5678 x 0.6829) mod 1))
Advantages:
✔ Any number between 0 and 1 can be applied to it, however, some values seem to yield better
outcomes than others.
Disadvantages:
✗ The multiplication method is often appropriate when the table size is a power of two since
multiplication hashing makes it possible to quickly compute the index by key.
Collision in a hash table is a term used to denote the phenomena when the hashing algorithm produces
the same hash value for two or more keys using a hash function. However, it's crucial to note that
collisions are not an issue; rather, they constitute a key component of hashing algorithms. Because
various hashing methods used in data structures convert each input into a fixed-length code regardless
of its length, collisions happen. The hashing algorithms will eventually yield repeating hashes since
there are an infinite number of inputs and a finite number of outputs.
Advantages:
1. Implementation is simple and easy.
2. We can add more keys to the table because the hash table has a lot of empty places.
3. Less sensitive than average to changing load factors
4. Typically utilized when there is uncertainty on the number and frequency of keys to be used in
the hash table.
Disadvantages:
1. Space is wasted.
2. The length of the chain lengthens the search period.
3. Comparatively worse cache performance to closed hashing.
Instead of using linked lists, open addressing stores each entry in the array itself. The hash value is not
used to locate objects. To insert, it first verifies the array beginning from the hashed index and then
searches for an empty slot using probing sequences. The probe sequence, with changing gaps between
subsequent probes, is the process of progressing through entries. There are three methods for dealing
with collisions in closed hashing:
1. Linear Probing
Linear probing includes inspecting the hash table sequentially from the very beginning. If the site
requested is already occupied, a different one is searched. The distance between probes in linear
probing is typically fixed (often set to a value of 1).
Formula:
index = (key + i) % hashTableSize
Sequence:
index = ( hash(n) % T)
(hash(n) + 1) % T
(hash(n) + 2) % T
(hash(n) + 3) % T … and so on.
2. Quadratic Probing
The distance between subsequent probes or entry slots is the only difference between linear and
quadratic probing. You must begin traversing until you find an available hashed index slot for an entry
record if the slot is already taken. By adding each succeeding value of any arbitrary polynomial in the
original hashed index, the distance between slots is determined.
Formula:
index = (index + i2) % hashTableSize
Sequence:
index = ( hash(n) % T)
(hash(n) + 1 x 1) % T
(hash(n) + 2 x 2) % T
(hash(n) + 3 x 3) % T … and so on
3. Double-Hashing
The time between probes is determined by yet another hash function. Double hashing is an optimized
technique for decreasing clustering. The increments for the probing sequence are computed using an
extra hash function.
Formula:
index = (first hash(key) + i * secondHash(key)) % size of the table
Sequence:
index = hash(x) % S
(hash(x) + 1*hash2(x)) % S
(hash(x) + 2*hash2(x)) % S
(hash(x) + 3*hash2(x)) % S … and so on
Importance of Hashing
1. Easy retrieval of required information from large data sets in an efficient manner.
2. The hash code produced by the hash function serves as the unique identifier in the data set thus
maintaining data integrity.
3. The data is stored in a structured manner as there is an index for every record in the hash table.
This ensures efficient storage and retrieval.
Limitations of Hashing
1. Many a time there leads to a situation of collision where two or more inputs have the same hash
value.
2. The performance of the hashing algorithm depends upon the quality of the hash function.
Sometimes, a not well-thought-of hash function may lead to collisions thus reducing the
efficiency of the algorithm.
Definition: A Hash table is a data structure used to insert, look up, and remove key-value pairs
quickly. It implements an associative array. Here, each key is translated by a hash function into a
distinct index in an array. The index functions as a storage location for the matching value.
How Does the Hash Table Work?
1. Initialize the Hash Table: Start with an empty array of a fixed size (let's say 10). Each index in
the array represents a bucket in the hash table.
[empty, empty, empty, empty, empty, empty, empty, empty, empty, empty]
2. Insertion: When inserting a key-value pair into the hash table, the hash function is used to
calculate the index where the pair should be stored. The hash function takes the key as input and
returns an integer value.
Let's say we want to insert the key "apple" with the value "fruit" into the hash table. The hash
function calculates the index as follows: hash("apple") = 5 Since the hash value is 5, we store
the key-value pair in index 5 of the array: [empty, empty, empty, empty, empty, ("apple",
"fruit"), empty, empty, empty, empty]
3. Retrieval: To retrieve a value associated with a given key, we use the same hash function to
calculate the index. We then look for the key-value pair in that index.
For example, if we want to retrieve the value for the key "apple," the hash function calculates
the index as 5. We check index 5 in the array and find the key-value pair ("apple", "fruit"):
[empty, empty, empty, empty, empty, ("apple", "fruit"), empty, empty, empty, empty]
4. Collision: Hash tables can encounter collisions when two different keys map to the same index.
This can happen if the hash function is not perfectly distributed or if the array size is relatively
small compared to the number of keys.
For example, if we try to insert the key "banana" and the hash function calculates the index as 5,
we encounter a collision. We search for the next available slot and find index 6 is empty. So, we
store the key-value pair ("banana", "fruit") in index 6: [empty, empty, empty, empty, empty,
("apple", "fruit"), ("banana", "fruit"), empty, empty, empty] [empty, ("apple", "fruit"),
("banana", "fruit"), empty, empty, empty]
1. Initialize an array of fixed size to serve as the underlying storage for the hash table. The size of
the array depends on the expected number of key-value pairs to be stored.
2. Define a hash function that takes a key as input and generates a unique hash code.
3. To insert a key-value pair into the hash table, apply the hash function to the key to determine
the index. If the calculated index is already occupied, handle collisions using a collision
resolution strategy.
4. To retrieve the value associated with a key, apply the hash function to the key to determine the
index. If the index contains a value, compare the stored key with the given key to ensure a
match. If the keys match, return the corresponding value. If the index is empty or the keys don't
match, the key is not present in the hash table.
5. To remove a key-value pair from the hash table, apply the hash function to the key to determine
the index. If the index contains a value, compare the stored key with the given key to ensure a
match. If the keys match, remove the key-value pair from the table. If the index is empty or the
keys don't match, the key is not present in the hash table.
Note that the choice of a hash function depends on the specific requirements of the application, and
different hash functions may be suitable for different use cases.
Hash Collision
1. A hash collision refers to a situation where two different inputs produce the same hash value or
hash code when processed by a hash function. In other words, it occurs when two distinct pieces
of data result in an identical hash output.
2. Hash collisions can have practical implications depending on the context, e.g. in hash tables,
collisions can lead to degraded performance if not handled efficiently.
3. Cryptographic hash functions, used in areas like data security and digital signatures, have
stricter requirements to prevent collisions. These functions aim to provide a high level of
collision resistance, making it computationally infeasible to find two inputs that produce the
same hash value.
Cryptographic hash collisions are of particular concern because they can undermine the
integrity and security of cryptographic systems. Researchers and cryptographers continually
evaluate and analyze hash functions to ensure their collision resistance and security properties,
as vulnerabilities or weaknesses discovered in hash functions could have significant
implications for various applications that rely on them.
There are primarily two types of hash collisions: accidental or random collisions and intentional or
malicious collisions. Let's explore each type in more detail:
1. Accidental Collisions: Accidental collisions occur when two different inputs produce the same
hash value due to the nature of hashing algorithms and the limited range of possible hash
values. These collisions are unintended and usually considered rare and coincidental. The
probability of accidental collisions depends on the quality of the hashing algorithm and the size
of the hash space. Good hashing algorithms strive to minimize the chances of accidental
collisions by distributing hash values uniformly across the output space.
2. Intentional Collisions: Intentional collisions occur when an attacker purposefully generates two
or more different inputs that produce the same hash value. These collisions are often the result
of exploiting vulnerabilities or weaknesses in the hashing algorithm. Intentional collisions can
be a significant security concern in various scenarios, such as digital signatures, certificate
authorities, password hashing, and data integrity checks.
(a) Preimage Attacks: In a preimage attack, an attacker tries to find any input that matches
a given hash value. This type of attack is aimed at breaking the one-way property of a hash
function, which means that it is computationally infeasible to determine the original input from
its hash value.
(b) Collision Attacks: In a collision attack, an attacker aims to find two different inputs that
produce the same hash value. The goal is to break the collision resistance property of a hash
function, which ensures that it is computationally difficult to find any two inputs with the same
hash value.
The higher the load factor value, the more space is conserved, but at the cost of chaining more items
into a slot (We will learn about chaining in the below sections). This would increase the time it takes
for items to be accessed.
In general, when faced with the choice of whether time or space is more essential, humans prefer to
optimise time (raising speed) over memory space. The tradeoff between time and space is known as the
time-space tradeoff.
Analysis of Hashing
We stated earlier that in the best case hashing would provide a O(1), constant time search technique.
However, due to collisions, the number of comparisons is typically not so simple. Even though a
complete analysis of hashing is beyond the scope of this text, we can state some well-known results
that approximate the number of comparisons necessary to search for an item.
The most important piece of information we need to analyze the use of a hash table is the load factor, λ.
Conceptually, if λ is small, then there is a lower chance of collisions, meaning that items are more
likely to be in the slots where they belong. If λ is large, meaning that the table is filling up, then there
are more and more collisions. This means that collision resolution is more difficult, requiring more
comparisons to find an empty slot. With chaining, increased collisions means an increased number of
items on each chain.
As before, we will have a result for both a successful and an unsuccessful search. For a successful
search using open addressing with linear probing, the average number of comparisons is approximately
1 1
(1+ )
2 1−λ
and an unsuccessful search gives
1 1 2
(1+( ))
2 1−λ
If we are using chaining, the average number of comparisons is
1+ λ
2
for the successful case, and simply λ comparisons if the search is unsuccessful.
struct DataItem {
int data;
int key;
};
if(hashArray[hashIndex]->key == key)
return hashArray[hashIndex];
if(hashArray[hashIndex]->key == key) {
struct DataItem* temp = hashArray[hashIndex];
//assign a dummy item at deleted position
hashArray[hashIndex] = dummyItem;
return temp;
}
//go to next cell
++hashIndex;
printf("\n");
}
int main() {
dummyItem = (struct DataItem*) malloc(sizeof(struct DataItem));
dummyItem->data = -1;
dummyItem->key = -1;
insert(1, 20);
insert(2, 70);
insert(42, 80);
insert(4, 25);
insert(12, 44);
insert(14, 32);
insert(17, 11);
insert(13, 78);
insert(37, 97);
printf("Insertion done: \n");
printf("Contents of Hash Table: ");
display();
int ele = 37;
printf("The element to be searched: %d", ele);
item = search(ele);
if(item != NULL) {
printf("\nElement found: %d\n", item->key);
} else {
printf("\nElement not found\n");
}
delete(item);
printf("Hash Table contents after deletion: ");
display();
}