Unit 2 hashing
Unit 2 hashing
1. Hash Functions
A hash function is a function that takes an input (or "key") and returns an integer value, known as a
hash value or hash code, which typically maps the key to a location (or index) in a hash table. The
purpose of hash functions is to efficiently distribute keys across the table to minimize collisions and
achieve fast access to data.
• Deterministic: The same input always produces the same hash value.
• Uniform Distribution: The hash values should be uniformly distributed across the range of
possible hash table indices to minimize collisions.
• Minimize Collisions: The function should minimize the likelihood that two different keys
produce the same hash value (a collision).
1. Division Method: The most common hash function is the division method, where the key k is
divided by a prime number mmm (the size of the hash table) and the remainder is the hash
value:
h(k)=kmod m
2. Multiplication Method: In this method, the key k is multiplied by a constant A, and the
fractional part of the product is taken, then scaled by the table size mmm:
h(k)=⌊m(kAmod 1)⌋
o A is typically chosen as a value between 0 and 1, and its specific value is important for
ensuring good distribution.
3. Universal Hashing: In universal hashing, a family of hash functions is chosen randomly, and
one function from the family is used to hash a key. This reduces the likelihood of collisions,
especially when adversarial inputs are possible.
• It is useful in scenarios where the inputs are not known in advance or when there is a high risk
of collisions due to predictable input patterns.
4. String Hashing: For strings, one common approach is the polynomial hash function. The idea
is to treat the string as a sequence of coefficients for a polynomial, where each character in
the string contributes to the hash based on its position in the string.
h(s)=(s1⋅p0+s2⋅p1+s3⋅p2+…+sn⋅pn−1)mod m
Here, p is a constant (often chosen as a small prime number), and m is the table size.
• This method helps spread out string keys effectively across the hash table.
• Collisions: Collisions happen when two different keys hash to the same index. This is inevitable
with a finite hash table size, and efficient collision resolution techniques are necessary.
When a collision occurs, i.e., when two keys hash to the same index, we need to handle it in a way that
preserves the efficiency of the hash table. There are several techniques to resolve collisions:
a) Open Addressing
In open addressing, all elements are stored in the hash table itself, and when a collision occurs, the
algorithm searches for another available slot (a "probe") within the table to store the new key. The
probe sequence is the pattern used to search for a slot.
1. Linear Probing:
o In linear probing, when a collision occurs at index iii, the algorithm checks the next
index, i+1. If that index is occupied, it continues checking i+2, i+3, and so on until an
empty slot is found.
o Problem: Linear probing can lead to clustering, where a group of adjacent slots get
filled, which increases the search time for subsequent insertions and lookups.
2. Quadratic Probing:
• In quadratic probing, the probe sequence is adjusted so that the next index is i+12,i+22,i+32,…
• This method helps to reduce clustering, but it still suffers from the issue of secondary
clustering (when multiple keys hash to the same index and follow the same probing pattern).
3. Double Hashing:
• In double hashing, two hash functions are used. If a collision occurs at index iii, a second hash
function is used to compute an offset (step size) for the probe:
where h1(k) is the first hash function, and h2(k) is the second hash function.
• This method tends to reduce clustering better than linear or quadratic probing, but it requires
two hash functions.
b) Chaining
In chaining, each table index stores a linked list (or another dynamic data structure) of all elements
that hash to that index. Instead of looking for an empty slot, we simply add the new element to the
list at the corresponding index. This can handle collisions by allowing multiple elements to be stored
at the same index.
• Advantages:
o Chaining allows the hash table to grow dynamically, as the list at each index can hold
any number of elements.
o The performance of insertions, deletions, and lookups depends on the length of the
list at each index, which can be minimized by resizing the table when the load factor
exceeds a threshold.
• Disadvantages:
o Additional memory is needed to store the lists, and performance can degrade if many
keys hash to the same index (i.e., the lists become long).
The load factor λ is defined as the ratio of the number of elements in the hash table n to the number
of slots in the table m:
λ=n/m
• A higher load factor increases the likelihood of collisions, which may degrade performance.
• If the load factor exceeds a certain threshold (commonly 0.7 to 0.8), the hash table may be
rehashing (resizing). Rehashing involves creating a new table with a larger size, typically double
the size of the current table, and re-inserting all elements.