0% found this document useful (0 votes)
149 views35 pages

Hashing

Hashing is a technique for implementing a dictionary data structure using a hash table. A hash table uses a hash function to map keys to indexes in an array of buckets or slots. Collisions occur when distinct keys map to the same slot, and must be resolved. Separate chaining resolves collisions by storing keys in linked lists at each slot. Open addressing resolves collisions by probing to find the next empty slot using techniques like linear probing. Linear probing searches sequentially from the initial slot location until an empty slot is found, but can cause clustering that degrades performance.

Uploaded by

Adhara Mukherjee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
149 views35 pages

Hashing

Hashing is a technique for implementing a dictionary data structure using a hash table. A hash table uses a hash function to map keys to indexes in an array of buckets or slots. Collisions occur when distinct keys map to the same slot, and must be resolved. Separate chaining resolves collisions by storing keys in linked lists at each slot. Open addressing resolves collisions by probing to find the next empty slot using techniques like linear probing. Linear probing searches sequentially from the initial slot location until an empty slot is found, but can cause clustering that degrades performance.

Uploaded by

Adhara Mukherjee
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Hashing

Objective
•To learn the Hash Table as a dictionary data structure
•To understand the difference between direct address table and
hash table
•To understand the concept of constant time searching
•To learn about various hash functions
•To identify collision as an inevitable event in hashing
•To learn various collision resolution techniques
•To be able to differentiate between open addressing and
chained hashing
Dictionary data structure
•Dictionary:
•Dynamic-set data structure for storing items indexed using keys.
•Supports operations Insert, Search, and Delete.
•Applications:
• Symbol table of a compiler.
• Memory-management tables in operating systems.
• Large-scale distributed systems.
•Hash Tables:
•Effective way of implementing dictionaries.
•Generalization of ordinary arrays.
Direct Address Table (DAT) • Suppose each element in the data
structure has a key drawn from {0, 1, …, 9},
and no two elements have the same key
• An array T[0…9] can be used
• Each key in the universe U={0, 1, …,9}
corresponds to an index in the table
• The set K = {2, 3, 5, 8} of actual keys
determines the slots in the table that
contain pointers to elements
• Other slots, heavily shaded, contain NIL
• This structure is called a direct-address
table, and Search, Insert, and Delete all run
in O(1), since they simply correspond to
accessing the elements in the array
• Similar logic is equivocally applicable for
any range of universe of discourse
Problem with DAT
Direct-address tables are impractical
when:
i. The number of possible keys is
large – storage space required will
be proportionally large
ii. When it far exceeds the number of
keys that are actually stored –
results in large number of empty
slots
Hash table is a better option
Hash table – an introduction

Hash function h to map keys to


hash-table slots. Keys k2 and k5 map to
the same slot and collide - hash
collision
Hashing – an example
Hash function
• The hash function: must be simple to compute; must distribute the keys evenly among the
cells
• Four popular methods: division, folding, middle-squaring, and truncation
• Division method:
• N is the size of the table (better if it is prime)
• convert keys, K, into integers
• use the remainder h(K) = K mod N as a hash value of the key K
• Example – h(013402122) = 013402122 mod 1013 = 132, assuming, N = 1013.
• Truncation:
• delete part of the key, K
• use the remaining digits (bits, characters) as h(K)
• Example:
K=013402122 last 3 digits: h(K) = 122
• Notice that truncation does not spread keys uniformly into the table; thus it is often
used in conjunction with other methods
•Folding:
•divide the integer key, K, into sections
•add, subtract, and/or multiply them together for combining
into the final value, h(K)
•Example: K=013402122 sections 013, 402, 122 h(K) = 013 +
402 + 122 = 537; therefore, h(013402122) -> 537
•Middle-squaring:
•choose a middle section of the integer key, K
•square the chosen section
•use a middle section of the result as h(K)
•Example: K = 013402122 middle: 402 4022=161404
middle: h(013402122) = 6140
Collision – an unavoidable issue in hashing
• After insertion if an element hashes to the
same value as an already inserted element,
then collision occurs - need to resolve it
• Formally, if two distinct keys, K1 ≠ K2, map to
the same table address, h(K1) = h(K2),
Collision occurs
• There are several methods for dealing with
this:
– Separate chaining (Open Hashing) Hash function h to map keys to
– Open addressing (Closed Hashing) hash-table slots. Keys k2 and k5 map to
the same slot and collide - hash
• Linear Probing collision
• Quadratic Probing
• Double Hashing
Separate Chaining
• The idea is to keep a list of all elements
that hash to the same value. Keys: 0, 1, 4, 9, 16, 25, 36, 49, 64, 81
– The array elements are pointers to the hash(key) = key % 10.
first nodes of the lists. 0 0

– A new item is inserted to the front of 1 81 1


the list. 2

• Advantages: 3

– Better space utilization for large items. 4


64 4

– Simple collision handling: searching 5


25

linked list. 6
36 16

– Overflow: we can store more items 7

than the hash table size. 8

– Deletion is quick and easy: deletion 9


49 9
from the linked list.
Analysis of Separate Chaining
• Collisions are very likely.
– How likely and what is the average length of lists?
• Load factor λ definition:
– Ratio of number of elements (N) in a hash table to the hash TableSize.
•i.e. λ = N / TableSize
– The load factor (λ) is a measure of how full the hash table is allowed to get
before its capacity is automatically increased
– When the number of entries in the hash table exceeds the product of the load
factor and the current capacity, the hash table is rehashed (that is, internal
data structures are rebuilt) so that the hash table has approximately twice the
number of buckets
• Table size is not really important, but the load factor is
• Can be less than or greater than 1
– For chaining λ is not bound by 1; it can be > 1.
Cost of searching
• Cost = Constant time to evaluate the hash function + time to
traverse the list.
• Unsuccessful search:
– We have to traverse the entire list, so we need to compare λ nodes on the
average.
• Successful search:
– List contains the one node that stores the searched item + 0 or more other
nodes.
– On the average, we need to check half of the other nodes while searching for a
certain element
– Thus average search cost = 1 + λ/2
Open Addressing
Collision Resolution with Open Addressing
• Separate chaining uses linked lists, incurs maintenance cost
– Requires the implementation of a second data structure.
• In an open addressing hashing system, all the data go inside the table.
– Thus, a bigger table is needed.
• Generally the load factor should be below 0.5.
– If a collision occurs, alternative cells are tried until an empty cell is found.
• More formally:
– Cells h0(x), h1(x), h2(x), …are tried in succession where hi(x) = (hash(x) + f(i))
mod TableSize, with f(0) = 0.
– The function f is the collision resolution strategy.
• There are three common collision resolution strategies:
– Linear Probing, Quadratic probing, Double hashing
Linear Probing
• In linear probing, collisions are
resolved by sequentially scanning an
array (with wraparound) until an
empty cell is found.
– i.e. f is a linear function of i,
typically f(i)= i.
• Example:
– Insert items with keys: 89, 18, 49,
58, 9 into an empty hash table.
– Table size is 10.
– Hash function is hash(x) = x mod
10.
•f(i) = i;
Clustering Problem
• As long as table is big enough, a free cell can always be
found, but the time to do so can get quite large.
• Worse, even if the table is relatively empty, blocks of
occupied cells start forming.
• This effect is known as primary clustering.
• Any key that hashes into the cluster will require several
attempts to resolve the collision, and then it will add to
the cluster.
Analysis of search, delete and insertion operations
• The search algorithm follows the same probe sequence as the insert algorithm.
–A search for 58 would involve 4 probes.
–A search for 19 would involve 5 probes.
• We must use lazy deletion (i.e. marking items as deleted)
–What happens if we delete an object that previously caused a collision? This would
prevent us from finding the object that collided with the deleted object, since it would
appear that it should be in the spot where the deleted object was.
• The average number of cells that are examined in an insertion using linear probing is
roughly
(1 + 1/(1 – λ)2) / 2 [Proof is beyond the scope of current discussion]
• For a half full table we obtain 2.5 as the average number of cells examined during an
insertion.
• Primary clustering is a problem at high load factors. For half empty tables the effect is
not disastrous.
Linear Probing – Analysis -- Example
• What is the average number of probes for a successful search and an 0 9
unsuccessful search for this hash table?
1
– Hash Function: h(x) = x mod 11
2 2
Successful Search:
20: 9 [1]; 30: 8 [1]; 2: 2 [1]; 13: 2, 3 [2]; 25: 3,4 [2]; 24: 2,3,4,5 [4]; 3 13

10: 10 [1]; 9: 9,10, 0 [3]. 4 25


Avg. Probe for SS = (1+1+1+2+2+4+1+3)/8=15/8 5 24
Unsuccessful Search: 6
– We assume that the hash function uniformly distributes the keys.
7
0: 0,1 [2]; 1: 1 [1]; 2: 2,3,4,5,6 [5]; 3: 3,4,5,6 [4]; 4: 4,5,6 [3]; 5: 5,6 [2];
6: 6 [1]; 7: 7 [1]; 8: 8,9,10,0,1 [5]; 9: 9,10,0,1 [4]; 10: 10,0,1 [3] 8 30

Avg. Probe for US = (2+1+5+4+3+2+1+1+5+4+3)/11=31/11 9 20

10 10
Quadratic Probing

• Quadratic Probing eliminates primary clustering problem of


linear probing.
• Collision function is quadratic.
– The popular choice is f(i) = i2.
• If the hash function evaluates to h and a search in cell h is
inconclusive, we try cells h + 12, h+22, … h + i2.
– i.e. It examines cells 1,4,9 and so on away from the original probe.
• Remember that subsequent probe points are a quadratic
number of positions from the original probe point.
An Example with quadratic probing

A quadratic probing hash table


after each insertion (note that the
table size was poorly chosen
because it is not a prime number).
Quadratic Probing (Continues…)
• Problem:
– We may not be sure that we will probe all locations in the table (i.e. there is no
guarantee to find an empty cell if table is more than half full.)
– If the hash table size is not prime this problem will be much severe.
• However, there is a theorem stating that:
– If the table size is prime and load factor is not larger than 0.5, all probes will
be to different locations and an item can always be inserted.
• Although quadratic probing eliminates primary clustering, elements that hash to
the same location will probe the same alternative cells. This is know as secondary
clustering.
• Techniques that eliminate secondary clustering are available.
– the most popular is double hashing.
Double Hashing
• To reduce secondary clustering, we can use a second hash function to
generate different probe sequences for different keys
hash(key)
( hash(key) + 1 * hash2 (key) ) % m
( hash(key) + 2 * hash2 (key) ) % m
( hash(key) + 3 * hash2 (key) ) % m

• hash2 is called the secondary hash function
• If hash2 (k) = 1, then it is the same as linear probing
• Remember, hash2(key) must not be 0; collision will not be resolved
otherwise
Double hashing example
• Given, hash(k) = k mod 7 and hash2(k) = k mod 5. The snapshot of the hash table
at any given point of time is as follow:
0 14
1 21
2
3
4 18
5
6
• 14 and 18 are already inserted. Suppose, 21 is the next key to be inserted. Apply
double hashing to find a slot in the table for key 21.
• hash(21) = 21 mod 7 = 0 -> Collision. Apply second hash function to resolve
collission.
• hash2(21) = 21 mod 5 = 1. So, 21 will be hashed to the slot (0+1) = 1.
The relative efficiency of four collision-resolution methods
Applications of hashing

• Compilers use hash tables to implement the symbol table (a


data structure to keep track of declared variables).
• Game programs use hash tables to keep track of positions it has
encountered (transposition table)
• Online spelling checkers.
Summary
• Hash tables can be used to implement the insert and find
operations in constant average time.
– it depends on the load factor not on the number of items in
the table.
• It is important to have a prime TableSize and a correct
choice of load factor and hash function.
• For separate chaining the load factor should be close to
1.
• For open addressing load factor should not exceed 0.5
unless this is completely unavoidable.
– Rehashing can be implemented to grow (or shrink) the table.
Questions:
Q.1 A hash table of length 10 uses open addressing with hash function
h(k)=k mod 10, and linear probing. After inserting 6 values into an
empty hash table, the table Table 1 is obtained.
Which one of the following choices gives a possible order in which the
key values could have been inserted in the following table?
A. 46, 42, 34, 52, 23, 33 B. 34, 42, 23, 52, 33, 46
C. 46, 34, 42, 23, 52, 33 D. 42, 46, 33, 23, 34, 52
Table
Q.2 Consider a 13 element hash table for which f(key)=key mod 13 is 1
used with integer keys. Assuming linear probing is used for collision
resolution, at which location would the key 103 be inserted, if the
keys 661, 182, 24 and 103 are inserted in that order?
(A) 0 (B) 1 (C) 11 (D) 12
Q.3 The keys 12, 18, 13, 2, 3, 23, 5 and 15 are inserted into an initially empty hash
table of length 10 using open addressing with hash function h(k) = k mod 10 and
linear probing. What is the resultant hash table?

A. (i) B. (ii) C. (iii) D. (iv)


Q.4 Consider a hash table of size seven, with starting index zero, and a hash function (3x +
4) mod 7. Assuming the hash table is initially empty and linear probing collision resolution
technique is to be used, which of the following is the contents of the table when the
sequence 1, 3, 8, 10 is inserted into the table using closed hashing? Note that ‘_’ denotes
an empty location in the table.
(A) 8, _, _, _, _, _, 10 (B) 1, 8, 10, _, _, _, 3
(C) 1, _, _, _, _, _,3 (D) 1, 10, 8, _, _, _, 3

Q.5 Given the following input (4322, 1334, 1471, 9679, 1989, 6171, 6173, 4199) and the
hash function x mod 10, which of the following statements are true? (GATE CS 2004)
i. 9679, 1989, 4199 hash to the same value
ii. 1471, 6171 hash to the same value
iii. All elements hash to the same value
iv. Each element hashes to a different value
(A) i only (B) ii only (C) i and ii only (D) iii or iv
Q.6 Which one of the following hash functions on integers will distribute keys most
uniformly over 10 buckets numbered 0 to 9 for i ranging from 0 to 2020?

(A) h(i) =i2 mod 10


(B) h(i) =i3 mod 10
(C) h(i) = (11 ∗ i2) mod 10
(D) h(i) = (12 ∗ i) mod 10

Q.7 A hash function h defined h(key)=key mod 7, with linear probing, is used to
insert the keys 44, 45, 79, 55, 91, 18, 63 into a table indexed from 0 to 6. What will
be the location of key 18?
(A) 3 (B) 4 (C) 5 (D) 6
Q.8 An advantage of chained hash table (external hashing) over the open
addressing scheme is
(A) Worst case complexity of search operations is less
(B) Space used is less
(C) Deletion is easier
(D) None of the above

Q.9 Assume ord(A) = 1, ord(B) = 2,…, ord(E) = 5, etc. Insert the characters of the
string K R P C S N Y T J M into a hash table of size 10. Use the hash function
h(x) = ( ord(x) – ord("a") + 1 ) mod10
If linear probing is used to resolve collisions, then the following insertion causes
collision
(A) Y (B) C (C) M (D) P
Q.10 A hash table with ten buckets with one slot per bucket is shown in the Table 2. The
symbols S1 to S7 initially entered using a hashing function with linear probing. The
maximum number of comparisons needed in searching an item that is not present is
0 S7
1 S1
2
3 S4
4 S2
5
6 S5
7
8 S6
9 S3
Table 2
(A) 4 (B) 5 (C) 6 (D) 3
Answers:
1–C
2–B
3–C
4–B
5–C
6–B
7–C
8–C
9–C
10 – B
Any
Questions?

You might also like