05 Hashing
05 Hashing
Algorithms
1
Introduction
• Searching methods studied so far …
• Linear searching (O(n) search time)
• Binary searching (O(lg n) search time)
4
How does it work
• The table part is just an ordinary array, it is the Hash that we are
interested in
• Hash function h is k % m = k % 8
• Where are the pairs stored?
• The hash function when applied to equal Objects, returns the same
number for each.
8
Hash Function Methods
Division Hash Method: Map a key k into one of the m slots by taking
the remainder of k divided by m.
h(k)=k mod m
Advantages:
•fast, requires only one operation
Disadvantage:
•Certain values of m are bad, e.g.,
• Power of 2
• Non-prime numbers
9
Generally a prime number is a best choice to spread keys evenly.
Hash Function Methods
The Folding Method: The key K is partitioned into a number of parts,
each of which has the same length as the required address with the
possible exception of the last part.
•This parts are then added together, ignoring the final carry, to form
an address.
Example:
•If key=356942781 is to be transformed into a three digit address
• P1=356, P2=942, P3=781 are added to yield 079
10
Hash Function Methods
The Mid-Square Method: The key K is multiplied by itself and the
address is obtained by selecting an appropriate number of digits
from the middle of the square.
Example:
•If key=123456 is to be transformed
• (123456)2 = 15241383936
• If 3-digit address is required, position 5 to 7 is chosen giving 138
11
Hashing a String Key
Table size [0..99]
A..Z ---> 1,2, ...26
0..9 ----> 27,...36
12
Example (Imperfect) Hash
Function
• Pairs are: (22,a),(33,c),(3,d),(72,e),(85,f)
• (key, value) pairs
• Hash function h is k % m = k % 8
• Where would a new pair (57, g) stored?
• 57 % 8 = 1
Now what?
(72,e) (33,c) (3,d) (85,f) (22,a) 13
• Advantages:
• Better space utilization for large number of items.
• Simple collision handling: searching linked list.
• Overflow: we can store more items than the hash table size. 15
• Deletion is quick and easy: deletion from the linked list.
Separate Chaining Example
Keys: 0, 1, 4, 9, 16, 25, 36, 49, 64, 81
hash(key) = key % 10.
0 0
1 81 1
2
4 64 4
5 25
6 36 16
7
8
16
9 49 9
Separate Chaining Operations
• Initialization: all entries are set to NULL
• Find:
• locate the cell using hash function.
• sequential search on the linked list in that cell.
• Insertion:
• Locate the cell using hash function.
• (If the item does not exist) insert it as the first item in the list.
• Deletion:
• Locate the cell using hash function.
17
• Delete the item from the linked list.
Separate Chaining Analysis
• We hope that number of elements per bucket roughly equal in
size, so that the lists will be short
18
Separate Chaining Analysis
Average time per dictionary operation:
•TableSize buckets, N elements in dictionary average N/TableSize
elements per bucket
•How long does it take to search for an element with a given key?
• Worst case: (n)
• Basic idea:
• Insertion: if a slot is full, try another one, until you find an empty one
• Search: follow the same sequence of probes
• Deletion: more difficult ... (we’ll see why)
• Example:
• Insert items with keys: 89, 18, 49, 58, 9 into an empty hash table.
• Table size is 10.
• Hash function is hash(x) = x mod 10.
21
Figure 20.4
Linear probing
hash table after
each insertion
22
Linear Probing: Searching a key
• The Search operation follows the same probe sequence as the
insert algorithm.
• A find for 58 would involve 4 probes.
• A find for 19 would involve 5 probes.
• Three cases:
• Position in table is occupied with an element of equal key
• Position in table is empty
• Position in table occupied with a different element
• For Case 2, probe the next higher index until the element is found
or an empty position is found 23
• The process wraps around to the beginning of the table
Linear Probing: Deleting a key
• Problems
• Cannot mark the slot as empty
• Impossible to retrieve keys inserted after that slot was occupied
• Solution
• Mark the slot with a sentinel value DELETED
24
Primary Clustering Problem
• As long as table is big enough, a free cell can always be found,
but the time to do so can get quite large.
• Any key that hashes into the cluster will require several attempts
to resolve the collision, and then it will add to the cluster.
25
Linear Probing Analysis
• Initialization: O(b), b# of buckets
26
Linear Probing Analysis -- Example
• What is the average number of probes for a
successful search and an unsuccessful search for this 0 9
hash table? 1
• Hash Function: h(x) = x mod 11
2 2
Successful Search:
• 20: 9 -- 30: 8 -- 2 : 2 -- 13: 2, 3 -- 25: 3,4 3 13
• 24: 2,3,4,5 -- 10: 10 -- 9: 9,10, 0 4 25
Avg. Probe for SS = (1+1+1+2+2+4+1+3)/8=15/8 5 24
Unsuccessful Search: 6
• We assume that the hash function uniformly 7
distributes the keys.
• 0: 0,1 -- 1: 1 -- 2: 2,3,4,5,6 -- 3: 3,4,5,6 8 30
• 4: 4,5,6 -- 5: 5,6 -- 6: 6 -- 7: 7 -- 8: 8,9,10,0,1 9 20
• 9: 9,10,0,1 -- 10: 10,0,1 10 10 27
Avg. Probe for US =
(2+1+5+4+3+2+1+1+5+4+3)/11=31/11
Quadratic Probing
• Quadratic Probing eliminates primary clustering problem of
linear probing.
29
Quadratic Probing
• Problem:
• We may not be sure that we will probe all locations in the table
(i.e. there is no guarantee to find an empty cell if table is more
than half full.)
• If the hash table size is not prime this problem will be much
severe.
31
Some Considerations
• What happens if load factor gets too high?
• Dynamically expand the table as soon as the load factor reaches 0.5,
which is called rehashing.
• Always double to a prime number.
• When expanding the hash table, reinsert the new table by using the
new hash function.
32
Analysis of Quadratic Probing
• Quadratic probing has not yet been mathematically analyzed
33
Double Hashing
• Use one hash function to determine the first slot
• Use a second hash function to determine the increment for
the probe sequence
• h(k,i) = (h1(k) + i h2(k) ) mod m, i=0,1,...
h1(14,0) = 14 mod 13 = 1 6
7 72
h(14,1) = (h1(14) + h2(14)) mod 13
8
= (1 + 4) mod 13 = 5 9 14
h(14,2) = (h1(14) + 2 h2(14)) mod 13 10
= (1 + 8) mod 13 = 9 11 50
12
35
The relative efficiency of
four collision-resolution methods
36
Hashing Applications
• Compilers use hash tables to implement the symbol table (a
data structure to keep track of declared variables).
• Game programs use hash tables to keep track of positions it
has encountered (transposition table)
• Online spelling checkers.
37
Summary
• Hash tables can be used to implement the insert and find
operations in constant average time.
• it depends on the load factor not on the number of items in the
table.
39
Hash function
Problems:
• Keys may not be numeric.
• Number of possible keys is much larger than the space
available in table.
• Different keys may map into same location
• Hash function is not one-to-one => collision.
• If there are too many collisions, the performance of the hash
table will suffer dramatically.
hashVal %=tableSize;
if (hashVal < 0) /* in case overflows occurs */
hashVal += tableSize;
return hashVal;
};
43
Hash function for strings:
98 108 105 key[i]
key a l i
0 1 2 i
KeySize = 3;
a (load factor)
1 a
47
k=0