Analysis of Algorithms CS 477/677: Hashing Instructor: George Bebis
Analysis of Algorithms CS 477/677: Hashing Instructor: George Bebis
CS 477/677
Hashing
Instructor: George Bebis
(Chapter 11)
Applications
Keeping track of customer account information at
a bank
Search through records to check balances and perform
transactions
Search engine
Looks for all documents containing a given word
Direct Addressing
Assumptions:
Key values are distinct
Each key is drawn from a universe U = {0, 1, . . . , m - 1}
Idea:
Store the items in an array, indexed by keys
Operations
Alg.: DIRECT-ADDRESS-SEARCH(T, k)
return T[k]
Alg.: DIRECT-ADDRESS-INSERT(T, x)
T[key[x]] x
Alg.: DIRECT-ADDRESS-DELETE(T, x)
T[key[x]] NIL
Running time for these operations: O(1)
7
direct addressing
ordered array
ordered list
unordered array
unordered list
Insert
O(1)
O(N)
O(N)
O(1)
O(1)
Search
O(1)
O(lgN)
O(N)
O(N)
O(N)
8
Example 2:
Hash Tables
When K is much smaller than U, a hash table
requires much less space than a direct-address
table
Can reduce storage requirements to |K|
Can still get O(1) search time, but on the average
case, not the worst case
10
Hash Tables
Idea:
Use a function h to compute the slot for each key
Store the element in slot h(k)
k2
k3
h(k1)
h(k4)
h(k2) = h(k5)
h(k3)
m-1
12
Revisit Example 2
13
k2
k3
h(k1)
h(k4)
h(k2) = h(k5)
Collisions!
h(k3)
m-1
14
Collisions
Two or more keys hash to the same slot!!
For a given set K of keys
If |K| m, collisions may or may not happen,
depending on the hash function
If |K| > m, collisions will definitely happen (i.e., there
must be at least two keys that have the same hash
value)
Handling Collisions
We will review the following methods:
Chaining
Open addressing
Linear probing
Quadratic probing
Double hashing
17
18
20
21
T
0
given key?
Worst case:
All n keys hash to the same slot
Worst-case time to search is
(n), plus time to compute the
hash function
chain
m-1
22
Length of a list:
T[j] = nj,
j = 0, 1, . . . , m 1
Number of keys in the table:
n = n0 + n1 + + nm-1
Average value of nj:
E[nj] = = n/m
T
n0 = 0
n2
n3
nj
nk
nm 1 = 0
23
T
0
chain
chain
chain
chain
m-1
24
(1 )
(1 )
25
26
n = O(m)
27
Hash Functions
A hash function transforms a key into a table
address
What makes a good hash function?
(1) Easy to compute
(2) Approximates a random function: for every input,
every output is equally likely (simple uniform hashing)
28
29
Disadvantage:
Certain values of m are bad, e.g.,
power of 2
non-prime numbers
30
97
m
100
p=1m=2
h(k) = {0, 1} , least significant 1 bit of k
p=2m=4
h(k) ={0, 1, 2, 3}, least significant 2 bits of k
k mod 97
k mod 100
31
= m (k A mod 1)
fractional part of kA = kA - kA
33
Universal Hashing
In practice, keys are not randomly distributed
Any fixed hash function might yield (n) time
Goal: hash functions that produce random
table indices irrespective of the keys
Idea:
Select a hash function at random, from a designed
class of functions at the beginning of the execution
34
Universal Hashing
35
H={h(k): U(0,1,..,m-1)}
36
Pr(h(x)=h(y))=
37
38
40
41
Open Addressing
If we have enough contiguous memory to store all the keys
(m > N) store the keys in the table itself
e.g., insert 14
insert 14
p=0,1,...,m-1
Probe sequences
<h(k,0), h(k,1), ..., h(k,m-1)>
Must be a permutation of <0,1,...,m-1>
There are m! possible permutations
Good hash functions should be able to
produce all m! probe sequences
Example
<1, 5, 9>
43
44
wrap around
45
0
h(k1)
h(k4)
h(k2) = h(k5)
h(k3)
m-1
46
Solution
Mark the slot with a sentinel value
DELETED
m-1
47
Quadratic probing
i=0,1,2,...
49
Double Hashing
(1) Use one hash function to determine the first slot
(2) Use a second hash function to determine the
increment for the probe sequence
h(k,i) = (h1(k) + i h2(k) ) mod m, i=0,1,...
Initial probe: h1(k)
0
1
2
3
4
5
6
7
8
9
10
11
12
79
69
98
72
14
50
51
a (load factor)
1 a
k=0
52
E(#steps) = 3.387
E(#steps) = 3.670
53