Lecture 8 Hashing
Lecture 8 Hashing
1
The Search Problem
• Find items with keys matching a given search
key
– Given an array A, containing n keys, and a
search key x, find the index i such as x=A[i]
– As in the case of sorting, a key could be part
of a large record.
2
Applications
• Keeping track of customer account information
at a bank
– Search through records to check balances and
perform transactions
• Keep track of reservations on flights
– Search to find empty seats, cancel/modify
reservations
• Search engine
– Looks for all documents containing a given word
3
Special Case: Dictionaries
5
Operations
Alg.: DIRECT-ADDRESS-SEARCH(T, k)
return T[k]
Alg.: DIRECT-ADDRESS-INSERT(T, x)
T[key[x]] ← x
Alg.: DIRECT-ADDRESS-DELETE(T, x)
T[key[x]] ← NIL
• Running time for these operations: O(1)
6
Comparing Different Implementations
• Implementing dictionaries using:
– Direct addressing
– Ordered/unordered arrays
– Ordered/unordered linked lists
Insert Search
direct addressing O(1) O(1)
ordered array O(N) O(lgN)
ordered list O(N) O(N)
unordered array O(1) O(N)
unordered list O(1) O(N)
7
Hash Tables
• When K is much smaller than U, a hash table
requires much less space than a direct-
address table
– Can reduce storage requirements to |K|
– Can still get O(1) search time, but on the average
case, not the worst case
8
Hash Tables
Idea:
– Use a function h to compute the slot for each key
– Store the element in slot h(k)
9
Example: HASH TABLES
0
U
(universe of keys) h(k1)
h(k4)
K k1
h(k2) = h(k5)
(actual k4 k2
keys)
k5 k3 h(k3)
m-1
10
Do you see any problems
with this approach?
0
U
(universe of keys) h(k1)
h(k4)
K k1
h(k2) = h(k5)
(actual k4 k2
keys) Collisions!
k5 k3 h(k3)
m-1
11
Collisions
• Two or more keys hash to the same slot!!
• For a given set K of keys
– If |K| ≤ m, collisions may or may not happen,
depending on the hash function
– If |K| > m, collisions will definitely happen (i.e., there
must be at least two keys that have the same hash
value)
• Avoiding collisions completely is hard, even with
a good hash function
12
Handling Collisions
• We will review the following methods:
– Chaining
– Open addressing
• Linear probing
• Quadratic probing
• Double hashing
• We will discuss chaining first, and ways to
build “good” functions.
13
Handling Collisions Using Chaining
• Idea:
– Put all elements that hash to the same slot into a
linked list
15
Insertion in Hash Tables
Alg.: CHAINED-HASH-INSERT(T, x)
insert x at the head of list T[h(key[x])]
16
Deletion in Hash Tables
Alg.: CHAINED-HASH-DELETE(T, x)
delete x from the list T[h(key[x])]
17
Searching in Hash Tables
Alg.: CHAINED-HASH-SEARCH(T, k)
18
Analysis of Hashing with Chaining:
Worst Case
• How long does it take to search T
for an element with a given key? 0
• Worst case:
– All n keys hash to the same slot
– Worst-case time to search is (n),
plus time to compute the hash
chain
function
m-1
19
Analysis of Hashing with Chaining:
• Average case
Average Case
– depends on how well the hash function
distributes the n keys among the m slots T
n0 = 0
• Simple uniform hashing assumption:
– Any given element is equally likely to n2
hash into any of the m slots (i.e., probability n3
of collision Pr(h(x)=h(y)), is 1/m)
nj
• Length of a list:
T[j] = nj, j = 0, 1, . . . , m – 1 nk
• Number of keys in the table:
n = n0 + n1 +· · · + nm-1 nm – 1 = 0
21
Case 1: Unsuccessful Search
(i.e., item not stored in the table)
Theorem
An unsuccessful search in a hash table takes expected time (1 )
under the assumption of simple uniform hashing
(i.e., probability of collision Pr(h(x)=h(y)), is 1/m)
Proof
• Searching unsuccessfully for any key k
– need to search to the end of the list T[h(k)]
• Expected length of the list:
– E[nh(k)] = α = n/m
• Expected number of elements examined in an unsuccessful search is α
• Total time required is:
– O(1) (for computing the hash function) + α (1 )
22
Case 2: Successful Search
23
Analysis of Search in Hash Tables
• If m (# of slots) is proportional to n (# of
elements in the table):
• n = O(m)
24
Hash Functions
• A hash function transforms a key into a table
address
• What makes a good hash function?
(1) Easy to compute
(2) Approximates a random function: for every input,
every output is equally likely (simple uniform hashing)
• In practice, it is very hard to satisfy the simple
uniform hashing property
– i.e., we don’t know in advance the probability
distribution that keys are drawn from
25
Good Approaches for Hash Functions
26
The Division Method
• Idea:
– Map a key k into one of the m slots by taking the
remainder of k divided by m
h(k) = k mod m
• Advantage:
– fast, requires only one operation
• Disadvantage:
– Certain values of m are bad, e.g.,
• power of 2
• non-prime numbers
27
Example - The Division Methodm97 m
significant p bits of k
– p = 1 m = 2
h(k) = {0, 1} , least significant 1 bit of k
– p = 2m = 4
h(k) ={0, 1, 2, 3} , least significant 2
bits of k
Choose m to be a prime, not close to a
power of 2 k mod 97
Column 2: k mod 100
Column 3:
28
The Multiplication Method
Idea:
• Multiply key k by a constant A, where 0 < A < 1
• Extract the fractional part of kA
• Multiply the fractional part by m
• Take the floor of the result
h(k) = = m (k A mod 1)
30
Universal Hashing
• In practice, keys are not randomly distributed
• Any fixed hash function might yield Θ(n) time
• Goal: hash functions that produce random
table indices irrespective of the keys
• Idea:
– Select a hash function at random, from a designed
class of functions at the beginning of the execution
31
Universal Hashing
32
Definition of Universal Hash Functions
H={h(k): U(0,1,..,m-1)}
33
Universal Hashing – Main Result
34
Designing a Universal Class
of Hash Functions
• Choose a prime number p large enough so that every possible
key k is in the range [0 ... p – 1]
Zp = {0, 1, …, p - 1} and Zp* = {1, …, p - 1}
• Define the following hash function
ha,b(k) = ((ak + b) mod p) mod m,
a Zp* and b Zp The class Hp,m of hash
functions is universal
• The family of all such hash functions is
Hp,m = {ha,b: a Zp* and b Zp}
• a , b: chosen randomly at the beginning of execution
35
Example: Universal Hash Functions
E.g.: p = 17, m = 6
= 11 mod 6
=5
36
Advantages of Universal Hashing
h(k,p), p=0,1,...,m-1
• Probe sequences
<h(k,0), h(k,1), ..., h(k,m-1)>
– Must be a permutation of <0,1,...,m-1>
– There are m! possible permutations
– Good hash functions should be able to
produce all m! probe sequences
39
Common Open Addressing Methods
• Linear probing
• Quadratic probing
• Double hashing
40
Linear probing: Inserting a key
• Idea: when there is a collision, check the next available
position in the table (i.e., probing)
42
Linear probing: Deleting a key
• Problems
– Cannot mark the slot as empty 0
– Impossible to retrieve keys inserted
after that slot was occupied
• Solution
– Mark the slot with a sentinel value
DELETED
• The deleted slot can later be used for
insertion m-1
• Searching will be able to find all the
keys
43
Primary Clustering Problem
• Some slots become more likely than others
• Long chunks of occupied slots are created
search time increases!!
Slot b:
2/m
Slot d:
4/m
Slot e:
5/m
44
Quadratic probing
i=0,1,2,...
45
Double Hashing
(1) Use one hash function to determine the first slot
(2) Use a second hash function to determine the
increment for the probe sequence
h(k,i) = (h1(k) + i h2(k) ) mod m, i=0,1,...
• Initial probe: h1(k)
• Second probe is offset by h2(k) mod m, so on ...
• Advantage: avoids clustering
• Disadvantage: harder to delete an element
• Can generate m2 probe sequences maximum
46
Double Hashing: Example
h1(k) = k mod 13 0
1 79
h2(k) = 1+ (k mod 11) 2
h(k,i) = (h1(k) + i h2(k) ) mod 13 3
4 69
• Insert key 14: 5 98
h1(14,0) = 14 mod 13 = 1 6
7 72
h(14,1) = (h1(14) + h2(14)) mod 13 8
9
= (1 + 4) mod 13 = 5 10
14