Lecture 9
Lecture 9
BBM408
LECTURE 7
Hashing I
• Direct-access tables
• Resolving collisions by
chaining
• Choosing hash functions
• Open addressing
k1 h(k1)
k5 h(k4)
S k4
k2 h(k2) = h(k5)
k3
h(k3)
U
m–1
When a record to be inserted maps to an already
As each key
occupied slotisininserted, h maps
T, a collision it to a slot of T.
occurs.
October 3, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.4
Resolving collisions by
chaining
• Link records in the same slot into a list.
T
Worst case:
i
• Every key
hashes to the
49 86 52 same slot.
• Access time =
Θ(n) if |S| = n
h(49) = h(86) = h(52) = i
October 3, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.5
Average-case analysis of chaining
We make the assumption of simple uniform
hashing:
• Each key k ∈ S is equally likely to be hashed
to any slot of table T, independent of where
other keys are hashed.
Let n be the number of keys in the table, and
let m be the number of slots.
Define the load factor of T to be
α = n/m
= average number of keys per slot.
October 3, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.6
Search cost
The expected time for an unsuccessful
search for a record with a given key is
= Θ(1 + α).
October 3, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.7
Search cost
The expected time for an unsuccessful
search for a record with a given key is
= Θ(1 + α). search
the list
apply hash function
and access slot
October 3, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.8
Search cost
The expected time for an unsuccessful
search for a record with a given key is
= Θ(1 + α). search
the list
apply hash function
and access slot
Expected search time = Θ(1) if α = O(1),
or equivalently, if n = O(m).
October 3, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.9
Search cost
The expected time for an unsuccessful
search for a record with a given key is
= Θ(1 + α). search
the list
apply hash function
and access slot
Expected search time = Θ(1) if α = O(1),
or equivalently, if n = O(m).
A successful search has same asymptotic
bound, but a rigorous argument is a little
more complicated. (See textbook.)
October 3, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.10
Choosing a hash function
October 3, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.14
Multiplication method
example
h(k) = (A·k mod 2w) rsh (w – r)
Suppose that m = 8 = 23 and that our computer
has w = 7-bit words:
3A
1011001 =A .
× 1101011=k 0
7 1
10010100110011 6 2
A
. 5 4 3
h(k) .
2A
Modular wheel
October 3, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.15
Resolving collisions by open
addressing
No storage is used outside of the hash table itself.
• Insertion systematically probes the table until an
empty slot is found.
• The hash function depends on both the key and
probe number:
h : U × {0, 1, …, m–1} → {0, 1, …, m–1}.
• The probe sequence 〈h(k,0), h(k,1), …, h(k,m–1)〉
should be a permutation of {0, 1, …, m–1}.
• The table may fill up, and deletion is difficult (but
not impossible).
October 3, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.16
Example of open addressing
m–1
October 3, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.17
Example of open addressing
481
m–1
October 3, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.18
Example of open addressing
m–1
October 3, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.19
Example of open addressing
Linear probing:
Given an ordinary hash function h′(k), linear
probing uses the hash function
h(k,i) = (h′(k) + i) mod m.
This method, though simple, suffers from primary
clustering, where long runs of occupied slots build
up, increasing the average search time. Moreover,
the long runs of occupied slots tend to get longer.
October 3, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.21
Probing strategies
Double hashing
Given two ordinary hash functions h1(k) and h2(k),
double hashing uses the hash function
h(k,i) = (h1(k) + i⋅ h2(k)) mod m.
This method generally produces excellent results,
but h2(k) must be relatively prime to m. One way
is to make m a power of 2 and design h2(k) to
produce only odd numbers.
October 3, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.22
Analysis of open addressing
October 3, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.23
Proof of the theorem
Proof.
• At least one probe is always necessary.
• With probability n/m, the first probe hits an
occupied slot, and a second probe is necessary.
• With probability (n–1)/(m–1), the second probe
hits an occupied slot, and a third probe is
necessary.
• With probability (n–2)/(m–2), the third probe
hits an occupied slot, etc.
Observe that n − i < n = α for i = 1, 2, …, n.
m−i m
October 3, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.24
Proof (continued)
Therefore, the expected number of probes is
n
1 + 1 + n − 1 n − 2 1
1 + 1 +
m m − 1 m − 2 m − n + 1
≤ 1 + α (1 + α (1 + α ( (1 + α ))))
≤1+α +α 2 +α3 +
∞
= ∑ α i The textbook has a
more rigorous proof
i =0
and an analysis of
= 1 . successful searches.
1−α
October 3, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.25
Implications of the theorem
October 3, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.26
A weakness of hashing
Problem: For any hash function h, a set
of keys exists that can cause the average
access time of a hash table to skyrocket.
• An adversary can pick all keys from
{k ∈ U : h(k) = i} for some slot i.
IDEA: Choose the hash function at random,
independently of the keys.
• Even if an adversary can see your code,
he or she cannot find a bad set of keys,
since he or she doesn’t know exactly
which hash function will be chosen.
October 5, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.2
Universal hashing
Definition. Let U be a universe of keys, and
let H be a finite collection of hash functions,
each mapping U to {0, 1, …, m–1}. We say
H is universal if for all x, y ∈ U, where x ≠ y,
we have |{h ∈ H : h(x) = h(y)}| ≤ |H | / m.
That is, the chance
of a collision {h : h(x) = h(y)} H
between x and y is
≤ 1/m if we choose h
|H |
randomly from H .
m
October 5, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.3
Universality is good
October 5, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.4
Proof of theorem
Proof. Let Cx be the random variable denoting
the total number of collisions of keys in T with
x, and let
1 if h(x) = h(y),
cxy =
0 otherwise.
October 5, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.5
Proof (continued)
E[C x ] = E ∑ c xy • Take expectation
y∈T −{ x} of both sides.
October 5, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.6
Proof (continued)
E[C x ] = E ∑ c xy • Take expectation
y∈T −{ x} of both sides.
= ∑ E[cxy ] • Linearity of
y∈T −{ x} expectation.
October 5, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.7
Proof (continued)
E[C x ] = E ∑ c xy • Take expectation
y∈T −{ x} of both sides.
= ∑ E[cxy ] • Linearity of
y∈T −{ x} expectation.
= ∑ 1/ m • E[cxy] = 1/m.
y∈T −{ x}
October 5, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.8
Proof (continued)
E[C x ] = E ∑ c xy • Take expectation
y∈T −{ x} of both sides.
= ∑ E[cxy ] • Linearity of
y∈T −{ x} expectation.
= ∑ 1/ m • E[cxy] = 1/m.
y∈T −{ x}
= n −1 . • Algebra.
m
October 5, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.9
Constructing a set of
universal hash functions
Let m be prime. Decompose key k into r + 1
digits, each with value in the set {0, 1, …, m–1}.
That is, let k = 〈k0, k1, …, kr〉, where 0 ≤ ki < m.
Randomized strategy:
Pick a = 〈a0, a1, …, ar〉 where each ai is chosen
randomly from {0, 1, …, m–1}.
r
Define ha (k ) = ∑ ai ki mod m . Dot product,
i =0
modulo m
REMEMBER
How big is H = {ha}? |H | = mr + 1. THIS!
October 5, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.10
Universality of dot-product
hash functions
Theorem. The set H = {ha} is universal.
Proof. Suppose that x = 〈x0, x1, …, xr〉 and y =
〈y0, y1, …, yr〉 be distinct keys. Thus, they differ
in at least one digit position, wlog position 0.
For how many ha ∈ H do x and y collide?
Example: m = 7.
z 1 2 3 4 5 6
z–1 1 4 5 2 3 6
October 5, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.13
Back to the proof
We have
r
a0 ( x0 − y0 ) ≡ −∑ ai ( xi − yi ) (mod m) ,
i =1
and since x0 ≠ y0 , an inverse (x0 – y0 )–1 must exist,
which implies that
r
a0 ≡ − ∑ ai ( xi − yi ) ⋅ ( x0 − y0 ) −1 (mod m) .
i =1
Thus, for any choices of a1, a2, …, ar, exactly
one choice of a0 causes x and y to collide.
October 5, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.14
Proof (completed)
Q. How many ha’s cause x and y to collide?
A. There are m choices for each of a1, a2, …, ar ,
but once these are chosen, exactly one choice
for a0 causes x and y to collide, namely
r
a0 = − ∑ ai ( xi − yi ) ⋅ ( x0 − y0 ) mod m .
−1
i =1
Thus, the number of ha’s that cause x and y
to collide is mr · 1 = mr = |H |/m.
October 5, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.15
Perfect hashing
Given a set of n keys, construct a static hash
table of size m = O(n) such that SEARCH takes
Θ(1) time in the worst case.
T S1
IDEA: Two-
0
level scheme 1 4 31 14 27
with universal 2
hashing at S4 h31(14) = h31(27) = 1
3
both levels. 4 1 00 26 S6
5
No collisions
6 9 86 40 37 22
at level 2!
m a 0 1 2 3 4 5 6 7 8
October 5, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.16
Collisions at level 2
Theorem. Let H be a class of universal hash
functions for a table of size m = n2. Then, if we
use a random h ∈ H to hash n keys into the table,
the expected number of collisions is at most 1/2.
Proof. By the definition of universality, the
probability that 2 given keys in the table collide
under h is 1/m = 1/n . Since there are (2 ) pairs
2 n