Lec8 PDF
Lec8 PDF
6.046J/18.401J LECTURE 8
Hashing II Universal hashing Universality theorem Constructing a set of universal hash functions Perfect hashing Prof. Charles E. Leiserson
October 5, 2005 Copyright 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.1
A weakness of hashing
Universal hashing
Definition. Let U be a universe of keys, and let H be a finite collection of hash functions, each mapping U to {0, 1, , m1}. We say H is universal if for all x, y U, where x y, we have |{h H : h(x) = h(y)}| = |H|/ m. That is, the chance
{h : h(x) = h(y)}
of a collision between x and y is
1/m if we choose h |H|
randomly from H.
m H
October 5, 2005
L7.3
Universality is good
October 5, 2005
L7.4
Proof of theorem
Proof. Let Cx be the random variable denoting the total number of collisions of keys in T with x, and let 1 if h(x) = h(y), cxy = 0 otherwise. Note: E[cxy] = 1/m and C x =
yT { x}
cxy .
October 5, 2005
L7.5
Proof (continued)
E[
C
x ]
=
E
c xy
yT { x}
Take expectation
of both sides.
October 5, 2005
L7.6
Proof (continued)
E[
C
x ]
=
E
c xy
yT { x}
=
yT { x}
E[cxy ]
October 5, 2005
L7.7
Proof (continued)
E[
C
x ]
=
E
c xy
yT { x} = =
yT { x}
yT { x}
Take expectation
of both sides. Linearity of expectation. E[cxy] = 1/m.
E[cxy ] 1/ m
October 5, 2005
L7.8
Proof (continued)
x ]
=
E
c xy E[
C
yT {
x}
= =
yT { x} yT { x}
E[cxy ] 1/ m
=
n
1 . m
October 5, 2005
Proof. Suppose that x = x0, x1, , xr and y = y0, y1, , yr be distinct keys. Thus, they differ in at least one digit position, wlog position 0. For how many ha H do x and y collide? We must have ha(x) = ha(y), which implies that
ai xi ai yi
i =0 i =0
October 5, 2005
(mod m) .
L7.11
Proof (continued)
Equivalently, we have
ai ( xi yi ) 0
i =0 i =1
(mod m)
or
r a0 ( x0 y0 ) + ai ( xi yi ) 0 which implies that
a0 ( x0 y0 ) ai ( xi yi )
i =1
October 5, 2005
(mod m) ,
(mod m) .
L7.12
Theorem. Let m be prime. For any z Zm such that z 0, there exists a unique z1 Zm such that z z1 1 (mod m). Example: m = 7.
z z1
October 5, 2005
4 5
1 4
5 2 3 6
L7.13
We have
a0 ( x0 y0 ) ai ( xi yi )
i =1 r
(mod m) ,
r 1 a0 a ( x ) ( x y ) y i 0 0 i i i =1
(mod m) .
Thus, for any choices of a1, a2, , ar, exactly one choice of a0 causes x and y to collide.
October 5, 2005 Copyright 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.14
Proof (completed)
Q. How many has cause x and y to collide? A. There are m choices for each of a1, a2, , ar , but once these are chosen, exactly one choice for a0 causes x and y to collide, namely
r
1
mod m . a0 = a x y
x y
( )
( )
i
i
i
0
0
i =1
Perfect hashing
IDEA: Twolevel scheme with universal hashing at both levels. No collisions at level 2!
October 5, 2005
T 0
1
2
3
4
5
6
4 31 4 31 1 00 1 00 9 86 9 86 m a
40 22 40 37 37 22 0 1 2 3 4 5 6 7 8
L7.16
Collisions at level 2
Theorem. Let H be a class of universal hash functions for a table of size m = n2. Then, if we use a random h H to hash n keys into the table, the expected number of collisions is at most 1/2. Proof. By the definition of universality, the probability that 2 given keys in the table collide
n 2
) pairs under h is 1/m = 1/n . Since there are (2 of keys that can possibly collide, the expected number of collisions is n
1
(
n
n
1)
1
2 <
1 .
2 =
2 2 n
2
n
October 5, 2005 Copyright 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.17
No collisions at level 2
Proof. Markovs inequality says that for any nonnegative random variable X, we have Pr{X t} E[X]/t. Applying this inequality with t = 1, we find that the probability of 1 or more collisions is at most 1/2. Thus, just by testing random hash functions
in H, well quickly find one that works.
October 5, 2005
L7.18
Analysis of storage
For the level-1 hash table T, choose m = n, and let ni be random variable for the number of keys that hash to slot i in T. By using ni2 slots for the level-2 hash table Si, the expected total storage required for the two-level scheme is therefore m 1
2 (ni
) =
(n),
E
i
=
0
since the analysis is identical to the analysis from recitation of the expected running time of bucket sort. (For a probability bound, apply Markov.)
October 5, 2005 Copyright 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.19