01 Phone-Book-Problem 07 Hash Tables 2 Hashfunctions

Download as pdf or txt
Download as pdf or txt
You are on page 1of 119

Hash Tables:

Hash Functions

Michael Levin
Higher School of Economics

Data Structures
Data Structures and Algorithms
Outline
1 Good Hash Functions
2 Universal Family
3 Hashing Integers
4 Hashing Strings
Phone Book
Design a data structure to store your
contacts: names of people along with their
phone numbers. The data structure should
be able to do the following quickly:
Add and delete contacts,
Lookup the phone number by name,
Determine who is calling given their
phone number.
We need two Maps:
(phone number → name) and
(name → phone number)
We need two Maps:
(phone number → name) and
(name → phone number)
Implement these Maps as hash tables
We need two Maps:
(phone number → name) and
(name → phone number)
Implement these Maps as hash tables
First, we will focus on the Map from
phone numbers to names
Direct Addressing
int(123-45-67) = 1234567
Direct Addressing
int(123-45-67) = 1234567
Create array Name of size 10L
where L
is the maximum allowed phone number
length
Direct Addressing
int(123-45-67) = 1234567
Create array Name of size 10Lwhere L
is the maximum allowed phone number
length
Store the name corresponding to phone
number P in Name[int(P)]
Direct Addressing
int(123-45-67) = 1234567
Create array Name of size 10Lwhere L
is the maximum allowed phone number
length
Store the name corresponding to phone
number P in Name[int(P)]
If no contact with phone number P ,
Name[int(P)] = N/A
Direct Addressing
Name
...
Natalie
Natalie: 123-45-67 → 1234567 N/A
N/A
Steve: 223-23-23 → 2232323 ...
Steve
N/A
...
Direct Addressing
Operations run in O(1)
Direct Addressing
Operations run in O(1)
Memory usage: O(10L), where L is the
maximum length of a phone number
Direct Addressing
Operations run in O(1)
Memory usage: O(10L), where L is the
maximum length of a phone number
Problematic with international numbers
of length 12 and more: we will need
1012 bytes = 1TB to store one person's
phone book  this won't t in anyone's
phone!
Chaining

Select hash function h with cardinality m


Chaining

Select hash function h with cardinality m


Create array Name of size m
Chaining

Select hash function h with cardinality m


Create array Name of size m
Store chains in each cell of the array
Name
Chaining

Select hash function h with cardinality m


Create array Name of size m
Store chains in each cell of the array
Name
Chain Name[h(int(P))] contains the
name for phone number P
Chaining
0
1 Steve: 223-23-23 Sasha: 239-17-17
2
3
4
5
6 Natalie: 123-45-67
7
Parameters
n phone numbers stored
Parameters
n phone numbers stored
m  cardinality of the hash function
Parameters
n phone numbers stored
m  cardinality of the hash function
c  length of the longest chain
Parameters
n phone numbers stored
m  cardinality of the hash function
c  length of the longest chain
O(n + m) memory is used
Parameters
n phone numbers stored
m  cardinality of the hash function
c  length of the longest chain
O(n + m) memory is used
𝛼 = mn is called load factor
Parameters
n phone numbers stored
m  cardinality of the hash function
c  length of the longest chain
O(n + m) memory is used
𝛼 = mn is called load factor
Operations run in time O(c + 1)
Parameters
n phone numbers stored
m  cardinality of the hash function
c  length of the longest chain
O(n + m) memory is used
𝛼 = mn is called load factor
Operations run in time O(c + 1)
You want small m and c !
Good Example
0
1 O3 O7
2 O1 c=2
m=8 3 O8 O4
4
5 O2
6 O6
7 O5
Bad Example
0
1 O 3 O7 O1 O8 O 4 O2 O 6 O5
2 c=8
3 m=8
4
5
6
7
First Digits
For the map from phone numbers to
names, select m = 1000
First Digits
For the map from phone numbers to
names, select m = 1000
Hash function: take rst three digits
First Digits
For the map from phone numbers to
names, select m = 1000
Hash function: take rst three digits
h(800-123-45-67) = 800
First Digits
For the map from phone numbers to
names, select m = 1000
Hash function: take rst three digits
h(800-123-45-67) = 800
Problem: area code
First Digits
For the map from phone numbers to
names, select m = 1000
Hash function: take rst three digits
h(800-123-45-67) = 800
Problem: area code
h(425-234-55-67) =
h(425-123-45-67) =
h(425-223-23-23) = · · · = 425
Last Digits

Select m = 1000
Last Digits

Select m = 1000
Hash function: take last three digits
Last Digits

Select m = 1000
Hash function: take last three digits
h(800-123-45-67) = 567
Last Digits

Select m = 1000
Hash function: take last three digits
h(800-123-45-67) = 567
Problem if many phone numbers end
with three zeros
Random Value
Select m = 1000
Random Value
Select m = 1000
Hash function: random number between
0 and 999
Random Value
Select m = 1000
Hash function: random number between
0 and 999
Uniform distribution of hash values
Random Value
Select m = 1000
Hash function: random number between
0 and 999
Uniform distribution of hash values
Dierent value when hash function
called again  we won't be able to nd
anything!
Random Value
Select m = 1000
Hash function: random number between
0 and 999
Uniform distribution of hash values
Dierent value when hash function
called again  we won't be able to nd
anything!
Hash function must be deterministic
Good Hash Functions

Deterministic
Fast to compute
Distributes keys well into dierent cells
Few collisions
No Universal Hash Function

Lemma
If number of possible keys is big (|U| ≫ m),
for any hash function h there is a bad input
resulting in many collisions.
U
m=3 U

h(k) = 2
h(k) = 0 h(k) = 1
m=3 U

h(k) = 2
h(k) = 0 h(k) = 1
m=3 U

h(k) = 2
h(k) = 0 h(k) = 1
42%
Outline
1 Good Hash Functions
2 Universal Family
3 Hashing Integers
4 Hashing Strings
Idea

Remember QuickSort?
Idea

Remember QuickSort?
Choosing random pivot helped
Idea

Remember QuickSort?
Choosing random pivot helped
Use randomization!
Idea

Remember QuickSort?
Choosing random pivot helped
Use randomization!
Dene a family (set) of hash functions
Idea

Remember QuickSort?
Choosing random pivot helped
Use randomization!
Dene a family (set) of hash functions
Choose random function from the family
Universal Family
Definition
Let U be the universe  the set of all
possible keys.
Universal Family
Definition
Let U be the universe  the set of all
possible keys. A set of hash functions
ℋ = {h : U → {0, 1, 2, . . . , m − 1}}
Universal Family
Definition
Let U be the universe  the set of all
possible keys. A set of hash functions
ℋ = {h : U → {0, 1, 2, . . . , m − 1}}

is called a universal family if


Universal Family
Definition
Let U be the universe  the set of all
possible keys. A set of hash functions
ℋ = {h : U → {0, 1, 2, . . . , m − 1}}

is called a universal family if for any two keys


x, y ∈ U, x ̸= y the probability of collision

Pr [h(x) = h(y )] ≤
1
m
Universal Family

Pr [h(x) = h(y )] ≤
1
m
means that a collision h(x) = h(y ) on
selected keys x and y , x ̸= y happens for no
more than m1 of all hash functions h ∈ ℋ.
How Randomization Works

h(x) = random({0, 1, 2, . . . , m − 1})


gives probability of collision exactly m1 .
How Randomization Works

h(x) = random({0, 1, 2, . . . , m − 1})


gives probability of collision exactly m1 .
It is not deterministic  can't use it.
How Randomization Works

h(x) = random({0, 1, 2, . . . , m − 1})


gives probability of collision exactly m1 .
It is not deterministic  can't use it.
All hash functions in ℋ are deterministic
How Randomization Works

h(x) = random({0, 1, 2, . . . , m − 1})


gives probability of collision exactly m1 .
It is not deterministic  can't use it.
All hash functions in ℋ are deterministic
Select a random function h from ℋ
How Randomization Works

h(x) = random({0, 1, 2, . . . , m − 1})


gives probability of collision exactly m1 .
It is not deterministic  can't use it.
All hash functions in ℋ are deterministic
Select a random function h from ℋ
Fixed h is used throughout the algorithm
Running Time
Lemma
If h is chosen randomly from a universal
family, the average length of the longest
chain c is O(1 + 𝛼), where 𝛼 = mn is the
load factor of the hash table.
Corollary
If h is from universal family, operations with
hash table run on average in time O(1 + 𝛼).
Choosing Hash Table Size

Control amount of memory used with m


Choosing Hash Table Size

Control amount of memory used with m


Ideally, load factor 0.5 < 𝛼 < 1
Choosing Hash Table Size

Control amount of memory used with m


Ideally, load factor 0.5 < 𝛼 < 1
Use O(m) = O( 𝛼n ) = O(n) memory to
store n keys
Choosing Hash Table Size

Control amount of memory used with m


Ideally, load factor 0.5 < 𝛼 < 1
Use O(m) = O( 𝛼n ) = O(n) memory to
store n keys
Operations run in time
O(1 + 𝛼) = O(1) on average
Dynamic Hash Tables
What if number of keys n is unknown in
advance?
Dynamic Hash Tables
What if number of keys n is unknown in
advance?
Start with very big hash table?
Dynamic Hash Tables
What if number of keys n is unknown in
advance?
Start with very big hash table?
You will waste a lot of memory
Dynamic Hash Tables
What if number of keys n is unknown in
advance?
Start with very big hash table?
You will waste a lot of memory
Copy the idea of dynamic arrays!
Dynamic Hash Tables
What if number of keys n is unknown in
advance?
Start with very big hash table?
You will waste a lot of memory
Copy the idea of dynamic arrays!
Resize the hash table when 𝛼 becomes
too large
Dynamic Hash Tables
What if number of keys n is unknown in
advance?
Start with very big hash table?
You will waste a lot of memory
Copy the idea of dynamic arrays!
Resize the hash table when 𝛼 becomes
too large
Choose new hash function and rehash all
the objects
Keep load factor below 0.9:
Rehash(T )
loadFactor ← T .numberOfKeys
T .size
if loadFactor > 0.9:
Create Tnew of size 2 × T .size
Choose hnew with cardinality Tnew .size
For each object O in T :
Insert O in Tnew using hnew
T ← Tnew , h ← hnew
Rehash Running Time
You should call Rehash after each operation
with the hash table
Similarly to dynamic arrays, single rehashing
takes O(n) time, but amortized running time
of each operation with hash table is still O(1)
on average, because rehashing will be rare
Outline
1 Good Hash Functions
2 Universal Family
3 Hashing Integers
4 Hashing Strings
Take phone numbers up to length 7, for
example 148-25-67
Take phone numbers up to length 7, for
example 148-25-67
Convert phone numbers to integers from
0 to 107 − 1 = 9 999 999:
148-25-67 → 1 482 567
Take phone numbers up to length 7, for
example 148-25-67
Convert phone numbers to integers from
0 to 107 − 1 = 9 999 999:
148-25-67 → 1 482 567
Choose prime number bigger than 107,
e.g. p = 10 000 019
Take phone numbers up to length 7, for
example 148-25-67
Convert phone numbers to integers from
0 to 107 − 1 = 9 999 999:
148-25-67 → 1 482 567
Choose prime number bigger than 107,
e.g. p = 10 000 019
Choose hash table size, e.g. m = 1 000
Hashing Integers

Lemma
ℋp = hpa,b (x) = ((ax + b) mod p) mod m
{︀ }︀

for all a, b : 1 ≤ a ≤ p − 1, 0 ≤ b ≤ p − 1
is a universal family
Hashing Phone Numbers
Example
Select a = 34, b = 2, so h = hp34,2 and
consider x = 1 482 567 corresponding to
phone number 148-25-67. p = 10 000 019.
Hashing Phone Numbers
Example
Select a = 34, b = 2, so h = hp34,2 and
consider x = 1 482 567 corresponding to
phone number 148-25-67. p = 10 000 019.
(34 × 1482567 + 2) mod 10000019 = 407185
Hashing Phone Numbers
Example
Select a = 34, b = 2, so h = hp34,2 and
consider x = 1 482 567 corresponding to
phone number 148-25-67. p = 10 000 019.
(34 × 1482567 + 2) mod 10000019 = 407185
407185 mod 1000 = 185
Hashing Phone Numbers
Example
Select a = 34, b = 2, so h = hp34,2 and
consider x = 1 482 567 corresponding to
phone number 148-25-67. p = 10 000 019.
(34 × 1482567 + 2) mod 10000019 = 407185
407185 mod 1000 = 185
h(x) = 185
General Case
Dene maximum length L of a phone
number
General Case
Dene maximum length L of a phone
number
Convert phone numbers to integers from
0 to 10L − 1
General Case
Dene maximum length L of a phone
number
Convert phone numbers to integers from
0 to 10L − 1
Choose prime number p > 10L
General Case
Dene maximum length L of a phone
number
Convert phone numbers to integers from
0 to 10L − 1
Choose prime number p > 10L
Choose hash table size m
General Case
Dene maximum length L of a phone
number
Convert phone numbers to integers from
0 to 10L − 1
Choose prime number p > 10L
Choose hash table size m
Choose random hash function from
universal family ℋp (choose random
a ∈ [1, p − 1] and b ∈ [0, p − 1])
Outline
1 Good Hash Functions
2 Universal Family
3 Hashing Integers
4 Hashing Strings
Lookup Phone Numbers by Name
Now we need to implement the Map
from names to phone numbers
Lookup Phone Numbers by Name
Now we need to implement the Map
from names to phone numbers
Can also use chaining
Lookup Phone Numbers by Name
Now we need to implement the Map
from names to phone numbers
Can also use chaining
Need a hash function dened on names
Lookup Phone Numbers by Name
Now we need to implement the Map
from names to phone numbers
Can also use chaining
Need a hash function dened on names
Hash arbitrary strings of characters
Lookup Phone Numbers by Name
Now we need to implement the Map
from names to phone numbers
Can also use chaining
Need a hash function dened on names
Hash arbitrary strings of characters
You will learn how string hashing is
implemented in Java!
String Length Notation
Definition
Denote by |S| the length of string S .
Examples
|“a”| = 1
|“ab”| = 2
|“abcde”| = 5
Hashing Strings
Given a string S , compute its hash value
Hashing Strings
Given a string S , compute its hash value
S = S[0]S[1] . . . S[|S| − 1], where S[i]
 individual characters
Hashing Strings
Given a string S , compute its hash value
S = S[0]S[1] . . . S[|S| − 1], where S[i]
 individual characters
We should use all the characters in the
hash function
Hashing Strings
Given a string S , compute its hash value
S = S[0]S[1] . . . S[|S| − 1], where S[i]
 individual characters
We should use all the characters in the
hash function
Otherwise there will be many collisions:
Hashing Strings
Given a string S , compute its hash value
S = S[0]S[1] . . . S[|S| − 1], where S[i]
 individual characters
We should use all the characters in the
hash function
Otherwise there will be many collisions:
For example, if S[0] is not used,
h(“aa”) = h(“ba”) = · · · = h(“za”)
Preparation

Convert each character S[i] to integer


code
Preparation

Convert each character S[i] to integer


code
ASCII code, Unicode, etc.
Preparation

Convert each character S[i] to integer


code
ASCII code, Unicode, etc.
Choose big prime number p
Polynomial Hashing
Definition
Family of hash functions
⎧ ⎫
|S|−1
mod p⎭
⎨ ∑︁ ⎬
𝒫p = hpx (S) = S[i]x i
i=0

with a xed prime p and all 1 ≤ x ≤ p − 1 is


called polynomial.
PolyHash(S, p, x)
hash ← 0
for i from |S| − 1 down to 0:
hash ← (hash × x + S[i]) mod p
return hash
Example: |S| = 3
1 hash = 0
2 hash = S[2] mod p
3 hash = S[1] + S[2]x mod p
4 hash = S[0] + S[1]x + S[2]x 2 mod p
Java Implementation
The method hashCode of the built-in Java
class String is very similar to our
PolyHash, it just uses x = 31 and for
technical reasons avoids the (mod p)
operator.
Java Implementation
The method hashCode of the built-in Java
class String is very similar to our
PolyHash, it just uses x = 31 and for
technical reasons avoids the (mod p)
operator.
You now know how a function that is used
trillions of times a day in many thousands of
programs is implemented!
Lemma
For any two dierent strings s1 and s2 of
length at most L + 1, if you choose h from
𝒫p at random (by selecting a random
x ∈ [1, p − 1]), the probability of collision
Pr [h(s1) = h(s2)] is at most pL .

Proof idea
This follows from the fact that the equation
a0 + a1x + a2x 2 + · · · + aLx L = 0 (mod p) for
prime p has at most L dierent solutions x .
Cardinality Fix

For use in a hash table of size m, we need a


hash function of cardinality m.
First apply random h from 𝒫p and then hash
the resulting value again using integer
hashing. Denote the resulting function by hm.
Lemma
For any two dierent strings s1 and s2 of
length at most L + 1 and cardinality m, the
probability of collision Pr [hm(s1) = hm(s2)] is
at most m1 + pL .
Polynomial Hashing
Corollary
If p > mL, for any two different strings s1
and s2 of length at most L + 1 the probability
of collision Pr [hm (s1) = hm (s2)] is O( m1 ).

Proof
1 1 1
m + pL < m
L
+ mL = m + m1 = 2
m = O( m1 )
Running Time
For big enough p again have
c = O(1 + 𝛼)
Running Time
For big enough p again have
c = O(1 + 𝛼)
Computing PolyHash(S) runs in time
O(|S|)
Running Time
For big enough p again have
c = O(1 + 𝛼)
Computing PolyHash(S) runs in time
O(|S|)
If lengths of the names in the phone
book are bounded by constant L,
computing h(S) takes O(L) = O(1)
time
Conclusion
You learned how to hash integers and
strings
Phone book can be implemented as two
hash tables
Mapping phone numbers to names and
back
Search and modication run on average
in O(1)!

You might also like