Hash Tables
Hash Tables
27 juin 2021
Outline
Introduction
Direct addressing
Hash tables
Hash functions
The division method
The multiplication method
Handling collisions
Chaining
Open addressing
Linear probing
Quadratic probing
Double hashing
Exercises
The last 4 digits of your student id can be used as a key that index into an array A
of pointers. Each pointer x will be the address of a record storing academic
information of a student : A[id] = x
The array will have 104 = 10, 000 entries, if the computer memory is 64 bit
addressable, this means the size of the array will be 10, 000 × 8 = 80, 000 bytes,
which is a small data structure
Given a student id, one can find the address x of his record through A : x = A[id].
It cost O(1) to find the record.
This is called direct addressing, a very efficient way to retrieve info in computer
memory.
Problems with direct addressing
Some organizations may not provide an id for the entities with which they are
interacting (like phone companies or stores) or may interact with entities using large
id numbers (like government id card, 12 digits) or credit card companies (credit card
numbers have 16 digits)
For example, if your phone number was used to index directly into your data, there
are 10 digits, which means the array size will be 1010 = 10, 000, 000, 000 = 10
billions × 8 = 80 billions of bytes × 8 = 640 billions of bits. 6,400,000,000,000 bits
is still less than the 64 bits addressable computer memory
(18,446,744,073,709,551,616 bits), but it is certainly a big chunk of it
Useless to said this is not the way your info is indexed. Rather indirect addressing
modes are used, one of them is called hashing (or hash tables)
Direct addressing formally defined
Instead of storing an element with key k in index k, use a function h and store the
element in index h(k)
The ”hash function” h maps the key from the domain U = {0, 1, .., n − 1} into an
array T = [0, .., m − 1] called hash table of a much smaller range of m values
Must make sure that h(k) is a legal index into the table T
Issues covered in this lecture
The dividend a is the key to be hashed while the divisor n is the size of
the hash table. The hash value (digest) is the value returned by the
modulo function
h(k) = k mod m
All the keys where the binary representation ends with 101 will index the
same entry in the table (all the keys in the above example index in the same
entry of the hash table).
Solution to hash table size
Here the table size is 11, which is a prime number. The binary representation
of 11 is 1011.
As we can see, each subtraction impacts the 4 rightmost bits that define the
index in the hash table.
decimal binary 1011 decimal remainder binary remainder
45 101101 -1011 34 100010
34 100010 -1011 23 010111
23 010111 -1011 12 001100
12 001100 -1011 1 000001
The further away the divisor is from a power of two, the more the bits of the
binary representation of the divisor are equally 0 and 1.
Also the more digits of the remainder are affected by each subtraction.
Division method applied on strings of characters
Most hashing functions assume that the key is a natural number. Here
we show how a hashing function like the division method can be
applied on keys that are strings of characters.
Recall : A number in base 10 is
6789 = (6 × 103 ) + (7 × 102 ) + (8 × 101 ) + (9 × 100 ) = 6789
h(52) = b8(52 × 0.6180 mod 1)c h(36) = b8(36 × 0.6180 mod 1)c
= b8(32.136 mod 1)c = b8(22.248 mod 1)c
= b8(0.136)c = b8(0.248)c
= b1.088c = b1.984c
= 1 = 1
Multiplication method : exercises
Compute the hash value of the keys below using the multiplication
method :
1. key = 196, m = 8, A = 0.22
2. key = 353, m = 11, A = 0.75
Advantage, disadvantage, implementation considerations
Assume memory words have w bits and the key k fits in a single word
s
Choose A to be a fraction of the form 2w where s is an integer in the
range 0 < s < 2w
Example :
m = 8 = 23 ; w = 6; k = 52; 0 < s < 26 ; s = 40; A = 0.625
Collisions occur when two keys hash to a same index in the hash table
Chaining puts elements that hash to the same index in a linked list :
SEARCH(T, k)
search for an element with key k in list T [h(k)]. Time proportional to length
of the list of elements in h(k)
INSERT(T, x)
insert x at the head of the list T [h(key [x])]. Worst-case O(1)
Chaining
DELETE(T,x)
delete x from the list T [h(key [x])]. If lists are singly linked, then deletion
takes as long as searching, O(1) if doubly linked
Analysis of running time for chaining
Worst case : All keys hash in the same hash table entry, creating a
chain of length n. Searching for a key takes Θ(n) + the time to hash
the key. We don’t do hashing for the worst case !
Average case analysis
In order words, each key is equally likely to hash in any of the m slots
of the hash table
Under the assumption of uniform hashing, the load factor α and the
average chain length of a hash table of size m with n elements will be
n
α= m
Average case analysis
Consequently, α = m n
= O(m)
m = O(1), in other words, if hash function
distribution is uniform, the average time for searching is constant
provided the number of elements in the hash table is a multiplicative
factor of the size of the table.
Open addressing
When collisions are handled through chaining, the keys that collide are
placed in a link list in the same entry where the collision occurred
Open addressing place the keys that collide in another entry of the
hash table by searching an empty slot using a probe sequence
HASH-INSERT(T, k)
i=0
repeat
j = h(k, i)
if T[j] == NIL
T[j] = k
return j
else i = i+1 // probing
until i == m
error ”hash table overflow”
Open addressing : Searching
HASH-SEARCH(T, k)
i=0
repeat
j = h(k, i)
if T[j] == k
return j
else i = i+1
until T[j] == NIL or i == m
return NIL
Computing probe sequences
Probe sequences determine how empty slots in the hash table are
searched. We describe three probe sequence techniques :
I Linear probing
I Quadratic probing
I Double hashing
Linear probing
h(28, 1) = 7
0 1 2 3 4 5 6 7 8 9 10
44 56 NIL NIL 81 NIL 39 29 52 NIL 21
↑ ↑
h(28, 2) = 8
0 1 2 3 4 5 6 7 8 9 10
44 56 NIL NIL 81 NIL 39 29 52 NIL 21
↑
h(28, 3) = 9
0 1 2 3 4 5 6 7 8 9 10
44 56 NIL NIL 81 NIL 39 29 52 NIL 21
↑
Linear probing : exercise
Assume the hash table has size m = 7 and h(k) = k mod 7. Insert
keys 701, 145, 217, 19, 13, 749 in the table, using linear probing
Linear probing : primary clustering
A cluster that covers the hash of a key is called the primary cluster of
the key
Linear probing can create large primary clusters that will increase the
running time of search/insert/delete
Example : hash table has size m = 7 and h(k) = k mod 7. Insert keys
14, 15, 1, 35, 29
0 1 2 3 4 5 6
14 15 1 35 29
Linear probing : clustering
As the hash table starts to fill, the number of clusters increases, then
they merge, becoming larger
no collision
[R. Sedgewick]
1
Quadratic probing
If the load factor α > 21 , then quadratic probing may find an empty entry
0 1 2 3 4 5 6
14 15 35 1 5
Conclusion : The size of the table must be expanded each time the load factor
1
exceeds 2
if no other technique is used
Quadratic probing issues : secondary clustering
Issue : Must have the hash value h2 (k) relatively prime to m (i.e. no factors in
common other than 1). To satisfy this requirement we
1. could choose m to be a power of 2 and h2 to always produce an odd number
>1
I The factors of m = 2x are 2x−y , y = 1, 2, . . . , x − 1. The factors of
m = 25 are 24 , 23 , 22 and 21
I If h2 (k) is an odd number, it cannot be a factor of m = 2x
2. or m prime and h2 to always produce an integer less than m
I If h2 (k) < m, since m is prime, none of the values greater than 1 and
smaller than m can be a factor of m
Handling collisions : Double hashing
Assume the hash table has size m = 7. h1 (k) = k mod 7 and h2 (k) = 5 − (k
mod 5), thus h(k, i) = (h1 (k) + ih2 (k)) mod 7.
1. Insert the following sequence of keys : 76, 93, 40, 47, 10, 55
2. Insert the following sequence of keys : 701, 145, 217, 19, 13, 749
Recalibrating the size of the table
1
If the table is half full, probing cost (1−.5) = 2 on average. If the table
1
is 90 percent full, then it cost is (1−.9) = 10.
When the table gets too full, inserting and searching for a key become
too costly. We should consider to increase the size of the table
Each time we increase the size of the table we should re-hash all the
keys as the new modulo m of the larger table is not the same as the
modulo of the original table
Open addressing : Deleting
1. Given the values 2341, 4234, 2839, 430, 22, 397, 3920, a hash table of size 7,
and hash function h(x) = x mod 7, show the resulting tables after inserting
the values in the above order with each of these collision strategies :
1.1 Chaining
1.2 Linear probing
1.3 Quadratic probing where c1 = 1 and c2 = 1
1.4 Double hashing with second hash function h2 (x) = (2x − 1) mod 7
2. Suppose you a hash table of size m = 9, use the division method as hashing
function h(x) = x mod 9 and chaining to handle collisions. The following
keys are inserted : 5, 28, 19, 15, 20, 33, 12, 17, 10. In which entries of the
table do collisions occur ?
Exercises
3. Now suppose you use the same hashing function as above with linear probing
to handle collisions, and the same keys as above are inserted. More collisions
occur than in the previous question. Where do the collisions occur and where
do the keys end up ?
Exercises
4. Fill a hash table when inserts items with the keys D E M O C R A T in that
order into an initially empty table of m = 5, using chaining to handle
collisions. Use the hash function 11k mod m to transform the kth letter of
the alphabet into a table index, e.g., hash(I ) = hash(9) = 99 mod 5 = 4
5. Fill a hash table when inserting items with the keys R E P U B L I C A N in
that order into an initially empty table of size m = 16 using linear probing.
Use the hash function 11k mod m to transform the kth letter of the alphabet
into a table index.
Exercises
6. Suppose you use one of the open addressing techniques to handle collisions
and you have inserted so many keys/values into your hash table such that all
entries are taken. Then collisions occur everytime. What can you do ?
7. Suppose a hash table with capacity m = 31 gets to be over .75 full. We decide
to rehash. What is a good size choice for the new table to reduce the load
factor below .5 and also avoid collisions ?
Hash functions as cryptographic functions
The number after SHA indicates the length in bits (the number of bits)
of the digest produced by each of these cryptographic hash functions
For example, adding a period to the end of the sentence below changes
almost half (111 out of 224) of the bits in the digest :