0% found this document useful (0 votes)
12 views55 pages

Hash Tables

The lecture notes cover hash tables and various methods for handling collisions, including chaining and open addressing. It discusses hash functions, specifically the division and multiplication methods, and their applications in indexing data efficiently. Additionally, it addresses the challenges of direct addressing and the importance of choosing appropriate hash table sizes to minimize collisions.

Uploaded by

taidang2072004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views55 pages

Hash Tables

The lecture notes cover hash tables and various methods for handling collisions, including chaining and open addressing. It discusses hash functions, specifically the division and multiplication methods, and their applications in indexing data efficiently. Additionally, it addresses the challenges of direct addressing and the importance of choosing appropriate hash table sizes to minimize collisions.

Uploaded by

taidang2072004
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Algorithms and Data Structures

Lecture notes: Hash Tables, Cormen Chap. 11

Lecturer: Michel Toulouse

Hanoi University of Science & Technology


[email protected]

27 juin 2021
Outline

Introduction
Direct addressing
Hash tables

Hash functions
The division method
The multiplication method

Handling collisions
Chaining
Open addressing
Linear probing
Quadratic probing
Double hashing

Exercises

Hash functions as encryption methods


Motivation

The last 4 digits of your student id can be used as a key that index into an array A
of pointers. Each pointer x will be the address of a record storing academic
information of a student : A[id] = x

The array will have 104 = 10, 000 entries, if the computer memory is 64 bit
addressable, this means the size of the array will be 10, 000 × 8 = 80, 000 bytes,
which is a small data structure

Given a student id, one can find the address x of his record through A : x = A[id].
It cost O(1) to find the record.

This is called direct addressing, a very efficient way to retrieve info in computer
memory.
Problems with direct addressing

Some organizations may not provide an id for the entities with which they are
interacting (like phone companies or stores) or may interact with entities using large
id numbers (like government id card, 12 digits) or credit card companies (credit card
numbers have 16 digits)

For example, if your phone number was used to index directly into your data, there
are 10 digits, which means the array size will be 1010 = 10, 000, 000, 000 = 10
billions × 8 = 80 billions of bytes × 8 = 640 billions of bits. 6,400,000,000,000 bits
is still less than the 64 bits addressable computer memory
(18,446,744,073,709,551,616 bits), but it is certainly a big chunk of it

Useless to said this is not the way your info is indexed. Rather indirect addressing
modes are used, one of them is called hashing (or hash tables)
Direct addressing formally defined

In direct addressing, the keys are used to index entries in a data


structure. Suppose :
I each key is draw from a set of n numbers U = {0, 1, .., n − 1}
I keys are distinct
Implementation :
I We are given an array T [0..n − 1] of pointers (the size of T is the
same as the size of the set U)
I Each entry k of T is a pointer to an object x with key k
I If there is no object x with key k then T [k] = NULL, i.e. it is a
null pointer
T is called a direct-addressing table, it provides search time in O(1)
but the size of the table can be huge and filled mostly with NULL
entries
Direct addressing example

Operations take O(1) time

SEARCH(T, k) INSERT(T, x) DELETE(T,x)


return T[k] T[key[x]] = x T[key[x]] = NILL
Hashing : maps keys to smaller ranges

Instead of storing an element with key k in index k, use a function h and store the
element in index h(k)

The ”hash function” h maps the key from the domain U = {0, 1, .., n − 1} into an
array T = [0, .., m − 1] called hash table of a much smaller range of m values

Must make sure that h(k) is a legal index into the table T
Issues covered in this lecture

There are many hash functions, we describe two particular methods :


I Hash functions based on the multiplication method
I hash functions using the division method
As the range of the hash table is much smaller than the domain of the keys, some
keys are mapped to the same index in the hash table, this is called collision

We study two classes of methods to handle collisions :


I chaining and
I open addressing.
Hashing keys using the division method

The hash function is the modulo operation a mod n.

The dividend a is the key to be hashed while the divisor n is the size of
the hash table. The hash value (digest) is the value returned by the
modulo function

Assume k is the key and h denotes the hashing function.

Therefore, h(k) = k mod m, where m the size of the hash table.

Example : If k = 91, m = 20, then h(91) = k mod 20 = 11.

It’s fast, requires just one division operation.


Disadvantage of the division method

h(k) = k mod m

Disadvantage : Have to avoid certain values of m :


I Powers of 2. If m = 2p for integer p, then h(k) is just the least
significant p bits of k.
I Examples : m = 8 = 23
I h(52) = 4 : h(110100) = 100
I h(37) = 5 : h(100101) = 101
I The implication is that all keys that end for example with 100
map to the same index (the computation of h(k) depends only on the p
least significant bits of the key).
Explanation
Note : The modulo function is the remainder of a division. The remainder can
be computed by repeatedly subtracting the divisor until the remainder is
smaller than the divisor.

The binary representation of a power of 2 is 1 follow by zeros. Therefore


subtractions of a binary number by a power of 2 only impact the bits on the
left of the binary number as show in the table below :

decimal binary -23 decimal remainder binary remainder


45 101101 -1000 37 100101
37 100101 -1000 29 011101
29 011101 -1000 21 010101
21 010101 -1000 13 001101
13 001101 -1000 5 000101

All the keys where the binary representation ends with 101 will index the
same entry in the table (all the keys in the above example index in the same
entry of the hash table).
Solution to hash table size

I Solution : pick table size m = a prime number not too close to a


power of 2 (or 10). Example m = 5
I h(52) = 2 : h(110100) = 010
I h(36) = 1 : h(100100) = 001
(the computation of h(k) depends on all the bits of the key).
Table size is a prime number

Here the table size is 11, which is a prime number. The binary representation
of 11 is 1011.
As we can see, each subtraction impacts the 4 rightmost bits that define the
index in the hash table.
decimal binary 1011 decimal remainder binary remainder
45 101101 -1011 34 100010
34 100010 -1011 23 010111
23 010111 -1011 12 001100
12 001100 -1011 1 000001

The further away the divisor is from a power of two, the more the bits of the
binary representation of the divisor are equally 0 and 1.
Also the more digits of the remainder are affected by each subtraction.
Division method applied on strings of characters

Most hashing functions assume that the key is a natural number. Here
we show how a hashing function like the division method can be
applied on keys that are strings of characters.
Recall : A number in base 10 is
6789 = (6 × 103 ) + (7 × 102 ) + (8 × 101 ) + (9 × 100 ) = 6789

Western computer keyboards have 128 characters, a string of characters can


be converted into integers using base 128 (7-bit ASCII values). For example,
”CLRS”, ASCII values are : C = 67, L = 76, R = 82, S = 83

The string ”CLRS” is interpreted as the integer


(67 × 1283 ) + (76 × 1282 ) + (82 × 1281 ) + (83 × 1280 ) = 141, 764, 947
Hash fct : multiplication method
The multiplication method is a hashing function addressing the above
difficulties. It works as follows :
1. Choose a constant A in the range 0 < A < 1
2. Multiply key k by A
3. Extract the fractional part of kA
4. Multiply the fractional part by m
5. Take the floor of the result
h(k) = bm(kA mod 1)c
Examples for m = 8 = 23 and A = 0.6180 :

h(52) = b8(52 × 0.6180 mod 1)c h(36) = b8(36 × 0.6180 mod 1)c
= b8(32.136 mod 1)c = b8(22.248 mod 1)c
= b8(0.136)c = b8(0.248)c
= b1.088c = b1.984c
= 1 = 1
Multiplication method : exercises

Compute the hash value of the keys below using the multiplication
method :
1. key = 196, m = 8, A = 0.22
2. key = 353, m = 11, A = 0.75
Advantage, disadvantage, implementation considerations

Advantage : Value of m not critical

Disadvantage : Slower than division method, but can be adapted to


make efficient use of the computer architecture design
Multiplication method : implementation

Assume memory words have w bits and the key k fits in a single word
s
Choose A to be a fraction of the form 2w where s is an integer in the
range 0 < s < 2w

I Multiplying s and k, the result is 2w bits, r1 2w + r0


I r1 is the high-order word of the product and r0 is the low-order word
I bm(kA mod 1)c are the p most significant bits of r0
Multiplication method : implementation

Example : m = 8 = 23 (p = 3); w = 5; k = 21; 0 < s < 25 ; s =


13 ⇒ A = 13/32 = 0.40625

h(21) = b8(21 × 0.40625 mod 1)c = ks = 21 × 13 = 273


= b8(8.53125 mod 1)c = 8 × 25 + 17
= b8(0.53125)c = r1 = 8, r0 = 17 = 10001
= b4.25c = takes the p = 3 most significant bits of r0
= 4 = 100 = 4
Multiplication method : implementation

Example : m = 8 = 23 (p = 3); w = 6; k = 52; 0 < s < 26 ; A =


0.625 ⇒ s = 0.625 × 26 = 40

h(52) = b8(52 × 0.625 mod 1)c = ks = 52 × 40 = 2080


= b8(32.5 mod 1)c = 2080 = 32 × 26 + 32
= b8(0.5)c = r1 = 32, r0 = 32 = 100000
= b4c = takes the p = 3 most significant bits of r0
= 4 = 100 = 4
Multiplication method : implementation

Example :
m = 8 = 23 ; w = 6; k = 52; 0 < s < 26 ; s = 40; A = 0.625

We extract the p most significant bits of r0 because we have selected


the table to be of size m = 2p , therefore we need p bits to generate a
number in the range 0..m − 1.
Collisions

Collisions occur when two keys hash to a same index in the hash table

Collisions certainly happen when the number of keys to store is larger


than the size m of the hash table T .
I If the number of keys to store is < m, collision may or may not
happen (i.e. collisions occur even thought the table is not full)
Indexing in hash tables always needs to handle collisions. Two methods
are commonly used : chaining and open addressing.
Chaining

Chaining puts elements that hash to the same index in a linked list :

SEARCH(T, k)
search for an element with key k in list T [h(k)]. Time proportional to length
of the list of elements in h(k)
INSERT(T, x)
insert x at the head of the list T [h(key [x])]. Worst-case O(1)
Chaining

DELETE(T,x)
delete x from the list T [h(key [x])]. If lists are singly linked, then deletion
takes as long as searching, O(1) if doubly linked
Analysis of running time for chaining

load factor : Given n keys and m indexes in the table : the


α = n/m = average # keys per index

Worst case : All keys hash in the same hash table entry, creating a
chain of length n. Searching for a key takes Θ(n) + the time to hash
the key. We don’t do hashing for the worst case !
Average case analysis

A good hash function satisfies (approximately) the assumption of


simple uniform hashing :

”given a hash function h, and a hash table of size m, the probability


that two non-equal keys a and b will hash to the same slot is
P(h(a) = h(b)) = m1 ”

In order words, each key is equally likely to hash in any of the m slots
of the hash table

Under the assumption of uniform hashing, the load factor α and the
average chain length of a hash table of size m with n elements will be
n
α= m
Average case analysis

Assumption : Assume simple uniform hashing : each key in table is


equally likely to be hashed to any index
I The average cost of an unsuccessful search for a key is Θ(1 + α)
I The average cost of a successful search is 1 + α/2 = Θ(1 + α)

If the size m of the hash table is proportional to the number n of


elements in the table, say 21 , then n = 2m, and we have n ∈ O(m)

Consequently, α = m n
= O(m)
m = O(1), in other words, if hash function
distribution is uniform, the average time for searching is constant
provided the number of elements in the hash table is a multiplicative
factor of the size of the table.
Open addressing

When collisions are handled through chaining, the keys that collide are
placed in a link list in the same entry where the collision occurred

Open addressing place the keys that collide in another entry of the
hash table by searching an empty slot using a probe sequence

The hash function is modified to include a probe number i, h(k, i).


The value of i increases by 1 from 0 to m − 1.
Open addressing : Insertion

HASH-INSERT(T, k)
i=0
repeat
j = h(k, i)
if T[j] == NIL
T[j] = k
return j
else i = i+1 // probing
until i == m
error ”hash table overflow”
Open addressing : Searching

HASH-SEARCH(T, k)
i=0
repeat
j = h(k, i)
if T[j] == k
return j
else i = i+1
until T[j] == NIL or i == m
return NIL
Computing probe sequences

Probe sequences determine how empty slots in the hash table are
searched. We describe three probe sequence techniques :
I Linear probing
I Quadratic probing
I Double hashing
Linear probing

The probe sequence h(k, i) is constructed from an ordinary hash


function h(k)

Linear probing h(k, i) = (h(k) + i) mod m = (k mod m) + i mod m


for i = 0, 1, . . . , m − 1, thus h(k, 0) is the initial hashing. If
h(k, 0) 6= NIL, then there is a collision.

Linear probing resolves collisions by looking at the next entry in the


table.

Assume the state of the hash table is as below and h = k mod 11 :


0 1 2 3 4 5 6 7 8 9 10
44 56 NIL NIL 81 NIL 39 29 52 NIL 21
Linear probing

A call to h(28, 0) = 6 yields a collision. Linear probing examines entries


7, 8 and finally 9.

h(28, 1) = 7
0 1 2 3 4 5 6 7 8 9 10
44 56 NIL NIL 81 NIL 39 29 52 NIL 21
↑ ↑

h(28, 2) = 8
0 1 2 3 4 5 6 7 8 9 10
44 56 NIL NIL 81 NIL 39 29 52 NIL 21

h(28, 3) = 9
0 1 2 3 4 5 6 7 8 9 10
44 56 NIL NIL 81 NIL 39 29 52 NIL 21

Linear probing : exercise

Assume the hash table has size m = 7 and h(k) = k mod 7. Insert
keys 701, 145, 217, 19, 13, 749 in the table, using linear probing
Linear probing : primary clustering

A cluster is a collection of consecutive occupied slots

A cluster that covers the hash of a key is called the primary cluster of
the key

Linear probing can create large primary clusters that will increase the
running time of search/insert/delete

Example : hash table has size m = 7 and h(k) = k mod 7. Insert keys
14, 15, 1, 35, 29
0 1 2 3 4 5 6
14 15 1 35 29
Linear probing : clustering
As the hash table starts to fill, the number of clusters increases, then
they merge, becoming larger

no collision

collision in small cluster


no collision

collision in large cluster

[R. Sedgewick]
1
Quadratic probing

Linear probing suffers from primary clustering : long runs of occupied


sequences build up : an empty slot that follows i full slots has
probability i+1
m to be filled.

Quadratic probing jumps around in the table according to a quadratic


function of the probe number : h(k, i) = (h(k) + c1 i + c2 i 2 ) mod m,
where c1 , c2 6= 0 are constants and i = 0, 1, . . . , m − 1. Thus, the
initial position probed is T [h(k)].
Quadratic probing
Assuming c1 = c2 = 1, in this case, the probe sequence will be
h(k, i) = (h(k) + c1 i + c2 i 2 ) mod m

h(28, 0) = ((28 mod 11) + 0 + 0) mod 11) = 6 = collision


0 1 2 3 4 5 6 7 8 9 10
44 NIL 57 NIL 81 NIL 39 29 52 NIL 21

h(28, 1) = ((28 mod 11) + 1 + 1) mod 11) = 8 = collision


0 1 2 3 4 5 6 7 8 9 10
44 NIL 57 NIL 81 NIL 39 29 52 NIL 21

h(28, 2) = ((28 mod 11) + 2 + 4) mod 11) = 1


0 1 2 3 4 5 6 7 8 9 10
44 NIL 57 NIL 81 NIL 39 29 52 NIL 21

Quadratic probing : exercises

Assume the hash table has size m = 7 and h(k) = k mod 7,


c1 = c2 = 1, so the quadratic probing function is

h(k, i) = ((k mod m) + c1 i + c2 i 2 ) mod m.

1. Insert keys 76, 40, 48, 5, 55


2. Using the same quadratic probing function and same hash table
(empty), insert keys 701, 145, 217, 19, 13, 749
Quadratic probing issues : space inefficiency
If the table T is less than half full and m is a prime number, then quadratic probing
will find an empty entry in the table.

This is because for i in [0, m−1


2
], h(k, i) will never return twice the same index in the
table if the load factor α ≤ 12 (you can find a proof of this statement in different
places, there is one in the section limitations of
https://en.wikipedia.org/wiki/Quadratic_probing)

If the load factor α > 21 , then quadratic probing may find an empty entry

Example : Insert keys 14,15,35,1,5,3

0 1 2 3 4 5 6
14 15 35 1 5

h(3, 0) = 3, h(3, 1) = 5, h(3, 2) = 2, h(3, 3) = 1, h(3, 4) = 2, h(3, 5) = 5,


h(3, 6) = 3

Conclusion : The size of the table must be expanded each time the load factor
1
exceeds 2
if no other technique is used
Quadratic probing issues : secondary clustering

In quadratic probing, clusters are formed along the path of probing,


instead of around the initial hash value of a key

These clusters are called secondary clusters

Secondary clusters are formed as a result of using the same pattern in


probing by all keys. If two keys have the same initial hash value, their
probe sequences are going to be the same

But this is a milder form of clustering compares to primary clustering


Double hashing

Use two auxiliary hash functions, h1 and h2 .

h1 gives the initial probe, and h2 gives the remaining probes :


h(k, i) = (h1 (k) + ih2 (k)) mod m

Issue : Must have the hash value h2 (k) relatively prime to m (i.e. no factors in
common other than 1). To satisfy this requirement we
1. could choose m to be a power of 2 and h2 to always produce an odd number
>1
I The factors of m = 2x are 2x−y , y = 1, 2, . . . , x − 1. The factors of
m = 25 are 24 , 23 , 22 and 21
I If h2 (k) is an odd number, it cannot be a factor of m = 2x
2. or m prime and h2 to always produce an integer less than m
I If h2 (k) < m, since m is prime, none of the values greater than 1 and
smaller than m can be a factor of m
Handling collisions : Double hashing

h(k, i) = (h1 (k) + ih2 (k)) mod m

Example : m = 13, k = 14, h1 (k) = k mod 13,


h2 (k) = (k mod 11) + 1. For i = 0

h(14, 0) = h1 (14) + 0(h2 (14))


= (14 mod 13) +
0((14 mod 11) + 1)
= 1 + 0(3 + 1)
= 1
Handling collisions : Double hashing

h(k, i) = (h1 (k) + ih2 (k)) mod m

Example : m = 13, k = 14, h1 (k) = k mod 13,


h2 (k) = (k mod 11) + 1. For i = 1

h(14, 1) = h1 (14) + 1(h2 (14))


= (14 mod 13) +
1((14 mod 11) + 1)
= 1 + 1(3 + 1)
= 5
Handling collisions : Double hashing

h(k, i) = (h1 (k) + ih2 (k)) mod m

Example : m = 13, k = 14, h1 (k) = k mod 13,


h2 (k) = (k mod 11) + 1. For i = 2

h(14, 2) = h1 (14) + 2(h2 (14))


= (14 mod 13) +
2(1 + (14 mod 11))
= 1 + 2(3 + 1)
= 9
Double hashing : exercises

Assume the hash table has size m = 7. h1 (k) = k mod 7 and h2 (k) = 5 − (k
mod 5), thus h(k, i) = (h1 (k) + ih2 (k)) mod 7.
1. Insert the following sequence of keys : 76, 93, 40, 47, 10, 55
2. Insert the following sequence of keys : 701, 145, 217, 19, 13, 749
Recalibrating the size of the table

1
If the table is half full, probing cost (1−.5) = 2 on average. If the table
1
is 90 percent full, then it cost is (1−.9) = 10.

When the table gets too full, inserting and searching for a key become
too costly. We should consider to increase the size of the table

Each time we increase the size of the table we should re-hash all the
keys as the new modulo m of the larger table is not the same as the
modulo of the original table
Open addressing : Deleting

Use a special value DELETED instead of NIL when marking an index


as empty during deletion.
I Suppose we want to delete key k at index j.
I And suppose that sometime after inserting key k, we were inserting key
k 0 , and during this insertion we had probed index j (which contained key
k).
I And suppose we then deleted key k by storing NIL into index j .
I And then we search for key k 0 .
I During the search, we would probe index j before probing the index into
which key k 0 was eventually stored.
I Thus, the search would be unsuccessful, even though key k 0 is in the
table.
Exercises

1. Given the values 2341, 4234, 2839, 430, 22, 397, 3920, a hash table of size 7,
and hash function h(x) = x mod 7, show the resulting tables after inserting
the values in the above order with each of these collision strategies :
1.1 Chaining
1.2 Linear probing
1.3 Quadratic probing where c1 = 1 and c2 = 1
1.4 Double hashing with second hash function h2 (x) = (2x − 1) mod 7
2. Suppose you a hash table of size m = 9, use the division method as hashing
function h(x) = x mod 9 and chaining to handle collisions. The following
keys are inserted : 5, 28, 19, 15, 20, 33, 12, 17, 10. In which entries of the
table do collisions occur ?
Exercises

3. Now suppose you use the same hashing function as above with linear probing
to handle collisions, and the same keys as above are inserted. More collisions
occur than in the previous question. Where do the collisions occur and where
do the keys end up ?
Exercises

4. Fill a hash table when inserts items with the keys D E M O C R A T in that
order into an initially empty table of m = 5, using chaining to handle
collisions. Use the hash function 11k mod m to transform the kth letter of
the alphabet into a table index, e.g., hash(I ) = hash(9) = 99 mod 5 = 4
5. Fill a hash table when inserting items with the keys R E P U B L I C A N in
that order into an initially empty table of size m = 16 using linear probing.
Use the hash function 11k mod m to transform the kth letter of the alphabet
into a table index.
Exercises

6. Suppose you use one of the open addressing techniques to handle collisions
and you have inserted so many keys/values into your hash table such that all
entries are taken. Then collisions occur everytime. What can you do ?
7. Suppose a hash table with capacity m = 31 gets to be over .75 full. We decide
to rehash. What is a good size choice for the new table to reduce the load
factor below .5 and also avoid collisions ?
Hash functions as cryptographic functions

Hash functions have two interesting properties


1. Even if we know the output (hash, digest) we cannot guess the
input (the key). Such function are non-invertible, given f (x) we
cannot find the inverse f −1 (x)
2. The output, the digest, always has the same length no matter
what was the length of the input (key)

If furthermore a hashing function is collision resistant, i.e. the


likelihood of a collision is very small, then the hash function can be
used to encrypt information.
Cryptographic hash functions

Cryptographic hash functions are hash functions designed specifically


to encrypt data

Among them there are two specific classes of cryptographic hash


functions known as Secure Hashing Algorithm (SHA) : SHA-2 and
SHA-3

For example, SHA-2 is a family of hash functions that includes


SHA-224, SHA-256, SHA-384, SHA-512, SHA-512/224, and
SHA-512/256

The number after SHA indicates the length in bits (the number of bits)
of the digest produced by each of these cryptographic hash functions

Note : the digest is usually represented in hexadecimal


SHA-2 cryptographic functions

SHA-2 cryptographic functions have another property that make then


useful to crypto-currency platforms : a single small change in the key
can make a huge difference in the digest produced

For example, adding a period to the end of the sentence below changes
almost half (111 out of 224) of the bits in the digest :

SHA-224(”The quick brown fox jumps over the lazy dog”)


730e109bd7a8a32b1cb9d9a09aa2325d2430587ddbc0c38bad911525

SHA-224(”The quick brown fox jumps over the lazy dog.”)


619cba8e8e05826e9b8c519c0a5c68f4fb653e8a3d8aa04bb2c8cd4c

You might also like