Hash-Table Data Structures: Motivation
Hash-Table Data Structures: Motivation
• Hash-table principles.
• Searching, insertion, deletion.
• Implementations of sets and maps using hash tables.
12-1
Motivation
• Assume we want a map from strings to ints (wordcount)
Or a set of strings
• We could use a big array, with one entry per possible word
26 entries for 1-letter word
26*26 = 676 entries for 2-letter words
26**4 = 456,976 entries for 4-letter words (OK)
26**8 = 208,827,064,576 entries for 8-letter words (not possible)
1
Motivation
• For larger words, maybe we could just use first 4 letters?
Eg, put “aberdeen” in the array slot for “aber”
Array slot “aber” is a bucket for words that start with “aber”
12-3
12-4
2
Hash-table principles (2)
• Illustration: hash table
(array of buckets)
collision
key value 0
k1 v1 1
k2 v2 2
k3 v3 3
4
k4 v4
5
kn vn m–2
m–1
hashing
(translating keys
to bucket indices)
12-5
12-6
3
Hash-table principles (4)
• The hash function must be consistent:
k1 = k2 implies hash(k1) = hash(k2).
• In general, the hash function is many-to-one.
• Therefore different keys may share the same home bucket:
k1 ≠ k2 but hash(k1) = hash(k2).
This is called a collision.
• Always prefer a hash function that makes collisions
relatively infrequent.
12-7
4
Hashing in Java (1)
• Instance method in class Object:
public int hashCode ();
// Translate this object to an integer, such that x.equals(y)
// implies x.hashcode() == y.hashcode().
12-9
• Examples:
Class of k Result of k.hashCode()
String weighted sum of characters of k
Integer integer value of k
Date (high 32 bits of k) exclusive-or (low 32 bits of k),
where k is expressed in milliseconds since 1970-01-01
12-10
5
Closed-bucket vs. open-bucket hash tables
• Closed-bucket hash table:
Each bucket may be occupied by several entries.
Buckets are completely separate.
12-11
12-12
6
Closed-bucket hash tables (2)
0 Ar 18
• Illustration (with no collisions): 1
Br 35
2
element number 3 Cl 17
F 9 4
Ne 10 5 F 9
Cl 17 6
Ar 18 7
is represented by
Br 35 8 I 53
9
Kr 36
10 Kr 36
I 53
11
Xe 54
12
13 Ne 10
23 Xe 54
24 12-13
25
7
Closed-bucket hash tables (4)
• Java class implementing CBHTs:
public class CBHT {
}
}
12-16
8
CBHT search (1)
• CBHT search algorithm:
To find which if any node of a CBHT contains an entry whose key is
equal to target-key:
1. Set b to hash(target-key).
2. Find which if any node of the SLL of bucket b contains an entry
whose key is equal to target-key, and terminate with that node as
answer.
12-17
12-18
9
CBHT insertion (1)
• CBHT insertion algorithm:
To insert the entry (key, val) into a CBHT:
1. Set b to hash(key).
2. Insert the entry (key, val) into the SLL of bucket b, replacing any
existing entry whose key is key.
3. Terminate.
12-19
12-20
10
CBHT deletion (1)
• CBHT deletion algorithm:
To delete the entry (if any) whose key is equal to key from a CBHT:
1. Set b to hash(key).
2. Delete the entry (if any) whose key is equal to key from the SLL of
bucket b.
3. Terminate.
12-21
11
CBHTs: analysis
• Analysis of the CBHT search/insertion/deletion algorithms
(counting comparisons):
Let the number of entries be n.
• In the best case, no bucket contains more than (say) 2
entries:
Max. no. of comparisons = 2
Best-case time complexity is O(1).
• In the worst case, one bucket contains all n entries:
Max. no. of comparisons = n
Worst-case time complexity is O(n).
12-23
CBHTs: design
• CBHT design consists of:
choosing the number of buckets m
choosing the hash function hash.
• Design aims:
collisions should be infrequent
entries should be distributed evenly among the buckets, such that
few buckets contain more than about 2 entries.
12-24
12
CBHTs: choosing the number of buckets
• The load factor of a hash table is the average number of
entries per bucket, n/m.
• If n is (roughly) predictable, choose m such that the load
factor is likely to be between 0.5 and 0.75.
A low load factor wastes space.
A high load factor tends to cause some buckets to have many
entries.
12-26
13
Example: hash table for words (1)
• Suppose that a hash table will contain about 1000 common
English words.
• Known patterns in the keys:
Letters vary in frequency:
• A, E, I, N, S, T are common
• Q, X, Z are uncommon.
Word lengths vary in frequency:
• word lengths 4–8 are common
• other word lengths are less common.
12-27
12-28
14
Example: hash table for words (3)
• Consider m = 520, hash(w) = 26 × (length of w – 1) +
(initial letter of w – ‘A’).
– Too few buckets. Load factor = 1000/520 ≈ 1.9.
– Very uneven distribution. Since few words have length 0–2,
buckets 0–51 will be sparsely populated. Since initial letter Z is
uncommon, buckets 25, 51, 77, 103, … will be sparsely populated.
And so on.
12-30
15
Implementation of maps using CBHTs
• Summary of algorithms:
Operation Algorithm Time complexity
get CBHT search O(1) best
O(n) worst
remove CBHT deletion O(1) best
O(n) worst
put CBHT insertion O(1) best
O(n) worst
putAll merge on corresponding O(m) best
buckets of both CBHTs O(n1 n2) worst
equals equality test on corresponding O(m) best
buckets of both CBHTs O(n1 n2) worst
12-31
Java classes
• HashSet for sets, HashMap for maps
Automatically add more buckets if load factor gets too high
(typically > .075)
12-32
16
public class GUIState {
String poundsText, eurosText;
Public boolean equals(GUIState g) {
return (poundsText.equals(g.poundsText) &&
eurosText.equals(g.eurosText)); }
12-33
17