0% found this document useful (0 votes)
141 views17 pages

Hash-Table Data Structures: Motivation

This document discusses hash table data structures. It begins by motivating the use of hash tables to store mappings from keys like strings to values more efficiently than using a simple array. It then provides an overview of hash table principles including hashing keys to indices in an array of buckets, handling collisions, and maintaining consistency. Examples are given of hash functions for different key types like strings. The document contrasts closed-bucket and open-bucket hash table implementations and provides code for a closed-bucket hash table class in Java.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
141 views17 pages

Hash-Table Data Structures: Motivation

This document discusses hash table data structures. It begins by motivating the use of hash tables to store mappings from keys like strings to values more efficiently than using a simple array. It then provides an overview of hash table principles including hashing keys to indices in an array of buckets, handling collisions, and maintaining consistency. Examples are given of hash functions for different key types like strings. The document contrasts closed-bucket and open-bucket hash table implementations and provides code for a closed-bucket hash table class in Java.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Hash-Table Data Structures

• Hash-table principles.
• Searching, insertion, deletion.
• Implementations of sets and maps using hash tables.

• Reading: bit in Malik (chap 21), better is Watt and Brown


Java Collections (several copes in QML)

12-1

Motivation
• Assume we want a map from strings to ints (wordcount)
 Or a set of strings

• We could use a big array, with one entry per possible word
 26 entries for 1-letter word
 26*26 = 676 entries for 2-letter words
 26**4 = 456,976 entries for 4-letter words (OK)
 26**8 = 208,827,064,576 entries for 8-letter words (not possible)

• Uses function which converts words into ints


 Known as Hash function

• For small words, this gives us very fast access/update


12-2

1
Motivation
• For larger words, maybe we could just use first 4 letters?
 Eg, put “aberdeen” in the array slot for “aber”
 Array slot “aber” is a bucket for words that start with “aber”

• But what if several words have same first 4 letters?


 “aberdeen” and “abertay”

• Soln: put a list of <word,int> pairs in each array slot


• This is a hash table

12-3

Hash-table principles (1)


• If a map’s keys are small integers, we can represent the
map by a key-indexed array. Search, insertion, and deletion
then have time complexity O(1).
• Surprisingly, we can approach this performance with keys
of other types!
• Hashing: translate each key to a small integer, and use that
integer to index an array.
• A hash table is an array of m buckets, together with a
hash function hash(k) that translates each key k to a
bucket index (in the range 0…m–1).

12-4

2
Hash-table principles (2)
• Illustration: hash table
(array of buckets)
collision
key value 0
k1 v1 1
k2 v2 2
k3 v3 3
4
k4 v4
5

kn vn m–2
m–1
hashing
(translating keys
to bucket indices)
12-5

Hash-table principles (3)


• Each key k has a home bucket in the hash table, namely
the bucket with index hash(k).
• To insert a new entry with key k into the hash table, assign
that entry to k’s home bucket.
• To search for an entry with key k in the hash table, look in
k’s home bucket.
• To delete an entry with key k from the hash table, look in
k’s home bucket.

12-6

3
Hash-table principles (4)
• The hash function must be consistent:
k1 = k2 implies hash(k1) = hash(k2).
• In general, the hash function is many-to-one.
• Therefore different keys may share the same home bucket:
k1 ≠ k2 but hash(k1) = hash(k2).
This is called a collision.
• Always prefer a hash function that makes collisions
relatively infrequent.

12-7

Example: a hash function for words


• Suppose that the keys are English words.
• Possible hash function:
m = 26
hash(w) = (initial letter of w) – ‘A’
• All words with initial letter ‘A’ share bucket 0;

all words with initial letter ‘Z’ share bucket 25.
• This is a convenient choice for illustrative purposes.
• But it is a poor choice for practical purposes: collisions are
likely to be frequent in some buckets.
12-8

4
Hashing in Java (1)
• Instance method in class Object:
public int hashCode ();
// Translate this object to an integer, such that x.equals(y)
// implies x.hashcode() == y.hashcode().

• Note that hashCode is consistent. We can use it to


implement a hash function for a hash table with m buckets:
int hash (Object k) {
return Math.abs(k.hashCode()) % m;
}
Math.abs returns a Modulo-m arithmetic then gives
nonnegative integer. an integer in the range 0…m–1.

12-9

Hashing in Java (2)


• Each subclass of Object should override hashCode.

• Examples:
Class of k Result of k.hashCode()
String weighted sum of characters of k
Integer integer value of k
Date (high 32 bits of k) exclusive-or (low 32 bits of k),
where k is expressed in milliseconds since 1970-01-01

12-10

5
Closed-bucket vs. open-bucket hash tables
• Closed-bucket hash table:
 Each bucket may be occupied by several entries.
 Buckets are completely separate.

• Open-bucket hash table:


 Each bucket may be occupied by at most one entry.
 Whenever there is a collision, displace the new entry to another
bucket.

12-11

Closed-bucket hash tables (1)


• Closed-bucket hash table (CBHT):
 Each bucket may be occupied by several entries.
 Buckets are completely separate.

• Simplest implementation: each bucket is an SLL.


• In the following illustrations, keys are names of chemical
elements. Assume:
m = 26
hash(e) = (initial letter of e) – ‘A’

12-12

6
Closed-bucket hash tables (2)
0 Ar 18
• Illustration (with no collisions): 1
Br 35
2
element number 3 Cl 17
F 9 4
Ne 10 5 F 9
Cl 17 6
Ar 18 7
is represented by
Br 35 8 I 53
9
Kr 36
10 Kr 36
I 53
11
Xe 54
12
13 Ne 10

23 Xe 54
24 12-13
25

Closed-bucket hash tables (3)


0
• Illustration (with collisions): Ba 56 Be 4
1
element number 2 Cs 55 Ca 20
H 1
He 2 7 He 2 H 1
Li 3 8
Be 4 9 K 19
Na 11 10
11 Li 3
Mg 12 is represented by
K 19 12 Mg 12
Ca 20 13
Na 11
Rb 37
17 Rb 37
Sr 38
18
Cs 55 Sr 38
Ba 56 25
12-14

7
Closed-bucket hash tables (4)
• Java class implementing CBHTs:
public class CBHT {

private BucketNode[] buckets;

public CBHT (int m) {


buckets = new BucketNode[m];
}

… // CBHT methods (see below)

private int hash (Object key) {


return Math.abs(key.hashCode())
% buckets.length;
}
12-15

Closed-bucket hash tables (5)


• Java class (continued):
//////// Inner class for CBHT nodes ////////

private static class BucketNode {

private Object key, value;


private BucketNode succ;

private BucketNode (Object key,


Object val, BucketNode succ) {

}

}
}
12-16

8
CBHT search (1)
• CBHT search algorithm:
To find which if any node of a CBHT contains an entry whose key is
equal to target-key:
1. Set b to hash(target-key).
2. Find which if any node of the SLL of bucket b contains an entry
whose key is equal to target-key, and terminate with that node as
answer.

12-17

CBHT search (2)


• Implementation (in class CBHT):
public BucketNode search (Object targetKey) {
int b = hash(targetKey);
for (BucketNode curr = buckets[b];
curr != null; curr = curr.succ) {
if (targetKey.equals(curr.key))
return curr;
}
return null;
}

12-18

9
CBHT insertion (1)
• CBHT insertion algorithm:
To insert the entry (key, val) into a CBHT:
1. Set b to hash(key).
2. Insert the entry (key, val) into the SLL of bucket b, replacing any
existing entry whose key is key.
3. Terminate.

12-19

CBHT insertion (2)


• Implementation (in class CBHT):
public void insert (Object key,
Object val) {
int b = hash(key);
for (BucketNode curr = buckets[b];
curr != null; curr = curr.succ) {
if (key.equals(curr.key)) {
curr.value = val; return;
}
}
buckets[b] =
new BucketNode(key, val, buckets[b]);
}

12-20

10
CBHT deletion (1)
• CBHT deletion algorithm:
To delete the entry (if any) whose key is equal to key from a CBHT:
1. Set b to hash(key).
2. Delete the entry (if any) whose key is equal to key from the SLL of
bucket b.
3. Terminate.

12-21

CBHT deletion (2)


• Implementation (in class CBHT):
public void delete (Object key) {
int b = hash(key);
for (BucketNode pred = null,
curr = buckets[b];
curr != null;
pred = curr, curr = curr.succ) {
if (key.equals(curr.key)) {
if (pred == null)
buckets[b] = curr.succ;
else pred.succ = curr.succ;
return;
}
}
} 12-22

11
CBHTs: analysis
• Analysis of the CBHT search/insertion/deletion algorithms
(counting comparisons):
Let the number of entries be n.
• In the best case, no bucket contains more than (say) 2
entries:
Max. no. of comparisons = 2
Best-case time complexity is O(1).
• In the worst case, one bucket contains all n entries:
Max. no. of comparisons = n
Worst-case time complexity is O(n).
12-23

CBHTs: design
• CBHT design consists of:
 choosing the number of buckets m
 choosing the hash function hash.

• Design aims:
 collisions should be infrequent
 entries should be distributed evenly among the buckets, such that
few buckets contain more than about 2 entries.

12-24

12
CBHTs: choosing the number of buckets
• The load factor of a hash table is the average number of
entries per bucket, n/m.
• If n is (roughly) predictable, choose m such that the load
factor is likely to be between 0.5 and 0.75.
 A low load factor wastes space.
 A high load factor tends to cause some buckets to have many
entries.

• Choose m to be a prime number.


 Typically the hash function performs modulo-m arithmetic. If m is
prime, the entries are more likely to be distributed evenly over the
buckets, regardless of any pattern in the keys.
12-25

CBHTs: choosing the hash function


• The hash function should be efficient (performing few
arithmetic operations).
• The hash function should distribute the entries evenly
among the buckets, regardless of any patterns in the keys.
• Possible trade-off:
 Speed up the hash function by using only part of the key.
 But beware of any patterns in that part of the key.

12-26

13
Example: hash table for words (1)
• Suppose that a hash table will contain about 1000 common
English words.
• Known patterns in the keys:
 Letters vary in frequency:
• A, E, I, N, S, T are common
• Q, X, Z are uncommon.
 Word lengths vary in frequency:
• word lengths 4–8 are common
• other word lengths are less common.

12-27

Example: hash table for words (2)


• hash(w) can depend on any of w’s letters and/or length.
• Consider m = 20, hash(w) = length of w – 1.
– Far too few buckets. Load factor = 1000/20 = 50.
– Very uneven distribution.

• Consider m = 26, hash(w) = initial letter of w – ‘A’.


– Far too few buckets.
– Very uneven distribution.

12-28

14
Example: hash table for words (3)
• Consider m = 520, hash(w) = 26 × (length of w – 1) +
(initial letter of w – ‘A’).
– Too few buckets. Load factor = 1000/520 ≈ 1.9.
– Very uneven distribution. Since few words have length 0–2,
buckets 0–51 will be sparsely populated. Since initial letter Z is
uncommon, buckets 25, 51, 77, 103, … will be sparsely populated.
And so on.

• Consider m = 1499, hash(w) = (weighted sum of letters of


w) modulo m
i.e., (c1 × 1st letter of w + c2 × 2nd letter of w + …) modulo m
+ Good number of buckets. Load factor ≈ 0.67.
+ Reasonably even distribution.
12-29

Implementation of sets using CBHTs


• Similar to implementation of maps (with set members
instead of map entries).
• Summary of algorithms:
Operation Algorithm Time complexity
contains CBHT search O(1) best
O(n) worst
add CBHT insertion O(1) best
O(n) worst
remove CBHT deletion O(1) best
O(n) worst

12-30

15
Implementation of maps using CBHTs
• Summary of algorithms:
Operation Algorithm Time complexity
get CBHT search O(1) best
O(n) worst
remove CBHT deletion O(1) best
O(n) worst
put CBHT insertion O(1) best
O(n) worst
putAll merge on corresponding O(m) best
buckets of both CBHTs O(n1 n2) worst
equals equality test on corresponding O(m) best
buckets of both CBHTs O(n1 n2) worst

12-31

Java classes
• HashSet for sets, HashMap for maps
 Automatically add more buckets if load factor gets too high
(typically > .075)

• Probably best all-round implementations for sets, maps


• Caveat: If you define a equals() (or compareTo()) function
for set members (map keys), you should also define
hashCode()
 hashCode() of two equal objects must be the same

12-32

16
public class GUIState {
String poundsText, eurosText;
Public boolean equals(GUIState g) {
return (poundsText.equals(g.poundsText) &&
eurosText.equals(g.eurosText)); }

// need something like below if want set or map of GUIState


public int hashCode() {
String state = poundsText+eurosText;
return state.hashCode(); }

12-33

17

You might also like