Chapter 11-Hash Tables
Chapter 11-Hash Tables
Data Structures
1
Hash Tables: Overview
Provide very fast insertion and searching
◦ Both are O(1)
◦ Is this too good to be true?
Disadvantages
◦ Based on arrays, so the size must be known in advance
◦ Performance degrades when the table becomes full
◦ No convenient way to sort data
Summary
◦ Best structure if you have no need to visit items in order and
you can predict the size of your database in advance.
2
Motivation
Let’s suppose we want to insert a key into a data
structure, where the key can fall into a range from
0 to m, where m is very large
And we want to be able to find the key quickly
3
Motivation
Moreover, we may not be storing integers
For example, if we store words of the dictionary
4
Hash Table
Idea
◦ Provide easy searching and insertion by mapping
keys to positions in an array
◦ This mapping is provided by a hash function
Takes the key as input
Produces an index as output
5
Hash Function: Example
The easiest hash Index Value
function is the
following: 0
◦ H(key) = key % tablesize 1 2001
◦ H(key) now contains a
value between 0 and 2
tablesize-1 3 13
So if we inserted the 4
following keys into a
table of size 10: 5
13,11456, 2001, 157 6 11456
◦ You probably already see
7 157
potential for collisions
◦ Patience, we’ll come to it! 8
9
6
What have we accomplished?
We have stored keys of an
Index Value
unpredictable large range
0
into a smaller data structure
And searching and inserting 1 2001
becomes easy! 2
3 13
To find a key k, just retrieve 4
table[H(k)]
5
To insert a key k, just set
table[H(k)] = k 6 11456
Both are O(1)! 7 157
8
9
7
What’s our price?
Of course, you’ve probably
Index Value
already realized, multiple
0
values in our range could
map to the same hash table 1 2001
index 2
For example, if we used: 3 13
◦ H(k) = k % 10
4
5
Then, tried to insert 207
6 11456
◦ H(207) = 7
We have a collision at 7 157
position 7 8
9
8
What have we learned?
If we use hash tables, we need the following:
◦ Some way of handling collisions. We’ll study a couple ways:
Open addressing
Which has 3 kinds: linear probing, quadratic probing, and double
hashing
Separate chaining
9
Linear Probing
Presumably, you will have define your hash
table size to be ‘safe’
◦ As in, larger than the maximum amount of items you
expect to store
As a result, there should be some available cells
10
Linear Probing: Example
Again, say we insert Index Value
element 207
H(207) = 207 % 10 = 7
0
1 2001
This results in a 2
collision with element 3 13
157 4
So we search linearly
5
for the next available
cell, which is at 6 11456
position 8 7 157
◦ And put 207 there
8 207
9
11
Linear Probing
Note: This complicates
Index Value
insertion and searching a bit!
0
For example, if we then
1 2001
inserted element 426, we
would have to check three 2
cells before finding a vacant 3 13
one at position 9 4
5
And searching, is not simply
6 11456
a matter of applying H(k)
7 157
◦ You apply H(k), and probe!
8 207
9 426
12
Linear Probing: Clusters
As the table to the right
Index Value
illustrates, linear probing
0
also tends to result in the
formation of clusters. 1 2001
Linear Probing
0
1 2001
5
double its size
6 426
Note this is not quite as 7 207
simple as it seems 8
10
inside, you have to recompute
11
its hash value
12
◦ The hash function is
13 13
necessarily different:
14
H(k) = k % 20
15
◦ But, less clustering
16 11456
17 157
18 15
Linear Probing
Linear probing is the simplest way to handle collisions, and is thus
worthy of explanation
Let’s look at the Java implementation on page 533
◦ This assumes a class with member variables:
hashArray (the hash table)
arraySize (the size of the hash table)
◦ Assume an empty slot contains -1
We’ll construct:
◦ hashFunc()
◦ find()
◦ insert()
◦ delete()
16
Quadratic Probing
The main problem with linear probing was its potential
for clustering
Quadratic probing attempts to address this
◦ Instead of linearly searching for these next available cell
i.e. for hash x, search cell x+1, x+2, x+3, x+4….
◦ Search quadratically
i.e. for hash x, search cell x+1, x+4, x+9, x+16, x+25…
Idea
◦ On a collision, initially assume a small cluster and go to x+1
◦ If that’s occupied, assume a larger cluster and go to x+4
◦ If that’s occupied assume an even larger cluster, and go to x+9
17
Quadratic Probing: Example
Returning to our old Index Value
example with inserting
0
207
H(207) = 207 % 10 = 7 1 2001
2
This results in a 3 13
collision with element 4
157
In this case, slot 7 is 5
23
Second Hash Function
Characteristics of the hash function for the probe
◦ It cannot be the same as the first hash function
◦ It can NEVER hash to zero
Why not?
24
Double Hashing: Example
Returning to our old
Index Value
example with inserting
207 0
H(207) = 207 % 10 = 7 1 2001
2
This results in a collision
3 13
with element 157
So we hash again, to get 4
the probe 5
◦ Suppose we choose c=5
◦ Then: 6 11456
P(207) = 5 – (207 % 5) 7 157
P(207) = 5 – 2 = 3
8
9
25
Double Hashing: Example
So we insert 207 at Index Value
position: 0 207
◦ 7+3 = 2
◦ 10 3 13
4
5
Wrapping around,
6 11456
this will put 207 at 7 157
position 0 8
9
26
Double Hashing: Example
Now, let’s again insert
Index Value
value 426
0 207
We run the initial hash:
1 2001
◦ H(426) = 426 % 10 = 6
We get a collision, so we 2
probe: 3 13
◦ P(426) = 5 – (426 % 5) 4
◦ =5–1=4
5
And insert at location:
6 11456
◦ H(426) + P(426) = 10
◦ Wrapping around, we get 0. 7 157
Another collision! 8
9
27
Double Hashing: Example
So, we probe again Index Value
◦ P(426) = 4
0 207
So we insert at location
1 2001
0+4 = 4, and this time
2
there is no collision
3 13
4 426
Double hashing will in
5
general produce the
fewest clusters 6 11456
Let’s construct
◦ hashFunc()
◦ hashFunc2()
◦ find()
◦ insert()
◦ delete()
29
Note…
What is a potential problem with choosing a
hash table of size 10 and a c of 5 for the
probe, as we just did?
30
Probe Sequence
The probe sequence may never find an open
cell!
Because H(0) = 0, we’ll start at hash location
0
◦ If we have a collision, P(0) = 5 so we’ll next check
0+5=5
◦ If we have a collision there, we’ll next check
5+5=10, with wraparound we get 0
◦ We’ll infinitely check 0 and 5, and never find an
open cell!
31
Double Hashing Requirement
The root of the problem is that the table size is not
prime!
◦ For example if the size were 11:
◦ 0, 5, 10, 4, 9, 3, 8, 2, 7, 1, 6
◦ If there is even one open cell, the probing is guaranteed to find it
32
Separate Chaining
The alternative to
open addressing
Does not involve
probing to
different locations
in the hash table
Rather, every
location in the
hash table contains
a linked list of keys
33
Separate Chaining
Simple case, 7 element
hash table
H(k) = k % 7
So:
◦ 21, 77 each hash to
location 0
◦ 72 hashes to location 2
◦ 75, 5, 19 hash to location
5
Each is simply
appended to the
correct linked list
34
Separate Chaining
In separate chaining,
trouble happens
when a list gets too
full
Generally, we want to
35
Java Implementation
Let’s look at pages 555-557
Note: We will need a linked list and the hash
table!
◦ Will take a little time
36
A Good Hash Function
Has two properties:
◦ Is computable quickly; so as not to degrade
performance of insertion and searching
◦ Can take a range of key values and transform them
into indices such that the key values are distributed
randomly across the hash table
37
For example…
Data can be highly non-random
For example, a car-part ID:
◦ 033-400-03-94-05-0-535
38
Rule #1: Don’t Use Non-Data
Compress the key fields down enough until
every bit counts
For example:
39
Rule #2: Use All of the Data
Every part of the key should contribute to the
hash function
More data portions that contribute to the key,
40
Rule #3: Use a Prime Number for
Modulo Base
This is a requirement for double hashing
Important for quadratic probing
Especially important if the keys may not be
randomly distributed
◦ The more keys that share a divisor with the array
size, the more collisions
◦ Example, non-random data which are multiples of 50
If the table size is 50, they all hash to the same spot
If the table size is 10, they all hash to the same pot
If the table size is 53, no keys divide evenly into the
table size. Better!
41
Hashing Efficiency
Insertion and Searching are O(1) in the best case
◦ This implies no collisions
◦ If you minimize collisions, you can approach this runtime
If collisions occur:
◦ Access times depend on resulting probe lengths
◦ Every probe equals one more access
◦ So every worst case insertion or search time is
proportional to:
The number of required probes if you use open addressing
The number of links in the longest list it you use separate
chaining
42