Hashing
Hashing
Preview
A hash function is a function that:
When applied to an Object, returns a number
When applied to equal Objects, returns the same number
for each
When applied to unequal Objects, is very unlikely to return
the same number for each
Hash functions turn out to be very important for
searching, that is, looking things up fast
Searching
Consider the problem of searching an array for a given value
If the array is not sorted, the search requires O(n) time
If the value isn’t there, we need to search all n elements
If the value is there, we search n/2 elements on average
If the array is sorted, we can do a binary search
A binary search requires O(log n) time
About equally fast whether the element is found or not
It doesn’t seem like we could do much better
How about an O(1), that is, constant time search?
We can do it if the array is organized in a particular way
Hashing
Suppose we were to come up with a “magic function” that, given a value to
search for, would tell us exactly where in the array to look
If it’s in that location, it’s in the array
If it’s not in that location, it’s not in the array
This function would have no other purpose
If we look at the function’s inputs and outputs, they probably won’t “make
sense”
This function is called a hash function because it “makes hash” of its inputs
Example (ideal) hash function
Suppose our hash function 0 kiwi
gave us the following values: 1
hashCode("apple") = 5 2 banana
hashCode("watermelon") = 3
hashCode("grapes") = 8
3 watermelon
hashCode("cantaloupe") = 7 4
hashCode("kiwi") = 0
hashCode("strawberry") = 9
5 apple
hashCode("mango") = 6 6 mango
hashCode("banana") = 2
7 cantaloupe
8 grapes
9 strawberry
Sets and tables
Sometimes we just want a set ... key value
of things—objects are either
in it, or they are not in it 141
Sometimes we want a map— 142 robin robin info
a way of looking up one thing 143 sparrow sparrow info
based on the value of another
We use a key to find a place in 144 hawk hawk info
the map 145 seagull seagull info
The associated value is the
information we are trying to 146
look up 147 bluejay bluejay info
Hashing works the same for
148 owl owl info
both sets and maps
Most of our examples will be
sets
Example imperfect hash function
Suppose our hash function gave 0 kiwi
us the following values: 1
hash("apple") = 5
hash("watermelon") = 3
2 banana
hash("grapes") = 8 3 watermelon
hash("cantaloupe") = 7 4
hash("kiwi") = 0
hash("strawberry") = 9 5 apple
hash("mango") = 6
hash("banana") = 2
6 mango
hash("honeydew") = 6 7 cantaloupe
8 grapes
9 strawberry
Hashing
0
Universe of keys
h(k1)
h(k4)
K k1 k4
(actual k2
h(k2)=h(k5)
keys) k5 collision
k3
h(k3)
m–1
(ii) Division Method
• Map a key k into one of the m slots by taking the
remainder of k divided by m. That is,
h(k) = k mod m
• Example: m = 31 and k = 78 h(k) = 16.
• Advantage: Fast, since requires just one division
operation.
(Mid Square Method)
Collisions
When two values hash to the same array location,
this is called a collision
Collisions are normally treated as “first come, first
served”—the first value that hashes to the location
gets it
We have to find something to do with the second and
subsequent values that hash to this same location
Methods of Resolution
• Chaining: 0
• Open Addressing:
– All elements stored in hash table itself.
– When collisions occur, use a systematic
(consistent) procedure to store
elements in free slots of the table.
(Linear Probing)
Insertion, I
Suppose you want to add ...
seagull to this hash table 141
0
Universe of keys
h(k1)=h(k4)
X
k1
k4
K
(actual k2 k6 X
k5 h(k2)=h(k5)=h(k6)
keys)
k8 k7
k3
X h(k3)=h(k7)
h(k8)
m–1
Collision Resolution by Chaining
0
Universe of keys
k1 k4
k1
k4
K
(actual k2 k6
keys)
k5 k5 k2 k6
k8 k7
k3
k7 k3
k8
m–1
Rehashing
In the event of a collision, another approach is to rehash: compute another hash function
Since we may need to rehash many times, we need an easily computable sequence of functions
Simple example: in the case of hashing Strings, we might take the previous hash code
and add the length of the String to it
Probably better if the length of the string was not a component in computing the original hash function
Possibly better yet: add the length of the String plus the number of probes made so far
Problem: are we sure we will look at every location in the array?
Rehashing is a fairly uncommon approach, and we won’t pursue it any further here