Hashing
Hashing
So, for example, if you were storing Strings, the hash function
would have to map an arbitrary String to an integer in the
range of the hash table array. Here is an example of such a
hash function (this is a very poor hash function):
1) Don't: Just replace the new value you are trying to insert
with the old one stored in the hash table.
Since these are difficult to truly analyze, I won't get into much
detail here. (The book does not either.) Ideally, a hash function
should work well for any type of input it could receive. With
that in mind, the ideal hash function simply maps an element
to a random integer in the given range. Thus, the probability
that a randomly chosen element maps to any one of the
possible array locations should be equal.
1) It's designed for an array of only size 26. (Or maybe a bit
bigger if we allow non-alphabetic characters.) Usually hash
tables are larger than this.
Simply put, the mod operator can help us deal with both of
these issues. Change the function as follows:
f(c0c1...cn) =
(ascii(c0)*1280 + ascii(c1)*1281 +...+ ascii(cn)*128n) mod tablesize
If you apply the mod at each step, then all intermediate values
calculated remain low and no overflow occurs.
Also, using Horner's rule can aid in this computation.
Applying the rule states that
cnxn + cn-1xn-1 + ... c1x + c0 =c0 + x(c1 + x(c2 + ... + x(cn-1 + xcn)...))
The problem here is that if the table size is big, say just even
10000, then you find that the highest value an 8 letter string
could possibly hash to is 8*127 = 1016. Then, in this situation,
you would NOT be using nearly 90% of the hash locations at
all. This would most definitely result in many values hashing to
the same location.
One final note: I have just shown you how to use mod to get a
hash function into the desired range. Another technique that
can be used is the "MAD" method, or the multiply and divide
method. This uses mod as well. You can read about this in page
348 of the text.
Linear Probing
Let's say that our hash table can hold up to 10 integer values
and that it currently looks like this:
index 0 1 2 3 4 5 6 7 8 9
value 173 281 461
index 0 1 2 3 4 5 6 7 8 9
value 173 281 352 461
Now, if we want to search for 352, we'd first figure out it's hash
value was 3. Since it is NOT stored there, we'd look at location
4. Since it's not stored there, we'd finally find 352 in location 5.
Let be the fraction of the elements that are filled in the hash
table. This means that 1- fraction of elements are free in the
hash table.
But, what if that's not the case. Then the probability that the
first element we search is full, but the second is empty is
(1-).
Similarly, the probability that the first two elements are full
but the third empty is 2(1 - )
i (1 )
i 1
i 1
(1 ) ii 1
i 1
d
(1 )
d
i 1
i
d
(1 ) ( i )
d i 0
d 1
(1 ) ( ( ))
d 1
(1 )1 ( 1)
(1 )( )
(1 ) 2
1
(1 )( )
(1 ) 2
1
(1 )( )
(1 ) 2
1
(1 )
This isn't so bad if is only .5, but as gets close to 1, say .9,
then the average number of values to search is about 50 or so,
which certainly is not efficient.
index 0 1 2 3 4 5 6 7 8 9
value 173 281 461
index 0 1 2 3 4 5 6 7 8 9
value 173 281 461 352
The idea is the same here. You keep on looking for an element
until you find it or you hit an empty location in your search.
(In our situation, if we were searching for 863, and let's say
that also hashed to location 3, then we would check location 3,
then location 4, then location 7, and finally location 2, since the
next place to check is (3+32)%10 = 2.
This proves that if we keep our hash table at least half empty
and use quadratic probing, we will ALWAYS be able to find a
location for a value to hash to. (Notice that unlike linear
probing where we are guaranteed to cycle through every
possible location when we hash, here it is more difficult to
prove such a property.)
Dynamic Table Expansion
But, there are more issues here to think about. How will our
hash function change? Once we change that, what do we have
to do for each value in the hash table?