Intro To Hashing
Intro To Hashing
However, we won't assume that the phone numbers are distinct, as for example, we can
associate a fixed line house number with all the people which live on that house. This means
that different keys can map to the same value.
The funny thing about Hash Tables, is that the high level concept is actually very simple and
familiar to all of us and yet it can be quite far from its actual implementation, which, as we will
see, is far from being simple and familiar.
To see how simple the high level idea is, it's easier for us if we implement it on some high level
language with proper built-in features, and, well, why not C++?
To use the built-in Hash Tables available to us in C++, which are conveniently called Maps (the
term comes from mapping, as described above), we need to include the header <map> on our
code.
So to declare a map which will be called phonebook and which will map names (strings) to
numbers (long int), we only need to do:
So, we already have a phone book created in less than 10 lines of code!! Seems pretty simple,
right?
Well, that's because it actually IS simple. If you devote some time to do some online research
on this matter, you will often find aliases for the name Hash Tables, which can be as suggestive
as Dictionaries, Associative Arrays and the most common, Maps.
To actually add values in to the Map and to fill it with our friends' adresses, it's also extremely
easy, we simply use the notation:
phonebook[key] = value;
and we have just added a new entry on our phone book!!
Below you can see some entries which were added to the phonebook:
So, as you can see, this is almost everything there is to it, with keys being mapped to values in
order to create and maintain a highly efficient data-structure which allows for fast (constant
time) insertion and look-up of values.
Albeit this sounding quite limited and specific, it is a data-structure which is used in the most
varied real-world like situations, as well as in the design of more complex data-structures like
Sets, which have applications in String Pattern Matching, Database querying and lots and lots
of other applications, so, as you can see, it's a powerful and easy to master data-structure that
every programmer should know about.
But, is this everything there is to it?
For the high-end programmer/user, yes, only knowing these basic concepts will be enough and
it will suffice for it to be correctly and efficiently used.
However, sometimes, the beauty of a concept usually lies underneath its outer shell and this
statement couldn't be any more accurate to describe what Hashing is all about!!
As you will be able to see on the following section, it is an absolutely incredible concept, which
will definitely make you feel blessed about that moment when you decided to enroll in CS and
which will make you feel God-like or something like that...
The concept of hashing: where all the trouble (and fun) begins
The concept of hashing and more specifically the concept of hash function are the key concepts
that one needs to understand in order to fully understand Hash Tables, and they are at the heart
of this idea:
For the integer data type, the best way we have of compressing a range [1..N] to a smaller
range [1..M] is to use the mod operator to wrap the range around.
As such, we can have something like:
hash(k) = k % M;
where M will become the size of our Hash Table. (which is completely unrelated to the table
of values we want to map, i.e. , it's not related in any way with our Phone Book)
It's also very common in literature to find the claim that M should be a prime number in order
to avoid collisions (we still don't know what these are) as often as possible, however, this is, by
itself, no restriction at all, and we can have an Hash Table of any size we want.
The size being prime is just a clever idea to attempt to uniformly distribute the hashed values
along the table.
To compress a string using a good hash function (i.e. a function which manages to
uniformely distribute the strings along the hash table in the best possible way) we need to make
sure that two things happen:
1. We use all the characters of the string;
2. The Hash value we obtain must be highly string dependent and also, we should take it
modulo M again, so it fits inside the table;
A simple C-style function to hash strings in a good way is (yes, this is the function of our
project):
There are some important remarks to be made about this hash function.
The first one is that the size of our Hash Table, defined by the constant HT_SIZE is not a prime
number, and obviously this is not a requisite for anything, we can have any size of table we
want... Also, and I'm mentioning it again, the size of the hash table being prime is only a
better way that we have found in order to minimize collisions as much as possible, but, it isn't,
BY NO MEANS, a mandatory thing.
The second thing worth mentioning about this function, is that it works by summing up the
ASCII values of all the characters in the string and then taking this result modulo M.
While this is definitely a good and simple way to go, this will also unravel a pattern which will
serve as motivation for our next and last section of this sort of tutorial.
For instance, let's look at the hash values of these two strings:
spam;
maps;
The ASCII values for the characters which make up these strings are:
a = 97, m = 109, s = 115 and p = 112.
If we sum these values up we get 433 and taking it modulo our HT_SIZE we obtain 13.
The main issue with this hash function is that it doesn't take into account the relative order of
the characters in string so that this could be re-hashed and, as a consequence, the strings:
maps, spam, masp, mpas, mpsa, smap, sapm, samp, pasm, pmas, pams, pmsa, amps, amsp,
apms, apsm, aspm, asmp, (possibly I've missed some permutations)...
all have the same hash value, despite being distinct!!
This is what a collision is:
When several different keys return the same hash value after being applied to the same
hashing function, we say that a collison occurred.
This is the exact moment where the magic can be unleashed and where we will actually see two
of the most clever algorithms ever written so that the maintenance of this data-structure is
possible by handling these collisions while maintaining the efficiency as high as possible on the
insertion and retrieval operations.
We will study two schemes in detail, but, we shall only implement the last one!
Handling collisions via linear probing and external chaining: A full C implementation of a
phone book
I can't emphasize enough how the collisions we are about to discuss are NOT related in any
way to the concepts we want to map, but, instead, they are an intrinsic property of the hash
function we use, as well as of the keys which are being hashed...
In fact, writing this in C++ is perfectly legal and it all works perfectly:
Although we can't guess what sort of hash function C++ uses internally to handle strings (we
can always look it up), we can be sure that at least, in all extensive uses of maps, at some point,
collisions WILL occur and they need to be handled somehow.
Languages like C++ and Java already handle these collisions internally such that all we need to
do is to seize it and enjoy the simple use these high level implementations provide.
However, what if we needed to write our own hash table implementation, with collision
handling included?
Well, it turns out we can do it and obviously someone else thought about this in the first place
such that several schemes to handle collisions were developed.
Some of those schemes are very simple, such the linear probing scheme we will study next,
while other schemes are so clever and complex that goes out of the ambit of this paper to
discuss them.
Besides the most nave scheme called Linear probing we will also focus our attention on a
cooler scheme called External chaining.
Additionally, collision handling via External Chaining will be fully implemented in C.
Here we can see the scheme of collision handling, linear probing, working:
When we try to insert an element on an occupied position, we scan the rest of the hash table
until a free position is found, so that the element can be inserted there.
For example, considering the elements highlighted in orange, we see that both 20 and 76 would
be inserted in the same position in the hash table.
76 is inserted correctly, as the position with index 6 is initially free, no collision occurs.
However, when we try to insert element 20, we will try to insert it just to see that there is
already element 76 there.
Now, we need to scan the Hash Table until a free position is found, and the insertion attempts
would go like:
There is an element at the position where 20 was to be inserted. As on linear probing, each
position can only take a single element, we start scanning the array by wrapping around to
index 0. Index 0 is full, skips to index 1, index 1 is also full, skips to index 2, index 2 is also
full, and so on, until index 4 (first free index) is reached. This is linear probing.
Important remarks about this collision handling scheme is that each element of the Hash Table
contains a pointer to a Linked List.
Collisions are handled by adding the element which would collide at the beginning of the
linked list at the specified hash position.
We can also remove elements easily:
We simply remove the element we want from the respective linked list.
Now, with both collision handling schemes explained, all there is left is to actually implement
this in the C programming language.
Collisions are NOT possible to avoid and, as such, they really need to be handled. This
is due to the fact that there is no known Perfect Hashing Function.
This would be a function such that, if we had an Hash Table of fixed size say, M, the
probability of occurrence of a collision, would be 1/M;
There are some well-known tricks to Computer Scientists such that collisions can be
minimized however:
If the size of the Hash Table is a prime number, regular hashing functions tend to
perform better on the average case and, if the hashing function involves all bits of the
keys being hashed, a somewhat more uniform distribution can be achieved;
I hope you have enjoying reading this as much as I've enjoyed writing it!
- Bruno Oliveira