0% found this document useful (0 votes)
48 views10 pages

Intro To Hashing

This document provides an introduction to hashing and hash tables. It discusses how hash tables allow constant time insertion and lookup by using hash functions to map keys to values in an array. The document explains how a phone book is an example of a real-world hash table. It then demonstrates how to implement a basic hash table in C++ to store a phone book. The document dives deeper into how hash functions work to compress data, discussing common hash functions for integers and strings. It also introduces the concept of collisions that can occur when different keys hash to the same value.

Uploaded by

Arpit Kathuria
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views10 pages

Intro To Hashing

This document provides an introduction to hashing and hash tables. It discusses how hash tables allow constant time insertion and lookup by using hash functions to map keys to values in an array. The document explains how a phone book is an example of a real-world hash table. It then demonstrates how to implement a basic hash table in C++ to store a phone book. The document dives deeper into how hash functions work to compress data, discussing common hash functions for integers and strings. It also introduces the concept of collisions that can occur when different keys hash to the same value.

Uploaded by

Arpit Kathuria
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

An introduction to Hashing and Hash Tables Applications and relationship with

the OS course project


ABSTRACT. Hashing is a very powerful technique which allows the programmer to
implement insertion and look-up operations on a table in constant time on the average case. We
will see how this can be achieved and we shall study hashing functions, collision handling
schemes and also high level implementations of this concept in-depth.
Introduction: Starting off with a motivation problem and introducing high-level
implementations of Hash Tables in C++ and Java
The obvious motivation problem to start this off with, would be our OS course project, KOS.
The main reason why such thing won't be done is due to the fact that the underlying
data-structure we need to use on our project is what is known in Java as a ConcurrentHashMap,
which can be described very naively as a regular HashMap which is suitable for being accessed
concurrently, which will bring upon us harder OS concepts such as threads, semaphores and a
plethora of other annoying system functions and primitives with which I won't bother in this
explanation.
With this point cleared out, the example which I find more suitable to use in order to describe
the main idea of Hashing and also its main purpose is that of a phone book.
On our discussion, we shall consider a phone book as being a very long list of fixed size (to
keep things simple) composed of records that consist on all of our friends' phone numbers and
names.
So, we can have something like:
Record = <Name, Phone_Number>;
and for our Phone Book, we can now directly see it as:
Phone_Book = Array of Record
Obviously, the way in which we use this phone book, couldn't be more intuitive: if I want to
call my friend Ana, I simply look up for the record with her name on my phone book and I can
call her immediately as I can say that my phone book maps Names to Phone Numbers.
On a more technical language, it can be said that my phone book is actually a Map or
HashMap that maps keys to values.
We shall assume, without any loss of generality, that all of our friends' names are distinct, as it
would be a tidbit weird if we had two persons with the exact same name on our phone book.

However, we won't assume that the phone numbers are distinct, as for example, we can
associate a fixed line house number with all the people which live on that house. This means
that different keys can map to the same value.
The funny thing about Hash Tables, is that the high level concept is actually very simple and
familiar to all of us and yet it can be quite far from its actual implementation, which, as we will
see, is far from being simple and familiar.
To see how simple the high level idea is, it's easier for us if we implement it on some high level
language with proper built-in features, and, well, why not C++?
To use the built-in Hash Tables available to us in C++, which are conveniently called Maps (the
term comes from mapping, as described above), we need to include the header <map> on our
code.
So to declare a map which will be called phonebook and which will map names (strings) to
numbers (long int), we only need to do:

So, we already have a phone book created in less than 10 lines of code!! Seems pretty simple,
right?
Well, that's because it actually IS simple. If you devote some time to do some online research
on this matter, you will often find aliases for the name Hash Tables, which can be as suggestive
as Dictionaries, Associative Arrays and the most common, Maps.
To actually add values in to the Map and to fill it with our friends' adresses, it's also extremely
easy, we simply use the notation:
phonebook[key] = value;
and we have just added a new entry on our phone book!!

Below you can see some entries which were added to the phonebook:

So, as you can see, this is almost everything there is to it, with keys being mapped to values in
order to create and maintain a highly efficient data-structure which allows for fast (constant
time) insertion and look-up of values.
Albeit this sounding quite limited and specific, it is a data-structure which is used in the most
varied real-world like situations, as well as in the design of more complex data-structures like
Sets, which have applications in String Pattern Matching, Database querying and lots and lots
of other applications, so, as you can see, it's a powerful and easy to master data-structure that
every programmer should know about.
But, is this everything there is to it?
For the high-end programmer/user, yes, only knowing these basic concepts will be enough and
it will suffice for it to be correctly and efficiently used.
However, sometimes, the beauty of a concept usually lies underneath its outer shell and this
statement couldn't be any more accurate to describe what Hashing is all about!!
As you will be able to see on the following section, it is an absolutely incredible concept, which
will definitely make you feel blessed about that moment when you decided to enroll in CS and
which will make you feel God-like or something like that...

The concept of hashing: where all the trouble (and fun) begins
The concept of hashing and more specifically the concept of hash function are the key concepts
that one needs to understand in order to fully understand Hash Tables, and they are at the heart
of this idea:

The better analogy to understand hashing is to think on data compression.


The idea of data compression and/or encoding is fundamental to many applications in CS, from
algorithms to file-system management, this idea is almost always the same:
We have a very large amount of data, say that we have N numbers.
However, now, we can only store them on an array whose size is M, where M < N.
Now, it becomes necessary to compress our N numbers so they fit on a table of smaller size.
This is exactly what an hash function does and, quoting wikipedia:
A hash function is any algorithm that maps data of variable length to data of a fixed length.
The values returned by a hash function are called hash values, hash codes, hash sums,
checksums or simply hashes.
The topic of choosing and developing good hash functions for different data types would
possibly be worth a thesis or two on its own right, and, as such we will only discuss two of the
most used hash functions to compress integers and strings.
To learn more about hash functions and hashing methods, I'd say that the Internet might be your
best friend.
Now, let's move on to the most common hash functions for integers and strings.

An Hash Function for the integer (int) data type

For the integer data type, the best way we have of compressing a range [1..N] to a smaller
range [1..M] is to use the mod operator to wrap the range around.
As such, we can have something like:
hash(k) = k % M;
where M will become the size of our Hash Table. (which is completely unrelated to the table
of values we want to map, i.e. , it's not related in any way with our Phone Book)
It's also very common in literature to find the claim that M should be a prime number in order
to avoid collisions (we still don't know what these are) as often as possible, however, this is, by
itself, no restriction at all, and we can have an Hash Table of any size we want.
The size being prime is just a clever idea to attempt to uniformly distribute the hashed values
along the table.

An Hash Function for the string data type

To compress a string using a good hash function (i.e. a function which manages to
uniformely distribute the strings along the hash table in the best possible way) we need to make
sure that two things happen:
1. We use all the characters of the string;
2. The Hash value we obtain must be highly string dependent and also, we should take it
modulo M again, so it fits inside the table;
A simple C-style function to hash strings in a good way is (yes, this is the function of our
project):

There are some important remarks to be made about this hash function.
The first one is that the size of our Hash Table, defined by the constant HT_SIZE is not a prime
number, and obviously this is not a requisite for anything, we can have any size of table we
want... Also, and I'm mentioning it again, the size of the hash table being prime is only a
better way that we have found in order to minimize collisions as much as possible, but, it isn't,
BY NO MEANS, a mandatory thing.
The second thing worth mentioning about this function, is that it works by summing up the
ASCII values of all the characters in the string and then taking this result modulo M.
While this is definitely a good and simple way to go, this will also unravel a pattern which will
serve as motivation for our next and last section of this sort of tutorial.
For instance, let's look at the hash values of these two strings:
spam;
maps;
The ASCII values for the characters which make up these strings are:
a = 97, m = 109, s = 115 and p = 112.
If we sum these values up we get 433 and taking it modulo our HT_SIZE we obtain 13.
The main issue with this hash function is that it doesn't take into account the relative order of
the characters in string so that this could be re-hashed and, as a consequence, the strings:
maps, spam, masp, mpas, mpsa, smap, sapm, samp, pasm, pmas, pams, pmsa, amps, amsp,
apms, apsm, aspm, asmp, (possibly I've missed some permutations)...
all have the same hash value, despite being distinct!!
This is what a collision is:
When several different keys return the same hash value after being applied to the same
hashing function, we say that a collison occurred.
This is the exact moment where the magic can be unleashed and where we will actually see two
of the most clever algorithms ever written so that the maintenance of this data-structure is
possible by handling these collisions while maintaining the efficiency as high as possible on the
insertion and retrieval operations.
We will study two schemes in detail, but, we shall only implement the last one!

Handling collisions via linear probing and external chaining: A full C implementation of a
phone book
I can't emphasize enough how the collisions we are about to discuss are NOT related in any
way to the concepts we want to map, but, instead, they are an intrinsic property of the hash
function we use, as well as of the keys which are being hashed...
In fact, writing this in C++ is perfectly legal and it all works perfectly:

Although we can't guess what sort of hash function C++ uses internally to handle strings (we
can always look it up), we can be sure that at least, in all extensive uses of maps, at some point,
collisions WILL occur and they need to be handled somehow.
Languages like C++ and Java already handle these collisions internally such that all we need to
do is to seize it and enjoy the simple use these high level implementations provide.
However, what if we needed to write our own hash table implementation, with collision
handling included?
Well, it turns out we can do it and obviously someone else thought about this in the first place
such that several schemes to handle collisions were developed.
Some of those schemes are very simple, such the linear probing scheme we will study next,
while other schemes are so clever and complex that goes out of the ambit of this paper to
discuss them.
Besides the most nave scheme called Linear probing we will also focus our attention on a
cooler scheme called External chaining.
Additionally, collision handling via External Chaining will be fully implemented in C.

Collision handling via linear probing


Now that we have understood the difference between the concepts of Hash Table and values to
be mapped and retrieved and also, now that we have identified the main problems of using hash
functions, it's time we do something about it and actually solve the problem, so that our Hash
Table data-structure can be assembled and works flawlessly.
The main idea of linear probing is that if the kth position of the Hash Table is occupied, then a
linear probing (as in, a scan) is done, until a free position is found, and the element is inserted
on that new position.

Here we can see the scheme of collision handling, linear probing, working:
When we try to insert an element on an occupied position, we scan the rest of the hash table
until a free position is found, so that the element can be inserted there.
For example, considering the elements highlighted in orange, we see that both 20 and 76 would
be inserted in the same position in the hash table.
76 is inserted correctly, as the position with index 6 is initially free, no collision occurs.
However, when we try to insert element 20, we will try to insert it just to see that there is
already element 76 there.
Now, we need to scan the Hash Table until a free position is found, and the insertion attempts
would go like:
There is an element at the position where 20 was to be inserted. As on linear probing, each
position can only take a single element, we start scanning the array by wrapping around to
index 0. Index 0 is full, skips to index 1, index 1 is also full, skips to index 2, index 2 is also
full, and so on, until index 4 (first free index) is reached. This is linear probing.

Collision handling via external chaining


The next collision handling scheme we shall study, is simultaneously more interesting and
more simple to code and it's also very efficient.
Instead of having only a single element at each position, external chaining extends this idea of
Hash Table further, such that at each position of the Hash Table there will be a pointer to a
Linked List, such that whenever there is a collision, the elements which would collide instead
get added on the same index of the Hash Table and the Linked List with head at that index is
extended by one element.
This idea greatly improves upon the previous one as the chances of having long linked lists are
very small (they are measured based on a parameter called the load factor of the Hash Table),
such that on the average case, if we have N elements to insert on an M-sized
Hash Table, the average length of the Linked Lists with head at each element is of N/M and
while it might require more memory overhead, it makes removing elements and searching a lot
easier.

Important remarks about this collision handling scheme is that each element of the Hash Table
contains a pointer to a Linked List.
Collisions are handled by adding the element which would collide at the beginning of the
linked list at the specified hash position.
We can also remove elements easily:
We simply remove the element we want from the respective linked list.
Now, with both collision handling schemes explained, all there is left is to actually implement
this in the C programming language.

Conclusions and final remarks


After reading this paper, I hope the reader can now have a broader and more intuitive
perspective about what hashing is and what are some of its small details which make this such
an amazing subject to be studied.
The main ideas I tried to transmit with this paper were:

High-Level implementations of Hash Tables are available on most of the programming


languages and they are very intuitive and easy to use;
Such high level implementations usually mask the complexity that is associated with
problems that most users of these high level implementations are not aware of, namely,
the existence of collisions and the dependency of the hash function based on the type of
keys which are being hashed: an hash function for strings is completely different from
an hash function for integers, for example;
There are many possible ways of handling collisions, and we have studied or at least,
exposed, two of these ways in some detail:
- The Linear Probing handling scheme;
- The External Chaining handling scheme;

Collisions are NOT possible to avoid and, as such, they really need to be handled. This
is due to the fact that there is no known Perfect Hashing Function.
This would be a function such that, if we had an Hash Table of fixed size say, M, the
probability of occurrence of a collision, would be 1/M;

There are some well-known tricks to Computer Scientists such that collisions can be
minimized however:
If the size of the Hash Table is a prime number, regular hashing functions tend to
perform better on the average case and, if the hashing function involves all bits of the
keys being hashed, a somewhat more uniform distribution can be achieved;

I hope you have enjoying reading this as much as I've enjoyed writing it!

- Bruno Oliveira

You might also like