0% found this document useful (0 votes)

48 views10 pages

Intro To Hashing

This document provides an introduction to hashing and hash tables. It discusses how hash tables allow constant time insertion and lookup by using hash functions to map keys to values in an array. The document explains how a phone book is an example of a real-world hash table. It then demonstrates how to implement a basic hash table in C++ to store a phone book. The document dives deeper into how hash functions work to compress data, discussing common hash functions for integers and strings. It also introduces the concept of collisions that can occur when different keys hash to the same value.

Uploaded by

Arpit Kathuria

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

48 views10 pages

Intro To Hashing

Uploaded by

Arpit Kathuria

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

An introduction to Hashing and Hash Tables Applications and relationship with

the OS course project

ABSTRACT. Hashing is a very powerful technique which allows the programmer to
implement insertion and look-up operations on a table in constant time on the average case. We
will see how this can be achieved and we shall study hashing functions, collision handling
schemes and also high level implementations of this concept in-depth.
Introduction: Starting off with a motivation problem and introducing high-level
implementations of Hash Tables in C++ and Java
The obvious motivation problem to start this off with, would be our OS course project, KOS.
The main reason why such thing won't be done is due to the fact that the underlying
data-structure we need to use on our project is what is known in Java as a ConcurrentHashMap,
which can be described very naively as a regular HashMap which is suitable for being accessed
concurrently, which will bring upon us harder OS concepts such as threads, semaphores and a
plethora of other annoying system functions and primitives with which I won't bother in this
explanation.
With this point cleared out, the example which I find more suitable to use in order to describe
the main idea of Hashing and also its main purpose is that of a phone book.
On our discussion, we shall consider a phone book as being a very long list of fixed size (to
keep things simple) composed of records that consist on all of our friends' phone numbers and
names.
So, we can have something like:
Record = <Name, Phone_Number>;
and for our Phone Book, we can now directly see it as:
Phone_Book = Array of Record
Obviously, the way in which we use this phone book, couldn't be more intuitive: if I want to
call my friend Ana, I simply look up for the record with her name on my phone book and I can
call her immediately as I can say that my phone book maps Names to Phone Numbers.
On a more technical language, it can be said that my phone book is actually a Map or
HashMap that maps keys to values.
We shall assume, without any loss of generality, that all of our friends' names are distinct, as it
would be a tidbit weird if we had two persons with the exact same name on our phone book.

However, we won't assume that the phone numbers are distinct, as for example, we can
associate a fixed line house number with all the people which live on that house. This means
that different keys can map to the same value.
The funny thing about Hash Tables, is that the high level concept is actually very simple and
familiar to all of us and yet it can be quite far from its actual implementation, which, as we will
see, is far from being simple and familiar.
To see how simple the high level idea is, it's easier for us if we implement it on some high level
language with proper built-in features, and, well, why not C++?
To use the built-in Hash Tables available to us in C++, which are conveniently called Maps (the
term comes from mapping, as described above), we need to include the header <map> on our
code.
So to declare a map which will be called phonebook and which will map names (strings) to
numbers (long int), we only need to do:

So, we already have a phone book created in less than 10 lines of code!! Seems pretty simple,
right?
Well, that's because it actually IS simple. If you devote some time to do some online research
on this matter, you will often find aliases for the name Hash Tables, which can be as suggestive
as Dictionaries, Associative Arrays and the most common, Maps.
To actually add values in to the Map and to fill it with our friends' adresses, it's also extremely
easy, we simply use the notation:
phonebook[key] = value;
and we have just added a new entry on our phone book!!

Below you can see some entries which were added to the phonebook:

So, as you can see, this is almost everything there is to it, with keys being mapped to values in
order to create and maintain a highly efficient data-structure which allows for fast (constant
time) insertion and look-up of values.
Albeit this sounding quite limited and specific, it is a data-structure which is used in the most
varied real-world like situations, as well as in the design of more complex data-structures like
Sets, which have applications in String Pattern Matching, Database querying and lots and lots
of other applications, so, as you can see, it's a powerful and easy to master data-structure that
every programmer should know about.
But, is this everything there is to it?
For the high-end programmer/user, yes, only knowing these basic concepts will be enough and
it will suffice for it to be correctly and efficiently used.
However, sometimes, the beauty of a concept usually lies underneath its outer shell and this
statement couldn't be any more accurate to describe what Hashing is all about!!
As you will be able to see on the following section, it is an absolutely incredible concept, which
will definitely make you feel blessed about that moment when you decided to enroll in CS and
which will make you feel God-like or something like that...

The concept of hashing: where all the trouble (and fun) begins
The concept of hashing and more specifically the concept of hash function are the key concepts
that one needs to understand in order to fully understand Hash Tables, and they are at the heart
of this idea:

The better analogy to understand hashing is to think on data compression.

The idea of data compression and/or encoding is fundamental to many applications in CS, from
algorithms to file-system management, this idea is almost always the same:
We have a very large amount of data, say that we have N numbers.
However, now, we can only store them on an array whose size is M, where M < N.
Now, it becomes necessary to compress our N numbers so they fit on a table of smaller size.
This is exactly what an hash function does and, quoting wikipedia:
A hash function is any algorithm that maps data of variable length to data of a fixed length.
The values returned by a hash function are called hash values, hash codes, hash sums,
checksums or simply hashes.
The topic of choosing and developing good hash functions for different data types would
possibly be worth a thesis or two on its own right, and, as such we will only discuss two of the
most used hash functions to compress integers and strings.
To learn more about hash functions and hashing methods, I'd say that the Internet might be your
best friend.
Now, let's move on to the most common hash functions for integers and strings.

An Hash Function for the integer (int) data type

For the integer data type, the best way we have of compressing a range [1..N] to a smaller
range [1..M] is to use the mod operator to wrap the range around.
As such, we can have something like:
hash(k) = k % M;
where M will become the size of our Hash Table. (which is completely unrelated to the table
of values we want to map, i.e. , it's not related in any way with our Phone Book)
It's also very common in literature to find the claim that M should be a prime number in order
to avoid collisions (we still don't know what these are) as often as possible, however, this is, by
itself, no restriction at all, and we can have an Hash Table of any size we want.
The size being prime is just a clever idea to attempt to uniformly distribute the hashed values
along the table.

An Hash Function for the string data type

To compress a string using a good hash function (i.e. a function which manages to
uniformely distribute the strings along the hash table in the best possible way) we need to make
sure that two things happen:
1. We use all the characters of the string;
2. The Hash value we obtain must be highly string dependent and also, we should take it
modulo M again, so it fits inside the table;
A simple C-style function to hash strings in a good way is (yes, this is the function of our
project):

There are some important remarks to be made about this hash function.
The first one is that the size of our Hash Table, defined by the constant HT_SIZE is not a prime
number, and obviously this is not a requisite for anything, we can have any size of table we
want... Also, and I'm mentioning it again, the size of the hash table being prime is only a
better way that we have found in order to minimize collisions as much as possible, but, it isn't,
BY NO MEANS, a mandatory thing.
The second thing worth mentioning about this function, is that it works by summing up the
ASCII values of all the characters in the string and then taking this result modulo M.
While this is definitely a good and simple way to go, this will also unravel a pattern which will
serve as motivation for our next and last section of this sort of tutorial.
For instance, let's look at the hash values of these two strings:
spam;
maps;
The ASCII values for the characters which make up these strings are:
a = 97, m = 109, s = 115 and p = 112.
If we sum these values up we get 433 and taking it modulo our HT_SIZE we obtain 13.
The main issue with this hash function is that it doesn't take into account the relative order of
the characters in string so that this could be re-hashed and, as a consequence, the strings:
maps, spam, masp, mpas, mpsa, smap, sapm, samp, pasm, pmas, pams, pmsa, amps, amsp,
apms, apsm, aspm, asmp, (possibly I've missed some permutations)...
all have the same hash value, despite being distinct!!
This is what a collision is:
When several different keys return the same hash value after being applied to the same
hashing function, we say that a collison occurred.
This is the exact moment where the magic can be unleashed and where we will actually see two
of the most clever algorithms ever written so that the maintenance of this data-structure is
possible by handling these collisions while maintaining the efficiency as high as possible on the
insertion and retrieval operations.
We will study two schemes in detail, but, we shall only implement the last one!

Handling collisions via linear probing and external chaining: A full C implementation of a
phone book
I can't emphasize enough how the collisions we are about to discuss are NOT related in any
way to the concepts we want to map, but, instead, they are an intrinsic property of the hash
function we use, as well as of the keys which are being hashed...
In fact, writing this in C++ is perfectly legal and it all works perfectly:

Although we can't guess what sort of hash function C++ uses internally to handle strings (we
can always look it up), we can be sure that at least, in all extensive uses of maps, at some point,
collisions WILL occur and they need to be handled somehow.
Languages like C++ and Java already handle these collisions internally such that all we need to
do is to seize it and enjoy the simple use these high level implementations provide.
However, what if we needed to write our own hash table implementation, with collision
handling included?
Well, it turns out we can do it and obviously someone else thought about this in the first place
such that several schemes to handle collisions were developed.
Some of those schemes are very simple, such the linear probing scheme we will study next,
while other schemes are so clever and complex that goes out of the ambit of this paper to
discuss them.
Besides the most nave scheme called Linear probing we will also focus our attention on a
cooler scheme called External chaining.
Additionally, collision handling via External Chaining will be fully implemented in C.

Collision handling via linear probing

Now that we have understood the difference between the concepts of Hash Table and values to
be mapped and retrieved and also, now that we have identified the main problems of using hash
functions, it's time we do something about it and actually solve the problem, so that our Hash
Table data-structure can be assembled and works flawlessly.
The main idea of linear probing is that if the kth position of the Hash Table is occupied, then a
linear probing (as in, a scan) is done, until a free position is found, and the element is inserted
on that new position.

Here we can see the scheme of collision handling, linear probing, working:
When we try to insert an element on an occupied position, we scan the rest of the hash table
until a free position is found, so that the element can be inserted there.
For example, considering the elements highlighted in orange, we see that both 20 and 76 would
be inserted in the same position in the hash table.
76 is inserted correctly, as the position with index 6 is initially free, no collision occurs.
However, when we try to insert element 20, we will try to insert it just to see that there is
already element 76 there.
Now, we need to scan the Hash Table until a free position is found, and the insertion attempts
would go like:
There is an element at the position where 20 was to be inserted. As on linear probing, each
position can only take a single element, we start scanning the array by wrapping around to
index 0. Index 0 is full, skips to index 1, index 1 is also full, skips to index 2, index 2 is also
full, and so on, until index 4 (first free index) is reached. This is linear probing.

Collision handling via external chaining

The next collision handling scheme we shall study, is simultaneously more interesting and
more simple to code and it's also very efficient.
Instead of having only a single element at each position, external chaining extends this idea of
Hash Table further, such that at each position of the Hash Table there will be a pointer to a
Linked List, such that whenever there is a collision, the elements which would collide instead
get added on the same index of the Hash Table and the Linked List with head at that index is
extended by one element.
This idea greatly improves upon the previous one as the chances of having long linked lists are
very small (they are measured based on a parameter called the load factor of the Hash Table),
such that on the average case, if we have N elements to insert on an M-sized
Hash Table, the average length of the Linked Lists with head at each element is of N/M and
while it might require more memory overhead, it makes removing elements and searching a lot
easier.

Important remarks about this collision handling scheme is that each element of the Hash Table
contains a pointer to a Linked List.
Collisions are handled by adding the element which would collide at the beginning of the
linked list at the specified hash position.
We can also remove elements easily:
We simply remove the element we want from the respective linked list.
Now, with both collision handling schemes explained, all there is left is to actually implement
this in the C programming language.

Conclusions and final remarks

After reading this paper, I hope the reader can now have a broader and more intuitive
perspective about what hashing is and what are some of its small details which make this such
an amazing subject to be studied.
The main ideas I tried to transmit with this paper were:

High-Level implementations of Hash Tables are available on most of the programming

languages and they are very intuitive and easy to use;
Such high level implementations usually mask the complexity that is associated with
problems that most users of these high level implementations are not aware of, namely,
the existence of collisions and the dependency of the hash function based on the type of
keys which are being hashed: an hash function for strings is completely different from
an hash function for integers, for example;
There are many possible ways of handling collisions, and we have studied or at least,
exposed, two of these ways in some detail:
- The Linear Probing handling scheme;
- The External Chaining handling scheme;

Collisions are NOT possible to avoid and, as such, they really need to be handled. This
is due to the fact that there is no known Perfect Hashing Function.
This would be a function such that, if we had an Hash Table of fixed size say, M, the
probability of occurrence of a collision, would be 1/M;

There are some well-known tricks to Computer Scientists such that collisions can be
minimized however:
If the size of the Hash Table is a prime number, regular hashing functions tend to
perform better on the average case and, if the hashing function involves all bits of the
keys being hashed, a somewhat more uniform distribution can be achieved;

I hope you have enjoying reading this as much as I've enjoyed writing it!

- Bruno Oliveira

Apache Httpclient Tutorial
100% (1)
Apache Httpclient Tutorial
69 pages
Hash Tables: A Detailed Description
No ratings yet
Hash Tables: A Detailed Description
10 pages
SEN 22413 All Experiment (Join AICTE)
No ratings yet
SEN 22413 All Experiment (Join AICTE)
44 pages
Module 1 - Introduction R Studio
No ratings yet
Module 1 - Introduction R Studio
21 pages
IntelliSense in Visual FoxPro 7
100% (1)
IntelliSense in Visual FoxPro 7
14 pages
Notes of Advanced Data Structures
No ratings yet
Notes of Advanced Data Structures
202 pages
Error While Creating Business Partner and 2 Windows Are Opened
No ratings yet
Error While Creating Business Partner and 2 Windows Are Opened
2 pages
Weeks 10, 11 - Sessions 19, 20, 21, 22 - Chapter HashTables
No ratings yet
Weeks 10, 11 - Sessions 19, 20, 21, 22 - Chapter HashTables
90 pages
8 Hashtables
No ratings yet
8 Hashtables
84 pages
Fycs Oops Manual
No ratings yet
Fycs Oops Manual
72 pages
Lecture 7 - Hash - Table - Direct - Adreess - Tables - Hash - Tables - Intro - Separate - Chaining
No ratings yet
Lecture 7 - Hash - Table - Direct - Adreess - Tables - Hash - Tables - Intro - Separate - Chaining
77 pages
Week 9 - Hash Functions and Collision
No ratings yet
Week 9 - Hash Functions and Collision
73 pages
Mcom Computer Application Project
No ratings yet
Mcom Computer Application Project
18 pages
CH 6
No ratings yet
CH 6
64 pages
Module 6 DSA 24
No ratings yet
Module 6 DSA 24
64 pages
Hashing in Data Structure
No ratings yet
Hashing in Data Structure
43 pages
DS Module-X
No ratings yet
DS Module-X
74 pages
Lec12 Hash Tables 09092024 090609pm
No ratings yet
Lec12 Hash Tables 09092024 090609pm
48 pages
Lecture 4 Hashtable and HashMap
No ratings yet
Lecture 4 Hashtable and HashMap
62 pages
Algo Lec3
No ratings yet
Algo Lec3
53 pages
ADI Hashing
No ratings yet
ADI Hashing
47 pages
Hashing in Data Structures
No ratings yet
Hashing in Data Structures
27 pages
Hashing
No ratings yet
Hashing
44 pages
Chapter 1 - Lecture 1
No ratings yet
Chapter 1 - Lecture 1
54 pages
Hash Tables
No ratings yet
Hash Tables
45 pages
Cse373 10 Hashing
No ratings yet
Cse373 10 Hashing
36 pages
Final Hashing
No ratings yet
Final Hashing
41 pages
Hashing
No ratings yet
Hashing
40 pages
DS Module 5 Hashing
No ratings yet
DS Module 5 Hashing
23 pages
CSE E Group 14
No ratings yet
CSE E Group 14
64 pages
Hashing
No ratings yet
Hashing
35 pages
LBYCPEI Final Report 6
No ratings yet
LBYCPEI Final Report 6
36 pages
Dsa M5
No ratings yet
Dsa M5
38 pages
Hash Table Data Structure
No ratings yet
Hash Table Data Structure
34 pages
11 HashTables
No ratings yet
11 HashTables
40 pages
Final DevSecOps Enterprise Container Hardening Guide 1.2
No ratings yet
Final DevSecOps Enterprise Container Hardening Guide 1.2
30 pages
Finals Complexity and Algorithmn
No ratings yet
Finals Complexity and Algorithmn
49 pages
Hash Tables
No ratings yet
Hash Tables
30 pages
UNIT V - Hashing
No ratings yet
UNIT V - Hashing
20 pages
Unit 3 Hashing
No ratings yet
Unit 3 Hashing
23 pages
Chapter 5 - Hashing - Part1
No ratings yet
Chapter 5 - Hashing - Part1
28 pages
BASH From Scratch
No ratings yet
BASH From Scratch
27 pages
Hashing
No ratings yet
Hashing
25 pages
Web Development Practical File
No ratings yet
Web Development Practical File
43 pages
Idst 2016 SA 05 Hashing
No ratings yet
Idst 2016 SA 05 Hashing
68 pages
Hashing and Hash Tables
No ratings yet
Hashing and Hash Tables
23 pages
Hashing Data Structure
No ratings yet
Hashing Data Structure
22 pages
DSL Writeup
No ratings yet
DSL Writeup
64 pages
Python for Beginners: This comprehensive introduction to the world of coding introduces you to the Python programming language
From Everand
Python for Beginners: This comprehensive introduction to the world of coding introduces you to the Python programming language
Vere salazar
No ratings yet
Unit 7
No ratings yet
Unit 7
27 pages
IBM Sterling B2B Integrator 6.1: Software Product Compatibility Reports Supported Operating Systems
No ratings yet
IBM Sterling B2B Integrator 6.1: Software Product Compatibility Reports Supported Operating Systems
23 pages
CSC 302 - Hashing Techniques
No ratings yet
CSC 302 - Hashing Techniques
19 pages
Hashing
No ratings yet
Hashing
14 pages
Lecture 3.2.1 Hashing
No ratings yet
Lecture 3.2.1 Hashing
17 pages
Hash Tables: Map Dictionary Key "Address."
No ratings yet
Hash Tables: Map Dictionary Key "Address."
16 pages
BCS304 DS Module 5 Notes
No ratings yet
BCS304 DS Module 5 Notes
45 pages
Iit Lecture Notes On Data Structure
No ratings yet
Iit Lecture Notes On Data Structure
36 pages
Use Case
No ratings yet
Use Case
12 pages
MCA Data Structures With Algorithms 14
No ratings yet
MCA Data Structures With Algorithms 14
12 pages
23 Hashing
No ratings yet
23 Hashing
14 pages
DS - Unit 5 - Notes
No ratings yet
DS - Unit 5 - Notes
8 pages
Simplified Android Kernel Driver Development With DDK v2
No ratings yet
Simplified Android Kernel Driver Development With DDK v2
11 pages
Hashing Techniques
No ratings yet
Hashing Techniques
15 pages
Hashing and Indexing
No ratings yet
Hashing and Indexing
28 pages
Sta Test Vii
No ratings yet
Sta Test Vii
9 pages
Class 30: Active Learning: Hashing
No ratings yet
Class 30: Active Learning: Hashing
24 pages
Hashing
No ratings yet
Hashing
11 pages
Experiment 8 DS Student
No ratings yet
Experiment 8 DS Student
8 pages
Maps
No ratings yet
Maps
36 pages
Complete Fiori Note
100% (1)
Complete Fiori Note
90 pages
Aman Seminar Report
No ratings yet
Aman Seminar Report
54 pages
Week 12 Hashing
No ratings yet
Week 12 Hashing
24 pages
Introduction To Hashing & Hashing Techniques: Review of Searching Techniques
No ratings yet
Introduction To Hashing & Hashing Techniques: Review of Searching Techniques
19 pages
Hash Function
No ratings yet
Hash Function
9 pages
22CS302 LM21
No ratings yet
22CS302 LM21
7 pages
Hash Function
No ratings yet
Hash Function
4 pages
Hash Tables: COT4810 Ken Pritchard 2 Sep 04
No ratings yet
Hash Tables: COT4810 Ken Pritchard 2 Sep 04
20 pages
Sap La TS410 en 17 SG
No ratings yet
Sap La TS410 en 17 SG
5 pages
Frequently Asked Questions - Object Oriented Concepts
No ratings yet
Frequently Asked Questions - Object Oriented Concepts
5 pages
Excel Dashboard Templates 35
No ratings yet
Excel Dashboard Templates 35
7 pages
Unit28 Hashing1
No ratings yet
Unit28 Hashing1
19 pages
Hashing
No ratings yet
Hashing
14 pages
IOS Developer
No ratings yet
IOS Developer
3 pages
Sheet 1
No ratings yet
Sheet 1
5 pages
Last Crash Log
No ratings yet
Last Crash Log
2 pages
QTP Certification Questions: D. Dat Folder Inside of The QTP Installation Directory
No ratings yet
QTP Certification Questions: D. Dat Folder Inside of The QTP Installation Directory
12 pages
DSPM (3rd) Dec2017
No ratings yet
DSPM (3rd) Dec2017
2 pages
Vaibhav Verma Resume
No ratings yet
Vaibhav Verma Resume
2 pages
Hashing
From Everand
Hashing
Prakash Hegade
No ratings yet
C++ Reverse Engineering Tutorial
No ratings yet
C++ Reverse Engineering Tutorial
35 pages

Intro To Hashing

Uploaded by

Intro To Hashing

Uploaded by

An introduction to Hashing and Hash Tables Applications and relationship with

the OS course project

The better analogy to understand hashing is to think on data compression.

An Hash Function for the integer (int) data type

An Hash Function for the string data type

Collision handling via linear probing

Collision handling via external chaining

Conclusions and final remarks

High-Level implementations of Hash Tables are available on most of the programming

You might also like