Hashing
Hashing
Introduction
During this week you will learn about another data structure called hash table and hashing
techniques. A hash table is a data structure that offers very fast insertion and searching.
When you first hear about them, hash tables sound almost too good to be true. No matter
how many data items there are, insertion and searching ( and sometimes deletion) can take
close to constant time : O(1).
Learning outcome
After completing this lesson, you would be able to describe hashing and hash
tables. Thus you should be able to,
Define a Hash Table
Describe collision
For arrays and linked lists, we need to search in a linear fashion, which can be costly in
practice. If we use arrays and keep the data sorted, then a phone number can be searched
using Binary Search, but insert and delete operations become costly as we have to maintain
sorted order.
With balanced binary search tree, we get moderate search, insert and delete times. All of
these operations can be guaranteed to be in O(Logn) time.
Another solution that one can think of is to use a direct access table where we make a big
array and use phone numbers as index in the array. An entry in array is NIL if phone number
is not present, else the array entry stores pointer to records corresponding to phone number.
Time complexity wise this solution is the best among all, we can do all operations in O(1)
time. For example to insert a phone number, we create a record with details of given phone
number, use phone number as index and store the pointer to the created record in table.
This solution has many practical limitations. First problem with this solution is extra space
required is huge. For example if phone number is n digits, we need O(m * 10n) space for
table where m is size of a pointer to record. Another problem is an integer in a programming
language may not store n digits.
Due to above limitations Direct Access Table cannot always be used. Hashing is the solution
that can be used in almost all such situations and performs extremely well compared to
above data structures like Array, Linked List, and Balanced BST in practice. With hashing
we get O(1) search time on average (under reasonable assumptions) and O(n) in worst case.
2
Hashing
ITE 2142 – Data Structures & Algorithms Week 08
What is hashing?
Hashing is an improvement over Direct Access Table. The idea is to use hash function that
converts a given phone number or any other key to a smaller number and uses the small
number as index in a table called hash table. You will about this in detail in the next section.
For a human user of a hash table this is essentially instantaneous. It is so fast that computer
programs typically use hash tables when they need look up tens of thousands of items in a
less than a second (as in spelling checkers). Hash tables are significantly faster that trees
and easy to program.
Hash tables have several disadvantages. They are based on arrays, but arrays are difficult to
expand once they have been created. For some kinds of hash tables, performance may
degrade catastrophically when the table becomes too full, so the programmer needs to have
a fairly accurate idea of how many data items will need to be stored.
8.2 Hashing
Think about a dictionary. If you want to put every word of an English-language dictionary,
into your computer’s memory, so they can be accessed quickly, a hash table is a good
choice. Let’s say we want to store a 50,000 word English language dictionary in main
memory. You would like every word to occupy its own cell in a 50,000 – cell array, so you
can access the word using an index number. This will make access very fast. But what’s
the relationship of the index numbers to the word? Given the word persistent, for
example how do we find its index number?
3
Hashing
ITE 2142 – Data Structures & Algorithms Week 08
How big an array are we talking about for English-language dictionary? If we only have
50,000 words, you might assume our array should have approximately this many elements.
However it may need array with 100,000 elements. Thus we look for a way to squeeze a
range of 0 to more than 7,000, 000, 000,000 into the range 0 to 100,000. A simple approach
is to use the modulo operator (%), which finds the remainder when one number is
divided by another. This type of distribution can be done via proper function called
hash function. Hash function hashes (converts) a number in a large range into a
number in a small range. This small range is corresponds to the index numbers in an
array. In the following example hashing function can be defined as,
small number = large number mod small range
i.e 6 = 196 mod 10
An array into which data is inserted using a hash function is called as hash table. In the
above diagram array with small range is called as a hash table. Hash function
should be in quick computation mode and function should be simple, so it can be
computed easily. If hash function is slow, speed of hash table degrades.
4
Hashing
ITE 2142 – Data Structures & Algorithms Week 08
8.3 Collision
If we can define a one-to-one mapping from elements of large range to elements in small range, then hash
function is called a perfect hashing function.
Think about array (hash table) that we are going to use for English-language dictionary. Perhaps you want
to insert the word melioration into the array. You hash the word to obtain its index number , but finds
it that the cell at that number is already occupied by the word demystify, which happens to hash to the
exact same number. This is called the collisions.
If we cannot define a perfect hashing function, i.e. one-to-many mapping from elements of large range to
elements in small range, we must deal with collisions. It can be depicted as follows.
Remember that we have specified an array with twice as many cells as data items. Thus perhaps half
the cells are empty. One approach, when collision occurs, is to search the array in some
systematic way for an empty cell, and insert the new item there, instead of at the index specified by
the hash function. This approach is called open addressing.
A second approach is to create an array that consists of linked lists of words instead of the words
themselves. Then when a collision occurs, the new item is simply inserted in the list at that
index. This is called separate chaining.
5
Hashing
ITE 2142 – Data Structures & Algorithms Week 08
The delete() method finds an existing item. Once the item is found, delete () writes over it with the
special data item nonItem.
The find() method first calls hashFunc() to hash the search key to obtain the
index number.
The hashFunc() method applies the % operator to the search key and the array size.
As hashVal steps through the array, it eventually reaches the end. When this happens we want it to
wrap around to the beginning.
7
Hashing
ITE 2142 – Data Structures & Algorithms Week 08
... so on. i.e. subsequent probes go to ‘x+1’, ‘x+22’, ‘x+32’ ... so on. The following figure shows
some quadratic probes.
8
Hashing
ITE 2142 – Data Structures & Algorithms Week 08
, where constant is prime and smaller than the array size. For e.g. stepsize = 5 – (key%5)
In this function, for any given key all the steps will be the same size, but different keys
generate different step sizes. These two function can be implemented in java as follows.
In open addressing, collisions are resolved by looking for an empty cell in hash table. A different
approach is to create a linked list at each array index in the hash table. Data item is hashed using a
hash function as before and item is stored in linked list at that index. Other items that are hashed to
same array index are added to the linked list.
9
Hashing
ITE 2142 – Data Structures & Algorithms Week 08
Separate chaining is conceptually somewhat simpler than the various probe schemes used in open
addressing, however, the code is longer because it must include the mechanism for the linked
lists, usually in the form of an additional class. This is how java implementation for separate
chaining looks like.
class Link
{ // (could be other items)
public int iData; // data item
public Link next; // next link in list
// -------------------------------------------------------------
public Link(int it) // constructor
{ iData= it; }
// -------------------------------------------------------------
class SortedList
{
private Link first; // ref to first list item
// -------------------------------------------------------------
public void SortedList() // constructor
{ first = null; }
// -------------------------------------------------------------
public void insert(Link theLink) // insert link, in order
{
int key = theLink.iData;
Link previous = null; // start at first
Link current = first;
// until end of list,
while(current != null && key > current.iData)
{ // or current > key,
previous = current;
current = current.next; // go to next item
}
if(previous==null) // if beginning of list,
first = theLink; // first --> new link
else // not at beginning,
previous.next = theLink; // prev --> new link
theLink.next = current; // new link --> current
10
Hashing
ITE 2142 – Data Structures & Algorithms Week 08
} // end insert()
// -------------------------------------------------------------
public void delete(int key) // delete link
{ // (assumes non-empty list)
Link previous = null; // start at first
Link current = first;
// until end of list,
////////////////////////////////////////////////////////////////
class HashTable
{
private SortedList[] hashArray; // array of lists
private int arraySize;
// -------------------------------------------------------------
public HashTable(int size) // constructor
{
arraySize = size;
hashArray = new SortedList[arraySize]; // create array
for(int j=0; j<arraySize; j++) // fill array
hashArray[j] = new SortedList(); // with lists
}
// -------------------------------------------------------------
public void displayTable()
{
for(int j=0; j<arraySize; j++) // for each cell,
{
System.out.print(j + ". "); // display cell number
hashArray[j].displayList(); // display list
}
}
// -------------------------------------------------------------
public int hashFunc(int key) // hash function
{
return key % arraySize;
}
// -------------------------------------------------------------
public void insert(Link theLink) // insert a link
{
int key = theLink.iData;
int hashVal = hashFunc(key); // hash the key
hashArray[hashVal].insert(theLink); // insert at hashVal
} // end insert()
// -------------------------------------------------------------
public void delete(int key) // delete a link
{
int hashVal = hashFunc(key); // hash the key
hashArray[hashVal].delete(key); // delete link
} // end delete()
// -------------------------------------------------------------
public Link find(int key) // find link
12
Hashing
ITE 2142 – Data Structures & Algorithms Week 08
{
int hashVal = hashFunc(key); // hash the key
Link theLink = hashArray[hashVal].find(key); // get link
return theLink; // return link
}
// -------------------------------------------------------------
} // end class HashTable
Activity 8.1
Suppose you have a set of data “12, 24, 45, 99, 181, 101” to store in a hash table
of size 10. Consider that the hash function is h(x) = x mod 10.
Store the given values in the hash table by using quadratic probing technique if you
have to deal with collision.
Summary
In this lesson you have learnt about hashing and Hash table. You learned about how
to create a hash table and how to deal with collision. You got to know different
approaches for avoiding collision and how do they differ from each other.
13
Hashing
ITE 2142 – Data Structures & Algorithms Week 08
1+1 = 2. Since index 2 is not available, next we try 2 +22 = 6 and since it is available we
store 101 at index 6.
14
Hashing