Week 11 Lec 01 Hash Table
Week 11 Lec 01 Hash Table
(PROG20799 )
Hash table
M Mohiuddin
Parts of this lecture are adapted from Simon Hood’s lecture with his explicit
and kind consent.
Course:PROG20799, week11 1
Lecture Overview
• Students’ queries
• Hash Tables
• The “Search and Insert” Problem
• The Hash Function
• Resolving Collisions
Course:PROG20799, week11 2
Hash table
• There is one search algorithm we’ve left for last
Course:PROG20799, week11 3
Search and Insert
• In programming we often find ourselves working
with some version of the “search and insert”
problem
• Given a list of items, search for an item in the list
as efficiently as possible
• If the item is not found, insert it into the list
• Items can be integers, chars or whatever
Course:PROG20799, week11 4
Course:PROG20799, week11 5
Five ways to do search and insert
• Store the list in an array and add new items at the end. This
implies a sequential search must be used to find items,
however it’s easy to add them.
2 3 6 1 8 7 9 34 ………………………………5
• Store the list in an array in sorted order and add new items
where they belong. This allows for faster searching, but
inserting new items is cumbersome.
1 2 3 4 5 7 8 9 12 14 16 18 21 23
• Store the list in an unsorted linked list. Addition of items is
easy, but we must traverse the entire list sequentially to find if
a given item exists.
• And…
Course:PROG20799, week11 6
Search and insert methods…..
• Store the list in a sorted linked list. Addition of
items is easy, and we can use our fast
algorithms to search for a given item –a good
choice
• Store the list in a binary search tree. Searching
is built-in and quick, and insertion is as simple
as a linked list –a better choice!
Course:PROG20799, week11 7
0 665 … 789 …….. … 1000
Course:PROG20799, week11 8
Hash table
• The best choice though is a hash table. Constant
time is tough to beat
• Big O of hashtable is O (1)
• The idea is roughly that we can store our items in
an array, for example, and store the items in their
appropriate place while we add them
• If we’re talking about integers, we can store them
in their value’s index
• But, it can be very wasteful! How?
Course:PROG20799, week11 9
Hash functions
• A simple way to reduce a number to a more
manageable size (so we can insert it in a
reasonably sized array index) is to use the
modulus operator
• For example, if we mod all our integer values by
100, we will always have a value between 0 and
99
• Even if our data is 9999999, (9999999 % 100) is
still just 99,
index = key % 100 = 56478876 % 100 = 76
Course:PROG20799, week11 10
Hash functions
• For example, (751 % 100) + 1 is 52, so we store it
in a[52]
• For example, (95422 % 100) + 1 is 23, so we store
it in a[23]
• So we’ve inserted 751 in index 52 and 95422 in
index 23
Course:PROG20799, week11 11
Hash functions
• For example, (751 % 100) + 1 is 52, so we store it
in a[52]
• For example, (95422 % 100) + 1 is 23, so we store
it in a[23]
• So we’ve inserted 751 in index 52 and 95422 in
index 23
• If we want to search for a given value (say 751 ),
we perform the hash function on our search
value and then go straight to that index (52) –
there it is!
Course:PROG20799, week11 12
Example problem
Company ABC has 52 employees and their IDs vary from 5000 to 8000.
Queries: is an employee with IDs 5208, or 7609, or 8000 there? 5260
Size of dataset = 52, spread of dataset = 3000, minimum data value = 5000
2nd approach
1st approach: simplest, fastest array size = spread + 1 3rd approach
but the most wasteful Hashfunction: index = data - 5000 array size = 52 + 1
array size: largest value + 1 Hashfunction: index = data % 52
index key i.e. data index key i.e. data index key i.e. data
0 0 0
1 1 1
……… ……… ..
5208 5208 208 5008 8 5208
….. …
7609 7609 17 7609
2609 7609 …
44 8000
Course:PROG20799, week11 14
Example problem Solution
Company ABC has 52 employees, and their IDs vary from 5000 to 8000.
Queries: is an employee with IDs 5008, or 7609, or 8000 there?
Size of dataset = 52, spread of dataset = 3K, minimum key value = 5K
2nd approach
1st approach: simplest yet the array size = spread of the data + 3rd approach
most wasteful 1 = Hashfunction: index = key – array size = size of data + 1 = 53
array size: largest value + 1 5K Hashfunction: key % 52
index key i.e. data index key i.e. data index key i.e. data
0 0 0
1 1 1
……… ……… 16 5008
5008 5008 8 5008 17 7609, 5009
…. ….
7600 7600
Course:PROG20799, week11 16
Collisions
• What if we try to insert 751 and 2351?
• We have a collision –both values, transformed
through our hash function, should be stored in
index 52
• How can we overcome this limitation?
Course:PROG20799, week11 17
Collisions
• What if we try to insert 751 and 2351?
• We have a collision –both values, transformed
through our hash function, should be stored in
index 52
• How can we overcome this limitation?
• We insert the number at the next empty space
Course:PROG20799, week11 18
Resolution of collisions!
• We can use a simple while loop to iterate through
our hash table until we find the given number –
just make sure to start at the proper place
0 1 2 3 4 5 6 7 8 9 10 11 12
0 1 2 3 4 5 6 7 8 9 10 11 12
Course:PROG20799, week11 20
index = 64 % 12 = 4, collisions = 12
52, 33, 84, 43, 16, 59, 31, 23, 61, 64, 67, 80
84 61 52 16 64 43 31 33 67 59 23 80
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
84 61 80 52 16 64 43 31 33 67 59 23
0 1 2 3 4 5 6 7 8 9 10 11 12
index = 64 % 20 = 4, collisions = 4
52, 33, 84, 43, 16, 59, 31, 23, 61, 64, 67, 80
80 61 43 84 23 64 67 31 52 33 16 59
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
Course:PROG20799, week11 21
Exercise
• Ex:1 Do the same problem i.e. using the same
hash function and receiving the same numbers in
the same order; however, with an array of 20
elements.
• Ex:2 Now change the hash function to:
H(key) = key%20
Course:PROG20799, week11 22
Hash function—C implementation
#define MaxNumbers 50 // maximum number of records
#define N 100 // size of an array
#define Empty 0
int main() {
FILE *in = fopen("numbers.in", "r");
// core logic follows
Course:PROG20799, week11 23
int key, loc, num[N + 1];
for (int j = 1; j <= N; j++) num[j] = Empty;
int distinct = 0;
while (fscanf(in, "%d", &key) == 1) {
loc= key % N + 1;
while (num[loc] != Empty && num[loc] != key)
loc = loc % N + 1;
if (num[loc] == Empty) { // key is not in the table
if (distinct == MaxNumbers) {
printf("Table full: %d not added\n", key);
exit(1); }
num[loc] = key; // if table not full then key is added and
distinct++; // number of distinct entries incremented
}} Course:PROG20799, week11 24
printf("There are %d distinct numbers\n", distinct);
fclose(in);
return 0;
}
Course:PROG20799, week11 25
Deleting an item from a hash table
• Let’s say we have the following table:
84 23 61 52 16 43 31 33 59
0 1 2 3 4 5 6 7 8 9 10 11
• Recall that 43 and 31 hashed initially to the same location 7.
Suppose 43 is to be deleted. Location = key % 12
Course:PROG20799, week11 26
Deleting an item from a hash table
• Lets say we have the following table:
84 23 61 52 16 31 33 59
0 1 2 3 4 5 6 7 8 9 10 11
• Recall that 43 and 31 hashed initially to the same location 7.
Suppose 43 is to be deleted.
• What happens if we delete 43 and then look for 31??
Course:PROG20799, week11 27
Deleting an item from a hash table
• Lets say we have the following table:
84 23 61 52 16 -1 31 33 59
0 1 2 3 4 5 6 7 8 9 10 11
• Recall that 43 and 31 hashed initially to the same location 7.
Suppose 43 is to be deleted.
• What happens if we delete it and then look for 31 ??
• Solution: for deleted entries, force a value that is different
from the one assigned for empty, for example ‘-1’.
• We still search for the key or the first empty location, but
ignore locations marked as deleted!
• However, a new key can be inserted in a location marked
deleted.
Course:PROG20799, week11 28
Filling locations marked deleted
84 23 61 52 16 43 31 33 59
0 1 2 3 4 5 6 7 8 9 10 11
0 1 2 3 4 5 6 7 8 9 10 11
Course:PROG20799, week11 29
Filling locations marked deleted
84 23 61 52 16 43 31 33 59
0 1 2 3 4 5 6 7 8 9 10 11
0 1 2 3 4 5 6 7 8 9 10 11
• If we now search for 55,
Course:PROG20799, week11 30
Filling locations marked deleted
84 23 61 52 16 43 31 33 59
0 1 2 3 4 5 6 7 8 9 10 11
0 1 2 3 4 5 6 7 8 9 10 11
• If we now search for 55, we will check locations
7,8,9 and 10 and then decide that 55 is not in the
table and must be inserted!
• We can set num[10] = 55, but should we not
insert it at the first location marked deleted ?
Course:PROG20799, week11 31
Hashfunction index = key % 12
index = 55 % 12 = 7
84 23 61 52 16 -1 31 33 55 59
0 1 2 3 4 5 6 7 8 9 10 11
84 23 61 52 16 55 31 33 59
0 1 2 3 4 5 6 7 8 9 10 11
Course:PROG20799, week11 32
Find or insert in a hash table
//find or insert ‘key’ in the hash table, num[1..n]
loc = H(key)
deletedLoc = 0
while ( num[loc]!= Empty && num[loc]!=key) {
if (deletedLoc == 0 && num[loc] == Deleted)
deletedLoc = loc //storing the first location marked deleted
loc = loc % n + 1 }
if (num[loc] == Empty) { // key not found
if (deletedLoc!=0)
loc = deletedLoc
num[loc] = key }
else print key, “found in location”, loc
Course:PROG20799, week11 33
Delete a key from Hash table
void deleteKey (int key, int num[ ]) {
int loc = key % N + 1;
while ( num[loc]!= Empty && num[loc]!=key ){
loc = loc % N + 1; } // % N ensure to circle back if end of array is reached
if ( num[loc]== Empty){
printf("\nKey not found\n");
return; }
if ( num[loc] == key){
num[loc] = Deleted;
printf("\n Key found on location %d and deleted\n", loc);
}
}
Course:PROG20799, week11 34
Hash function for strings
• What if we want to store words or chars? We
can’t mod a word can we?
• A simple way to overcome this limitation is to
convert each char in the word to an int. Then,
add the char/ints representing each letter
together, and mod plus 1 as normal
• Unfortunately, this means that anagrams collide –
mate, meat and team all have the same value
Course:PROG20799, week11 35
Hash function for strings
• We might consider assigning weights to each
letter based on the letter’s position in the word
• The main goal is to avoid collisions -if we have a
collision, our algorithm runs more slowly
• We might assign 3 to the first letter, 5 to the
second, 7 to the third, and so on
• Make sure your hash function is simple! Why ?
• Mate = 3 * ASCII(M) + 5 * ASCII(a) + 7 * ASCII(t) + 9 * ASCII(e ) =
• Team = 3 * ASCII(T) + 5 * ASCII(e) + 7 * ASCII(a) + 9 * ASCII(m ) =
Course:PROG20799, week11 36
Hash function for strings
A simple snippet of code to hash words might then
be to avoid anagrams colliding
int j, wordNum= 0;
int weight = 3;
while (word[j] != '\0') {
wordNum+= weight + word[j++];
weight += 2;
}
location = wordNum % n + 1;
Course:PROG20799, week11 37
Linear probing for collision avoidance
84 23 61 52 16 43 31 33 59
0 1 2 3 4 5 6 7 8 9 10 11
Course:PROG20799, week11 38
Drawbacks of linear probing with unity
step size
• For a fuller table we may have to move quite far
before we get a free spot
• If we insert values in indices where they don’t
belong, we are increasing the probability of
future collisions
• If, however, we shift the index by an arbitrary
number , we may reduce the size of contiguously
filled indices
• Think of relative primes!
Course:PROG20799, week11 39
Linear probing with double hashing
loc = num % n + 1 // this gives initial hash location
for linear probing with single hashing, k = CONSTANT
For linear probing with double hashing, K is:
k = num% (n - 2) + 1 // this gives the increment for
//this key
It is strongly recommended that n and n-2 are twin
primes like 31/29, 103/101 or 1021/1019 etc. and
must be just less than the size of the array.
Course:PROG20799, week11 40
Double hashing implementation
// returns 0 if the key is found or 1 otherwise
int findOrInsertDouble(int key, int num[]) {
int loc = key % N + 1;
int k = key % (N - 2) + 1;
int deletedLoc = 0;
while ( num[loc]!= Empty && num[loc]!=key) {
collisions++;
if (deletedLoc == 0 && num[loc] == Deleted)
deletedLoc = loc; //storing the first location marked deleted
loc = loc + k;
if (loc > N)
loc = loc - N; }
Course:PROG20799, week11 41
Double hashing implementation
if (num[loc] == Empty) // key not found
{
if (deletedLoc!=0)
loc = deletedLoc;
num[loc] = key;
return 1;
} // one key added
else
{
//printf( "\n%d found on location %d\n", key, loc);
return 0; // no key added
}
}
Course:PROG20799, week11 42
Performance of hash table
Load factor
Course:PROG20799, week11 43
The concept of chaining
H(k1)
k1 k2
H(k2)
Course:PROG20799, week11 44
Exercise for hashtable with chaining
52, 33, 84, 43, 16, 59, 31, 23, 61, 64, 67, 80
index = key % 12 =
84 61 52 43 80 33 59
0 1 2 3 4 5 6 7 8 9 10 11
16 31 23
64 67
Course:PROG20799, week11 45
Exercise Solution for hashtable with chaining
52, 33, 84, 43, 16, 59, 31, 23, 61, 64, 67, 80
index = key % 12 =
84 61 52 43 80 33 59
0 1 2 3 4 5 6 7 8 9 10 11
16 31 23
4 67
Course:PROG20799, week11 46
Hash table with chaining implementation
struct node {
int key, age;
char name[100];
struct node *next; };
struct hash {
struct node *head;
int count; };
struct hash *hashTable = NULL; //hash table global declaration
………// in the main function, hash table definition
hashTable = (struct hash *) calloc(n, sizeof (struct hash));
Course:PROG20799, week11 47
insertToHash function
void insertToHash(int key, char *name, int age) {
int hashIndex = key % eleCount; // hashing function
struct node *newnode = createNode(key, name, age);
/* head of list for the bucket with index "hashIndex" */
if (!hashTable[hashIndex].head) {
hashTable[hashIndex].head = newnode;
hashTable[hashIndex].count = 1; return; }
newnode->next = (hashTable[hashIndex].head);
//update head of the list and no of nodes in the current bucket
hashTable[hashIndex].head = newnode;
hashTable[hashIndex].count++; }
Course:PROG20799, week11 48
Extra stuff
Course:PROG20799, week11 49
Wasteful, but no Space efficient, but
collisions with collisions
• Data (keys) varies from 10M to 50M
• Spread of the data is 40M
• # of records = size of the data = 3K
• Key = 5,003,760
Normal hash function
Simple hash function:
Minimum array size =
No hash function: Array size = 40M + 1
3K + 1
Array size = 50M + 1 For 3K data, array size of
Usually 6K
For 3K data, array size 40M is still very wastful
Hashfunction:
of 50M is very wastful Hash function:
index = key % 6K
Example: index = key – 10M
key = 15,003,760
key = 15,003,760 key = 15,003,760
index = 3760
index = 15,003,760 index = 5,003,760
Course:PROG20799, week11 50