Chapter10_HashTables
Chapter10_HashTables
Hash Tables
1
Main Points
Introduction
Hash Table, Hash Function, Collisions
Handling Collisions
Exercise
2
Let’s Look at this
3
Hash Tables: An introduction
Goal: Is it possible to design a search of O(1): a
constant search time no mater where the element is
located in the list
Example: Consider a list of employees in a small
company. Each of 100 employees has an ID number in
the range 0-99. If we store the elements( employee
records) in the array, then each employee’s ID
number will be an index to the array element where
record will be stored
There is a one-to-one correspondence between the
element key and the array index
What about if the company want to use a 5-digits ID
number as the primary key.
Compare the size of the array with the number of
employees
4
One solution to the previous
problem
What if we keep the array size down to the
size that we will actually be using (100
elements) and we use just the last two
digits of key to identify each employee?
For example the employee with the key number
45678 will be stored in the element of the array
with index 78 and employee with 23456 will be
stored in array index 56
We are looking for a way to convert a 5 digit
number into a two digit array index: some
function need to do the transformation: a hash
function to use a hash table (array)
5
Hash Table, Hash Function,
Collisions
A Hash Table is a data structure in which
keys are mapped to array positions by a
hash function
A Hash Function, is a function which,
when applied to the key, produces an
integer which can be used as an address
(index) in the hash table
For our previous example:
int HashFunction (int id_number){
return( id_number % table_size);
}
6
Hash Table, Hash Function,
Collisions: continue…
Suppose that the hash table contains
records of two employees with IDs 45678
and 23456 respectively.
We need to add another employee whose ID is
34878!!
The array element with index 78 has already a
value.
When more than one element tries to occupy the
same array position, we have a collision
Collision is a condition resulting when two or
more keys produce the same hash location
7
Issues surrounding Hashing
Which HF to use?
The Big Challenge
Easy to compute. If the ‘Hash Algorithm’ is too
inefficient it will overshadow the advantages of the
technique
Should distribute entries uniformly through the HT
slots
Should minimize Collisions
8
Other examples
Implementing a Dictionary:
HF: Sum the ASCII codes for the letters then mod n (n is HT
size!)
raw and war would have the same key!
How to implement a ‘Spelling Checker’?
Create a HT for all the words in a dictionary
When you encounter a word, whose spelling you want to
check, just hash it and see if it exists in the table.
If it does, then you've spelled it correctly.
If not, then you haven't.
This allows you to look up a word in O(1) time rather than O(n)
time, which, in a dictionary on the order of 800,000 words, is a
big time saver.
Cryptography: Use a hash function to encrypt your
passwords
9
Hash Function: Examples
HFs generally take records whose key values come
from a large range, and stores those records in a HT
with a relatively small number of slots – Depends a lot
of the keys set
Some HF Examples:
Division
F(x) = x mod m; best value for m is prime
Mid-Square:
Middle K digits in X²
Folding:
Given a key x1x2… xr
F1(x1x2… xr) = x1x2+ x3x4…+ xr-1xr
F2(x1x2… xr) = x2x1+ x4x3…+ xrxr-1
Example: x = 251367
F1(X) = 25 + 13 + 67 = 125
F2(X) = 52 + 31 + 76 = 159
Truncating:
F(x) = last K digits of x or first K digits
10
Summarize what we learnt so
far….
To add/retrieve an element from the
hash table:
Algorithm to add: Algorithm to get a
Add (key, value){ value:
Index=hash(key); dataT getValue(key){
Index=hash(key);
hashTable[Index]=value; return(hashTable[index])
} }
Will these algorithms always work? We can make
them work if we know all possible search keys,
appropriate Table size and perfect hash function
11
Collision, Handling Collision
The two ways of dealing with
collisions are:
Chaining: use linked list
Open Addressing: Linear probing,
Quadratic probing, and Double Hashing
12
Problem
Consider the following Hash table and
the following hash function: 0
1 1
H(x)=x² % 10
2
What we want is to insert [1,9] 3
13
Solution 1: Use Separate
Chaining
Colliding records are chained together in
separate linked lists
HT slots don’t hold data, rather it stores
pointers to the synonyms’ linked lists.
If a collision happens, insert in the
corresponding linked list - O(1). (Insert
always at the head)
Search/Delete ?
Drawback: Use of another Data Structure,
Linear search through the Linked Lists
14
Advantages of Separate
chaining
Simple collision handling
No Overflow: we can store more
elements than the hash table size
Deletion is done from the linked list
15
Example
16
Separate Chaining an
illustration
Assume that we want to a list of
students into a hash table using their
IDs. The following program represent
how collision is solved using separate
chaining.
We are using a table of 10 cells
We are using ID%Size as a hash function
Implement this code
17
18
19
20
21
22
23
24
Solution 2: Open addressing
26
Linear Probing
In this method, f (k) is linear = i
i
Linear probing Insert algorithm:
If(table is full): error
probe=h(k)
While(table[probe] occupied)
probe=(probe+1) mod m
Table[probe]=k
Search(k) Algorithm:
Compute h(k)
Look at HT[h(k)]:
If empty (element does not exist)
If full:
Compare to K, if equal return it else:
Loop/’circular linear search’ through successive slots
If found return it
If an empty slot found, element does not exist
Drawback Clustering
Elements tend to cluster around full slots! Hence,
resulting in very long probes. A Solution Quadratic Probing 27
Consider the following example
H(x)= x%10
F(i)=i
-Find 58(#tries?)
- Insert 19
- Find 19(#tries?)
28
Linear probing: drawbacks
As long as the table is big enough, an
empty cell can always be found but
the time to do so can get quite large
More, even if the table is relatively
empty, blocks of occupied cells start
forming
Primary clustering
29
Linear Searching Analysis
We want to compute the average probes for a 0 9
successful and unsuccessful search for this hash 1
table 2 2
H(x)= x mod 11 3 13
Case 1: Successful Search 4 25
20,30,2,13,25,24,10,9 5 24
Avg=(1+1+1+2+2+4+1+3)/8=15/8 6
(<2: two search/each) 7
Case 2: Unsuccessful Search 8 30
we are searching for: 0,1,2,3,4,5,6,7,8,9,10 9 20
Avg=(2+1+1+4+3+2+1+1+5+3+1)/11=24/11 10 10
30
Solution 3: Quadratic Probing
Eliminates Clustering by probing separated slots
fi(k) is linear = i²
If collision happens at HT[k], look successively at
K+1², K+2², … till empty cell found
31
Example
H(x)=x mod 10 0
1
Insert: 3,5,13,24,33,45,54 2
54? 3 3
4 13
5 5
6 45
7 33
8 24
9
32
Solution 4: Double Hashing
Avoids both Primary and Secondary
Clustering
Idea:
The probe should depend on the key instead of
being the same for all keys
Use another Hash Function. Hence, the
increment is defined by second function
The Second HF should:
Depend on the key
Be different from the first! Why?
Not returning Zero
33
Double Hashing Insert
Algorithm
If (table is full) error
probe=h1(key), offset=h2(k)
While(table[probe] is occupied)
probe=(probe+offset) mod m
Table[probe]=k
The probe goes to probe, probe+offset,
probe+2*offset, probe+3*offset…..
34
Double Hashing (Cont.)
Ideal functions are of this format:
h2(Key) = Const – (Key % Const)
Where Const is a prime number less than
HT size
Example: (Const = 5)
35
Illustration of linear probing and
double hashing
36
37
38
39
40
41
42
Load Factor and Hash Tables
43
Analysis of Separate Chaining
Load factor λ definition:
Ratio of number elements (N) in a hash
table to hash table size
i.e: λ= N/TableSize
The average length of a list is also λ
For chaining λ is not bound by 1, it can
be >1 (Hash Table size is 10 but N=100)
So to delete/search an element: time to
compute hash function + length of the
chain: λ so search and delete:
O(1+λ)=O(λ) 44
Separate Chaining Performance
Search cost is proportional to length
of chain
Worst case, all keys hash to same chain
When size of hash table is too large
Many empty slots
When size of hash table is small
You end up having long chains
45
Linear Probing Performance
Insert and search depend on the length of the
cluster
Average length of the cluster is λ
Worst case: all keys hashed to same cluster
When size of the array is too large:
You will have many HT entries empty
When the size is too small
clusters!
Typical choice: size=2*N_elements
46
Analysis of double hashing and
quadratic hashing
Remember, the load factor λ=N elements in the HT/HT
size
This means 1- λ: represent a fraction of how many empty
location in HT
So the expected number of probes to find an empty
location (i.e: unsuccessful search)will be 1/(1- λ)
Even though double hashing avoids the clustering of
linear probing and quadratic probing, the estimate of its
efficiency was proved to the be the same as quadratic
probing.
47
Hash Tables: A Summary
A hash table is based on an array
The range of key values is usually greater than the
size of the array
A key value is hashed to an array index by a hash
function
The hashing of a key to an already filled cell is called a
collision
Collision can be handled using open addressing or
separate chaining
In open addressing, data items that hash to a full
array cell are placed in another cell in the array
In separate chaining, each array element consist of a
linked list
48
Hash Tables: A Summary
In linear probing the step size is always 1
The number of tries required to find an item is
called the probe length
In linear probing, contiguous sequence of
filled cells appear: primary cluster
Quadratic probing eliminates primary
clustering but suffers from less severe
secondary clustering
In double hashing the step depends on the
key and is obtained from a second hash
function
49