Lecture 27 - Hashing
Lecture 27 - Hashing
• Tables
• Hash tables
• Chaining
Introduction
• Suppose:
• The range of keys is 0..m-1
• Keys are distinct
• The idea:
• Set up an array T[0..m-1] in which
• T[i] = x if x T and key[x] = i
• T[i] = NULL otherwise
• This is called a direct-address table
• Operations take O(1) time!
8
Direct Addressing
Advantages with Direct Addressing
Hash Tables
Compared to direct addressing
• Advantage: Requires less storage and runs in O(1) time.
• Comparison
Collision
Resolving Collisions
• Solution 1: Chaining
Collision by Chaining
21
22
Analysis of Chaining
• Nature of keys
• Hash functions
• Division method
• Multiplication method
Hash
function
• Rule1: The hash value is fully determined by the data being hashed.
• Rule3: The hash function uniformly distributes the data across the entire
set of possible hash values.
• Rule4: The hash function generates very different hash values for similar
strings.
An example of a hash function
int hash(char *str, int table_size)
{
int sum=0;
//sum up all the characters in the string
for(;*str; str++) sum+=*str
//return sum mod table_size
return sum%table_size;
}
Analysis of example
• Rule4: Breaks, hash the string “CAT”, now hash the string “ACT”, they
are the same, a slight variation in the string should result in different hash
values, but with this function often they don’t.
Methods to create hash functions
• Division method
• Multiplication method
Division method
• Advantage?
Restrictions on value of m
• All elements are stored in the hash table itself (so no pointers
involved as in chaining).
0 1 2 3
Insertion in hash table
HASH_INSERT(T,k)
1 i0
2 repeat j h(k,i)
3 if T[j] = NIL
4 then T[j] = k
5 return j
6 else i i +1
7 until i = m
8 error “ hash table overflow”
Searching from Hash table
HASH_SEARCH(T,k)
1 i0
2 repeat j h(k,i)
3 if T[j] = k
4 then return j
5 i i +1
6 until T[j] = NIL or i = m
7 return NIL
• Worst case for inserting a key is (n)
• Worst case for searching is (n)
• Algorithm assumes that keys are not deleted once they are
inserted
• Deleting a key from an open addressing table is difficult,
instead we can mark them in the table as removed
(introduced a new class of entries, full, empty and
removed)
Clustering
• Even with a good hash function, linear probing has its problems:
• The position of the initial mapping i 0 of key k is called the home
position of k.
• When several insertions map to the same home position, they end
up placed contiguously in the table. This collection of keys with
the same home position is called a cluster.
• As clusters grow, the probability that a key will map to the middle
of a cluster increases, increasing the rate of the cluster’s growth.
This tendency of linear probing to place items together is known as
primary clustering.
• As these clusters grow, they merge with other clusters forming even
bigger clusters which grow even faster.
Quadratic probing
h(k,i) = (h’(k) + c1i + c2i 2) mod m for i = 0,1,…,m − 1.
• Leads to a secondary clustering (milder form of clustering)
• The clustering effect can be improved by increasing the order to the
probing function (cubic). However the hash function becomes more
expensive to compute
• But again for two keys k1 and k2, if h(k1,0)= h(k2,0) implies that
h(k1,i)= h(k2,i)
Double Hashing
• Recall that in open addressing the sequence of probes follows
i j +1 = (i j + c ) mod m for j 0
• We can solve the problem of primary clustering in linear probing by having the keys
which map to the same home position use differing probe sequences. In other words, the
different values for c should be used for different keys.
• Double hashing refers to the scheme of using another hash function for c