0% found this document useful (0 votes)
19 views63 pages

L04 Hashing

The document provides an overview of hashing, including its definition, methods for hashing, collision resolution techniques, and performance metrics. It discusses various collision resolution strategies such as separate chaining, linear probing, quadratic probing, and double hashing, along with the concept of load factor and its impact on performance. Additionally, it explains the importance of good hash functions and the challenges associated with collisions in hash tables.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views63 pages

L04 Hashing

The document provides an overview of hashing, including its definition, methods for hashing, collision resolution techniques, and performance metrics. It discusses various collision resolution strategies such as separate chaining, linear probing, quadratic probing, and double hashing, along with the concept of load factor and its impact on performance. Additionally, it explains the importance of good hash functions and the challenges associated with collisions in hash tables.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

L04: Hashing

For efficient look-up in a table


Outline
 What is hashing?
 How to hash?
 What is collision?
 How to resolve collision?
 Separate chaining
 Linear probing
 Quadratic probing
 Double hashing
 Load factor
 Primary clustering and secondary clustering
2
ADT Table Operations
ADT table supports 3 operations:
1. Insertion
2. Deletion
3. Retrieval
Sorted Balanced Hashing
Array BST
Insertion O(n) O(log n) O(1) avg
Deletion O(n) O(log n) O(1) avg
Retrieval O(log n) O(log n) O(1) avg

Note: Balanced Binary Search Tree (BST) will be covered


in the next lecture
3
Direct Addressing
Table
A simplified version of hash table
SBS bus Problem

5
Direct Addressing Table: Operations

insert (key, data)


a[key] = data // where a[] is an array - the table

delete (key)
a[key] = null
find (key)
return a[key]

6
Direct Addressing Table: Restrictions
 Keys must be integer
 What happens for key values 151A and NR10?
 Range of keys must be small
 Keys must be dense, i.e. not many gaps in
the key values.
 How to overcome these restrictions?

7
Hash Table

A generalization of direct addressing


table, to remove its restrictions.
9
10
Good Hash Functions
 Fast to compute
 Scatter keys evenly throughout the hash table
 Less collisions
 Need less slots (space)

11
Bad Hash Functions

 What happen when you hash Singapore’s


house phone numbers by selecting the first
three digits?

12
Perfect Hash Functions

13
How to Define a Hash Function?
 Uniform hash function
 Division method
 Multiplication method
 Hashing of strings

14
Uniform Hash Functions

15
3 Division method (mod operator)
 Map into a hash table of m slots.
 Use the modulo operator (% ) to map an integer
to a value between 0 and m-1.

hash(k ) = k % m
The most popular method.

16
3 How to pick m (table size)?

17
Multiplication method
1. Multiply by a fraction A between 0 and 1
2. Extract the fractional part
3. Multiply the fraction by m, the hash table size

hash (k ) = m(kA- kA )

18
Hashing of strings

hash1(s) // s is a string
sum = 0
foreach character c in s
sum += c // sum up the ASCII values of all characters
return sum % m // m is the hash table size

19
Hashing of strings
hash(“Tan Ah Teck”)

= (“T” + “a” + “n” + “ ” +


“A” + “h” + “ ” +
“T” + “e” + “c” + “k”) % 11 // hash table size is 11
= (84 + 97 + 110 + 32 +
65 + 104 + 32 +
84 + 101 + 99 + 107) % 11
= 825 % 11
=0

20
Hashing of strings:
 Lee Chin Tan
 Chen Le Tian
 Chan Tin Lee

All 3 strings above have the same hash value! Why?

 Problem: The hash code produced by hash1 does


not depend on positions of characters! – Bad

21
Improved Hashing of strings
A better way is to “shift” the sum before adding the next
character, so that its position will affect the hash code.
This kind of hash code is called a polynomial hash code

hash2(s)
sum = 0
foreach character c in s
sum = sum*37 + c
return sum % m // m is the hash table size

22
Collision Resolution
4 Probability of Collision (1/2)
 von Mises Paradox (The Birthday Paradox): “How
many people must be in a room before the probability
that some share a birthday, ignoring the year and leap
days, becomes at least 50 percent?”

Q(n) = Probability of unique birthday for n people

364 363 362 365 - n  1


=   ...
365 365 365 365
P(n) = Probability of collisions (same birthday) for n people
= 1 – Q(n)
P(23) = 0.507

24
4 Probability of Collision (2/2)
 This means that if there are 23 people in a room,
the probability that some people share a
birthday is 50.7%.
 So, if we insert 23 keys in to a table with 365
slots, more than half of the time we get
collisions. Such a result is counter-intuitive to
many.
 So, collision is very likely!

25
How to resolve collisions?

Conflict resolution schemes commonly used:


 Separate Chaining
 Linear Probing
 Quadratic Probing
 Double Hashing

26
27
28
Load Factor
 n: number of keys in the hash table
 m: size of the hash tables – number of slots
 Define the load factor 
 = n/m
a measure of how full the hash table is.
Note:  can be >= 1.
If table size is the number of linked lists, then 
is the average length of the linked lists.

29
Average Running Time
 Find O(1 + )
 Insert O(1)
 Delete O(1 + )

 Note that  affects the performance of find and


delete operations.
 If  is bounded by some constant, then all three
operations are O(1).

30
31
Linear Probing
0
hash(k) = k mod 7 In linear probing,
1 when there is a
Here the table size m=7
collision, we scan
2 forwards for the the
Note: 7 is a prime number.
3 next empty slot
(wrapping around
4 when we reach the
last slot).
5

32
Linear Probing: Insert 18
0
hash(k) = k mod 7
1
hash(18) 2
= 18 mod 7
3
=4
4 18
5

33
Linear Probing: Insert 14
0 14
hash(k) = k mod 7
1

2
hash(14)
3
= 14 mod 7
=0 4 18
5

34
Linear Probing: Insert 21
0 14
hash(k) = k mod 7
1 21
Collision occurs!
2 Look for next empty slot.
3
hash(21)
= 21 mod 7 4 18
=0 5

35
Linear Probing

4.2 Linear Probing: Insert 1


0 14
hash(k) = k mod 7 Collides with 21
1 (hash value 0). Look
21
for next empty slot.
hash(1) 2 1
= 1 mod 7 3
=1 4 18
5

36
Linear Probing: Insert 35
0 14
hash(k) = k mod 7 Collision, need to
1 check next 3 slots.
21
hash(35) 2 1
= 35 mod 7 3 35
=0
4 18
5

37
Linear Probing: Find 35
0 14
hash(k) = k mod 7 Found 35, after 4
1 probes.
21
hash(35) = 0
2 1
3 35
4 18
5

38
Linear Probing: Find 8

0 14
hash(k) = k mod 7 8 NOT found.
1 Need 5 probes!
21
hash(8) = 1
2 1
3 35
4 18
5

39
Linear Probing: Delete 21
0 14
hash(k) = k mod 7
1 21
hash(21) = 0
2 1
3 35
4 18
5

40
Linear Probing: Find 35
0 14
hash(k) = k mod 7 35 NOT found!
1 Incorrect!
hash(35) = 0
2 1
We cannot simply
3 35 remove a value,
4 because it can
18
affect find()!
5

41
How to delete?
 Lazy Deletion
 Use three different states of a slot
 Occupied
 deleted
 Empty
 When a value is removed from linear probed
hash table, we just mark the status of the slot
as “deleted”, Instead of emptying the slot.
 Need to use a state array the same size as
the hash table.

42
Linear Probing: Delete 21

0 14
hash(k) = k mod 7 Slot 1 is occupied
1 but now marked
X
21
as deleted.
hash(21) = 0
2 1
3 35
4 18
5

43
Linear Probing

Linear Probing: Find 35


0 14
hash(k) = k mod 7
1 X
21
hash(35) = 0
2 1
3 35
Found 35
4 18 Now we can find 35

44
Linear Probing: Insert 15
0 14
hash(k) = k mod 7
1 X
21
15 15 is inserted into
hash(15) = 1 the slot 1 which
2 1 was marked as
Note: We continue 3 35 deleted
to search for 15, 4 18
and found that 15 We can insert a
is not in the hash 5 new value into a
table (total 5 slot that has been
6
probes).
marked as
deleted.
45
46
Problem of Linear Probing

0 14
1 15
Consecutive
2 1 occupied slots.
3 35
4 18
5

6 This is called
Primary
Clustering
47
Linear Probing

48
Modified Linear Probing
Q: How to modify linear probing to avoid primary clustering?
We can modify the probe sequence as follows:
hash(key)
( hash(key) + 1 * d ) % m
( hash(key) + 2 * d ) % m
( hash(key) + 3 * d ) % m
:
where d is some constant integer >1 and is co-prime to m.
Note: Since d and m are co-primes, the probe sequence
covers all the slots in the hash table.

49
Quadratic Probing

50
Quadratic Probing: Insert 3, 18
0
hash(k) = k mod 7
1
hash(3) = 3
Hash(18)= 4 2
3 3
4 18
5

51
Quadratic Probing: Insert 38
0 38
hash(k) = k mod 7
1
hash(38) = 3
2
Collision
3 3
+1
4 18
5
+4
6

52
Can quadratic probing always find a free slot?
Insert 12 into the previous example, followed by 10.

what happen?

Theorem
If  < 0.5, and m is prime, then we can always
find an empty slot.
(m is the table size and  is the load factor)

When quadratic probing is used, in the worst


case, 50% of the hash table is wasted.

53
Secondary Clustering
 In quadratic probing, clusters are formed along
the path of probing, instead of around the home
location.
 These clusters are called secondary clusters.
 Secondary clusters are formed as a result of
using the same pattern in probing by all keys.

54
Double Hashing

55
After Inserting 14 and 18, Insert 21
hash(k) = k mod 7 0 14 hash(21)
hash2(k) = k mod 5
1 21 =21 mod 7
2
=0
3
hash2(21)
4 18 = 21 mod 5
5 =1
6

56
Double Hashing: Insert 4
hash(k) = k mod 7 0 14 If we insert 4, the
hash2(k) = k mod 5
1 probe sequence is
21
4 (home), 8, 12, …
2
3 hash(4) = 4
4
hash2(4) = 4
18
5 4
6

57
Double Hashing: Insert 35
hash(k) = k mod 7 0 14 But if we insert 35,
hash2(k) = k mod 5
1 the probe sequence
21
hash(35) = 0 is 0, 0, 0, …
hash2(35) = 0 2
What is wrong?
3 Since hash2(35)=0.
4 Not acceptable!
18
5 4
6

58
Warning:

59
Note:
 If hash2(k) =1, then it is the same as linear probing.

 If hash2(k) =d, where d is a constant integer > 1, then it


is the same as modified linear probing.

60
Good Collision Resolution Method

 Small cluster size


 Always find an empty slot if it exists
 Give different probe sequences when 2 keys
collide
 Fast

61
[CS1020 Lecture 10 AY2013/4 S1]
62
Summary
 How to hash? Criteria for good hash functions?
 How to resolve collision?
Collision resolution techniques:
 separate chaining
 linear probing
 quadratic probing
 double hashing
 Problem on deletions
 Primary clustering and secondary clustering.

63

You might also like