0% found this document useful (0 votes)
47 views18 pages

Week 10: Hash Table: Readings

Hash tables provide constant time operations on average by mapping keys to integer indices in an array. Collisions occur when two keys hash to the same index. Separate chaining resolves collisions by storing keys in linked lists at each index. Linear probing resolves collisions by searching sequentially for empty slots after the initial index. Both separate chaining and linear probing require lazy deletion, where deleted items are marked as deleted instead of removed to avoid invalidating the hash table structure.

Uploaded by

tjm.stkr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views18 pages

Week 10: Hash Table: Readings

Hash tables provide constant time operations on average by mapping keys to integer indices in an array. Collisions occur when two keys hash to the same index. Separate chaining resolves collisions by storing keys in linked lists at each index. Linear probing resolves collisions by searching sequentially for empty slots after the initial index. Both separate chaining and linear probing require lazy deletion, where deleted items are marked as deleted instead of removed to avoid invalidating the hash table structure.

Uploaded by

tjm.stkr
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Week 10:

Hash Table

Readings
p

Required

Exercise

[Weiss] ch20
20.5

6
nus.soc.cs1102b.week10

Hash Table is a data structure that support


the most common dynamic set operations
in constant time on average. It has many
many applications.

Recap
Unsorted

Sorted

BST

Array/List Array

Insert

O(1)

Hash
Table

O(N)

O(log N) O(1) avg


O(log N) O(1) avg

Delete

O(N)

O(N)

Find

O(N)

O(logN) O(log N) O(1) avg

findMin

O(N)

O(1)

O(log N) O(N)

findMax O(N)

O(1)

O(log N) O(N)

9
nus.soc.cs1102b.week10

Direct Addressing
Table

9 October 2002

Direct address table, is a simplified


version of hash table.

Consider the problem of maintaining


information about SBS (and TIBS) bus
services. We want to support three
operations find, insert and delete.

SBS Bus Problem


p

find(N)
n

insert(N)
n

Does bus service no. N exist?

Introduce a new bus service no. N

delete(N)
n

Remove bus service no. N

11
nus.soc.cs1102b.week10

Since bus numbers are integers between 0


999, we can create an array with 1000
booleans, initialized to false. If bus
service N exists, just set position N to
true. All find, delete, and insert can be
done in O(1) time.

SBS Bus Problem


0 false
1 false
2 true
:
:

989

true

12
nus.soc.cs1102b.week10

We can extends this idea, if we want to


maintain additional data about a bus. Use
an array of 1000 slots, each can reference
to an Object.

Direct Addressing Table


0
1
2

2, data
:

989

989, data

13
nus.soc.cs1102b.week10

Direct Addressing Table


insert (key, data)
a[key] = data
delete (key)
a[key] = null
search (key)
return a[key]
14
nus.soc.cs1102b.week10

9 October 2002

This works only if keys are integers,


(cannot keep track of bus no NR10,
162M) and the range for the keys must be
small (if keys are phone numbers, you
need an array of size 10 million).

Restrictions
p
p

Keys must be integer


Range of keys must be small

15
nus.soc.cs1102b.week10

Hash Table is a generalization of direct


addressing table, to remove these
restrictions.

Hash Table

The idea is to map any keys to small


integers. We call this hashing. The
function that map keys to integers are call
hash function.

Idea
p
p

Map non-integer keys to integers


Map large integers to smaller integers

HASHING

17
nus.soc.cs1102b.week10

h is a hash function. This example shows


how we map phone numbers to slot
numbers between 0 and 999.

Hash Table
66752378
h

17

66752378,
data

68744483
h

974
68744483,
data

18
nus.soc.cs1102b.week10

9 October 2002

Here is the pseudocode: notice that we


have replaced key with h(key).
(This does not work! See next slide)

Hash Table
insert (key, data)
a[ h(key) ] = data
delete (key)
a[ h(key) ] = null
search (key)
return a[ h(key) ]
19
nus.soc.cs1102b.week10

But a hash function does not guarantee


that two different keys goes into different
slots! This is called a collision.

Hash Table
66752378,
data

67774385

h
:
68744483,
data

20
nus.soc.cs1102b.week10

Problem
p

Two keys can have the same hash value

COLLISION

21
nus.soc.cs1102b.week10

To implement hash table, we need to


answer two que stions: how to define a
hash function and how to resolve
collision. They are important issues that
can affect the efficiency of hash table.

Overview of This Lecture


p
p

How to hash?
How to resolve collision?

22
nus.soc.cs1102b.week10

9 October 2002

Hash Functions

Good Hash Functions


p appear

random

p fast
p depends

on all information in the key


p keys that are close have hash values
that are far apart

24
nus.soc.cs1102b.week10

It is possible to have a perfect hash


function: where collision is guaranteed
not to occur.

Perfect Hashing Function


One-to-one mapping between keys and
hash values.
p Maybe possible if all keys are known
p

25
nus.soc.cs1102b.week10

A uniform hashing function put a key into


a slot with equal probability.

Uniform Hashing Function


p

Distributes keys evenly

Example
n

if k are integers uniformly distributed among 0 and


X-1

k [0, X )
k m
hash( k ) =
X
26
nus.soc.cs1102b.week10

9 October 2002

There are many ways to hash an integer.

Hashing Integers

The most popular one is the division


method: where we use the mod operator
(% in Java) to map an integer to values
between 0 and m-1 (inclusive).

Division Method
p

Mapped into table of m slots

hash( k )= k % m
28
nus.soc.cs1102b.week10

mod operator
p

n mod m = remainder of n divided by m

29
nus.soc.cs1102b.week10

The choice of m (or hash table size) is


important. If m is power of two, say 2n ,
then key modulo of m is the same of last n
bits of the key.
If m is 10n , then our hash values is the last
n digit of keys.
We usually pick m to be a prime number
close to a power of two.

How to pick m?
p

m = 16

m = 10

m = 13

30
nus.soc.cs1102b.week10

9 October 2002

Rule
p

Pick m to be a prime number not too


close to power of two.

31
nus.soc.cs1102b.week10

Another method is the multiplication


method. The golden ratio = (sqrt(5) 1)/2
seems to be a good choice for A.

Multiplication Method
1.Multiply by a number 0 <= A < 1
2.Extract the fractional part
3.Multiply by m

hash(k ) = m(kA kA)

32
nus.soc.cs1102b.week10

Hashing Strings

To hash a string, we can just sum up all


ascii values of ecah characters.

Hashing of Strings
hash(s, m)
sum = 0
foreach character c in s
sum += c
return sum % m

34
nus.soc.cs1102b.week10

9 October 2002

hash(Tan Ah Teck, 11)


= (T + a + n + +
A + h + +
T + e + c + k) % 11
= (84 + 97 + 110 + 32 +
65 + 104 + 32 +
84 + 101 + 99 + 107) % 11
= 825 % 11
= 0
35
nus.soc.cs1102b.week10

This only depends on the characters that


are present in a string, not their positions.

Hashing of Strings
Lee Chin Tan
Chen Le Tian
p Chan Tin Lee

p
p

Does not depend on


position of characters!

36
nus.soc.cs1102b.week10

A better way is to shift the sum


everytime, so that the position affects the
calculated hash values. (Note: Javas
String.hashCode( ) uses 31 instead of 37)

Hashing of Strings
hash(s)
sum = 0
foreach character c in s
sum += sum*37 + c
return sum % m

37
nus.soc.cs1102b.week10

Collision
Resolution

9 October 2002

Probability of Collision
p

von Mises Paradox: "How many people


must be in a room before the probability
that some share a birthday, ignoring the
year and leap days, becomes at least 50
percent?"

39
nus.soc.cs1102b.week10

Probability of Collision
Q(n) = Probability of unique birthday for n people
= 364
363 362 365 n + 1

365

365

365

...

365

P(n) = Probability of collisions for n people


= 1 Q(n)

P(23) = 0.507
40
nus.soc.cs1102b.week10

If we more than 23 keys into a table with


365 slots, more than half of the time we
get collision.

Probability of Collision

Collision is very likely !

41
nus.soc.cs1102b.week10

Collision Resolutions
Separate Chaining
Linear Probing
p Quadratic Probing
p Double Hashing
p
p

42
nus.soc.cs1102b.week10

9 October 2002

Separate Chaining

Separate Chaining is the most straight


forward method, using a linked- list to
store the collided keys.

Idea
0

k1,data
k2,data

m-1

k4,data

k3,data
44
nus.soc.cs1102b.week10

Insertion can be done in O(1) time. But


deletion and search takes O(n) time where
n is the length of the list.

Hash Table
insert (key, data)
insert data into the list a[ h(key) ]
delete (key)
delete data from the list a[ h(key) ]
search (key)
find key from the list a[ h(key) ]
45
nus.soc.cs1102b.week10

Analysis
n: number of keys
m: number of slots
p L: load factor
p
p

p
p

L = n/m
Average length of list = L

46
nus.soc.cs1102b.week10

9 October 2002

However, we can bound the length of the


chain by a constant.

Average Running Time


Search O(1 + L)
Insert O(1)
p Delete O(1 + L)

p
p

If L is bounded by some constant, then all


three operations are O(1)

47
nus.soc.cs1102b.week10

When ever the load factor exceeds the


bound, we need to rehash all keys into a
bigger table (increase m to reduce L)

Rehashing
p

To keep L bounded, we may need to


reconstruct the whole table

48
nus.soc.cs1102b.week10

Linear Probing

In linear probing, when we get a collision,


we scan through the table looking for an
empty slot (wrapping around when we
reach the last slot)

Linear Probing
hash(k)
k mod 7

0
1
2
3
4
5
6
50
nus.soc.cs1102b.week10

9 October 2002

21 collides with 14. Look for the next


empty slot.

Insert 21
hash(k)
k mod 7

14

21

2
3
4

18

5
6
53
nus.soc.cs1102b.week10

1 collided with 21. Look for an empty


slot.

Insert 1
hash(k)
k mod 7

14

21

3
4

18

5
6
54
nus.soc.cs1102b.week10

Insert 35
hash(k)
k mod 7

14

21

35

18

5
6
55
nus.soc.cs1102b.week10

Find a values is similar to find. We probe


the array starting from the original hash
position (in this case hash(35) = 0)

Find 35
hash(k)
k mod 7

14

21

35

18

FOUND 35

5
6
56
nus.soc.cs1102b.week10

9 October 2002

When probing, if we reach an empty slot,


we know that the value does not exist in
the hash table.

Find 8
hash(k)
k mod 7

14

21

35

18

8 NOT FOUND

6
57
nus.soc.cs1102b.week10

To delete, we first find the value, and


remove it from the table.

Delete 21
hash(k)
k mod 7

14

21

35

18

5
6
58
nus.soc.cs1102b.week10

We cannot simply remove a value,


because it can affect find( ) !

Find 35
hash(k)
k mod 7

14

35 NOT FOUND!

35

18

5
6
59
nus.soc.cs1102b.week10

Problem

Cannot Delete!

60
nus.soc.cs1102b.week10

9 October 2002

When a value is removed from linear


probed hash table, we just mark it as
deleted, instead of emptying the slot.

How to delete?
p
p

Lazy Deletion
Three different states
n
n
n

occupied
occupied but mark as deleted
empty

61
nus.soc.cs1102b.week10

Delete 21
hash(k)
k mod 7

14

21
X

35

18

5
6
62
nus.soc.cs1102b.week10

Find 35
hash(k)
k mod 7

14

21
X

35

18

FOUND 35

5
6
63
nus.soc.cs1102b.week10

When we insert, we can put a value into


either an empty slot, or a slot that has
been marked as deleted.

Insert 15
hash(k)
k mod 7

14

21
X

35

18

5
6
64
nus.soc.cs1102b.week10

9 October 2002

Insert 15
hash(k)
k mod 7

14

15

35

18

5
6
65
nus.soc.cs1102b.week10

The problem with linear probing is that it


can create many consecutive occupied
slots, increasing the running time of
find/insert/delete. This is called primary
clustering.

Problem

Primary Clustering

67
nus.soc.cs1102b.week10

An improvement to linear probing is


quadratic probing.

Quadratic Probing

The probe sequence for linear probing is


this.

Linear Probing
hash(key)
( hash(key) + 1 ) % m
( hash(key) + 2 ) % m
( hash(key) + 3 ) % m
:

69
nus.soc.cs1102b.week10

9 October 2002

For quadratic probing, we use this probe


sequence.

Quadratic Probing
hash(key)
( hash(key) + 1 ) % m
( hash(key) + 4 ) % m
( hash(key) + 9 ) % m
:

70
nus.soc.cs1102b.week10

Insert 3
hash(k)
k mod 7

0
1
2
3

18

5
6
71
nus.soc.cs1102b.week10

Notice that the calculation of +1 +4 +9 ..


starts from the original hash position. If
we were to start from the previous probe
position, the probe sequence should be +1
+3 +5 ..+ (2i -1).

Insert 38
hash(k)
k mod 7

38

1
2
3

18

(Q: Show mathematically that they are the


same)

5
6
72
nus.soc.cs1102b.week10

How can we be sure that quadratic


probing always terminate? Insert 12 into
the previous example, follow by 10. See
what happen?

Theorem
p

If L < 0.5, and m is prime, then we can


always find an empty slot if table is not
full.

73
nus.soc.cs1102b.week10

9 October 2002

Using quadratic probing requires more


careful design of hash table. It also
suffers from a (less minor) problem if
two keys has the same initial position,
they have the same probe sequence.

Problems
If two keys have the same initial position,
their probe sequence is the same.
p Secondary clustering.
p

74
nus.soc.cs1102b.week10

Double hashing uses a second hash


function to calculate the probe sequence,
so unless two keys have the same hash
values for both hash functions, they have
different probe sequences.

Double Hashing

hash2 (key) is the secondary hash function.

Double Hashing
hash(key)
(hash(key) + hash2(key)) % m
(hash(key) + 2*hash2(key)) % m
(hash(key) + 3*hash2(key)) % m
:

76
nus.soc.cs1102b.week10

We use k%5 as the secondary hash


function in this example. Can you give
two keys that have the same probe
sequence in this example?

Insert 21
hash(k)
k mod 7

14

21

hash 2(k)
k mod 5

If we insert 21, the probe sequence is the


same as linear probing.

3
4

18

5
6
77
nus.soc.cs1102b.week10

9 October 2002

If we insert 4, the probe sequence is 4, 8,


12 (from the first probe position) or 4,
4, 4, (from the previous probe
position).

Insert 4
hash(k)
k mod 7

14

21

hash 2(k)
k mod 5

2
3
4

18

6
78
nus.soc.cs1102b.week10

But if we insert 35, the probe sequence is


0, 0, 0,
What is wrong?

Insert 35
hash(k)
k mod 7

14

21

hash 2(k)
k mod 5

2
3
4

18

6
79
nus.soc.cs1102b.week10

Warning
p

Secondary hash function must not


evaluate to 0 !

Change hash2(key) to
hash2(key) = 5 (key % 5)

80
nus.soc.cs1102b.week10

Good Collision Resolution


Minimize clustering
Can find an empty slot if L is small
p Give different probe sequence when initial
probe is the same
p Fast
p
p

81
nus.soc.cs1102b.week10

9 October 2002

You might also like