0% found this document useful (0 votes)
13 views

Lecture 13 - Hash Tables

Uploaded by

hao05500550
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Lecture 13 - Hash Tables

Uploaded by

hao05500550
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Lecture 13.

Hash Tables

1
Hash Tables (1/4)
¢ Motivation: symbol tables
u A compiler uses a symbol table to relate symbols to associated
data
ê Symbols: variable names, procedure names, etc.
ê Associated data: memory location, call graph, etc.
u For a symbol table (also called a dictionary), we care about
search, insertion, and deletion
u We typically don’t care about sorted order

2
N
e
t
w
o
r
Hash Tables (2/4)
k
i
n
g
L
a 0
b
o
r U
a (universe of keys) h(k1)
t
o h(k4)
r
y
3 K k1 h(k2)
/
4
(actual k4 k2
9 keys)
k3 h(k3)

m-1

3
Hash Tables (3/4)
¢ Idea
u Use a function h to compute the slot for each key

u Store the element in slot ℎ(𝑘)

¢ A hash function h transforms a key into an index in a hash table


𝑇[0 … 𝑚 − 1]:

ℎ ∶ 𝑈 → {0, 1, . . . , 𝑚 − 1}

¢ We say that 𝑘 hashes to slot ℎ(𝑘)

¢ Advantages:
u Reduce the range of array indices handled: 𝑚 instead of |𝑈|

u Storage is correspondingly reduced

4
Hash Tables (4/4)
¢ More formally:
u Given a table 𝑇 and a record 𝑥, with key (= symbol) and
satellite data, we need to support:
ê Insert (𝑇, 𝑥)
ê Delete (𝑇, 𝑥)
ê Search (𝑇, 𝑥)
u We want these to be fast, but don’t care about sorting the
records
¢ The structure we will use is a hash table
u Supports all the above in 𝑂(1) expected time !

5
Direct-Address Tables
¢ Suppose:
u The range of keys is 0. . 𝑚 − 1
u Keys are distinct
¢ The idea:
u Set up an array 𝑇[0. . 𝑚 − 1] in which
ê 𝑇[𝑖] = 𝑥 if 𝑥 ∈ 𝑇 and key[𝑥] = 𝑖
ê 𝑇[𝑖] = NULL otherwise
u This is called a direct-address table
ê Operations take 𝑂(1) time!
ê So what’s the problem?

6
Direct-Address Tables
¢ Dictionary operations are trivial and take 𝑂(1) time
each:
u DIRECT-ADDRESS-SEARCH(T, k)
return 𝑇[𝑘]

u DIRECT-ADDRESS-INSERT(T, x)
𝑇[key[𝑥]] ← 𝑥

u DIRECT-ADDRESS-DELETE(T, x)
𝑇[key[𝑥]] ← NIL

7
The Problem With Direct Addressing
¢ Direct addressing works well when the range 𝑚 of keys
is relatively small
¢ But what if the keys are 32-bit integers?
u Problem 1: direct-address table will have
232 entries, more than 4 billion
u Problem 2: even if memory is not an issue, the time to initialize
the elements to NULL may be …
¢ Solution: map keys to smaller range 0. . 𝑚 − 1
¢ This mapping is called a hash function

8
Hash Functions
¢ Next problem: collision

T
U 0
(universe of keys)
h(k1)
k1
h(k4)
K k4
(actual k5
h(k2) = h(k5)
keys)
k2 h(k3)
k3

m-1
9
Resolving Collisions
¢ How can we solve the problem of collisions?

u Solution 1: chaining
u Solution 2: open addressing

10
Chaining (1/10)

index LL Head
0
1
2
3
4
5
...
n-1

11
Chaining (2/10) key1: data1

Hash
index LL Head
Function
0
1
2 3
3 key1:data1
4
5
...
n-1

12
Chaining (3/10) key2: data2

Hash
index LL Head
Function
0
1 key2:data2
2 1
3 key1:data1
4
5
...
n-1

13
Chaining (4/10) key3: data3

Hash
index LL Head
Function
0
1 key2:data2
2 4
3 key1:data1
4 key3:data3
5
...
n-1

14
Chaining (5/10) key4: data4

Hash
index LL Head
Function
0
1 key2:data2
2 3
3 key4:data4 key1:data1
4 key3:data3
5
...
n-1 Note: Insertion at
beginning of list
15
Chaining (6/10)
¢ Chaining puts elements that hash to the same slot in a
linked list:
T
U ——
(universe of keys) k1 k4 ——
——
k1
——
K k4 k5 ——
(actual
k7 k5 k2 k7 ——
keys)
——
k2 k3 k3 ——
k8
k6
k8 k6 ——
——
16
Chaining (7/10)
¢ How do we insert an element?

T
U ——
(universe of keys) k1 k4 ——
——
k1
——
K k4 k5 ——
(actual
k7 k5 k2 k7 ——
keys)
——
k2 k3 k3 ——
k8
k6
k8 k6 ——
——
17
Chaining (8/10)
¢ How do we delete an element?
u Do we need a doubly-linked list for efficient delete?

T
U ——
(universe of keys) k1 k4 ——
——
k1
——
K k4 k5 ——
(actual
k7 k5 k2 k7 ——
keys)
——
k2 k3 k3 ——
k8
k6
k8 k6 ——
——
18
Chaining (9/10)
¢ How do we search for an element with a given key?

T
U ——
(universe of keys) k1 k4 ——
——
k1
——
K k4 k5 ——
(actual
k7 k5 k2 k7 ——
keys)
——
k2 k3 k3 ——
k8
k6
k8 k6 ——
——
19
Chaining (10/10)

u CHAINED-HASH-INSERT(T, x)
insert 𝑥 at the head of list 𝑇[ℎ(key[𝑥])]

u CHAINED-HASH-SEARCH(T, k)
search for an element with key 𝑘 in list 𝑇[ℎ[𝑘]]

u CHAINED-HASH-DELETE(T, x)
delete 𝑥 from the list 𝑇[ℎ(key[𝑥])]

20
Practice Problems

¢ Draw a hash table after we insert the keys 5; 28;


19; 15; 20; 33; 12; 17; 10 with collisions resolved
by chaining. Let the table have 9 slots, 𝑇[0, . . , 8],
and the hash function be ℎ(𝑘) = 𝑘 mod 9.

𝑖 0 1 2 3 4 5 6 7 8
[] [10, [20] [12] [] [5] [33, [] [17]
𝑇[𝑖] 19, 15]
28]

21
Analysis of Chaining

¢ Assume simple uniform hashing: each key in table is


equally likely to be hashed to any slot
¢ Given n keys and m slots in the table, the
load factor a = 𝑛 / 𝑚 = average # keys per slot
¢ What will be the average cost of an unsuccessful
search for a key?
u 𝑂(1 + )
¢ What will be the average cost of a successful search?
u 𝑂(1 + a/2) = 𝑂(1 + a)

22
Analysis of Chaining

¢ So, the cost of searching


u 𝑂(1 + )
¢ If the number of keys n is proportional to the
number of slots in the table, what is a?
u a = 𝑂(1)
u In other words, we can make the expected cost of
searching constant if we make a constant

23
Choosing A Hash Function

¢ Clearly choosing the hash function well is crucial

¢ What are desirable features of the hash function?


u Should distribute keys uniformly into slots
u Should not depend on patterns in the data

24
Hash Functions: The Division Method

¢ Idea
u Map a key 𝑘 into one of the 𝑚 slots by taking the
remainder of 𝑘 divided by 𝑚
ℎ(𝑘) = 𝑘 𝑚𝑜𝑑 𝑚
¢ Advantage
u fast, requires only one operation
¢ Disadvantage
u Certain values of 𝑚 are bad: power of 2 and non-prime
numbers

25
The Division Method

¢ A good choice for 𝑚: a prime number, not too close


to an exact power of 2
¢ e.g., allocate a hash table, with collisions resolved
through chaining
u 𝑛 = 2000 character strings (8 bits/character)
u Choose m roughly 𝑛/3: 𝑚 = 701 (prime near 2000/3, not
near a power of 2)
u ℎ(𝑘) = 𝑘 mod 701

26
Hash Functions: The Multiplication Method

Idea:
¢ Multiply key 𝑘 by a constant 𝐴, 0 < 𝐴 < 1
¢ Extract the fractional part of 𝑘𝐴
¢ Multiply the fractional part by 𝑚
¢ Take the floor of the result
ℎ(𝑘) = 𝑚 (𝑘 𝐴 𝑚𝑜𝑑 1)
fractional part of 𝑘𝐴 = 𝑘𝐴 − 𝑘𝐴

¢ Disadvantage: Slower than division method


¢ Advantage: Value of 𝑚 is not critical: typically 2𝑝

27
The Multiplication Method
¢ For a constant 𝐴, 0 < 𝐴 < 1:
¢ ℎ(𝑘) = 𝑚 (𝑘𝐴 − 𝑘𝐴 )

Fractional part of 𝑘𝐴
¢ Choose 𝑚 = 2𝑃
¢ Choose 𝐴 not too close to 0 or 1
¢ Knuth: Good choice for 𝐴 = (Ö5 - 1)/2

28
Truncation method

• Ignore part of the key and use the rest as the array
index (converting non-numeric parts)
• Example
If students have a 9-digit identification number, take
the last 3 digits as the table position
– e.g. 925371622 becomes 622

30
Mid-square method

¢ Middle of square method


¢ This method squares the key value, and then
takes out the number of bits from the middle
of the square
¢ The number of bits to be used to obtain the
address depends on the table size
u If r bits are used, the range of values is 0 to 2r-1
¢ This works well because most or all bits of the
key value contribute to the result
31
Mid-square method

¢ Example
u consider items whose keys are 4-digit numbers in
base 10
u The goal is to hash these key values to a table of
size 100
u This range is equivalent to two digits in base 10,
that is, the number of bits is 2
u For example, the input is the number 4567,
squaring gets an 8-digit number, 20857489
u The middle two digits of this result is 57
32
Folding method
¢ Partition the key into several parts of the same length
except for the last
¢ These parts are then added together to obtain the hash
address for the key
¢ Two ways of carrying out this addition
u Shift-folding
u Folding and reverse

33
Folding method

¢ Example

34
Digit analysis method
¢ All the keys in the table are known in advance
¢ The index is formed by extracting, and then
manipulating specific digits from the key
¢ For example, the key is 925371622, we select
the digits from 5 through 7 resulting 537
¢ The manipulation can then take many forms
u Reversing the digits (735)
u Performing a circular shift to the right (753)
u Performing a circular shift to the left (375)
u Swapping each pair of digits (357)
35
Open Addressing (1/9)
¢ If we have enough contiguous memory to store all the
keys (𝑚 > 𝑁)
Þ store the keys in the table itself
¢ No need to use the linked lists anymore
¢ Collision resolution
u Put the elements that collide in the available empty places in
the table

39
Open Addressing (2/9)
¢ Basic idea:
u To insert: if slot is full, try another slot, …, until an open slot is
found (probing)
u To search, follow same sequence of probes as would be used
when inserting the element
ê If reach element with correct key, return it
ê If reach a NULL pointer, element is not in table

¢ Good for fixed sets (adding but no deletion)


u Example: spell checking
¢ Table needn’t be much bigger than 𝑛

40
Open Addressing (3/9)

index data
0
1
2
3
4
5
...
n-1

41
Open Addressing (4/9)

key1: data1
index data
0
1
2
Hash
3 key1:data1 Function
4
5
...
n-1 3
42
Open Addressing (5/9)

key2: data2
index data
0
1 key2:data2
2
Hash
3 key1:data1 Function
4
5
...
n-1 1
43
Open Addressing (6/9)

key3: data3
index data
0
1 key2:data2
2
Hash
3 key1:data1 Function
4 key3:data3
5
...
n-1 4
44
Open Addressing (7/9)

key4: data4
index data
0
1 key2:data2
2
Hash
3 key1:data1 Function
4 key3:data3
5
...
n-1 3
?
45
Open Addressing (8/9)

key4: data4
index data
0
1 key2:data2
2
Hash
3 key1:data1 Function
4 key3:data3
5 key4:data4
...
n-1 3
46
Open Addressing (9/9)

47
Linear Probing (1/3)

¢ When there is a collision, check the next available


position in the table (probing)

ℎ(𝑘, 𝑖) = (ℎ1(𝑘) + 𝑖) mod 𝑚

¢ First slot probed: ℎ1(𝑘), second: ℎ1(𝑘) + 1 and so on

48
Linear Probing (2/3)

¢ Searching for a key


¢ Three situations: 0
u Position in table is occupied with an element of
equal key h(k1)
u Position in table is empty h(k4)
u Position in table occupied with a different
element h(k2) = h(k5)
¢ Probe the next higher index until the
element is found or an empty position is h(k3)
found
m-1
¢ The process wraps around to the beginning
of the table

49
Linear Probing (3/3)

¢ Deleting a key
0
u Cannot mark the slot as empty
u Impossible to retrieve keys inserted after that
slot was occupied
¢ Solution
u Mark the slot with a sentinel value DELETED
¢ The deleted slot can later be used for
insertion
m-1
¢ Searching will be able to find all the keys

50
Practice Problems
¢ Consider a hash table of length 11 using open addressing
with the primary hash function ℎ1(𝑘) = 𝑘. Illustrate the
result of inserting 31, 4, 15, 28, 59 using linear probing.
u Hash function ℎ(𝑘, 𝑖) = (ℎ1(𝑘) + 𝑖) mod 𝑚 = (𝑘 + 𝑖) mod 𝑚
ê ℎ(31,0) = 9 Þ 𝑇[9] = 31
ê ℎ 4,0 = 4 Þ 𝑇[4] = 4
ê ℎ(15,0) = 4 collision
o ℎ 15,1 = 5 ⇒ 𝑇[5] = 15
ê ℎ 28,0 = 6 ⇒ 𝑇[6] = 28
ê ℎ(59,0) = 4 collision
o ℎ(59,1) = 5 collision
o ℎ(59,2) = 6 collision
o ℎ 59,3 = 7 Þ 𝑇[7] = 59

51
Double Hashing (1/3)

¢ Use a first hash function to determine the first slot


¢ Use a second hash function to determine the increment
for the probe sequence
ℎ(𝑘, 𝑖) = (ℎ1(𝑘) + 𝑖 ℎ2(𝑘)) mod 𝑚
¢ Initial probe: ℎ1(𝑘), second probe is offset by ℎ2(𝑘) mod
m, so on
¢ Advantage: avoids clustering
¢ Disadvantage: harder to delete an element

52
Double Hashing (2/3)
0
ℎ1(𝑘) = 𝑘 mod 13 1 79

ℎ2(𝑘) = 1 + (𝑘 mod 11) 2


3
ℎ(𝑘, 𝑖) = (ℎ1(𝑘) + 𝑖 ℎ2(𝑘)) mod 13 4 69

¢ Insert key 14: 5 98

6
ℎ1(14) = 14 mod 13 = 1 7 72

ℎ2(14) = 1 + (14 mod 11) = 4 8


9 14
ℎ(14, 1) = (ℎ1(14) + ℎ2(14)) mod 13 10
= (1 + 4) mod 13 = 5 11 50

12
ℎ(14, 2) = (ℎ1(14) + 2 ℎ2(14)) mod 13
= (1 + 8) mod 13 = 9
53
Double Hashing (3/3)
Choosing the second hash function
ℎ(𝑘, 𝑖) = (ℎ1(𝑘) + 𝑖 ℎ2(𝑘)) mod 𝑚
¢ ℎ2 should not evaluate to 0 Þ infinite loop
¢ ℎ2 should be relatively prime to the table size 𝑚

¢ Solution 1
u Choose m prime
u Design ℎ2 such that it produces an integer less than 𝑚
¢ Solution 2
u Choose 𝑚 = 2𝑝
u Design ℎ2 such that it always produces an odd number

54
Practice Problems
¢ Consider a hash table of length 11 using open addressing.
Illustrate the result of inserting 31, 4, 15, 28, 59 using
double hashing with functions ℎ1(𝑘) = 𝑘 and ℎ2(𝑘) = 1 + (𝑘
mod (𝑚 − 1)).
u Hash function ℎ(𝑘, 𝑖) = (𝑘 + 𝑖(1 + (𝑘 mod (𝑚 − 1)))) mod 𝑚
ê ℎ 31,0 = 9 Þ 𝑇[9] = 31
ê ℎ 4,0 = 4 Þ 𝑇[4] = 4
ê ℎ(15,0) = 4 collision
o ℎ 15,1 = 10 Þ 𝑇[10] = 15
ê ℎ 28,0 = 6 Þ 𝑇[6] = 28
ê ℎ(59,0) = 4 collision
o ℎ 59,1 = 3 Þ 𝑇[3] = 59

55

You might also like