0% found this document useful (0 votes)
36 views

Chapter 11-Hash Tables

The document provides an overview of hash tables, including: - Hash tables provide fast O(1) insertion and searching using a hash function to map keys to array positions. - Collisions can occur when multiple keys map to the same position, degrading performance as the table fills up. - Linear probing is introduced as a simple approach to handle collisions by searching sequentially through the table for the next empty slot. However, it can lead to clustering as the table fills.

Uploaded by

Bandar Abdallat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Chapter 11-Hash Tables

The document provides an overview of hash tables, including: - Hash tables provide fast O(1) insertion and searching using a hash function to map keys to array positions. - Collisions can occur when multiple keys map to the same position, degrading performance as the table fills up. - Linear probing is introduced as a simple approach to handle collisions by searching sequentially through the table for the next empty slot. However, it can lead to clustering as the table fills.

Uploaded by

Bandar Abdallat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 42

Hash Tables

Data Structures

1
Hash Tables: Overview
 Provide very fast insertion and searching
◦ Both are O(1)
◦ Is this too good to be true?

 Disadvantages
◦ Based on arrays, so the size must be known in advance
◦ Performance degrades when the table becomes full
◦ No convenient way to sort data

 Summary
◦ Best structure if you have no need to visit items in order and
you can predict the size of your database in advance.

2
Motivation
 Let’s suppose we want to insert a key into a data
structure, where the key can fall into a range from
0 to m, where m is very large
 And we want to be able to find the key quickly

 Clearly, one option is to use an array of size m+1


◦ Put each key into its corresponding slot (0 to m)
◦ Searching is then constant time

 But we may only have n keys, where n <<< m


◦ So we waste a ton of space!

3
Motivation
 Moreover, we may not be storing integers
 For example, if we store words of the dictionary

◦ Our ‘range’ is ‘a’ to ‘zyzzyva’


◦ And we’re talking hundreds of thousands of words in
between

 So the ‘mapping’ is not necessarily clear


 For example, we won’t know immediately that the

word ‘frog’ is at 85,467th word in the dictionary


◦ And btw I’m not claiming that is right. 

4
Hash Table
 Idea
◦ Provide easy searching and insertion by mapping
keys to positions in an array
◦ This mapping is provided by a hash function
 Takes the key as input
 Produces an index as output

 The array is called a hash table.

5
Hash Function: Example
 The easiest hash Index Value
function is the
following: 0
◦ H(key) = key % tablesize 1 2001
◦ H(key) now contains a
value between 0 and 2
tablesize-1 3 13
 So if we inserted the 4
following keys into a
table of size 10: 5
13,11456, 2001, 157 6 11456
◦ You probably already see
7 157
potential for collisions
◦ Patience, we’ll come to it! 8
9
6
What have we accomplished?
 We have stored keys of an
Index Value
unpredictable large range
0
into a smaller data structure
 And searching and inserting 1 2001
becomes easy! 2
3 13
 To find a key k, just retrieve 4
table[H(k)]
5
 To insert a key k, just set
table[H(k)] = k 6 11456
 Both are O(1)! 7 157
8
9
7
What’s our price?
 Of course, you’ve probably
Index Value
already realized, multiple
0
values in our range could
map to the same hash table 1 2001
index 2
 For example, if we used: 3 13
◦ H(k) = k % 10
4
5
 Then, tried to insert 207
6 11456
◦ H(207) = 7
 We have a collision at 7 157
position 7 8
9
8
What have we learned?
 If we use hash tables, we need the following:
◦ Some way of handling collisions. We’ll study a couple ways:
 Open addressing
 Which has 3 kinds: linear probing, quadratic probing, and double
hashing
 Separate chaining

 Also, the choice of the hash function is delicate


◦ We can produce hash functions which are more or less likely to
have high collision frequencies
◦ We’ll look at potential options

9
Linear Probing
 Presumably, you will have define your hash
table size to be ‘safe’
◦ As in, larger than the maximum amount of items you
expect to store
 As a result, there should be some available cells

 In linear probing, if an insertion results in a


collision, search sequentially until a vacant cell
is found
◦ Use wraparound if necessary

10
Linear Probing: Example
 Again, say we insert Index Value
element 207
 H(207) = 207 % 10 = 7
0
1 2001
 This results in a 2
collision with element 3 13
157 4
 So we search linearly
5
for the next available
cell, which is at 6 11456
position 8 7 157
◦ And put 207 there
8 207
9
11
Linear Probing
 Note: This complicates
Index Value
insertion and searching a bit!
0
 For example, if we then
1 2001
inserted element 426, we
would have to check three 2
cells before finding a vacant 3 13
one at position 9 4
5
 And searching, is not simply
6 11456
a matter of applying H(k)
7 157
◦ You apply H(k), and probe!
8 207
9 426
12
Linear Probing: Clusters
 As the table to the right
Index Value
illustrates, linear probing
0
also tends to result in the
formation of clusters. 1 2001

◦ Where large amounts of cells in 2


a row are populated
3 13
◦ And large amounts of cells are
sparse
4
5

 This becomes worse as the 6 11456


table fills up 7 157
◦ Degrades performance 8 207
9 426
13
Linear Probing: Clusters
 LaFore: A cluster is like a
Index Value
‘faint scene’ at a mall
0
 Initially, the first arrivals come
1 2001
◦ Later arrivals come because they
wonder why everyone was in one 2
place
3 13
◦ As the crowd gets bigger, more
are attracted 4
5
 Same thing with clusters! 6 11456
◦ Items that hash to a value in the 7 157
cluster will add to its size
8 207
9 426
14
Index Value

Linear Probing
0

1 2001

 One option: If the table 3

becomes full enough, 4

5
double its size
6 426
 Note this is not quite as 7 207

simple as it seems 8

◦ Because for every value 9

10
inside, you have to recompute
11
its hash value
12
◦ The hash function is
13 13
necessarily different:
14
 H(k) = k % 20
15
◦ But, less clustering
16 11456

17 157

18 15
Linear Probing
 Linear probing is the simplest way to handle collisions, and is thus
worthy of explanation
 Let’s look at the Java implementation on page 533
◦ This assumes a class with member variables:
 hashArray (the hash table)
 arraySize (the size of the hash table)
◦ Assume an empty slot contains -1

 We’ll construct:
◦ hashFunc()
◦ find()
◦ insert()
◦ delete()

16
Quadratic Probing
 The main problem with linear probing was its potential
for clustering
 Quadratic probing attempts to address this
◦ Instead of linearly searching for these next available cell
 i.e. for hash x, search cell x+1, x+2, x+3, x+4….
◦ Search quadratically
 i.e. for hash x, search cell x+1, x+4, x+9, x+16, x+25…

 Idea
◦ On a collision, initially assume a small cluster and go to x+1
◦ If that’s occupied, assume a larger cluster and go to x+4
◦ If that’s occupied assume an even larger cluster, and go to x+9

17
Quadratic Probing: Example
 Returning to our old Index Value
example with inserting
0
207
 H(207) = 207 % 10 = 7 1 2001
2
 This results in a 3 13
collision with element 4
157
 In this case, slot 7 is 5

occupied but slot 6 11456


7+1=8 is open, so we 7 157
put it there 8
9
18
Quadratic Probing
 Now, if we insert 426 Index Value
 H(426) = 426 % 10 = 6 0
◦ Which is occupied
1 2001
 Slot 6+1=7 is also
2
occupied
3 13
4
 So we check slot:
5
◦ 6+4=10
6 11456
◦ This passes the end, so we
wraparound to slot 0 and 7 157
insert there 8 207
9
19
Quadratic Probing
 We have achieved a
Index Value
decrease in the cluster
0 426
count
1 2001
 Clusters will tend to be
smaller and more sparse 2

◦ Instead of having large clusters 3 13


◦ And largely sparse areas 4
5
 Thus quadratic probing got 6 11456
rid of what we call primary 7 157
clustering.
8 207
9
20
Quadratic Probing
 Quadratic probing does,
Index Value
however, suffer from
0 426
secondary clustering
1 2001

 Where, if you have several 2

keys hashing to the same 3 13


value 4
◦ The first collision requires one
5
probe
◦ The second requires four
6 11456

◦ The third requires nine 7 157


◦ The fourth requires sixteen 8 207
9
21
Quadratic Probing
 Secondary clustering would Index Value
happen if we inserted for
0 426
example:
1 2001
◦ 827, 10857, 707 1117
◦ Because they all hash to 7 2
3 13
 Not as serious a problem 4
as primary clustering 5
6 11456
 But there is a better 7 157
solution that avoids both. 8 207
9
22
Double Hashing
 The problem thus far is that the probe sequences are always
the same
◦ For example: linear probing always generates x+1, x+2, x+3...
◦ Quadratic probing always generates x+1, x+4, x+9…

 Solution: Make both the hash location and the probe


dependent upon the key
◦ Hash the key once to get the location
◦ Hash the key a second time to get the probe

 This is called double hashing.

23
Second Hash Function
 Characteristics of the hash function for the probe
◦ It cannot be the same as the first hash function
◦ It can NEVER hash to zero
 Why not?

 Experts have discovered, this type of hash


function works good for the probe:
◦ probe = c – (key % c)

◦ Where c is a prime number that is smaller than the


array size

24
Double Hashing: Example
 Returning to our old
Index Value
example with inserting
207 0
 H(207) = 207 % 10 = 7 1 2001
2
 This results in a collision
3 13
with element 157
 So we hash again, to get 4
the probe 5
◦ Suppose we choose c=5
◦ Then: 6 11456
 P(207) = 5 – (207 % 5) 7 157
 P(207) = 5 – 2 = 3
8
9
25
Double Hashing: Example
 So we insert 207 at Index Value
position: 0 207

◦ H(207) + P(207) = 1 2001

◦ 7+3 = 2

◦ 10 3 13
4
5
 Wrapping around,
6 11456
this will put 207 at 7 157
position 0 8
9
26
Double Hashing: Example
 Now, let’s again insert
Index Value
value 426
0 207
 We run the initial hash:
1 2001
◦ H(426) = 426 % 10 = 6
 We get a collision, so we 2
probe: 3 13
◦ P(426) = 5 – (426 % 5) 4
◦ =5–1=4
5
 And insert at location:
6 11456
◦ H(426) + P(426) = 10
◦ Wrapping around, we get 0. 7 157
Another collision! 8
9
27
Double Hashing: Example
 So, we probe again Index Value
◦ P(426) = 4
0 207
 So we insert at location
1 2001
0+4 = 4, and this time
2
there is no collision
3 13
4 426
 Double hashing will in
5
general produce the
fewest clusters 6 11456

◦ Because both the hash and 7 157


probe are key-dependent 8
9
28
Java Implementation, page 547
 Let’s try this again:
◦ Again, we have our hash table stored in hashArray
◦ And arraySize as the size of the hash table
◦ Again, assume positive integers and all entries are initially -1

 Let’s construct
◦ hashFunc()
◦ hashFunc2()
◦ find()
◦ insert()
◦ delete()

29
Note…
 What is a potential problem with choosing a
hash table of size 10 and a c of 5 for the
probe, as we just did?

 Suppose we had a value k where H(k) = 0 and


P(k) = 5
◦ i.e., k = 0

 What would the probe sequence be?


 What’s the problem?

30
Probe Sequence
 The probe sequence may never find an open
cell!
 Because H(0) = 0, we’ll start at hash location

0
◦ If we have a collision, P(0) = 5 so we’ll next check
0+5=5
◦ If we have a collision there, we’ll next check
5+5=10, with wraparound we get 0
◦ We’ll infinitely check 0 and 5, and never find an
open cell!

31
Double Hashing Requirement
 The root of the problem is that the table size is not
prime!
◦ For example if the size were 11:
◦ 0, 5, 10, 4, 9, 3, 8, 2, 7, 1, 6
◦ If there is even one open cell, the probing is guaranteed to find it

 Thus, very important – a requirement of double hashing


is that the table size is prime.
◦ So our previous table size of 10 is not a good idea
◦ We would want 11, or 13, etc.

 Generally, for open addressing, double hashing is best

32
Separate Chaining
 The alternative to
open addressing
 Does not involve
probing to
different locations
in the hash table
 Rather, every
location in the
hash table contains
a linked list of keys

33
Separate Chaining
 Simple case, 7 element
hash table
 H(k) = k % 7
 So:
◦ 21, 77 each hash to
location 0
◦ 72 hashes to location 2
◦ 75, 5, 19 hash to location
5

 Each is simply
appended to the
correct linked list

34
Separate Chaining
 In separate chaining,
trouble happens
when a list gets too
full
 Generally, we want to

keep the size of the


biggest list, call it M,
much smaller than N
◦ Searching and
insertion will then take
O(M) time in the worst
case

35
Java Implementation
 Let’s look at pages 555-557
 Note: We will need a linked list and the hash

table!
◦ Will take a little time

 I’m going to use an unsorted list for


simplicity
◦ A sorted list will speed up searching, slow down
insertion

36
A Good Hash Function
 Has two properties:
◦ Is computable quickly; so as not to degrade
performance of insertion and searching
◦ Can take a range of key values and transform them
into indices such that the key values are distributed
randomly across the hash table

 For random keys, the modulo (%) operator is


good
 It is not always an easy task!

37
For example…
 Data can be highly non-random
 For example, a car-part ID:
◦ 033-400-03-94-05-0-535

 For each set of digits, there can be a unique range


or set of values!
◦ i.e. Digits 3-5 could be a category code, where the only
acceptable values are 100, 150, 200, 250, up to 850.
◦ Digits 6-7 could be a month of introduction (0-12)
◦ Digit 12 could be “yes” or “no” (0 or 1)
◦ Digits 13-15 could be a checksum, a function of all the
other digits in the code

38
Rule #1: Don’t Use Non-Data
 Compress the key fields down enough until
every bit counts
 For example:

◦ The category (bits 3-5, with restricted values 100,


150, 200, … , 850) counting by 50s needs to be
compressed down to run from 0 to 15
◦ The checksum is not necessary, and should be
removed. It is a function of the rest of the code
and thus redundant with respect to the hash table

39
Rule #2: Use All of the Data
 Every part of the key should contribute to the
hash function
 More data portions that contribute to the key,

more likely it will be that the keys hash evenly


◦ Saving collisions, which cause trouble no matter
what the algorithm you use

40
Rule #3: Use a Prime Number for
Modulo Base
 This is a requirement for double hashing
 Important for quadratic probing
 Especially important if the keys may not be
randomly distributed
◦ The more keys that share a divisor with the array
size, the more collisions
◦ Example, non-random data which are multiples of 50
 If the table size is 50, they all hash to the same spot
 If the table size is 10, they all hash to the same pot
 If the table size is 53, no keys divide evenly into the
table size. Better!

41
Hashing Efficiency
 Insertion and Searching are O(1) in the best case
◦ This implies no collisions
◦ If you minimize collisions, you can approach this runtime

 If collisions occur:
◦ Access times depend on resulting probe lengths
◦ Every probe equals one more access
◦ So every worst case insertion or search time is
proportional to:
 The number of required probes if you use open addressing
 The number of links in the longest list it you use separate
chaining

42

You might also like