0% found this document useful (0 votes)
3 views5 pages

10 More Hashing

Uploaded by

satish.bansal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views5 pages

10 More Hashing

Uploaded by

satish.bansal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Remember This List?

CSE 326: Data Structures • How should we resolve collisions?


• What should the table size be?
More Hashing Techniques • What should the hash function be?
• How well does hashing work in the real world?
– We’ll see a case study today!

Hannah Tang and Brian Tjaden


Summer Quarter 2002

Hashing Dilemma Universal Hashing1


0
Suppose your WorstEnemy 1) knows your hash function; 2) gets to k1 1
decide which keys to send you? Suppose we have a set K of h .
k2 .
possible keys, and a finite K .
Faced with this enticing possibility, WorstEnemy decides to: set H of hash functions that m-1
a) Send you keys which maximize collisions for your hash function. map keys to entries in a hi
b) Take a nap. hashtable of size m. H hj

Definition:
Moral: No single hash function can protect you! H is a universal collection of hash functions if and only if …
For any two keys k1 , k2 in K, there are at most |H|/m functions in H for which
Faced with this dilemma, you: h(k1 ) = h(k2 ).
a) Give up and use a linked list for your Dictionary.
b) Drop out of software, and choose a career in fast foods.
• So … if we randomly choose a hash function from H, our chances of collision
c) Run and hide.
are no more than if we get to choose hash table entries at random!
d) Proceed to the next slide, in hope of a better alternative. 1Motivation: see previous slide (or visit http://www.burgerking.com/jobs)

Good Hashing:
Random Hashing – Not! Universal Hash Function A (UHFa)
Parameterized by prime table size and vector of r integers:
How can we “randomly choose a hash function”? a = <a 1 … ar> where 0 <= ai < size
– Certainly we cannot randomly choose hash functions at runtime,
interspersed amongst the inserts, finds, deletes! Why not?
Represent each key as a vector k of r integers, where ki < size
• We can, however, randomly choose a hash function each – size = 11, key = 39752 ==> <3,9,7,5,2>
time we initialize a new hashtable.
– size = 29, key = “hello world” ==>
<8,5,12,12,15,23,15,18,12,4>
Conclusions
– WorstEnemy never knows which hash function we will choose –  r 
neither do we! h a (k) =  ∑ a i ki  mod size
– No single input (set of keys) can always evoke worst-case behavior  i= 0 

1
UHFa: Example Thinking about UHFa
• Context: hash strings of length 3 in a table of size 131
Strengths:
let a = <35, 100, 21> – Works on any type as long as you can map keys to
vectors
h a (“xyz”) = (35*120 + 100*121 + 21*122) % 131
– If we’re building a static table, we can try many values
= 129 of the hash vector <a>
– Random <a> has guaranteed good properties no matter
Let b = <25, 90, 83> what we’re hashing
h b(“xyz”) = (25*120 + 90*121 + 83*122) % 131
= 43 Weaknesses:
– Must choose prime table size larger than any k i

Good Hashing: UHF b : Example


Universal Hash Function B (UHFb ) Context: hash integers in a table of size 160

Parameterized by j, a, and b: Let j = 32, a = 13, b = 142


– j * size should fit into an int h j,a,b(1000) = ((13*1000 + 142) % (32*160)) / 32
– a and b must be less than size = (13142 % 5120) / 32
= 2902 / 32
= 90
hj,a,b (k) = ((ak + b) mod (j*size))/j

Let j = 31, a = 82, b = 112


h j,a,b(1000) = ((82*1000 + 112) % (31*160)) / 31
= (82112 % 4960) / 31
= 2752 / 31
= 89

Thinking about UHFb Perfect Hashing


Strengths
– If we’re building a static table, we can try many parameter When we know the entire key set in advance …
values – Examples: programming language keywords, CD -ROM
– Randoma,b has guaranteed good properties no matter file list, spelling dictionary, etc.
what we’re hashing
– Can choose any size table
– Very efficient if j and size are powers of 2 - why? … then perfect hashing lets us achieve:
– Worst-case O(1) time complexity!
Weaknesses – Worst-case O(n) space complexity!
– Need to turn non-integer keys into integers

2
Perfect Hashing Theorems 2
Perfect Hashing Technique
• Static set of n known keys 0 Theorem: If we store n keys in a hash table of size n 2 using a randomly chosen
universal hash function, then the probability of any collision is < ½.
• Separate chaining, two-level hash 1
• Primary hash table size=n 2 Theorem: If we store n keys in a hash table of size m=n using a randoml y chosen
universal hash function, then
• j th secondary hash table size=nj 2 3  m −1 2 
(where nj keys hash to slot j in primary E∑n j < 2n
hash table) 4 Secondary hash tables  j =0 
• Universal hash functions in all hash 5 where n j is the number of keys hashing to slot j.
tables 6
Corollary : If we store n keys in a hash table of size m=n using a randoml y chosen
• Conduct (a few!) random trials, until universal hash function and we set the size of each secondary hash table to mj=n j2,
we get collision -free hash functions Primary hash table then:
a) The probability that the total storage used for all secondary hash tables exceeds 4n is less than ½.
b) The expected amount of storage required for all secondary hash t ables is less than 2n.

2
Intro to Algorithms, 2 n d ed. Cormen ,
Leiserson , Rivest, Stein

Extendible Hashing:
Perfect Hashing Conclusions Cost of a Database Query
Perfect hashing theorems set tight expected bounds on sizes and
collision behavior of all the hash tables (primary and all
secondaries).

à Conduct a few random trials of universal hash functions, by


simply varying UHF parameters, until we get a set of UHFs and
associated table sizes which deliver …
– Worst-case O(1) time complexity!
– Worst-case O(n) space complexity!

I/O to CPU ratio is 300-to-1!

Extendible Hashing Extendible Hash Table


• Directory entry: key prefix (first k bits) and a pointer to the bucket with all
Hashing technique for huge data sets keys starting with its prefix
– Optimizes to reduce disk accesses • Each bucket contains keys matching on first j ≤ k bits, plus the value
associated with each key
– Each hash bucket fits on one disk block
– Better than B-Trees if order is not important – why? directory for k = 3
000 001 010 011 100 101 110 111

Table contains:
(j = 2) (j = 2) (j = 3) (j = 3) (j = 2)
– Buckets, each fitting in one disk block, with the data 00001 01001 10001 10101 11001
– A directory that fits in one disk block is used to hash to 00011 01011 10011 10110 11011
the correct bucket 00100 01100 10111 11100
00110 11110

3
Inserting (easy case) Splitting a Leaf
000 001 010 011 100 101 110 111 000 001 010 011 100 101 110 111

(2) (2) (3) (3) (2) (2) (2) (3) (3) (2)
00001 01001 10001 10101 11001 00001 01001 10001 10101 11001
00011 01011 10011 10110 11100 00011 01011 10011 10110 11011
00100 01100 10111 11110 00100 01100 10111 11100
00110 00110 11110

insert(11011) insert(11000)

000 001 010 011 100 101 110 111 000 001 010 011 100 101 110 111

(2) (2) (3) (3) (2) (2) (2) (3) (3) (3) (3)
00001 01001 10001 10101 11001 00001 01001 10001 10101 11000 11100
00011 01011 10011 10110 11011 00011 01011 10011 10110 11001 11110
00100 01100 10111 11100 00100 01100 10111 11011
00110 11110 00110

If Extendible Hashing Doesn’t Cut It


Splitting the Directory
00 01 10 11 Store only pointers/references to the items: (key, value) pairs are
1. insert(10010) in disk
But, no room to insert and + (Potentially ) much smaller M
no adoption! (2) (2) (2)
+ Fewer items in the directory
01101 10000 11001
10001 11110 – One extra disk access!
2. Solution: Expand directory 10011 Rehash
10111 + Potentially better distribution over the buckets
3. Then, it’s just a normal split.
+ Fewer unnecessary items in the directory
– Can’t solve the problem if there’s simply too much data
000 001 010 011 100 101 110 111
What if these don’t work?
– Use a B-Tree to store the directory!

Hash Wrap-up
Hash function: maps keys to integers; table size should be prime
Hash Wrap-up (part 2)
Collision resolution Choosing a Hash Function
• Separate Chaining • Universal hashing
– Guarantees no (always) bad
• Also: Extendible hashing
– Expand beyond hashtable via
input – For disk-based data
secondary Dictionaries
– Combine with B-tree directory if needed
– Allows λ > 1 • Perfect hashing
• Open Addressing – Requires known, fixed keyset
– Expand within hashtable – Achieves O(1) time, O(n) space
- guaranteed!
– Secondary probing: {linear,
quadratic, double hash}
– λ ≤ 1 (by definition!)
– λ ≤ ½ (by preference!)

•Rehashing
–Tunes up hashtable when λ crosses the line

4
Dictionary ADT Wrapup: Case Study: Assumptions
Case Study
You will be given a spelling dictionary of English words
• Your company, Procrastinators Inc., will release its highly
– 30,000 words
hyped word -processing program, WordMaster2000 (yeah,
they’re a little behind the times), next month. – Static (ie, does not support adding user-supplied words yet)
• Your highly successful alpha-test was marred by user – Arbitrary(ish) preprocessing time
requests for a spell-checker. Practical notes
• Your mission: write and test a spell -checker module before – Almost all searches are successful – Why?
WordMaster2000 is released. – Words average about 8 characters in length
• For now, you only need to worry about the English – 30,000 words at 8 bytes/word ~ .25 MB
language, although WordMaster2000 is successful, you may
– There are many regularities in the structure of English
need to port your spell-checker to other languages/character
words
sets.

Case Study:
Design Considerations
Issues:
– Which data structure should we use?
– What are our design goals?

Possible Solutions?

You might also like