10 More Hashing
10 More Hashing
Definition:
Moral: No single hash function can protect you! H is a universal collection of hash functions if and only if …
For any two keys k1 , k2 in K, there are at most |H|/m functions in H for which
Faced with this dilemma, you: h(k1 ) = h(k2 ).
a) Give up and use a linked list for your Dictionary.
b) Drop out of software, and choose a career in fast foods.
• So … if we randomly choose a hash function from H, our chances of collision
c) Run and hide.
are no more than if we get to choose hash table entries at random!
d) Proceed to the next slide, in hope of a better alternative. 1Motivation: see previous slide (or visit http://www.burgerking.com/jobs)
Good Hashing:
Random Hashing – Not! Universal Hash Function A (UHFa)
Parameterized by prime table size and vector of r integers:
How can we “randomly choose a hash function”? a = <a 1 … ar> where 0 <= ai < size
– Certainly we cannot randomly choose hash functions at runtime,
interspersed amongst the inserts, finds, deletes! Why not?
Represent each key as a vector k of r integers, where ki < size
• We can, however, randomly choose a hash function each – size = 11, key = 39752 ==> <3,9,7,5,2>
time we initialize a new hashtable.
– size = 29, key = “hello world” ==>
<8,5,12,12,15,23,15,18,12,4>
Conclusions
– WorstEnemy never knows which hash function we will choose – r
neither do we! h a (k) = ∑ a i ki mod size
– No single input (set of keys) can always evoke worst-case behavior i= 0
1
UHFa: Example Thinking about UHFa
• Context: hash strings of length 3 in a table of size 131
Strengths:
let a = <35, 100, 21> – Works on any type as long as you can map keys to
vectors
h a (“xyz”) = (35*120 + 100*121 + 21*122) % 131
– If we’re building a static table, we can try many values
= 129 of the hash vector <a>
– Random <a> has guaranteed good properties no matter
Let b = <25, 90, 83> what we’re hashing
h b(“xyz”) = (25*120 + 90*121 + 83*122) % 131
= 43 Weaknesses:
– Must choose prime table size larger than any k i
2
Perfect Hashing Theorems 2
Perfect Hashing Technique
• Static set of n known keys 0 Theorem: If we store n keys in a hash table of size n 2 using a randomly chosen
universal hash function, then the probability of any collision is < ½.
• Separate chaining, two-level hash 1
• Primary hash table size=n 2 Theorem: If we store n keys in a hash table of size m=n using a randoml y chosen
universal hash function, then
• j th secondary hash table size=nj 2 3 m −1 2
(where nj keys hash to slot j in primary E∑n j < 2n
hash table) 4 Secondary hash tables j =0
• Universal hash functions in all hash 5 where n j is the number of keys hashing to slot j.
tables 6
Corollary : If we store n keys in a hash table of size m=n using a randoml y chosen
• Conduct (a few!) random trials, until universal hash function and we set the size of each secondary hash table to mj=n j2,
we get collision -free hash functions Primary hash table then:
a) The probability that the total storage used for all secondary hash tables exceeds 4n is less than ½.
b) The expected amount of storage required for all secondary hash t ables is less than 2n.
2
Intro to Algorithms, 2 n d ed. Cormen ,
Leiserson , Rivest, Stein
Extendible Hashing:
Perfect Hashing Conclusions Cost of a Database Query
Perfect hashing theorems set tight expected bounds on sizes and
collision behavior of all the hash tables (primary and all
secondaries).
Table contains:
(j = 2) (j = 2) (j = 3) (j = 3) (j = 2)
– Buckets, each fitting in one disk block, with the data 00001 01001 10001 10101 11001
– A directory that fits in one disk block is used to hash to 00011 01011 10011 10110 11011
the correct bucket 00100 01100 10111 11100
00110 11110
3
Inserting (easy case) Splitting a Leaf
000 001 010 011 100 101 110 111 000 001 010 011 100 101 110 111
(2) (2) (3) (3) (2) (2) (2) (3) (3) (2)
00001 01001 10001 10101 11001 00001 01001 10001 10101 11001
00011 01011 10011 10110 11100 00011 01011 10011 10110 11011
00100 01100 10111 11110 00100 01100 10111 11100
00110 00110 11110
insert(11011) insert(11000)
000 001 010 011 100 101 110 111 000 001 010 011 100 101 110 111
(2) (2) (3) (3) (2) (2) (2) (3) (3) (3) (3)
00001 01001 10001 10101 11001 00001 01001 10001 10101 11000 11100
00011 01011 10011 10110 11011 00011 01011 10011 10110 11001 11110
00100 01100 10111 11100 00100 01100 10111 11011
00110 11110 00110
Hash Wrap-up
Hash function: maps keys to integers; table size should be prime
Hash Wrap-up (part 2)
Collision resolution Choosing a Hash Function
• Separate Chaining • Universal hashing
– Guarantees no (always) bad
• Also: Extendible hashing
– Expand beyond hashtable via
input – For disk-based data
secondary Dictionaries
– Combine with B-tree directory if needed
– Allows λ > 1 • Perfect hashing
• Open Addressing – Requires known, fixed keyset
– Expand within hashtable – Achieves O(1) time, O(n) space
- guaranteed!
– Secondary probing: {linear,
quadratic, double hash}
– λ ≤ 1 (by definition!)
– λ ≤ ½ (by preference!)
•Rehashing
–Tunes up hashtable when λ crosses the line
4
Dictionary ADT Wrapup: Case Study: Assumptions
Case Study
You will be given a spelling dictionary of English words
• Your company, Procrastinators Inc., will release its highly
– 30,000 words
hyped word -processing program, WordMaster2000 (yeah,
they’re a little behind the times), next month. – Static (ie, does not support adding user-supplied words yet)
• Your highly successful alpha-test was marred by user – Arbitrary(ish) preprocessing time
requests for a spell-checker. Practical notes
• Your mission: write and test a spell -checker module before – Almost all searches are successful – Why?
WordMaster2000 is released. – Words average about 8 characters in length
• For now, you only need to worry about the English – 30,000 words at 8 bytes/word ~ .25 MB
language, although WordMaster2000 is successful, you may
– There are many regularities in the structure of English
need to port your spell-checker to other languages/character
words
sets.
Case Study:
Design Considerations
Issues:
– Which data structure should we use?
– What are our design goals?
Possible Solutions?