0% found this document useful (0 votes)

27 views44 pages

Lecture 9

Uploaded by

orhanayt2018

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views44 pages

Lecture 9

Uploaded by

orhanayt2018

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 44

Algorithm Analysis

BBM408
LECTURE 7
Hashing I
• Direct-access tables
• Resolving collisions by
chaining
• Choosing hash functions
• Open addressing

Assoc. Prof. Dr. Burkay Genç

October 3, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.1
Symbol-table problem
Symbol table S holding n records:
record
x Operations on S:
key[x]
• INSERT(S, x)
• DELETE(S, x)
Other fields
containing • SEARCH(S, k)
satellite data

How should the data structure S be organized?

October 3, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.2
Direct-access table
IDEA: Suppose that the keys are drawn from
the set U ⊆ {0, 1, …, m–1}, and keys are
distinct. Set up an array T[0 . . m–1]:
x if k ∈ K and key[x] = k,
T[k] =
NIL otherwise.
Then, operations take Θ(1) time.
Problem: The range of keys can be large:
• 64-bit numbers (which represent
18,446,744,073,709,551,616 different keys),
• character strings (even larger!).
October 3, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.3
Hash functions
Solution: Use a hash function h to map the
universe U of all keys into T
{0, 1, …, m–1}: 0

k1 h(k1)
k5 h(k4)
S k4
k2 h(k2) = h(k5)
k3
h(k3)
U
m–1
When a record to be inserted maps to an already
As each key
occupied slotisininserted, h maps
T, a collision it to a slot of T.
occurs.
October 3, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.4
Resolving collisions by
chaining
• Link records in the same slot into a list.
T
Worst case:
i
• Every key
hashes to the
49 86 52 same slot.
• Access time =
Θ(n) if |S| = n
h(49) = h(86) = h(52) = i

October 3, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.5
Average-case analysis of chaining
We make the assumption of simple uniform
hashing:
• Each key k ∈ S is equally likely to be hashed
to any slot of table T, independent of where
other keys are hashed.
Let n be the number of keys in the table, and
let m be the number of slots.
Define the load factor of T to be
α = n/m
= average number of keys per slot.
October 3, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.6
Search cost
The expected time for an unsuccessful
search for a record with a given key is
= Θ(1 + α).

October 3, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.7
Search cost
The expected time for an unsuccessful
search for a record with a given key is
= Θ(1 + α). search
the list
apply hash function
and access slot

October 3, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.8
Search cost
The expected time for an unsuccessful
search for a record with a given key is
= Θ(1 + α). search
the list
apply hash function
and access slot
Expected search time = Θ(1) if α = O(1),
or equivalently, if n = O(m).

October 3, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.9
Search cost
The expected time for an unsuccessful
search for a record with a given key is
= Θ(1 + α). search
the list
apply hash function
and access slot
Expected search time = Θ(1) if α = O(1),
or equivalently, if n = O(m).
A successful search has same asymptotic
bound, but a rigorous argument is a little
more complicated. (See textbook.)
October 3, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.10
Choosing a hash function

The assumption of simple uniform hashing

is hard to guarantee, but several common
techniques tend to work well in practice as
long as their deficiencies can be avoided.
Desiderata:
• A good hash function should distribute the
keys uniformly into the slots of the table.
• Regularity in the key distribution should
not affect this uniformity.
October 3, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.11
Division method
Assume all keys are integers, and define
h(k) = k mod m.
Deficiency: Don’t pick an m that has a small
divisor d. A preponderance of keys that are
congruent modulo d can adversely affect
uniformity.
Extreme deficiency: If m = 2r, then the hash
doesn’t even depend on all the bits of k:
• If k = 10110001110110102 and r = 6, then
h(k) = 0110102 . h(k)
October 3, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.12
Division method (continued)
h(k) = k mod m.
Pick m to be a prime not too close to a power
of 2 or 10 and not otherwise used prominently
in the computing environment.
Annoyance:
• Sometimes, making the table size a prime is
inconvenient.
But, this method is popular, although the next
method we’ll see is usually superior.
October 3, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.13
Multiplication method
Assume that all keys are integers, m = 2r, and our
computer has w-bit words. Define
h(k) = (A·k mod 2w) rsh (w – r),
where rsh is the “bitwise right-shift” operator and
A is an odd integer in the range 2w–1 < A < 2w.
• Don’t pick A too close to 2w–1 or 2w.
• Multiplication modulo 2w is fast compared to
division.
• The rsh operator is fast.

October 3, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.14
Multiplication method
example
h(k) = (A·k mod 2w) rsh (w – r)
Suppose that m = 8 = 23 and that our computer
has w = 7-bit words:
3A
1011001 =A .
× 1101011=k 0
7 1
10010100110011 6 2
A
. 5 4 3
h(k) .
2A
Modular wheel
October 3, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.15
Resolving collisions by open
addressing
No storage is used outside of the hash table itself.
• Insertion systematically probes the table until an
empty slot is found.
• The hash function depends on both the key and
probe number:
h : U × {0, 1, …, m–1} → {0, 1, …, m–1}.
• The probe sequence 〈h(k,0), h(k,1), …, h(k,m–1)〉
should be a permutation of {0, 1, …, m–1}.
• The table may fill up, and deletion is difficult (but
not impossible).
October 3, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.16
Example of open addressing

Insert key k = 496:

T
0
0. Probe h(496,0)
586
133
204 collision
481

m–1

October 3, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.17
Example of open addressing

Insert key k = 496:

T
0
0. Probe h(496,0)
1. Probe h(496,1) 586 collision
133
204

481

m–1

October 3, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.18
Example of open addressing

Insert key k = 496:

T
0
0. Probe h(496,0)
1. Probe h(496,1) 586
133
2. Probe h(496,2)
204
496 insertion
481

m–1

October 3, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.19
Example of open addressing

Search for key k = 496:

T
0
0. Probe h(496,0)
1. Probe h(496,1) 586
133
2. Probe h(496,2)
204
496
Search uses the same probe 481
sequence, terminating suc-
m–1
cessfully if it finds the key
and unsuccessfully if it encounters an empty slot.
October 3, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.20
Probing strategies

Linear probing:
Given an ordinary hash function h′(k), linear
probing uses the hash function
h(k,i) = (h′(k) + i) mod m.
This method, though simple, suffers from primary
clustering, where long runs of occupied slots build
up, increasing the average search time. Moreover,
the long runs of occupied slots tend to get longer.

October 3, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.21
Probing strategies

Double hashing
Given two ordinary hash functions h1(k) and h2(k),
double hashing uses the hash function
h(k,i) = (h1(k) + i⋅ h2(k)) mod m.
This method generally produces excellent results,
but h2(k) must be relatively prime to m. One way
is to make m a power of 2 and design h2(k) to
produce only odd numbers.

October 3, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.22
Analysis of open addressing

We make the assumption of uniform hashing:

• Each key is equally likely to have any one of
the m! permutations as its probe sequence.
Theorem. Given an open-addressed hash
table with load factor α = n/m < 1, the
expected number of probes in an unsuccessful
search is at most 1/(1–α).

October 3, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.23
Proof of the theorem
Proof.
• At least one probe is always necessary.
• With probability n/m, the first probe hits an
occupied slot, and a second probe is necessary.
• With probability (n–1)/(m–1), the second probe
hits an occupied slot, and a third probe is
necessary.
• With probability (n–2)/(m–2), the third probe
hits an occupied slot, etc.
Observe that n − i < n = α for i = 1, 2, …, n.
m−i m
October 3, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.24
Proof (continued)
Therefore, the expected number of probes is
n 
1 + 1 + n − 1  n − 2   1    
1 +   1 +    
m  m − 1  m − 2   m − n + 1   
≤ 1 + α (1 + α (1 + α ( (1 + α ))))
≤1+α +α 2 +α3 +
∞
= ∑ α i The textbook has a
more rigorous proof
i =0
and an analysis of
= 1 . successful searches.
1−α
October 3, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.25
Implications of the theorem

• If α is constant, then accessing an open-

addressed hash table takes constant time.
• If the table is half full, then the expected
number of probes is 1/(1–0.5) = 2.
• If the table is 90% full, then the expected
number of probes is 1/(1–0.9) = 10.

October 3, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.26
A weakness of hashing
Problem: For any hash function h, a set
of keys exists that can cause the average
access time of a hash table to skyrocket.
• An adversary can pick all keys from
{k ∈ U : h(k) = i} for some slot i.
IDEA: Choose the hash function at random,
independently of the keys.
• Even if an adversary can see your code,
he or she cannot find a bad set of keys,
since he or she doesn’t know exactly
which hash function will be chosen.
October 5, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.2
Universal hashing
Definition. Let U be a universe of keys, and
let H be a finite collection of hash functions,
each mapping U to {0, 1, …, m–1}. We say
H is universal if for all x, y ∈ U, where x ≠ y,
we have |{h ∈ H : h(x) = h(y)}| ≤ |H | / m.
That is, the chance
of a collision {h : h(x) = h(y)} H
between x and y is
≤ 1/m if we choose h
|H |
randomly from H .
m
October 5, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.3
Universality is good

Theorem. Let h be a hash function chosen

(uniformly) at random from a universal set H
of hash functions. Suppose h is used to hash
n arbitrary keys into the m slots of a table T.
Then, for a given key x, we have
E[#collisions with x] < n/m.

October 5, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.4
Proof of theorem
Proof. Let Cx be the random variable denoting
the total number of collisions of keys in T with
x, and let
1 if h(x) = h(y),
cxy =
0 otherwise.

Note: E[cxy] = 1/m and C x = ∑ cxy .

y∈T −{x}

October 5, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.6
Proof (continued)
 
E[C x ] = E  ∑ c xy  • Take expectation
 y∈T −{ x}  of both sides.
= ∑ E[cxy ] • Linearity of
y∈T −{ x} expectation.

October 5, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.7
Proof (continued)
 
E[C x ] = E  ∑ c xy  • Take expectation
 y∈T −{ x}  of both sides.
= ∑ E[cxy ] • Linearity of
y∈T −{ x} expectation.
= ∑ 1/ m • E[cxy] = 1/m.
y∈T −{ x}

October 5, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.8
Proof (continued)
 
E[C x ] = E  ∑ c xy  • Take expectation
 y∈T −{ x}  of both sides.
= ∑ E[cxy ] • Linearity of
y∈T −{ x} expectation.
= ∑ 1/ m • E[cxy] = 1/m.
y∈T −{ x}

= n −1 . • Algebra.
m
October 5, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.9
Constructing a set of
universal hash functions
Let m be prime. Decompose key k into r + 1
digits, each with value in the set {0, 1, …, m–1}.
That is, let k = 〈k0, k1, …, kr〉, where 0 ≤ ki < m.
Randomized strategy:
Pick a = 〈a0, a1, …, ar〉 where each ai is chosen
randomly from {0, 1, …, m–1}.
r
Define ha (k ) = ∑ ai ki mod m . Dot product,
i =0
modulo m
REMEMBER
How big is H = {ha}? |H | = mr + 1. THIS!
October 5, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.10
Universality of dot-product
hash functions
Theorem. The set H = {ha} is universal.
Proof. Suppose that x = 〈x0, x1, …, xr〉 and y =
〈y0, y1, …, yr〉 be distinct keys. Thus, they differ
in at least one digit position, wlog position 0.
For how many ha ∈ H do x and y collide?

We must have ha(x) = ha(y), which implies that

r r
∑ ai xi ≡ ∑ ai yi (mod m) .
i =0 i =0
October 5, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.11
Proof (continued)
Equivalently, we have
r
∑ ai ( xi − yi ) ≡ 0 (mod m)
i =0
or r
a0 ( x0 − y0 ) + ∑ ai ( xi − yi ) ≡ 0 (mod m) ,
i =1
which implies that
r
a0 ( x0 − y0 ) ≡ −∑ ai ( xi − yi ) (mod m) .
i =1
October 5, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.12
Fact from number theory
Theorem. Let m be prime. For any z ∈ Z m
such that z ≠ 0, there exists a unique z–1 ∈ Z m
such that
z · z–1 ≡ 1 (mod m).

Example: m = 7.
z 1 2 3 4 5 6
z–1 1 4 5 2 3 6

October 5, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.13
Back to the proof
We have
r
a0 ( x0 − y0 ) ≡ −∑ ai ( xi − yi ) (mod m) ,
i =1
and since x0 ≠ y0 , an inverse (x0 – y0 )–1 must exist,
which implies that
 r 
a0 ≡  − ∑ ai ( xi − yi )  ⋅ ( x0 − y0 ) −1 (mod m) .
 i =1 
Thus, for any choices of a1, a2, …, ar, exactly
one choice of a0 causes x and y to collide.
October 5, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.14
Proof (completed)
Q. How many ha’s cause x and y to collide?
A. There are m choices for each of a1, a2, …, ar ,
but once these are chosen, exactly one choice
for a0 causes x and y to collide, namely
 r  
a0 =   − ∑ ai ( xi − yi )  ⋅ ( x0 − y0 )  mod m .
−1
  i =1  
Thus, the number of ha’s that cause x and y
to collide is mr · 1 = mr = |H |/m.
October 5, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.15
Perfect hashing
Given a set of n keys, construct a static hash
table of size m = O(n) such that SEARCH takes
Θ(1) time in the worst case.
T S1
IDEA: Two-
0
level scheme 1 4 31 14 27
with universal 2
hashing at S4 h31(14) = h31(27) = 1
3
both levels. 4 1 00 26 S6
5
No collisions
6 9 86 40 37 22
at level 2!
m a 0 1 2 3 4 5 6 7 8
October 5, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.16
Collisions at level 2
Theorem. Let H be a class of universal hash
functions for a table of size m = n2. Then, if we
use a random h ∈ H to hash n keys into the table,
the expected number of collisions is at most 1/2.
Proof. By the definition of universality, the
probability that 2 given keys in the table collide
under h is 1/m = 1/n . Since there are (2 ) pairs
2 n

of keys that can possibly collide, the expected

number of collisions is
 n 1 n(n − 1) 1
 ⋅ 2 = ⋅ 2 < 1.
 2 n 2 n 2
October 5, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.17
No collisions at level 2
Corollary. The probability of no collisions
is at least 1/2.
Proof. Markov’s inequality says that for any
nonnegative random variable X, we have
Pr{X ≥ t} ≤ E[X]/t.
Applying this inequality with t = 1, we find
that the probability of 1 or more collisions is
at most 1/2.
Thus, just by testing random hash functions
in H , we’ll quickly find one that works.
October 5, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.18
Analysis of storage
For the level-1 hash table T, choose m = n, and
let ni be random variable for the number of keys
that hash to slot i in T. By using ni2 slots for the
level-2 hash table Si, the expected total storage
required for the two-level scheme is therefore
m−1 
E  ∑ Θ(ni ) = Θ(n) ,
2
 i =0 
since the analysis is identical to the analysis from
recitation of the expected running time of bucket
sort. (For a probability bound, apply Markov.)
October 5, 2005 Copyright © 2001-5 by Erik D. Demaine and Charles E. Leiserson L7.19

Hashing
No ratings yet
Hashing
35 pages
Analysis of Algorithms CS 477/677: Hashing Instructor: George Bebis
No ratings yet
Analysis of Algorithms CS 477/677: Hashing Instructor: George Bebis
53 pages
Hashing
No ratings yet
Hashing
38 pages
CH 4
No ratings yet
CH 4
58 pages
HASHING
No ratings yet
HASHING
28 pages
Hash Table PDF
No ratings yet
Hash Table PDF
25 pages
Lec 11 Hash Table
No ratings yet
Lec 11 Hash Table
43 pages
Hashing
No ratings yet
Hashing
111 pages
0.1 Direct-Address Tables
No ratings yet
0.1 Direct-Address Tables
10 pages
Day3.2 DS2 HashTablesHeaps
No ratings yet
Day3.2 DS2 HashTablesHeaps
61 pages
Algo Cha 8
No ratings yet
Algo Cha 8
20 pages
Introduction To Algorithms: 6.046J/18.401J/SMA5503
No ratings yet
Introduction To Algorithms: 6.046J/18.401J/SMA5503
24 pages
DSA Topic07-I
No ratings yet
DSA Topic07-I
14 pages
Lecture 13 - Hash Tables
No ratings yet
Lecture 13 - Hash Tables
51 pages
Hashing - Datastructures and Algorithms
No ratings yet
Hashing - Datastructures and Algorithms
32 pages
Hashing
No ratings yet
Hashing
35 pages
Algorithm Lecture6 Search
No ratings yet
Algorithm Lecture6 Search
40 pages
Lecture 8 Hashing
No ratings yet
Lecture 8 Hashing
47 pages
Hashing Important Theorems
No ratings yet
Hashing Important Theorems
26 pages
Lecture 27 - Hashing
No ratings yet
Lecture 27 - Hashing
48 pages
Hashing in Data Structure
No ratings yet
Hashing in Data Structure
25 pages
Overview of Hash Tables
No ratings yet
Overview of Hash Tables
4 pages
Theory PDF
No ratings yet
Theory PDF
18 pages
11 Hashing
No ratings yet
11 Hashing
60 pages
Module 5
No ratings yet
Module 5
22 pages
UNIT V - Hashing
No ratings yet
UNIT V - Hashing
20 pages
Lecture 10: Hashing III: Open Addressing
No ratings yet
Lecture 10: Hashing III: Open Addressing
8 pages
Full Unit 6 Cse 205
No ratings yet
Full Unit 6 Cse 205
20 pages
Hashing Linear Probing
No ratings yet
Hashing Linear Probing
91 pages
11-Hashing-Hong Kong
No ratings yet
11-Hashing-Hong Kong
25 pages
L-2005-08-Advance Data Structure Part 1-HS
No ratings yet
L-2005-08-Advance Data Structure Part 1-HS
46 pages
Dsa Module 6 Ktuassist
No ratings yet
Dsa Module 6 Ktuassist
9 pages
Lect Hashing
No ratings yet
Lect Hashing
36 pages
Hashing
No ratings yet
Hashing
20 pages
Data Structures
No ratings yet
Data Structures
42 pages
Dsa 4
No ratings yet
Dsa 4
55 pages
Cse373 10 Hashing
No ratings yet
Cse373 10 Hashing
36 pages
3 Hashing
No ratings yet
3 Hashing
20 pages
Hashing
No ratings yet
Hashing
10 pages
Unit 1 Hashing
No ratings yet
Unit 1 Hashing
61 pages
Hash Tables: COT4810 Ken Pritchard 2 Sep 04
No ratings yet
Hash Tables: COT4810 Ken Pritchard 2 Sep 04
20 pages
DS 5
No ratings yet
DS 5
23 pages
Hashing PDF
No ratings yet
Hashing PDF
65 pages
Hashing Slide
No ratings yet
Hashing Slide
16 pages
Lecture9f13 Hashing
No ratings yet
Lecture9f13 Hashing
29 pages
Hashing and Indexing
No ratings yet
Hashing and Indexing
28 pages
Hash Table 2010
No ratings yet
Hash Table 2010
43 pages
Lec 7
No ratings yet
Lec 7
6 pages
Hashing
No ratings yet
Hashing
25 pages
CMP2030 L02 Hashing
No ratings yet
CMP2030 L02 Hashing
21 pages
Unit IV Hashing and Set 9
No ratings yet
Unit IV Hashing and Set 9
8 pages
Hash Tables
No ratings yet
Hash Tables
20 pages
DSA - Unit 1
No ratings yet
DSA - Unit 1
43 pages
Modue 5
No ratings yet
Modue 5
10 pages
Handout 9 - Hashing
No ratings yet
Handout 9 - Hashing
11 pages
Ds 17hashing
No ratings yet
Ds 17hashing
27 pages