ch03 LSH
ch03 LSH
material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify
them to fit your own needs. If you make use of a significant portion of these slides in your own
lecture, please include this message, or a link to our web site: http://www.mmds.org
Locality Filtering
PageRank, Recommen
sensitive data SVM
SimRank der systems
hashing streams
Dimensional Duplicate
Spam Queries on Perceptron,
ity document
Detection streams kNN
reduction detection
Items 1…K
Items 1…N
Basket 1: {1,2,3}
Pairs: {1,2} {1,3} {2,3}
Candidate
pairs:
Locality-
Docu- Min those pairs
Shingling Sensitive
ment Hashing of signatures
Hashing
that we need
to test for
The set Signatures: similarity
of strings short integer
of length k vectors that
that appear represent the
in the doc- sets, and
ument reflect their
similarity
The set
of strings
of length k
that appear
in the doc-
ument
Shingling
Step 1: Shingling: Convert documents to sets
Documents as High-Dim Data
Step 1: Shingling: Convert documents to sets
Simple approaches:
Document = set of words appearing in document
Document = set of “important” words
Don’t work well for this application. Why?
MinHashing
Step 2: Minhashing: Convert large sets to
short signatures, while preserving similarity
Encoding Sets as Bit Vectors
Many similarity problems can be
formalized as finding subsets that
have significant intersection
Encode sets using 0/1 (bit, boolean) vectors
One dimension per element in the universal set
Interpret set intersection as bitwise AND, and
set union as bitwise OR
Example: C1 = 10111; C2 = 10011
Size of intersection = 3; size of union = 4,
Jaccard similarity (not distance) = 3/4
Distance: d(C1,C2) = 1 – (Jaccard similarity) = 1/4
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 26
From Sets to Boolean Matrices
Rows = elements (shingles)
Columns = sets (documents)
Documents
1 in row e and column s if and only
1 1 1 0
if e is a member of s
Column similarity is the Jaccard 1 1 0 1
similarity of the corresponding sets 0 1 0 1
Shingles
(rows with value 1) 0 0 0 1
Typical matrix is sparse! 1 0 0 1
Each document is a column:
1 1 1 0
Example: sim(C1 ,C2) = ?
Size of intersection = 3; size of union = 6, 1 0 1 0
Jaccard similarity (not distance) = 3/6
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 27
Outline: Finding Similar Columns
So far:
Documents Sets of shingles
Represent sets as boolean vectors in a matrix
Next goal: Find similar columns while
computing small signatures
Similarity of columns == similarity of signatures
Min-Hashing Example
to 1 5 1 5
store row indexes:
2 3 1 3
6 4 6 4
2 4 3 1 0 1 0 2 1 2 1
3 2 4 1 0 0 1 2 1 4 1
7 1 7 0 1 0 1
1 2 1 2
6 3 2 0 1 0 1
1 6 6 0 1 0 1 4th element of the permutation
is the first to map to a 1
5 7 1 1 0 1 0
4 5 5 1 0 1 0
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 33
0 0
The Min-Hash Property 0 0
Choose a random permutation 1 1
Claim: Pr[h(C1) = h(C2)] = sim(C1, C2) 0 0
Why?
0 1
Let X be a doc (set of shingles), y X is a shingle
1 0
Then: Pr[(y) = min((X))] = 1/|X|
It is equally likely that any y X is mapped to the min element
Let y be s.t. (y) = min((C1C2))
Then either: (y) = min((C1)) if y C1 , or One of the two
cols had to have
(y) = min((C2)) if y C2 1 at position y
2 4 3 1 0 1 0 2 1 2 1
3 2 4 1 0 0 1 2 1 4 1
7 1 7 0 1 0 1
1 2 1 2
6 3 2 0 1 0 1
1 6 6 0 1 0 1 Similarities:
1-3 2-4 1-2 3-4
5 7 1 1 0 1 0 Col/Col 0.75 0.75 0 0
4 5 5 1 0 1 0 Sig/Sig 0.67 1.00 0 0
r rows
per band
b bands
One
signature
Signature matrix M
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 44
Partition M into Bands
Divide matrix M into b bands of r rows
Matrix M
r rows b bands
Probability = 1
if t > s
Similarity threshold s
Probability
No chance
of sharing
if t < s
a bucket
Probability Remember:
of sharing Probability of
a bucket equal hash-values
= similarity
At least
No bands
one band
identical
identical
Probability s ~ (1/b)1/r 1 - (1 - t r )b
of sharing
a bucket
All rows
Some row of a band
of a band are equal
Similarity t=sim(C1, C2) of two sets unequal
s 1-(1-sr)b
.2 .006
.3 .047
.4 .186
.5 .470
.6 .802
.7 .975
.8 .9996
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 56
Picking r and b: The S-curve
Picking r and b to get the best S-curve
50 hash-functions (r=5, b=10)
1
0.9
Prob. sharing a bucket
0.8
0.7
0.6
0.5
0.4
0.3
0.2
Blue area: False Negative rate
0.1
Green area: False Positive rate
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Similarity