0% found this document useful (0 votes)
27 views53 pages

Chapter 5

Uploaded by

yehenew
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views53 pages

Chapter 5

Uploaded by

yehenew
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 53

Chapter 5: Finding Similar

Items:
Locality Sensitive Hashing
New thread: High dim. data

High dim. Graph Infinite Machine


Apps
data data data learning

Locality Filtering
PageRank, Recommen
sensitive data SVM
SimRank der systems
hashing streams

Network Web Decision Association


Clustering
Analysis advertising Trees Rules

Dimensional Duplicate
Spam Queries on Perceptron,
ity document
Detection streams kNN
reduction detection

2
Pinterest Visual Search
Given a query image patch, find similar images

3
How does it work?
Image Q Feature Vector
… 0 0 1 1 0 1 0 1 0 0 … 0 0 0 1 1 0 1 0

Hamming
Distance Similarity (Q, B)

Image B Feature Vector


… 1 0 1 0 0 0 0 1 0 1 … 1 1 0 0 1 0 0 0

 Collect billions of images


 Determine feature vector for each image (4k dim)
 Given a query Q, find nearest neighbors FAST
4
Application: Visual Search

5
A Common Metaphor
 Many problems can be expressed as
finding “similar” sets:
 Find near-neighbors in high-dimensional space
 Examples:
 Pages with similar words
 For duplicate detection, classification by topic
 Customers who purchased similar products
 Products with similar customer sets
 Images with similar features
 Image completion
 Recommendations and search

6
Problem for today’s lecture
 Given: High dimensional data points
 For example: Image is a long vector of pixel colors
 And some distance function
 which quantifies the “distance” between and
 Goal: Find all pairs of data points that are
within distance threshold
 Note: Naïve solution would take
where is the number of data points
 MAGIC: This can be done in !! How??

7
LSH: The Bigfoot of CS
 LSH is really a family of related techniques
 In general, one throws items into buckets using
several different “hash functions”
 You examine only those pairs of items that share
a bucket for at least one of these hashings
 Upside: Designed correctly, only a small fraction
of pairs are ever examined
 Downside: There are false negatives – pairs of
similar items that never even get considered

8
Motivating Application:
Finding Similar Documents
Motivation for Min-Hash/LSH
 Suppose we need to find near-duplicate
documents among million documents
 Naïvely, we would have to compute pairwise
similarities for every pair of docs
 ≈ 5*1011 comparisons
 At 105 secs/day and 106 comparisons/sec,
it would take 5 days
 For million, it takes more than a year…

 Similarly, you have a dataset of 10m images,


quickly find the most similar to query image Q
10
3 Essential Steps for Similar Docs
1. Shingling: Converts a document into a set
representation (Boolean vector)
2. Min-Hashing: Convert large sets to short
signatures, while preserving similarity
3. Locality-Sensitive Hashing: Focus on
pairs of signatures likely to be from
similar documents
 Candidate pairs!

11
The Big Picture

Candidate
pairs:
Locality-
Docu- Min those pairs
Shingling Sensitive
ment Hashing of signatures
Hashing
that we need
to test for
The set Signatures: similarity
of strings short integer
of length k vectors that
that appear represent the
in the docu- sets, and
ment reflect their
similarity

12
Docu-
Shingling
ment

The set
of strings
of length k
that appear
in the docu-
ment

Shingling
Step 1: Shingling:
Convert a document into a set
Documents as High-Dim Data
Step 1: Shingling: Converts a document into a set
 A k-shingle (or k-gram) for a document is a
sequence of k tokens that appears in the doc
 Tokens can be characters, words or something else,
depending on the application
 Assume tokens = characters for examples
 To compress long shingles, we can hash them to
(say) 4 bytes
 Represent a document by the set of hash
values of its k-shingles
14
Compressing Shingles
 Example: k=2; document D1= abcab
Set of 2-shingles: S(D1) = {ab, bc, ca}
Hash the shingles: h(D1) = {1, 5, 7}
 k = 8, 9, or 10 is often used in practice

 Benefits of shingles:
 Documents that are intuitively similar will have
many shingles in common
 Changing a word only affects k-shingles within
distance k-1 from the word
15
Similarity Metric for Shingles
 Document D1 is a set of its k-shingles C1=S(D1)
 A natural similarity measure is the
Jaccard similarity:
sim(D1, D2) = |C1C2|/|C1C2|
Jaccard distance: d(C1, C2) = 1 - |C1C2|/|C1C2|
3 in intersection.
8 in union.
Jaccard similarity
= 3/8

16
From Sets to Boolean Matrices
Encode sets using 0/1 (bit, Boolean) vectors
 Rows = elements (shingles) Documents

 Columns = sets (documents) 1 1 1 0


 1 in row e and column s if and 1 1 0 1
only if e is a member of s
0 1 0 1
 Column similarity is the Jaccard

Shingles
similarity of the corresponding 0 0 0 1
sets (rows with value 1) 1 0 0 1
 Typical matrix is sparse!
 Each document is a column: 1 1 1 0
 Example: sim(C1 ,C2) = ? 1 0 1 0
 Size of intersection = 3; size of union = 6, We don’t really construct the
Jaccard similarity (not distance) = 3/6 matrix; just imagine it exists
17
Outline: Finding Similar Columns
 So far:
 Documents  Sets of shingles
 Represent sets as boolean vectors in a matrix
 Next goal: Find similar columns while
computing small signatures
 Similarity of columns == similarity of signatures
 Warnings:
 Comparing all pairs takes too much time: Job for LSH
 These methods can produce false negatives(something is
true when it is actually false), and even false positives
( something is false when it is actually true) (if the
optional check is not made) 18
Docu- Min-Hash-
Shingling
ment ing

The set Signatures:


of strings short integer
of length k vectors that
that appear represent the
in the doc- sets, and
ument reflect their
similarity

Min-Hashing
Step 2: Min-Hashing: Convert large sets to
short signatures, while preserving similarity
Hashing Columns (Signatures)
 Key idea: “hash” each column C to a small
signature h(C), such that:
 sim(C1, C2) is the same as the “similarity” of signatures
h(C1) and h(C2)

 Goal: Find a hash function h(·) such that:


 If sim(C1,C2) is high, then with high prob. h(C1) = h(C2)
 If sim(C1,C2) is low, then with high prob. h(C1) ≠ h(C2)

 Idea: Hash docs into buckets. Expect that “most”


pairs of near duplicate docs hash into the same
20
Min-Hashing: Goal
 Goal: Find a hash function h(·) such that:
 if sim(C1,C2) is high, then with high prob. h(C1) = h(C2)
 if sim(C1,C2) is low, then with high prob. h(C1) ≠ h(C2)

 Clearly, the hash function depends on


the similarity metric:
 Not all similarity metrics have a suitable
hash function
 There is a suitable hash function for
the Jaccard similarity: It is called Min-Hashing
21
Min-Hashing: Overview
 Permute the rows of the Boolean matrix
 Thought experiment – not real

 Define minhash function for this permutation, h(C)


= the number of the first (in the permuted order)
row in which column C has 1:
 h (C) = min (C)

 Apply, to all columns, several randomly chosen


permutations to create a signature for each column
 Result is a signature matrix: columns = sets, rows =
minhash values, in order for that column 22
Min-Hashing Example
2nd element of the permutation
(row 1) is the first to map to a 1

Permutation  Input matrix (Shingles x Documents)


Signature matrix M

2 4 3 1 0 1 0 2 1 2 1
3 2 4 1 0 0 1 2 1 4 1
7 1 7 0 1 0 1
1 2 1 2
6 3 2 0 1 0 1
1 6 6 0 1 0 1 h2(3)=1 (permutation 2, column 3)
4th element of the permutation
5 7 1 1 0 1 0 (row 1) is the first to map to a 1

4 5 5 1 0 1 0
24
A Subtle Point
 Students sometimes ask whether the minhash
value should be the original number of the
row, or the number in the permuted order (as
we did in our example).
 Answer: it doesn’t matter
 You only need to be consistent, and assure that
two columns get the same value if and only if their
first 1’s in the permuted order are in the same row

25
0 0
The Min-Hash Property 0 0
 Choose a random permutation  1 1
 Claim: Pr[h(C1) = h(C2)] = sim(C1, C2) 0 0
 Why?
0 1
 Let X be a doc (set of shingles), z X is a shingle
1 0
 Then: Pr[(z) = min((X))] = 1/|X|
 It is equally likely that any z X is mapped to the min element
 Let y be s.t. (y) = min((C1C2))
 Then either: (y) = min((C1)) if y  C1 , or One of the two
cols had to have
(y) = min((C2)) if y  C2 1 at position y

 So the prob. that both are true is the prob. y  C1  C2


 Pr[min((C1))=min((C2))]=|C1C2|/|C1C2|= sim(C1, C2)
26
0 0
Four Types of Rows 0 0
 Given cols C 1 1
1 and C2, rows are classified as:
C1 C2 0 0
A 1 1 0 1
B 1 0
C 0 1
1 0
D 0 0
 Define: a = # rows of type A, etc.
 Note: sim(C , C2) = a/(a +b +c)
1
 Then: Pr[h(C ) = h(C )] = Sim(C , C )
1 2 1 2
 Look down the permuted cols C1 and C2 until we see a 1
 If it’s a type-A row, then h(C1) = h(C2)
If a type-B or type-C row, then not 27
Similarity for Signatures
 We know: Pr[h(C1) = h(C2)] = sim(C1, C2)
 Now generalize to multiple hash functions

 The similarity of two signatures is the fraction


of the hash functions in which they agree
 Thus, the expected similarity of two signatures
equals the Jaccard similarity of the columns or
sets that the signatures represent.
 And the longer the signatures, the smaller will be
the expected error.
28
Min-Hashing Example
Permutation  Input matrix (Shingles x Documents)
Signature matrix M

2 4 3 1 0 1 0 2 1 2 1
3 2 4 1 0 0 1 2 1 4 1
7 1 7 0 1 0 1
1 2 1 2
6 3 2 0 1 0 1
1 6 6 0 1 0 1 Similarities:
1-3 2-4 1-2 3-4
5 7 1 1 0 1 0 Col/Col 0.75 0.75 0 0
4 5 5 1 0 1 0 Sig/Sig 0.67 1.00 0 0

29
Implementation Trick
 Permuting rows even once is prohibitive
 Row hashing!
 Pick K = 100 hash functions hi
 Ordering under hi gives a random permutation of rows!
 One-pass implementation
 For each column c and hash-func. hi keep a “slot” M(i, c)
for the min-hash value
 Initialize all M(i, c) =  How to pick a random
hash function h(x)?
 Scan rows looking for 1s Universal hashing:
 Suppose row j has 1 in column c ha,b(x)=((a·x+b) mod p) mod N
where:
 Then for each hi : a,b … random integers
 If hi(j) < M(i, c), then M(i, c)  hi(j) p … prime number (p > N)
30
Implementation
for each row r do begin
for each hash function hi do
compute hi (r); Important: so you hash r only
once per hash function, not
for each column c once per 1 in row r.

if c has 1 in row r
for each hash function hi do
if hi (r) < M(i, c) then
M(i, c) := hi (r);
end;
31
Example Implementation
M(i, C1) M(i, C2)
h(1) = 1 1 ∞
g(1) = 3 3 ∞

Row C1 C2 h(2) = 2 1 2
1 1 0 g(2) = 0 3 0
2 0 1
3 1 1
4 1 0 h(3) = 3 1 2
5 0 1 g(3) = 2 2 0

h(4) = 4 1 2
g(4) = 4 2 0

h(x) = x mod 5 h(5) = 0 1 0


g(x) = (2x+1) mod 5 g(5) = 1 2 0

Signature matrix M
32
Candidate
pairs:
Locality-
Docu- Min-Hash- those pairs
Shingling Sensitive
ment ing of signatures
Hashing
that we need
to test for
The set Signatures: similarity
of strings short integer
of length k vectors that
that appear represent the
in the doc- sets, and
ument reflect their
similarity

Locality Sensitive Hashing


Step 3: Locality Sensitive Hashing:
Focus on pairs of signatures likely to be from
similar documents
2 1 4 1
LSH: Overview 1 2 1 2
2 1 2 1
 Goal: Find documents with Jaccard similarity at
least s (for some similarity threshold, e.g., s=0.8)
 LSH – General idea: Use a hash function that
tells whether x and y is a candidate pair: a pair
of elements whose similarity must be evaluated
 For Min-Hash matrices:
 Hash columns of signature matrix M to many buckets
 Each pair of documents that hashes into the
same bucket is a candidate pair
34
2 1 4 1
LSH: Overview 1 2 1 2
2 1 2 1

 Pick a similarity threshold s (0 < s < 1)


 Columns x and y of M are a candidate pair if
their signatures agree on at least fraction s of
their rows:
M (i, x) = M (i, y) for at least frac. s values of i
 We expect documents x and y to have the same
(Jaccard) similarity as their signatures

35
2 1 4 1
LSH for Min-Hash 1 2 1 2
2 1 2 1
Big idea: Hash columns of
signature matrix M several times
 Arrange that (only) similar columns are
likely to hash to the same bucket, with
high probability
 Candidate pairs are those that hash to the
same bucket

36
2 1 4 1
Partition M into b Bands 1 2 1 2
2 1 2 1

r rows
per band

b bands

One
signature

Signature matrix M
37
Partition M into Bands
 Divide matrix M into b bands of r rows

 For each band, hash its portion of each


column to a hash table with k buckets
 Make k as large as possible

 Candidate column pairs are those that hash


to the same bucket for ≥ 1 band
 Tune b and r to catch most similar pairs,
but few non-similar pairs
38
Hashing Bands
Columns 2 and 6
Buckets are probably identical
(candidate pair)

Columns 6 and 7 are


surely different.

Matrix M

r rows b bands

39
Hashing Bands

41
Simplifying Assumption
 There are enough buckets that columns are
unlikely to hash to the same bucket unless
they are identical in a particular band
 Hereafter, we assume that “same bucket”
means “identical in that band”
 Assumption needed only to simplify analysis,
not for correctness of algorithm

42
2 1 4 1
Example of Bands 1 2 1 2
2 1 2 1

Assume the following case:


 Suppose 100,000 columns of M (100k docs)
 Signatures of 100 integers (rows)
 Therefore, signatures take 40MB
 Goal: Find pairs of documents that
are at least s = 0.8 similar
 Choose b = 20 bands of r = 5 integers/band

43
2 1 4 1
C1, C2 are 80% Similar 1 2 1 2
2 1 2 1
 Find pairs of  s=0.8 similarity, set b=20, r=5
 Assume: sim(C1, C2) = 0.8
 Since sim(C1, C2)  s, we want C1, C2 to be a candidate pair:
We want them to hash to at least 1 common bucket (at
least one band is identical)
 Probability C1, C2 identical in one particular
band: (0.8)5 = 0.328
 Probability C1, C2 are not similar in all of the 20 bands:
(1-0.328)20 = 0.00035
 i.e., about 1/3000th of the 80%-similar column pairs
are false negatives (we miss them)
 We would find 99.965% pairs of truly similar documents
44
2 1 4 1
C1, C2 are 30% Similar 1 2 1 2
2 1 2 1
 s=0.8 similarity, set b=20, r=5
 Find pairs of
 Assume: sim(C1, C2) = 0.3
 Since sim(C1, C2) < s we want C1, C2 to hash to NO
common buckets (all bands should be different)
 Probability C1, C2 identical in one particular band:
(0.3)5 = 0.00243
 Probability C1, C2 identical in at least 1 of 20 bands: 1
- (1 - 0.00243)20 = 0.0474
 In other words, approximately 4.74% pairs of docs with
similarity 0.3 end up becoming candidate pairs
 They are false positives since we will have to examine them (they
are candidate pairs) but then it will turn out their similarity is
below threshold s 45
2 1 4 1
LSH Involves a Tradeoff 1 2 1 2
2 1 2 1
 Pick:
 The number of Min-Hashes (rows of M)
 The number of bands b, and
 The number of rows r per band
to balance false positives/negatives
 Example: If we had only 10 bands of 10
rows, the number of false positives would
go down, but the number of false negatives
would go up
46
Analysis of LSH – What We Want

Probability = 1
if t > s

Similarity threshold s
Probability
No chance Say “yes” if you
of sharing
if t < s are below the line.
a bucket

Similarity t =sim(C1, C2) of two sets

47
What 1 Band of 1 Row Gives You

Probability Remember:
of sharing Probability of
a bucket equal hash-values
= similarity

Similarity t =sim(C1, C2) of two sets

48
What 1 Band of 1 Row Gives You

False
negatives

Probability
of sharing Say “yes” if you
a bucket are below the line.

False
positives
s

Similarity t =sim(C1, C2) of two sets

49
b bands, r rows/band
 Say columns C1 and C2 have similarity t
 Pick any band (r rows)
 Prob. that all rows in band equal = tr
 Prob. that some row in band unequal = 1 - tr

 Prob. that no band identical = (1 - tr)b


 Prob. that at least 1 band identical =
1 - (1 - tr)b

50
What b Bands of r Rows Gives You

At least
No bands
one band
identical
identical

Probability 1 - (1 - t r )b
of sharing
a bucket

All rows
Some row of a band
of a band are equal
Similarity t=sim(C1, C2) of two sets unequal

51
Example: b = 20; r = 5
 Similarity threshold s
 Prob. that at least 1 band is identical:

s 1-(1-sr)b
0.2 0.006
0.3 0.047
0.4 0.186
0.5 0.470
0.6 0.802
0.7 0.975
0.8 0.9996
52
Picking r and b: The S-curve
 Picking r and b to get the best S-curve
 50 hash-functions (r=5, b=10)
1

0.9
Prob. sharing a bucket

0.8

0.7

0.6

0.5

0.4

0.3

0.2
Blue area: False Negative rate
0.1
Green area: False Positive rate
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Similarity

53
LSH Summary
 Tune M, b, r to get almost all pairs with
similar signatures, but eliminate most pairs
that do not have similar signatures
 Check in main memory that candidate pairs
really do have similar signatures
 Optional: In another pass through data,
check that the remaining candidate pairs
really represent similar documents
For more explanation with examples use the
below link
https://www.pinecone.io/learn/locality-sensitiv
54
Summary: 3 Steps
 Shingling: Convert documents to set representation
 We used hashing to assign each shingle an ID
 Min-Hashing: Convert large sets to short signatures,
while preserving similarity
 We used similarity preserving hashing to generate
signatures with property Pr[h(C1) = h(C2)] = sim(C1, C2)
 We used hashing to get around generating random
permutations
 Locality-Sensitive Hashing: Focus on pairs of
signatures likely to be from similar documents
 We used hashing to find candidate pairs of similarity  s
55

You might also like