100% found this document useful (1 vote)
181 views2 pages

Minhash PDF

Set similarity can be measured using the Jaccard similarity coefficient between two sets. Min-hashing is a technique to estimate this similarity using hash signatures of fixed length. It works by randomly permuting the rows of a matrix and taking the minimum hash value of each column as its signature. The similarity of two signatures approximates the true Jaccard similarity of the original sets. An implementation can hash rows instead of permuting them to generate random orderings in a single pass over the data. Signatures can be efficiently compared using locality sensitive hashing to limit the number of candidate pair comparisons.

Uploaded by

Yibo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
181 views2 pages

Minhash PDF

Set similarity can be measured using the Jaccard similarity coefficient between two sets. Min-hashing is a technique to estimate this similarity using hash signatures of fixed length. It works by randomly permuting the rows of a matrix and taking the minimum hash value of each column as its signature. The similarity of two signatures approximates the true Jaccard similarity of the original sets. An implementation can hash rows instead of permuting them to generate random orderings in a single pass over the data. Signatures can be efficiently compared using locality sensitive hashing to limit the number of candidate pair comparisons.

Uploaded by

Yibo
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 2

Set Similarity

Finding Near Duplicates „ Set Similarity (Jaccard measure)


Ci I C j
simJ(Ci , C j ) =
Ci U C j
„ View sets as columns of a matrix; one row for
each element in the universe. aij = 1 indicates
presence of item i in set j
(Adapted from slides and material
„ Example
from Rajeev Motwani and Jeff Ullman) C1 C2

0 1
1 0
1 1 simJ(C1,C2) = 2/5 = 0.4
0 0
1 1
0 1

Identifying Similar Sets? Key Observation


„ Signature Idea „ For columns Ci, Cj, four types of rows
„ Hash columns Ci to signature sig(Ci) Ci Cj
„ simJ(Ci,Cj) approximated by A 1 1
simH(sig(Ci),sig(Cj)) B 1 0
„ Naï
ve Approach C 0 1
„ Sample P rows uniformly at random D 0 0
„ Define sig(Ci) as P bits of Ci in sample „ Overload notation: A = # of rows of type A
„ Problem „ Claim
sparsity Æ would miss interesting part of columns A
simJ(Ci , C j ) =
„

„ sample would get only 0’s in columns A + B+C

Min Hashing Min-Hash Signatures


„ Randomly permute rows „ Pick – P random row permutations
„ Hash h(Ci) = index of first row with 1 in „ MinHash Signature
column Ci
„ Suprising Property sig(C) = list of P indexes of first rows with 1 in
column C
[ ]
P h(Ci ) = h(C j ) = simJ (Ci , C j )
„ Why?
„ Similarity of signatures
„ Both are A/(A+B+C)
„ Let simH(sig(Ci),sig(Cj)) = fraction of
„ Look down columns Ci, Cj until first non-Type-
permutations where MinHash values agree
D row
„ Observe E[simH(sig(Ci),sig(Cj))] = simJ(Ci,Cj)
„ h(Ci) = h(Cj) ÅÆ type A row

1
Example Implementation Trick
Signatures
S1 S2 S3 „ Permuting rows even once is prohibitive
Perm 1 = (12345) 1 2 1 „ Row Hashing
C1 C2 C3 Perm 2 = (54321) 4 5 4
„ Pick P hash functions hk: {1,…,n}Æ{1,…,O(n)}
R1 1 0 1 Perm 3 = (34512) 3 5 4
„ Ordering under hk gives random row
R2 0 1 1
permutation
R3 1 0 0
R4 1 0 1 „ One-pass Implementation
Similarities
R5 0 1 0 1-2 1-3 2-3 „ For each Ci and hk, keep “slot” for min-hash
Col-Col 0.00 0.50 0.25 value
Sig-Sig 0.00 0.67 0.00 „ Initialize all slot(Ci,hk) to infinity
„ Scan rows in arbitrary order looking for 1’s
„ Suppose row Rj has 1 in column Ci
„ For each hk,

Example Comparing Signatures


C1 C2 C1 slots C2 slots
R1 1 0 h(1) = 1 1 - „ Signature Matrix S
R2 0 1 g(1) = 3 3 - „ Rows = Hash Functions
R3 1 1 „ Columns = Columns
R4 1 0 h(2) = 2 1 2 „ Entries = Signatures
g(2) = 0 3 0
R5 0 1 „ Compute – Pair-wise similarity of signature columns
h(3) = 3 1 2 „ Problem
g(3) = 2 2 0 „ MinHash fits column signatures in memory
„ But comparing signature-pairs takes too much time
h(4) = 4 1 2
h(x) = x mod 5 „ Technique to limit candidate pairs?
g(4) = 4 2 0
g(x) = 2x+1 mod 5 „ A-Priori does not work
h(5) = 0 1 0 „ Locality Sensitive Hashing (LSH)
g(5) = 1 2 0

You might also like