Data Mining: Sketching, Locality Sensitive Hashing
Data Mining: Sketching, Locality Sensitive Hashing
LECTURE 5
Sketching, Locality Sensitive Hashing
2
Jaccard Similarity
• The Jaccard similarity (Jaccard coefficient) of two sets S1,
S2 is the size of their intersection divided by the size of
their union.
• JSim (S1, S2) = |S1S2| / |S1S2|.
3 in intersection.
8 in union.
Jaccard similarity
= 3/8
• Extreme behavior:
• Jsim(X,Y) = 1, iff X = Y
• Jsim(X,Y) = 0 iff X,Y have no elements in common
• JSim is symmetric
Cosine Similarity
• Sim(X,Y) = cos(X,Y)
• The cosine of the angle between X and Y
Candidate
pairs :
Locality-
Docu- those pairs
sensitive
ment of signatures
Hashing
that we need
to test for
The set Signatures : similarity.
of strings short integer
of length k vectors that
that appear represent the
in the doc- sets, and
ument reflect their
similarity
12
Shingles
Working Assumption
• Documents that have lots of shingles in common
have similar text, even if the text appears in
different order.
• Careful: you must pick k large enough, or most
documents will have most shingles.
• Extreme case k = 1: all documents are the same
• k = 5 is OK for short documents; k = 10 is better for long
documents.
• Alternative ways to define shingles:
• Use words instead of characters
• Anchor on stop words (to avoid templates)
16
• Key idea: “hash” each set S to a small signature Sig (S), such
that:
1. Sig (S) is small enough that we can fit a signature in main memory
for each set.
2. Sim (S1, S2) is (almost) the same as the “similarity” of Sig (S1) and
Sig (S2). (signature preserves similarity).
• X = {A,B,F,G} X Y
A 1 1
• Y = {A,E,F,G} B 1 0
C 0 0
3
D 0 0
• Sim(X,Y) = E 0 1
5 F 1 1
G 1 1
Example
• Universe: U = {A,B,C,D,E,F,G}
• X = {A,B,F,G} X Y
A 1 1
• Y = {A,E,F,G} B 1 0
C 0 0
3
D 0 0
• Sim(X,Y) = E 0 1
5 F 1 1
G 1 1
• X = {A,B,F,G} X Y
A 1 1
• Y = {A,E,F,G} B 1 0
C 0 0
3
D 0 0
• Sim(X,Y) = E 0 1
5 F 1 1
G 1 1
Minhashing
• Pick a random permutation of the rows (the
universe U).
• Define “hash” function for set S
• h(S) = the index of the first row (in the permuted order)
in which column S has 1.
same as:
• h(S) = the index of the first element of S in the permuted
order.
• Use k (e.g., k = 100) independent random
permutations to create a signature.
Example of minhash signatures
• Input matrix Random
Permutation
elem index elem
ent S1 S2 S3 S4 ent S1 S2 S3 S4
A
A 1 0 1 0 1 A 1 0 1 0
C
B 1 0 0 1 2 C 0 1 0 1
C 0 1 0 1 G 3 G 1 0 1 0
D 0 1 0 1 F 4 F 1 0 1 0
E 0 1 1 1 B 5 B 1 0 0 1
F 1 0 1 0 E 6 E 0 1 1 1
G 1 0 1 0 D 7 D 0 1 0 1
1 2 1 2
Example of minhash signatures
• Input matrix Random
Permutation
elem index elem
ent S1 S2 S3 S4 ent S1 S2 S3 S4
D
A 1 0 1 0 1 D 0 1 0 1
B
B 1 0 0 1 2 B 1 0 0 1
C 0 1 0 1 A 3 A 1 0 1 0
D 0 1 0 1 C 4 C 0 1 0 1
E 0 1 1 1 F 5 F 1 0 1 0
F 1 0 1 0 G 6 G 1 0 1 0
G 1 0 1 0 E 7 E 0 1 1 1
2 1 3 1
Example of minhash signatures
• Input matrix Random
Permutation
elem index elem
ent S1 S2 S3 S4 ent S1 S2 S3 S4
C
A 1 0 1 0 1 C 0 1 0 1
D
B 1 0 0 1 2 D 0 1 0 1
C 0 1 0 1 G 3 G 1 0 1 0
D 0 1 0 1 F 4 F 1 0 1 0
E 0 1 1 1 A 5 A 1 0 1 0
F 1 0 1 0 B 6 B 1 0 0 1
G 1 0 1 0 E 7 E 0 1 1 1
3 1 3 1
Example of minhash signatures
• Input matrix
S1 S2 S3 S4 Signature matrix
A 1 0 1 0
S1 S2 S3 S4 We now have a
B 1 0 0 1 smaller dataset
≈
C 0 1 0 1 h1 1 2 1 2 with just 𝑘 rows
D 0 1 0 1 h2 2 1 3 1
E 0 1 1 1 h3 3 1 3 1
F 1 0 1 0
G 1 0 1 0
• Sig(S) = vector of hash values
• e.g., Sig(S2) = [2,1,1]
• Sig(S,i) = value of the i-th hash
function for set S
• E.g., Sig(S2,3) = 1
29
A Subtle Point
• People sometimes ask whether the minhash
value should be the original number of the row, or
the number in the permuted order (as we did in
our example).
• Answer: it doesn’t matter.
• You only need to be consistent, and assure that
two columns get the same value if and only if
their first 1’s in the permuted order are in the
same row.
30
• Why?
• The first row where one of the two sets has value 1
belongs to the union.
• Recall that union contains rows with at least one 1.
• We have equality if both sets have value 1, and this row
belongs to the intersection
Example
• Universe: U = {A,B,C,D,E,F,G}
• X = {A,B,F,G} Rows C,D could be anywhere
• Y = {A,E,F,G} they do not affect the probability
X Y X Y
• Union = A 1 1 D D 0 0
{A,B,E,F,G} B 1 0 *
C 0 0 *
• Intersection =
D 0 0 C C 0 0
{A,F,G} E 0 1 *
F 1 1 *
G 1 1 *
Example
• Universe: U = {A,B,C,D,E,F,G}
• X = {A,B,F,G} The * rows belong to the union
• Y = {A,E,F,G}
X Y X Y
• Union = A 1 1 D D 0 0
{A,B,E,F,G} B 1 0 *
C 0 0 *
• Intersection =
D 0 0 C C 0 0
{A,F,G} E 0 1 *
F 1 1 *
G 1 1 *
Example
• Universe: U = {A,B,C,D,E,F,G}
• X = {A,B,F,G} The question is what is the value
• Y = {A,E,F,G} of the first * element
X Y X Y
• Union = A 1 1 D D 0 0
{A,B,E,F,G} B 1 0 *
C 0 0 *
• Intersection =
D 0 0 C C 0 0
{A,F,G} E 0 1 *
F 1 1 *
G 1 1 *
Example
• Universe: U = {A,B,C,D,E,F,G}
• X = {A,B,F,G} If it belongs to the intersection
• Y = {A,E,F,G} then h(X) = h(Y)
X Y X Y
• Union = A 1 1 D D 0 0
{A,B,E,F,G} B 1 0 *
C 0 0 *
• Intersection =
D 0 0 C C 0 0
{A,F,G} E 0 1 *
F 1 1 *
G 1 1 *
Example
• Universe: U = {A,B,C,D,E,F,G}
• X = {A,B,F,G} Every element of the union is equally likely
to be the * element
• Y = {A,E,F,G} | A,F,G | 3
Pr(h(X) = h(Y)) = = = Sim(X,Y)
| A,B,E,F,G | 5
X Y X Y
• Union = A 1 1 D D 0 0
{A,B,E,F,G} B 1 0 *
C 0 0 *
• Intersection =
D 0 0 C C 0 0
{A,F,G} E 0 1 *
F 1 1 *
G 1 1 *
36
≈
1 2 1 2
C 0 1 0 1 (S1, S4) 1/7 0
2 1 3 1
D 0 1 0 1 (S2, S3) 0 0
3 1 3 1
E 0 1 1 1 (S2, S4) 3/4 1
F 1 0 1 0 (S3, S4) 0 0
Zero similarity is preserved
G 1 0 1 0
High similarity is well approximated
• With multiple signatures we get a good approximation
• Why? What is the expected value of the fraction of agreements?
Is it now feasible?
• Assume a billion rows
• Hard to pick a random permutation of 1…billion
• Even representing a random permutation
requires 1 billion entries!!!
• How about accessing rows in permuted order?
•
Sig(S,i) will become the smallest value of hi(r) among all rows
(shingles) for which column S has value 1 (shingle belongs in S);
i.e., hi (r) gives the min index for the i-th permutation
39
h(0) = 1 1 -
g(0) = 3 3 -
x Row S1 S2 h(x) g(x)
0 A 1 0 1 3 h(1) = 2 1 2
1 B 0 1 2 0 g(1) = 0 3 0
2 C 1 1 3 2
3 D 1 0 4 4 h(2) = 3 1 2
g(2) = 2 2 0
4 E 0 1 0 1
h(3) = 4 1 2
g(3) = 4 2 0
h(x) = x+1 mod 5 g(x) = 2x+1 mod 5
Implementation – (4)
• Often, data is given by column, not row.
• E.g., columns = documents, rows = shingles.
• If so, sort matrix once so it is by row.
• And always compute hi (r ) only once for each
row.
41
Locality-Sensitive Hashing
• What we want: a function f(X,Y) that tells whether or not X
and Y is a candidate pair: a pair of elements whose
similarity must be evaluated.
Sig(S,i)
Sig(S’,i)
hash function i
n hash functions
Sig(S):
signature for set S
r rows
per band
b bands
b mini-signatures
One
signature
46
1 2 3 4 5 6 7
r rows b bands
48
Probability
= 1 if s > t
Probability No chance
of sharing if s < t
a bucket
Probability Remember:
of sharing probability of
a bucket equal hash-values
= similarity
t ~ (1/b)1/r 1 - (1 - s r )b
Probability
of sharing
a bucket
Example: b = 20; r = 5
t = 0.5
s 1-(1-sr)b
.2 .006
.3 .047
.4 .186
.5 .470
.6 .802
.7 .975
.8 .9996
53
LSH Summary
Random Hyperplanes
• Claim:
• Prob[h(x)=h(y)] = 1 – (angle between x and y)/180
59
hv(y) = -1
y P[hv(x) ≠ hv(y)] = 2θ/360 = θ/180
Simplification
• We need not pick from among all possible vectors
v to form a component of a sketch.
• It suffices to consider only vectors v consisting of
+1 and –1 components.