0% found this document useful (0 votes)
78 views13 pages

Data Mining: Similarity and Distance

1) Similarity and distance measures are used to quantify how alike or close together two objects are. They are important for tasks like recommending similar items, grouping similar customers or documents, and detecting anomalies. 2) Common similarity measures include Jaccard similarity, which measures the overlap of elements between two sets, and cosine similarity, which measures the angle between two vectors representing documents. 3) Cosine similarity captures how aligned two document vectors are, with a value of 1 for completely aligned vectors and 0 for orthogonal vectors. It is commonly used for comparing documents represented as vectors of word counts.

Uploaded by

Joseph Conteh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
78 views13 pages

Data Mining: Similarity and Distance

1) Similarity and distance measures are used to quantify how alike or close together two objects are. They are important for tasks like recommending similar items, grouping similar customers or documents, and detecting anomalies. 2) Common similarity measures include Jaccard similarity, which measures the overlap of elements between two sets, and cosine similarity, which measures the angle between two vectors representing documents. 3) Cosine similarity captures how aligned two document vectors are, with a value of 1 for completely aligned vectors and 0 for orthogonal vectors. It is commonly used for comparing documents represented as vectors of word counts.

Uploaded by

Joseph Conteh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 13

DATA MINING

LECTURE 4
Similarity and Distance
Similarity and Distance
• For many different problems we need to quantify how
close two objects are.
• Examples:
• For an item bought by a customer, find other similar items
• Group together the customers of a site so that similar customers
are shown the same ad.
• Group together web documents so that you can separate the ones
that talk about politics and the ones that talk about sports.
• Find all the near-duplicate mirrored web documents.
• Find credit card transactions that are very different from previous
transactions.
• To solve these problems we need a definition of similarity,
or distance.
• The definition depends on the type of data that we have
Similarity
• Numerical measure of how alike two data objects
are.
• A function that maps pairs of objects to real values
• Higher when objects are more alike.
• Often falls in the range [0,1], sometimes in [-1,1]

• Desirable properties for similarity


1.s(p, q) = 1 (or maximum similarity) only if p = q.
(Identity)
2.s(p, q) = s(q, p) for all p and q. (Symmetry)
Similarity between sets
• Consider the following documents

apple apple new


releases releases apple pie
new ipod new ipad recipe

• Which ones are more similar?

• How would you quantify their similarity?


Similarity: Intersection
• Number of words in common

apple apple new


releases releases apple pie
new ipod new ipad recipe

• Sim(D,D) = 3, Sim(D,D) = Sim(D,D) =2


• What about this document?

Vefa rereases new book


with apple pie recipes
• Sim(D,D) = Sim(D,D) = 3
6

Jaccard Similarity
• The Jaccard similarity (Jaccard coefficient) of two sets S1,
S2 is the size of their intersection divided by the size of
their union.
• JSim (C1, C2) = |C1∩C2| / |C1∪C2|.

3 in intersection.
8 in union.
Jaccard similarity
= 3/8

• Extreme behavior:
• Jsim(X,Y) = 1, iff X = Y
• Jsim(X,Y) = 0 iff X,Y have no elements in common
• JSim is symmetric
Jaccard Similarity between sets
• The distance for the documents

apple apple new Vefa releases


releases releases apple pie new book with
new ipod new ipad recipe apple pie
recipes

• JSim(D,D) = 3/5
• JSim(D,D) = JSim(D,D) = 2/6
• JSim(D,D) = JSim(D,D) = 3/9
Similarity between vectors
Documents (and sets in general) can also be represented as vectors

document Apple Microsoft Obama Election


D1 10 20 0 0
D2 30 60 0 0
D3 60 30 0 0
D4 0 0 10 20

How do we measure the similarity of two vectors?

• We could view them as sets of words. Jaccard Similarity will


show that D4 is different form the rest
• But all pairs of the other three documents are equally similar
We want to capture how well the two vectors are aligned
Example

document Apple Microsoft Obama Election


D1 10 20 0 0
D2 30 60 0 0
D3 60 30 0 0
D4 0 0 10 20

apple
Documents D1, D2 are in the “same direction”

Document D3 is on the same plane as D1, D2

Document D3 is orthogonal to the rest


microsoft

{Obama, election}
Example

document Apple Microsoft Obama Election


D1 1/3 2/3 0 0
D2 1/3 2/3 0 0
D3 2/3 1/3 0 0
D4 0 0 1/3 2/3

apple
Documents D1, D2 are in the “same direction”

Document D3 is on the same plane as D1, D2

Document D3 is orthogonal to the rest


microsoft

{Obama, election}
Cosine Similarity

• Sim(X,Y) = cos(X,Y)
• The cosine of the angle between X and Y

• If the vectors are aligned (correlated) angle is zero degrees and


cos(X,Y)=1
• If the vectors are orthogonal (no common coordinates) angle is 90
degrees and cos(X,Y) = 0

• Cosine is commonly used for comparing documents, where we


assume that the vectors are normalized by the document length.
Cosine Similarity - math
• If d1 and d2 are two vectors, then
cos( d1, d2 ) = (d1 ∙ d2) / ||d1|| ||d2|| ,
where ∙ indicates vector dot product and || d || is the length of vector d.

• Example:

d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2

d1 ∙ d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5

||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0) 0.5 = (42) 0.5 = 6.481

||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245

cos( d1, d2 ) = .3150


Example

document Apple Microsoft Obama Election


D1 10 20 0 0
D2 30 60 0 0
D3 60 30 0 0
D4 0 0 10 20

apple

Cos(D1,D2) = 1

Cos (D3,D1) = Cos(D3,D2) = 4/5

Cos(D4,D1) = Cos(D4,D2) = Cos(D4,D3) = 0 microsoft

{Obama, election}

You might also like