0% found this document useful (0 votes)

78 views13 pages

Data Mining: Similarity and Distance

1) Similarity and distance measures are used to quantify how alike or close together two objects are. They are important for tasks like recommending similar items, grouping similar customers or documents, and detecting anomalies. 2) Common similarity measures include Jaccard similarity, which measures the overlap of elements between two sets, and cosine similarity, which measures the angle between two vectors representing documents. 3) Cosine similarity captures how aligned two document vectors are, with a value of 1 for completely aligned vectors and 0 for orthogonal vectors. It is commonly used for comparing documents represented as vectors of word counts.

Uploaded by

Joseph Conteh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

78 views13 pages

Data Mining: Similarity and Distance

Uploaded by

Joseph Conteh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 13

DATA MINING

LECTURE 4
Similarity and Distance
Similarity and Distance
• For many different problems we need to quantify how
close two objects are.
• Examples:
• For an item bought by a customer, find other similar items
• Group together the customers of a site so that similar customers
are shown the same ad.
• Group together web documents so that you can separate the ones
that talk about politics and the ones that talk about sports.
• Find all the near-duplicate mirrored web documents.
• Find credit card transactions that are very different from previous
transactions.
• To solve these problems we need a definition of similarity,
or distance.
• The definition depends on the type of data that we have
Similarity
• Numerical measure of how alike two data objects
are.
• A function that maps pairs of objects to real values
• Higher when objects are more alike.
• Often falls in the range [0,1], sometimes in [-1,1]

• Desirable properties for similarity

1.s(p, q) = 1 (or maximum similarity) only if p = q.
(Identity)
2.s(p, q) = s(q, p) for all p and q. (Symmetry)
Similarity between sets
• Consider the following documents

apple apple new

releases releases apple pie
new ipod new ipad recipe

• Which ones are more similar?

• How would you quantify their similarity?

Similarity: Intersection
• Number of words in common

apple apple new

releases releases apple pie
new ipod new ipad recipe

• Sim(D,D) = 3, Sim(D,D) = Sim(D,D) =2

• What about this document?

Vefa rereases new book

with apple pie recipes
• Sim(D,D) = Sim(D,D) = 3
6

Jaccard Similarity
• The Jaccard similarity (Jaccard coefficient) of two sets S1,
S2 is the size of their intersection divided by the size of
their union.
• JSim (C1, C2) = |C1∩C2| / |C1∪C2|.

3 in intersection.
8 in union.
Jaccard similarity
= 3/8

• Extreme behavior:
• Jsim(X,Y) = 1, iff X = Y
• Jsim(X,Y) = 0 iff X,Y have no elements in common
• JSim is symmetric
Jaccard Similarity between sets
• The distance for the documents

apple apple new Vefa releases

releases releases apple pie new book with
new ipod new ipad recipe apple pie
recipes

• JSim(D,D) = 3/5
• JSim(D,D) = JSim(D,D) = 2/6
• JSim(D,D) = JSim(D,D) = 3/9
Similarity between vectors
Documents (and sets in general) can also be represented as vectors

document Apple Microsoft Obama Election

D1 10 20 0 0
D2 30 60 0 0
D3 60 30 0 0
D4 0 0 10 20

How do we measure the similarity of two vectors?

• We could view them as sets of words. Jaccard Similarity will

show that D4 is different form the rest
• But all pairs of the other three documents are equally similar
We want to capture how well the two vectors are aligned
Example

document Apple Microsoft Obama Election

D1 10 20 0 0
D2 30 60 0 0
D3 60 30 0 0
D4 0 0 10 20

apple
Documents D1, D2 are in the “same direction”

Document D3 is on the same plane as D1, D2

Document D3 is orthogonal to the rest

microsoft

{Obama, election}
Example

document Apple Microsoft Obama Election

D1 1/3 2/3 0 0
D2 1/3 2/3 0 0
D3 2/3 1/3 0 0
D4 0 0 1/3 2/3

apple
Documents D1, D2 are in the “same direction”

Document D3 is on the same plane as D1, D2

Document D3 is orthogonal to the rest

microsoft

{Obama, election}
Cosine Similarity

• Sim(X,Y) = cos(X,Y)
• The cosine of the angle between X and Y

• If the vectors are aligned (correlated) angle is zero degrees and

cos(X,Y)=1
• If the vectors are orthogonal (no common coordinates) angle is 90
degrees and cos(X,Y) = 0

• Cosine is commonly used for comparing documents, where we

assume that the vectors are normalized by the document length.
Cosine Similarity - math
• If d1 and d2 are two vectors, then
cos( d1, d2 ) = (d1 ∙ d2) / ||d1|| ||d2|| ,
where ∙ indicates vector dot product and || d || is the length of vector d.

• Example:

d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2

d1 ∙ d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5

||d1|| = (33+22+00+55+00+00+00+22+00+00) 0.5 = (42) 0.5 = 6.481

||d2|| = (11+00+00+00+00+00+00+11+00+22) 0.5 = (6) 0.5 = 2.245

cos( d1, d2 ) = .3150

Example

document Apple Microsoft Obama Election

D1 10 20 0 0
D2 30 60 0 0
D3 60 30 0 0
D4 0 0 10 20

apple

Cos(D1,D2) = 1

Cos (D3,D1) = Cos(D3,D2) = 4/5

Cos(D4,D1) = Cos(D4,D2) = Cos(D4,D3) = 0 microsoft

{Obama, election}

Block-3 Unit 9
No ratings yet
Block-3 Unit 9
73 pages
Measuring Data Similarity and Dissimilarity
No ratings yet
Measuring Data Similarity and Dissimilarity
20 pages
Chapter 8 - Collaborative - Filtering
No ratings yet
Chapter 8 - Collaborative - Filtering
118 pages
L04
No ratings yet
L04
35 pages
Module-3Conti.. Similarity& Dissimlarity
No ratings yet
Module-3Conti.. Similarity& Dissimlarity
29 pages
Similarity Measures Le 512
No ratings yet
Similarity Measures Le 512
14 pages
Lec-3. Datamining-Similarity-Distance-Ext
No ratings yet
Lec-3. Datamining-Similarity-Distance-Ext
104 pages
Cosine Similarity
No ratings yet
Cosine Similarity
4 pages
Cosine Similarity
No ratings yet
Cosine Similarity
5 pages
CSC 522 Lecture10
No ratings yet
CSC 522 Lecture10
30 pages
TE IT DMBI Module2 Data Preprocessing L8-L11
No ratings yet
TE IT DMBI Module2 Data Preprocessing L8-L11
73 pages
CS822 DataMining Week4
No ratings yet
CS822 DataMining Week4
45 pages
Clustering Part4
No ratings yet
Clustering Part4
79 pages
Lecture 3
No ratings yet
Lecture 3
58 pages
Similarity and Dissimilarity
No ratings yet
Similarity and Dissimilarity
34 pages
5.2K Full Valid Mail Access Mix by MegaCloud 16.04
No ratings yet
5.2K Full Valid Mail Access Mix by MegaCloud 16.04
89 pages
2 (C) - Jaccard and Cosine Method
No ratings yet
2 (C) - Jaccard and Cosine Method
6 pages
Similarity
No ratings yet
Similarity
20 pages
Clustering
No ratings yet
Clustering
43 pages
III Clustering
No ratings yet
III Clustering
87 pages
CS-DM Module - 3
No ratings yet
CS-DM Module - 3
27 pages
Distance and Similarity
No ratings yet
Distance and Similarity
33 pages
Unit 3
No ratings yet
Unit 3
13 pages
Class 1c - DataFundamentals
No ratings yet
Class 1c - DataFundamentals
27 pages
Exalted 2nd Edition Dragon Blooded PDF
0% (3)
Exalted 2nd Edition Dragon Blooded PDF
2 pages
Cosine Similarity
No ratings yet
Cosine Similarity
3 pages
Unit III
No ratings yet
Unit III
85 pages
DMi 03-Proximity
No ratings yet
DMi 03-Proximity
51 pages
Class-Data Preprocessing-IV
No ratings yet
Class-Data Preprocessing-IV
28 pages
Clustering
No ratings yet
Clustering
15 pages
3 Unit PR NonParametric Decision Making
No ratings yet
3 Unit PR NonParametric Decision Making
78 pages
Mahyuddin Databia
No ratings yet
Mahyuddin Databia
8 pages
Lecture 10
No ratings yet
Lecture 10
26 pages
Data Mining: Characterization: Jimma University, Faculty of Computing Arranged By: Dessalegn Y
No ratings yet
Data Mining: Characterization: Jimma University, Faculty of Computing Arranged By: Dessalegn Y
79 pages
Distance and Similarity
No ratings yet
Distance and Similarity
33 pages
A Comparative Study On Distance Measuring Approach
No ratings yet
A Comparative Study On Distance Measuring Approach
3 pages
03 Schubert
No ratings yet
03 Schubert
13 pages
VectorApplicationsInDS
No ratings yet
VectorApplicationsInDS
31 pages
Experiment No.1 Span of Attention
100% (9)
Experiment No.1 Span of Attention
8 pages
BDA
No ratings yet
BDA
31 pages
Lec 5
No ratings yet
Lec 5
22 pages
Data Mining and Predictive Modeling: Lecture 13: Measuring Data Similarity
No ratings yet
Data Mining and Predictive Modeling: Lecture 13: Measuring Data Similarity
19 pages
Assignment No 1 (Data Science) - Ashber
No ratings yet
Assignment No 1 (Data Science) - Ashber
9 pages
CS2209 Similarity Distances
No ratings yet
CS2209 Similarity Distances
23 pages
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
No ratings yet
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
30 pages
Lab 2
No ratings yet
Lab 2
21 pages
Similarity
No ratings yet
Similarity
19 pages
Data Mining: Similarity and Distance
No ratings yet
Data Mining: Similarity and Distance
13 pages
Materi 7.1. Distance Measurement
No ratings yet
Materi 7.1. Distance Measurement
14 pages
Similarity
No ratings yet
Similarity
20 pages
What Is Cosine Similarity and Why Is It Advantageous?
No ratings yet
What Is Cosine Similarity and Why Is It Advantageous?
2 pages
Clustering Lecture 1: Basics: Jing Gao
No ratings yet
Clustering Lecture 1: Basics: Jing Gao
62 pages
Comparative Study of Document Similarity Algorithms and Clustering Algorithms For Sentiment Analysis
No ratings yet
Comparative Study of Document Similarity Algorithms and Clustering Algorithms For Sentiment Analysis
4 pages
Manhattan & Euclidean Distance
No ratings yet
Manhattan & Euclidean Distance
16 pages
Lesson 6 Similarities KNN
No ratings yet
Lesson 6 Similarities KNN
25 pages
Rehabilitation and Retrofitting of Structurs Question Papers
No ratings yet
Rehabilitation and Retrofitting of Structurs Question Papers
4 pages
Cosine Similarity Tutorial
No ratings yet
Cosine Similarity Tutorial
7 pages
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
No ratings yet
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
57 pages
Similarity Analysis
No ratings yet
Similarity Analysis
85 pages
Documents Similarity
No ratings yet
Documents Similarity
6 pages
High Court of Judicature For Rajasthan Bench at Jaipur: (Downloaded On 09/02/2022 at 01:11:10 PM)
No ratings yet
High Court of Judicature For Rajasthan Bench at Jaipur: (Downloaded On 09/02/2022 at 01:11:10 PM)
5 pages
STS Group1 PPT Presentation 3
No ratings yet
STS Group1 PPT Presentation 3
11 pages
Edexcel History A Level Coursework Grade Boundaries
100% (1)
Edexcel History A Level Coursework Grade Boundaries
6 pages
Similarity Measures
No ratings yet
Similarity Measures
11 pages
Safe Work in Confined Spaces
100% (1)
Safe Work in Confined Spaces
20 pages
Case No 114 Philippine Tobacco Flu Curing and Redrying Corp Vs NLRC Dec 10, 1998
No ratings yet
Case No 114 Philippine Tobacco Flu Curing and Redrying Corp Vs NLRC Dec 10, 1998
4 pages
EBOOK2PM 24 SEANCES DERNIERE VERSION Copie
No ratings yet
EBOOK2PM 24 SEANCES DERNIERE VERSION Copie
48 pages
Beige Scrapbook Geography Presentation
No ratings yet
Beige Scrapbook Geography Presentation
60 pages
Met A Language
No ratings yet
Met A Language
101 pages
Three Concepts of Political Stability An Agent Based Model
No ratings yet
Three Concepts of Political Stability An Agent Based Model
28 pages
12 Exam WBHS 2015-06 P2
No ratings yet
12 Exam WBHS 2015-06 P2
13 pages
De Thi Giua Hoc Ki 1 Anh 8 (Suu Tam)
No ratings yet
De Thi Giua Hoc Ki 1 Anh 8 (Suu Tam)
3 pages
NPC v. Heirs of Casionan
100% (2)
NPC v. Heirs of Casionan
2 pages
Everything About The Marxism
No ratings yet
Everything About The Marxism
23 pages
Bpo - Module 2
No ratings yet
Bpo - Module 2
29 pages
Lab Gis
No ratings yet
Lab Gis
16 pages
Formula Sheet (1) Descriptive Statistics: Quartiles (n+1) /4 (n+1) /2 (The Median) 3 (n+1) /4
No ratings yet
Formula Sheet (1) Descriptive Statistics: Quartiles (n+1) /4 (n+1) /2 (The Median) 3 (n+1) /4
13 pages
Guia Curso Ventas
No ratings yet
Guia Curso Ventas
4 pages
Classroom 1 Class Notes For Article
No ratings yet
Classroom 1 Class Notes For Article
2 pages
A Fierce Dog-1
No ratings yet
A Fierce Dog-1
8 pages
Hpta Narrative Report Q3
No ratings yet
Hpta Narrative Report Q3
3 pages
Action Research Proposal 2021
No ratings yet
Action Research Proposal 2021
3 pages
Diocese of Lafia Dyc Keffi 2020
No ratings yet
Diocese of Lafia Dyc Keffi 2020
5 pages
Root Finding:: X X D D X X D X X
No ratings yet
Root Finding:: X X D D X X D X X
3 pages
Chapter 2: Developing Marketing Strategies and Plans I. Marketing and Customer Value The Value Delivery Process
No ratings yet
Chapter 2: Developing Marketing Strategies and Plans I. Marketing and Customer Value The Value Delivery Process
7 pages
Forty Stories: A Fifty-Two Stories Production
No ratings yet
Forty Stories: A Fifty-Two Stories Production
9 pages
Customer Journey Map Playbook
100% (11)
Customer Journey Map Playbook
36 pages

Data Mining: Similarity and Distance

Uploaded by

Data Mining: Similarity and Distance

Uploaded by

DATA MINING

• Desirable properties for similarity

apple apple new

• Which ones are more similar?

• How would you quantify their similarity?

apple apple new

• Sim(D,D) = 3, Sim(D,D) = Sim(D,D) =2

Vefa rereases new book

apple apple new Vefa releases

document Apple Microsoft Obama Election

How do we measure the similarity of two vectors?

• We could view them as sets of words. Jaccard Similarity will

document Apple Microsoft Obama Election

Document D3 is on the same plane as D1, D2

Document D3 is orthogonal to the rest

document Apple Microsoft Obama Election

Document D3 is on the same plane as D1, D2

Document D3 is orthogonal to the rest

• If the vectors are aligned (correlated) angle is zero degrees and

• Cosine is commonly used for comparing documents, where we

||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0) 0.5 = (42) 0.5 = 6.481

||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245

cos( d1, d2 ) = .3150

document Apple Microsoft Obama Election

Cos (D3,D1) = Cos(D3,D2) = 4/5

Cos(D4,D1) = Cos(D4,D2) = Cos(D4,D3) = 0 microsoft

You might also like

||d1|| = (33+22+00+55+00+00+00+22+00+00) 0.5 = (42) 0.5 = 6.481

||d2|| = (11+00+00+00+00+00+00+11+00+22) 0.5 = (6) 0.5 = 2.245