Lec-3. Datamining-Similarity-Distance-Ext
Lec-3. Datamining-Similarity-Distance-Ext
MS (Data Science)
Fall 2024 Semester
Course Teacher
Professor
Ex Dean & Chairman
Nazeer Hussain University, Karachi
Books
• “Introduction to Data Mining” by Tan, Steinbach, Kumar.
• Mining Massive Datasets by Anand Rajaraman, Jeff Ullman, and Jure Leskovec. Free online
book. Includes slides from the course
• “Data Mining: Concepts and Techniques”, by Jiawei Han and Micheline Kambe
DATA MINING
SIMILARITY & DISTANCE
Similarity and Distance
Recommender Systems
6
Similarity
• Numerical measure of how alike two data objects are.
• A function that maps pairs of objects to real values
• Higher when objects are more alike.
• Often falls in the range [0,1], sometimes in [-1,1]
Similarity: Intersection
• Number of words in common
Similarity: Intersection
• Number of words in common
Jaccard Similarity
• The Jaccard similarity (Jaccard coefficient) of two sets S1, S2 is
the size of their intersection divided by the size of their union.
• JSim (S1, S2) = |S1S2| / |S1S2|.
? in intersection.
? in union.
Jaccard similarity
= ?/?
• Extreme behavior:
• Jsim(X,Y) = 1, iff X ? Y
• Jsim(X,Y) = 0 iff X,Y have ? elements in common
• JSim is symmetric
13
Jaccard Similarity
• The Jaccard similarity (Jaccard coefficient) of two sets S1, S2 is
the size of their intersection divided by the size of their union.
• JSim (S1, S2) = |S1S2| / |S1S2|.
3 in intersection.
8 in union.
Jaccard similarity
= 3/8
• Extreme behavior:
• Jsim(X,Y) = 1, iff X ? Y
• Jsim(X,Y) = 0 iff X,Y have ? elements in common
• JSim is symmetric
14
Jaccard Similarity
• The Jaccard similarity (Jaccard coefficient) of two sets S1, S2 is
the size of their intersection divided by the size of their union.
• JSim (S1, S2) = |S1S2| / |S1S2|.
3 in intersection.
8 in union.
Jaccard similarity
= 3/8
• Extreme behavior:
• Jsim(X,Y) = 1, iff X = Y
• Jsim(X,Y) = 0 iff X,Y have no elements in common
• JSim is symmetric
15
• JSim(D,D) = ?
• JSim(D,D) = JSim(D,D) = ?
• JSim(D,D) = JSim(D,D) = ?
16
• JSim(D,D) = 3/5
• JSim(D,D) = JSim(D,D) = 2/6
• JSim(D,D) = JSim(D,D) = 3/9
17
Example
document Apple Microsoft Obama Election
D1 10 20 0 0
D2 30 60 0 0
D3 60 30 0 0
D4 0 0 10 20
apple
microsoft
{Obama, election}
19
Example
document Apple Microsoft Obama Election
D1 10 20 0 0
D2 30 60 0 0
D3 60 30 0 0
D4 0 0 10 20
apple
Documents D1, D2 are in the “same direction”
Example
document Apple Microsoft Obama Election
D1 10 20 0 0
D2 30 60 0 0
D3 60 30 0 0
D4 0 0 10 20
apple
Documents D1, D2 are in the “same direction”
Cosine Similarity
• Sim(X,Y) = cos(X,Y)
• The cosine of the angle between X and Y
• Example:
Note: We only need to
d1 = 3 2 0 5 0 0 0 2 0 0 consider the non-zero
d2 = 1 0 0 0 0 0 0 1 0 2 entries of the vectors
d1 d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
Example
document Apple Microsoft Obama Election
D1 10 20 0 0
D2 30 60 0 0
D3 60 30 0 0
D4 0 0 10 20
apple
microsoft
{Obama, election}
24
Example
document Apple Microsoft Obama Election
D1 10 20 0 0
D2 30 60 0 0
D3 60 30 0 0
D4 0 0 10 20
apple
Cos(D1,D2) = ?
Example
document Apple Microsoft Obama Election
D1 10 20 0 0
D2 30 60 0 0
D3 60 30 0 0
D4 0 0 10 20
apple
Cos(D1,D2) = 1
Correlation Coefficient
• The correlation coefficient measures correlation between two random
variables.
• If we have observations (vectors) and is defined as
Example
document Apple Microsoft Obama Election
D1 10 20 0 0
D2 30 60 0 0
D3 60 30 0 0
D4 0 0 10 20
apple
microsoft
{Obama, election}
28
Correlation Coefficient
Normalized vectors
document Apple Microsoft Obama Election
D1 -5 +5 0 0
D2 -15 +15 0 0
D3 +15 -15 0 0
D4 0 0 -5 +5
CorrCoeff(D1,D2) = ?
CorrCoeff(D1,D3) = CorrCoeff(D2,D3) = ?
CorrCoeff(D1,D4) = CorrCoeff(D2,D4) = CorrCoeff(D3,D4) = ?
29
Correlation Coefficient
Normalized vectors
document Apple Microsoft Obama Election
D1 -5 +5 0 0
D2 -15 +15 0 0
D3 +15 -15 0 0
D4 0 0 -5 +5
CorrCoeff(D1,D2) = 1
CorrCoeff(D1,D3) = CorrCoeff(D2,D3) = -1
CorrCoeff(D1,D4) = CorrCoeff(D2,D4) = CorrCoeff(D3,D4) = 0
30
Distance
• Numerical measure of how different two data objects are
• A function that maps pairs of objects to real values
• Lower when objects are more alike
• Higher when two objects are different
• Minimum distance is 0, when comparing an object with itself.
• Upper limit varies
31
Distance Metric
• A distance function d is a distance metric if it is a function from
pairs of objects to real numbers such that:
1. . (non-negativity)
2. iff . (identity)
3. . (symmetry)
4. (triangle inequality ).
32
Triangle Inequality
• Triangle inequality guarantees that the distance function is well-
behaved.
• The direct connection is the shortest distance
• -norm:
Example of Distances
y = (9,8)
-norm:
5 3
-norm:
4
x = (5,5)
-norm:
35
Example
𝑥=(𝑥 1 , … , 𝑥 𝑛 )
• Cosine distance:
Hamming Distance
• Hamming distance is the number of positions in which bit-vectors
differ.
• Example:
• p1 = 10101
• p2 = 10011.
• because the bit-vectors differ in the 3 rd and 4th positions.
weird wierd
intelligent unintelligent
Athena Athina
Variational distance
• Variational distance: The distance between the distribution vectors
document Apple Microsoft Obama Election
Dist(D1,D2) = 0.05+0.1+0.05 = 0.2
D1 0.35 0.5 0.1 0.05
D2 0.4 0.4 0.1 0.1 Dist(D2,D3) = 0.35+0.35+0.5+ 0.2 = 1.4
D3 0.05 0.05 0.6 0.3
Dist(D1,D3) = 0.3+0.45+0.5+ 0.25 = 1.5
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
Apple Microsoft Obama Election
D1 D2 D3
54
APPLICATIONS OF SIMILARITY:
RECOMMENDATION SYSTEMS
55
An important problem
• Recommendation systems
• When a user buys an item (initially books) we want to recommend other
items that the user may like
• When a user rates a movie, we want to recommend movies that the user
may like
• When a user likes a song, we want to recommend other songs that they
may like
Rows: Users
Columns: Movies (in general Items)
Values: The rating of the user for the movie
Recommendation Systems
• Content-based:
• Represent the items into a feature space and recommend items to customer
C similar to previous items rated highly by C
• Movie recommendations: recommend movies with same actor(s), director, genre, …
• Websites, blogs, news: recommend other sites with “similar” content
61
Content-based prediction
Someone who likes one of the Harry Potter (or Star Wars)
movies is likely to like the rest
• Same actors, similar story, same genre
62
Intuition
Item profiles
likes
build
recommend
match Red
Circles
Triangles
User profile
63
Approach
• Map items into a feature space:
• For movies:
• Actors, directors, genre, rating, year,…
• Challenge: make all features compatible.
• For documents?
• To compare items with users we need to map users to the same feature
space. How?
• Take all the movies that the user has seen and take the average vector
• Other aggregation functions are also possible.
Collaborative filtering
Two users are similar if they rate the same items in a similar way
User Similarity
User Similarity
User Similarity
Cosine Similarity:
Assumes zero entries are negatives:
Cos(A,B) = 0.38
Cos(A,C) = 0.32
69
Euclidean Distance
Formula:
73
• Example:
Consider two users in a movie recommendation system where their
ratings of 5 movies are recorded. The ratings are as follows:
User A: [5, 3, 4, 4, 2]
User B: [4, 2, 5, 3, 1]
We want to calculate the Euclidean distance between these two
users.
74
• Solution:
75
Manhattan Distance
Formula:
76
• Example:
• Consider two users in a movie recommendation system where
their ratings of 5 movies are recorded. The ratings are as follows:
• User A: [5, 3, 4, 4, 2]
• User B: [4, 2, 5, 3, 1]
• We want to calculate the Manhatan distance between these two
users.
77
• Solution:
78
Minkowski Distance
Formula:
`
79
80
• Example:
• Consider two users in a movie recommendation system where
their ratings of 5 movies are recorded. The ratings are as follows:
• User A: [5, 3, 4, 4, 2]
• User B: [4, 2, 5, 3, 1]
• We want to calculate the Minskowisky distance between these
two users.
81
82
• Example:
• Let's calculate the Jaccard similarity between two sets of movie
genres watched by two users:
85
86
87
Cosine Similarity
Formula:
88
89
90
91
92
• Example:
• Now, let’s calculate the Pearson correlation for two users’ movie
ratings over 4 movies:
• User A’s ratings: [4, 3, 5, 1]
• User B’s ratings: [5, 2, 4, 1]
• Solution:
95
96
97
• Properties:
• The norm (magnitude) of a vector in Euclidean space is given by
the square root of the sum of the squares of its components.
• Angles between vectors can be calculated using geometric
concepts like the dot product.
101
• Vector Space
• Definition: A vector space is a more abstract algebraic structure
that consists of vectors and operations like vector addition and
scalar multiplication. Vector spaces do not inherently have the
notions of distance or angles; they rely on algebraic rules instead.
• Key Characteristics:
• Defined by vectors (elements that can be added and multiplied by scalars).
• No direct geometric interpretation unless a metric is defined (e.g., through
inner products or norms).
• Can exist in any dimension, even infinite.
102
103
104