0% found this document useful (0 votes)

7 views58 pages

Lecture 3

Chapter 3 discusses the concepts of similarity and distance in recommendation systems, emphasizing the importance of quantifying how close two objects are for various applications such as item recommendations and customer grouping. It introduces various methods for measuring similarity, including Jaccard and Cosine similarity, as well as distance metrics like Hamming and Edit distance. The chapter also outlines different types of recommendation systems, including collaborative filtering, content-based filtering, and hybrid systems, which aim to enhance user engagement by predicting user preferences.

Uploaded by

mas2003159dev.app

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views58 pages

Lecture 3

Uploaded by

mas2003159dev.app

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 58

CHAPTER 3

Similarity and Distance

Recommendation Systems
Sketching, Locality Sensitive Hashing
Similarity and Distance
• For many different problems we need to quantify how
close two objects are.
• Examples:
• For an item bought by a customer, find other similar items
• Group together the customers of a site so that similar customers
are shown the same ad.
• Group together web documents so that you can separate the ones
that talk about politics and the ones that talk about sports.
• Find all the near-duplicate mirrored web documents.
• Find credit card transactions that are very different from previous
transactions.
• To solve these problems we need a definition of similarity,
or distance.
• The definition depends on the type of data that we have
Similarity
• Numerical measure of how alike two data objects
are.
• A function that maps pairs of objects to real values
• Higher when objects are more alike.
• Often falls in the range [0,1], sometimes in [-1,1]

• Desirable properties for similarity

1. s(p, q) = 1 (or maximum similarity) only if p = q.
(Identity)
2. s(p, q) = s(q, p) for all p and q. (Symmetry)
Similarity between sets
• Consider the following documents

apple apple new

releases releases apple pie
new ipod new ipad recipe

• Which ones are more similar?

• How would you quantify their similarity?

Similarity: Intersection
• Number of words in common

apple apple new

releases releases apple pie
new ipod new ipad recipe

• Sim(D,D) = 3, Sim(D,D) = Sim(D,D) =2

• What about this document?

Vefa rereases new book

with apple pie recipes

• Sim(D,D) = Sim(D,D) = 3
6

Jaccard Similarity
• The Jaccard similarity (Jaccard coefficient) of two sets S1,
S2 is the size of their intersection divided by the size of
their union.
• JSim (C1, C2) = |C1C2| / |C1C2|.

3 in intersection.
8 in union.
Jaccard similarity
= 3/8

• Extreme behavior:
• Jsim(X,Y) = 1, iff X = Y
• Jsim(X,Y) = 0 iff X,Y have no elements in common
• JSim is symmetric
Jaccard Similarity between sets
• The distance for the documents

apple apple new Vefa rereases

releases releases apple pie new book with
new ipod new ipad recipe apple pie
recipes

• JSim(D,D) = 3/5
• JSim(D,D) = JSim(D,D) = 2/6
• JSim(D,D) = JSim(D,D) = 3/9
Similarity between vectors
Documents (and sets in general) can also be represented as vectors

document Apple Microsoft Obama Election

D1 10 20 0 0
D2 30 60 0 0
D3 60 30 0 0
D4 0 0 10 20

How do we measure the similarity of two vectors?

• We could view them as sets of words. Jaccard Similarity will

show that D4 is different form the rest
• But all pairs of the other three documents are equally similar
We want to capture how well the two vectors are aligned
Example

document Apple Microsoft Obama Election

D1 10 20 0 0
D2 30 60 0 0
D3 60 30 0 0
D4 0 0 10 20

apple
Documents D1, D2 are in the “same direction”

Document D3 is on the same plane as D1, D2

Document D3 is orthogonal to the rest

microsoft

{Obama, election}
Example

document Apple Microsoft Obama Election

D1 1/3 2/3 0 0
D2 1/3 2/3 0 0
D3 2/3 1/3 0 0
D4 0 0 1/3 2/3

apple
Documents D1, D2 are in the “same direction”

Document D3 is on the same plane as D1, D2

Document D3 is orthogonal to the rest

microsoft

{Obama, election}
Cosine Similarity

• Sim(X,Y) = cos(X,Y)
• The cosine of the angle between X and Y

• If the vectors are aligned (correlated) angle is zero degrees and

cos(X,Y)=1
• If the vectors are orthogonal (no common coordinates) angle is 90
degrees and cos(X,Y) = 0

• Cosine is commonly used for comparing documents, where we

assume that the vectors are normalized by the document length.
Cosine Similarity - math
• If d1 and d2 are two vectors, then
cos( d1, d2 ) = (d1 • d2) / ||d1|| ||d2|| ,
where • indicates vector dot product and || d || is the length of vector d.

• Example:

d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2

d1 • d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5

||d1|| = (33+22+00+55+00+00+00+22+00+00)0.5 = (42) 0.5 = 6.481

||d2|| = (11+00+00+00+00+00+00+11+00+22) 0.5 = (6) 0.5 = 2.245

cos( d1, d2 ) = .3150

Example

document Apple Microsoft Obama Election

D1 10 20 0 0
D2 30 60 0 0
D3 60 30 0 0
D4 0 0 10 20

apple

Cos(D1,D2) = 1

Cos (D3,D1) = Cos(D3,D2) = 4/5

Cos(D4,D1) = Cos(D4,D2) = Cos(D4,D3) = 0 microsoft

{Obama, election}
Distance
• Numerical measure of how different two data
objects are
• A function that maps pairs of objects to real values
• Lower when objects are more alike
• Higher when two objects are different
• Minimum distance is 0, when comparing an
object with itself.
• Upper limit varies
Distance Metric
• A distance function d is a distance metric if it is a
function from pairs of objects to real numbers
such that:
1. d(x,y) > 0. (non-negativity)
2. d(x,y) = 0 iff x = y. (identity)
3. d(x,y) = d(y,x). (symmetry)
4. d(x,y) < d(x,z) + d(z,y) (triangle inequality ).
Triangle Inequality
• Triangle inequality guarantees that the distance
function is well-behaved.
• The direct connection is the shortest distance

• It is useful also for proving useful properties about

the data.
Distances for real vectors
• Vectors 𝑥 = 𝑥1 , … , 𝑥𝑑 and 𝑦 = (𝑦1 , … , 𝑦𝑑 )

• Lp norms or Minkowski distance:

𝑝 𝑝 1ൗ𝑝
𝐿𝑝 𝑥, 𝑦 = 𝑥1 − 𝑦1 + ⋯ + 𝑥𝑑 − 𝑦𝑑

• L2 norm: Euclidean distance:

𝐿2 𝑥, 𝑦 = 𝑥1 − 𝑦1 2 + ⋯ + 𝑥𝑑 − 𝑦𝑑 2

• L1 norm: Manhattan distance:

𝐿1 𝑥, 𝑦 = 𝑥1 − 𝑦1 + ⋯ + |𝑥𝑑 − 𝑦𝑑 |
Lp norms are known to be distance metrics
• L∞ norm:
𝐿∞ 𝑥, 𝑦 = max 𝑥1 − 𝑦1 , … , |𝑥𝑑 − 𝑦𝑑 |
• The limit of Lp as p goes to infinity.
18

Example of Distances
y = (9,8)
L2-norm:
𝑑𝑖𝑠𝑡(𝑥, 𝑦) = 42 + 32 = 5

5 3
L1-norm:
4 𝑑𝑖𝑠𝑡(𝑥, 𝑦) = 4 + 3 = 7
x = (5,5)

L∞-norm:
𝑑𝑖𝑠𝑡(𝑥, 𝑦) = max 3,4 = 4
Example

𝑥 = (𝑥1 , … , 𝑥𝑛 )

Green: All points y at distance L1(x,y) = r from point x

Blue: All points y at distance L2(x,y) = r from point x

Red: All points y at distance L∞(x,y) = r from point x

Lp distances for sets
• We can apply all the Lp distances to the cases of
sets of attributes, with or without counts, if we
represent the sets as vectors
• E.g., a transaction is a 0/1 vector
• E.g., a document is a vector of counts.
Similarities into distances
• Jaccard distance:
𝐽𝐷𝑖𝑠𝑡(𝑋, 𝑌) = 1 – 𝐽𝑆𝑖𝑚(𝑋, 𝑌)

• Jaccard Distance is a metric

• Cosine distance:
𝐷𝑖𝑠𝑡(𝑋, 𝑌) = 1 − cos(𝑋, 𝑌)
• Cosine distance is a metric
23

Hamming Distance
• Hamming distance is the number of positions in which
bit-vectors differ.
• Example: p1 = 10101
p2 = 10011.

• d(p1, p2) = 2 because the bit-vectors differ in the 3rd and

4th positions.

• Hamming distance between two vectors of categorical

attributes is the number of positions in which they differ.
• Example: x = (married, low income, cheat ),
• y = (single , low income, not cheat)
• d(x,y) = 2
24

Why Hamming Distance Is a Distance

Metric

• d(x,x) = 0 since no positions differ.

• d(x,y) = d(y,x) by symmetry of “different from.”
• d(x,y) > 0 since strings cannot differ in a negative
number of positions.
• Triangle inequality: changing x to z and then to y
is one way to change x to y.

• For binary vectors if follows from the fact that L1

norm is a metric
Distance between strings
• How do we define similarity between strings?

weird wierd
intelligent unintelligent
Athena Athina

• Important for recognizing and correcting typing

errors and analyzing DNA sequences.
26

Edit Distance for strings

• The edit distance of two strings is the number of
inserts and deletes of characters needed to turn one
into the other.
• Example: x = abcde ;
• y = bcduve.
• Turn x into y by deleting a, then inserting u and v after d.
• Edit distance = 3.
• Minimum number of operations can be computed
using dynamic programming
• Common distance measure for comparing DNA
sequences
27

Why Edit Distance Is a Distance Metric

• d(x,x) = 0 because 0 edits suffice.

• d(x,y) = d(y,x) because insert/delete are
inverses of each other.
• d(x,y) > 0: no notion of negative edits.
• Triangle inequality: changing x to z and then
to y is one way to change x to y. The
minimum is no more than that
28

Variant Edit Distances

• Allow insert, delete, and mutate.

• Change one character into another.
• Minimum number of inserts, deletes, and
mutates also forms a distance measure.

• Same for any set of operations on strings.

• Example: substring reversal or block transposition OK
for DNA sequences
• Example: character transposition is used for spelling
APPLICATIONS OF
SIMILARITY:
RECOMMENDATION
SYSTEMS
Recommendation Systems
Are a type of information filtering system that aim to predict the interests or preferences of a user
and make personalized recommendations for products or content. They are used in a variety of
applications, such as e-commerce, social media, and streaming services, to enhance user
engagement and increase revenue.

• There are several types of recommendation systems, including:

1. Collaborative filtering: As I mentioned earlier, collaborative filtering is a technique that analyzes
the behavior of a large number of users and identifies similar users based on their shared
behavior. Once similar users are identified, the algorithm can use their past behavior to make
recommendations to a user based on the assumption that the user will have similar preferences
to those similar users.

2. Content-based filtering: In this approach, recommendations are made based on the

characteristics of the items being recommended. For example, if a user has shown interest in
action movies in the past, the algorithm will recommend other action movies with similar
characteristics, such as the same director or lead actor.

3. Hybrid systems: These systems combine both collaborative filtering and content-based filtering
to improve the accuracy of recommendations. Hybrid systems can leverage the strengths of both
approaches to overcome their weaknesses.
Collaborative Filtering
• Collaborative filtering
• is a technique used in recommender systems to make personalized
recommendations to users based on their past behavior and the
behavior of similar users.
• The idea behind collaborative filtering is that people who have
similar preferences in the past are likely to have similar
preferences in the future.
• In collaborative filtering, the algorithm analyzes the past behavior of
a large number of users and identifies similar users based on their
shared behavior.
• For example, if two users have both watched and rated a similar
set of movies or products, the algorithm would consider them
similar users. Once similar users are identified, the algorithm can use
their past behavior to make recommendations to a user based on
the assumption that the user will have similar preferences to those
similar users.
Intuition
Item profiles

likes

build
recommend

match Red
Circles
Triangles
User profile
An important problem
• Recommendation systems
• When a user buys an item (initially books) we want to
recommend other items that the user may like
• When a user rates a movie, we want to recommend
movies that the user may like
• When a user likes a song, we want to recommend other
songs that they may like

• A big success of data mining

• Exploits the long tail
• How Into Thin Air made Touching the Void popular
Utility (Preference) Matrix

Harry Harry Harry Twilight Star Star Star

Potter 1 Potter 2 Potter 3 Wars 1 Wars 2 Wars 3
A 4 5 1
B 5 5 4
C 2 4 5
D 3 3

How can we fill the empty entries of the matrix?

Recommendation Systems
• Content-based:
• Represent the items into a feature space and
recommend items to customer C similar to previous
items rated highly by C
• Movie recommendations: recommend movies with same
actor(s), director, genre, …
• Websites, blogs, news: recommend other sites with “similar”
content
Content-based prediction

Harry Harry Harry Twilight Star Star Star

Potter 1 Potter 2 Potter 3 Wars 1 Wars 2 Wars 3
A 4 5 1
B 5 5 4
C 2 4 5
D 3 3

Someone who likes one of the Harry Potter (or Star Wars)
movies is likely to like the rest
• Same actors, similar story, same genre
Approach
• Map items into a feature space:
• For movies:
• Actors, directors, genre, rating, year,…
• Challenge: make all features compatible.
• For documents?

• To compare items with users we need to map users to the

same feature space. How?
• Take all the movies that the user has seen and take the average
vector
• Other aggregation functions are also possible.

• Recommend to user C the most similar item I computing

similarity in the common feature space
• Distributional distance measures also work well.
Collaborative filtering

Harry Harry Harry Twilight Star Star Star

Potter 1 Potter 2 Potter 3 Wars 1 Wars 2 Wars 3
A 4 5 1
B 5 5 4
C 2 4 5
D 3 3

Two users are similar if they rate the same items in a similar way

Recommend to user C, the items

liked by many of the most similar users.
User Similarity

Harry Harry Harry Twilight Star Star Star

Potter 1 Potter 2 Potter 3 Wars 1 Wars 2 Wars 3
A 4 5 1
B 5 5 4
C 2 4 5
D 3 3

Which pair of users do you consider as the most similar?

What is the right definition of similarity?

User Similarity (Jaccard Similarity)

Harry Harry Harry Twilight Star Star Star

Potter 1 Potter 2 Potter 3 Wars 1 Wars 2 Wars 3
A 1 1 1
B 1 1 1
C 1 1 1
D 1 1

Jaccard Similarity: users are sets of movies

Disregards the ratings.

Jsim(A,B) = 1/5
Jsim(A,C) = Jsim(B,D) = 1/2
User Similarity (Cosine)
Harry Harry Harry Twilight Star Star Star
Potter 1 Potter 2 Potter 3 Wars 1 Wars 2 Wars 3
A 1 2 3
B 4 5 6

Cosine Similarity:
To calculate the cosine similarity between A and B, we first compute the dot product of the
two vectors:

A·B = (1 * 4) + (2 * 5) + (3 * 6) = 32
Next, we calculate the magnitudes of the two vectors:

|A| = sqrt((1 * 1) + (2 * 2) + (3 * 3)) = sqrt(14) ≈ 3.7417

|B| = sqrt((4 * 4) + (5 * 5) + (6 * 6)) = sqrt(77) ≈ 8.7749
Finally, we divide the dot product by the product of the magnitudes:

cosine_similarity(A, B) = A·B / (|A| * |B|) = 32 / (3.7417 * 8.7749) ≈ 0.9251

So the cosine similarity between A and B is approximately 0.9251, indicating that the
two vectors are very similar.
User Similarity ( Normalized Cosine Similarity)
( VS Cosine Similarity) :
Harry Harry Harry Twilight Star Star Star
Potter 1 Potter 2 Potter 3 Wars 1 Wars 2 Wars 3
A 1 2 3
B 2 4 6
To calculate the cosine similarity between A and B, we first compute the dot product of the
two vectors:
A·B = (1 * 2) + (2 * 4) + (3 * 6) = 28
Next, we calculate the magnitudes of the two vectors:
|A| = sqrt((1 * 1) + (2 * 2) + (3 * 3)) = sqrt(14) ≈ 3.7417
|B| = sqrt((2 * 2) + (4 * 4) + (6 * 6)) = sqrt(56) ≈ 7.4833
Finally, we divide the dot product by the product of the magnitudes:
cosine_similarity(A, B) = A·B / (|A| * |B|) = 28 / (3.7417 * 7.4833) ≈ 0.7454

Now let's calculate the normalized cosine similarity between A and B:

To normalize the vectors, we divide each element of the vectors by their respective magnitudes:
A_norm = [1/|A|, 2/|A|, 3/|A|] = [0.2673, 0.5345, 0.8018]
B_norm = [2/|B|, 4/|B|, 6/|B|] = [0.2673, 0.5345, 0.8018]
Now we can calculate the cosine similarity between the normalized vectors:
A_norm·B_norm = (0.2673 * 0.2673) + (0.5345 * 0.5345) + (0.8018 * 0.8018) = 0.6667
correlation coefficient
The correlation coefficient is a statistical measure that is used to evaluate the

strength and direction of the linear relationship between two variables. It is

denoted by the symbol "r" and ranges from -1 to +1.

correlation coefficient
Let's say we have two variables: X and Y. We have collected data on these
variables and want to calculate their correlation coefficient.
Here's an example of some data we've collected:

Harry Harry Harry Twilight Star Star Star

Potter 1 Potter 2 Potter 3 Wars 1 Wars 2 Wars 3
A 1 2 3 4 5
B 2 4 6 8 10
correlation coefficient
Harry Harry Harry Twilight Star Star Star
Potter 1 Potter 2 Potter 3 Wars 1 Wars 2 Wars 3
A 1 2 3 4 5
B 2 4 6 8 10

r = (nΣXY - ΣXΣY) / sqrt((nΣX^2 - (ΣX)^2) * (nΣY^2 - (ΣY)^2))

= (5* 110 – 15*30) / Sqrt (( 5 * 55 – 225) * (5 * 220 – 900))
= ( 550 – 450) / Sqrt ((275 – 225) * ( 1100 – 900))
= 100 / Sqrt (50 * 200)
= 1
User-User Collaborative Filtering
• Consider user c
• Find set D of other users whose ratings are most “similar”
to c’s ratings
• Estimate user’s ratings based on ratings of users in D
using some aggregation function

• Advantage:
• for each user we have small amount of computation.
Item-Item Collaborative Filtering
• We can transpose (flip) the matrix and perform the same
computation as before to define similarity between items
• Intuition: Two items are similar if they are rated in the same way by
many users.
• Better defined similarity since it captures the notion of genre of an
item
• Users may have multiple interests.
• Algorithm: For each user c and item i
• Find the set D of most similar items to item i that have been rated by user
c.
• Aggregate their ratings to predict the rating for item i.
• Disadvantage: we need to consider each user-item pair separately
Collaborative Filtering example

we can see that User 1 and User 2 give nearly similar ratings to Movie 1, so we can conclude
that Movie 3 is also going to be averagely liked by User 1 but Movie 4 will be a good
recommendation to User 2,
like this we can also see that there are users who have different choices like User 1 and User 3
are opposite to each other. One can see that User 3 and User 4 have a common interest in the
movie, on that basis we can say that Movie 4 is also going to be disliked by User 4. This is
Collaborative Filtering, we recommend to users the items which are liked by users of similar
interest domains.

Cosine Similarity
We can also use the cosine similarity between the users to find out the users with
similar interests, larger cosine implies that there is a smaller angle between two
users, hence they have similar interests.
SKETCHING
AND
LOCALITY SENSITIVE
HASHING
SKETCHING AND LOCALITY SENSITIVE HASHING
Sketching and Locality Sensitive Hashing (LSH) are techniques used to
perform approximate nearest neighbor search in high-dimensional data.
The goal of these techniques is to find similar items or data points
efficiently, without having to compare each item or data point against all
others.
Sketching is a technique that involves representing high-dimensional data as low-
dimensional sketches or summaries, while still preserving the essential
characteristics of the data. For example, in the case of text data, one could represent
each document as a set of words, and then represent each document as a sketch
that only contains the most frequently occurring words. This allows for efficient
comparison and similarity calculation between documents.

Locality Sensitive Hashing (LSH) is another technique used to find approximate

nearest neighbors in high-dimensional data. LSH involves hashing data points into
buckets based on their similarity. Similar data points are more likely to be hashed
into the same bucket, allowing for efficient nearest neighbor search. One popular
implementation of LSH is the MinHash algorithm, which involves hashing sets of
items into signatures and comparing the signatures of two sets to estimate their
Jaccard similarity.
Sketching Example
• Suppose you have a large collection of text documents and you want to
find similar documents quickly and efficiently. One way to approach this
problem is to use a technique called "sketching".
• First, you can represent each document as a set of words that occur in the
document. For example, the set of words for a document might include "cat",
"dog", "house", and "tree".
• Next, you can create a "sketch" or summary of the document by selecting
only the most frequent words in the set. For example, if you choose the
top 10 most frequent words, the sketch for the document might be:
• "cat", "dog", "house", "tree", "the", "and", "in", "is", "of", and "to".
• By representing each document as a sketch, you can perform similarity
search between documents more efficiently. Instead of comparing the
entire set of words for each document, you can compare only the
sketches. This reduces the computational complexity of the similarity search,
making it faster and more efficient.
Another important problem
• Find duplicate and near-duplicate documents
from a web crawl.
• Why is it important:
• Identify mirrored web pages, and avoid indexing them,
or serving them multiple times
• Find replicated news stories and cluster them under a
single story.
• Identify plagiarism

• What if we wanted exact duplicates?

Finding similar items
• Both the problems we described have a common
component
• We need a quick way to find highly similar items to a
query item
• OR, we need a method for finding all pairs of items that
are highly similar.
• Also known as the Nearest Neighbor problem, or
the All Nearest Neighbors problem

• We will examine it for the case of near-duplicate

web documents.
Main issues
• What is the right representation of the document
when we check for similarity?
• E.g., representing a document as a set of characters
will not do (why?)
• When we have billions of documents, keeping the
full text in memory is not an option.
• We need to find a shorter representation
• How do we do pairwise comparisons of billions of
documents?
• If we wanted exact match it would be ok, can we
replicate this idea?
56

Three Essential Techniques for Similar

Documents

1. Shingling : convert documents, emails, etc.,

to sets.

2. Minhashing : convert large sets to short

signatures, while preserving similarity.

3. Locality-Sensitive Hashing (LSH): focus on

pairs of signatures likely to be similar.
57

The Big Picture

Candidate
pairs :
Locality-
Docu- those pairs
sensitive
ment of signatures
Hashing
that we need
to test for
The set Signatures : similarity.
of strings short integer
of length k vectors that
that appear represent the
in the doc- sets, and
ument reflect their
similarity
58

Shingles

• A k -shingle (or k -gram) for a document is a

sequence of k characters that appears in the
document.
• Example: document = abcab. k=2
• Set of 2-shingles = {ab, bc, ca}.
• Option: regard shingles as a bag, and count ab twice.

• Represent a document by its set of k-shingles.

Shingling Example
• Shingle: a sequence of k contiguous Words

Note that each shingle overlaps the previous one by two

words. This is called a "sliding window" approach.

By representing the document as a set of shingles, you can

perform efficient similarity search using techniques like
MinHash and Locality Sensitive Hashing (LSH).

This allows you to find similar documents even if they are not
exact matches, making it useful for applications like
plagiarism detection and document retriev

ML Unit 2
No ratings yet
ML Unit 2
22 pages
Measuring Data Similarity and Dissimilarity
No ratings yet
Measuring Data Similarity and Dissimilarity
20 pages
Procurement Process
100% (1)
Procurement Process
14 pages
Assignment No. 2: Similarity and Dissimilarity Measures
No ratings yet
Assignment No. 2: Similarity and Dissimilarity Measures
11 pages
Cosine Similarity
No ratings yet
Cosine Similarity
4 pages
Manual of Sensorless Brushless Motor Speed Controller: Ver HW-01-081027.1
No ratings yet
Manual of Sensorless Brushless Motor Speed Controller: Ver HW-01-081027.1
4 pages
Similarity
No ratings yet
Similarity
20 pages
Chapter 8 - Collaborative - Filtering
No ratings yet
Chapter 8 - Collaborative - Filtering
118 pages
Lec-3. Datamining-Similarity-Distance-Ext
No ratings yet
Lec-3. Datamining-Similarity-Distance-Ext
104 pages
III Clustering
No ratings yet
III Clustering
87 pages
Distance and Similarity: Andre Salvaro Furtado
No ratings yet
Distance and Similarity: Andre Salvaro Furtado
56 pages
Data Mining: Similarity and Distance
No ratings yet
Data Mining: Similarity and Distance
13 pages
Clustering
No ratings yet
Clustering
43 pages
Data Mining: Similarity and Distance
No ratings yet
Data Mining: Similarity and Distance
13 pages
Class-Data Preprocessing-IV
No ratings yet
Class-Data Preprocessing-IV
28 pages
Data Mining: Characterization: Jimma University, Faculty of Computing Arranged By: Dessalegn Y
No ratings yet
Data Mining: Characterization: Jimma University, Faculty of Computing Arranged By: Dessalegn Y
79 pages
Clustering Part4
No ratings yet
Clustering Part4
79 pages
3 Unit PR NonParametric Decision Making
No ratings yet
3 Unit PR NonParametric Decision Making
78 pages
TE IT DMBI Module2 Data Preprocessing L8-L11
No ratings yet
TE IT DMBI Module2 Data Preprocessing L8-L11
73 pages
Similarity and Dissimilarity
No ratings yet
Similarity and Dissimilarity
34 pages
03 Schubert
No ratings yet
03 Schubert
13 pages
Distance and Similarity
No ratings yet
Distance and Similarity
33 pages
Module-3Conti.. Similarity& Dissimlarity
No ratings yet
Module-3Conti.. Similarity& Dissimlarity
29 pages
Assignment No 1 (Data Science) - Ashber
No ratings yet
Assignment No 1 (Data Science) - Ashber
9 pages
Distance Measure
No ratings yet
Distance Measure
11 pages
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
No ratings yet
Data Mining: Similarity and Distance Recommendation Systems Sketching, Locality Sensitive Hashing
57 pages
Lecture 10
No ratings yet
Lecture 10
26 pages
CS822 DataMining Week4
No ratings yet
CS822 DataMining Week4
45 pages
Manhattan & Euclidean Distance
No ratings yet
Manhattan & Euclidean Distance
16 pages
DMi 03-Proximity
No ratings yet
DMi 03-Proximity
51 pages
Distance and Similarity
No ratings yet
Distance and Similarity
33 pages
Lecture - 7 MSDS
No ratings yet
Lecture - 7 MSDS
32 pages
Clustering Lecture 1: Basics: Jing Gao
No ratings yet
Clustering Lecture 1: Basics: Jing Gao
62 pages
A Comparative Study On Distance Measuring Approach
No ratings yet
A Comparative Study On Distance Measuring Approach
3 pages
Chapter 2
No ratings yet
Chapter 2
70 pages
Distances Similarities
No ratings yet
Distances Similarities
39 pages
Lesson 6 Similarities KNN
No ratings yet
Lesson 6 Similarities KNN
25 pages
Class 1c - DataFundamentals
No ratings yet
Class 1c - DataFundamentals
27 pages
Data Mining: Similarity and Distance
No ratings yet
Data Mining: Similarity and Distance
6 pages
TM3 ch07 Clustering
No ratings yet
TM3 ch07 Clustering
47 pages
DMi 03 Proximity
No ratings yet
DMi 03 Proximity
9 pages
BDA
No ratings yet
BDA
31 pages
L6 Recommendation
No ratings yet
L6 Recommendation
56 pages
Similarity Analysis
No ratings yet
Similarity Analysis
85 pages
Similarity
No ratings yet
Similarity
19 pages
DSB - Unit3
No ratings yet
DSB - Unit3
87 pages
DS - Module 3
No ratings yet
DS - Module 3
65 pages
4.4-InstanceBasedLearning Part 1
No ratings yet
4.4-InstanceBasedLearning Part 1
16 pages
Similarty and Dissimilarity
No ratings yet
Similarty and Dissimilarity
11 pages
Lecture 2. Similarity Measures For Cluster Analysis
No ratings yet
Lecture 2. Similarity Measures For Cluster Analysis
31 pages
Similarity Measures
No ratings yet
Similarity Measures
11 pages
Data Mining and Predictive Modeling: Lecture 13: Measuring Data Similarity
No ratings yet
Data Mining and Predictive Modeling: Lecture 13: Measuring Data Similarity
19 pages
CS2209 Similarity Distances
No ratings yet
CS2209 Similarity Distances
23 pages
8 Qam
100% (1)
8 Qam
16 pages
Lecture 4
No ratings yet
Lecture 4
33 pages
Similarity
No ratings yet
Similarity
20 pages
ML Unit 2
No ratings yet
ML Unit 2
11 pages
Lab 2
No ratings yet
Lab 2
21 pages
Materi 7.1. Distance Measurement
No ratings yet
Materi 7.1. Distance Measurement
14 pages
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
No ratings yet
Mbict 111 - 162 - 2021 - 11 - 14032021 - 3236
30 pages
Department of Education: Republic of The Philippines
No ratings yet
Department of Education: Republic of The Philippines
24 pages
Dist
No ratings yet
Dist
14 pages
Beatrice Gay Letter Writing Scipt
0% (2)
Beatrice Gay Letter Writing Scipt
23 pages
Carding
0% (1)
Carding
1 page
Geek Pride Day by Slidesgo
No ratings yet
Geek Pride Day by Slidesgo
47 pages
PR 1 Research
No ratings yet
PR 1 Research
57 pages
Microwave Power Dividers and Couplers Primer
No ratings yet
Microwave Power Dividers and Couplers Primer
8 pages
Apache Superset Readthedocs Io en Latest
No ratings yet
Apache Superset Readthedocs Io en Latest
135 pages
Hall Sensor Diagram
No ratings yet
Hall Sensor Diagram
7 pages
DataBinding in The OpenEdge GUI For PDF
No ratings yet
DataBinding in The OpenEdge GUI For PDF
37 pages
The Professional Tool For 3 Dimensional Trajectometry Simulations of Rock Falls
No ratings yet
The Professional Tool For 3 Dimensional Trajectometry Simulations of Rock Falls
4 pages
Comprehensive Review and Comparison of Single-Phase Grid-Tied Photovoltaic Microinverters
No ratings yet
Comprehensive Review and Comparison of Single-Phase Grid-Tied Photovoltaic Microinverters
19 pages
Theorizing Body As Place
No ratings yet
Theorizing Body As Place
29 pages
Inst Requirements novAA800series
No ratings yet
Inst Requirements novAA800series
15 pages
Pyttsx3 Documentation: Release 2.6
No ratings yet
Pyttsx3 Documentation: Release 2.6
21 pages
USA Team Selection Test 2011
No ratings yet
USA Team Selection Test 2011
3 pages
Instructions For Downloading & Installing Astrometrica
No ratings yet
Instructions For Downloading & Installing Astrometrica
4 pages
To For "Passport Mela" With Of: On Their
No ratings yet
To For "Passport Mela" With Of: On Their
2 pages
NPN 2N2222 - 2n2222a PNP 2N2907 - 2n2907a: S I L I C o N P L A N A R e P I T A X I A L T R A N S I S T o R S
No ratings yet
NPN 2N2222 - 2n2222a PNP 2N2907 - 2n2907a: S I L I C o N P L A N A R e P I T A X I A L T R A N S I S T o R S
3 pages
Computer Project XI
No ratings yet
Computer Project XI
10 pages
Test Result Cable System
No ratings yet
Test Result Cable System
10 pages
Appointments For The Case Study
No ratings yet
Appointments For The Case Study
7 pages
Context Switching'
No ratings yet
Context Switching'
1 page
Client VPN OS Configuration - Cisco Meraki
No ratings yet
Client VPN OS Configuration - Cisco Meraki
33 pages
Dvi Assignment
No ratings yet
Dvi Assignment
4 pages
Control Flow Graphs Against Malware Methods of Analysis and Detection
No ratings yet
Control Flow Graphs Against Malware Methods of Analysis and Detection
5 pages
Resizing Partitions (For Android)
No ratings yet
Resizing Partitions (For Android)
2 pages
ePM-10M 12M Cut-Sheet 8075 RevB 2020-10-27
No ratings yet
ePM-10M 12M Cut-Sheet 8075 RevB 2020-10-27
4 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet

Lecture 3

Uploaded by

Lecture 3

Uploaded by

CHAPTER 3

Similarity and Distance

• Desirable properties for similarity

apple apple new

• Which ones are more similar?

• How would you quantify their similarity?

apple apple new

• Sim(D,D) = 3, Sim(D,D) = Sim(D,D) =2

Vefa rereases new book

apple apple new Vefa rereases

document Apple Microsoft Obama Election

How do we measure the similarity of two vectors?

• We could view them as sets of words. Jaccard Similarity will

document Apple Microsoft Obama Election

Document D3 is on the same plane as D1, D2

Document D3 is orthogonal to the rest

document Apple Microsoft Obama Election

Document D3 is on the same plane as D1, D2

Document D3 is orthogonal to the rest

• If the vectors are aligned (correlated) angle is zero degrees and

• Cosine is commonly used for comparing documents, where we

||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481

||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245

cos( d1, d2 ) = .3150

document Apple Microsoft Obama Election

Cos (D3,D1) = Cos(D3,D2) = 4/5

Cos(D4,D1) = Cos(D4,D2) = Cos(D4,D3) = 0 microsoft

• It is useful also for proving useful properties about

• Lp norms or Minkowski distance:

• L2 norm: Euclidean distance:

• L1 norm: Manhattan distance:

Green: All points y at distance L1(x,y) = r from point x

Blue: All points y at distance L2(x,y) = r from point x

Red: All points y at distance L∞(x,y) = r from point x

• Jaccard Distance is a metric

• d(p1, p2) = 2 because the bit-vectors differ in the 3rd and

• Hamming distance between two vectors of categorical

Why Hamming Distance Is a Distance

• d(x,x) = 0 since no positions differ.

• For binary vectors if follows from the fact that L1

• Important for recognizing and correcting typing

Edit Distance for strings

Why Edit Distance Is a Distance Metric

• d(x,x) = 0 because 0 edits suffice.

Variant Edit Distances

• Allow insert, delete, and mutate.

• Same for any set of operations on strings.

• There are several types of recommendation systems, including:

2. Content-based filtering: In this approach, recommendations are made based on the

• A big success of data mining

Harry Harry Harry Twilight Star Star Star

How can we fill the empty entries of the matrix?

Harry Harry Harry Twilight Star Star Star

• To compare items with users we need to map users to the

• Recommend to user C the most similar item I computing

Harry Harry Harry Twilight Star Star Star

Recommend to user C, the items

Harry Harry Harry Twilight Star Star Star

Which pair of users do you consider as the most similar?

What is the right definition of similarity?

Harry Harry Harry Twilight Star Star Star

Jaccard Similarity: users are sets of movies

Disregards the ratings.

|A| = sqrt((1 * 1) + (2 * 2) + (3 * 3)) = sqrt(14) ≈ 3.7417

cosine_similarity(A, B) = A·B / (|A| * |B|) = 32 / (3.7417 * 8.7749) ≈ 0.9251

Now let's calculate the normalized cosine similarity between A and B:

strength and direction of the linear relationship between two variables. It is

denoted by the symbol "r" and ranges from -1 to +1.

Harry Harry Harry Twilight Star Star Star

r = (nΣXY - ΣXΣY) / sqrt((nΣX^2 - (ΣX)^2) * (nΣY^2 - (ΣY)^2))

Locality Sensitive Hashing (LSH) is another technique used to find approximate

• What if we wanted exact duplicates?

• We will examine it for the case of near-duplicate

Three Essential Techniques for Similar

1. Shingling : convert documents, emails, etc.,

2. Minhashing : convert large sets to short

||d1|| = (33+22+00+55+00+00+00+22+00+00)0.5 = (42) 0.5 = 6.481

||d2|| = (11+00+00+00+00+00+00+11+00+22) 0.5 = (6) 0.5 = 2.245