0% found this document useful (0 votes)
7 views58 pages

Lecture 3

Chapter 3 discusses the concepts of similarity and distance in recommendation systems, emphasizing the importance of quantifying how close two objects are for various applications such as item recommendations and customer grouping. It introduces various methods for measuring similarity, including Jaccard and Cosine similarity, as well as distance metrics like Hamming and Edit distance. The chapter also outlines different types of recommendation systems, including collaborative filtering, content-based filtering, and hybrid systems, which aim to enhance user engagement by predicting user preferences.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views58 pages

Lecture 3

Chapter 3 discusses the concepts of similarity and distance in recommendation systems, emphasizing the importance of quantifying how close two objects are for various applications such as item recommendations and customer grouping. It introduces various methods for measuring similarity, including Jaccard and Cosine similarity, as well as distance metrics like Hamming and Edit distance. The chapter also outlines different types of recommendation systems, including collaborative filtering, content-based filtering, and hybrid systems, which aim to enhance user engagement by predicting user preferences.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

CHAPTER 3

Similarity and Distance


Recommendation Systems
Sketching, Locality Sensitive Hashing
Similarity and Distance
• For many different problems we need to quantify how
close two objects are.
• Examples:
• For an item bought by a customer, find other similar items
• Group together the customers of a site so that similar customers
are shown the same ad.
• Group together web documents so that you can separate the ones
that talk about politics and the ones that talk about sports.
• Find all the near-duplicate mirrored web documents.
• Find credit card transactions that are very different from previous
transactions.
• To solve these problems we need a definition of similarity,
or distance.
• The definition depends on the type of data that we have
Similarity
• Numerical measure of how alike two data objects
are.
• A function that maps pairs of objects to real values
• Higher when objects are more alike.
• Often falls in the range [0,1], sometimes in [-1,1]

• Desirable properties for similarity


1. s(p, q) = 1 (or maximum similarity) only if p = q.
(Identity)
2. s(p, q) = s(q, p) for all p and q. (Symmetry)
Similarity between sets
• Consider the following documents

apple apple new


releases releases apple pie
new ipod new ipad recipe

• Which ones are more similar?

• How would you quantify their similarity?


Similarity: Intersection
• Number of words in common

apple apple new


releases releases apple pie
new ipod new ipad recipe

• Sim(D,D) = 3, Sim(D,D) = Sim(D,D) =2


• What about this document?

Vefa rereases new book


with apple pie recipes

• Sim(D,D) = Sim(D,D) = 3
6

Jaccard Similarity
• The Jaccard similarity (Jaccard coefficient) of two sets S1,
S2 is the size of their intersection divided by the size of
their union.
• JSim (C1, C2) = |C1C2| / |C1C2|.

3 in intersection.
8 in union.
Jaccard similarity
= 3/8

• Extreme behavior:
• Jsim(X,Y) = 1, iff X = Y
• Jsim(X,Y) = 0 iff X,Y have no elements in common
• JSim is symmetric
Jaccard Similarity between sets
• The distance for the documents

apple apple new Vefa rereases


releases releases apple pie new book with
new ipod new ipad recipe apple pie
recipes

• JSim(D,D) = 3/5
• JSim(D,D) = JSim(D,D) = 2/6
• JSim(D,D) = JSim(D,D) = 3/9
Similarity between vectors
Documents (and sets in general) can also be represented as vectors

document Apple Microsoft Obama Election


D1 10 20 0 0
D2 30 60 0 0
D3 60 30 0 0
D4 0 0 10 20

How do we measure the similarity of two vectors?

• We could view them as sets of words. Jaccard Similarity will


show that D4 is different form the rest
• But all pairs of the other three documents are equally similar
We want to capture how well the two vectors are aligned
Example

document Apple Microsoft Obama Election


D1 10 20 0 0
D2 30 60 0 0
D3 60 30 0 0
D4 0 0 10 20

apple
Documents D1, D2 are in the “same direction”

Document D3 is on the same plane as D1, D2

Document D3 is orthogonal to the rest


microsoft

{Obama, election}
Example

document Apple Microsoft Obama Election


D1 1/3 2/3 0 0
D2 1/3 2/3 0 0
D3 2/3 1/3 0 0
D4 0 0 1/3 2/3

apple
Documents D1, D2 are in the “same direction”

Document D3 is on the same plane as D1, D2

Document D3 is orthogonal to the rest


microsoft

{Obama, election}
Cosine Similarity

• Sim(X,Y) = cos(X,Y)
• The cosine of the angle between X and Y

• If the vectors are aligned (correlated) angle is zero degrees and


cos(X,Y)=1
• If the vectors are orthogonal (no common coordinates) angle is 90
degrees and cos(X,Y) = 0

• Cosine is commonly used for comparing documents, where we


assume that the vectors are normalized by the document length.
Cosine Similarity - math
• If d1 and d2 are two vectors, then
cos( d1, d2 ) = (d1 • d2) / ||d1|| ||d2|| ,
where • indicates vector dot product and || d || is the length of vector d.

• Example:

d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2

d1 • d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5

||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 = (42) 0.5 = 6.481

||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245

cos( d1, d2 ) = .3150


Example

document Apple Microsoft Obama Election


D1 10 20 0 0
D2 30 60 0 0
D3 60 30 0 0
D4 0 0 10 20

apple

Cos(D1,D2) = 1

Cos (D3,D1) = Cos(D3,D2) = 4/5

Cos(D4,D1) = Cos(D4,D2) = Cos(D4,D3) = 0 microsoft

{Obama, election}
Distance
• Numerical measure of how different two data
objects are
• A function that maps pairs of objects to real values
• Lower when objects are more alike
• Higher when two objects are different
• Minimum distance is 0, when comparing an
object with itself.
• Upper limit varies
Distance Metric
• A distance function d is a distance metric if it is a
function from pairs of objects to real numbers
such that:
1. d(x,y) > 0. (non-negativity)
2. d(x,y) = 0 iff x = y. (identity)
3. d(x,y) = d(y,x). (symmetry)
4. d(x,y) < d(x,z) + d(z,y) (triangle inequality ).
Triangle Inequality
• Triangle inequality guarantees that the distance
function is well-behaved.
• The direct connection is the shortest distance

• It is useful also for proving useful properties about


the data.
Distances for real vectors
• Vectors 𝑥 = 𝑥1 , … , 𝑥𝑑 and 𝑦 = (𝑦1 , … , 𝑦𝑑 )

• Lp norms or Minkowski distance:


𝑝 𝑝 1ൗ𝑝
𝐿𝑝 𝑥, 𝑦 = 𝑥1 − 𝑦1 + ⋯ + 𝑥𝑑 − 𝑦𝑑

• L2 norm: Euclidean distance:


𝐿2 𝑥, 𝑦 = 𝑥1 − 𝑦1 2 + ⋯ + 𝑥𝑑 − 𝑦𝑑 2

• L1 norm: Manhattan distance:


𝐿1 𝑥, 𝑦 = 𝑥1 − 𝑦1 + ⋯ + |𝑥𝑑 − 𝑦𝑑 |
Lp norms are known to be distance metrics
• L∞ norm:
𝐿∞ 𝑥, 𝑦 = max 𝑥1 − 𝑦1 , … , |𝑥𝑑 − 𝑦𝑑 |
• The limit of Lp as p goes to infinity.
18

Example of Distances
y = (9,8)
L2-norm:
𝑑𝑖𝑠𝑡(𝑥, 𝑦) = 42 + 32 = 5

5 3
L1-norm:
4 𝑑𝑖𝑠𝑡(𝑥, 𝑦) = 4 + 3 = 7
x = (5,5)

L∞-norm:
𝑑𝑖𝑠𝑡(𝑥, 𝑦) = max 3,4 = 4
Example

𝑥 = (𝑥1 , … , 𝑥𝑛 )

Green: All points y at distance L1(x,y) = r from point x

Blue: All points y at distance L2(x,y) = r from point x

Red: All points y at distance L∞(x,y) = r from point x


Lp distances for sets
• We can apply all the Lp distances to the cases of
sets of attributes, with or without counts, if we
represent the sets as vectors
• E.g., a transaction is a 0/1 vector
• E.g., a document is a vector of counts.
Similarities into distances
• Jaccard distance:
𝐽𝐷𝑖𝑠𝑡(𝑋, 𝑌) = 1 – 𝐽𝑆𝑖𝑚(𝑋, 𝑌)

• Jaccard Distance is a metric

• Cosine distance:
𝐷𝑖𝑠𝑡(𝑋, 𝑌) = 1 − cos(𝑋, 𝑌)
• Cosine distance is a metric
23

Hamming Distance
• Hamming distance is the number of positions in which
bit-vectors differ.
• Example: p1 = 10101
p2 = 10011.

• d(p1, p2) = 2 because the bit-vectors differ in the 3rd and

4th positions.

• Hamming distance between two vectors of categorical


attributes is the number of positions in which they differ.
• Example: x = (married, low income, cheat ),
• y = (single , low income, not cheat)
• d(x,y) = 2
24

Why Hamming Distance Is a Distance


Metric

• d(x,x) = 0 since no positions differ.


• d(x,y) = d(y,x) by symmetry of “different from.”
• d(x,y) > 0 since strings cannot differ in a negative
number of positions.
• Triangle inequality: changing x to z and then to y
is one way to change x to y.

• For binary vectors if follows from the fact that L1


norm is a metric
Distance between strings
• How do we define similarity between strings?

weird wierd
intelligent unintelligent
Athena Athina

• Important for recognizing and correcting typing


errors and analyzing DNA sequences.
26

Edit Distance for strings


• The edit distance of two strings is the number of
inserts and deletes of characters needed to turn one
into the other.
• Example: x = abcde ;
• y = bcduve.
• Turn x into y by deleting a, then inserting u and v after d.
• Edit distance = 3.
• Minimum number of operations can be computed
using dynamic programming
• Common distance measure for comparing DNA
sequences
27

Why Edit Distance Is a Distance Metric

• d(x,x) = 0 because 0 edits suffice.


• d(x,y) = d(y,x) because insert/delete are
inverses of each other.
• d(x,y) > 0: no notion of negative edits.
• Triangle inequality: changing x to z and then
to y is one way to change x to y. The
minimum is no more than that
28

Variant Edit Distances

• Allow insert, delete, and mutate.


• Change one character into another.
• Minimum number of inserts, deletes, and
mutates also forms a distance measure.

• Same for any set of operations on strings.


• Example: substring reversal or block transposition OK
for DNA sequences
• Example: character transposition is used for spelling
APPLICATIONS OF
SIMILARITY:
RECOMMENDATION
SYSTEMS
Recommendation Systems
Are a type of information filtering system that aim to predict the interests or preferences of a user
and make personalized recommendations for products or content. They are used in a variety of
applications, such as e-commerce, social media, and streaming services, to enhance user
engagement and increase revenue.

• There are several types of recommendation systems, including:


1. Collaborative filtering: As I mentioned earlier, collaborative filtering is a technique that analyzes
the behavior of a large number of users and identifies similar users based on their shared
behavior. Once similar users are identified, the algorithm can use their past behavior to make
recommendations to a user based on the assumption that the user will have similar preferences
to those similar users.

2. Content-based filtering: In this approach, recommendations are made based on the


characteristics of the items being recommended. For example, if a user has shown interest in
action movies in the past, the algorithm will recommend other action movies with similar
characteristics, such as the same director or lead actor.

3. Hybrid systems: These systems combine both collaborative filtering and content-based filtering
to improve the accuracy of recommendations. Hybrid systems can leverage the strengths of both
approaches to overcome their weaknesses.
Collaborative Filtering
• Collaborative filtering
• is a technique used in recommender systems to make personalized
recommendations to users based on their past behavior and the
behavior of similar users.
• The idea behind collaborative filtering is that people who have
similar preferences in the past are likely to have similar
preferences in the future.
• In collaborative filtering, the algorithm analyzes the past behavior of
a large number of users and identifies similar users based on their
shared behavior.
• For example, if two users have both watched and rated a similar
set of movies or products, the algorithm would consider them
similar users. Once similar users are identified, the algorithm can use
their past behavior to make recommendations to a user based on
the assumption that the user will have similar preferences to those
similar users.
Intuition
Item profiles

likes

build
recommend

match Red
Circles
Triangles
User profile
An important problem
• Recommendation systems
• When a user buys an item (initially books) we want to
recommend other items that the user may like
• When a user rates a movie, we want to recommend
movies that the user may like
• When a user likes a song, we want to recommend other
songs that they may like

• A big success of data mining


• Exploits the long tail
• How Into Thin Air made Touching the Void popular
Utility (Preference) Matrix

Harry Harry Harry Twilight Star Star Star


Potter 1 Potter 2 Potter 3 Wars 1 Wars 2 Wars 3
A 4 5 1
B 5 5 4
C 2 4 5
D 3 3

How can we fill the empty entries of the matrix?


Recommendation Systems
• Content-based:
• Represent the items into a feature space and
recommend items to customer C similar to previous
items rated highly by C
• Movie recommendations: recommend movies with same
actor(s), director, genre, …
• Websites, blogs, news: recommend other sites with “similar”
content
Content-based prediction

Harry Harry Harry Twilight Star Star Star


Potter 1 Potter 2 Potter 3 Wars 1 Wars 2 Wars 3
A 4 5 1
B 5 5 4
C 2 4 5
D 3 3

Someone who likes one of the Harry Potter (or Star Wars)
movies is likely to like the rest
• Same actors, similar story, same genre
Approach
• Map items into a feature space:
• For movies:
• Actors, directors, genre, rating, year,…
• Challenge: make all features compatible.
• For documents?

• To compare items with users we need to map users to the


same feature space. How?
• Take all the movies that the user has seen and take the average
vector
• Other aggregation functions are also possible.

• Recommend to user C the most similar item I computing


similarity in the common feature space
• Distributional distance measures also work well.
Collaborative filtering

Harry Harry Harry Twilight Star Star Star


Potter 1 Potter 2 Potter 3 Wars 1 Wars 2 Wars 3
A 4 5 1
B 5 5 4
C 2 4 5
D 3 3

Two users are similar if they rate the same items in a similar way

Recommend to user C, the items


liked by many of the most similar users.
User Similarity

Harry Harry Harry Twilight Star Star Star


Potter 1 Potter 2 Potter 3 Wars 1 Wars 2 Wars 3
A 4 5 1
B 5 5 4
C 2 4 5
D 3 3

Which pair of users do you consider as the most similar?

What is the right definition of similarity?


User Similarity (Jaccard Similarity)

Harry Harry Harry Twilight Star Star Star


Potter 1 Potter 2 Potter 3 Wars 1 Wars 2 Wars 3
A 1 1 1
B 1 1 1
C 1 1 1
D 1 1

Jaccard Similarity: users are sets of movies

Disregards the ratings.


Jsim(A,B) = 1/5
Jsim(A,C) = Jsim(B,D) = 1/2
User Similarity (Cosine)
Harry Harry Harry Twilight Star Star Star
Potter 1 Potter 2 Potter 3 Wars 1 Wars 2 Wars 3
A 1 2 3
B 4 5 6

Cosine Similarity:
To calculate the cosine similarity between A and B, we first compute the dot product of the
two vectors:

A·B = (1 * 4) + (2 * 5) + (3 * 6) = 32
Next, we calculate the magnitudes of the two vectors:

|A| = sqrt((1 * 1) + (2 * 2) + (3 * 3)) = sqrt(14) ≈ 3.7417


|B| = sqrt((4 * 4) + (5 * 5) + (6 * 6)) = sqrt(77) ≈ 8.7749
Finally, we divide the dot product by the product of the magnitudes:

cosine_similarity(A, B) = A·B / (|A| * |B|) = 32 / (3.7417 * 8.7749) ≈ 0.9251


So the cosine similarity between A and B is approximately 0.9251, indicating that the
two vectors are very similar.
User Similarity ( Normalized Cosine Similarity)
( VS Cosine Similarity) :
Harry Harry Harry Twilight Star Star Star
Potter 1 Potter 2 Potter 3 Wars 1 Wars 2 Wars 3
A 1 2 3
B 2 4 6
To calculate the cosine similarity between A and B, we first compute the dot product of the
two vectors:
A·B = (1 * 2) + (2 * 4) + (3 * 6) = 28
Next, we calculate the magnitudes of the two vectors:
|A| = sqrt((1 * 1) + (2 * 2) + (3 * 3)) = sqrt(14) ≈ 3.7417
|B| = sqrt((2 * 2) + (4 * 4) + (6 * 6)) = sqrt(56) ≈ 7.4833
Finally, we divide the dot product by the product of the magnitudes:
cosine_similarity(A, B) = A·B / (|A| * |B|) = 28 / (3.7417 * 7.4833) ≈ 0.7454

Now let's calculate the normalized cosine similarity between A and B:


To normalize the vectors, we divide each element of the vectors by their respective magnitudes:
A_norm = [1/|A|, 2/|A|, 3/|A|] = [0.2673, 0.5345, 0.8018]
B_norm = [2/|B|, 4/|B|, 6/|B|] = [0.2673, 0.5345, 0.8018]
Now we can calculate the cosine similarity between the normalized vectors:
A_norm·B_norm = (0.2673 * 0.2673) + (0.5345 * 0.5345) + (0.8018 * 0.8018) = 0.6667
correlation coefficient
The correlation coefficient is a statistical measure that is used to evaluate the

strength and direction of the linear relationship between two variables. It is

denoted by the symbol "r" and ranges from -1 to +1.


correlation coefficient
Let's say we have two variables: X and Y. We have collected data on these
variables and want to calculate their correlation coefficient.
Here's an example of some data we've collected:

Harry Harry Harry Twilight Star Star Star


Potter 1 Potter 2 Potter 3 Wars 1 Wars 2 Wars 3
A 1 2 3 4 5
B 2 4 6 8 10
correlation coefficient
Harry Harry Harry Twilight Star Star Star
Potter 1 Potter 2 Potter 3 Wars 1 Wars 2 Wars 3
A 1 2 3 4 5
B 2 4 6 8 10

r = (nΣXY - ΣXΣY) / sqrt((nΣX^2 - (ΣX)^2) * (nΣY^2 - (ΣY)^2))


= (5* 110 – 15*30) / Sqrt (( 5 * 55 – 225) * (5 * 220 – 900))
= ( 550 – 450) / Sqrt ((275 – 225) * ( 1100 – 900))
= 100 / Sqrt (50 * 200)
= 1
User-User Collaborative Filtering
• Consider user c
• Find set D of other users whose ratings are most “similar”
to c’s ratings
• Estimate user’s ratings based on ratings of users in D
using some aggregation function

• Advantage:
• for each user we have small amount of computation.
Item-Item Collaborative Filtering
• We can transpose (flip) the matrix and perform the same
computation as before to define similarity between items
• Intuition: Two items are similar if they are rated in the same way by
many users.
• Better defined similarity since it captures the notion of genre of an
item
• Users may have multiple interests.
• Algorithm: For each user c and item i
• Find the set D of most similar items to item i that have been rated by user
c.
• Aggregate their ratings to predict the rating for item i.
• Disadvantage: we need to consider each user-item pair separately
Collaborative Filtering example

we can see that User 1 and User 2 give nearly similar ratings to Movie 1, so we can conclude
that Movie 3 is also going to be averagely liked by User 1 but Movie 4 will be a good
recommendation to User 2,
like this we can also see that there are users who have different choices like User 1 and User 3
are opposite to each other. One can see that User 3 and User 4 have a common interest in the
movie, on that basis we can say that Movie 4 is also going to be disliked by User 4. This is
Collaborative Filtering, we recommend to users the items which are liked by users of similar
interest domains.

Cosine Similarity
We can also use the cosine similarity between the users to find out the users with
similar interests, larger cosine implies that there is a smaller angle between two
users, hence they have similar interests.
SKETCHING
AND
LOCALITY SENSITIVE
HASHING
SKETCHING AND LOCALITY SENSITIVE HASHING
Sketching and Locality Sensitive Hashing (LSH) are techniques used to
perform approximate nearest neighbor search in high-dimensional data.
The goal of these techniques is to find similar items or data points
efficiently, without having to compare each item or data point against all
others.
Sketching is a technique that involves representing high-dimensional data as low-
dimensional sketches or summaries, while still preserving the essential
characteristics of the data. For example, in the case of text data, one could represent
each document as a set of words, and then represent each document as a sketch
that only contains the most frequently occurring words. This allows for efficient
comparison and similarity calculation between documents.

Locality Sensitive Hashing (LSH) is another technique used to find approximate


nearest neighbors in high-dimensional data. LSH involves hashing data points into
buckets based on their similarity. Similar data points are more likely to be hashed
into the same bucket, allowing for efficient nearest neighbor search. One popular
implementation of LSH is the MinHash algorithm, which involves hashing sets of
items into signatures and comparing the signatures of two sets to estimate their
Jaccard similarity.
Sketching Example
• Suppose you have a large collection of text documents and you want to
find similar documents quickly and efficiently. One way to approach this
problem is to use a technique called "sketching".
• First, you can represent each document as a set of words that occur in the
document. For example, the set of words for a document might include "cat",
"dog", "house", and "tree".
• Next, you can create a "sketch" or summary of the document by selecting
only the most frequent words in the set. For example, if you choose the
top 10 most frequent words, the sketch for the document might be:
• "cat", "dog", "house", "tree", "the", "and", "in", "is", "of", and "to".
• By representing each document as a sketch, you can perform similarity
search between documents more efficiently. Instead of comparing the
entire set of words for each document, you can compare only the
sketches. This reduces the computational complexity of the similarity search,
making it faster and more efficient.
Another important problem
• Find duplicate and near-duplicate documents
from a web crawl.
• Why is it important:
• Identify mirrored web pages, and avoid indexing them,
or serving them multiple times
• Find replicated news stories and cluster them under a
single story.
• Identify plagiarism

• What if we wanted exact duplicates?


Finding similar items
• Both the problems we described have a common
component
• We need a quick way to find highly similar items to a
query item
• OR, we need a method for finding all pairs of items that
are highly similar.
• Also known as the Nearest Neighbor problem, or
the All Nearest Neighbors problem

• We will examine it for the case of near-duplicate


web documents.
Main issues
• What is the right representation of the document
when we check for similarity?
• E.g., representing a document as a set of characters
will not do (why?)
• When we have billions of documents, keeping the
full text in memory is not an option.
• We need to find a shorter representation
• How do we do pairwise comparisons of billions of
documents?
• If we wanted exact match it would be ok, can we
replicate this idea?
56

Three Essential Techniques for Similar


Documents

1. Shingling : convert documents, emails, etc.,


to sets.

2. Minhashing : convert large sets to short


signatures, while preserving similarity.

3. Locality-Sensitive Hashing (LSH): focus on


pairs of signatures likely to be similar.
57

The Big Picture

Candidate
pairs :
Locality-
Docu- those pairs
sensitive
ment of signatures
Hashing
that we need
to test for
The set Signatures : similarity.
of strings short integer
of length k vectors that
that appear represent the
in the doc- sets, and
ument reflect their
similarity
58

Shingles

• A k -shingle (or k -gram) for a document is a


sequence of k characters that appears in the
document.
• Example: document = abcab. k=2
• Set of 2-shingles = {ab, bc, ca}.
• Option: regard shingles as a bag, and count ab twice.

• Represent a document by its set of k-shingles.


Shingling Example
• Shingle: a sequence of k contiguous Words

Note that each shingle overlaps the previous one by two


words. This is called a "sliding window" approach.

By representing the document as a set of shingles, you can


perform efficient similarity search using techniques like
MinHash and Locality Sensitive Hashing (LSH).

This allows you to find similar documents even if they are not
exact matches, making it useful for applications like
plagiarism detection and document retriev

You might also like