Lecture 3
Lecture 3
• Sim(D,D) = Sim(D,D) = 3
6
Jaccard Similarity
• The Jaccard similarity (Jaccard coefficient) of two sets S1,
S2 is the size of their intersection divided by the size of
their union.
• JSim (C1, C2) = |C1C2| / |C1C2|.
3 in intersection.
8 in union.
Jaccard similarity
= 3/8
• Extreme behavior:
• Jsim(X,Y) = 1, iff X = Y
• Jsim(X,Y) = 0 iff X,Y have no elements in common
• JSim is symmetric
Jaccard Similarity between sets
• The distance for the documents
• JSim(D,D) = 3/5
• JSim(D,D) = JSim(D,D) = 2/6
• JSim(D,D) = JSim(D,D) = 3/9
Similarity between vectors
Documents (and sets in general) can also be represented as vectors
apple
Documents D1, D2 are in the “same direction”
{Obama, election}
Example
apple
Documents D1, D2 are in the “same direction”
{Obama, election}
Cosine Similarity
• Sim(X,Y) = cos(X,Y)
• The cosine of the angle between X and Y
• Example:
d1 = 3 2 0 5 0 0 0 2 0 0
d2 = 1 0 0 0 0 0 0 1 0 2
d1 • d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5
apple
Cos(D1,D2) = 1
{Obama, election}
Distance
• Numerical measure of how different two data
objects are
• A function that maps pairs of objects to real values
• Lower when objects are more alike
• Higher when two objects are different
• Minimum distance is 0, when comparing an
object with itself.
• Upper limit varies
Distance Metric
• A distance function d is a distance metric if it is a
function from pairs of objects to real numbers
such that:
1. d(x,y) > 0. (non-negativity)
2. d(x,y) = 0 iff x = y. (identity)
3. d(x,y) = d(y,x). (symmetry)
4. d(x,y) < d(x,z) + d(z,y) (triangle inequality ).
Triangle Inequality
• Triangle inequality guarantees that the distance
function is well-behaved.
• The direct connection is the shortest distance
Example of Distances
y = (9,8)
L2-norm:
𝑑𝑖𝑠𝑡(𝑥, 𝑦) = 42 + 32 = 5
5 3
L1-norm:
4 𝑑𝑖𝑠𝑡(𝑥, 𝑦) = 4 + 3 = 7
x = (5,5)
L∞-norm:
𝑑𝑖𝑠𝑡(𝑥, 𝑦) = max 3,4 = 4
Example
𝑥 = (𝑥1 , … , 𝑥𝑛 )
• Cosine distance:
𝐷𝑖𝑠𝑡(𝑋, 𝑌) = 1 − cos(𝑋, 𝑌)
• Cosine distance is a metric
23
Hamming Distance
• Hamming distance is the number of positions in which
bit-vectors differ.
• Example: p1 = 10101
p2 = 10011.
4th positions.
weird wierd
intelligent unintelligent
Athena Athina
3. Hybrid systems: These systems combine both collaborative filtering and content-based filtering
to improve the accuracy of recommendations. Hybrid systems can leverage the strengths of both
approaches to overcome their weaknesses.
Collaborative Filtering
• Collaborative filtering
• is a technique used in recommender systems to make personalized
recommendations to users based on their past behavior and the
behavior of similar users.
• The idea behind collaborative filtering is that people who have
similar preferences in the past are likely to have similar
preferences in the future.
• In collaborative filtering, the algorithm analyzes the past behavior of
a large number of users and identifies similar users based on their
shared behavior.
• For example, if two users have both watched and rated a similar
set of movies or products, the algorithm would consider them
similar users. Once similar users are identified, the algorithm can use
their past behavior to make recommendations to a user based on
the assumption that the user will have similar preferences to those
similar users.
Intuition
Item profiles
likes
build
recommend
match Red
Circles
Triangles
User profile
An important problem
• Recommendation systems
• When a user buys an item (initially books) we want to
recommend other items that the user may like
• When a user rates a movie, we want to recommend
movies that the user may like
• When a user likes a song, we want to recommend other
songs that they may like
Someone who likes one of the Harry Potter (or Star Wars)
movies is likely to like the rest
• Same actors, similar story, same genre
Approach
• Map items into a feature space:
• For movies:
• Actors, directors, genre, rating, year,…
• Challenge: make all features compatible.
• For documents?
Two users are similar if they rate the same items in a similar way
Cosine Similarity:
To calculate the cosine similarity between A and B, we first compute the dot product of the
two vectors:
A·B = (1 * 4) + (2 * 5) + (3 * 6) = 32
Next, we calculate the magnitudes of the two vectors:
• Advantage:
• for each user we have small amount of computation.
Item-Item Collaborative Filtering
• We can transpose (flip) the matrix and perform the same
computation as before to define similarity between items
• Intuition: Two items are similar if they are rated in the same way by
many users.
• Better defined similarity since it captures the notion of genre of an
item
• Users may have multiple interests.
• Algorithm: For each user c and item i
• Find the set D of most similar items to item i that have been rated by user
c.
• Aggregate their ratings to predict the rating for item i.
• Disadvantage: we need to consider each user-item pair separately
Collaborative Filtering example
we can see that User 1 and User 2 give nearly similar ratings to Movie 1, so we can conclude
that Movie 3 is also going to be averagely liked by User 1 but Movie 4 will be a good
recommendation to User 2,
like this we can also see that there are users who have different choices like User 1 and User 3
are opposite to each other. One can see that User 3 and User 4 have a common interest in the
movie, on that basis we can say that Movie 4 is also going to be disliked by User 4. This is
Collaborative Filtering, we recommend to users the items which are liked by users of similar
interest domains.
Cosine Similarity
We can also use the cosine similarity between the users to find out the users with
similar interests, larger cosine implies that there is a smaller angle between two
users, hence they have similar interests.
SKETCHING
AND
LOCALITY SENSITIVE
HASHING
SKETCHING AND LOCALITY SENSITIVE HASHING
Sketching and Locality Sensitive Hashing (LSH) are techniques used to
perform approximate nearest neighbor search in high-dimensional data.
The goal of these techniques is to find similar items or data points
efficiently, without having to compare each item or data point against all
others.
Sketching is a technique that involves representing high-dimensional data as low-
dimensional sketches or summaries, while still preserving the essential
characteristics of the data. For example, in the case of text data, one could represent
each document as a set of words, and then represent each document as a sketch
that only contains the most frequently occurring words. This allows for efficient
comparison and similarity calculation between documents.
Candidate
pairs :
Locality-
Docu- those pairs
sensitive
ment of signatures
Hashing
that we need
to test for
The set Signatures : similarity.
of strings short integer
of length k vectors that
that appear represent the
in the doc- sets, and
ument reflect their
similarity
58
Shingles
This allows you to find similar documents even if they are not
exact matches, making it useful for applications like
plagiarism detection and document retriev