0% found this document useful (0 votes)
5 views20 pages

M02 User-Based CF V02

The document discusses User-Based Collaborative Filtering (CF), a prominent recommendation approach used by e-commerce sites that relies on user ratings to predict preferences. It explains the process of finding similar users, measuring similarity using Pearson correlation, and making predictions based on neighbors' ratings. Additionally, it addresses challenges such as neighborhood selection, scalability issues, and the pros and cons of collaborative filtering methods.

Uploaded by

Fa Putra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views20 pages

M02 User-Based CF V02

The document discusses User-Based Collaborative Filtering (CF), a prominent recommendation approach used by e-commerce sites that relies on user ratings to predict preferences. It explains the process of finding similar users, measuring similarity using Pearson correlation, and making predictions based on neighbors' ratings. Additionally, it addresses challenges such as neighborhood selection, scalability issues, and the pros and cons of collaborative filtering methods.

Uploaded by

Fa Putra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

User-Based Collaborative

Filtering
Dosen : ZK Abdurahman Baizal

Sumber : Dietmar Jannach, et al, 2010, Introduction to Recommender


System
Collaborative Filtering (CF)
• The most prominent approach to generate recommendations
• used by large, commercial e-commerce sites
• well-understood, various algorithms and variations exist
• applicable in many domains (book, movies, DVDs, ..)
• Approach
• use the "wisdom of the crowd" to recommend items
• Basic assumption and idea
• Users give ratings to catalog items (implicitly or explicitly)
• Customers who had similar tastes in the past, will have similar tastes in the
future
User-based nearest-neighbor collaborative
filtering
• The basic technique:
• Given an "active user" (Alice) and an item “i” not yet seen by Alice
• The goal is to estimate Alice's rating for this item, e.g., by
• find a set of users (peers) who liked the same items as Alice in the past and who have
rated item i
• use, e.g. the average of their ratings to predict, if Alice will like item i
• do this for all items Alice has not seen and recommend the best-rated
Item1 Item2 Item3 Item4 Item5
Alice 5 3 4 4 ?
User1 3 1 2 3 3
User2 4 3 4 3 5
User3 3 3 1 5 4
User4 1 5 5 2 1
User-based nearest-neighbor collaborative
filtering
• Some first questions
• How do we measure similarity?
• How many neighbors should we consider?
• How do we generate a prediction from the neighbors' ratings?
Item1 Item2 Item3 Item4 Item5
Alice 5 3 4 4 ?
User1 3 1 2 3 3
User2 4 3 4 3 5
User3 3 3 1 5 4
User4 1 5 5 2 1
Measuring user similarity
• A popular similarity measure in user-based CF: Pearson
correlation

a, b : users
ra,p : rating of user a for item p
P : set of items, rated both by a and b
Possible similarity values between -1 and 1;
𝒓𝒂 , 𝒓𝒃 = user's average ratings

Item1 Item2 Item3 Item4 Item5


Alice 5 3 4 4 ?
User1 3 1 2 3 3 Sim(a,1) = 0.85
User2 4 3 4 3 5 Sim(a,2) = 0.70
User3 3 3 1 5 4 Sim(a,3) = 0.00
User4 1 5 5 2 1 Sim(a,3) = -0.79
Measuring user similarity

The similarity of Alice to User1 is thus as follows

Based on these calculations, we observe that User1 and User2 were somehow similar to Alice in their rating
behavior in the past.
Pearson correlation
• Takes differences in rating behavior into account

6 Alice

5 User1

4 User4
Ratings
3

0
Item1 Item2 Item3 Item4

• Works well in usual domains, compared with alternative measures


• such as cosine similarity
Implementasi di Python
Implementasi di Python
Buat nilai rerata dari tiap user dan membuat matrix baru dengan nilai rating diambil dari selisih rerata dan
rating asli
Implementasi di Python
Untuk menghitung similarity, dapat menggunakan library dari sklearn yaitu cosine_similarity. Disini kita perlu
membuat suatu fungsi dengan parameter yaitu matrix rating, user aktif (yang akan kita cari nilai rating
kosongnya) dan nilai k (jumlah tetangga/neighbor).
Making predictions
• A common prediction function:

• Calculate, whether the neighbors' ratings for the unseen item i are
higher or lower than their average
• Combine the rating differences – use the similarity with as a weight
• Add/subtract the neighbors' bias from the active user's average and
use this as a prediction
Making predictions

In the example, the prediction for Alice’s rating for Item5 based on the ratings of near neighbors
User1 and User2 will be

In real-world applications, rating databases are much larger and can comprise thousands or even
millions of users and items, which means that we must think about computational complexity.

In addition, the rating matrix is typically very sparse, meaning that every user will rate only a very
small subset of the available items.
Neighborhood selection
we intuitively decided not to take all neighbors into account (neighborhood
selection).

For the calculation of the predictions, we included only those that had a positive
correlation with the active user (and, of course, had rated the item for which we are
looking for a prediction).

If we included all users in the neighborhood, this would not only negatively
influence the performance with respect to the required calculation time, but it
would also have an effect on the accuracy of the recommendation, as the ratings of
other users who are not really comparable would be taken into account
Neighborhood selection

The common techniques for reducing the size of the neighborhood are to define a specific minimum
threshold of user similarity or to limit the size to a fixed number and to take only the k nearest
neighbors into account

if the similarity threshold is too high, the size of the neighborhood will be very small for many users,
which in turn means that for many items no predictions can be made (reduced coverage).

In contrast, when the threshold is too low, the neighborhood sizes are not significantly reduced.
Neighborhood selection

The value chosen for k – the size of the neighborhood – does not influence
coverage. However, the problem of finding a good value for k still exists:

When the number of neighbors k taken into account is too high, too many
neighbors with limited similarity bring additional “noise” into the predictions.

When k is too small – for example, below 10 in the experiments, the quality of
the predictions may be negatively affected. An analysis of the MovieLens
dataset indicates that “in most real-world situations, a neighborhood of 20 to 50
neighbors seems reasonable”
Implementasi di Python

Untuk Pe Er !!!
Improving the metrics / prediction function
• Not all neighbor ratings might be equally "valuable"
• Agreement on commonly liked items is not so informative as agreement on
controversial items
• Possible solution: Give more weight to items that have a higher variance
• Value of number of co-rated items
• Use "significance weighting", by e.g., linearly reducing the weight when the number
of co-rated items is low
• Case amplification
• Intuition: Give more weight to "very similar" neighbors, i.e., where the similarity
value is close to 1.
• Neighborhood selection
• Use similarity threshold or fixed number of neighbors
Memory-based and model-based approaches
• User-based CF is said to be "memory-based"
• the rating matrix is directly used to find neighbors / make predictions
• does not scale for most real-world scenarios
• large e-commerce sites have tens of millions of customers and millions of
items
• Model-based approaches
• based on an offline pre-processing or "model-learning" phase
• at run-time, only the learned model is used to make predictions
• models are updated / re-trained periodically
• large variety of techniques used
• model-building and updating can be computationally expensive
2001: Item-based collaborative filtering recommendation algorithms, B.
Sarwar et al., WWW 2001

• Scalability issues arise with U2U if many more users than items
(m >> n , m = |users|, n = |items|)
• e.g. amazon.com
• Space complexity O(m2) when pre-computed
• Time complexity for computing Pearson O(m2n)

• High sparsity leads to few common ratings between two users

• Basic idea: "Item-based CF exploits relationships between items first,


instead of relationships between users"
Collaborative Filtering Issues
• Pros:
• well-understood, works well in some domains, no knowledge engineering required

• Cons:
• requires user community, sparsity problems, no integration of other knowledge sources, no explanation of results

• What is the best CF method?


• In which situation and which domain? Inconsistent findings; always the same domains and data sets; differences
between methods are often very small (1/100)

• How to evaluate the prediction quality?


• MAE / RMSE: What does an MAE of 0.7 actually mean?
• Serendipity: Not yet fully understood

• What about multi-dimensional ratings?

You might also like