M02 User-Based CF V02
M02 User-Based CF V02
Filtering
Dosen : ZK Abdurahman Baizal
a, b : users
ra,p : rating of user a for item p
P : set of items, rated both by a and b
Possible similarity values between -1 and 1;
𝒓𝒂 , 𝒓𝒃 = user's average ratings
Based on these calculations, we observe that User1 and User2 were somehow similar to Alice in their rating
behavior in the past.
Pearson correlation
• Takes differences in rating behavior into account
6 Alice
5 User1
4 User4
Ratings
3
0
Item1 Item2 Item3 Item4
• Calculate, whether the neighbors' ratings for the unseen item i are
higher or lower than their average
• Combine the rating differences – use the similarity with as a weight
• Add/subtract the neighbors' bias from the active user's average and
use this as a prediction
Making predictions
In the example, the prediction for Alice’s rating for Item5 based on the ratings of near neighbors
User1 and User2 will be
In real-world applications, rating databases are much larger and can comprise thousands or even
millions of users and items, which means that we must think about computational complexity.
In addition, the rating matrix is typically very sparse, meaning that every user will rate only a very
small subset of the available items.
Neighborhood selection
we intuitively decided not to take all neighbors into account (neighborhood
selection).
For the calculation of the predictions, we included only those that had a positive
correlation with the active user (and, of course, had rated the item for which we are
looking for a prediction).
If we included all users in the neighborhood, this would not only negatively
influence the performance with respect to the required calculation time, but it
would also have an effect on the accuracy of the recommendation, as the ratings of
other users who are not really comparable would be taken into account
Neighborhood selection
The common techniques for reducing the size of the neighborhood are to define a specific minimum
threshold of user similarity or to limit the size to a fixed number and to take only the k nearest
neighbors into account
if the similarity threshold is too high, the size of the neighborhood will be very small for many users,
which in turn means that for many items no predictions can be made (reduced coverage).
In contrast, when the threshold is too low, the neighborhood sizes are not significantly reduced.
Neighborhood selection
The value chosen for k – the size of the neighborhood – does not influence
coverage. However, the problem of finding a good value for k still exists:
When the number of neighbors k taken into account is too high, too many
neighbors with limited similarity bring additional “noise” into the predictions.
When k is too small – for example, below 10 in the experiments, the quality of
the predictions may be negatively affected. An analysis of the MovieLens
dataset indicates that “in most real-world situations, a neighborhood of 20 to 50
neighbors seems reasonable”
Implementasi di Python
Untuk Pe Er !!!
Improving the metrics / prediction function
• Not all neighbor ratings might be equally "valuable"
• Agreement on commonly liked items is not so informative as agreement on
controversial items
• Possible solution: Give more weight to items that have a higher variance
• Value of number of co-rated items
• Use "significance weighting", by e.g., linearly reducing the weight when the number
of co-rated items is low
• Case amplification
• Intuition: Give more weight to "very similar" neighbors, i.e., where the similarity
value is close to 1.
• Neighborhood selection
• Use similarity threshold or fixed number of neighbors
Memory-based and model-based approaches
• User-based CF is said to be "memory-based"
• the rating matrix is directly used to find neighbors / make predictions
• does not scale for most real-world scenarios
• large e-commerce sites have tens of millions of customers and millions of
items
• Model-based approaches
• based on an offline pre-processing or "model-learning" phase
• at run-time, only the learned model is used to make predictions
• models are updated / re-trained periodically
• large variety of techniques used
• model-building and updating can be computationally expensive
2001: Item-based collaborative filtering recommendation algorithms, B.
Sarwar et al., WWW 2001
• Scalability issues arise with U2U if many more users than items
(m >> n , m = |users|, n = |items|)
• e.g. amazon.com
• Space complexity O(m2) when pre-computed
• Time complexity for computing Pearson O(m2n)
• Cons:
• requires user community, sparsity problems, no integration of other knowledge sources, no explanation of results