Unit Iii
Unit Iii
UNIT III
UNIT III
COLLABORATIVE FILTERING
A systematic approach, Nearest-neighbour collaborative filtering (CF), user-based and item-based
CF, components of neighbourhood methods (rating normalization, similarity weight computation, and
neighbourhood selection.
Item-based Collaborative Filtering: Ratings of a group of similar items are used to make
recommendations for item I. Similarly, to predict I's rating given by a user U, we calculate the
weighted average of the rating r of k similar items (neighbors) to I, where the weights are
determined by the similarity between I and each of the similar items.
During this first phase, it’s usual to precompute the similarity matrix beforehand to obtain a good
performance during inference time. In the case of item-based models, an item-item similarity matrix is
built by applying the similarity metric between all pairs of items. Since the matrix is sparse, we only
consider the set of mutually rated pairs of items during the similarity computation. For instance, the
similarity between items from columns 1 and 4 of the image above will be computed as the similarity
between vectors [4,3,5] and [5,3,4]. It’s possible that a pair of items may show no co-ratings by users
due to the sparsity of the matrix, resulting in an empty set. In that case, a value of 0 similarity is
assigned for that pair. To improve computational efficiency, it is common to consider only the k
nearest neighbors of an item during inference time.
Let’s say we want to predict how Madison rated the Animal Farm book and we defined k=2 as the
number of nearest neighbors to consider during calculation. To simplify the example, we will only
manually calculate the similarities between the target item and items from columns 2 and 4 because
they are the nearest neighbors for this item. When calculating the mean rating during similarity
computation, we will consider only the set of ratings that are mutually exclusive between items.
The image below shows how the neighborhood is formed. The circle in red is the value we’re trying
to predict. The squares in green are ratings from Madison that are going to be used to infer the rating
for the target item. The other two ratings marked with an X are not considered because k=2. The
rectangles in orange show a set of mutually exclusive ratings between the target item and the item
from column 2, while the rectangles in blue show the same, but for the common ratings between the
target item and item from column 4.
These are the common set of ratings between the target item (item 3) and the first neighbor (item 2):
[4,3,3] and [4,4,3]. The first step is to calculate the mean in each set:
The Pearson similarity formula centers the ratings by their mean, so we can transform this vector and
then plug the results into the equation:
The same calculation is done for the similarity between items 3 and 4:
Next, we calculate the mean for each item, considering all the item’s ratings:
Then, we can plug in the values we found together with Madison’s ratings for Items 2 and 4 (1 and 3,
respectively) in the equation below:
So,
Since ratings are discrete numbers, we round this value to 2. It’s important to note that in a real-world
setting, it’s often recommended to use neighborhood methods only when k is above a certain
threshold because, when the number of neighbors is small, the predictions are usually not precise. An
alternative would be to use Content-based filtering when we do not have enough data about the user-
item relationship.
Online and Offline Phases
Neighborhood-based methods separate the computations into two phases: offline, where the model is
fitted; and online, where inferences are made. In the offline phase, the user-user (or item-item)
similarity values are precomputed, and the k most similar users or items are predetermined. These pre-
computed values are leveraged to make fast predictions during the online phase.
Although good online performance is a major benefit of these methods, there is also a known
disadvantage: the similarity matrix can get huge depending on the number of users/items in the
system, causing the offline phase to not scale well. In addition to that, as the methods do not
traditionally adapt to change, they need to update the precomputed similarities and nearest neighbors
to account for new users and items, which makes the retraining process even more challenging.
The Cold-Start Problem
As previously stated, the effectiveness of Neighborhood-based CF algorithms depends on user-item
interaction data. However, the user-item matrix is often sparse, with many missing values, as most
users only interact with a small fraction of the items. This sparsity can lead to inaccurate
recommendations due to the small neighborhood size.
This challenge is known as the cold-start problem, which is more apparent when new users or items
enter the system. In this setting, the algorithm does not have enough data to form the nearest
neighbors, so, the system cannot make useful recommendations.
Another important property of the user-item matrix is that the distribution of ratings among items
displays a long-tail pattern. This means that only a small subset of items receives a significant number
of ratings and are considered popular, while most items receive few ratings or no ratings at all. As a
result, it is difficult to make precise predictions about items in the long tail using these methods, and
that can be a problem because items that are less rated may provide large profit margins. This is
something that is explored by Chris Anderson in his book, “The long tail”. This problem can also
result in a lack of diversity in recommendations because the algorithm will usually only recommend
popular items.
To address these limitations to some extent, alternative algorithms can be used, such as matrix
factorization and hybrid algorithms, which combine CF with Content-based Filtering methods. The
next blog posts of this series will explore these topics in greater detail.
user-based and item-based CF
Collaborative Filtering (CF) is a popular technique in recommendation systems that helps predict a
user's preferences by leveraging the preferences of other users or items. There are two main types of
collaborative filtering: user-based and item-based.
1. User-Based Collaborative Filtering:
Idea: This approach relies on the assumption that users who have agreed in the past
tend to agree again in the future. In other words, it recommends items to a user based
on the preferences of users with similar tastes.
Workflow:
1. Similarity Calculation: Measure the similarity between users based on their
historical interactions or preferences. Common similarity metrics include
cosine similarity, Pearson correlation, or Jaccard similarity.
2. Neighborhood Selection: Identify a subset of users (neighborhood) who are
most similar to the target user.
3. Prediction: Predict the target user's preference for a particular item by
aggregating the preferences of the selected neighborhood. This can be done
by taking a weighted average of their ratings, for example.
Advantages:
Intuitive approach based on the idea of finding like-minded users.
Easily interpretable.
Challenges:
Cold-start problem for new users.
Scalability issues with a large user base.
Sparsity of data can lead to unreliable predictions.
2. Item-Based Collaborative Filtering:
Idea: This approach focuses on the similarity between items rather than users. It
recommends items to a user based on the similarity between the items the user has
liked or interacted with in the past.
Workflow:
1. Similarity Calculation: Measure the similarity between items based on the
users who have interacted with them. Similarity metrics are typically the
same as those used in user-based CF.
2. Neighborhood Selection: Identify a subset of items that are most similar to
the target item.
3. Prediction: Predict the target user's preference for a particular item based on
their historical preferences for similar items. This is done by aggregating the
ratings of the selected neighborhood.
Advantages:
3. Improved Model Performance: Normalized ratings provide a more consistent basis for
measuring similarity between users or items, leading to more accurate predictions.
Z-Score Normalization: In addition to mean centering, another normalization technique is Z-score
normalization. This involves scaling the ratings by the standard deviation of a user's (or item's)
ratings.
Formula: Normalized Rating (r') = (Rating (r) - Mean Rating of the User) / Standard
Deviation of Ratings of the User
When to Use:
Z-score normalization is particularly useful when dealing with users who have a wide range
of rating scales or exhibit extreme rating behaviors.
Example:
If a user tends to give ratings that are consistently higher or lower than the average, Z-score
normalization will scale those ratings based on how much they deviate from the mean.
Measures the cosine of the angle between two vectors representing user or item
preferences.
Appropriate for scenarios where the magnitude of the vectors is essential.
2. Pearson Correlation:
Formula:
Measures the linear correlation between two vectors, considering both the magnitude
and direction of the ratings.
Suitable for situations where the absolute ratings are significant.
3. Jaccard Similarity:
Formula:
3. Neighborhood Selection:
Neighborhood selection is a critical step in collaborative filtering algorithms, where the goal is to
identify a subset of users or items (the neighborhood) that are most similar to the target user or item.
This selected subset is then used to make predictions or recommendations for the target user. The
neighborhood selection process involves deciding which users or items to include in the neighborhood
and, in some cases, setting a limit on the number of neighbors to consider.
Methods of Neighborhood Selection:
1. Top-N Neighbors:
Objective: Select the N most similar users or items based on the computed similarity
weights.
Process: Rank all potential neighbors based on their similarity weights and select the
top N for inclusion in the neighborhood.
Example: If N is set to 10, the top 10 most similar users to the target user form the
neighborhood for making recommendations.
Purpose: Focuses on the most similar entities, ensuring a balance between accuracy
and computational efficiency.
2. Threshold-Based Selection:
Objective: Include only those users or items with a similarity score above a certain
threshold.
Process: Set a similarity threshold, and include users or items in the neighborhood
only if their similarity score surpasses this threshold.
Example: If the similarity threshold is set to 0.8, only users or items with a similarity
score of 0.8 or higher are included in the neighborhood.
Purpose: Provides more flexibility in the size of the neighborhood and can be used to
filter out less relevant or less similar entities.
Considerations in Neighborhood Selection:
1. Computational Complexity:
Selecting too many neighbors can lead to increased computational complexity,
especially in large datasets.
Balancing the number of neighbors to include is crucial for achieving a trade-off
between accuracy and efficiency.
2. Sparsity of Data:
In sparse datasets, where users have only interacted with a small fraction of items, it
may be challenging to find a sufficient number of neighbors.
Threshold-based methods can be useful in such scenarios.
3. Impact on Cold Start:
Neighborhood selection methods should account for the "cold start" problem, where
new users or items have limited interaction history.
In such cases, alternative methods like content-based recommendations or hybrid
models may be employed.
Example:
Let's consider a user-based collaborative filtering scenario. If User A is the target user, the
neighborhood selection process may involve computing similarity weights with all other users and
selecting the top 10 most similar users as neighbors for User A.
Purpose of Neighborhood Selection:
1. Relevance: Ensures that the selected neighbors are the most relevant and similar entities to
the target user or item.
2. Computational Efficiency: Manages computational complexity by limiting the number of
neighbors, balancing accuracy with efficiency.
3. Personalization: Allows the recommendation system to tailor recommendations based on the
preferences of a select group of similar users or items.