16 Recommender Systems PDF
16 Recommender Systems PDF
If we have features like these, each film can be recommended by a feature vector
Add an extra feature which is x0 = 1 for each film
So for each film we have a [3 x 1] vector, which for film number 1 ("Love at Last") would be
Create some parameters which give values as close as those seen in the data when applied
Sum over all values of i (all movies the user has used) when r(i,j) = 1 (i.e. all the films that the user has rated)
This is just like linear regression with least-squared error
We can also add a regularization term to make our equation look as follows
The regularization term goes from k=1 through to m, so (θj) ends up being an n+1 feature vector
Don't regularize over the bias terms (0)
If you do this you get a reasonable value
We're rushing through this a bit, but it's just a linear regression problem
To make this a little bit clearer you can get rid of the mj term (it's just a constant so shouldn't make any difference to
minimization)
So to learn (θj)
But for our recommender system we want to learn parameters for all users, so we add an extra summation term to this
which means we determine the minimum (θj) value for every user
When you do this as a function of each (θj) parameter vector you get the parameters for each user
So this is our optimization objective -> J( θ1 , ..., θnu)
In order to do the minimization we have the following gradient descent
Slightly different to our previous gradient descent implementations
k = 0 and k != 0 versions
We can define the middle term above as
Using that same approach we should then be able to determine the remaining feature vectors for the other films
If we're given the users' preferences we can use them to work out the film's features
Algorithm Structure
Where the top term is the partial derivative of the cost function with respect to x k i while the bottom is the partial
derivative of the cost function with respect to θk i
So here we regularize EVERY parameters (no longer x0 parameter) so no special case update rule
3) Having minimized the values, given a user (user j) with parameters θ and movie (movie i) with learned features x, we
predict a start rating of (θj)T xi
This is the collaborative filtering algorithm, which should give pretty good predictions for how users like new movies
Vectorization: Low rank matrix factorization
Having looked at collaborative filtering algorithm, how can we improve this?
Given one product, can we determine other relevant products?
We start by working out another way of writing out our predictions
So take all ratings by all users in our example above and group into a matrix Y
5 movies
4 users
Get a [5 x 4] matrix
Given [Y] there's another way of writing out all the predicted ratings
With this matrix of predictive ratings
We determine the (i,j) entry for EVERY movie
We can define another matrix X
Just like matrix we had for linear regression
Take all the features for each movie and stack them in rows
Think of each movie as one example
Also define a matrix Θ
Take each per user parameter vector and stack in rows
Given our new matrices X and θ
We can have a vectorized way of computing the prediction range matrix by doing X * θT
We can given this algorithm another name - low rank matrix factorization
This comes from the property that the X * θT calculation has a property in linear algebra that we create a low
rank matrix
Don't worry about what a low rank matrix is
Finally, having run the collaborative filtering algorithm, we can use the learned features to find related films
When you learn a set of features you don't know what the features will be - lets you identify the features which define a
film
Say we learn the following features
x1 - romance
x2 - action
x3 - comedy
x4 - ...
So we have n features all together
After you've learned features it's often very hard to come in and apply a human understandable metric to what those
features are
Usually learn features which are very meaning full for understanding what users like
Say you have movie i
Find movies j which is similar to i, which you can recommend
Our features allow a good way to measure movie similarity
If we have two movies x i and xj
We want to minimize || xi - xj||
i.e. the distance between those two movies
Provides a good indicator of how similar two films are in the sense of user perception
NB - Maybe ONLY in terms of user perception
Now we compute the average rating each movie obtained and stored in an nm - dimensional column vector
If we look at all the movie ratings in [Y] we can subtract off the mean rating
Means we normalize each film to have an average rating of 0
Now, we take the new set of ratings and use it with the collaborative filtering algorithm
Learn θj and xi from the mean normalized ratings
For our prediction of user j on movie i, predict
(θj)T xi + μi
Where these vectors are the mean normalized values
We have to add μ because we removed it from our θ values
So for user 5 the same argument applies, so
As an aside - we spoke here about mean normalization for users with no ratings
If you have some movies with no ratings you can also play with versions of the algorithm where you normalize the
columns
BUT this is probably less relevant - probably shouldn't recommend an unrated movie
To summarize, this shows how you do mean normalization preprocessing to allow your system to deal with users who have
not yet made any ratings
Means we recommend the user we know little about the best average rated products