0% found this document useful (0 votes)
6 views3 pages

Subtitle

Uploaded by

shshen.idv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views3 pages

Subtitle

Uploaded by

shshen.idv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
You are on page 1/ 3

So let's take a look at how we can

develop a recommender system if we had features of each item, or


features of each movie. So here's the same data set that we
had previously with the four users having rated some but
not all of the five movies. What if we additionally have
features of the movies? So here I've added two features X1 and
X2, that tell us how much each of these is a romance movie, and
how much each of these is an action movie. So for example Love at Last
is a very romantic movie, so this feature takes on 0.9, but
it's not at all an action movie. So this feature takes on 0. But it turns out
Nonstop Car chases has
just a little bit of romance in it. So it's 0.1, but it has a ton of action. So
that feature takes on the value of 1.0. So you recall that I had used the notation
nu to denote the number of users, which is 4 and m to denote
the number of movies which is 5. I'm going to also introduce n to denote
the number of features we have here. And so n=2, because we have two
features X1 and X2 for each movie. With these features we have for
example that the features for movie one, that is the movie Love at Last,
would be 0.90. And the features for the third movie Cute Puppies of Love would be
0.99 and 0. And let's start by taking a look at
how we might make predictions for Alice's movie ratings. So for user one, that is
Alice, let's say we predict the rating for movie i as w.X(i)+b. So this is just a
lot
like linear regression. For example if we end up choosing
the parameter w(1)=[5,0] and say b(1)=0, then the prediction for movie three where
the features are 0.99 and 0, which is just copied from here,
first feature 0.99, second feature 0. Our prediction would be w.X(3)+b=0.99 times 5
plus 0 times zero, which turns out to be equal to 4.95. And this rating seems
pretty plausible. It looks like Alice has given high ratings
to Love at Last and Romance Forever, to two highly romantic movies, but
given low ratings to the action movies, Nonstop Car Chases and Swords vs Karate. So
if we look at Cute Puppies of Love, well predicting that she might rate
that 4.95 seems quite plausible. And so these parameters w and b for Alice seems
like a reasonable model for
predicting her movie ratings. Just add a little the notation
because we have not just one user but multiple users, or
really nu equals 4 users. I'm going to add a superscript 1 here to
denote that this is the parameter w(1) for user 1 and
add a superscript 1 there as well. And similarly here and here as well,
so that we would actually have different parameters for
each of the 4 users on data set. And more generally in this
model we can for user j, not just user 1 now,
we can predict user j's rating for movie i as w(j).X(i)+b(j). So here the
parameters w(j) and b(j) are the parameters used
to predict user j's rating for movie i which is a function of X(i),
which is the features of movie i. And this is a lot like linear regression,
except that we're fitting a different linear regression model for
each of the 4 users in the dataset. So let's take a look at how we can
formulate the cost function for this algorithm. As a reminder, our notation
is that r(i.,j)=1 if user j has rated movie i or
0 otherwise. And y(i,j)=rating given
by user j on movie i. And on the previous side we defined w(j),
b(j) as the parameters for user j. And X(i) as the feature vector for
movie i. So the model we have is for user j and movie i predict the rating
to be w(j).X(i)+b(j). I'm going to introduce just
one new piece of notation, which is I'm going to use m(j) to denote
the number of movies rated by user j. So if the user has rated 4 movies,
then m(j) would be equal to 4. And if the user has rated 3 movies
then m(j) would be equal to 3. So what we'd like to do is to
learn the parameters w(j) and b(j), given the data that we have. That is given the
ratings a user
has given of a set of movies. So the algorithm we're going to use is
very similar to linear regression. So let's write out the cost function for
learning the parameters w(j) and b(j) for a given user j. And let's just focus on
one
user on user j for now. I'm going to use the mean
squared error criteria. So the cost will be the prediction,
which is w(j).X(i)+b(j) minus the actual rating
that the user had given. So minus y(i,j) squared. And we're trying to
choose parameters w and b to minimize the squared error
between the predicted rating and the actual rating that was observed. But the user
hasn't rated all the movies,
so if we're going to sum over this,
we're going to sum over only over the values
of i where r(i,j)=1. So we're going to sum only over the movies
i that user j has actually rated. So that's what this denotes,
sum of all values of i where r(i,j)=1. Meaning that user j has
rated that movie i. And then finally we can take
the usual normalization 1 over m(j). And this is very much like
the cost function we have for linear regression with m or
really m(j) training examples. Where you're summing over the m(j) movies
for which you have a rating taking a squared error and
the normalizing by this 1 over 2m(j). And this is going to be a cost function J of
w(j), b(j). And if we minimize this as
a function of w(j) and b(j), then you should come up with a pretty
good choice of parameters w(i) and b(j). For making predictions for
user j's ratings. Let me have just one more
term to this cost function, which is the regularization
term to prevent overfitting. And so
here's our usual regularization parameter, lambda divided by 2m(j) and then times
as sum of the squared
values of the parameters w. And so
n is a number of numbers in X(i) and that's the same as a number
of numbers in w(j). If you were to minimize this cost
function J as a function of w and b, you should get a pretty
good set of parameters for predicting user j's ratings for
other movies. Now, before moving on, it turns out
that for recommender systems it would be convenient to actually eliminate
this division by m(j) term, m(j) is just a constant
in this expression. And so, even if you take it out, you should end up with
the same value of w and b. Now let me take this cost function
down here to the bottom and copy it to the next slide. So we have that to learn
the parameters w(j), b(j) for user j. We would minimize this cost function
as a function of w(j) and b(j). But instead of focusing on a single user, let's
look at how we learn
the parameters for all of the users. To learn the parameters w(1), b(1), w(2),
b(2),...,w(nu), b(nu), we would take this cost function on
top and sum it over all the nu users. So we would have sum from
j=1 one to nu of the same cost function that we
had written up above. And this becomes the cost for learning all the parameters for
all of the users. And if we use gradient descent or any other optimization
algorithm to
minimize this as a function of w(1), b(1) all the way through w(nu),
b(nu), then you have a pretty good set of parameters for predicting
movie ratings for all the users. And you may notice that this algorithm
is a lot like linear regression, where that plays a role similar to
the output f(x) of linear regression. Only now we're training a different linear
regression model for each of the nu users. So that's how you can learn parameters
and
predict movie ratings, if you had access to these features X1 and
X2. That tell you how much is each of
the movies, a romance movie, and how much is each of
the movies an action movie? But where do these features come from? And what if you
don't have access to such
features that give you enough detail about the movies with which
to make these predictions? In the next video, we'll look at
the modification of this algorithm. They'll let you make predictions
that you make recommendations. Even if you don't have, in advance,
features that describe the items of the movies in sufficient detail to
run the algorithm that we just saw. Let's go on and
take a look at that in the next video

You might also like