0% found this document useful (0 votes)

107 views

Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD

This document provides an overview and summary of key concepts from a lecture on linear regression. It introduces linear regression as a method for predicting continuous output values using linear functions. It discusses minimizing the empirical risk by using the least squares criterion and implementing gradient descent or a closed-form solution. The document also covers regularization, which adds a penalty term to the loss function to improve generalization by preventing overfitting to the training data.

Uploaded by

Mega Silvia Hasugian

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

107 views

Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD

Uploaded by

Mega Silvia Hasugian

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

[MITx 6.

86x Notes Index]

Unit 02 - Nonlinear Classification,

Linear regression, Collaborative
Filtering

Lecture 5. Linear Regression

5.1. Unit 2 Overview

Building up from the previous unit, in this unit we will introduce:

linear regression (output a number in R)

non-linear classification methods
recommender problems (sometime called collaborative filtering problems)

5.2. Objectives
At the end of this lecture, you will be able to

write the training error as least squares criterion for linear regression
use stochastic gradient descent for fitting linear regression models
solve closed-form linear regression solution
identify regularization term and how it changes the solution, generalization

5.3. Introduction
Today we will see Linear Classification In the last unit we saw linear classification, where
we was trying to land the mapping between the feature vectors of our data ( ) and
the corresponding (binary) label ( ).

This relatively simple set up can be already used to answer pretty complex questions, like
making recommendations to buy or not some stocks, where the feature vector is given by
the stock prices in the last d-days.

We can extend such problem to return, instead of just a binary output (price will increase -
buy, vs price will drop - sell) a more informative output on the extent of the expected
variation in price.

With regression we want to predict things that are continuous in nature.

The set up is very similar to before with . The only differences is that now we
consider ).

The goal of our model will be to map-- to learn how to map-- feature vectors into these
continuous values.

We will consider for now only linear regression:

For compactness of the equations we will consider (we can always think of of
being just a dimension of ).

Are we limiting ourself to just linear relations? No, we can still use linear classifier by
doing an appropriate representation of the feature vector, by mapping into some kind of
complex space and from that mapping then apply linear regression (note that at this point
we will not yet talk about how we can construct this feature vector position. We will
assume instead that somebody already gave us an appropriate feature vector).

3 questions to address:

1. Which would be an appropriate objective that can quantify the extent of our mistake?.
How do we address the correctness of our output? In classification we had a binary
error term, here it must be a range, our prediction could be “almost there” or very far.
2. How do we set up the learning algorithm. We will see two today, a numerical,
gradient based ones, and a analytical, closed-form algorithm where we do not need to
approximate.
3. How we do regularisation. How we perform a better generalization to be more robust
when we don’t have enough training data or when the data is noisy

5.4. Empirical Risk

Let’s deal with the first question, the objective. We want to measure how much our
prediction deviates from the know values of , how far from they are. We call our
objective empirical risk (here denoted with ) and will be a sort of average loss of all our
data points, depending from the parameter to estimate.

Why squaring the deviation? Intuitively, since our training data may be noisy and the
values that we record may be noisy, if it is a small deviation between our prediction and
the true value, it’s OK. However, if the deviation is large, we want really, truly penalize.
And this is the behaviour we are getting from the squared function, that the bigger
difference would actually result in much higher loss.

Note that the above equation use the squared error as loss function, but other loss functions
could also be used, e.g. the hinge loss we already saw in unit 1:

I will minimise this risk for the known data, but what I really want to do it so get it
minimised for the unknown data I don’t already see.

2 mistakes are possible:

1. structural mistakes Maybe the linear function is not sufficient for you to model your
training data. Maybe the mapping between your training vectors and y’s is actually
highly nonlinear. Instead of just considering linear mappings, you should consider a
much broader set of function. This is one class of mistakes.
2. estimation mistakes The mapping itself is indeed linear, but we don’t have enough
training data to estimate the parameters correctly.

There is a trade-off between these two kind of error: on one side, minimising the structural
mistakes ask for a broader set of functions with more parameters, but this, at equal training
set size, would increase the estimation mistakes. On the other side, minimising the
estimation mistakes call for simpler set of functions, with less parameters, where however
I become susceptible for structural mistakes.

In this lesson we remain commit to linear regression, and we want to minimise the
empirical risk.

5.5. Gradient Based Approach

In this segment we will study the first of the two algorithms to implement the learning
phase, the gradient based approach.

The advantage of the Empirical Risk function with the squared error as loss is that it is
differentiable everywhere. Its gradient with respect to the parameter is:

We will implement its stochastic variant: we start by randomly select one sample in the
training set, look at its gradient, and update our parameter in the opposite direction (as we
want to minimise the empirical risk).

The algorithm will then be as follow:

1. We initialise the thetas to zero

2. We randomly pick up a data pair
3. We compute the gradient and update as
where is the learning rate,
influencing the size of the movement of the parameter at each iteration. We can have
constant or making it depends from the number of iteration we already have done,
e.g. , so to minimise the steps are we get closer to our minimum.

Note that the parameter updates at each step, not only on some “mistake” like in
classification (i.e. we treat all deviations as “mistake”), and that the amount depends on the
deviation itself (i.e. not of a fixed amount like in classification). Going against the gradient
assure that the algorithm self-correct itself, i.e. we obtain parameters that lead to
predictions closer and closer to the actual true .

5.6. Closed Form Solution

The second learning algorithm we study is the closed form (analytical) solution. This is
quite an exception in the machine learning field, as typically closed form solutions do not
typically exist. But here the empirical risk happens to be a convex function that we can
solve it exactly.

Let’s compute the gradient with respect of , but this time of the whole empirical risk, not
just the loss function:

With is a scalar and is an matrix.

If is invertible we can finally write .

We can invert \mathbf{A} only if the feature vectors span over , that is
if .

Also, we should put attention that inverting is an operation of order , so we

should be carefully when is very large, like in bag of words approaches used in
sentiment analysis where can be easily be in the tens of thousands magnitude.

5.7. Generalization and Regularization

We now focus the discussion in how do we assure that the algorithm we found, the
parameters we estimated, will be good also for the unknown data, will be robust and not
too much negatively impacted by the noise that it is in our training data?

This question is more and more important as we have less data to train the algorithm.

The way to solve this problem is to use a mechanism called regularisation that try to push
us away to fit the training data “perfectly” (where we “fit” also the errors, the noises
embedded in our data) and try to instead generalise to possibly unknown data.

The idea is to introduce something that push the thetas to zero, so that it would be only
worth for us to move our parameters if there is really a very strong pattern that justify the
move.

5.8. Regularization
The implementation of the regularisation we see in this lesson is called ridge regression
and it is the same we used in the classification problem:

The new objective function to minimise becomes:

The first term is our empirical risk, and it catches how well we are fitting the data. The
second term, being the square norm of thetas, tries to push the thetas to remain zero, to not
move unless there is a significant advantage in doing so. And is the parameter that
determine the trade off, the relative contribution, between these two terms. Note that being
the norm it doesn’t influence any specific dimension of theta. Its role is is actually to
determine how much do I care to fit my training examples versus how much do I care to be
staying close to zero. In other words, we don’t want any weak piece of evidence to pull our
thetas very strongly. We want to keep them grounded in some area and only pulls them
when we have enough evidence that it would really, in substantial way, impact the
empirical loss.

In other terms, the effect of regularization is to restrict the parameters of a model to freely
take on large values. This will make the model function smoother, leveling the ‘hills’ and
filling the ‘vallyes’. It will also make the model more stable, as a small perturbation on
will not change significantly with smaller .

What’s very nice about using the squared norm as regularisation term is that actually
everything that we discussed before, both the gradient and closed form solution, can be
very easily adjusted to this new loss function.

Gradient based approach with regularisation

With respect to a single point the gradient of is :

We can modify the gradient descent algorithm where the update rule becomes:

The difference with the empirical risk without the regularisation term is the term
that multiply trying to put it down at each update.

The closed form approach with regularisation

In the homework ?

5. 9. Closing Comment
By using regularisation, by requiring much more evidence to push the parameters into the
right direction we will increase the mistakes of our prediction within the training set, but
we will reduce the test error when the fitted thetas are used with respect to data that has not
been used for the fitting step.

However if we continue to increase , if we continue to give weight to the regularisation

term, then also the testing error will start to increase as well.

Our objective is to find the “sweet spot” of lambda where the test error is those that is
minimised, and we can do that using the validation set to calibrate the value that lambda
should have.

We saw regularization in the context of linear regression, but we will see regularization
across many different machine learning tasks. This is just one way to implement it. We will
see some other mechanism to implement regularisation, for example in neural networks.

Lecture 6. Nonlinear Classification

6.1. Objectives
At the end of this lecture, you will be able to

derive non-linear classifiers from feature maps

move from coordinate parameterization to weighting examples
compute kernel functions induced from feature maps
use kernel perceptron, kernel linear regression
understand the properties of kernel functions

6.2. Higher Order Feature Vectors

Outline

Non-linear classification and regression

Feature maps, their inner products
Kernel functions induced from feature maps
Kernel methods, kernel perceptron
Other non-linear classifiers (e.g. Random Forest)

This lesson deals with non-linear classification, with the basic idea to expand the feature
vector, map it to a higher dimensional space, and then feed this new vector to a linear
classifier.

The computational disadvantage of using higher dimensional space can be avoided using
so-called kernel functions. We will then see linear models applied to these kernel models,
and in particular, for simplicity, the perceptron linear model (that when used with kernel
function becomes the “kernel perceptron”).

Let’s see an example in 1D: we have the real line on which we have our points we want to
classify. We can easily see that if we have the set of points {-3: positively labelled, 2:
negatively labelled, 5:positively labelled} there is no linear classifier that can correctly
classify this data set:

We can remedy the situation by introducing a feature transformation feeding a different

type of example to the linear classifier.

We will always include in the new feature vector the original vector itself, so as to be able
to retain the power that was available prior to feature transformation. But we will also add
to it additional features. Note that, differently from statistics, in Machine Learning we
know nothing (assume nothing) about the distribution of our data, so removing the original
data to keep only the new added feature would risk to remove information that is not
captured in the transformation.
In this case we can add for example . So the mapping is
. As result also the parameter of the classifier became

bidimensional.

Our dataset becomes {(-3,9)[+], (2,4)[-], (5,25)[+]} that can be easily classified by a linear
classifier in 2D (i.e. a line) :

Note that the linear classifier in the new feature space we had found, back in the original
space becomes a non-linear classifier: .

An other example would be having our original dataset in 2D as {(2,2)[+],(-2,2)[-],(-2,-2)

[+],(2,-2)[-]} that is not separable in 2D, but it becomes separable in 3D when I use a

feature transformation like , for example by the plane

given by .

6.3. Introduction to Non-linear Classification

We can get more and more powerful classifiers by adding linearly independent features, ²
, ,… This , ,…, as functions are linearly independent, so the original coordinates
always provide something above and beyond what were in the previous ones.

Note that when is already multidimensional, even just would result in

dimensions exploding, e.g.
\mathbf{x} \in \mathbb{R^5} = \phi(\mathbf{x} \in \mathbb{R^2}) = \array{x_1\\x_2\\x_1^2\\x_2^2\\\sqrt{2}x_1x_2}### 6.3. Introduction to Non
We can get more and more powerful classifiers by adding linearly independent features, $x²
, ,… This , ,…, as functions are linearly independent, so the original coordinates
always provide something above and beyond what were in the previous ones.

Note that when is already multidimensional, even just would result in

dimensions exploding, e.g. (the meaning of the

scalar associated to the cross term will be discussed later).

Once we have the new feature vector we can make non-linear classification or regression
in the original data making a linear classification or regression in the new feature space:

Classification:
Regression:

More feature we add (e.g. more polinomian grades we add), better we fit the data. The key
question now is when is time to stop adding features ? We can use the validation test to test
which is the polynomial form that, trained on the training set, respond better in the
validation set.
At the extreme, you hold out each of the training example in turn in a procedure called
leave one out cross validation. So you take a single training sample, you remove it from
the training set, retrain the method, and then test how well you would predict that
particular holdout example, and do that for each training example in turn. And then you
average the results.

While very powerful, this explicit mapping into larger dimensions feature vectors is
indeed… that the dimensions could become quickly very high as our original data is
already multidimensional

Let’s our original . Then a feature transformation:

quadratic (order 2 polynomial): would involve dimensions (the original

dimensions plus all the cross products)
cubic (order 3 polynomial): would involve dimensions

The exact number of terms of a feature transformation of order of a vector of

dimensions is (the sum of multiset numbers).

So our feature vector becomes very high-dimensional very quickly if we even started from
a moderately dimensional vector.

Once we have the new feature vector we can make non-linear classification or regression
in the original data making a linear classification or regression in the new feature space:

Classification:
Regression:

At the extreme, you hold out each of the training example in turn in a procedure called
leave one out cross validation. So you take a single training sample, you remove it from
the training set, retrain the method, and then test how well you would predict that
particular holdout example, and do that for each training example in turn. And then you
average the results.

While very powerful, this explicit mapping into larger dimensions feature vectors is
indeed… that the dimensions could become quickly very high as our original data is
already multidimensional

Let’s our original . Then a feature transformation:

quadratic (order 2 polynomial): would involve dimensions (the original

dimensions plus all the cross products)
cubic (order 3 polynomial): would involve dimensions

The exact number of terms of a feature transformation of order of a vector of

dimensions is (the sum of multiset numbers).

So our feature vector becomes very high-dimensional very quickly if we even started from
a moderately dimensional vector.

So we would want to have a more efficient way of doing that – operating with high
dimensional feature vectors without explicitly having to construct them. And that is what
kernel methods provide us.
6.4. Motivation for Kernels: Computational Efficiency
The idea is that you can take inner products between high dimensional feature vectors and
evaluate that inner product very cheaply. And then, we can turn our algorithms into
operating only in terms of these inner products.

We define the kernel function of two feature vectors (two different data pairs) applied to a
a given transformation as the dot product of the transformed feature vectors of the two
data:

We can hence think of the kernel function as a kind of similarity measure, how similar the
example is to the one. Note also that being the dot product symmetric and positive,
kernel functions are in turn symmetric and positive.

For example let’s take and to be two dimensional feature vectors and the feature
transformation defined as (so that is
)

This particular transformation allows to compute the kernel function very cheaply and
having very few dimensions:

Note that even if the transformed feature vectors have 5 dimensions, the kernel function
return a scalar. In general, for this kind of feature transformation function , the kernel
function evaluates as , where is the order
of the polynomial transformation .

However, it is only for some for which the evaluation of the kernel function becomes so
nice! As soon we can prove that a particular kernel function can be expressed as the dot
product of two particular feature transformations (for those interested the Mercer’s theorem
stated in these notes) the kernel function is valid and we don’t actually need to construct
the transformed feature vector (the output of ).

Now our task will be to turn a linear method that previously operated on , like
to an inter-classifier that only depends on those inner products, that
operates in terms of kernels.

And we’ll do that in the context of kernel perception just for simplicity. But it applies to
any linear method that we’ve already learned.

6.5. The Kernel Perceptron Algorithm

Let’s show how we can use the kernel function in place of the feature vectors in the
perceptron algorithm.

Recall that the perceptron algoritm was (excluding for simplicity ):

θ = 0 # initialisation
for t in 1:T
for i in 1:n
if yⁱ θ⋅𝛷(xⁱ) ≦ 0 # checking if sign is the same, i.e.
data is on the right side of its label
θ = θ + yⁱ𝛷(xⁱ) # update θ if mistake
Which is the final value of the parameter resulting from such updates ? We can write it as

where is the vector of number of mistakes (and hence updates) underwent for each data
pair (so is the (scalar) number of errors occurred with the j-th data pair).

Note that we can interpret in terms of the relative importance of the j-th training
example to the final predictor. Because we are doing perceptron, the importance is just in
terms of the number of mistakes that we make on that particular example.

When we want to make a prediction of a data pair using the resulting parameter
value (that is the “optimal” parameter the perceptron algorithm can give us), we take
an inner product with that:

We can rewrite the above equation as :

But this means we can now express success or errors in terms of the vector and a valid
kernel function (typically something cheap to compute) !

An error on the data pair can then be expressed as

. We can then base our perceptron algorithm on this
check, where we start with initiating the error vector to zero, and we run trought the data
set checking for errors and, if found, updating the corresponding error term. I practice, our
endogenous variable to minimise the errors is no longer directly theta, but became the
vector, that as said implicitly gives the contribution of each data pair to the parameter.
The perceptron algorithm becomes hence the kernel perceptron algorithm:

α = 0 # initialisation of the vector

for t in 1:T
for i in 1:n
if yⁱ ∑ⱼ[αʲyʲk(xʲ,xⁱ)] ≦ 0 # checking if prediction is
right
αⁱ += 1 # update αⁱ if mistake

When we have run the algorithm and found the optimal we may immediately retrieve
the optimal by the above equation, even if at this point we really do not need (or
sometimes can’t, i.e. when has infinite dimensions) to make predictions.

6.6. Kernel Composition Rules

Now instead of directly constructing feature vectors by adding coordinates and then taking
it in the product and seeing how it collapses into a kernel, we can construct kernels directly
from simpler kernels by made of the following kernel composition rules:

1. is a valid kernel whose feature representation is ;

2. Given a function and a valid kernel function whose feature
representation is , then is also a valid kernel
whose feature representation is
3. Given and being two valid kernels whose feature
representations are respectively and , then
is also a valid kernel whose feature

representation is

4. Given and being two valid kernels whose feature

representations are respectively and , then
is also a valid kernel whose feature

representation is (see this lecture notes for a proof)

Armed with these rules we can build up pretty complex kernels starting from simpler ones.

For example let’s start with the identity function as , i.e. . Such feature
function results in a kernel (this is known as the
linear kernel).

We can now add to it a squared term to form a new kernel, that by virtue of rules (3) and
(4) above is still a valid kernel:

6.7. The Radial Basis Kernel

We can use kernel functions, and have them in term of simply, cheap-to-evaluate functions,
even when the underlying feature representation would have infinite dimensions and
would be hence impossible to explicitly construct.

One example is the so called radial basis kernel:

It can be proved that suck kernel is indeed a valid kernel and its corresponding feature
representation , i.e. involves polynomial features up to an infinite order.

Does the radial basis kernel look like a Gaussian (without the normalisation term) ? Well,
because indeed it is:

The above picture shows the contour lines of the radial basis kernel when we keep fixed
(in 2 dimensions) and we let to move away from it: the value of the kernel then reduces
in a shape that in 3-d would resemble the classical bell shape of the Gaussian curve. We
could even parametrise the radial basis kernel replacing the fixed term with a
parameter that would determine the width of the bell-shaped curve (the larger the value
of the narrower will be the bell, i.e. small values of yield wide bells).

Because the feature has infinite dimensions, the radial basis kernel has infinite expressive
power and can correctly classify any training test.
The linear decision boundary in the infinite dimensional space is given by the set
and corresponds to a (possibly) non-linear
boundary in the original feature vector space.

The more difficult task it is, the more iterations before this kernel perception (with the
radial basis kernel) will find the separating solution, but it always will in a finite number of
times. This is by contrast with the “normal” perceptron algorithm that when the set is not
separable would continue to run at the infinite, changing its parameters unless it is stopped
at a certain arbitrary point.

Other non-linear classifiers

We have seen as we can have nonlinear classifiers extending to higher dimensional space
and evenutally using kernel methods to collapse the calculations and operate only
implicitly in those high dimension spaces.

There are certainly other ways to get nonlinear classifiers.

Decision trees make classification operating sequentially on the various dimensions and
making first a separation on the first dimension and then, in a subsequent step, on the
second dimension and so on. And you can “learn” these trees incrementally.

There is a way to make these decision trees more robust, called random forest classifiers,
that adds two type of randomness: the first one is in randomly choosing the dimension on
which to operate the cut, the second in randomly selecting the single example on which
operate from the data set (with replacement) and then just average the predictions obtained
from these trees.

So the procedure of a random forest classifier is:

boostrap the sample

build a randomized (by dimension) decision tree
average the predictions (ensemble)

Summary
We can get non-linear classifiers (or regression) methods by simply mapping our data
into new feature vectors that include non-linear components, and applying a linear
method on these resulting vectors;
These feature vectors can be high dimensional, however;
We can turn linear methods into kernel methods by casting the computation in terms
of inner products;
A kernel function is advantageous when the inner products are faster to evaluate than
using explicit feature vectors (e.g. when the vectors would be infinite dimensional!)
We saw the radial basis kernel that is particularly powerful because it is both (a)
cheap to evaluate and (b) has a corresponding infinite dimensional feature vector

Lecture 7. Recommender Systems

7.1. Objectives
At the end of this lecture, you will be able to

understand the problem definition and assumptions of recommender systems

understand the impact of similarity measures in the K-Nearest Neighbor method
understand the need to impose the low rank assumption in collaborative filtering
iteratively find values of and (given ) in collaborative filtering

7.2. Introduction
This lesson deals about recommender systems, when the algorithm try to guess preferences
based on choices already made by the user (like film to watch or products to buy).
We’ll see:

the exact problem definition

the historically used algorithm, the K-Nearest Neighbor algorithm (KNN);
the algorithm in use today, the matrix factorization or collaborative filtering.

Problem definition
We keep as example across the lecture the recommendation of movies.

We start with a matrix of preferences for user of movie

. While there are many ways to store preferences, we will use a real number.

The goal is to base the prediction on the prior choices of the users, considering that this
matrix could be very sparse (e.g. out of 18000 films, each individual ranked very few of
them!), i.e. we want to fill these “empty spaces” of the matrix.

Why not to use classification/regression based on feature vectors as learned in Lectures 1?

For two reasons:

1. Deciding which feature to use or extracting them from data could be hard/infeasible
(e.g. where can I get the info if a film has a happy or bad ending ?), or the feature
vector could become very very large (things about in general “products” for Amazon,
could be everything)
2. Often we have little data about a single users preferences, while to make a
recommendation based on its own previous choices we would need lot of data.

The “trick” is then to “borrow” preferences from the other users and trying to measure how
much a single user is closer to the other ones in our dataset.

7.3. K-Nearest Neighbor Method

The number K here means, how big should be your advisory pool on how many neighbors
you want to look at. And this can be one of the hyperparameters of the algorithm.

We look at the k closest users that did score the element I am interested to, look at their
score for it, and average their score.

where is the set of K users close to user a that have a score for item

Now, the question is of course how do I define this similarity? We can use any method to
define similarity between vectors, like cosine similarity ( ) or Euclidean
distance ( ).

We can make the algorithm a bit more sophisticated by weighting the neighbour scores to
the level of similarity rather than just take their unweighted average:

where is some similarity measure between users and .

There has been many improvements that has been added to this kind of algorithm, like
adjusting for the different “average” score that each user gives to the items (i.e. they
compare the deviations from user’s averages rather than the raw score itself).

Still they are very far from today’s methods. The problem of KNN is that it doesn’t enable
us to detect the hidden structures that is there in the data, which is that users may be
similar to some pool of other users in one dimension, but similar to some other set of users
in a different dimension. For example, loving machine learning books, and having there
some “similarity” with other readers of machine learning books on some hidden
characteristics (e.g. liking equation-rich books or more discursive ones), and plant books,
where the similarity with other plant-reading users would be based on completely different
hidden features (e.g. loving photos or having nice tabular descriptions of plants).

Conversely in collaborative filtering, the algorithm would be able to detect these hidden
groupings among the users, both in terms of products and in terms of users. So we don’t
have to explicitly engineer very sophisticated similarity measure, the algorithm would be
able to pick up these very complex dependencies that for us, as humans, would be
definitely not tractable to come up with.

7.4. Collaborative Filtering: the Naive Approach

Let’s start with a naive approach where we just try to apply the same method we used in
regression to this problem, i.e. minimise a function made of a distance between the
observed score in the matrix and the estimated one and a regularisation term.

For now, we treat each individual score independently… and this will be the reason for
which (we will see) this method will not work.

So, we have our (sparse) matrix and we want to find a dense matrix that is able to
replicate at best the observed points of when these are available, and fill the missing
ones when .

Let’s first define as the set of points for which a score in is given:
.

The J function then takes any possible X matrix and minimise the distance between the
points in the set less a regularisation parameter (we keep the individual scores to zero
unless we have strong belief to move them from such state):

To find the optimal that minimise the FOC we have to

distinguish if is in or not:

:
:

Clearly this result doesn’t make sense: for data we already know we obtain a bad
estimation (as worst as we increase lambda) and for unknown scores we are left with
zeros.

7.5. Collaborative Filtering with Matrix Factorization

What we need to do is to actually relate scores together instead of considering them
independently.

The idea is then to constrain the matrix X to have a lower rank, as rank captures how much
independence is present between the entries of the matrix.

At one extreme, constraining the matrix to be rank 1, would means that we could factorise
the matrix as just the matrix product of two single vectors, one defining a sort of general
sentiment about the items for each user ( ), and the other one ( ) representing the average
sentiment for a given item, i.e. .

But representing users and items with just a single number takes us back to the KNN
problem of not being able to distinguish the possible multiple groups hidden in each user
or in each item.
We could then decide to divide the users and/or the items in respectively and
matrices and constrain our X matrix to be a product of these two matrices
(hence with rank 2 in this case):

The exact numbers of vectors to use in the user/items factorisation matrices (i.e. the
rank of X) is then a hyperparameter that can be selected using the validation set.

Still for simplicity, in this lesson we will see the simplest case of constraining the matrix to
be factorisable by a pair of single vectors.

7.6. Alternating Minimization

Using rank 1, we can adapt the function to take the two vectors and instead of the
whole matrix, and our objective becomes to found their elements that minimise such
function:

How do we minimise J ? We can take an iterative approach where we start by randomly

sampling values for one of the vector and minimise for the other vector (by setting the
derivatives with respect on its elements equal to zero), then fix this second vector and
going minimise for the first one, etc., until the value of the function J doesn’t move behind
a certain threshold, in an alternating minimisation exercise that will guarantee us to find a
local minima (but not a global one!).

Note also that when we minimise for the individual component of one of the two vectors,
we obtain derivatives with respect to the individual vector elements that are independent,
so the first order condition can be expressed each time in terms of a single variable.

Numerical example
Let’s consider a value of equal to and the following score dataset:

and let start out minimisation algorithm with

L becomes :

From where, setting and we can retrieve the minimising

values of as 22/23 and 8/27. We can now compute with these
values of to retrieve the minimising values of and so on.

Homework 3

Project 2: Digit recognition (Part 1)

[MITx 6.86x Notes Index]

The Hundred-Page Machine Learning Book - Andriy Burkov
No ratings yet
The Hundred-Page Machine Learning Book - Andriy Burkov
16 pages
Machine Learning Interview Questions
From Everand
Machine Learning Interview Questions
Tech Interviews
4.5/5 (2)
DL Unit-2
No ratings yet
DL Unit-2
24 pages
Tuo Zhao Notes
No ratings yet
Tuo Zhao Notes
47 pages
MATH 136 1015 Final - Exam
No ratings yet
MATH 136 1015 Final - Exam
13 pages
Cost Function
No ratings yet
Cost Function
17 pages
IML-Summary
No ratings yet
IML-Summary
12 pages
Representer Function
No ratings yet
Representer Function
12 pages
Hundred Page ML Book CH 3
No ratings yet
Hundred Page ML Book CH 3
16 pages
Lecture16 Crossvalidation
No ratings yet
Lecture16 Crossvalidation
32 pages
Introduction To Machine Learning Lecture 2: Linear Regression
No ratings yet
Introduction To Machine Learning Lecture 2: Linear Regression
38 pages
ML Linear Model
No ratings yet
ML Linear Model
10 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
CH 1
No ratings yet
CH 1
24 pages
Machine Learning Lecture 1
No ratings yet
Machine Learning Lecture 1
5 pages
A Layman's Guide to the Project
No ratings yet
A Layman's Guide to the Project
34 pages
Regression Analysis
No ratings yet
Regression Analysis
54 pages
(MLP) Lecture Notes
No ratings yet
(MLP) Lecture Notes
22 pages
L02 Linear Regression
No ratings yet
L02 Linear Regression
9 pages
Notes5_Regression
No ratings yet
Notes5_Regression
14 pages
Linear Regression
No ratings yet
Linear Regression
62 pages
Regression
No ratings yet
Regression
16 pages
MLA TAB Lecture3
No ratings yet
MLA TAB Lecture3
70 pages
(Machine Learning Coursera) Lecture Note Week 1
No ratings yet
(Machine Learning Coursera) Lecture Note Week 1
8 pages
ML:Introduction: Week 1 Lecture Notes
No ratings yet
ML:Introduction: Week 1 Lecture Notes
10 pages
A Tutorial of Machine Learning
No ratings yet
A Tutorial of Machine Learning
16 pages
Wk05 machine learning
No ratings yet
Wk05 machine learning
6 pages
Today: - Calculus
No ratings yet
Today: - Calculus
61 pages
CSE_412__Lab_Manual_3___Linear_Regression
No ratings yet
CSE_412__Lab_Manual_3___Linear_Regression
10 pages
ML:Introduction: Week 1 Lecture Notes
No ratings yet
ML:Introduction: Week 1 Lecture Notes
8 pages
Lecture 1, Part 1: Linear Regression: Roger Grosse
No ratings yet
Lecture 1, Part 1: Linear Regression: Roger Grosse
9 pages
GradientDescent-Regression_slides
No ratings yet
GradientDescent-Regression_slides
26 pages
ML: Introduction 1. What Is Machine Learning?
No ratings yet
ML: Introduction 1. What Is Machine Learning?
38 pages
Machine Learning - SoS 2017
No ratings yet
Machine Learning - SoS 2017
15 pages
Week 1 Lecture Notes
No ratings yet
Week 1 Lecture Notes
7 pages
Chapter Regression
No ratings yet
Chapter Regression
10 pages
Bayesian linear regression for Posterior Predictive Distribution MATLAB
No ratings yet
Bayesian linear regression for Posterior Predictive Distribution MATLAB
46 pages
4 Linear Regression Additional Notes
No ratings yet
4 Linear Regression Additional Notes
8 pages
ML Primer PDF
No ratings yet
ML Primer PDF
122 pages
ML4 Linear Models
No ratings yet
ML4 Linear Models
34 pages
Linear Regression
No ratings yet
Linear Regression
9 pages
Introduction To Machine Learning Algorithms: Linear Regression
No ratings yet
Introduction To Machine Learning Algorithms: Linear Regression
1 page
Essentials of Linear Regression in Python
No ratings yet
Essentials of Linear Regression in Python
23 pages
Lec9 - Linear Models
No ratings yet
Lec9 - Linear Models
44 pages
cs229 Notes4 PDF
No ratings yet
cs229 Notes4 PDF
11 pages
Lec4 Oct12 2022 PracticalNotes LinearRegression
No ratings yet
Lec4 Oct12 2022 PracticalNotes LinearRegression
34 pages
Bias
No ratings yet
Bias
62 pages
Lecture Notes 5 Linear Regression
No ratings yet
Lecture Notes 5 Linear Regression
11 pages
ML:Introduction: Week 1 Lecture Notes
No ratings yet
ML:Introduction: Week 1 Lecture Notes
5 pages
01B-DL2023-LinearModels
No ratings yet
01B-DL2023-LinearModels
47 pages
Six Lectures On NN - Montanari
No ratings yet
Six Lectures On NN - Montanari
77 pages
Foundations of Machine Learning - 3
No ratings yet
Foundations of Machine Learning - 3
38 pages
Module I-Part 1
No ratings yet
Module I-Part 1
48 pages
03 Linear Models
No ratings yet
03 Linear Models
46 pages
ML Lecture - 3
No ratings yet
ML Lecture - 3
47 pages
Machine Learning Notes Cs229 1
No ratings yet
Machine Learning Notes Cs229 1
217 pages
COL774 Practice Problems
No ratings yet
COL774 Practice Problems
22 pages
AC-ED L04 - Logistic Regression, Regularization
No ratings yet
AC-ED L04 - Logistic Regression, Regularization
80 pages
n27 PDF
No ratings yet
n27 PDF
3 pages
Process Performance Models: Statistical, Probabilistic & Simulation
From Everand
Process Performance Models: Statistical, Probabilistic & Simulation
Vishnuvarthanan Moorthy
No ratings yet
A Conversation About Calculus
From Everand
A Conversation About Calculus
Ginachukwu Amah
No ratings yet
Combine PDF
No ratings yet
Combine PDF
31 pages
Phytochemical and Pharmacological Potential of Annona Species: A Review
No ratings yet
Phytochemical and Pharmacological Potential of Annona Species: A Review
8 pages
Supercritical Fluid Extraction 2021
100% (1)
Supercritical Fluid Extraction 2021
40 pages
Volatile Oil Production March 20
No ratings yet
Volatile Oil Production March 20
45 pages
Natural Product Formulation 2021
No ratings yet
Natural Product Formulation 2021
38 pages
Phytochemical and Pharmacological Profile of Ipomoea Aquatica
No ratings yet
Phytochemical and Pharmacological Profile of Ipomoea Aquatica
13 pages
Unit 01 - Linear Classifiers and Generalizations - MD
No ratings yet
Unit 01 - Linear Classifiers and Generalizations - MD
23 pages
Unit 03 - Neural Networks - MD
No ratings yet
Unit 03 - Neural Networks - MD
24 pages
Statistical Learning Theory: 18.657: Mathematics of Machine Learning
No ratings yet
Statistical Learning Theory: 18.657: Mathematics of Machine Learning
9 pages
Section 6.1 - Inner Products and Norms: M×N I, J I, J
No ratings yet
Section 6.1 - Inner Products and Norms: M×N I, J I, J
16 pages
Raju K. George, Abhijith Ajayakumar - A Course in Linear Algebra-Springer (2024)
100% (1)
Raju K. George, Abhijith Ajayakumar - A Course in Linear Algebra-Springer (2024)
555 pages
2 Linear and Multilinear Algebra: 2.1 Basic Concepts and Notation
No ratings yet
2 Linear and Multilinear Algebra: 2.1 Basic Concepts and Notation
35 pages
From Algebraic Structures To
No ratings yet
From Algebraic Structures To
53 pages
Homework 3: SVM and Sentiment Analysis: Minted Listings
No ratings yet
Homework 3: SVM and Sentiment Analysis: Minted Listings
7 pages
Wavelet Transform
No ratings yet
Wavelet Transform
60 pages
Cauchy-Schwarz Inequality and Heisenberg's Uncertainty Principle
100% (5)
Cauchy-Schwarz Inequality and Heisenberg's Uncertainty Principle
5 pages
Calculo en Variedades Solucionario de Ejercicios Propuestos
No ratings yet
Calculo en Variedades Solucionario de Ejercicios Propuestos
64 pages
93253
No ratings yet
93253
67 pages
Mémoire Final Menasra Amina
No ratings yet
Mémoire Final Menasra Amina
105 pages
A Role of Bargmann-Segal Spaces in Characterization and Expansion of Operators On Fock Space
No ratings yet
A Role of Bargmann-Segal Spaces in Characterization and Expansion of Operators On Fock Space
28 pages
Inner Product Spaces
No ratings yet
Inner Product Spaces
14 pages
Dirac Notation and Basic Linear Algebra For Quantum Computing
No ratings yet
Dirac Notation and Basic Linear Algebra For Quantum Computing
4 pages
Machine Learning A Bayesian and Optimization Perspective 1st Edition Theodoridis Solutions Manualinstant download
100% (3)
Machine Learning A Bayesian and Optimization Perspective 1st Edition Theodoridis Solutions Manualinstant download
63 pages
Allen Hatcher - Vector Bundles and K-Theory
No ratings yet
Allen Hatcher - Vector Bundles and K-Theory
99 pages
PHY 309 Quantum Physics I
100% (1)
PHY 309 Quantum Physics I
130 pages
Instant Download The geometry of curvature homogeneous pseudo Riemannian manifolds First Edition Peter B. Gilkey PDF All Chapters
100% (5)
Instant Download The geometry of curvature homogeneous pseudo Riemannian manifolds First Edition Peter B. Gilkey PDF All Chapters
81 pages
Quantum Cryptography and Quantum Computation: Network Security Course Project Report
No ratings yet
Quantum Cryptography and Quantum Computation: Network Security Course Project Report
16 pages
UG Mathematics
No ratings yet
UG Mathematics
22 pages
Ring Theory & Linear Algebra II 2987
No ratings yet
Ring Theory & Linear Algebra II 2987
4 pages
Article Generalizations of Cauchy-Schwarz.
No ratings yet
Article Generalizations of Cauchy-Schwarz.
6 pages
Linear Algebra Week 7
No ratings yet
Linear Algebra Week 7
58 pages
Lecture Notes On Fundamentals of Vector Spaces
No ratings yet
Lecture Notes On Fundamentals of Vector Spaces
30 pages
(Ebook) Linear Algebra with Python: Theory and Applications by Makoto Tsukada, Yugi Kobayashi, Hiroshi Kaneko, Sin-Ei Takahasi, Kiyoshi Shirayanagi, Masato Noguchi ISBN 9789819929504, 9789819929511, 9819929504, 9819929512 download pdf
100% (2)
(Ebook) Linear Algebra with Python: Theory and Applications by Makoto Tsukada, Yugi Kobayashi, Hiroshi Kaneko, Sin-Ei Takahasi, Kiyoshi Shirayanagi, Masato Noguchi ISBN 9789819929504, 9789819929511, 9819929504, 9819929512 download pdf
81 pages
Datascience
No ratings yet
Datascience
14 pages
Galerkin Method
0% (1)
Galerkin Method
18 pages
Math3 Questionbank
No ratings yet
Math3 Questionbank
18 pages

Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD

Uploaded by

Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD

Uploaded by

[MITx 6.

86x Notes Index]

Unit 02 - Nonlinear Classification,

Lecture 5. Linear Regression

5.1. Unit 2 Overview

linear regression (output a number in R)

With regression we want to predict things that are continuous in nature.

We will consider for now only linear regression:

5.4. Empirical Risk

2 mistakes are possible:

5.5. Gradient Based Approach

The algorithm will then be as follow:

1. We initialise the thetas to zero

5.6. Closed Form Solution

With is a scalar and is an matrix.

Also, we should put attention that inverting is an operation of order , so we

5.7. Generalization and Regularization

The new objective function to minimise becomes:

Gradient based approach with regularisation

The closed form approach with regularisation

However if we continue to increase , if we continue to give weight to the regularisation

Lecture 6. Nonlinear Classification

derive non-linear classifiers from feature maps

6.2. Higher Order Feature Vectors

Non-linear classification and regression

We can remedy the situation by introducing a feature transformation feeding a different

An other example would be having our original dataset in 2D as {(2,2)[+],(-2,2)[-],(-2,-2)

feature transformation like , for example by the plane

6.3. Introduction to Non-linear Classification

Note that when is already multidimensional, even just would result in

Note that when is already multidimensional, even just would result in

dimensions exploding, e.g. (the meaning of the

scalar associated to the cross term will be discussed later).

Let’s our original . Then a feature transformation:

quadratic (order 2 polynomial): would involve dimensions (the original

The exact number of terms of a feature transformation of order of a vector of

Let’s our original . Then a feature transformation:

quadratic (order 2 polynomial): would involve dimensions (the original

The exact number of terms of a feature transformation of order of a vector of

6.5. The Kernel Perceptron Algorithm

Recall that the perceptron algoritm was (excluding for simplicity ):

We can rewrite the above equation as :

An error on the data pair can then be expressed as

α = 0 # initialisation of the vector

6.6. Kernel Composition Rules

1. is a valid kernel whose feature representation is ;

4. Given and being two valid kernels whose feature

representation is (see this lecture notes for a proof)

6.7. The Radial Basis Kernel

One example is the so called radial basis kernel:

Other non-linear classifiers

There are certainly other ways to get nonlinear classifiers.

So the procedure of a random forest classifier is:

boostrap the sample

Lecture 7. Recommender Systems

understand the problem definition and assumptions of recommender systems

the exact problem definition

We start with a matrix of preferences for user of movie

Why not to use classification/regression based on feature vectors as learned in Lectures 1?

7.3. K-Nearest Neighbor Method

where is some similarity measure between users and .

7.4. Collaborative Filtering: the Naive Approach

To find the optimal that minimise the FOC we have to

7.5. Collaborative Filtering with Matrix Factorization

7.6. Alternating Minimization

How do we minimise J ? We can take an iterative approach where we start by randomly

and let start out minimisation algorithm with

From where, setting and we can retrieve the minimising

Project 2: Digit recognition (Part 1)

You might also like