Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD
Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD
5.2. Objectives
At the end of this lecture, you will be able to
write the training error as least squares criterion for linear regression
use stochastic gradient descent for fitting linear regression models
solve closed-form linear regression solution
identify regularization term and how it changes the solution, generalization
5.3. Introduction
Today we will see Linear Classification In the last unit we saw linear classification, where
we was trying to land the mapping between the feature vectors of our data ( ) and
the corresponding (binary) label ( ).
This relatively simple set up can be already used to answer pretty complex questions, like
making recommendations to buy or not some stocks, where the feature vector is given by
the stock prices in the last d-days.
We can extend such problem to return, instead of just a binary output (price will increase -
buy, vs price will drop - sell) a more informative output on the extent of the expected
variation in price.
The set up is very similar to before with . The only differences is that now we
consider ).
The goal of our model will be to map-- to learn how to map-- feature vectors into these
continuous values.
Are we limiting ourself to just linear relations? No, we can still use linear classifier by
doing an appropriate representation of the feature vector, by mapping into some kind of
complex space and from that mapping then apply linear regression (note that at this point
we will not yet talk about how we can construct this feature vector position. We will
assume instead that somebody already gave us an appropriate feature vector).
3 questions to address:
1. Which would be an appropriate objective that can quantify the extent of our mistake?.
How do we address the correctness of our output? In classification we had a binary
error term, here it must be a range, our prediction could be “almost there” or very far.
2. How do we set up the learning algorithm. We will see two today, a numerical,
gradient based ones, and a analytical, closed-form algorithm where we do not need to
approximate.
3. How we do regularisation. How we perform a better generalization to be more robust
when we don’t have enough training data or when the data is noisy
Why squaring the deviation? Intuitively, since our training data may be noisy and the
values that we record may be noisy, if it is a small deviation between our prediction and
the true value, it’s OK. However, if the deviation is large, we want really, truly penalize.
And this is the behaviour we are getting from the squared function, that the bigger
difference would actually result in much higher loss.
Note that the above equation use the squared error as loss function, but other loss functions
could also be used, e.g. the hinge loss we already saw in unit 1:
I will minimise this risk for the known data, but what I really want to do it so get it
minimised for the unknown data I don’t already see.
1. structural mistakes Maybe the linear function is not sufficient for you to model your
training data. Maybe the mapping between your training vectors and y’s is actually
highly nonlinear. Instead of just considering linear mappings, you should consider a
much broader set of function. This is one class of mistakes.
2. estimation mistakes The mapping itself is indeed linear, but we don’t have enough
training data to estimate the parameters correctly.
There is a trade-off between these two kind of error: on one side, minimising the structural
mistakes ask for a broader set of functions with more parameters, but this, at equal training
set size, would increase the estimation mistakes. On the other side, minimising the
estimation mistakes call for simpler set of functions, with less parameters, where however
I become susceptible for structural mistakes.
In this lesson we remain commit to linear regression, and we want to minimise the
empirical risk.
The advantage of the Empirical Risk function with the squared error as loss is that it is
differentiable everywhere. Its gradient with respect to the parameter is:
We will implement its stochastic variant: we start by randomly select one sample in the
training set, look at its gradient, and update our parameter in the opposite direction (as we
want to minimise the empirical risk).
Note that the parameter updates at each step, not only on some “mistake” like in
classification (i.e. we treat all deviations as “mistake”), and that the amount depends on the
deviation itself (i.e. not of a fixed amount like in classification). Going against the gradient
assure that the algorithm self-correct itself, i.e. we obtain parameters that lead to
predictions closer and closer to the actual true .
Let’s compute the gradient with respect of , but this time of the whole empirical risk, not
just the loss function:
We can invert \mathbf{A} only if the feature vectors span over , that is
if .
This question is more and more important as we have less data to train the algorithm.
The way to solve this problem is to use a mechanism called regularisation that try to push
us away to fit the training data “perfectly” (where we “fit” also the errors, the noises
embedded in our data) and try to instead generalise to possibly unknown data.
The idea is to introduce something that push the thetas to zero, so that it would be only
worth for us to move our parameters if there is really a very strong pattern that justify the
move.
5.8. Regularization
The implementation of the regularisation we see in this lesson is called ridge regression
and it is the same we used in the classification problem:
The first term is our empirical risk, and it catches how well we are fitting the data. The
second term, being the square norm of thetas, tries to push the thetas to remain zero, to not
move unless there is a significant advantage in doing so. And is the parameter that
determine the trade off, the relative contribution, between these two terms. Note that being
the norm it doesn’t influence any specific dimension of theta. Its role is is actually to
determine how much do I care to fit my training examples versus how much do I care to be
staying close to zero. In other words, we don’t want any weak piece of evidence to pull our
thetas very strongly. We want to keep them grounded in some area and only pulls them
when we have enough evidence that it would really, in substantial way, impact the
empirical loss.
In other terms, the effect of regularization is to restrict the parameters of a model to freely
take on large values. This will make the model function smoother, leveling the ‘hills’ and
filling the ‘vallyes’. It will also make the model more stable, as a small perturbation on
will not change significantly with smaller .
What’s very nice about using the squared norm as regularisation term is that actually
everything that we discussed before, both the gradient and closed form solution, can be
very easily adjusted to this new loss function.
We can modify the gradient descent algorithm where the update rule becomes:
The difference with the empirical risk without the regularisation term is the term
that multiply trying to put it down at each update.
5. 9. Closing Comment
By using regularisation, by requiring much more evidence to push the parameters into the
right direction we will increase the mistakes of our prediction within the training set, but
we will reduce the test error when the fitted thetas are used with respect to data that has not
been used for the fitting step.
Our objective is to find the “sweet spot” of lambda where the test error is those that is
minimised, and we can do that using the validation set to calibrate the value that lambda
should have.
We saw regularization in the context of linear regression, but we will see regularization
across many different machine learning tasks. This is just one way to implement it. We will
see some other mechanism to implement regularisation, for example in neural networks.
6.1. Objectives
At the end of this lecture, you will be able to
This lesson deals with non-linear classification, with the basic idea to expand the feature
vector, map it to a higher dimensional space, and then feed this new vector to a linear
classifier.
The computational disadvantage of using higher dimensional space can be avoided using
so-called kernel functions. We will then see linear models applied to these kernel models,
and in particular, for simplicity, the perceptron linear model (that when used with kernel
function becomes the “kernel perceptron”).
Let’s see an example in 1D: we have the real line on which we have our points we want to
classify. We can easily see that if we have the set of points {-3: positively labelled, 2:
negatively labelled, 5:positively labelled} there is no linear classifier that can correctly
classify this data set:
We will always include in the new feature vector the original vector itself, so as to be able
to retain the power that was available prior to feature transformation. But we will also add
to it additional features. Note that, differently from statistics, in Machine Learning we
know nothing (assume nothing) about the distribution of our data, so removing the original
data to keep only the new added feature would risk to remove information that is not
captured in the transformation.
In this case we can add for example . So the mapping is
. As result also the parameter of the classifier became
bidimensional.
Our dataset becomes {(-3,9)[+], (2,4)[-], (5,25)[+]} that can be easily classified by a linear
classifier in 2D (i.e. a line) :
Note that the linear classifier in the new feature space we had found, back in the original
space becomes a non-linear classifier: .
given by .
Once we have the new feature vector we can make non-linear classification or regression
in the original data making a linear classification or regression in the new feature space:
Classification:
Regression:
More feature we add (e.g. more polinomian grades we add), better we fit the data. The key
question now is when is time to stop adding features ? We can use the validation test to test
which is the polynomial form that, trained on the training set, respond better in the
validation set.
At the extreme, you hold out each of the training example in turn in a procedure called
leave one out cross validation. So you take a single training sample, you remove it from
the training set, retrain the method, and then test how well you would predict that
particular holdout example, and do that for each training example in turn. And then you
average the results.
While very powerful, this explicit mapping into larger dimensions feature vectors is
indeed… that the dimensions could become quickly very high as our original data is
already multidimensional
So our feature vector becomes very high-dimensional very quickly if we even started from
a moderately dimensional vector.
So we would want to have a more efficient way of doing that – operating with high
dimensional feature vectors without explicitly having to construct them. And that is what
kernel methods provide us. $ (the meaning of the scalar associated to the cross term will be
discussed later).
Once we have the new feature vector we can make non-linear classification or regression
in the original data making a linear classification or regression in the new feature space:
Classification:
Regression:
More feature we add (e.g. more polinomian grades we add), better we fit the data. The key
question now is when is time to stop adding features ? We can use the validation test to test
which is the polynomial form that, trained on the training set, respond better in the
validation set.
At the extreme, you hold out each of the training example in turn in a procedure called
leave one out cross validation. So you take a single training sample, you remove it from
the training set, retrain the method, and then test how well you would predict that
particular holdout example, and do that for each training example in turn. And then you
average the results.
While very powerful, this explicit mapping into larger dimensions feature vectors is
indeed… that the dimensions could become quickly very high as our original data is
already multidimensional
So our feature vector becomes very high-dimensional very quickly if we even started from
a moderately dimensional vector.
So we would want to have a more efficient way of doing that – operating with high
dimensional feature vectors without explicitly having to construct them. And that is what
kernel methods provide us.
6.4. Motivation for Kernels: Computational Efficiency
The idea is that you can take inner products between high dimensional feature vectors and
evaluate that inner product very cheaply. And then, we can turn our algorithms into
operating only in terms of these inner products.
We define the kernel function of two feature vectors (two different data pairs) applied to a
a given transformation as the dot product of the transformed feature vectors of the two
data:
We can hence think of the kernel function as a kind of similarity measure, how similar the
example is to the one. Note also that being the dot product symmetric and positive,
kernel functions are in turn symmetric and positive.
For example let’s take and to be two dimensional feature vectors and the feature
transformation defined as (so that is
)
This particular transformation allows to compute the kernel function very cheaply and
having very few dimensions:
Note that even if the transformed feature vectors have 5 dimensions, the kernel function
return a scalar. In general, for this kind of feature transformation function , the kernel
function evaluates as , where is the order
of the polynomial transformation .
However, it is only for some for which the evaluation of the kernel function becomes so
nice! As soon we can prove that a particular kernel function can be expressed as the dot
product of two particular feature transformations (for those interested the Mercer’s theorem
stated in these notes) the kernel function is valid and we don’t actually need to construct
the transformed feature vector (the output of ).
Now our task will be to turn a linear method that previously operated on , like
to an inter-classifier that only depends on those inner products, that
operates in terms of kernels.
And we’ll do that in the context of kernel perception just for simplicity. But it applies to
any linear method that we’ve already learned.
θ = 0 # initialisation
for t in 1:T
for i in 1:n
if yⁱ θ⋅𝛷(xⁱ) ≦ 0 # checking if sign is the same, i.e.
data is on the right side of its label
θ = θ + yⁱ𝛷(xⁱ) # update θ if mistake
Which is the final value of the parameter resulting from such updates ? We can write it as
where is the vector of number of mistakes (and hence updates) underwent for each data
pair (so is the (scalar) number of errors occurred with the j-th data pair).
Note that we can interpret in terms of the relative importance of the j-th training
example to the final predictor. Because we are doing perceptron, the importance is just in
terms of the number of mistakes that we make on that particular example.
When we want to make a prediction of a data pair using the resulting parameter
value (that is the “optimal” parameter the perceptron algorithm can give us), we take
an inner product with that:
But this means we can now express success or errors in terms of the vector and a valid
kernel function (typically something cheap to compute) !
When we have run the algorithm and found the optimal we may immediately retrieve
the optimal by the above equation, even if at this point we really do not need (or
sometimes can’t, i.e. when has infinite dimensions) to make predictions.
representation is
Armed with these rules we can build up pretty complex kernels starting from simpler ones.
For example let’s start with the identity function as , i.e. . Such feature
function results in a kernel (this is known as the
linear kernel).
We can now add to it a squared term to form a new kernel, that by virtue of rules (3) and
(4) above is still a valid kernel:
It can be proved that suck kernel is indeed a valid kernel and its corresponding feature
representation , i.e. involves polynomial features up to an infinite order.
Does the radial basis kernel look like a Gaussian (without the normalisation term) ? Well,
because indeed it is:
The above picture shows the contour lines of the radial basis kernel when we keep fixed
(in 2 dimensions) and we let to move away from it: the value of the kernel then reduces
in a shape that in 3-d would resemble the classical bell shape of the Gaussian curve. We
could even parametrise the radial basis kernel replacing the fixed term with a
parameter that would determine the width of the bell-shaped curve (the larger the value
of the narrower will be the bell, i.e. small values of yield wide bells).
Because the feature has infinite dimensions, the radial basis kernel has infinite expressive
power and can correctly classify any training test.
The linear decision boundary in the infinite dimensional space is given by the set
and corresponds to a (possibly) non-linear
boundary in the original feature vector space.
The more difficult task it is, the more iterations before this kernel perception (with the
radial basis kernel) will find the separating solution, but it always will in a finite number of
times. This is by contrast with the “normal” perceptron algorithm that when the set is not
separable would continue to run at the infinite, changing its parameters unless it is stopped
at a certain arbitrary point.
Decision trees make classification operating sequentially on the various dimensions and
making first a separation on the first dimension and then, in a subsequent step, on the
second dimension and so on. And you can “learn” these trees incrementally.
There is a way to make these decision trees more robust, called random forest classifiers,
that adds two type of randomness: the first one is in randomly choosing the dimension on
which to operate the cut, the second in randomly selecting the single example on which
operate from the data set (with replacement) and then just average the predictions obtained
from these trees.
Summary
We can get non-linear classifiers (or regression) methods by simply mapping our data
into new feature vectors that include non-linear components, and applying a linear
method on these resulting vectors;
These feature vectors can be high dimensional, however;
We can turn linear methods into kernel methods by casting the computation in terms
of inner products;
A kernel function is advantageous when the inner products are faster to evaluate than
using explicit feature vectors (e.g. when the vectors would be infinite dimensional!)
We saw the radial basis kernel that is particularly powerful because it is both (a)
cheap to evaluate and (b) has a corresponding infinite dimensional feature vector
7.1. Objectives
At the end of this lecture, you will be able to
7.2. Introduction
This lesson deals about recommender systems, when the algorithm try to guess preferences
based on choices already made by the user (like film to watch or products to buy).
We’ll see:
Problem definition
We keep as example across the lecture the recommendation of movies.
The goal is to base the prediction on the prior choices of the users, considering that this
matrix could be very sparse (e.g. out of 18000 films, each individual ranked very few of
them!), i.e. we want to fill these “empty spaces” of the matrix.
1. Deciding which feature to use or extracting them from data could be hard/infeasible
(e.g. where can I get the info if a film has a happy or bad ending ?), or the feature
vector could become very very large (things about in general “products” for Amazon,
could be everything)
2. Often we have little data about a single users preferences, while to make a
recommendation based on its own previous choices we would need lot of data.
The “trick” is then to “borrow” preferences from the other users and trying to measure how
much a single user is closer to the other ones in our dataset.
We look at the k closest users that did score the element I am interested to, look at their
score for it, and average their score.
where is the set of K users close to user a that have a score for item
Now, the question is of course how do I define this similarity? We can use any method to
define similarity between vectors, like cosine similarity ( ) or Euclidean
distance ( ).
We can make the algorithm a bit more sophisticated by weighting the neighbour scores to
the level of similarity rather than just take their unweighted average:
There has been many improvements that has been added to this kind of algorithm, like
adjusting for the different “average” score that each user gives to the items (i.e. they
compare the deviations from user’s averages rather than the raw score itself).
Still they are very far from today’s methods. The problem of KNN is that it doesn’t enable
us to detect the hidden structures that is there in the data, which is that users may be
similar to some pool of other users in one dimension, but similar to some other set of users
in a different dimension. For example, loving machine learning books, and having there
some “similarity” with other readers of machine learning books on some hidden
characteristics (e.g. liking equation-rich books or more discursive ones), and plant books,
where the similarity with other plant-reading users would be based on completely different
hidden features (e.g. loving photos or having nice tabular descriptions of plants).
Conversely in collaborative filtering, the algorithm would be able to detect these hidden
groupings among the users, both in terms of products and in terms of users. So we don’t
have to explicitly engineer very sophisticated similarity measure, the algorithm would be
able to pick up these very complex dependencies that for us, as humans, would be
definitely not tractable to come up with.
For now, we treat each individual score independently… and this will be the reason for
which (we will see) this method will not work.
So, we have our (sparse) matrix and we want to find a dense matrix that is able to
replicate at best the observed points of when these are available, and fill the missing
ones when .
Let’s first define as the set of points for which a score in is given:
.
The J function then takes any possible X matrix and minimise the distance between the
points in the set less a regularisation parameter (we keep the individual scores to zero
unless we have strong belief to move them from such state):
:
:
Clearly this result doesn’t make sense: for data we already know we obtain a bad
estimation (as worst as we increase lambda) and for unknown scores we are left with
zeros.
The idea is then to constrain the matrix X to have a lower rank, as rank captures how much
independence is present between the entries of the matrix.
At one extreme, constraining the matrix to be rank 1, would means that we could factorise
the matrix as just the matrix product of two single vectors, one defining a sort of general
sentiment about the items for each user ( ), and the other one ( ) representing the average
sentiment for a given item, i.e. .
But representing users and items with just a single number takes us back to the KNN
problem of not being able to distinguish the possible multiple groups hidden in each user
or in each item.
We could then decide to divide the users and/or the items in respectively and
matrices and constrain our X matrix to be a product of these two matrices
(hence with rank 2 in this case):
The exact numbers of vectors to use in the user/items factorisation matrices (i.e. the
rank of X) is then a hyperparameter that can be selected using the validation set.
Still for simplicity, in this lesson we will see the simplest case of constraining the matrix to
be factorisable by a pair of single vectors.
Note also that when we minimise for the individual component of one of the two vectors,
we obtain derivatives with respect to the individual vector elements that are independent,
so the first order condition can be expressed each time in terms of a single variable.
Numerical example
Let’s consider a value of equal to and the following score dataset:
L becomes :
Homework 3