0% found this document useful (0 votes)
10 views4 pages

w1d Linear Regression Regularization

The document discusses the concepts of linear regression, overfitting, and regularization, highlighting the challenges of accurately fitting models to noisy data. It emphasizes the importance of balancing model complexity with the amount of available data to avoid underfitting or overfitting, and introduces regularization techniques to penalize extreme weight values in order to improve model generalization. Examples illustrate how fitting too many basis functions can lead to poor predictions on new data, and regularization methods like ridge regression are proposed as solutions to mitigate these issues.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views4 pages

w1d Linear Regression Regularization

The document discusses the concepts of linear regression, overfitting, and regularization, highlighting the challenges of accurately fitting models to noisy data. It emphasizes the importance of balancing model complexity with the amount of available data to avoid underfitting or overfitting, and introduces regularization techniques to penalize extreme weight values in order to improve model generalization. Examples illustrate how fitting too many basis functions can lead to poor predictions on new data, and regularization methods like ridge regression are proposed as solutions to mitigate these issues.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Linear regression, overfitting, and

regularization
The fitted function f (x) doesn’t usually match the training data exactly. Each training item
has a residual, (y(n) − f (x(n) )), which is normally non-zero. Why don’t we get perfect fits?

• Data is usually inherently noisy or stochastic, in which case it’s impossible to exactly
predict y from x. For example, if a builder mixes several batches of concrete with the
same quantities specified by x, we wouldn’t expect their observed strengths y to be
exactly the same.
• Even if the outputs are noiseless, N > D data-points are unlikely to lie exactly on any
function represented by a linear combination of our D basis functions.

If we don’t include enough basis functions, we will underfit our data. For example, if some
points lie exactly along a cubic curve:
N = 100; D = 1
X = np.random.rand(N, D) - 0.5
yy = X**3
We would not be able to fit this data accurately if we only put linear and quadratic basis
functions in our augmented design matrix Φ.
You could fit the cubic data above fairly accurately with a few RBF functions. The fit
wouldn’t extrapolate well outside the x ∈ (−0.5, 0.5) range of observations, but you can get
an accurate fit close to where there is data. To avoid underfitting, we need a model with
enough representational power to closely approximate the underlying function.
When the number of training points N is small, it’s easy to fit the observations with low
square error. In fact, usually if we have N or more basis functions, such as N RBFs with
different centres, the residuals will all be zero!1 However, we should not trust this fit. It’s
hard for us as intelligent humans to guess what an arbitrary function is doing in between
only a few observations, so we shouldn’t believe the result of fitting an arbitrary model
either. Moreover, if the observations are noisy, it seems unlikely that a good fit should match
the observed data exactly anyway.
As advocated by Acton’s rant in note w1a, one possible approach to modelling is as follows.
Start with a simple model with only a few parameters. This model may underfit, that is, not
represent all of the structure evident in the training data. We could then consider a series of
more complicated models while we feel that fitting these models can still be justified by the
amount of data we have.
However, limiting the number of parameters in a model isn’t always easy or the right
approach. If our inputs have many features, even simple linear regression (without additional
basis functions) has many parameters. An example is predicting some characteristic of an
organism (a phenotype) from DNA data, which is often in the form of > 105 features2 .
We could consider removing features from high-dimensional inputs to make a smaller
model, but filtering is not always the correct approach either. If some features are noisy
measurements of the same underlying property, it is better to average all of them rather
than to select one of them. However, we may not know in advance which groups of features
should be averaged, and which selected.
Another approach to modelling is to use large models (models with many free parameters),
but to discourage unreasonable fits that match our noisy training data too closely.

1. The basis functions need to produce N linearly-independent columns in the feature or design matrix Φ. Most
basis functions do have this property, but the technical details are too involved to get into here. There’s a reference
in the previous note.
2. Such as Single-Nucleotide Polymorphisms (SNPs) pronounced “snips”.

MLPR:w1d Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 1


1 Examples of what overfitting can look like
Example 1: Fitting many features. We will consider a synthetic situation where each datapoint
relates to a different underlying quantity µ(n) . We will sample each of these quantities from
a uniform distribution between zero and one, which we write as

µ(n) ∼ Uniform(0, 1). (1)

In our example, the features and output will both be noisy measurements of the underlying
quantity:
(n)
xd ∼ N (µ(n) , 0.012 ), d = 1...D (2)
y(n) ∼ N (µ(n) , 0.12 ). (3)

The notation N (µ, σ2 ) means the values are drawn from a Gaussian or Normal distribution
centred on µ, with variance σ2 .
In this situation, averaging the xd measurements would be a reasonable estimate of the
underlying feature µ, and hence the output y. Thus a regression model with wd = 1/D would
make reasonable predictions of y from x. Do you recover something like this model if you fit
linear regression? Try fitting randomly-generated datasets with various N and D. You can
generate data from the above model as follows:
mu = np.random.rand(N)
X = np.tile(mu[:,None], (1, D)) + 0.01*np.random.randn(N, D)
yy = 0.1*np.random.randn(N) + mu
By making D large and N not much larger than D, the weights that give the best least-
squares fit to the training data are much larger in magnitude than wd = 1/D. By using
weights much smaller and larger than the ideal values, the model can use small differences
between input features to fit the noise in the observations. If we tried to interpret the value
of a weight as meaning something about the corresponding feature, we would embarrass
ourselves. However, the average of the weights is often close to 1/D, and predictions on more
data generated in the same way as above, might be ok. The model will generalize badly
though — it will make wild predictions — if we test on inputs generated from xd ∼ N (µ, 1).

Example 2: Explaining noise with many basis functions. Consider data drawn as follows:
(n)
xd ∼ Uniform[0, 1], d = 1...D (4)
(n)
y ∼ N (0, 1). (5)

The outputs have no relationship to the inputs. The predictor with the smallest average
square error on future test cases is f (x) = 0, its average square error will be one. We now
consider what happens if we fit a model with many basis functions. If we use a high-degree
polynomial, or many RBF basis functions, we can get a lower square training error than one.
However, the error on new data would be larger. The fits usually have extreme weights with
large magnitudes (for example ∼ 103 ). There is a danger that for some inputs, the predictions
could be extreme.

It’s possible to represent a large range of interesting functions using weights with small
magnitude. Yet least-square fits are often obtained with combinations of extremely positive
and negative weights, obtaining fits that pass unreasonably close to noisy observations. We
would like to avoid fitting these extreme weights.

MLPR:w1d Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 2


2 Regularization
Penalizing extreme solutions can be achieved in various ways, and is called regularization.
One form of regularization is to penalize the sum of the square weights in our cost function.
This method has various names, including Tikhonov regularization, ridge regression, or L2
regularization. For K basis functions, the regularized cost function is then:
N h i2 K
Eλ (w; y, Φ) = ∑ y(n) − f ( x(n) ; w ) +λ ∑ w2k (6)
n =1 k =1
= (y − Φw)> (y − Φw) + λw w. >
(7)

For λ = 0 we only care about fitting the data, but for larger values of λ we trade-off the
accuracy of the fit so that we can make the weights smaller in magnitude.
We can fit the regularized cost function with the same linear least-squares fitting routine as
before. This time, instead of adding new features, we add new data items. If our original
matrix of input features Φ is N × K, for N data items and K basis functions, we add K rows
to both the vector of labels and matrix of input features:

Φ
   
y
ỹ = Φ̃ = √ , (8)
0K λIK

where 0K is a vector of K zeros, and IK is the K × K identity matrix. Then

E(w; ỹ, Φ̃) = (ỹ − Φ̃w)> (ỹ − Φ̃w) (9)


> >
= (y − Φw) (y − Φw) + λw w = Eλ (w; y, Φ). (10)

Thus we can fit training data (ỹ, Φ̃) using least-squares code that knows nothing about
regularization, and fit the regularized cost function.
Below we see a situation where using least-squares with a dozen RBF basis functions leads
to overfitting.

0
y

−2

least sq.
−4 regularized
data

−0.5 0 0.5
x

One could argue for changing the basis functions. However, as illustrated above, regularizing
the same linear regression model can give less extreme predictions, at the expense of giving
a fit further from the training points. The regularized fit depends strongly on λ. For λ = 0
we obtain the least squares fit.

3 Check your understanding


• [The website version of this note has a question here.]
• [The website version of this note has a question here.]

MLPR:w1d Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 3


3.1 Optional questions
• Try to generate a figure like the one above, demonstrating the effect of regularization.
However, there will be more opportunities to implement regularized regression later.
• If we have K radial basis functions, give a simple upper bound on the largest function
value that could be obtained for a given weight vector w. Also give a simple upper
bound on the largest derivative (you could consider one dimensional regression). From
such bounds we can see that limiting the size of the weights stops the function taking
on extreme values, or changing extremely quickly.

MLPR:w1d Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 4

You might also like