w1d Linear Regression Regularization
w1d Linear Regression Regularization
regularization
The fitted function f (x) doesn’t usually match the training data exactly. Each training item
has a residual, (y(n) − f (x(n) )), which is normally non-zero. Why don’t we get perfect fits?
• Data is usually inherently noisy or stochastic, in which case it’s impossible to exactly
predict y from x. For example, if a builder mixes several batches of concrete with the
same quantities specified by x, we wouldn’t expect their observed strengths y to be
exactly the same.
• Even if the outputs are noiseless, N > D data-points are unlikely to lie exactly on any
function represented by a linear combination of our D basis functions.
If we don’t include enough basis functions, we will underfit our data. For example, if some
points lie exactly along a cubic curve:
N = 100; D = 1
X = np.random.rand(N, D) - 0.5
yy = X**3
We would not be able to fit this data accurately if we only put linear and quadratic basis
functions in our augmented design matrix Φ.
You could fit the cubic data above fairly accurately with a few RBF functions. The fit
wouldn’t extrapolate well outside the x ∈ (−0.5, 0.5) range of observations, but you can get
an accurate fit close to where there is data. To avoid underfitting, we need a model with
enough representational power to closely approximate the underlying function.
When the number of training points N is small, it’s easy to fit the observations with low
square error. In fact, usually if we have N or more basis functions, such as N RBFs with
different centres, the residuals will all be zero!1 However, we should not trust this fit. It’s
hard for us as intelligent humans to guess what an arbitrary function is doing in between
only a few observations, so we shouldn’t believe the result of fitting an arbitrary model
either. Moreover, if the observations are noisy, it seems unlikely that a good fit should match
the observed data exactly anyway.
As advocated by Acton’s rant in note w1a, one possible approach to modelling is as follows.
Start with a simple model with only a few parameters. This model may underfit, that is, not
represent all of the structure evident in the training data. We could then consider a series of
more complicated models while we feel that fitting these models can still be justified by the
amount of data we have.
However, limiting the number of parameters in a model isn’t always easy or the right
approach. If our inputs have many features, even simple linear regression (without additional
basis functions) has many parameters. An example is predicting some characteristic of an
organism (a phenotype) from DNA data, which is often in the form of > 105 features2 .
We could consider removing features from high-dimensional inputs to make a smaller
model, but filtering is not always the correct approach either. If some features are noisy
measurements of the same underlying property, it is better to average all of them rather
than to select one of them. However, we may not know in advance which groups of features
should be averaged, and which selected.
Another approach to modelling is to use large models (models with many free parameters),
but to discourage unreasonable fits that match our noisy training data too closely.
1. The basis functions need to produce N linearly-independent columns in the feature or design matrix Φ. Most
basis functions do have this property, but the technical details are too involved to get into here. There’s a reference
in the previous note.
2. Such as Single-Nucleotide Polymorphisms (SNPs) pronounced “snips”.
In our example, the features and output will both be noisy measurements of the underlying
quantity:
(n)
xd ∼ N (µ(n) , 0.012 ), d = 1...D (2)
y(n) ∼ N (µ(n) , 0.12 ). (3)
The notation N (µ, σ2 ) means the values are drawn from a Gaussian or Normal distribution
centred on µ, with variance σ2 .
In this situation, averaging the xd measurements would be a reasonable estimate of the
underlying feature µ, and hence the output y. Thus a regression model with wd = 1/D would
make reasonable predictions of y from x. Do you recover something like this model if you fit
linear regression? Try fitting randomly-generated datasets with various N and D. You can
generate data from the above model as follows:
mu = np.random.rand(N)
X = np.tile(mu[:,None], (1, D)) + 0.01*np.random.randn(N, D)
yy = 0.1*np.random.randn(N) + mu
By making D large and N not much larger than D, the weights that give the best least-
squares fit to the training data are much larger in magnitude than wd = 1/D. By using
weights much smaller and larger than the ideal values, the model can use small differences
between input features to fit the noise in the observations. If we tried to interpret the value
of a weight as meaning something about the corresponding feature, we would embarrass
ourselves. However, the average of the weights is often close to 1/D, and predictions on more
data generated in the same way as above, might be ok. The model will generalize badly
though — it will make wild predictions — if we test on inputs generated from xd ∼ N (µ, 1).
Example 2: Explaining noise with many basis functions. Consider data drawn as follows:
(n)
xd ∼ Uniform[0, 1], d = 1...D (4)
(n)
y ∼ N (0, 1). (5)
The outputs have no relationship to the inputs. The predictor with the smallest average
square error on future test cases is f (x) = 0, its average square error will be one. We now
consider what happens if we fit a model with many basis functions. If we use a high-degree
polynomial, or many RBF basis functions, we can get a lower square training error than one.
However, the error on new data would be larger. The fits usually have extreme weights with
large magnitudes (for example ∼ 103 ). There is a danger that for some inputs, the predictions
could be extreme.
It’s possible to represent a large range of interesting functions using weights with small
magnitude. Yet least-square fits are often obtained with combinations of extremely positive
and negative weights, obtaining fits that pass unreasonably close to noisy observations. We
would like to avoid fitting these extreme weights.
For λ = 0 we only care about fitting the data, but for larger values of λ we trade-off the
accuracy of the fit so that we can make the weights smaller in magnitude.
We can fit the regularized cost function with the same linear least-squares fitting routine as
before. This time, instead of adding new features, we add new data items. If our original
matrix of input features Φ is N × K, for N data items and K basis functions, we add K rows
to both the vector of labels and matrix of input features:
Φ
y
ỹ = Φ̃ = √ , (8)
0K λIK
Thus we can fit training data (ỹ, Φ̃) using least-squares code that knows nothing about
regularization, and fit the regularized cost function.
Below we see a situation where using least-squares with a dozen RBF basis functions leads
to overfitting.
0
y
−2
least sq.
−4 regularized
data
−0.5 0 0.5
x
One could argue for changing the basis functions. However, as illustrated above, regularizing
the same linear regression model can give less extreme predictions, at the expense of giving
a fit further from the training points. The regularized fit depends strongly on λ. For λ = 0
we obtain the least squares fit.