Machine Learning Lecture 4
Machine Learning Lecture 4
Iain Styles
22 October 2019
y = f( x, w) + e (1)
where e ∼ N (0, σ2 ) is a normal distribution of variance σ2 such
that σ is a measure of the uncertainty in the sampling. That is,
when the value of the dependent variable y is sampled for some
value of the independent variable x, it will be drawn from a normal
distribution with mean f ( x, w) and variance σ2 . Under this model,
we can write the distribution of y as
N
p(y|x, w, σ2 ) = ∏ N ( y i | f ( x i , w ), σ 2 ) (3)
i =1
!
N
2 −N
2
ln p(y|x, w, σ ) = ln(2πσ ) 2 ) + ln ∏ exp(−(yi − f (xi , w)) 2 2
/(2σ ))
i =1
(5)
where we have used ln ab = ln a + ln b. We now use the general-
isation of this, ln ∏i ai = ∑i ln ai , and the identity ln ab = b ln a to
obtain the following expression for the log-likelihood:
N
N 1
ln p(y|x, w, σ2 ) = −
2
ln 2πσ2 − 2
2σ ∑ (yi − f (xi , w))2 (6)
i =1
This has two terms. The first term (which is negative) is max-
imised by minimising the number of data points or the variance
in the measurement. This is intuitively obvious: more data and/or
more noise means less certainty. The second term is exactly the
familiar least-squares error term (negated). Maximising the log-
likelihood is therefore equivalent to minimising the least-squares the most likely set of data
is the one with the lowest error
error.
We have written down an expression for the likelihood assum-
ing Gaussian noise on the data. We can now use this to perform
some rather more sophisticated types of regression. In particular, it
allows us to incorporate prior information about the problem using
Bayes Rule:
y
fore – it is the least-squares solution. This solution assumes that all
model weights - large or small - are equally likely. 0
Is this desireable? Sometime, but not necessarily so. One char-
acteristic of overfitting is that the model weights of the high-order 1 0 1
terms can be very large. We have seen this previously in our earlier x
examples, reproduced in Figure 1 and Table 1. Our previous studies
Figure 1: Fitting y = sin(2πx ) with a
have focussed on removing these high order terms from the basis polynomial fit of degree M = 9 to data
set, but could we control their contribution to the model fitting in a with added noise..
different way?
M w0 w1 w2 w3 w4 w5 w6 w7 w8 w9
9 -0.66 10.98 25.62 -117.80 -143.29 405.10 246.74 -561.32 -127.91 263.129
Table 1: Coefficients of a high-order
polynomial fit to noisy data show
Let us consider another form of prior distribution for the model characteristic large values of high-
weights. We assume that they are drawn from a normal distribu- order coefficients.
tion with zero mean, and, for convenience, variance σ2 = 1/2λ.
We ignore normalisation constants for simplicity as they will all
be absorbed into a single constant of proportionality. The distri-
bution is condition on λ and assuming each of the components is
independent, the joint distribution can be written
M
p(w|λ) ∝ ∏ exp(−λwi2 ) (12)
i =1
∝ exp(−λ ∑ wi2 ) (13)
i
∝ exp(−λwT w) (14)
N
L= ∑ (yi − f (xi , w))2 + λwT w. (16)
i =1
That is, a Gaussian prior with zero mean and variance σ2 = 1/2λ
is equivalent to adding a “penalty” term to the least-squares error
function. This penalty is proportional to the square of the length of
the weight vector and so when we try to minimise L it will prefer-
entially prefer solutions with small values for its components. This
is consistent with the Bayesian prior, which is normally distributed
around zero. The most likely values of the weights are those near
lecture 4: a bayesian view of regression 4
to zero, and the least likely are those that are large. The parame-
ter λ controls the width of the Gaussian prior: large λ means low
variance and therefore a narrow distribution, and so the larger λ
is, the less likely the weights are to take large values. Because this
prior distribution results in the model coefficient being kept small,
it is known as a shrinkage method, and since the penalty term is
the L2 norm (ie the square length of the weight vector, this is often
referred to as L2 regularisation, or sometimes as Tikhonov regulari-
sation (although this is a more general class of methods).
L2 regularisation is very widely used in regression tasks. In the
next section of the module, we will study how to use it effectively
Reading
Sections 1.2.5 of Bishop, Pattern Recognition and Machine Learn-
ing.