0% found this document useful (0 votes)
35 views

Machine Learning Lecture 4

This document provides an overview of a Bayesian view of regression. It begins by establishing a model of the data generating process, assuming each data point results from some deterministic function plus sampling error. It then derives the likelihood function and shows that maximum likelihood inference is equivalent to least squares regression. Finally, it introduces incorporating prior information using Bayes' rule, showing how a Gaussian prior on the model weights favors simpler models that avoid overfitting.

Uploaded by

chelsea
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views

Machine Learning Lecture 4

This document provides an overview of a Bayesian view of regression. It begins by establishing a model of the data generating process, assuming each data point results from some deterministic function plus sampling error. It then derives the likelihood function and shows that maximum likelihood inference is equivalent to least squares regression. Finally, it introduces incorporating prior information using Bayes' rule, showing how a Gaussian prior on the model weights favors simpler models that avoid overfitting.

Uploaded by

chelsea
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Lecture 4: A Bayesian View of Regression

Iain Styles
22 October 2019

Bayesian View of Regression

So far, we have adopted quite an informal approach to regression:


we wrote down an error function (least-squares) that made some
sense from an intuitive viewpoint that seemed to make logical
sense, but we have no formal basis for claiming that the “least
squares fitting” method was a correct and valid way to approach
the regression problem. Studying the problem from a Bayesian
perspective will give us the formal rigour that we need in order to
justify the choices we have made.
Our starting point will be to construct a model of the underlying
data-generating process. We assume that each data point is the
result of some process that has a deterministic component, and
some associated sampling uncertainty.

y = f( x, w) + e (1)
where e ∼ N (0, σ2 ) is a normal distribution of variance σ2 such
that σ is a measure of the uncertainty in the sampling. That is,
when the value of the dependent variable y is sampled for some
value of the independent variable x, it will be drawn from a normal
distribution with mean f ( x, w) and variance σ2 . Under this model,
we can write the distribution of y as

p(y| x, w, σ2 ) = N (y| f ( x, w), σ2 ) (2)

that is, it is normally distributed with mean f ( x, w) and variance


σ2 .
Now consider that we have a dataset D = {( xi , yi )}iN=1 which
we will write as (x, y). We assume that the dependent variables yi
are sampled independently from normal distributions with the same
variance σ2 . The independence of the sampling means that the joint
probability distributions over all data points can be written as the
product of the distributions for each point:

N
p(y|x, w, σ2 ) = ∏ N ( y i | f ( x i , w ), σ 2 ) (3)
i =1

This is known as the likelihood of y: it is the probability density


function of the dependent variables y conditioned on the set of pa-
rameters that describe the data generating function (ie. given some
set of parameters, what is the probability of the measurements?).
With the likelihood, we can now approach regression in a dif-
ferent way. Since the likelihood is a proper probability density
function, we can ask “what parameters w maximise it“? In other
words, what is the most likely set of measurements, and what are
lecture 4: a bayesian view of regression 2

the parameters that gives rise to the most likely measurements?


This is known as maximum likelihood inference.
First, we substitute in the full form of the normal distribution
1
N ( x |µ, σ2 ) = (2πσ2 )− 2 exp(−( x − µ)2 /(2σ2 ))
N
N
p(y|x, w, σ2 ) = (2πσ2 )− 2 ∏ exp(−(yi − f (xi , w))2 /(2σ2 )) (4)
i =1
We now take the logarithm of this to get rid of the exponential
terms. Since the logarithm is a monotonic function (it has no max-
ima or minima of its own), the maximum of the log-likelihood will
be at the same value of w as the maximum of the likelihood.

!
N
2 −N
2
ln p(y|x, w, σ ) = ln(2πσ ) 2 ) + ln ∏ exp(−(yi − f (xi , w)) 2 2
/(2σ ))
i =1
(5)
where we have used ln ab = ln a + ln b. We now use the general-
isation of this, ln ∏i ai = ∑i ln ai , and the identity ln ab = b ln a to
obtain the following expression for the log-likelihood:

N
N 1
ln p(y|x, w, σ2 ) = −
2
ln 2πσ2 − 2
2σ ∑ (yi − f (xi , w))2 (6)
i =1
This has two terms. The first term (which is negative) is max-
imised by minimising the number of data points or the variance
in the measurement. This is intuitively obvious: more data and/or
more noise means less certainty. The second term is exactly the
familiar least-squares error term (negated). Maximising the log-
likelihood is therefore equivalent to minimising the least-squares the most likely set of data
is the one with the lowest error
error.
We have written down an expression for the likelihood assum-
ing Gaussian noise on the data. We can now use this to perform
some rather more sophisticated types of regression. In particular, it
allows us to incorporate prior information about the problem using
Bayes Rule:

p( a|b) = p(b| a) p( a)/p(b) (7)


where p( a|b) is the posterior distribution of a given b, p(b| a) is the
likelihood of b given a and p( a) is the prior distribution of a.
Given the likelihood p(y|x, w, σ2 ), we can use Bayes’ rule to
compute the probability density function of the model weights:
p(y|x, w, σ2 ) × p(w)
p(w|x, y, σ2 ) = (8)
P(y)
That is, the probability density function of the model weights
depends on the likelihood of the measurements conditioned on the
weight, multiplied by the prior distribution of the weights, and the
normalised by the distribution of the measurements. We will ignore
the normalising factor p(y) for simplicity and consider

p(w|x, y, σ2 ) ∝ p(y|x, w, σ2 ) × p(w) (9)


lecture 4: a bayesian view of regression 3

The simplest case to consider is p(w) = c, a constant. In this case


we have that uniform distribution of weights

p(w|x, y, σ2 ) ∝ p(y|x, w, σ2 ) × c (10)


M=9
6
∝ p(y|x, w, σ2 ) (11)
4
and the maximum likelihood solution of this is the same as be-
2

y
fore – it is the least-squares solution. This solution assumes that all
model weights - large or small - are equally likely. 0
Is this desireable? Sometime, but not necessarily so. One char-
acteristic of overfitting is that the model weights of the high-order 1 0 1
terms can be very large. We have seen this previously in our earlier x
examples, reproduced in Figure 1 and Table 1. Our previous studies
Figure 1: Fitting y = sin(2πx ) with a
have focussed on removing these high order terms from the basis polynomial fit of degree M = 9 to data
set, but could we control their contribution to the model fitting in a with added noise..
different way?

M w0 w1 w2 w3 w4 w5 w6 w7 w8 w9
9 -0.66 10.98 25.62 -117.80 -143.29 405.10 246.74 -561.32 -127.91 263.129
Table 1: Coefficients of a high-order
polynomial fit to noisy data show
Let us consider another form of prior distribution for the model characteristic large values of high-
weights. We assume that they are drawn from a normal distribu- order coefficients.
tion with zero mean, and, for convenience, variance σ2 = 1/2λ.
We ignore normalisation constants for simplicity as they will all
be absorbed into a single constant of proportionality. The distri-
bution is condition on λ and assuming each of the components is
independent, the joint distribution can be written

M
p(w|λ) ∝ ∏ exp(−λwi2 ) (12)
i =1
∝ exp(−λ ∑ wi2 ) (13)
i
∝ exp(−λwT w) (14)

Using Bayes Theorem we have

p(w|x, y, σ2 , λ) ∝ p(y|x, w, σ2 ) × p(w|λ) (15)

and noting that ln ab = ln a + ln b, we follow the same process as


before and find that this is maximised by the minimum of

N
L= ∑ (yi − f (xi , w))2 + λwT w. (16)
i =1

That is, a Gaussian prior with zero mean and variance σ2 = 1/2λ
is equivalent to adding a “penalty” term to the least-squares error
function. This penalty is proportional to the square of the length of
the weight vector and so when we try to minimise L it will prefer-
entially prefer solutions with small values for its components. This
is consistent with the Bayesian prior, which is normally distributed
around zero. The most likely values of the weights are those near
lecture 4: a bayesian view of regression 4

to zero, and the least likely are those that are large. The parame-
ter λ controls the width of the Gaussian prior: large λ means low
variance and therefore a narrow distribution, and so the larger λ
is, the less likely the weights are to take large values. Because this
prior distribution results in the model coefficient being kept small,
it is known as a shrinkage method, and since the penalty term is
the L2 norm (ie the square length of the weight vector, this is often
referred to as L2 regularisation, or sometimes as Tikhonov regulari-
sation (although this is a more general class of methods).
L2 regularisation is very widely used in regression tasks. In the
next section of the module, we will study how to use it effectively

Reading
Sections 1.2.5 of Bishop, Pattern Recognition and Machine Learn-
ing.

You might also like