0% found this document useful (0 votes)
5 views

ML Lecture Linear Regression 3

Uploaded by

yiruiliu115
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

ML Lecture Linear Regression 3

Uploaded by

yiruiliu115
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Learning Objectives

1、How to achieve linear regression using basis functions?


2、What are the relationships between maximum likelihood and least
squares, between maximum a posterior and regularization, and among
expected loss, bias, variance, and noise?
3、What are the common regularization methods for regression?
4、How to achieve Bayesian linear regression?
5、What is the kernel for regression?
6、How to choose the model complexity?
7、What are the evidence approximation and maximization?
Outlines

 Linear Basis Function Models


 Maximum Likelihood and Least Squares
 Bias Variance Decomposition
 Bayesian Linear Regression
 Predictive Distribution
 Bayesian Model Comparison
 Evidence Approximation and Maximization
Bayesian Model Comparison (1)
 How do we choose the ‘right’ model?
 Assume we want to compare models Mi, i=1,…,L,
using data D; this requires computing

Posterior Prior Model evidence or


marginal likelihood

 Bayes Factor: ratio of evidence for two models


Bayesian Model Comparison (2)
 Having computed , we can compute
the predictive (mixture) distribution

 A simpler approximation, known as model


selection, is to use the model with the
highest evidence.
Bayesian Model Comparison (3)
 For a model with parameters w, we get the
model evidence by marginalizing over w

 Note that
Bayesian Model Comparison (4)
For a given model with a
single parameter, w, con-
sider the approximation

where the posterior is


assumed to be sharply
peaked. 1
𝑝 𝑤 =
∆𝑤𝑝𝑟𝑖𝑜𝑟
Bayesian Model Comparison (5)
 Taking logarithms, we obtain

Negative

 With M parameters, all assumed to have the same


ratio , we get

Negative and linear in M.


Bayesian Model Comparison (6)
Matching data and model complexity

under-fitting

over-fitting
Outlines

 Linear Basis Function Models


 Maximum Likelihood and Least Squares
 Bias Variance Decomposition
 Bayesian Linear Regression
 Predictive Distribution
 Bayesian Model Comparison
 Evidence Approximation and Maximization*
The Evidence Approximation (1)*
The fully Bayesian predictive distribution is given by

but this integral is intractable. Approximate with

where is the mode of , which is assumed to


be sharply peaked; a.k.a. empirical Bayes, type II or gene-
ralized maximum likelihood, or evidence approximation.
The Evidence Approximation (2)*
From Bayes’ theorem we have

and if we assume p(®,¯) to be flat we see that

General results for Gaussian integrals give


The Evidence Approximation (3)*

Precision:
The Evidence Approximation (4)*
 Example: sinusoidal data, M th degree polynomial,
Maximizing the Evidence Function (1)*
 To maximise w.r.t. ® and ¯, we define
the eigenvector equation

Precision:

 Thus
Precision:

has eigenvalues ¸i + ®.
Maximizing the Evidence Function (2)*

𝜕𝑝(𝐭|𝛼, 𝛽)
=
𝜕𝛼

𝜕𝑝(𝐭|𝛼, 𝛽)
=
𝜕𝛽
Maximizing the Evidence Function (3)*
 We can now differentiate w.r.t. ® and
¯, and set the results to zero, to get

1

𝛽MAP

where
° depends on both ® and ¯.
recall
Effective Number of Parameters (1)*

w1 is not well determined


Likelihood by the likelihood when
( +2)-1 2-1 more disturbed from 

1-1
( +1)-1 w2 is well determined by
the likelihood when less
disturbed from 
Prior
° is the number of well
-1 determined parameters
Effective Number of Parameters (2)*
 Example: sinusoidal data, 9 Gaussian basis
functions, ¯ = 11.1 (true value *).

*
Effective Number of Parameters (3)*
 Example: sinusoidal data, 9 Gaussian basis
functions, ¯ = 11.1 (true value *).

Test set error


*
Effective Number of Parameters (4)*
 Example: sinusoidal data, 9 Gaussian basis
functions, ¯ = 11.1 (true value *).

0

0  ° 10
Effective Number of Parameters (5)*
 In the limit , ° = M and we can consider
using the easy-to-compute approximation
Limitations of Fixed Basis Functions
 M basis function along each dimension of a
D-dimensional input space requires MD
basis functions: the curse of dimensionality.
 In later chapters, we shall see how we can
get away with fewer basis functions, by
choosing these using the training data.

You might also like