100% found this document useful (1 vote)
58 views48 pages

Pattern Recognition Machine Learning: Chapter 3: Linear Models For Regression

This document outlines key concepts in linear regression models. It discusses using linear basis functions with polynomial, Gaussian, and sigmoidal shapes to model regression problems. It introduces maximum likelihood and least squares estimation for fitting linear models to data. The document covers overfitting, regularization techniques like ridge regression and the lasso to prevent overfitting, and Bayesian linear regression with a prior over weights. It also discusses using predictive distributions to make predictions from trained linear models.

Uploaded by

Harish Kumar J
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
58 views48 pages

Pattern Recognition Machine Learning: Chapter 3: Linear Models For Regression

This document outlines key concepts in linear regression models. It discusses using linear basis functions with polynomial, Gaussian, and sigmoidal shapes to model regression problems. It introduces maximum likelihood and least squares estimation for fitting linear models to data. The document covers overfitting, regularization techniques like ridge regression and the lasso to prevent overfitting, and Bayesian linear regression with a prior over weights. It also discusses using predictive distributions to make predictions from trained linear models.

Uploaded by

Harish Kumar J
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 48

PATTERN RECOGNITION

AND MACHINE LEARNING


CHAPTER 3: LINEAR MODELS FOR
REGRESSION

Outline

Discuss tutorial.
Regression Examples.
The Gaussian distribution.
Linear Regression.
Maximum Likelihood estimation.

Polynomial Curve Fitting

Academia Example
Predict: final percentage mark for student.
Features: 6 assignment grades, midterm exam, final
exam, project, age.
Questions we could ask.
I forgot the weights of components. Can you
recover them from a spreadsheet of the final
grades?
I lost the final exam grades. How well can I still
predict the final mark?
How important is each component, actually?
Could I guess well someones final mark given
their assignments? Given their exams?

The Gaussian Distribution

Central Limit Theorem


The distribution of the sum of N i.i.d.
random variables becomes
increasingly Gaussian as N grows.
Example: N uniform [0,1] random
variables.

Reading exponential prob


formulas
In infinite space, cannot just form sum
x p(x) grows to infinity.
Instead, use exponential, e.g.
p(n) = (1/2)n
Suppose there is a relevant feature
f(x) and I want to express that the
greater f(x) is, the less probable x
is.
Use p(x) = exp(-f(x)).

Example: exponential form


sample size
Fair Coin: The longer the sample
size, the less likely it is.
p(n) = 2-n.
ln[p(n)]

Sample size n

Exponential Form: Gaussian


mean
The further x is from the mean, the
less likely it is.
ln[p(x)]

2(x-)

Smaller variance decreases


probability
The smaller the variance 2, the less
likely x is (away from the mean). Or: the
greater the precision, the less likely x is.
ln[p(x)]

1/2 =

Minimal energy = max


probability
The greater the energy (of the joint
state), the less probable the state
is.
ln[p(x)]

E(x)

Linear Basis Function Models


(1)
Generally

where j(x) are known as basis functions.


Typically, 0(x) = 1, so that w0 acts as a
bias.
In the simplest case, we use linear basis
functions : d(x) = xd.

Linear Basis Function Models


(2)
Polynomial basis
functions:

These are global; a small


change in x affect all
basis functions.

Linear Basis Function Models


(3)
Gaussian basis functions:

These are local; a small


change in x only affect
nearby basis functions. j
and s control location and
scale (width).
Related to kernel methods.

Linear Basis Function Models


(4)
Sigmoidal basis functions:

where

Also these are local; a


small change in x only
affect nearby basis
functions. j and s control
location and scale
(slope).

Curve Fitting With Noise

Maximum Likelihood and Least


Squares (1)
Assume observations from a deterministic
function with added Gaussian noise:
wher
e

which is the same as saying,


Given observed inputs,
,
and targets,
, we obtain the likelihood
function

Maximum Likelihood and Least


Squares (2)
Taking the logarithm, we get

where

is the sum-of-squares error.

Maximum Likelihood and Least


Squares (3)
Computing the gradient and setting it to
zero yields

Solving for w, we get


where

The MoorePenrose pseudoinverse,


.

Linear Algebra/Geometry of Least


Squares
Consider

Ndimensional
Mdimensional

S is spanned by
.
wML minimizes the
distance between and
its orthogonal projection
on S, i.e. .

Maximum Likelihood and Least


Squares (4)
Maximizing with respect to the bias, w0,
alone, we see that

We can also maximize with respect to ,


giving

0th Order Polynomial

3rd Order Polynomial

9th Order Polynomial

Over-fitting

Root-Mean-Square (RMS)
Error:

Polynomial Coefficients

Data Set Size:


9th Order Polynomial

1st Order Polynomial

Data Set Size:


9th Order Polynomial

Quadratic Regularization
Penalize large coefficient values

Regularization:

Regularization:

Regularization:

vs.

Regularized Least Squares (1)


Consider the error function:
Data term + Regularization term

With the sum-of-squares error function and


a quadratic regularizer, we get

which is minimized by

is called
the
regularizati
on
coefficient.

Regularized Least Squares (2)


With a more general regularizer, we have

Lasso

Quadratic

Regularized Least Squares (3)


Lasso tends to generate sparser solutions
than a quadratic
regularizer.

Cross-Validation for
Regularization

Bayesian Linear Regression (1)


Define a conjugate shrinkage prior
over weight vector w:
p(w|) = N(w|0,-1I)
Combining this with the likelihood
function and using results for marginal
and conditional Gaussian distributions,
gives a posterior distribution.
Log of the posterior = sum of squared
errors + quadratic regularization.

Bayesian Linear Regression (3)


0 data points observed
Prior

Data Space

Bayesian Linear Regression (4)


1 data point observed
Likelihood

Posterior

Data Space

Bayesian Linear Regression (5)


2 data points observed
Likelihood

Posterior

Data Space

Bayesian Linear Regression (6)


20 data points observed
Likelihood

Posterior

Data Space

Predictive Distribution (1)


Predict t for new values of x by
integrating over w.
Can be solved analytically.

Predictive Distribution (2)


Example: Sinusoidal data, 9 Gaussian basis
functions, 1 data point

Predictive Distribution (3)


Example: Sinusoidal data, 9 Gaussian basis
functions, 2 data points

Predictive Distribution (4)


Example: Sinusoidal data, 9 Gaussian basis
functions, 4 data points

Predictive Distribution (5)


Example: Sinusoidal data, 9 Gaussian basis
functions, 25 data points

Limitations of Fixed Basis


Functions
M basis function along each
dimension of a D-dimensional input
space requires MD basis functions:
the curse of dimensionality.
In later chapters, we shall see how
we can get away with fewer basis
functions, by choosing these using
the training data.

You might also like