Lec19 Introduction2LinearRegression
Lec19 Introduction2LinearRegression
Linear Regression
Prof. Nicholas Zabaras
Email: [email protected]
URL: https://www.zabaras.com/
Learn about basis functions and feature space and setting the data design
matrix
Feature extraction
1 N
T
target values t = t ,..., t
y ( x, w ) w0 w1 x ... wM x M
1
fitting curve
original curve sin(2*x)
random data pts for fitting
M
wi x
0.5
i
i 0 0
t
-0.5
-1
MatLab Code
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
j 0
This can be done by minimizing an error function that measures the misfit
between the function 𝑦(𝑥, 𝒘), for any given value of 𝒘, and the training set data
points.
1 N
min E ( w ) y ( xn , w ) tn
2
w 2 n 1
M N N
ij j i ij
A w
j 0
T , A x , Ti
n 1
ntn
xi j
n
i
n 1
fitting curve
1 original curve sin(2*x)
random data pts for fitting
0.5
0
t
-0.5
-1
MatLab Code
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
fitting curve
1 original curve sin(2*x)
random data pts for fitting
0.5
-0.5
-1
MatLab Code
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
1st order (linear) polynomials give rather poor fit to the data and the sin(2𝜋𝑥).
fitting curve
1 original curve sin(2*x)
random data pts for fitting
0.5
0
t
-0.5
-1
MatLab Code
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
The 3nd order polynomial seems to give the best fit to the function sin(2𝜋𝑥).
fitting curve
1 original curve sin(2*x)
random data pts for fitting
0.5
t
-0.5
-1
MatLab Code
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
For 𝑀 = 9 we obtain a perfect fit to the training data. However, the fitted
curve oscillates wildly and gives a very poor representation of sin(2𝜋𝑥). This is
known as overfitting.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 13
Training and Test Errors
We often use the root-mean-square (RMS) error. The division by 𝑁 allows us
to compare different data sizes. The square root makes 𝐸𝑅𝑀𝑆 in units of 𝑡.
ERMS 2 E ( w*) / N
1/2
Root-mean-square error evaluated on the training and test data for various M
1
0.7
predicting the values of 𝑡 for new 0.6
data observations of 𝒙.
ER M S
0.5
0.4
0.3
0.2
0.1
0
0 1 2 3 4 5 6 7 8 9
M
MatLab Code
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 14
Training and Test Errors
Small values of 𝑀 give relatively large values of the test set error. The
corresponding polynomials are incapable of capturing the oscillations in
sin(2𝜋𝑥).
3 < 𝑀 ≤ 8 give small values for the test set error, and these also give
reasonable representation of sin(2𝜋𝑥).
Root-mean-square error evaluated on the training and test data for various M
1
Training
0.9 Testing
0.8
0.7
0.6
ER M S
0.5
0.4
0.3
0.2
0.1
MatLab Code
0
0 1 2 3 4 5 6 7 8 9
M
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 15
Overfitting
Obtain insight into the problem by examining the values of 𝒘 obtained from
polynomials of various order.
As 𝑀 increases, the magnitude of the coefficients typically gets larger.
fitting curve
1 original curve sin(2*x)
random data pts for fitting
0.5
0
t
-0.5
-1
MatLab Code
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
fitting curve
1 original curve sin(2*x)
random data pts for fitting
0.5
0
t
-0.5
-1
MatLab Code
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
It makes more sense to choose the complexity of the model according to the
complexity of the problem being solved.
2 n 1 2
Aij w j Ti , Aij x
j 0 n 1
i j
n ij , Ti xni tn
n 1
fitting curve
1 original curve sin(2*x)
random data pts for fitting
0.5
t
-0.5
-1
We now fit the polynomial of order 𝑀 = 9 to the same data set as before but
now using the regularized error function.
For 𝑙𝑛𝜆 = −18, the over-fitting has been suppressed and we obtain a much
closer representation of sin(2𝜋𝑥).
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 22
Regularization Controls Model Complexity
Solution using regularized least squared, ln = 0
fitting curve
1 original curve sin(2*x)
random data pts for fitting
0.5
t
-0.5
-1
MatLab Code
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x
If, however, we use too large a value for 𝜆 then we again obtain a poor fit, as
shown for ln𝜆 = 0.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 23
Regularization Controls Model Complexity
Root Mean Square Error verus ln for M=9
1
Training
0.9 Test
0.8
0.7
0.6
ER M S
0.5
0.4
0.3
0.2
0.1
0 MatLab Code
-35 -30 -25 -20 -15 -10
ln(¸ )
𝜆 controls the effective complexity of the model and hence determines the
degree of over-fitting.
We will soon re-examine this problem with a Bayesian approach that avoids
the over-fitting problem.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 24
MLE, Regularization and Model Complexity
Effective model complexity in MLE is governed by the number of basis functions and
is controlled by the size of the data set.
With regularization, the effective model complexity is controlled mainly by 𝜆 and still
by the number and form of the basis functions.
Thus the model complexity for a particular problem cannot be decided simply by
maximizing the likelihood function as this leads to excessively complex models and
over-fitting.
Independent hold-out data can be used to determine model complexity but this
wastes data and it is computationally expensive.
The quantity 𝑝(𝒟|𝒘) on the right-hand side of Bayes’ theorem is evaluated for
the observed data set 𝒟 and can be viewed as a function of the parameter
vector 𝒘 (likelihood function).
n 1
2 2
( xn )2
n 1 2
ln 2 ln(2 )
2
xn n ML
( x ) 2
ML n 1
, ML
2
n 1
N N
* We work often with log-likelihood to avoid underflow (taking products of small probabilities) and for simplifying the algebra.
The MLE approach underestimates the variance (bias) – this is in the root of
the over-fitting problem e.g. in polynomial curve fitting.
The maximum likelihood solutions ML , ML 2
are functions of the data set values
𝑥1, . . . , 𝑥𝑁. Consider the expectations of these quantities with respect to the
data set values, which come from a Gaussian.
Using the point estimates above you can show that :
In this derivation
For a quadratic loss function, the optimal point estimate is the conditional mean
𝑦(𝒙, 𝒘) = 𝔼[𝑡|𝒙].
y ( x , w ) w0 w1 x1 ... wD xD
x x0 1 .x1 . . xD
T
where
x 0 x 1 x 2 x . . M 1 x , 0 x 1, w0 bias
T
-0.8 0.1
-1 0
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
2s 2
0.9
0.8
0.7
0.6
1: 0.2 :1 0.5
s 0.1 0.4
0.3
0.2
MatLab code
0.1
0
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
ea e a 1
tanh(a ) a 2 2a 1 where ( a )
e e a 1 e a
They are however useful only when the input is defined on a lattice.
𝑦 𝒙, 𝒘 = 𝝓 𝒙 𝑇 𝒘
t = 𝑦 𝒙, 𝒘 + 𝜀
𝜀~𝒩(0, 𝛽−1 )
2 n 1 2 2
ML N n 1
n ML
p (t | x, wML , ML ) N (t | y ( x, wML ), ML
1
)
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 42
Maximum Likelihood and Least Squares
Assume observations from a deterministic function with added Gaussian noise:
t y ( x, w ) where p | N | 0, 1
Here 𝛽 is the precision. This is based on a squared loss function for which 𝔼[𝑡|𝒙] =
𝑦(𝒙, 𝒘). A 2D example is shown below.
18 18
17.5
17.5
17
17
16.5
16.5 16
16 15.5
15.5 15
30 30
20
30
40
20
30
40 Run surfaceFitDemo
10 20 10
10
20 from PMTK3
10 0 0
t | x w0 w1 x1 w2 x2 t | x w0 w1 x1 w2 x2 w3 x12 w4 x22
0
Given observed inputs, X x1 ,..., xN , and targets t t1 ,..., t N , we obtain the
T
likelihood function
p t | X , w , N tn | w T ( xn ), 1
N
n 1
n 1
Instead of maximizing the log-likelihood, one equivalently can minimize the negative
log-likelihood (NLL)
NLL w , logN tn | w T ( xn ), 1
N
n 1
2 n 1 n 1
𝑅𝑆𝑆 is often known as the residual sum of squares or sum of squared errors (𝑆𝑆𝐸)
and 𝑀𝑆𝐸 is the mean squared error.
Computing 𝒘 via MLE is the same as Least Squares. Sum of squares error contours for linear regression
5
prediction
3
The NLL
4 truth
2.5 is a
3 2
quadratic
2
bowl with
1.5
a unique
1
minimum
w1
1
0
0.5 (the MLE
-1
0
estimate)
Run residualsDemo -2
from PMTK3 -0.5 Run contoursSSEdemo,
-3
-4 -3 -2 -1 0 1 2 3 4 -1
from PMTK3
-1 -0.5 0 0.5 1 1.5 2 2.5 3
w0
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 45
Maximum Likelihood and Least Squares
ln p t | w , ln ln 2 ED ( w ), ED ( w ) tn w T ( xn )
N N 1 N 2
2 2 2 n 1
Setting the gradient of the log-likelihood (written here as a row vector) wrt 𝒘 equal to
zero:
N
w ln p t | w , tn w ( xn ) ( xn ) tn ( xn ) w ( xn ) ( xn )T 0
N N
T T T T
n 1 n 1 n 1
( xn ) ( xn ) tn ( xn )T wT T T t T w T t
N N T
T T
w
n 1 n 1
T x x x . . x T
( x2 ) 0 ( x2 ) 1 ( x2 ) .. M 1 ( x2 ) i 0 i 1 i M 1 i
,
: : : : : 0 1 .. M 1
T
( x N
)
0 N
( x ) 1 ( x N ) .. M 1 ( x N
)
i i x1 i x2 . . i x N
T
n 1 n 1
t w ( xn )
N
1 2 1 2
ED ( w ) T
ML N N n 1
n
So the MLE variance 𝛽𝑀𝐿 is equal to the residual variance of the target values around
the regression function.
Some models of interest don’t have concave likelihoods and locally optimal MLE
estimates are needed.
Run convexFnHand
from PMTK3
1-
A B
x y A and B are local minimum
Convex function Concave function Neither convex or concave
2 , e , log ( 0) log , ( 0)
function
n ( xn )
N
1 2
ED ( w ) t w T
2 n 1
then after presentation of pattern 𝑛, the stochastic gradient descent algorithm
updates the parameter vector 𝒘 using
w ( 1) w ( ) En w ( ) tn w ( ) ( xn ) ( xn )
T
En
-6 -6
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
2.5
d
LH r ,
differentiable, using the fact that
dr
2
r
𝑑/𝑑𝑟 |𝑟| = 𝑠𝑖𝑔𝑛(𝑟) 𝑖𝑓 𝑟 ≠ 0.
1.5 The function is also 𝐶1 continuous,
1 since the gradients of the two parts
0.5
of the function match at 𝑟 = ±𝛿.
0 Run huberLossDemo
from PMTK3
-0.5
-3 -2 -1 0 1 2 3
Optimizing the Huber loss is much faster than using the Laplace likelihood, since we can use
standard optimization methods (quasi-Newton) rather than linear programming.
The Huber method has an unnatural probabilistic interpretation (Pontil et al. 1998).
Pontil, M., S. Mukherjee, and F. Girosi (1998). On the Noise Model of Support Vector Machine Regression. Technical
report, MIT AI Lab.
Huber, P. (1964). Robust estimation of a location parameter. Annals of Statistics 53, 73-101.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 52
Robust Linear Regression
Using the Laplace distribution leads to 𝐿1 error norm (non-linear objective function) that is
difficult to optimize.
𝑟𝑖 ≜ 𝑟𝑖+ − 𝑟𝑖−
min
r
w , ri , ri
i
ri
s.t . ri
0, ri
0, w T
xi ri
ri
ti
i