0% found this document useful (0 votes)
34 views53 pages

Lec19 Introduction2LinearRegression

The document introduces linear regression and discusses: 1) Overfitting, the effect of data size, and differences between training and test errors. 2) How the least squares solution arises from maximum likelihood estimation under a Gaussian likelihood model. 3) Basis function models and how to set the data design matrix.

Uploaded by

hu jack
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views53 pages

Lec19 Introduction2LinearRegression

The document introduces linear regression and discusses: 1) Overfitting, the effect of data size, and differences between training and test errors. 2) How the least squares solution arises from maximum likelihood estimation under a Gaussian likelihood model. 3) Basis function models and how to set the data design matrix.

Uploaded by

hu jack
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

Introduction to

Linear Regression
Prof. Nicholas Zabaras

Email: [email protected]
URL: https://www.zabaras.com/

September 23, 2020

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras


Contents
 Bayesian Computing and Machine Learning, Motivation to Bayesian inference
via a regression example, Over fitting, Effect of Data Size, Training and Test
Errors, Over fitting and MLE, Regularization and Model Complexity
 Prior Modeling, Bayesian Inference and Prediction, Frequentist Vs Bayesian
Paradigm, Bias in MLE (Gaussian Example)
 Linear basis function models, MLE and Least Squares, Convexity of the NLL,
Sequential Learning of the MLE, Robust Linear Regression, Huber Loss
Function

 Chris Bishops’ PRML book, Chapters 1 and 2


 Kevin Murphy’s, Machine Learning: A probabilistic perspective, Chapter 5
 C P Robert, The Bayesian Choice: From Decision-Theoretic Motivations to Compulational Implementation, Springer-
Verlag, NY, 2001 (online resource)
 A. Gelman, JB Carlin, HS Stern and DB Rubin, Bayesian Data Analysis, Chapman and Hall CRC Press, 2nd Edition, 2003.
 M Marin and C P Robert, The Bayesian Core, Spring Verlag, 2007 (online resource)
 Bayesian Statistics for Engineering, Online Course at Georgia Tech, B. Vidakovic.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 2
Goals
 The goal’s for todays’ lecture are the following:

 Understand the fundamentals of regression problems: over fitting, effect of


data size, and training and test errors behavior

 Understand how the least squares solution arises as a MLE under a


Gaussian likelihood model

 Learn about basis functions and feature space and setting the data design
matrix

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 3


Typical Problems in Machine Learning
 Pattern Recognition: automatically classifying the data into different
categories and use of these to take actions.

 Example: handwritten recognition.

Input: a vector 𝒙 of pixel values.

Output: A digit from 0 to 9.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 4


Typical Problems in Machine Learning
 In the digits example, a large set of input vectors 𝒙1, . . . , 𝒙𝑁, or a training set is
used to tune the parameters of an adaptive model.

 The category of an input vector is expressed using a target vector 𝒕


(identifying the corresponding digit).

 The result of a machine learning algorithm: 𝑦(𝒙) where the output 𝑦 is


encoded as the target vectors for any input 𝒙.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 5


Terminology
 Training or learning phase: determine 𝑦(𝒙) on the basis of the training data.

 Test set, generalization

 Feature extraction

 Data pre-processing (rotation, scaling, etc.)

 Using lower-dimensional representation of the input and test data

 Supervised learning (input & target vectors in the training data)

 Classification (discrete categories) or regression (continuous variables)

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 6


Terminology
 Unsupervised learning (no target vectors in the training data) also called
clustering, or density estimation.
 Reinforcement learning (Richard S. Sutton
and Andrew G. Barto)

 credit assignment (rewards attributed to different moves at the end of a


game)

 exploration (of new actions)

 exploitation (of high reward actions).

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 7


Motivation Example: Polynomial Curve Fitting
 Problem definition: implicitly trying to discover the underlying (generating)
function sin(2𝜋𝑥) in a set of data.
 Some data points are known: x =  x1 ,..., xN  as well as the corresponding
T

1 N
T
target values t = t ,..., t

 We fit the data using a polynomial function of the form


Polynomials of various orders M fitted to the training data, M = 0

y ( x, w )  w0  w1 x  ...  wM x M
1
fitting curve
original curve sin(2*x)
random data pts for fitting

M
  wi x
0.5
i

i 0 0

t
-0.5

-1
MatLab Code
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 8


Motivation Example: Polynomial Curve Fitting
 The values of the coefficients will be determined by fitting the polynomial
M
to the training data. y ( x, w )   w x j
j

j 0

 This can be done by minimizing an error function that measures the misfit
between the function 𝑦(𝑥, 𝒘), for any given value of 𝒘, and the training set data
points.

1 N
min E ( w )    y ( xn , w )  tn 
2

w 2 n 1

M N N

 ij j i ij 
A w
j 0
 T , A  x , Ti
n 1
  ntn
xi j
n
i

n 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 9


Polynomial Curve Fitting
Polynomials of various orders M fitted to the training data, M = 0

fitting curve
1 original curve sin(2*x)
random data pts for fitting

0.5

0
t

-0.5

-1
MatLab Code
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 10


Polynomial Curve Fitting
Polynomials of various orders M fitted to the training data, M = 1

fitting curve
1 original curve sin(2*x)
random data pts for fitting

0.5

-0.5

-1
MatLab Code
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

 1st order (linear) polynomials give rather poor fit to the data and the sin(2𝜋𝑥).

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 11


Polynomial Curve Fitting
Polynomials of various orders M fitted to the training data, M = 3

fitting curve
1 original curve sin(2*x)
random data pts for fitting

0.5

0
t

-0.5

-1

MatLab Code
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

 The 3nd order polynomial seems to give the best fit to the function sin(2𝜋𝑥).

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 12


Overfitting
Polynomials of various orders M fitted to the training data, M = 9

fitting curve
1 original curve sin(2*x)
random data pts for fitting

0.5

t
-0.5

-1
MatLab Code
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

 For 𝑀 = 9 we obtain a perfect fit to the training data. However, the fitted
curve oscillates wildly and gives a very poor representation of sin(2𝜋𝑥). This is
known as overfitting.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 13
Training and Test Errors
 We often use the root-mean-square (RMS) error. The division by 𝑁 allows us
to compare different data sizes. The square root makes 𝐸𝑅𝑀𝑆 in units of 𝑡.

ERMS   2 E ( w*) / N 
1/2

Root-mean-square error evaluated on the training and test data for various M
1

 The test set error is a measure of


Training
0.9 Testing

how well we are doing in 0.8

0.7
predicting the values of 𝑡 for new 0.6

data observations of 𝒙.

ER M S
0.5

0.4

0.3

0.2

0.1

0
0 1 2 3 4 5 6 7 8 9
M
MatLab Code
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 14
Training and Test Errors
 Small values of 𝑀 give relatively large values of the test set error. The
corresponding polynomials are incapable of capturing the oscillations in
sin(2𝜋𝑥).
 3 < 𝑀 ≤ 8 give small values for the test set error, and these also give
reasonable representation of sin(2𝜋𝑥).
Root-mean-square error evaluated on the training and test data for various M
1
Training
0.9 Testing

0.8

0.7

0.6
ER M S

0.5

0.4

0.3

0.2

0.1
MatLab Code
0
0 1 2 3 4 5 6 7 8 9
M
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 15
Overfitting
 Obtain insight into the problem by examining the values of 𝒘 obtained from
polynomials of various order.
 As 𝑀 increases, the magnitude of the coefficients typically gets larger.

 For 𝑀 = 9, the coefficients


1.0e+006 *
have become finely tuned to
M  0 M 1 M  6 M  9 the data by developing large
w0* 0.0000 0.0000 0.0000 0.0000 positive and negative values.
w1* 0 -0.0000 0.0000 0.0002
w2* 0 0 -0.0000 -0.0053  The more flexible
w3* 0 0 0.0000 0.0486 polynomials with larger
w4* 0 0 0 -0.2316
values of 𝑀 are becoming
w5* 0 0 0 0.6399
w6* 0 0 0 -1.0616 increasingly tuned to the
w7* 0 0 0 1.0422 random noise on the target
w8* 0 0 0 -0.5576 values.
w9* 0 0 0 0.1252 MatLab Code

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 16


Varying the Data Size
 It is also interesting to examine the behavior of a given model as the size of
the data set is varied.
M = 9 polynomial. Increasing N reduces over-fitting, N = 15

fitting curve
1 original curve sin(2*x)
random data pts for fitting

0.5

0
t

-0.5

-1
MatLab Code
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 17


Varying the Data Size
 For a given model complexity, the over-fitting problem becomes less severe
as the size of the data set increases, i.e. the larger the data set, the more
complex (in other words more flexible) the model that we can afford to fit to the
data. M = 9 polynomial. Increasing N reduces over-fitting, N = 100

fitting curve
1 original curve sin(2*x)
random data pts for fitting

0.5

0
t

-0.5

-1
MatLab Code
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 18


Overfitting and Maximum Likelihood
 It is not reasonable having the number of parameters in a model (model
complexity) limited according to the size of the available training set.

 It makes more sense to choose the complexity of the model according to the
complexity of the problem being solved.

 The least squares approach to finding the model parameters is a specific


case of maximum likelihood.

 The over-fitting problem is a general property of maximum likelihood.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 19


Overfitting and Bayesian Approach
 By adopting a Bayesian approach, the over-fitting problem can be avoided.

 There is no difficulty in employing models for which the number of parameters


greatly exceeds the number of data points.

 In a Bayesian model, “the effective number” of parameters adapts


automatically to the size of the data set.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 20


Regularization Technique
 To control the over-fitting we use regularization, e.g. adding “a penalty term” to
the error function in order to discourage the coefficients from reaching large
values.
 The simplest such penalty term takes the form of a sum of squares of all of
the coefficients, leading to a modified error function of the form
1 N 
E ( w )    y ( xn , w )  tn   w
2 2

2 n 1 2

 The regularization parameter 𝜆 governs the relative importance of the


regularization term compared with the sum-of-squares error term.
 The minimizer is similar to that given earlier but with
M N N

 Aij w j  Ti , Aij   x
j 0 n 1
i j
n   ij , Ti   xni tn
n 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 21


Regularization Controls Model Complexity
Solution using regularized least squares, ln  = -18

fitting curve
1 original curve sin(2*x)
random data pts for fitting

0.5

t
-0.5

-1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1


MatLab Code
x

 We now fit the polynomial of order 𝑀 = 9 to the same data set as before but
now using the regularized error function.
 For 𝑙𝑛𝜆 = −18, the over-fitting has been suppressed and we obtain a much
closer representation of sin(2𝜋𝑥).
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 22
Regularization Controls Model Complexity
Solution using regularized least squared, ln  = 0

fitting curve
1 original curve sin(2*x)
random data pts for fitting

0.5

t
-0.5

-1

MatLab Code
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

 If, however, we use too large a value for 𝜆 then we again obtain a poor fit, as
shown for ln𝜆 = 0.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 23
Regularization Controls Model Complexity
Root Mean Square Error verus ln  for M=9
1
Training
0.9 Test

0.8

0.7

0.6

ER M S
0.5

0.4

0.3

0.2

0.1

0 MatLab Code
-35 -30 -25 -20 -15 -10
ln(¸ )

 𝜆 controls the effective complexity of the model and hence determines the
degree of over-fitting.
 We will soon re-examine this problem with a Bayesian approach that avoids
the over-fitting problem.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 24
MLE, Regularization and Model Complexity
 Effective model complexity in MLE is governed by the number of basis functions and
is controlled by the size of the data set.

 With regularization, the effective model complexity is controlled mainly by 𝜆 and still
by the number and form of the basis functions.

 Thus the model complexity for a particular problem cannot be decided simply by
maximizing the likelihood function as this leads to excessively complex models and
over-fitting.

 Independent hold-out data can be used to determine model complexity but this
wastes data and it is computationally expensive.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 25


Frequentist Versus Bayesian Paradigms
 The likelihood 𝑝(𝒟|𝒘) is essential in both Bayesian and frequentist
approaches but it is used in different roles.
 In a frequentist approach
 𝒘 is a fixed parameter computed by an estimator (e.g. maximum likelihood
estimator).
 Error bars on this point estimate are computed by considering the
distribution of all possible data sets 𝒟 (e.g. variability of predictions
between different bootstrap data sets)
 In Bayesian approach
 there is only one set of data 𝒟 and
 the uncertainty in 𝒘 is introduced with appropriate prior and computing
posterior probabilities over 𝒘.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 26


Prior Knowledge is Essential
 We cannot do everything simply based on data – prior knowledge is essential
to inference and prediction

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 27


Bayesian Probabilities
For example in the regression problem with the observed data 𝒟 =
{𝑡1, . . . , 𝑡𝑁}, we can obtain the conditional probability 𝑝(𝒘|𝒟) by Bayes’ theorem
p(D | w ) p( w )
p( w | D )  , p (D )   p(D | w ) p( w )dw
p(D )

 The quantity 𝑝(𝒟|𝒘) on the right-hand side of Bayes’ theorem is evaluated for
the observed data set 𝒟 and can be viewed as a function of the parameter
vector 𝒘 (likelihood function).

 Given this definition of likelihood, we can state Bayes’ theorem in words

𝑃𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 ∝ 𝐿𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 × 𝑃𝑟𝑖𝑜𝑟

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 28


Gaussian Distribution
 Consider the Gaussian distribution
1  1 2
N ( x |  , 2 )  exp   ( x   ) 
 2 
2 1/2
  
2
2

 The likelihood function for the Gaussian distribution is


N
p (D |  ,  )   N ( xn |  ,  2 )
2

n 1

 The log likelihood takes the form*


N
1 N N
ln p (D |  ,  )  
2

2 2
 ( xn   )2 
n 1 2
ln  2  ln(2 )
2

 Maximum likelihood solution N N

 xn  n ML
( x   ) 2

 ML  n 1
,  ML
2
 n 1

N N
* We work often with log-likelihood to avoid underflow (taking products of small probabilities) and for simplifying the algebra.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 29


MLE for a Gaussian Distribution
N N
1 1
 ML 
N
 xi , 
i 1
2
ML 
N
 i ML
( x
i 1
  ) 2

Sample mean S ample variance wrt ML


mean ( not the exact mean)

 The MLE approach underestimates the variance (bias) – this is in the root of
the over-fitting problem e.g. in polynomial curve fitting.
 The maximum likelihood solutions ML ,  ML 2
are functions of the data set values
𝑥1, . . . , 𝑥𝑁. Consider the expectations of these quantities with respect to the
data set values, which come from a Gaussian.
 Using the point estimates above you can show that :
In this derivation

N 1 2 you need to use :


 ML    ,  2
ML
    xi x j    2 for i  j
N
 xi2    2   2

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 30


MLE: Underestimating the Variance
 In the schematic from MLE estimate for
Bishop’s PRML, we the 2 data points
consider 3 cases each
with 2 data points True Gaussian
extracted from the true
Gaussian.

 The mean of the 3


distributions predicted
via MLE (i.e. averaged
over the data) is correct.

 However, the variance


is underestimated
since it is a variance with
respect to the sample
mean and NOT the true
mean.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 31
Probabilistic Regression Model
 Before we proceed in a follow up lecture with Bayesian regression, let us first
introduce a probabilistic setting for regression.
 Supervised learning: 𝑁 observations {𝒙𝑛} with corresponding target values
{𝑡𝑛} are provided.
 The goal is to predict 𝑡 for a new value 𝒙. We will eventually like to compute
the predictive distribution 𝑝(𝑡|𝒙).
 We will start by constructing a function 𝑦(𝒙) that is a prediction of 𝑡.
 We are interested in a linear combination – regression – of a ``fixed set’’ of
nonlinear basis functions.
 In the remaining of the lecture we will focus on MLE models and in the follow
up discuss different forms of regularization.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 32


Linear Regression
 From the conditional distribution 𝑝(𝑡|𝒙), we can make point estimates of 𝑡 for a given
𝒙 by minimizing a `loss function’.

 For a quadratic loss function, the optimal point estimate is the conditional mean
𝑦(𝒙, 𝒘) = 𝔼[𝑡|𝒙].

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 33


Linear Regression
 The simplest linear model for regression is one that involves a linear combination of
the input variables

y ( x , w )  w0  w1 x1  ...  wD xD

x   x0  1 .x1 . . xD 
T
where

 This is often simply known as linear regression.

 𝐷 is the input dimensionality.

 This model is a linear function of the parameters.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 34


Linear Basis Functions Models
 More generally regression function is:
M 1
y ( x , w )  w0  w11  x   ...  wM 1M 1  x    wii  x   w T   x 
i 0

where 𝜙𝑖 𝒙 are known as basis functions and

  x   0  x  1  x  2  x  . . M 1  x   , 0  x   1, w0  bias
T

 Often 𝜙𝑖 𝒙 represent features extracted from the data.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 35


Polynomial and Gaussian Basis Functions
Basis Function -- Polynomial 3 4
(x ! 7 j ) 2
1 Basis
Basis Function
Funct - Gaussian
ion { Gaussian ? j (x) = exp ! 2
)
2s
1
0.8
0.9
0.6
0.8
0.4
0.7
0.2
0.6
0
0.5
-0.2
0.4
-0.4
0.3
-0.6
0.2

-0.8 0.1

-1 0
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

MatLab code   1: 0.2 :1


Polynomial basis functions s  0.2
(scalar input, global support): Gaussian basis functions:
 j  x  x j   x   2 
 j  x   exp   
j

 2s 2

 

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 36


Logistic Sigmoidal Basis Functions
x ! 7j
Basis Function
Basis Funct - Sigmoidal
ion { Sigmoidal <j (x) = <( )
s
1

0.9

0.8

0.7

0.6
  1: 0.2 :1 0.5

s  0.1 0.4

0.3

0.2
MatLab code
0.1

0
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

Sigmoidal basis functions:


 x  j  1 ea  e a
 j  x     ,  ( a )  tanh(a)  a  2  2a   1
 s  1  e a e e a

 (a) : logistic sigmoid function

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 37


Sigmoidal and Tanh Basis Functions
 The sigmoidal and tanh basis functions are related:

ea  e a 1
tanh(a )  a  2  2a   1 where  ( a ) 
e  e a 1  e a

 A general linear combination of logistic sigmoidal functions is equivalent to a linear


combination of tanh functions:
M 1
 x  j  M 1
 x  j 
y ( x, w)  w0   w j    w0   w j  2 
j 1  s  j 1  2 s 
 x  j 
1  tanh  
M 1
 2 s 
M 1
 x  j 
 w0   w j  u0   u j tanh  
j 1 2 j 1  2 s 
1 M 1 wj
where : u0  w0   w j , u j 
2 j 1 2

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 38


Choice of Basis Functions
 We are interested in functions of local support to explore adaptivity.

 Local support functions comprise a spectrum of different spatial frequencies.

 An example is wavelets that are local both spatially and in frequency.

 They are however useful only when the input is defined on a lattice.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 39


Likelihood Model
 Consider a set of training data comprising 𝑁 input x =  x1 ,..., xN 
T
and the
corresponding target values t =  t1 ,..., t N 
T

 We assume that, given the value of 𝑥, the corresponding value of 𝑡 has a


Gaussian distribution with a mean equal to the value 𝑦(𝑥, 𝒘) and precision
(inverse of the variance) 𝛽
p (t | x, w ,  )  N (t | y ( x, w ),  ) 1

𝑦 𝒙, 𝒘 = 𝝓 𝒙 𝑇 𝒘
t = 𝑦 𝒙, 𝒘 + 𝜀
𝜀~𝒩(0, 𝛽−1 )

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 40


Likelihood Function
 The likelihood function is
N
p (t | x, w ,  )   N (tn | y ( xn , w ),  1 )
n 1

 From this, the log-likelihood takes the form:


 N
N N
ln p (t | x , w ,  )     y( xn , w )  tn   ln   ln(2 )
2

2 n 1 2 2

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 41


Maximum Likelihood and Least Squares
 Consider first the MLE estimate for 𝒘. Note that maximizing the log-likelihood
to obtain 𝒘𝑀𝐿 is the same as minimizing the sum of squares error function
(residual sum of squares, 𝑅𝑆𝑆(𝒘)):
 N
N N
max ln p(t | x, w ,  )     y( x , w )  t   ln   ln(2 ) 
2
n n
w 2 n 1 2 2
1 N
 argmin   y ( xn , w )  tn 
2
w ML
w 2 n 1
 We can also determine the MLE estimate of 𝛽:
N
1 1
   y( x , w )  tn 
2

 ML N n 1
n ML

 We can now make (a frequentist, plug-in approximation) prediction as follows:

p (t | x, wML ,  ML )  N (t | y ( x, wML ),  ML
1
)
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 42
Maximum Likelihood and Least Squares
 Assume observations from a deterministic function with added Gaussian noise:
t  y ( x, w )   where p   |    N  | 0,  1 

which is the same as saying,


p  t | x, w ,    N  t | y ( x, w ),  1 

 Here 𝛽 is the precision. This is based on a squared loss function for which 𝔼[𝑡|𝒙] =
𝑦(𝒙, 𝒘). A 2D example is shown below.

18 18
17.5
17.5
17
17
16.5
16.5 16
16 15.5

15.5 15

30 30
20
30
40
20
30
40 Run surfaceFitDemo
10 20 10
10
20 from PMTK3
10 0 0

t | x   w0  w1 x1  w2 x2 t | x   w0  w1 x1  w2 x2  w3 x12  w4 x22
0

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 43


Maximum Likelihood and Least Squares
 Let us now introduce a regression function of the form: 𝑦 = 𝒘𝑇 𝝓 𝒙 .

 Given observed inputs, X   x1 ,..., xN , and targets t  t1 ,..., t N , we obtain the
T

likelihood function
p  t | X , w ,     N  tn | w T  ( xn ),  1 
N

n 1

 We often use the log-likelihood:

 w,    log p(D |  )   logN  tn | w T  ( xn ),  1  ,    w,  


N

n 1

 Instead of maximizing the log-likelihood, one equivalently can minimize the negative
log-likelihood (NLL)
NLL  w ,     logN  tn | w T  ( xn ),  1 
N

n 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 44


Maximum Likelihood and Least Squares
 Taking the log of the likelihood, we obtain:
ln p  t | w ,     ln N  tn | w T  ( xn ),  1  
N
N N
ln   ln  2    ED ( w )
n 1 2 2
where with ED ( w ) we have denoted:
ED ( w )    tn  w  ( xn )  , RSS    tn  w  ( xn )  , MSE  RSS / N
1 N T 2
N
T 2

2 n 1 n 1

 𝑅𝑆𝑆 is often known as the residual sum of squares or sum of squared errors (𝑆𝑆𝐸)
and 𝑀𝑆𝐸 is the mean squared error.
 Computing 𝒘 via MLE is the same as Least Squares. Sum of squares error contours for linear regression
5
prediction
3
The NLL
4 truth
2.5 is a
3 2
quadratic
2
bowl with
1.5
a unique
1
minimum

w1
1

0
0.5 (the MLE
-1
0
estimate)
Run residualsDemo -2
from PMTK3 -0.5 Run contoursSSEdemo,
-3
-4 -3 -2 -1 0 1 2 3 4 -1
from PMTK3
-1 -0.5 0 0.5 1 1.5 2 2.5 3
w0
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 45
Maximum Likelihood and Least Squares
ln p  t | w ,    ln   ln  2    ED ( w ), ED ( w )    tn  w T  ( xn ) 
N N 1 N 2

2 2 2 n 1

 Setting the gradient of the log-likelihood (written here as a row vector) wrt 𝒘 equal to
zero:
N 
 w ln p  t | w ,       tn  w  ( xn )  ( xn )    tn ( xn )  w   ( xn ) ( xn )T   0
N N
T T T T

n 1  n 1 n 1 

 This equation can be solved for 𝒘.

  ( xn ) ( xn )   tn ( xn )T  wT T    T t   T w  T t
N N T
T T
w
n 1 n 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 46


Maximum Likelihood and Least Squares
  ( xn ) ( xn )   tn ( xn )T  wT T    T t   T w  T t
N N T
T T
w
n 1 n 1

 We obtain (normal equation; ordinary least squares solution):

wML      T t  †t , † : Moore -Penrose pseudo - inverse


T 1

where we have defined:


T   ( x1 )  ( x2 ) ..  ( x N ) 
  ( x1 )   0 ( x1 ) 1 ( x1 ) .. M 1 ( x1 ) 
T

 T      x     x    x  . .   x  T
  ( x2 )   0 ( x2 ) 1 ( x2 ) .. M 1 ( x2 )  i 0 i 1 i M 1 i
  ,
 :   : : : :     0 1 ..  M 1 
 T   
  ( x N 
) 
 0 N
( x ) 1 ( x N ) ..  M 1 ( x N 
)
 i  i  x1  i  x2  . . i  x N  
T

 Note that indeed:


N N
     ( xn ) ( xn ) ,  t   tn ( xn )
T T T

n 1 n 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 47


Maximum Likelihood and Least Squares
ln p  t | w ,     ln N  tn | w T  ( xn ),  1  
N
N N
ln   ln  2    ED ( w )
n 1 2 2

 Taking now the log of the likelihood with respect to  gives:

 t  w  ( xn ) 
N
1 2 1 2
 ED ( w )  T

 ML N N n 1
n

 So the MLE variance 𝛽𝑀𝐿 is equal to the residual variance of the target values around
the regression function.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 48


Convexity of the NLL
 Convexity of the NLL (positive definite Hessian) leads to a unique globally optimal
MLE.

 Some models of interest don’t have concave likelihoods and locally optimal MLE
estimates are needed.

Convex region Non-convex region

Run convexFnHand
from PMTK3

1-

A B
x y A and B are local minimum
Convex function Concave function Neither convex or concave

 2 , e ,  log  (  0) log  ,  (  0)
function

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 49


Sequential Learning: LMS Algorithm
 If the data set is large, we use sequential (on-line) algorithms

 We apply the technique of stochastic (sequential) gradient descent.

 If the error function comprises a sum over data points E   En


n

 n  ( xn ) 

N
1 2
ED ( w )  t  w T

2 n 1
then after presentation of pattern 𝑛, the stochastic gradient descent algorithm
updates the parameter vector 𝒘 using


w ( 1)  w ( )  En  w ( )   tn  w ( )  ( xn )  ( xn )
T


En

𝜏 is the iteration number & 𝜂 the learning rate parameter.


 This is known as least-mean-squares or the LMS algorithm.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 50


Robust Linear Regression
 Using a Gaussian distribution for the noise,
t  y ( x , w )   ,  ~ N   | 0,  1 
can result in poor fit especially if we have outliers in the data.
 Squared error penalizes deviations quadratically, so points far from the line have more affect
on the fit than points near the line.
 To achieve robustness to outliers one can replace the Gaussian with a distribution that has
heavy tails (e.g. the Laplace distribution). Such a distribution assigns higher likelihood to
outliers, without having to perturb the regression line to “explain” them.
4
4
Run linregRobustDemoCombined 3
from PMTK3 3
2
2
1
1
0
0 Least Squares
least squares -1 Huber loss 1.0
-1
 if r  
laplace
r / 2
2 Huber loss 5.0
student, dof=0.630 -2
LH  r ,    
 1    r   / 2 if r  
-2 2
p  t | x , w , b   Lap  t | y ( x, w ), b   exp   t  y ( x, w )  -3

-3
 b  d
-4 NLL( w )   ri ( w ) , ri ( w )  t  y ( x, w )
-4 LH  r ,   
dr r 
i -5
-5

-6 -6
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 51


Huber Loss Function
5  This is equivalent to 𝐿2 for errors
L2
4.5
L1 that are smaller than 𝛿, and is
4 huber
equivalent to 𝐿1 for larger errors.

r / 2
2
if r  
3.5 LH  r ,      This loss function is everywhere
  r   / 2 if r  
2

3

2.5
d
LH  r ,   
differentiable, using the fact that
dr
2
r 
𝑑/𝑑𝑟 |𝑟| = 𝑠𝑖𝑔𝑛(𝑟) 𝑖𝑓 𝑟 ≠ 0.
1.5  The function is also 𝐶1 continuous,
1 since the gradients of the two parts
0.5
of the function match at 𝑟 = ±𝛿.
0 Run huberLossDemo
from PMTK3
-0.5
-3 -2 -1 0 1 2 3

 Optimizing the Huber loss is much faster than using the Laplace likelihood, since we can use
standard optimization methods (quasi-Newton) rather than linear programming.

 The Huber method has an unnatural probabilistic interpretation (Pontil et al. 1998).
 Pontil, M., S. Mukherjee, and F. Girosi (1998). On the Noise Model of Support Vector Machine Regression. Technical
report, MIT AI Lab.
 Huber, P. (1964). Robust estimation of a location parameter. Annals of Statistics 53, 73-101.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 52
Robust Linear Regression
 Using the Laplace distribution leads to 𝐿1 error norm (non-linear objective function) that is
difficult to optimize.

 A solution is to transform the problem (by increasing its dimension to 2𝑁 + 𝑀) to a linear


programming problem.

𝑟𝑖 ≜ 𝑟𝑖+ − 𝑟𝑖−
min
    r
w , ri , ri
i

 ri

 s.t . ri

 0, ri

 0, w T
xi  ri

 ri

 ti
i

 Note that with our definition above:


1 r if ri  0 1  0 if ri  0
ri   (r i  ri )   i , ri   ( ri  ri )  
2  0 if ri  0 2 ri if ri  0
ri  ri   ri 

 Boyd, S. and L. Vandenberghe (2004). Convex optimization. Cambridge

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 53

You might also like