0% found this document useful (0 votes)

34 views53 pages

Lec19 Introduction2LinearRegression

The document introduces linear regression and discusses: 1) Overfitting, the effect of data size, and differences between training and test errors. 2) How the least squares solution arises from maximum likelihood estimation under a Gaussian likelihood model. 3) Basis function models and how to set the data design matrix.

Uploaded by

hu jack

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views53 pages

Lec19 Introduction2LinearRegression

Uploaded by

hu jack

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 53

Introduction to

Linear Regression
Prof. Nicholas Zabaras

Email: [email protected]
URL: https://www.zabaras.com/

September 23, 2020

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras

Contents
 Bayesian Computing and Machine Learning, Motivation to Bayesian inference
via a regression example, Over fitting, Effect of Data Size, Training and Test
Errors, Over fitting and MLE, Regularization and Model Complexity
 Prior Modeling, Bayesian Inference and Prediction, Frequentist Vs Bayesian
Paradigm, Bias in MLE (Gaussian Example)
 Linear basis function models, MLE and Least Squares, Convexity of the NLL,
Sequential Learning of the MLE, Robust Linear Regression, Huber Loss
Function

 Chris Bishops’ PRML book, Chapters 1 and 2

 Kevin Murphy’s, Machine Learning: A probabilistic perspective, Chapter 5
 C P Robert, The Bayesian Choice: From Decision-Theoretic Motivations to Compulational Implementation, Springer-
Verlag, NY, 2001 (online resource)
 A. Gelman, JB Carlin, HS Stern and DB Rubin, Bayesian Data Analysis, Chapman and Hall CRC Press, 2nd Edition, 2003.
 M Marin and C P Robert, The Bayesian Core, Spring Verlag, 2007 (online resource)
 Bayesian Statistics for Engineering, Online Course at Georgia Tech, B. Vidakovic.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 2
Goals
 The goal’s for todays’ lecture are the following:

 Understand the fundamentals of regression problems: over fitting, effect of

data size, and training and test errors behavior

 Understand how the least squares solution arises as a MLE under a

Gaussian likelihood model

 Learn about basis functions and feature space and setting the data design
matrix

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 3

Typical Problems in Machine Learning
 Pattern Recognition: automatically classifying the data into different
categories and use of these to take actions.

 Example: handwritten recognition.

Input: a vector 𝒙 of pixel values.

Output: A digit from 0 to 9.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 4

Typical Problems in Machine Learning
 In the digits example, a large set of input vectors 𝒙1, . . . , 𝒙𝑁, or a training set is
used to tune the parameters of an adaptive model.

 The category of an input vector is expressed using a target vector 𝒕

(identifying the corresponding digit).

 The result of a machine learning algorithm: 𝑦(𝒙) where the output 𝑦 is

encoded as the target vectors for any input 𝒙.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 5

Terminology
 Training or learning phase: determine 𝑦(𝒙) on the basis of the training data.

 Test set, generalization

 Feature extraction

 Data pre-processing (rotation, scaling, etc.)

 Using lower-dimensional representation of the input and test data

 Supervised learning (input & target vectors in the training data)

 Classification (discrete categories) or regression (continuous variables)

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 6

Terminology
 Unsupervised learning (no target vectors in the training data) also called
clustering, or density estimation.
 Reinforcement learning (Richard S. Sutton
and Andrew G. Barto)

 credit assignment (rewards attributed to different moves at the end of a

game)

 exploration (of new actions)

 exploitation (of high reward actions).

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 7

Motivation Example: Polynomial Curve Fitting
 Problem definition: implicitly trying to discover the underlying (generating)
function sin(2𝜋𝑥) in a set of data.
 Some data points are known: x =  x1 ,..., xN  as well as the corresponding
T

1 N
T
target values t = t ,..., t

 We fit the data using a polynomial function of the form

Polynomials of various orders M fitted to the training data, M = 0

y ( x, w )  w0  w1 x  ...  wM x M
1
fitting curve
original curve sin(2*x)
random data pts for fitting

M
  wi x
0.5
i

i 0 0

t
-0.5

-1
MatLab Code
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 8

Motivation Example: Polynomial Curve Fitting
 The values of the coefficients will be determined by fitting the polynomial
M
to the training data. y ( x, w )   w x j
j

j 0

 This can be done by minimizing an error function that measures the misfit
between the function 𝑦(𝑥, 𝒘), for any given value of 𝒘, and the training set data
points.

1 N
min E ( w )    y ( xn , w )  tn 
2

w 2 n 1

M N N

 ij j i ij 
A w
j 0
 T , A  x , Ti
n 1
  ntn
xi j
n
i

n 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 9

Polynomial Curve Fitting
Polynomials of various orders M fitted to the training data, M = 0

fitting curve
1 original curve sin(2*x)
random data pts for fitting

0.5

0
t

-0.5

-1
MatLab Code
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 10

Polynomial Curve Fitting
Polynomials of various orders M fitted to the training data, M = 1

fitting curve
1 original curve sin(2*x)
random data pts for fitting

0.5

-0.5

-1
MatLab Code
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

 1st order (linear) polynomials give rather poor fit to the data and the sin(2𝜋𝑥).

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 11

Polynomial Curve Fitting
Polynomials of various orders M fitted to the training data, M = 3

fitting curve
1 original curve sin(2*x)
random data pts for fitting

0.5

0
t

-0.5

-1

MatLab Code
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

 The 3nd order polynomial seems to give the best fit to the function sin(2𝜋𝑥).

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 12

Overfitting
Polynomials of various orders M fitted to the training data, M = 9

fitting curve
1 original curve sin(2*x)
random data pts for fitting

0.5

t
-0.5

-1
MatLab Code
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

 For 𝑀 = 9 we obtain a perfect fit to the training data. However, the fitted
curve oscillates wildly and gives a very poor representation of sin(2𝜋𝑥). This is
known as overfitting.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 13
Training and Test Errors
 We often use the root-mean-square (RMS) error. The division by 𝑁 allows us
to compare different data sizes. The square root makes 𝐸𝑅𝑀𝑆 in units of 𝑡.

ERMS   2 E ( w*) / N 
1/2

Root-mean-square error evaluated on the training and test data for various M
1

 The test set error is a measure of

Training
0.9 Testing

how well we are doing in 0.8

0.7
predicting the values of 𝑡 for new 0.6

data observations of 𝒙.

ER M S
0.5

0.4

0.3

0.2

0.1

0
0 1 2 3 4 5 6 7 8 9
M
MatLab Code
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 14
Training and Test Errors
 Small values of 𝑀 give relatively large values of the test set error. The
corresponding polynomials are incapable of capturing the oscillations in
sin(2𝜋𝑥).
 3 < 𝑀 ≤ 8 give small values for the test set error, and these also give
reasonable representation of sin(2𝜋𝑥).
Root-mean-square error evaluated on the training and test data for various M
1
Training
0.9 Testing

0.8

0.7

0.6
ER M S

0.5

0.4

0.3

0.2

0.1
MatLab Code
0
0 1 2 3 4 5 6 7 8 9
M
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 15
Overfitting
 Obtain insight into the problem by examining the values of 𝒘 obtained from
polynomials of various order.
 As 𝑀 increases, the magnitude of the coefficients typically gets larger.

 For 𝑀 = 9, the coefficients

1.0e+006 *
have become finely tuned to
M  0 M 1 M  6 M  9 the data by developing large
w0* 0.0000 0.0000 0.0000 0.0000 positive and negative values.
w1* 0 -0.0000 0.0000 0.0002
w2* 0 0 -0.0000 -0.0053  The more flexible
w3* 0 0 0.0000 0.0486 polynomials with larger
w4* 0 0 0 -0.2316
values of 𝑀 are becoming
w5* 0 0 0 0.6399
w6* 0 0 0 -1.0616 increasingly tuned to the
w7* 0 0 0 1.0422 random noise on the target
w8* 0 0 0 -0.5576 values.
w9* 0 0 0 0.1252 MatLab Code

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 16

Varying the Data Size
 It is also interesting to examine the behavior of a given model as the size of
the data set is varied.
M = 9 polynomial. Increasing N reduces over-fitting, N = 15

fitting curve
1 original curve sin(2*x)
random data pts for fitting

0.5

0
t

-0.5

-1
MatLab Code
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 17

Varying the Data Size
 For a given model complexity, the over-fitting problem becomes less severe
as the size of the data set increases, i.e. the larger the data set, the more
complex (in other words more flexible) the model that we can afford to fit to the
data. M = 9 polynomial. Increasing N reduces over-fitting, N = 100

fitting curve
1 original curve sin(2*x)
random data pts for fitting

0.5

0
t

-0.5

-1
MatLab Code
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 18

Overfitting and Maximum Likelihood
 It is not reasonable having the number of parameters in a model (model
complexity) limited according to the size of the available training set.

 It makes more sense to choose the complexity of the model according to the
complexity of the problem being solved.

 The least squares approach to finding the model parameters is a specific

case of maximum likelihood.

 The over-fitting problem is a general property of maximum likelihood.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 19

Overfitting and Bayesian Approach
 By adopting a Bayesian approach, the over-fitting problem can be avoided.

 There is no difficulty in employing models for which the number of parameters

greatly exceeds the number of data points.

 In a Bayesian model, “the effective number” of parameters adapts

automatically to the size of the data set.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 20

Regularization Technique
 To control the over-fitting we use regularization, e.g. adding “a penalty term” to
the error function in order to discourage the coefficients from reaching large
values.
 The simplest such penalty term takes the form of a sum of squares of all of
the coefficients, leading to a modified error function of the form
1 N 
E ( w )    y ( xn , w )  tn   w
2 2

2 n 1 2

 The regularization parameter 𝜆 governs the relative importance of the

regularization term compared with the sum-of-squares error term.
 The minimizer is similar to that given earlier but with
M N N

 Aij w j  Ti , Aij   x
j 0 n 1
i j
n   ij , Ti   xni tn
n 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 21

Regularization Controls Model Complexity
Solution using regularized least squares, ln  = -18

fitting curve
1 original curve sin(2*x)
random data pts for fitting

0.5

t
-0.5

-1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

MatLab Code
x

 We now fit the polynomial of order 𝑀 = 9 to the same data set as before but
now using the regularized error function.
 For 𝑙𝑛𝜆 = −18, the over-fitting has been suppressed and we obtain a much
closer representation of sin(2𝜋𝑥).
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 22
Regularization Controls Model Complexity
Solution using regularized least squared, ln  = 0

fitting curve
1 original curve sin(2*x)
random data pts for fitting

0.5

t
-0.5

-1

MatLab Code
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
x

 If, however, we use too large a value for 𝜆 then we again obtain a poor fit, as
shown for ln𝜆 = 0.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 23
Regularization Controls Model Complexity
Root Mean Square Error verus ln  for M=9
1
Training
0.9 Test

0.8

0.7

0.6

ER M S
0.5

0.4

0.3

0.2

0.1

0 MatLab Code
-35 -30 -25 -20 -15 -10
ln(¸ )

 𝜆 controls the effective complexity of the model and hence determines the
degree of over-fitting.
 We will soon re-examine this problem with a Bayesian approach that avoids
the over-fitting problem.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 24
MLE, Regularization and Model Complexity
 Effective model complexity in MLE is governed by the number of basis functions and
is controlled by the size of the data set.

 With regularization, the effective model complexity is controlled mainly by 𝜆 and still
by the number and form of the basis functions.

 Thus the model complexity for a particular problem cannot be decided simply by
maximizing the likelihood function as this leads to excessively complex models and
over-fitting.

 Independent hold-out data can be used to determine model complexity but this
wastes data and it is computationally expensive.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 25

Frequentist Versus Bayesian Paradigms
 The likelihood 𝑝(𝒟|𝒘) is essential in both Bayesian and frequentist
approaches but it is used in different roles.
 In a frequentist approach
 𝒘 is a fixed parameter computed by an estimator (e.g. maximum likelihood
estimator).
 Error bars on this point estimate are computed by considering the
distribution of all possible data sets 𝒟 (e.g. variability of predictions
between different bootstrap data sets)
 In Bayesian approach
 there is only one set of data 𝒟 and
 the uncertainty in 𝒘 is introduced with appropriate prior and computing
posterior probabilities over 𝒘.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 26

Prior Knowledge is Essential
 We cannot do everything simply based on data – prior knowledge is essential
to inference and prediction

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 27

Bayesian Probabilities
For example in the regression problem with the observed data 𝒟 =
{𝑡1, . . . , 𝑡𝑁}, we can obtain the conditional probability 𝑝(𝒘|𝒟) by Bayes’ theorem
p(D | w ) p( w )
p( w | D )  , p (D )   p(D | w ) p( w )dw
p(D )

 The quantity 𝑝(𝒟|𝒘) on the right-hand side of Bayes’ theorem is evaluated for
the observed data set 𝒟 and can be viewed as a function of the parameter
vector 𝒘 (likelihood function).

 Given this definition of likelihood, we can state Bayes’ theorem in words

𝑃𝑜𝑠𝑡𝑒𝑟𝑖𝑜𝑟 ∝ 𝐿𝑖𝑘𝑒𝑙𝑖ℎ𝑜𝑜𝑑 × 𝑃𝑟𝑖𝑜𝑟

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 28

Gaussian Distribution
 Consider the Gaussian distribution
1  1 2
N ( x |  , 2 )  exp   ( x   ) 
 2 
2 1/2
  
2
2

 The likelihood function for the Gaussian distribution is

N
p (D |  ,  )   N ( xn |  ,  2 )
2

n 1

 The log likelihood takes the form*

N
1 N N
ln p (D |  ,  )  
2

2 2
 ( xn   )2 
n 1 2
ln  2  ln(2 )
2

 Maximum likelihood solution N N

 xn  n ML
( x   ) 2

 ML  n 1
,  ML
2
 n 1

N N
* We work often with log-likelihood to avoid underflow (taking products of small probabilities) and for simplifying the algebra.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 29

MLE for a Gaussian Distribution
N N
1 1
 ML 
N
 xi , 
i 1
2
ML 
N
 i ML
( x
i 1
  ) 2

Sample mean S ample variance wrt ML

mean ( not the exact mean)

 The MLE approach underestimates the variance (bias) – this is in the root of
the over-fitting problem e.g. in polynomial curve fitting.
 The maximum likelihood solutions ML ,  ML 2
are functions of the data set values
𝑥1, . . . , 𝑥𝑁. Consider the expectations of these quantities with respect to the
data set values, which come from a Gaussian.
 Using the point estimates above you can show that :
In this derivation

N 1 2 you need to use :

 ML    ,  2
ML
    xi x j    2 for i  j
N
 xi2    2   2

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 30

MLE: Underestimating the Variance
 In the schematic from MLE estimate for
Bishop’s PRML, we the 2 data points
consider 3 cases each
with 2 data points True Gaussian
extracted from the true
Gaussian.

 The mean of the 3

distributions predicted
via MLE (i.e. averaged
over the data) is correct.

 However, the variance

is underestimated
since it is a variance with
respect to the sample
mean and NOT the true
mean.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 31
Probabilistic Regression Model
 Before we proceed in a follow up lecture with Bayesian regression, let us first
introduce a probabilistic setting for regression.
 Supervised learning: 𝑁 observations {𝒙𝑛} with corresponding target values
{𝑡𝑛} are provided.
 The goal is to predict 𝑡 for a new value 𝒙. We will eventually like to compute
the predictive distribution 𝑝(𝑡|𝒙).
 We will start by constructing a function 𝑦(𝒙) that is a prediction of 𝑡.
 We are interested in a linear combination – regression – of a ``fixed set’’ of
nonlinear basis functions.
 In the remaining of the lecture we will focus on MLE models and in the follow
up discuss different forms of regularization.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 32

Linear Regression
 From the conditional distribution 𝑝(𝑡|𝒙), we can make point estimates of 𝑡 for a given
𝒙 by minimizing a `loss function’.

 For a quadratic loss function, the optimal point estimate is the conditional mean
𝑦(𝒙, 𝒘) = 𝔼[𝑡|𝒙].

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 33

Linear Regression
 The simplest linear model for regression is one that involves a linear combination of
the input variables

y ( x , w )  w0  w1 x1  ...  wD xD

x   x0  1 .x1 . . xD 
T
where

 This is often simply known as linear regression.

 𝐷 is the input dimensionality.

 This model is a linear function of the parameters.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 34

Linear Basis Functions Models
 More generally regression function is:
M 1
y ( x , w )  w0  w11  x   ...  wM 1M 1  x    wii  x   w T   x 
i 0

where 𝜙𝑖 𝒙 are known as basis functions and

  x   0  x  1  x  2  x  . . M 1  x   , 0  x   1, w0  bias
T

 Often 𝜙𝑖 𝒙 represent features extracted from the data.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 35

Polynomial and Gaussian Basis Functions
Basis Function -- Polynomial 3 4
(x ! 7 j ) 2
1 Basis
Basis Function
Funct - Gaussian
ion { Gaussian ? j (x) = exp ! 2
)
2s
1
0.8
0.9
0.6
0.8
0.4
0.7
0.2
0.6
0
0.5
-0.2
0.4
-0.4
0.3
-0.6
0.2

-0.8 0.1

-1 0
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

MatLab code   1: 0.2 :1

Polynomial basis functions s  0.2
(scalar input, global support): Gaussian basis functions:
 j  x  x j   x   2 
 j  x   exp   
j

 2s 2

 

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 36

Logistic Sigmoidal Basis Functions
x ! 7j
Basis Function
Basis Funct - Sigmoidal
ion { Sigmoidal <j (x) = <( )
s
1

0.9

0.8

0.7

0.6
  1: 0.2 :1 0.5

s  0.1 0.4

0.3

0.2
MatLab code
0.1

0
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

Sigmoidal basis functions:

 x  j  1 ea  e a
 j  x     ,  ( a )  tanh(a)  a  2  2a   1
 s  1  e a e e a

 (a) : logistic sigmoid function

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 37

Sigmoidal and Tanh Basis Functions
 The sigmoidal and tanh basis functions are related:

ea  e a 1
tanh(a )  a  2  2a   1 where  ( a ) 
e  e a 1  e a

 A general linear combination of logistic sigmoidal functions is equivalent to a linear

combination of tanh functions:
M 1
 x  j  M 1
 x  j 
y ( x, w)  w0   w j    w0   w j  2 
j 1  s  j 1  2 s 
 x  j 
1  tanh  
M 1
 2 s 
M 1
 x  j 
 w0   w j  u0   u j tanh  
j 1 2 j 1  2 s 
1 M 1 wj
where : u0  w0   w j , u j 
2 j 1 2

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 38

Choice of Basis Functions
 We are interested in functions of local support to explore adaptivity.

 Local support functions comprise a spectrum of different spatial frequencies.

 An example is wavelets that are local both spatially and in frequency.

 They are however useful only when the input is defined on a lattice.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 39

Likelihood Model
 Consider a set of training data comprising 𝑁 input x =  x1 ,..., xN 
T
and the
corresponding target values t =  t1 ,..., t N 
T

 We assume that, given the value of 𝑥, the corresponding value of 𝑡 has a

Gaussian distribution with a mean equal to the value 𝑦(𝑥, 𝒘) and precision
(inverse of the variance) 𝛽
p (t | x, w ,  )  N (t | y ( x, w ),  ) 1

𝑦 𝒙, 𝒘 = 𝝓 𝒙 𝑇 𝒘
t = 𝑦 𝒙, 𝒘 + 𝜀
𝜀~𝒩(0, 𝛽−1 )

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 40

Likelihood Function
 The likelihood function is
N
p (t | x, w ,  )   N (tn | y ( xn , w ),  1 )
n 1

 From this, the log-likelihood takes the form:

 N
N N
ln p (t | x , w ,  )     y( xn , w )  tn   ln   ln(2 )
2

2 n 1 2 2

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 41

Maximum Likelihood and Least Squares
 Consider first the MLE estimate for 𝒘. Note that maximizing the log-likelihood
to obtain 𝒘𝑀𝐿 is the same as minimizing the sum of squares error function
(residual sum of squares, 𝑅𝑆𝑆(𝒘)):
 N
N N
max ln p(t | x, w ,  )     y( x , w )  t   ln   ln(2 ) 
2
n n
w 2 n 1 2 2
1 N
 argmin   y ( xn , w )  tn 
2
w ML
w 2 n 1
 We can also determine the MLE estimate of 𝛽:
N
1 1
   y( x , w )  tn 
2

 ML N n 1
n ML

 We can now make (a frequentist, plug-in approximation) prediction as follows:

p (t | x, wML ,  ML )  N (t | y ( x, wML ),  ML
1
)
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 42
Maximum Likelihood and Least Squares
 Assume observations from a deterministic function with added Gaussian noise:
t  y ( x, w )   where p   |    N  | 0,  1 

which is the same as saying,

p  t | x, w ,    N  t | y ( x, w ),  1 

 Here 𝛽 is the precision. This is based on a squared loss function for which 𝔼[𝑡|𝒙] =
𝑦(𝒙, 𝒘). A 2D example is shown below.

18 18
17.5
17.5
17
17
16.5
16.5 16
16 15.5

15.5 15

30 30
20
30
40
20
30
40 Run surfaceFitDemo
10 20 10
10
20 from PMTK3
10 0 0

t | x   w0  w1 x1  w2 x2 t | x   w0  w1 x1  w2 x2  w3 x12  w4 x22
0

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 43

Maximum Likelihood and Least Squares
 Let us now introduce a regression function of the form: 𝑦 = 𝒘𝑇 𝝓 𝒙 .

 Given observed inputs, X   x1 ,..., xN , and targets t  t1 ,..., t N , we obtain the
T

likelihood function
p  t | X , w ,     N  tn | w T  ( xn ),  1 
N

n 1

 We often use the log-likelihood:

 w,    log p(D |  )   logN  tn | w T  ( xn ),  1  ,    w,  

n 1

 Instead of maximizing the log-likelihood, one equivalently can minimize the negative
log-likelihood (NLL)
NLL  w ,     logN  tn | w T  ( xn ),  1 
N

n 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 44

Maximum Likelihood and Least Squares
 Taking the log of the likelihood, we obtain:
ln p  t | w ,     ln N  tn | w T  ( xn ),  1  
N
N N
ln   ln  2    ED ( w )
n 1 2 2
where with ED ( w ) we have denoted:
ED ( w )    tn  w  ( xn )  , RSS    tn  w  ( xn )  , MSE  RSS / N
1 N T 2
N
T 2

2 n 1 n 1

 𝑅𝑆𝑆 is often known as the residual sum of squares or sum of squared errors (𝑆𝑆𝐸)
and 𝑀𝑆𝐸 is the mean squared error.
 Computing 𝒘 via MLE is the same as Least Squares. Sum of squares error contours for linear regression
5
prediction
3
The NLL
4 truth
2.5 is a
3 2
quadratic
2
bowl with
1.5
a unique
1
minimum

w1
1

0
0.5 (the MLE
-1
0
estimate)
Run residualsDemo -2
from PMTK3 -0.5 Run contoursSSEdemo,
-3
-4 -3 -2 -1 0 1 2 3 4 -1
from PMTK3
-1 -0.5 0 0.5 1 1.5 2 2.5 3
w0
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 45
Maximum Likelihood and Least Squares
ln p  t | w ,    ln   ln  2    ED ( w ), ED ( w )    tn  w T  ( xn ) 
N N 1 N 2

2 2 2 n 1

 Setting the gradient of the log-likelihood (written here as a row vector) wrt 𝒘 equal to
zero:
N 
 w ln p  t | w ,       tn  w  ( xn )  ( xn )    tn ( xn )  w   ( xn ) ( xn )T   0
N N
T T T T

n 1  n 1 n 1 

 This equation can be solved for 𝒘.

  ( xn ) ( xn )   tn ( xn )T  wT T    T t   T w  T t
N N T
T T
w
n 1 n 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 46

Maximum Likelihood and Least Squares
  ( xn ) ( xn )   tn ( xn )T  wT T    T t   T w  T t
N N T
T T
w
n 1 n 1

 We obtain (normal equation; ordinary least squares solution):

wML      T t  †t , † : Moore -Penrose pseudo - inverse

T 1

where we have defined:

T   ( x1 )  ( x2 ) ..  ( x N ) 
  ( x1 )   0 ( x1 ) 1 ( x1 ) .. M 1 ( x1 ) 
T

 T      x     x    x  . .   x  T
  ( x2 )   0 ( x2 ) 1 ( x2 ) .. M 1 ( x2 )  i 0 i 1 i M 1 i
  ,
 :   : : : :     0 1 ..  M 1 
 T   
  ( x N 
) 
 0 N
( x ) 1 ( x N ) ..  M 1 ( x N 
)
 i  i  x1  i  x2  . . i  x N  
T

 Note that indeed:

N N
     ( xn ) ( xn ) ,  t   tn ( xn )
T T T

n 1 n 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 47

Maximum Likelihood and Least Squares
ln p  t | w ,     ln N  tn | w T  ( xn ),  1  
N
N N
ln   ln  2    ED ( w )
n 1 2 2

 Taking now the log of the likelihood with respect to  gives:

 t  w  ( xn ) 
N
1 2 1 2
 ED ( w )  T

 ML N N n 1
n

 So the MLE variance 𝛽𝑀𝐿 is equal to the residual variance of the target values around
the regression function.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 48

Convexity of the NLL
 Convexity of the NLL (positive definite Hessian) leads to a unique globally optimal
MLE.

 Some models of interest don’t have concave likelihoods and locally optimal MLE
estimates are needed.

Convex region Non-convex region

Run convexFnHand
from PMTK3

1-


A B
x y A and B are local minimum
Convex function Concave function Neither convex or concave

 2 , e ,  log  (  0) log  ,  (  0)
function

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 49

Sequential Learning: LMS Algorithm
 If the data set is large, we use sequential (on-line) algorithms

 We apply the technique of stochastic (sequential) gradient descent.

 If the error function comprises a sum over data points E   En

 n  ( xn ) 

N
1 2
ED ( w )  t  w T

2 n 1
then after presentation of pattern 𝑛, the stochastic gradient descent algorithm
updates the parameter vector 𝒘 using


w ( 1)  w ( )  En  w ( )   tn  w ( )  ( xn )  ( xn )
T


En

𝜏 is the iteration number & 𝜂 the learning rate parameter.

 This is known as least-mean-squares or the LMS algorithm.

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 50

Robust Linear Regression
 Using a Gaussian distribution for the noise,
t  y ( x , w )   ,  ~ N   | 0,  1 
can result in poor fit especially if we have outliers in the data.
 Squared error penalizes deviations quadratically, so points far from the line have more affect
on the fit than points near the line.
 To achieve robustness to outliers one can replace the Gaussian with a distribution that has
heavy tails (e.g. the Laplace distribution). Such a distribution assigns higher likelihood to
outliers, without having to perturb the regression line to “explain” them.
4
4
Run linregRobustDemoCombined 3
from PMTK3 3
2
2
1
1
0
0 Least Squares
least squares -1 Huber loss 1.0
-1
 if r  
laplace
r / 2
2 Huber loss 5.0
student, dof=0.630 -2
LH  r ,    
 1    r   / 2 if r  
-2 2
p  t | x , w , b   Lap  t | y ( x, w ), b   exp   t  y ( x, w )  -3

-3
 b  d
-4 NLL( w )   ri ( w ) , ri ( w )  t  y ( x, w )
-4 LH  r ,   
dr r 
i -5
-5

-6 -6
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 51

Huber Loss Function
5  This is equivalent to 𝐿2 for errors
L2
4.5
L1 that are smaller than 𝛿, and is
4 huber
equivalent to 𝐿1 for larger errors.

r / 2
2
if r  
3.5 LH  r ,      This loss function is everywhere
  r   / 2 if r  
2

3

2.5
d
LH  r ,   
differentiable, using the fact that
dr
2
r 
𝑑/𝑑𝑟 |𝑟| = 𝑠𝑖𝑔𝑛(𝑟) 𝑖𝑓 𝑟 ≠ 0.
1.5  The function is also 𝐶1 continuous,
1 since the gradients of the two parts
0.5
of the function match at 𝑟 = ±𝛿.
0 Run huberLossDemo
from PMTK3
-0.5
-3 -2 -1 0 1 2 3

 Optimizing the Huber loss is much faster than using the Laplace likelihood, since we can use
standard optimization methods (quasi-Newton) rather than linear programming.

 The Huber method has an unnatural probabilistic interpretation (Pontil et al. 1998).
 Pontil, M., S. Mukherjee, and F. Girosi (1998). On the Noise Model of Support Vector Machine Regression. Technical
report, MIT AI Lab.
 Huber, P. (1964). Robust estimation of a location parameter. Annals of Statistics 53, 73-101.
Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 52
Robust Linear Regression
 Using the Laplace distribution leads to 𝐿1 error norm (non-linear objective function) that is
difficult to optimize.

 A solution is to transform the problem (by increasing its dimension to 2𝑁 + 𝑀) to a linear

programming problem.

𝑟𝑖 ≜ 𝑟𝑖+ − 𝑟𝑖−
min
    r
w , ri , ri
i

 ri

 s.t . ri

 0, ri

 0, w T
xi  ri

 ri

 ti
i

 Note that with our definition above:

1 r if ri  0 1  0 if ri  0
ri   (r i  ri )   i , ri   ( ri  ri )  
2  0 if ri  0 2 ri if ri  0
ri  ri   ri 

 Boyd, S. and L. Vandenberghe (2004). Convex optimization. Cambridge

Statistical Computing and Machine Learning, Fall 2020, N. Zabaras 53

Lecture 8.2 - Variational Quantum Eigensolver
No ratings yet
Lecture 8.2 - Variational Quantum Eigensolver
27 pages
QT Proposal
No ratings yet
QT Proposal
91 pages
Bob Proctor - How To Improve Your Self Image (MUST SEE!)
No ratings yet
Bob Proctor - How To Improve Your Self Image (MUST SEE!)
4 pages
slides_foundations
No ratings yet
slides_foundations
81 pages
Curs 1 SSL - Introduction
No ratings yet
Curs 1 SSL - Introduction
57 pages
Neural Network Lectures RBF 1
No ratings yet
Neural Network Lectures RBF 1
44 pages
DSA5102X_lecture1
No ratings yet
DSA5102X_lecture1
51 pages
DSA5105 Lecture1
No ratings yet
DSA5105 Lecture1
51 pages
Introduction ML
No ratings yet
Introduction ML
65 pages
EE2211 Lecture 7
No ratings yet
EE2211 Lecture 7
43 pages
Excellent 05 - Overfitting
No ratings yet
Excellent 05 - Overfitting
22 pages
Least Squares Fit To Polynomial
No ratings yet
Least Squares Fit To Polynomial
12 pages
Lect 1
No ratings yet
Lect 1
24 pages
DSA Module 3
No ratings yet
DSA Module 3
30 pages
ML 01
No ratings yet
ML 01
24 pages
Lecture1
No ratings yet
Lecture1
56 pages
DSA5102_lecture1
No ratings yet
DSA5102_lecture1
60 pages
APS1070 Lecture (3) Slides
No ratings yet
APS1070 Lecture (3) Slides
70 pages
Lec3 Linear Regression With Multiple Vars
No ratings yet
Lec3 Linear Regression With Multiple Vars
30 pages
Lecture 2
No ratings yet
Lecture 2
66 pages
3 Polyreg
No ratings yet
3 Polyreg
22 pages
Lecture 02
No ratings yet
Lecture 02
33 pages
Lecture3 2015
No ratings yet
Lecture3 2015
38 pages
Introduction To Machine Learning Lecture 2: Linear Regression
No ratings yet
Introduction To Machine Learning Lecture 2: Linear Regression
38 pages
Linear Regression
No ratings yet
Linear Regression
20 pages
GML-slides-2024-04-29 (1)
No ratings yet
GML-slides-2024-04-29 (1)
206 pages
QSRI-lecture1
No ratings yet
QSRI-lecture1
45 pages
Polynomial Curve Fitting in Machine Learning
No ratings yet
Polynomial Curve Fitting in Machine Learning
7 pages
linear+regression+with+multiple+variable
No ratings yet
linear+regression+with+multiple+variable
30 pages
Lecture 7 - Overfitting, Bias-Variance Trade Off (DONE!!) PDF
No ratings yet
Lecture 7 - Overfitting, Bias-Variance Trade Off (DONE!!) PDF
42 pages
Introduction: Geometric Models: - Page 1 of 25
No ratings yet
Introduction: Geometric Models: - Page 1 of 25
25 pages
RADL TQKhoat
No ratings yet
RADL TQKhoat
50 pages
Dasar Statistika Dan Matematika
No ratings yet
Dasar Statistika Dan Matematika
30 pages
Lec31 32 CaterpillarRegressionExample
No ratings yet
Lec31 32 CaterpillarRegressionExample
108 pages
RMT ML Book-1
No ratings yet
RMT ML Book-1
446 pages
Lecture 7
No ratings yet
Lecture 7
83 pages
L4_ML
No ratings yet
L4_ML
43 pages
SkriptOptMach
No ratings yet
SkriptOptMach
49 pages
Physics 114: Lecture 17 Least Squares Fit To Polynomial: Dale E. Gary
No ratings yet
Physics 114: Lecture 17 Least Squares Fit To Polynomial: Dale E. Gary
12 pages
ml-3
No ratings yet
ml-3
66 pages
Machine Learning Basics: Lecture Slides For Chapter 5 of Deep Learning Ian Goodfellow
No ratings yet
Machine Learning Basics: Lecture Slides For Chapter 5 of Deep Learning Ian Goodfellow
85 pages
Week 4 Linear Regression
No ratings yet
Week 4 Linear Regression
38 pages
Fit without fear- remarkable mathematical phenomena of deep learning through the prism of interpolation
No ratings yet
Fit without fear- remarkable mathematical phenomena of deep learning through the prism of interpolation
51 pages
ML 7th Sem AIML ITE Notes Complete LONG
No ratings yet
ML 7th Sem AIML ITE Notes Complete LONG
202 pages
Lecture 3_Regression (1)
No ratings yet
Lecture 3_Regression (1)
47 pages
Physics 114: Lecture 17 Least Squares Fit To Polynomial: Dale E. Gary
No ratings yet
Physics 114: Lecture 17 Least Squares Fit To Polynomial: Dale E. Gary
12 pages
1.Introduction(Bishop)
No ratings yet
1.Introduction(Bishop)
16 pages
101827-FS2018-0: Programming With MATLAB: Advanced Course: Felix Wichmann
No ratings yet
101827-FS2018-0: Programming With MATLAB: Advanced Course: Felix Wichmann
31 pages
Pattern Recognition and Machine Learning
100% (2)
Pattern Recognition and Machine Learning
59 pages
Pattern Recognition Machine Learning: Chapter 1: Introduction
No ratings yet
Pattern Recognition Machine Learning: Chapter 1: Introduction
59 pages
Regression Analysis
No ratings yet
Regression Analysis
11 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
4 pages
Vahid
No ratings yet
Vahid
18 pages
Module 2 - Syllabus: CS 476 Introduction To Machine Learning, Module 2
No ratings yet
Module 2 - Syllabus: CS 476 Introduction To Machine Learning, Module 2
20 pages
03 Linear Regression
No ratings yet
03 Linear Regression
54 pages
Lecture16 Crossvalidation
No ratings yet
Lecture16 Crossvalidation
32 pages
Neural Networks Study Notes
100% (2)
Neural Networks Study Notes
11 pages
Lecture 1
No ratings yet
Lecture 1
47 pages
Week11_regularization and optimization
No ratings yet
Week11_regularization and optimization
75 pages
slides_cours_ML[1]
No ratings yet
slides_cours_ML[1]
272 pages
Lecture5
No ratings yet
Lecture5
26 pages
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet
Geometric Hashing: Efficient Algorithms for Image Recognition and Matching
From Everand
Geometric Hashing: Efficient Algorithms for Image Recognition and Matching
Fouad Sabry
No ratings yet
Dai 2020
No ratings yet
Dai 2020
62 pages
Ek 2020
No ratings yet
Ek 2020
203 pages
Seminar em
No ratings yet
Seminar em
51 pages
Durrande 2020
No ratings yet
Durrande 2020
90 pages
Gonzalez 2020
No ratings yet
Gonzalez 2020
79 pages
Lecture 7 - Introduction To Quantum Noise Bonus
No ratings yet
Lecture 7 - Introduction To Quantum Noise Bonus
13 pages
Lecture 8.1 - Iterative Quantum Phase Estimation - Moving Beyond Traditional QPE
No ratings yet
Lecture 8.1 - Iterative Quantum Phase Estimation - Moving Beyond Traditional QPE
31 pages
Lecture 3 - Entanglement in Action
No ratings yet
Lecture 3 - Entanglement in Action
36 pages
Lecture 1.1 - Single States
No ratings yet
Lecture 1.1 - Single States
49 pages
Lecture 4.1 - Quantum Query Algorithms
No ratings yet
Lecture 4.1 - Quantum Query Algorithms
38 pages
Lec20 RidgeRegression
No ratings yet
Lec20 RidgeRegression
21 pages
Introduction To State Space Models and Sequential Bayesian Inference
No ratings yet
Introduction To State Space Models and Sequential Bayesian Inference
58 pages
Lec24 BayesianLinearRegression
No ratings yet
Lec24 BayesianLinearRegression
29 pages
Lec35 SequentialImportanceSampling
No ratings yet
Lec35 SequentialImportanceSampling
46 pages
Lec27 AcceptReject
No ratings yet
Lec27 AcceptReject
30 pages
Lec33 MetropolisHastings
No ratings yet
Lec33 MetropolisHastings
66 pages
Lec29 ImportanceSampling
No ratings yet
Lec29 ImportanceSampling
84 pages
Lec18 HierarchicalBayesianModels
No ratings yet
Lec18 HierarchicalBayesianModels
20 pages
Lec9 MultivariateGaussian
No ratings yet
Lec9 MultivariateGaussian
60 pages
Lec23 Evidence4Regression
No ratings yet
Lec23 Evidence4Regression
38 pages
Lec21 BiasVarianceDecomposition
No ratings yet
Lec21 BiasVarianceDecomposition
15 pages
Lec25 MonteCarloMethods
No ratings yet
Lec25 MonteCarloMethods
57 pages
Lec7 InformationTheory
No ratings yet
Lec7 InformationTheory
41 pages
Lec17 PriorModeling
No ratings yet
Lec17 PriorModeling
37 pages
Lec16 SummarizingPosteriors BayesianModelSelection
No ratings yet
Lec16 SummarizingPosteriors BayesianModelSelection
59 pages
Lec28 StratifiedSampling
No ratings yet
Lec28 StratifiedSampling
15 pages
Lec14 15 GenerativeModelsForDiscreteData
No ratings yet
Lec14 15 GenerativeModelsForDiscreteData
74 pages
Lec22 Introduction2BayesianRegression
No ratings yet
Lec22 Introduction2BayesianRegression
42 pages
Rouths Criteria
No ratings yet
Rouths Criteria
25 pages
Approach To The Synthesis of Neural Network Structure During Classification
No ratings yet
Approach To The Synthesis of Neural Network Structure During Classification
7 pages
WWW Pyimagesearch Com 2020-10-12 Multi Class Object Detection and Bounding Box R (1 36)
No ratings yet
WWW Pyimagesearch Com 2020-10-12 Multi Class Object Detection and Bounding Box R (1 36)
46 pages
Chapter 7
No ratings yet
Chapter 7
18 pages
Chapter 18
No ratings yet
Chapter 18
3 pages
Rethinking Automatic Chord Recognition With Convolutional Neural Networks
No ratings yet
Rethinking Automatic Chord Recognition With Convolutional Neural Networks
8 pages
Data Mining and Data Warehousing
No ratings yet
Data Mining and Data Warehousing
21 pages
Unit 3 Basics of SQL
No ratings yet
Unit 3 Basics of SQL
7 pages
[1]Carlo Ratti with Matthew Claudel
No ratings yet
[1]Carlo Ratti with Matthew Claudel
13 pages
Supervised Learning-1
100% (1)
Supervised Learning-1
37 pages
Stock Price Prediction Using ML Algorithms
No ratings yet
Stock Price Prediction Using ML Algorithms
9 pages
DenseCap - Fully Convolutional Localization Networks For Dense Captioning
No ratings yet
DenseCap - Fully Convolutional Localization Networks For Dense Captioning
10 pages
CH 5 SVM
No ratings yet
CH 5 SVM
25 pages
Ai Chapter1
No ratings yet
Ai Chapter1
24 pages
Control System Term Paper
No ratings yet
Control System Term Paper
4 pages
Pattern Recognition
No ratings yet
Pattern Recognition
11 pages
Viva Questions For SQL & Java For STD 12 QUESTIONS
No ratings yet
Viva Questions For SQL & Java For STD 12 QUESTIONS
5 pages
Youtu - beudWtu6Wh LQ
No ratings yet
Youtu - beudWtu6Wh LQ
3 pages
Deep Learning Unit-1 Finals
No ratings yet
Deep Learning Unit-1 Finals
23 pages
Branch: Computer Science & Engineering Semester: 4 Sem Subject: Artificial Intelligence
No ratings yet
Branch: Computer Science & Engineering Semester: 4 Sem Subject: Artificial Intelligence
5 pages
Interpersonal Communication 3
No ratings yet
Interpersonal Communication 3
10 pages
Anomaly Detection 2
No ratings yet
Anomaly Detection 2
8 pages
3-1 Bigdata (Spark)
No ratings yet
3-1 Bigdata (Spark)
3 pages
School of Engineering, Rmit University
No ratings yet
School of Engineering, Rmit University
3 pages
Heart Disease Prediction System Using Machine Learning 1
No ratings yet
Heart Disease Prediction System Using Machine Learning 1
17 pages
Assignment - 2023 - Week - 2-With Solution PDF
No ratings yet
Assignment - 2023 - Week - 2-With Solution PDF
5 pages
Netflix User and Movies Interest Analysis For Asian Countries
No ratings yet
Netflix User and Movies Interest Analysis For Asian Countries
5 pages
Phase 2
No ratings yet
Phase 2
13 pages