0% found this document useful (0 votes)

73 views54 pages

Machine Learning: Linear Models For Regression

The document summarizes linear regression models for predicting a continuous output variable from an input variable. It discusses modeling a linear relationship between input and output, minimizing the least squares error, using regularization to prevent overfitting, and Bayesian linear regression. It also covers using nonlinear basis functions to model nonlinear relationships and the differences between discriminative and generative approaches.

Uploaded by

Marco Caruso

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

73 views54 pages

Machine Learning: Linear Models For Regression

Uploaded by

Marco Caruso

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 54

Machine Learning

Linear Models for Regression

Marcello Restelli
t

x
March 9, 2017 4

x3
2 t̂
t
0x1
w1

5 x2
0
−5 0 2 4
x2 x1
w0
Outline

1 Linear Regression

2 Minimizing Least Squares

3 Regularization

4 Bayesian Linear Regression

Marcello Restelli March 9, 2017 2 / 47

Linear Regression

Regression Problems

The goal of regression is to learn

t
a mapping from input x to a
continuous output t

x
Examples we assume
input x
Predict stock market price output t exists a model
Predict age of a web user that goes from x
Predict effect of an actuation in robotics to t
Predict the value of a house
Predict the temperature in a building

Marcello Restelli March 9, 2017 4 / 47

Linear Regression

Linear Models

Many real processes can be

approximated with linear models

t
Linear regression often appears as
a module of larger systems
x
Linear problems can be solved analytically
Linear prediction provides an introduction to many of the core concepts
of machine learning
Augmented with kernels, it can model non-linear relationships

Marcello Restelli March 9, 2017 5 / 47

Linear Regression

Linear Function

Linear function in the parameters w:

D−1
X
y(x, w) = w0 + wj xj = wT x
j=1

x = (1, x1 , . . . , xD−1 )
w0 is the offset
each point in w0 w1 space describes a model
w1

w0 x
Marcello Restelli March 9, 2017 6 / 47
Linear Regression

Loss Functions for Regression

We need to quantify what it means to do well or poorly on a task
We need to define a loss (error) function: L(t, y(x))
The average, or expected, loss is given by:
Z Z
E[L] = L(t, y(x))p(x, t)dxdt

A common choice is the squared loss function

Z Z
E[L] = (t − y(x))2 p(x, t)dxdt

The optimal solution (if we assume a completely

flexible function) is the conditional average:
Z
y(x) = tp(t|x)dt = E[t|x]
Marcello Restelli March 9, 2017 7 / 47
Linear Regression

Other Loss Functions

Simple generalization of the squared loss, called the Minkowski loss:

Z Z
E[L] = (t − y(x))q p(x, t)dxdt

The minimum of E[L] is given by:

the conditional mean for q = 2
the conditional median for q = 1
the conditional mode for q → 0

Marcello Restelli March 9, 2017 8 / 47

Linear Regression

Basis Functions

To consider non-linear functions we can use non-linear basis function:

M−1
X
y(x, w) = w0 + wj φj (x) = wT φ(x)
j=1

φ(x) = (1, φ1 (x), . . . , φM−1 (x))T

1
Examples: 0.5
Polynomial: φj (x) = xj
Gaussian: 0
(x−µ )2

φj (x) = exp − 2σ2j
−0.5
Sigmoidal:
φj (x) = 1 µ −x −1
j
1+exp σ −1 −0.5 0 0.5 1

Marcello Restelli March 9, 2017 9 / 47

Linear Regression

Basis Functions

To consider non-linear functions we can use non-linear basis function:

M−1
X
y(x, w) = w0 + wj φj (x) = wT φ(x)
j=1

φ(x) = (1, φ1 (x), . . . , φM−1 (x))T

1
Examples: 0.8
Polynomial: φj (x) = xj
Gaussian:
0.6
(x−µ )2

φj (x) = exp − 2σ2j 0.4
Sigmoidal: 0.2
φj (x) = 1 µ −x
j
1+exp σ 0
−1 −0.5 0 0.5 1
Marcello Restelli March 9, 2017 9 / 47
Linear Regression

Basis Functions

To consider non-linear functions we can use non-linear basis function:

M−1
X
y(x, w) = w0 + wj φj (x) = wT φ(x)
j=1

φ(x) = (1, φ1 (x), . . . , φM−1 (x))T

Basis Functions
Example
t

t
x x
t

x2 x
x2
Marcello Restelli March 9, 2017 10 / 47
Linear Regression

Discriminative vs Generative

Generative approach:
Model the joint density: p(x, t) = p(x|t)p(t)
Infer conditional density: p(t|x) = p(x,t)
p(x) R
Marginalize to find conditional mean: E[t|x] = tp(t|x)dt
Discriminative approach:
Model conditional density p(t|x) R
Marginalize to find conditional mean: E[t|x] = tp(t|x)dt
Direct approach
Find a regression function y(x) directly from the training data

Marcello Restelli March 9, 2017 11 / 47

Minimizing Least Squares

Given a data set with N samples,
let us consider the following
L = 2.8 · 105
error (loss) function
L = 3.3 · 105
L = 7.6 · 103

w1
N L = 5.4 · 104
1X
L(w) = (y(xn , w) − tn )2 L = 1.0 · 105
2 L = 1.0 · 106
n=1

This is (half) the residual sum of w0

squares (RSS), a.k.a. sum of
squared errors (SSE)
It can also be written as the sum
of the `2 -norm of the vector of t
residual errors
N
X
RSS(w) = kk22 = 2i x
Marcello Restelli i=1 March 9, 2017 13 / 47
Minimizing Least Squares

Minimizing Least Squares

Given a data set with N samples,
let us consider the following
L = 2.8 · 105
error (loss) function
L = 3.3 · 105
L = 7.6 · 103

w1
N L = 5.4 · 104
1X
L(w) = (y(xn , w) − tn )2 L = 1.0 · 105
2 L = 1.0 · 106
n=1

This is (half) the residual sum of w0

Ordinary Least Squares

Closed-Form Optimization

Let’s write RSS in matrix form with Φ = (φ(x1 ), . . . , φ(xN ))T and
t = (t1 , . . . , tN )T
1 1
L(w) = RSS(w) = (t − Φw)T (t − Φw)
2 2
Compute first and second derivative
∂L(w) ∂ 2 L(w)
= −ΦT (t − Φw) ; T
T =Φ Φ
∂w ∂w∂w
Assuming ΦT Φ in nonsingular
−1 T
ŵOLS = ΦT Φ Φ t
Complexity O(NM 2 + M 3 )
Cholesky: M 3 + NM 2 /2
QR: NM 2
Marcello Restelli March 9, 2017 14 / 47
Minimizing Least Squares

Geometric Interpretation

t is an N-dimensional vector
Let’s denote with ϕj the j-th column of Φ
Define t̂ the N-dimensional vector, whose n-th element is y(xn , w)
t̂ is a linear combination of ϕ1 , . . . , ϕM
so t̂ lies in an M-subspace S
Since t̂ minimizes the SSE with respect to t, it represents the projections
of t onto the subspace S
−1 T
t̂ = Φŵ = Φ ΦT Φ Φ t
−1 T
H = Φ ΦT Φ Φ is called the hat matrix

Marcello Restelli March 9, 2017 15 / 47

Minimizing Least Squares

Geometric Example
= 3 and M= D = 2 
Assume N   
1 2 5 3.5
Φ=X=  1 −2  t=  1  t̂ =  1 
1 2 2 3.5

4
t̂
x3

2 t

0 x1
x2
5
0
2 4
x2 −5 0
x1
Marcello Restelli March 9, 2017 16 / 47
Minimizing Least Squares

Gradient Optimization
Closed-form solution is not practical with big data
We can use sequential (online) updates
Stochastic gradient descent
If the lossPfunction can be expressed as a sum over samples
(L(x) = n L(xn ))
w(k+1) = w(k) − α(k) ∇L(xn )
T

w(k+1) = w(k) − α(k) w(k) φ(xn ) − tn φ(xn )

where k is the iteration and α is a learning rate

For convergence the learning rate has to satisfy
∞
X 1
(k)
= +∞
k=0
α
∞
X 1
2
< +∞
k=0 α(k)

Marcello Restelli March 9, 2017 17 / 47

Minimizing Least Squares

Maximum Likelihood (ML)

The output variable t can be modeled as a deterministic function y of the

input x and a random noise

t = f(x) +

We want to approximate f(x) with y(x, w)

We assume ∼ N 0, σ 2

Given N samples, with inputs X = {x1 , . . . , xN } and outputs

t = (t1 , . . . , tN )T , the likelihood function is
N
Y
2
N tn |wT φ(xn ), σ 2

p(t|X, w, σ ) =
n=1

Marcello Restelli March 9, 2017 18 / 47

Minimizing Least Squares

Maximum Likelihood (ML)

Assuming the samples to be independent and identically distributed

(iid), we consider the log-likelihood:
N
X
`(w) = ln p(t|X, w, σ 2 ) = log p(tn |xn , w, σ 2 )
n=1
N 1
= − log 2πσ 2 − 2 RSS(w)

2 2σ
To find the maximum likelihood, we equal the gradient to zero
N N
!
X T
X T
T
∇ ln `(w) = tn φ(xn ) − w φ(xn )φ(xn ) =0
n=1 n=1
−1 T
T
wML = Φ Φ Φ t

Marcello Restelli March 9, 2017 19 / 47

Minimizing Least Squares

Variance of the Parameters

We assume
the observation ti are uncorrelated and have constant variance σ 2
the xi are fixed (non random)
The variance-covariance matrix of the least-squares estimates is
−1 2
Var(ŵOLS ) = ΦT Φ σ
Usually, the variance σ 2 is estimated by
N
1
σˆ2 =
X
(tn − ŵT φ(xn ))2
N−M−1
n=1
Assuming that the model is linear in the features φ1 (), . . . , φM () and that
the noise is additive and Gaussian
ŵ ∼ N w, (ΦT Φ)−1 σ 2 (N − M − 1)σˆ2 ∼ σ 2 χ2

N−M−1

Such properties can be used to form test hypothesis and confidence

intervals
Marcello Restelli March 9, 2017 20 / 47
Minimizing Least Squares

Gauss-Markov Theorem

Theorem (Gauss-Markov)
The least squares estimate of w has the smallest variance among all linear
unbiased estimates.

It follows that least squares estimator has the lowest MSE of all linear
estimator with no bias
However, there may exist a biased estimator with smaller MSE

Marcello Restelli March 9, 2017 21 / 47

Minimizing Least Squares

Multiple Outputs

Let now consider the case of multiple outputs

We could use a different set of basis functions for each output, thus
having independent regression problems
Usually, a single set of basis functions is considered
−1 T
ŴML = ΦT Φ Φ T

For each output tk we have

−1 T
ŵk = ΦT Φ Φ tk

where tk is an N-dimensional column vector

The solution decouples between the different outputs
−1 T
The pseudo-inverse Φ† = ΦT Φ Φ needs to be computed only once

Marcello Restelli March 9, 2017 22 / 47

Minimizing Least Squares

Increasing Model Complexity: Quadratic Function

200 data
1
2
10
15
19
100
t

2 4 6 8 10 12 14 16 18 20
x

Marcello Restelli March 9, 2017 23 / 47

Minimizing Least Squares

Increasing Model Complexity: Sinusoidal Function

Marcello Restelli March 9, 2017 24 / 47

Minimizing Least Squares

Increasing Model Complexity: Sinusoidal Function

Marcello Restelli March 9, 2017 24 / 47

Minimizing Least Squares

Increasing Model Complexity: Sinusoidal Function

Marcello Restelli March 9, 2017 24 / 47

Minimizing Least Squares

Increasing Model Complexity: Sinusoidal Function

Marcello Restelli March 9, 2017 24 / 47

Minimizing Least Squares

Under-fitting vs Over-Fitting

With low-order polynomials we have under-fitting

With high-order polynomials we get excellent fit over the training data,
but a poor representation of the true function: over-fitting
We want to have good generalization

We use a test set of 100 samples

to evaluate generalization
q
ERMS = 2∗RSS(ŵ)
N

Marcello Restelli March 9, 2017 25 / 47

Minimizing Least Squares

How to Avoid Over-fitting?

This is the problem of model selection (we will see this later)
What happens when the number of training samples increases?

Marcello Restelli March 9, 2017 26 / 47

Minimizing Least Squares

How to Avoid Over-fitting?

What happens to the parameters when the model gets more complex?

M=0 M=1 M=3 M=9

ŵ 0 0.19 0.82 0.31 0.35
ŵ 1 -1.27 7.99 232.37
ŵ 2 -25.43 -5321.83
ŵ 3 48568.31
ŵ 4 -231639.30
ŵ 5 640042.26
ŵ 6 -1061800.52
ŵ 7 1042400.18
ŵ 8 -557682.99
ŵ 9 125201.43

Marcello Restelli March 9, 2017 27 / 47

Regularization

Ridge Regression
One way to reduce the MSE is to change the loss function as follows
L(w) = LD (w) + λLW (w)
LD (w): error on data (e.g., RSS)
LW (w): model complexity
By taking LW (w) = 12 wT w = 12 kwk22 we get
N
1X 2 λ
L(w) = ti − wT φ(xi ) + kwk22
2 2
i=1

It is called ridge regression (or weight decay)

It is a regularization (or parameter shrinkage) method
The loss function is still quadratic in w:
−1 T
ŵridge = λI + ΦT Φ Φ t