Machine Learning: Linear Models For Regression
Machine Learning: Linear Models For Regression
Marcello Restelli
t
x
March 9, 2017 4
x3
2 t̂
t
0x1
w1
5 x2
0
−5 0 2 4
x2 x1
w0
Outline
1 Linear Regression
3 Regularization
Regression Problems
t
a mapping from input x to a
continuous output t
x
Examples we assume
input x
Predict stock market price output t exists a model
Predict age of a web user that goes from x
Predict effect of an actuation in robotics to t
Predict the value of a house
Predict the temperature in a building
Linear Models
t
Linear regression often appears as
a module of larger systems
x
Linear problems can be solved analytically
Linear prediction provides an introduction to many of the core concepts
of machine learning
Augmented with kernels, it can model non-linear relationships
Linear Function
x = (1, x1 , . . . , xD−1 )
w0 is the offset
each point in w0 w1 space describes a model
w1
w0 x
Marcello Restelli March 9, 2017 6 / 47
Linear Regression
Basis Functions
Basis Functions
Basis Functions
Basis Functions
Example
t
t
x x
t
x2 x
x2
Marcello Restelli March 9, 2017 10 / 47
Linear Regression
Discriminative vs Generative
Generative approach:
Model the joint density: p(x, t) = p(x|t)p(t)
Infer conditional density: p(t|x) = p(x,t)
p(x) R
Marginalize to find conditional mean: E[t|x] = tp(t|x)dt
Discriminative approach:
Model conditional density p(t|x) R
Marginalize to find conditional mean: E[t|x] = tp(t|x)dt
Direct approach
Find a regression function y(x) directly from the training data
w1
N L = 5.4 · 104
1X
L(w) = (y(xn , w) − tn )2 L = 1.0 · 105
2 L = 1.0 · 106
n=1
w1
N L = 5.4 · 104
1X
L(w) = (y(xn , w) − tn )2 L = 1.0 · 105
2 L = 1.0 · 106
n=1
Let’s write RSS in matrix form with Φ = (φ(x1 ), . . . , φ(xN ))T and
t = (t1 , . . . , tN )T
1 1
L(w) = RSS(w) = (t − Φw)T (t − Φw)
2 2
Compute first and second derivative
∂L(w) ∂ 2 L(w)
= −ΦT (t − Φw) ; T
T =Φ Φ
∂w ∂w∂w
Assuming ΦT Φ in nonsingular
−1 T
ŵOLS = ΦT Φ Φ t
Complexity O(NM 2 + M 3 )
Cholesky: M 3 + NM 2 /2
QR: NM 2
Marcello Restelli March 9, 2017 14 / 47
Minimizing Least Squares
Geometric Interpretation
t is an N-dimensional vector
Let’s denote with ϕj the j-th column of Φ
Define t̂ the N-dimensional vector, whose n-th element is y(xn , w)
t̂ is a linear combination of ϕ1 , . . . , ϕM
so t̂ lies in an M-subspace S
Since t̂ minimizes the SSE with respect to t, it represents the projections
of t onto the subspace S
−1 T
t̂ = Φŵ = Φ ΦT Φ Φ t
−1 T
H = Φ ΦT Φ Φ is called the hat matrix
Geometric Example
= 3 and M= D = 2
Assume N
1 2 5 3.5
Φ=X= 1 −2 t= 1 t̂ = 1
1 2 2 3.5
4
t̂
x3
2 t
0 x1
x2
5
0
2 4
x2 −5 0
x1
Marcello Restelli March 9, 2017 16 / 47
Minimizing Least Squares
Gradient Optimization
Closed-form solution is not practical with big data
We can use sequential (online) updates
Stochastic gradient descent
If the lossPfunction can be expressed as a sum over samples
(L(x) = n L(xn ))
w(k+1) = w(k) − α(k) ∇L(xn )
T
w(k+1) = w(k) − α(k) w(k) φ(xn ) − tn φ(xn )
t = f(x) +
Gauss-Markov Theorem
Theorem (Gauss-Markov)
The least squares estimate of w has the smallest variance among all linear
unbiased estimates.
It follows that least squares estimator has the lowest MSE of all linear
estimator with no bias
However, there may exist a biased estimator with smaller MSE
Multiple Outputs
200 data
1
2
10
15
19
100
t
2 4 6 8 10 12 14 16 18 20
x
Under-fitting vs Over-Fitting
This is the problem of model selection (we will see this later)
What happens when the number of training samples increases?
What happens to the parameters when the model gets more complex?
Ridge Regression
One way to reduce the MSE is to change the loss function as follows
L(w) = LD (w) + λLW (w)
LD (w): error on data (e.g., RSS)
LW (w): model complexity
By taking LW (w) = 12 wT w = 12 kwk22 we get
N
1X 2 λ
L(w) = ti − wT φ(xi ) + kwk22
2 2
i=1
Ridge Regression
Quadratic Example
Polynomial degree 15
kwk2 =2.74e+6
200
data
λ=0
100
t
2 4 6 8 10 12 14 16 18 20
x
Marcello Restelli March 9, 2017 30 / 47
Regularization
Ridge Regression
Quadratic Example
Polynomial degree 15
kwk2 =1.75e+3
data
200
λ=0
λ=1e-10
100
t
2 4 6 8 10 12 14 16 18 20
x
Marcello Restelli March 9, 2017 30 / 47
Regularization
Ridge Regression
Quadratic Example
Polynomial degree 15
kwk2 =9.32e+2
data
200
λ=0
λ=1e-9
100
t
2 4 6 8 10 12 14 16 18 20
x
Marcello Restelli March 9, 2017 30 / 47
Regularization
Ridge Regression
Quadratic Example
Polynomial degree 15
kwk2 =3.92e+2
data
200
λ=0
λ=1e-8
100
t
2 4 6 8 10 12 14 16 18 20
x
Marcello Restelli March 9, 2017 30 / 47
Regularization
Ridge Regression
Quadratic Example
Polynomial degree 15
kwk2 =1.54e+2
data
200
λ=0
λ=1e-7
100
t
2 4 6 8 10 12 14 16 18 20
x
Marcello Restelli March 9, 2017 30 / 47
Regularization
Ridge Regression
Quadratic Example
Polynomial degree 15
kwk2 =6.22e+1
data
200
λ=0
λ=1e-6
100
t
2 4 6 8 10 12 14 16 18 20
x
Marcello Restelli March 9, 2017 30 / 47
Regularization
Ridge Regression
Quadratic Example: Weights
Polynomial degree 15
·105
2
1
w
−1
10−10
Marcello Restelli 10−9 10−8 10−7 10−6
March 9, 2017 31 / 47
Regularization
Ridge Regression
Sinusoidal Example
Lasso
w∗
w∗
w1 w1
Bayesian Approach
Posterior Distribution
The posterior distribution for the model parameters can be found by
combining the prior with the likelihood for the parameters given data
This is accomplished using Bayes’ Rule:
P(data|parameters)P(parameters)
P(parameters|data) =
P(data)
p(D|w)P(w)
p(w|D) =
P(D)
where
p(w|D) is the posterior probability of parameters w given training data D
p(D|w) is the probability (likelihood) of observing D given w
P(w) is the prior probability over the parameters
R marginal likelihood (normalizing constant):
P(D) is the
P(D) = p(D|w)P(w)dw
We want the most probable value of w given the data: maximum a
posteriori (MAP). It is the mode of the posterior.
Marcello Restelli March 9, 2017 37 / 47
Bayesian Linear Regression
Predictive Distribution
ΦT t
wN = SN S−1 0 w 0 +
σ2
ΦT Φ
S−1
N = S0 +
−1
σ2
For sequential data, the posterior acts as prior for the next iteration
Marcello Restelli March 9, 2017 39 / 47
Bayesian Linear Regression
1D Example
t
w0 x
1D Example
Predictive Distribution
where
σN2 (x) = σ2
|{z} + φ(x)T SN φ(x)
| {z }
noise in the Uncertainty associated
target values
with parameter values
Example
Sinsusoidal dataset, 9 Gaussian basis functions
Modeling Challenges
Computational Challenges
The other big challenge is computing the posterior distribution. There are
several approaches:
Analytical integration: If we use “conjugate priors”, the posterior
distribution can be computed analytically. Only works for simple
models
Gaussian (Laplace) approximation: Approximate the posterior
distribution with a Gaussian. Works well when there a lot of data
compared to the model complexity
Monte Carlo integration: Once we have a sample from the posterior
distribution, we can do many things. Currently, the common approach is
Markov Chain Monte Carlo (MCMC), that consists in simulating a
Markov chain that converges to the posterior distribution
Variational approximation: A cleverer way of approximating the
posterior. It is usually faster than MCMC, but it is less general
Advantages
Closed-form solution
Tractable Bayesian treatment
Arbitrary non-linearity with the proper basis functions
Limitations
Basis functions are chosen independently from the training set
Curse of dimensionality