0% found this document useful (0 votes)
5 views8 pages

09 Regularization

The document discusses regularization, which is a technique used in regression and classification problems to counteract overfitting. It involves adding a penalty term to the loss function, with a tuning parameter λ controlling the strength of regularization. Ridge regression is an example where an L2 norm penalty is added, shrinking parameter estimates towards zero and improving the conditioning of the optimization problem. The tuning parameter λ manages the bias-variance tradeoff, and its best value is chosen using a validation set.

Uploaded by

theroules26
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views8 pages

09 Regularization

The document discusses regularization, which is a technique used in regression and classification problems to counteract overfitting. It involves adding a penalty term to the loss function, with a tuning parameter λ controlling the strength of regularization. Ridge regression is an example where an L2 norm penalty is added, shrinking parameter estimates towards zero and improving the conditioning of the optimization problem. The tuning parameter λ manages the bias-variance tradeoff, and its best value is chosen using a validation set.

Uploaded by

theroules26
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

Regularization

Regularization
Regularization: consists in adding a penalty term to the loss function:

V (θ) = J(θ) + λ g(θ)


where λ > 0 is a tuning parameter. The solution of the learning from data problem
becomes
θ̂ = arg min V (θ).
θ∈Mp (θ)

This technique is used in both regression and classification problems. One of the most
common form is quadratic regularization:

V (θ) = J(θ) + λ !θ!22 = J(θ) + λ θ T θ.

Why regularization?
To counteract the effects of overfitting.
To make the optimization problem better conditioned.

Note: the tuning parameter λ must be determined separately as for the model complexity.
Roberto Diversi Learning and Estimation of Dynamical Systems M – p. 1/6
Regularization

Regularized least squares: the Ridge regression


Consider the static model

y(t) = ϕT (t) θ + ε(t), t = 1, 2, . . . , N

and the associated quadratic regularized LS problem

N
ε2 (t) + λ θ T θ = !Y − Φ θ!2 + λ !θ!22
!
minV (θ) : V (θ) =
t=1

This estimation method is called Ridge regression. By setting the gradient of V (θ) to zero
it is easy to find the normal equations
" T
Φ Φ + λ Ip θ = ΦT Y
#

where Ip is the p × p identity matrix and p is the complexity of the model. The
regularized LS estimator is thus given by

T
#−1
ΦT Y
"
θ̂ = Φ Φ + λ Ip

Roberto Diversi Learning and Estimation of Dynamical Systems M – p. 2/6


Regularization

The penalty term λ !θ!22 is also called a shrinkage penalty in statistics because it has
the effect of shrinking the estimated parameters towards zero.

Assume now that the existence of the true model

y(t) = ϕT (t) θ ∗ + w(t) t = 1, 2, . . . , N

By analyzing the statistical properties of the Ridge estimator we get


" T #−1 T ∗ ∗
" T #−1 ∗
E[θ̂] = Φ Φ + λ Ip Φ Φ θ = θ − λ Φ Φ + λ Ip θ $= θ ∗

2 T
#−1 T T
#−1 2
(ΦT Φ)−1
" "
cov(θ̂) = σw Φ Φ + λ Ip Φ Φ Φ Φ + λ Ip < σw

The Ridge estimator θ̂ is biased and cov(θ̂) < cov(θ̂LS ).


As λ increases, the shrinkage of the estimated coefficients leads to a reduction in the
variance of the estimates at the expense of an increase in bias =⇒ λ is an
hyperparameter that can manage the bias-variance tradeoff.
If ΦT Φ is ill-conditioned (or even rank deficient), the use of λ leads to the better
conditioned matrix ΦT Φ + λ Ip .
Roberto Diversi Learning and Estimation of Dynamical Systems M – p. 3/6
Regularization

Example: polynomial fitting of the model y(t) = sin u(t) + w(t).


LS, 3rd order polynomial LS, 12th order polynomial
1.5 1.5
True True
Val. set Val. set
1
1 Estimated Estimated

0.5
0.5

0
0
-0.5

-0.5
-1

-1
-1.5

-1.5 -2
0 100 200 300 400 500 600 0 100 200 300 400 500 600

Ridge, 12th order polynomial, =0.1 Ridge, 12th order polynomial, =3


1.5 1.5
True True
Val. set Val. set
1 Estimated 1 Estimated

0.5 0.5

0 0

-0.5 -0.5

-1 -1

-1.5 -1.5
0 100 200 300 400 500 600 0 100 200 300 400 500 600

Roberto Diversi Learning and Estimation of Dynamical Systems M – p. 4/6


Regularization

Regularization can be exploited also for the identification of dynamic models and for
classification problems. In general:

If the model complexity p is high (many parameters) it may not be possible to


estimate several of them accurately =⇒ it is advantageous to pull them towards
zero as the ones having the smallest influence on J(θ) will be affected most by the
shrinkage property =⇒ regularization allows complex models to be trained on
small data sets without severe overfitting.
The problem of minimizing J(θ) may be ill conditioned, especially when the
complexity p is high, in the sense that the Hessian J $$ (θ) may be ill-conditioned
=⇒ adding the norm penalty will add λ Ip to this matrix so that it becomes better
conditioned.
The choice of λ is a crucial issue as we may think of λ as a knob to control the
bias-variance tradeoff (the larger λ the larger the number of parameters that will be
close to zero). The best way consists in choosing the values of λ leading to the
smallest value of the loss function evaluated on the validation set.

Roberto Diversi Learning and Estimation of Dynamical Systems M – p. 5/6


Regularization

Alternative formulation for Ridge regression

It is possible to show that the problem

arg min V (θ), V (θ) = J(θ) + λ θ T θ


θ∈Mp (θ)

is equivalent to the problem

arg min J(θ), subject to θT θ ≤ K


θ∈Mp (θ)

For every λ there is some K such that the solutions θ̂ of the two optimization
problems are the same. The two approaches can be related using Lagrange
multipliers.
The parameter K can be seen as a budget for how large the norm of θ can be.
Note that K plays the role of 1/λ. In fact, decreasing K has the effect of shrinking
the estimated parameters towards zero.

Roberto Diversi Learning and Estimation of Dynamical Systems M – p. 6/6

You might also like