0% found this document useful (0 votes)

5 views8 pages

09 Regularization

The document discusses regularization, which is a technique used in regression and classification problems to counteract overfitting. It involves adding a penalty term to the loss function, with a tuning parameter λ controlling the strength of regularization. Ridge regression is an example where an L2 norm penalty is added, shrinking parameter estimates towards zero and improving the conditioning of the optimization problem. The tuning parameter λ manages the bias-variance tradeoff, and its best value is chosen using a validation set.

Uploaded by

theroules26

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views8 pages

09 Regularization

Uploaded by

theroules26

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Regularization

Regularization
Regularization: consists in adding a penalty term to the loss function:

V (θ) = J(θ) + λ g(θ)

where λ > 0 is a tuning parameter. The solution of the learning from data problem
becomes
θ̂ = arg min V (θ).
θ∈Mp (θ)

This technique is used in both regression and classification problems. One of the most
common form is quadratic regularization:

V (θ) = J(θ) + λ !θ!22 = J(θ) + λ θ T θ.

Why regularization?
To counteract the effects of overfitting.
To make the optimization problem better conditioned.

Note: the tuning parameter λ must be determined separately as for the model complexity.
Roberto Diversi Learning and Estimation of Dynamical Systems M – p. 1/6
Regularization

Regularized least squares: the Ridge regression

Consider the static model

y(t) = ϕT (t) θ + ε(t), t = 1, 2, . . . , N

and the associated quadratic regularized LS problem

N
ε2 (t) + λ θ T θ = !Y − Φ θ!2 + λ !θ!22
!
minV (θ) : V (θ) =
t=1

This estimation method is called Ridge regression. By setting the gradient of V (θ) to zero
it is easy to find the normal equations
" T
Φ Φ + λ Ip θ = ΦT Y
#

where Ip is the p × p identity matrix and p is the complexity of the model. The
regularized LS estimator is thus given by

T
#−1
ΦT Y
"
θ̂ = Φ Φ + λ Ip

Roberto Diversi Learning and Estimation of Dynamical Systems M – p. 2/6

Regularization

The penalty term λ !θ!22 is also called a shrinkage penalty in statistics because it has
the effect of shrinking the estimated parameters towards zero.

Assume now that the existence of the true model

y(t) = ϕT (t) θ ∗ + w(t) t = 1, 2, . . . , N

By analyzing the statistical properties of the Ridge estimator we get

" T #−1 T ∗ ∗
" T #−1 ∗
E[θ̂] = Φ Φ + λ Ip Φ Φ θ = θ − λ Φ Φ + λ Ip θ $= θ ∗

2 T
#−1 T T
#−1 2
(ΦT Φ)−1
" "
cov(θ̂) = σw Φ Φ + λ Ip Φ Φ Φ Φ + λ Ip < σw

The Ridge estimator θ̂ is biased and cov(θ̂) < cov(θ̂LS ).

As λ increases, the shrinkage of the estimated coefficients leads to a reduction in the
variance of the estimates at the expense of an increase in bias =⇒ λ is an
hyperparameter that can manage the bias-variance tradeoff.
If ΦT Φ is ill-conditioned (or even rank deficient), the use of λ leads to the better
conditioned matrix ΦT Φ + λ Ip .
Roberto Diversi Learning and Estimation of Dynamical Systems M – p. 3/6
Regularization

Example: polynomial fitting of the model y(t) = sin u(t) + w(t).

LS, 3rd order polynomial LS, 12th order polynomial
1.5 1.5
True True
Val. set Val. set
1
1 Estimated Estimated

0.5
0.5

0
0
-0.5

-0.5
-1

-1
-1.5

-1.5 -2
0 100 200 300 400 500 600 0 100 200 300 400 500 600

Ridge, 12th order polynomial, =0.1 Ridge, 12th order polynomial, =3

1.5 1.5
True True
Val. set Val. set
1 Estimated 1 Estimated

0.5 0.5

0 0

-0.5 -0.5

-1 -1

-1.5 -1.5
0 100 200 300 400 500 600 0 100 200 300 400 500 600

Roberto Diversi Learning and Estimation of Dynamical Systems M – p. 4/6

Regularization

Regularization can be exploited also for the identification of dynamic models and for
classification problems. In general:

If the model complexity p is high (many parameters) it may not be possible to

estimate several of them accurately =⇒ it is advantageous to pull them towards
zero as the ones having the smallest influence on J(θ) will be affected most by the
shrinkage property =⇒ regularization allows complex models to be trained on
small data sets without severe overfitting.
The problem of minimizing J(θ) may be ill conditioned, especially when the
complexity p is high, in the sense that the Hessian J $$ (θ) may be ill-conditioned
=⇒ adding the norm penalty will add λ Ip to this matrix so that it becomes better
conditioned.
The choice of λ is a crucial issue as we may think of λ as a knob to control the
bias-variance tradeoff (the larger λ the larger the number of parameters that will be
close to zero). The best way consists in choosing the values of λ leading to the
smallest value of the loss function evaluated on the validation set.

Roberto Diversi Learning and Estimation of Dynamical Systems M – p. 5/6

Regularization

Alternative formulation for Ridge regression

It is possible to show that the problem

arg min V (θ), V (θ) = J(θ) + λ θ T θ

θ∈Mp (θ)

is equivalent to the problem

arg min J(θ), subject to θT θ ≤ K

θ∈Mp (θ)

For every λ there is some K such that the solutions θ̂ of the two optimization
problems are the same. The two approaches can be related using Lagrange
multipliers.
The parameter K can be seen as a budget for how large the norm of θ can be.
Note that K plays the role of 1/λ. In fact, decreasing K has the effect of shrinking
the estimated parameters towards zero.

Roberto Diversi Learning and Estimation of Dynamical Systems M – p. 6/6

Picards Method
100% (4)
Picards Method
6 pages
Introduction To Quadratic Functions
No ratings yet
Introduction To Quadratic Functions
41 pages
Cost Function
No ratings yet
Cost Function
17 pages
ISYE 8803 - Kamran - M6 - LD Learning Using Regularization
No ratings yet
ISYE 8803 - Kamran - M6 - LD Learning Using Regularization
25 pages
Bias
No ratings yet
Bias
62 pages
Week 6
No ratings yet
Week 6
34 pages
(Ebook PDF) Numerical Methods For Engineers 8th Editioninstant Download
100% (3)
(Ebook PDF) Numerical Methods For Engineers 8th Editioninstant Download
55 pages
Pattern Recognition Machine Learning: Chapter 3: Linear Models For Regression
100% (1)
Pattern Recognition Machine Learning: Chapter 3: Linear Models For Regression
48 pages
Extensions Beyond Linear Regression: Topics in Data Science
No ratings yet
Extensions Beyond Linear Regression: Topics in Data Science
66 pages
2022 Scribe Lecture7
No ratings yet
2022 Scribe Lecture7
9 pages
Lecture03d Ridge
No ratings yet
Lecture03d Ridge
13 pages
Submitted To The Annals of Statistics
No ratings yet
Submitted To The Annals of Statistics
66 pages
Class03 RLS
No ratings yet
Class03 RLS
28 pages
Non-Linear Regression Models
No ratings yet
Non-Linear Regression Models
34 pages
Least Squares
No ratings yet
Least Squares
12 pages
Lecture16 Crossvalidation
No ratings yet
Lecture16 Crossvalidation
32 pages
Lecture 17 Least Squares, State Estimation
No ratings yet
Lecture 17 Least Squares, State Estimation
29 pages
Note 8
No ratings yet
Note 8
12 pages
CCS357 Lab Manual
No ratings yet
CCS357 Lab Manual
41 pages
Chapman-Kolmogorov Equations 30 The Effect of 48511
No ratings yet
Chapman-Kolmogorov Equations 30 The Effect of 48511
9 pages
Lecture 2 - Linear Regression
No ratings yet
Lecture 2 - Linear Regression
54 pages
Optimisasi Di Industri Migas Slides w05
No ratings yet
Optimisasi Di Industri Migas Slides w05
96 pages
Ch5 Regularization
No ratings yet
Ch5 Regularization
23 pages
Lect 6
No ratings yet
Lect 6
10 pages
Lec20 RidgeRegression
No ratings yet
Lec20 RidgeRegression
21 pages
MIT Regression
No ratings yet
MIT Regression
5 pages
L11+ Regularization
No ratings yet
L11+ Regularization
25 pages
Regularized Least-Squares Classification
No ratings yet
Regularized Least-Squares Classification
24 pages
cs229 Notes14
No ratings yet
cs229 Notes14
6 pages
9 - Linear Regression-Problems and Solutions
No ratings yet
9 - Linear Regression-Problems and Solutions
23 pages
Lecture 4 - Estimation - BMSLec03
No ratings yet
Lecture 4 - Estimation - BMSLec03
20 pages
Machine Learning
No ratings yet
Machine Learning
19 pages
Consistency of One-Class SVM and Related Algorithms
No ratings yet
Consistency of One-Class SVM and Related Algorithms
8 pages
Ridge Regression
No ratings yet
Ridge Regression
9 pages
Regularization
No ratings yet
Regularization
3 pages
Gauss Seidel Iteration Method, Convergence Analysis
No ratings yet
Gauss Seidel Iteration Method, Convergence Analysis
20 pages
M-Estimators and Half-Quadratic Minimization
No ratings yet
M-Estimators and Half-Quadratic Minimization
10 pages
Chap 03
No ratings yet
Chap 03
59 pages
10: Empirical Risk Minimization
No ratings yet
10: Empirical Risk Minimization
6 pages
G.C. Calafiore (Politecnico Di Torino)
No ratings yet
G.C. Calafiore (Politecnico Di Torino)
23 pages
Machine Learning: Linear Models For Regression
No ratings yet
Machine Learning: Linear Models For Regression
54 pages
Linear Model Methodology
No ratings yet
Linear Model Methodology
9 pages
Design and Analysis of Algorithms Cho
No ratings yet
Design and Analysis of Algorithms Cho
12 pages
The Risk of Machine Learning
No ratings yet
The Risk of Machine Learning
66 pages
Linear Regression
No ratings yet
Linear Regression
19 pages
Introduction To Machine Learning Lecture 2: Linear Regression
No ratings yet
Introduction To Machine Learning Lecture 2: Linear Regression
38 pages
07 Regularization
No ratings yet
07 Regularization
7 pages
Lec Note 7 2024
No ratings yet
Lec Note 7 2024
4 pages
(MLP) Lecture Notes
No ratings yet
(MLP) Lecture Notes
22 pages
Kernel Ridge Regression: Max Welling
No ratings yet
Kernel Ridge Regression: Max Welling
3 pages
Representer Function
No ratings yet
Representer Function
12 pages
Chap 10-1 - Sequence Modeling Recurrent and Recursive Nets - Eunjeong Yi
No ratings yet
Chap 10-1 - Sequence Modeling Recurrent and Recursive Nets - Eunjeong Yi
21 pages
Identification and Estimation
No ratings yet
Identification and Estimation
37 pages
Cs419 Closed Form Derv
No ratings yet
Cs419 Closed Form Derv
5 pages
Lec 07-08 - Final
No ratings yet
Lec 07-08 - Final
32 pages
LeastSquares DeptMath
No ratings yet
LeastSquares DeptMath
7 pages
Informed Search
No ratings yet
Informed Search
43 pages
Regularization (Mathematics)
No ratings yet
Regularization (Mathematics)
11 pages
21CSE07-DAA Question Bank
No ratings yet
21CSE07-DAA Question Bank
8 pages
Lecture 2
No ratings yet
Lecture 2
8 pages
Technometrics
No ratings yet
Technometrics
14 pages
Machine Learning - Home - Week 2 - Notes - Coursera
No ratings yet
Machine Learning - Home - Week 2 - Notes - Coursera
10 pages
Addendum Bias Variance
No ratings yet
Addendum Bias Variance
3 pages
Bias Variance
No ratings yet
Bias Variance
3 pages
Module 4 Algorithmic Thinking With Python
No ratings yet
Module 4 Algorithmic Thinking With Python
10 pages
Approximate Solutions To Dynamic Models - Linear Methods: Harald Uhlig
No ratings yet
Approximate Solutions To Dynamic Models - Linear Methods: Harald Uhlig
12 pages
Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD
No ratings yet
Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD
14 pages
U02Lecture06 Regression
No ratings yet
U02Lecture06 Regression
25 pages
Week 8 Prev & Current Assignments
No ratings yet
Week 8 Prev & Current Assignments
28 pages
Deep Learning - Week 11
No ratings yet
Deep Learning - Week 11
4 pages
Assignment 2 Task 1: Test Selectionsort Algorithm 100 Data
No ratings yet
Assignment 2 Task 1: Test Selectionsort Algorithm 100 Data
9 pages
1.1. Linear Models - Scikit-Learn 1.4.2 Documentation
No ratings yet
1.1. Linear Models - Scikit-Learn 1.4.2 Documentation
17 pages
Machine Learning CH - Nural Net
No ratings yet
Machine Learning CH - Nural Net
1 page
Week 14-MNS
No ratings yet
Week 14-MNS
17 pages
Shortest Route and Minimum Spanning Tree 6th April 2020
No ratings yet
Shortest Route and Minimum Spanning Tree 6th April 2020
39 pages
Linear Quadratic Control
No ratings yet
Linear Quadratic Control
7 pages
1238 Support Vector Regression Machines
No ratings yet
1238 Support Vector Regression Machines
7 pages
Simplex 5
No ratings yet
Simplex 5
11 pages
IITKGP Assignment4 Solution4
No ratings yet
IITKGP Assignment4 Solution4
3 pages
Problem Sheet-3
No ratings yet
Problem Sheet-3
4 pages
Tirth Soni - SVS U2 Assignment
No ratings yet
Tirth Soni - SVS U2 Assignment
5 pages
cs502 Midterm Solved Mcqs by Me
No ratings yet
cs502 Midterm Solved Mcqs by Me
4 pages
Second Semester 2019-2020
No ratings yet
Second Semester 2019-2020
2 pages
Plan of Lectures and Lab MEC610 (March-Jul 2023)
No ratings yet
Plan of Lectures and Lab MEC610 (March-Jul 2023)
2 pages
Aamm Assignment 1
No ratings yet
Aamm Assignment 1
4 pages
What Do We Have in Common?
No ratings yet
What Do We Have in Common?
4 pages
Quadratic Functions: LESSON
No ratings yet
Quadratic Functions: LESSON
3 pages
Or Question Papers
No ratings yet
Or Question Papers
2 pages
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
From Everand
Student Solutions Manual to Accompany Economic Dynamics in Discrete Time, secondedition
Yue Jiang
4.5/5 (2)
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet

09 Regularization

Uploaded by

09 Regularization

Uploaded by

Regularization

V (θ) = J(θ) + λ g(θ)

V (θ) = J(θ) + λ !θ!22 = J(θ) + λ θ T θ.

Regularized least squares: the Ridge regression

y(t) = ϕT (t) θ + ε(t), t = 1, 2, . . . , N

and the associated quadratic regularized LS problem

Roberto Diversi Learning and Estimation of Dynamical Systems M – p. 2/6

Assume now that the existence of the true model

y(t) = ϕT (t) θ ∗ + w(t) t = 1, 2, . . . , N

By analyzing the statistical properties of the Ridge estimator we get

The Ridge estimator θ̂ is biased and cov(θ̂) < cov(θ̂LS ).

Example: polynomial fitting of the model y(t) = sin u(t) + w(t).

Ridge, 12th order polynomial, =0.1 Ridge, 12th order polynomial, =3

Roberto Diversi Learning and Estimation of Dynamical Systems M – p. 4/6

If the model complexity p is high (many parameters) it may not be possible to

Roberto Diversi Learning and Estimation of Dynamical Systems M – p. 5/6

Alternative formulation for Ridge regression

It is possible to show that the problem

arg min V (θ), V (θ) = J(θ) + λ θ T θ

is equivalent to the problem

arg min J(θ), subject to θT θ ≤ K

Roberto Diversi Learning and Estimation of Dynamical Systems M – p. 6/6

You might also like