0% found this document useful (0 votes)

10 views4 pages

w1d Linear Regression Regularization

The document discusses the concepts of linear regression, overfitting, and regularization, highlighting the challenges of accurately fitting models to noisy data. It emphasizes the importance of balancing model complexity with the amount of available data to avoid underfitting or overfitting, and introduces regularization techniques to penalize extreme weight values in order to improve model generalization. Examples illustrate how fitting too many basis functions can lead to poor predictions on new data, and regularization methods like ridge regression are proposed as solutions to mitigate these issues.

Uploaded by

zeliawillscumberg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views4 pages

w1d Linear Regression Regularization

Uploaded by

zeliawillscumberg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Linear regression, overfitting, and

regularization
The fitted function f (x) doesn’t usually match the training data exactly. Each training item
has a residual, (y(n) − f (x(n) )), which is normally non-zero. Why don’t we get perfect fits?

• Data is usually inherently noisy or stochastic, in which case it’s impossible to exactly
predict y from x. For example, if a builder mixes several batches of concrete with the
same quantities specified by x, we wouldn’t expect their observed strengths y to be
exactly the same.
• Even if the outputs are noiseless, N > D data-points are unlikely to lie exactly on any
function represented by a linear combination of our D basis functions.

If we don’t include enough basis functions, we will underfit our data. For example, if some
points lie exactly along a cubic curve:
N = 100; D = 1
X = np.random.rand(N, D) - 0.5
yy = X**3
We would not be able to fit this data accurately if we only put linear and quadratic basis
functions in our augmented design matrix Φ.
You could fit the cubic data above fairly accurately with a few RBF functions. The fit
wouldn’t extrapolate well outside the x ∈ (−0.5, 0.5) range of observations, but you can get
an accurate fit close to where there is data. To avoid underfitting, we need a model with
enough representational power to closely approximate the underlying function.
When the number of training points N is small, it’s easy to fit the observations with low
square error. In fact, usually if we have N or more basis functions, such as N RBFs with
different centres, the residuals will all be zero!1 However, we should not trust this fit. It’s
hard for us as intelligent humans to guess what an arbitrary function is doing in between
only a few observations, so we shouldn’t believe the result of fitting an arbitrary model
either. Moreover, if the observations are noisy, it seems unlikely that a good fit should match
the observed data exactly anyway.
As advocated by Acton’s rant in note w1a, one possible approach to modelling is as follows.
Start with a simple model with only a few parameters. This model may underfit, that is, not
represent all of the structure evident in the training data. We could then consider a series of
more complicated models while we feel that fitting these models can still be justified by the
amount of data we have.
However, limiting the number of parameters in a model isn’t always easy or the right
approach. If our inputs have many features, even simple linear regression (without additional
basis functions) has many parameters. An example is predicting some characteristic of an
organism (a phenotype) from DNA data, which is often in the form of > 105 features2 .
We could consider removing features from high-dimensional inputs to make a smaller
model, but filtering is not always the correct approach either. If some features are noisy
measurements of the same underlying property, it is better to average all of them rather
than to select one of them. However, we may not know in advance which groups of features
should be averaged, and which selected.
Another approach to modelling is to use large models (models with many free parameters),
but to discourage unreasonable fits that match our noisy training data too closely.

1. The basis functions need to produce N linearly-independent columns in the feature or design matrix Φ. Most
basis functions do have this property, but the technical details are too involved to get into here. There’s a reference
in the previous note.
2. Such as Single-Nucleotide Polymorphisms (SNPs) pronounced “snips”.

MLPR:w1d Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 1

1 Examples of what overfitting can look like
Example 1: Fitting many features. We will consider a synthetic situation where each datapoint
relates to a different underlying quantity µ(n) . We will sample each of these quantities from
a uniform distribution between zero and one, which we write as

µ(n) ∼ Uniform(0, 1). (1)

In our example, the features and output will both be noisy measurements of the underlying
quantity:
(n)
xd ∼ N (µ(n) , 0.012 ), d = 1...D (2)
y(n) ∼ N (µ(n) , 0.12 ). (3)

The notation N (µ, σ2 ) means the values are drawn from a Gaussian or Normal distribution
centred on µ, with variance σ2 .
In this situation, averaging the xd measurements would be a reasonable estimate of the
underlying feature µ, and hence the output y. Thus a regression model with wd = 1/D would
make reasonable predictions of y from x. Do you recover something like this model if you fit
linear regression? Try fitting randomly-generated datasets with various N and D. You can
generate data from the above model as follows:
mu = np.random.rand(N)
X = np.tile(mu[:,None], (1, D)) + 0.01*np.random.randn(N, D)
yy = 0.1*np.random.randn(N) + mu
By making D large and N not much larger than D, the weights that give the best least-
squares fit to the training data are much larger in magnitude than wd = 1/D. By using
weights much smaller and larger than the ideal values, the model can use small differences
between input features to fit the noise in the observations. If we tried to interpret the value
of a weight as meaning something about the corresponding feature, we would embarrass
ourselves. However, the average of the weights is often close to 1/D, and predictions on more
data generated in the same way as above, might be ok. The model will generalize badly
though — it will make wild predictions — if we test on inputs generated from xd ∼ N (µ, 1).

Example 2: Explaining noise with many basis functions. Consider data drawn as follows:
(n)
xd ∼ Uniform[0, 1], d = 1...D (4)
(n)
y ∼ N (0, 1). (5)

The outputs have no relationship to the inputs. The predictor with the smallest average
square error on future test cases is f (x) = 0, its average square error will be one. We now
consider what happens if we fit a model with many basis functions. If we use a high-degree
polynomial, or many RBF basis functions, we can get a lower square training error than one.
However, the error on new data would be larger. The fits usually have extreme weights with
large magnitudes (for example ∼ 103 ). There is a danger that for some inputs, the predictions
could be extreme.

It’s possible to represent a large range of interesting functions using weights with small
magnitude. Yet least-square fits are often obtained with combinations of extremely positive
and negative weights, obtaining fits that pass unreasonably close to noisy observations. We
would like to avoid fitting these extreme weights.

MLPR:w1d Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 2

2 Regularization
Penalizing extreme solutions can be achieved in various ways, and is called regularization.
One form of regularization is to penalize the sum of the square weights in our cost function.
This method has various names, including Tikhonov regularization, ridge regression, or L2
regularization. For K basis functions, the regularized cost function is then:
N h i2 K
Eλ (w; y, Φ) = ∑ y(n) − f ( x(n) ; w ) +λ ∑ w2k (6)
n =1 k =1
= (y − Φw)> (y − Φw) + λw w. >
(7)

For λ = 0 we only care about fitting the data, but for larger values of λ we trade-off the
accuracy of the fit so that we can make the weights smaller in magnitude.
We can fit the regularized cost function with the same linear least-squares fitting routine as
before. This time, instead of adding new features, we add new data items. If our original
matrix of input features Φ is N × K, for N data items and K basis functions, we add K rows
to both the vector of labels and matrix of input features:

Φ

y
ỹ = Φ̃ = √ , (8)
0K λIK

where 0K is a vector of K zeros, and IK is the K × K identity matrix. Then

E(w; ỹ, Φ̃) = (ỹ − Φ̃w)> (ỹ − Φ̃w) (9)

> >
= (y − Φw) (y − Φw) + λw w = Eλ (w; y, Φ). (10)

Thus we can fit training data (ỹ, Φ̃) using least-squares code that knows nothing about
regularization, and fit the regularized cost function.
Below we see a situation where using least-squares with a dozen RBF basis functions leads
to overfitting.

0
y

−2

least sq.
−4 regularized
data

−0.5 0 0.5
x

One could argue for changing the basis functions. However, as illustrated above, regularizing
the same linear regression model can give less extreme predictions, at the expense of giving
a fit further from the training points. The regularized fit depends strongly on λ. For λ = 0
we obtain the least squares fit.

3 Check your understanding

• [The website version of this note has a question here.]
• [The website version of this note has a question here.]

MLPR:w1d Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 3

3.1 Optional questions
• Try to generate a figure like the one above, demonstrating the effect of regularization.
However, there will be more opportunities to implement regularized regression later.
• If we have K radial basis functions, give a simple upper bound on the largest function
value that could be obtained for a given weight vector w. Also give a simple upper
bound on the largest derivative (you could consider one dimensional regression). From
such bounds we can see that limiting the size of the weights stops the function taking
on extreme values, or changing extremely quickly.

MLPR:w1d Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 4

12th Maths EM Unit Test 1 Model Question Paper English Medium PDF Download
100% (2)
12th Maths EM Unit Test 1 Model Question Paper English Medium PDF Download
2 pages
Simple Unfired Pressure Vessels Designed To Contain Air or Nitrogen
100% (1)
Simple Unfired Pressure Vessels Designed To Contain Air or Nitrogen
88 pages
ESP32 Microcontroller Based Smart Power
No ratings yet
ESP32 Microcontroller Based Smart Power
8 pages
Weda
50% (2)
Weda
60 pages
Bias
No ratings yet
Bias
62 pages
Lecture 04
No ratings yet
Lecture 04
19 pages
Nonlinear Regression
No ratings yet
Nonlinear Regression
8 pages
CPSC540: Regularization, Regularization, Nonlinear Prediction and Generalization
No ratings yet
CPSC540: Regularization, Regularization, Nonlinear Prediction and Generalization
23 pages
Lecture 05
No ratings yet
Lecture 05
10 pages
6 Complexity
No ratings yet
6 Complexity
22 pages
Neural Networks Study Notes
100% (2)
Neural Networks Study Notes
11 pages
Lecture03b Overfitting
No ratings yet
Lecture03b Overfitting
5 pages
06 Basis
No ratings yet
06 Basis
9 pages
Lecture03b Overfitting Annotated
No ratings yet
Lecture03b Overfitting Annotated
5 pages
Introduction To Machine Learning Lecture 2: Linear Regression
No ratings yet
Introduction To Machine Learning Lecture 2: Linear Regression
38 pages
Machine Learning and Pattern Recognition Bayesian Complexity Control
No ratings yet
Machine Learning and Pattern Recognition Bayesian Complexity Control
4 pages
Neural Network Lectures RBF 1
No ratings yet
Neural Network Lectures RBF 1
44 pages
Lecture3 2015
No ratings yet
Lecture3 2015
38 pages
Machine Learning Insem-01 QP
No ratings yet
Machine Learning Insem-01 QP
6 pages
Tut1 Questions
No ratings yet
Tut1 Questions
2 pages
Massachusetts Institute of Technology: 6.867 Machine Learning, Fall 2006 Problem Set 2: Solutions
No ratings yet
Massachusetts Institute of Technology: 6.867 Machine Learning, Fall 2006 Problem Set 2: Solutions
7 pages
Week 4 Linear Regression
No ratings yet
Week 4 Linear Regression
38 pages
Lecture 3-Linear-Regression-Part2
No ratings yet
Lecture 3-Linear-Regression-Part2
45 pages
Group 30
No ratings yet
Group 30
33 pages
02 - Linear Models - C - Regularization - Logistic - Regression
No ratings yet
02 - Linear Models - C - Regularization - Logistic - Regression
16 pages
Pattern Recognition Machine Learning: Chapter 3: Linear Models For Regression
100% (1)
Pattern Recognition Machine Learning: Chapter 3: Linear Models For Regression
48 pages
9 - Linear Regression-Problems and Solutions
No ratings yet
9 - Linear Regression-Problems and Solutions
23 pages
Sparse Regression
No ratings yet
Sparse Regression
37 pages
Machine Learning Lecture 1
No ratings yet
Machine Learning Lecture 1
5 pages
Exercise 03
No ratings yet
Exercise 03
5 pages
ML PYQs
No ratings yet
ML PYQs
32 pages
Classification Problem: Feedforwardnet Patternnet Fitnet
No ratings yet
Classification Problem: Feedforwardnet Patternnet Fitnet
16 pages
Machine Learning and Pattern Recognition Week 3 Intro - Classification
No ratings yet
Machine Learning and Pattern Recognition Week 3 Intro - Classification
5 pages
Lecture16 Crossvalidation
No ratings yet
Lecture16 Crossvalidation
32 pages
Lab Manual 05
No ratings yet
Lab Manual 05
13 pages
EE2211 Lecture 7
No ratings yet
EE2211 Lecture 7
43 pages
Lec3 Linear Regression With Multiple Vars
No ratings yet
Lec3 Linear Regression With Multiple Vars
30 pages
Group30 Linear Regression
No ratings yet
Group30 Linear Regression
20 pages
Skript Opt Mach
No ratings yet
Skript Opt Mach
49 pages
Regression Using LS Handout
No ratings yet
Regression Using LS Handout
21 pages
L11+ Regularization
No ratings yet
L11+ Regularization
24 pages
Unit - 4-NNDL - Notes
No ratings yet
Unit - 4-NNDL - Notes
14 pages
L11+ Regularization
No ratings yet
L11+ Regularization
25 pages
Cao 2015
No ratings yet
Cao 2015
17 pages
Lecture 7 - Part A - Mutli Class and Overfitting and Regularization
No ratings yet
Lecture 7 - Part A - Mutli Class and Overfitting and Regularization
43 pages
Linear Regression With Multiple Variable
No ratings yet
Linear Regression With Multiple Variable
30 pages
CMPE257 - W2C3 - ML Fundamentals - Part 2
No ratings yet
CMPE257 - W2C3 - ML Fundamentals - Part 2
34 pages
PA Notes 2
No ratings yet
PA Notes 2
23 pages
Chapter 3 Summary
No ratings yet
Chapter 3 Summary
8 pages
Today: - Calculus
No ratings yet
Today: - Calculus
61 pages
Curs5site PDF
No ratings yet
Curs5site PDF
47 pages
L09 - Regularisation
No ratings yet
L09 - Regularisation
79 pages
Least Squares Fitting - From Wolfram MathWorld
No ratings yet
Least Squares Fitting - From Wolfram MathWorld
5 pages
07 Regularization
No ratings yet
07 Regularization
7 pages
Least Squares Fit To Polynomial
No ratings yet
Least Squares Fit To Polynomial
12 pages
Machine Learning
No ratings yet
Machine Learning
19 pages
Ridge Lasso Regression Bias Variance Tradeoff 71
No ratings yet
Ridge Lasso Regression Bias Variance Tradeoff 71
19 pages
Theory in Machine Learning
No ratings yet
Theory in Machine Learning
60 pages
G 203008076 - 4 - Christhian Quiñonez - Ex1 - 2 A PDF
No ratings yet
G 203008076 - 4 - Christhian Quiñonez - Ex1 - 2 A PDF
20 pages
Multiclass Classification Regularization
No ratings yet
Multiclass Classification Regularization
31 pages
Handout5 Regularization
No ratings yet
Handout5 Regularization
20 pages
L8 Ann
No ratings yet
L8 Ann
20 pages
Lect3 2
No ratings yet
Lect3 2
43 pages
Square Summable Power Series
From Everand
Square Summable Power Series
Louis de Branges
5/5 (1)
Doing Business in Hungary
No ratings yet
Doing Business in Hungary
22 pages
Biological Data Science Lecture6
No ratings yet
Biological Data Science Lecture6
29 pages
BDS 2016-17
No ratings yet
BDS 2016-17
4 pages
Award in Education and Training Sample
No ratings yet
Award in Education and Training Sample
9 pages
w2c Central Limit
No ratings yet
w2c Central Limit
1 page
W2e Multivariate Gaussian
No ratings yet
W2e Multivariate Gaussian
6 pages
Biological Data Science Lecture4
No ratings yet
Biological Data Science Lecture4
21 pages
BDS 2018-19
No ratings yet
BDS 2018-19
6 pages
MATH11183 Week 1-Part 2
No ratings yet
MATH11183 Week 1-Part 2
18 pages
TS Part2
No ratings yet
TS Part2
62 pages
MDA3S
No ratings yet
MDA3S
22 pages
Part 4
No ratings yet
Part 4
24 pages
Week 2 Naive Bayes
No ratings yet
Week 2 Naive Bayes
15 pages
MLPR w0f - Machine Learning and Pattern Recognition
No ratings yet
MLPR w0f - Machine Learning and Pattern Recognition
3 pages
W6a Gaussian Process Kernels
No ratings yet
W6a Gaussian Process Kernels
6 pages
Part 5
No ratings yet
Part 5
31 pages
Week 8 Pca
No ratings yet
Week 8 Pca
26 pages
Part 3
No ratings yet
Part 3
29 pages
2019 AMAM Exam Paper
No ratings yet
2019 AMAM Exam Paper
3 pages
Heat Advection
No ratings yet
Heat Advection
12 pages
PMRslides 02
No ratings yet
PMRslides 02
13 pages
Bayesian Workshop1 Solution
No ratings yet
Bayesian Workshop1 Solution
3 pages
PMRslides 03 B
No ratings yet
PMRslides 03 B
45 pages
w9b Netflix Prize
No ratings yet
w9b Netflix Prize
3 pages
Slides 03 A
No ratings yet
Slides 03 A
21 pages
Bayesian Week4 LectureNotes
No ratings yet
Bayesian Week4 LectureNotes
15 pages
2017 AMAM Exam Paper
No ratings yet
2017 AMAM Exam Paper
6 pages
Machine Learning and Pattern Recognition - Laplace - Approximation
No ratings yet
Machine Learning and Pattern Recognition - Laplace - Approximation
4 pages
Machine Learning and Pattern Recognition Minimal Stochastic Variational Inference Demo
No ratings yet
Machine Learning and Pattern Recognition Minimal Stochastic Variational Inference Demo
3 pages
Bio Statslectures
No ratings yet
Bio Statslectures
60 pages
Busi BOSCH Retail
No ratings yet
Busi BOSCH Retail
1 page
YIP 6.0 Students
No ratings yet
YIP 6.0 Students
86 pages
Direct and Indirect
No ratings yet
Direct and Indirect
5 pages
Johanson Cointegration Test and ECM
100% (7)
Johanson Cointegration Test and ECM
7 pages
A Survey
No ratings yet
A Survey
8 pages
ALL Interview Questions
No ratings yet
ALL Interview Questions
29 pages
Dr. B.S. Thandaveswara: Thand12345@yahoo - Co.in
No ratings yet
Dr. B.S. Thandaveswara: Thand12345@yahoo - Co.in
4 pages
Fundamentals of Information Technology
No ratings yet
Fundamentals of Information Technology
2 pages
Questions About Loop Parts:) : Willll162904
No ratings yet
Questions About Loop Parts:) : Willll162904
5 pages
I Am Curious (Yellow)
No ratings yet
I Am Curious (Yellow)
7 pages
Energy Storage Targets 2030 and 2050 Full Report
No ratings yet
Energy Storage Targets 2030 and 2050 Full Report
36 pages
Assurance Ethics, Values & Good Governance Compress Edition PDF
No ratings yet
Assurance Ethics, Values & Good Governance Compress Edition PDF
216 pages
Sample Private
No ratings yet
Sample Private
1 page
76.research On The Influence of Heat Treatment On The
No ratings yet
76.research On The Influence of Heat Treatment On The
7 pages
Meyer Attatchment Parts Catalogue 6-5206n - 156153 - V1
No ratings yet
Meyer Attatchment Parts Catalogue 6-5206n - 156153 - V1
23 pages
Where Can Buy Gentrification 1st Edition Loretta Lees Ebook With Cheap Price
No ratings yet
Where Can Buy Gentrification 1st Edition Loretta Lees Ebook With Cheap Price
67 pages
Peace Education Reflection Paper
No ratings yet
Peace Education Reflection Paper
1 page
Mee 505 Exam-2015-1
No ratings yet
Mee 505 Exam-2015-1
2 pages
Multi-Stage Payment Methods
No ratings yet
Multi-Stage Payment Methods
11 pages
Floating Solar Project at The Kariba Dam
No ratings yet
Floating Solar Project at The Kariba Dam
15 pages
Grade 10 Work Sheet w5 q1
100% (2)
Grade 10 Work Sheet w5 q1
2 pages
2-5 Acceleration: Distance Above The Floor, Showing Its Position at Regular Time Intervals As It Falls
No ratings yet
2-5 Acceleration: Distance Above The Floor, Showing Its Position at Regular Time Intervals As It Falls
9 pages
Gas To Power Feasibility Study Presentation
No ratings yet
Gas To Power Feasibility Study Presentation
23 pages
3M Petrifilm Yeast Molds
No ratings yet
3M Petrifilm Yeast Molds
8 pages
Sowah CSM 183
No ratings yet
Sowah CSM 183
510 pages
Assessment 2 Instructions - Capstone Proposal - ..
No ratings yet
Assessment 2 Instructions - Capstone Proposal - ..
3 pages

w1d Linear Regression Regularization

Uploaded by

w1d Linear Regression Regularization

Uploaded by

Linear regression, overfitting, and

MLPR:w1d Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 1

µ(n) ∼ Uniform(0, 1). (1)

MLPR:w1d Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 2

where 0K is a vector of K zeros, and IK is the K × K identity matrix. Then

E(w; ỹ, Φ̃) = (ỹ − Φ̃w)> (ỹ − Φ̃w) (9)

3 Check your understanding

MLPR:w1d Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 3

MLPR:w1d Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 4

You might also like