0% found this document useful (0 votes)

61 views16 pages

02 - Linear Models - C - Regularization - Logistic - Regression

1. Regularization helps prevent overfitting by constraining model parameters to reduce complexity. This includes limiting the number of parameters, restricting their range of values, and adding more training data. 2. Ridge regression (L2 regularization) adds a penalty term (λ) to the loss function that shrinks large weights. This prevents weights from increasing too much. 3. The regularization parameter λ controls the effective model complexity and determines the amount of overfitting. λ is selected using a validation set to optimize generalization.

Uploaded by

Duy Hùng Đào

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

61 views16 pages

02 - Linear Models - C - Regularization - Logistic - Regression

Uploaded by

Duy Hùng Đào

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Regularization

Preventing overfitting

1. Reduce the number of model parameters

2. Constrain the range of model parameter values
3. Provide more data
4. Any other ways preventing too much optimization of training error
Ridge Regression (=L2 Regularization)
lamda is hyperparameter

• Mean Squared Loss (MSE) with L2 Regularization

+
1 1 8 𝜆
𝐿 𝒘 = . 𝒘) 𝒙(() + 𝑏 − 𝑦 ( + ∥ 𝒘 ∥𝟐
𝑁 2 2
(&'

where ∥ 𝒘 ∥( = 𝑤)( + ⋯ + 𝑤*(

• That is, it prevents from increasing the scale of w à Note that w is a
kind of slope à the function can’t have too high slope
• For neural networks, this is called weight decaying
• How about regularize the bias term? not necessary. not affect to overfitting
8
t w9! 125201.43 72.68 0.01

0
the magnitude of the coefficients.
The impact of the regularization term on the generalization error can be seen by
Ridge Regression
plotting the value of−1the RMS error (1.3) for both training and test setsbias against ln λ,
as shown in Figure 1.8. We see that in effect λ now controls the effective complexity
of the model and hence determines the degree of over-fitting. 10 1. INTRODUCTION
x 1 0 x 1
The issue of model complexity is an important one and will be discussed at
length in Section 1.3. Here we simply note that, if we were trying to solve a practical
application using this approach of minimizing an error function, we would have to
M =3 1 M =9 1 ln λ = −18 1 ln λ = 0
find a way to determine a suitable value for the model complexity. The results above
t
suggest a simple way of achieving this, namely by taking the available t data and t

partitioning it into a training

0
set, used to determine the coefficients w, and a separate
0 0
validation set, also called a hold-out set, used to optimize the model complexity
(either M or λ). In many cases, however, this will prove to be too wasteful of
valuable training data, −1
and we have to seek more sophisticated approaches. −1 −1
So far our discussion of polynomial curve fitting has appealed largely to in-
tuition. We now seek a more principled approach to solving problems in pattern
recognition x by 1turning to0 a discussion of probability theory. x 1 0
As well as providing the x 1 0 x 1

foundation for nearly all of the subsequent developments in thisFigure

lynomials having various orders M , shown as red curves, fitted to the data set shown in
book,1.7it will also 1.1. Example: Polynomial Curve Fitting 11
Plots of M = 9 polynomials fitted to the data set shown in Figure 1.2 using the regularized error
function (1.4) for two values of the regularization parameter λ corresponding to ln λ = −18 and ln λ = 0. The
case of no regularizer, i.e., λ = 0, corresponding to ln λ = −∞, is shown at the bottom right of Figure 1.4.
e 1.8 Graph of the root-mean-square er- Table 1.2 Table of the coefficients w! for M = ln λ = −∞ ln λ = −18 ln λ = 0
ror (1.3)
RMS) error versus
defined by ln λ for the!
M =9 1 9 polynomials with various values for w ! 0.35 and flexible 0.35 0.13
polynomial. ERMS = 2E(w! )/N parametermay
regularizationTraining
the(1.3) wish to use0! relatively complex
λ. Note models. One technique that is often
that ln λ = −∞Test used
corresponds to a to w
control 1the 232.37
over-fitting phenomenon in 4.74
such cases is-0.05
that of regularization,
n which the division by N allows us to compare different sizes of data model sets on with no regularization,which
i.e., involves
!
-5321.83
to w2adding a penalty term to the error-0.77
function (1.2)-0.06
in order to discourage
an equal footing, and the square root ensures that ERMS is measured on the same the coefficients
the graph at the bottom right in Fig- w3 !from reaching large values. The simplest such penalty term takes the
48568.31 -31.97 -0.05
scale (and in the same units) as the target variable t. Graphs of the training ureand1.4. We see that, as the form
valueofof a sum of! squares of all of the coefficients, leading to a modified error function
-231639.30 -3.89 -0.03
ERMS

est set RMS errors are shown, for various values of 0.5 M , in Figure 1.5. The test w4
λ increases, of the form
the typical magnitude of w !
set error is a measure of how well we are doing in predicting the values of for 640042.26
1"
N 55.28 -0.02
thet coefficients gets smaller. 5
! =
2 λ
− tn } + "w"
{y(xn , w) 41.32
new data observations of x. We note from Figure 1.5 that small values of M give w6 -1061800.52
! E(w)
2 2
-0.01
2
(1.4)
elatively large values of the test set error, and this can be attributed to the fact that w7 !
1042400.18 n=1
-45.95 -0.00
he corresponding polynomials are rather inflexible and are incapable of capturing where "w"2 w = w02 + w12 + . . . + wM
≡ 8!wT w -557682.99 2
-91.53 0.00λ governs the rel-
, and the coefficient
he oscillations in the function sin(2πx). Values of M in the range 3 ! M ! 8 ative importance ! of the regularization term compared with the sum-of-squares error
give small values for the test set error, and these also give 0reasonable representations w 125201.43 72.68 0.01
term. Note that9 often the coefficient w0 is omitted from the regularizer because its
of the generating function sin(2πx), as can be seen, for the case −35of M = 3,−30 fromln λ −25 −20 inclusion causes the results to depend on the choice of origin for the target variable
Figure 1.4. (Hastie et al., 2001), or it may be included but with its own regularization coefficient
the magnitude of the coefficients.
(we shall discuss this topic in more detail in Section 5.5.1). Again, the error function
The impact
Exercise of
1.2the regularization
in (1.4) can beterm on theexactly
minimized generalization error
in closed form. can be seen
Techniques such by
as this are known
Lasso Regression (=L1 Regularization)
w1, w2, w3, ..., wn could be -> 0, a, 0, 0, ..., a
Not decrease all value (a/10, a/9,...,a/10), but
remove some weight parameter
• Mean Squared Loss (MSE) with L1 Regularization

+ $
1 1 8
𝐿 𝒘 = . 𝒘) 𝒙(() + 𝑏 − 𝑦 ( + 𝜆 . |𝑤$ |
𝑁 2
(&' !

• L1-regularization encourages sparsity à make some weight to be zero

• This can be seen as an automatic feature selection
L1 & L2 Regularization and Sparsity

Solutions without regularization

L2 regularization L1 regularization
Linear Classification
Logistic Regression

• A thought experiment
§ Can we use the linear regression model for binary classification?
Binary Classification as a Regression

• We represent the target values only by either 0 or 1 depending on the

class
• But in regression, the prediction range is usually −∞, +∞
• What if we can limit the prediction range to be [0, 1]
Sigmoid Function

1
𝜎 𝑧 =
1 + exp(−𝑧)

• Is a squashing function that squashes the input z to a

range between [0,1]
• A key idea is that we can interpret this output as a
probability between 0% ~ 100%
• We can then set a rule
§ If output is > 0.5 à class 1, otherwise class 0

• Now let’s parameterize this model. How?

Logistic Regression Model

1
𝑝(!) =𝜎 𝒘𝒙(!) +𝑏 =
1 + exp(−𝒘𝒙 ! − 𝑏)

• Now the output 𝑝(!) is always between 0 and 1

• We can interpret this as the probability to be class 1
Cost Function: Binary Cross Entropy (BCE)

• We label one class by y=1, and the other class by y=0

• Maximize the probability of having the correct label

• For datapoint (i) whose class y=1, maximize p(i)

• For datapoint (j) whose class y=0, maximize (1 - p(j))

• Combining both, we can write it as minimizing the following

+
1
𝐿 𝒘 = − . 𝑦 (!) log 𝑝(!) + ( 1 − 𝑦 (!) ) log(1 − 𝑝 ! )
𝑁
(&'
Training

• Can we derive a closed-form solution for logistic regression?

• If no, can we compute the gradient?
• To compute the gradient of the objective fuction, you can use the
following fact about the gradient of the sigmoid function

𝜎 : 𝑧 = 𝜎(𝑧)(1 − 𝜎 𝑧 )

§ (Can you drive this?)

The Gradient of the BCE

+
𝜕 1 (!)
𝐿 𝒘 = . 𝜎 𝒘) 𝑥 !
− 𝑦 (!) 𝑥"
𝜕𝑤" 𝑁
(&'

err is directed not square, representing 2 directions.

Decision Boundaries

sigmoid

x
Regularization

• Like other models, logistic regression models can also be regularized by

using L1 and L2 regularization (or using other methods)

Lecture 09 ML
No ratings yet
Lecture 09 ML
26 pages
Support Vector Regression
100% (1)
Support Vector Regression
23 pages
Regularization: Ridge Regression and The LASSO: Statistics 305: Autumn Quarter 2006/2007
No ratings yet
Regularization: Ridge Regression and The LASSO: Statistics 305: Autumn Quarter 2006/2007
56 pages
Ridge Regression: Patrick Breheny
No ratings yet
Ridge Regression: Patrick Breheny
22 pages
Parameter Estimation and Inverse Problems: Richard C. Aster, Brian Borchers, and Clifford H. Thurber
No ratings yet
Parameter Estimation and Inverse Problems: Richard C. Aster, Brian Borchers, and Clifford H. Thurber
6 pages
Ridge Regression
No ratings yet
Ridge Regression
6 pages
Lecture3 2015
No ratings yet
Lecture3 2015
38 pages
Regression-and-generalization (1)
No ratings yet
Regression-and-generalization (1)
67 pages
Lecture 3
No ratings yet
Lecture 3
61 pages
Notes 04
No ratings yet
Notes 04
50 pages
ANN-Regression-Python Examples
No ratings yet
ANN-Regression-Python Examples
35 pages
07 Regularization
No ratings yet
07 Regularization
7 pages
Image Reconstruction With Tikhonov Regularization
No ratings yet
Image Reconstruction With Tikhonov Regularization
5 pages
Lecture 4_Regularization
No ratings yet
Lecture 4_Regularization
22 pages
Regularization & Gradient Descent
No ratings yet
Regularization & Gradient Descent
18 pages
High Speed Tracking With Kernelized Correlation Filters
No ratings yet
High Speed Tracking With Kernelized Correlation Filters
14 pages
Lecture 1.5-1.6
No ratings yet
Lecture 1.5-1.6
23 pages
High-Performance Extreme Learning Machines - A Complete Toolbox For Big Data Applications PDF
No ratings yet
High-Performance Extreme Learning Machines - A Complete Toolbox For Big Data Applications PDF
15 pages
DL-Lec 2 -bias-variance-tradeoff
No ratings yet
DL-Lec 2 -bias-variance-tradeoff
33 pages
Ridge Lasso Regression Bias Variance Tradeoff 71
No ratings yet
Ridge Lasso Regression Bias Variance Tradeoff 71
19 pages
SkriptOptMach
No ratings yet
SkriptOptMach
49 pages
Heat Conduction - Basic Research
100% (1)
Heat Conduction - Basic Research
362 pages
07: Regularization: The Problem of Overfitting
No ratings yet
07: Regularization: The Problem of Overfitting
5 pages
Lecture16 Crossvalidation
No ratings yet
Lecture16 Crossvalidation
32 pages
Nowcasting Paper PDF
No ratings yet
Nowcasting Paper PDF
30 pages
Regularization PDF
No ratings yet
Regularization PDF
32 pages
Computational Inverse Problems
100% (1)
Computational Inverse Problems
67 pages
Machine learning
No ratings yet
Machine learning
19 pages
Group30 Linear Regression
No ratings yet
Group30 Linear Regression
20 pages
9_Linear Regression-Problems and Solutions
No ratings yet
9_Linear Regression-Problems and Solutions
23 pages
week2
No ratings yet
week2
43 pages
Asset-V1 ColumbiaX+CSMM.102x+3T2018+type@asset+block@ML Lecture3 PDF
No ratings yet
Asset-V1 ColumbiaX+CSMM.102x+3T2018+type@asset+block@ML Lecture3 PDF
33 pages
Ridge Regression and Other Kernels For Genomic Selection With R Package Rrblup
No ratings yet
Ridge Regression and Other Kernels For Genomic Selection With R Package Rrblup
6 pages
lec8_Regularization_polynomial_regression
No ratings yet
lec8_Regularization_polynomial_regression
30 pages
w1d_linear_regression_regularization
No ratings yet
w1d_linear_regression_regularization
4 pages
HW1 Solutions
No ratings yet
HW1 Solutions
3 pages
EE2211 Lecture 7
No ratings yet
EE2211 Lecture 7
43 pages
Group 30 Ppt
No ratings yet
Group 30 Ppt
33 pages
Stable Weight Decay Regularization
No ratings yet
Stable Weight Decay Regularization
18 pages
Lecture 4.2. Generalization and Regularization
No ratings yet
Lecture 4.2. Generalization and Regularization
23 pages
HW 4
No ratings yet
HW 4
7 pages
Unit 2
No ratings yet
Unit 2
8 pages
Feature Selection, L1 vs. L2 Regularization, and Rotational Invariance - A NG 2004
No ratings yet
Feature Selection, L1 vs. L2 Regularization, and Rotational Invariance - A NG 2004
8 pages
Lecture 7 - Part A - Mutli Class and Overfitting and Regularization
No ratings yet
Lecture 7 - Part A - Mutli Class and Overfitting and Regularization
43 pages
Exercise 03
No ratings yet
Exercise 03
5 pages
Regularization
No ratings yet
Regularization
46 pages
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
No ratings yet
Data Mining and Machine Learning: Fundamental Concepts and Algorithms
57 pages
5
No ratings yet
5
10 pages
ML models and when to choose one over others
No ratings yet
ML models and when to choose one over others
7 pages
07_regularization
No ratings yet
07_regularization
51 pages
(MLP) Lecture Notes
No ratings yet
(MLP) Lecture Notes
22 pages
L3 Linear Regression
No ratings yet
L3 Linear Regression
23 pages
Lecture Slides 3 - Bias Variance and Regularisation For Neural Networks - 2021
No ratings yet
Lecture Slides 3 - Bias Variance and Regularisation For Neural Networks - 2021
29 pages
Computational Inverse Techniques in Nondestructive Evaluation - G. R. Liu - X. Han
No ratings yet
Computational Inverse Techniques in Nondestructive Evaluation - G. R. Liu - X. Han
22 pages
PA Notes 2
No ratings yet
PA Notes 2
23 pages
L09 - Regularisation
No ratings yet
L09 - Regularisation
79 pages
AI34
No ratings yet
AI34
3 pages
Logistic Regression
No ratings yet
Logistic Regression
42 pages
Kkk
No ratings yet
Kkk
17 pages
Regularization_(1)
No ratings yet
Regularization_(1)
3 pages
Linear Regression Python Programming
No ratings yet
Linear Regression Python Programming
25 pages
Kernel Ridge Regression
No ratings yet
Kernel Ridge Regression
8 pages
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
No ratings yet
Exam in Statistical Machine Learning Statistisk Maskininlärning (1RT700)
11 pages
LR2
No ratings yet
LR2
25 pages
Lab Manual 05
No ratings yet
Lab Manual 05
13 pages
L11+ Regularization
No ratings yet
L11+ Regularization
25 pages
Regularization 1704650055
No ratings yet
Regularization 1704650055
32 pages
L11+ Regularization
No ratings yet
L11+ Regularization
24 pages
Introduction To Machine Learning: Dr. Muhammad Amjad Iqbal
No ratings yet
Introduction To Machine Learning: Dr. Muhammad Amjad Iqbal
20 pages
ML4 Linear Models
No ratings yet
ML4 Linear Models
34 pages
Regularization
No ratings yet
Regularization
7 pages
Pectral Egularization: Francesca Odone and Lorenzo Rosasco
No ratings yet
Pectral Egularization: Francesca Odone and Lorenzo Rosasco
48 pages
Machine Learning Lecture 1
No ratings yet
Machine Learning Lecture 1
5 pages
Ridge Regression: Ryota Tomioka Department of Mathema6cal Informa6cs The University of Tokyo
No ratings yet
Ridge Regression: Ryota Tomioka Department of Mathema6cal Informa6cs The University of Tokyo
53 pages
Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD
No ratings yet
Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD
14 pages
Class 02
No ratings yet
Class 02
42 pages
Introduction To Machine Learning: The Problem of Overfitting
No ratings yet
Introduction To Machine Learning: The Problem of Overfitting
8 pages
Cs 229, Public Course Problem Set #2 Solutions: Kernels, SVMS, and Theory
No ratings yet
Cs 229, Public Course Problem Set #2 Solutions: Kernels, SVMS, and Theory
8 pages
Lec 05 Regularization
No ratings yet
Lec 05 Regularization
77 pages
Deblurring Images, Matrices, Spectra, and Filtering (Fundamentals of Algorithms)
No ratings yet
Deblurring Images, Matrices, Spectra, and Filtering (Fundamentals of Algorithms)
145 pages
RTV 4 Manual
No ratings yet
RTV 4 Manual
128 pages
Pattern Recognition Machine Learning: Chapter 3: Linear Models For Regression
100% (1)
Pattern Recognition Machine Learning: Chapter 3: Linear Models For Regression
48 pages
Lec20 RidgeRegression
No ratings yet
Lec20 RidgeRegression
21 pages
Inversion Techniques Applied To Resistivity Invers
No ratings yet
Inversion Techniques Applied To Resistivity Invers
19 pages
Introduction To Machine Learning Lecture 2: Linear Regression
No ratings yet
Introduction To Machine Learning Lecture 2: Linear Regression
38 pages
Regularization For Deep Learning: Tsz-Chiu Au Chiu@unist - Ac.kr
No ratings yet
Regularization For Deep Learning: Tsz-Chiu Au Chiu@unist - Ac.kr
100 pages
Machine Learning Slides
No ratings yet
Machine Learning Slides
281 pages
The Problem of Overfitting: Overfitting With Linear Regression
No ratings yet
The Problem of Overfitting: Overfitting With Linear Regression
32 pages
Ordered Weighted Averaging Aggregation Operator: Fundamentals and Applications
From Everand
Ordered Weighted Averaging Aggregation Operator: Fundamentals and Applications
Fouad Sabry
No ratings yet
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet

02 - Linear Models - C - Regularization - Logistic - Regression

Uploaded by

02 - Linear Models - C - Regularization - Logistic - Regression

Uploaded by

Regularization

1. Reduce the number of model parameters

• Mean Squared Loss (MSE) with L2 Regularization

where ∥ 𝒘 ∥( = 𝑤)( + ⋯ + 𝑤*(

partitioning it into a training

foundation for nearly all of the subsequent developments in thisFigure

• L1-regularization encourages sparsity à make some weight to be zero

Solutions without regularization

• We represent the target values only by either 0 or 1 depending on the

• Is a squashing function that squashes the input z to a

• Now let’s parameterize this model. How?

• Now the output 𝑝(!) is always between 0 and 1

• We label one class by y=1, and the other class by y=0

• For datapoint (i) whose class y=1, maximize p(i)

• Combining both, we can write it as minimizing the following

• Can we derive a closed-form solution for logistic regression?

§ (Can you drive this?)

err is directed not square, representing 2 directions.

• Like other models, logistic regression models can also be regularized by

You might also like