0% found this document useful (0 votes)
40 views

02 - Linear Models - C - Regularization - Logistic - Regression

1. Regularization helps prevent overfitting by constraining model parameters to reduce complexity. This includes limiting the number of parameters, restricting their range of values, and adding more training data. 2. Ridge regression (L2 regularization) adds a penalty term (λ) to the loss function that shrinks large weights. This prevents weights from increasing too much. 3. The regularization parameter λ controls the effective model complexity and determines the amount of overfitting. λ is selected using a validation set to optimize generalization.

Uploaded by

Duy Hùng Đào
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views

02 - Linear Models - C - Regularization - Logistic - Regression

1. Regularization helps prevent overfitting by constraining model parameters to reduce complexity. This includes limiting the number of parameters, restricting their range of values, and adding more training data. 2. Ridge regression (L2 regularization) adds a penalty term (λ) to the loss function that shrinks large weights. This prevents weights from increasing too much. 3. The regularization parameter λ controls the effective model complexity and determines the amount of overfitting. λ is selected using a validation set to optimize generalization.

Uploaded by

Duy Hùng Đào
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Regularization

Preventing overfitting

1. Reduce the number of model parameters


2. Constrain the range of model parameter values
3. Provide more data
4. Any other ways preventing too much optimization of training error
Ridge Regression (=L2 Regularization)
lamda is hyperparameter

• Mean Squared Loss (MSE) with L2 Regularization

+
1 1 8 𝜆
𝐿 𝒘 = . 𝒘) 𝒙(() + 𝑏 − 𝑦 ( + ∥ 𝒘 ∥𝟐
𝑁 2 2
(&'

where ∥ 𝒘 ∥( = 𝑤)( + ⋯ + 𝑤*(


• That is, it prevents from increasing the scale of w à Note that w is a
kind of slope à the function can’t have too high slope
• For neural networks, this is called weight decaying
• How about regularize the bias term? not necessary. not affect to overfitting
8
t w9! 125201.43 72.68 0.01

0
the magnitude of the coefficients.
The impact of the regularization term on the generalization error can be seen by
Ridge Regression
plotting the value of−1the RMS error (1.3) for both training and test setsbias against ln λ,
as shown in Figure 1.8. We see that in effect λ now controls the effective complexity
of the model and hence determines the degree of over-fitting. 10 1. INTRODUCTION
x 1 0 x 1
The issue of model complexity is an important one and will be discussed at
length in Section 1.3. Here we simply note that, if we were trying to solve a practical
application using this approach of minimizing an error function, we would have to
M =3 1 M =9 1 ln λ = −18 1 ln λ = 0
find a way to determine a suitable value for the model complexity. The results above
t
suggest a simple way of achieving this, namely by taking the available t data and t

partitioning it into a training


0
set, used to determine the coefficients w, and a separate
0 0
validation set, also called a hold-out set, used to optimize the model complexity
(either M or λ). In many cases, however, this will prove to be too wasteful of
valuable training data, −1
and we have to seek more sophisticated approaches. −1 −1
So far our discussion of polynomial curve fitting has appealed largely to in-
tuition. We now seek a more principled approach to solving problems in pattern
recognition x by 1turning to0 a discussion of probability theory. x 1 0
As well as providing the x 1 0 x 1

foundation for nearly all of the subsequent developments in thisFigure


lynomials having various orders M , shown as red curves, fitted to the data set shown in
book,1.7it will also 1.1. Example: Polynomial Curve Fitting 11
Plots of M = 9 polynomials fitted to the data set shown in Figure 1.2 using the regularized error
function (1.4) for two values of the regularization parameter λ corresponding to ln λ = −18 and ln λ = 0. The
case of no regularizer, i.e., λ = 0, corresponding to ln λ = −∞, is shown at the bottom right of Figure 1.4.
e 1.8 Graph of the root-mean-square er- Table 1.2 Table of the coefficients w! for M = ln λ = −∞ ln λ = −18 ln λ = 0
ror (1.3)
RMS) error versus
defined by ln λ for the!
M =9 1 9 polynomials with various values for w ! 0.35 and flexible 0.35 0.13
polynomial. ERMS = 2E(w! )/N parametermay
regularizationTraining
the(1.3) wish to use0! relatively complex
λ. Note models. One technique that is often
that ln λ = −∞Test used
corresponds to a to w
control 1the 232.37
over-fitting phenomenon in 4.74
such cases is-0.05
that of regularization,
n which the division by N allows us to compare different sizes of data model sets on with no regularization,which
i.e., involves
!
-5321.83
to w2adding a penalty term to the error-0.77
function (1.2)-0.06
in order to discourage
an equal footing, and the square root ensures that ERMS is measured on the same the coefficients
the graph at the bottom right in Fig- w3 !from reaching large values. The simplest such penalty term takes the
48568.31 -31.97 -0.05
scale (and in the same units) as the target variable t. Graphs of the training ureand1.4. We see that, as the form
valueofof a sum of! squares of all of the coefficients, leading to a modified error function
-231639.30 -3.89 -0.03
ERMS

est set RMS errors are shown, for various values of 0.5 M , in Figure 1.5. The test w4
λ increases, of the form
the typical magnitude of w !
set error is a measure of how well we are doing in predicting the values of for 640042.26
1"
N 55.28 -0.02
thet coefficients gets smaller. 5
! =
2 λ
− tn } + "w"
{y(xn , w) 41.32
new data observations of x. We note from Figure 1.5 that small values of M give w6 -1061800.52
! E(w)
2 2
-0.01
2
(1.4)
elatively large values of the test set error, and this can be attributed to the fact that w7 !
1042400.18 n=1
-45.95 -0.00
he corresponding polynomials are rather inflexible and are incapable of capturing where "w"2 w = w02 + w12 + . . . + wM
≡ 8!wT w -557682.99 2
-91.53 0.00λ governs the rel-
, and the coefficient
he oscillations in the function sin(2πx). Values of M in the range 3 ! M ! 8 ative importance ! of the regularization term compared with the sum-of-squares error
give small values for the test set error, and these also give 0reasonable representations w 125201.43 72.68 0.01
term. Note that9 often the coefficient w0 is omitted from the regularizer because its
of the generating function sin(2πx), as can be seen, for the case −35of M = 3,−30 fromln λ −25 −20 inclusion causes the results to depend on the choice of origin for the target variable
Figure 1.4. (Hastie et al., 2001), or it may be included but with its own regularization coefficient
the magnitude of the coefficients.
(we shall discuss this topic in more detail in Section 5.5.1). Again, the error function
The impact
Exercise of
1.2the regularization
in (1.4) can beterm on theexactly
minimized generalization error
in closed form. can be seen
Techniques such by
as this are known
Lasso Regression (=L1 Regularization)
w1, w2, w3, ..., wn could be -> 0, a, 0, 0, ..., a
Not decrease all value (a/10, a/9,...,a/10), but
remove some weight parameter
• Mean Squared Loss (MSE) with L1 Regularization

+ $
1 1 8
𝐿 𝒘 = . 𝒘) 𝒙(() + 𝑏 − 𝑦 ( + 𝜆 . |𝑤$ |
𝑁 2
(&' !

• L1-regularization encourages sparsity à make some weight to be zero


• This can be seen as an automatic feature selection
L1 & L2 Regularization and Sparsity

Solutions without regularization

L2 regularization L1 regularization
Linear Classification
Logistic Regression

• A thought experiment
§ Can we use the linear regression model for binary classification?
Binary Classification as a Regression

• We represent the target values only by either 0 or 1 depending on the


class
• But in regression, the prediction range is usually −∞, +∞
• What if we can limit the prediction range to be [0, 1]
Sigmoid Function

1
𝜎 𝑧 =
1 + exp(−𝑧)

• Is a squashing function that squashes the input z to a


range between [0,1]
• A key idea is that we can interpret this output as a
probability between 0% ~ 100%
• We can then set a rule
§ If output is > 0.5 à class 1, otherwise class 0

• Now let’s parameterize this model. How?


Logistic Regression Model

1
𝑝(!) =𝜎 𝒘𝒙(!) +𝑏 =
1 + exp(−𝒘𝒙 ! − 𝑏)

• Now the output 𝑝(!) is always between 0 and 1


• We can interpret this as the probability to be class 1
Cost Function: Binary Cross Entropy (BCE)

• We label one class by y=1, and the other class by y=0


• Maximize the probability of having the correct label

• For datapoint (i) whose class y=1, maximize p(i)


• For datapoint (j) whose class y=0, maximize (1 - p(j))

• Combining both, we can write it as minimizing the following


+
1
𝐿 𝒘 = − . 𝑦 (!) log 𝑝(!) + ( 1 − 𝑦 (!) ) log(1 − 𝑝 ! )
𝑁
(&'
Training

• Can we derive a closed-form solution for logistic regression?


• If no, can we compute the gradient?
• To compute the gradient of the objective fuction, you can use the
following fact about the gradient of the sigmoid function

𝜎 : 𝑧 = 𝜎(𝑧)(1 − 𝜎 𝑧 )

§ (Can you drive this?)


The Gradient of the BCE

+
𝜕 1 (!)
𝐿 𝒘 = . 𝜎 𝒘) 𝑥 !
− 𝑦 (!) 𝑥"
𝜕𝑤" 𝑁
(&'

err is directed not square, representing 2 directions.


Decision Boundaries

sigmoid

x
Regularization

• Like other models, logistic regression models can also be regularized by


using L1 and L2 regularization (or using other methods)

You might also like