02 - Linear Models - C - Regularization - Logistic - Regression
02 - Linear Models - C - Regularization - Logistic - Regression
Preventing overfitting
+
1 1 8 𝜆
𝐿 𝒘 = . 𝒘) 𝒙(() + 𝑏 − 𝑦 ( + ∥ 𝒘 ∥𝟐
𝑁 2 2
(&'
0
the magnitude of the coefficients.
The impact of the regularization term on the generalization error can be seen by
Ridge Regression
plotting the value of−1the RMS error (1.3) for both training and test setsbias against ln λ,
as shown in Figure 1.8. We see that in effect λ now controls the effective complexity
of the model and hence determines the degree of over-fitting. 10 1. INTRODUCTION
x 1 0 x 1
The issue of model complexity is an important one and will be discussed at
length in Section 1.3. Here we simply note that, if we were trying to solve a practical
application using this approach of minimizing an error function, we would have to
M =3 1 M =9 1 ln λ = −18 1 ln λ = 0
find a way to determine a suitable value for the model complexity. The results above
t
suggest a simple way of achieving this, namely by taking the available t data and t
est set RMS errors are shown, for various values of 0.5 M , in Figure 1.5. The test w4
λ increases, of the form
the typical magnitude of w !
set error is a measure of how well we are doing in predicting the values of for 640042.26
1"
N 55.28 -0.02
thet coefficients gets smaller. 5
! =
2 λ
− tn } + "w"
{y(xn , w) 41.32
new data observations of x. We note from Figure 1.5 that small values of M give w6 -1061800.52
! E(w)
2 2
-0.01
2
(1.4)
elatively large values of the test set error, and this can be attributed to the fact that w7 !
1042400.18 n=1
-45.95 -0.00
he corresponding polynomials are rather inflexible and are incapable of capturing where "w"2 w = w02 + w12 + . . . + wM
≡ 8!wT w -557682.99 2
-91.53 0.00λ governs the rel-
, and the coefficient
he oscillations in the function sin(2πx). Values of M in the range 3 ! M ! 8 ative importance ! of the regularization term compared with the sum-of-squares error
give small values for the test set error, and these also give 0reasonable representations w 125201.43 72.68 0.01
term. Note that9 often the coefficient w0 is omitted from the regularizer because its
of the generating function sin(2πx), as can be seen, for the case −35of M = 3,−30 fromln λ −25 −20 inclusion causes the results to depend on the choice of origin for the target variable
Figure 1.4. (Hastie et al., 2001), or it may be included but with its own regularization coefficient
the magnitude of the coefficients.
(we shall discuss this topic in more detail in Section 5.5.1). Again, the error function
The impact
Exercise of
1.2the regularization
in (1.4) can beterm on theexactly
minimized generalization error
in closed form. can be seen
Techniques such by
as this are known
Lasso Regression (=L1 Regularization)
w1, w2, w3, ..., wn could be -> 0, a, 0, 0, ..., a
Not decrease all value (a/10, a/9,...,a/10), but
remove some weight parameter
• Mean Squared Loss (MSE) with L1 Regularization
+ $
1 1 8
𝐿 𝒘 = . 𝒘) 𝒙(() + 𝑏 − 𝑦 ( + 𝜆 . |𝑤$ |
𝑁 2
(&' !
L2 regularization L1 regularization
Linear Classification
Logistic Regression
• A thought experiment
§ Can we use the linear regression model for binary classification?
Binary Classification as a Regression
1
𝜎 𝑧 =
1 + exp(−𝑧)
1
𝑝(!) =𝜎 𝒘𝒙(!) +𝑏 =
1 + exp(−𝒘𝒙 ! − 𝑏)
𝜎 : 𝑧 = 𝜎(𝑧)(1 − 𝜎 𝑧 )
+
𝜕 1 (!)
𝐿 𝒘 = . 𝜎 𝒘) 𝑥 !
− 𝑦 (!) 𝑥"
𝜕𝑤" 𝑁
(&'
sigmoid
x
Regularization