0% found this document useful (0 votes)
13 views67 pages

Regression and Generalization

The document discusses advanced topics in regression and generalization, focusing on extending linear regression to non-linear functions, model evaluation, and regularization techniques. It highlights the importance of avoiding overfitting through model complexity management and validation methods such as cross-validation. Additionally, it covers the use of regularization to penalize complex models and improve generalization performance.

Uploaded by

samira.nazari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views67 pages

Regression and Generalization

The document discusses advanced topics in regression and generalization, focusing on extending linear regression to non-linear functions, model evaluation, and regularization techniques. It highlights the importance of avoiding overfitting through model complexity management and validation methods such as cross-validation. Additionally, it covers the use of regularization to penalize complex models and improve generalization performance.

Uploaded by

samira.nazari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 67

Regression and generalization

CE-717: Machine Learning


Sharif University of Technology

M. Soleymani
Fall 2016
Topics
 Beyond linear regression models
 Evaluation & model selection
 Regularization
 Probabilistic perspective for the regression problem

2
Recall: Linear regression (squared loss)
 Linear regression functions
𝑓 ∶ ℝ → ℝ 𝑓(𝑥; 𝒘) = 𝑤0 + 𝑤1 𝑥
𝑓 ∶ ℝd → ℝ 𝑓(𝒙; 𝒘) = 𝑤0 + 𝑤1 𝑥1 + . . . 𝑤𝑑 𝑥𝑑
𝒘 = 𝑤0 ,𝑤1 ,...,𝑤𝑑 𝑇 are the
parameters we need to set.

 Minimizing the squared loss for linear regression


2
𝐽(𝒘) = 𝒚 − 𝑿𝒘 2

 We obtain 𝒘 = 𝑿𝑇 𝑿 −𝟏 𝑿𝑇 𝒚

3
Beyond linear regression
 How to extend the linear regression to non-linear
functions?
 Transform the data using basis functions
 Learn a linear regression on the new feature vectors (obtained
by basis functions)

4
Beyond linear regression
 𝑚𝑡ℎ order polynomial regression (univariate 𝑓 ∶ ℝ ⟶ ℝ)
𝑓 𝑥; 𝒘 = 𝑤0 + 𝑤1 𝑥 + . . . +𝑤𝑚−1 𝑥 𝑚−1 +𝑤𝑚 𝑥 𝑚

−𝟏
 Solution: 𝒘 = 𝑇
𝑿′ 𝑿′ 𝑿′𝑇 𝒚

1 1 1 2 1 𝑚 𝒘0
𝑦1 1 𝑥 𝑥 ⋯ 𝑥
2 1 2 2 2 𝑚 𝒘1
𝒚= ⋮ 𝑿′ = 1 𝑥 𝑥 ⋯ 𝑥 𝒘=
𝑦𝑛 ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
1 𝑥𝑛
1
𝑥 𝑛 2 ⋯ 𝑥𝑛
1 𝒘𝑚

5
Polynomial regression: example

𝑚=3
𝑚=1

𝑚=5 𝑚=7

6
Generalized linear
 Linear combination of fixed non-linear function of the
input vector

𝑓(𝒙; 𝒘) = 𝑤0 + 𝑤1 𝜙1 (𝒙)+ . . . 𝑤𝑚 𝜙𝑚 (𝒙)

{𝜙1 (𝒙), . . . , 𝜙𝑚 (𝒙)}: set of basis functions (or features)


𝜙𝑖 𝒙 : ℝ𝑑 → ℝ

7
Basis functions: examples
 Linear

 Polynomial (univariate)

8
Basis functions: examples
2
𝒙−𝒄𝑗
 Gaussian: 𝜙𝑗 𝒙 = 𝑒𝑥𝑝 −
2𝜎𝑗2

𝒙−𝒄𝑗 1
 Sigmoid: 𝜙𝑗 𝒙 = 𝜎 𝜎 𝑎 =
𝜎𝑗 1+exp(−𝑎)

9
Radial Basis Functions: prototypes
 Predictions based on similarity to “prototypes”:
1 2
𝜙𝑗 𝒙 = 𝑒𝑥𝑝 − 2 𝒙 − 𝒄𝑗
2𝜎𝑗

 Measuring the similarity to the prototypes 𝒄1 , … , 𝒄𝑚


 σ2 controls how quickly it vanishes as a function of the
distance to the prototype.
 Training examples themselves could serve as prototypes

10
Generalized linear: optimization
𝑛 2
𝑖 𝑖
𝐽 𝒘 = 𝑦 − 𝑓 𝒙 ;𝒘
𝑖=1
𝑛 2
= 𝑦 𝑖 − 𝒘𝑇 𝝓 𝒙 𝑖
𝑖=1

(1) (1)
1 𝜙1 (𝒙 ) ⋯ 𝜙𝑚 (𝒙 ) 𝑤0
𝑦 (1) (2) (2) 𝑤1
𝒚= ⋮ 𝚽=
1 𝜙1 (𝒙 ) ⋯ 𝜙𝑚 (𝒙 ) 𝒘= ⋮
⋮ ⋮ ⋱ ⋮
𝑦 (𝑛) (𝑛) 𝑤𝑚
(𝑛)
1 𝜙1 (𝒙 ) ⋯ 𝜙𝑚 (𝒙 )

𝑇 −𝟏
𝒘= 𝚽 𝚽 𝚽𝑇 𝒚

11
Model complexity and overfitting
 With limited training data, models may achieve zero
training error but a large test error.

1 𝑛 2
𝑖
Training 𝑦 𝑖 − 𝑓 𝒙 ;𝜽 ≈0
(empirical) loss 𝑛 𝑖=1

Expected E𝐱,y 𝑦 − 𝑓 𝒙; 𝜽
2
≫0
(test) loss

 Over-fitting: when the training loss no longer bears any


relation to the test (generalization) loss.
 Fails to generalize to unseen examples.

12
Polynomial regression
𝑚=0 𝑚=1

𝑦 𝑦

𝑚=3 𝑚=9

𝑦 𝑦

13 [Bishop]
Polynomial regression: training and test error

2
𝑛 𝑖 𝑖
𝑖=1 𝑦 − 𝑓 𝒙 ;𝜽
𝑅𝑀𝑆𝐸 =
𝑛

[Bishop]

14
Over-fitting causes
 Model complexity
 E.g., Model with a large number of parameters (degrees of
freedom)

 Low number of training data


 Small data size compared to the complexity of the model

15
Model complexity
 Example:
 Polynomials with larger 𝑚 are becoming increasingly tuned to
the random noise on the target values.

𝑚=0 𝑚=1
𝑦 𝑦

𝑚=3 𝑚=9
𝑦 𝑦

16
16
[Bishop]
Number of training data & overfitting
 Over-fitting problem becomes less severe as the size of
training data increases.

𝑚=9 𝑚=9

𝑛 = 15 𝑛 = 100

[Bishop]

17
How to evaluate the learner’s performance?
 Generalization error: true (or expected) error that we
would like to optimize

 Two ways to assess the generalization error is:


 Practical: Use a separate data set to test the model
 Theoretical: Law of Large numbers
 statistical bounds on the difference between training and expected
errors

18
Evaluation and model selection
 Evaluation:
 We need to measure how well the learned function can
predicts the target for unseen examples

 Model selection:
 Most of the time we need to select among a set of models
 Example: polynomials with different degree 𝑚
 and thus we need to evaluate these models first

19
Avoiding over-fitting
 Determine a suitable value for model complexity
 Simple hold-out method
 Cross-validation

 Regularization (Occam’s Razor)


 Explicit preference towards simple models
 Penalize for the model complexity in the objective function

 Bayesian approach

20
Simple hold-out: model selection
 Steps:
 Divide training data into training and validation set 𝑣_𝑠𝑒𝑡
 Use only the training set to train a set of models
 Evaluate each learned model on the validation set
2
1 (𝑖) (𝑖)
 𝐽𝑣 𝒘 = 𝑖∈𝑣_𝑠𝑒𝑡 𝑦 − 𝑓 𝒙 ;𝒘
𝑣_𝑠𝑒𝑡

 Choose the best model based on the validation set error

 Usually, too wasteful of valuable training data


 Training data may be limited.
 On the other hand, small validation set give a relatively noisy
estimate of performance.

21
Simple hold out:
training, validation, and test sets
 Simple hold-out chooses the model that minimizes error on
validation set.

 𝐽𝑣 𝒘 is likely to be an optimistic estimate of generalization


error.
 extra parameter (e.g., degree of polynomial) is fit to this set.

 Estimate generalization error for the test set


 performance of the selected model is finally evaluated on the test set

Training

Validation
22
Test
Cross-Validation (CV): Evaluation
 𝑘-fold cross-validation steps:
 Shuffle the dataset and randomly partition training data into 𝑘 groups of
approximately equal size
 for 𝑖 = 1 to 𝑘
 Choose the 𝑖-th group as the held-out validation group
 Train the model on all but the 𝑖-th group of data
 Evaluate the model on the held-out group
 Performance scores of the model from 𝑘 runs are averaged.
 The average error rate can be considered as an estimation of the true
performance.
… First run

Second run

… (k-1)th run
23 … k-th run
Cross-Validation (CV): Model Selection
 For each model we first find the average error find by CV.

 The model with the best average performance is


selected.

24
Cross-validation: polynomial regression example
 5-fold CV
 100 runs
 average

𝑚=3
𝑚=1
CV: 𝑀𝑆𝐸 = 1.45
CV: 𝑀𝑆𝐸 = 0.30

𝑚=5 𝑚=7
CV: 𝑀𝑆𝐸 = 45.44 CV: 𝑀𝑆𝐸 = 31759
25
Leave-One-Out Cross Validation (LOOCV)
 When data is particularly scarce, cross-validation with 𝑘
=𝑁
 Leave-one-out treats each training sample in turn as a test
example and all other samples as the training set.

 Use for small datasets


 When training data is valuable
 LOOCV can be time expensive as 𝑁 training steps are
required.

26
Regularization
 Adding a penalty term in the cost function to discourage
the coefficients from reaching large values.

 Ridge regression (weight decay):

𝑛 2
𝑖
𝐽 𝒘 = 𝑦 − 𝒘𝑇 𝝓 𝒙 𝑖
+ 𝜆𝒘𝑇 𝒘
𝑖=1

−𝟏
𝒘= 𝚽𝑇 𝚽 + 𝜆𝑰 𝚽𝑇 𝒚

27
Polynomial order
 Polynomials with larger 𝑚 are becoming increasingly
tuned to the random noise on the target values.
 magnitude of the coefficients typically gets larger by increasing
𝑚.

[Bishop]

28
Regularization parameter
𝑚=9

𝑤0
𝑤1
𝑤2
𝑤3
𝑤4
𝑤5
𝑤6
𝑤7
𝑤8 [Bishop]
𝑤9

𝑙𝑛𝜆 = −∞ 𝑙𝑛𝜆 = −18

29
Regularization parameter
 Generalization
 𝜆 now controls the effective complexity of the model and
hence determines the degree of over-fitting

30
[Bishop]
Choosing the regularization parameter
 A set of models with different values of 𝜆.

 Find 𝒘 for each model based on training data

 Find 𝐽𝑣 (𝒘) (or 𝐽𝑐𝑣 (𝒘)) for each model


2
1 (𝑖) (𝑖)
 𝐽𝑣 𝒘 = 𝑖∈𝑣_𝑠𝑒𝑡 𝑦 − 𝑓 𝑥 ;𝒘
𝑛_𝑣

 Select the model with the best 𝐽𝑣 (𝒘) (or 𝐽𝑐𝑣 (𝒘))

31
The approximation-generailization trade-off

 Small true error shows good approximation of 𝑓 out of


sample

 More complex ℋ ⇒ better chance of approximating 𝑓

 Less complex ℋ ⇒ better chance of generalization out of 𝑓

32
Complexity of Hypothesis Space: Example
Price

Price

Price
Size Size Size
𝑤0 + 𝑤1 𝑥 𝑤0 + 𝑤1 𝑥 + 𝑤2 𝑥 2 𝑤0 + 𝑤1 𝑥 + 𝑤2 𝑥 2 + 𝑤3 𝑥 3 + 𝑤4 𝑥 4

Less complex ℋ More complex ℋ

33 This example has been adapted from: Prof. Andrew Ng’s slides
Complexity of Hypothesis Space: Example
Price

Price

Price
Size Size Size
𝑤0 + 𝑤1 𝑥 𝑤0 + 𝑤1 𝑥 + 𝑤2 𝑥 2 𝑤0 + 𝑤1 𝑥 + 𝑤2 𝑥 2 + 𝑤3 𝑥 3 + 𝑤4 𝑥 4

Less complex ℋ More complex ℋ

34 This example has been adapted from: Prof. Andrew Ng’s slides
Complexity of Hypothesis Space: Example
1 2
(𝑖) (𝑖)
𝐽𝑣 𝒘 = 𝑦 −𝑓 𝒙 ;𝒘
𝑛_𝑣 𝑖∈𝑣𝑎𝑙_𝑠𝑒𝑡
1 2
(𝑖) (𝑖)
𝐽𝑡𝑟𝑎𝑖𝑛 𝒘 = 𝑦 − 𝑓 𝒙 ;𝒘
𝑛_𝑡𝑟𝑎𝑖𝑛 𝑖∈𝑡𝑟𝑎𝑖𝑛_𝑠𝑒𝑡

𝐽𝑣
error

𝐽𝑡𝑟𝑎𝑖𝑛

degree of polynomial 𝑚

35
Complexity of Hypothesis Space
 Less complex ℋ:
 𝐽𝑡𝑟𝑎𝑖𝑛 (𝒘) ≈ 𝐽𝑣 (𝒘) and 𝐽𝑡𝑟𝑎𝑖𝑛 (𝒘) is very high

 More complex ℋ:
 𝐽𝑡𝑟𝑎𝑖𝑛 (𝒘) ≪ 𝐽𝑣 (𝒘) and 𝐽𝑡𝑟𝑎𝑖𝑛 (𝒘) is low
𝐽𝑣 (𝒘)
error

𝐽𝑡𝑟𝑎𝑖𝑛 (𝒘)

degree of polynomial 𝑚
36
Size of training set
1 (𝑖) (𝑖)
2 𝑓 𝑥; 𝒘 = 𝑤0 + 𝑤1 𝑥 + 𝑤2 𝑥 2
𝐽𝑣 𝒘 = 𝑦 − 𝑓 𝑥 ;𝒘
𝑛_𝑣 𝑖∈𝑣𝑎𝑙_𝑠𝑒𝑡
1 2
(𝑖) (𝑖)
𝐽𝑡𝑟𝑎𝑖𝑛 𝒘 = 𝑦 − 𝑓 𝑥 ;𝒘
𝑛_𝑡𝑟𝑎𝑖𝑛 𝑖∈𝑡𝑟𝑎𝑖𝑛_𝑠𝑒𝑡
error

𝐽𝑣

𝐽𝑡𝑟𝑎𝑖𝑛

(training set size)


𝑛

37 This slide has been adapted from: Prof. Andrew Ng’s slides
Less complex ℋ
𝑓 𝑥; 𝒘 = 𝑤0 + 𝑤1 𝑥
error

price
𝐽𝑣
High
error
𝐽𝑡𝑟𝑎𝑖𝑛

size
(training set size)
𝑛

price
If model is very simple, getting more
training data will not (by itself) help
much.

38 This slide has been adapted from: Prof. Andrew Ng’s slides size
𝑓 𝑥; 𝒘 = 𝑤0 + 𝑤1 𝑥 + ⋯ 𝑤10 𝑥 10
More complex ℋ
error

price
𝐽𝑣
Gap

𝐽𝑡𝑟𝑎𝑖𝑛
size

(training set size)


𝑛

price
For more complex models, getting more
training data is usually helps.

size
39 This slide has been adapted from: Prof. Andrew Ng’s slides
Regularization: Example
𝑓 𝑥; 𝒘 = 𝑤0 + 𝑤1 𝑥 + 𝑤2 𝑥 2 +𝑤3 𝑥 3 +𝑤4 𝑥 4

1 𝑛 2
𝐽 𝒘 = 𝑦 𝑖 𝑖
− 𝑓 𝑥 ;𝒘 + 𝜆𝒘𝑇 𝒘
𝑛 𝑖=1
Price

Price

Price
Size Size Size
Large 𝜆x Intermediate 𝜆 Small 𝜆
(Prefer to more simple models) (Prefer to more complex models)

𝑤1 = 𝑤2 ≈ 0 𝜆=0
40 This example has been adapted from: Prof. Andrew Ng’s slides
Model complexity: Bias-variance trade-off
 Least squares, can lead to severe over-fitting if complex models
are trained using data sets of limited size.

 A frequentist viewpoint of the model complexity issue, known


as the bias-variance trade-off.

41
Formal discussion on bias, variance, and noise

 Best unrestricted regression function

 Noise

 Bias and variance

42
The learning diagram: deterministic target

ℎ: 𝒳 → 𝒴

1 𝑁
𝑥 ,…,𝑥
1
𝑥 , 𝑦 (1) , … , 𝑥 𝑁
, 𝑦 (𝑁)

𝑓: 𝒳 → 𝒴

43
[Y.S. Abou Mostafa, et. al]
The learning diagram including noisy target
 Type equation here.
ℎ: 𝒳 → 𝒴

1 𝑁
𝑥 ,…,𝑥
1
𝑥 , 𝑦 (1) , … , 𝑥 𝑁
, 𝑦 (𝑁)

𝑓 𝒙 = ℎ(𝒙)

𝑓: 𝒳 → 𝒴

𝑃 𝑥, 𝑦 = 𝑃 𝑥 𝑃(𝑦|𝑥)

Distribution Target
on features distribution
44
[Y.S. Abou Mostafa, et. al]
Best unrestricted regression function
 If we know the joint distribution 𝑃(𝒙, 𝑦) and no
constraints on the regression function?
 cost function: mean squared error

∗ 2
ℎ = argmin 𝔼𝒙,𝑦 𝑦−ℎ 𝒙
ℎ:ℝ𝑑 →ℝ

ℎ∗ 𝒙 = 𝔼𝑦|𝒙 [𝑦]

45
Best unrestricted regression function: Proof
2 2
𝔼𝒙,𝑦 𝑦−ℎ 𝒙 = 𝑦−ℎ 𝒙 𝑝 𝒙, 𝑦 𝑑𝒙𝑑𝑦

 For each 𝒙 separately minimize loss since ℎ(𝒙) can be chosen


independently for each different 𝒙:

2
𝛿 𝔼𝒙,𝑦 𝑦−ℎ 𝒙
= 2 𝑦 − ℎ 𝒙 𝑝 𝒙, 𝑦 𝑑𝑦 = 0
𝛿ℎ(𝒙)
𝑦𝑝 𝒙, 𝑦 𝑑𝑦 𝑦𝑝 𝒙, 𝑦 𝑑𝑦
⇒ℎ 𝒙 = = = 𝑦𝑝 𝑦|𝒙 𝑑𝑦 = 𝔼𝑦|𝒙 𝑦
𝑝 𝒙, 𝑦 𝑑𝑦 𝑝 𝒙

⟹ ℎ∗ 𝒙 = 𝔼𝑦|𝒙 [𝑦]

46
𝒙, 𝑦 ~𝑃
Error decomposition ℎ 𝒙 : minimizes the expected loss

𝐸𝑡𝑟𝑢𝑒 𝑓𝒟 𝒙 = 𝔼𝒙,𝑦 𝑓𝒟 𝒙 − 𝑦 2
Expected loss

= 𝔼𝒙,𝑦 𝑓𝒟 𝒙 − ℎ 𝒙 + ℎ 𝒙 − 𝑦 2

2 2
= 𝔼𝒙 𝑓𝒟 𝒙 − ℎ 𝒙 + 𝔼𝒙,𝑦 ℎ 𝒙 − 𝒚
+2𝔼𝒙,𝑦 𝑓𝒟 𝒙 − ℎ 𝒙 ℎ 𝒙 −𝑦

𝔼𝒙 𝑓𝒟 𝒙 − ℎ 𝒙 𝔼𝑦|𝒙 ℎ 𝒙 − 𝑦

0
47
𝒙, 𝑦 ~𝑃
Error decomposition ℎ 𝒙 : minimizes the expected loss

𝐸𝑡𝑟𝑢𝑒 𝑓𝒟 𝒙 = 𝔼𝒙,𝑦 𝑓𝒟 𝒙 − 𝑦 2

2
= 𝔼𝒙,𝑦 𝑓𝒟 𝒙 − ℎ 𝒙 + ℎ 𝒙 − 𝑦

2 2
= 𝔼𝒙 𝑓𝒟 𝒙 − ℎ 𝒙 + 𝔼𝒙,𝑦 ℎ 𝒙 − 𝒚
+2𝐸
0 𝒙,𝑦 𝑓 𝒙; 𝒘 − ℎ 𝒙 ℎ 𝒙 −𝑦
noise

 Noise shows the irreducible minimum value of the loss


function

48
Expectation of true error
2
𝐸𝑡𝑟𝑢𝑒 𝑓𝒟 𝒙 = 𝔼𝒙,𝑦 𝑓𝒟 𝒙 − 𝑦
2
= 𝔼𝒙 𝑓𝒟 𝒙 − ℎ 𝒙 + 𝑛𝑜𝑖𝑠𝑒

2
𝔼𝒟 𝔼𝒙 𝑓𝒟 𝒙 − ℎ 𝒙
2
= 𝔼𝒙 𝔼𝒟 𝑓𝒟 𝒙 − ℎ 𝒙

2
We now want to focus on 𝔼𝒟 𝑓𝒟 𝒙 − ℎ 𝒙 .

49
The average hypothesis

𝑓 𝒙 ≡ 𝐸𝒟 𝑓𝒟 𝒙

𝐾
1
𝑓 𝒙 ≈ 𝑓𝒟 𝑘 𝒙
𝐾
𝑘=1

𝐾 training sets (of size 𝑁) sampled from 𝑃(𝒙, 𝑦):


𝒟 (1) , 𝒟 (2) , … , 𝒟 (𝐾)

50
Using the average hypothesis
2
𝔼𝒟 𝑓𝒟 𝒙 − ℎ 𝒙
2
= 𝔼𝒟 𝑓𝒟 𝒙 − 𝑓 𝒙 + 𝑓 𝒙 − ℎ 𝒙

2 2
= 𝔼𝒟 𝑓𝒟 𝒙 − 𝑓 𝒙 + 𝑓 𝒙 −ℎ 𝒙

51
Bias and variance
2 2 2
𝔼𝒟 𝑓𝒟 𝒙 − ℎ 𝒙 = 𝔼𝒟 𝑓𝒟 𝒙 − 𝑓 𝒙 + 𝑓 𝒙 −ℎ 𝒙

var(𝒙) bias(𝒙)

2
𝔼𝒙 𝔼𝒟 𝑓𝒟 𝒙 − ℎ 𝒙 = 𝔼𝒙 var 𝒙 + bias(𝒙)
= var + bias

52
Bias-variance trade-off
2
var = 𝔼𝒙 𝔼𝒟 𝑓𝒟 𝒙 − 𝑓 𝒙

bias = 𝔼𝒙 𝑓 𝒙 − ℎ 𝒙


More complex ℋ ⇒ lower bias but higher variance

53
[Y.S. Abou Mostafa, et. al]
Example: sin target
 Only two training example 𝑁 = 2

 Two models used for learning:


 ℋ0 : 𝑓 𝑥 = 𝑏
 ℋ1 : 𝑓 𝑥 = 𝑎𝑥 + 𝑏

 Which is better ℋ0 or ℋ1 ?

54
Learning from a training set
ℋ0 ℋ1

55
[Y.S. Abou Mostafa, et. al]
Variance ℋ0

𝑓(𝑥)

56
[Y.S. Abou Mostafa, et. al]
Variance ℋ1

𝑓(𝑥)

57 [Y.S. Abou Mostafa, et. al]


Which is better?

𝑓(𝑥) 𝑓(𝑥)

58
[Y.S. Abou Mostafa, et. al]
Lesson

Match the model complexity


to the data sources
not to the complexity of the target function.

59
Expected training and true error curves
 Errors vary with the number of training samples

𝐸true
𝐸true
𝐸train

𝐸train

expected true error: 𝔼𝒟 𝐸𝑡𝑟𝑢𝑒 𝑓𝒟 𝒙


expected training error: 𝔼𝒟 𝐸𝑡𝑟𝑎𝑖𝑛 𝑓𝒟 𝒙
60
[Y.S. Abou Mostafa, et. al]
Regularization

61
[Y.S. Abou Mostafa, et. al]
Regularization: bias and variance

𝑓(𝑥)
𝑓(𝑥)

62 [Y.S. Abou Mostafa, et. al]


Winner of ℋ0 , ℋ1 , and ℋ1 with regularization

ℋ1

𝑓(𝑥)
𝑓(𝑥)
𝑓(𝑥)

63

[Y.S. Abou Mostafa, et. al]


Regularization and bias/variance

𝜆 is
large

𝐿 = 100 data sets


𝑛 = 25
𝑚 = 25
𝜆 is
intermediate

𝜆 is
small

64
[Bishop]
Learning curves of bias, variance, and noise

[Bishop]

65
Bias-variance decomposition: summary
 The noise term is unavoidable.
 The terms we are interested in are bias and variance.
 The approximation-generalization trade-off is seen in the
bias-variance decomposition.

66
Resources
 C. Bishop, “Pattern Recognition and Machine Learning”,
Chapter 1.1,1.3, 3.1, 3.2.
 Yaser S. Abu-Mostafa, Malik Maghdon-Ismail, and Hsuan
Tien Lin,“Learning from Data”, Chapter 2.3, 3.2, 3.4.

67

You might also like