0% found this document useful (0 votes)

13 views67 pages

Regression and Generalization

The document discusses advanced topics in regression and generalization, focusing on extending linear regression to non-linear functions, model evaluation, and regularization techniques. It highlights the importance of avoiding overfitting through model complexity management and validation methods such as cross-validation. Additionally, it covers the use of regularization to penalize complex models and improve generalization performance.

Uploaded by

samira.nazari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views67 pages

Regression and Generalization

Uploaded by

samira.nazari

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 67

Regression and generalization

CE-717: Machine Learning

Sharif University of Technology

M. Soleymani
Fall 2016
Topics
 Beyond linear regression models
 Evaluation & model selection
 Regularization
 Probabilistic perspective for the regression problem

2
Recall: Linear regression (squared loss)
 Linear regression functions
𝑓 ∶ ℝ → ℝ 𝑓(𝑥; 𝒘) = 𝑤0 + 𝑤1 𝑥
𝑓 ∶ ℝd → ℝ 𝑓(𝒙; 𝒘) = 𝑤0 + 𝑤1 𝑥1 + . . . 𝑤𝑑 𝑥𝑑
𝒘 = 𝑤0 ,𝑤1 ,...,𝑤𝑑 𝑇 are the
parameters we need to set.

 Minimizing the squared loss for linear regression

2
𝐽(𝒘) = 𝒚 − 𝑿𝒘 2

 We obtain 𝒘 = 𝑿𝑇 𝑿 −𝟏 𝑿𝑇 𝒚

3
Beyond linear regression
 How to extend the linear regression to non-linear
functions?
 Transform the data using basis functions
 Learn a linear regression on the new feature vectors (obtained
by basis functions)

4
Beyond linear regression
 𝑚𝑡ℎ order polynomial regression (univariate 𝑓 ∶ ℝ ⟶ ℝ)
𝑓 𝑥; 𝒘 = 𝑤0 + 𝑤1 𝑥 + . . . +𝑤𝑚−1 𝑥 𝑚−1 +𝑤𝑚 𝑥 𝑚

−𝟏
 Solution: 𝒘 = 𝑇
𝑿′ 𝑿′ 𝑿′𝑇 𝒚

1 1 1 2 1 𝑚 𝒘0
𝑦1 1 𝑥 𝑥 ⋯ 𝑥
2 1 2 2 2 𝑚 𝒘1
𝒚= ⋮ 𝑿′ = 1 𝑥 𝑥 ⋯ 𝑥 𝒘=
𝑦𝑛 ⋮ ⋮ ⋮ ⋮ ⋮ ⋮
1 𝑥𝑛
1
𝑥 𝑛 2 ⋯ 𝑥𝑛
1 𝒘𝑚

5
Polynomial regression: example

𝑚=3
𝑚=1

𝑚=5 𝑚=7

6
Generalized linear
 Linear combination of fixed non-linear function of the
input vector

𝑓(𝒙; 𝒘) = 𝑤0 + 𝑤1 𝜙1 (𝒙)+ . . . 𝑤𝑚 𝜙𝑚 (𝒙)

{𝜙1 (𝒙), . . . , 𝜙𝑚 (𝒙)}: set of basis functions (or features)

𝜙𝑖 𝒙 : ℝ𝑑 → ℝ

7
Basis functions: examples
 Linear

 Polynomial (univariate)

8
Basis functions: examples
2
𝒙−𝒄𝑗
 Gaussian: 𝜙𝑗 𝒙 = 𝑒𝑥𝑝 −
2𝜎𝑗2

𝒙−𝒄𝑗 1
 Sigmoid: 𝜙𝑗 𝒙 = 𝜎 𝜎 𝑎 =
𝜎𝑗 1+exp(−𝑎)

9
Radial Basis Functions: prototypes
 Predictions based on similarity to “prototypes”:
1 2
𝜙𝑗 𝒙 = 𝑒𝑥𝑝 − 2 𝒙 − 𝒄𝑗
2𝜎𝑗

 Measuring the similarity to the prototypes 𝒄1 , … , 𝒄𝑚

 σ2 controls how quickly it vanishes as a function of the
distance to the prototype.
 Training examples themselves could serve as prototypes

10
Generalized linear: optimization
𝑛 2
𝑖 𝑖
𝐽 𝒘 = 𝑦 − 𝑓 𝒙 ;𝒘
𝑖=1
𝑛 2
= 𝑦 𝑖 − 𝒘𝑇 𝝓 𝒙 𝑖
𝑖=1

(1) (1)
1 𝜙1 (𝒙 ) ⋯ 𝜙𝑚 (𝒙 ) 𝑤0
𝑦 (1) (2) (2) 𝑤1
𝒚= ⋮ 𝚽=
1 𝜙1 (𝒙 ) ⋯ 𝜙𝑚 (𝒙 ) 𝒘= ⋮
⋮ ⋮ ⋱ ⋮
𝑦 (𝑛) (𝑛) 𝑤𝑚
(𝑛)
1 𝜙1 (𝒙 ) ⋯ 𝜙𝑚 (𝒙 )

𝑇 −𝟏
𝒘= 𝚽 𝚽 𝚽𝑇 𝒚

11
Model complexity and overfitting
 With limited training data, models may achieve zero
training error but a large test error.

1 𝑛 2
𝑖
Training 𝑦 𝑖 − 𝑓 𝒙 ;𝜽 ≈0
(empirical) loss 𝑛 𝑖=1

Expected E𝐱,y 𝑦 − 𝑓 𝒙; 𝜽
2
≫0
(test) loss

 Over-fitting: when the training loss no longer bears any

relation to the test (generalization) loss.
 Fails to generalize to unseen examples.

12
Polynomial regression
𝑚=0 𝑚=1

𝑦 𝑦

𝑚=3 𝑚=9

𝑦 𝑦

13 [Bishop]
Polynomial regression: training and test error

2
𝑛 𝑖 𝑖
𝑖=1 𝑦 − 𝑓 𝒙 ;𝜽
𝑅𝑀𝑆𝐸 =
𝑛

[Bishop]

14
Over-fitting causes
 Model complexity
 E.g., Model with a large number of parameters (degrees of
freedom)

 Low number of training data

 Small data size compared to the complexity of the model

15
Model complexity
 Example:
 Polynomials with larger 𝑚 are becoming increasingly tuned to
the random noise on the target values.

𝑚=0 𝑚=1
𝑦 𝑦

𝑚=3 𝑚=9
𝑦 𝑦

16
16
[Bishop]
Number of training data & overfitting
 Over-fitting problem becomes less severe as the size of
training data increases.

𝑚=9 𝑚=9

𝑛 = 15 𝑛 = 100

[Bishop]

17
How to evaluate the learner’s performance?
 Generalization error: true (or expected) error that we
would like to optimize

 Two ways to assess the generalization error is:

 Practical: Use a separate data set to test the model
 Theoretical: Law of Large numbers
 statistical bounds on the difference between training and expected
errors

18
Evaluation and model selection
 Evaluation:
 We need to measure how well the learned function can
predicts the target for unseen examples

 Model selection:
 Most of the time we need to select among a set of models
 Example: polynomials with different degree 𝑚
 and thus we need to evaluate these models first

19
Avoiding over-fitting
 Determine a suitable value for model complexity
 Simple hold-out method
 Cross-validation

 Regularization (Occam’s Razor)

 Explicit preference towards simple models
 Penalize for the model complexity in the objective function

 Bayesian approach

20
Simple hold-out: model selection
 Steps:
 Divide training data into training and validation set 𝑣_𝑠𝑒𝑡
 Use only the training set to train a set of models
 Evaluate each learned model on the validation set
2
1 (𝑖) (𝑖)
 𝐽𝑣 𝒘 = 𝑖∈𝑣_𝑠𝑒𝑡 𝑦 − 𝑓 𝒙 ;𝒘
𝑣_𝑠𝑒𝑡

 Choose the best model based on the validation set error

 Usually, too wasteful of valuable training data

 Training data may be limited.
 On the other hand, small validation set give a relatively noisy
estimate of performance.

21
Simple hold out:
training, validation, and test sets
 Simple hold-out chooses the model that minimizes error on
validation set.

 𝐽𝑣 𝒘 is likely to be an optimistic estimate of generalization

error.
 extra parameter (e.g., degree of polynomial) is fit to this set.

 Estimate generalization error for the test set

 performance of the selected model is finally evaluated on the test set

Training

Validation
22
Test
Cross-Validation (CV): Evaluation
 𝑘-fold cross-validation steps:
 Shuffle the dataset and randomly partition training data into 𝑘 groups of
approximately equal size
 for 𝑖 = 1 to 𝑘
 Choose the 𝑖-th group as the held-out validation group
 Train the model on all but the 𝑖-th group of data
 Evaluate the model on the held-out group
 Performance scores of the model from 𝑘 runs are averaged.
 The average error rate can be considered as an estimation of the true
performance.
… First run
…
Second run
…
… (k-1)th run
23 … k-th run
Cross-Validation (CV): Model Selection
 For each model we first find the average error find by CV.

 The model with the best average performance is

selected.

24
Cross-validation: polynomial regression example
 5-fold CV
 100 runs
 average

𝑚=3
𝑚=1
CV: 𝑀𝑆𝐸 = 1.45
CV: 𝑀𝑆𝐸 = 0.30

𝑚=5 𝑚=7
CV: 𝑀𝑆𝐸 = 45.44 CV: 𝑀𝑆𝐸 = 31759
25
Leave-One-Out Cross Validation (LOOCV)
 When data is particularly scarce, cross-validation with 𝑘
=𝑁
 Leave-one-out treats each training sample in turn as a test
example and all other samples as the training set.

 Use for small datasets

 When training data is valuable
 LOOCV can be time expensive as 𝑁 training steps are
required.

26
Regularization
 Adding a penalty term in the cost function to discourage
the coefficients from reaching large values.

 Ridge regression (weight decay):

𝑛 2
𝑖
𝐽 𝒘 = 𝑦 − 𝒘𝑇 𝝓 𝒙 𝑖
+ 𝜆𝒘𝑇 𝒘
𝑖=1

−𝟏
𝒘= 𝚽𝑇 𝚽 + 𝜆𝑰 𝚽𝑇 𝒚

27
Polynomial order
 Polynomials with larger 𝑚 are becoming increasingly
tuned to the random noise on the target values.
 magnitude of the coefficients typically gets larger by increasing
𝑚.

[Bishop]

28
Regularization parameter
𝑚=9

𝑤0
𝑤1
𝑤2
𝑤3
𝑤4
𝑤5
𝑤6
𝑤7
𝑤8 [Bishop]
𝑤9

𝑙𝑛𝜆 = −∞ 𝑙𝑛𝜆 = −18

29
Regularization parameter
 Generalization
 𝜆 now controls the effective complexity of the model and
hence determines the degree of over-fitting

30
[Bishop]
Choosing the regularization parameter
 A set of models with different values of 𝜆.

 Find 𝒘 for each model based on training data

 Find 𝐽𝑣 (𝒘) (or 𝐽𝑐𝑣 (𝒘)) for each model

2
1 (𝑖) (𝑖)
 𝐽𝑣 𝒘 = 𝑖∈𝑣_𝑠𝑒𝑡 𝑦 − 𝑓 𝑥 ;𝒘
𝑛_𝑣

 Select the model with the best 𝐽𝑣 (𝒘) (or 𝐽𝑐𝑣 (𝒘))

31
The approximation-generailization trade-off

 Small true error shows good approximation of 𝑓 out of

sample

 More complex ℋ ⇒ better chance of approximating 𝑓

 Less complex ℋ ⇒ better chance of generalization out of 𝑓

32
Complexity of Hypothesis Space: Example
Price

Price

Price
Size Size Size
𝑤0 + 𝑤1 𝑥 𝑤0 + 𝑤1 𝑥 + 𝑤2 𝑥 2 𝑤0 + 𝑤1 𝑥 + 𝑤2 𝑥 2 + 𝑤3 𝑥 3 + 𝑤4 𝑥 4

Less complex ℋ More complex ℋ

33 This example has been adapted from: Prof. Andrew Ng’s slides
Complexity of Hypothesis Space: Example
Price

Price

Price
Size Size Size
𝑤0 + 𝑤1 𝑥 𝑤0 + 𝑤1 𝑥 + 𝑤2 𝑥 2 𝑤0 + 𝑤1 𝑥 + 𝑤2 𝑥 2 + 𝑤3 𝑥 3 + 𝑤4 𝑥 4

Less complex ℋ More complex ℋ

34 This example has been adapted from: Prof. Andrew Ng’s slides
Complexity of Hypothesis Space: Example
1 2
(𝑖) (𝑖)
𝐽𝑣 𝒘 = 𝑦 −𝑓 𝒙 ;𝒘
𝑛_𝑣 𝑖∈𝑣𝑎𝑙_𝑠𝑒𝑡
1 2
(𝑖) (𝑖)
𝐽𝑡𝑟𝑎𝑖𝑛 𝒘 = 𝑦 − 𝑓 𝒙 ;𝒘
𝑛_𝑡𝑟𝑎𝑖𝑛 𝑖∈𝑡𝑟𝑎𝑖𝑛_𝑠𝑒𝑡

𝐽𝑣
error

𝐽𝑡𝑟𝑎𝑖𝑛

degree of polynomial 𝑚

35
Complexity of Hypothesis Space
 Less complex ℋ:
 𝐽𝑡𝑟𝑎𝑖𝑛 (𝒘) ≈ 𝐽𝑣 (𝒘) and 𝐽𝑡𝑟𝑎𝑖𝑛 (𝒘) is very high

 More complex ℋ:
 𝐽𝑡𝑟𝑎𝑖𝑛 (𝒘) ≪ 𝐽𝑣 (𝒘) and 𝐽𝑡𝑟𝑎𝑖𝑛 (𝒘) is low
𝐽𝑣 (𝒘)
error

𝐽𝑡𝑟𝑎𝑖𝑛 (𝒘)

degree of polynomial 𝑚
36
Size of training set
1 (𝑖) (𝑖)
2 𝑓 𝑥; 𝒘 = 𝑤0 + 𝑤1 𝑥 + 𝑤2 𝑥 2
𝐽𝑣 𝒘 = 𝑦 − 𝑓 𝑥 ;𝒘
𝑛_𝑣 𝑖∈𝑣𝑎𝑙_𝑠𝑒𝑡
1 2
(𝑖) (𝑖)
𝐽𝑡𝑟𝑎𝑖𝑛 𝒘 = 𝑦 − 𝑓 𝑥 ;𝒘
𝑛_𝑡𝑟𝑎𝑖𝑛 𝑖∈𝑡𝑟𝑎𝑖𝑛_𝑠𝑒𝑡
error

𝐽𝑣

𝐽𝑡𝑟𝑎𝑖𝑛

(training set size)

𝑛

37 This slide has been adapted from: Prof. Andrew Ng’s slides
Less complex ℋ
𝑓 𝑥; 𝒘 = 𝑤0 + 𝑤1 𝑥
error

price
𝐽𝑣
High
error
𝐽𝑡𝑟𝑎𝑖𝑛

size
(training set size)
𝑛

price
If model is very simple, getting more
training data will not (by itself) help
much.

38 This slide has been adapted from: Prof. Andrew Ng’s slides size
𝑓 𝑥; 𝒘 = 𝑤0 + 𝑤1 𝑥 + ⋯ 𝑤10 𝑥 10
More complex ℋ
error

price
𝐽𝑣
Gap

𝐽𝑡𝑟𝑎𝑖𝑛
size

(training set size)

𝑛

price
For more complex models, getting more
training data is usually helps.

size
39 This slide has been adapted from: Prof. Andrew Ng’s slides
Regularization: Example
𝑓 𝑥; 𝒘 = 𝑤0 + 𝑤1 𝑥 + 𝑤2 𝑥 2 +𝑤3 𝑥 3 +𝑤4 𝑥 4

1 𝑛 2
𝐽 𝒘 = 𝑦 𝑖 𝑖
− 𝑓 𝑥 ;𝒘 + 𝜆𝒘𝑇 𝒘
𝑛 𝑖=1
Price

Price

Price
Size Size Size
Large 𝜆x Intermediate 𝜆 Small 𝜆
(Prefer to more simple models) (Prefer to more complex models)

𝑤1 = 𝑤2 ≈ 0 𝜆=0
40 This example has been adapted from: Prof. Andrew Ng’s slides
Model complexity: Bias-variance trade-off
 Least squares, can lead to severe over-fitting if complex models
are trained using data sets of limited size.

 A frequentist viewpoint of the model complexity issue, known

as the bias-variance trade-off.

41
Formal discussion on bias, variance, and noise

 Best unrestricted regression function

 Noise

 Bias and variance

42
The learning diagram: deterministic target

ℎ: 𝒳 → 𝒴

1 𝑁
𝑥 ,…,𝑥
1
𝑥 , 𝑦 (1) , … , 𝑥 𝑁
, 𝑦 (𝑁)

𝑓: 𝒳 → 𝒴

43
[Y.S. Abou Mostafa, et. al]
The learning diagram including noisy target
 Type equation here.
ℎ: 𝒳 → 𝒴

1 𝑁
𝑥 ,…,𝑥
1
𝑥 , 𝑦 (1) , … , 𝑥 𝑁
, 𝑦 (𝑁)

𝑓 𝒙 = ℎ(𝒙)

𝑓: 𝒳 → 𝒴

𝑃 𝑥, 𝑦 = 𝑃 𝑥 𝑃(𝑦|𝑥)

Distribution Target
on features distribution
44
[Y.S. Abou Mostafa, et. al]
Best unrestricted regression function
 If we know the joint distribution 𝑃(𝒙, 𝑦) and no
constraints on the regression function?
 cost function: mean squared error

∗ 2
ℎ = argmin 𝔼𝒙,𝑦 𝑦−ℎ 𝒙
ℎ:ℝ𝑑 →ℝ

ℎ∗ 𝒙 = 𝔼𝑦|𝒙 [𝑦]

45
Best unrestricted regression function: Proof
2 2
𝔼𝒙,𝑦 𝑦−ℎ 𝒙 = 𝑦−ℎ 𝒙 𝑝 𝒙, 𝑦 𝑑𝒙𝑑𝑦

 For each 𝒙 separately minimize loss since ℎ(𝒙) can be chosen

independently for each different 𝒙:

2
𝛿 𝔼𝒙,𝑦 𝑦−ℎ 𝒙
= 2 𝑦 − ℎ 𝒙 𝑝 𝒙, 𝑦 𝑑𝑦 = 0
𝛿ℎ(𝒙)
𝑦𝑝 𝒙, 𝑦 𝑑𝑦 𝑦𝑝 𝒙, 𝑦 𝑑𝑦
⇒ℎ 𝒙 = = = 𝑦𝑝 𝑦|𝒙 𝑑𝑦 = 𝔼𝑦|𝒙 𝑦
𝑝 𝒙, 𝑦 𝑑𝑦 𝑝 𝒙

⟹ ℎ∗ 𝒙 = 𝔼𝑦|𝒙 [𝑦]

46
𝒙, 𝑦 ~𝑃
Error decomposition ℎ 𝒙 : minimizes the expected loss

𝐸𝑡𝑟𝑢𝑒 𝑓𝒟 𝒙 = 𝔼𝒙,𝑦 𝑓𝒟 𝒙 − 𝑦 2
Expected loss

= 𝔼𝒙,𝑦 𝑓𝒟 𝒙 − ℎ 𝒙 + ℎ 𝒙 − 𝑦 2

2 2
= 𝔼𝒙 𝑓𝒟 𝒙 − ℎ 𝒙 + 𝔼𝒙,𝑦 ℎ 𝒙 − 𝒚
+2𝔼𝒙,𝑦 𝑓𝒟 𝒙 − ℎ 𝒙 ℎ 𝒙 −𝑦

𝔼𝒙 𝑓𝒟 𝒙 − ℎ 𝒙 𝔼𝑦|𝒙 ℎ 𝒙 − 𝑦

0
47
𝒙, 𝑦 ~𝑃
Error decomposition ℎ 𝒙 : minimizes the expected loss

𝐸𝑡𝑟𝑢𝑒 𝑓𝒟 𝒙 = 𝔼𝒙,𝑦 𝑓𝒟 𝒙 − 𝑦 2

2
= 𝔼𝒙,𝑦 𝑓𝒟 𝒙 − ℎ 𝒙 + ℎ 𝒙 − 𝑦

2 2
= 𝔼𝒙 𝑓𝒟 𝒙 − ℎ 𝒙 + 𝔼𝒙,𝑦 ℎ 𝒙 − 𝒚
+2𝐸
0 𝒙,𝑦 𝑓 𝒙; 𝒘 − ℎ 𝒙 ℎ 𝒙 −𝑦
noise

 Noise shows the irreducible minimum value of the loss

function

48
Expectation of true error
2
𝐸𝑡𝑟𝑢𝑒 𝑓𝒟 𝒙 = 𝔼𝒙,𝑦 𝑓𝒟 𝒙 − 𝑦
2
= 𝔼𝒙 𝑓𝒟 𝒙 − ℎ 𝒙 + 𝑛𝑜𝑖𝑠𝑒

2
𝔼𝒟 𝔼𝒙 𝑓𝒟 𝒙 − ℎ 𝒙
2
= 𝔼𝒙 𝔼𝒟 𝑓𝒟 𝒙 − ℎ 𝒙

2
We now want to focus on 𝔼𝒟 𝑓𝒟 𝒙 − ℎ 𝒙 .

49
The average hypothesis

𝑓 𝒙 ≡ 𝐸𝒟 𝑓𝒟 𝒙

𝐾
1
𝑓 𝒙 ≈ 𝑓𝒟 𝑘 𝒙
𝐾
𝑘=1

𝐾 training sets (of size 𝑁) sampled from 𝑃(𝒙, 𝑦):

𝒟 (1) , 𝒟 (2) , … , 𝒟 (𝐾)

50
Using the average hypothesis
2
𝔼𝒟 𝑓𝒟 𝒙 − ℎ 𝒙
2
= 𝔼𝒟 𝑓𝒟 𝒙 − 𝑓 𝒙 + 𝑓 𝒙 − ℎ 𝒙

2 2
= 𝔼𝒟 𝑓𝒟 𝒙 − 𝑓 𝒙 + 𝑓 𝒙 −ℎ 𝒙

51
Bias and variance
2 2 2
𝔼𝒟 𝑓𝒟 𝒙 − ℎ 𝒙 = 𝔼𝒟 𝑓𝒟 𝒙 − 𝑓 𝒙 + 𝑓 𝒙 −ℎ 𝒙

var(𝒙) bias(𝒙)

2
𝔼𝒙 𝔼𝒟 𝑓𝒟 𝒙 − ℎ 𝒙 = 𝔼𝒙 var 𝒙 + bias(𝒙)
= var + bias

52
Bias-variance trade-off
2
var = 𝔼𝒙 𝔼𝒟 𝑓𝒟 𝒙 − 𝑓 𝒙

bias = 𝔼𝒙 𝑓 𝒙 − ℎ 𝒙

ℎ
ℎ

More complex ℋ ⇒ lower bias but higher variance

53
[Y.S. Abou Mostafa, et. al]
Example: sin target
 Only two training example 𝑁 = 2

 Two models used for learning:

 ℋ0 : 𝑓 𝑥 = 𝑏
 ℋ1 : 𝑓 𝑥 = 𝑎𝑥 + 𝑏

 Which is better ℋ0 or ℋ1 ?

54
Learning from a training set
ℋ0 ℋ1

55
[Y.S. Abou Mostafa, et. al]
Variance ℋ0

𝑓(𝑥)

56
[Y.S. Abou Mostafa, et. al]
Variance ℋ1

𝑓(𝑥)

57 [Y.S. Abou Mostafa, et. al]

Which is better?

𝑓(𝑥) 𝑓(𝑥)

58
[Y.S. Abou Mostafa, et. al]
Lesson

Match the model complexity

to the data sources
not to the complexity of the target function.

59
Expected training and true error curves
 Errors vary with the number of training samples

𝐸true
𝐸true
𝐸train

𝐸train

expected true error: 𝔼𝒟 𝐸𝑡𝑟𝑢𝑒 𝑓𝒟 𝒙

expected training error: 𝔼𝒟 𝐸𝑡𝑟𝑎𝑖𝑛 𝑓𝒟 𝒙
60
[Y.S. Abou Mostafa, et. al]
Regularization

61
[Y.S. Abou Mostafa, et. al]
Regularization: bias and variance

𝑓(𝑥)
𝑓(𝑥)

62 [Y.S. Abou Mostafa, et. al]

Winner of ℋ0 , ℋ1 , and ℋ1 with regularization

ℋ1

𝑓(𝑥)
𝑓(𝑥)
𝑓(𝑥)

[Y.S. Abou Mostafa, et. al]

Regularization and bias/variance

𝜆 is
large

𝐿 = 100 data sets

𝑛 = 25
𝑚 = 25
𝜆 is
intermediate

𝜆 is
small

64
[Bishop]
Learning curves of bias, variance, and noise

[Bishop]

65
Bias-variance decomposition: summary
 The noise term is unavoidable.
 The terms we are interested in are bias and variance.
 The approximation-generalization trade-off is seen in the
bias-variance decomposition.

66
Resources
 C. Bishop, “Pattern Recognition and Machine Learning”,
Chapter 1.1,1.3, 3.1, 3.2.
 Yaser S. Abu-Mostafa, Malik Maghdon-Ismail, and Hsuan
Tien Lin,“Learning from Data”, Chapter 2.3, 3.2, 3.4.

Coincent - Data Science With Python Assignment
100% (2)
Coincent - Data Science With Python Assignment
23 pages
Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD
No ratings yet
Unit 02 - Nonlinear Classification, Linear Regression, Collaborative Filtering - MD
14 pages
Lecture16 Crossvalidation
No ratings yet
Lecture16 Crossvalidation
32 pages
Week 2
No ratings yet
Week 2
43 pages
ML 04 Validation Regularization
No ratings yet
ML 04 Validation Regularization
57 pages
Machine Learning
No ratings yet
Machine Learning
63 pages
Supervised Learning
No ratings yet
Supervised Learning
41 pages
Lecture 09 ML
No ratings yet
Lecture 09 ML
26 pages
Theory in Machine Learning
No ratings yet
Theory in Machine Learning
60 pages
Bias Variance
No ratings yet
Bias Variance
14 pages
Supervised Regression Notes
No ratings yet
Supervised Regression Notes
11 pages
Lecture 4 - Regularization
No ratings yet
Lecture 4 - Regularization
22 pages
Today: - Calculus
No ratings yet
Today: - Calculus
61 pages
CSO504 Machine Learning: Evaluation and Error Analysis Validation and Regularization Koustav Rudra 22/08/2022
No ratings yet
CSO504 Machine Learning: Evaluation and Error Analysis Validation and Regularization Koustav Rudra 22/08/2022
28 pages
IML Summary
No ratings yet
IML Summary
12 pages
Unit-I Machine Learning Basics
No ratings yet
Unit-I Machine Learning Basics
85 pages
Regression Analysis
No ratings yet
Regression Analysis
11 pages
Choosing Model and Tuning
No ratings yet
Choosing Model and Tuning
20 pages
Lecture 3 - Regression
No ratings yet
Lecture 3 - Regression
47 pages
2EL1730 ML Lecture02 Linear and Logistic Regression
No ratings yet
2EL1730 ML Lecture02 Linear and Logistic Regression
65 pages
Machine Learning Using Matlab: Lecture 8 Advice On ML Application
No ratings yet
Machine Learning Using Matlab: Lecture 8 Advice On ML Application
30 pages
2 LinearRegression2
No ratings yet
2 LinearRegression2
45 pages
EE2211 Lecture 7
No ratings yet
EE2211 Lecture 7
43 pages
ML4 Linear Models
No ratings yet
ML4 Linear Models
34 pages
ML 3
No ratings yet
ML 3
66 pages
Lecture Slide 02 - Supervised Learning - Summer 2023
No ratings yet
Lecture Slide 02 - Supervised Learning - Summer 2023
43 pages
Lec4 Oct12 2022 PracticalNotes LinearRegression
No ratings yet
Lec4 Oct12 2022 PracticalNotes LinearRegression
34 pages
10: Advice For Applying Machine Learning: Deciding What To Try Next
No ratings yet
10: Advice For Applying Machine Learning: Deciding What To Try Next
8 pages
Linear Regression With Multiple Variable
No ratings yet
Linear Regression With Multiple Variable
30 pages
Unit 4 Regression
No ratings yet
Unit 4 Regression
26 pages
Lec8 Regularization Polynomial Regression
No ratings yet
Lec8 Regularization Polynomial Regression
30 pages
Linear Regression Python Programming
No ratings yet
Linear Regression Python Programming
25 pages
L11+ Regularization
No ratings yet
L11+ Regularization
25 pages
Lecture 2 Ai
No ratings yet
Lecture 2 Ai
24 pages
Regularization Linear Models
No ratings yet
Regularization Linear Models
23 pages
Lecture 3 - Linear Regression
No ratings yet
Lecture 3 - Linear Regression
31 pages
PRML RefSheet
No ratings yet
PRML RefSheet
6 pages
Lecture 5 - Polynomial Regression Imran 07032025 114203am
No ratings yet
Lecture 5 - Polynomial Regression Imran 07032025 114203am
39 pages
DL-Lec 2 - Bias-Variance-Tradeoff
No ratings yet
DL-Lec 2 - Bias-Variance-Tradeoff
33 pages
L11+ Regularization
No ratings yet
L11+ Regularization
24 pages
Machine Learning Lecture 1
No ratings yet
Machine Learning Lecture 1
5 pages
CPSC540: Regularization, Regularization, Nonlinear Prediction and Generalization
No ratings yet
CPSC540: Regularization, Regularization, Nonlinear Prediction and Generalization
23 pages
ML 01
No ratings yet
ML 01
24 pages
Lecture 7 - Part A - Mutli Class and Overfitting and Regularization
No ratings yet
Lecture 7 - Part A - Mutli Class and Overfitting and Regularization
43 pages
(Slide) Non Linear Regression
No ratings yet
(Slide) Non Linear Regression
39 pages
3 LogisticRegression
No ratings yet
3 LogisticRegression
30 pages
Overfitting Regression
No ratings yet
Overfitting Regression
14 pages
Lecture 7
No ratings yet
Lecture 7
29 pages
Ridge Lasso Regression Bias Variance Tradeoff 71
No ratings yet
Ridge Lasso Regression Bias Variance Tradeoff 71
19 pages
Lecture 3
No ratings yet
Lecture 3
61 pages
Regression
No ratings yet
Regression
45 pages
Essentials of Linear Regression in Python
No ratings yet
Essentials of Linear Regression in Python
23 pages
Week11 - Regularization and Optimization
No ratings yet
Week11 - Regularization and Optimization
75 pages
CH 1
No ratings yet
CH 1
24 pages
Linear - Regression
100% (1)
Linear - Regression
39 pages
Lecture 3
No ratings yet
Lecture 3
33 pages
Mlfa Autumn 22 Lec 02
No ratings yet
Mlfa Autumn 22 Lec 02
24 pages
Regression
No ratings yet
Regression
39 pages
02 - Linear Models - C - Regularization - Logistic - Regression
No ratings yet
02 - Linear Models - C - Regularization - Logistic - Regression
16 pages
Chapter-3-Linear Models For Regression
100% (1)
Chapter-3-Linear Models For Regression
61 pages
Introduction to Calculus
From Everand
Introduction to Calculus
Joan Van Glabek
4.5/5 (8)
Unit 4 ML
No ratings yet
Unit 4 ML
28 pages
Ai Notes
No ratings yet
Ai Notes
19 pages
Linear Classifiers in Python: Chapter1
No ratings yet
Linear Classifiers in Python: Chapter1
16 pages
Ds Unit 1
No ratings yet
Ds Unit 1
77 pages
Artificial Intelligence - Edureka
No ratings yet
Artificial Intelligence - Edureka
37 pages
DL Unit-3
No ratings yet
DL Unit-3
56 pages
Data-Driven Approach To Predict The Flow Boiling Heat Transfer Coefficient of Liquid Hydrogen Aviation Fuel
No ratings yet
Data-Driven Approach To Predict The Flow Boiling Heat Transfer Coefficient of Liquid Hydrogen Aviation Fuel
9 pages
Salesforce AI Associate Full File
No ratings yet
Salesforce AI Associate Full File
38 pages
DL 4
No ratings yet
DL 4
15 pages
ML Prep For Samsung
No ratings yet
ML Prep For Samsung
73 pages
Freshwater Fish Image Classifier
No ratings yet
Freshwater Fish Image Classifier
54 pages
Lecture - 5 - Validation
No ratings yet
Lecture - 5 - Validation
30 pages
Analysis and Prediction of Healthcare Sector Stock Price Using Machine Learning Techniques Healthcare Stock Analysis
No ratings yet
Analysis and Prediction of Healthcare Sector Stock Price Using Machine Learning Techniques Healthcare Stock Analysis
15 pages
Computer Science Students Academic Performance Prediction Using Ai
No ratings yet
Computer Science Students Academic Performance Prediction Using Ai
68 pages
1 s2.0 S095219762400407X Main
No ratings yet
1 s2.0 S095219762400407X Main
12 pages
Lasso and Ridge Regression
No ratings yet
Lasso and Ridge Regression
30 pages
Regularized Linear Regression. Linear Regression Is A Widely Used - by Yahya Ansari - Medium
No ratings yet
Regularized Linear Regression. Linear Regression Is A Widely Used - by Yahya Ansari - Medium
12 pages
CH11
No ratings yet
CH11
36 pages
Machine Learning: Intechopen Series Artificial Intelligence, Volume 7
No ratings yet
Machine Learning: Intechopen Series Artificial Intelligence, Volume 7
154 pages
IntroClassificationDA 2024
No ratings yet
IntroClassificationDA 2024
129 pages
Sample Project Doc-RIT
No ratings yet
Sample Project Doc-RIT
63 pages
Ai & ML - SLM
No ratings yet
Ai & ML - SLM
87 pages
Basic Concepts of Machine Learning For Beginners 1732109263
No ratings yet
Basic Concepts of Machine Learning For Beginners 1732109263
102 pages
Sound Deposit Insurance Pricing Using A Machine Le
No ratings yet
Sound Deposit Insurance Pricing Using A Machine Le
18 pages
11.ABM SoftSensor MachineLearning DeepLearning
No ratings yet
11.ABM SoftSensor MachineLearning DeepLearning
13 pages
MINI PROJECT Kshetrika
No ratings yet
MINI PROJECT Kshetrika
41 pages
003 KNN Complete
No ratings yet
003 KNN Complete
66 pages
Project Report
No ratings yet
Project Report
3 pages
Dissertation CathyWesthues Revised
No ratings yet
Dissertation CathyWesthues Revised
239 pages

Regression and Generalization

Uploaded by

Regression and Generalization

Uploaded by

Regression and generalization

CE-717: Machine Learning

 Minimizing the squared loss for linear regression

𝑓(𝒙; 𝒘) = 𝑤0 + 𝑤1 𝜙1 (𝒙)+ . . . 𝑤𝑚 𝜙𝑚 (𝒙)

{𝜙1 (𝒙), . . . , 𝜙𝑚 (𝒙)}: set of basis functions (or features)

 Measuring the similarity to the prototypes 𝒄1 , … , 𝒄𝑚

 Over-fitting: when the training loss no longer bears any

 Low number of training data

 Two ways to assess the generalization error is:

 Regularization (Occam’s Razor)

 Choose the best model based on the validation set error

 Usually, too wasteful of valuable training data

 𝐽𝑣 𝒘 is likely to be an optimistic estimate of generalization

 Estimate generalization error for the test set

 The model with the best average performance is

 Use for small datasets

 Ridge regression (weight decay):

𝑙𝑛𝜆 = −∞ 𝑙𝑛𝜆 = −18

 Find 𝒘 for each model based on training data

 Find 𝐽𝑣 (𝒘) (or 𝐽𝑐𝑣 (𝒘)) for each model

 Small true error shows good approximation of 𝑓 out of

 More complex ℋ ⇒ better chance of approximating 𝑓

 Less complex ℋ ⇒ better chance of generalization out of 𝑓

Less complex ℋ More complex ℋ

Less complex ℋ More complex ℋ

(training set size)

(training set size)

 A frequentist viewpoint of the model complexity issue, known

 Best unrestricted regression function

 Bias and variance

 For each 𝒙 separately minimize loss since ℎ(𝒙) can be chosen

 Noise shows the irreducible minimum value of the loss

𝐾 training sets (of size 𝑁) sampled from 𝑃(𝒙, 𝑦):

More complex ℋ ⇒ lower bias but higher variance

 Two models used for learning:

57 [Y.S. Abou Mostafa, et. al]

Match the model complexity

expected true error: 𝔼𝒟 𝐸𝑡𝑟𝑢𝑒 𝑓𝒟 𝒙

62 [Y.S. Abou Mostafa, et. al]

[Y.S. Abou Mostafa, et. al]

𝐿 = 100 data sets

You might also like