EE2211 Introduction To Machine Learning
EE2211 Introduction To Machine Learning
Learning
Lecture 7
Thomas Yeo
[email protected]
2
© Copyright EE, NUS. All Rights Reserved.
Fundamental ML Algorithms:
Overfitting, Bias-Variance Tradeoff
3
© Copyright EE, NUS. All Rights Reserved.
Regression Review
• Goal: Given feature(s) , we want to predict target
– can be 1-D or more than 1-D
– is 1-D
• Two types of input data
– Training set , from
– Test set {, from
• Learning/Training
– Training set used to estimate regression coefficients
• Prediction/Testing/Evaluation
– Prediction performed on test set to evaluate performance
4
© Copyright EE, NUS. All Rights Reserved.
Regression Review
• Goal: Given feature(s) , we want to predict target
– can be 1-D or more than 1-D
– is 1-D
• Two types of input data
– Training set , from
– Test set {, from
• Learning/Training
– Training set used to estimate regression coefficients
• Prediction/Testing/Evaluation
– Prediction performed on test set to evaluate performance
5
© Copyright EE, NUS. All Rights Reserved.
Regression Review
• Goal: Given feature(s) , we want to predict target
– can be 1-D or more than 1-D
– is 1-D
• Two types of input data
– Training set , from
– Test set {, from
• Learning/Training
– Training set used to estimate regression coefficients
• Prediction/Testing/Evaluation
– Prediction performed on test set to evaluate performance
6
© Copyright EE, NUS. All Rights Reserved.
Regression Review
• Goal: Given feature(s) , we want to predict target
– can be 1-D or more than 1-D
– is 1-D
• Two types of input data
– Training set , from
– Test set {, from
• Learning/Training
– Training set used to estimate regression coefficients
• Prediction/Testing/Evaluation
– Prediction performed on test set to evaluate performance
7
© Copyright EE, NUS. All Rights Reserved.
Regression Review: Linear Case
• is 1D & is 1-D
• Linear relationship between &
• Illustration (4 training samples):
8
© Copyright EE, NUS. All Rights Reserved.
Regression Review: Linear Case
• is 1D & is 1-D
• Linear relationship between &
• Illustration (4 training samples):
9
© Copyright EE, NUS. All Rights Reserved.
Regression Review: Linear Case
• is 1D & is 1-D
• Linear relationship between &
• Illustration (4 training samples):
10
© Copyright EE, NUS. All Rights Reserved.
Regression Review: Linear Case
• is 1D & is 1-D
• Linear relationship between &
• Illustration (4 training samples):
11
© Copyright EE, NUS. All Rights Reserved.
Regression Review: Polynomial
• is 1-D (or more than 1-D) & is 1-D
• Polynomial relationship between &
• Quadratic illustration (4 training samples, is 1-D):
12
© Copyright EE, NUS. All Rights Reserved.
Regression Review: Polynomial
• is 1-D (or more than 1-D) & is 1-D
• Polynomial relationship between &
• Quadratic illustration (4 training samples, is 1-D):
13
© Copyright EE, NUS. All Rights Reserved.
Note on Training & Test Sets
• Linear is special case of polynomial => use “P”
instead of “X” from now on
• Training/Learning (primal) on training set
14
© Copyright EE, NUS. All Rights Reserved.
Note on Training & Test Sets
• Linear is special case of polynomial => use “P”
instead of “X” from now on
• Training/Learning (primal) on training set
15
© Copyright EE, NUS. All Rights Reserved.
Note on Training & Test Sets
• Linear is special case of polynomial => use “P”
instead of “X” from now on
• Training/Learning (primal) on training set
16
© Copyright EE, NUS. All Rights Reserved.
Note on Training & Test Sets
• Linear is special case of polynomial => use “P”
instead of “X” from now on
• Training/Learning (primal) on training set
17
© Copyright EE, NUS. All Rights Reserved.
Questions?
18
© Copyright EE, NUS. All Rights Reserved.
Overfitting Example
19
© Copyright EE, NUS. All Rights Reserved.
Overfitting Example
20
© Copyright EE, NUS. All Rights Reserved.
Overfitting Example
21
© Copyright EE, NUS. All Rights Reserved.
Overfitting Example
22
© Copyright EE, NUS. All Rights Reserved.
Overfitting Example
Big Prediction Error
23
© Copyright EE, NUS. All Rights Reserved.
Overfitting Example
Big Prediction Error
24
© Copyright EE, NUS. All Rights Reserved.
Underfitting Example
25
© Copyright EE, NUS. All Rights Reserved.
Underfitting Example
26
© Copyright EE, NUS. All Rights Reserved.
Underfitting Example
27
© Copyright EE, NUS. All Rights Reserved.
Underfitting Example
28
© Copyright EE, NUS. All Rights Reserved.
Underfitting Example
29
© Copyright EE, NUS. All Rights Reserved.
“Just Nice”
30
© Copyright EE, NUS. All Rights Reserved.
“Just Nice”
31
© Copyright EE, NUS. All Rights Reserved.
Overfitting & Underfitting
32
© Copyright EE, NUS. All Rights Reserved.
Overfitting & Underfitting
• Overfitting occurs when model predicts the training data
well, but predicts new data (e.g., from test set) poorly
• Reason 1
– Model is too complex for the data
– Previous example: Fit order 9 polynomial to 10 data points
• Reason 2
– Too many features but number of training examples too small
– Even linear model can overfit, e.g., linear model with 9 input
features (i.e., is 10-D) and 10 data points in training set =>
data might not be enough to estimate the 10 unknowns
• Solutions
– Use simpler models (e.g., lower order polynomial)
– Use regularization (see next part of lecture)
33
© Copyright EE, NUS. All Rights Reserved.
Overfitting & Underfitting
• Overfitting occurs when model predicts the training data
well, but predicts new data (e.g., from test set) poorly
• Reason 1
– Model is too complex for the data
– Previous example: Fit order 9 polynomial to 10 data points
• Reason 2
– Too many features but number of training examples too small
– Even linear model can overfit, e.g., linear model with 9 input
features (i.e., is 10-D) and 10 data points in training set =>
data might not be enough to estimate the 10 unknowns
• Solutions
– Use simpler models (e.g., lower order polynomial)
– Use regularization (see next part of lecture)
34
© Copyright EE, NUS. All Rights Reserved.
Overfitting & Underfitting
• Overfitting occurs when model predicts the training data
well, but predicts new data (e.g., from test set) poorly
• Reason 1
– Model is too complex for the data
– Previous example: Fit order 9 polynomial to 10 data points
• Reason 2
– Too many features but number of training samples too small
– Even linear model can overfit, e.g., linear model with 9 input
features (i.e., is 10-D) and 10 data points in training set =>
data might not be enough to estimate 10 unknowns well
• Solutions
35
© Copyright EE, NUS. All Rights Reserved.
Overfitting & Underfitting
• Overfitting occurs when model predicts the training data
well, but predicts new data (e.g., from test set) poorly
• Reason 1
– Model is too complex for the data
– Previous example: Fit order 9 polynomial to 10 data points
• Reason 2
– Too many features but number of training samples too small
– Even linear model can overfit, e.g., linear model with 9 input
features (i.e., is 10-D) and 10 data points in training set =>
data might not be enough to estimate 10 unknowns well
• Solutions
– Use simpler models (e.g., lower order polynomial)
– Use regularization (see next part of lecture)
36
© Copyright EE, NUS. All Rights Reserved.
Overfitting & Underfitting
• Underfitting is the inability of trained model to predict the
targets in the training set
• Reason 1
– Model is too simple for the data
– Previous example: Fit order 1 polynomial to 10 data points
that came from an order 2 polynomial
– Solution: Try more complex model
• Reason 2
– Features are not informative enough
– Solution: Try to develop more informative features
37
© Copyright EE, NUS. All Rights Reserved.
Overfitting & Underfitting
• Underfitting is the inability of trained model to predict the
targets in the training set
• Reason 1
– Model is too simple for the data
– Previous example: Fit order 1 polynomial to 10 data points
that came from an order 2 polynomial
– Solution: Try more complex model
• Reason 2
– Features are not informative enough
– Solution: Try to develop more informative features
38
© Copyright EE, NUS. All Rights Reserved.
Overfitting & Underfitting
• Underfitting is the inability of trained model to predict the
targets in the training set
• Reason 1
– Model is too simple for the data
– Previous example: Fit order 1 polynomial to 10 data points
that came from an order 2 polynomial
– Solution: Try more complex model
• Reason 2
– Features are not informative enough
– Solution: Try to develop more informative features
39
© Copyright EE, NUS. All Rights Reserved.
Overfitting / Underfitting Schematic
Underfitting Overfitting
regime regime
40
© Copyright EE, NUS. All Rights Reserved.
Questions?
41
© Copyright EE, NUS. All Rights Reserved.
Regularization
• Regularization is an umbrella term that includes methods
forcing learning algorithm to build less complex models.
42
© Copyright EE, NUS. All Rights Reserved.
Regularization
• Regularization is an umbrella term that includes methods
forcing learning algorithm to build less complex models.
• Motivation 1: Solve an ill-posed problem
– For example, estimate 10th order polynomial with just 5 datapoints
43
© Copyright EE, NUS. All Rights Reserved.
Regularization
• Regularization is an umbrella term that includes methods
forcing learning algorithm to build less complex models.
• Motivation 1: Solve an ill-posed problem
– For example, estimate 10th order polynomial with just 5 datapoints
• Motivation 2: Reduce overfitting
44
© Copyright EE, NUS. All Rights Reserved.
Regularization
• Regularization is an umbrella term that includes methods
forcing learning algorithm to build less complex models.
• Motivation 1: Solve an ill-posed problem
– For example, estimate 10th order polynomial with just 5 datapoints
• Motivation 2: Reduce overfitting
• For example, in previous lecture, we added :
45
© Copyright EE, NUS. All Rights Reserved.
Regularization
• Regularization is an umbrella term that includes methods
forcing learning algorithm to build less complex models.
• Motivation 1: Solve an ill-posed problem
– For example, estimate 10th order polynomial with just 5 datapoints
• Motivation 2: Reduce overfitting
• For example, in previous lecture, we added :
46
© Copyright EE, NUS. All Rights Reserved.
Regularization
• Regularization is an umbrella term that includes methods
forcing learning algorithm to build less complex models.
• Motivation 1: Solve an ill-posed problem
– For example, estimate 10th order polynomial with just 5 datapoints
• Motivation 2: Reduce overfitting
• For example, in previous lecture, we added :
47
© Copyright EE, NUS. All Rights Reserved.
Regularization
• Regularization is an umbrella term that includes methods
forcing learning algorithm to build less complex models.
• Motivation 1: Solve an ill-posed problem
– For example, estimate 10th order polynomial with just 5 datapoints
• Motivation 2: Reduce overfitting
• For example, in previous lecture, we added :
49
© Copyright EE, NUS. All Rights Reserved.
Regularization
• Consider minimization from previous slide
50
© Copyright EE, NUS. All Rights Reserved.
Regularization
• Consider minimization from previous slide
51
© Copyright EE, NUS. All Rights Reserved.
Regularization
• Consider minimization from previous slide
52
© Copyright EE, NUS. All Rights Reserved.
Regularization
• Consider minimization from previous slide
L2 - Regularization
•
53
© Copyright EE, NUS. All Rights Reserved.
Regularization
• Consider minimization from previous slide
L2 - Regularization
•
• Encourage to be small (also called shrinkage or weight-
decay)
54
© Copyright EE, NUS. All Rights Reserved.
Regularization
• Consider minimization from previous slide
L2 - Regularization
•
• Encourage to be small (also called shrinkage or weight-
decay) => constrain model complexity
55
© Copyright EE, NUS. All Rights Reserved.
Regularization
• Consider minimization from previous slide
L2 - Regularization
•
• Encourage to be small (also called shrinkage or weight-
decay) => constrain model complexity
• More generally, most machine learning algorithms can be
formulated as the following optimization problem
56
© Copyright EE, NUS. All Rights Reserved.
Regularization
• Consider minimization from previous slide
L2 - Regularization
•
• Encourage to be small (also called shrinkage or weight-
decay) => constrain model complexity
• More generally, most machine learning algorithms can be
formulated as the following optimization problem
57
© Copyright EE, NUS. All Rights Reserved.
Regularization
• Consider minimization from previous slide
L2 - Regularization
•
• Encourage to be small (also called shrinkage or weight-
decay) => constrain model complexity
• More generally, most machine learning algorithms can be
formulated as the following optimization problem
Training Test
Set Fit Set Fit
Order 9 Good Bad
Order 9 + = 1 Good Good
59
© Copyright EE, NUS. All Rights Reserved.
Regularization Example
Training Test
Set Fit Set Fit
Order 9 Good Bad
Order 9, = 1 Good
60
© Copyright EE, NUS. All Rights Reserved.
Regularization Example
Training Test
Set Fit Set Fit
Order 9 Good Bad
Order 9, = 1 Good Good
61
© Copyright EE, NUS. All Rights Reserved.
Questions?
62
© Copyright EE, NUS. All Rights Reserved.
Bias versus Variance
• Suppose we are trying to predict red target below:
63
© Copyright EE, NUS. All Rights Reserved.
Bias versus Variance
• Suppose we are trying to predict red target below:
High Variance
Low Bias: blue
Low Bias predictions on average
close to red target
High Variance: large
variability among
predictions
64
© Copyright EE, NUS. All Rights Reserved.
Bias versus Variance
• Suppose we are trying to predict red target below:
Low Variance High Variance
Low Bias: blue Low Bias: blue
predictions on average Low Bias predictions on average
close to red target close to red target
Low Variance: low High Variance: large
variability among variability among
predictions predictions
65
© Copyright EE, NUS. All Rights Reserved.
Bias versus Variance
• Suppose we are trying to predict red target below:
Low Variance High Variance
Low Bias: blue Low Bias: blue
predictions on average Low Bias predictions on average
close to red target close to red target
Low Variance: low High Variance: large
variability among variability among
predictions predictions
predictions on average
not close to red target
Low Variance: Low
variability among
predictions
66
© Copyright EE, NUS. All Rights Reserved.
Bias versus Variance
• Suppose we are trying to predict red target below:
Low Variance High Variance
Low Bias: blue Low Bias: blue
predictions on average Low Bias predictions on average
close to red target close to red target
Low Variance: low High Variance: large
variability among variability among
predictions predictions
67
© Copyright EE, NUS. All Rights Reserved.
Bias + Variance Trade off
• Test error = Bias Squared + Variance + Irreducible Noise
68
© Copyright EE, NUS. All Rights Reserved.
Bias + Variance Trade off
• Test error = Bias Squared + Variance + Irreducible Noise
69
© Copyright EE, NUS. All Rights Reserved.
Bias + Variance Trade off
• Test error = Bias Squared + Variance + Irreducible Noise
70
© Copyright EE, NUS. All Rights Reserved.
Bias + Variance Example
• Simulate data from order 2 polynomial (+ noise)
• Randomly sample 10 training samples each time
• Fit with order 2 polynomial: low variance, low bias
• Fit with order 5 polynomial: high variance, low bias
4th Order Polynomials 2nd Order Polynomials
71
© Copyright EE, NUS. All Rights Reserved.
Bias + Variance Example
• Simulate data from order 2 polynomial (+ noise)
• Randomly sample 10 training samples each time
• Fit with order 2 polynomial: low variance, low bias
• Fit with order 4 polynomial: high variance, low bias
4th Order Polynomials 2nd Order Polynomials
72
© Copyright EE, NUS. All Rights Reserved.
Bias + Variance Example
• Simulate data from order 2 polynomial (+ noise)
• Randomly sample 10 training samples each time
• Fit with order 2 polynomial: low variance, low bias
• Fit with order 4 polynomial: high variance, low bias
4th Order Polynomials 2nd Order Polynomials
73
© Copyright EE, NUS. All Rights Reserved.
Bias + Variance Example
• Simulate data from order 2 polynomial (+ noise)
• Randomly sample 10 training samples each time
• Fit with order 2 polynomial: low variance, low bias
• Fit with order 4 polynomial: high variance, low bias
4th Order Polynomials 2nd Order Polynomials
74
© Copyright EE, NUS. All Rights Reserved.
Bias + Variance Example
• Simulate data from order 2 polynomial (+ noise)
• Randomly sample 10 training samples each time
• Fit with order 2 polynomial: low variance, low bias
• Fit with order 4 polynomial: high variance, low bias
4th Order Polynomials 2nd Order Polynomials
75
© Copyright EE, NUS. All Rights Reserved.
Bias + Variance Example
• Simulate data from order 2 polynomial (+ noise)
• Randomly sample 10 training samples each time
• Fit with order 2 polynomial: low variance, low bias
• Fit with order 4 polynomial: high variance, low bias
4th Order Polynomials 2nd Order Polynomials
76
© Copyright EE, NUS. All Rights Reserved.
Bias + Variance Example
• Simulate data from order 2 polynomial (+ noise)
• Randomly sample 10 training samples each time
• Fit with order 2 polynomial: low variance, low bias
• Fit with order 4 polynomial: high variance, low bias
4th Order Polynomials 2nd Order Polynomials
77
© Copyright EE, NUS. All Rights Reserved.
Bias + Variance Example
• Simulate data from order 2 polynomial (+ noise)
• Randomly sample 10 training samples each time
• Fit with order 2 polynomial: low variance, low bias
• Fit with order 4 polynomial: high variance, low bias
4th Order Polynomials 2nd Order Polynomials
78
© Copyright EE, NUS. All Rights Reserved.
Bias + Variance Example
• Simulate data from order 2 polynomial (+ noise)
• Randomly sample 10 training samples each time
• Fit with order 2 polynomial: low variance, low bias Order 2
Achieves Lower
• Fit with order 4 polynomial: high variance, low bias Test Error
4th Order Polynomials 2nd Order Polynomials
79
© Copyright EE, NUS. All Rights Reserved.
Questions?
80
© Copyright EE, NUS. All Rights Reserved.
Bias-Variance Decomposition Theorem
81
© Copyright EE, NUS. All Rights Reserved.
Bias-Variance Decomposition Theorem
82
© Copyright EE, NUS. All Rights Reserved.
Bias-Variance Decomposition Theorem
83
© Copyright EE, NUS. All Rights Reserved.
Bias-Variance Decomposition Theorem
84
© Copyright EE, NUS. All Rights Reserved.
Bias-Variance Decomposition Theorem
85
© Copyright EE, NUS. All Rights Reserved.
Bias-Variance Decomposition Theorem
86
© Copyright EE, NUS. All Rights Reserved.
Bias-Variance Decomposition Theorem
87
© Copyright EE, NUS. All Rights Reserved.
Bias-Variance Decomposition Theorem
88
© Copyright EE, NUS. All Rights Reserved.
Bias-Variance Decomposition Theorem
89
© Copyright EE, NUS. All Rights Reserved.
Bias-Variance Decomposition Theorem
90
© Copyright EE, NUS. All Rights Reserved.
Bias-Variance Decomposition Theorem
91
© Copyright EE, NUS. All Rights Reserved.
Summary
• Overfitting, underfitting & model complexity
– Overfitting: low error in training set, high error in test set
– Underfitting: high error in both training & test sets
– Overly complex models can overfit; Overly simple models can underfit
• Regularization (e.g., L2 regularization)
– Solve “ill-posed” problem (e.g., more unknowns than data points)
– Reduce overfitting
• Bias-Variance Tradeoff
– Test error = Bias Squared + Variance + Irreducible Noise
– Interpretation:
• Overly complex models can have high variance, low bias
• Overly simple models can have low variance, high bias
• Interpretation is not always true (see tutorial)
92
© Copyright EE, NUS. All Rights Reserved.
Summary
• Overfitting, underfitting & model complexity
– Overfitting: low error in training set, high error in test set
– Underfitting: high error in both training & test sets
– Overly complex models can overfit; Overly simple models can underfit
• Regularization (e.g., L2 regularization)
– Solve “ill-posed” problem (e.g., more unknowns than data points)
– Reduce overfitting
• Bias-Variance Tradeoff
– Test error = Bias Squared + Variance + Irreducible Noise
– Interpretation:
• Overly complex models can have high variance, low bias
• Overly simple models can have low variance, high bias
• Interpretation is not always true (see tutorial)
93
© Copyright EE, NUS. All Rights Reserved.
Summary
• Overfitting, underfitting & model complexity
– Overfitting: low error in training set, high error in test set
– Underfitting: high error in both training & test sets
– Overly complex models can overfit; Overly simple models can underfit
• Regularization (e.g., L2 regularization)
– Solve “ill-posed” problem (e.g., more unknowns than data points)
– Reduce overfitting
• Bias-Variance Decomposition Theorem
– Test error = Bias Squared + Variance + Irreducible Noise
– Can be interpreted as trading off bias & variance:
• Overly complex models can have high variance, low bias
• Overly simple models can have low variance, high bias
• Interpretation of Bias-Variance tradeoff is not always true (see tutorial)
94
© Copyright EE, NUS. All Rights Reserved.