0% found this document useful (0 votes)
10 views20 pages

Ovefitting, Generalization, Cross Validation

The document discusses the concepts of overfitting, generalization, and model evaluation in machine learning, emphasizing the importance of selecting the right polynomial order for data modeling. It explains the process of dividing data into training and test sets, and introduces k-fold cross-validation as a method to reduce statistical variance in model evaluation. The advantages and disadvantages of different validation methods, including leave-one-out cross-validation, are also highlighted.

Uploaded by

AHSAN FAREED
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views20 pages

Ovefitting, Generalization, Cross Validation

The document discusses the concepts of overfitting, generalization, and model evaluation in machine learning, emphasizing the importance of selecting the right polynomial order for data modeling. It explains the process of dividing data into training and test sets, and introduces k-fold cross-validation as a method to reduce statistical variance in model evaluation. The advantages and disadvantages of different validation methods, including leave-one-out cross-validation, are also highlighted.

Uploaded by

AHSAN FAREED
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Overfitting,

Generalization,
Cross-Validation

Video content can be watched here:


https://www.youtube.com/channel/UCov-vAEaVfqEtlHkdiRL3Tg
Overfitting vs
Generalization

Of Which order polynomial will be best for


the data?
Video content can be watched here:
https://www.youtube.com/channel/UCov-vAEaVfqEtlHkdiRL3Tg
Overfitting vs Generalization
• Of Which order polynomial will be best for the data?
• The model which has the least error as much as possible

0th order polynomial 1st order polynomial


regression regression

This is better
because it has less error
Video content can be watched here:
https://www.youtube.com/channel/UCov-vAEaVfqEtlHkdiRL3Tg
Overfitting vs Generalization
• Of Which order polynomial will be best for the data?
• What about this?

9th order polynomial


regression

• This may be the BEST because the error is ZERO!!

Do you agree with this?


Video content can be watched here:
https://www.youtube.com/channel/UCov-vAEaVfqEtlHkdiRL3Tg
Overfitting vs Generalization
• What is the purpose of Machine Learning?
Predict the unknown data
Learning the given data vs as exactly as possible
as exactly as possible
based on the given data

Video content can be watched here:


https://www.youtube.com/channel/UCov-vAEaVfqEtlHkdiRL3Tg
Overfitting vs
Generalization
• Then, which one looks best?
• As the order (M) increases,
• the complexity of model
increases
• As the complexity of model
increases,
• the model can more exactly
learn the given data
• However, the prediction
accuracy does not necessarily
increase

Video content can be watched here:


https://www.youtube.com/channel/UCov-vAEaVfqEtlHkdiRL3Tg
Model Evaluation • Which model is best?
• You may try several
approaches and need
to choose one
• You may try several
different parameters of
a model and need to
choose one
• Model Evaluations
• Based on Training &
Testing data set
• Cross-Validation

Video content can be watched here:


https://www.youtube.com/channel/UCov-vAEaVfqEtlHkdiRL3Tg
Training Set and Test Set
• How to choose a good model
• Divide the given data into TRAINING set and TEST set
• Training set and Test set should NOT overlap each other!!
• Both need to be independent as much as possible
• With Training set, build various models
• With Test set, evaluate each model
• Choose the model which shows the best performance with Test set

Data

Training Set Test Set


Video content can be watched here:
https://www.youtube.com/channel/UCov-vAEaVfqEtlHkdiRL3Tg
Training Set and Test Set
• How to choose a good model
Data

1. Randomly Split

Training Set Test Set

2. Model build 3. Test

Model 1 Model 2 Model n

4. Choose
the best model
Video content can be watched here:
https://www.youtube.com/channel/UCov-vAEaVfqEtlHkdiRL3Tg
Training Set and Test Set
• Performance Graph

Video content can be watched here:


https://www.youtube.com/channel/UCov-vAEaVfqEtlHkdiRL3Tg
Size of Test
set
• 50~30% of given data

Training Set Advantage • Simple & easy


and Test • Test set is not used for
Set modeling building. Waste of
data
• Data is randomly split
Disadvantage • Evaluation can be
significantly different
depending on data split
• => Any good idea?

Video content can be watched here:


https://www.youtube.com/channel/UCov-vAEaVfqEtlHkdiRL3Tg
Training Set and Test Set
• Data Waste & Random Split
Split 1 Split 2
Data Data

Training Set Test Set Training Set Test Set

Bad performance Good performance

Model Model

Video content can be watched here:


https://www.youtube.com/channel/UCov-vAEaVfqEtlHkdiRL3Tg
Cross Validation
• In order to reduce statistical variance
• Usually, k-fold cross validation is widely used

• K-fold Cross Validation


• Split given data into K folds Data
• Folds should not overlap with each other

Fold 1 Fold 2 Fold 3 … Fold k

• Compose k-1 training set and 1 test set with k folds

Video content can be watched here:


https://www.youtube.com/channel/UCov-vAEaVfqEtlHkdiRL3Tg
Cross Validation
• Example: 4-fold cross validation
Training/Test Training/Test Training/Test Training/Test
Set I Set II Set III Set IV
Fold 2

Training set
Fold 1 Fold 4 Fold 3
Training set

Training set

Training set
Fold 2 Fold 1 Fold 4 Fold 3

Fold 3 Fold 2 Fold 1 Fold 4

Test Test Test Test


Fold 4 Fold 3 Fold 2 Fold 1
set set set set

Choose a model by the average performance of 4 sets

Video content can be watched here:


https://www.youtube.com/channel/UCov-vAEaVfqEtlHkdiRL3Tg
Cross Validation
• Example: 4-fold cross validation
Training Set TestSet
Training/Test Fold 1, 2, 3 Fold 4
Set I

Model 1 Model 2 Model n

Set I 3 4 … 5
Set II …
Performance
Set III …
Set IV …

Video content can be watched here:


https://www.youtube.com/channel/UCov-vAEaVfqEtlHkdiRL3Tg
Video content can be watched here:

Cross Validation https://www.youtube.com/channel/UCov-vAEaVfqEtlHkdiRL3Tg

• Example: 4 fold cross validation


Training Set TestSet
Training/Test Fold 4, 1, 2 Fold 3
Set II

Model 1 Model 2 Model n

Set I 3 4 … 5
Set II 4 1 … 2
Performance
Set III …
Set IV …
Cross Validation
Video content can be watched here:
https://www.youtube.com/channel/UCov-vAEaVfqEtlHkdiRL3Tg

• Example: 4 fold cross validation


Training Set TestSet
Training/Test Fold 3, 4, 1 Fold 2
Set III

Model 1 Model 2 Model n

Set I 3 4 … 5
Set II 4 1 … 2
Performance
Set III 5 2 … 3
Set IV …
Video content can be watched here:
https://www.youtube.com/channel/UCov-vAEaVfqEtlHkdiRL3Tg

Cross Validation
• Example: 4 fold cross validation
Training Set TestSet
Training/Test Fold 2, 3, 4 Fold 1
Set IV

Model 1 Model 2 Model n

Set I 3 4 … 5
Set II 4 1 … 2
Performance
Set III 5 2 … 3
Set IV 4 3 … 2
Cross Validation

Summary Advantage Disadvantage


The data set is divided into k subsets, Less dependent on how the data Time !!
and the holdout method is repeated gets divided.
k times. Every data point gets to be in a test
Each time, one of the k subsets is set exactly once, and gets to be in a
used as the test set and the other k- training set k-1 times.
1 subsets are put together to form a
training set.
Then the average error across all k
trials is computed.
The variance is reduced as k is
increased.
Video content can be watched here:
https://www.youtube.com/channel/UCov-vAEaVfqEtlHkdiRL3Tg
No fixed Regular CV: Split into folds and Shuffling
folds
folds and
Random No split into folds and Randomly dividing
the data into a test and training set k
Division different times.

Variation
Extreme case of k-fold cross validation
LOOCV
If data size is n, set k = n
(Leave-
one-out Every data point except one is used for
training and the remaining one is used for
Cross testing
Validation)
Repeat this n times
Video content can be watched here:
https://www.youtube.com/channel/UCov-vAEaVfqEtlHkdiRL3Tg

You might also like