5 DL
5 DL
• Biased data:
– Data that does not accurately reflect our target population.
– Example – we only asked people with glasses who they will vote for, instead of
asking a random sample of the population.
– Training on biased data can lead to inaccuracies.
– There are methods to correct bias.
The three-way split
• Training set
A set of examples used for learning
• Validation set
A set of examples used to tune the parameters of a classifier
• Test set
A set of examples used only to assess the performance of fully-trained classifier. After
assessing the model with the test set, YOU MUST NOT further tune your model (that’s the
theory anyway… - in order to prevent ‘learning the test set’ and ‘overfitting’)
The three-way split
Validation Test
Performance
Model
tuning
evaluation
Training data
data data
How to perform the split?
Can you use Jul 2011 stock price to try and predict Apr 2011
stock price?
How to perform the split?
• Some datasets are affected by seasonality.
90,000
80,000
70,000
60,000
Sales ($)
50,000
40,000
30,000
20,000
10,000
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Month
– Realistic evaluation: Such plan should reflect the real-life scenarios your model
should work on.
– External evaluation by third party is the gold standard (but not always realistic).
Imbalanced classes
Dealing with imbalanced classes
Can we build a classifier that can find the needles in the haystack?
Imbalanced Class Sizes
• Sometimes, classes have very unequal frequency
– Cancer diagnosis: 99% healthy, 1% disease
– eCommerce: 90% don’t buy, 10% buy
– Security: >99.99% of Americans are not terrorists
• Up-sample:
– Repeat minority points (Data duplications)
– Synthetic Minority Oversampling Technique (SMOTE): It works based on
the KNearestNeighbours algorithm, synthetically generating data points that fall in the proximity of
the already existing outnumbered group.
https://www.analyticsvidhya.com/blog/2020/11/handling-imbalanced-data-machine-learning-computer-vision-and-nlp/
https://developers.google.com/machine-learning/data-prep/construct/sampling-splitting/imbalanced-data
Bootstrapping
• Sample m sets from the majority so that each sets size is equal to the
minority set.
• The downsampling is done with replacement (repeats allowed).
• Train a model of minority vs bootstrapped sample for each bootstrap
iteration.
• This gives us m different models this is the basis of ensemble models
like Random forest.
Reweighting
Training data
5000 faces 108 non faces
• P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features”, CVPR, 2001
• P. Viola and M. Jones, “Robust real-time face detection”, IJCV 57(2), 2004
Distribution of data for improving the accuracy of
the model
Training set
Which you run your learning algorithm on.
Dev Which you use to tune parameters, select features, and
(development) make other decisions regarding the learning algorithm.
set Sometimes also called the hold-out cross validation set .
But in the era of Deep Learning may even go down to 99 0.5 0.5
If the data size is 1,00,0000 then 5000 5000 data size will still be there
in dev and test sets
17 January 2025 20
Importance of Choosing dev and test sets wisely
17 January 2025 21
Data Distribution Mismatch
It is naturally good to have the data in all the sets from the same distribution.
For example, Housing data coming from Mumbai and we are trying to find the
house prices in Chandigarh.
Else wasting a lot of time in improving the performance of dev set and then
finding out that it is not working well for the test set.
Sometimes we have only two partitioning of the data in that case they are
called Train/dev or train/test set.
17 January 2025 22
Underfitting vs Overfitting
Lecture 13
17 January 2025 23
Bias/Variance
17 January 2025 24
Bias/Variance
17 January 2025 25
Bias/Variance
Train Set Error 1 12 8 1
Multi dimensional system can have high bias in some areas and high variance in some
other areas of the system, resulting in High Bias and High Variance issue
17 January 2025 26
High Bias
It allows to fit the training set better. If you find that this
Increase the model size (such as
increases variance, then use regularization, which will
number of neurons/layers)
usually eliminate the increase in variance.
17 January 2025 28
Often compare with human level performance
Image Ease of obtaining data from human labelers.
recognition,
spam
classification.
Error analysis can draw on human intuition.
17 January 2025 29
Tasks Where we don’t compare with human level
performance
Picking a book to It is harder to obtain labels
recommend to
you;
17 January 2025 30
Classification example for animals
Type Scenario 1 Scenario 2
Humans (Bayes error) 1 7.5
Training error 8 8
Dev error 10 10
Focus on Bias Focus on Variance
Avoidable Bias 7 0.5
17 January 2025 31
New scenario
Type Scenario 1 Scenario 2 Scenario 3
Human (Bayes) error 1 1 1
0.7 0.7 0.7
0.5 0.5 0.5
Training error 5 1 0.7
Dev error 6 5 0.8
Bias issue Variance Issue Difficult
17 January 2025 32
Thank You
For more information, please visit the
following links:
[email protected]
[email protected]
https://www.linkedin.com/in/gauravsingal789/
http://www.gauravsingal.in
17 January 2025
33