0% found this document useful (0 votes)
10 views33 pages

5 DL

The document discusses the importance of proper dataset partitioning into training, validation, and test sets to avoid bias and overfitting in machine learning models. It highlights the challenges posed by imbalanced classes and offers strategies for handling such data, including down-sampling, up-sampling, and reweighting. Additionally, it emphasizes the significance of realistic evaluation plans and the need for data distribution consistency across sets to ensure effective model performance.

Uploaded by

kushalgangwar98
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views33 pages

5 DL

The document discusses the importance of proper dataset partitioning into training, validation, and test sets to avoid bias and overfitting in machine learning models. It highlights the challenges posed by imbalanced classes and offers strategies for handling such data, including down-sampling, up-sampling, and reweighting. Additionally, it emphasizes the significance of realistic evaluation plans and the need for data distribution consistency across sets to ensure effective model performance.

Uploaded by

kushalgangwar98
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Dataset – Train, Test & Validation

The basic rule in model validation

Don’t test your model on data it’s been


trained with!
Random selection of data vs bias
• Unbiased data:
– Data that represents the diversity in our target distribution of data on which we
want to make predictions. i.e., it is a random, representative sample. This is a
good scenario.

• Biased data:
– Data that does not accurately reflect our target population.
– Example – we only asked people with glasses who they will vote for, instead of
asking a random sample of the population.
– Training on biased data can lead to inaccuracies.
– There are methods to correct bias.
The three-way split
• Training set
A set of examples used for learning

• Validation set
A set of examples used to tune the parameters of a classifier

• Test set
A set of examples used only to assess the performance of fully-trained classifier. After
assessing the model with the test set, YOU MUST NOT further tune your model (that’s the
theory anyway… - in order to prevent ‘learning the test set’ and ‘overfitting’)
The three-way split

The entire available dataset

Validation Test

Performance
Model
tuning

evaluation
Training data
data data
How to perform the split?

• How many examples in each data set?


– Training: Typically, 60-80% of data
– Test set: Typically, 20-30% of your trained set
– Validation set: Often 20% of data
• Examples
– 3 way: Training: 60%, CV: 20%, Test: 20%
– 2 ways: Training 70%, Test: 30%
How to perform the split?

• Time based – If time is important

Can you use Jul 2011 stock price to try and predict Apr 2011
stock price?
How to perform the split?
• Some datasets are affected by seasonality.
90,000

80,000

70,000

60,000
Sales ($)

50,000

40,000

30,000

20,000

10,000

0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Month

Is it a good test set?


How to perform the split?
• Many more possible scenarios…
• Key messages:
– Evaluation plan: Very important at the beginning of the project.

– Realistic evaluation: Such plan should reflect the real-life scenarios your model
should work on.

– External evaluation by third party is the gold standard (but not always realistic).
Imbalanced classes
Dealing with imbalanced classes

Class 1: Short people


Class 2: Tall people

Can we build a classifier that can find the needles in the haystack?
Imbalanced Class Sizes
• Sometimes, classes have very unequal frequency
– Cancer diagnosis: 99% healthy, 1% disease
– eCommerce: 90% don’t buy, 10% buy
– Security: >99.99% of Americans are not terrorists

• Similar situation with multiple classes.

• This creates problems for training and evaluating a model


Evaluating a model with imbalanced data

• Example: 99% of people do not have cancer


• If we simply create a ‘trivial classifier’ – predict that nobody has cancer, then 99% of our predictions
are correct! Bad news! – we incorrectly predict nobody has cancer.
• If we make only a 1% mistake on healthy patients, and accurately find all cancer patients
– Then ~50% of people that we tell have cancer are actually healthy!
Learning with imbalanced datasets is often done by
balancing the classes

• Important: Estimate the results using an imbalanced held-out (test) set


• How to create a “balanced” set
– Treat the majority class
• Down-sample
• Bootstrap (repeat down sampling with replacement)
– Treat the small class
• Up-Sample the small class
• Assign larger weights to the minority class samples
Down-sample and Up-sample
• Down-sample:
– Sample (randomly choose) a random subset of points in the
majority class

• Up-sample:
– Repeat minority points (Data duplications)
– Synthetic Minority Oversampling Technique (SMOTE): It works based on
the KNearestNeighbours algorithm, synthetically generating data points that fall in the proximity of
the already existing outnumbered group.

https://www.analyticsvidhya.com/blog/2020/11/handling-imbalanced-data-machine-learning-computer-vision-and-nlp/
https://developers.google.com/machine-learning/data-prep/construct/sampling-splitting/imbalanced-data
Bootstrapping
• Sample m sets from the majority so that each sets size is equal to the
minority set.
• The downsampling is done with replacement (repeats allowed).
• Train a model of minority vs bootstrapped sample for each bootstrap
iteration.
• This gives us m different models this is the basis of ensemble models
like Random forest.
Reweighting

• Idea: Assign larger weights to samples from the smaller class

• A commonly used weighting scheme is linearly by class size:


𝑛
𝑤𝑐 =
𝑛𝑐
– Where 𝑛𝑐 is the size of the class c and 𝑛 = σ𝑐 𝑛𝑐 is the total sample
size.
Imbalanced data can be harnessed
• The Viola/Jones Face Detector
Faces Non-Faces

Training data
5000 faces 108 non faces

• P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features”, CVPR, 2001
• P. Viola and M. Jones, “Robust real-time face detection”, IJCV 57(2), 2004
Distribution of data for improving the accuracy of
the model

Training set
Which you run your learning algorithm on.
Dev Which you use to tune parameters, select features, and
(development) make other decisions regarding the learning algorithm.
set Sometimes also called the hold-out cross validation set .

which you use to evaluate the performance of the


Test set algorithm, but not to make any decisions regarding what
learning algorithm or parameters to use.
17 January 2025 19
Partition your data in different categories
Dev Test or
Hold Out Cross Validation
Training Set Test Set
Set

Traditional Style partitioning 70/30 or 60/20/20

But in the era of Deep Learning may even go down to 99 0.5 0.5

If the data size is 1,00,0000 then 5000 5000 data size will still be there
in dev and test sets
17 January 2025 20
Importance of Choosing dev and test sets wisely

The purpose of the dev


Very important that dev
and test sets are to direct Bad distribution will
and test set reflect data
your team toward the severely restrict analysis
you expect to get in the
most important changes to to guess that why test data
future and want to do well
make to the machine is not giving good results
on.
learning system

17 January 2025 21
Data Distribution Mismatch
It is naturally good to have the data in all the sets from the same distribution.

For example, Housing data coming from Mumbai and we are trying to find the
house prices in Chandigarh.

Else wasting a lot of time in improving the performance of dev set and then
finding out that it is not working well for the test set.

Sometimes we have only two partitioning of the data in that case they are
called Train/dev or train/test set.
17 January 2025 22
Underfitting vs Overfitting
Lecture 13

17 January 2025 23
Bias/Variance

17 January 2025 24
Bias/Variance

How much worse the


The algorithm’s error rate on
algorithm does on the dev
the training set is
(or test) set than the training
algorithm’s bias .
set is algorithm’s variance .

17 January 2025 25
Bias/Variance
Train Set Error 1 12 8 1

Dev Set Error 9 13 16 1.5


High Variance High Bias High Bias, Low Bias
High Variance Low Variance
Overfitting Underfitting Underfitting Good fit

Multi dimensional system can have high bias in some areas and high variance in some
other areas of the system, resulting in High Bias and High Variance issue

17 January 2025 26
High Bias
It allows to fit the training set better. If you find that this
Increase the model size (such as
increases variance, then use regularization, which will
number of neurons/layers)
usually eliminate the increase in variance.

Create additional features that help the algorithm


Modify input features based on
eliminate a particular category of errors. These new
insights from error analysis
features could help with both bias and variance.

Reduce or eliminate regularization


reduces avoidable bias but increase variance.
(L2, L1 regularization, dropout)

Modify model architecture (such as


neural network architecture) so that it This can affect both bias and variance.
is more suitable for your problem
17 January 2025 27
High Variance
Simplest and most reliable way to address variance, so long as
Add more training data you have access to significantly more data and enough
computational power to process the data.

Add regularization (L2, L1 regularization,


This technique reduces variance but increases bias.
dropout)

Add early stopping (stop gradient descent


Reduces variance but increases bias.
early, based on dev set error)

Modify model architecture (such as neural


network architecture) so that it is more This affects both bias and variance.
suitable for your problem

17 January 2025 28
Often compare with human level performance
Image Ease of obtaining data from human labelers.
recognition,
spam
classification.
Error analysis can draw on human intuition.

Use human-level performance to estimate the


optimal error rate and set a “desired error rate”.

17 January 2025 29
Tasks Where we don’t compare with human level
performance
Picking a book to It is harder to obtain labels
recommend to
you;

picking an ad to Human intuition is harder to count on


show a user on a
website;

predicting stock It is hard to know what the optimal error rate


market. and reasonable desired error rate is

17 January 2025 30
Classification example for animals
Type Scenario 1 Scenario 2
Humans (Bayes error) 1 7.5
Training error 8 8
Dev error 10 10
Focus on Bias Focus on Variance
Avoidable Bias 7 0.5

17 January 2025 31
New scenario
Type Scenario 1 Scenario 2 Scenario 3
Human (Bayes) error 1 1 1
0.7 0.7 0.7
0.5 0.5 0.5
Training error 5 1 0.7
Dev error 6 5 0.8
Bias issue Variance Issue Difficult

Two fundamental You can fit the training set well


Assumptions:
Training set performance should Generalize to dev/test set.

17 January 2025 32
Thank You
For more information, please visit the
following links:

[email protected]
[email protected]
https://www.linkedin.com/in/gauravsingal789/
http://www.gauravsingal.in

17 January 2025
33

You might also like