Handling Imbalanced Datasets
Handling Imbalanced Datasets
Introduction
Suppose that you are working in a given company and you are asked to
create a model that, based on various measurements at your disposal,
predicts whether a product is defective or not. You decide to use your
favourite classifier, train it on the data and voila : you get a 96.2%
accuracy !
Your boss is astonished and decides to use your model without any
further tests. A few weeks later he enters your office and underlines the
https://towardsdatascience.com/handling-imbalanced-datasets-in-machine-learning-7a0e84220f28 1/21
5/29/2019 Handling imbalanced datasets in machine learning – Towards Data Science
uselessness of your model. Indeed, the model you created has not
found any defective product from the time it has been used in
production.
After some investigations, you find out that there is only around 3.8%
of the product made by your company that are defective and your
model just always answers “not defective”, leading to a 96.2% accuracy.
The kind of “naive” results you obtained is due to the imbalanced
dataset you are working with. The goal of this article is to review the
different methods that can be used to tackle classification problems
with imbalanced classes.
Outline
First we will give an overview of different evaluation metrics that can
help to detect “naive behaviours”. We will then discuss a whole bunch
of methods consisting in reworking the dataset and show that these
methods can be misleading. Finally, we will show that reworking the
problem is, most of the time, the best way to proceed.
. . .
https://towardsdatascience.com/handling-imbalanced-datasets-in-machine-learning-7a0e84220f28 2/21
5/29/2019 Handling imbalanced datasets in machine learning – Towards Data Science
A good and yet simple metric that should always be used when dealing
with classification problem is the confusion matrix. This metric gives an
interesting overview of how well a model is doing. Thus, it is a great
starting point for any classification model evaluation. We summarise
most of the metrics that can be derived from the confusion matrix in
the following graphic
The confusion matrix and the metrics that can be derived from it.
https://towardsdatascience.com/handling-imbalanced-datasets-in-machine-learning-7a0e84220f28 3/21
5/29/2019 Handling imbalanced datasets in machine learning – Towards Data Science
• low recall + high precision : the model can’t detect the class well
but is highly trustable when it does
• high recall + low precision : the class is well detected but the
model also include points of other classes in it
The confusion matrix of our introductory example. Notice that the “defective” precision can’t be
computed.
The accuracy is 96.2% as said earlier. The non defective class precision
is 96.2% and the defective class precision is not computable. The recall
of the non defective class is 1.0 which is perfect (all the non defective
products have been labelled as such). But the recall of the defective
class is 0.0 which is the worse case (no defective products were
https://towardsdatascience.com/handling-imbalanced-datasets-in-machine-learning-7a0e84220f28 4/21
5/29/2019 Handling imbalanced datasets in machine learning – Towards Data Science
detected). Thus, we can conclude our model is not doing well for this
class. The F1 score is not computable for the defective products and is
0.981 for the non defective products. In this example, looking at the
confusion matrix could have led to re-think our model or our goal (as
we will see in the following sections). It could have prevented using a
useless model.
Suppose that for a given point x, we have a model that outputs the
probability that this point belongs to C: P(C | x). Based on this
probability, we can define a decision rule that consists in saying that x
belongs to class C if and only if P(C | x)≥T, where T is a given threshold
defining our decision rule. If T=1, a point is labelled as belonging to C
only if the model is 100% confident it does. If T=0, every points are
labelled as belonging to C.
https://towardsdatascience.com/handling-imbalanced-datasets-in-machine-learning-7a0e84220f28 5/21
5/29/2019 Handling imbalanced datasets in machine learning – Towards Data Science
Illustration of possible ROC curves depending on the e ectiveness of the model. On the left, the model has to sacri ce a lot of precision to get a
high recall. On the right, the model is highly e ective: it can reach a high recall while keeping a high precision.
Based on the ROC curve, we can build another metric, easier to use, to
evaluate the model: the AUROC which is the Area Under the ROC
curve. AUROC acts a little bit as a scalar value that summarises the
entire ROC curve. As it can be seen, the AUROC tend towards 1.0 for
the best case and towards 0.5 for the worst case.
Here again, a good AUROC score means that the model we are
evaluating does not sacrifice a lot of precision to get a good recall on
the observed class (often the minority class).
. . .
An imbalanced example
Let’s suppose that we have two classes: C0 and C1. Points from the class
C0 follow a one dimensional Gaussian distribution of mean 0 and
https://towardsdatascience.com/handling-imbalanced-datasets-in-machine-learning-7a0e84220f28 6/21
5/29/2019 Handling imbalanced datasets in machine learning – Towards Data Science
Illustration of our imbalanced example. Dotted lines represent the probability densities of each class independently. Solid lines also take into
account the proportions.
In this example we can see that the curve of the C0 class is always
above the curve of the C1 class and, so, for any given point the
probability that this point was drawn from class C0 is always greater
than the probability it was drawn from class C1. Mathematically, using
basic Bayes rule, we can write
where we can clearly see the effect of the priors and how it can lead to a
situation where a class is always more likely than the other.
https://towardsdatascience.com/handling-imbalanced-datasets-in-machine-learning-7a0e84220f28 7/21
5/29/2019 Handling imbalanced datasets in machine learning – Towards Data Science
All this implies that even from a perfect theoretical point of view we
know that if we had to train a classifier on these data, the accuracy of
the classifier would be maximal when always answering C0. So, if
the goal is to train a classifier to get the best possible accuracy then it
should not be seen as a problem but just as a fact: with these features,
the best we can do (in terms of accuracy) is to always answer C0. We
have to accept it.
In our Gaussian example, if the means are di erent enough with respect to the variances, even imbalanced classes can be well separable.
https://towardsdatascience.com/handling-imbalanced-datasets-in-machine-learning-7a0e84220f28 8/21
5/29/2019 Handling imbalanced datasets in machine learning – Towards Data Science
Here we see that contrarily to the previous case the C0 curve is not
always above the C1 curve and, so, there are points that are more likely
to be drawn from class C1 than from class C0. In this case, the two
classes are separated enough to compensate the imbalance: a classifier
will not necessarily answer C0 all the time.
Illustration of the theoretical minimal error for di erent degree of separability of two classes.
https://towardsdatascience.com/handling-imbalanced-datasets-in-machine-learning-7a0e84220f28 9/21
5/29/2019 Handling imbalanced datasets in machine learning – Towards Data Science
Which is the area under the min of the two curves represented above.
. . .
https://towardsdatascience.com/handling-imbalanced-datasets-in-machine-learning-7a0e84220f28 10/21
5/29/2019 Handling imbalanced datasets in machine learning – Towards Data Science
Illustration of the e ect that di erent degrees of majority class undersampling have on the model
decisions.
https://towardsdatascience.com/handling-imbalanced-datasets-in-machine-learning-7a0e84220f28 11/21
5/29/2019 Handling imbalanced datasets in machine learning – Towards Data Science
this way will then have a lower accuracy on the future real test data
than the classifier trained on the unchanged dataset. Indeed, the true
proportions of classes are important to know for classifying a new point
and that information has been lost when resampling the dataset.
https://towardsdatascience.com/handling-imbalanced-datasets-in-machine-learning-7a0e84220f28 12/21
5/29/2019 Handling imbalanced datasets in machine learning – Towards Data Science
Looking for additional features can help separate two classes that were not initially separable.
. . .
So what if we are still unhappy with these results? In this case, it means
that, in one way or another, our problem is not well stated (otherwise
we should accept results as they are) and that we should rework it in
order to get more satisfying results. Let’s see an example.
https://towardsdatascience.com/handling-imbalanced-datasets-in-machine-learning-7a0e84220f28 13/21
5/29/2019 Handling imbalanced datasets in machine learning – Towards Data Science
Cost-based classification
The feeling that obtained results are not good can come from the fact
that the objective function was not well defined. Up to now, we have
assumed that we target a classifier with high accuracy, assuming at the
same time that both kinds of errors (“false positive” and “false
negative”) have the same cost. In our example it means we assumed
that predicting C0 when true label is C1 is as bad as predicting C1 when
true label is C0. Errors are then symmetric.
Let’s consider our introductory example with defective (C1) and not
defective (C0) products. In this case, we can imagine that not detecting
a defective product will cost more to the company (customer service
costs, possible juridical costs if dangerous defects, …) than wrongly
labelling a not defective product as defective (production cost lost).
Now, predicting C0 when true label is C1 is far worse than predicting
C1 when true label is C0. Errors are no longer symmetric.
• predicting C1 when true label is C0 costs P10 (with 0 < P10 <<
P01)
Then, we can redefine our objective function: we don’t target the best
accuracy anymore but we look for the lower prediction cost instead.
https://towardsdatascience.com/handling-imbalanced-datasets-in-machine-learning-7a0e84220f28 14/21
5/29/2019 Handling imbalanced datasets in machine learning – Towards Data Science
So, with this objective function, the best classifier from a theoretical
point of view will then be such that:
Probability threshold
One first possible way to take into account the cost in our classifier is to
do it after the training. The idea is, first, to train a classifier the basic
way to output the following probabilities
https://towardsdatascience.com/handling-imbalanced-datasets-in-machine-learning-7a0e84220f28 15/21
5/29/2019 Handling imbalanced datasets in machine learning – Towards Data Science
and C1 otherwise.
Illustration of the probability threshold approach: the outputted probabilities are reweighted such that costs are taken into account in the nal
decision rule.
Classes reweight
The idea of class reweight is to take into account the asymmetry of cost
errors directly during the classifier training. Doing so, the outputted
probabilities for each class will already embed the cost error
https://towardsdatascience.com/handling-imbalanced-datasets-in-machine-learning-7a0e84220f28 16/21
5/29/2019 Handling imbalanced datasets in machine learning – Towards Data Science
For some models (for example Neural Network classifiers), taking the
cost into account during the training can consist in adjusting the
objective function. We still want our classifier to output
https://towardsdatascience.com/handling-imbalanced-datasets-in-machine-learning-7a0e84220f28 17/21
5/29/2019 Handling imbalanced datasets in machine learning – Towards Data Science
Illustration of the class reweight approach: the majority class is undersampled with a proportion that is chosen carefully to introduce the cost
information directly inside the class proportions.
. . .
Takeaways
The main takeaways of this article are:
https://towardsdatascience.com/handling-imbalanced-datasets-in-machine-learning-7a0e84220f28 18/21
5/29/2019 Handling imbalanced datasets in machine learning – Towards Data Science
have to be set with respect to a well chosen goal that can be, for
example, minimising a cost
Finally, let’s say that the main keyword of this article is “goal”. Knowing
exactly what you want to obtain will help overcome imbalanced dataset
problems and will ensure having the best possible results. Defining the
goal perfectly should always be the first thing to do and is the starting
point of any choice that have to be done in order to create a machine
learning model.
. . .
https://towardsdatascience.com/handling-imbalanced-datasets-in-machine-learning-7a0e84220f28 19/21
5/29/2019 Handling imbalanced datasets in machine learning – Towards Data Science
https://towardsdatascience.com/handling-imbalanced-datasets-in-machine-learning-7a0e84220f28 20/21
5/29/2019 Handling imbalanced datasets in machine learning – Towards Data Science
https://towardsdatascience.com/handling-imbalanced-datasets-in-machine-learning-7a0e84220f28 21/21