10 Techniques To Deal With Class Imbalance in Machine Learning
10 Techniques To Deal With Class Imbalance in Machine Learning
Learning
C LA S S I F I C AT I O N I NT E RM E D I AT E M A C HI NE LE A RNI NG PYT HO N S T RUC T URE D D AT A T E C HNI Q UE UNC AT E G O RI Z E D
Overview
You can check the implementation of the code in my GitHub repository here
Introduction
When observation in one class is higher than the observation in other classes then there exists a class
imbalance. Example: To detect fraudulent credit card transactions. As you can see in the below graph
fraudulent transaction is around 400 when compared with non-fraudulent transaction around 90000.
Fraud detection
Spam filtering
Disease screening
SaaS subscription churn
Advertising click-throughs
The Problem with Class Imbalance
Most machine learning algorithms work best when the number of samples in each class are about equal.
This is because most algorithms are designed to maximize accuracy and reduce errors.
However, if the data set in imbalance then In such cases, you get a pretty high accuracy just by predicting
the majority class, but you fail to capture the minority class, which is most often the point of creating the
model in the first place.
Let’s say we have a dataset of credit card companies where we have to find out whether the credit card
transaction was fraudulent or not.
But here’s the catch… the fraud transaction is relatively rare, only 6% of the transaction is fraudulent.
Now, before you even start, do you see how the problem might break? Imagine if you didn’t bother training
a model at all. Instead, what if you just wrote a single line of code that always predicts ‘no fraudulent
transaction’
This is clearly a problem because many machine learning algorithms are designed to maximize overall
accuracy. In this article, we will see different techniques to handle the imbalanced data.
Data
We will use a credit card fraud detection dataset for this article you can find the dataset from here.
After loading the data display the first five-row of the data set.
# check the target variable that is fraudulet and not fradulent transactiondata['Class'].value_counts()# 0 ->
plt.show()
You can clearly see that there is a huge difference between the data set. 9000 non-fraudulent transactions
and 492 fraudulent.
One of the major issues that new developer users fall into when dealing with unbalanced datasets relates
to the metrics used to evaluate their model. Using simpler metrics like accuracy score can be misleading.
In a dataset with highly unbalanced classes, the classifier will always “predicts” the most common class
without performing any analysis of the features and it will have a high accuracy rate, obviously not the
correct one.
We can see 99% accuracy, we are getting very high accuracy because it is predicting mostly the majority
class that is 0 (Non-fraudulent).
Resampling Technique
A widely adopted technique for dealing with highly unbalanced datasets is called resampling. It consists of
removing samples from the majority class (under-sampling) and/or adding more examples from the
minority class (over-sampling).
Despite the advantage of balancing classes, these techniques also have their weaknesses (there is no free
lunch).
The simplest implementation of over-sampling is to duplicate random records from the minority class,
which can cause overfishing.
In under-sampling, the simplest technique involves removing random records from the majority class,
which can cause loss of information.
Let’s implement this with the credit card fraud detection example.
data[data['Class'] == 0] class_1 = data[data['Class'] == 1]# print the shape of the class print('class 0:',
1. Random Under-Sampling
Undersampling can be defined as removing some observations of the majority class. This is done until the
majority and minority class is balanced out.
Undersampling can be a good choice when you have a ton of data -think millions of rows. But a drawback
to undersampling is that we are removing information that may be valuable.
Oversampling can be defined as adding more copies to the minority class. Oversampling can be a good
choice when you don’t have a ton of data to work with.
A con to consider when undersampling is that it can cause overfitting and poor generalization to your test
set.
axis=0) print("total class of 1 and 0:",test_under['Class'].value_counts())# plot the count after under-
A number of more sophisticated resampling techniques have been proposed in the scientific literature.
For example, we can cluster the records of the majority class, and do the under-sampling by removing
records from each cluster, thus seeking to preserve information. In over-sampling, instead of creating
exact copies of the minority class records, we can introduce small variations into those copies, creating
more diverse synthetic samples.
Let’s apply some of these resampling techniques, using the Python library imbalanced-learn. It is
compatible with scikit-learn and is part of scikit-learn-contrib projects.
import imblearn
3. Random under-sampling with imblearn
RandomUnderSampler is a fast and easy way to balance the data by randomly selecting a subset of data
for the targeted classes. Under-sample the majority class(es) by randomly picking samples with or without
replacement.
Counter(y_rus))
One way to fight imbalance data is to generate new samples in the minority classes. The most naive
strategy is to generate new samples by randomly sampling with replacement of the currently available
samples. The RandomOverSampler offers such a scheme.
Tomek links are pairs of very close instances but of opposite classes. Removing the instances of the
majority class of each pair increases the space between the two classes, facilitating the classification
process.
Tomek’s link exists if the two samples are the nearest neighbors of each other
In the code below, we’ll use ratio='majority' to resample the majority class.
SMOTE (Synthetic Minority Oversampling Technique) works by randomly picking a point from the minority
class and computing the k-nearest neighbors for this point. The synthetic points are added between the
chosen point and its neighbors.
# import library from imblearn.over_sampling import SMOTE smote = SMOTE() # fit predictor and target variable
7. NearMiss
NearMiss is an under-sampling technique. Instead of resampling the Minority class, using a distance, this
will make the majority class equal to the minority class.
from imblearn.under_sampling import NearMiss nm = NearMiss() x_nm, y_nm = nm.fit_resample(x, y)
print('Original dataset shape:', Counter(y)) print('Resample dataset shape:', Counter(y_nm))
Accuracy is not the best metric to use when evaluating imbalanced datasets as it can be misleading.
Confusion Matrix: a table showing correct predictions and types of incorrect predictions.
Precision: the number of true positives divided by all positive predictions. Precision is also called
Positive Predictive Value. It is a measure of a classifier’s exactness. Low precision indicates a high
number of false positives.
Recall: the number of true positives divided by the number of positive values in the test data. The
recall is also called Sensitivity or the True Positive Rate. It is a measure of a classifier’s completeness.
Low recall indicates a high number of false negatives.
F1: Score: the weighted average of precision and recall.
Area Under ROC Curve (AUROC): AUROC represents the likelihood of your model distinguishing
observations from two classes.
In other words, if you randomly select one observation from each class, what’s the probability that your
model will be able to “rank” them correctly?
The next tactic is to use penalized learning algorithms that increase the cost of classification mistakes on
the minority class.
During training, we can use the argument class_weight=’balanced’ to penalize mistakes on the
minority class by an amount proportional to how under-represented it is.
We also want to include the argument probability=True if we want to enable probability estimates for
SVM algorithms.
# load library from sklearn.svm import SVC # we can add class_weight='balanced' to add panalize mistake
svc_model = SVC(class_weight='balanced', probability=True) svc_model.fit(x_train, y_train) svc_predict =
While in every machine learning problem, it’s a good rule of thumb to try a variety of algorithms, it can be
especially beneficial with imbalanced datasets.
Decision trees frequently perform well on imbalanced data. In modern machine learning, tree ensembles
(Random Forests, Gradient Boosted Trees, etc.) almost always outperform singular decision trees, so we’ll
jump right into those:
Tree base algorithm work by learning a hierarchy of if/else questions. This can force both classes to be
addressed.
# load library from sklearn.ensemble import RandomForestClassifier rfc = RandomForestClassifier() # fit the
predictor and target rfc.fit(x_train, y_train) # predict rfc_predict = rfc.predict(x_test)# check performance
print('ROCAUC score:',roc_auc_score(y_test, rfc_predict)) print('Accuracy score:',accuracy_score(y_test,
Advantages
It can help improve run time and storage problems by reducing the number of training data samples
when the training data set is huge.
Disadvantages
It can discard potentially useful information which could be important for building rule classifiers.
The sample chosen by random under-sampling may be a biased sample. And it will not be an accurate
representation of the population. Thereby, resulting in inaccurate results with the actual test data set.
Advantages
Disadvantages
It increases the likelihood of overfitting since it replicates the minority class events.
You can check the implementation of the code in my GitHub repository here.
References
1. https://elitedatascience.com/imbalanced-classes
2. https://towardsdatascience.com/methods-for-dealing-with-imbalanced-data-5b761be45a18
Conclusion
To summarize, in this article, we have seen various techniques to handle the class imbalance in a dataset.
There are actually many methods to try when dealing with imbalanced data. Hope this article was useful if
so please share and like it.
Benai Kumar – Aspiring Data Scientist by hear t | Keen to learn and share knowledge
Guest Blog