A Systematic Review On Imbalanced Data Challenges in Machine Learning: Applications and Solutions
A Systematic Review On Imbalanced Data Challenges in Machine Learning: Applications and Solutions
In machine learning, the data imbalance imposes challenges to perform data analytics in almost all areas of
real-world research. The raw primary data often suffers from the skewed perspective of data distribution
of one class over the other as in the case of computer vision, information security, marketing, and medical
science. The goal of this article is to present a comparative analysis of the approaches from the reference of
data pre-processing, algorithmic and hybrid paradigms for contemporary imbalance data analysis techniques,
and their comparative study in lieu of different data distribution and their application areas.
CCS Concepts: • Computer systems organization → Embedded systems; Redundancy; Robotics; • Net-
works → Network reliability;
Additional Key Words and Phrases: Data imbalance, machine learning, data analysis, sampling
ACM Reference format:
Harsurinder Kaur, Husanbir Singh Pannu, and Avleen Kaur Malhi. 2019. A Systematic Review on Imbalanced
Data Challenges in Machine Learning: Applications and Solutions. ACM Comput. Surv. 52, 4, Article 79 (Au-
gust 2019), 36 pages.
https://doi.org/10.1145/3343440
Authors’ addresses: H. Kaur, H. S. Pannu, and A. K. Malhi, CSED, Thapar Institute of Engineering and Technology, Patiala,
India 147004; emails: [email protected], {hspannu, avleen}@thapar.edu.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
79
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from [email protected].
© 2019 Association for Computing Machinery.
0360-0300/2019/08-ART79 $15.00
https://doi.org/10.1145/3343440
ACM Computing Surveys, Vol. 52, No. 4, Article 79. Publication date: August 2019.
79:2 H. Kaur et al.
pre-processing stage, methods involve a number of re-sampling methods, such as random over-/
under-sampling or consolidating the two sampling methods as a motive to obtain approximately
equal count of samples in the classes (Abouelenien et al. 2013). However, this technique only deals
with training data set and balances it while learning algorithm stays the same (Singh and Purohit
2015). Algorithmic centered approach includes assumptions created to favour the minority class
and changing the costs to get the balance classes. In this article, the following key-points are ana-
lyzed:
• Prominent issues in imbalanced data classification
• Comparative study of methods used to approach imbalance data distributions
ACM Computing Surveys, Vol. 52, No. 4, Article 79. Publication date: August 2019.
A Systematic Review on Imbalanced Data Challenges in Machine Learning 79:3
domains, challenges that arise, and evaluation metrics based on benchmark datasets. They have
also discussed comparative analysis of various methods and learning algorithms used to tackle
issue of imbalanced data distributions. Haixiang et al. (2017) proposed categorical wise study of all
applicable areas of imbalanced data problem. Sun et al. (2009) proposed review on imbalanced data
classification in contrast to taxonomy of all application areas and have discussed all open issues.
Comparative analysis of methods and suggested that hybrid methods can give better results in
Longadge and Dongre (2013). State of the art solutions have been discussed in He and Garcia
(2009). Issues related to Multi-class classification for imbalanced data is discussed and methods to
tackle and minimize the issue are also proposed in Sahare and Gupta (2012). In Kotsiantis et al.
(2006) suggested resampling techniques and hybrid methods can do quite better as compared to
other methods. Comparative analysis of various ensemble learning algorithms and concluded that
RUSBoost algorithm has good performance and least complex among all (Galar et al. 2012). Anwar
et al. (2014) proposed complexity measure related to classifier’s performance due to unbalanced
data on basis of k-nearest neighbour learning algorithm. Chawla (2009) proposed SMOTE and
its combination with other learning algorithms. In Bekkar and Alitouche (2013) relative study of
methods in contrast to its advantages and disadvantages is proposed. Table 1 presents comparison
of existing surveys on imbalanced data classification.
2 REVIEW TECHNIQUE
The categorical technique proposed in this research article has been taken from the literature
proposed by Yanmin et al. (Sun et al. 2009). The several phases used in this survey are to gener-
ate a review technique, designing a comprehensive and detailed study, comparison of techniques,
comparative result analysis, and investigating open issues. The review technique engaged in this
survey is presented in Figure 4.
ACM Computing Surveys, Vol. 52, No. 4, Article 79. Publication date: August 2019.
79:4 H. Kaur et al.
Defines
Nature of Challenges Learning Evaluation Domain
Research Paper classification
problem faced algo metrics areas
problem
A review on imbalanced
classification (Sun et al. 2009)
A review of class imbalance
problem in dtaa mining
(Longadge and Dongre 2013)
A review on handling
imbalanced datasets
(Kotsiantis et al. 2006)
A survey on learning from
imbalanced data (He and
Garcia 2009)
A review on ensembles for
class imbalance (Galar et al.
2012)
Overview of imbalanced data
(Chawla 2009)
A review on imbalanced data
learning approaches (Bekkar
and Alitouche 2013)
A review of multi-class
classification for imbalanced
data (Sahare and Gupta 2012)
A review of methods and
applications for imbalanced
data (Haixiang et al. 2017)
Classification problems with
unbalanced data (Anwar
et al. 2014)
Our Survey
3 APPLICATION DOMAINS
The problem of imbalanced data classification emerges as a major issue in many real-world appli-
cations, thus reducing the predictive performance of the model.
ACM Computing Surveys, Vol. 52, No. 4, Article 79. Publication date: August 2019.
A Systematic Review on Imbalanced Data Challenges in Machine Learning 79:5
application areas, like prediction of emergency events, prediction of pollution, and so on. Figure 7
represents the application areas of imbalanced data classification. Table 3 describes various appli-
cation areas in lieu of detailed applications with related articles.
(1) Computer Vision: System that retrieves some useful information from the images. In this
area, one tries to associate the high-dimensional image features to a structured labeling of
objects in the image. Imbalance occurs when a large amount of images from the training
data does not contain the object of interest, thus known as negative images as compared
to positive objects. This might result in misclassification of positive objects. In Gao et al.
(2014), Enhanced and Hierarchical Structure (EHS) method is proposed for the imbalance
of positive and negative class in a massive video dataset. It outperforms the most common
ML algorithms, such as under-sampling and over-sampling.
ACM Computing Surveys, Vol. 52, No. 4, Article 79. Publication date: August 2019.
79:6 H. Kaur et al.
(2) Information Security: Over the past few years, machine-learning algorithms have been
successfully used in protecting the useful information. Imbalance of data emerges while
deciding the decision boundary for anomaly classification. In Nepal and Pathan (2014),
a comparative study of the techniques involved in security, trust, and privacy issues in
cloud systems has been proposed.
(3) Fraud Detection: Frauds such as credit card fraud, cheque fraud, and so on, might be an
expensive issue to an individual or an organization. Every year billions in loss of money
is introduced, due to limitations of machine-learning algorithms while dealing with the
massive imbalanced class distributions and other reasons. So, this poses a threat in
ACM Computing Surveys, Vol. 52, No. 4, Article 79. Publication date: August 2019.
A Systematic Review on Imbalanced Data Challenges in Machine Learning 79:7
Fig. 5. Imbalanced classification publishing Fig. 6. Top journal publications count in the area of
trend. imbalanced data learning during the past decade.
investigating between the fraudulent masterminds and genuine users. However, the
number of legitimate users is far more than fraudulent. ContrastMiner algorithm ef-
fectively differentiates between genuine and fraud users in online banking (Wei et al.
2013a).
(4) Medical Science: In hospitals, a huge amount of information is stored in large databases
about the patients and their medical history. And in medical diagnoses, it is very critical
to differentiate between the positive (unhealthy) and negative (healthy) patients, because
usually patients with diseases are reasonably rare as compared to the normal or healthy
population. So, this poses a critical threat for the classification between the healthy and
normal population.
(5) Network Intrusion detection: With the growing demand, network-based computer sys-
tems are playing a significant role to reduce the human efforts. However, the attacks on
networks and computer machines have also been growing simultaneously. For example,
Elbasiony et al. (2013) and Thomas (2013) used different techniques to tackle the problem
of imbalance in network traffic.
(6) Image Processing: Image processing is an method of conducting operations on images
or extracting useful information from them. In this, issue of imbalance of data occurs
ACM Computing Surveys, Vol. 52, No. 4, Article 79. Publication date: August 2019.
79:8 H. Kaur et al.
while distinguishing the class distributions between the novel and mis-classified features
(Hodge and Austin 2004).
(7) Cloud Computing: The open and distributed cloud environment has been mostly liked
by the intruders. Detecting the anomalies in the cloud environment seems as a difficult
task, due to imbalance classification. The same issue arises while detecting the failures
also.
(8) Data Mining: In Longadge and Dongre (2013), a comparative study of techniques to tackle
imbalanced class distribution in data mining has been discussed.
(9) Text Classification: Text classification is an efficient method of supervised machine learn-
ing that enables various applications such as spam filtering, sentiment analysis, and so
on. For instance, in Sarker and Gonzalez (2015), NLP and ML techniques have been used
to automatic detect adverse drug reaction using multi-corpus training using social media
text including medical information.
(10) Direct Marketing: Direct marketing in today’s digital world has taken over all of the
traditional methods. But, while dealing with imbalanced classes, the predictive models
of the consumer’s behaviour and responses often gets affected. Thus, study of advantage
of using SVM algorithm is to overcome the issue of imbalanced class in the consumers
responses is proposed in Kim et al. (2013).
(11) Bioinformatics: Imbalanced data has a major application in the area of bioinformatics,
which includes protein sub-cellular prediction proposed in Wan et al. (2017), gene predic-
tion in Wang et al. (2015) and Batuwita and Palade (2009). Protein classification includes
Song et al. (2014) and Zhao et al. (2008), promoter prediction in Zeng et al. (2009), and
so on. There exists sensitivity to imbalance data in the case of classification algorithms,
which produces sub-optimal classification results. The goal of employing imbalanced
learning methods is to raise sensitivity and lower specificity as much as possible.
ACM Computing Surveys, Vol. 52, No. 4, Article 79. Publication date: August 2019.
A Systematic Review on Imbalanced Data Challenges in Machine Learning 79:9
Wan et al. (2017) have explored an efficient prediction algorithm for sub-cellular localization
of proteins. It performs multi-label classifier using an ensemble called HPSLPred and is an ideal
choice to handle imbalanced data for multi-label classification. It outperforms other state-of-art
algorithms while yielding 75.89% average precision value (APValue). Major contributions include
designing a 350D comprehensive feature model, self dependable dimension selection and HPSL-
Pred ensemble model for the optimal performance.
Song et al. (2014) have classified DNA-binding proteins for imbalanced data using an ensem-
ble approach nDNA-Prot. To define the structure of protein, 188 dimensional features have been
extracted to feed into ensemble called imDC for DNA-binding proteins classification. Selected
features possess maximum relevance and minimum redundancy to yield accuracy of 95.8% and
0.986 for AUC for cross validation. Testing accuracy for the proposed nDNA-Prot was found to
be 86% outperforming both DNA-prot (68%) and iDNA-Prot (76%). Batuwita and Palade (2009) is
another proposed technique by Want et al. using ensemble learning with micro RNA data and UCI
datasets. The proposed method has been compared against LibID, BalanceCascade, AsymBoost,
UnderSampl, HSampl, and AdaBoost. MiRNAs has 193:8494 for positive and negative examples
and 30 pos with 1000 neg were used for testing. Sensitivity and specificity over ImDC, LibID, and
Triplet-SVM were found to be {0.86,0.83,0.93} and {0.93,0.92,0.88}.
For the application area Computer Vision, in Shyu et al. (2008) and Gao et al. (2014) depicts im-
balance of class in semantic analysis in massive video dataset. Celebi et al. (2007) proposed issue of
imbalanced data while classifying the dermoscopy images. In Fraud detection, Zhang et al. (2008)
and Wei et al. (2013b) discussed detection of online banking fraud where imbalance data arise as a
major issue. For credit card fraud detection also, where identifying the anomalous person respon-
sible for fraud is important task, imbalance data diminishes the accuracy of classifiers for detection
(Fu et al. 2016; Kulkarni and Ade 2016). In Zakaryazad and Duman (2016) (Mardani and Shahriari
2013) and Sahin et al. (2013), proposed various approaches for detection of credit card fraud. In
the case of financial application where predicting one’s eligibility to pay full loan back is a most-
valuable task for banks, private organization, imbalance data poses threat (Abeysinghe et al. 2016;
Sanz et al. 2015). Fuzzy rules have been employed to achieve higher understanding of the anomaly
prediction in the imbalanced data classification model in Sanz et al. (2015). It does not uses any
pre-processing or sampling and thus avoids any noise involvement. For testing, 11 real-world fi-
nancial datasets were analyzed while outperforming SMOTE-based oversampling, C4.5 decision
tree and type-1 fuzzy models. In Abeysinghe et al. (2016) proposed techniques for detecting insur-
ance fraud while dealing with imbalanced data. Improving the performance of detection of fraud
in retail surveillance is also proposed by Pan et al. (2011). Imbalanced data involved in insurance
fraud has been explored in Hassan and Abraham (2016), which introduces the ensemble of ANN,
SVM and Decision Trees used with and without replacement techniques for under-sampled class. It
chooses among the undersampling partitionings to choose the best one as a claim of its originality.
DT is the winner in the empirical analysis.
In medical science and diabetes, Yu and Ni (2014) proposes an improved random subspace
method and bagging ensemble to derive feature subspace, which keeps balance in diversity and
ACM Computing Surveys, Vol. 52, No. 4, Article 79. Publication date: August 2019.
79:10 H. Kaur et al.
accuracy of the classifiers. SVM is the base classifier and the proposed ensemble approach has been
analyzed on all types of performance metrics. In Vo and Won (2007) there is a discussion of the clas-
sification bias induced due to imbalanced data, thus an extended regularized least square method
is proposed. Errors are penalized according to various weights, which are determined using the
specified rules for each sample. For cancer-related science, in Krawczyk et al. (2016) the ensemble
of EUSBoost is studied, which involves the boosting idea for under-sample evolution for each of
the base classifier. Level set active contours method yields an effective features extraction for bet-
ter classification of the breast cancer symptoms for clinical decision support system. In Yang et al.
(2016), tumor tissues analysis has been performed in various gene samples. One versus all multi-
classification model has been used, which divides the problem into multiple binary classification
problem. To deal with unbalanced data, it uses balanced sampling and feature selection. Gentle-
Boost ensemble (Can-CSC-GBE) and cost sensitive classifier has been studied in Ali et al. (2016) to
detect the breast cancer using the features of protein amino acid. Comparative empirical analysis
is performed using cost sensitive learning and ensemble of AdaBoostM1 and bagging. For hepati-
tis virus (HBV and HCV) prediction, one feature selection method and three balancing methods
have been studied in Richardson and Lidbury (2017) using SVMs for pathological cases. Random
forest are used for feature selection and data from ACT Pathology Canberra Australia involving
records of 18,645 patients have been experimented. For hepatitis virus (HBV and HCV) prediction,
one feature selection method and three balancing methods have been studied in Richardson and
Lidbury (2017) using SVMs for pathological cases. Random forest are used for feature selection
and data from ACT Pathology Canberra Australia involving records of 18,645 patients have been
experimented. In Yap et al. (2014), cardiac surgery classification is discussed using undersampling,
oversampling, boosting and bagging techniques for binary classification. CHAID, C5 and CART
classifiers have been used and sensitivity and precision have been reported to work well using
decision tree. Moreover in this study bagging and boosting have no improvement effects on DT
performance. Cytogenetic domain classification has been studied in Lerner et al. (2007), using hier-
archical decomposition followed by up-sampling of minority classes and reduction of dimensions.
Each hierarchical level tackles a smaller problem of approximately balanced data classes. Multi-
layer perceptron NN and Naive Bayesian have been employed to analyse the smallness of data
being better than its imbalance nature. Another semi-supervised technique is studied in (Herndon
and Caragea 2016) by using both labeled and unlabeled data in a domain adaptation environment.
Logistic regression-based two classifiers have been used for splice site forecast in gene prediction
while achieving precision-recall between 50.83%–82.61%.
In Peiravian and Zhu (2013), malicious Android apps are classified using API calls and permis-
sions. Shabtai et al. (2012) discuss malicious code classification on Opcode patterns for anti-virus
software as signatures for more than 30,000 files. Nepal and Pathan (2014) have discussed cloud
systems based upon QoS premises while discussing security fundamentals and contemporary tech-
nology. Security is involved with detection of anomalies and intrusion detection in the majority
dominated negative class. In Song et al. (2010), the skewed and concept drift data streams have
been studied in context to cloud security using one-class classifier in an ensemble settings and
k-means. Use of genetic programming and incremental ensemble has been used to detect cyber
security drifts in Folino et al. (2016).
Taneja et al. (2015) have studied advertisement frauds on the internet mobile service. Feature
selections has been performed using recursive feature elimination and Hellinger Distance Decision
Tree is used for classification while achieving 64.04% accuracy. A fake escrow website has been
studied in Abbasi and Chen (2009) for fraud cues, which are extracted from webpages. SVM, ANN,
DT, Naive bayes and PCA were compared for the empirical analysis on a test bed of 90,000 pages
through 410 websites. In Zhong et al. (2013) concept adapting very fast decision tree has been used
ACM Computing Surveys, Vol. 52, No. 4, Article 79. Publication date: August 2019.
A Systematic Review on Imbalanced Data Challenges in Machine Learning 79:11
for P2P application based upon large volume data set communities leaving and joining. NonP2P
outnumbers P2P data by big margin and CVFDT along with re-sampling technique to monitor
the Internet traffic. Imbalanced data learning on User to Root intrusion has been studied in Engen
et al. (2008) using MLO and DT along with evolutionary ANN. Kasai and Oike (2010) discuss image
pickup apparatus, which receives larger dataset with more light and few dataset with lesser light
to perform image processing of various types. Another visual information retrieval for imbalanced
dataset has been studied in Chang et al. (2003) by including active learning, quasi-bagging, class-
boundary alignment, adaptive dimensionality reduction, and recursive subspace co-training.
Cloud computing infrastructure anomaly detection has been studied in Pannu et al. (2012) and Fu
et al. (2012) using adaptive one class SVM and both one and two class SVMs, respectively. Credit
card fraud mining has been studied in Bhattacharyya et al. (2011) using SVM, RF, and logistic
regression. Longadge and Dongre (2013) have the imbalance data mining from the prospective of
feature selection, pre-processing of data and algorithmic approach. Dua and Du (2016) is a cyber-
security-oriented data-mining reference for imbalanced data and anomaly detection.
In text classification, the imbalanced data has been studied in Liu et al. (2009), using simple
probability-based term weighting scheme and information ratios for minor class. Wang et al. (2013)
explained the text sentiment classification using BRC while under sampling the higher density
regions and SVM was used while reporting precision and recall. Duman et al. (2012) reviewed
the database marketing classification problem of a bank using approaches such as CHIAD, ANN
and logistic regression. Zakaryazad and Duman (2016) have discussed direct marketing and fraud
detection using profit driven penalty function-based ANN. Spam review detection has been studied
in Al Najada and Zhu (2014) for imbalanced data using a bagging-based approach called iSRD. A
good survey information on review spam detection has been studied in Crawford et al. (2015).
Burez and Van den Poel (2009) have studied customer churn prediction against class imbalance
problem using sampling and modeling techniques such as weighted RF and gradient boosting.
The summary of literature survey has been given in the Table 3 based upon the application
areas.
ACM Computing Surveys, Vol. 52, No. 4, Article 79. Publication date: August 2019.
79:12 H. Kaur et al.
4.1.1 Sampling Methods. It is an easy and popular approach to balance the class distributions
of the training data. The original data space gets balanced by employing one of the method among
over-sampling or under-sampling. The basic idea of re-sampling is to yield balanced classes and
process is repeated till balanced classes are obtained. It works by adding or removing the samples
from the data space to diminish the biased behaviour of imbalanced data, thus changing the size
of the training data space. Cao and Zhai (2015) proposed that sampling methods reduces learning
time and faster execution once we get the balanced classes. Cao and Zhai (2015) have proved that
sampling methods are effective alternative for supervised learning. Yap et al. (2014) have proved
that sampling methods outperforms bagging and boosting. The summary of sampling methods is
proposed in Table 3. There are various options to perform the sampling:
ACM Computing Surveys, Vol. 52, No. 4, Article 79. Publication date: August 2019.
A Systematic Review on Imbalanced Data Challenges in Machine Learning 79:13
• Over-sampling: The basic idea of over-sampling is to increase the size of the minority
class to obtain balanced classes. Duplication of samples is done in random over-sampling
in which samples are randomly selected. Thus, class size increases due to duplication
of samples. Likely, over-fitting is the main issue arises in over-sampling (Ganganwar
2012). In Chawla et al. (2002), Chawla proposed Synthetic Minority Over-Sampling tech-
nique (SMOTE). In SMOTE, synthetic samples are produced by the help of minority class
ACM Computing Surveys, Vol. 52, No. 4, Article 79. Publication date: August 2019.
79:14 H. Kaur et al.
samples. It also suggests the idea that using combination of sampling methods can be a
good option for improving the classifier performance while dealing with the imbalanced
class distributions. In Bach et al. (2017), a comparative study of undersampling and over-
sampling against osteoporosis diagnoses where only 7.14% cases are of minority class and
their focus was to identify the best outcome from under-sampling and over-sampling with
classifiers. They concluded that SMOTE along with RandomForest classifier achieved high-
est performance. Yap et al. (2014) proposed a comparative study of sampling techniques,
bagging and boosting against imbalanced dataset and concluded that sampling techniques
are better for imbalanced data using decision tree where bagging and boosting does not
improve decision tree performance in regard to their dataset. Zhang and Li (2014) proposed
random walk over-sampling approach for problem of imbalance in data distribution by gen-
erating samples to increment the number of samples in minority class. Ramyachitra and
Manikandan (2014) proposed review for the imbalanced dataset classification and solutions
to reduce it. Moreo et al. (2016) proposed oversampling method for imbalanced text classifi-
cation where to each minority class document in the training class, a probabilistic function
is assigned. Zheng et al. (2016) proposed a technique that can overcome the limitations of
SMOTE, i.e., SNOCC, which creates new samples and ensures that generated new samples
find the new nearest neighbours. This proposed technique outperforms SMOTE and other
methods. Ensemble-based method SMOTE along with Boosting for handling imbalanced
PubChem BioAssay data, which outperforms combination of Random Forest with SMOTE
on the basis of sensitivity and G-mean (Hao et al. 2014). Figure 11 depicts general idea of
over-sampling the minority class to gain balance in data.
• Under-sampling: It is a pre-processing method that draws the random set of samples from
the majority class to balance the classes and rest of the samples are ignored (Nguyen et al.
2012). The size of the data space is measured to draw desirable class distribution ratio. Thus,
ACM Computing Surveys, Vol. 52, No. 4, Article 79. Publication date: August 2019.
A Systematic Review on Imbalanced Data Challenges in Machine Learning 79:15
under-sampling helps in gaining the equal number of class samples and makes training
phase faster. However, the main issue arises because there lies possibilities of losing infor-
mative instances from the majority class while deleting the instances. Galar et al. (2013)
proposed random undersampling along with boosting algorithm to handle imbalanced data
distribution. Yap et al. (2014) proposed a comparative study of sampling techniques, bagging
and boosting against imbalanced dataset and concluded that sampling techniques are bet-
ter for imbalanced data using decision tree where bagging and boosting does not improve
decision tree performance in regard to their dataset. Yu et al. (2013) proposed heuristics
under-sampling method, i.e., ACOSampling to address the imbalance data distributions in
DNA micro-array dataset. This method while under-sampling does not delete the samples
with useful information rather it automatically extract and save them. In Krawczyk et al.
(2016), an ensemble of EUSBoost is proposed that involves the boosting idea for under-
sample evolution for each of the base classifier. Level set active contours method yields
an effective features extraction for better classification of the breast cancer symptoms for
clinical decision support system. Another method in Kang et al. (2016) focuses on eliminat-
ing the noisy minority class samples, which hinders the performance in imbalanced data
classification. Dai and Hua (2016) proposed comparative analysis of various random under-
sampling ensemble methods in medical imbalanced dataset where correctly predicting true
positive rate is difficult. In Peng et al. (2016), a data gravitation classification model is used,
which is efficient for supervised learning methods to handle the issue of imbalanced data
by undersample, which shows an effective margin in sensitivity and specificity in results.
Tomek Link combined with random under-sampling to deal with imbalanced dataset and
comparative analysis of other methods and algorithms has been done and concluded that
using Tomek Link with combination of Tomek Link and SMOTE is better idea (Elhassan
et al. 2016). Figure 12 depicts general idea of under-sampling the majority class.
• Hybrid sampling: Hybrid sampling methods are those that apply both re-sampling tech-
niques to attain balance in the data. Hybrid sampling techniques are proposed in Qian et al.
(2014) and Charte et al. (2015). Wang (2014) proposed a technique of combining sampling
methods, under-sampling and over-sampling to handle the problem of imbalance data. To
get a balanced training data space, under sample to delete the instances without contain-
ing useful information. Then over-sampling is done to replicate existing instances. Thus,
the proposed method reduces the chances of losing informative instances. Adding huge
number of synthetic samples to training space resulted in increasing the classification per-
formance. SMOTE is a hybrid sampling method, because it generates synthetic examples in
minority class and it is an alternative to duplication of minority class samples. It is a good
method to tackle imbalanced data. It falls into the hybrid category, because it uses com-
bination of under-sampling and over-sampling to overcome the issue of imbalanced data
distribution and does not rely on under-sampling only. Figure 13 illustrates the SMOTE
method.
Kubat et al. (1997) proposed one side selection method in which Tomek Links are used to reject
the noisy and unreliable examples from the majority class. Thus, it under-sample the majority
class in an efficient way. And then, CNN is used to delete the samples that are distant to the
decision boundary. Thus, it saves informative samples while under-sampling the majority class.
In He et al. (2008), Adaptive Synthetic sampling approach (ADASYN) is proposed for assigning
different weights to samples according to their level of complication while learning and synthetic
data is generated (See Table 4). It proves to be an efficient way of handling imbalanced data in the
following ways:
ACM Computing Surveys, Vol. 52, No. 4, Article 79. Publication date: August 2019.
79:16 H. Kaur et al.
Fig. 13. Illustration of synthetic exam- Fig. 14. Example of Feature Selection Method.
ples in SMOTE.
(1) It tends to reduce the situation of imbalance data where hyperplane always gets biased
toward the majority class
(2) Generates the classification hyperplane in a efficient manner that it automatically leans
in the direction of instances that are difficult to learn
Over-sampling tends to increase the size of the data space, which results in a situation of over-
fitting and takes more time in training phase. However, under-sampling removes some set of sam-
ples randomly from the majority class where lies the possibility of losing informative samples.
4.1.2 Feature Selection and Extraction. Selecting subset of relevant features or attributes from
high dimensional data sets helps to upgrade the performance of the classifier. This method is pro-
posed in imbalanced data classification (Jamali et al. 2012; Maldonado et al. 2014; Van Hulse et al.
2009). Selecting features is generally gained by three methods filter method, wrapper method
and embedded method. The comparative analysis of advantages and disadvantages is discussed
in Saeys et al. (2007). The ultimate aim of feature selection is to select the set with best features
from the whole dataset to gain better classifier performance. Whereas feature extraction is also
a dimensionality reduction technique, it differs from feature selection in the way that it gener-
ates new features using the primary ones. Feature extraction methods are Principal Component
Analysis (PCA), Singular Value Decomposition (SVD). Feature selection and extraction is found
ACM Computing Surveys, Vol. 52, No. 4, Article 79. Publication date: August 2019.
A Systematic Review on Imbalanced Data Challenges in Machine Learning 79:17
applicable in various real-world applications such as network traffic analysis (Liu et al. 2015), soft-
ware defect prediction (Khoshgoftaar et al. 2014), text categorization (Yang et al. 2014), microarray
data classification (Bolón-Canedo et al. 2015), medical diagnoses (Zhang et al. 2014). The example
of selecting subset with best features from the imbalanced dataset is described in Figure 14.
ACM Computing Surveys, Vol. 52, No. 4, Article 79. Publication date: August 2019.
79:18 H. Kaur et al.
Fig. 15. Classification for supervised and semi- Fig. 16. General idea of bagging.
supervised techniques.
ACM Computing Surveys, Vol. 52, No. 4, Article 79. Publication date: August 2019.
A Systematic Review on Imbalanced Data Challenges in Machine Learning 79:19
Yap et al. (2014) proposed a comparative study of sampling techniques, bagging and boost-
ing against imbalanced dataset.
• Cost-Sensitive SVM: the basic goal of SVM is to classify data by building the linear decision
boundary (hyper-plane). It splits the data points according to their class. SVM dealing with
balanced data, gives best performance. But in the case of highly imbalanced its hyperplane
gets partial towards the majority class. So, to handle this problem, a method is proposed in
which every class’s classification error gets assigned a different penalty cost. It refers to a
class-specific error favoring SVM algorithm (Lee et al. 2016). The model formulation is as
follows:
1
arд min w 2 + P λi + N λj (1)
(w,b ) 2
i:yi >0 j:y j <0
s.t . w T x i + b .yi ≥ 1 − λi , i ∈ N ,
where λi is the soft margin slack variable, P, N are penalties associated with positive and
negative mis-classifications, (x i , yi ) is sample and its associated label.
• One Class SVM: In this case, SVM is trained using data space having only one class (nor-
mal), thus refers to one class SVM. It was first presented by Scholkopf et al. for evaluation
of high-dimensional distribution. In Raskutti and Kowalczyk (2004), the authors proposed
imbalance data handling solutions with highly unequal distributions and suggested that
one-class learning from class containing positive samples only, yields perfect performance
dealing with highly imbalanced data.
• Weighted SVM: Huang and Du (2005) proposed the disadvantages of classification using
standard support vector machine. While dealing with imbalanced data classification, the
standard support vector machine generates the linear decision boundary(hyper-plane) bi-
ased towards the majority class. However, in weighted support vector machines, the classifi-
cation error of the minority class gets better. Moreover, it assists in decreasing the influence
of outliers in binary data distribution as compared to standard SVM. It shows better classi-
fication performance. Following equation states the weighted SVM objective function with
weights λi :
1
min w 2 + P λi ξ i (2)
w,b, ξ 2
i
s.t. yi (w.ϕ (x i ) + b) ≥ 1 − ξ i
ξ i ≥ 0, i ∈ N .
4.3.1 Challenges in Algorithm Centered Approaches. There is a common problem of dataset shift
that exists in all classification problems. Dataset shift is the problem where different distributions
is followed by training and testing data.
(1) The minority class is quite sensitive to errors in classification due to less number of ex-
amples existing in the case of highly imbalanced domains. If we consider the extreme case
of single misclassified example, then there will be significant performance drop. There is
need of potential approaches to handle such type of misclassified examples.
(2) The foremost challenge is the natural dataset shift (Cieslak and Chawla 2009), where the
data of interest produces relevant degree of shift, which results in drop in performance.
It is possible that techniques might be developed for discovering and measuring dataset
shift presence but how to adapt it to focus on the minority class is really challenging.
ACM Computing Surveys, Vol. 52, No. 4, Article 79. Publication date: August 2019.
79:20 H. Kaur et al.
(3) Furthermore, there exists no articles in the literature that have focused on the designing of
imbalanced classification algorithms that may work under dataset shift conditions, which
is only possible either by employing some pre-processing technique (Moreno-Torres et al.
2013) or by using ad hoc algorithm (Bickel et al. 2009).
(4) Another problem that exists is induced dataset shift in imbalanced data classification algo-
rithms. The state-of-the-art techniques mostly employ artificial means of inducing poten-
tial source of shift in machine-learning process by using stratified cross-validation tech-
niques. Hence, there is requirement of more subtle validation technique so that artificial
means of dataset shift can be avoided.
(5) The output adjustment may be done by possibly overdriving the classifier toward minority
class, which results in increase in error in majority class. Therefore, output compensation
does not prove useful as majority of the objects will originate from majority class only.
Table 6 explains the challenges in the application of data level and algorithmic level approaches.
Table 11 lays out the relative comparison among various data and algorithmic techniques along
with data sets used, performance metrics and prominent findings.
ACM Computing Surveys, Vol. 52, No. 4, Article 79. Publication date: August 2019.
A Systematic Review on Imbalanced Data Challenges in Machine Learning 79:21
specifically produced for the highly imbalanced dataset. The main idea behind this is to delete
the noisy and unreliable samples and extraction of useful and consistent samples to achieve good
classification accuracy. The proposed technique outperforms other methods while comparison.
Although, re-sampling techniques achieves success in imbalanced data classification, a hybrid
method of combining sampling techniques with Bagging is proposed by Lu et al. (2016). The ex-
periment is performed on 26 benchmark datasets that showed that sampling methods outperforms
bagging whereas bagging lacks in outperforming sampling methods. So, a hybrid re-sampling
method is introduced in Cao and Zhai (2015) that deals with binary class imbalance data. First,
SMOTE is used to rise the count of minority class instances. Then One Side Selection (OSS) re-
jects the instances without any useful information and achieved feasible results. Abouelenien et al.
(2013) introduce cluster-based sampling and ensemble method. To tackle the problem of dealing
with large imbalanced dataset, clusters are generated from all training space. Then, from the train-
ing space, representative data is elected from generated clusters as training instances, which yields
improvement in evaluation measures i.e. accuracy and sensitivity. A SVM modeling Algorithm in
which training space is increased by generating synthetic instances with SMOTE is proposed by
Tang et al. (2009). Then, building SVM on the sampling technique over-sampling the data set. Ma-
jority Weighted Minority Oversampling Technique (MWMOTE) is proposed to tackle imbalanced
data by filtering the useful instances from minority class. Then weights are assigned with respect
to their euclidean distance (Barua et al. 2014). Figure 17 represents possible combinations of sam-
pling methods with learning algorithms to handle imbalanced dataset. In this scenario, choice can
be made among sampling and learning algorithm combination according to classification problem.
5 LEARNING ALGORITHMS
In this section, some of the classifier learning algorithms are discussed. Table 7 describes six learn-
ing algorithms with their learning strategy and limitations while dealing with data imbalance and
the related literature.
ACM Computing Surveys, Vol. 52, No. 4, Article 79. Publication date: August 2019.
79:22 H. Kaur et al.
Lagrangian twin support vector machine for imbalanced data classification using different train-
ing points. It is tested and compared with other models on real as well as synthetic datasets.
ACM Computing Surveys, Vol. 52, No. 4, Article 79. Publication date: August 2019.
A Systematic Review on Imbalanced Data Challenges in Machine Learning 79:23
et al. (2015), who adduced an evolutionary fuzzy classification method for the process of modelling
and prediction in financial applications of real world, which calculates good prediction accuracies
tested on 11 financial application datasets. Park and Ghosh (2014) introduced two ensemble ap-
proaches based on decision trees for the problem of imbalanced data classification by utilizing the
characteristics of α-divergence. The effectiveness of the proposed ensembles is shown by perform-
ing the experimental results on multi-class imbalanced datasets.
6 PERFORMANCE METRICS
This section describes the various performance metrics that can be used in evaluation of the im-
balanced data techniques.
ACM Computing Surveys, Vol. 52, No. 4, Article 79. Publication date: August 2019.
79:24 H. Kaur et al.
Metric Formula
T P +T N
Accuracy T P +F P +F N +T N
TP
Sensitivity (T P +F N )
TN
Specificity (T N +F P )
TP
Precision (T P +F P )
TP
Recall (T P +F N )
2∗P r ecision∗Recall
F-measure P r ecision+Recall
G-MEAN sensitivity ∗ speci f icity
1+T P R−F P R
AUC 2
ACM Computing Surveys, Vol. 52, No. 4, Article 79. Publication date: August 2019.
A Systematic Review on Imbalanced Data Challenges in Machine Learning 79:25
For two class classification, i.e., binary classification, each example that is predicted will belong to
one of the four possible outcomes which are described in Table 8:
• TP (True Positive): the actually positive examples that are correctly predicted as positive.
• TN (True Negative): the actually negative examples that are correctly predicted as negative.
• FP (False Positive): the actually negative examples that are incorrectly predicted as positive.
• FN (Fasle Negative): the actually positive examples that are incorrectly predicted as
negative.
• Accuracy: It acts as main evaluation parameter while dealing with binary decision problems
for any prediction model. For any classification model, it measures the number of correctly
predicted examples among all the total possible examples. Simply, ratio of the correctly
predicted examples to the total number of examples present. Du et al. [2017] has proposed
that considering geometric mean of accuracy of minority and majority class is more effective
classification method as compared to common accuracy in case of highly imbalanced data.
• Sensitivity: it is the measure of positive examples that are correctly predicted by a model. It
is sometimes also called as True Positive Rate (TPR) and equivalent to another evaluation
metric, i.e., Recall. In medical diagnoses, sensitivity is defined as fraction of patients with
disease, i.e., positive example and their test result also classified positive.
• Specificity: it is the measure of negative examples that are correctly predicted by a model.
It is sometimes also called as True Negative Rate (TNR). In medical diagnoses, specificity
is defined as fraction of people without disease, i.e., negative example and their test result
also classified negative. Sensitivity and Specificity are defined with the help of confusion
matrix. These two terms can be used in combination in some domains for the predictive
performance of classification model (Sammut and Webb 2011).
• Precision: it is defined as the ratio of true positives (TP) to the total number of positive
examples predicted.
• Recall: it is defined as fraction of true positive examples to all examples that are actually
positive.
• F-measure: It is used to evaluate the accuracy of predictions while dealing with binary de-
cision problems. Basically, it is harmonic mean of Precision and Recall.
• G-measure: it is evaluated as square root of multiplication of sensitivity and specificity.
• AUC-ROC: The Receiver Operating Curve (ROC) visually depicts the difference between
accuracy on positive examples and error on negative examples where AUC is abbreviated as
Area Under Curve. This curve evaluates trade-offs between true positives and false positives
in respect to the limit of threshold of a predictive model. A detailed introduction to ROC
Curve is disscussed in Fawcett (2006). AUC’s performance varies in numbers [0, 1] while
worst performance is presented by 0 and, however, best performance is presented by 1. The
model is classified as best if it obtains True Positive Rate and False Positive Rate as 1 and 0,
respectively. A comparitive study of Precision-Recall and ROC Curves is illustrated in Davis
and Goadrich (2006). Example of ROC curves for given classification model is illustrated in
Figure 19. Various AUC curves based upon the classification of the data have been depicted
in Figure 20.
ACM Computing Surveys, Vol. 52, No. 4, Article 79. Publication date: August 2019.
79:26 H. Kaur et al.
Fig. 19. Showcase of ROC/AUC curves for the given classification model.
Let TPR and FPR be the true positive rate and false positive rates. In contrast to accuracy, the
precision recall and F meaure are better for imbalanced data. If the cost is high for FP, then precision
is an ideal measure. For example, in detection of spam emails, it is costly to loose an important
email as spam. Similarly, if the cost associated with FN is higher, then recall is a good measure. For
example, actual positive (patient diagnosis, fraud detection) getting predicted as negative (normal)
has a dangerous consequence. F-1 score measures the balance between precision and recall. The
formulas have been summarized in Table 8.
Zou et al. (2016) have exposed the limitation of ROC for the comparative analysis by arguing
a better threshold rather than 0.5 for the testing set. This AUC limitation was found against de-
tecting protein remote homology and experiments were performed by using an established single
benchmark as threshold. The literature discussed an interesting example with precision and recall
ACM Computing Surveys, Vol. 52, No. 4, Article 79. Publication date: August 2019.
A Systematic Review on Imbalanced Data Challenges in Machine Learning 79:27
Fig. 21. Positive association between F1 and AUC (Zou et al. 2016).
being below 0.5 may yield AUC more than 0.9. Figure 21 shows the positive association between
F1 and AUC making the proposed model more credible and efficient.
7 COMPARATIVE ANALYSIS
Table 10 illustrates the advantages and disadvantages of data pre-processing, algorithm and hy-
brid approaches. Table 11 gives the detailed comparison of the various classification techniques
proposed so far for algorithm as well as data-oriented approaches by using various parameters.
The comparison parameters used include approach used, learning method used, data type, per-
formance metrics and key findings of the proposed approach. The research articles are divided
ACM Computing Surveys, Vol. 52, No. 4, Article 79. Publication date: August 2019.
79:28 H. Kaur et al.
learning
Article Approach method data metrics used findings
(Du et al. ALGO ANN High imbalance G-mean-98.41±2.81 It consumes 1/10 of the
2017) ratio Sensitivity-96.56,Specificity- original time required for
97.59 training.
(Tang et al. ALGO GSVM-RU 7 highly imbalanced G-mean-85.2 AUC-ROC- Minimize the chances of info
2009) data set 91.4,F-measure-66.5, loss in under sampling and
AUC-PR-65.2 faster SVM prediction
(Sowah et al. ALGO CUST 16 datasets AUC-ROC-0.805, Outperforms CBU,
2016) G-mean-0.766 SMOTE,RUS, ROS, OSS
(He et al. DATA ADASYN 5 public datasets Overall Accuracy-0.9257, Dynamically weight
2008) Precision-0.8067, adjustment with respect to
Recall-0.9015, data distribution.
F-measure-0.8505,
G-mean-0.9168
(Raskutti and ALGO Extreme AHR-DATA, AUC-ROC-0.99 Incase of dealing with highly
Kowalczyk relabalcing for Reuters Newswire imablanced data, 1-class is
2004) SVM better
(Zhang et al. DATA Ensemble Songbo Tan’s Hotel F-measure-0.69, Generates diverse base
2015) method reviews G-mean-0.764,Weighted classifer
Accuracy(WA)-0.8257
(Cao and DATA Hybrid 5 UCI datasets AUC-ROC-0.7968 Improvment in AUC
Zhai 2015) resampling SVM
(Chawla DATA SMOTE 9 datasets AUC-ROC-9560 Combination of over and
et al. 2002) undersampling enhances
performance.
(Lee et al. Both ML algos for Etching process AUC-0.91, G-mean-0.92, Excellent performance for
2016) imbalance data data and chemical F-measure-0.69 any imbalance factor
fault detection vapor deposition
process data
(Wang and Both 5 learning 10 SDP dataset from AUC-0.649,G-mean- Outperforms original
Yao 2013) algorithms public PROMISE 0.762,Balance-0.711 Recall AdaBoost.NC overall
repository PF-0.823 performance measures
(Gao et al. ALGO Enhanced and TRECVID Video Features: GCM,Texture,SIFT robust and stable while
2014) Hierachical Dataset and MoSIFT dealing with different
Structure (EHS) features
(Kim et al. ALGO SVM-RFM 4 datasets from Accuracy-0.976,specificity- SVM is efficient for high
2013) Direct marketing 0.974,sensitivity-0.238,Gain dimensional datasets
education value-.52
foundation (DMEF)
(Arar and ALGO Neural Network NASA MDP dataset Accuracy-68.4,probability of efficient combination of
Ayan 2015) with ABC false alarm-33.0, algorithm
algorithm balance-71.8,(AUC)-0.79, and
Normalized Expected Cost of
Misclassification (NECM)
(Barua et al. ALGO MWMOTE 20 real-world G-mean-0.7232,ROC-0.98834 does not perform well in
2014) datasets Recall
(Wan et al. ALGO BRkNN, UniProtKB APValue = 85.89% Multi-thread tech. for dim.
2017) HOMER, reduction; performs well only
MLkNN, on human protein source
I BLR M L,
DMLkNN
(Song et al. ALGO Ensemble of 16 total 44,996 with accuracy = 86%, validation Predicts DNA-binding
2014) sorting algos 9676 positives AUC = 0.986 protein sequences among all
in UniProtKB/Swiss-Prot
database; 188d attributes
selected using max relv. min
redund.
(Wang et al. ALGO Ensemble of DT, UCI data and Sen = {0.86,.083,0.93}, spc = time complexity and
2015) RandFor, SVM, miRNA (193:8494) {0.93,0.92,0.88} for ImDC, parameter tuning not
NB, K-nn LibID and Triplet-SVM considered
ACM Computing Surveys, Vol. 52, No. 4, Article 79. Publication date: August 2019.
A Systematic Review on Imbalanced Data Challenges in Machine Learning 79:29
based on the data-oriented or algorithm-oriented approach or any hybrid approach that employs
the features of both approaches.
8 OPEN ISSUES
This article has discussed the spectrum of data imbalance in eleven prominent machine-learning
application areas. Imbalanced data classification is a skewed data distribution learning of binary
target class in this big data era. In addition to algorithm- and data-oriented approaches, hybrid
approaches along with ensemble are gaining increasing popularity, which must consider the na-
ture of the empirical dataset. There are often trade offs between real-life solutions, adaptive and
computational efficiency. New directions can be found through the track of exploring the data dis-
tribution fractures between testing and training data, normalizing the dataset size of each class,
incorporating the attribute specific weight adjustment to the under-sampled class, overlapping
among the classes and small disjuncts. The current status of the open problems in algorithm and
data level approaches is shown in Table 12.
For imbalance data operations, however, scaling issues arise while hampering the traditional
approaches to perform well. So, in Fernández et al. (2017), MapReduce-based de facto technique
is suggested for big data this year, to recursively divide and solve the distribution. But only few
researches have been able to contribute in the big data imbalanced classification due the its adapta-
tion difficulties related to MapReduce programming paradigm. Scarcity of the data and disjunctive
nature also highlight the programming solutions for imbalance data classification (Krawczyk 2016;
Pannu and Kaur 2017). Below are the some potential measures to develop the solutions against data
imbalance in machine learning:
ACM Computing Surveys, Vol. 52, No. 4, Article 79. Publication date: August 2019.
79:30 H. Kaur et al.
9 CONCLUSION
This article presents a comprehensive analysis about learning challenges due to imbalanced data
distribution. In addition, characteristics, problems, and solutions have been discussed. As imbal-
anced data restricts the performance and accuracy of classifier, various methods and techniques
have been proposed to overcome the negative effects of data imbalance. Comparison of modern
data pre-processing and algorithmic approaches has been performed and discussed along with
their application domains. These approaches have been tuned to get the generalized learning model
in various proposed articles with validation of desired results. Although sampling methods are the
most popular and simple to implement, real-world applications are involved with skewed data dis-
tributions so hybrid algorithmic approaches are also desirable. A detailed comparative study has
been given about the various methods proposed in the state-of-the-art discussing various param-
eters of these studies. The challenges in algorithm as well as data-oriented approaches along with
open issues have also been discussed.
REFERENCES
Ahmed Abbasi and Hsinchun Chen. 2009. A comparison of fraud cues and classification methods for fake escrow website
detection. Info. Technol. Manage. 10, 2–3 (2009), 83–101.
Chirath Abeysinghe, Jianguo Li, and Jing He. 2016. A classifier hub for imbalanced financial data. In Proceedings of the
Australasian Database Conference. Springer, 476–479.
Mohamed Abouelenien, Xiaohui Yuan, Balathasan Giritharan, Jianguo Liu, and Shoujiang Tang. 2013. Cluster-based sam-
pling and ensemble for bleeding detection in capsule endoscopy videos. Amer. J. Sci. Eng. 2, 1 (2013), 24–32.
Hamzah Al Najada and Xingquan Zhu. 2014. iSRD: Spam review detection with imbalanced data distributions. In Proceed-
ings of the IEEE 15th International Conference on Information Reuse and Integration (IRI’14). IEEE, 553–560.
Safdar Ali, Abdul Majid, Syed Gibran Javed, and Mohsin Sattar. 2016. Can-CSC-GBE: Developing cost-sensitive classifier
with gentleboost ensemble for breast cancer classification using protein amino acids and imbalanced data. Comput. Biol.
Med. 73 (2016), 38–46.
Nafees Anwar, Geoff Jones, and Siva Ganesh. 2014. Measurement of data complexity for classification problems with un-
balanced data. Stat. Anal.ysis and Data Min.: ASA Data Sci. J. 7, 3 (2014), 194–211.
Ömer Faruk Arar and Kürşat Ayan. 2015. Software defect prediction using cost-sensitive neural network. Appl. Soft Comput.
33 (2015), 263–277.
Malgorzata Bach, Aleksandra Werner, J. Żywiec, and W. Pluskiewicz. 2017. The study of under-and over-sampling methods’
utility in analysis of highly imbalanced data on osteoporosis. Info. Sci. 384 (2017), 174–190.
Sukarna Barua, Md Monirul Islam, Xin Yao, and Kazuyuki Murase. 2014. MWMOTE–majority weighted minority oversam-
pling technique for imbalanced data set learning. IEEE Trans. Knowl. Data Eng. 26, 2 (2014), 405–425.
Rukshan Batuwita and Vasile Palade. 2009. microPred: Effective classification of pre-miRNAs for human miRNA gene
prediction. Bioinformatics 25, 8 (2009), 989–995.
Oscar Beijbom, Mohammad Saberian, David Kriegman, and Nuno Vasconcelos. 2014. Guess-averse loss functions for cost-
sensitive multiclass boosting. In Proceedings of the International Conference on Machine Learning. 586–594.
Mohamed Bekkar and Taklit Akrouf Alitouche. 2013. Imbalanced data learning approaches review. Int. J. Data Min. Knowl.
Manage. Process 3, 4 (2013), 15.
Sanket M. Bhandari and Krunal Patel. 2015. A review on using clustering and classification techniques to predict student
failure with high dimensional and imbalanced data.
Siddhartha Bhattacharyya, Sanjeev Jha, Kurian Tharakunnel, and J. Christopher Westland. 2011. Data mining for credit
card fraud: A comparative study. Decis. Supp. Syst. 50, 3 (2011), 602–613.
Steffen Bickel, Michael Brückner, and Tobias Scheffer. 2009. Discriminative learning under covariate shift. J. Mach. Learn.
Res. 10 (Sep.2009), 2137–2155.
Verónica Bolón-Canedo, Noelia Sánchez-Maroño, and Amparo Alonso-Betanzos. 2015. Distributed feature selection: An
application to microarray data classification. Appl. Soft Comput. 30 (2015), 136–150.
ACM Computing Surveys, Vol. 52, No. 4, Article 79. Publication date: August 2019.
A Systematic Review on Imbalanced Data Challenges in Machine Learning 79:31
Jonathan Burez and Dirk Van den Poel. 2009. Handling class imbalance in customer churn prediction. Expert Syst. Appl. 36,
3 (2009), 4626–4636.
Lu Cao and Yikui Zhai. 2015. Imbalanced data classification based on a hybrid resampling SVM method. In Proceedings of
the Ubiquitous Intelligence and Computing and IEEE 12th International Conference on Autonomic and Trusted Computing
and IEEE 15th International Conference on Scalable Computing and Communications and Its Associated Workshops (UIC-
ATC-ScalCom’15). IEEE, 1533–1536.
Peng Cao, Dazhe Zhao, and Osmar Zaiane. 2013. An optimized cost-sensitive SVM for imbalanced data learning. In Pro-
ceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 280–292.
M. Emre Celebi, Hassan A. Kingravi, Bakhtiyar Uddin, Hitoshi Iyatomi, Y. Alp Aslandogan, William V. Stoecker, and Randy
H. Moss. 2007. A methodological approach to the classification of dermoscopy images. Comput. Med. Imag. Graph. 31,
6 (2007), 362–373.
Xiaoyong Chai, Lin Deng, Qiang Yang, and Charles X. Ling. 2004. Test-cost sensitive naive bayes classification. In Proceed-
ings of the 4th IEEE International Conference on Data Mining (ICDM’04). IEEE, 51–58.
Edward Y. Chang, Beitao Li, Gang Wu, and Kingshy Goh. 2003. Statistical learning for effective visual information retrieval.
In Proceedings of the International Conference on Image Processing (ICIP’03), vol. 3. IEEE, III–609.
Francisco Charte, Antonio J. Rivera, María J. del Jesus, and Francisco Herrera. 2015. Addressing imbalance in multilabel
classification: Measures and random resampling algorithms. Neurocomputing 163 (2015), 3–16.
Nitesh V. Chawla. 2009. Data mining for imbalanced datasets: An overview. In Data Mining and Knowledge Discovery
Handbook. Springer, 875–886.
Nitesh V. Chawla, Kevin W. Bowyer, Lawrence O. Hall, and W. Philip Kegelmeyer. 2002. SMOTE: Synthetic minority over-
sampling technique. J. Artific. Intell. Res. 16 (2002), 321–357.
David A. Cieslak and Nitesh V. Chawla. 2009. A framework for monitoring classifiers’ performance: When and why failure
occurs? Knowl. Info. Syst. 18, 1 (2009), 83–108.
Michael Crawford, Taghi M. Khoshgoftaar, Joseph D. Prusa, Aaron N. Richter, and Hamzah Al Najada. 2015. Survey of
review spam detection using machine-learning techniques. J. Big Data 2, 1 (2015), 23.
Dong Dai and Shaowen Hua. 2016. Random under-sampling ensemble methods for highly imbalanced rare disease classifi-
cation. In Proceedings of the International Conference on Data Mining (DMIN’16). The Steering Committee of The World
Congress in Computer Science, Computer Engineering and Applied Computing (WorldComp), 54.
Jesse Davis and Mark Goadrich. 2006. The relationship between precision-recall and ROC curves. In Proceedings of the 23rd
International Conference on Machine Learning. ACM, 233–240.
Sauptik Dhar and Vladimir Cherkassky. 2015. Development and evaluation of cost-sensitive universum-SVM. IEEE Trans.
Cybernet. 45, 4 (2015), 806–818.
Jie Du and C. M. Vong. 2018. Online multi-label learning under dynamic changes in data distribution with labels. Accepted
and in Press IEEE Trans. Cybernet. (2018).
Jie Du, Chi-Man Vong, Chi-Man Pun, Pak-Kin Wong, and Weng-Fai Ip. 2017. Post-boosting of classification boundary for
imbalanced data using geometric mean. Neural Netw. 96 (2017), 101–114.
Shihong Du, Fangli Zhang, and Xiuyuan Zhang. 2015. Semantic classification of urban buildings combining VHR image
and GIS data: An improved random forest approach. ISPRS J. Photogram. Remote Sens. 105 (2015), 107–119.
Sumeet Dua and Xian Du. 2016. Data Mining and Machine Learning in Cybersecurity. CRC Press.
Ekrem Duman, Yeliz Ekinci, and Aydın Tanrıverdi. 2012. Comparing alternative classifiers for database marketing: The case
of imbalanced datasets. Expert Syst. Appl. 39, 1 (2012), 48–53.
Reda M. Elbasiony, Elsayed A. Sallam, Tarek E. Eltobely, and Mahmoud M. Fahmy. 2013. A hybrid network intrusion
detection framework based on random forests and weighted k-means. Ain Shams Eng. J. 4, 4 (2013), 753–762.
T. Elhassan, M. Aljurf, F. Al-Mohanna, and M. Shoukri. 2016. Classification of imbalance data using tomek link (T-link)
combined with random under-sampling (RUS) as a data reduction method. J. Info. Data Min. (2016).
Vegard Engen, Jonathan Vincent, and Keith Phalp. 2008. Enhancing network-based intrusion detection for imbalanced data.
Int. J. Knowl.-Based Intell. Eng. Syst. 12, 5–6 (2008), 357–367.
Tom Fawcett. 2006. An introduction to ROC analysis. Pattern Recogn. Lett. 27, 8 (2006), 861–874.
Alberto Fernández, Sara del Río, Nitesh V. Chawla, and Francisco Herrera. 2017. An insight into imbalanced big data
classification: Outcomes and challenges. Complex Intell. Syst. (2017), 1–16.
Gianluigi Folino, Francesco Sergio Pisani, and Pietro Sabatino. 2016. An incremental ensemble evolved by using genetic
programming to efficiently detect drifts in cyber security datasets. In Proceedings of the Conference on Genetic and
Evolutionary Computation Conference Companion. ACM, 1103–1110.
Kang Fu, Dawei Cheng, Yi Tu, and Liqing Zhang. 2016. Credit card fraud detection using convolutional neural networks.
In Proceedings of the International Conference on Neural Information Processing. Springer, 483–490.
Song Fu, Jianguo Liu, and Husanbir Pannu. 2012. A hybrid anomaly detection framework in cloud computing using one-
class and two-class support vector machines. In Proceedings of the International Conference on Advanced Data Mining
and Applications. Springer, 726–738.
ACM Computing Surveys, Vol. 52, No. 4, Article 79. Publication date: August 2019.
79:32 H. Kaur et al.
Mikel Galar, Alberto Fernandez, Edurne Barrenechea, Humberto Bustince, and Francisco Herrera. 2012. A review on en-
sembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE Trans. Syst., Man,
Cybernet., Part C (Appl. Rev.) 42, 4 (2012), 463–484.
Mikel Galar, Alberto Fernández, Edurne Barrenechea, and Francisco Herrera. 2013. EUSBoost: Enhancing ensembles for
highly imbalanced data-sets by evolutionary undersampling. Pattern Recogn. 46, 12 (2013), 3460–3471.
Vaishali Ganganwar. 2012. An overview of classification algorithms for imbalanced datasets. Int. J. Emerg. Technol. Adv.
Eng. 2, 4 (2012), 42–47.
Zan Gao, Longfei Zhang, Ming yu Chen, Alexander G. Hauptmann, Hua Zhang 0003, and An-Ni Cai. 2014. Enhanced
and hierarchical structure algorithm for data imbalance problem in semantic extraction under massive video dataset.
Multimedia Tools Appl. 68, 3 (2014), 641–657.
Nicolás García-Pedrajas, Juan A. Romero del Castillo, and Gonzalo Cerruela-García. 2017. A proposal for local k values for
k -nearest neighbor rule. IEEE Trans. Neural Netw. Learn. Syst. 28, 2 (2017), 470–475.
Adel Ghazikhani, Reza Monsefi, and Hadi Sadoghi Yazdi. 2013. Ensemble of online neural networks for non-stationary and
imbalanced data streams. Neurocomputing 122 (2013), 535–544.
Adel Ghazikhani, Reza Monsefi, and Hadi Sadoghi Yazdi. 2014. Online neural network model for non-stationary and im-
balanced data stream classification. Int. J. Mach. Learn. Cybernet. 5, 1 (2014), 51–62.
Guo Haixiang, Li Yijing, Jennifer Shang, Gu Mingyun, Huang Yuanyue, and Gong Bing. 2017. Learning from class-
imbalanced data: Review of methods and applications. Expert Syst. Appl. 73 (2017), 220–239.
Ming Hao, Yanli Wang, and Stephen H. Bryant. 2014. An efficient algorithm coupled with synthetic minority over-sampling
technique to classify imbalanced PubChem BioAssay data. Analyt. Chim. Acta 806 (2014), 117–127.
Amira Kamil Ibrahim Hassan and Ajith Abraham. 2016. Modeling insurance fraud detection using imbalanced data classi-
fication. In Advances in Nature and Biologically Inspired Computing. Springer, 117–127.
Haibo He, Yang Bai, Edwardo A. Garcia, and Shutao Li. 2008. ADASYN: Adaptive synthetic sampling approach for im-
balanced learning. In Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN’08) (IEEE World
Congress on Computational Intelligence). IEEE, 1322–1328.
Haibo He and Edwardo A. Garcia. 2009. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng. 21, 9 (2009), 1263–
1284.
Nic Herndon and Doina Caragea. 2016. A study of domain adaptation classifiers derived from logistic regression for the
task of splice site prediction. IEEE Trans. Nanobiosci. 15, 2 (2016), 75–83.
Victoria Hodge and Jim Austin. 2004. A survey of outlier detection methodologies. Artificial Intell. Rev. 22, 2 (2004), 85–126.
Yi-Min Huang and Shu-Xin Du. 2005. Weighted support vector machine for classification with uneven training class sizes.
In Proceedings of 2005 International Conference on Machine Learning and Cybernetics, Vol. 7. IEEE, 4365–4369.
Jae Pil Hwang, Seongkeun Park, and Euntai Kim. 2011. A new weighted approach to imbalanced data classification problem
via support vector machine with quadratic cost function. Expert Syst. Appl. 38, 7 (2011), 8580–8585.
Ilnaz Jamali, Mohammad Bazmara, and Shahram Jafari. 2012. Feature selection in imbalance data sets. Int. J. Comput. Sci.
Iss. 9, 3 (2012), 42–45.
Piyasak Jeatrakul, Kok Wai Wong, and Chun Che Fung. 2010. Classification of imbalanced data by combining the comple-
mentary neural network and SMOTE algorithm. In Proceedings of the International Conference on Neural Information
Processing. Springer, 152–159.
Qi Kang, XiaoShuang Chen, SiSi Li, and MengChu Zhou. 2016. A noise-filtered under-sampling scheme for imbalanced
classification. IEEE Trans. Cybernet. (2016).
Masanori Kasai and Yusuke Oike. 2010. Image pickup apparatus, image processing method, and computer program capable
of obtaining high-quality image data by controlling imbalance among sensitivities of light-receiving devices. U.S. Patent
7,839,437.
Madian Khabsa, Ahmed Elmagarmid, Ihab Ilyas, Hossam Hammady, and Mourad Ouzzani. 2016. Learning to identify rele-
vant studies for systematic reviews using random forest and external information. Mach. Learn. 102, 3 (2016), 465–482.
Salman H. Khan, Munawar Hayat, Mohammed Bennamoun, Ferdous A. Sohel, and Roberto Togneri. 2017. Cost-sensitive
learning of deep feature representations from imbalanced data. IEEE Trans. Neural Netw. Learn. Syst. (2017).
Taghi M. Khoshgoftaar, Kehan Gao, Amri Napolitano, and Randall Wald. 2014. A comparative study of iterative and non-
iterative feature selection techniques for software defect prediction. Info. Syst. Front. 16, 5 (2014), 801–822.
Taghi M. Khoshgoftaar, Jason Van Hulse, and Amri Napolitano. 2010. Supervised neural network modeling: An empirical
investigation into learning from imbalanced data with labeling errors. IEEE Trans. Neural Netw. 21, 5 (2010), 813–830.
Gitae Kim, Bongsug Kevin Chae, and David L. Olson. 2013. A support vector machine (SVM) approach to imbalanced
datasets of customer responses: Comparison with other customer response models. Service Bus. 7, 1 (2013), 167–182.
Sotiris Kotsiantis, Dimitris Kanellopoulos, Panayiotis Pintelas, et al. 2006. Handling imbalanced datasets: A review. GESTS
Int. Trans. Comput. Sci. Eng. 30, 1 (2006), 25–36.
ACM Computing Surveys, Vol. 52, No. 4, Article 79. Publication date: August 2019.
A Systematic Review on Imbalanced Data Challenges in Machine Learning 79:33
Bartosz Krawczyk. 2016. Learning from imbalanced data: Open challenges and future directions. Progr. Artific. Intell. 5, 4
(2016), 221–232.
Bartosz Krawczyk, Mikel Galar, Łukasz Jeleń, and Francisco Herrera. 2016. Evolutionary undersampling boosting for im-
balanced classification of breast cancer malignancy. Appl. Soft Comput. 38 (2016), 714–726.
Bartosz Krawczyk, Michał Woźniak, and Gerald Schaefer. 2014. Cost-sensitive decision tree ensembles for effective imbal-
anced classification. Appl. Soft Comput. 14 (2014), 554–562.
Miroslav Kubat, Stan Matwin, et al. 1997. Addressing the curse of imbalanced training sets: One-sided selection. In Pro-
ceedings of the International Conference on machine Learning (ICML’97), Vol. 97. 179–186.
Pallavi Kulkarni and Roshani Ade. 2016. Logistic regression learning model for handling concept drift with unbalanced data
in credit card fraud detection system. In Proceedings of the 2nd International Conference on Computer and Communication
Technologies. Springer, 681–689.
Taehyung Lee, Ki Bum Lee, and Chang Ouk Kim. 2016. Performance of machine-learning algorithms for class-imbalanced
process fault detection problems. IEEE Trans. Semicond. Manufact. 29, 4 (2016), 436–445.
Boaz Lerner, Josepha Yeshaya, and Lev Koushnir. 2007. On the classification of a small imbalanced cytogenetic image
database. IEEE/ACM Trans. Comput. Biol. Bioinform. 4, 2 (2007).
Miao Liu, Mingjun Wang, Jun Wang, and Duo Li. 2013. Comparison of random forest, support vector machine and back
propagation neural network for electronic tongue data classification: Application to the recognition of orange beverage
and Chinese vinegar. Sens. Actuat. B: Chem. 177 (2013), 970–980.
Ying Liu, Han Tong Loh, and Aixin Sun. 2009. Imbalanced text classification: A term weighting approach. Expert Syst. Appl.
36, 1 (2009), 690–701.
Zhen Liu, Ruoyu Wang, Ming Tao, and Xianfa Cai. 2015. A class-oriented feature selection approach for multi-class imbal-
anced network traffic datasets based on local and global metrics fusion. Neurocomputing 168 (2015), 365–381.
Rushi Longadge and Snehalata Dongre. 2013. Class imbalance problem in data mining review. arXiv preprint arXiv:1305.1707
(2013).
Victoria López, Sara del Río, José Manuel Benítez, and Francisco Herrera. 2015. Cost-sensitive linguistic fuzzy rule-based
classification systems under the MapReduce framework for imbalanced big data. Fuzzy Sets Syst. 258 (2015), 5–38.
Yang Lu, Yiu-ming Cheung, and Yuan Yan Tang. 2016. Hybrid sampling with bagging for class imbalance learning. In
Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining. Springer, 14–26.
Abdul Majid, Safdar Ali, Mubashar Iqbal, and Nabeela Kausar. 2014. Prediction of human breast and colon cancers from
imbalanced data using nearest neighbor and support vector machines. Comput. Methods Programs Biomed. 113, 3 (2014),
792–808.
Sebastián Maldonado, Richard Weber, and Fazel Famili. 2014. Feature selection for high-dimensional class-imbalanced data
sets using support vector machines. Info. Sci. 286 (2014), 228–246.
Shahla Mardani and Hamid Reza Shahriari. 2013. A new method for occupational fraud detection in process aware informa-
tion systems. In Proceedings of the 10th International ISC Conference on Information Security and Cryptology (ISCISC’13).
IEEE, 1–5.
Stephen O. Moepya, Sharat S. Akhoury, and Fulufhelo V. Nelwamondo. 2014. Applying cost-sensitive classification for
financial fraud detection under high class-imbalance. In Proceedings of the IEEE International Conference on Data Mining
Workshop (ICDMW’14). IEEE, 183–192.
Jose G. Moreno-Torres, Xavier Llorà, David E. Goldberg, and Rohit Bhargava. 2013. Repairing fractures between data using
genetic programming-based feature extraction: A case study in cancer diagnosis. Info. Sci. 222 (2013), 805–823.
Alejandro Moreo, Andrea Esuli, and Fabrizio Sebastiani. 2016. Distributional random oversampling for imbalanced text
classification. In Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information
Retrieval. ACM, 805–808.
Surya Nepal and Mukaddim Pathan. 2014. Security, Privacy and Trust in Cloud Systems. Springer.
Hien M. Nguyen, Eric W. Cooper, and Katsuari Kamei. 2012. A comparative study on sampling techniques for handling class
imbalance in streaming data. In Proceedings of the Joint 6th International Conference on Soft Computing and Intelligent
Systems (SCIS’12), 13th International Symposium on Advanced Intelligent Systems (ISIS’12). IEEE, 1762–1767.
Ana Palacios, Krzysztof Trawiński, Oscar Cordón, and Luciano Sánchez. 2014. Cost-sensitive learning of fuzzy rules for
imbalanced classification problems using FURIA. Int. J. Uncertain. Fuzz. Knowl.-based Syst. 22, 05 (2014), 643–675.
Jiyan Pan, Quanfu Fan, Sharath Pankanti, Hoang Trinh, Prasad Gabbur, and Sachiko Miyazawa. 2011. Soft margin keyframe
comparison: Enhancing precision of fraud detection in retail surveillance. In Proceedings of the IEEE Workshop on Ap-
plications of Computer Vision (WACV’11). IEEE, 549–556.
Husanbir Singh Pannu and Harsurinder Kaur. 2017. Anomaly detection survey for information security. In Proceedings of
the 10th International Conference on Security of Information and Networks. ACM, 251–258.
Husanbir S. Pannu, Jianguo Liu, Qiang Guan, and Song Fu. 2012. AFD: Adaptive failure detection system for cloud comput-
ing infrastructures. In Proceedings of the IEEE 31st International Performance Computing and Communications Conference
(IPCCC’12). IEEE, 71–80.
ACM Computing Surveys, Vol. 52, No. 4, Article 79. Publication date: August 2019.
79:34 H. Kaur et al.
Yubin Park and Joydeep Ghosh. 2014. Ensembles of alpha-trees for imbalanced classification problems. IEEE Trans. Knowl.
Data Eng. 26, 1 (2014), 131–143.
Harshita Patel and G. S. Thakur. 2016. A hybrid weighted nearest neighbor approach to mine imbalanced data. In Pro-
ceedings of the International Conference on Data Mining (DMIN’16). The Steering Committee of The World Congress in
Computer Science, Computer Engineering and Applied Computing (WorldComp), 106.
Naser Peiravian and Xingquan Zhu. 2013. Machine learning for Android malware detection using permission and API calls.
In Proceedings of the IEEE 25th International Conference on Tools with Artificial Intelligence (ICTAI’13). IEEE, 300–305.
Lizhi Peng, Bo Yang, Yuehui Chen, and Xiaoqing Zhou. 2016. An under-sampling imbalanced learning of data gravitation-
based classification. In Proceedings of the 12th International Conference on Natural Computation, Fuzzy Systems and
Knowledge Discovery (ICNC-FSKD’16). IEEE, 419–425.
Yun Qian, Yanchun Liang, Mu Li, Guoxiang Feng, and Xiaohu Shi. 2014. A resampling ensemble algorithm for classification
of imbalance problems. Neurocomputing 143 (2014), 57–67.
Chen Qiu, Liangxiao Jiang, and Chaoqun Li. 2017. Randomly selected decision tree for test-cost sensitive learning. Appl.
Soft Comput. 53 (2017), 27–33.
D. Ramyachitra and P. Manikandan. 2014. Imbalanced dataset classification and solutions: A review. Int. J. Comput.ing and
Bus. Res. 5, 4 (2014).
K. Usha Rani, G. Naga Ramadevi, and D. Lavanya. 2016. Performance of synthetic minority oversampling technique on
imbalanced breast cancer data. In Proceedings of the 3rd International Conference on Computing for Sustainable Global
Development (INDIACom’16). IEEE, 1623–1627.
Bhavani Raskutti and Adam Kowalczyk. 2004. Extreme re-balancing for SVMs: A case study. ACM SIGKDD Explor. Newslett.
6, 1 (2004), 60–69.
Alice M. Richardson and Brett A. Lidbury. 2017. Enhancement of hepatitis virus immunoassay outcome predictions in
imbalanced routine pathology data by data balancing and feature selection before the application of support vector
machines. BMC Med. Info. Decis. Mak. 17, 1 (2017), 121.
Yvan Saeys, Iñaki Inza, and Pedro Larrañaga. 2007. A review of feature selection techniques in bioinformatics. Bioinformatics
23, 19 (2007), 2507–2517.
Mahendra Sahare and Hitesh Gupta. 2012. A review of multi-class classification for imbalanced data. Int. J. Adv. Comput.
Res. 2, 3 (2012), 160–164.
Yusuf Sahin, Serol Bulkan, and Ekrem Duman. 2013. A cost-sensitive decision tree approach for fraud detection. Expert
Syst. Appl. 40, 15 (2013), 5916–5923.
Claude Sammut and Geoffrey I. Webb. 2011. Encyclopedia of Machine Learning. Springer Science & Business Media.
José Antonio Sanz, Dario Bernardo, Francisco Herrera, Humberto Bustince, and Hani Hagras. 2015. A compact evolu-
tionary interval-valued fuzzy rule-based classification system for the modeling and prediction of real-world financial
applications with imbalanced data. IEEE Trans. Fuzzy Syst. 23, 4 (2015), 973–990.
Abeed Sarker and Graciela Gonzalez. 2015. Portable automatic text classification for adverse drug reaction detection via
multi-corpus training. J. Biomed. Info. 53 (2015), 196–207.
Asaf Shabtai, Robert Moskovitch, Clint Feher, Shlomi Dolev, and Yuval Elovici. 2012. Detecting unknown malicious code
by applying classification techniques on opcode patterns. Secur. Info. 1, 1 (2012), 1.
Yuan-Hai Shao, Wei-Jie Chen, Jing-Jing Zhang, Zhen Wang, and Nai-Yang Deng. 2014. An efficient weighted Lagrangian
twin support vector machine for imbalanced data classification. Pattern Recogn. 47, 9 (2014), 3158–3167.
Mei-Ling Shyu, Zongxing Xie, Min Chen, and Shu-Ching Chen. 2008. Video semantic event/concept detection using a
subspace-based multimedia data mining framework. IEEE Trans. Multimedia 10, 2 (2008), 252–259.
Arpit Singh and Anuradha Purohit. 2015. A survey on methods for solving data imbalance problem for classification. Work
127, 15 (2015).
Li Song, Dapeng Li, Xiangxiang Zeng, Yunfeng Wu, Li Guo, and Quan Zou. 2014. nDNA-prot: Identification of DNA-binding
proteins based on unbalanced classification. BMC Bioinformat.ics 15, 1 (2014), 298.
Qun Song, Jun Zhang, and Qian Chi. 2010. Assistant detection of skewed data streams classification in cloud security. In
Proceedings of the IEEE International Conference on Intelligent Computing and Intelligent Systems (ICIS’10), Vol. 1. IEEE,
60–64.
Robert A. Sowah, Moses A. Agebure, Godfrey A. Mills, Koudjo M. Koumadi, and Seth Y. Fiawoo. 2016. New cluster under-
sampling technique for class imbalance learning. Int. J. Mach. Learn. Comput. 6, 3 (2016), 205.
Yanmin Sun, Andrew K. C. Wong, and Mohamed S. Kamel. 2009. Classification of imbalanced data: A review. Int. J. Pattern
Recogn. Artific. Intell. 23, 04 (2009), 687–719.
Zhongbin Sun, Qinbao Song, Xiaoyan Zhu, Heli Sun, Baowen Xu, and Yuming Zhou. 2015. A novel ensemble method for
classifying imbalanced data. Pattern Recogn. 48, 5 (2015), 1623–1637.
Mayank Taneja, Kavyanshi Garg, Archana Purwar, and Samarth Sharma. 2015. Prediction of click frauds in mobile adver-
tising. In Proceedings of the 8th International Conference on Contemporary Computing (IC3’15). IEEE, 162–166.
ACM Computing Surveys, Vol. 52, No. 4, Article 79. Publication date: August 2019.
A Systematic Review on Imbalanced Data Challenges in Machine Learning 79:35
Bo Tang, Haibo He, Paul M. Baggenstoss, and Steven Kay. 2016. A Bayesian classification approach using class-specific
features for text categorization. IEEE Trans. Knowl. Data Eng. 28, 6 (2016), 1602–1606.
Yuchun Tang, Yan-Qing Zhang, Nitesh V. Chawla, and Sven Krasser. 2009. SVMs modeling for highly imbalanced classifi-
cation. IEEE Trans. Syst., Man, Cybernet., Part B (Cybernet.) 39, 1 (2009), 281–288.
Ciza Thomas. 2013. Improving intrusion detection for imbalanced network traffic. Secur. Commun. Netw. 6, 3 (2013), 309–
324.
Jason Van Hulse, Taghi M. Khoshgoftaar, Amri Napolitano, and Randall Wald. 2009. Feature selection with high-
dimensional imbalanced data. In Proceedings of the IEEE International Conference on Data Mining Workshops (ICDMW’09).
IEEE, 507–514.
Nguyen Ha Vo and Yonggwan Won. 2007. Classification of unbalanced medical data with weighted regularized least squares.
In Proceedings of the Conference on Frontiers in the Convergence of Bioscience and Information Technologies (FBIT’07). IEEE,
347–352.
Chi-Man Vong, Jie Du, Chi-Man Wong, and Jiu-Wen Cao. 2018. Postboosting using extended G-Mean for online sequential
multiclass imbalance learning. IEEE Trans. Neural Netw. Learn. Syst. (2018).
Shixiang Wan, Yucong Duan, and Quan Zou. 2017. HPSLPred: An ensemble multi-label classifier for human protein sub-
cellular location prediction with imbalanced source. Proteomics 17, 17–18 (2017), 1700262.
C. Wang, L. Hu, M. Guo, X. Liu, and Q. Zou. 2015. imDC: An ensemble learning method for imbalanced classification with
miRNA data. Genet. Mol. Res. 14, 1 (2015), 123–133.
Qiang Wang. 2014. A hybrid sampling SVM approach to imbalanced data classification. In Abstract and Applied Analysis,
Vol. 2014. Hindawi Publishing Corporation.
Suge Wang, Deyu Li, Lidong Zhao, and Jiahao Zhang. 2013. Sample cutting method for imbalanced text sentiment classifi-
cation based on BRC. Knowl.-Based Syst. 37 (2013), 451–461.
Shuo Wang and Xin Yao. 2013. Using class imbalance learning for software defect prediction. IEEE Trans. Reliabil. 62, 2
(2013), 434–443.
Wei Wei, Jinjiu Li, Longbing Cao, Yuming Ou, and Jiahang Chen. 2013a. Effective detection of sophisticated online banking
fraud on extremely imbalanced data. World Wide Web 16, 4 (2013), 449–475.
Wei Wei, Jinjiu Li, Longbing Cao, Yuming Ou, and Jiahang Chen. 2013b. Effective detection of sophisticated online banking
fraud on extremely imbalanced data. World Wide Web 16, 4 (2013), 449–475.
Qingyao Wu, Yunming Ye, Haijun Zhang, Michael K. Ng, and Shen-Shyang Ho. 2014. ForesTexter: An efficient random
forest algorithm for imbalanced text categorization. Knowl.-Based Syst. 67 (2014), 105–116.
Yufei Xia, Chuanzhe Liu, and Nana Liu. 2017. Cost-sensitive boosted tree for loan evaluation in peer-to-peer lending.
Electron. Comm. Res. Appl. 24 (2017), 30–49.
Jieming Yang, Zhaoyang Qu, and Zhiying Liu. 2014. Improved feature-selection method considering the imbalance problem
in text categorization. Sci. World J. 2014 (2014).
Junshan Yang, Jiarui Zhou, Zexuan Zhu, Xiaoliang Ma, and Zhen Ji. 2016. Iterative ensemble feature selection for multiclass
classification of imbalanced microarray data. J. Biol. Res. Thessaloniki 23, 1 (2016), 13.
Bee Wah Yap, Khatijahhusna Abd Rani, Hezlin Aryani Abd Rahman, Simon Fong, Zuraida Khairudin, and Nik Nik Ab-
dullah. 2014. An application of oversampling, undersampling, bagging and boosting in handling imbalanced datasets.
In Proceedings of the 1st International Conference on Advanced Data and Information Engineering (DaEng’13). Springer,
13–22.
Hualong Yu and Jun Ni. 2014. An improved ensemble learning method for classifying high-dimensional and imbalanced
biomedicine data. IEEE/ACM Trans. Comput. Biol. Bioinformat. 11, 4 (2014), 657–666.
Hualong Yu, Jun Ni, and Jing Zhao. 2013. ACOSampling: An ant colony optimization-based undersampling method for
classifying imbalanced DNA microarray data. Neurocomputing 101 (2013), 309–318.
Ashkan Zakaryazad and Ekrem Duman. 2016. A profit-driven Artificial Neural Network (ANN) with applications to fraud
detection and direct marketing. Neurocomputing 175 (2016), 121–131.
Jia Zeng, Shanfeng Zhu, and Hong Yan. 2009. Towards accurate human promoter recognition: A review of currently used
sequence features and classification methods. Brief. Bioinformat. 10, 5 (2009), 498–508.
Bin Zhang, Yi Zhou, and Christos Faloutsos. 2008. Toward a comprehensive model in internet auction fraud detection. In
Proceedings of the 41st Annual Hawaii International Conference on System Sciences. IEEE, 79–79.
Dongmei Zhang, Jun Ma, Jing Yi, Xiaofei Niu, and Xiaojing Xu. 2015. An ensemble method for unbalanced sentiment
classification. In Proceedings of the 11th International Conference on Natural Computation (ICNC’15). IEEE, 440–445.
Huaxiang Zhang and Mingfang Li. 2014. RWO-Sampling: A random walk over-sampling approach to imbalanced data
classification. Info. Fusion 20 (2014), 99–116.
Yan-Ping Zhang, Li-Na Zhang, and Yong-Cheng Wang. 2010. Cluster-based majority under-sampling approaches for class
imbalance learning. In Proceedings of the 2nd IEEE International Conference on Information and Financial Engineering
(ICIFE’10). IEEE, 400–404.
ACM Computing Surveys, Vol. 52, No. 4, Article 79. Publication date: August 2019.
79:36 H. Kaur et al.
Zhancheng Zhang, Jun Dong, Xiaoqing Luo, Kup-Sze Choi, and Xiaojun Wu. 2014. Heartbeat classification using disease-
specific feature selection. Comput. Biol. Med. 46 (2014), 79–89.
Xing-Ming Zhao, Xin Li, Luonan Chen, and Kazuyuki Aihara. 2008. Protein classification with imbalanced data. Proteins:
Struct. Funct. Bioinformat. 70, 4 (2008), 1125–1132.
Zhuoyuan Zheng, Yunpeng Cai, and Ye Li. 2016. Oversampling method for imbalanced classification. Comput. Info. 34, 5
(2016), 1017–1037.
Weicai Zhong, Bijan Raahemi, and Jing Liu. 2013. Classifying peer-to-peer applications using imbalanced concept-adapting
very fast decision tree on IP data stream. Peer-to-Peer Netw. Appl. 6, 3 (2013), 233–246.
Maciej Zięba, Jakub M. Tomczak, Marek Lubicz, and Jerzy Świątek. 2014. Boosted SVM for extracting rules from imbalanced
data in application to prediction of the post-operative life expectancy in the lung cancer patients. Appl. Soft Comput. 14
(2014), 99–108.
Quan Zou, Sifa Xie, Ziyu Lin, Meihong Wu, and Ying Ju. 2016. Finding the best classification threshold in imbalanced
classification. Big Data Res. 5 (2016), 2–8.
ACM Computing Surveys, Vol. 52, No. 4, Article 79. Publication date: August 2019.