0% found this document useful (0 votes)

58 views11 pages

A Study of The Behavior of Several Methods For Bal

This document discusses a study that evaluates several methods for addressing class imbalance in machine learning training data. The study compares 10 methods on 13 datasets from the UCI repository. The key findings are: 1) Class imbalance does not always hinder classifier performance. Other factors like class overlap can also negatively impact learning. 2) Two new proposed methods, Smote + Tomek and Smote + ENN, combine oversampling with data cleaning to improve class separation. 3) In general, oversampling methods produced more accurate results than undersampling, as measured by AUC. Smote + Tomek and Smote + ENN performed well on datasets with few positive examples.

Uploaded by

aman pandey

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

58 views11 pages

A Study of The Behavior of Several Methods For Bal

Uploaded by

aman pandey

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/220520041

A Study of the Behavior of Several Methods for Balancing machine Learning

Training Data

Article in ACM SIGKDD Explorations Newsletter · June 2004

DOI: 10.1145/1007730.1007735 · Source: DBLP

CITATIONS READS
2,464 6,507

3 authors:

Gustavo Enrique Batista Ronaldo Cristiano Prati

UNSW Sydney Universidade Federal do ABC (UFABC)
140 PUBLICATIONS 7,643 CITATIONS 110 PUBLICATIONS 4,253 CITATIONS

SEE PROFILE SEE PROFILE

Maria-Carolina Monard
University of São Paulo
167 PUBLICATIONS 5,762 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

On the Generation of Fuzzy Classification Systems Using the Genetic Paradigm and Fuzzy Decision Trees for the Application on Large Datasets View project

Missing data imputation View project

All content following this page was uploaded by Ronaldo Cristiano Prati on 03 June 2014.

The user has requested enhancement of the downloaded file.

A Study of the Behavior of Several Methods for Balancing
Machine Learning Training Data

Gustavo E. A. P. A. Batista
Ronaldo C. Prati
Maria Carolina Monard
Instituto de Ciências Matemáticas e de Computação
Caixa Postal 668, 13560-970
São Carlos - SP, Brazil
{gbatista, prati, mcmonard}@icmc.usp.br

ABSTRACT for learning are balanced. However, this is not always the
case in real world data where one class might be represented
There are several aspects that might influence the perfor-
by a large number of examples, while the other is represented
mance achieved by existing learning systems. It has been
by only a few. This is known as the class imbalance problem
reported that one of these aspects is related to class imbal-
and is often reported as an obstacle to the induction of good
ance in which examples in training data belonging to one
classifiers by Machine Learning (ML) algorithms. Generally,
class heavily outnumber the examples in the other class. In
the problem of imbalanced data sets occurs when one class
this situation, which is found in real world data describing
represents a circumscribed concept, while the other class
an infrequent but important event, the learning system may
represents the counterpart of that concept, so that examples
have difficulties to learn the concept related to the minor-
from the counterpart class heavily outnumber examples from
ity class. In this work we perform a broad experimental
the class of interest. This sort of data is found, for example,
evaluation involving ten methods, three of them proposed
in medical record databases regarding a rare disease, were
by the authors, to deal with the class imbalance problem
there is a large number of patients who do not have that
in thirteen UCI data sets. Our experiments provide evi-
disease; continuous fault-monitoring tasks where non-faulty
dence that class imbalance does not systematically hinder
examples heavily outnumber faulty examples, and others.
the performance of learning systems. In fact, the problem
seems to be related to learning with too few minority class In recent years, there have been several attempts at dealing
examples in the presence of other complicating factors, such with the class imbalance problem in the field of Data Mining
as class overlapping. Two of our proposed methods, Smote and Knowledge Discovery in Databases, to which ML is a
+ Tomek and Smote + ENN, deal with these conditions substantial contributor. Related papers have been published
directly, allying a known over-sampling method with data in the ML literature aiming to overcome this problem. The
cleaning methods in order to produce better-defined class ML community seems to agree on the hypothesis that the
clusters. Our comparative experiments show that, in gen- imbalance between classes is the major obstacle in induc-
eral, over-sampling methods provide more accurate results ing classifiers in imbalanced domains. However, it has also
than under-sampling methods considering the area under been observed that in some domains, for instance the Sick
the ROC curve (AUC). This result seems to contradict re- data set [3], standard ML algorithms are capable of induc-
sults previously published in the literature. Smote + Tomek ing good classifiers, even using highly imbalanced training
and Smote + ENN presented very good results for data sets sets. This shows that class imbalance is not the only prob-
with a small number of positive examples. Moreover, Ran- lem responsible for the decrease in performance of learning
dom over-sampling, a very simple over-sampling method, is algorithms.
very competitive to more complex over-sampling methods. In [18] we developed a systematic study aiming to ques-
Since the over-sampling methods provided very good perfor- tion whether class imbalances hinder classifier induction or
mance results, we also measured the syntactic complexity of whether these deficiencies might be explained in other ways.
decision trees induced from over-sampled data. Our results Our study was developed on a series of artificial data sets in
show that these trees are usually more complex then the ones order to fully control all the variables we wanted to analyze.
induced from original data. Random over-sampling usually The results of our experiments, using a discrimination-based
produced the smallest increase in the mean number of in- inductive scheme, suggested that the problem is not solely
duced rules and Smote + ENN the smallest increase in the caused by class imbalance, but is also related to the degree
mean number of conditions per rule, when compared among of data overlapping among the classes.
the investigated over-sampling methods. The results obtained in this previous work motivated the
proposition of two new methods to deal with the problem of
learning in the presence of class imbalance. These methods
1. INTRODUCTION ally a known over-sampling method, namely Smote [5], with
Most learning systems usually assume that training sets used two data cleaning methods: Tomek links [22] and Wilson’s
Edited Nearest Neighbor Rule [24]. The main motivation
behind these methods is not only to balance the training
data, but also to remove noisy examples lying on the wrong Decision trees also experience a similar problem. In the pres-
side of the decision border. The removal of noisy examples ence of class overlapping, decision trees may need to create
might aid in finding better-defined class clusters, therefore, many tests to distinguish the minority class cases from ma-
allowing the creation of simpler models with better general- jority class cases. Pruning the decision tree might not nec-
ization capabilities. essarily alleviate the problem. This is due to the fact that
In addition, in this work we perform a broad experimental pruning removes some branches considered too specialized,
evaluation involving ten methods, three of them proposed labelling new leaf nodes with the dominant class on this
by the authors, to deal with the class imbalance problem in node. Thus, there is a high probability that the majority
thirteen UCI data sets. We concluded that over-sampling class will also be the dominant class of those leaf nodes.
methods are able to aid in the induction of classifiers that
are more accurate than those induced from under-sampled 3. ON EVALUATING CLASSIFIERS IN IM-
data sets. This result seems to contradict results previously
published in the literature. Two of our proposed methods BALANCED DOMAINS
performed well in practice, in particular for data sets with The most straightforward way to evaluate the performance
a small number of positive examples. It is worth noting of classifiers is based on the confusion matrix analysis. Ta-
that Random over-sampling, a very simple over-sampling ble 1 illustrates a confusion matrix for a two-class problem
method, is very competitive to more complex over-sampling having positive and negative class values. From such a
methods. matrix it is possible to extract a number of widely used
The remainder of the paper is organized as follows: Section 2 metrics for measuring the performance of learning systems,
discusses why learning from imbalanced data sets might be such as Error Rate, defined as Err = T P +FF N P +F N
+F P +T N
and
a difficult task. Section 3 describes the drawbacks of using Accuracy, defined as Acc = T P +FT N
P +T N
+F P +T N
= 1 − Err.
accuracy (or error rate) to measure the performance of clas-
sifiers and discusses alternative metrics. Section 4 presents
the methods employed in the experimental evaluation, in- Table 1: Confusion matrix for a two-class problem.
cluding the three methods proposed by the authors. Sec- Positive Prediction Negative Prediction
Positive Class True Positive (T P ) False Negative (F N )
tion 5 discusses the methodology used in the experiments, Negative Class False Positive (F P ) True Negative (T N )
as well as the results achieved. Finally, Section 6 presents
the conclusions and outlines future research. However, when the prior class probabilities are very differ-
ent, the use of such measures might lead to misleading con-
clusions. Error rate and accuracy are particularly suspicious
2. WHY LEARNING FROM IMBALANCED performance measures when studying the effect of class dis-
DATA SETS MIGHT BE DIFFICULT tribution on learning since they are strongly biased to favor
Learning from imbalanced data sets is often reported as the majority class. For instance, it is straightforward to
being a difficult task. In order to better understand this create a classifier having an accuracy of 99% (or an error
problem, imagine the situation illustrated in Figure 1. In rate of 1%) in a domain where the majority class proportion
Fig. 1(a) there is a large imbalance between the majority corresponds to 99% of the examples, by simply forecasting
class (-) and the minority class (+), and the data set presents every new example as belonging to the majority class.
some degree of class overlapping. A much more comfortable Another fact against the use of accuracy (or error rate) is
situation for learning is represented in Fig. 1(b), where the that these metrics consider different classification errors to
classes are balanced with well-defined clusters. be equally important. However, highly imbalanced problems
In a situation similar to the one illustrated in Fig. 1(a), generally have highly non-uniform error costs that favor the
spare cases from the minority class may confuse a classifier minority class, which is often the class of primary interest.
like k-Nearest Neighbor (k-NN). For instance, 1-NN may in- Finally, another point that should be considered when study-
correctly classify many cases from the minority class because ing the effect of class distribution on learning systems is that
the nearest neighbors of these cases are examples belonging the class distribution may change. Consider the confusion
to the majority class. In a situation where the imbalance is matrix shown in Table 1. Note that the class distribution
very high, the probability of the nearest neighbor of a mi- (the proportion of positive to negative examples) is the re-
nority class case is a case of the majority class is likely to lationship between the first and second lines. Any perfor-
be high, and the minority class error rate will tend to have mance metric that uses values from both lines will be inher-
high values, which is unacceptable. ently sensitive to class skews. Metrics such as accuracy and
error rate use values from both lines of the confusion matrix.
As class distribution changes, these measures will change as
Figure 1: Many negative cases against some spare positive well, even if the fundamental classifier performance does not.
cases (a) balanced data set with well-defined clusters (b). All things considered, it would be more interesting if we use
- - - - a performance metric that disassociates the errors (or hits)
- - - - - that occurred in each class. From Table 1 it is possible to
- - -- - - - +
+ + - - -
-
-
- + - - - - + + - - derive four performance metrics that directly measure the
- - - ++ - -
+ -+ - + +
+
-
- - - - +
+
- - - classification performance on positive and negative classes
+ + - - - - + - -
- - - - -
+ + +
- -
- independently:
+ - - + - -
+ - - - - + - - -
+ + - + - False negative rate F Nrate = T PF+F N
is the percentage
++
- - + - - N
of positive cases misclassified as belonging to the nega-
(a) (b) tive class;
False positive rate F Prate = F PF+T P
N
is the percentage the example Ei , and yi is its class value. The objective of a
of negative cases misclassified as belonging to the posi- supervised learning algorithm is to induce a general mapping
tive class; of vectors ~x to values y. Thus, the learning system aims to
True negative rate T Nrate = F PT+T N
N
is the percentage construct a model y = f (~x), of an unknown function f , also
of negative cases correctly classified as belonging to the known as concept function, that enables one to predict the y
negative class; values for previously unseen examples. In practice, learning
True positive rate T Prate = T PT+F P
N
is the percentage systems are able to induce a function h that approximates
of positive cases correctly classified as belonging to the f , i.e., h(~x) ≈ f (~x). In this case, h is called the hypothesis
positive class. of the concept function f .
In Table 2 a data set with N examples and M attributes is
These four performance measures have the advantage of be- presented. Columns (A1 , . . . AM ) represent the attributes
ing independent of class costs and prior probabilities. The and lines (E1 , . . . EN ) represent the examples. For instance,
aim of a classifier is to minimize the false positive and neg- line i in Table 2 refers to the ith example and the entry xij
ative rates or, similarly, to maximize the true negative and refers to the value of the jth attribute, Aj , of example i. For
positive rates. Unfortunately, for most real world applica- classification problems, the class-attribute Y is a qualitative
tions there is a tradeoff between F Nrate and F Prate , and attribute that may assume a set of NCl discrete values C =
similarly between T Nrate and T Prate . ROC (Receiver Op- {C1 , C2 , . . . CNCl }.
erating Characteristic) graphs [19] can be used to analyze
the relationship between F Nrate and F Prate (or T Nrate and
T Prate ) for a classifier. Table 2: Data set in attribute-value form.
Some classifiers, such as the Naı̈ve Bayes classifier or some A1 A2 ··· AM Y
Neural Networks, yield a score that represents the degree to E1 x11 x12 ··· x1M y1
which an example is a member of a class. Such ranking can E2 x21 x22 ··· x2M y2
be used to produce several classifiers, by varying the thresh- .. .. .. .. .. ..
. . . . . .
old of an example pertaining to a class. Each threshold EN xN 1 xN 2 . . . xN M yN
value produces a different point in the ROC space. These
points are linked by tracing straight lines through two con- As stated earlier, in this work we consider two-class prob-
secutive points to produce a ROC curve. For decision trees, lems where C1 = + represents the circumscribed concept
we could use the class distributions at each leaf as a score or, class and C2 = − represents the counterpart of that con-
as proposed in [9], by ordering the leaves by their positive cept. Furthermore, the examples from the negative class
class accuracy and producing several trees by re-labelling the outnumber the examples from the positive class.
leaves, one at a time, from all forecasting negative class to
all forecasting positive class in the positive accuracy order. 4.2 Our Implementation of the k-NN Algo-
A ROC graph characterizes the performance of a binary rithm
classification model across all possible trade-offs between the Several research papers use the Euclidean distance as a dis-
classifier sensitivity (T Prate ) and false alarm (F Prate ). ROC tance metric for the k-NN algorithm. However, this distance
graphs are consistent for a given problem, even if the distri- function might not be appropriate when the domain presents
bution of positive and negative examples is highly skewed. qualitative attributes. For those domains, the distance for
A ROC analysis also allows the performance of multiple qualitative attributes is usually calculated using the over-
classification functions to be visualized and compared si- lap function, in which the value 0 (if two examples have the
multaneously. The area under the ROC curve (AUC) rep- same value for a given attribute) or the value 1 (if these val-
resents the expected performance as a single scalar. The ues differ) are assigned. Our implementation of the k-NN
AUC has a known statistical meaning: it is equivalent to algorithm uses the Heterogeneous Value Difference Metric
the Wilcoxon test of ranks, and is equivalent to several other (HVDM) distance function [25]. This distance function uses
statistical measures for evaluating classification and ranking the Euclidean distance for quantitative attributes and the
models [10]. In this work, we use the method proposed in [9] VDM distance [21] for qualitative attributes. The VDM
with Laplace correction for measuring the leaf accuracy to metric provides a more appropriate distance function for
produce ROC curves. We also use the AUC as the main qualitative attributes if compared with the overlap metric,
method for assessing our experiments. since the VDM metric considers the classification similarity
for each possible value of a qualitative attribute to calculate
4. METHODS the distances between these values.
This section describes the notation used as well as our im- Another refinement to the basic k-NN algorithm is to weigh
plementation of the k-NN algorithm, since this algorithm the contribution of each of the k neighbors according to their
plays an important role in the behavior of the methods con- distance to the query example Eq , giving greater weight to
sidered. Finally, an explanation of each balancing method closer neighbors. The vote of each neighbor is weighed ac-
is given. cording to the inverse square of its distance from Eq [17].
Given Ê = {Ê1 , Ê2 , . . . Êk }, the set of k nearest neighbors
4.1 Notation of Eq , according to the distance function d the final classifi-
In supervised learning, the inducer is fed with a data set cation is given by Equation 1.
E = {E1 , E2 , . . . EN }, in which each example Ei ∈ E has
an associated label. This label defines the class the example
X
k
1
belongs to. Each example Ei ∈ E is a tuple Ei = (x~i , yi ) h(Eq ) = arg max ωi δ(c, f (Êi )) ωi = (1)
in which x~i is a vector of feature (or attribute) values of c∈C
i=1 d(Eq , Êi )2
and δ(a, b) = 1 if a = b otherwise δ(a, b) = 0. examples from the majority class that are distant from
As the balancing methods make severe use of distance com- the decision border, since these sorts of examples might
putations, we implemented an indexing structure namely be considered less relevant for learning.
M-tree [6] to speed up the execution of k-NN queries. M- One-sided selection One-sided selection (OSS) [14] is an
tree only considers relative distances of examples rather than under-sampling method resulting from the application of
their absolute positions in a multi-dimensional space, to or- Tomek links followed by the application of CNN. Tomek
ganize and partition the metric space. In a metric space, links are used as an under-sampling method and removes
example proximity is only defined by a distance function noisy and borderline majority class examples. Border-
that satisfies the positivity, symmetry and triangle inequal- line examples can be considered “unsafe” since a small
ity postulates. amount of noise can make them fall on the wrong side
of the decision border. CNN aims to remove examples
4.3 Methods from the majority class that are distant from the decision
In this work, we evaluate ten different methods of under and border. The remainder examples, i.e. “safe” majority
over-sampling to balance the class distribution on training class examples and all minority class examples are used
data. Two of these methods, Random over-sampling and for learning.
Random under-sampling, are non-heuristic methods that CNN + Tomek links This is one of the methods pro-
were initially included in this evaluation as baseline meth- posed in this work. It is similar to the one-sided se-
ods. The evaluated methods are described next. lection, but the method to find the consistent subset
is applied before the Tomek links. Our objective is to
Random over-sampling is a non-heuristic method that verify its competitiveness with OSS. As finding Tomek
aims to balance class distribution through the random links is computationally demanding, it would be compu-
replication of minority class examples. tationally cheaper if it was performed on a reduced data
Random under-sampling is also a non-heuristic method set.
that aims to balance class distribution through the ran- Neighborhood Cleaning Rule Neighborhood Cleaning
dom elimination of majority class examples. Rule (NCL) [15] uses the Wilson’s Edited Nearest Neigh-
bor Rule (ENN) [24] to remove majority class examples.
Several authors agree that Random over-sampling can in- ENN removes any example whose class label differs from
crease the likelihood of occurring overfitting, since it makes the class of at least two of its three nearest neighbors.
exact copies of the minority class examples. In this way, a NCL modifies the ENN in order to increase the data
symbolic classifier, for instance, might construct rules that cleaning. For a two-class problem the algorithm can be
are apparently accurate, but actually cover one replicated described in the following way: for each example Ei in
example. On the other hand, the major drawback of Ran- the training set, its three nearest neighbors are found.
dom under-sampling is that this method can discard poten- If Ei belongs to the majority class and the classification
tially useful data that could be important for the induction given by its three nearest neighbors contradicts the orig-
process. The remainder balancing methods use heuristics in inal class of Ei , then Ei is removed. If Ei belongs to the
order to overcome the limitations of the non-heuristic meth- minority class and its three nearest neighbors misclas-
ods. sify Ei , then the nearest neighbors that belong to the
Tomek links Tomek links [22] can be defined as follows: majority class are removed.
given two examples Ei and Ej belonging to different Smote Synthetic Minority Over-sampling Technique (Smo-
classes, and d(Ei , Ej ) is the distance between Ei and te) [5] is an over-sampling method. Its main idea is to
Ej . A (Ei , Ej ) pair is called a Tomek link if there is form new minority class examples by interpolating be-
not an example El , such that d(Ei , El ) < d(Ei , Ej ) or tween several minority class examples that lie together.
d(Ej , El ) < d(Ei , Ej ). If two examples form a Tomek Thus, the overfitting problem is avoided and causes the
link, then either one of these examples is noise or both decision boundaries for the minority class to spread fur-
examples are borderline. Tomek links can be used as an ther into the majority class space.
under-sampling method or as a data cleaning method. Smote + Tomek links Although over-sampling minority
As an under-sampling method, only examples belong- class examples can balance class distributions, some other
ing to the majority class are eliminated, and as a data problems usually present in data sets with skewed class
cleaning method, examples of both classes are removed. distributions are not solved. Frequently, class clusters
Condensed Nearest Neighbor Rule Hart’s Condensed are not well defined since some majority class exam-
Nearest Neighbor Rule (CNN) [11] is used to find a con- ples might be invading the minority class space. The
opposite can also be true, since interpolating minority
sistent subset of examples. A subset Ê ⊆ E is consis-
class examples can expand the minority class clusters,
tent with E if using a 1-nearest neighbor, Ê correctly
introducing artificial minority class examples too deeply
classifies the examples in E. An algorithm to create a
in the majority class space. Inducing a classifier un-
subset Ê from E as an under-sampling method is the
der such a situation can lead to overfitting. In order to
following [14]: First, randomly draw one majority class
create better-defined class clusters, we propose applying
example and all examples from the minority class and
Tomek links to the over-sampled training set as a data
put these examples in Ê. Afterwards, use a 1-NN over
cleaning method. Thus, instead of removing only the
the examples in Ê to classify the examples in E. Ev-
majority class examples that form Tomek links, exam-
ery misclassified example from E is moved to Ê. It is
ples from both classes are removed. The application of
important to note that this procedure does not find the
this method is illustrated in Figure 2. First, the original
smallest consistent subset from E. The idea behind this
data set (a) is over-sampled with Smote (b), and then
implementation of a consistent subset is to eliminate the
Figure 2: Balancing a data set: original data set (a); over- Figure 3: Proportion of negative/positive examples versus
sampled data set (b); Tomek links identification (c); and AUC.
borderline and noise examples removal (d ). 100
90

AUC (%)
70
60

50
40
0 5 10 15 20 25 30 35 40
Proportion of negative/positive examples
Pima New-thyroid
German E-coli
Post-operative Satimage
Haberman Flag
Splice-ie Glass
Splice-ei Letter-a
Vehicle Nursery
Letter-vowel

versus the mean AUC values for the original data sets. If
class imbalances can systematically hinder the performance
of imbalanced data sets, then it would be expected that
AUC decreases for highly imbalanced data sets. However,
Tomek links are identified (c) and removed, producing in spite of a large degree of imbalance the data sets Letter-a
a balanced data set with well-defined class clusters (d ). and Nursery obtained almost 100% AUC.
The Smote + Tomek links method was first used to im- The results obtained in the UCI data sets seem to be com-
prove the classification of examples for the problem of patible with previous work of the authors [18] conducted on
annotation of proteins in Bioinformatics [1]. a series of experiments with artificial domains, in which we
Smote + ENN The motivation behind this method is sim- varied the degree of overlapping between the classes. It was
ilar to Smote + Tomek links. ENN tends to remove concluded that class imbalance, by itself, does not seem to
more examples than the Tomek links does, so it is ex- be a problem, but when allied to highly overlapped classes,
pected that it will provide a more in depth data clean- it can significantly decrease the number of minority class ex-
ing. Differently from NCL which is an under-sampling amples correctly classified. Domains with non-overlapping
method, ENN is used to remove examples from both classes do not seem to be problematic for learning no mat-
classes. Thus, any example that is misclassified by its ter the degree of imbalance. Moreover, in [12] Japkowicz
three nearest neighbors is removed from the training set. performed several experiments on artificial data sets and
concluded that class imbalances do not seem to systemat-
5. EXPERIMENTAL EVALUATION ically cause performance degradation. She concludes that
the imbalance problem is a relative problem depending on
The main objective of our research is to compare several bal- both the complexity of the concept1 and the overall size of
ancing methods published in the literature, as well as the the training set.
three proposed methods, in order to verify whether those
The relationship between training set size and improper clas-
methods can effectively deal in practice with the problem of
sification performance for imbalanced data sets seems to be
class imbalance. To make this comparison, we have selected
that on small imbalanced data sets the minority class is
thirteen data sets from UCI [3] which have different degrees
poorly represented by an excessively reduced number of ex-
of imbalance. Table 3 summarizes the data employed in this
amples, that might not be sufficient for learning, especially
study. For each data set, it shows the number of examples
when a large degree of class overlapping exists and the class
(#Examples), number of attributes (#Attributes), number
is further divided into subclusters. For larger data sets, the
of quantitative and qualitative attributes, class attribute dis-
effect of these complicating factors seems to be reduced, as
tribution and the majority class error. For data sets having
the minority class is better represented by a larger number
more than two classes, we chose the class with fewer ex-
of examples. This trend is confirmed by the graph shown
amples as the positive class, and collapsed the remainder
in Figure 4 which shows how the AUC is affected by the
as the negative class. As the Letter and Splice data sets
number of positive training examples in the data sets.
have a similar number of examples in the minority classes,
In a second stage, the over and under-sampling methods de-
we created two data sets with each of them: Letter-a and
scribed in Section 4 were applied to the original data sets.
Letter-vowel, Splice-ie and Splice-ei.
Smote, Random over-sampling, Random under-sampling and
In our experiments, we used release 8 of the C4.5 symbolic
CNN methods have internal parameters that allow the user
learning algorithm to induce decision trees [20]. Firstly,
to set up the resulting class distribution obtained after the
we ran C4.5 over the original (imbalanced) data sets and
application of these methods. We decided to add/remove
calculated the AUC for each data set using 10-fold cross-
examples until a balanced distribution was reached. This
validation. The results obtained in this initial experiment
are shown in a graph in Figure 3. 1
Where the “concept complexity” corresponds to the num-
Figure 3 plots the proportion of negative/positive examples ber of subclusters into which the classes are subdivided.
Table 3: Data sets summary descriptions.
Data set #Examples #Attributes Class Class % Majority
(quanti., quali.) (min., maj.) (min., maj.) Error
Pima 768 8 (8,0) (1, 0) (34.77%, 65.23%) 65.23%
German 1000 20 (7,13) (Bad, Good) (30.00%, 70.00%) 70.00%
Post-operative 90 8 (1,7) (S, remainder) (26.67%, 73.33%) 73.33%
Haberman 306 3 (3,0) (Die, Survive) (26.47%, 73.53%) 73.53%
Splice-ie 3176 60 (0,60) (ie, remainder) (24.09%, 75.91%) 75.91%
Splice-ei 3176 60 (0,60) (ei, remainder) (23.99%, 76.01%) 76.01%
Vehicle 846 18 (18,0) (van, remainder) (23.52%, 76.48%) 76.48%
Letter-vowel 20000 16 (16,0) (all vowels, remainder) (19.39%, 80.61%) 80.61%
New-thyroid 215 5 (5,0) (hypo, remainder) (16.28%, 83.72%) 83.72%
E.Coli 336 7 (7,0) (iMU, remainder) (10.42%, 89.58%) 89.58%
Satimage 6435 36 (36,0) (4, remainder) (9.73%, 90.27%) 90.27%
Flag 194 28 (10,18) (white, remainder) (8.76%, 91.24%) 91.24%
Glass 214 9 (9,0) (Ve-win-float-proc, remainder) (7.94%, 92.06%) 92.06%
Letter-a 20000 16 (16,0) (a, remainder) (3.95%, 96.05%) 96.05%
Nursery 12960 8 (8,0) (not recom, remainder) (2.55%, 97.45%) 97.45%

provement for decision trees grown over artificially balanced

Figure 4: Number of positive training examples versus AUC. data sets. One argument against pruning is that if pruning
100 is allowed to execute under these conditions, the learning
90 system would prune based on false assumption, i.e., that
80 the test set distribution matches the training set distribu-
AUC (%)

70
tion [23].
Figure 5 shows a comparison of the effect of pruning de-
60
cision trees on the original and balanced data sets. Line
50 x = y represents when both pruned and unpruned decision
40 trees obtain the same AUC. Plots above this line represent
10 100 1000 10000
Number of positive examples
that unpruned decision trees obtained better results, and
plots under this line the opposite. Figure 5 clearly shows
Pima New-thyroid
German E-coli that pruning rarely leads to an improvement in AUC for the
Post-operative
Haberman
Satimage
Flag
original and balanced data sets.
Splice-ie Glass
Splice-ei Letter-a
Vehicle Nursery
Letter-vowel Figure 5: AUC of pruned versus unpruned decision trees for
the original and balanced data sets.
100
decision is motivated by the results presented in [23], in 90
which it is shown that when AUC is used as performance
AUC Unpruned (%)

measure, the best class distribution for learning tends to be 80

near the balanced class distribution. 70

The results obtained in our experiments are summarized
60
in Tables 4 and 5. Table 4 shows the performance results
for the original, as well as for the over-sampled data sets. 50
Table 5 shows the results obtained for the under-sampled 40
data sets. The performance results are reported in terms of 40 50 60 70 80 90 100
AUC. The numbers between brackets are the corresponding AUC Pruned (%)
standard deviations. As stated earlier, these results were Original Random over-sampling
CNN NCL
obtained with 10-fold cross-validation. AUCs were measured CNN+Tomek Smote
over decision trees pruned with the default C4.5 pruning Tomek
OSS
Smote+ENN
Smote+Tomek
parameter setting (25% confidence level) and over unpruned Random under-sampling
decision trees.
Although some research papers state that pruning might be In Tables 4 and 5 the results in bold indicate the best AUCs
helpful with imbalanced data sets in some circumstances [4], obtained for each data set considering pruned and unpruned
other papers indicate that when target misclassification costs decision trees independently. In order to facilitate the anal-
or class distributions are unknown, then pruning should be ysis of the results, Tables 6 and 7 present these results as a
avoided [26; 2]. One reason to avoid pruning is that most ranking of methods for pruned and unpruned decision trees
pruning schemes, including the one used by C4.5, attempt respectively. The over-sampling methods are highlighted
to minimize the overall error rate. These pruning schemes with a light gray color, and the results obtained with the
can be detrimental to the minority class, since reducing the original data sets with a dark gray color. Note that, in
error rate in the majority class, which stands for most of the general, over-sampling methods are better ranked than the
examples, would result in a greater impact over the overall under-sampling methods. Hsu’s Multiple Comparison with
error rate. On the other hand, it still seems to be an open- the Best (MCB) test was performed in order to verify if sig-
ended question if pruning can lead to a performance im- nificant differences exist, with 95% confidence level, among
Table 4: AUC results for the original and over-sampled data sets.
Data set Pruning Original Rand Over Smote Smote+Tomek Smote+ENN
yes 81.53(5.11) 85.32(4.17) 85.49(5.17) 84.46(5.84) 83.66(4.77)
Pima
no 82.33(5.70) 86.03(4.14) 85.97(5.82) 85.56(6.02) 83.64(5.35)
yes 79.19(5.84) 84.65(3.80) 80.74(5.43) 81.75(4.78) 80.91(4.36)
German
no 85.94(4.14) 85.56(4.31) 84.51(4.55) 84.02(3.94) 83.90(3.70))
yes 49.29(2.26) 68.79(23.93) 55.66(24.66) 41.80(16.59) 59.83(33.91)
Post-operative
no 78.23(15.03) 71.33(23.43) 68.19(26.62) 47.99(16.61) 59.48(34.91)
yes 58.25(12.26) 71.81(13.42) 72.23(9.82) 75.73(6.55) 76.38(5.51)
Haberman
no 67.91(13.76) 73.58(14.22) 75.45(11.02) 78.41(7.11) 77.01(5.10)
yes 98.76(0.56) 98.89(0.47) 98.46(0.87) 98.26(0.51) 97.97(0.74)
Splice-ie
no 99.30(0.30) 99.09(0.27) 99.19(0.28) 99.13(0.31) 98.88(0.34)
yes 98.77(0.46) 98.80(0.44) 98.92(0.44) 98.87(0.44) 98.85(0.60)
Splice-ei
no 99.47(0.61) 99.52(0.60) 99.52(0.26) 99.51(0.32) 99.49(0.16)
yes 98.49(0.84) 99.14(0.73) 98.96(0.98) 98.96(0.98) 97.92(1.09)
Vehicle
no 98.45(0.90) 99.13(0.75) 99.04(0.85) 99.04(0.85) 98.22(0.90)
yes 98.07(0.63) 98.80(0.32) 98.90(0.20) 98.90(0.20) 98.94(0.22)
Letter-vowel
no 98.81(0.33) 98.84(0.27) 99.15(0.17) 99.14(0.17) 99.19(0.15)
yes 94.73(9.24) 98.39(2.91) 98.91(1.84) 98.91(1.84) 99.22(1.72)
New-thyroid
no 94.98(9.38) 98.89(2.68) 98.91(1.84) 98.91(1.84) 99.22(1.72)
yes 87.64(15.75) 93.24(6.72) 95.49(4.30) 95.98(4.21) 95.29(3.79)
E.Coli
no 92.50(7.71) 93.55(6.89) 95.49(4.30) 95.98(4.21) 95.29(3.79)
yes 93.73(1.91) 95.34(1.25) 95.43(1.03) 95.43(1.03) 95.67(1.18)
Satimage
no 94.82(1.18) 95.52(1.12) 95.69(1.28) 95.69(1.28) 96.06(1.20)
yes 45.00(15.81) 79.91(28.72) 73.62(30.16) 79.30(28.68) 79.32(28.83)
Flag
no 76.65(27.34) 79.78(28.98) 73.87(30.34) 82.06(29.52) 78.56(28.79)
yes 88.16(12.28) 92.20(12.11) 91.40(8.24) 91.40(8.24) 92.90(7.30)
Glass
no 88.16(12.28) 92.07(12.09) 91.27(8.38) 91.27(8.38) 93.40(7.61)
yes 99.61(0.40) 99.77(0.30) 99.91(0.12) 99.91(0.12) 99.91(0.12)
Letter-a
no 99.67(0.37) 99.78(0.29) 99.92(0.12) 99.92(0.12) 99.91(0.14)
yes 99.79(0.11) 99.99(0.01) 99.21(0.55) 99.27(0.36) 97.80(1.07)
Nursery
no 99.96(0.05) 99.99(0.01) 99.75(0.34) 99.53(0.31) 99.20(0.51)

the best ranked method and the remaining methods. The are not completely unexpected. As stated before, it seems
results are also summarized in Tables 6 and 7, where meth- that the loss of performance is directly related to the lack of
ods marked with an asterisk obtained statistically inferior minority class examples in conjunction with other compli-
results when compared to the top ranked method. cating factors. Over-sampling is the class of methods that
Conversely, over-sampling methods in general and Random most directly attack the problem of the lack of minority class
over-sampling in particular are well-ranked among the re- examples.
mainder methods. This result seems to diverge with several It is worth mentioning that two of our proposed methods,
papers previously published in the literature. Drummond Smote + Tomek and Smote + ENN are generally ranked
and Holte [8] report that when using C4.5’s default settings, among the best for data sets with a small number of positive
over-sampling is surprisingly ineffective, often producing lit- examples. Considering only data sets with less than 100 pos-
tle or no change in performance in response to modifications itive examples (in our experiments there are 6 of them: Flag,
of misclassification costs and class distribution. Moreover, Glass, Post-operative, New-thyroid, E.Coli and Haberman),
they note that over-sampling prunes less and therefore gen- at least one of the proposed methods provided meaningful
eralizes less than under-sampling, and that a modification results for all 6 data sets for pruned trees – Table 6, and
of the C4.5’s parameter settings to increase the influence for 5 of the 6 data sets for unpruned trees – Table 7. This
of pruning and other overfitting avoidance factors can re- seems to indicate that these methods could be appropriate
establish the performance of over-sampling. In our exper- in domains having such conditions.
iments, Random over-sampling did not produce overfitted Since over-sampling methods, as well as unpruned decision
decision trees even when these trees were left unpruned, as trees obtained very good performance results, further anal-
it can be confirmed by the higher AUC values obtained by ysis will focus on these results. In addition to classifier per-
this method for unpruned trees. In addition, under-sampling formance results, we also attempted to measure the syntac-
methods did not perform as well as over-sampling meth- tic complexity of the induced models. Syntactic complexity
ods, even when heuristics to remove cases were considered is given by two main parameters: the mean number of in-
in under-sampling. duced rules (branches) and the mean number of conditions
Moreover, Domingos [7] reports that concerning concept per rule. Tables 8 and 9 respectively show the mean number
learning problems, C4.5 Rules produces lower cost classi- of induced rules and the mean number of condition per rule
fiers using under-sampling than over-sampling. Ling and for the over-sampling methods and the original data sets
Li [16] compare over and under-sampling for boosted C4.5 with unpruned decision trees. The best results are shown
and report that under-sampling produces better lift index, in bold, and the best results obtained by an over-sampling
although extreme over-sampling performs almost as well. method, not considering the results obtained in the original
On the other hand, Japkowicz and Stephen [13] compare data sets, are highlighted with a light gray color.
several methods of over and under-sampling on a series of Figure 6 shows the results in Table 8 in graphical form,
artificial data sets and conclude that over-sampling is more where it can be observed that over-sampled data sets usu-
effective than under-sampling at reducing error rate. ally lead to an increase in the number of induced rules if
In our opinion, the good results obtained by over-sampling compared to the trees induced with the original data sets.
Table 5: AUC results for the under-sampled data sets.
Data set Pruning Rand Under CNN CNN+Tomek Tomek OSS NCL
yes 81.17(3.87) 79.60(6.22) 80.30(3.86) 82.56(5.11) 77.89(5.37) 81.61(4.48)
Pima
no 81.49(4.29) 80.08(5.82) 81.71(3.69) 83.11(4.65) 79.23(4.81) 82.55(3.53)
yes 79.85(3.05) 79.85(5.56) 79.48(5.01) 78.87(4.27) 79.20(3.15) 77.89(3.85)
German
no 84.54(3.32) 82.25(5.59) 81.70(4.00) 85.90(3.99) 82.96(3.22) 85.07(3.54)
yes 49.11(14.07) 49.20(8.91) 49.02(11.34) 46.16(5.89) 46.31(18.77) 42.34(28.12)
Post-operative
no 55.52(24.47) 65.69(21.64) 75.79(16.86) 66.45(23.29) 64.44(20.88) 45.62(32.71)
yes 66.07(10.26) 58.36(10.26) 55.73(14.31) 64.46(10.95) 62.70(11.50) 68.01(13.99)
Haberman
no 68.40(10.17) 58.36(10.26) 55.73(14.31) 69.59(13.30) 62.03(11.82) 69.29(14.13)
yes 97.46(1.10) 98.39(0.64) 97.55(0.46) 98.69(0.51) 97.37(0.84) 98.38(0.57)
Splice-ie
no 98.80(0.40) 99.17(0.36) 98.82(0.32) 99.18(0.43) 98.93(0.30) 99.15(0.36)
yes 98.74(0.46) 98.78(0.46) 98.85(0.42) 98.78(0.46) 98.83(0.45) 98.77(0.47)
Splice-ei
no 99.25(0.48) 99.27(0.77) 99.47(0.27) 99.44(0.60) 99.33(0.66) 99.40(0.66)
yes 97.25(1.95) 98.62(0.67) 98.34(1.32) 98.26(0.90) 98.79(0.67) 97.94(1.05)
Vehicle
no 97.80(0.94) 98.64(0.63) 98.42(1.02) 98.41(0.90) 98.71(0.97) 98.17(1.12)
yes 97.69(0.43) 98.03(0.37) 97.97(0.46) 98.18(0.53) 97.66(0.30) 98.17(0.30)
Letter-vowel
no 98.26(0.28) 98.49(0.31) 98.39(0.22) 98.90(0.18) 98.27(0.19) 98.81(0.17)
yes 94.87(5.00) 94.79(10.14) 94.54(10.10) 94.73(9.24) 92.72(10.55) 93.44(9.74)
New-thyroid
no 94.87(5.00) 94.79(10.14) 94.54(10.10) 94.98(9.38) 92.72(10.55) 93.69(9.90)
yes 88.75(12.45) 80.32(19.96) 80.34(19.85) 91.57(7.81) 83.97(21.27) 91.73(8.00)
E.Coli
no 88.64(12.46) 81.13(20.00) 81.95(19.90) 94.03(5.56) 83.76(21.17) 92.04(8.15)
yes 92.34(1.27) 92.25(1.45) 92.73(1.38) 94.21(1.76) 92.85(1.19) 94.42(1.53)
Satimage
no 92.86(1.29) 92.35(1.35) 92.90(1.38) 95.11(1.29) 92.84(1.22) 95.06(1.27)
yes 71.13(28.95) 49.12(21.57) 75.85(30.26) 45.00(15.81) 45.00(15.81) 44.47(15.71)
Flag
no 78.35(29.98) 78.90(28.63) 75.64(29.37) 78.59(28.75) 81.73(29.51) 76.13(27.80)
yes 82.44(8.99) 58.44(13.15) 72.69(14.07) 87.15(16.47) 72.16(16.84) 91.67(12.76)
Glass
no 80.47(13.25) 64.31(14.21) 75.44(11.61) 87.00(16.75) 78.76(12.52) 91.67(12.76)
yes 99.35(0.48) 99.60(0.37) 99.61(0.37) 99.61(0.40) 99.66(0.46) 99.60(0.40)
Letter-a
no 99.46(0.42) 99.66(0.37) 99.65(0.38) 99.67(0.37) 99.67(0.45) 99.67(0.37)
yes 97.52(0.82) 99.55(0.21) 98.77(0.35) 99.80(0.08) 99.47(0.19) 99.79(0.12)
Nursery
no 98.76(0.22) 99.84(0.13) 99.57(0.21) 99.89(0.08) 99.83(0.08) 99.89(0.09)

Table 6: Performance ranking for original and balanced data sets for pruned decision trees.
Data set 1o 2o 3o 4o 5o 6o 7o 8o 9o 10o 11o
Pima Smt RdOvr Smt+Tmk Smt+ENN Tmk NCL Original RdUdr CNN+Tmk CNN* OSS*
German RdOvr Smt+Tmk Smt+ENNSmt RdUdr CNN CNN+Tmk*OSS* Original* Tmk* NCL*
Post-operativeRdOvr Smt+ENNSmt Original CNN RdUdr CNN+Tmk OSS* Tmk* NCL* Smt+Tmk*
Haberman Smt+ENNSmt+Tmk Smt RdOvr NCL RdUdr Tmk OSS* CNN* Original* CNN+Tmk*
Splice-ie RdOvr Original Tmk Smt CNN NCL Smt+Tmk Smt+ENN*CNN+Tmk*RdUdr* OSS*
Splice-ei Smt Smt+Tmk Smt+ENNCNN+TmkOSS RdOvr Tmk CNN NCL Original RdUdr
Vehicle RdOvr Smt Smt+Tmk OSS CNN OriginalCNN+Tmk Tmk NCL* Smt+ENN* RdUdr*
Letter-vowel Smt+ENNSmt+Tmk Smt RdOvr Tmk* NCL* Original* CNN* CNN+Tmk*RdUdr* OSS*
New-thyroid Smt+ENNSmt+Tmk Smt RdOvr RdUdr CNN Original Tmk CNN+Tmk NCL OSS
E.Coli Smt+Tmk Smt Smt+ENNRdOvr NCL Tmk RdUdr Original OSS CNN+Tmk*CNN*
Satimage Smt+ENNSmt Smt+Tmk RdOvr NCL Tmk Original* OSS* CNN+Tmk*RdUdr* CNN*
Flag RdOvr Smt+ENNSmt+Tmk CNN+TmkSmt RdUdr CNN* OSS* Tmk* Original* NCL*
Glass Smt+ENNRdOvr NCL Smt Smt+TmkOriginalTmk RdUdr CNN+Tmk*OSS* CNN*
Letter-a Smt+Tmk Smt+ENNSmt RdOvr OSS OriginalTmk CNN+Tmk NCL CNN RdUdr*
Nursery RdOvr Tmk Original NCL CNN* OSS* Smt+Tmk* Smt* CNN+Tmk*Smt+ENN* RdUdr*

Table 7: Performance ranking for original and balanced data sets for unpruned decision trees.
Data set 1o 2o 3o 4o 5o 6o 7o 8o 9o 10o 11o
Pima RdOvr Smt Smt+Tmk Smt+ENNTmk NCL Original CNN+Tmk RdUdr CNN* OSS*
German Original Tmk RdOvr NCL RdUdr Smt Smt+Tmk Smt+ENN OSS CNN CNN+Tmk
Post-operativeOriginal CNN+TmkRdOvr Smt Tmk CNN OSS Smt+ENN RdUdr Smt+Tmk* NCL*
Haberman Smt+Tmk Smt+ENN Smt RdOvr Tmk NCL RdUdr Original OSS* CNN* CNN+Tmk*
Splice-ie Original Smt Tmk CNN NCL Smt+Tmk RdOvr OSS* Smt+ENN* CNN+Tmk*RdUdr*
Splice-ei RdOvr Smt Smt+Tmk Smt+ENNOriginal CNN+TmkTmk NCL OSS CNN RdUdr
Vehicle RdOvr Smt Smt+Tmk OSS CNN Original CNN+TmkTmk Smt+ENN NCL RdUdr*
Letter-vowel Smt+ENNSmt Smt+Tmk Tmk* RdOvr* NCL* Original* CNN* CNN+Tmk*OSS* RdUdr*
New-thyroid Smt+ENNSmt Smt+Tmk RdOvr Original Tmk RdUdr CNN CNN+Tmk NCL OSS
E.Coli Smt+Tmk Smt Smt+ENNTmk RdOvr Original NCL RdUdr OSS CNN+Tmk*CNN*
Satimage Smt+ENNSmt Smt+Tmk RdOvr Tmk NCL Original CNN+Tmk*RdUdr* OSS* CNN*
Flag Smt+Tmk OSS RdOvr CNN Tmk Smt+ENN RdUdr Original NCL CNN+Tmk Smt
Glass Smt+ENNRdOvr NCL Smt Smt+TmkOriginal Tmk RdUdr OSS* CNN+Tmk*CNN*
Letter-a Smt Smt+Tmk Smt+ENNRdOvr Tmk OSS NCL Original CNN CNN+Tmk RdUdr*
Nursery RdOvr Original NCL Tmk CNN OSS* Smt* CNN+Tmk*Smt+Tmk* Smt+ENN* RdUdr*

Comparing the mean number of rules obtained with the the application of over-sampling would result in an increase
over-sampled data sets, Random over-sampling and Smote in the mean number of rules, since over-sampling increases
+ ENN are the methods that usually provide a smaller in- the total number of training examples, which usually gener-
crease in the mean number of rules. It was expected that ates larger decision trees. It can be considered unexpected
Table 8: Number of rules (branches) for the original and over-sampled data sets and unpruned decision trees.
Data set Original Rand Over Smote Smote+Tomek Smote+ENN
Pima 29.90(6.06) 63.80(13.15) 57.70(11.52) 54.20(12.91) 47.50(8.76)
German 315.50(21.41) 410.60(28.64) 367.30(20.85) 355.10(24.20) 261.00(28.08)
Post-operative 20.40(3.86) 36.80(3.05) 38.60(4.35) 32.70(5.87) 25.90(4.09)
Haberman 7.80(3.79) 25.20(10.94) 23.20(9.61) 25.00(7.70) 30.30(4.92)
Splice-ie 203.50(7.78) 258.70(13.07) 443.20(16.69) 340.60(21.34) 307.90(17.21)
Splice-ei 167.80(9.40) 193.30(7.41) 374.50(20.41) 283.90(14.90) 248.80(12.90)
Vehicle 26.20(3.29) 28.90(2.60) 34.90(3.38) 34.90(3.38) 29.20(2.82)
Letter-vowel 534.50(11.92) 678.80(19.07) 1084.50(19.61) 1083.20(20.12) 1022.00(26.34)
New-thyroid 5.40(0.84) 5.10(0.32) 6.90(1.29) 6.90(1.29) 6.90(0.99)
E-coli 11.60(3.03) 17.70(2.91) 16.70(3.20) 16.50(3.84) 12.70(3.23)
Satimage 198.80(11.04) 252.70(9.33) 404.60(12.97) 404.60(12.97) 339.40(13.80)
Flag 28.60(6.52) 46.30(7.72) 52.50(12.47) 46.50(13.36) 40.30(9.09)
Glass 9.40(2.22) 13.00(1.33) 17.70(1.77) 17.70(1.77) 15.50(1.58)
Letter-a 59.10(3.45) 88.00(5.56) 257.60(15.42) 257.60(15.42) 252.60(18.23)
Nursery 229.40(4.65) 282.50(5.34) 1238.30(28.91) 1204.70(27.94) 766.30(77.24)

Table 9: Mean number of conditions per rule for the original and over-sampled data sets and unpruned decision trees.
Data set Original Rand Over Smote Smote+Tomek Smote+ENN
Pima 6.21(0.61) 7.92(0.64) 7.74(0.44) 7.59(0.54) 7.27(0.67)
German 6.10(0.17) 6.89(0.25) 10.27(0.51) 9.68(0.32) 7.35(0.58)
Post-operative 3.61(0.41) 4.86(0.26) 5.36(0.37) 4.75(0.52) 4.46(0.50)
Haberman 3.45(1.36) 5.71(1.43) 5.61(1.27) 5.81(1.02) 6.45(0.60)
Splice-ie 6.04(0.09) 6.15(0.04) 6.08(0.08) 6.00(0.09) 5.58(0.11)
Splice-ei 5.46(0.14) 5.70(0.08) 5.51(0.07) 5.41(0.09) 4.91(0.09)
Vehicle 7.21(0.70) 7.03(0.44) 7.09(0.50) 7.09(0.50) 6.63(0.38)
Letter-vowel 20.96(1.19) 19.32(0.82) 18.78(0.40) 18.78(0.40) 18.32(0.43)
New-thyroid 2.76(0.39) 2.85(0.17) 3.12(0.26) 3.12(0.26) 3.08(0.20)
E-coli 4.43(0.79) 5.48(0.41) 4.98(0.60) 4.92(0.65) 4.15(0.49)
Satimage 12.13(0.46) 15.93(0.42) 13.89(0.64) 13.89(0.64) 12.54(0.36)
Flag 3.92(0.70) 5.42(0.55) 9.43(1.04) 8.75(1.53) 6.71(1.23)
Glass 4.20(0.61) 5.80(0.51) 5.92(0.50) 5.92(0.50) 5.51(0.32)
Letter-a 7.30(0.22) 10.35(0.64) 10.97(0.38) 10.97(0.38) 10.86(0.36)
Nursery 6.51(0.01) 6.84(0.03) 6.87(0.03) 6.84(0.03) 6.41(0.12)

that Random over-sampling is competitive with Smote + data sets, this method was the best ranked for another 4
Tomek and Smote + ENN in the number of induced rules, data sets.
once Tomek and ENN were applied as data cleaning meth-
ods with the objective of eliminating noise examples and
thus simplifying the induced decision trees. Figure 7: Mean number of conditions per rule for original
and balanced data sets and unpruned decision trees.
22
Mean number of conditions per rule

Figure 6: Mean number of induced rules for original and 20

balanced data sets and unpruned decision trees. 18
16
1400
balanced data
Mean number of rules - balanced data

14
1200 12
1000 10
8
800
6
600 4
2
400 2 4 6 8 10 12 14 16 18 20 22
Mean number of conditions per rule - original data
200
Original Smote+ENN
0 Random over-sampling Smote+Tomek
0 100 200 300 400 500 600 Smote
Mean number of rules - original data

Original Smote+ENN
Random over-sampling Smote+Tomek
Smote
6. CONCLUSION AND LIMITATIONS
The results presented in Table 9 are shown in a graph in In this work we analyze the behavior of several over and
Figure 7 allowing a clearer comparison for the mean number under-sampling methods to deal with the problem of learn-
of conditions per rule for the over-sampled data sets. The ing from imbalanced data sets. Our results show that the
Smote + ENN method provided very good results. In fact, over-sampling methods in general, and Smote + Tomek and
it was the best ranked in 10 data sets. Furthermore, this Smote + ENN (two of the methods proposed in this work) in
method was even able to obtain smaller values than those particular for data sets with few positive (minority) exam-
achieved by decision trees induced from the original data sets ples, provided very good results in practice. Moreover, Ran-
in 6 data sets. Moreover, considering only the over-sampled dom over-sampling, frequently considered an unprosperous
method provided competitive results with the more complex [10] Hand, D. J. Construction and Assessment of Classifi-
methods. As a general recommendation, Smote + Tomek or cation Rules. John Wiley and Sons, 1997.
Smote + ENN might be applied to data sets with a small
number of positive instances, a condition that is likely to [11] Hart, P. E. The Condensed Nearest Neighbor Rule.
lead to classification performance problems for imbalanced IEEE Transactions on Information Theory IT-14
data sets. For data sets with larger number of positive ex- (1968), 515–516.
amples, the Random over-sampling method which is compu- [12] Japkowicz, N. Class Imbalances: Are We Focusing
tationally less expensive than other methods would produce on the Right Issue? In Workshop on Learning from
meaningful results. Imbalanced Data Sets II (2003).
It should be noted that allocating half of the training exam-
ples to the minority class does not always provide optimal [13] Japkowicz, N., and Stephen, S. The Class Imbal-
results [23]. We plan to address this issue in future research. ance Problem: A Systematic Study. IDA Journal 6, 5
Furthermore, some under-sampling methods, such as Tomek (2002), 429–449.
links and NCL, that do not originally allow the user to spec-
ify the resulting class distribution, must be improved to in- [14] Kubat, M., and Matwin, S. Addressing the Course
clude this feature. Another natural extension to this work of Imbalanced Training Sets: One-sided Selection. In
is to analyze the ROC curves obtained from the classifiers. ICML (1997), pp. 179–186.
This might provide us with a more in depth understanding [15] Laurikkala, J. Improving Identification of Difficult
of the behavior of balancing and cleaning methods. Small Classes by Balancing Class Distribution. Tech.
Acknowledgements. We wish to thank the anonymous Rep. A-2001-2, University of Tampere, 2001.
reviewers and Dorival Leão Pinto Júnior for their helpful
comments. This research was partially supported by the [16] Ling, C. X., and Li, C. Data Mining for Direct Min-
Brazilian Research Councils CAPES and FAPESP. ing: Problems and Solutions. In KDD (1998), pp. 73–
79.

7. REFERENCES [17] Mitchell, T. M. Machine Learning. McGraw-Hill,

1997.
[1] Batista, G. E. A. P. A., Bazan, A. L., and [18] Prati, R. C., Batista, G. E. A. P. A., and
Monard, M. C. Balancing Training Data for Auto- Monard, M. C. Class Imbalances versus Class Over-
mated Annotation of Keywords: a Case Study. In WOB lapping: an Analysis of a Learning System Behavior. In
(2003), pp. 35–43. MICAI (2004), pp. 312–321. LNAI 2972.
[2] Bauer, E., and Kohavi, R. An Empirical Comparison [19] Provost, F. J., and Fawcett, T. Analysis and Vi-
of Voting Classification Algorithms: Bagging, Boosting, sualization of Classifier Performance: Comparison un-
and Variants. Machine Learning 36 (1999), 105–139. der Imprecise Class and Cost Distributions. In KDD
(1997), pp. 43–48.
[3] Blake, C., and Merz, C. UCI Repository of Machine
Learning Databases, 1998. http://www.ics.uci.edu/ [20] Quinlan, J. R. C4.5 Programs for Machine Learning.
~mlearn/MLRepository.html. Morgan Kaufmann, CA, 1988.
[4] Chawla, N. V. C4.5 and Imbalanced Data Sets: In- [21] Stanfill, C., and Waltz, D. Instance-based Learning
vestigating the Effect of Sampling Method, Probabilis- Algorithms. Communications of the ACM 12 (1986),
tic Estimate, and Decision Tree Structure. In Workshop 1213–1228.
on Learning from Imbalanced Data Sets II (2003).
[22] Tomek, I. Two Modifications of CNN. IEEE Trans-
[5] Chawla, N. V., Bowyer, K. W., Hall, L. O., actions on Systems Man and Communications SMC-6
and Kegelmeyer, W. P. SMOTE: Synthetic Minority (1976), 769–772.
Over-sampling Technique. JAIR 16 (2002), 321–357.
[23] Weiss, G. M., and Provost, F. Learning When
[6] Ciaccia, P., Patella, M., and Zezula, P. M-tree: Training Data are Costly: The Effect of Class Distri-
an Efficient Access Method for Similarity Search in bution on Tree Induction. JAIR 19 (2003), 315–354.
Metric Spaces. In VLDB (1997), pp. 426–435.
[24] Wilson, D. L. Asymptotic Properties of Nearest
[7] Domingos, P. MetaCost: A General Method for Mak- Neighbor Rules Using Edited Data. IEEE Transactions
ing Classifiers Cost-Sensitive. In KDD (1999), pp. 155– on Systems, Man, and Communications 2, 3 (1972),
164. 408–421.

[8] Drummond, C., and Holte, R. C. C4.5, Class Imbal- [25] Wilson, D. R., and Martinez, T. R. Reduction
ance, and Cost Sensitivity: Why Under-sampling beats Techniques for Exemplar-Based Learning Algorithms.
Over-sampling. In Workshop on Learning from Imbal- Machine Learning 38, 3 (2000), 257–286.
anced Data Sets II (2003). [26] Zadrozny, B., and Elkan, C. Learning and Mak-
ing Decisions When Costs and Probabilities are Both
[9] Ferri, C., Flach, P., and Hernández-Orallo, J.
Unknown. In KDD (2001), pp. 204–213.
Learning Decision Trees Using the Area Under the ROC
Curve. In ICML (2002), pp. 139–146.

View publication stats

2002 Estrategias - de - Comunicacion - para - El - Cambio - Social
No ratings yet
2002 Estrategias - de - Comunicacion - para - El - Cambio - Social
98 pages
Virtual Screening of Natural Products Database
100% (1)
Virtual Screening of Natural Products Database
71 pages
Cobo Et Al., 1992
No ratings yet
Cobo Et Al., 1992
33 pages
Holanda Et Al. 2020
No ratings yet
Holanda Et Al. 2020
8 pages
LAMEDIDADELASL
No ratings yet
LAMEDIDADELASL
24 pages
Forests 11 01071
No ratings yet
Forests 11 01071
23 pages
Vet EU IG - Chapter 2 - Initial Submission
No ratings yet
Vet EU IG - Chapter 2 - Initial Submission
103 pages
Pure Vs Guided Mirror
No ratings yet
Pure Vs Guided Mirror
9 pages
2019 VazquezSnchezetal - PolyphylyTurbinicarpus
No ratings yet
2019 VazquezSnchezetal - PolyphylyTurbinicarpus
17 pages
Ambiente Marinho Profundo
No ratings yet
Ambiente Marinho Profundo
61 pages
Mexican Qualityof Work Life Scale
No ratings yet
Mexican Qualityof Work Life Scale
13 pages
Quimica Nova 2012 Xylopia Laevigata Publicado
No ratings yet
Quimica Nova 2012 Xylopia Laevigata Publicado
8 pages
Engineering Practices For Building Quality Software
No ratings yet
Engineering Practices For Building Quality Software
127 pages
Segmental Lumbar Spinal Cord Aplasia in A Free-Ranging Southern Tiger Cat (Leopardus Guttulus)
No ratings yet
Segmental Lumbar Spinal Cord Aplasia in A Free-Ranging Southern Tiger Cat (Leopardus Guttulus)
6 pages
Comportamiento Mecanico de Juntas Frias Lisas de C
No ratings yet
Comportamiento Mecanico de Juntas Frias Lisas de C
13 pages
Estrategias de Comunicación para El Cambio Social
No ratings yet
Estrategias de Comunicación para El Cambio Social
98 pages
CONCURRENT TRAINING Silvaetal.2012
No ratings yet
CONCURRENT TRAINING Silvaetal.2012
9 pages
Association of Uterine Health in The First Lactation
No ratings yet
Association of Uterine Health in The First Lactation
13 pages
Tac 2011
No ratings yet
Tac 2011
7 pages
A Simulation Methodology Applied On Hydraulic Valves For High Fluxes
No ratings yet
A Simulation Methodology Applied On Hydraulic Valves For High Fluxes
11 pages
Psicothema2013UsingR MAPE
No ratings yet
Psicothema2013UsingR MAPE
8 pages
Daz-Villaetal2007 S Desfontainesii
No ratings yet
Daz-Villaetal2007 S Desfontainesii
12 pages
Gaspar BritoHenriques Vale Geojournal
No ratings yet
Gaspar BritoHenriques Vale Geojournal
15 pages
Update On Prenatal Diagnosis and Fetal Surgery For
No ratings yet
Update On Prenatal Diagnosis and Fetal Surgery For
15 pages
Arsenic Peru 52 JSolis As Conf Paper
No ratings yet
Arsenic Peru 52 JSolis As Conf Paper
3 pages
The Problem With Gender Blind Design and How We Might Begin To Address It A Model For Intersectional Feminist Ethical Deliberation
No ratings yet
The Problem With Gender Blind Design and How We Might Begin To Address It A Model For Intersectional Feminist Ethical Deliberation
13 pages
Coffee Melanoidins
No ratings yet
Coffee Melanoidins
14 pages
Sucroseconcentrationandp Hinliquid
No ratings yet
Sucroseconcentrationandp Hinliquid
7 pages
Artificial Intelligence CS-3431w (V2)
No ratings yet
Artificial Intelligence CS-3431w (V2)
23 pages
Jump Height Loss As An Indicator of Fatigue During Sprint Training
No ratings yet
Jump Height Loss As An Indicator of Fatigue During Sprint Training
11 pages
Egov Bancnet Corporate User'S Manual
No ratings yet
Egov Bancnet Corporate User'S Manual
66 pages
Taxonomy of The Sylvilagus Brasiliensis Complex in Central and South America (Lagomorpha: Leporidae)
No ratings yet
Taxonomy of The Sylvilagus Brasiliensis Complex in Central and South America (Lagomorpha: Leporidae)
33 pages
Carrasco Etal 2019 Zootaxa
No ratings yet
Carrasco Etal 2019 Zootaxa
45 pages
Resume To Yaya Wallet
No ratings yet
Resume To Yaya Wallet
13 pages
2016DesoveyCrecimientoC squamipinnisREVBIOLTROPVol643991-1005
No ratings yet
2016DesoveyCrecimientoC squamipinnisREVBIOLTROPVol643991-1005
17 pages
Family Business in Latin America
No ratings yet
Family Business in Latin America
4 pages
A Princesa Imperfeita
No ratings yet
A Princesa Imperfeita
8 pages
RBZ 2020 0023
No ratings yet
RBZ 2020 0023
22 pages
Impact of Metformin On Male Reproduction: Current Pharmaceutical Design July 2015
No ratings yet
Impact of Metformin On Male Reproduction: Current Pharmaceutical Design July 2015
14 pages
JSportsMedPhysFitness-12118 Manuscript PDF
No ratings yet
JSportsMedPhysFitness-12118 Manuscript PDF
34 pages
El Presupuesto Exible Como Herramienta para La Toma de Decisiones
No ratings yet
El Presupuesto Exible Como Herramienta para La Toma de Decisiones
19 pages
Manifesto For A Post-Materialist Science
No ratings yet
Manifesto For A Post-Materialist Science
4 pages
3.JFBN .Paredes
No ratings yet
3.JFBN .Paredes
25 pages
Chest Physiotherapy For Pneumonia in Children
No ratings yet
Chest Physiotherapy For Pneumonia in Children
30 pages
Flipkart TBBD '23 Cheat Sheet - Electronics
No ratings yet
Flipkart TBBD '23 Cheat Sheet - Electronics
23 pages
Grid Interactive Charging Control For Plug-In Electric Vehicles
No ratings yet
Grid Interactive Charging Control For Plug-In Electric Vehicles
7 pages
2 PB
No ratings yet
2 PB
18 pages
The Favourable Thermodynamic Properties of Fe-Doped CaMnO3 For Thermochemical Heat Storage
No ratings yet
The Favourable Thermodynamic Properties of Fe-Doped CaMnO3 For Thermochemical Heat Storage
16 pages
IT Security Hacker Pitch Deck by Slidesgo
No ratings yet
IT Security Hacker Pitch Deck by Slidesgo
42 pages
Gamification Frameworks Games
No ratings yet
Gamification Frameworks Games
9 pages
Blind Navigation Band
No ratings yet
Blind Navigation Band
50 pages
2009 PB Silva Etal
No ratings yet
2009 PB Silva Etal
13 pages
Artigo Mioranza ControlofMincognitaintomatoplantswithThuyaoccidentalis
No ratings yet
Artigo Mioranza ControlofMincognitaintomatoplantswithThuyaoccidentalis
15 pages
Investigation of 5-Nitrofuran Derivatives Synthesi
No ratings yet
Investigation of 5-Nitrofuran Derivatives Synthesi
10 pages
Effect of Rest Interval Between Sets in The Muscle.20
No ratings yet
Effect of Rest Interval Between Sets in The Muscle.20
9 pages
Aguado Carreteras 1
No ratings yet
Aguado Carreteras 1
18 pages
8011 31240 1 PB
No ratings yet
8011 31240 1 PB
2 pages
Caffeine Extraction From Tea and Coffee: January 2017
No ratings yet
Caffeine Extraction From Tea and Coffee: January 2017
6 pages
Automation Digitalization Pelleting Control Brochure Download Data
No ratings yet
Automation Digitalization Pelleting Control Brochure Download Data
22 pages
New Mathematical Modelling For Grain Size Distribution Adjustment in Concrete
No ratings yet
New Mathematical Modelling For Grain Size Distribution Adjustment in Concrete
5 pages
FortiNet Log Reference PDF
No ratings yet
FortiNet Log Reference PDF
143 pages
Purchase Order Version Management - S - 4HANA Materials Management
No ratings yet
Purchase Order Version Management - S - 4HANA Materials Management
18 pages
Questionário de Adaptação Ao Ensino Superior (QAES) : Construção e Validação de Um Novo Questionário
No ratings yet
Questionário de Adaptação Ao Ensino Superior (QAES) : Construção e Validação de Um Novo Questionário
16 pages
Gcse Information and Communication Technology
No ratings yet
Gcse Information and Communication Technology
20 pages
Rafaela
No ratings yet
Rafaela
10 pages
Filmora Activation
No ratings yet
Filmora Activation
3 pages
Ps Assignment - Solution
No ratings yet
Ps Assignment - Solution
7 pages
A System Analysis Approach
No ratings yet
A System Analysis Approach
23 pages
In Vitro Interactions Between Lactic Acid Solution and Art Glass-Ionomer Cements
No ratings yet
In Vitro Interactions Between Lactic Acid Solution and Art Glass-Ionomer Cements
7 pages
MCA Syllabus
No ratings yet
MCA Syllabus
8 pages
L01 Mysql Workbench Setup
No ratings yet
L01 Mysql Workbench Setup
17 pages
Caracterización de Un Nuevo Geminivirus Asociado Con Los Síntomas de Moteado Amarillo de La Okra (Abelmoschus Esculentus) en México
No ratings yet
Caracterización de Un Nuevo Geminivirus Asociado Con Los Síntomas de Moteado Amarillo de La Okra (Abelmoschus Esculentus) en México
13 pages
Tecnologia Del Vino
No ratings yet
Tecnologia Del Vino
8 pages
Industry 40 Implications in Logistics An Overview
No ratings yet
Industry 40 Implications in Logistics An Overview
9 pages
On The Impact of Repeaters Deployment On WCDMA, Networks Planning
No ratings yet
On The Impact of Repeaters Deployment On WCDMA, Networks Planning
6 pages
Log
No ratings yet
Log
45 pages
Selection of Aluminum Resistant Wheat Genotypes Using Multienviroment and Multivariate Indices
No ratings yet
Selection of Aluminum Resistant Wheat Genotypes Using Multienviroment and Multivariate Indices
2 pages
1xjBRET: Exploration of Geometric Animation Using A Single Formula in Spreadsheet Excel
No ratings yet
1xjBRET: Exploration of Geometric Animation Using A Single Formula in Spreadsheet Excel
25 pages
Directorate General of Commercial Intelligence and Statistics
No ratings yet
Directorate General of Commercial Intelligence and Statistics
4 pages
CEC2010 RealParameterOptimization TechnicalReport
No ratings yet
CEC2010 RealParameterOptimization TechnicalReport
16 pages
Factory Patterns: Factory Method and Abstract Factory
No ratings yet
Factory Patterns: Factory Method and Abstract Factory
25 pages
Echograph 1090-1091 Avg Dac e 2006 06
No ratings yet
Echograph 1090-1091 Avg Dac e 2006 06
2 pages
Zarin Tasnim
No ratings yet
Zarin Tasnim
11 pages
FTC100D Panel Interface AT052610
No ratings yet
FTC100D Panel Interface AT052610
8 pages
Guidelines For Master Thesis SS 2013 Allgemein
No ratings yet
Guidelines For Master Thesis SS 2013 Allgemein
20 pages
Intel 8080 CPU Chip Development
No ratings yet
Intel 8080 CPU Chip Development
4 pages
104 Assignment On Moodle Nitika
No ratings yet
104 Assignment On Moodle Nitika
4 pages
(4 Usd) (76561199183231530)
No ratings yet
(4 Usd) (76561199183231530)
1 page
Cubeacon Card - Datasheet-V - 0.3.1
No ratings yet
Cubeacon Card - Datasheet-V - 0.3.1
2 pages
Machine Learning: Fundamentals and Applications
From Everand
Machine Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet

A Study of The Behavior of Several Methods For Bal

Uploaded by

A Study of The Behavior of Several Methods For Bal

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

A Study of the Behavior of Several Methods for Balancing machine Learning

Article in ACM SIGKDD Explorations Newsletter · June 2004

Gustavo Enrique Batista Ronaldo Cristiano Prati

SEE PROFILE SEE PROFILE

Missing data imputation View project

The user has requested enhancement of the downloaded file.

provement for decision trees grown over artificially balanced

measure, the best class distribution for learning tends to be 80

near the balanced class distribution. 70

Figure 6: Mean number of induced rules for original and 20

7. REFERENCES [17] Mitchell, T. M. Machine Learning. McGraw-Hill,

View publication stats

You might also like