A Study of The Behavior of Several Methods For Bal
A Study of The Behavior of Several Methods For Bal
net/publication/220520041
CITATIONS READS
2,464 6,507
3 authors:
Maria-Carolina Monard
University of São Paulo
167 PUBLICATIONS 5,762 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
On the Generation of Fuzzy Classification Systems Using the Genetic Paradigm and Fuzzy Decision Trees for the Application on Large Datasets View project
All content following this page was uploaded by Ronaldo Cristiano Prati on 03 June 2014.
Gustavo E. A. P. A. Batista
Ronaldo C. Prati
Maria Carolina Monard
Instituto de Ciências Matemáticas e de Computação
Caixa Postal 668, 13560-970
São Carlos - SP, Brazil
{gbatista, prati, mcmonard}@icmc.usp.br
ABSTRACT for learning are balanced. However, this is not always the
case in real world data where one class might be represented
There are several aspects that might influence the perfor-
by a large number of examples, while the other is represented
mance achieved by existing learning systems. It has been
by only a few. This is known as the class imbalance problem
reported that one of these aspects is related to class imbal-
and is often reported as an obstacle to the induction of good
ance in which examples in training data belonging to one
classifiers by Machine Learning (ML) algorithms. Generally,
class heavily outnumber the examples in the other class. In
the problem of imbalanced data sets occurs when one class
this situation, which is found in real world data describing
represents a circumscribed concept, while the other class
an infrequent but important event, the learning system may
represents the counterpart of that concept, so that examples
have difficulties to learn the concept related to the minor-
from the counterpart class heavily outnumber examples from
ity class. In this work we perform a broad experimental
the class of interest. This sort of data is found, for example,
evaluation involving ten methods, three of them proposed
in medical record databases regarding a rare disease, were
by the authors, to deal with the class imbalance problem
there is a large number of patients who do not have that
in thirteen UCI data sets. Our experiments provide evi-
disease; continuous fault-monitoring tasks where non-faulty
dence that class imbalance does not systematically hinder
examples heavily outnumber faulty examples, and others.
the performance of learning systems. In fact, the problem
seems to be related to learning with too few minority class In recent years, there have been several attempts at dealing
examples in the presence of other complicating factors, such with the class imbalance problem in the field of Data Mining
as class overlapping. Two of our proposed methods, Smote and Knowledge Discovery in Databases, to which ML is a
+ Tomek and Smote + ENN, deal with these conditions substantial contributor. Related papers have been published
directly, allying a known over-sampling method with data in the ML literature aiming to overcome this problem. The
cleaning methods in order to produce better-defined class ML community seems to agree on the hypothesis that the
clusters. Our comparative experiments show that, in gen- imbalance between classes is the major obstacle in induc-
eral, over-sampling methods provide more accurate results ing classifiers in imbalanced domains. However, it has also
than under-sampling methods considering the area under been observed that in some domains, for instance the Sick
the ROC curve (AUC). This result seems to contradict re- data set [3], standard ML algorithms are capable of induc-
sults previously published in the literature. Smote + Tomek ing good classifiers, even using highly imbalanced training
and Smote + ENN presented very good results for data sets sets. This shows that class imbalance is not the only prob-
with a small number of positive examples. Moreover, Ran- lem responsible for the decrease in performance of learning
dom over-sampling, a very simple over-sampling method, is algorithms.
very competitive to more complex over-sampling methods. In [18] we developed a systematic study aiming to ques-
Since the over-sampling methods provided very good perfor- tion whether class imbalances hinder classifier induction or
mance results, we also measured the syntactic complexity of whether these deficiencies might be explained in other ways.
decision trees induced from over-sampled data. Our results Our study was developed on a series of artificial data sets in
show that these trees are usually more complex then the ones order to fully control all the variables we wanted to analyze.
induced from original data. Random over-sampling usually The results of our experiments, using a discrimination-based
produced the smallest increase in the mean number of in- inductive scheme, suggested that the problem is not solely
duced rules and Smote + ENN the smallest increase in the caused by class imbalance, but is also related to the degree
mean number of conditions per rule, when compared among of data overlapping among the classes.
the investigated over-sampling methods. The results obtained in this previous work motivated the
proposition of two new methods to deal with the problem of
learning in the presence of class imbalance. These methods
1. INTRODUCTION ally a known over-sampling method, namely Smote [5], with
Most learning systems usually assume that training sets used two data cleaning methods: Tomek links [22] and Wilson’s
Edited Nearest Neighbor Rule [24]. The main motivation
behind these methods is not only to balance the training
data, but also to remove noisy examples lying on the wrong Decision trees also experience a similar problem. In the pres-
side of the decision border. The removal of noisy examples ence of class overlapping, decision trees may need to create
might aid in finding better-defined class clusters, therefore, many tests to distinguish the minority class cases from ma-
allowing the creation of simpler models with better general- jority class cases. Pruning the decision tree might not nec-
ization capabilities. essarily alleviate the problem. This is due to the fact that
In addition, in this work we perform a broad experimental pruning removes some branches considered too specialized,
evaluation involving ten methods, three of them proposed labelling new leaf nodes with the dominant class on this
by the authors, to deal with the class imbalance problem in node. Thus, there is a high probability that the majority
thirteen UCI data sets. We concluded that over-sampling class will also be the dominant class of those leaf nodes.
methods are able to aid in the induction of classifiers that
are more accurate than those induced from under-sampled 3. ON EVALUATING CLASSIFIERS IN IM-
data sets. This result seems to contradict results previously
published in the literature. Two of our proposed methods BALANCED DOMAINS
performed well in practice, in particular for data sets with The most straightforward way to evaluate the performance
a small number of positive examples. It is worth noting of classifiers is based on the confusion matrix analysis. Ta-
that Random over-sampling, a very simple over-sampling ble 1 illustrates a confusion matrix for a two-class problem
method, is very competitive to more complex over-sampling having positive and negative class values. From such a
methods. matrix it is possible to extract a number of widely used
The remainder of the paper is organized as follows: Section 2 metrics for measuring the performance of learning systems,
discusses why learning from imbalanced data sets might be such as Error Rate, defined as Err = T P +FF N P +F N
+F P +T N
and
a difficult task. Section 3 describes the drawbacks of using Accuracy, defined as Acc = T P +FT N
P +T N
+F P +T N
= 1 − Err.
accuracy (or error rate) to measure the performance of clas-
sifiers and discusses alternative metrics. Section 4 presents
the methods employed in the experimental evaluation, in- Table 1: Confusion matrix for a two-class problem.
cluding the three methods proposed by the authors. Sec- Positive Prediction Negative Prediction
Positive Class True Positive (T P ) False Negative (F N )
tion 5 discusses the methodology used in the experiments, Negative Class False Positive (F P ) True Negative (T N )
as well as the results achieved. Finally, Section 6 presents
the conclusions and outlines future research. However, when the prior class probabilities are very differ-
ent, the use of such measures might lead to misleading con-
clusions. Error rate and accuracy are particularly suspicious
2. WHY LEARNING FROM IMBALANCED performance measures when studying the effect of class dis-
DATA SETS MIGHT BE DIFFICULT tribution on learning since they are strongly biased to favor
Learning from imbalanced data sets is often reported as the majority class. For instance, it is straightforward to
being a difficult task. In order to better understand this create a classifier having an accuracy of 99% (or an error
problem, imagine the situation illustrated in Figure 1. In rate of 1%) in a domain where the majority class proportion
Fig. 1(a) there is a large imbalance between the majority corresponds to 99% of the examples, by simply forecasting
class (-) and the minority class (+), and the data set presents every new example as belonging to the majority class.
some degree of class overlapping. A much more comfortable Another fact against the use of accuracy (or error rate) is
situation for learning is represented in Fig. 1(b), where the that these metrics consider different classification errors to
classes are balanced with well-defined clusters. be equally important. However, highly imbalanced problems
In a situation similar to the one illustrated in Fig. 1(a), generally have highly non-uniform error costs that favor the
spare cases from the minority class may confuse a classifier minority class, which is often the class of primary interest.
like k-Nearest Neighbor (k-NN). For instance, 1-NN may in- Finally, another point that should be considered when study-
correctly classify many cases from the minority class because ing the effect of class distribution on learning systems is that
the nearest neighbors of these cases are examples belonging the class distribution may change. Consider the confusion
to the majority class. In a situation where the imbalance is matrix shown in Table 1. Note that the class distribution
very high, the probability of the nearest neighbor of a mi- (the proportion of positive to negative examples) is the re-
nority class case is a case of the majority class is likely to lationship between the first and second lines. Any perfor-
be high, and the minority class error rate will tend to have mance metric that uses values from both lines will be inher-
high values, which is unacceptable. ently sensitive to class skews. Metrics such as accuracy and
error rate use values from both lines of the confusion matrix.
As class distribution changes, these measures will change as
Figure 1: Many negative cases against some spare positive well, even if the fundamental classifier performance does not.
cases (a) balanced data set with well-defined clusters (b). All things considered, it would be more interesting if we use
- - - - a performance metric that disassociates the errors (or hits)
- - - - - that occurred in each class. From Table 1 it is possible to
- - -- - - - +
+ + - - -
-
-
- + - - - - + + - - derive four performance metrics that directly measure the
- - - ++ - -
+ -+ - + +
+
-
- - - - +
+
- - - classification performance on positive and negative classes
+ + - - - - + - -
- - - - -
+ + +
- -
- independently:
+ - - + - -
+ - - - - + - - -
+ + - + - False negative rate F Nrate = T PF+F N
is the percentage
++
- - + - - N
of positive cases misclassified as belonging to the nega-
(a) (b) tive class;
False positive rate F Prate = F PF+T P
N
is the percentage the example Ei , and yi is its class value. The objective of a
of negative cases misclassified as belonging to the posi- supervised learning algorithm is to induce a general mapping
tive class; of vectors ~x to values y. Thus, the learning system aims to
True negative rate T Nrate = F PT+T N
N
is the percentage construct a model y = f (~x), of an unknown function f , also
of negative cases correctly classified as belonging to the known as concept function, that enables one to predict the y
negative class; values for previously unseen examples. In practice, learning
True positive rate T Prate = T PT+F P
N
is the percentage systems are able to induce a function h that approximates
of positive cases correctly classified as belonging to the f , i.e., h(~x) ≈ f (~x). In this case, h is called the hypothesis
positive class. of the concept function f .
In Table 2 a data set with N examples and M attributes is
These four performance measures have the advantage of be- presented. Columns (A1 , . . . AM ) represent the attributes
ing independent of class costs and prior probabilities. The and lines (E1 , . . . EN ) represent the examples. For instance,
aim of a classifier is to minimize the false positive and neg- line i in Table 2 refers to the ith example and the entry xij
ative rates or, similarly, to maximize the true negative and refers to the value of the jth attribute, Aj , of example i. For
positive rates. Unfortunately, for most real world applica- classification problems, the class-attribute Y is a qualitative
tions there is a tradeoff between F Nrate and F Prate , and attribute that may assume a set of NCl discrete values C =
similarly between T Nrate and T Prate . ROC (Receiver Op- {C1 , C2 , . . . CNCl }.
erating Characteristic) graphs [19] can be used to analyze
the relationship between F Nrate and F Prate (or T Nrate and
T Prate ) for a classifier. Table 2: Data set in attribute-value form.
Some classifiers, such as the Naı̈ve Bayes classifier or some A1 A2 ··· AM Y
Neural Networks, yield a score that represents the degree to E1 x11 x12 ··· x1M y1
which an example is a member of a class. Such ranking can E2 x21 x22 ··· x2M y2
be used to produce several classifiers, by varying the thresh- .. .. .. .. .. ..
. . . . . .
old of an example pertaining to a class. Each threshold EN xN 1 xN 2 . . . xN M yN
value produces a different point in the ROC space. These
points are linked by tracing straight lines through two con- As stated earlier, in this work we consider two-class prob-
secutive points to produce a ROC curve. For decision trees, lems where C1 = + represents the circumscribed concept
we could use the class distributions at each leaf as a score or, class and C2 = − represents the counterpart of that con-
as proposed in [9], by ordering the leaves by their positive cept. Furthermore, the examples from the negative class
class accuracy and producing several trees by re-labelling the outnumber the examples from the positive class.
leaves, one at a time, from all forecasting negative class to
all forecasting positive class in the positive accuracy order. 4.2 Our Implementation of the k-NN Algo-
A ROC graph characterizes the performance of a binary rithm
classification model across all possible trade-offs between the Several research papers use the Euclidean distance as a dis-
classifier sensitivity (T Prate ) and false alarm (F Prate ). ROC tance metric for the k-NN algorithm. However, this distance
graphs are consistent for a given problem, even if the distri- function might not be appropriate when the domain presents
bution of positive and negative examples is highly skewed. qualitative attributes. For those domains, the distance for
A ROC analysis also allows the performance of multiple qualitative attributes is usually calculated using the over-
classification functions to be visualized and compared si- lap function, in which the value 0 (if two examples have the
multaneously. The area under the ROC curve (AUC) rep- same value for a given attribute) or the value 1 (if these val-
resents the expected performance as a single scalar. The ues differ) are assigned. Our implementation of the k-NN
AUC has a known statistical meaning: it is equivalent to algorithm uses the Heterogeneous Value Difference Metric
the Wilcoxon test of ranks, and is equivalent to several other (HVDM) distance function [25]. This distance function uses
statistical measures for evaluating classification and ranking the Euclidean distance for quantitative attributes and the
models [10]. In this work, we use the method proposed in [9] VDM distance [21] for qualitative attributes. The VDM
with Laplace correction for measuring the leaf accuracy to metric provides a more appropriate distance function for
produce ROC curves. We also use the AUC as the main qualitative attributes if compared with the overlap metric,
method for assessing our experiments. since the VDM metric considers the classification similarity
for each possible value of a qualitative attribute to calculate
4. METHODS the distances between these values.
This section describes the notation used as well as our im- Another refinement to the basic k-NN algorithm is to weigh
plementation of the k-NN algorithm, since this algorithm the contribution of each of the k neighbors according to their
plays an important role in the behavior of the methods con- distance to the query example Eq , giving greater weight to
sidered. Finally, an explanation of each balancing method closer neighbors. The vote of each neighbor is weighed ac-
is given. cording to the inverse square of its distance from Eq [17].
Given Ê = {Ê1 , Ê2 , . . . Êk }, the set of k nearest neighbors
4.1 Notation of Eq , according to the distance function d the final classifi-
In supervised learning, the inducer is fed with a data set cation is given by Equation 1.
E = {E1 , E2 , . . . EN }, in which each example Ei ∈ E has
an associated label. This label defines the class the example
X
k
1
belongs to. Each example Ei ∈ E is a tuple Ei = (x~i , yi ) h(Eq ) = arg max ωi δ(c, f (Êi )) ωi = (1)
in which x~i is a vector of feature (or attribute) values of c∈C
i=1 d(Eq , Êi )2
and δ(a, b) = 1 if a = b otherwise δ(a, b) = 0. examples from the majority class that are distant from
As the balancing methods make severe use of distance com- the decision border, since these sorts of examples might
putations, we implemented an indexing structure namely be considered less relevant for learning.
M-tree [6] to speed up the execution of k-NN queries. M- One-sided selection One-sided selection (OSS) [14] is an
tree only considers relative distances of examples rather than under-sampling method resulting from the application of
their absolute positions in a multi-dimensional space, to or- Tomek links followed by the application of CNN. Tomek
ganize and partition the metric space. In a metric space, links are used as an under-sampling method and removes
example proximity is only defined by a distance function noisy and borderline majority class examples. Border-
that satisfies the positivity, symmetry and triangle inequal- line examples can be considered “unsafe” since a small
ity postulates. amount of noise can make them fall on the wrong side
of the decision border. CNN aims to remove examples
4.3 Methods from the majority class that are distant from the decision
In this work, we evaluate ten different methods of under and border. The remainder examples, i.e. “safe” majority
over-sampling to balance the class distribution on training class examples and all minority class examples are used
data. Two of these methods, Random over-sampling and for learning.
Random under-sampling, are non-heuristic methods that CNN + Tomek links This is one of the methods pro-
were initially included in this evaluation as baseline meth- posed in this work. It is similar to the one-sided se-
ods. The evaluated methods are described next. lection, but the method to find the consistent subset
is applied before the Tomek links. Our objective is to
Random over-sampling is a non-heuristic method that verify its competitiveness with OSS. As finding Tomek
aims to balance class distribution through the random links is computationally demanding, it would be compu-
replication of minority class examples. tationally cheaper if it was performed on a reduced data
Random under-sampling is also a non-heuristic method set.
that aims to balance class distribution through the ran- Neighborhood Cleaning Rule Neighborhood Cleaning
dom elimination of majority class examples. Rule (NCL) [15] uses the Wilson’s Edited Nearest Neigh-
bor Rule (ENN) [24] to remove majority class examples.
Several authors agree that Random over-sampling can in- ENN removes any example whose class label differs from
crease the likelihood of occurring overfitting, since it makes the class of at least two of its three nearest neighbors.
exact copies of the minority class examples. In this way, a NCL modifies the ENN in order to increase the data
symbolic classifier, for instance, might construct rules that cleaning. For a two-class problem the algorithm can be
are apparently accurate, but actually cover one replicated described in the following way: for each example Ei in
example. On the other hand, the major drawback of Ran- the training set, its three nearest neighbors are found.
dom under-sampling is that this method can discard poten- If Ei belongs to the majority class and the classification
tially useful data that could be important for the induction given by its three nearest neighbors contradicts the orig-
process. The remainder balancing methods use heuristics in inal class of Ei , then Ei is removed. If Ei belongs to the
order to overcome the limitations of the non-heuristic meth- minority class and its three nearest neighbors misclas-
ods. sify Ei , then the nearest neighbors that belong to the
Tomek links Tomek links [22] can be defined as follows: majority class are removed.
given two examples Ei and Ej belonging to different Smote Synthetic Minority Over-sampling Technique (Smo-
classes, and d(Ei , Ej ) is the distance between Ei and te) [5] is an over-sampling method. Its main idea is to
Ej . A (Ei , Ej ) pair is called a Tomek link if there is form new minority class examples by interpolating be-
not an example El , such that d(Ei , El ) < d(Ei , Ej ) or tween several minority class examples that lie together.
d(Ej , El ) < d(Ei , Ej ). If two examples form a Tomek Thus, the overfitting problem is avoided and causes the
link, then either one of these examples is noise or both decision boundaries for the minority class to spread fur-
examples are borderline. Tomek links can be used as an ther into the majority class space.
under-sampling method or as a data cleaning method. Smote + Tomek links Although over-sampling minority
As an under-sampling method, only examples belong- class examples can balance class distributions, some other
ing to the majority class are eliminated, and as a data problems usually present in data sets with skewed class
cleaning method, examples of both classes are removed. distributions are not solved. Frequently, class clusters
Condensed Nearest Neighbor Rule Hart’s Condensed are not well defined since some majority class exam-
Nearest Neighbor Rule (CNN) [11] is used to find a con- ples might be invading the minority class space. The
opposite can also be true, since interpolating minority
sistent subset of examples. A subset Ê ⊆ E is consis-
class examples can expand the minority class clusters,
tent with E if using a 1-nearest neighbor, Ê correctly
introducing artificial minority class examples too deeply
classifies the examples in E. An algorithm to create a
in the majority class space. Inducing a classifier un-
subset Ê from E as an under-sampling method is the
der such a situation can lead to overfitting. In order to
following [14]: First, randomly draw one majority class
create better-defined class clusters, we propose applying
example and all examples from the minority class and
Tomek links to the over-sampled training set as a data
put these examples in Ê. Afterwards, use a 1-NN over
cleaning method. Thus, instead of removing only the
the examples in Ê to classify the examples in E. Ev-
majority class examples that form Tomek links, exam-
ery misclassified example from E is moved to Ê. It is
ples from both classes are removed. The application of
important to note that this procedure does not find the
this method is illustrated in Figure 2. First, the original
smallest consistent subset from E. The idea behind this
data set (a) is over-sampled with Smote (b), and then
implementation of a consistent subset is to eliminate the
Figure 2: Balancing a data set: original data set (a); over- Figure 3: Proportion of negative/positive examples versus
sampled data set (b); Tomek links identification (c); and AUC.
borderline and noise examples removal (d ). 100
90
80
AUC (%)
70
60
50
40
0 5 10 15 20 25 30 35 40
Proportion of negative/positive examples
Pima New-thyroid
German E-coli
Post-operative Satimage
Haberman Flag
Splice-ie Glass
Splice-ei Letter-a
Vehicle Nursery
Letter-vowel
versus the mean AUC values for the original data sets. If
class imbalances can systematically hinder the performance
of imbalanced data sets, then it would be expected that
AUC decreases for highly imbalanced data sets. However,
Tomek links are identified (c) and removed, producing in spite of a large degree of imbalance the data sets Letter-a
a balanced data set with well-defined class clusters (d ). and Nursery obtained almost 100% AUC.
The Smote + Tomek links method was first used to im- The results obtained in the UCI data sets seem to be com-
prove the classification of examples for the problem of patible with previous work of the authors [18] conducted on
annotation of proteins in Bioinformatics [1]. a series of experiments with artificial domains, in which we
Smote + ENN The motivation behind this method is sim- varied the degree of overlapping between the classes. It was
ilar to Smote + Tomek links. ENN tends to remove concluded that class imbalance, by itself, does not seem to
more examples than the Tomek links does, so it is ex- be a problem, but when allied to highly overlapped classes,
pected that it will provide a more in depth data clean- it can significantly decrease the number of minority class ex-
ing. Differently from NCL which is an under-sampling amples correctly classified. Domains with non-overlapping
method, ENN is used to remove examples from both classes do not seem to be problematic for learning no mat-
classes. Thus, any example that is misclassified by its ter the degree of imbalance. Moreover, in [12] Japkowicz
three nearest neighbors is removed from the training set. performed several experiments on artificial data sets and
concluded that class imbalances do not seem to systemat-
5. EXPERIMENTAL EVALUATION ically cause performance degradation. She concludes that
the imbalance problem is a relative problem depending on
The main objective of our research is to compare several bal- both the complexity of the concept1 and the overall size of
ancing methods published in the literature, as well as the the training set.
three proposed methods, in order to verify whether those
The relationship between training set size and improper clas-
methods can effectively deal in practice with the problem of
sification performance for imbalanced data sets seems to be
class imbalance. To make this comparison, we have selected
that on small imbalanced data sets the minority class is
thirteen data sets from UCI [3] which have different degrees
poorly represented by an excessively reduced number of ex-
of imbalance. Table 3 summarizes the data employed in this
amples, that might not be sufficient for learning, especially
study. For each data set, it shows the number of examples
when a large degree of class overlapping exists and the class
(#Examples), number of attributes (#Attributes), number
is further divided into subclusters. For larger data sets, the
of quantitative and qualitative attributes, class attribute dis-
effect of these complicating factors seems to be reduced, as
tribution and the majority class error. For data sets having
the minority class is better represented by a larger number
more than two classes, we chose the class with fewer ex-
of examples. This trend is confirmed by the graph shown
amples as the positive class, and collapsed the remainder
in Figure 4 which shows how the AUC is affected by the
as the negative class. As the Letter and Splice data sets
number of positive training examples in the data sets.
have a similar number of examples in the minority classes,
In a second stage, the over and under-sampling methods de-
we created two data sets with each of them: Letter-a and
scribed in Section 4 were applied to the original data sets.
Letter-vowel, Splice-ie and Splice-ei.
Smote, Random over-sampling, Random under-sampling and
In our experiments, we used release 8 of the C4.5 symbolic
CNN methods have internal parameters that allow the user
learning algorithm to induce decision trees [20]. Firstly,
to set up the resulting class distribution obtained after the
we ran C4.5 over the original (imbalanced) data sets and
application of these methods. We decided to add/remove
calculated the AUC for each data set using 10-fold cross-
examples until a balanced distribution was reached. This
validation. The results obtained in this initial experiment
are shown in a graph in Figure 3. 1
Where the “concept complexity” corresponds to the num-
Figure 3 plots the proportion of negative/positive examples ber of subclusters into which the classes are subdivided.
Table 3: Data sets summary descriptions.
Data set #Examples #Attributes Class Class % Majority
(quanti., quali.) (min., maj.) (min., maj.) Error
Pima 768 8 (8,0) (1, 0) (34.77%, 65.23%) 65.23%
German 1000 20 (7,13) (Bad, Good) (30.00%, 70.00%) 70.00%
Post-operative 90 8 (1,7) (S, remainder) (26.67%, 73.33%) 73.33%
Haberman 306 3 (3,0) (Die, Survive) (26.47%, 73.53%) 73.53%
Splice-ie 3176 60 (0,60) (ie, remainder) (24.09%, 75.91%) 75.91%
Splice-ei 3176 60 (0,60) (ei, remainder) (23.99%, 76.01%) 76.01%
Vehicle 846 18 (18,0) (van, remainder) (23.52%, 76.48%) 76.48%
Letter-vowel 20000 16 (16,0) (all vowels, remainder) (19.39%, 80.61%) 80.61%
New-thyroid 215 5 (5,0) (hypo, remainder) (16.28%, 83.72%) 83.72%
E.Coli 336 7 (7,0) (iMU, remainder) (10.42%, 89.58%) 89.58%
Satimage 6435 36 (36,0) (4, remainder) (9.73%, 90.27%) 90.27%
Flag 194 28 (10,18) (white, remainder) (8.76%, 91.24%) 91.24%
Glass 214 9 (9,0) (Ve-win-float-proc, remainder) (7.94%, 92.06%) 92.06%
Letter-a 20000 16 (16,0) (a, remainder) (3.95%, 96.05%) 96.05%
Nursery 12960 8 (8,0) (not recom, remainder) (2.55%, 97.45%) 97.45%
70
tion [23].
Figure 5 shows a comparison of the effect of pruning de-
60
cision trees on the original and balanced data sets. Line
50 x = y represents when both pruned and unpruned decision
40 trees obtain the same AUC. Plots above this line represent
10 100 1000 10000
Number of positive examples
that unpruned decision trees obtained better results, and
plots under this line the opposite. Figure 5 clearly shows
Pima New-thyroid
German E-coli that pruning rarely leads to an improvement in AUC for the
Post-operative
Haberman
Satimage
Flag
original and balanced data sets.
Splice-ie Glass
Splice-ei Letter-a
Vehicle Nursery
Letter-vowel Figure 5: AUC of pruned versus unpruned decision trees for
the original and balanced data sets.
100
decision is motivated by the results presented in [23], in 90
which it is shown that when AUC is used as performance
AUC Unpruned (%)
the best ranked method and the remaining methods. The are not completely unexpected. As stated before, it seems
results are also summarized in Tables 6 and 7, where meth- that the loss of performance is directly related to the lack of
ods marked with an asterisk obtained statistically inferior minority class examples in conjunction with other compli-
results when compared to the top ranked method. cating factors. Over-sampling is the class of methods that
Conversely, over-sampling methods in general and Random most directly attack the problem of the lack of minority class
over-sampling in particular are well-ranked among the re- examples.
mainder methods. This result seems to diverge with several It is worth mentioning that two of our proposed methods,
papers previously published in the literature. Drummond Smote + Tomek and Smote + ENN are generally ranked
and Holte [8] report that when using C4.5’s default settings, among the best for data sets with a small number of positive
over-sampling is surprisingly ineffective, often producing lit- examples. Considering only data sets with less than 100 pos-
tle or no change in performance in response to modifications itive examples (in our experiments there are 6 of them: Flag,
of misclassification costs and class distribution. Moreover, Glass, Post-operative, New-thyroid, E.Coli and Haberman),
they note that over-sampling prunes less and therefore gen- at least one of the proposed methods provided meaningful
eralizes less than under-sampling, and that a modification results for all 6 data sets for pruned trees – Table 6, and
of the C4.5’s parameter settings to increase the influence for 5 of the 6 data sets for unpruned trees – Table 7. This
of pruning and other overfitting avoidance factors can re- seems to indicate that these methods could be appropriate
establish the performance of over-sampling. In our exper- in domains having such conditions.
iments, Random over-sampling did not produce overfitted Since over-sampling methods, as well as unpruned decision
decision trees even when these trees were left unpruned, as trees obtained very good performance results, further anal-
it can be confirmed by the higher AUC values obtained by ysis will focus on these results. In addition to classifier per-
this method for unpruned trees. In addition, under-sampling formance results, we also attempted to measure the syntac-
methods did not perform as well as over-sampling meth- tic complexity of the induced models. Syntactic complexity
ods, even when heuristics to remove cases were considered is given by two main parameters: the mean number of in-
in under-sampling. duced rules (branches) and the mean number of conditions
Moreover, Domingos [7] reports that concerning concept per rule. Tables 8 and 9 respectively show the mean number
learning problems, C4.5 Rules produces lower cost classi- of induced rules and the mean number of condition per rule
fiers using under-sampling than over-sampling. Ling and for the over-sampling methods and the original data sets
Li [16] compare over and under-sampling for boosted C4.5 with unpruned decision trees. The best results are shown
and report that under-sampling produces better lift index, in bold, and the best results obtained by an over-sampling
although extreme over-sampling performs almost as well. method, not considering the results obtained in the original
On the other hand, Japkowicz and Stephen [13] compare data sets, are highlighted with a light gray color.
several methods of over and under-sampling on a series of Figure 6 shows the results in Table 8 in graphical form,
artificial data sets and conclude that over-sampling is more where it can be observed that over-sampled data sets usu-
effective than under-sampling at reducing error rate. ally lead to an increase in the number of induced rules if
In our opinion, the good results obtained by over-sampling compared to the trees induced with the original data sets.
Table 5: AUC results for the under-sampled data sets.
Data set Pruning Rand Under CNN CNN+Tomek Tomek OSS NCL
yes 81.17(3.87) 79.60(6.22) 80.30(3.86) 82.56(5.11) 77.89(5.37) 81.61(4.48)
Pima
no 81.49(4.29) 80.08(5.82) 81.71(3.69) 83.11(4.65) 79.23(4.81) 82.55(3.53)
yes 79.85(3.05) 79.85(5.56) 79.48(5.01) 78.87(4.27) 79.20(3.15) 77.89(3.85)
German
no 84.54(3.32) 82.25(5.59) 81.70(4.00) 85.90(3.99) 82.96(3.22) 85.07(3.54)
yes 49.11(14.07) 49.20(8.91) 49.02(11.34) 46.16(5.89) 46.31(18.77) 42.34(28.12)
Post-operative
no 55.52(24.47) 65.69(21.64) 75.79(16.86) 66.45(23.29) 64.44(20.88) 45.62(32.71)
yes 66.07(10.26) 58.36(10.26) 55.73(14.31) 64.46(10.95) 62.70(11.50) 68.01(13.99)
Haberman
no 68.40(10.17) 58.36(10.26) 55.73(14.31) 69.59(13.30) 62.03(11.82) 69.29(14.13)
yes 97.46(1.10) 98.39(0.64) 97.55(0.46) 98.69(0.51) 97.37(0.84) 98.38(0.57)
Splice-ie
no 98.80(0.40) 99.17(0.36) 98.82(0.32) 99.18(0.43) 98.93(0.30) 99.15(0.36)
yes 98.74(0.46) 98.78(0.46) 98.85(0.42) 98.78(0.46) 98.83(0.45) 98.77(0.47)
Splice-ei
no 99.25(0.48) 99.27(0.77) 99.47(0.27) 99.44(0.60) 99.33(0.66) 99.40(0.66)
yes 97.25(1.95) 98.62(0.67) 98.34(1.32) 98.26(0.90) 98.79(0.67) 97.94(1.05)
Vehicle
no 97.80(0.94) 98.64(0.63) 98.42(1.02) 98.41(0.90) 98.71(0.97) 98.17(1.12)
yes 97.69(0.43) 98.03(0.37) 97.97(0.46) 98.18(0.53) 97.66(0.30) 98.17(0.30)
Letter-vowel
no 98.26(0.28) 98.49(0.31) 98.39(0.22) 98.90(0.18) 98.27(0.19) 98.81(0.17)
yes 94.87(5.00) 94.79(10.14) 94.54(10.10) 94.73(9.24) 92.72(10.55) 93.44(9.74)
New-thyroid
no 94.87(5.00) 94.79(10.14) 94.54(10.10) 94.98(9.38) 92.72(10.55) 93.69(9.90)
yes 88.75(12.45) 80.32(19.96) 80.34(19.85) 91.57(7.81) 83.97(21.27) 91.73(8.00)
E.Coli
no 88.64(12.46) 81.13(20.00) 81.95(19.90) 94.03(5.56) 83.76(21.17) 92.04(8.15)
yes 92.34(1.27) 92.25(1.45) 92.73(1.38) 94.21(1.76) 92.85(1.19) 94.42(1.53)
Satimage
no 92.86(1.29) 92.35(1.35) 92.90(1.38) 95.11(1.29) 92.84(1.22) 95.06(1.27)
yes 71.13(28.95) 49.12(21.57) 75.85(30.26) 45.00(15.81) 45.00(15.81) 44.47(15.71)
Flag
no 78.35(29.98) 78.90(28.63) 75.64(29.37) 78.59(28.75) 81.73(29.51) 76.13(27.80)
yes 82.44(8.99) 58.44(13.15) 72.69(14.07) 87.15(16.47) 72.16(16.84) 91.67(12.76)
Glass
no 80.47(13.25) 64.31(14.21) 75.44(11.61) 87.00(16.75) 78.76(12.52) 91.67(12.76)
yes 99.35(0.48) 99.60(0.37) 99.61(0.37) 99.61(0.40) 99.66(0.46) 99.60(0.40)
Letter-a
no 99.46(0.42) 99.66(0.37) 99.65(0.38) 99.67(0.37) 99.67(0.45) 99.67(0.37)
yes 97.52(0.82) 99.55(0.21) 98.77(0.35) 99.80(0.08) 99.47(0.19) 99.79(0.12)
Nursery
no 98.76(0.22) 99.84(0.13) 99.57(0.21) 99.89(0.08) 99.83(0.08) 99.89(0.09)
Table 6: Performance ranking for original and balanced data sets for pruned decision trees.
Data set 1o 2o 3o 4o 5o 6o 7o 8o 9o 10o 11o
Pima Smt RdOvr Smt+Tmk Smt+ENN Tmk NCL Original RdUdr CNN+Tmk CNN* OSS*
German RdOvr Smt+Tmk Smt+ENNSmt RdUdr CNN CNN+Tmk*OSS* Original* Tmk* NCL*
Post-operativeRdOvr Smt+ENNSmt Original CNN RdUdr CNN+Tmk OSS* Tmk* NCL* Smt+Tmk*
Haberman Smt+ENNSmt+Tmk Smt RdOvr NCL RdUdr Tmk OSS* CNN* Original* CNN+Tmk*
Splice-ie RdOvr Original Tmk Smt CNN NCL Smt+Tmk Smt+ENN*CNN+Tmk*RdUdr* OSS*
Splice-ei Smt Smt+Tmk Smt+ENNCNN+TmkOSS RdOvr Tmk CNN NCL Original RdUdr
Vehicle RdOvr Smt Smt+Tmk OSS CNN OriginalCNN+Tmk Tmk NCL* Smt+ENN* RdUdr*
Letter-vowel Smt+ENNSmt+Tmk Smt RdOvr Tmk* NCL* Original* CNN* CNN+Tmk*RdUdr* OSS*
New-thyroid Smt+ENNSmt+Tmk Smt RdOvr RdUdr CNN Original Tmk CNN+Tmk NCL OSS
E.Coli Smt+Tmk Smt Smt+ENNRdOvr NCL Tmk RdUdr Original OSS CNN+Tmk*CNN*
Satimage Smt+ENNSmt Smt+Tmk RdOvr NCL Tmk Original* OSS* CNN+Tmk*RdUdr* CNN*
Flag RdOvr Smt+ENNSmt+Tmk CNN+TmkSmt RdUdr CNN* OSS* Tmk* Original* NCL*
Glass Smt+ENNRdOvr NCL Smt Smt+TmkOriginalTmk RdUdr CNN+Tmk*OSS* CNN*
Letter-a Smt+Tmk Smt+ENNSmt RdOvr OSS OriginalTmk CNN+Tmk NCL CNN RdUdr*
Nursery RdOvr Tmk Original NCL CNN* OSS* Smt+Tmk* Smt* CNN+Tmk*Smt+ENN* RdUdr*
Table 7: Performance ranking for original and balanced data sets for unpruned decision trees.
Data set 1o 2o 3o 4o 5o 6o 7o 8o 9o 10o 11o
Pima RdOvr Smt Smt+Tmk Smt+ENNTmk NCL Original CNN+Tmk RdUdr CNN* OSS*
German Original Tmk RdOvr NCL RdUdr Smt Smt+Tmk Smt+ENN OSS CNN CNN+Tmk
Post-operativeOriginal CNN+TmkRdOvr Smt Tmk CNN OSS Smt+ENN RdUdr Smt+Tmk* NCL*
Haberman Smt+Tmk Smt+ENN Smt RdOvr Tmk NCL RdUdr Original OSS* CNN* CNN+Tmk*
Splice-ie Original Smt Tmk CNN NCL Smt+Tmk RdOvr OSS* Smt+ENN* CNN+Tmk*RdUdr*
Splice-ei RdOvr Smt Smt+Tmk Smt+ENNOriginal CNN+TmkTmk NCL OSS CNN RdUdr
Vehicle RdOvr Smt Smt+Tmk OSS CNN Original CNN+TmkTmk Smt+ENN NCL RdUdr*
Letter-vowel Smt+ENNSmt Smt+Tmk Tmk* RdOvr* NCL* Original* CNN* CNN+Tmk*OSS* RdUdr*
New-thyroid Smt+ENNSmt Smt+Tmk RdOvr Original Tmk RdUdr CNN CNN+Tmk NCL OSS
E.Coli Smt+Tmk Smt Smt+ENNTmk RdOvr Original NCL RdUdr OSS CNN+Tmk*CNN*
Satimage Smt+ENNSmt Smt+Tmk RdOvr Tmk NCL Original CNN+Tmk*RdUdr* OSS* CNN*
Flag Smt+Tmk OSS RdOvr CNN Tmk Smt+ENN RdUdr Original NCL CNN+Tmk Smt
Glass Smt+ENNRdOvr NCL Smt Smt+TmkOriginal Tmk RdUdr OSS* CNN+Tmk*CNN*
Letter-a Smt Smt+Tmk Smt+ENNRdOvr Tmk OSS NCL Original CNN CNN+Tmk RdUdr*
Nursery RdOvr Original NCL Tmk CNN OSS* Smt* CNN+Tmk*Smt+Tmk* Smt+ENN* RdUdr*
Comparing the mean number of rules obtained with the the application of over-sampling would result in an increase
over-sampled data sets, Random over-sampling and Smote in the mean number of rules, since over-sampling increases
+ ENN are the methods that usually provide a smaller in- the total number of training examples, which usually gener-
crease in the mean number of rules. It was expected that ates larger decision trees. It can be considered unexpected
Table 8: Number of rules (branches) for the original and over-sampled data sets and unpruned decision trees.
Data set Original Rand Over Smote Smote+Tomek Smote+ENN
Pima 29.90(6.06) 63.80(13.15) 57.70(11.52) 54.20(12.91) 47.50(8.76)
German 315.50(21.41) 410.60(28.64) 367.30(20.85) 355.10(24.20) 261.00(28.08)
Post-operative 20.40(3.86) 36.80(3.05) 38.60(4.35) 32.70(5.87) 25.90(4.09)
Haberman 7.80(3.79) 25.20(10.94) 23.20(9.61) 25.00(7.70) 30.30(4.92)
Splice-ie 203.50(7.78) 258.70(13.07) 443.20(16.69) 340.60(21.34) 307.90(17.21)
Splice-ei 167.80(9.40) 193.30(7.41) 374.50(20.41) 283.90(14.90) 248.80(12.90)
Vehicle 26.20(3.29) 28.90(2.60) 34.90(3.38) 34.90(3.38) 29.20(2.82)
Letter-vowel 534.50(11.92) 678.80(19.07) 1084.50(19.61) 1083.20(20.12) 1022.00(26.34)
New-thyroid 5.40(0.84) 5.10(0.32) 6.90(1.29) 6.90(1.29) 6.90(0.99)
E-coli 11.60(3.03) 17.70(2.91) 16.70(3.20) 16.50(3.84) 12.70(3.23)
Satimage 198.80(11.04) 252.70(9.33) 404.60(12.97) 404.60(12.97) 339.40(13.80)
Flag 28.60(6.52) 46.30(7.72) 52.50(12.47) 46.50(13.36) 40.30(9.09)
Glass 9.40(2.22) 13.00(1.33) 17.70(1.77) 17.70(1.77) 15.50(1.58)
Letter-a 59.10(3.45) 88.00(5.56) 257.60(15.42) 257.60(15.42) 252.60(18.23)
Nursery 229.40(4.65) 282.50(5.34) 1238.30(28.91) 1204.70(27.94) 766.30(77.24)
Table 9: Mean number of conditions per rule for the original and over-sampled data sets and unpruned decision trees.
Data set Original Rand Over Smote Smote+Tomek Smote+ENN
Pima 6.21(0.61) 7.92(0.64) 7.74(0.44) 7.59(0.54) 7.27(0.67)
German 6.10(0.17) 6.89(0.25) 10.27(0.51) 9.68(0.32) 7.35(0.58)
Post-operative 3.61(0.41) 4.86(0.26) 5.36(0.37) 4.75(0.52) 4.46(0.50)
Haberman 3.45(1.36) 5.71(1.43) 5.61(1.27) 5.81(1.02) 6.45(0.60)
Splice-ie 6.04(0.09) 6.15(0.04) 6.08(0.08) 6.00(0.09) 5.58(0.11)
Splice-ei 5.46(0.14) 5.70(0.08) 5.51(0.07) 5.41(0.09) 4.91(0.09)
Vehicle 7.21(0.70) 7.03(0.44) 7.09(0.50) 7.09(0.50) 6.63(0.38)
Letter-vowel 20.96(1.19) 19.32(0.82) 18.78(0.40) 18.78(0.40) 18.32(0.43)
New-thyroid 2.76(0.39) 2.85(0.17) 3.12(0.26) 3.12(0.26) 3.08(0.20)
E-coli 4.43(0.79) 5.48(0.41) 4.98(0.60) 4.92(0.65) 4.15(0.49)
Satimage 12.13(0.46) 15.93(0.42) 13.89(0.64) 13.89(0.64) 12.54(0.36)
Flag 3.92(0.70) 5.42(0.55) 9.43(1.04) 8.75(1.53) 6.71(1.23)
Glass 4.20(0.61) 5.80(0.51) 5.92(0.50) 5.92(0.50) 5.51(0.32)
Letter-a 7.30(0.22) 10.35(0.64) 10.97(0.38) 10.97(0.38) 10.86(0.36)
Nursery 6.51(0.01) 6.84(0.03) 6.87(0.03) 6.84(0.03) 6.41(0.12)
that Random over-sampling is competitive with Smote + data sets, this method was the best ranked for another 4
Tomek and Smote + ENN in the number of induced rules, data sets.
once Tomek and ENN were applied as data cleaning meth-
ods with the objective of eliminating noise examples and
thus simplifying the induced decision trees. Figure 7: Mean number of conditions per rule for original
and balanced data sets and unpruned decision trees.
22
Mean number of conditions per rule
14
1200 12
1000 10
8
800
6
600 4
2
400 2 4 6 8 10 12 14 16 18 20 22
Mean number of conditions per rule - original data
200
Original Smote+ENN
0 Random over-sampling Smote+Tomek
0 100 200 300 400 500 600 Smote
Mean number of rules - original data
Original Smote+ENN
Random over-sampling Smote+Tomek
Smote
6. CONCLUSION AND LIMITATIONS
The results presented in Table 9 are shown in a graph in In this work we analyze the behavior of several over and
Figure 7 allowing a clearer comparison for the mean number under-sampling methods to deal with the problem of learn-
of conditions per rule for the over-sampled data sets. The ing from imbalanced data sets. Our results show that the
Smote + ENN method provided very good results. In fact, over-sampling methods in general, and Smote + Tomek and
it was the best ranked in 10 data sets. Furthermore, this Smote + ENN (two of the methods proposed in this work) in
method was even able to obtain smaller values than those particular for data sets with few positive (minority) exam-
achieved by decision trees induced from the original data sets ples, provided very good results in practice. Moreover, Ran-
in 6 data sets. Moreover, considering only the over-sampled dom over-sampling, frequently considered an unprosperous
method provided competitive results with the more complex [10] Hand, D. J. Construction and Assessment of Classifi-
methods. As a general recommendation, Smote + Tomek or cation Rules. John Wiley and Sons, 1997.
Smote + ENN might be applied to data sets with a small
number of positive instances, a condition that is likely to [11] Hart, P. E. The Condensed Nearest Neighbor Rule.
lead to classification performance problems for imbalanced IEEE Transactions on Information Theory IT-14
data sets. For data sets with larger number of positive ex- (1968), 515–516.
amples, the Random over-sampling method which is compu- [12] Japkowicz, N. Class Imbalances: Are We Focusing
tationally less expensive than other methods would produce on the Right Issue? In Workshop on Learning from
meaningful results. Imbalanced Data Sets II (2003).
It should be noted that allocating half of the training exam-
ples to the minority class does not always provide optimal [13] Japkowicz, N., and Stephen, S. The Class Imbal-
results [23]. We plan to address this issue in future research. ance Problem: A Systematic Study. IDA Journal 6, 5
Furthermore, some under-sampling methods, such as Tomek (2002), 429–449.
links and NCL, that do not originally allow the user to spec-
ify the resulting class distribution, must be improved to in- [14] Kubat, M., and Matwin, S. Addressing the Course
clude this feature. Another natural extension to this work of Imbalanced Training Sets: One-sided Selection. In
is to analyze the ROC curves obtained from the classifiers. ICML (1997), pp. 179–186.
This might provide us with a more in depth understanding [15] Laurikkala, J. Improving Identification of Difficult
of the behavior of balancing and cleaning methods. Small Classes by Balancing Class Distribution. Tech.
Acknowledgements. We wish to thank the anonymous Rep. A-2001-2, University of Tampere, 2001.
reviewers and Dorival Leão Pinto Júnior for their helpful
comments. This research was partially supported by the [16] Ling, C. X., and Li, C. Data Mining for Direct Min-
Brazilian Research Councils CAPES and FAPESP. ing: Problems and Solutions. In KDD (1998), pp. 73–
79.
[8] Drummond, C., and Holte, R. C. C4.5, Class Imbal- [25] Wilson, D. R., and Martinez, T. R. Reduction
ance, and Cost Sensitivity: Why Under-sampling beats Techniques for Exemplar-Based Learning Algorithms.
Over-sampling. In Workshop on Learning from Imbal- Machine Learning 38, 3 (2000), 257–286.
anced Data Sets II (2003). [26] Zadrozny, B., and Elkan, C. Learning and Mak-
ing Decisions When Costs and Probabilities are Both
[9] Ferri, C., Flach, P., and Hernández-Orallo, J.
Unknown. In KDD (2001), pp. 204–213.
Learning Decision Trees Using the Area Under the ROC
Curve. In ICML (2002), pp. 139–146.