Oligois: Scalable Instance Selection For Class-Imbalanced Data Sets
Oligois: Scalable Instance Selection For Class-Imbalanced Data Sets
1, FEBRUARY 2013
AbstractIn current research, an enormous amount of infor- highly imbalanced problems, the ratio between the positive and
mation is constantly being produced, which poses a challenge for negative classes can be as high as 1 : 1000 or 1 : 10 000.
data mining algorithms. Many of the problems in extremely active It has been repeatedly shown that most classification meth-
research areas, such as bioinformatics, security and intrusion de-
tection, or text mining, share the following two features: large data ods suffer from an imbalanced distribution of training in-
sets and class-imbalanced distribution of samples. Although many stances among classes [2]. Most learning algorithms expect
methods have been proposed for dealing with class-imbalanced an approximately even distribution of instances among the
data sets, most of these methods are not scalable to the very different classes and suffer, to different degrees, when that
large data sets common to those research fields. In this paper, is not the case. Dealing with the class-imbalance problem is
we propose a new approach to dealing with the class-imbalance
problem that is scalable to data sets with many millions of in- a difficult but relevant task as many of the most interesting
stances and hundreds of features. This proposal is based on the and challenging real-world problems have a very uneven class
divide-and-conquer principle combined with application of the distribution.
selection process to balanced subsets of the whole data set. This Many algorithms and methods have been proposed to ame-
divide-and-conquer principle allows the execution of the algorithm liorate the effect of class imbalance on the performance of
in linear time. Furthermore, the proposed method is easy to
implement using a parallel environment and can work without learning algorithms. There are three main approaches to these
loading the whole data set into memory. Using 40 class-imbalanced methods.
medium-sized data sets, we will demonstrate our methods ability
to improve the results of state-of-the-art instance selection meth- Internal approaches acting on the algorithm. These ap-
ods for class-imbalanced data sets. Using three very large data sets, proaches modify the learning algorithm to deal with the
we will show the scalability of our proposal to millions of instances imbalance problem. They can adapt the decision threshold
and hundreds of features. to create a bias toward the minority class or introduce costs
Index TermsClass-imbalance problem, instance selection, in the learning process to compensate the minority class.
instance-based learning, very large problems. External approaches acting on the data. These algorithms
act on the data instead of the learning method. They have
I. I NTRODUCTION the advantage of being independent from the classifier
used. There are two basic approaches: oversampling the
T HE overwhelming amount of data that is currently avail-
able in any field of research poses new problems for data
mining and knowledge discovery methods. This large amount
minority class and undersampling the majority class.
Combined approaches that are based on boosting account-
of data makes most existing algorithms inapplicable to many ing for the imbalance in the training set. These methods
real-world problems. Furthermore, one of the distinctive fea- modify the basic boosting method to account for minority
tures of many common problems in data mining applications is class underrepresentation in the data set.
the uneven distribution of the instances of the different classes.
There are two principal advantages of choosing sampling
In extremely active research areas, such as artificial intelligence
over cost-sensitive methods. First, sampling is more general
in medicine, bioinformatics, or intrusion detection, two classes
as it does not depend on the possibility of adapting a certain
are usually involved: a class of interest or a positive class, and
algorithm to work with classification costs. Second, the learning
a negative class that is overrepresented in the data sets. This
algorithm is not modified, which can cause difficulties and add
is usually referred to as the class-imbalance problem [1]. In
additional parameters to be tuned.
Data-driven algorithms can be broadly classified into two
groups: those that undersample the majority class and those that
Manuscript received October 13, 2011; revised April 9, 2012; accepted oversample the minority class. There are also algorithms that
June 12, 2012. Date of publication July 30, 2012; date of current version
January 11, 2013. This work was supported in part by the Spanish Ministry of combine both processes. Both undersampling and oversampling
Science and Innovation under Project TIN2008-03151 and in part by the Junta can be randomly achieved or through a more complicated
de Andaluca under Project P09-TIC-4623. This paper was recommended by process of searching for least or most useful instances. Previ-
Associate Editor N. Chawla.
The authors are with the Department of Computing and Numerical Analysis, ous works have shown that undersampling the majority class
University of Crdoba, 14071 Crdoba, Spain (e-mail: [email protected]; usually leads to better results than oversampling the minority
[email protected]; [email protected]). class [2] when oversampling is performed using sampling with
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. replacement from the minority class. Furthermore, combining
Digital Object Identifier 10.1109/TSMCB.2012.2206381 undersampling of the majority class with oversampling of the
2168-2267/$31.00 2012 IEEE
GARCA-PEDRAJAS et al.: OLIGOIS 333
minority class has not yielded better results than undersam- Section IV shows the experimental results, and Section V shows
pling of the majority class alone [3]. One of the possible the conclusions of this paper.
sources of the problematic performance of oversampling is
the fact that no new information is introduced in the training II. O LIGO IS
set as oversampling must rely on adding new copies of mi-
Although random undersampling works in many cases, the
nority class instances already in the data set. Sampling has
random deletion of examples from the majority class may
proven a very efficient method of dealing with class-imbalanced
cause the classifier to miss important concepts pertaining to the
data sets [4], [5].
majority class [1]. The problem with traditional approaches to
Removing instances only from the majority class, usually
class-imbalanced data sets is that they have serious scalability
referred to as one-sided selection (OSS) [6], has two major
problems, particularly in methods based on evolutionary com-
problems. First, reduction is limited by the number of instances
putation that have been shown to achieve the best performance
of the minority class. Second, instances from the minority class
[9]. Furthermore, when careful experimental comparison is
are never removed, even when their contribution to the models
made, many sophisticated methods do not achieve significant
performance is harmful.
improvements over undersampling [9].
However, few attempts have been made to cope with class-
As we stated earlier, our goal is to obtain a method that is
imbalanced data sets using instance selection algorithms, which both scalable and able to sample the most relevant instances
can remove instances from both the minority and majority to deal with class-imbalanced data sets. Scalability will be
classes. Standard widely used methods can be applied, but they achieved using a divide-and-conquer approach. The ability to
do not achieve good results because their design bias is not sample instances to deal with class-imbalanced data sets will
focused on these kinds of problems. Evolutionary computation be achieved by means of the combination of several rounds of
[7] has been used with more success, but scalability is important instance selection in balanced subsets of the whole data set.
[8], and those methods cannot be applied to large and very large Thus, our methodology is primarily based on the divide-and-
data sets. conquer approach. Instead of applying the instance selection
In this paper, we propose a new framework called oligarchic method to the whole data set, we first perform a random
instance selection,1 which is specifically designed for class- partition of the instances and apply the selection to each one
imbalanced data sets. The method has two major objectives: of the subsets obtained. This partition is repeated for several
1) improving the performance of previous approaches based rounds, and the results are combined through a voting process.
of instances selection for class-imbalanced data sets; and 2) To account for the class-imbalanced nature of the data sets, the
developing a method that is able to scale up to very large, and subsets used always contain the same number of instances from
even huge, problems. both classes. Any instance selection method can be used in the
The class-imbalanced nature of the problem is dealt with subset in the same way that any classifier can be used in an
by means of two mechanisms. First, the selection of instances ensemble. Because our method treats majority class instances
from the majority and minority classes is performed sepa- unfairly, favoring minority class instances, we refer to it as oli-
rately. Second, selection is driven by a fitness function that garchic instance selection (O LIGO IS). On its own, each round
takes accuracy in both classes into account. Furthermore, at would not be able to achieve good performance. However, the
its inner level, all the selection process is always performed in combination of several rounds using a voting scheme is able
balanced sets. to improve the performance of an instance selection algorithm
Its divide-and-conquer philosophy addresses the problem of applied to the whole data set with a large reduction in the
scalability without compromising its performance. The method execution time of the algorithm.
is based on applying an instance selection algorithm to balanced Assume that we have a training set T , with n instances, n+
subsets of the whole training set and on combining the results from the minority or positive class, and n from the majority or
obtained from those subsets by means of a voting scheme. As negative class. First, the training data T is divided into t disjoint
an additional and very useful feature, the method has linear subsets Dj of approximately equal size s as follows:
time complexity and can be easily implemented in a shared or
t
distributed memory parallel machine. T = Dj . (1)
When dealing with class-imbalanced data sets, our main aim j=1
is improving accuracy. However, if a method achieves the same The partition is carried out using a random algorithm. The
accuracy using less instances, that method would be preferable. result of this first step is t subsets where the distribution of
Moreover, many of the most relevant class-imbalanced prob- classes is roughly as imbalanced as the distribution of the
lems appear in very large data sets where data reduction is a whole data set due to the use of a random algorithm. To avoid
must. Thus, in the remaining of this paper, we will focus on effects derived from these uneven distributions, we balance
both accuracy and reduction. all subsets by adding randomly selected instances of the mi-
This paper is organized as follows. Section II presents nority class. These instances are randomly sampled without
our proposal, Section III describes our experimental setup, replacement to avoid repeated instances in any subset. It may
happen in heavily imbalanced data sets that there are not
1 We refer to our method as instance selection rather than undersampling enough instances from the minority class to construct balanced
because, although the method is specifically designed for class-imbalanced data data sets. To avoid this from happening, the subset size must
sets, selection of the minority class is also allowed. fulfill s 2n+ .
334 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 43, NO. 1, FEBRUARY 2013
The instance selection algorithm or a method of our choice is improvement is needed. This improvement is achieved using
then applied to each subset separately,2 and the results of all of separate thresholds for the minority and majority classes. Two
the subsets are recorded. After applying the instance selection thresholds are then used: t+ is used for selecting minority class
algorithm, we record the number of times that each instance instances and t for majority class instances. The evaluation
has been selected to be kept. We call this record the number of a pair of thresholds, i.e., t+ and t , is made using the same
of votes due to its similarity with the combination of classifiers equation (2), using the subset S(t+ , t ) selected with these two
in an ensemble by voting [10]. This process is repeated for r thresholds. The key difference is that we must evaluate a larger
rounds with different random partitions of the data set. set of values: [0, r] [0, t r]. Thus, for each pair of thresholds,
The final step is the combination of the different rounds. we evaluate the following:
To obtain a meaningful combination, we use the philosophy
f S(t+ t ) = r S(t+ t ) +(1 )a S(t+ t ) . (3)
of the ensembles of classifiers. In an ensemble, several weak
learners are combined to form a strong classifier; in our method, and select the best pair of thresholds. The evaluation of this
several weak (in the sense that they are applied to subsets of the number of thresholds might exclude the scalability achieved by
data) instance selection algorithms are combined to produce the divide-and-conquer approach. To avoid this negative effect,
a strong and fast instance selection method. Each round can the evaluation of a pair of thresholds is also approached using a
be considered similar to a classifier in an ensemble, and the divide-and-conquer method. Instead of evaluating the accuracy
combination process, by voting, is similar to the combination of S(t+ , t ) with the whole data set, which is of complexity
of base learners in bagging or boosting [11]. O(n2 ), we apply the same partition philosophy used in the
Once we have performed these r rounds, we have recorded previous step. The training set is divided into random disjoint
the number of votes received by each instance. The number subsets, and accuracy is estimated separately in each subset
of votes received by an instance of the majority class is in using the average evaluation of all the subsets for the fitness
the interval [0, r], as each majority class instance is only in of each pair of thresholds.
one subset in every round. Minority class instances may be This procedure obtains a selected set of instances that may
in more than one subset due to the balance process. Thus, the be imbalanced. To obtain a balanced data set, we perform a last
number of votes is in the interval [0, t r]. The first decision step. The class with more selected instances is undersampled,
is whether to perform the selection on both classes or only on removing first the instances with fewer votes. This balanced
the majority class. Performing selection only on the majority data set is evaluated using (3). If it achieves a better evaluation,
class constrains the maximum reduction that can be achieved. the balanced selected data set is used as the final result of
Furthermore, previous results [9] showed better performance the algorithm; otherwise, the selection obtained using the best
when both classes were subject to selection. Thus, we opted to thresholds is kept. The complete method is shown in Algorithm
allow instances to be removed from both classes. 1. As O LIGO IS uses a base instance selection to be applied
The final combination of votes must set a threshold to to each subset (see line 2 in Algorithm 1), we will use the
decide whether an instance must be selected as the final out- term O LIGO IS.X when O LIGO IS is applied using X as instance
put of the process. A first natural choice would be majority selection algorithm.
voting: Instances are kept if they receive at least half of the
possible votes. However, the performance of this fixed thresh- Algorithm 1: O LIGO IS algorithm.
old depends heavily on the problem. Therefore, we developed a
method of automatically selecting an optimum vote threshold. Data: A training set T = {(x1 , y1 ), . . . , (xn , yn )}, subset
To that end, we define function f that evaluates the goodness size s, and number of rounds r.
of a certain subset of instances S. Using the threshold t, we Result: The set of selected instances S T .
obtain the set of selected instances S(t), and then f (S(t)) is for i = 1 to r do
calculated as follows: 1 Divide instances into ns disjoint subsets
Di : i Di = T of size s
f (S(t)) = r (S(t)) + (1 )a (S(t)) (2) for j = 1 to ns do
2 Apply instance selection algorithm to Dj
where r(S(t)) is the reduction achieved using threshold t to 3 Store votes of selected instances from Dj
select S(t), and a(S(t)) is the accuracy achieved with the end
instances in S(t) using a 1-nearest neighbor (1-NN) classifier.3 end
The accuracy may be calculated using any of the class- 4 Obtain thresholds of votes to keep an instance from the
imbalanced measures defined in the following. To obtain the minority, t+ , and the majority, t , classes
best threshold, all of the possible values are evaluated, and the 5 S = {xi T |(votes(xi ) t+ and xi C + ) or (votes(xi )
optimum is chosen. t and xi C )}
However, this approach does not pay attention to the class- 6 Undersample the class with more instances in S to obtain
imbalanced nature of the data sets. To account for that, a further S balanced removing instances with fewer votes
if f (S balanced ) f (S) then
2 We can use instance selection methods that are specifically designed for S = S balanced
class-imbalanced data sets, or as the subsets are balanced, we can use any end
standard instance selection method.
3 Any other classifier can be used in this function to obtain the accuracy of 7 return S
the subset selected.
GARCA-PEDRAJAS et al.: OLIGOIS 335
algorithm, a steady-state genetic algorithm, a cross or d(xj , xl ) < d(xi , xj ). If two instances form a TL, either
generational elitist selection, heterogeneous recombi- one of them is noise or both are borderline samples. TLs
nation and cataclysmic mutation (CHC) genetic al- can be used as an undersampling method, removing for
gorithm, and a population-based incremental learning every TL found the instance belonging to the majority
algorithm. They found that evolutionary-based methods class.
were able to outperform classical algorithms in both 8) CN N + T L [26]. This method is similar to OSS, but the
classification accuracy and data reduction. Among the CNN rule is applied before the TLs. The method is faster
evolutionary algorithms, CHC was able to achieve the than OSS because TLs are obtained for a smaller data set.
best overall performance. Therefore, we have included 9) Modified Selective Subset (MSS) [27] method. This
the CHC algorithm in our comparison. To account for method is a modification of the algorithm of Ritter et al.
the class-imbalanced nature of our data sets, we used [28] for finding a selective subset. A subset is selective if
G-mean as the accuracy measure instead of a testing error it is consistent, and all prototypes in the original training
(see Section III-C). set are nearer to a selective neighbor of the same class
5) Garca and Herrera [23] performed a comparison of dif- than to any member of the training set from a different
ferent methods for sampling with class-imbalanced data class. The MSS algorithm is aimed at obtaining a minimal
sets and found evolutionary algorithms to perform earlier consistent subset using the selective property. In this way,
nonevolutionary methods, although the computational the authors define the MSS as that subset of the training
cost is clearly higher. They proposed a family of differ- set that contains, for every instance x in the training set,
ent evolutionary methods based on the CHC algorithm that element of its neighborhood that is the nearest to a
in which two different accuracy measures can be used, class other than that of x. The authors propose an iterative
i.e., G-mean and area under the curve (AUC), and the procedure to find this modified selective subset. Each
selection process can be applied to both the minority and instance is ordered regarding the distance to its nearest
majority classes or only to the majority class. The four enemy.
different combinations were tested with similar results. 10) Synthetic minority over-sampling technique (SMOTE)
To allow a fair comparison with the remaining algorithms, method [2]. One of the problems with oversampling is
we have chosen the G-mean as the accuracy measure that merely making copies of the minority class samples
and selection of both classes. The authors referred to does not add new information to the data set. To overcome
this method as evolutionary balancing undersampling this problem, Chawla et al. [2] proposed a method called
(EBUS). It uses the following fitness function for an SMOTE, which combines undersampling of the majority
individual S: class with oversampling of the minority class. However,
instead of oversampling the minority class by just making
FitnessBal (S) = G |1 n /n | P if n > 0
+
(4) copies of the minority class samples, SMOTE generates
GP If n = 0
synthetic instances from actual instances of the minority
where G is the G-mean, n+ is the number of instances class. Synthetic samples are generated in the following
selected from the minority class, and n is the number way: Take the difference between the feature vector
of instances selected from the majority class. This model (sample) under consideration and its nearest neighbor.
aims to remove instances of both classes, identifying mi- Multiply this difference by a random number between 0
nority class examples that have a negative influence over and 1, and add it to the feature vector under consideration.
the classification task and achieving maximal reduction The source code used for all methods is in C and is licensed
in positive instances. A penalty factor P , which is used under the GNU General Public License. The code, the partitions
for preserving the same number of instances belonging to of the data sets, and the detailed numerical results of all
each class, helps to maintain a generalization capability in the experiments are available from http://cib.uco.es/index.php/
the reduction task. The authors suggested an empirically supplementary-material-for-oligois.
obtained value of P = 0.2. All of the experiments were carried out in a cluster of
6) Condensed Nearest Neighbor (CNN) Rule [24]. The stan- 32 blades. Each blade is a biprocessor DELL Power Edge M600
dard CNN rule is adapted to class-imbalanced problems. with four cores per processor. Thus, we count 256 cores. The
First, we randomly drew one majority class example blades are interconnected with a master node and among them
and all examples from the minority class and put these with a 1-Gb network. In the parallel implementation, we use a
examples in S. We then tested all of the instances in T master/slave model in which all information processed by the
not included in S using a 1-NN rule. All misclassified slaves is sent by the master. The processors run at 2.5 GHz, and
instances are added to S. each blade has 16 Gb of memory.
7) OSS [6]. OSS is an undersampling method resulting from
the application of Tomek links (TLs) followed by the
B. Statistical Tests
application of CNN. TLs [25] are defined as follows:
Given a pair of instances xi and xj , belonging to classes We used the Wilcoxon test as the main statistical test for
yi and yj , respectively, where yi = yj and d(xi , xj ) is the comparing pairs of algorithms. This test was chosen because
distance between them, a pair (xi , xj ) is called a TL if it assumes limited commensurability and is safer than para-
there is not an example xl such that d(xi , xl ) < d(xi , xj ) metric tests because it does not assume normal distributions or
GARCA-PEDRAJAS et al.: OLIGOIS 337
homogeneity of variance. Therefore, it can be applied to accu- been developed to take the imbalanced nature of the problems
racy and reduction. Furthermore, empirical results [29] show into account. Given the number of true positives (TPs), false
that it is also stronger than other tests. positives (FPs), true negatives (TNs), and false negatives (FNs),
For groups of methods, we first carry out an ImanDavenport we can define several measures. Perhaps, the most common
test to ascertain whether there are significant differences measures are the TP rate T Prate , recall R, or sensitivity Sn, i.e.,
among methods. The ImanDavenport test is based on the 2F TP
Friedman test, which compares the average ranks of k algo- T Prate = R = Sn = (8)
TP + FN
rithms, but is more powerful than the Friedman test. Let rij be
the rank of the jth algorithm on the ith data set, where
in case which is relevant if we are only interested in the performance
of ties, average ranks are assigned, and let Rj = 1/N i rij be on the positive class and the TN rate T Nrate or specificity Sp,
the average rank for N data sets. Under the null hypothesis, all as follows:
algorithms are equivalent; the statistic TN
T Nrate = Sp = (9)
2 TN + FP
12N k(k + 1)
XF2 = Rj2 (5)
k(k + 1) j 4 From these basic measures, others have been proposed, such
as the F -measure or, if we are concerned about the performance
is distributed following 2F with k 1 degrees of freedom of both negative
and positive classes, the G-mean measure: G
for k and N sufficiently large. In general, N > 10 and k > 5 mean = Sp Sn. This last measure is used as the accuracy
is enough. Iman and Davenport found this statistic to be too measure in all of our experiments. Thus, in the tables reporting
conservative and developed a better one as follows: our results, we will show the G-mean measure and the reduction
after applying the instance selection algorithm. Reduction is
(N 1)XF2 measured as the percentage of instances removed after applying
FF = (6)
N (k 1) XF2 the selection process.
Many classifiers are subject to some kind of threshold that
which is distributed following a F distribution with k 1 and can be varied to achieve different values of the earlier measures.
(k 1)(N 1) degrees of freedom. After carrying out the For that kind of classifiers, receiver operating characteristic
ImanDavenport test, we can perform pairwise comparisons (ROC) curves can be constructed. An ROC curve is a graph-
with the Wilcoxon test. However, when all algorithms are com- ical plot of the T Prate (sensitivity) against the F Prate (1
pared with a control method, it is not advisable to perform many specificity or F Prate = F P/T N + F P ) for a binary classifier
Wilcoxon tests against the control method. We can instead use system as its discrimination threshold is varied. The perfect
one of the general procedures for controlling the familywise er- model would achieve a TP rate of 1 and an FP rate of 0.
ror in multiple hypothesis testing. The test statistic for compar- A random guess will be represented by a line connecting the
ing the ith and jth classifiers using these methods is as follows: points (0, 0) and (1, 1). ROC curves are a good measure of the
(Ri Rj ) performance of the classifiers. Furthermore, from this curve, a
z=
(7) new measure, i.e., AUC, can be obtained, which is a very good
k(k + 1)/6N
overall measure for comparing algorithms. AUC is a useful
The z-value is used to find the corresponding probability metric for classifier performance as it is independent of the
from the table of normal distribution, which is then compared decision criterion selected and prior probabilities.
with an appropriate . Step-up and step-down procedures se- However, for the nearest neighbor rule, AUC is less com-
quentially test the hypotheses ordered by their significance. monly used due to the difficulty of obtaining the threshold
We will denote the ordered p-values by p1 , p2 , . . . so that needed to construct the ROC curve. We will present AUC
p1 p2 . . . pk1 . One of the simplest such methods was results for the nearest neighbor using as threshold the ratio
developed by Holm. It compares each pi with /(k i). between the distances to the nearest neighbor for each one of
Holms step-down procedure starts with the most significant p the two classes.
value. If p1 is below /(k 1), the corresponding hypothesis
is rejected, and we are allowed to compare p2 with /(k 2).
IV. E XPERIMENTAL R ESULTS
If the second hypothesis is rejected, the test proceeds with the
third, and so on. As soon as a certain null hypothesis cannot be As stated, O LIGO IS is a method that can use any of the
rejected, all remaining hypotheses are retained as well. We will many available base instance selection algorithms, in the same
use for all statistical tests a significance level of 0.05. way that boosting can use the classifier of choice. Thus, the
first set of experiments focused on determining which in-
stance selection algorithm was most appropriate for O LIGO IS.
C. Evaluation Measures
We performed experiments using O LIGO IS with eight of the
Accuracy is not a useful measure for imbalanced data, par- nine methods described earlier. Random undersampling was
ticularly when the number of instances of the minority class not used because the subsets used in our method are always
is very small compared with the majority class. If we have balanced.
a ratio of 1 : 100, a classifier that assigns all instances to the Tables II and III show a pairwise comparison of all the
majority class will have 99% accuracy. Several measures have methods for accuracy and reduction, respectively. Fig. 1 shows
338 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 43, NO. 1, FEBRUARY 2013
TABLE II
PAIRWISE COMPARISON OF O LIGO IS ACCURACY MEASURED USING G-MEAN USING THE EIGHT METHODS.
WIN/LOSS RECORD AND THE p-VALUE OF THE WILCOXON TEST ARE SHOWN
TABLE III
PAIRWISE COMPARISON OF O LIGO IS FOR REDUCTION MEASURED AS THE PERCENTAGE OF INSTANCES REMOVED
USING THE EIGHT METHODS. WIN/LOSS RECORD AND THE p-VALUE OF THE WILCOXON TEST ARE SHOWN
Fig. 1. Average Friedmans ranks for O LIGO IS using the eight instance selection algorithms.
GARCA-PEDRAJAS et al.: OLIGOIS 339
Fig. 2. Results of the Holm test for O LIGO IS with seven different methods and O LIGO IS.CHC as the control method for (top) accuracy and (bottom) reduction.
Numerical p-values are shown.
TABLE V
ACCURACY AND REDUCTION OF THE TEN STANDARD METHODS AND O LIGO IS. CHC
Fig. 3. Results of the Holm test for the standard methods and O LIGO IS.CHC as the control method for (top) G-mean, (middle) reduction, and (bottom) AUC.
Numerical p-values are shown.
methods. A plot of average rank values is shown in Fig. 4 The results of the nine methods against O LIGO IS.CHC are
together with the average values of AUC, G-mean, and re- illustrated in Fig. 5. The figure shows results for accuracy
duction for all methods. This plot shows that average ranks and reduction. This graphic representation is based on the
are clearly better for O LIGO IS.CHC. The figure also shows -error relative movement diagrams [30]. However, here, we
that O LIGO IS.CHC not only achieved superior accuracy to the use the reduction difference instead of the difference value.
remaining methods but that reduction was also clearly better These diagrams use an arrow to represent the results of two
than the other nine methods. methods applied to the same data set. The arrow starts at the
GARCA-PEDRAJAS et al.: OLIGOIS 341
A. Execution Time
In the description of our method, we claimed that applying
the philosophy of divide-and-conquer would produce a method
that was more scalable than previous approaches. To verify
this claim, Fig. 7 shows the average time of each method
together with the time needed for the longest execution. The
execution time shown in the tables is the wall-clock time spent
by each part of the algorithm. We measure the time elapsed
from the beginning of the algorithm until it completes its final
output. That means that the time needed to read the data files,
perform the partition, send the data to the slaves, receive the
results, and obtain the best threshold of votes is included in
the reported time. In this way, our proposal establishes no
artificial advantage. The figure shows that O LIGO IS.CHC is
significantly faster than the other methods. The difference is
particularly remarkable for the largest problems and the parallel
implementation of our method. Although we are comparing
a parallel implementation of O LIGO IS.CHC against the other
methods, we do not think that it is an unfair comparison as
Fig. 4. Average Friedmans ranks and average values for the standard algo- natural parallelization is an inherent feature of O LIGO IS.
rithms and O LIGO IS.CHC.
Fig. 5. Reduction/accuracy using relative movement diagrams for O LIGO IS.CHC and the nine standard methods. Positive values on both axes show better
performance by O LIGO IS.CHC.
experiments. This size achieved good performance and fast rank. The results of the Holm test are shown in Fig. 9. These
execution. In the first experiment, we tested whether a larger results show that the best subset size in terms of reduction is
size would achieve better results. Fig. 8 shows the average 100 instances.
accuracy, reduction, and execution time for subset sizes of 100, The second relevant parameter of our method is the number
250, 500, 750, and 1000 instances. of rounds. We chose ten rounds, using the results of other
The results showed that the method is quite stable with approaches combining difference methods, such as ensembles
respect to this parameter. An ImanDavenport test failed to of classifiers, which have shown that most of the performance
find significant differences among the different values of the gain is obtained by the first few classifiers added. To test the
parameter in terms of accuracy, with a p-value of 0.4883. validity of this value, we performed experiments with 5, 10,
For reduction, the ImanDavenport test obtained a p-value 15, 20, and 25 rounds. Fig. 10 shows the average accuracy,
of 0.0015, showing significant differences. We performed a reduction, and execution time of the experiments conducted
Holm test using the value of 100 instances as a control ex- on each number of rounds. As in the previous experiment, we
periment because this parameter obtained the best average performed an ImanDavenport test for accuracy and reduction.
GARCA-PEDRAJAS et al.: OLIGOIS 343
Fig. 7. Average time and longest time in logarithmic scale for the standard
algorithms and O LIGO IS.CHC.
Fig. 8. Average G-mean, reduction, and execution time for O LIGO IS.CHC
for different subset sizes.
Fig. 6. (Top) Reduction and AUC (bottom) for O LIGO IS.CHC and CHC for
the ten data sets with the highest IR.
Fig. 10. Average G-mean, reduction, and execution time for O LIGO IS.CHC
for different numbers of rounds.
Fig. 13. Average G-mean, reduction, and classifier size for O LIGO IS.CHC
and CHC using a C4.5 decision tree and a SVM.
TABLE VI
SUMMARY OF VERY LARGE DATA SETS
Fig. 11. Average G-mean, reduction, and execution time for O LIGO IS.CHC
for different values of .
used as classifiers a decision tree using C4.5 [31] algorithm
and a support vector machine (SVM) [32] using a Gaussian
kernel and the parameters and C obtained by tenfold cross-
validation. Fig. 13 shows the results for these two classifiers.
As was the case with 1-NN, O LIGO IS.CHC improves the
results of CHC in both accuracy and reduction. Furthermore,
the classifiers induced in both cases are simpler. All of the
differences were found significant at a 0.05 significance level
using a Wilcoxon test.
D. Scalability
In previous experiments, we showed the performance of
O LIGO IS in data sets of medium size. However, one of the aims
of this paper is the scalability of the presented method to very
large data sets. Thus, we applied our algorithm to three very
large problems. These data sets are shown in Table VI. ccds
Fig. 12. Results of the Holm test for O LIGO IS.CHC as the control method and dna represent the same problems as the corresponding data
with = 0.5 as the control experiment for (top) accuracy and with = 0.3 sets in Table I but with a larger set of instances. chrom21 is the
for (bottom) reduction, and O LIGO IS.CHC with the remaining values of of
0.3, 0.4, 0.5, 0.6 and 0.7 for each case. Numerical p-values are shown. bioinformatics problem of predicting the translation initiation
site of a gene in the sequence of human chromosome 21.
With a largest set of five million instances and 800 features,
C. Using Other Classifiers
these data sets represent a serious challenge for any method.
As a final experiment, we tested whether O LIGO IS was able Furthermore, chrom21 and dna also show a large IR. To clarify
to improve results when using other classifiers. We repeated the difficulty of the problem, we show the memory size of
the experiments using the best version of our algorithm, i.e., these three data sets using a standard float point precision stored
O LIGO IS.CHC, and the best standard method, i.e., CHC. We with 4 B.
GARCA-PEDRAJAS et al.: OLIGOIS 345
TABLE VII [5] A. Estabrooks, T. Jo, and N. Japkowicz, A multiple resampling method
SUMMARY OF RESULTS FOR LARGE DATA SETS. SIGNIFICANT for learning from imbalanced data sets, Comput. Intell., vol. 20, no. 1,
DIFFERENCES FOR A RESAMPLED t-TEST ARE MARKED IN BOLDFACE pp. 1836, Feb. 2004.
[6] M. Kubat and S. Matwin, Addressing the curse of imbalanced training
sets: One-sided selection, in Proc. 14th Int. Conf. Mach. Learn., 1997,
pp. 179186.
[7] S. Garca, J. Derrac, I. Triguero, C. J. Carmona, and F. Herrera,
Evolutionary-based selection of generalized instances for imbalanced
classification, Knowl.-Based Syst., vol. 25, no. 1, pp. 312, Feb. 2012.
[8] N. Garca-Pedrajas, J. A. Romero del Castillo, and D. Ortiz-Boyer, A
cooperative coevolutionary algorithm for instance selection for instance-
We applied O LIGO IS.CHC to a subset size of 1000 instances. based learning, Mach. Learn., vol. 78, no. 3, pp. 381420, Mar. 2010.
As the control experiment, we used random undersampling [9] S. Garca and F. Herrera, Evolutionary undersampling for classifica-
because none of the other standard methods is scalable to tion with imbalanced datasets: Proposals and taxonomy, Evol. Comput.,
vol. 17, no. 3, pp. 275306, Fall 2009.
data sets of this size. Table VII shows the results of both [10] L. Kuncheva and C. J. Whitaker, Measures of diversity in classifier en-
methods. The first interesting result is the good scalability of sembles and their relationship with the ensemble accuracy, Mach. Learn.,
our proposal. For the largest data set of five million instances vol. 51, no. 2, pp. 181207, May 2003.
[11] N. Garca-Pedrajas, C. Garca-Osorio, and C. Fyfe, Nonlinear boosting
and 800 features, O LIGO IS.CHC took 1.5 h to complete the projections for ensemble construction, J. Mach. Learn. Res., vol. 8,
whole process. If we estimate the time that standard CHC would pp. 133, May 2007.
need for such a data set, considering that it needed more than [12] R. Barandela, J. L. Snchez, V. Garca, and E. Rangel, Strategies for
learning in class imbalance problems, Pattern Recognit., vol. 36, no. 3,
3 h for arabidopsis, which has 33 971 instances, we can pp. 849851, Mar. 2003.
conclude that O LIGO IS reduces the time by several orders of [13] I. Guyon and A. Elisseeff, An introduction to variable and feature selec-
magnitude. tion, J. Mach. Learn. Res., vol. 3, pp. 11571182, Mar. 2003.
[14] A. Frank and A. Asuncion, UCI Machine Learning Repository, 2010.
Furthermore, the performance of O LIGO IS.CHC is compet- [15] N. Garca-Pedrajas, Constructing ensembles of classifiers by means of
itive. For ccds, there are no significant differences in terms weighted instance selection, IEEE Trans. Neural Netw., vol. 20, no. 2,
of accuracy, but OL IGO IS.CHC achieves a significantly bet- pp. 258277, Feb. 2009.
[16] D. R. Wilson and T. R. Martinez, Reduction techniques for instance-
ter reduction. For chrom21 and dna, both methods achieved based learning algorithms, Mach. Learn., vol. 38, no. 3, pp. 257286,
the same reduction, but OL IGO IS.CHC obtained significantly Mar. 2000.
better accuracy. [17] D. L. Wilson, Asymptotic properties of nearest neighbor rules using
edited data, IEEE Trans. Syst., Man, Cybern., vol. SMC-2, no. 3,
pp. 408421, Jul. 1972.
[18] H. Brighton and C. Mellish, Advances in instance selection for instance-
V. C ONCLUSION AND F UTURE W ORK based learning algorithms, Data Mining Knowl. Discov., vol. 6, no. 2,
pp. 153172, Apr. 2002.
In this paper, we have presented a new method of instance [19] L. Kuncheva, Editing for the k-nearest neighbors rule by a genetic algo-
selection in class-imbalanced data sets that is applicable to very rithm, Pattern Recognit. Lett., vol. 16, no. 8, pp. 809814, Aug. 1995.
large data sets. The method consists of concurrently applying [20] H. Ishibuchi and T. Nakashima, Pattern and feature selection by genetic
algorithms in nearest neighbor classification, J. Adv. Comput. Intell.
instance selection on small class-balanced subsets of the origi- Intell. Inf., vol. 4, no. 2, pp. 138145, 2000.
nal data set and combining them by means of a voting method, [21] C. R. Reeves and D. R. Bush, Using genetic algorithms for training data
setting a different threshold for minority and majority class selection in RBF networks, in Instances Selection and Construction for
Data Mining, H. Liu and H. Motoda, Eds. Norwell, MA: Kluwer, 2001,
samples. Using a CHC algorithm as the base algorithm, we pp. 339356.
have shown that our method exceeds the performance of state- [22] J. R. Cano, F. Herrera, and M. Lozano, Using evolutionary algorithms
of-the-art selection methods for class-imbalanced data in both as instance selection for data reduction in KDD: An experimental study,
IEEE Trans. Evol. Comput., vol. 7, no. 6, pp. 561575, Dec. 2003.
accuracy and reduction. Its advantage over the state-of-the-art [23] S. Garca, A. Fernndez, and F. Herrera, Enhancing the effectiveness
methods was larger when the IR of the problem was higher. and interpretability of decision tree and rule induction classifiers with
The results showed that our approach scaled up to problems evolutionary training set selection over imbalanced problems, Appl. Soft
Comput., vol. 9, no. 4, pp. 13041314, Sep. 2009.
of very large size. Two features of our method guarantee that [24] P. E. Hart, The condensed nearest neighbor rule, IEEE Trans. Inf.
scalability: First, instance selection is always performed over Theory, vol. IT-14, no. 3, pp. 515516, May 1968.
small data sets, keeping the time spent by the process low, and [25] I. Tomek, Two modifications of CNN, IEEE Trans. Syst., Man, Cybern.
B, Cybern., vol. SMCB-6, no. 11, pp. 769772, Nov. 1976.
second, only small subsets of instances must be kept in memory, [26] G. E. A. P. A. Batista, R. C. Prati, and M. C. Monard, A study of
removing any scalability constraint due to memory limits. This the behavior of several methods for balancing machine learning training
scalability has been experimentally proven with the largest data data, SIGKDD Explor. Newslett., vol. 6, no. 1, pp. 2029, Jun. 2004.
[27] R. Barandela, F. J. Ferri, and J. S. Snchez, Decision boundary preserv-
set of five million instances and 800 features. ing prototype selection for nearest neighbor classification, Int. J. Pattern
Recognit. Artif. Intell., vol. 19, no. 6, pp. 787806, Sep. 2005.
R EFERENCES [28] G. L. Ritter, H. B. Woodruff, S. R. Lowry, and T. L. Isenhour, An
algorithm for selective nearest neighbor decision rule, IEEE Trans. Inf.
[1] H. He and E. A. Garcia, Learning from imbalanced data, IEEE Trans. Theory, vol. IT-21, no. 6, pp. 665669, Nov. 1975.
Knowl. Data Eng., vol. 21, no. 9, pp. 12631284, Sep. 2009. [29] J. Demar, Statistical comparisons of classifiers over multiple data sets,
[2] N. V. Chawla, W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, SMOTE: J. Mach. Learn. Res., vol. 7, pp. 130, Dec. 2006.
Synthetic minority over-sampling technique, J. Artif. Intell. Res., vol. 16, [30] J. Maudes-Raedo, J. J. Rodrguez-Dez, and C. Garca-Osorio, Disturb-
no. 1, pp. 321357, Jan. 2002. ing neighbors diversity for decision forest, in Proc. Workshop SUEMA,
[3] C. Ling and G. Li, Data mining for direct marketing problems and G. Valentini and O. Okun, Eds., Patras, Grecia, Jul. 2008, pp. 6771.
solutions, in Proc. 4th Int. Conf. KDD, New York, 1998, pp. 7379. [31] J. R. Quinlan, C4.5: Programs for Machine Learning. San Mateo, CA:
[4] G. M. Weiss and F. Provost, The effect of class distribution on clas- Morgan Kaufmann, 1993.
sifier learning: An empirical study, Dept. Comput. Sci., Rutgers Univ., [32] V. Vapnik, The Nature of Statistical Learning Theory. New York:
Newark, NJ, Tech. Rep. TR-43, 2001. Springer-Verlag, 1999.
346 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 43, NO. 1, FEBRUARY 2013
Nicols Garca-Pedrajas was born in Crdoba, Aida de Haro-Garca was born in Crdoba, Spain,
Spain, in 1970. He received the B.S. degree in in 1984. She received the B.S. degree in computing
computing and the Ph.D. degree from the Univer- from the University of Crdoba, Crdoba, in 2007
sity of Mlaga, Mlaga, Spain, in 1993 and 2001, and the Ph.D. degree from the University of Granada,
respectively. Granada, Spain, in 2011.
He is currently a Professor in the area of computer She is currently an Assistant Professor in the area
science and artificial intelligence with the Depart- of computer science and artificial intelligence with
ment of Computing and Numerical Analysis, Uni- the Department of Computing and Numerical Anal-
versity of Crdoba, Crdoba, and the Head of the ysis, University of Crdoba. She is also a member of
Computational Intelligence and Bioinformatics Re- the Computational Intelligence and Bioinformatics
search Group, University of Cordoba. His current Research Group, University of Crdoba. Her current
research interests include classification, neural networks, evolutionary compu- research interests include classification, evolutionary computation, and feature
tation, and bioinformatics. and instance selection.