0% found this document useful (0 votes)

53 views15 pages

Oligois: Scalable Instance Selection For Class-Imbalanced Data Sets

Uploaded by

अंकुर कांचन

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

53 views15 pages

Oligois: Scalable Instance Selection For Class-Imbalanced Data Sets

Uploaded by

अंकुर कांचन

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

332 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 43, NO.

1, FEBRUARY 2013

OligoIS: Scalable Instance Selection

for Class-Imbalanced Data Sets
Nicols Garca-Pedrajas, Javier Prez-Rodrguez, and Aida de Haro-Garca

AbstractIn current research, an enormous amount of infor- highly imbalanced problems, the ratio between the positive and
mation is constantly being produced, which poses a challenge for negative classes can be as high as 1 : 1000 or 1 : 10 000.
data mining algorithms. Many of the problems in extremely active It has been repeatedly shown that most classification meth-
research areas, such as bioinformatics, security and intrusion de-
tection, or text mining, share the following two features: large data ods suffer from an imbalanced distribution of training in-
sets and class-imbalanced distribution of samples. Although many stances among classes [2]. Most learning algorithms expect
methods have been proposed for dealing with class-imbalanced an approximately even distribution of instances among the
data sets, most of these methods are not scalable to the very different classes and suffer, to different degrees, when that
large data sets common to those research fields. In this paper, is not the case. Dealing with the class-imbalance problem is
we propose a new approach to dealing with the class-imbalance
problem that is scalable to data sets with many millions of in- a difficult but relevant task as many of the most interesting
stances and hundreds of features. This proposal is based on the and challenging real-world problems have a very uneven class
divide-and-conquer principle combined with application of the distribution.
selection process to balanced subsets of the whole data set. This Many algorithms and methods have been proposed to ame-
divide-and-conquer principle allows the execution of the algorithm liorate the effect of class imbalance on the performance of
in linear time. Furthermore, the proposed method is easy to
implement using a parallel environment and can work without learning algorithms. There are three main approaches to these
loading the whole data set into memory. Using 40 class-imbalanced methods.
medium-sized data sets, we will demonstrate our methods ability
to improve the results of state-of-the-art instance selection meth- Internal approaches acting on the algorithm. These ap-
ods for class-imbalanced data sets. Using three very large data sets, proaches modify the learning algorithm to deal with the
we will show the scalability of our proposal to millions of instances imbalance problem. They can adapt the decision threshold
and hundreds of features. to create a bias toward the minority class or introduce costs
Index TermsClass-imbalance problem, instance selection, in the learning process to compensate the minority class.
instance-based learning, very large problems. External approaches acting on the data. These algorithms
act on the data instead of the learning method. They have
I. I NTRODUCTION the advantage of being independent from the classifier
used. There are two basic approaches: oversampling the
T HE overwhelming amount of data that is currently avail-
able in any field of research poses new problems for data
mining and knowledge discovery methods. This large amount
minority class and undersampling the majority class.
Combined approaches that are based on boosting account-
of data makes most existing algorithms inapplicable to many ing for the imbalance in the training set. These methods
real-world problems. Furthermore, one of the distinctive fea- modify the basic boosting method to account for minority
tures of many common problems in data mining applications is class underrepresentation in the data set.
the uneven distribution of the instances of the different classes.
There are two principal advantages of choosing sampling
In extremely active research areas, such as artificial intelligence
over cost-sensitive methods. First, sampling is more general
in medicine, bioinformatics, or intrusion detection, two classes
as it does not depend on the possibility of adapting a certain
are usually involved: a class of interest or a positive class, and
algorithm to work with classification costs. Second, the learning
a negative class that is overrepresented in the data sets. This
algorithm is not modified, which can cause difficulties and add
is usually referred to as the class-imbalance problem [1]. In
additional parameters to be tuned.
Data-driven algorithms can be broadly classified into two
groups: those that undersample the majority class and those that
Manuscript received October 13, 2011; revised April 9, 2012; accepted oversample the minority class. There are also algorithms that
June 12, 2012. Date of publication July 30, 2012; date of current version
January 11, 2013. This work was supported in part by the Spanish Ministry of combine both processes. Both undersampling and oversampling
Science and Innovation under Project TIN2008-03151 and in part by the Junta can be randomly achieved or through a more complicated
de Andaluca under Project P09-TIC-4623. This paper was recommended by process of searching for least or most useful instances. Previ-
Associate Editor N. Chawla.
The authors are with the Department of Computing and Numerical Analysis, ous works have shown that undersampling the majority class
University of Crdoba, 14071 Crdoba, Spain (e-mail: [email protected]; usually leads to better results than oversampling the minority
[email protected]; [email protected]). class [2] when oversampling is performed using sampling with
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. replacement from the minority class. Furthermore, combining
Digital Object Identifier 10.1109/TSMCB.2012.2206381 undersampling of the majority class with oversampling of the
2168-2267/$31.00 2012 IEEE
GARCA-PEDRAJAS et al.: OLIGOIS 333

minority class has not yielded better results than undersam- Section IV shows the experimental results, and Section V shows
pling of the majority class alone [3]. One of the possible the conclusions of this paper.
sources of the problematic performance of oversampling is
the fact that no new information is introduced in the training II. O LIGO IS
set as oversampling must rely on adding new copies of mi-
Although random undersampling works in many cases, the
nority class instances already in the data set. Sampling has
random deletion of examples from the majority class may
proven a very efficient method of dealing with class-imbalanced
cause the classifier to miss important concepts pertaining to the
data sets [4], [5].
majority class [1]. The problem with traditional approaches to
Removing instances only from the majority class, usually
class-imbalanced data sets is that they have serious scalability
referred to as one-sided selection (OSS) [6], has two major
problems, particularly in methods based on evolutionary com-
problems. First, reduction is limited by the number of instances
putation that have been shown to achieve the best performance
of the minority class. Second, instances from the minority class
[9]. Furthermore, when careful experimental comparison is
are never removed, even when their contribution to the models
made, many sophisticated methods do not achieve significant
performance is harmful.
improvements over undersampling [9].
However, few attempts have been made to cope with class-
As we stated earlier, our goal is to obtain a method that is
imbalanced data sets using instance selection algorithms, which both scalable and able to sample the most relevant instances
can remove instances from both the minority and majority to deal with class-imbalanced data sets. Scalability will be
classes. Standard widely used methods can be applied, but they achieved using a divide-and-conquer approach. The ability to
do not achieve good results because their design bias is not sample instances to deal with class-imbalanced data sets will
focused on these kinds of problems. Evolutionary computation be achieved by means of the combination of several rounds of
[7] has been used with more success, but scalability is important instance selection in balanced subsets of the whole data set.
[8], and those methods cannot be applied to large and very large Thus, our methodology is primarily based on the divide-and-
data sets. conquer approach. Instead of applying the instance selection
In this paper, we propose a new framework called oligarchic method to the whole data set, we first perform a random
instance selection,1 which is specifically designed for class- partition of the instances and apply the selection to each one
imbalanced data sets. The method has two major objectives: of the subsets obtained. This partition is repeated for several
1) improving the performance of previous approaches based rounds, and the results are combined through a voting process.
of instances selection for class-imbalanced data sets; and 2) To account for the class-imbalanced nature of the data sets, the
developing a method that is able to scale up to very large, and subsets used always contain the same number of instances from
even huge, problems. both classes. Any instance selection method can be used in the
The class-imbalanced nature of the problem is dealt with subset in the same way that any classifier can be used in an
by means of two mechanisms. First, the selection of instances ensemble. Because our method treats majority class instances
from the majority and minority classes is performed sepa- unfairly, favoring minority class instances, we refer to it as oli-
rately. Second, selection is driven by a fitness function that garchic instance selection (O LIGO IS). On its own, each round
takes accuracy in both classes into account. Furthermore, at would not be able to achieve good performance. However, the
its inner level, all the selection process is always performed in combination of several rounds using a voting scheme is able
balanced sets. to improve the performance of an instance selection algorithm
Its divide-and-conquer philosophy addresses the problem of applied to the whole data set with a large reduction in the
scalability without compromising its performance. The method execution time of the algorithm.
is based on applying an instance selection algorithm to balanced Assume that we have a training set T , with n instances, n+
subsets of the whole training set and on combining the results from the minority or positive class, and n from the majority or
obtained from those subsets by means of a voting scheme. As negative class. First, the training data T is divided into t disjoint
an additional and very useful feature, the method has linear subsets Dj of approximately equal size s as follows:
time complexity and can be easily implemented in a shared or
t
distributed memory parallel machine. T = Dj . (1)
When dealing with class-imbalanced data sets, our main aim j=1
is improving accuracy. However, if a method achieves the same The partition is carried out using a random algorithm. The
accuracy using less instances, that method would be preferable. result of this first step is t subsets where the distribution of
Moreover, many of the most relevant class-imbalanced prob- classes is roughly as imbalanced as the distribution of the
lems appear in very large data sets where data reduction is a whole data set due to the use of a random algorithm. To avoid
must. Thus, in the remaining of this paper, we will focus on effects derived from these uneven distributions, we balance
both accuracy and reduction. all subsets by adding randomly selected instances of the mi-
This paper is organized as follows. Section II presents nority class. These instances are randomly sampled without
our proposal, Section III describes our experimental setup, replacement to avoid repeated instances in any subset. It may
happen in heavily imbalanced data sets that there are not
1 We refer to our method as instance selection rather than undersampling enough instances from the minority class to construct balanced
because, although the method is specifically designed for class-imbalanced data data sets. To avoid this from happening, the subset size must
sets, selection of the minority class is also allowed. fulfill s 2n+ .
334 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 43, NO. 1, FEBRUARY 2013

The instance selection algorithm or a method of our choice is improvement is needed. This improvement is achieved using
then applied to each subset separately,2 and the results of all of separate thresholds for the minority and majority classes. Two
the subsets are recorded. After applying the instance selection thresholds are then used: t+ is used for selecting minority class
algorithm, we record the number of times that each instance instances and t for majority class instances. The evaluation
has been selected to be kept. We call this record the number of a pair of thresholds, i.e., t+ and t , is made using the same
of votes due to its similarity with the combination of classifiers equation (2), using the subset S(t+ , t ) selected with these two
in an ensemble by voting [10]. This process is repeated for r thresholds. The key difference is that we must evaluate a larger
rounds with different random partitions of the data set. set of values: [0, r] [0, t r]. Thus, for each pair of thresholds,
The final step is the combination of the different rounds. we evaluate the following:
To obtain a meaningful combination, we use the philosophy
f S(t+ t ) = r S(t+ t ) +(1 )a S(t+ t ) . (3)
of the ensembles of classifiers. In an ensemble, several weak
learners are combined to form a strong classifier; in our method, and select the best pair of thresholds. The evaluation of this
several weak (in the sense that they are applied to subsets of the number of thresholds might exclude the scalability achieved by
data) instance selection algorithms are combined to produce the divide-and-conquer approach. To avoid this negative effect,
a strong and fast instance selection method. Each round can the evaluation of a pair of thresholds is also approached using a
be considered similar to a classifier in an ensemble, and the divide-and-conquer method. Instead of evaluating the accuracy
combination process, by voting, is similar to the combination of S(t+ , t ) with the whole data set, which is of complexity
of base learners in bagging or boosting [11]. O(n2 ), we apply the same partition philosophy used in the
Once we have performed these r rounds, we have recorded previous step. The training set is divided into random disjoint
the number of votes received by each instance. The number subsets, and accuracy is estimated separately in each subset
of votes received by an instance of the majority class is in using the average evaluation of all the subsets for the fitness
the interval [0, r], as each majority class instance is only in of each pair of thresholds.
one subset in every round. Minority class instances may be This procedure obtains a selected set of instances that may
in more than one subset due to the balance process. Thus, the be imbalanced. To obtain a balanced data set, we perform a last
number of votes is in the interval [0, t r]. The first decision step. The class with more selected instances is undersampled,
is whether to perform the selection on both classes or only on removing first the instances with fewer votes. This balanced
the majority class. Performing selection only on the majority data set is evaluated using (3). If it achieves a better evaluation,
class constrains the maximum reduction that can be achieved. the balanced selected data set is used as the final result of
Furthermore, previous results [9] showed better performance the algorithm; otherwise, the selection obtained using the best
when both classes were subject to selection. Thus, we opted to thresholds is kept. The complete method is shown in Algorithm
allow instances to be removed from both classes. 1. As O LIGO IS uses a base instance selection to be applied
The final combination of votes must set a threshold to to each subset (see line 2 in Algorithm 1), we will use the
decide whether an instance must be selected as the final out- term O LIGO IS.X when O LIGO IS is applied using X as instance
put of the process. A first natural choice would be majority selection algorithm.
voting: Instances are kept if they receive at least half of the
possible votes. However, the performance of this fixed thresh- Algorithm 1: O LIGO IS algorithm.
old depends heavily on the problem. Therefore, we developed a
method of automatically selecting an optimum vote threshold. Data: A training set T = {(x1 , y1 ), . . . , (xn , yn )}, subset
To that end, we define function f that evaluates the goodness size s, and number of rounds r.
of a certain subset of instances S. Using the threshold t, we Result: The set of selected instances S T .
obtain the set of selected instances S(t), and then f (S(t)) is for i = 1 to r do
calculated as follows: 1 Divide instances into ns disjoint subsets
Di : i Di = T of size s
f (S(t)) = r (S(t)) + (1 )a (S(t)) (2) for j = 1 to ns do
2 Apply instance selection algorithm to Dj
where r(S(t)) is the reduction achieved using threshold t to 3 Store votes of selected instances from Dj
select S(t), and a(S(t)) is the accuracy achieved with the end
instances in S(t) using a 1-nearest neighbor (1-NN) classifier.3 end
The accuracy may be calculated using any of the class- 4 Obtain thresholds of votes to keep an instance from the
imbalanced measures defined in the following. To obtain the minority, t+ , and the majority, t , classes
best threshold, all of the possible values are evaluated, and the 5 S = {xi T |(votes(xi ) t+ and xi C + ) or (votes(xi )
optimum is chosen. t and xi C )}
However, this approach does not pay attention to the class- 6 Undersample the class with more instances in S to obtain
imbalanced nature of the data sets. To account for that, a further S balanced removing instances with fewer votes
if f (S balanced ) f (S) then
2 We can use instance selection methods that are specifically designed for S = S balanced
class-imbalanced data sets, or as the subsets are balanced, we can use any end
standard instance selection method.
3 Any other classifier can be used in this function to obtain the accuracy of 7 return S
the subset selected.
GARCA-PEDRAJAS et al.: OLIGOIS 335

TABLE I PASCAL challenge.4 The remaining data sets were created

SUMMARY OF DATASETS. THE ATTRIBUTES OF EACH DATASET C AN B E
CONTINUOUS (C), BINARY (B) OR NOMINAL (N). IR IS ALSO SHOWN following Garca et al. [15] from UCI data sets. To estimate
the reduction and accuracy, we used a tenfold cross-validation
method and a 1-NN classifier.

A. Algorithms for the Comparison

To allow a fair assessment of the validity of our proposal, we
must compare O LIGO IS with state-of-the-art methods for in-
stance selection in class-imbalanced problems. We have chosen
the following methods.
1) Random undersampling. This method is included as a
baseline measure. We randomly undersample the ma-
jority class until both classes are balanced. Although
more sophisticated methods have been proposed, most
experimental results show that this simple method is still
very competitive.
2) Decremental Reduction Optimization Procedure 3
(DROP 3) [16]. This algorithm is designed to be
insensitive to the order of presentation of the instances.
It includes a noise filtering step, using a method similar
to Wilsons edited nearest neighbor rule [17]. Then, the
instances are ordered according to distance from their
nearest neighbors. Instances are removed beginning with
those furthest from their nearest neighbors. This tends to
remove the instances furthest from the boundaries first.
DROP 3 iteratively removes an instance x if at least as
many of its associates in T would be classified correctly
without x. The associates of x are the instances that have
x as one of their nearest neighbors.
3) Iterative Case Filtering (ICF) [18]. For the ICF al-
gorithm, Coverage(c) = {c T : c LocalSet(c )} and
Reachable(c) = {c T : c LocalSet(c)} sets are de-
fined. The local set of a case c is defined as the set of
cases contained in the largest hypersphere centered on c
such that only cases in the same class as c are contained
in the hypersphere [18]; in other words, the hypersphere
The scalability of the method is assured by the three follow-
is bounded by the first instance of different class. The
ing features.
coverage set of an instance includes the instances that
1) Application of the method to small data sets. Due to the have that instance as one of their neighbors, and the
small size of the subsets in which the selection process reachable set is formed by the instances that are neighbors
is applied, the selection process will always be fast, to that instance. The algorithm is based on repeatedly
regardless of the complexity of the instance selection applying a deleting rule to the set of retained instances
method used. until no more instances fulfill the deleting rule. The
2) Only small data sets must be kept in memory. This allows deleting rule consists of removing all instances x for
the application of instance selection when data sets do not which |Reachable(x)| > |Coverage(x)|.
fit in memory. 4) Genetic algorithms have been applied to instance selec-
3) Easy parallelization. tion. The application is easy and straightforward. Each
individual is a binary vector that codes a certain sam-
III. E XPERIMENTAL S ETUP ple of the training set. The evaluation is usually made
We used a set of 40 problems to test the performance considering both data reduction and classification accu-
of the proposed method, which is shown in Table I. The racy. Examples of applications of genetic algorithms to
data sets ads, adult, breast cancer, cancer, census, instance selection can be found in [19][21]. Cano et al.
euthyroid, german, haberman, hepatitis, magic04, [22] performed a comprehensive comparison of the per-
ozone1hr, ozone8hr, pima, sick, and titanic are from the formances of different evolutionary algorithms for in-
UCI Machine Learning Repository [14]. arabidopsis and stance selection. They compared a generational genetic
ccds are data sets addressing the problem of recognition of
translation initiation sites in gene sequences. dna is from the 4 http://pascallin2.ecs.soton.ac.uk/Challenges/.
336 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 43, NO. 1, FEBRUARY 2013

algorithm, a steady-state genetic algorithm, a cross or d(xj , xl ) < d(xi , xj ). If two instances form a TL, either
generational elitist selection, heterogeneous recombi- one of them is noise or both are borderline samples. TLs
nation and cataclysmic mutation (CHC) genetic al- can be used as an undersampling method, removing for
gorithm, and a population-based incremental learning every TL found the instance belonging to the majority
algorithm. They found that evolutionary-based methods class.
were able to outperform classical algorithms in both 8) CN N + T L [26]. This method is similar to OSS, but the
classification accuracy and data reduction. Among the CNN rule is applied before the TLs. The method is faster
evolutionary algorithms, CHC was able to achieve the than OSS because TLs are obtained for a smaller data set.
best overall performance. Therefore, we have included 9) Modified Selective Subset (MSS) [27] method. This
the CHC algorithm in our comparison. To account for method is a modification of the algorithm of Ritter et al.
the class-imbalanced nature of our data sets, we used [28] for finding a selective subset. A subset is selective if
G-mean as the accuracy measure instead of a testing error it is consistent, and all prototypes in the original training
(see Section III-C). set are nearer to a selective neighbor of the same class
5) Garca and Herrera [23] performed a comparison of dif- than to any member of the training set from a different
ferent methods for sampling with class-imbalanced data class. The MSS algorithm is aimed at obtaining a minimal
sets and found evolutionary algorithms to perform earlier consistent subset using the selective property. In this way,
nonevolutionary methods, although the computational the authors define the MSS as that subset of the training
cost is clearly higher. They proposed a family of differ- set that contains, for every instance x in the training set,
ent evolutionary methods based on the CHC algorithm that element of its neighborhood that is the nearest to a
in which two different accuracy measures can be used, class other than that of x. The authors propose an iterative
i.e., G-mean and area under the curve (AUC), and the procedure to find this modified selective subset. Each
selection process can be applied to both the minority and instance is ordered regarding the distance to its nearest
majority classes or only to the majority class. The four enemy.
different combinations were tested with similar results. 10) Synthetic minority over-sampling technique (SMOTE)
To allow a fair comparison with the remaining algorithms, method [2]. One of the problems with oversampling is
we have chosen the G-mean as the accuracy measure that merely making copies of the minority class samples
and selection of both classes. The authors referred to does not add new information to the data set. To overcome
this method as evolutionary balancing undersampling this problem, Chawla et al. [2] proposed a method called
(EBUS). It uses the following fitness function for an SMOTE, which combines undersampling of the majority
individual S: class with oversampling of the minority class. However,
instead of oversampling the minority class by just making

FitnessBal (S) = G |1 n /n | P if n > 0
+
(4) copies of the minority class samples, SMOTE generates
GP If n = 0
synthetic instances from actual instances of the minority
where G is the G-mean, n+ is the number of instances class. Synthetic samples are generated in the following
selected from the minority class, and n is the number way: Take the difference between the feature vector
of instances selected from the majority class. This model (sample) under consideration and its nearest neighbor.
aims to remove instances of both classes, identifying mi- Multiply this difference by a random number between 0
nority class examples that have a negative influence over and 1, and add it to the feature vector under consideration.
the classification task and achieving maximal reduction The source code used for all methods is in C and is licensed
in positive instances. A penalty factor P , which is used under the GNU General Public License. The code, the partitions
for preserving the same number of instances belonging to of the data sets, and the detailed numerical results of all
each class, helps to maintain a generalization capability in the experiments are available from http://cib.uco.es/index.php/
the reduction task. The authors suggested an empirically supplementary-material-for-oligois.
obtained value of P = 0.2. All of the experiments were carried out in a cluster of
6) Condensed Nearest Neighbor (CNN) Rule [24]. The stan- 32 blades. Each blade is a biprocessor DELL Power Edge M600
dard CNN rule is adapted to class-imbalanced problems. with four cores per processor. Thus, we count 256 cores. The
First, we randomly drew one majority class example blades are interconnected with a master node and among them
and all examples from the minority class and put these with a 1-Gb network. In the parallel implementation, we use a
examples in S. We then tested all of the instances in T master/slave model in which all information processed by the
not included in S using a 1-NN rule. All misclassified slaves is sent by the master. The processors run at 2.5 GHz, and
instances are added to S. each blade has 16 Gb of memory.
7) OSS [6]. OSS is an undersampling method resulting from
the application of Tomek links (TLs) followed by the
B. Statistical Tests
application of CNN. TLs [25] are defined as follows:
Given a pair of instances xi and xj , belonging to classes We used the Wilcoxon test as the main statistical test for
yi and yj , respectively, where yi = yj and d(xi , xj ) is the comparing pairs of algorithms. This test was chosen because
distance between them, a pair (xi , xj ) is called a TL if it assumes limited commensurability and is safer than para-
there is not an example xl such that d(xi , xl ) < d(xi , xj ) metric tests because it does not assume normal distributions or
GARCA-PEDRAJAS et al.: OLIGOIS 337

homogeneity of variance. Therefore, it can be applied to accu- been developed to take the imbalanced nature of the problems
racy and reduction. Furthermore, empirical results [29] show into account. Given the number of true positives (TPs), false
that it is also stronger than other tests. positives (FPs), true negatives (TNs), and false negatives (FNs),
For groups of methods, we first carry out an ImanDavenport we can define several measures. Perhaps, the most common
test to ascertain whether there are significant differences measures are the TP rate T Prate , recall R, or sensitivity Sn, i.e.,
among methods. The ImanDavenport test is based on the 2F TP
Friedman test, which compares the average ranks of k algo- T Prate = R = Sn = (8)
TP + FN
rithms, but is more powerful than the Friedman test. Let rij be
the rank of the jth algorithm on the ith data set, where
in case which is relevant if we are only interested in the performance
of ties, average ranks are assigned, and let Rj = 1/N i rij be on the positive class and the TN rate T Nrate or specificity Sp,
the average rank for N data sets. Under the null hypothesis, all as follows:
algorithms are equivalent; the statistic TN
T Nrate = Sp = (9)

2 TN + FP
12N k(k + 1)
XF2 = Rj2 (5)
k(k + 1) j 4 From these basic measures, others have been proposed, such
as the F -measure or, if we are concerned about the performance
is distributed following 2F with k 1 degrees of freedom of both negative
and positive classes, the G-mean measure: G
for k and N sufficiently large. In general, N > 10 and k > 5 mean = Sp Sn. This last measure is used as the accuracy
is enough. Iman and Davenport found this statistic to be too measure in all of our experiments. Thus, in the tables reporting
conservative and developed a better one as follows: our results, we will show the G-mean measure and the reduction
after applying the instance selection algorithm. Reduction is
(N 1)XF2 measured as the percentage of instances removed after applying
FF = (6)
N (k 1) XF2 the selection process.
Many classifiers are subject to some kind of threshold that
which is distributed following a F distribution with k 1 and can be varied to achieve different values of the earlier measures.
(k 1)(N 1) degrees of freedom. After carrying out the For that kind of classifiers, receiver operating characteristic
ImanDavenport test, we can perform pairwise comparisons (ROC) curves can be constructed. An ROC curve is a graph-
with the Wilcoxon test. However, when all algorithms are com- ical plot of the T Prate (sensitivity) against the F Prate (1
pared with a control method, it is not advisable to perform many specificity or F Prate = F P/T N + F P ) for a binary classifier
Wilcoxon tests against the control method. We can instead use system as its discrimination threshold is varied. The perfect
one of the general procedures for controlling the familywise er- model would achieve a TP rate of 1 and an FP rate of 0.
ror in multiple hypothesis testing. The test statistic for compar- A random guess will be represented by a line connecting the
ing the ith and jth classifiers using these methods is as follows: points (0, 0) and (1, 1). ROC curves are a good measure of the
(Ri Rj ) performance of the classifiers. Furthermore, from this curve, a
z= (7) new measure, i.e., AUC, can be obtained, which is a very good
k(k + 1)/6N
overall measure for comparing algorithms. AUC is a useful
The z-value is used to find the corresponding probability metric for classifier performance as it is independent of the
from the table of normal distribution, which is then compared decision criterion selected and prior probabilities.
with an appropriate . Step-up and step-down procedures se- However, for the nearest neighbor rule, AUC is less com-
quentially test the hypotheses ordered by their significance. monly used due to the difficulty of obtaining the threshold
We will denote the ordered p-values by p1 , p2 , . . . so that needed to construct the ROC curve. We will present AUC
p1 p2 . . . pk1 . One of the simplest such methods was results for the nearest neighbor using as threshold the ratio
developed by Holm. It compares each pi with /(k i). between the distances to the nearest neighbor for each one of
Holms step-down procedure starts with the most significant p the two classes.
value. If p1 is below /(k 1), the corresponding hypothesis
is rejected, and we are allowed to compare p2 with /(k 2).
IV. E XPERIMENTAL R ESULTS
If the second hypothesis is rejected, the test proceeds with the
third, and so on. As soon as a certain null hypothesis cannot be As stated, O LIGO IS is a method that can use any of the
rejected, all remaining hypotheses are retained as well. We will many available base instance selection algorithms, in the same
use for all statistical tests a significance level of 0.05. way that boosting can use the classifier of choice. Thus, the
first set of experiments focused on determining which in-
stance selection algorithm was most appropriate for O LIGO IS.
C. Evaluation Measures
We performed experiments using O LIGO IS with eight of the
Accuracy is not a useful measure for imbalanced data, par- nine methods described earlier. Random undersampling was
ticularly when the number of instances of the minority class not used because the subsets used in our method are always
is very small compared with the majority class. If we have balanced.
a ratio of 1 : 100, a classifier that assigns all instances to the Tables II and III show a pairwise comparison of all the
majority class will have 99% accuracy. Several measures have methods for accuracy and reduction, respectively. Fig. 1 shows
338 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 43, NO. 1, FEBRUARY 2013

TABLE II
PAIRWISE COMPARISON OF O LIGO IS ACCURACY MEASURED USING G-MEAN USING THE EIGHT METHODS.
WIN/LOSS RECORD AND THE p-VALUE OF THE WILCOXON TEST ARE SHOWN

TABLE III
PAIRWISE COMPARISON OF O LIGO IS FOR REDUCTION MEASURED AS THE PERCENTAGE OF INSTANCES REMOVED
USING THE EIGHT METHODS. WIN/LOSS RECORD AND THE p-VALUE OF THE WILCOXON TEST ARE SHOWN

Fig. 1. Average Friedmans ranks for O LIGO IS using the eight instance selection algorithms.
GARCA-PEDRAJAS et al.: OLIGOIS 339

Fig. 2. Results of the Holm test for O LIGO IS with seven different methods and O LIGO IS.CHC as the control method for (top) accuracy and (bottom) reduction.
Numerical p-values are shown.

average Friedman rankings for the same experiments. These TABLE IV

p-VALUES OF THE PAIRWISE COMPARISON USING A WILCOXON TEST OF
ranks are, by themselves, a good measure of the relative perfor- O LIGO IS USING THE EIGHT METHODS APPLIED IN THEIR STANDARD
mance of a group of methods. The ImanDavenport test with a WAYS. A MEANS THAT OUR METHOD IS SIGNIFICANTLY BETTER,
p-value of 0 for both accuracy and reduction showed significant A T HAT IT I S W ORSE , AND A T HAT T HERE A RE N O D IFFERENCES

differences between the methods. The rankings showed that

the best performing method was O LIGO IS.CHC. The pairwise
comparisons also showed that this method was significantly
better than all the others, with a p-value below 0.05 in both
accuracy and reduction. Results for O LIGO IS.EBUS were also
good but inferior to O LIGO IS.CHC.
However, as stated in Section III-B, using many pairwise
comparisons is not the most effective way of comparing a
number of algorithms. To further assure the advantage of
O LIGO IS.CHC, we used Holms procedure. O LIGO IS.CHC while preserving accuracy, and for CHC, it improved reduction
was used as the control method. Fig. 2 shows the results of and accuracy. Although for some methods, accuracy was equal
Holms test for accuracy and reduction. The test showed that to or worse than the standard method, accuracy is exchanged
O LIGO IS.CHC was better than all other methods in terms of for a significant improvement in reduction. Furthermore, per-
accuracy and better than all the methods except O LIGO IS. formance using those methods is less relevant because the
EBUS in terms of reduction. Thus, we chose O LIGO IS.CHC representative of our methodology is O LIGO IS.CHC, which
as representative of O LIGO IS to be used in the remaining was shown to be an improvement on all of the methods in the
experiments. next experiment performed.
In the previous experiments, we compared O LIGO IS us- Once we had established the utility of O LIGO IS and
ing eight different instance selection methods. However, we chosen O LIGO IS.CHC as its representative, we compared
have not compared the performance of O LIGO IS against its O LIGO IS.CHC with the ten state-of-the-art methods described
base method. Therefore, we compared each application of in Section III-A. All of the algorithms were applied to the
O LIGO IS for each instance selection algorithm with the appli- whole data set with the same parameters used for O LIGO IS.
cation of the same instance selection algorithm to the whole Table V shows the results of O LIGO IS. CHC and the ten standard
data set. We used the Wilcoxon test to compare pairs of methods for AUC and reduction.
algorithms. We performed a Holm test using our proposed method as
Table IV shows the results of the Wilcoxon test. The table the control algorithm. Before applying the Holm test, we as-
shows that O LIGO IS improved all of the methods in terms of certained global differences with an ImanDavenport test that
reduction and some of them in terms of both reduction and obtained a p-value of 0 for both accuracy and reduction.
accuracy. It is particularly important to note that O LIGO IS The results of the Holm test are shown in Fig. 3. In terms
was better than the standard forms of the two best performing of both accuracy and reduction, O LIGO IS.CHC showed a sig-
methods CHC and EBUS. For EBUS, it improved reduction nificantly better performance than all of the other standard
340 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 43, NO. 1, FEBRUARY 2013

TABLE V
ACCURACY AND REDUCTION OF THE TEN STANDARD METHODS AND O LIGO IS. CHC

Fig. 3. Results of the Holm test for the standard methods and O LIGO IS.CHC as the control method for (top) G-mean, (middle) reduction, and (bottom) AUC.
Numerical p-values are shown.

methods. A plot of average rank values is shown in Fig. 4 The results of the nine methods against O LIGO IS.CHC are
together with the average values of AUC, G-mean, and re- illustrated in Fig. 5. The figure shows results for accuracy
duction for all methods. This plot shows that average ranks and reduction. This graphic representation is based on the
are clearly better for O LIGO IS.CHC. The figure also shows -error relative movement diagrams [30]. However, here, we
that O LIGO IS.CHC not only achieved superior accuracy to the use the reduction difference instead of the difference value.
remaining methods but that reduction was also clearly better These diagrams use an arrow to represent the results of two
than the other nine methods. methods applied to the same data set. The arrow starts at the
GARCA-PEDRAJAS et al.: OLIGOIS 341

the standard algorithm in both accuracy and reduction. Ar-

rows pointing up-left indicate that our algorithm improved
reduction but had worse accuracy, whereas arrows pointing
down-right indicate that our algorithm improved accuracy but
had a worse reduction. Arrows pointing down-left indicate
that our algorithm performed worse in both accuracy and
reduction.
If we inspect the plots, we can see that most of the arrows
are in the top right part of the coordinates, meaning a general
advantage of O LIGO IS.CHC in both accuracy and reduction.
We can also see a remarkable improvement in some problems.
One of the most interesting features of any method of under-
sampling or instance selection for class-imbalanced data sets
is its behavior when the imbalance ratio (IR) increases. It is
important that a method is able to keep its performance for
high IRs as well as for more balanced data sets. To test the
performance of O LIGO IS.CHC when the IR increases, we show
in Fig. 6 the accuracy and reduction of O LIGO IS.CHC and the
second best performing method, CHC, for the ten data sets with
the highest IRs.
The figure shows that the behavior of O LIGO IS.CHC for
high IRs was even better than in the general case. It clearly
achieved a better AUC than CHC and, at the same time, showed
a reduction that was, on average, 15% larger than the reduction
obtained by CHC. The differences are significant for accuracy
and reduction using a Wilcoxon test and a significance level
of 0.05.

A. Execution Time
In the description of our method, we claimed that applying
the philosophy of divide-and-conquer would produce a method
that was more scalable than previous approaches. To verify
this claim, Fig. 7 shows the average time of each method
together with the time needed for the longest execution. The
execution time shown in the tables is the wall-clock time spent
by each part of the algorithm. We measure the time elapsed
from the beginning of the algorithm until it completes its final
output. That means that the time needed to read the data files,
perform the partition, send the data to the slaves, receive the
results, and obtain the best threshold of votes is included in
the reported time. In this way, our proposal establishes no
artificial advantage. The figure shows that O LIGO IS.CHC is
significantly faster than the other methods. The difference is
particularly remarkable for the largest problems and the parallel
implementation of our method. Although we are comparing
a parallel implementation of O LIGO IS.CHC against the other
methods, we do not think that it is an unfair comparison as
Fig. 4. Average Friedmans ranks and average values for the standard algo- natural parallelization is an inherent feature of O LIGO IS.
rithms and O LIGO IS.CHC.

B. Study of the Method

coordinate origin, and the coordinates of the tip of the arrow
represent the difference between the accuracy and reduction The presented method has three parameters that must be set
of our method and those of the standard undersampling algo- by the user. Here, we study the behavior of O LIGO IS.CHC as
rithm. These graphs are a convenient way to summarize the regards these three parameters. For any method, high sensitivity
results. A positive value in either reduction or accuracy means to any of its parameters is undesirable, as this would make
that our method performed better. Thus, arrows pointing up- difficult the use of the method. The first parameter is the size of
right represent data sets for which our method outperformed the subsets. We chose a size of 100 instances for our previous
342 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 43, NO. 1, FEBRUARY 2013

Fig. 5. Reduction/accuracy using relative movement diagrams for O LIGO IS.CHC and the nine standard methods. Positive values on both axes show better
performance by O LIGO IS.CHC.

experiments. This size achieved good performance and fast rank. The results of the Holm test are shown in Fig. 9. These
execution. In the first experiment, we tested whether a larger results show that the best subset size in terms of reduction is
size would achieve better results. Fig. 8 shows the average 100 instances.
accuracy, reduction, and execution time for subset sizes of 100, The second relevant parameter of our method is the number
250, 500, 750, and 1000 instances. of rounds. We chose ten rounds, using the results of other
The results showed that the method is quite stable with approaches combining difference methods, such as ensembles
respect to this parameter. An ImanDavenport test failed to of classifiers, which have shown that most of the performance
find significant differences among the different values of the gain is obtained by the first few classifiers added. To test the
parameter in terms of accuracy, with a p-value of 0.4883. validity of this value, we performed experiments with 5, 10,
For reduction, the ImanDavenport test obtained a p-value 15, 20, and 25 rounds. Fig. 10 shows the average accuracy,
of 0.0015, showing significant differences. We performed a reduction, and execution time of the experiments conducted
Holm test using the value of 100 instances as a control ex- on each number of rounds. As in the previous experiment, we
periment because this parameter obtained the best average performed an ImanDavenport test for accuracy and reduction.
GARCA-PEDRAJAS et al.: OLIGOIS 343

Fig. 7. Average time and longest time in logarithmic scale for the standard
algorithms and O LIGO IS.CHC.

Fig. 8. Average G-mean, reduction, and execution time for O LIGO IS.CHC
for different subset sizes.

Fig. 6. (Top) Reduction and AUC (bottom) for O LIGO IS.CHC and CHC for
the ten data sets with the highest IR.

For accuracy, the test obtained a p-value of 0.0998, and for

reduction, the p-value was 0.6012, indicating that there were
no significant differences in any of the cases. However, the
results with ten rounds achieved the best average rank for both
reduction and accuracy.
The last relevant parameter of our method is [see (3)], Fig. 9. Results of the Holm test for O LIGO IS.CHC reduction with a subset
size of 100 instances as the control method, and O LIGO IS.CHC with subset
which measures the relative importance of accuracy and re- sizes of 250, 500, 750 and 100 instances. Numerical p-values are shown.
duction when evaluating the vote thresholds. We performed
experiments with values of of 0.3, 0.4, 0.5, 0.6, and 0.7.
Fig. 11 shows the average accuracy, reduction, and execution test for both accuracy and reduction. The results are shown in
time for these five values. As expected, lower values of Fig. 12. We used = 0.5 as control experiments for accuracy
obtained higher reduction as the weight of the term in the because it achieved the best average rank for the G-mean value
evaluation of the thresholds of votes became more relevant. The and = 0.3 for reduction for the same reason.
price paid is lower accuracy. The test shows that decreasing improves results in terms
The ImanDavenport test obtained a p-value of 0 for reduc- of reduction but harms accuracy. However, values above 0.5 do
tion. This test also found significant differences for accuracy not show better accuracy than 0.5. Thus, this value is a good
with a p-value of 0.0026. Therefore, we performed a Holm compromise between accuracy and reduction.
344 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 43, NO. 1, FEBRUARY 2013

Fig. 10. Average G-mean, reduction, and execution time for O LIGO IS.CHC
for different numbers of rounds.

Fig. 13. Average G-mean, reduction, and classifier size for O LIGO IS.CHC
and CHC using a C4.5 decision tree and a SVM.

TABLE VI
SUMMARY OF VERY LARGE DATA SETS

Fig. 11. Average G-mean, reduction, and execution time for O LIGO IS.CHC
for different values of .
used as classifiers a decision tree using C4.5 [31] algorithm
and a support vector machine (SVM) [32] using a Gaussian
kernel and the parameters and C obtained by tenfold cross-
validation. Fig. 13 shows the results for these two classifiers.
As was the case with 1-NN, O LIGO IS.CHC improves the
results of CHC in both accuracy and reduction. Furthermore,
the classifiers induced in both cases are simpler. All of the
differences were found significant at a 0.05 significance level
using a Wilcoxon test.

D. Scalability
In previous experiments, we showed the performance of
O LIGO IS in data sets of medium size. However, one of the aims
of this paper is the scalability of the presented method to very
large data sets. Thus, we applied our algorithm to three very
large problems. These data sets are shown in Table VI. ccds
Fig. 12. Results of the Holm test for O LIGO IS.CHC as the control method and dna represent the same problems as the corresponding data
with = 0.5 as the control experiment for (top) accuracy and with = 0.3 sets in Table I but with a larger set of instances. chrom21 is the
for (bottom) reduction, and O LIGO IS.CHC with the remaining values of of
0.3, 0.4, 0.5, 0.6 and 0.7 for each case. Numerical p-values are shown. bioinformatics problem of predicting the translation initiation
site of a gene in the sequence of human chromosome 21.
With a largest set of five million instances and 800 features,
C. Using Other Classifiers
these data sets represent a serious challenge for any method.
As a final experiment, we tested whether O LIGO IS was able Furthermore, chrom21 and dna also show a large IR. To clarify
to improve results when using other classifiers. We repeated the difficulty of the problem, we show the memory size of
the experiments using the best version of our algorithm, i.e., these three data sets using a standard float point precision stored
O LIGO IS.CHC, and the best standard method, i.e., CHC. We with 4 B.
GARCA-PEDRAJAS et al.: OLIGOIS 345

TABLE VII [5] A. Estabrooks, T. Jo, and N. Japkowicz, A multiple resampling method
SUMMARY OF RESULTS FOR LARGE DATA SETS. SIGNIFICANT for learning from imbalanced data sets, Comput. Intell., vol. 20, no. 1,
DIFFERENCES FOR A RESAMPLED t-TEST ARE MARKED IN BOLDFACE pp. 1836, Feb. 2004.
[6] M. Kubat and S. Matwin, Addressing the curse of imbalanced training
sets: One-sided selection, in Proc. 14th Int. Conf. Mach. Learn., 1997,
pp. 179186.
[7] S. Garca, J. Derrac, I. Triguero, C. J. Carmona, and F. Herrera,
Evolutionary-based selection of generalized instances for imbalanced
classification, Knowl.-Based Syst., vol. 25, no. 1, pp. 312, Feb. 2012.
[8] N. Garca-Pedrajas, J. A. Romero del Castillo, and D. Ortiz-Boyer, A
cooperative coevolutionary algorithm for instance selection for instance-
We applied O LIGO IS.CHC to a subset size of 1000 instances. based learning, Mach. Learn., vol. 78, no. 3, pp. 381420, Mar. 2010.
As the control experiment, we used random undersampling [9] S. Garca and F. Herrera, Evolutionary undersampling for classifica-
because none of the other standard methods is scalable to tion with imbalanced datasets: Proposals and taxonomy, Evol. Comput.,
vol. 17, no. 3, pp. 275306, Fall 2009.
data sets of this size. Table VII shows the results of both [10] L. Kuncheva and C. J. Whitaker, Measures of diversity in classifier en-
methods. The first interesting result is the good scalability of sembles and their relationship with the ensemble accuracy, Mach. Learn.,
our proposal. For the largest data set of five million instances vol. 51, no. 2, pp. 181207, May 2003.
[11] N. Garca-Pedrajas, C. Garca-Osorio, and C. Fyfe, Nonlinear boosting
and 800 features, O LIGO IS.CHC took 1.5 h to complete the projections for ensemble construction, J. Mach. Learn. Res., vol. 8,
whole process. If we estimate the time that standard CHC would pp. 133, May 2007.
need for such a data set, considering that it needed more than [12] R. Barandela, J. L. Snchez, V. Garca, and E. Rangel, Strategies for
learning in class imbalance problems, Pattern Recognit., vol. 36, no. 3,
3 h for arabidopsis, which has 33 971 instances, we can pp. 849851, Mar. 2003.
conclude that O LIGO IS reduces the time by several orders of [13] I. Guyon and A. Elisseeff, An introduction to variable and feature selec-
magnitude. tion, J. Mach. Learn. Res., vol. 3, pp. 11571182, Mar. 2003.
[14] A. Frank and A. Asuncion, UCI Machine Learning Repository, 2010.
Furthermore, the performance of O LIGO IS.CHC is compet- [15] N. Garca-Pedrajas, Constructing ensembles of classifiers by means of
itive. For ccds, there are no significant differences in terms weighted instance selection, IEEE Trans. Neural Netw., vol. 20, no. 2,
of accuracy, but OL IGO IS.CHC achieves a significantly bet- pp. 258277, Feb. 2009.
[16] D. R. Wilson and T. R. Martinez, Reduction techniques for instance-
ter reduction. For chrom21 and dna, both methods achieved based learning algorithms, Mach. Learn., vol. 38, no. 3, pp. 257286,
the same reduction, but OL IGO IS.CHC obtained significantly Mar. 2000.
better accuracy. [17] D. L. Wilson, Asymptotic properties of nearest neighbor rules using
edited data, IEEE Trans. Syst., Man, Cybern., vol. SMC-2, no. 3,
pp. 408421, Jul. 1972.
[18] H. Brighton and C. Mellish, Advances in instance selection for instance-
V. C ONCLUSION AND F UTURE W ORK based learning algorithms, Data Mining Knowl. Discov., vol. 6, no. 2,
pp. 153172, Apr. 2002.
In this paper, we have presented a new method of instance [19] L. Kuncheva, Editing for the k-nearest neighbors rule by a genetic algo-
selection in class-imbalanced data sets that is applicable to very rithm, Pattern Recognit. Lett., vol. 16, no. 8, pp. 809814, Aug. 1995.
large data sets. The method consists of concurrently applying [20] H. Ishibuchi and T. Nakashima, Pattern and feature selection by genetic
algorithms in nearest neighbor classification, J. Adv. Comput. Intell.
instance selection on small class-balanced subsets of the origi- Intell. Inf., vol. 4, no. 2, pp. 138145, 2000.
nal data set and combining them by means of a voting method, [21] C. R. Reeves and D. R. Bush, Using genetic algorithms for training data
setting a different threshold for minority and majority class selection in RBF networks, in Instances Selection and Construction for
Data Mining, H. Liu and H. Motoda, Eds. Norwell, MA: Kluwer, 2001,
samples. Using a CHC algorithm as the base algorithm, we pp. 339356.
have shown that our method exceeds the performance of state- [22] J. R. Cano, F. Herrera, and M. Lozano, Using evolutionary algorithms
of-the-art selection methods for class-imbalanced data in both as instance selection for data reduction in KDD: An experimental study,
IEEE Trans. Evol. Comput., vol. 7, no. 6, pp. 561575, Dec. 2003.
accuracy and reduction. Its advantage over the state-of-the-art [23] S. Garca, A. Fernndez, and F. Herrera, Enhancing the effectiveness
methods was larger when the IR of the problem was higher. and interpretability of decision tree and rule induction classifiers with
The results showed that our approach scaled up to problems evolutionary training set selection over imbalanced problems, Appl. Soft
Comput., vol. 9, no. 4, pp. 13041314, Sep. 2009.
of very large size. Two features of our method guarantee that [24] P. E. Hart, The condensed nearest neighbor rule, IEEE Trans. Inf.
scalability: First, instance selection is always performed over Theory, vol. IT-14, no. 3, pp. 515516, May 1968.
small data sets, keeping the time spent by the process low, and [25] I. Tomek, Two modifications of CNN, IEEE Trans. Syst., Man, Cybern.
B, Cybern., vol. SMCB-6, no. 11, pp. 769772, Nov. 1976.
second, only small subsets of instances must be kept in memory, [26] G. E. A. P. A. Batista, R. C. Prati, and M. C. Monard, A study of
removing any scalability constraint due to memory limits. This the behavior of several methods for balancing machine learning training
scalability has been experimentally proven with the largest data data, SIGKDD Explor. Newslett., vol. 6, no. 1, pp. 2029, Jun. 2004.
[27] R. Barandela, F. J. Ferri, and J. S. Snchez, Decision boundary preserv-
set of five million instances and 800 features. ing prototype selection for nearest neighbor classification, Int. J. Pattern
Recognit. Artif. Intell., vol. 19, no. 6, pp. 787806, Sep. 2005.
R EFERENCES [28] G. L. Ritter, H. B. Woodruff, S. R. Lowry, and T. L. Isenhour, An
algorithm for selective nearest neighbor decision rule, IEEE Trans. Inf.
[1] H. He and E. A. Garcia, Learning from imbalanced data, IEEE Trans. Theory, vol. IT-21, no. 6, pp. 665669, Nov. 1975.
Knowl. Data Eng., vol. 21, no. 9, pp. 12631284, Sep. 2009. [29] J. Demar, Statistical comparisons of classifiers over multiple data sets,
[2] N. V. Chawla, W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, SMOTE: J. Mach. Learn. Res., vol. 7, pp. 130, Dec. 2006.
Synthetic minority over-sampling technique, J. Artif. Intell. Res., vol. 16, [30] J. Maudes-Raedo, J. J. Rodrguez-Dez, and C. Garca-Osorio, Disturb-
no. 1, pp. 321357, Jan. 2002. ing neighbors diversity for decision forest, in Proc. Workshop SUEMA,
[3] C. Ling and G. Li, Data mining for direct marketing problems and G. Valentini and O. Okun, Eds., Patras, Grecia, Jul. 2008, pp. 6771.
solutions, in Proc. 4th Int. Conf. KDD, New York, 1998, pp. 7379. [31] J. R. Quinlan, C4.5: Programs for Machine Learning. San Mateo, CA:
[4] G. M. Weiss and F. Provost, The effect of class distribution on clas- Morgan Kaufmann, 1993.
sifier learning: An empirical study, Dept. Comput. Sci., Rutgers Univ., [32] V. Vapnik, The Nature of Statistical Learning Theory. New York:
Newark, NJ, Tech. Rep. TR-43, 2001. Springer-Verlag, 1999.
346 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 43, NO. 1, FEBRUARY 2013

Nicols Garca-Pedrajas was born in Crdoba, Aida de Haro-Garca was born in Crdoba, Spain,
Spain, in 1970. He received the B.S. degree in in 1984. She received the B.S. degree in computing
computing and the Ph.D. degree from the Univer- from the University of Crdoba, Crdoba, in 2007
sity of Mlaga, Mlaga, Spain, in 1993 and 2001, and the Ph.D. degree from the University of Granada,
respectively. Granada, Spain, in 2011.
He is currently a Professor in the area of computer She is currently an Assistant Professor in the area
science and artificial intelligence with the Depart- of computer science and artificial intelligence with
ment of Computing and Numerical Analysis, Uni- the Department of Computing and Numerical Anal-
versity of Crdoba, Crdoba, and the Head of the ysis, University of Crdoba. She is also a member of
Computational Intelligence and Bioinformatics Re- the Computational Intelligence and Bioinformatics
search Group, University of Cordoba. His current Research Group, University of Crdoba. Her current
research interests include classification, neural networks, evolutionary compu- research interests include classification, evolutionary computation, and feature
tation, and bioinformatics. and instance selection.

Javier Prez-Rodrguez was born in Crdoba,

Spain, in 1983. He received the B.S. degree in com-
puting from the University of Crdoba, Crdoba,
in 2008. He is currently working toward the Ph.D.
degree in the University of Granada, Granada, Spain.
He is a member of the Computational Intelligence
and Bioinformatics Research Group, University of
Cordoba. His current research interests include evo-
lutionary computation and bioinformatics.

Textbook of Neonatal Resuscitation 8th Edition
83% (18)
Textbook of Neonatal Resuscitation 8th Edition
333 pages
Jahangirnagar University: Admit Card Faculty of Law (Unit-F)
No ratings yet
Jahangirnagar University: Admit Card Faculty of Law (Unit-F)
2 pages
Anthropology, Western & Eastern Thought, Sociology
No ratings yet
Anthropology, Western & Eastern Thought, Sociology
7 pages
Disciplines and Ideas in The Social Sciences Quarter 1: Week 5 - Module 5
50% (2)
Disciplines and Ideas in The Social Sciences Quarter 1: Week 5 - Module 5
16 pages
A Systematic Review On Imbalanced Data Challenges in Machine Learning: Applications and Solutions
100% (1)
A Systematic Review On Imbalanced Data Challenges in Machine Learning: Applications and Solutions
36 pages
Sjs Learning Module: Second Quarter
No ratings yet
Sjs Learning Module: Second Quarter
47 pages
Managment Assignment 1
No ratings yet
Managment Assignment 1
5 pages
National Institute of Fashion Technology
No ratings yet
National Institute of Fashion Technology
3 pages
P.E. 9 - Q1 - Module1b
No ratings yet
P.E. 9 - Q1 - Module1b
13 pages
Alternative Therapy
No ratings yet
Alternative Therapy
3 pages
Medical Mind
No ratings yet
Medical Mind
2 pages
Functional Grammar: An Introduction
No ratings yet
Functional Grammar: An Introduction
248 pages
Name: Alifia Nur Hanifah Nim: P27224018099 Class: Div Reg B
No ratings yet
Name: Alifia Nur Hanifah Nim: P27224018099 Class: Div Reg B
5 pages
BC Alumni NEwsletter February 2023 Final
No ratings yet
BC Alumni NEwsletter February 2023 Final
16 pages
Lesson Plan English ADVERB
No ratings yet
Lesson Plan English ADVERB
9 pages
Talking About Family (Handout 1 - Week 1)
No ratings yet
Talking About Family (Handout 1 - Week 1)
5 pages
An Insight Into Classification With Imbalanced Data
No ratings yet
An Insight Into Classification With Imbalanced Data
29 pages
NUS AMP Brochure
No ratings yet
NUS AMP Brochure
15 pages
Senior Project Letter of Intent-5
No ratings yet
Senior Project Letter of Intent-5
2 pages
A Study For The Discovery of Web Usage Patterns Using Soft Computing Based Data Clustering Techniques
No ratings yet
A Study For The Discovery of Web Usage Patterns Using Soft Computing Based Data Clustering Techniques
14 pages
Expert Systems With Applications: Georgios Douzas, Fernando Bacao
No ratings yet
Expert Systems With Applications: Georgios Douzas, Fernando Bacao
8 pages
Leevy2018 Article ASurveyOnAddressingHigh-classI
No ratings yet
Leevy2018 Article ASurveyOnAddressingHigh-classI
30 pages
Eng2 12298 PDF
No ratings yet
Eng2 12298 PDF
24 pages
Enhancing Classification Performance of Multi-Class Imbalanced Data Using The OAA-DB Algorithm
No ratings yet
Enhancing Classification Performance of Multi-Class Imbalanced Data Using The OAA-DB Algorithm
8 pages
A Survey On Oversampling Techniques For Imbalanced Learning
No ratings yet
A Survey On Oversampling Techniques For Imbalanced Learning
6 pages
Imbalanced Dataset Classification and Solutions: A Review
No ratings yet
Imbalanced Dataset Classification and Solutions: A Review
29 pages
Paper IJRITCC
No ratings yet
Paper IJRITCC
5 pages
Ensemble Models For Effective Classification of Big Data With Data Imbalance
No ratings yet
Ensemble Models For Effective Classification of Big Data With Data Imbalance
17 pages
Classifying Imbalanced Data Sets Using Similarity Based Hierarchical Decomposition
No ratings yet
Classifying Imbalanced Data Sets Using Similarity Based Hierarchical Decomposition
16 pages
ADASYN: Adaptive Synthetic Sampling Approach For Imbalanced Learning
No ratings yet
ADASYN: Adaptive Synthetic Sampling Approach For Imbalanced Learning
7 pages
Bagging Using Instance-Level Difficulty For Multi-Class Imbalanced Big Data Classification On Spark
No ratings yet
Bagging Using Instance-Level Difficulty For Multi-Class Imbalanced Big Data Classification On Spark
10 pages
IR-Lab Project of Yanjun Qi (Fall 2004) : A Brief Literature Review of Class Imbalanced Problem
No ratings yet
IR-Lab Project of Yanjun Qi (Fall 2004) : A Brief Literature Review of Class Imbalanced Problem
5 pages
Cover Letter For Airbus
100% (2)
Cover Letter For Airbus
7 pages
Analysis of Imbalanced Classification Algorithms A Perspective View
No ratings yet
Analysis of Imbalanced Classification Algorithms A Perspective View
5 pages
A Brief History of Memory Research: Background: Associationism
No ratings yet
A Brief History of Memory Research: Background: Associationism
74 pages
d2c0 PDF
No ratings yet
d2c0 PDF
6 pages
Addressing Imbalance Problem in The Class - A Survey
No ratings yet
Addressing Imbalance Problem in The Class - A Survey
5 pages
Clustering Based Undersampling For Handling Class Imbalance in C4.5 Classification Algorithm
No ratings yet
Clustering Based Undersampling For Handling Class Imbalance in C4.5 Classification Algorithm
7 pages
Class Imbalance Problem in Data Mining: Review
No ratings yet
Class Imbalance Problem in Data Mining: Review
5 pages
Bs Tourism Curriculum
No ratings yet
Bs Tourism Curriculum
1 page
UTSSSS
75% (12)
UTSSSS
2 pages
Alianza Cultural Uruguay - Estados Unidos: Sample Lesson Plan Layout
No ratings yet
Alianza Cultural Uruguay - Estados Unidos: Sample Lesson Plan Layout
3 pages
Dermatoglyphic Multiple Intelligence Test (D.M.I.T.) Report: Thumbrule Dermatoglyphics Pvt. LTD
No ratings yet
Dermatoglyphic Multiple Intelligence Test (D.M.I.T.) Report: Thumbrule Dermatoglyphics Pvt. LTD
23 pages
The Weekly Schedule Sunday Monday Tuesday Wednesday Thursday Friday Saturday
No ratings yet
The Weekly Schedule Sunday Monday Tuesday Wednesday Thursday Friday Saturday
2 pages
Foundations of Data Imbalance and Solutions For A Data Democracy
No ratings yet
Foundations of Data Imbalance and Solutions For A Data Democracy
20 pages
Navo Minority Over-Sampling Technique (Nmote) : A Consistent Performance Booster On Imbalanced Datasets
No ratings yet
Navo Minority Over-Sampling Technique (Nmote) : A Consistent Performance Booster On Imbalanced Datasets
42 pages
FAST - A ROC-Based Feature Selection Metric For Small Samples and Imbalanced Data Classification Problems (2008)
No ratings yet
FAST - A ROC-Based Feature Selection Metric For Small Samples and Imbalanced Data Classification Problems (2008)
9 pages
1993 Transforming Mission - Book Review
No ratings yet
1993 Transforming Mission - Book Review
3 pages
Human Annotator For Imbalanced Dossier
No ratings yet
Human Annotator For Imbalanced Dossier
11 pages
IET Communications - 2021 - Le - A Comprehensive Survey of Imbalanced Learning Methods For Bankruptcy Prediction
No ratings yet
IET Communications - 2021 - Le - A Comprehensive Survey of Imbalanced Learning Methods For Bankruptcy Prediction
9 pages
Survey On Deep Learning With Class Imbalance: Open Access Survey Paper
No ratings yet
Survey On Deep Learning With Class Imbalance: Open Access Survey Paper
54 pages
Proposal Defense v6
No ratings yet
Proposal Defense v6
55 pages
Projectoverviewextinctzoo
No ratings yet
Projectoverviewextinctzoo
2 pages
International Conference On Information and Communications Technology
No ratings yet
International Conference On Information and Communications Technology
5 pages
A Cluster Based Under-Sampling Solution For Handling Imbalanced Data
No ratings yet
A Cluster Based Under-Sampling Solution For Handling Imbalanced Data
12 pages
An Overview of Classification Algorithms For Imbalanced Datasets
No ratings yet
An Overview of Classification Algorithms For Imbalanced Datasets
7 pages
Batista 2004
No ratings yet
Batista 2004
10 pages
Fdsa Question-Bank
No ratings yet
Fdsa Question-Bank
7 pages
Li 2011
No ratings yet
Li 2011
4 pages
Giaonx,+1155 3735 1 CE
No ratings yet
Giaonx,+1155 3735 1 CE
13 pages
VUTTIPITTAYAMONGKOL 2021 On The Class Overlap Problem
No ratings yet
VUTTIPITTAYAMONGKOL 2021 On The Class Overlap Problem
56 pages
Paper 6 - 240417 - 184500 OCR
No ratings yet
Paper 6 - 240417 - 184500 OCR
11 pages
2018 12state of ArtofImbalancedDataClassificationMethods
No ratings yet
2018 12state of ArtofImbalancedDataClassificationMethods
7 pages
Stop Oversampling For Class Imbalance Learning - A Review (OJO) - AHMAD S. TARAWNEH, AHMAD B. HASSANAT, GHADA AWAD ALTARAWNEH, ABDULLAH ALMUHAIMEED
No ratings yet
Stop Oversampling For Class Imbalance Learning - A Review (OJO) - AHMAD S. TARAWNEH, AHMAD B. HASSANAT, GHADA AWAD ALTARAWNEH, ABDULLAH ALMUHAIMEED
18 pages
Empowering: Changemakers
No ratings yet
Empowering: Changemakers
39 pages
Imbalanced Data Problem in Machine Learning A Review
No ratings yet
Imbalanced Data Problem in Machine Learning A Review
14 pages
Class Notes
No ratings yet
Class Notes
24 pages
Random and Synthetic Over Sampling Approach To Resolve Data 2zu79c47m6
No ratings yet
Random and Synthetic Over Sampling Approach To Resolve Data 2zu79c47m6
9 pages
Improving Imbalanced Learning Through A Heuristic Oversampling Method Based On K-Means and SMOTE
No ratings yet
Improving Imbalanced Learning Through A Heuristic Oversampling Method Based On K-Means and SMOTE
20 pages
Handling Imbalanced Ratio For Class Imbalance Problem Using SMOTE
No ratings yet
Handling Imbalanced Ratio For Class Imbalance Problem Using SMOTE
12 pages
AReviewon Oversampling Techniquesfor Solvingthe Data Imbalance Problemin Classification
No ratings yet
AReviewon Oversampling Techniquesfor Solvingthe Data Imbalance Problemin Classification
11 pages
Investigating Class Rarity in Big Data: Open Access Research
No ratings yet
Investigating Class Rarity in Big Data: Open Access Research
17 pages
NICE Actimize - DS - Rarity Problem in Supervised Fraud Detection Insights Article - 3JUNE20
No ratings yet
NICE Actimize - DS - Rarity Problem in Supervised Fraud Detection Insights Article - 3JUNE20
11 pages
Imbalanced Data Classification Method Based On LSSASMOTE
No ratings yet
Imbalanced Data Classification Method Based On LSSASMOTE
9 pages
A Novel Resampling Technique For Imbalanced Classification in Software Defect Prediction by A Re-Sampling Method With Filtering
No ratings yet
A Novel Resampling Technique For Imbalanced Classification in Software Defect Prediction by A Re-Sampling Method With Filtering
10 pages
11-A-SMOTE A New Preprocessing Approach For Highly Im
No ratings yet
11-A-SMOTE A New Preprocessing Approach For Highly Im
11 pages
Aipptoriginal 191215023212
No ratings yet
Aipptoriginal 191215023212
16 pages
Author Final Version
No ratings yet
Author Final Version
11 pages
Kumar 2021 IOP Conf. Ser. Mater. Sci. Eng. 1099 012077
No ratings yet
Kumar 2021 IOP Conf. Ser. Mater. Sci. Eng. 1099 012077
9 pages
Imbalanced Data
No ratings yet
Imbalanced Data
54 pages
Classification of Imbalanced Data A Review
No ratings yet
Classification of Imbalanced Data A Review
34 pages
Imbalance Factor: A Simple New Scale For Measuring Inter-Class Imbalance Extent in Classification Problems
No ratings yet
Imbalance Factor: A Simple New Scale For Measuring Inter-Class Imbalance Extent in Classification Problems
27 pages
FP Grade 3 English Life Skills LP Term 2 Week 6
No ratings yet
FP Grade 3 English Life Skills LP Term 2 Week 6
3 pages
JPSP - 2022 - 383
No ratings yet
JPSP - 2022 - 383
12 pages
1 s2.0 S0957417423032803 Main
No ratings yet
1 s2.0 S0957417423032803 Main
29 pages
To1 Saintek Bandung & Cirebon (22 Agustus 2021)
No ratings yet
To1 Saintek Bandung & Cirebon (22 Agustus 2021)
9 pages
MACHINE LEARNING FOR BEGINNERS: A Practical Guide to Understanding and Applying Machine Learning Concepts (2023 Beginner Crash Course)
From Everand
MACHINE LEARNING FOR BEGINNERS: A Practical Guide to Understanding and Applying Machine Learning Concepts (2023 Beginner Crash Course)
Elaine Tate
No ratings yet
Secrets of Statistical Data Analysis and Management Science!
From Everand
Secrets of Statistical Data Analysis and Management Science!
Andrei Besedin
No ratings yet
Machine Learning: Fundamentals and Applications
From Everand
Machine Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Data Science through R. Unsupervised Learning. Dimension Reduction Techniques: Principal Components, Factor Analysis and Correspondence Analysis
From Everand
Data Science through R. Unsupervised Learning. Dimension Reduction Techniques: Principal Components, Factor Analysis and Correspondence Analysis
César Pérez López
No ratings yet

Oligois: Scalable Instance Selection For Class-Imbalanced Data Sets

Uploaded by

Oligois: Scalable Instance Selection For Class-Imbalanced Data Sets

Uploaded by

332 IEEE TRANSACTIONS ON CYBERNETICS, VOL. 43, NO.

OligoIS: Scalable Instance Selection

TABLE I PASCAL challenge.4 The remaining data sets were created

A. Algorithms for the Comparison

average Friedman rankings for the same experiments. These TABLE IV

differences between the methods. The rankings showed that

the standard algorithm in both accuracy and reduction. Ar-

B. Study of the Method

For accuracy, the test obtained a p-value of 0.0998, and for

Javier Prez-Rodrguez was born in Crdoba,

You might also like