A New Feature Selection Algorithm Based On Binary Ant Colony Optimization
A New Feature Selection Algorithm Based On Binary Ant Colony Optimization
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
a r t i c l e i n f o abstract
Article history: In DNA microarray data, class imbalance problem occurs frequently, causing poor prediction
Received 25 December 2011 performance for minority classes. Moreover, its other features, such as high-dimension, small sample,
Received in revised form high noise etc., intensify this damage. In this study, we propose ACOSampling that is a novel
25 August 2012
undersampling method based on the idea of ant colony optimization (ACO) to address this problem.
Accepted 26 August 2012
The algorithm starts with feature selection technology to eliminate noisy genes in data. Then we
Communicated by T. Heskes
Available online 19 September 2012 randomly and repeatedly divided the original training set into two groups: training set and validation
set. In each division, one modified ACO algorithm as a variant of our previous work is conducted to filter
Keywords: less informative majority samples and search the corresponding optimal training sample subset. At last,
DNA microarray
the statistical results from all local optimal training sample subsets are given in the form of frequence
Ant colony optimization
list, where each frequence indicates the importance of the corresponding majority sample. We only
Class imbalance
Undersampling extracted those high frequency ones and combined them with all minority samples to construct the
Support vector machine final balanced training set. We evaluated the method on four benchmark skewed DNA microarray
datasets by support vector machine (SVM) classifier, showing that the proposed method outperforms
many other sampling approaches, which indicates its superiority.
& 2012 Elsevier B.V. All rights reserved.
0925-2312/$ - see front matter & 2012 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.neucom.2012.08.018
310 H. Yu et al. / Neurocomputing 101 (2013) 309–318
each partition, ACOSampling is conducted to find the correspond- of majority class by cleaning noisy samples, redundant samples
ing optimal majority class sample subset. Different from the and boundary examples in majority category [18]. As another
traditional ACO algorithm, ACOSampling impels ants to leave improved oversampling method, Adaptive Synthetic Sampling
from the nest, then to pass all majority class samples one by one, (ADA-SYN) uses a density distribution as criterion to automati-
by either pathway 0 or pathway 1, at last to reach the food source, cally decide the number of synthetic samples that need to be
where pathway 0 indicates the corresponding sample is useless generated for each minority example by adaptively changing the
and should be filtered, while pathway 1 represents it is important weights of different minority class examples to compensate for
and should be selected. Considering the particularity of the the skewed distributions [19]. Another sampling method using
classification tasks in this study, the overall accuracy is not an density distribution is Under-sampling based on clustering (SBC),
excellent measure as the fitness function, thus we construct it by presented recently by Yen and Lee [20]. SBC may automatically
three weighted indicative metrics, namely F-measure, G-mean and decide to remove how many majority class samples in each
AUC, respectively. After that, many local optimal majority class cluster, according to the corresponding density distribution.
sample subsets can be generated by iterative partitions, so the Garcı́a et al. [37] have simply compared two kinds of sampling
significance of each majority sample may be estimated according strategies and found oversampling generally produces better
to its selection frequence, i.e., the higher the selection frequence, classification performance when the dataset is highly skewed,
the more information the corresponding sample can provide. while undersampling is more effective when imbalance ratio is
Next, a global optimum balanced sample set can be created by very low. All in all, sampling possess many advantages, such as
combining the highly ranked samples of majority class with all simple, intuitive, low time complexity and low storage cost, thus
examples of minority class. At last, we construct a SVM classifier it can be more convenient to apply in real-world imbalanced
upon the balanced training set for recognizing future unlabeled classification tasks. In Section 4, we would investigate the
samples. performance of the proposed ACOSampling method compared
The remainder of this paper is organized as follows. Section 2 with original data without sampling (ORI) and several benchmark
reviews some previous work related with class imbalance pro- sampling strategies described above, such as ROS, RUS, SMOTE,
blem. In Section 3, the idea and procedure of ACOSampling BSO1, BSO2, OSS, ADA-SYN and SBC.
method is described in detail. Experimental results and discus- Cost sensitive learning methods consider the costs associated
sions are presented in Section 4. At last, we conclude this paper in with misclassifying samples [21]. Instead of creating balanced
Section 5. data distributions through sampling, cost-sensitive learning
assigns different costs for the samples belonging to different
classes by creating a cost matrix. Based on the cost matrix,
2. Previous work misclassifications on the minority class are more expensive than
the majority class. Moreover, cost sensitive learning pursues to
As mentioned in the Section 1, the existing class imbalance minimize the total cost but not error rate, thus the significance of
learning methods could be roughly categorized into two major minority class is highlighted. There are generally three kinds of
groups: sampling strategy and cost sensitive learning. Here, we cost sensitive learning methods. The first one is based on
pay special attention to sampling strategy because it is more translation theorem [22]. It applies misclassification costs to the
related with our study. data set in terms of data-space weighting. The second class,
The sampling is actually a re-balancing process for the given building on metacost framework [23], uses cost-minimizing
imbalanced data set. It can be distinguished into oversampling techniques to the combination schemes of ensemble methods.
and undersampling. Oversampling, as its name indicates, Some existing research has combined these two strategies, such
increases some samples belonging to minority class, while under- as AdaCX series algorithms [24] and AdaCost [25]. The last class of
sampling takes away some examples of majority class. The cost sensitive learning methods directly designs appropriate cost
simplest sampling methods are Random Over Sampling (ROS) functions for specific classifier, including cost-sensitive decision
and Random Under Sampling (RUS) [15]. The former will make tree [26], cost-sensitive neural network [27] and cost-sensitive
the learner to be overfitting by simply duplicating some samples support vector machine [28] etc. In some application fields, it has
of minority class, while the latter may lose some valuable been demonstrated that cost sensitive learning is superior to
classification information due to many majority examples are sampling approaches [29]. However, it is difficult to pre-design
randomly removed [11]. To overcome their drawbacks, some an appropriate cost function when class imbalance problem
complicated sampling methods were developed. Synthetic Min- occurs [27].
ority Over-sampling TEchnique (SMOTE), proposed by Chawla In recent several years, ensemble learning has also become
et al. [16], can create artificial data based on the feature space popular to be employed for solving class imbalance problems.
similarities between existing minority examples. Specifically, Generally speaking, in this technology, ensemble learning frame-
randomly select one sample xi in minority class, find its K-nearest work is incorporated with sampling approach or weighting
neighbors belonging to the same class by Euclidian distance. To strategy to acquire better classification performance and general-
create a synthetic sample, randomly select one of the K-nearest ization capability. Chawla et al. introduced SMOTE into Boosting
neighbors, then multiply the corresponding feature vector differ- ensemble learning framework to develop the SMOTEBoost learn-
ence with a random number between [0,1], and finally, add this ing method [30]. Unlike the base classifiers generation strategy in
vector to xi. Han et al. [17] observed that most misclassified traditional Boosting, SMOTEBoost promotes weak classifiers
samples scatter around the borderline between two categories, through altering distributions for the samples of different classes
then presented two improved versions of SMOTE, Borderline- by SMOTE. Liu et al. combined RUS and AdaBoost classifier to
SmOte1 (BSO1) and Borderline-SmOte2 (BSO2), respectively. For overcome deficiency of information loss of traditional RUS
BSO1, SMOTE only runs on those minority class samples near method and presented two ensemble strategies: EasyEnsemble
borderline, while for BSO2, it generates synthetic minority class and BalanceCascade [31]. In contrast with Boosting framework,
samples between each frontier minority example and one of its Bagging seems to leave less room to be modified for class
K-nearest neighbors belonging to majority class, thus mildly imbalance problem. However, there are still some improved
enlarges decision region of minority class. One Side Selection versions about Bagging, including Asymmetric Bagging (asBag-
(OSS) has very similar idea with BSO2. It shrinks the decision area ging) which has been used to retrieve image [32] and predict drug
H. Yu et al. / Neurocomputing 101 (2013) 309–318 311
molecules [33], Roughly Balanced Bagging (RB Bagging) based on where i represents the ith site, i.e., the ith majority sample in
negative binomial distributions [34] etc. In literature [35], Khosh- original training set, j denotes pathway, which may be assigned as
goftaar et al. compared the existing Boosting and Bagging tech- 1 or 0 to denote whether selecting the corresponding sample or
nologies on noisy and imbalanced data and found Bagging series not. tij is pheromone intensity of the ith site in the jth pathway, pij
algorithms generally outperform Boosting. However, ensemble and k are the probability of selecting the jth pathway of the ith
learning is more time-consuming than both former methods so site and possible value of pathway j (0 or 1), respectively. When
that it is restricted in practical applications [35]. an ant arrives at the food source, the corresponding sample subset
will be evaluated by fitness function. It is worth noting that
overall classification accuracy is not an indicative measure for
imbalanced classification tasks [16]. Therefore, we use a special
3. Methods metric designed by Yang et al. [43] to evaluate classification
performance, which is given in formula (2):
3.1. Undersampling based on ant colony optimization
fitness ¼ a F-measure þ b G-mean þ g AUC
Ant colony optimization (ACO) algorithm, which is developed s:t: : a þ b þ g ¼ 1 ð2Þ
by Colorni et al. [38], is one important member of swarm
The fitness function is constitutive of three weighted metrics:
intelligence family. ACO simulates the behavior of foraging by
F-measure, G-mean and AUC. We will introduce these metrics in
real ant colony and in recent years, it has been successfully
Section 4.2. When one cycle finishes, the pheromone of all
applied to solve various practical optimization problems, includ-
pathways is updated, the update function inherits from the
ing TSP [39], parameter optimization [40], path planning [41],
literature [38] and is described as follows:
protein folding [42] etc.
In previous work, we have once designed an ACO algorithm to tij ðt þ 1Þ ¼ r tij ðtÞ þ Dtij ð3Þ
select tumor-related marker genes in DNA microarray data [36].
whereris the evaporation coefficient, which controls the decre-
While in this study, we transform it from feature space to sample
ment of pheromone, Dtij is increased pheromone of some excel-
space to search an undersampling set which is regarded as the
lent pathways. In this paper, we add pheromone in the pathways
optimal subset estimated on the given validation set. However,
of the best 10% ants after each cycle and store these pathways in a
this optimal set is not necessary absolutely balanced. In addition,
set E. Dtij is defined as follows:
in optimization process, to justly evaluate the performance for ( 1
each ant, several indicative metrics have also been jointly used to 0:1ant_n fitness, pathwayij A E
construct the fitness function.
Dtij ¼ ð4Þ
0, pathwayij2=E
Fig. 1 describes the sample selection procedure using our ACO
algorithm. As indicates in Fig. 1, the process of sample selection In formula (4), ant_n is the size of ant colony, i.e., the number
may be regarded as the procedure of seeking for food of one ant. of ants. When one cycle finishes, the pheromone of some path-
Between nest and food, sites are built one by one, and each of ways will be intensified and the others will be weakened, so that
them represents one alternative sample of majority class in guaranteeing those excellent pathways are given more chances in
original training set. In the process of moving from nest to food, next cycle. When convergence of ACO algorithm, all ants are
ant passes each site by either pathway 0 or pathway 1, where inclined to select the same pathway. At last, the optimal solution
pathway 0 denotes that the next sample will be filtered and returns.
pathway 1 represents that the next sample will be selected. In contrast with our previous work, we make a few changes
At last, when the ant arrives at the food, some majority samples in this study, such as fitness function and pheromone update
are extracted and combined with all minority examples to
constitute the corresponding training set. A binary set {1, 0, 1, 0,
0, 1} means the 1st, 3rd and 6th majority sample have been
picked out. Then the new created training set would be evaluated
according to the fitness on the validation set. Ants cooperate with
each other by intensity of pheromone left in every pathway to
search the optimal routine.
In our ACO algorithm, many ants synchronously search path-
ways from nest to food. They select pathways according to the
quantities of pheromone left in these pathways. The more
pheromone is left, the more probability of the corresponding
pathway is selected. We compute the probability of selecting a
pathway by:
tij
pij ¼ Pk ð1Þ
j tij
Fig. 1. Sample selection procedure based on ACO algorithm. Fig. 2. Pseudo-code description of the undersampling algorithm based on ACO.
312 H. Yu et al. / Neurocomputing 101 (2013) 309–318
function. On the other hand, it also inherits some advantages from applications, it may be easily modified to fit various validation
previous method, for example, we impose the upper and lower methods. Meanwhile, to impartially estimate the amount of
boundary of pheromone in each pathway to prevent the algo- information of each majority example and avoid overfitting for
rithm sinking into local optimization prematurely. Pseudo-code final generated classifier, the original training set is randomly and
description of the algorithm is simply summarized in Fig. 2. repeatedly divided into two groups: training set (2/3) and
validation set (1/3) for 100 times. It is clear that in these 100
3.2. ACOSampling strategy repeated partitions, each sample has equal chance to be picked
into training set. Then we conduct ACO algorithm in each
By ACO algorithm mentioned above, an excellent undersam- partition to find the corresponding optimal undersampling set
pling subset may be extracted as the final training set to construct and figure out majority class samples ranking frequence list (take
a classifier and recognize future testing samples. However, to Fig. 4 as an example, the more times one sample emerges, the
guide optimization procedure, we have to divide the original more information the sample can provide for classification) based
training set into two parts: training set and validation set, before on these local optimal sets. Next, a balanced dataset is created by
ACO algorithm works. Generally, it can cause two severe pro- combining the highly ranked samples of majority class with all
blems for constructed classifier: information loss and overfitting examples of minority class. At last, we train a classifier upon the
due to the employment of validation set. In particular, when balanced sample set and evaluate its performance by testing set.
classification tasks are based on small sample set, these problems According to the descriptions above, main loop of ACOSampling
become more serious. To solve this problem, we present a novel method can be summarized by pseudo-code in Fig. 5.
strategy named as ACOSampling to produce more robust classifier
by the combination of reduplicative partition of original sample 3.3. Support vector machine
set and ACO algorithm. The frame diagram of ACOSampling
strategy presents in Fig. 3. Support vector machine (SVM) introduced by Vapnik [44], is a
As can be observed from Fig. 3, in our design, ACOSampling valuable tool for solving pattern classification problem. In contrast
applies 3-fold cross validation to evaluate classification perfor- with traditional classification methods, SVM possesses several
mance, i.e., two-thirds samples are extracted into original training prominent advantages as follows: (1) high generalization capabil-
set and the rest ones are used for testing each time. In practical ity; (2) absence of local minima; (3) be suitable for small-sample
50
Rank Index Frequence
40
1 15 49
Frequence
30 2 7 42
20 … ... …
10 26 21 4
27 24 1
0
0 3 6 9 12 15 18 21 24 27
Sample
0 g ij mi
g ij ¼ ð7Þ
si
wherexi is a d-dimension sample, yi is the corresponding class by all samples belong to nclass. Take colon dataset [5] as an
label, N is the number of samples. The discriminant function of example, we compute SNR value for all 2000 genes and rank them
SVM can be described as follows: in ascending sequence (see Fig. 7).
Fig. 7 shows that only quite a few genes have high SNR values
!
X
sv and they could be regarded as feature genes related closely with
gðxÞ ¼ sgn ai yi Kðx,xi Þ þ b ð5Þ classification task. In this study, we select top-100 ranked genes
i¼1
to conduct experiments.
314 H. Yu et al. / Neurocomputing 101 (2013) 309–318
1 1
0.8 0.8
Signal-Noise Ratio
Signal-Noise Ratio
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 500 1000 1500 2000 0 500 1000 1500 2000
Gene Index Gene Ranking
Fig. 7. SNR value distribution for gene index (left) and gene ranking (right).
Table 1 Table 2
Datasets used in this study. Confusion matrix.
Dataset Size Genes Maj:Min Imbalance ratio Predicted positive class Predicted negative class
Colon [5] 62 2000 40:22 1.82 Actual positive class TP (True positive) FN (False negative)
CNS [53] 60 7129 39:21 1.86 Actual negative class FP (False positive) TN (True negative)
Lung [8] 39 2880 24:15 1.60
Glioma [9] 50 10367 36:14 2.57
Table 3
4. Experiments Initial parameters settings.
It is well-known that in skewed recognition tasks, overall AUC is the area below the ROC curve which depicts the
accuracy (Acc) generally gives bias evaluation, thus some other performance of a method using the (FPR, TPR) pairs. It has been
specific evaluation metrics, such as F-measure, G-mean and area proved to be a reliable performance measure for class imbalance
under the receiver operating characteristic curve (AUC), are problem [11].
needed to estimate classification performance of a learner. In the study, some initial parameters used in ACOSampling and
F-measure and G-mean may be regarded as functions of the con- SVM have been given empirically according to our previous work
fusion matrix as shown in Table 2. They are calculated as follows: [36] (see Table 3). As for several parameters such as phmin and
phmax, we have made a little adjustment based on extensive
2 Precision Recall
F-measure ¼ ð9Þ experimental results.
Precision þRecall
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
G-mean ¼ TPR TNR ð10Þ
4.3. Results and discussions
where Precision, Recall, TPR and TNR are further defined as
follows: First, we conduct experiments on four imbalanced microarray
TP datasets (refers to Section 4.1) with top-100 feature genes
Precision ¼ ð11Þ extracted by SNR strategy. To present superiority of the proposed
TP þ FP
method, the performance of some typical sampling approaches,
TP such as original data without sampling (ORI), ROS, RUS, SMOTE,
Recall ¼ TPR ¼ ð12Þ
TP þFN BSO1, BSO2, OSS, ADA-SYN and SBC etc., are tested synchronously.
H. Yu et al. / Neurocomputing 101 (2013) 309–318 315
Table 4
Performance comparison for various sampling methods on four datasets.
ORI ROS RUS SMOTE BSO1 BSO2 OSS ADA-SYN SBC ACOSampling
Colon dataset
Acc 83.23 7 2.72 84.19 72.48 84.52 7 3.47 85.48 7 1.61 83.07 7 1.94 84.03 72.10 85.657 2.84 85.65 72.10 83.23 7 2.41 85.63 7 1.83
F-measure 75.24 7 4.49 76.78 74.78 79.31 7 4.07 79.37 7 2.85 74.99 7 3.29 76.91 73.08 81.137 2.96 79.76 73.29 78.95 7 2.84 81.137 2.63
G-mean 80.237 3.76 81.54 74.15 84.17 7 3.22 83.83 7 2.54 80.01 7 2.78 81.68 72.45 85.76 7 2.29 84.21 72.80 84.25 7 2.47 85.927 2.41
AUC 87.23 7 2.32 87.76 73.20 89.16 7 1.88 89.13 7 1.69 88.20 7 1.51 88.61 72.72 91.33 7 1.89 88.82 71.50 90.197 2.19 94.187 1.56
CNS dataset
Acc 82.83 7 1.98 83.33 72.36 82.00 7 1.00 84.33 7 2.49 84.50 7 2.48 84.33 72.49 83.17 7 2.29 84.67 72.21 79.507 1.98 83.83 7 3.42
F-measure 75.507 2.99 76.08 73.29 77.45 7 1.44 77.56 7 3.66 78.13 7 3.23 78.55 73.75 77.58 7 2.29 78.44 73.32 76.107 2.37 79.757 3.83
G-mean 80.967 2.44 81.32 72.55 83.21 7 1.31 82.51 7 3.09 83.08 7 2.56 83.79 73.24 83.037 1.69 83.43 72.86 81.94 7 2.12 85.177 3.23
AUC 92.21 7 1.94 92.26 70.89 92.91 7 1.47 93.057 1.09 93.367 1.44 93.22 71.74 92.94 7 1.24 92.81 71.22 92.43 7 1.08 93.33 7 1.47
Lung dataset
Acc 65.38 7 3.29 64.62 72.99 67.44 7 3.25 65.907 2.82 65.13 7 2.86 67.44 72.00 68.21 7 3.08 65.38 72.63 67.18 7 2.99 71.797 4.59
F-measure 56.107 5.31 53.30 74.10 60.56 7 4.66 55.58 7 5.35 55.40 7 4.38 59.39 72.30 62.507 5.52 54.79 73.53 60.437 4.14 67.867 4.50
G-mean 63.46 7 4.24 61.48 73.37 66.89 7 3.82 63.35 7 4.20 62.93 7 3.52 66.16 71.90 68.067 4.18 62.67 72.90 66.74 7 3.45 72.327 4.19
AUC 67.75 7 2.52 67.36 72.49 71.22 7 2.33 67.92 7 2.94 68.14 7 2.90 69.78 72.74 73.53 7 3.56 68.00 73.34 73.22 7 2.06 77.427 4.16
Glioma dataset
Acc 92.807 1.83 94.00 70.89 92.20 7 3.03 93.607 1.20 94.00 7 1.79 93.40 71.28 93.207 1.60 94.20 71.40 93.407 1.80 94.407 1.96
F-measure 87.087 3.54 89.35 71.63 87.54 7 4.01 88.56 7 2.16 89.32 7 3.08 88.80 71.59 88.57 7 2.39 89.77 72.42 88.87 7 2.73 90.547 3.00
G-mean 90.947 2.96 92.71 71.53 93.16 7 1.96 91.97 7 1.76 92.47 7 2.03 93.19 70.78 93.26 7 1.50 93.08 71.69 93.44 7 1.66 94.327 1.30
AUC 98.71 7 0.34 98.93 70.18 98.75 7 0.56 98.87 7 0.47 99.157 0.27 98.73 70.37 98.75 7 0.90 98.97 70.36 98.77 7 0.47 99.13 7 0.16
Fig. 8. Performance comparison for various sampling methods on four datasets. 1st column: Colon dataset; 2nd column: CNS dataset; 3rd column: Lung dataset;
4th column: Glioma dataset.
The average classification results based on 10 times’ 3-fold cross particularly designed assessment criteria for imbalanced classifi-
validation are presented in Table 4 and Fig. 8. cation, such as F-measure, G-mean and AUC, but also is embodied
As shown in Table 4 and Fig. 8, almost all sampling methods in the overall accuracy Acc.
outperform the method only using original data (i.e., ORI), Compared with those traditional sampling strategies, we are
indicating that sampling is effective to improve classification more interested in the performance of ACOSampling method.
performance for imbalanced high-dimensional and small sample From Table 4 and Fig. 8, we observe that ACOSampling acquires
classification tasks. This improvement not only reflects in some the highest F-measure and G-mean in all datasets. For AUC metric,
316 H. Yu et al. / Neurocomputing 101 (2013) 309–318
ACOSampling attains the highest value on Colon dataset and Lung serious degeneration or not. Undoubtedly, Lung dataset may be
dataset, and it ranks only second to BSO1 on two other datasets. regarded as a clear harmful class imbalance task, and thus
The results indicate that the proposed ACOSampling strategy is ACOSampling performs better on this dataset. However, we have
more effective and can extract some majority examples with to admit that ACOSampling is more time-consuming because it
more classification information than those typical sampling runs iteratively for estimating the significance of each majority
approaches. At the same time, we find an interesting pheno- sample. This could be well explained by ‘‘No Free Lunch
menon that the proposed method could obviously improve Theorems’’ of Wolpert et al. [55], which demonstrates that there
classification performance on Lung dataset, but only a slight is no optimization method that outperforms all others in all
improvement on the other datasets. This can be explained by circumstances.
the viewpoint of Ref. [31] which partitions class imbalance tasks Moreover, Fig. 8 shows that undersampling generally produces
into two groups: harmful and unharmful, according to the judg- better results than oversampling on our low imbalance ratio
ment of whether the classification performance suffers from datasets, which is similar with the finding of Ref. [37]. This
Table 5
Performance comparison for ACOSampling method based on different number of feature genes on four datasets.
Colon dataset
Acc 82.74 7 3.95 83.71 7 2.10 84.84 72.41 85.637 1.83 84.20 7 2.68 82.10 7 3.91 75.32 7 6.08
F-measure 77.27 7 4.92 78.74 7 2.66 79.50 73.54 81.137 2.63 78.61 7 3.47 76.08 7 4.15 69.42 7 6.68
G-mean 82.59 7 4.24 83.94 7 2.16 84.35 73.05 85.927 2.41 83.58 7 2.81 81.43 7 3.30 75.80 7 5.72
AUC 90.017 3.30 90.31 7 2.03 91.78 72.15 94.187 1.56 87.70 7 1.41 87.35 7 1.45 83.05 7 3.52
CNS dataset
Acc 69.33 7 3.18 78.17 7 4.44 81.67 74.01 83.837 3.42 79.50 7 3.42 78.67 7 2.96 73.83 7 2.59
F-measure 63.47 7 3.72 74.36 7 4.06 77.23 74.18 79.757 3.83 73.17 7 3.70 70.61 7 2.53 61.43 7 2.90
G-mean 70.65 7 3.33 80.16 7 3.77 82.91 73.49 85.177 3.23 79.35 7 2.93 77.08 7 1.99 69.55 7 2.31
AUC 78.18 7 3.93 89.27 7 3.38 92.47 71.66 93.337 1.47 90.31 7 1.69 84.58 7 4.08 76.04 7 1.58
Lung dataset
Acc 68.21 7 3.28 71.79 7 3.80 68.20 72.85 71.79 7 4.59 74.107 1.80 70.26 7 4.47 68.20 7 4.89
F-measure 63.10 7 2.94 66.62 7 4.13 65.26 73.23 67.86 7 4.50 69.727 3.14 65.99 7 5.42 63.85 7 6.82
G-mean 68.43 7 2.91 71.77 7 3.53 69.29 72.75 72.32 7 4.19 74.597 2.43 70.93 7 4.71 68.70 7 5.69
AUC 75.22 7 4.63 79.78 7 2.44 74.89 74.32 77.42 7 4.16 80.227 3.31 73.56 7 4.18 72.55 7 3.74
Glioma dataset
Acc 93.40 7 2.01 96.80 7 1.83 96.60 72.69 94.40 7 1.96 89.60 7 2.50 88.40 7 2.15 82.40 7 1.96
F-measure 89.19 7 3.06 94.68 7 2.93 94.22 74.45 90.54 7 3.00 83.69 7 3.28 82.80 7 2.85 75.85 7 2.41
G-mean 94.26 7 1.83 97.74 7 1.30 96.95 73.04 94.32 7 1.30 90.92 7 1.69 91.40 7 1.89 86.60 7 2.01
AUC 99.20 7 0.53 99.86 7 0.25 99.68 70.43 99.13 7 0.16 98.08 7 1.17 98.06 7 1.27 91.03 7 3.50
Fig. 9. Performance comparison for ACOSampling method based on different number of feature genes on four datasets. 1st column: Colon dataset; 2nd column: CNS
dataset; 3rd column: Lung dataset; 4th column: Glioma dataset.
H. Yu et al. / Neurocomputing 101 (2013) 309–318 317
implies that using all majority class samples is not necessary Foundation for young researchers of Ministry of Education of China
when the dataset is a little skewed. In five oversampling methods, under Grant No.20070217051 and Ph.D Foundation of Jiangsu
ROS performs worst in most cases, while the other four strategies University of Science and Technology under Grant No.35301002.
show comparative performance with each other. In those under-
sampling methods exclusive of ACOSampling, OSS performs best
in most datasets, while compared with RUS, SBC does not reveal
enough competitive power. Therefore, in practical applications, References
if the time-complexity is limited strictly, OSS should be one
considerable alternative. [1] X. Zhou, M.C. Kao, W.H. Wong, From the cover: transitive functional
annotation by shortest-path analysis of gene expression data, Proc. Nat.
Then we investigate the relationship between number of Acad. Sci. U.S.A. 99 (20) (2002) 12783–12788.
selected feature genes and classification performance for ACO- [2] D. Husmeier, Sensitivity and specificity of inferring genetic regulatory
Sampling method. The number of feature genes is assigned as 10, interactions from microarray experiments with dynamic Bayesian networks,
Bioinformatics 19 (17) (2003) 2271–2282.
20, 50, 100, 200, 500 and 1000, respectively. We conduct 10
[3] E. Segal, M. Shapira, A. Regev, et al., Module networks: discovering regulatory
times’ 3-fold cross validation in each group, then present the modules and their condition specific regulators from gene expression data,
results in Table 5 and Fig. 9. Nat. Genet. 34 (2) (2003) 166–176.
Though there are some fluctuations, a trend can be still [4] W.E. Evans, R.K. Guy, Gene expression as a drug discovery tool, Nat. Genet. 36
(3) (2004) 214–215.
observed from the curves in Fig. 9, i.e., the middle high and low [5] U. Alon, N. Barkai, D.A. Notterman, et al., Broad patterns of gene expression
on both sides. That means both of selecting too few or too many revealed by clustering analysis of tumor and normal colon tissues probed by
feature genes would degenerate classification performance. We Oligonucleotide array, Proc. Nat. Acad. Sci. U.S.A. 96 (12) (1999) 6745–6750.
[6] T.P. Conrads, M. Zhou, E.F. Petricoin, et al., Cancer diagnosis using proteomic
believe the reason is the former causes the deficiency of useful patterns, Expert Rev. Mol. Diagn. 3 (4) (2003) 411–420.
information and the latter adds in much noise and redundant [7] T.R. Golub, D.K. Slonim, P. Tamayo, et al., Molecular classification of cancer:
information. Table 5 shows that for Colon dataset and CNS dataset, class discovery and class prediction by gene expression monitoring, Science
286 (5439) (1999) 531–537.
the best performance can be obtained with 100 feature genes, [8] D.A. Wigle, I. Jurisica, N. Radulovich, et al., Molecular profiling of non-small
while for Lung dataset and Glioma dataset, the optimal number are cell lung cancer and correlation with disease-free survival, Cancer Res. 62
200 and 20, respectively. Therefore, for high-dimensional and small (11) (2002) 3005–3008.
[9] C.L. Nutt, D.R. Mani, R.A. Betensky, et al., Gene expression-based classification
sample classification tasks, for example, DNA microarray data, it is
of malignant gliomas correlates better with survival than histological
necessary to extract a few class-related features previously, which classification, Cancer Res. 63 (7) (2003) 1602–1607.
is also verified by Wasikowski et al. [56]. [10] R. Blagus, L. Lusa, Class prediction for high-dimensional class-imbalanced
data, BMC Bioinf. 11 (523) (2010).
In contrast with the previous work by Yang et al. [43] which is
[11] H. He, E.A. Garcia, Learning from imbalanced data, IEEE Trans. Knowl. Data
similar with this study, our ACOSampling owns one specific Eng. 21 (9) (2009) 1263–1284.
merit: stronger generalization ability derived from 100 times’ [12] N. Japkowicz, Workshop on learning from imbalanced data sets, in: Proceed-
random partitions. Using these random partitions, we could give ings of the 17th American Association for Artificial Intelligence, Austin, Texas,
USA, 2000.
more righteous evaluation for the significance of each majority [13] N.V. Chawla, N. Japkowicz, A. Kolcz, Workshop on learning from imbalanced
class sample. While Ref. [43] tries to avoid overfitting by inte- data sets II, in: Proceedings of the 20th International Conference of Machine
grating the results from multiple different kinds of classifiers, Learning, Washington, USA, 2003.
[14] N.V. Chawla, N. Japkowicz, A. Kolcz, Editorial: special issue on learning from
which would cause more bias for final classification results than imbalanced data sets, ACM SIGKDD Explor. Newsl 6 (1) (2004) 1–6.
our method. However, we have to admit that the proposed [15] C. Ling, C. Li, Data mining for direct marketing problems and solutions, in:
method in this study is more time-consuming than their work. Proceedings of the 4th ACM SIGKDD International Conference of Knowledge
Discovery and Data Mining, New York, USA, 1998, pp.73–79.
Therefore, we declare that the proposed ACOSampling method is [16] N.V. Chawla, K.W. Bowyer, L.O. Hall, et al., SMOTE: synthetic minority over-
more suitable to deal with imbalanced classification tasks with sampling technique, J. Artif. Intell. Res. 16 (1) (2002) 321–357.
the characteristic of small sample simultaneously. [17] H. Han, W.Y. Wang, B.H. Mao, Borderline-SMOTE: a new over-sampling
method in imbalanced data sets learning, in: Proceedings of the 2005
International Conference of Intelligent Computing, Hefei, China, 2005,
5. Conclusions pp.878–887.
[18] M. Kubat, S. Matwin, Addressing the curse of imbalanced training sets: one-
sided selection, in: Proceedings of the 14th International Conference of
In this paper, we present a novel heuristic undersampling Machine Learning, Nashville, Tennessee, USA, 1997, pp.179–186.
method named as ACOSampling to address imbalanced DNA [19] H. He, Y. Bai, E.A. Garcia, et al., ADASYN: adaptive synthetic sampling
microarray data classification problem. By extensive experiments, approach for imbalanced learning, in: Proceedings of the 2008 International
Joint Conference of Neural Networks, Hong Kong, China, 2008, pp.1322–
it has demonstrated that the proposed method is effective and 1328.
can automatically extract those so-called ‘‘information samples’’ [20] S.J. Yen, Y.S. Lee, Cluster-based under-sampling approaches for imbalanced
from majority class. However, since its procedure of sampling is data distributions, Expert. Syst. Appl. 36 (3) (2009) 5718–5727.
[21] C. Elkan, The foundations of cost-sensitive learning, in: Proceedings of the
more time-consuming than those typical sampling approaches, it 17th International Joint Conference of Artificial Intelligence, Seattle,
will be more efficient on small sample classification tasks. Washington, USA, 2001, pp.973–978.
Considering the excessive computational and storage cost of the [22] B. Zadrozny, J. Langford, N. Abe, Cost-sensitive learning by cost-proportionate
example weighting, in: Proceedings of the 3rd International Conference of
proposed algorithm, we intend to improve its efficiency by modifying Data Mining, Melbourne, Florida, USA, 2003, pp.435–442.
its formation rule in future work. We also expect that our ACOSam- [23] P. Domingos, MetaCost: a general method for making classifiers cost-
pling can be applied to other real-world data mining applications, sensitive, in: Proceedings of the 5th ACM SIGKDD International Conference
of Knowledge Discovery and Data Mining, San Diego, CA, USA 1999, PP.155–
where we suffer from class imbalance. Moreover, considering ubi-
164.
quitous multiclass imbalanced classification tasks in practical appli- [24] Y. Sun, M.S. Kamel, A.K.C. Wong, et al., Cost-sensitive boosting for classifica-
cations, we will investigate the possibility of extending current tion of imbalanced data, Pattern Recognit. 40 (12) (2007) 3358–3378.
[25] W. Fan, S.J. Stolfo, J. Zhang, et al., AdaCost: misclassification cost-sensitive
ACOSampling to multiclass tasks in the future work, too.
boosting, in: Proceedings of the 16th International Conference of Machine
Learning, Bled, Slovenia, 1999, pp.97–105.
[26] C. Drummond, R.C. Holte, Exploiting the cost (In)sensitivity of decision tree
Acknowledgements splitting criteria, in: Proceedings of the 17th International Conference of
Machine Learning, Stanford, CA, USA, 2000, pp.239–246.
[27] Z.H. Zhou, X.Y. Liu, Training cost-sensitive neural networks with methods
This work is partially supported by National Natural Science addressing the class imbalance problem, IEEE Trans. Knowl. Data Eng. 18 (1)
Foundation of China under grant No.60873036, the Ph.D Programs (2006) 63–77.
318 H. Yu et al. / Neurocomputing 101 (2013) 309–318
[28] R. Akbani, S. Kwek, N. Japkowicz, Applying support vector machines to International Conference of Bioinformatics, Systems Biology and Intelligent
imbalanced data sets, in: Proceedings of the 15th European Conference on Computing, Shanghai, China, 2009, pp.3–9.
Machine Learning, Pisa, Italy, 2004, pp.39–50. [53] Q. Shen, Z. Mei, B.X. Ye, Simultaneous genes and training samples selection
[29] X.Y. Liu, Z.H. Zhou, The influence of class imbalance on cost-sensitive by modified particle swarm optimization for gene expression data classifica-
learning: an empirical study, in: Proceedings of the 6th IEEE International tion, Comput. Biol. Med. 39 (7) (2009) 646–649.
Conference on Data Mining, Hong Kong, China, 2006, pp.970–974. [54] S.L. Pomeroy, P. Tamayo, M. Gaasenbeek, et al., Prediction of central nervous
[30] N.V. Chawla, A. Lazarevic, L.O. Hall, et al., SMOTEBoost: improving prediction system embryonal tumour outcome based on gene expression, Nature 415
of the minority class in boosting, in: Proceedings of the 7th European (6870) (2002) 436–442.
Conference on Principles of Data Mining and Knowledge Discovery, Cavtat– [55] D.H. Wolpert, W.G. Macready, No free lunch theorems for optimization, IEEE
Dubrovnik, Croatia, 2003, pp.107–119. Trans. Evol. Comput. 1 (1) (1997) 67–82.
[31] X.Y. Liu, J. Wu, Z.H. Zhou, Exploratory undersampling for class-imbalance [56] M. Wasikowski, X.W. Chen, Combating the small sample class imbalance
learning, IEEE Trans. Syst. Man Cybern. Part B Cybern. 39 (2) (2009) problem using feature selection, IEEE Trans. Knowl. Data Eng. 22 (10) (2010)
539–550. 1388–1400.
[32] D. Tao, X. Tang, X. Li, et al., Asymmetric bagging and random subspace for
support vector machines-based relevance feedback in image retrieval, IEEE
Trans. Pattern Anal. Mach. Intell. 28 (7) (2006) 1088–1099.
[33] G.Z. Li, H.H. Meng, W.C. Lu, et al., Asymmetric bagging and feature selection
Hualong Yu received the B.S. degree from Heilongjiang
for activities prediction of drug molecules, BMC Bioinf. 9 (S6) (2008) S7.
University, Harbin, China, in 2005. Then he received
[34] S. Hido, H. Kashima, Y. Takahashi, Roughly balanced bagging for imbalanced
M.S. and Ph.D degree from the college of computer
data, Stat. Anal. Data Min. 2 (5-6) (2009) 412–426.
science and technology, Harbin Engineering Univer-
[35] T.M. Khoshgoftaar, J.V. Hulse, A. Napolitano, Comparing boosting and bagging
sity, Harbin, China, in 2008 and 2010, respectively.
techniques with noisy and imbalanced data, IEEE Trans. Syst. Man Cybern.
Since 2010, he has been one lecturer and master
Part B Cybern. 41 (3) (2011) 552–568.
supervisor in Jiangsu University of Science and Tech-
[36] H.L. Yu, G.C. Gu, H.B. Liu, et al., A modified ant colony optimization algorithm
nology, Zhenjiang, China. He is author or co-author for
for tumor marker gene selection, Genomics Proteomics Bioinf. 7 (4) (2009)
over 20 research papers, 3 books and the program
200–208.
committee member for ICICSE2012. His research inter-
[37] V. Garcı́a, J.S. Sánchez, R.A. Mollineda, et al., The Class Imbalance Problem in
ests mainly include pattern recognition, machine
Pattern Classification and Learning, in: II Congreso Español de Informática,
learning and Bioinformatics, etc.
2007, pp. 283–291.
[38] A. Colorni, M. Dorigo, V. Maniezzo, Distributed optimization by ant colonies.
in: Proceedings of the 1st European Conference on Artificial Life, Paris,
France, 1991, pp.134–142.
[39] A. Uğur, D. Aydin, An interactive simulation and analysis software for solving Jun Ni received the B.S. degree from Harbin Engineer-
TSP using ant colony optimization algorithms, Adv. Eng. Software 40 (5) ing University, Harbin, China, the M.S. degree from
(2009) 341–349. Shanghai Jiaotong University, Shanghai, China and the
[40] X. Zhang, X. Chen, Z. He, An ACO-based algorithm for parameter optimization Ph.D degree from the University of Iowa, IA, USA. He is
of support vector machines, Expert. Syst. Appl. 37 (9) (2010) 6618–6628. currently an associate professor and director of Med-
[41] H. Duan, Y. Yu, X. Zhang, et al., Three-dimension path planning for UCAV ical Imaging HPC and Informatics Lab, Department of
using hybrid meta-heuristic ACO-DE algorithm, Simul. Modell. Pract. Theory Radiology, Carver College of Medicine, the University
18 (8) (2010) 1104–1115. of Iowa, Iowa City, IA, USA. He is also visiting professor
[42] A. Shmygelska, H.H. Hoos, An ant colony optimization algorithm for the 2D in Harbin Engineering University and Shanghai Uni-
and 3D hydrophobic polar protein folding problem, BMC Bioinf. 6 (30) versity, China, since 2006 and 2009, respectively. He
(2005). edited or co-edited 34 books or proceedings and
[43] P. Yang, L. Xu, B. Zhou, et al., A particle swarm based hybrid system for authored or co-authored 115 peer-reviewed journal
imbalanced medical data sampling, BMC Genomics 10 (S3) (2009) S34. and conference papers. In addition, he is editor-in-
[44] V. Vapnik, Statistical Learning Theory, Wiley Publishers, New York, USA, chief of International Journal of Computational Medicine and Healthcare, associate
1998. editor of IEEE Systems Journal and editorial board member for 15 other profes-
[45] J.A.K. Suykens, J. Vandewalle, Least squares support vector machine classi- sional journals. Since 2003, he has also been General/Program Chairs for over 50
fiers, Neural Process. Lett. 9 (3) (1999) 293–300. International conferences. Currently, his research interests include distributed
[46] Y.H. Wang, F.S. Makedon, J.C. Ford, et al., HykGene: a hybrid approach for computation, parallel computing, medical imaging informatics, computational
selecting feature genes for phenotype classification using microarray gene biology and Bioinformatics, etc.
expression data, Bioinformatics 21 (8) (2005) 1530–1537.
[47] E. Xing, M. Jordan, R. Karp, Feature selection for high-dimensional genomic
microarray data, in: Proceedings of the 18th International Conference of
Machine Learning, Williamstown, MA, USA, 2001, pp.601–608.
[48] I. Inza, P. Larranaga, R. Blanco, Filter versus wrapper gene selection Jing Zhao received the Ph.D degree in Harbin Institute
approaches in DNA microarray domains, Artif. Intell. Med. 31 (2) (2004) of Technology, Harbin, China, in 2005. She is currently
91–104. a professor and Ph.D supervisor in college of computer
[49] J.H. Chiang, S.H. Ho, A combination of rough-based feature selection and RBF science and technology, Harbin Engineering Univer-
neural network for classification using gene expression data, IEEE Trans. sity, Harbin, China and a senior visiting scholar in Duke
Nanobiosci. 7 (1) (2008) 91–99. University, USA. She is author or co-author for over 20
[50] K. Yang, Z. Cai, J. Li, et al., A stable gene selection in microarray data analysis, research papers. Her research interests include soft-
BMC Bioinf. 7 (228) (2006). ware reliability, mobile computing and image
[51] G.Z. Li, H.H. Meng, J. Ni, Embedded gene selection for imbalanced microarray processing.
data analysis, in: Proceedings of the 3rd International Multi-symposiums on
Computer and Computational Sciences, Shanghai, China, 2008, pp.17–24.
[52] A.H.M. Kamal, X.Q. Zhu, R. Narayanan, Gene selection for microarray expres-
sion data with imbalanced sample distributions, in: Proceedings of the 2009