0% found this document useful (0 votes)
16 views

2015-Elsevier-A binary ABC algorithm based on advanced similarity scheme for feature selection

Uploaded by

chandreshgovind
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

2015-Elsevier-A binary ABC algorithm based on advanced similarity scheme for feature selection

Uploaded by

chandreshgovind
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Applied Soft Computing 36 (2015) 334–348

Contents lists available at ScienceDirect

Applied Soft Computing


journal homepage: www.elsevier.com/locate/asoc

A binary ABC algorithm based on advanced similarity scheme for


feature selection
Emrah Hancer a,b,∗ , Bing Xue b,∗∗ , Dervis Karaboga a , Mengjie Zhang b
a
Computer Engineering, Erciyes University, Kayseri 38039, Turkey
b
Evolutionary Computation Research Group, School of Engineering and Computer Science, Victoria University of Wellington, Wellington 6140, New Zealand

a r t i c l e i n f o a b s t r a c t

Article history: Feature selection is the basic pre-processing task of eliminating irrelevant or redundant features through
Received 6 January 2015 investigating complicated interactions among features in a feature set. Due to its critical role in classifica-
Received in revised form 15 June 2015 tion and computational time, it has attracted researchers’ attention for the last five decades. However, it
Accepted 23 July 2015
still remains a challenge. This paper proposes a binary artificial bee colony (ABC) algorithm for the feature
Available online 31 July 2015
selection problems, which is developed by integrating evolutionary based similarity search mechanisms
into an existing binary ABC variant. The performance analysis of the proposed algorithm is demonstrated
Keywords:
by comparing it with some well-known variants of the particle swarm optimization (PSO) and ABC algo-
Feature selection
Artificial bee colony
rithms, including standard binary PSO, new velocity based binary PSO, quantum inspired binary PSO,
Particle swarm optimization discrete ABC, modification rate based ABC, angle modulated ABC, and genetic algorithms on 10 bench-
Classification mark datasets. The results show that the proposed algorithm can obtain higher classification performance
in both training and test sets, and can eliminate irrelevant and redundant features more effectively than
the other approaches. Note that all the algorithms used in this paper except for standard binary PSO and
GA are employed for the first time in feature selection.
© 2015 Elsevier B.V. All rights reserved.

1. Introduction available features to achieve similar or even better classification


performance than using all features [3]. The task is terminologi-
Thanks to the rapid development in computer hardware and cally known as “feature selection”. It does not only achieve better
software, a huge amount of information can be collected and classification accuracy, but also improves the efficiency, reduces
included in datasets through a large number of features (attributes). data complexity and simplifies the structure of the learnt classifiers
However, not all features are relevant to the target concept. In [2].
other words, datasets may include irrelevant and redundant fea- Feature selection is one of the most difficult tasks in data mining
tures besides relevant ones. Unfortunately, these features may and classification due to the feature interaction and the large search
adversely affect the classification performance due to the large space [4,5]. Feature interaction may appear as two-way, three-way
search space, known as “the curse of dimensionality” [1,2]. Fur- or may involve even more features. For instance, a feature by itself
thermore, more features may introduce more noise to the dataset may not have a confident effect to the target, but its effect can
that can be also detrimental to the classification performance. Thus, be increased when used together with other features. Also, a fea-
it is important to select an appropriate feature subset from the ture which is individually relevant may become redundant when
interconnected with others. The other challenging task is the large
search space, 2n , where n is the total number of features. In other
words, it is not possible to thoroughly search all possible solutions
∗ Corresponding author at: Computer Engineering, Erciyes University, Kayseri in most cases. Although a variety of search methods such as sequen-
38039, Turkey. Tel.: +90 3522076666/32583. tial forward and backward feature selection (SFS, SBS) [6,7] have
∗∗ Corresponding author at: Evolutionary Computation Research Group, School of
been proposed, they may converge to local minima or cost high
Engineering and Computer Science, Victoria University of Wellington, Wellington
computational time.
6140, New Zealand. Tel.: +64 4 463 5542.
E-mail addresses: [email protected] (E. Hancer), To address these problems, evolutionary computation (EC) tech-
[email protected] (B. Xue). niques have been used as a strong alternative to the classical search

http://dx.doi.org/10.1016/j.asoc.2015.07.023
1568-4946/© 2015 Elsevier B.V. All rights reserved.
E. Hancer et al. / Applied Soft Computing 36 (2015) 334–348 335

methods due to their global search potentials. Particle swarm opti- on recent studies related to feature selection. Section 3 presents
mization (PSO) [8,9], genetic algorithms (GAs) [10,11], genetic the proposed algorithm and Section 4 describes the experimental
programming (GP) [12,13], and ant colony optimization (ACO) design. Section 5 presents the experimental results and discuss-
[14,15] have been widely applied to feature selection. In this study, ions. Section 6 concludes the study and provides an insight into the
artificial bee colony (ABC) [16] based on foraging behaviours of future trends.
honey bees is chosen as the main motivation to address feature
selection problems on account of the following advantages when
2. Background
compared to the other well-known EC techniques [17]: (1) it can
converge more quickly to the target, (2) It is computationally less
expensive, and (3) It is one of the most recent EC techniques. The In this section, background on the artificial bee colony and recent
idea of applying ABC to feature selection is not a novel subject, i.e., trends of the feature selection are presented.
there exist some studies concerning the ABC based feature selec-
tion [18–20]. However, the existing studies unfortunately have 2.1. Artificial bee colony
not demonstrated a comprehensive experimental study, includ-
ing comparisons with recent EC variants, on a variety of datasets
Artificial bee colony (ABC) that mimics the foraging behaviours
or thorough performance evaluation and analysis. Therefore, the
of honey bee colony was proposed by Karaboga in 2005 [29]. From
potential of ABC for feature selection has not been fully demon-
the perspective of an optimization problem, the food sources and
strated and the need for the studies based on ABC has not come to
their nectar amounts represent probable solutions and their corre-
an end.
sponding fitness values, respectively. The ABC for a minimization
problem can be explained as follows. Employed bees exploit their
1.1. Goals associated food sources explored before and share the information
concerning quality and position of food sources with onlooker bees
The overall goal of this paper is to propose an improved binary via waggle dance. Onlooker bees waiting on the hive make decision
version of the artificial bee colony (ABC) algorithm to address fea- on the selection of food source to be exploited with the help of the
ture selection problems. To achieve this goal, the discrete binary information gained by employed bees. Scout bees are responsible
ABC (DisABC) algorithm [21] based on the similarity of Jaccard coef- for searching a new food source depending on an internal rule or
ficient among individuals is further improved by introducing the possible external clues [30,31]. The basic implementation of ABC
neighborhood selection mechanism of the differential evolution comprises of four phases:
(DE) strategy. In other words, the similarity based search approach (1) Initialization phase: Accepting the search space as the
is re-simulated according to the DE mutation, recombination and environment of food sources available for the exploration and
selection strategies. The other goal is to put forward a compre- exploitation processes, the algorithm first randomly produces food
hensive comparative study of some variants of the ABC, PSO and sources. Each food source defined as Xi = {xi1 , xi2 , xi3 , . . ., xij , . . ., xiD }
GA algorithms on wrapper feature selection in terms of the clas- is generated by:
sification performance and the feature subset size for the future
studies of researchers. To establish the second goal, seven algo- xij = xjmin + U(0, 1)(xjmax − xjmin ) (1)
rithms, which are binary PSO (BPSO) [22], new velocity based binary
PSO (NBPSO) [23], quantum inspired binary PSO (QBPSO) [24], where i = {1, 2, . . ., SN} and SN is the number of food sources; j = {1,
discrete ABC (DisABC) [21], angle modulated ABC (AMABC) [25], 2, . . ., D}; U(0, 1) is the random variable uniformly distributed
modification rate based ABC (MRABC) [26] and genetic algorithms between (0,1); D is the dimensionality of the search space; xjmin and
(GA) [27] are employed, and 10 benchmark datasets, including vari- xjmax are predefined minimum and maximum values of parameter
ous classes, instances and features are chosen from the UCI machine j.
learning repository [28]. Further, two recently published ACO stud- (2) Employed bee phase: Between employed bees and food
ies are considered to evaluate the performance of the proposed sources, one-to-one and on-to relation is established, i.e., each
ABC variant. To our knowledge, the employed algorithms except employed bee is associated with only one food source. An employed
for BPSO and GA are used for feature selection for the first time, bee modifies the position of its concerning food source to find a new
and a comprehensive comparative analysis on feature selection is richer food source:
not very common in the literature. Specifically, the following points
are investigated: ij = xij + ij (xij − xkj ) (2)

where i represents the index of current food source (Xi ); k repre-


• whether integrating a differential evolution search mechanism to
sents the index of neighbor food source (Xk ), which is randomly
the DisABC algorithm improves its global search ability in feature
chosen among all sources except for i; j is the randomly selected
selection tasks,
parameter for modification; Vi is the generated food source deter-
• whether the proposed algorithm is able to perform well in both
mined by modifying one parameter of Xi ; and ij is a random
training and test sets in terms of the classification rate when
number uniformly distributed within [−1,1]. After Vi is generated,
compared with the seven existing algorithms,
its fitness value is evaluated. If the fitness value of Vi is better than
• whether the proposed algorithm can more effectively remove
the fitness value of Xi , the employed bee memorizes the new food
redundant or irrelevant features and can obtain better feature
source position and leaves the old one, and its counter holding the
subsets than the seven existing algorithms, and
number of trials is reset to 0. Otherwise, the current food source
• whether the proposed algorithm performs better than conven-
is kept in memory and its counter holding the number of trials is
tional deterministic feature selection approaches.
increased by 1.
(3) Onlooker bee phase: After getting the information concerning
1.2. The organization of the paper nectar amount (fitness value) and positions of food sources from
employed bees via waggle dance, each onlooker bee selects a food
The rest of the paper is organized as follows. Section 2 gives source depending on the probability according to the fitness values
an outline of the basic ABC algorithm and provides a background through roulette-wheel scheme, where richer food sources have
336 E. Hancer et al. / Applied Soft Computing 36 (2015) 334–348

a higher probability than others. The selection scheme based on 0 in Vi and value 1 in Xi (M01 ) are determined using integer model
fitness values is given by: programming through the following equations;
 M11

fitnessi
pi = SN (3) min 1− −  × Dissimilarity (Xi , Xk ) (9a)
fitnessi M11 + M10 + M01
i=1
M11 + M01 = m1 (9b)
where fitnessi is the fitness value of source Xi . After calculation of
probability value (pi ), a random number in the range of 0 and 1 M10 ≤ m0 (9c)
(rand(0, 1)) is generated for each food source i. If pi > rand(0, 1), Xi
M11 , M10 , M01  0 and they are integers (9d)
is chosen and then the searching-exploiting process on Xi is carried
out as in the employed bee phase. where m1 is the total number of ones and m0 is the total number
(4) Scout bee phase: It is known that after being exploited repeat- of zeros in Xi . Eq.(9b) generates the feasible set of M11 , M10 , M01 is
edly, food sources should be left by bees to avoid waste of energy. In equal to (m1 + 1)(m0 + 1) number of combinations. After the deter-
basic ABC, a food source is assumed as abandoned when its counter mination of the M values between Vi and Xi , the generation task
holding the number of trials exceeds the predefined value, known of Vi source is carried out. Vi is first defined as 1 × D zero vector
as the “limit” parameter. Then, a new food source is generated by and then the following selection mechanisms are carried out in a
Eq. (1) to replace the abandoned one. possibilistic manner:

2.2. Jaccard similarity coefficient 1. Random selection. Choose M11 number of zero bits where their
corresponding values are equal to 1 in Xi and change their values
in Vi from 0 to 1. Then, choose M10 number of zero bits where
The binary similarity/dissimilarity measures play a critical role
their corresponding values are equal to 0 in Xi and change their
in many applications, such as classification, clustering and image
values in Vi from 0 to 1.
retrieval [32]. Over the century, there has been a significant effort
2. Greedy selection. Choose M11 number of zero bits where their
on consistently measuring the similarity among binary vectors
corresponding values are equal to 1 in both Xi and global best
resulting numerous similarity and dissimilarity measures in var-
source in population (GbestParams), and change their value in
ious fields. One of the most well-known similarity measures is
Vi from 0 to 1. In some cases, it is not possible to modify M11
Jaccard coefficient [33], defined by:
number of zero bits due to the interaction between Xi and Gbest-
M11 Params. If the number of changed bits (index0 ) is less than M11 ,
Similarity (Xi , Xk ) = (4)
M11 + M10 + M01 choose (M11 − index0 ) bits of zero where their corresponding val-
where Xi and Xk are D dimensional binary vectors and the dth bit of ues are equal to 1 in Xi and change their values in Vi from 0
Xi is represented by xid (xid {0, 1}); M11 is the number of bits where to 1. After that, choose M10 number of zero bits where their
xid = xkd = 1; M10 is the number of bits where xid = 1 and xkd = 0; and corresponding values are equal to 0 and 1 in Xi and Gbest-
M01 is the number of bits where xid = 0 and xkd = 1. Params, respectively, and change their values in Vi from 0 to 1.
The dissimilarity measure between Xi and Xk is defined by: If the number of changed bits (index1 ) is less than M10 , choose
(M10 − index1 ) number of zero bits where their corresponding
Dissimilarity (Xi , Xk ) = 1 values are equal to 0 in Xi and change their values in Vi from 0
M11 to 1.
− Similarity(Xi , Xk ) = 1 − (5)
M11 + M10 + M01
2.4. Feature selection

2.3. Discrete binary ABC How to determine a feature as “relevant” is a difficult prob-
lem due to the complicated (two-way, three-way or multi-way)
Kashan et al. [21] introduced a discrete ABC (DisABC) based on interactions between features. A feature may become relevant or
the concept of dissimilarity between binary vectors as a measure. irrelevant when used together with other features; thus, an opti-
Specifically, the substitute operator ‘−’ measuring the magnitude mal feature subset should comprise complementary features which
of differences between two sources (Xi and Xk ) to generate a new provide diverse properties of the classes [34]. Meanwhile, a large
neighbor source (Vi ) via Eq. (2) is first rewritten in the form of Eq. number of available features lead to complexity in the search space,
(6). Eq. (6) is then formed into Eq. (7) by the Jaccard coefficient i.e., it is impossible to exhaustively search the whole space in most
based similarity/dissimilarity between vector pairs (Eq. (5)). cases. Although various algorithms have been proposed to address
feature selection problems, it still remains a challenge. The factors
ij − xij = ij (xij − xkj ) (6) affecting the performance of a feature selection algorithm are as
follows [35]:
Dissimilarity (Vi , Xi ) ≈  × Dissimilarity (Xi , Xk ) (7)

where Xi , Xk and Vi are binary sources, and  is a positive random 1. Initialization: The starting feature subset in the space should be
scaling factor defined by Eq. (8). first determined, which directly influences the search direction
 − min
 and operators. For instance, one may start with empty set and
max
 = max − iter (8) then iteratively adds features to this subset (known as forward)
MCN
or one may start with all features and then removes them itera-
where max and min are the upper and lower level of , MCN is tively (known as backward). A feature set can be also initialized
the maximum number of cycles, and iter is the current cycle. according to some predefined rules [36].
Eq. (7) reveals that the dissimilarity between Vi and Xi Dissimi- 2. Search strategy: An exhaustive search is not practically possible
larity ((Vi , Xi )) should be close to the result of  × Dissimilarity(Xi , due to the large search space in most cases. Therefore, a more
Xk ) as much as possible. According to this information, the number realistic approach can be applied. For instance, searching can be
of bits with value 1 in both Vi and Xi (M11 ), the number of bits with made by both forward and backward strategies (known as float-
value 1 in Vi and value 0 in Xi (M10 ), the number of bits with value ing search) or more heuristic global search techniques. In this
E. Hancer et al. / Applied Soft Computing 36 (2015) 334–348 337

concept, evolutionary computation based algorithms, includ- perform better than SFS and SBS, but they may also converge to
ing particle swarm optimization (PSO) [37], genetic algorithms local minima.
(GAs) [27], and ant colony optimization (ACO) [38] get attention (2) EC (non-ABC) based approaches: To overcome the draw-
due to their search abilities. In recent years, researchers also have backs of deterministic feature selection methods, researchers have
started to work on artificial bee colony (ABC) to solve feature concentrated on solving the feature selection problem using evo-
selection problems [18,19]. lutionary and swarm intelligence based algorithms, including GAs,
3. Evaluation criterion: An evaluation criterion is expected to mea- GP, ACO and PSO. Yang and Honavar [10] proposed an objective
sure the quality of a feature subset accurately and inexpensively. function based on the feature measurement cost and the classifi-
Fundamentally, all evaluation criteria are based on the either cation accuracy to improve the classification performance. Raymer
classification performance or the characteristics of the data itself et al. [11] proposed a feature selection method using GA, where fea-
[39,40]. ture selection and extraction were achieved simultaneously. The
4. Stopping criterion: Whereas some algorithms complete their pro- effectiveness of the proposed algorithm was tested by compar-
cesses without any restriction or rules, some of them need to ing it with the SFFS [45] and linear discriminant analysis methods.
confirm that the reached feature subset is a good one [34]. In However, the obtained feature subset by GA through tuning sets
addition, the size of obtained feature subset can be also used as is not much a preferred way to evaluate the feature subsets. Zhu
a stopping criterion. et al. [46] introduced a wrapper-filter feature selection algorithm
(WFFSA) based on a combined version of local search and GA. In
local search mechanism of WFFSA, two operators were defined: (1)
Given a dataset Z comprising of N patterns/examples/instances select a feature from activated feature set (the positions of which
and S feature set. To select a feature subset Sk via a feature selection represent 1 in chromosome) using linear ranking selection and
algorithm, the following issues need to be considered [35]: move it to inactivated features (the positions of which represent
0 in chromosome), and (2) select a feature from inactivated feature
1. The size of Sk should be smaller than current S s.t. Sk ⊂ S. set using linear ranking selection and move it to the activated fea-
2. The dataset Z within Sk feature subset should achieve the best ture set. Muni et al. [12] introduced a multi-tree GP based feature
classification performance. selection method, in which each classifier has c trees comprising
3. While providing the best classification performance, the size of of feature subset for a problem with c classes. Ahmed et al. [47]
Sk should be as small as possible. used GP to combine top-ranked features obtained by the two fea-
4. The most appropriate evaluation criterion for the dataset should ture ranking techniques, information gain and relief to address with
be selected. feature selection problems on mass spectrometry. Unler and Murat
[8] addressed feature selection through an adaptive discrete PSO,
where feature subset was selected by the relevance and predictive
2.4.1. Existing feature selection methods contribution of each feature. The superiority of the method was
Feature selection methods can be categorized into two main demonstrated by comparing it with tabu search, SFS and SBS. Liu
groups: filter methods and wrapper methods [41]. Filter meth- et al. [9] proposed the improved feature selection (IFS) method,
ods eliminate irrelevant or noisy features from the dataset without the aim of which was to reach higher generalization capability. IFS
applying any classification algorithm. In contrast to filter methods, was built on multi swarm PSO (MSPSO), SVM and an improved fit-
a wrapper method employs a learning (classification) algorithm to ness function based on F-score. The performance analysis of IFS was
evaluate the goodness of the selected features [42]. Filter methods conducted by comparing it with PSO and GA. However, MSPSO was
are argued to be more general than wrapper methods, but wrap- computationally more intensive than PSO and GA due to its com-
per methods are more successful than filter methods in terms of plex structure and large population size. Xue et al. [3] investigated
obtaining better classification performance. feature selection using PSO with a two stage fitness function, com-
(1) Deterministic approaches: One of the simplest filter methods, prising of the error rate and the number of features. This approach
Relief [43] selects k highest correlated features with the target class. improved the feature subset when compared to the PSO using only
However, Relief does not deal with redundant features since it tries the error rate as a fitness function. Xue et al. [3] also considered the
to get all relevant features without considering the redundancy of feature selection problem as a multi objective problem (i.e. maxi-
feature. Another basic filter method, FOCUS [44] searches for the mizing the classification performance and minimizing the number
smallest possible feature subset. It starts with a single feature and of features simultaneously). In another study, Xue et al. [36] inte-
then adds other features to the subset until it finds the subset that grated three new initialisation strategies and updating mechanisms
splits the training data according to their outputs. However, it is to the PSO algorithm motivated by forward selection, backward
computationally intensive. selection and a combination of them. The detailed survey of PSO on
Two well-known wrapper methods, sequential forward selec- feature selection can be found in [48,49]. ACO has also been used to
tion (SFS) [6] and sequential backward selection (SBS) [7] use solve FS problems, where nodes represent features, and the edges
forward and backward strategies in finding candidate feature sub- between nodes define the choice of the next feature in graphs [50].
sets. SFS starts with an empty feature subset and then adds features Ming [51] proposed an ACO and rough set theory based feature
to the subset until the further addition cannot improve the classi- selection method, which starts with the features in the core of a
fication performance. On the other hand, SBS starts with a subset rough set and uses forward selection. Ding [52] combined ACO and
including all features and then removes features from this subset SVM for feature selection problems. In this model, grid based ACO
until the further removal cannot improve the classification perfor- was implemented to optimize the parameters of SVM, and feature
mance. However, the features added or removed cannot be handled subset selection was performed through F-statistics. Nemati et al.
or modified later. It can be inferred that these methods are simi- [14] introduced a parallel combined version of ACO and GA for fea-
lar to the mechanism of agglomerative and divisive hierarchical ture selection in protein function prediction. Sarac and Ozel [53]
methods in terms of non-dynamic structures. Thus, these meth- proposed an ACO based feature selection approach for web page
ods may converge to local minima. To minimize the drawbacks of classification, in which feature extraction is applied before feature
SFS and SBS, floating forward and backward search mechanisms selection to group similar HTML tags together, i.e., to reduce the
(SFFS and SFBS) [45] were developed. SFFS (SFBS) performs for- feature space. More information on ACO based feature selection
ward (backward) step after each backward (forward) step. They can be found in [50,54].
338 E. Hancer et al. / Applied Soft Computing 36 (2015) 334–348

(3)ABC based approaches: In recent years, researchers have been Keeping the above mentioned remarks in mind, we propose a
trying to develop feature selection approaches using the ABC algo- modified DisABC algorithm (MDisABC) using the ideas of mutation,
rithm after observing the numerous efficient applications based on recombination and selection in evolutionary computation, such as
ABC in various fields [55,56,26,57]. Uzer et al. [19] used the corpora- DE [45,64]. The major components/factors of MDisABC are given
tion of ABC and SVM in medical dataset classification. Although the as follows. Firstly, instead of using one neighbor solution, three
performance analysis of the ABC based feature selection approach neighbors are used to create mutant solution in the search space.
was conducted by comparing it with the results obtained from In this way, bees can share information obtained from neighbors
existing studies, the feature subset size and standard deviations more effectively in order to improve the search ability of the whole
of the obtained accuracy values were not considered. Subanya algorithm. In standard DE, a mutant solution (i = {ωi1 , ωi2 , . . ., ωid ,
and Rajalaxmi [58] applied the combination of the proposed ABC . . ., ωiD }) of a current solution Xi is generated by Eq. (10). However,
and Naive Bayes to the Cleveland Heart disease dataset. However, it cannot be directly used in binary problem solving. To form Eq.
no comparative study was reported to show the effectiveness of (10) in the concept of dissimilarity that is suitable to the binary
the proposed approach. Shunmugapriya et al. [59] improved the space, Eq. (10) first needs to be rewritten in the form of Eq. (11).
searching mechanism of ABC with abandoned food source for fea- Then, the substitute equation ‘-’ in Eq. (11) measuring the magni-
ture selection. As in [58], no comprehensive comparative study tude of differences between binary vectors is reformulated into Eq.
was presented to show the effectiveness of the proposed method. (12) by the Jaccard coefficient dissimilarity between vector pairs
Schiezaro and Pedrini [18] presented a study on feature selec- (Eq. (5)).
tion comparing the proposed ABC algorithm with the standard
PSO, ABC and GA, where simple modification rate (MR) pertur-
bation is used to select a feature subset. The obtained results i = Xr1 + (Xr2 − Xr3 ) (10)
indicated that the ABC algorithm was superior to the others. How-
ever, the number of evaluations for all algorithms was not chosen
i − Xr1 = (Xr2 − Xr3 ) (11)
as the same value. In [60], an hybrid version of rough set-based
attribute reduction (RSAR) and ABC algorithms was applied to fea-
ture selection, and the effectiveness of the proposed approach Dissimilarity (i , Xr1 ) ≈  × Dissimilarity (Xr2 , Xr3 ) (12)
was shown by comparing it with six well-known approaches,
including RSAR, PSO and GAs. Although the proposed and some
employed algorithms reached the same feature subset size for 3 where r1 , r2 and r3 are randomly selected neighborhoods and  (or
out of 5 datasets, the obtained features were not clearly evaluated known as F) is the positive scaling factor.
to reflect how the proposed algorithm outperformed the others According to Eq. (12), Dissimilarity(i , Xr1 ) should be close to
in terms of accuracy. Consequently, the need and demand for the  × Dissimilarity (Xr2 , Xr3 ) as soon as possible (see Eq. (13)). Eq.
studies based on ABC and other EC techniques is still an open (13) is then reformulated into Eq.(14a) using mathematical pro-
issue. gramming as mentioned in Section 2.2, and M values between i
and Xi that are required to generate i are found by solving Eq.
3. The proposed approach (14a).

Discrete binary ABC (DisABC) [21] is one of the first binary vari-
ants of the ABC algorithm based on the similarity between binary min{Dissimilarity (i , Xr1 ) −  × Dissimilarity (Xr2 , Xr3 )} (13)
vectors measured by the Jaccard coefficient. It has gained popu-
larity among the researchers and has been used in the comparative  M11

studies of binary problems [61] on account of its simplicity, novelty, min 1− −  × Dissimilarity (Xr2 , Xr3 ) (14a)
M11 + M10 + M01
and performance. This leads us to choose DisABC as a main motiva-
tion for the feature selection problems in this study. Furthermore,
the drawbacks including the complicated structure (NP hard) of M11 + M01 = m1 (14b)
feature selection and the weakness of DisABC in high-dimensional
problems motivate us to improve the search ability of the DisABC
M10 ≤ m0 (14c)
algorithm for feature selection problems.
How can we improve the search ability of the DisABC algorithm?
For two decades, population-based metaheuristics composed of M11 , M10 , M01  0 and they are integers (14d)
an evolutionary framework and a set of local algorithms activated
within the generation cycle of the external work [62] have attracted
attention since they are inspired by the transmission of ideas and where m1 is the number of ones in Xr1 and m0 is the number of
combination of multiple operators. By this way, it is expected to zeros in Xr1 .
achieve good performance in problem solving. Nowadays, it is not The second important component/factor is the recombination
difficult to see numerous successful metaheuristics and evolution- operator between the current Xi and mutant solution i , shown by
ary algorithms comprising some forms of lifetime learning [63]. The Eq. (15). By this way, the exchange of information is fully provided
most common way to use a learning scheme in an EC based algo- between the mutant and current solutions to find better solutions,
rithm is the hybridization (or modification), which refers to the which is known as multiple interaction [29].
combination of two or more different methods effectively. In other
words, one of the simplest ways to increase the search ability of an ⎧
⎪ ω , if rand(d) ≤ CR,
algorithm is to modify some parts of the mechanism via internal or ⎨ id
external forces or mix of at least two heterogeneous individuals by uid = xid , otherwise, (15)
conscious manipulation or as a natural progressive manipulation ⎪

[63].
E. Hancer et al. / Applied Soft Computing 36 (2015) 334–348 339

Table 1
Datasets.

Dateset Number of features Number of classes Number of examples

Wine 13 3 178
Vehicle 18 4 846
German 24 2 1000
WBCD 30 2 569
Ionosphere 34 2 351
Lung 56 3 32
Hill-Valley 100 2 606
Musk 1 166 2 476
Madelon 500 2 2600
Isolet5 617 26 1559

where CR is the crossover rate and ωid represents the dth dimension • Step 3: get the optimal M values between i and Xi as M11 = 3,
of i . A new solution is then formed as Ui = {ui1 , ui2 , . . ., uid , . . ., uiD } M10 = 3 and M01 = 1 by solving Eq. (14a):
and uid represents the dth dimension of Ui .  M11

min 1− − 0.5714 (17a)
M11 + M10 + M01
3.1. Major steps to generating a new solution
M11 + M01 = 4 (17b)
To further explain the process of generating a new solution in M10 ≤ 6 (17c)
MDisABC, we summarised the major steps as follows:
M11 , M10 , M01  0 (17d)
• Step 1: randomly select three neighbors (i.e. food sources), Xr1 ,
where m1 = 4 (number of ones in Xr1 ) and m0 = 6 (number of zeros
Xr2 and Xr3 for the current food source Xi ; in Xr1 ).
• Step 2: calculate  × Dissimilarity(Xr2 , Xr3 ) by Eq. (5); • Step 4: set i = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0}. Select M11 = 3 bits from
• Step 3: solve Eq. (14a) to get M values between i and Xi ;
S1xr1 = {1, 6, 7, 9}, where S1xr1 is the set representing the ele-
• Step 4: using the obtained M values, apply random selection or
ments of value 1 in Xr1 . Assume that sixth, seventh and ninth
greedy selection in a probabilistic manner to generate i ; elements are randomly selected; therefore, i = {0, 0, 0, 0, 0, 1,
• Step 5: apply recombination between Xi and i by Eq. (15) to
1, 0, 1, 0}. After that, select M10 = 3 number of bits from S0xr1 =
generate a new solution Ui ; {2, 3, 4, 5, 8, 10}, where S0xr1 is the set representing the elements
• Step 6: choose a better solution between Xi and Ui ;
of value 0 in Xr1 . Assume that third, fifth and last elements are
selected; therefore, i = {0, 0, 1, 0, 1, 1, 1, 0, 1, 1}.
Anexample: We also include the following example to illustrate
the first four steps in detail to show how to generate a mutant solu- 3.2. Pseudocode of MDisABC
tion i . The fifth and sixth steps are not included in this example
since they are easy to understand. In MDisABC, the initial binary food sources are generated by Eq.
(18) instead of Eq. (1) [21].
• Step 1: randomly pick up three neighbors, say Xr1 = {1, 0, 0, 0, 0,
1, 1, 0, 1, 0}, Xr2 = {0, 0, 1, 1, 0, 0, 0, 1, 1, 1} and Xr3 = {0, 1, 1, 0, 0, 1, if randid ≥ 0.75,
0, 1, 0, 1, 0}, and assume that  is assigned as 0.8. xid = (18)
• Step 2: calculate  × Dissimilarity (Xr2 , Xr3 ) based on the M values 0, otherwise,
between Xr2 and Xr3 , which are M11 = 2, M10 = 3 and M01 = 2 using where randid is the uniformly generated number within the range
Eq. (5): of [0, 1] for the dth dimension of source i.
 2
 MDisABC follows the basic steps of ABC except the initialisa-
 × Dissimilarity (Xr2 , Xr3 ) = 0.8 × 1 − = 0.5714 tion and new solution generation mechanisms. The detailed pseudo
2+3+2
(16) code of the MDisABC algorithm can be seen in Algorithm 1.
340 E. Hancer et al. / Applied Soft Computing 36 (2015) 334–348

Algorithm 1. The pseudocode of the MDisABC

classifier, k nearest neighbours (KNN) is used to evaluate the clas-


4. Experiment design
sification performance, where k is chosen 5 (5NN) as in [3].
4.1. Datasets
4.2. Employed algorithms in comparative study
Ten benchmark datasets from the UCI machine learning reposi-
tory [28] are used in experiments, as shown in Table 1. The datasets To show the effectiveness of the proposed algorithm, the fol-
comprise of a various number of features, classes and samples, pro- lowing EC based algorithms are employed:
viding a comprehensive analysis of the proposed and employed
algorithms. For each dataset, examples are randomly divided into 1. Genetic algorithms (GA): GA [27] is one of the most popular
two sets: 70% as the training set and 30% as the test set. As a EC techniques for feature selection, where each chromosome
E. Hancer et al. / Applied Soft Computing 36 (2015) 334–348 341

represents a feature subset. Selection, crossover and mutation


are the main operators in GA. A sub-population is first selected
from initial population for crossover. After crossover is per-
formed, mutation is applied in a probabilistic manner. Then,
generated offsprings and current population are sorted and a
selection scheme (e.g. elitist or roulette-wheel) is applied among
them to generate new population for the next iterations.
2. Binary particle swarm optimization (BPSO): BPSO was proposed Fig. 1. An illustrative representation on how the features are selected.
by Kennedy and Eberhart for discrete problems in 1997 [22].
In BPSO, velocities are continuous values and are updated as in
4.3. Parameter settings
continuous PSO, but the velocity values represent the probability
of the corresponding positions in a particle taking value 1 or 0 in
In the experiments, the following parameters are used: the num-
contrast to basic PSO.
ber of individuals in population is set to 30 as in [4]; the maximum
3. New velocity based binary particle swarm optimization (NBPSO):
number of iterations is empirically set to 50; the parameters of
Khanesar et al. [23] determined that inertia weight may have
BPSO are selected as in [69] such that c1 = 2, c2 = 2, initial weight= 0.9
negative effects on velocity and particle update mechanism.
and Vmax = 6; the crossover and mutation rates of GA are selected
Thus, they introduced an effective velocity mechanism to update
as 0.8 and 0.2 as in [70]; the parameters of NBPSO are selected as
particles, named as new velocity based binary PSO (NBPSO).
in [23] such that w1 = 0.5 and Vmax = 6; the parameters of QBPSO
4. Quantum inspired binary particle swarm optimization (QBPSO):
are selected as in [24] such that Qmax = 0.05 and Qmin = 0.01;
Yun-Won et al. [24] integrated the concept and principles of
the limit parameter of the AMABC, MRABC, DisABC and MDisABC
quantum computing, including quantum bit and superposi-
is experimentally chosen as 50; max and min are set to 0.9 and
tion of states into basic BPSO, namely quantum inspired BPSO
0.5 in MDisABC and DisABC as in [21]; the MR parameter of MRABC
(QBPSO). In QBPSO, a Q-bit individual is integrated as the prob-
is set to 0.5; and the CR parameter of MDisABC is experimentally
ability of particles taking value 1 and 0 instead of velocities.
chosen as 0.25.
Accordingly, some parameters such as inertia weight and bal-
An individual of an EC based algorithm represents a probable
ance coefficients are not required to be determined.
feature subset of S in a feature selection problem. If any dimension
5. Angle modulated artificial bee colony (AMABC): Pampara and
of an individual is 1 (or >0.5), its corresponding feature is selected,
Englebrecht [25] proposed a binary artificial bee algorithm
which can be illustrated in Fig. 1. The fitness value (error rate) of
inspired by angle modulation (AMABC). In AMABC, each candi-
the corresponding individual (feature subset) is calculated through
date is represented via a four dimensional continuous vector (a,
10-fold cross-validation on the training set [71]. The training set
b, c, d) within [−1, 1] and each candidate is transformed into a
is divided into 10 folds. A single fold is used as the sub-test data,
D-dimensional binary space by the sinusoid function.
and the remaining 9 folds are used as the sub-training data. This
6. Modification rate based artificial bee colony (MRABC): Akay and
process is repeated 10 times with each of the 10-folds as the sub-
Karaboga [26] introduced a modification rate (MR) perturbation
test data. Then the averaged error rates from the 10 times are used
into basic search equation (Eq. (2)) to decrease the convergence
as the fitness value of the corresponding feature subset (individual)
problems of the standard ABC in high dimensional problems. For
during the evolutionary feature selection process. How and why 10-
each parameter of an individual, a random number (rand(0,1))
fold cross validation is used in this way is explained in [71] in detail.
is generated. If rand(0, 1) < MR, that parameter is evolved by Eq.
After the evolutionary process, the best feature subset is evaluated
(2); otherwise, it is not changed. Note that [18] used similar
on the test set using with the training set and 5-NN to obtain the
mechanism with MRABC in feature selection problems.
classification error rate by Eq. (19).
7. Discrete binary artificial bee colony (DisABC): DisABC [21] is based
on dissimilarity between the current and neighborhood candi- FP + FN
ErrorRate = (19)
dates. The detailed information concerning DisABC can be found FP + FN + TP + TN
in Section 2.3.
where TP and TN are true positives and negatives, and FP and FN
are false positives and negatives.
Among the employed algorithms, BPSO and GA are the most
widely used algorithms in feature selection resulting many publi- 5. Results and discussions
cations [54,65]. The other algorithms have been recently proposed
and are strongly considered as fundamental approaches for the dif- The experimental results of the classification error percentage
ferent variants of ABC and PSO by the researchers. However, they over the 30 independent runs are presented in Tables 2–4, in terms
have not been applied to feature selection problems yet. Accord- of mean values, standard deviations and symbols where ‘CER’ rep-
ing to the determined points, they are chosen for the comparative resents the classification error rate, ‘#NOF’ represents the number
studies to show the effectiveness of the proposed approach. of selected features and ‘T-Sig’ shows whether there exists any
As for the comparison of the proposed algorithm with the statistically significant difference obtained by Wilcoxon Rank Sum
deterministic approaches, the linear forward selection (LFS) [66], Test between the proposed and other algorithms. In addition, the
linear floating forward selection (LFFS) [66] and greedy stepwise best results are denoted in bold.
based selection (GSBS) [67] approaches derived from the sequen- The symbols used to demonstrate the significant difference
tial forward (SFS), sequential floating forward (SFFS) and sequential between the proposed and the other algorithms on the classifica-
backward (SBS) feature selection approaches are chosen. While LFS tion error rate through Wilcoxon Rank Sum Test have the following
(GSBS) performs forward (backward) search by adding (remov- meanings:
ing) features, LFFS applies both forward and backward searches
in a sequential order. The experiments of LFS, LFFS and GSBS • ‘+’ & ‘−’: The results of the MDisABC algorithm are significantly
are processed using Waikato environment for knowledge analy- better or worse than the corresponding algorithm.
sis (WEKA) [68], and the results of the deterministic and proposed • ‘≈’: The results of the MDisABC algorithm are similar to the cor-
approaches are presented together in Table 4 responding algorithm.
342 E. Hancer et al. / Applied Soft Computing 36 (2015) 334–348

Table 2
The obtained error percentages of the algorithms on training sets.

Dataset GA BPSO NBPSO QBPSO AMABC MRABC DisABC MDisABC

Wine CER 3.95 ± 0.50 4.38 ± 0.09 4.38 ± 0.09 4.41 ± 0.18 4.82 ± 0.19 4.45 ± 0.12 4.51 ± 0.16 4.36 ± 0.08
T-Sig − ≈ ≈ ≈ + + +
Vehicle CER 29.71 ± 0.70 29.12 ± 0.32 29.15 ± 0.56 29.99 ± 1.14 30.39 ± 0.55 29.24 ± 0.50 29.29 ± 0.53 28.94 ± 0.15
T-Sig + + + + + + +
German CER 24.93 ± 0.96 24.80 ± 0.49 24.43 ± 0.78 24.34 ± 0.55 25.49 ± 0.55 24.45 ± 0.61 24.27 ± 0.75 23.91 ± 0.39
T-Sig + + + + + + ≈
WBCD CER 4.71 ± 0.41 3.99 ± 0.43 4.58 ± 0.57 4.44 ± .44 4.12 ± 0.37 4.53 ± 0.39 4.19 ± 0.51 3.79 ± 0.33
T-Sig + ≈ + + + + +
Ionosphere CER 8.72 ± 1.18 8.89 ± 0.80 7.19 ± 1.12 6.61 ± .91 7.44 ± 0.68 8.76 ± 1.05 6.04 ± 0.41 5.74 ± 0.41
T-Sig + + + + + + +
Lung CER 25.10 ± 5.58 23.82 ± 3.95 19.63 ± 5.50 18.66 ± 4.55 26.32 ± 3.88 23.35 ± 4.36 19.84 ± 3.98 18.83 ± 3.05
T-Sig + + ≈ ≈ + + ≈
Hill-Valley CER 40.83 ± 1.48 43.01 ± 0.96 40.18 ± 1.15 40.48 ± 0.84 43.11 ± 0.84 41.79 ± 1.31 40.15 ± 0.80 40.14 ± 0.77
T-Sig + + ≈ + + + ≈
Musk 1 CER 7.42 ± 1.02 9.68 ± 0.55 7.35 ± 0.94 7.94 ± 0.60 10.67 ± 0.69 8.98 ± 0.85 8.14 ± 1.10 7.85 ± 0.63
T-Sig − + − ≈ + + ≈
Madelon CER 18.53 ± 0.93 21.87 ± 0.49 18.82 ± 1.01 19.42 ± 0.54 22.99 ± 0.85 20.59 ± 0.73 19.63 ± 2.11 18.71 ± 1.03
T-Sig ≈ + ≈ + + + +
Isolet5 CER 11.65 ± 0.75 14.33 ± 0.53 11.90 ± 0.66 12.25 ± 0.49 15.51 ± 0.49 13.19 ± 0.58 13.63 ± 1.12 12.73 ± 0.61
T-Sig − + − − + + +

In addition, the number of times that each feature is selected cases, which shows the improvement on DisABC is well-designed
over 30 independent runs for the Wine, Vehicle, German, WBCD and well-established. Further, between the results of MDisABC
and Ionosphere datasets are reported in Tables 5–9. Due to the and the other ones, there exist mostly significant differences.
large dimensionality, the other datasets could not be used for the For instance, MDisABC mostly gets significantly better results
analysis. than BPSO, which is one of the most widely-used algorithms in
feature selection. Only in three cases of Isolet5, the results of
5.1. Results on the training sets MDisABC are statistically worse than the results of the GA, NBPSO
and QBPSO algorithms; in two cases of Musk1, NBPSO and GA
The results of the error rate values on the training sets through statistically performs better than MDisABC, and in one case of
EC based algorithms are presented in Table 2. It is shown in Table 2 Wine, GA statistically achieves better results than MDisABC. Except
that the proposed algorithm gets the best performances for five for these six cases, in 64 out of 70 (7 algorithms × 10 datasets)
datasets in terms of the mean error rates. Table 2 also shows that cases, the existing algorithms cannot obtain significantly better
MDisABC produces better mean values than DisABC in almost all performance than the MDisABC algorithm. Therefore, it can be

Table 3
The obtained error percentages of the algorithms on test sets.

Dataset GA BPSO NBPSO QBPSO AMABC MRABC DisABC MDisABC 5-NN All

Wine CER 9.19 ± 1.64 0 0.37 ± 1.02 0 0 0.12 ± 0.46 0 0 22.17


#NOF 6.1 5.8 6.13 5.83 6.63 5.76 5.83 5.76 13
T-Sig + ≈ ≈ ≈ ≈ ≈ ≈
Vehicle CER 20.82 ± 1.71 20.74 ± 1.70 21.32 ± 1.98 22.48 ± 2.19 21.31 ± 1.96 20.73 ± 1.67 20.99 ± 1.90 21.22 ± 2.12 23.9
#NOF 10 9.93 9.76 9.90 11.93 9.73 9.73 9.30 18
T-Sig ≈ ≈ ≈ ≈ ≈ ≈ ≈
German CER 29.06 ± 1.84 28.97 ± 1.79 29.17 ± 1.91 29.12 ± 1.84 28.62 ± 1.71 29.90 ± 1.69 29.43 ± 2.61 29.85 ± 1.87 32
#NOF 10.70 11.43 11.30 10.83 13.56 10.40 8.73 8.03 25
T-Sig ≈ ≈ ≈ ≈ - ≈ ≈
WBCD CER 7.27≈1.46 7.05 ± 1.03 7.54 ± 1.17 7.45 ± 0.91 7.39 ± 1.18 7.84 ± 0.80 7.37 ± 1.06 6.72 ± 1.08 7.06
#NOF 14.96 13.36 13.96 14.30 12.06 14.50 12.23 11.86 30
T-Sig + ≈ + + + + +
Ionosphere CER 8.47 ± 1.75 8.09 ± 1.45 7.68 ± 1.96 7.96 ± 1.86 7.11 ± 1.90 8.95 ± 1.53 6.69 ± 1.72 6.38 ± 1.64 10.48
#NOF 10.93 10.1 9.63 8.23 4.9 10.13 6.03 5.76 34
T-Sig + + + + ≈ + ≈
Lung CER 40.00 ± 12.58 34.07 ± 13.35 38.88 ± 15.64 35.18 ± 11.70 44.44 ± 15.98 42.96 ± 13.91 39.62 ± 11.91 32.96 ± 8.49 44.44
#NOF 28.03 27.33 27.30 26.70 22.03 26.50 16.66 24.36 56
T-Sig + ≈ ≈ ≈ + + +
Hill-Valley CER 45.95 ± 2.05 45.73 ± 1.84 45.78 ± 1.48 45.53 ± 1.79 45.6 ± 2.68 45.67 ± 2.34 44.57 ± 2.23 44.92 ± 2.13 47.25
#NOF 45.73 46.16 44.86 44.7 31.5 44.73 24.60 30.53 100
T-Sig + + + ≈ ≈ ≈ ≈
Musk 1 CER 17.19 ± 2.36 16.64 ± 2.78 16.04 ± 2.40 15.40 ± 2.20 18.33 ± 3.17 15.71 ± 2.73 15.88 ± 2.91 14.71 ± 2.07 20
#NOF 81.83 81,2 82,23 81,4 75,96 82.26 86.16 75.76 166
T-Sig + + + ≈ + ≈ ≈
Madelon CER 23.79 ± 1.28 24.03 ± 1.45 23.96 ± 1.73 22.55 ± 1.47 24.58 ± 1.60 23.95 ± 1.31 23.02 ± 3.84 21.14 ± 2.02 28.21
#NOF 250.03 248.1 246.26 240.56 238.43 248.43 223.86 195.96 500
T-Sig + + + + + + +
Isolet5 CER 14.90 ± 1.19 15.33 ± 1.19 15.09 ± 1.04 14.47 ± 1.10 16.63 ± 1.14 15.20 ± 1.12 16.02 ± 1.38 14.51 ± 1.17 19.02
#NOF 306.06 306.6 303.93 305.73 347.4 303.96 378.33 300.10 617
T-Sig ≈ + + ≈ + + +
E. Hancer et al. / Applied Soft Computing 36 (2015) 334–348 343

Table 4
The obtained error rates of the deterministic approaches with the proposed approach on test sets.

Dataset LFS LFFS GSBS MDisABC

Wine CER 25.93 25.93 24.07 0


#NOF 7 8 10 5.8
T-Sig + + + N/A
Vehicle CER 27.89 24.30 24.7 21.69 ± 1.85
#NOF 9 5 16 9.23
T-Sig + + + N/A
German CER 31.67 31.67 30.67 29.85 ± 1.87
#NOF 5 5 20 8.03
T-Sig + + + N/A
WBCD CER 7.65 8.82 7.06 6.72 ± 1.08
#NOF 12 11 29 11.86
T-Sig + + + N/A
Ionosphere CER 9.52 9.52 10.48 6.38 ± 1.64
#NOF 6 6 29 5.76
T-Sig + + + N/A
Lung CER 33.33 33.33 33.33 32.96 ± 8.49
#NOF 6 5 36 24.36
T-Sig + + + N/A
Hill-Valley CER 44.51 46.15 45.60 44.92 ± 2.13
#NOF 9 8 95 30.53
T-Sig ≈ + ≈ N/A
Musk 1 CER 19.29 20.71 17.14 14.71 ± 2.07
#NOF 12 12 124 75.76
T-Sig + + + N/A
Madelon CER 28.97 32.31 25.12 21.14 ± 2.02
#NOF 7 6 250 195.96
T-Sig + + + N/A
Isolet5 CER 23.72 25.43 19.23 14.51 ± 1.17
#NOF 27 23 585 300.10
T-Sig + + + N/A

Table 5
Times of appearance of each feature over 30 runs for Wine.

Wine F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F13

MDisABC 30 12 18 0 0 12 30 1 18 30 18 0 0
DisABC 30 19 16 0 0 17 29 9 12 29 12 2 0
MRABC 30 13 18 0 0 13 28 2 17 30 18 2 0
AMABC 30 27 13 0 0 13 30 23 19 30 13 1 0
QBPSO 30 8 23 0 0 8 30 1 22 30 23 0 0
NBPSO 30 13 17 3 0 13 30 8 15 25 21 8 0
BPSO 30 7 23 0 0 7 30 1 23 30 23 0 0
GA 30 20 17 0 0 6 23 14 15 24 11 23 0

Table 6
Times of appearance of each feature over 30 runs for Vehicle.

Vehicle F1 F2 F3 F4 F5 F6 F7 F8 F9

MDisABC 30 23 30 1 16 30 1 15 15
DisABC 29 22 30 9 21 28 6 19 9
MRABC 28 23 28 2 19 29 5 15 9
AMABC 21 28 30 13 23 23 23 20 18
QBPSO 26 17 21 2 25 27 13 20 10
NBPSO 24 21 30 0 22 30 8 15 17
BPSO 27 26 30 1 20 27 4 11 17
GA 26 19 30 10 24 24 13 12 10

Vehicle F10 F11 F12 F13 F14 F15 F16 F17 F18

MDisABC 29 25 0 0 18 10 0 19 17
DisABC 25 27 0 0 20 12 0 21 17
MRABC 29 25 0 0 24 15 1 20 20
AMABC 28 27 0 1 23 15 12 23 30
QBPSO 24 24 0 0 19 20 11 22 16
NBPSO 29 24 0 1 17 10 2 22 21
BPSO 30 27 0 0 14 17 0 20 27
GA 30 27 0 2 16 15 2 16 24
344 E. Hancer et al. / Applied Soft Computing 36 (2015) 334–348

Table 7
Times of appearance of each feature over 30 runs for German.

German F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12

MDisABC 30 0 28 1 11 14 20 11 4 0 3 3
DisABC 30 4 25 5 15 15 24 11 13 0 9 6
MRABC 30 3 28 0 13 19 15 9 9 2 10 5
AMABC 30 17 25 2 14 26 22 17 14 2 17 17
QBPSO 30 5 27 4 13 18 20 15 17 2 12 4
NBPSO 30 4 23 3 15 19 21 21 13 0 9 7
BPSO 30 8 29 2 17 21 23 15 12 1 14 9
GA 30 4 21 6 16 17 18 13 18 3 11 9

German F13 F14 F15 F16 F17 F18 F19 F20 F21 F22 F23 F24

MDisABC 8 3 9 11 7 26 16 7 11 8 6 3
DisABC 11 6 8 11 10 14 16 3 5 9 6 6
MRABC 12 10 9 19 15 20 15 11 16 15 16 11
AMABC 18 16 18 22 10 23 13 14 19 16 19 16
QBPSO 10 8 9 13 17 15 16 15 15 16 15 9
NBPSO 10 9 13 15 8 22 19 17 18 18 10 15
BPSO 14 9 9 17 12 23 17 9 14 16 6 16
GA 12 4 8 13 17 16 18 17 16 13 10 11

Table 8
Times of appearance of each feature over 30 runs for WBCD.

WBCD F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F13 F14 F15

MDisABC 20 0 7 0 14 14 18 12 13 14 13 5 18 7 16
DisABC 18 0 14 0 17 15 11 12 16 15 17 7 16 14 19
MRABC 21 0 23 0 12 20 17 15 15 20 11 20 10 22 15
AMABC 20 2 6 0 11 14 18 14 10 15 14 8 8 6 16
QBPSO 16 1 22 0 19 19 12 14 19 15 19 10 12 22 18
NBPSO 19 3 22 2 11 13 19 14 14 19 13 15 15 22 12
BPSO 21 0 8 0 15 16 15 12 13 13 10 9 16 8 14
GA 15 9 26 2 20 15 15 20 14 16 13 16 14 19 10

WBCD F16 F17 F18 F19 F20 F21 F22 F23 F24 F25 F26 F27 F28 F29 F30

MDisABC 9 14 8 10 10 24 0 7 0 13 21 21 17 22 9
DisABC 10 16 12 9 12 19 0 14 0 13 15 18 16 14 8
MRABC 16 12 21 14 15 21 0 23 0 15 19 16 12 16 14
AMABC 18 18 11 12 14 24 0 6 0 11 17 18 17 15 16
QBPSO 13 17 13 19 17 17 1 22 0 13 15 17 18 13 16
NBPSO 17 17 18 13 13 19 3 22 2 13 16 17 15 12 15
BPSO 20 14 19 14 16 28 0 8 0 17 21 26 22 19 7
GA 18 16 18 19 14 15 2 26 2 17 14 21 19 11 13

Table 9
Times of appearance of each feature over 30 runs for Ionosphere.

Ionosphere F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F13 F14 F15 F16 F17

MDisABC 13 13 15 1 28 0 2 7 6 1 0 0 1 14 13 14 4
DisABC 10 11 17 9 24 1 7 6 7 1 5 0 4 10 12 12 2
MRABC 17 14 19 0 27 9 3 7 16 6 12 2 7 19 16 15 6
AMABC 12 11 13 14 23 2 4 3 6 1 2 0 7 8 4 4 2
QBPSO 14 13 14 3 30 1 2 6 8 2 6 0 6 18 23 13 6
NBPSO 14 17 13 4 30 5 4 9 11 5 8 0 6 15 20 13 7
BPSO 12 11 14 2 29 7 4 9 14 1 12 3 12 15 18 14 8
GA 17 13 12 4 28 4 1 11 13 9 9 2 11 16 15 13 8

Ionosphere F18 F19 F20 F21 F22 F23 F24 F25 F26 F27 F28 F29 F30 F31 F32 F33 F34

MDisABC 2 2 1 0 0 8 0 0 0 20 1 0 0 0 0 5 2
DisABC 0 3 1 1 0 5 2 2 0 18 0 3 0 1 0 5 2
MRABC 2 5 4 7 7 15 3 17 1 11 3 2 3 6 2 11 10
AMABC 2 1 1 1 1 4 2 0 0 6 3 0 2 0 0 4 1
QBPSO 2 5 0 3 2 16 0 10 0 20 2 2 0 4 0 7 5
NBPSO 4 5 2 3 6 19 6 11 0 15 4 7 2 5 2 7 9
BPSO 3 11 2 3 4 15 4 11 0 17 2 6 2 8 5 12 13
GA 8 4 5 4 8 14 12 16 1 15 4 10 2 10 8 11 10
E. Hancer et al. / Applied Soft Computing 36 (2015) 334–348 345

suggested that the proposed algorithm outperforms the others in time than in MDisABC because GSBS starts evaluations with a large
terms of minimizing the training error rate. number of features. Thus, it can be concluded that not only in the
traning and testing classification performance, but also in the CPU
5.2. Results on the test sets computational time the proposed MDisABC algorithm performs
well.
On the test sets, the classification error rate and the feature sub-
set size should be considered together to make fair comparisons of 5.4. Comparisons with recent ACO papers
the algorithms. For instance, two algorithms may achieve similar
classification performance, but the one selecting a smaller number To further test the performance of MDisABC, we compare MDis-
of features is a better algorithm. Summarizing the average error rate ABC with two ACO based algorithms [72,73] published in 2015,
values on the test sets, Table 3 shows that in almost all cases, the which use the similar methodology for feature selection to this
EC based algorithms obtain lower classification errors using smaller paper. According to the first study [72], four datasets, including
feature subsets than when all features are used with the 5NN clas- Wine, Vehicle, German and Ionosphere are common with this
sifier. For instance, while 5NN is able to obtain 22.17% error using paper, and the classification results of them for ACO based feature
all features in Wine, nearly all EC based algorithms except for GA selection are 4.51%, 28.25%, 30.40% and 14.82%, respectively. The
obtain no classification error (0%) just using around half of the avail- results show that MDisABC performs better than ACO [72]. Accord-
able features. Thus, it can be inferred that feature selection plays a ing to the other study [73] with an improved ACO algorithm, Wine
vital role in classification performance. and Vehicle datasets are common with this paper, and the classi-
Table 3 also indicates that MDisABC achieves the best perform- fication results of them for an improved version of ACO are 3.10%
ances in terms of the mean error rate values except for four cases and 24.70%, respectively. Therefore, MDisABC is also superior to the
and it generally gets the smallest feature subset. Especially in Musk improved ACO.
and Madelon, the reduction of the subset size is nearly at least 10%
when compared to the others. In Hill-Valley, Vehicle and German, 5.5. Comparisons with deterministic approaches
DisABC, MRABC, and AMABC achieve the best mean values, respec-
tively. In terms of the significance test, the proposed MDisABC According to Table 4, LFS finds the smallest feature subsets in
algorithm only gets significantly worse performance than the other most cases. However, it cannot provide the same success in terms of
ones in 1 out of 70 cases (10 datasets × 7 algorithms). Especially the classification error rate. LFFS performs similar or slightly worse
in Ionosphere, Madelon, Isolet 5, and WBCD, the statistical perfor- than LFS. On the other hand, GSBS finds the largest feature subsets
mance of the MDisABC algorithm is successful. Although the error nearly in all cases, and it performs worse than LFS and LFFS in terms
rate results of the algorithms are mostly similar to each other espe- of the classification error rate since it is a greedy backward search
cially in Wine, Vehicle and German, MDisABC reduces the feature starting with the entire subset of features. When comparing the
subset size more effectively than the other algorithms. Therefore, MDisABC algorithm with the deterministic approaches, it is seen
the MDisABC algorithm can be also regarded as successful in those that the MDisABC algorithm provides the best performances on 9
datasets. Consequently, the proposed MDisABC also outperforms out of the 10 datasets and it statistically obtains meaningful results
the others in terms of minimizing the error rate and feature subset in almost all cases. Only in Hill Valley, the MDisABC algorithm can-
size on the test sets. not achieve significantly better performance than the deterministic
approaches, but their performances are very similar. In terms of
5.3. CPU time analysis the time complexity, the forward approaches (LFS and LFFS) are the
cheapest ones, but the backward approach (GSBS) costs higher than
The computational time results are presented in Fig. 2. The the others. Especially in large-scale datasets (Madelon and Isolet5),
experiments are implemented in MATLAB 2013a and are executed it may take nearly 1 week. In conclusion, the MDisABC algorithm
on a computer with an Intel Core i7-4700HQ 2.40 GHz CPU and is also superior to the deterministic approaches and can be used as
8 GB RAM. The implementation codes of the employed algorithms an alternative in feature selection problems.
except for QBPSO are provided from the authors of the corre-
sponding studies [69,21,23] or are coded according to the authors’ 5.6. Analysis of selected features
suggestions [25] to demonstrate a reliable analysis. According to
Fig. 2, there exists no great difference between the algorithms in To show the stability/consistency of the features selected by the
term of the CPU computational times in almost all datasets. For proposed algorithm over different independent runs, we analyse
instance, the CPU time difference between the algorithms is at most the number of times for each feature being selected by the algo-
1 or 2 s from Wine to Musk1. This is not a large difference when rithms in the 30 independent runs. Due to the space limit, five
compared to the runtime of algorithms on these datasets which datasets (Wine, Vehicle, German, WBCD and Ionosphere) with rel-
was between 18 and 35 s. Proportional to the dimensionality and atively small number of features are used here as examples for the
number of patterns, the time complexity is increased in the Made- analysis of the selected features.
lon and Isolet5 datasets, where the AMABC and DisABC algorithms Presenting the selection times of each feature in Wine, Table 5
cannot preserve the scalability as in the other datasets. shows that F1, F7 and F10 are the most dominant features in clas-
When comparing the proposed MDisABC with deterministic sification frequently selected by all algorithms, and F4, F5, F12 and
methods, there is no clear pattern since the computational time F13 are mostly not preferred features by all algorithms except for
used by different methods heavily depends on the datasets. In the cases of GA and NBPSO in F12. For the other features, whereas
general, for datasets with a small number of features (e.g. Wine), F2 is mostly preferred by DisABC and AMABC, it is not preferred
the deterministic methods take shorter time than MDisABC. For frequently by the BPSO variants and the proposed MDisABC algo-
datasets with a large number of features (e.g. Madelon), MDisABC rithm. F8 is mostly preferred by AMABC and is sometimes preferred
is often slower than LFS and LFFS, but is faster than GSBS. The main by DisABC and NBPSO, but it is only selected for once by the others.
reason is that most of the computational time in a wrapper feature Therefore, MDisABC is as stable as the other algorithms. Further-
selection method is spent on the evaluation procedures. On large more, the best feature subset obtained by MDisABC comprises of all
datasets (e.g. Madelon), GSBS may have a larger number of evalua- dominant and two occasionally preferred complementary features
tions than MDisABC and each evaluation in GSBS may take longer (Sbest = {F1, F2, F6, F7, F10}).
346 E. Hancer et al. / Applied Soft Computing 36 (2015) 334–348

Fig. 2. Average CPU computational times of the algorithms.

Considering the count of selection times for each feature in Vehi- The best feature subset obtained by MDisABC is the {F1, F7, F8, F13,
cle, Table 6 shows that F1, F3, F6, F10 and F11 are the dominant F15, F16, F21, F26, F27, F28, F29}, which comprises of dominant
features mostly preferred by the algorithms except for the case of features, but does not include any least preferred features.
AMABC in F1 and the case of QBPSO in F3. As for the least or not pre- According to Table 9, F5 is the most preferred feature, and F12,
ferred features, F4, F7, F12, F13 and F16 can be given as examples. F20, F24 (except for GA), F26, F28, F30 and F32 (except for GA) are
Although F4 is not preferred by MDisABC and binary variants of the least preferred features among all the algorithms in Ionosphere.
the PSO, it is occasionally chosen by AMABC, DisABC and GA. Also, It is also seen in Table 9 that there exist features such as F15, F23
although F7 is preferred by AMABC, it is not preferred by the other and F25 which are selected more frequently by the PSO algorithms,
ones. It can be inferred that the stability of MDisABC also carries GA and MRABC than by the binary variants of ABC. This might be the
on in Vehicle, and AMABC is not as consistent as the other ones. reason why the size of feature subsets obtained by binary variants
Besides, the best feature subset obtained by MDisABC is Sbest = {F1, of ABC is about half of the feature subset size obtained by the other
F3, F5, F6, F8, F9, F10, F14}, comprising of all dominant features ones. The best feature subset obtained by MDisABC is the {F3, F4,
(except for F11) and four occasionally selected features. F5, F16, F23, F25, F27}, the combination of one available dominant
According to Table 7, F1, F3 and F7 are the most frequently and six occasionally selected features. In conclusion, the proposed
selected features among all of the algorithms in German. Although MDisABC algorithm is the most stable and robust algorithm.
F18 is also one of the most preferred features among MDisABC,
AMABC, NBPSO and BPSO, it is not much preferred by GA, DisABC 6. Conclusions
and QBPSO. For the least preferred features, F2 (except for the case
of AMABC), F4 and F10 can be given as samples. Not only F2 but also The main goal of this study was to propose a new variant of the
F14 and F15 are much more frequently picked by AMABC. These fea- DisABC algorithm for feature selection. This goal was successfully
tures may lead AMABC to obtain higher feature subset size than the achieved by introducing DE based neighborhood mechanism into
others (see Table 3). Therefore, it may be suggested that AMABC the similarity based search of DisABC. The second goal of this study
is also not good at eliminating redundant or irrelevant features was to demonstrate a comprehensive comparative study for the
despite its performance in German, and the stability of MDisABC future studies of researchers. This goal was achieved by comparing
is illustrated in German. The best feature subset obtained by MDis- the proposed algorithm with the seven different EC based algo-
ABC is the {F1, F3, F6, F8, F12, F17, F21} comprising of dominant rithms, including BPSO, NBPSO, QBPSO, DisABC, AMABC, MRABC
features (except for F7). and GA, and three classical approaches, including LFS, LFFS and
In WBCD, Table 8 shows that there is no dominant feature cho- GSBS. It should be noted that while BPSO and GA are the algorithms
sen by all algorithms like the previous datasets, yet F1 and F21 are most widely applied to the feature selection, the other EC based
maybe given as samples. On the other hand, F2, F4, F22 and F24 are algorithms are implemented for feature selection for the first time.
the least preferred features by the algorithms. It is difficult to make The obtained results show that the integration of DE based sim-
an analysis of all cases due to the different combinations of feature ilarity search mechanism into the DisABC algorithm effectively
subsets, but the stability of MDisABC can be illustrated in WBCD. improved the global search ability of the algorithm in feature
E. Hancer et al. / Applied Soft Computing 36 (2015) 334–348 347

selection, and the proposed MDisABC algorithm achieved the best [17] D. Karaboga, B. Gorkemli, C. Ozturk, N. Karaboga, A comprehensive survey:
classification performance in both training and test sets in almost artificial bee colony (ABC) algorithm and applications, Artif. Intell. Rev. 42 (1)
(2014) 21–57.
all cases. Further, the proposed MDisABC algorithm is able to [18] M. Schiezaro, H. Pedrini, Data feature selection based on artificial bee colony
remove redundant features effectively while obtaining the high- algorithm, EURASIP J. Image Video Process. 2013 (1) (2013) 1–8.
est classification results. The results also show that the proposed [19] M.S. Uzer, Y. Nihat, O. Inan, Feature selection method based on artificial bee
colony algorithm and support vector machines for medical datasets classifica-
MDisABC algorithm outperformed the deterministic non-EC meth- tion, Sci. World J. 2013 (2013) 1–10.
ods, LFS, LFFS and GSBS in terms of the classification accuracy and [20] M. Akila, S.S. Kumar, Performance of classification using a hybrid distance mea-
selected a much smaller number of features than GSBS. The anal- sure with artificial bee colony algorithm for feature selection in keystroke
dynamics, Int. J. Comput. Intell. Stud. 2 (2) (2013) 187–197.
ysis of the features selected by different algorithms reveals that
[21] M.H. Kashan, N. Nahavandi, A.H. Kashan, Disabc: a new artificial bee colony
the proposed MDisABC algorithm is the most stable and consistent algorithm for binary optimization, Appl. Soft Comput. 12 (1) (2012) 342–352.
algorithm among all the algorithms. Moreover, the analysis of the [22] J. Kennedy, R. Eberhart, A discrete binary version of the particle swarm algo-
rithm, in: IEEE International Conference on Systems, Man, and Cybernetics,
CPU times of the different methods shows that the proposed MDis-
Computational Cybernetics and Simulation, vol. 5, 1997, pp. 4104–4108.
ABC algorithm achieved better accuracy than the existing methods [23] M.A. Khanesar, M. Teshnehlab, M.A. Shoorehdeli, A novel binary particle
without taking a longer computational time. In the future, the stud- swarm optimization, in: Mediterranean Conference on Control&Automation
ies of feature selection based on ABC are expected to increase and (MED’07), 2007, pp. 1–6.
[24] J. Yun-Won, P. Jong-Bae, J. Se-Hwan, K.Y. Lee, A new quantum-inspired binary
more ABC based approaches will be developed. There are also some pso: application to unit commitment problems for power systems, IEEE Trans.
new evolutionary algorithms [74–76] which have not been used Power Syst. 25 (3) (2010) 1486–1495.
in feature selection. We will test the proposed method on large [25] G. Pampara, A.P. Engelbrecht, Binary artificial bee colony optimization, in: IEEE
Symp. Swarm Intell., 2011, pp. 1–8.
dimensional datasets and consider the feature selection problem [26] B. Akay, D. Karaboga, A modified artificial bee colony algorithm for real-
in filter approaches [77] using ABC. parameter optimization, Inf. Sci. 192 (2012) 120–142.
[27] J.H. Holland, Genetic algorithms, Scholarpedia 7 (12) (2012) 1482.
[28] K. Bache, M. Lichman, UCI machine learning repository, 2013, Available from:
Acknowledgements http://archive.ics.uci.edu/ml
[29] D. Karaboga, An idea based on honey bee swarm for numerical optimiza-
tion, Technical Report-TR06, Erciyes University, Engineering Faculty, Computer
This work is supported in part by the Marsden Funds of New Engineering Department (2005).
Zealand (VUW1209), the University Research Funds of Victoria [30] S. Das, S. Biswas, S. Kundu, Synergizing fitness learning with proximity-based
University of Wellington (203936/3337), and the National Science food source selection in artificial bee colony algorithm for numerical optimiza-
tion, Appl. Soft Comput. 13 (12) (2013) 4676–4694.
Foundation of China (NSFC No. 61170180). The authors are also [31] D. Karaboga, B. Basturk, A powerful and efficient algorithm for numerical func-
grateful for financial support from the Scientific and Technological tion optimization: artificial bee colony (ABC) algorithm, J. Glob. Optim. 39 (3)
Research Council of Turkey (TUBITAK-BIDEB) and Turkish Council (2007) 459–471.
[32] S. Seok Choi, S. Hyuk Cha, A survey of binary similarity and distance measures,
of Higher Education. J. Syst. Cybern. Inf. (2010) 43–48.
[33] P. Jaccard, The distribution of the flora in the alpine zone, New Phytol. 11 (1912)
37–50.
References [34] B. Bonev, Feature Selection Based on Information Theory, University of Alicante,
2010, Ph.D. thesis.
[1] I. Guyon, A. Elisseeff, An introduction to variable and feature selection, J. Mach. [35] A. Blum, P. Langley, Selection of relevant features and examples in machine
Learn. Res. 3 (2003) 1157–1182. learning, Artif. Intell. 97 (1-2) (1997) 245–271.
[2] M. Dash, H. Liu, Feature selection for classification, Intell. Data Anal. 1 (1-4) [36] B. Xue, M. Zhang, W.N. Browne, Particle swarm optimisation for feature selec-
(1997) 131–156. tion in classification: novel initialisation and updating mechanisms, Appl. Soft
[3] B. Xue, Z. Mengjie, W.N. Browne, Particle swarm optimization for feature selec- Comput. 18 (2014) 261–276.
tion in classification: a multi-objective approach, IEEE Trans. Cybern. 43 (6) [37] R. Eberhart, J. Kennedy, A New Optimizer Using Particle Swarm Theory, 1995.
(2013) 1656–1671. [38] E. Bonabeau, M. Dorigo, G. Theraulaz, Swarm Intelligence: From Natural to
[4] B. Xue, Particle Swarm Optimisation for Feature Selection in Classification, Vic- Artificial Systems, Oxford University Press Inc, New York, NY, USA, 1999.
toria University of Wellington, School of Engineering and Computer Science, [39] L. Cervante, B. Xue, L. Shang, M. Zhang, A dimension reduction approach
2013, Ph.D. thesis. to classification based on particle swarm optimisation and rough set
[5] M. Lane, B. Xue, I. Liu, M. Zhang, Particle swarm optimisation and statisti- theory, in: AI 2012: Advances in Artificial Intelligence, Springer, 2012,
cal clustering for feature selection, in: S. Cranefield, A. Nayak (Eds.), AI 2013: pp. 313–325.
Advances in Artificial Intelligence, vol. 8272 of Lecture Notes in Computer Sci- [40] L. Cervante, B. Xue, L. Shang, M. Zhang, A multi-objective feature selection
ence, Springer International Publishing, 2013, pp. 214–220. approach based on binary pso and rough set theory, in: Evolutionary Compu-
[6] A.W. Whitney, A direct method of nonparametric measurement selection, IEEE tation in Combinatorial Optimization, vol. 7832 of Lecture Notes in Computer
Trans. Comput. C-20 (9) (1971) 1100–1103. Science, Springer, Berlin, Heidelberg, 2013, pp. 25–36.
[7] T. Marill, D. Green, On the effectiveness of receptors in recognition systems, [41] B. Xue, M. Zhang, W.N. Browne, New fitness functions in binary particle swarm
IEEE Trans. Inf. Theory 9 (1) (2006) 11–17. optimisation for feature selection, in: IEEE Congress on Evolutionary Compu-
[8] A. Unler, A. Murat, A discrete particle swarm optimization method for feature tation (CEC’2012), 2012, pp. 1–8.
selection in binary classification problems, Eur. J. Oper. Res. 206 (3) (2010) [42] H. Bommaganti, Feature Boosting: A Novel Feature Subset Selection Approach,
528–539. University of Minnesota, 2001, Master thesis.
[9] Y. Liu, G. Wang, H. Chen, H. Dong, X. Zhu, S. Wang, An improved particle swarm [43] K. Kira, L.A. Rendell, A practical approach to feature selection, in: Proceedings of
optimization for feature selection, J. Bionic Eng. 8 (2) (2011) 191–200. the Ninth International Workshop on Machine Learning, ML92, Morgan Kauf-
[10] J. Yang, V.G. Honavar, Feature subset selection using a genetic algorithm, IEEE mann Publishers Inc., San Francisco, CA, USA, 1992, pp. 249–256.
Intell. Syst. 13 (2) (1998) 44–49. [44] H. Almuallim, T.G. Dietterich, Learning boolean concepts in the presence of
[11] M.L. Raymer, W.F. Punch, E.D. Goodman, L.A. Kuhn, A.K. Jain, Dimensional- many irrelevant features, Artif. Intell. 69 (1994) 279–305.
ity reduction using genetic algorithms, IEEE Trans. Evol. Comput. 4 (2) (2000) [45] P. Pudil, J. Novovicova, J. Kittler, Floating search methods in feature selection,
164–171. Pattern Recognit. Lett. 15 (11) (1994) 1119–1125.
[12] D. Muni, N. Pal, D.J. Genetic programming for simultaneous feature selection [46] Z. Zhu, Y.-S. Ong, M. Dash, Wrapper-filter feature selection algorithm using a
and classifier design, IEEE Trans. Syst. Man Cybern. B 36 (1) (2006) 106–117. memetic framework, IEEE Trans. Syst. Man Cybern. B 37 (1) (2007) 70–76.
[13] R. Ramirez, P.M. An evolutionary computation approach to cognitive states [47] S. Ahmed, M. Zhang, L. Peng, Improving feature ranking for biomarker discovery
classification, in: IEEE Congress on Evolutionary Computation (CEC’07), 2007, in proteomics mass spectrometry data using genetic programming, Connect.
pp. 1793–1799. Sci. 26 (3) (2014) 215–243.
[14] S. Nemati, M.E. Basiri, N. Ghasem-Aghaee, M.H. Aghdam, A novel aco-ga hybrid [48] V. Kothari, J. Anuradha, S. Shah, P. Mittal, A survey on particle swarm opti-
algorithm for feature selection in protein function prediction, Expert Syst. Appl. mization in feature selection, in: P.V. Krishna, M.R. Babu, E. Ariwa (Eds.), Global
36 (10) (2009) 12086–12094. Trends in Information Systems and Software Applications, vol. 270 of Commu-
[15] L. Wen, Q. Yin, P. Guo, Ant colony optimization algorithm for feature selection nications in Computer and Information Science, Springer, Berlin, Heidelberg,
and classification of multispectral remote sensing image, in: IEEE International 2012, pp. 192–201, chapter 22.
Geoscience and Remote Sensing Symposium (IGARSS2008), vol. 2, 2008, pp. [49] B. Tran, B. Xue, M. Zhang, Overview of particle swarm optimisation for feature
II-923–II-926. selection in classification, in: Simulated Evolution and Learning, vol. 8886 of
[16] D. Karaboga, B. Basturk, On the performance of artificial bee colony (ABC) algo- Lecture Notes in Computer Science, Springer International Publishing, 2014,
rithm, Appl. Soft Comput. 8 (1) (2008) 687–697. pp. 605–617.
348 E. Hancer et al. / Applied Soft Computing 36 (2015) 334–348

[50] R. Jensen, Performing feature selection with aco, in: A. Abraham, C. Grosan, V. [64] M.H. Kashan, A.H. Kashan, N. Nahavandi, A novel differential evolution algo-
Ramos (Eds.), Swarm Intelligence in Data Mining, vol. 34 of Studies in Compu- rithm for binary optimization, Comput. Optim. Appl. 55 (2) (2013) 481–513.
tational Intelligence, Springer, Berlin, Heidelberg, 2006, pp. 45–73, http://dx. [65] B. Xue, M. Zhang, W.N. Browne, Single feature ranking and binary particle
doi.org/10.1007/978-3-540-34956-3 3 swarm optimisation based feature subset ranking for feature selection, in:
[51] H. Ming, A rough set based hybrid method to feature selection, in: Interna- Proceedings of the Thirty-Fifth Australasian Computer Science Conference
tional Symposium on Knowledge Acquisition and Modeling (KAM’08), 2008, (ACSC 2012), Melbourne, Australia, 2012, pp. 27–36.
pp. 585–588. [66] M. Gutlein, E. Frank, M. Hall, A. Karwath, Large-scale attribute selection using
[52] S. Ding, Feature selection based f-score and aco algorithm in support vector wrappers, in: IEEE Symposium on Computational Intelligence and Data Mining
machine, in: Second International Symposium on Knowledge Acquisition and (CIDM’09), 2009, pp. 332–339.
Modeling, 2009 (KAM’09), vol. 1, 2009, pp. 19–23. [67] R. Caruana, D. Freitag, Greedy attribute selection, in: Proceedings of the
[53] E. Sarac, S.A. Ozel, An ant colony optimization based feature selection for web Eleventh International Conference on Machine Learning, Morgan Kaufmann,
page classification, Sci. World J. (2014) 1–16. 1994, pp. 28–36.
[54] B. de la Iglesia, Evolutionary computation for feature selection in classification [68] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, I.H. Witten, The weka
problems, Wiley Interdiscip. Rev. 3 (6) (2013) 381–407. data mining software: an update, SIGKDD Explor. Newsl. 11 (1) (2009) 10–18.
[55] C. Ozturk, E. Hancer, D. Karaboga, Color quantization: a short review and an [69] S. Mirjalili, A. Lewis, S-shaped versus v-shaped transfer functions for binary
application with artificial bee colony algorithm, Informatica 25 (3) (2014) particle swarm optimization, Swarm Evol. Comput. 9 (2013) 1–14.
485–503. [70] S. Sivanandam, S. Deepa, Genetic algorithm implementation using matlab, in:
[56] C. Ozturk, E. Hancer, D. Karaboga, Improved clustering criterion for image Introduction to Genetic Algorithms, Springer, Berlin, Heidelberg, 2008, pp.
clustering with artificial bee colony algorithm, Pattern Anal. Appl. 18 (2014) 211–262.
587–599. [71] R. Kohavi, G.H. John, Wrappers for feature subset selection, Artif. Intell. 97 (1-
[57] C. Ozturk, E. Hancer, D. Karaboga, Automatic clustering with global best artifi- 22) (1997) 273–324.
cial bee colony algorithm, J. Fac. Eng. Arch. Gazi Univ. 29 (4) (2014) 677–687. [72] N. Sreeja, A. Sankar, Pattern matching based classification using ant colony
[58] B. Subanya, R. Rajalaxmi, Artificial bee colony based feature selection for effec- optimization based feature selection, Appl. Soft Comput. 31 (0) (2015) 91–102.
tive cardiovascular disease diagnosis, Int. J. Sci./Eng. Res. 5 (5) (2014) 606–612. [73] S. Kashef, H. Nezamabadi-pour, An advanced ACO algorithm for feature
[59] P. Shunmugapriya, S. Kanmani, R. Supraja, K. Saranya, Hemalatha, Feature subset selection, Neurocomputing 147 (0) (2015) 271–279, Advances in Self-
selection optimization through enhanced artificial bee colony algorithm, in: Organizing Maps Subtitle of the special issue: Selected Papers from the
International Conference on Recent Trends in Information Technology (ICRTIT), Workshop on Self-Organizing Maps 2012 (WSOM 2012).
2013, pp. 56–61. [74] B. Javidy, A. Hatamlou, S. Mirjalili, Ions motion algorithm for solving optimiza-
[60] N. Suguna, K.G. Thanushkodi, An independent rough set approach hybrid with tion problems, Appl. Soft Comput. 32 (2015) 72–79.
artificial bee colony algorithm for dimensionality reduction, Am. J. Appl. Sci. 8 [75] A. Hatamlou, Heart: a novel optimization algorithm for cluster analysis Progress
(3) (2011) 261–266. in Artificial Intelligence, vol. 2, Springer, Berlin, Heidelberg, 2014, pp. 167–173.
[61] M.S. Kiran, M. Gunduz, Xor-based artificial bee colony algorithm for binary [76] A. Hatamlou, Black hole: a new heuristic optimization approach for data clus-
optimization, Turk. J. Electr. Eng. Comput. Sci. 21 (2013) 2307–2328. tering, Inf. Sci. 222 (2013) 175–184.
[62] F. Neri, C. Cotta, Memetic algorithms and memetic computing optimization: a [77] B. Xue, L. Cervante, L. Shang, W. Browne, M. Zhang, Binary PSO and rough
literature review, Swarm Evol. Comput. 2 (2012) 1–14. set theory for feature selection: a multi-objective filter based approach, Int.
[63] O. Yew-Soon, L. Meng-Hiot, C. Xianshun, Memetic computation-past, present J. Comput. Intell. Appl. (IJCIA) 13 (June (2)) (2014), http://dx.doi.org/10.1142/
& future [research frontier], IEEE Comput. Intell. Mag. 5 (2) (2010) 24–31. S1469026814500096, 1450009-1–1450009-34 (34 pp.).

You might also like