0% found this document useful (0 votes)
6 views12 pages

2016-Elsevier-Feature selection using multi modal optimization techniques

This paper explores the application of Multimodal Optimization (MO) techniques to the Feature Selection (FSel) problem, which is a high-dimensional optimization challenge. The authors propose binary versions of existing Evolutionary Algorithm (EA) based MO methods and compare their effectiveness in identifying suitable feature subsets against traditional heuristic approaches. Results indicate that MO methods significantly improve the identification of multiple accurate feature subsets, showcasing their potential as powerful tools in feature selection.

Uploaded by

chandreshgovind
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views12 pages

2016-Elsevier-Feature selection using multi modal optimization techniques

This paper explores the application of Multimodal Optimization (MO) techniques to the Feature Selection (FSel) problem, which is a high-dimensional optimization challenge. The authors propose binary versions of existing Evolutionary Algorithm (EA) based MO methods and compare their effectiveness in identifying suitable feature subsets against traditional heuristic approaches. Results indicate that MO methods significantly improve the identification of multiple accurate feature subsets, showcasing their potential as powerful tools in feature selection.

Uploaded by

chandreshgovind
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Neurocomputing 171 (2016) 586–597

Contents lists available at ScienceDirect

Neurocomputing
journal homepage: www.elsevier.com/locate/neucom

Feature selection using multimodal optimization techniques


Shima Kamyab n, Mahdi Eftekhari
Department of Computer Engineering, Shahid Bahonar University of Kerman, Kerman, Iran

art ic l e i nf o a b s t r a c t

Article history: This paper investigates the effect of using Multimodal Optimization (MO) techniques on solving the
Received 7 July 2014 Feature Selection (FSel) problem. The FSel problem is a high-dimensional optimization problem in the
Received in revised form nature and thus needs a solver with high exploration power. On the other hand, if alternative optimal
6 June 2015
solutions could be provided for a problem, the implementation phase may become more selective
Accepted 30 June 2015
depending on the cost and limitations of domain of the problem. The high exploration power and
Communicated by Feiping Nie
Available online 16 July 2015 solution conservation capability of MO methods make them able to find multiple suitable solutions in a
single run. Therefore, MO methods can be considered as a powerful tool of finding suitable feature
Keywords: subsets for FSel problem. In this paper, we made a special study on the use of MO methods in the feature
Multimodal optimization
selection problem. The binary versions of some existing Evolutionary Algorithm (EA) based MO methods
Feature selection
like Dynamic Fitness Sharing (DFS), local Best PSO variants and GA_SN_CM, are proposed and used for
Evolutionary feature selection
selection of suitable features from several benchmark datasets. The results obtained by the MO methods
are compared to some well-known heuristic approaches for FSel problem from the literature. The
obtained results and their statistical analyses indicate the effectiveness of MO methods in finding
multiple accurate feature subsets compared to existing powerful methods.
& 2015 Elsevier B.V. All rights reserved.

1. Introduction Wrapper based methods aim at selecting those subsets of


features which improve the performance of a predetermined
Reduction of pattern dimensionality, via feature selection and learning model [3–7]. A wrapper model consists of two phases:
feature extraction, is one of the most fundamental steps in data Phase 1—feature subset selection, which selects the best subset
preprocessing. The aim of Feature Selection (FSel) is to choose a using the accuracy of the classifier (on the training data) as a
necessary and sufficient subset of features which is capable of criterion. Phase 2—learning and testing, where a classifier is
describing the target concept while retaining the accuracy of learned from the training data with the best feature subset, and
classification in a dataset. Additional features may induce some is tested on the test data. Therefore, the wrapper approaches use
disadvantageous effects on the classification process. First, they the prediction performance of a model to assess the relative
significantly slow down the learning process. Second, they deterio- usefulness of subsets of features [8]. In [9], the feature selection
rate the classification accuracy by causing the classifier to over-fit problem is reformulated as a least-square regression problem
the training data as irrelevant or redundant features may confound equivalent to ℓ1;2 -norm minimization on both loss function and
the learning algorithm [1]. In design of pattern classifiers, careful regularization. This method uses the label of training data in its
feature selection may improve both the quality and computation optimization procedure and therefore has a learning phase.
time of inducing subsequent models. Using fewer features often Filter approaches work based on the intrinsic properties of the data,
leads to both simpler and easier models to interpret and important rather than being biased toward a particular classifier [10–18]. The
insights into the application such as noise reduction [2]. essence of filter methods is to seek the relevant features and eliminate
The proposed feature selection methods in the literature have been irrelevant ones. They also consist of two phases: Phase 1—feature
categorized into four classes depending on how they evaluate the selection using measures such as information, distance, dependence, or
feature subsets: 1—Wrapper, 2—Filter, 3—Hybrid and 4—Embedded consistency; no classifier is engaged in this phase. Phase 2—this is the
based methods. same as in the wrapper’s model, where a classifier is learned from the
training data with the selected features, and is tested on the test data.
In [19] a graph-based feature selection framework under Trace Ration
n
criterion is proposed in which the feature selection problem is viewed
Corresponding author. Tel./fax: þ 91 343 3235901.
as a special subspace learning task where the projection matrix is
E-mail addresses: [email protected] (S. Kamyab),
[email protected] (M. Eftekhari). constraint to be selection matrix. In order to encode the relationship

http://dx.doi.org/10.1016/j.neucom.2015.06.068
0925-2312/& 2015 Elsevier B.V. All rights reserved.
S. Kamyab, M. Eftekhari / Neurocomputing 171 (2016) 586–597 587

among, a graph structure data based on the proximity of data samples number of all features. Generally, for such problems, the optimal
is used. solution cannot be guaranteed to be acquired except by perform-
Hybrid methods attempt to take advantages of both wrapper ing an exhaustive search in the solution space. The use of meta-
and filter methods to obtain better feature sets [20–26]. In heuristic techniques allows us to obtain reasonably good solutions
particular, hybrid methods are based on a sequential (e.g. two- without being forced to explore the whole solution space. The
step) approach where the first step is usually based on filter quality of “heuristic” solutions, depends on the characteristics of
methods to reduce the number of features. Using this reduced set, applied method. Recent studies show that meta-heuristic techni-
in second step, a wrapper method is then employed to select ques are among superior methodologies. In real-world applica-
desired number of features [1]. Embedded methods embed the tions, people are more interested in obtaining good solutions in a
feature selection procedure into the learning algorithm i.e., feature reasonable amount of time rather than being obsessed with
selection is occurring naturally as a part of learners [3]. From the optimal solutions. Therefore, we favor meta-heuristic methods
literature, it seems more reasonable to use a wrapper (or that are efficient for dealing with real-world applications [8]. The
embedded method) with a linear predictor as a filter and then most frequently used meta-heuristic strategies applied to the
train a more complex nonlinear predictor on the resulting vari- feature selection problem are EA based methods. Such methods
ables [5]. Therefore, in this paper we used a wrapper approach in the process of feature selection do not suffer from the so called
with 1-NN classifier as predictor in the first phase of a wrapper Nesting effect caused by traditional sequential feature selection
based approach and then the resulted subsets were evaluated methods like SBS1 [31] and SFS2 [32] methods. Nesting effect
using more complex classifiers. corresponds to the inability to reselect the discarded features
From a computational perspective, the feature selection pro- (SBS) and discarding the selected features (SFS). Using such
blem is generally difficult to solve. It is inherently a combinatorial sequential schemes in feature selection may result in finding local
optimization problem [27]. Hence, optimization techniques are optimum solutions for the FSel problem. EA-based methods over-
suitable methods to resolve this problem. Recently, several opti- come this drawback by having no restriction on selecting features
mization based methods are used and proposed in order to find in during their search process. Moreover, when the number of
relevant feature subsets among possible combinations of features variables is large, due to the power of parallel selection of features,
[25,28–30]. Evolutionary Algorithms (EAs) are capable of exploring the computational time of EA based methods seems to be
the search space comprehensively due to their population based considerably less than the methods based on forward selection
structures, in a reasonable time. However, these methods suffer and backward elimination.
from premature convergence problem because of the tendency of There are several studies which investigate the effectiveness of
their population to converge to its best found solution. This EA based methods in feature selection process. They adopted
problem could be solved by controlling the diversity of the different kinds of EAs such as Genetic Algorithm (GA)
population. [22,25,28,29], PSO Algorithm [2,33–35], Differential Evolution
Finding multiple optimum solutions is the objective of a class of (DE) Algorithm [36], ACO algorithm [37] and so on. These
optimization methods called Multimodal Optimization (MO) tech- approaches also would act as filter, wrapper, hybrid or embedded
niques. In such EA based approaches, some additional mechanisms depending on the way they evaluate fitness of their population.
are involved in the EAs in order to help the main search process to Proposing some improvements of the exploration and local
avoid converging toward a local optimum. These mechanisms, search powers of EAs for feature selection in several available
called niching techniques, empower the search process to find studies, shows that these powers play crucial roles in obtaining
multiple global and local optima by controlling the diversity of the better results for the FSel problem. For instance, GRASP [29] is an
population and performing some local search behavior to fine- iterative process, in which each of the iterations has two phases:
tune the obtained solutions. These attributes of MO methods construction and local search. In the construction phase, a feasible
encourage them to be considered as appropriate tools to find solution is built. Then its neighborhood is explored by the local
suitable solutions for the feature selection problem. search. The final result will be the best solution found over all
In this paper, a special study on the use of MO methods to the iterations. Chuang et al. in [34] used a new modification of PSO
feature selection problem is made which was not performed algorithm called CatfishBPSO for the feature selection process. In
before. To do this, the modification of some well-known EA based CatfishBPSO, the so-called “catfish” effect introduces a competition
MO methods and also our proposed niching method called function into a group of individuals. Catfish particles are introduced
GA_SN_CM are used for feature selection task and are compared into the search space if the fitness of gbest cannot be improved over
with some well-known EA based methods for feature selection to a number of consecutive iterations. These catfish particles are
study the power of MO methods on improving the feature introduced at extreme positions of the search space and will
selection results. initialize a new search for these extreme positions. The catfish
The reminder of this paper is organized as follows: Section 3 particles open up new opportunities for finding better solutions, and
provides a brief description about using EA based optimization guide the entire swarm to promising new regions of the search
methods for feature selection. In Section 3, MO methods which are space [34]. Introducing catfish particles in CatfishBPSO algorithm
used for feature selection in this paper are discussed. In Section 4, helps it to avoid converging toward a local optimum solution by
the described MO methods are applied to several test problems increasing the exploration power and diversity of its population.
and the results are presented and discussed. Finally, Section 5 is High classification accuracy of this heuristic optimization method for
dedicated to conclusions. feature selection indicates the effectiveness of its mechanism. These
methods inspired us to use Multimodal Optimization techniques
which have high exploration power and good local search behavior
2. Feature selection using EA-based optimization methods for feature selection in order to find more desirable feature subsets.
In [8] Memetic Algorithm (MA), which is a population based
Searching for an appropriate feature subset, which can repre- approach with a local search mechanism to improve the solutions,
sent the target concept in a large feature set, is an NP-hard
problem. If the feature subset is modeled as a binary string where,
each feature is either selected or not for inclusion, the number of 1
Sequential Backward Selection.
possible non-empty feature subsets is 2n  1, where n is the 2
Sequential Forward Selection.
588 S. Kamyab, M. Eftekhari / Neurocomputing 171 (2016) 586–597

is used for feature selection. Like Multimodal Optimization techni- for assessing the “goodness” of a feature subset is computed as (1).
ques, MA takes advantages of the local search as well as exploration
f itnessi ðselectedFeaturesÞ ¼ αðclassif icationAccuracy ðselectedFeaturesÞÞ
of the search space and therefore it is enable to find more effective  
Nt  Ns
and accurate feature subsets. þ ð1  αÞ ð1Þ
Nt

where Nt and Ns denote the number of total features and the number
3. Multimodal optimization methods for feature selection
of selected features of a candidate solution, respectively. α is a
constant coefficient to balance the effect of classification accuracy
When a single-objective optimization problem has more than
and number of features in calculating the fitness value, which is set to
one optimal solution, it can be considered as multimodal optimi-
0.7 in our experiments.
zation problem. The objective of locating different optima in a
Note that Eq. (1) guides the algorithm to make a suitable
single run makes it more complicated than single global optimiza-
tradeoff between classification accuracy and the number of
tion. Niching methods, devised for extending EAs to multimodal
selected features. It leads the population to find accurate solutions
optimization, address this issue by maintaining the diversity of the
with fewer numbers of features in the search space.
population. In this way, they allow parallel convergence into
multiple good solutions in a multimodal domain. The concept of
niching is inspired by the way of evolving organisms in the nature. 3.2. Description of MO algorithms
The process involves the formation of sub-populations within a
population. Each sub-population aims at locating one optimal 3.2.1. Genetic algorithm using self-adaptive neighborhood scheme
solution of the fitness landscape and the whole population, with crowding replacement memory (GA_SN_CM)
together, is expected to locate multiple optima in a single run. In our recent work in [40] we proposed a niching scheme using
Several niching methods were proposed in the literature [38]. a self-adaptive neighborhood structure which leads to form
The aim of the Feature Selection approach is to obtain some suitable species within the population and a memory structure
representations of the dataset that are more adequate for learning with crowding replacement scheme in order to conserve found
a concept from that dataset [10]. Such representations may not be solutions during the search process of GA. Using such mechanisms
unique. Therefore, the FSel problem can be seen as a multimodal obviates the need for tuning problem dependent niche parameters
optimization problem. On the other hand, even if the aim of FSel is and causes formation of species, consistent with the optimum
to find a single solution, MO methods can find more desirable which they are searching for. Moreover, using the memory
solutions because of their high exploration and exploitation structure helps the search process at preserving found niches
power. These characteristics of MO methods, reduce the prob- without interrupting the exploration power of niching method.
ability of getting trapped in a local optimum. The optimal feature The obtained results indicate high exploration and exploitation
subset should be searched in a large space containing all possible power of proposed method. These characteristics motivated us to
subsets of features. Therefore, the FSel problem is a hard problem adopt this method in solving the feature selection problem.
when encounters large number of features. MO methods not only Note that we used binary version of GA associated with our
can explore wide regions of the search space and perform appro- proposed niching scheme for the feature selection problem. The
priate local search to fine tune obtained solutions, they also can pseudo-code of GA_SN_CM is shown in Fig. 1. More details about
provide alternative solutions for the problem at hand. In the real this method are discussed in [40].
world, sometimes, even when an optimal solution (feature subset)
is found, the high cost of implementation of a model using that
subset makes it impossible to use. If alternative optimum solutions 3.2.2. Local best PSO variants
could be provided to the user, the available solutions can be In [41] Li stated that a PSO algorithm using the ring topology
checked and the most appropriate one be selected considering the can operate as a niching algorithm . It does this by using individual
users’ preferences. The nature of providing multiple solutions by particles’ local memories to form a stable network retaining the
MO methods helps to achieve this end. best positions found so far, while these particles explore the search
Here, we shall briefly review the niching methods which we space more broadly. Given a reasonably large population uni-
used for FSel in this paper after describing the fitness function formly distributed in the search space, PSO algorithms using the
used by all methods in our experiments. ring topology are able to form stable niches across different local
neighborhood and eventually can locate multiple global/local
optima. Such fast and efficient niching method can form and
3.1. Fitness evaluation
maintain the species around optima with no need to set any
additional parameters apart from PSO parameters. Hence, we
Using an evaluation criterion in the search algorithm is crucial
believe this method, by establishing a fast and efficient search in
to guide the search process toward finding appropriate feature
the feature space, can perform well in finding accurate solutions
subsets.
for feature selection problem in a reasonable time.
Sebban and Nock [39] categorize the feature selection mechan-
The pseudo code of lbestPSO using a ring topology is illustrated
isms into three classes according to what is optimized in it:
in Fig. 2, where, the function neighborhoodBest(.) returns the best-fit
personal best in the ith neighborhood, which is stored in pn,i,
 Algorithms that find the feature subsets of a specified dimen-
representing the neighborhood best for the ith particle. Conveni-
sionality in which the classes of data are most discriminable.
ently, the population indices can be used to identify the left and
 Algorithms that find the smallest feature dimensionality for
right neighbors of each particle and their corresponding memories
which the discriminability exceeds a specified value, and
(p-bests).
 Algorithms that find a compromise between a small subset of
The velocity and position update of each particle is done in this
features and the class discriminability.
niching method according to Eqs. (2) and (3), respectively.
! ! ! ! ! !
In this study, we considered the third mechanism considering v i ’w n v i þ c1 n R 1 ½0; φ1   ð p i  x i Þ þ c2 n R 2 ½0; φ2 
classification accuracy as the main discriminability measure to ! !
achieve the greatest possible precision. Therefore, the fitness function  ð p n;i  x i Þ ð2Þ
S. Kamyab, M. Eftekhari / Neurocomputing 171 (2016) 586–597 589

GA_SN_CM Procedure
Begin
N = number of neighbors for each individuals;
Ps = population size;
t =0;
Initialize the main population P(t) and memory M(t) randomly;
While (termination condition is not met) do
t = t + 1;
Check if the number of individuals in each neighborhood is equal to N;
Eliminate redundant chromosomes in each neighborhood in P(t);
Generate new individuals instead of eliminated chromosomes;
Fitnesses = Evaluate (P(t)) according to Eq(1) ;
Update memory M(t) using raw Fitness values;
For i = 1 to Ps
p1 = select the first parent from the P(t) using Fitness values;
if rand < Pn
p2 = select second parent from p1’s neighbors using Fitness values;
else
p2 = select second parent except for itself from P(t) using Fitness values;
end if
{o1 , o2} = crossover(p1 , p2);
{o1’sNeighbors , o2’sNeighbors} = binaryCrossover(p1’sNeighbors, p2’sNeighbors);
mutate(o1 , o2);
binaryMutate(o1’sNeighbors , o2’sNeighbors);
{t1 , t2 } = survive(p1, p2, o1, o2);
Add t1 and t2 to P(t+1) and their corresponding neighbors to M(t+1);
end for
End while
Return M(t) as the final solution;
End
Fig. 1. The pseudo code for main algorithm of GA_SN_CM.

velocity from Eq. (2).


! !
if rand o Sð v i Þ then x i ¼ 1
! ð4Þ
else x i ¼ 0

where
! ! 1
Sð v i Þ ¼ sigmoidð v i Þ ¼ ð5Þ
1 þ e  vi
The neighborhood structure used by lbestPSO contains two or
three particles of the swarm assuming either existing overlap
(shared members) between neighborhoods or not. Based on this,
four different niching methods were introduced in [41] such as
r2PSO and r3PSO, which are the lbestPSOs with 2 and 3 members of
overlapped neighborhood, respectively. Other versions are r2PSO-
Fig. 2. The pseudo code of lbest PSO using a ring topology [41].
lhc and r3PSO-lhc which are like the other two versions but
without the assumption of overlapped neighborhoods. In fact,
! ! ! the overlapped neighborhood assumption increases the tendency
x i’ x i þ v i ð3Þ
of the algorithm toward locating the global solutions rather than
! ! the local ones. By using these methods in FSel, we can also analyze
where R 1 ½0; φ1  and R 2 ½0; φ2  are two separate functions each
returning a vector comprising random values uniformly generated the behavior of different niching methods for this problem.
in the ranges [0,φ1] and [0,φ2], respectively. φ1 and φ2 are
commonly set to φ/2 (where φ is a positive constant). The symbol 3.2.3. Dynamic fitness sharing (DFS)
 denotes point-wise vector multiplication. Fitness Sharing (FS) is one of the best-known and widely used
In the case of feature selection problem, feature subsets GA based niching methods for MO.
usually are encoded as n-bit binary strings, where n denotes During the FS procedure, sub-populations are formed and
the cardinality of the original feature set. Consequently, the maintained within the main population by reducing the prob-
value one or zero in a particular bit determines if its correspond- ability of growing the number of individuals populating a peak.
ing feature is selected from the original feature set or not. By This scheme causes the disappearance of the individuals in one
using such structure for feature subsets, a binary version of the peak and therefore they will populate the other peaks. This is
search algorithms should be used to deal with feature subsets in achieved by reducing the fitness of similar individuals as their
the search space. Hence, in this paper we used Eq. (4) instead of number grows, with the implicit assumption that similar indivi-
(3) to update the position of each particle after computing its duals belong to the same species and populate the same niche. To
590 S. Kamyab, M. Eftekhari / Neurocomputing 171 (2016) 586–597

this purpose, definition of a similarity metric on the search space 4. Experiments


as well as a threshold (niche radius) is required for representing
the maximal distance between the individuals to be considered as The effectiveness of considering MO methods was tested using 12
similar (in the same niche). It has been proved that, when the datasets of varying size retrieved from UCI repository which are
number of individuals within the population is large enough and frequently used in the literature. The obtained results are compared
the niche radius is properly set, FS provides as many species in the with two well-known population based heuristic methods used for
population as the number of peaks in the fitness landscape, thus feature selection from the literature.
populating all the niches [42].
In FS method the fitness of individuals populating a peak 4.1. Databases
(niche) is modified to avoid the selection pressure phenomenon.
The modification of the raw fitness is of an agent is accomplished As mentioned above, 12 datasets having different characteris-
according to the presence of similar individuals around it.The tics were considered in our simulation experiments in order to
shared fitness of an individual τ at generation t is computed as (6). validate the performance of MO methods [43]. The dimensionality,
size and number of classes of each dataset are given in Table 1.
rawFit t ðτÞ
fitnesssh;t ðτÞ ¼ ð6Þ
mt ð τ Þ
4.2. Evaluation mechanism
where rawFitt(τ) is the raw fitness of the individual τ and mt(τ) is
the niche count which depends on the number and the relative In order to compare the existing method in our experiments,
positions of the individuals of the population which the raw we considered the classification accuracy (CA) and number of
fitness is shared between them. The niche count is given by (7) selected features (Feature No) by each method, as evaluation
criteria.
X
mt ð τ Þ ¼ Shðτ; wÞ ð7Þ CatfishBPSO [24] and Memetic Algorithm [8], are selected as
w A Pt powerful well-known heuristic methods for feature selection from
the literature to be compared with MO methods. Moreover, Binary
where, Pt denotes the population at time t, and Sh(τ,w) is the GA (BGA) and binary PSO (BPSO) were also used to compare the
sharing function which measures the similarity between two power of MO methods with their basic single modal form in
individuals. The most commonly used form of Sh is as the solving the feature selection problem.
following: As mentioned before, we considered the wrapper based scheme for
feature selection in our experiments. In each run, 60% of data of each
8  
< 1  dðτ;wÞ αsh if dðτ; wÞ o σ sh dataset are selected at random as training data and the remaining 40%
σ sh
Shðτ; wÞ ¼ ð8Þ are chosen as test data. We believe that using 60% of data as training set
: 0; otherwise results in more general models while retaining the overall character-
istics of a particular model, as also was done in [44]. The model (1-NN
In [42] DFS is proposed based on: (1) explicitly finding the
species at each generation. (2) Using the FS to ensure that each
niche will be populated proportional to the fitness of its peak. And
(3) preserving species with fewer individuals from extinction by
elitism. For this purpose, DFS embodies an explicit, dynamic
identification and localization on the fitness landscape of the
specie discovered at each generation. This mechanism is outlined
in Fig. 3.
The skeleton of a GA with DFS is shown in Fig. 4.
In this paper, we used the binary version of GA with the binary
operators and hamming distance as distance metric.

Fig. 4. GA with DFS [42].

Table 1
Description of used data sets.

No. Data sets Features Classes Samples

1 Wisconsin Breast Cancer (WBCO) 9 2 699


2 Glass 10 6 214
3 Wine 13 3 178
4 Zoo 16 7 101
5 Vehicle 19 4 846
6 Protein 20 6 116
7 Segment 20 7 2309
8 Wdbc 30 2 569
9 Soybean 35 4 47
10 Lung cancer 56 3 32
11 Sonar 60 2 208
12 Ovarian Cancer 4000 2 216
Fig. 3. Dynamic species identification algorithm [42].
S. Kamyab, M. Eftekhari / Neurocomputing 171 (2016) 586–597 591

classifier) is built using the training set and then the performance is software, were used in order to evaluate classifier dependence trend of
tested using the test set (unseen data). The experimental results are existing algorithms to find solutions.
reported in the form of the mean and standard deviation of the correct The parameters used in the algorithms were set as follows:
classification rate obtained on the test data. Some statistical analyses,
using the obtained results in the experiments, are taken into account in  GA_SN_CM: population size ¼70, generations ¼200, selection
order to determine relative rank of existing methods. Moreover, the type¼roulette wheel, crossover operator¼ one point binary,
power of finding multiple solutions by MO methods is compared with survive type ¼generational, Pc ¼ 0.9, Pm ¼0.01, number of neigh-
other methods in the case of 6 datasets. Finally, 6 additional classifiers bors for each individual ¼population size/4.
including SVM, PART, jRip, BFTree, Bayes Net and J48, using weka  Local best PSO variants: population size ¼ 70, iterations¼ 200,
vmin,d ¼  6, vmax,d ¼ 6, c1 ¼ c2 ¼2, w¼1, j ¼4.1, j1 ¼j2 ¼j/2,
neighborhood variants ¼r2PSO, r3PSO, r2PSO-lhc and r3PSO-
lhc as used in [16].
 GA and DFS: population size ¼70, generations ¼ 200, selection
type¼roulette wheel, crossover operator¼ one point binary,
survive type ¼generational, Pc ¼ 0.9, Pm ¼0.01, σsh ¼depends on
dataset.
 BPSO and CatfishBPSO: population size ¼ 70, iterations¼ 200,
Fig. 5. the local search procedure used in Memetic algorithm described in [8]. vmin,d ¼  6, vmax,d ¼ 6, c1 ¼ c2 ¼2, w¼1, j ¼4.1, j1 ¼j2 ¼j/2,

Table 2
Obtained results of feature number and classification accuracy by each algorithm for 11 datasets.

Dataset Measure Without FS DFS r2PSO r3PSO r2PSO-lhc GA_SN_CM

WBCO Feature No 9 (0) 2.05 (0.6048) 1.2 (0.41) 2.75 (0.4443) 1.45 (0.51) 1.67(0.5501)
CA 0.96 (0.004) 0.9337(0.019) 0.9340 (0.012) 0.8741 (0.046) 0.939 (0.01) 0.9387(0.023)
Glass Feature No 9 3.4 (0.695) 3.4 (0.502) 3.35 (0.5871) 3.5 (0.6882) 2.88(0.527)
CA 0.671 (0.05) 0.7157 (0.044) 0.7186 (0.043) 0.7104 (0.05) 0.786 (0.03) 0.7535(0.0861)
Wine Feature No 13 (0) 2.65 (0.67) 2.7 (0.6569) 2.55 (0.6048) 2.6 (0.5982) 2.47(0.3305)
CA 0.732 (0.03) 0.9514 (0.021) 0.9563 (0.019) 0.95 (0.017) 0.9563(0.02) 0.9537(0.018)
Zoo Feature No 16 (0) 4.6 (1.04) 3.9 (0.8335) 4.5 (0.6882) 4.75 (1.019) 3.8(0.4612)
CA 0.932 (0.06) 0.9687 (0.025) 0.9775 (0.031) 0.965 (0.0347) 0.9775(0.01) 0.961 (0.0351)
Vehicle Feature No 19 5.6 (1.142) 4.8 (0.6958) 4.8 (0.6958) 5.15 (0.875) 5.01 (0.41)
CA 0.69 (0.011) 0.7336 (0.021) 0.7270 (0.013) 0.7302 (0.02) 0.7303 (0.01) 0.7337(0.0101)
Protein Feature No 20 (0) 7.75 (1.2926) 7.25 (1.5517) 7.95 (1.468) 7.95 (1.6375) 6.98(0.89)
CA 0.691(0.068) 0.7880 (0.045) 0.8173 (0.042) 0.8348 (0.047) 0.8195 (0.04) 0.7826(0.0377)
Segment Feature No 20 (0) 3.3 (0.4701) 3 (0) 3 (0) 3 (0) 3 (0)
CA 0.96 (0.007) 0.9674 (0.006) 0.9665 (0.004) 0.9693 (0.005) 0.966 (0.004) 0.9751 (0.006)
Wdbc Feature No 30 (0) 4.95 (0.9986) 2.55 (0.6863) 2.9 (0.7181) 2.6 (0.6805) 2.58(0.43)
CA 0.916 (0.01) 0.9342 (0.008) 0.9418 (0.01) 0.9432 (0.008) 0.9381 (0.01) 0.9342(0.02)
Soybean Feature No 35(0) 4.75 (1.5174) 2 (0) 1.9 (0.307) 2 (0) 2.55(0.5104)
CA 0.984 (0.03) 1 (0) 1 (0) 1 (0) 1 (0) 1 (0)
Lung cancer Feature No 56 (0) 19.4 (2.5214) 9.35 (2.996) 7.9 (2.1496) 13.1 (2.3373) 18.1(3.2472)
CA 0.461 (0.13) 0.9269 (0.046) 0.9846 (0.031) 0.9846 (0.031) 0.95 (0.0516) 0.9538(0.0397)
Sonar Feature No 60 (0) 18.75 (2.6532) 12.2 (2.1908) 11.85 (1.7252) 14.4 (2.0365) 14.1(3.3149)
CA 0.816 (0.02) 0.9265 (0.023) 0.9668 (0.018) 0.9674 (0.017) 0.951 (0.02) 0.9324(0.0183)
Ovarian Cancer Feature No 4000(0) 1900(50.05) 1650.4(70.38) 1550(85.42) 1741.3(38.23) 1806.9(34.42)
CA 0.8895(0.03) 0.9603(0.05) 0.9849(0.014) 0.986(0.00917) 0.986(0.0091) 0.9813(0.0206)

Dataset Measure r3PSO-lhc CatfishBPSO MA GA BPSO

WBCO Feature No 1.55 (0.510) 1.6 (0.502) 1.5 (0.41) 3 (0) 2.8 (0.4104)
CA 0.941 (0.015) 0.9306 (0.016) 0.9282 (0.018) 0.9247 (0.013) 0.9311 (0.017)
Glass Feature No 3.4 (0.7539) 3.8 (0.7451) 3.65 (0.5871) 4.6 (0.54) 4.6 (0.54)
CA 0.7462 (0.044) 0.6973 (0.041) 0.6773 (0.037) 0.6359 (0.03) 0.6767 (0.04)
Wine Feature No 2.75 (0.7163) 4.1 (0.6386) 2.8 (0.6882) 2 (0) 2.4 (0.5477)
CA 0.9605 (0.022) 0.9114 (0.025) 0.9394 (0.025) 0.9239 (0.0161) 0.9380 (0.0126)
Zoo Feature No 4.7 (0.8013) 5.3 (0.9233) 5.9 (0.4442) 4.8 (0) 6.6 (0.8944)
CA 0.9787 (0.026) 0.9725 (0.029) 0.9662 (0.023) 0.9650 (0.028) 0.97 (0.02)
Vehicle Feature No 5.05 (0.9986) 5.4 (0.8335) 5.5 (0.5130) 5.6 (0.5477) 5 (0.7071)
CA 0.7363 (0.021) 0.7225 (0.020) 0.7312 (0.0184) 0.7219 (0.005) 0.697 (0.018)
Protein Feature No 8.3 (2.028) 8.7 (1.8353) 8.2 (1.7137) 7.4 (1.5166) 7.4 (1.5166)
CA 0.8304 (0.0396) 0.7923 (0.036) 0.8141 (0.0432) 0.8087 (0.0677) 0.8174 (0.0451)
Segment Feature No 3 (0) 3.85 (0.2236) 3 (0) 3.4 (0.5312) 3 (0)
CA 0.9706 (0.049) 0.9301 (0.005) 0.9662 (0.0061) 0.9507 (0.0045) 0.969 (0.0093)
Wdbc Feature No 2.75 (0.6386) 3.15 (0.6386) 2.5 (0.6882) 4.8 (1.303) 2.4 (0.547)
CA 0.9394 (0.013) 0.9399 (0.018) 0.9373 (0.0151) 0.9298 (0.02) 0.9281 (0.01)
Soybean Feature No 1.9 (0.3077) 2.25 (0.36635) 2 (0) 7.05 (1.099) 4.95 (0.366)
CA 1 (0) 1 (0) 1 (0) 1 (0) 1 (0)
Lung cancer Feature No 12 (2.6754) 13.7 (3.4043) 7.9 (2.6832) 18.8 (2.38) 15.6 (2.79)
CA 0.980 (0.0423) 0.95 (0.0451) 0.9961 (0.017) 0.9692 (0.04) 0.9385 (0.03)
Sonar Feature No 12.6 (2.1126) 16.4 (2.6832) 12.15 (1.4609) 18.8 (2.7749) 12.8 (2.3875)
CA 0.946 (0.0229) 0.9502 (0.022) 0.9583 (0.0133) 0.9181 (0.0261) 0.9398 (0.019)
Ovarian Cancer Feature No 1661.7(40.81) 815.4(177.47) 1726.07(38.05) 1923(31.04) 1693(41.07)
CA 0.9872(0.011) 0.9814(0.0125) 0.9802(0.0314) 0.9418(0.021) 0.9757(0.035)
592 S. Kamyab, M. Eftekhari / Neurocomputing 171 (2016) 586–597

max age¼3, amount of catfish particles ¼10% of population of features. This can be due to its suitable exploration power which
size, other setting used same as in [24]. is the result of incorporating catfish particles in this algorithm.
 Memetic algorithm: population size ¼ 70, generations ¼200, In the next section, in order to analyze the obtained results and
nSel ¼0.3 (i.e. 0.3 of the part of population (worst part) are evaluate our viewpoints, we used some statistical analyses to
replaced with generated children in each generation), Pc ¼ 0.9, compare the obtained number of features and classification
Pm ¼0.01, parent selection type ¼random (after sorting the accuracy by existing methods.
population), local search operation ¼Described as follows: If X
be a solution and its neighborhood (N(X)) is defined as follows: 4.5. Statistical analysis
0 0
NðXÞ ¼ fX j X ¼ X [ fvj0 g fvj g; 8 vj A X; vj0 2
= Xg; ð9Þ
In order to show that the reported results have statistical
The local search procedure can be described as in Fig. 5. significance, we have carried out some statistics on obtaining
results in Table 2. We used non-parametric 1  N Freidman and
Quade tests with 9 and 90 degrees of freedom according to F-
4.3. Computational complexity distribution. The obtained ranks using these statistics are reported
in Table 3.
In terms of the number of fitness evaluations as the elementary As we can see from Table 3, the statistical analyses also confirm
operation for computational complexity, since we set the termina- the better performance of local best variants even in finding a
tion condition of each algorithm as reaching to a fixed number of global good solution compared to the other existing methods. In
generations (equal to 200), we can see from all algorithms that the the case of classification accuracy, the r3PSO-lhc version of local
computational complexity of all used methods are O(NG), where, N best PSO could achieve the best performance. Based on this result,
and G stand for the number of individuals and number of genera- it can be said that considering a larger neighborhood rather than
tions, respectively. However, if “computing the distance” is consid- r2PSO with less overlap to other neighborhoods may lead the MO
ered as the elementary operation, in the case of MO methods, DFS method to achieve more accuracy in feature selection problem. On
has the complexity O(N2) due to it dynamic species identification the other hand, in the case of number of features, r2PSO and r3PSO
technique, GA_SN_CM has the complexity O(N2) due to its memory could achieve better ranks. It means that the overlap between
mechanism and The complexity of localbest PSO variants are O(N). In neighborhoods may lead the algorithm to reach solutions with
the case of single modal methods, the complexity of CatfishBPSO, fewer numbers of features. Memetic algorithm is the next algo-
basic binary GA and basic binary PSO are O(N) and Memetic rithm with high rank among existing algorithms . We believe its
Algorithms has complexity O(Nn) which n{N denotes the number local search behavior is suitable for feature selection problem and
of assumed neighbors for each individual. if its algorithm could be associated with a solution conservation
mechanism over its iterations, it can turn into a powerful MO
technique for solving different problems specially feature selec-
4.4. Results tion. After local best variants and MA, the proposed GA_SN_CM
method achieved the next rank. We can see that its performance is
Table 2 shows the results obtained by MO and other heuristic the best among GA based methods and its obtained rank is close to
methods over 20 independent runs on existing datasets. The CA in MA. We believe its performance over GA based methods is because
Table 2 refers to the average classification accuracy achieved by 1- of the tradeoff which it establishes between exploration and
NN classifier on the test set of classification dataset by each exploitation powers through its suitable local search behavior by
algorithm. The best values are highlighted in bold-face.
As can be seen from Table 2, local Best PSO variant MO methods
could achieve better results for several data sets. The Memetic
Table 3
algorithm could obtain results close to local Best variants. This
statistical analyses of the results in Table 2.
may be caused by the similarity between the nature of MA and MO
methods in which there is suitable local search operation in order Statistical test method CA Feature no.
to exploit better solutions. On the other hand, although DFS is an
Freidman 1nN DFS 6.5833 7.3182
MO method, apart from its better results comparing with GA, it
r2PSO 3.9167 3.2273
could not achieve good results compared to the other existing MO r3PSO 4.5 3.6364
methods. We believe this may be also because of the local search r2PSO-lhc 4.125 4.7727
operation which it uses, i.e. the fitness sharing mechanism r3PSO-lhc 2.625 4.9545
between individuals in specific specie and the similarity measure GA_SN_CM 8.5 4.1818
CatfishBPSO 6.9167 7.8182
which is used to consider individuals belonging to the same
MA 5.7083 5.3182
species which seems not to be appropriate in the feature space. BPSO 7.375 5.8182
Therefore, selection of a suitable local search operation mechan- GA 8.5 7.9545
ism considering the attributes of the problem at hand plays an Without FS 9.9167 11
important role in finding desirable solutions by MO methods. P-value 6e  06 0
Generally, the results show the power of MO methods in solving Quade 1nN DFS 7.0449 8.197
feature selection problem compared with their single modal r2PSO 3.4679 3.2273
r3PSO 4.2885 3.1364
versions of algorithms as they could achieve better results com- r2PSO-lhc 3.8205 4.7955
pared to existing well known heuristic feature selection methods r3PSO-lhc 2.5705 4.6136
in many cases. Therefore, they can be used as suitable techniques GA_SN_CM 6.0064 5.0303
for solving feature selection problem. In the case of ovarian cancer CatfishBPSO 7.1346 7.697
MA 5.5321 4.3788
dataset which is a high dimensional dataset we can see that the
BPSO 7.6603 5.4394
obtained results by EAs associated with niching techniques also GA 8.4359 8.4848
appeared better than their single modal version indicating the Without FS 10.0385 11
power of MO methods in solving high dimensional FSel problems. P-value 3.4878e  08 3.038e  09
In this case CatfishBPSO could find the subset with fewer numbers
S. Kamyab, M. Eftekhari / Neurocomputing 171 (2016) 586–597 593

improving found solutions along with the conservation of found rule based classifiers, Bayes Net, as Bayesian classifier, and J48 and
optima which it establishes in the GA. We can see from Table 3 BFTree, as tree based classifiers, were used to compute accuracy of
that DFS could achieve better ranks compared with GA . found solutions by each method. The best solution obtained by each
It is worth noting that the obtained results also show this fact that algorithm, during its search over 20 runs, was considered to compute
performing classification of a dataset without any feature selection has the classification accuracy of utilized classifiers. For each database, 10-
the worst accuracy among all existing techniques. This means that the fold Cross-validation was developed to compute the classification
feature selection process is a necessary operation to improve the accuracy for each classifier. Table 5 shows the classification accuracy
classification quality by eliminating noisy and irrelevant features. using the best feature subset selected by each algorithm for different
classifiers and datasets. In the following table F No stands for the
4.6. Algorithms’ behavior number of selected features and CA corresponds to the classification
accuracy of the best solution found by each method.
In order to observe the manner in which different algorithms Table 5 in fact provides the information indicating how much
select salient features over their iterations, the fitness values over the best features found by wrapper heuristic methods can be
the number of iterations are shown in Fig. 6 for 11 datasets suitable for classification while different classifiers are utilized.
(averaged over 20 runs). Such information shows the convergence Looking at Table 4, we observed that in many cases, the best
behavior together with the power of considered methods in both features obtained by wrapper based approaches could give a
selection of fewer numbers of features and achieving better reasonable accuracy compared with using all the features of that
classification accuracies. dataset for classification. In some cases the best selected features
We can see in Fig. 6 that, for most of the datasets, MO methods can even lead to improve the classification accuracy (e.g. see Sonar
specially GA_SN_CM converge slower than single modal methods. dataset using BFTree classifier). Among existing methods, the
This may be due to their tendency to explore the whole search results show that MO methods could find more suitable features
space resulting in finding fitter solutions. The methods like MA compared to the other heuristic algorithms in terms of the
and Catfish BPSO achieved good results in early iterations but they obtained classification accuracy for many datasets.
cannot improve their solutions later. We wish to note that the
exploration power of MO methods can lead the feature selection
process to find more desirable solutions. We believe if MO
methods uses a local search that is more consistence with feature 5. Conclusion
selection process, like the one used by MA, they can find even
fitter solutions in their search process. In this study, MO techniques are investigated in the case of solving
feature selection problem which is an NP-hard problem requiring
high exploration and exploitation powers to be solved. MO methods
4.7. Finding multiple solutions due to their slow convergence in order to form subpopulations in the
main population, can explore more regions of the search space.
The main objective of MO methods is to provide several Moreover, they use some local search operations to exploit better
optimal solutions for the user to choose from. In this section, we solutions in the formed subpopulations. The additional advantage of
investigate the variety of found solutions in the final population of using MO methods is that they have the ability of finding multiple
a sample run for each method. Table 4 includes the number of optimal feature subsets in terms of classification accuracy. We believe
distinct solutions (feature subsets) with high fitness value encoun- that such characteristics of MO methods make them desirable for
tered at the final population of each algorithm. Six data sets with solving feature selection problem. To verify this, we used 12 UCI
different characteristics including Sonar, Glass, Wdbc, Protein, datasets and five well-known MO methods such as DFS, r2PSO,
Segment and Vehicle were considered in this section. Moreover, r3PSO, r2PSO-lhc and r3PSO-lhc and our proposed method
the classification accuracy of obtained subsets is also reported GA_SN_CM, in our experiments. These methods were compared with
using 1-NN classifier (averaged over 20 times data permuting and single modal forms of EAs and also two well-known heuristic
classifying). The aim of this section is just to check the diversity of methods (introduced in the literature for feature selection) such as
the final population of considered methods and their ability to CatfishBPSO and Memetic Algorithm. Obtained results and their
provide multiple solutions. statistical analyses show that MO methods can significantly improve
In Table 4, we can see that compared with GA, PSO and MA, MO the feature selection process compared with the single modal form of
methods especially local best PSO variants could locate more EAs and they also can compete with powerful existing heuristic
distinct solutions in their final population. For instance, in the feature selection methods. We believe if a suitable local search
case of segment dataset (Table 4), the population of single modal consistent with feature selection is used in MO methods, their results
methods totally converge to one or two solutions while MO can significantly be improved. In order to check the suitability of
methods could find different feature subsets. For other datasets selected feature (by different methods) in terms of yielded classifica-
the solutions found by MO methods are more diverse than single tion accuracy, some other classifiers apart from the one used in the
modal methods. search process are utilized. The obtained results show acceptable
accuracies for classifiers by found features.
4.8. Obtained accuracy using other classifiers

As mentioned earlier, in order to investigate how much the


accuracy of obtained solutions (features) will depend on the classifier Acknowledgements
type; we used some other classifiers on 6 data sets of different
characteristics in this section (using Weka software). To do this, Special thanks to Mr. Javad Rahimipour Anaraki for their
6 classifiers including SVM as a powerful classifier, PART and Jrip, as valuable co-operation in our experiments.
594 S. Kamyab, M. Eftekhari / Neurocomputing 171 (2016) 586–597

Fig. 6. The fitness value obtained by each algorithm over its iteration for each dataset (averaged over 20 runs): (a) WBCO, (b) Glass, (c) Wine, (d) Zoo, (e) Vehicle, (f) Protein,
(g) Segment, (h) Wdbc, (i) Soybean, (j) Lung Cancer, (k) Sonar. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of
this article.)
Table 4
obtained feature subsets in the final population with their corresponding fitness and accuracy of existing methods. NoFSub denote the number of distinct solutions with high fitness value in the final population of each algorithm.
avgFit and avgCA represents the average fitness and average classification accuracy of distinct solutions, respectively.

Data Set Sonar Glass wbdc

method NoFSub avgFit avgCA NoFSub avgFit avgCA NoFSub avgFit avgCA

DFS-FS 3 0.8894 ( 7 0.0073) 0.9634 ( 7 0.0135) 5 0.6765 ( 7 0.0551) 0.5621 ( 70.0803) 12 0.8456 (7 0.0384) 0.8472 ( 70.0347)
r2PSO 9 0.9182 ( 7 0.0085) 0.9752 ( 7 0.0125) 3 0.7071 ( 7 0.0324) 0.6078 ( 70.0486) 5 0.9391 (7 0.0062) 0.9286 ( 70.0117)
r3PSO 15 0.9252 ( 7 0.0085) 0.9845 ( 7 0.0178) 2 0.7253 ( 7 0.0282) 0.6255 ( 70.0453) 2 0.9493 (7 0) 0.9454 ( 70.0051)
r2PSO-lhc 4 0.9038 ( 7 0.0117) 0.9697 ( 7 0.0205) 7 0.7448 ( 7 0.0037) 0.6588 ( 70.0043) 7 0.9319 (7 0.0065) 0.9241 ( 70.0101)
r3PSO-lhc 8 0.9111 ( 7 0.0095) 0.9658 ( 7 0.0322) 6 0.7135 ( 7 3.26e  04) 0.6120 ( 70.0064) 10 0.9462 (7 1.17e  16) 0.9438 ( 70.0053)
GA_SN_CM 8 0.8617 ( 7 0.0327) 0.9196 ( 7 0.0368) 6 0.6909 ( 7 0.0085) 0.7093 ( 70.0360) 7 0.9069 (7 0.0189) 0.9302 ( 70.0098)
CatfishBPSO 4 0.9179 ( 7 0.0102) 0.9505 ( 7 0.0182) 3 0.7071 ( 7 0.0324) 0.6078 ( 70.0486) 6 0.9426 (7 0.0063) 0.9369 ( 70.0085)
MA 4 0.9243 ( 7 0.0150) 0.9668 ( 7 0.0152) 1 0.7217 ( 7 0) 0.6167 ( 70) 4 0.9356 (7 0.0050) 0.9312 ( 70.0036)
GA 2 0.8836 ( 7 0.0011) 0.9480 ( 7 0.0117) 3 0.7198 ( 7 0.0211) 0.6235 ( 70.0261) 1 0.9224 (7 0) 0.9248 ( 70)
BPSO 4 0.9045 ( 7 0.0134) 0.9457 ( 7 0.0134) 3 0.7128 ( 7 0.0282) 0.6135 ( 70.0426) 7 0.9310 (7 0.0026) 0.9259 ( 70.0034)

Data Set Segment vehicle protein

S. Kamyab, M. Eftekhari / Neurocomputing 171 (2016) 586–597


method NoFSub avgFit avgCA NoFSub avgFit avgCA NoFSub avgFit avgCA

DFS-FS 4 0.9153 ( 7 0.0262) 0.9092 ( 7 0.0269) 6 0.6978 ( 7 0.0289) 0.6063 ( 70.0344) 5 0.7750 (7 0.0660) 0.7363 ( 70.0979)
r2PSO 4 0.9335 ( 7 0.0090) 0.9264 ( 7 0.0128) 8 0.7342 ( 7 0.0128) 0.6507 ( 70.0175) 6 0.7859 (7 0.0062) 0.7405 ( 70.0121)
r3PSO 3 0.9314 ( 7 0.0066) 0.9234 ( 7 0.0094) 6 0.7317 ( 7 0.0132) 0.6487 ( 70.0218) 12 0.7506 (7 0.0191) 0.7043 ( 70.0343)
r2PSO-lhc 11 0.9275 ( 7 0.0077) 0.9223 ( 7 0.0119) 8 0.7373 ( 7 0.0079) 0.6586 ( 70.0139) 8 0.7793 (7 0.0159) 0.7481 ( 70.0187)
r3PSO-lhc 11 0.9289 ( 7 0.0046) 0.9249 ( 7 0.0072) 6 0.7338 ( 7 0.0109) 0.6506 ( 70.0149) 8 0.7753 (7 0.0150) 0.7433 ( 70.0196)
GA_SN_CM 11 0.9080 ( 7 0.0048) 0.9630 ( 7 0.0070) 8 0.7108 ( 7 0.0175) 0.6261 ( 70.0219) 5 0.6885 (7 0.0511) 0.7279 ( 70.0243)
CatfishBPSO 8 0.9313 ( 7 0.0016) 0.9285 ( 7 0.0060) 5 0.7428 ( 7 4.47e  05) 0.6682 ( 70.0051) 6 0.7983 (7 0.0122) 0.7535 ( 70.0113)
MA 2 0.9348 ( 7 0.0070) 0.9283 ( 7 0.0100) 2 0.7417 ( 7 0.0102) 0.6562 ( 70.0098) 3 0.7780 (7 0.0232) 0.7209 ( 70.0302)
GA 1 0.9390 ( 7 0) 0.9342 ( 7 0) 3 0.7172 ( 7 0.0259) 0.6270 ( 70.0373) 2 0.7732 (7 0.0862) 0.7224 ( 70.1282)
BPSO 1 0.9390 ( 7 0) 0.9342 ( 7 0) 7 0.7345 ( 7 1.2e  16) 0.6503 ( 70.0064) 6 0.7781 (7 0.0125) 0.7258 ( 70.0230)

595
596 S. Kamyab, M. Eftekhari / Neurocomputing 171 (2016) 586–597

Table 5
Obtained accuracy using different classifiers.

Classifier Dataset Glass Vehicle Protein Segment Wdbc Sonar

Method F No CA F No CA F No CA F No CA F No CA F No CA

SVM DFS 4 0.5633 3 0.5443 9 0.6956 3 0.8171 3 0.9154 20 0.8019


r2PSO 3 0.4835 4 0.5704 8 0.5826 3 0.8171 3 0.9454 10 0.7681
r3PSO 3 0.4929 5 0.5869 10 0.6608 3 0.8171 2 0.9348 10 0.7391
r2PSO-lhc 2 0.4835 5 0.5869 7 0.6434 3 0.8171 3 0.9348 16 0.8067
r3PSO-lhc 3 0.4882 4 0.6071 8 0.6434 3 0.8193 2 0.9471 13 0.7584
GA_SN_CM 3 0.4835 4 0.5704 7 0.6347 4 0.8206 2 0.9471 17 0.8260
CatfishBPSO 4 0.4882 5 0.5704 8 0.6956 3 0.8206 2 0.8802 10 0.7922
Memetic Alg 2 0.4084 5 0.5633 5 0.5478 3 0.8145 2 0.9225 9 0.7584
BPSO 3 0.5586 4 0.4970 7 0.6521 3 0.8171 2 0.9066 10 0.7777
GA 3 0.4882 4 0.5017 7 0.7043 3 0.8171 5 0.9190 15 0.7391
Without FS 9 0.5633 19 0.7360 20 0.7304 20 0.9293 30 0.9735 60 0.7874
PART DFS 4 0.6572 3 0.671 9 0.6 3 0.9579 3 0.9278 20 0.7032
r2PSO 3 0.6901 4 0.6852 8 0.5913 3 0.9579 3 0.9295 10 0.7971
r3PSO 3 0.6150 5 0.6568 10 0.6086 3 0.9579 2 0.9366 10 0.7149
r2PSO-lhc 2 0.6713 4 0.6958 7 0.6695 3 0.9579 3 0.9295 16 0.7391
r3PSO-lhc 3 0.6525 4 0.6840 8 0.6086 3 0.9636 2 0.9190 13 0.7729
GA_SN_CM 3 0.6384 4 0.6852 7 0.6086 4 0.9631 2 0.9190 17 0.7874
CatfishBPSO 4 0.6244 5 0.7017 8 0.6173 3 0.9627 2 0.9295 10 0.7487
Memetic Alg 2 0.6244 5 0.7053 5 0.5826 3 0.9571 2 0.9278 9 0.7198
BPSO 3 0.6525 4 0.6863 7 0.6086 3 0.9579 2 0.9207 10 0.7101
GA 3 0.6901 4 0.6710 7 0.6 3 0.9627 5 0.9348 15 0.7101
Without FS 9 0.7230 19 0.7431 20 0.6608 20 0.9649 30 0.9436 60 0.7343
JRip DFS 4 0.6009 3 0.6721 9 0.55 3 0.9636 3 0.9260 20 0.7777
r2PSO 3 0.6478 5 0.6863 8 0.5217 3 0.9636 3 0.9313 10 0.7681
r3PSO 3 0.5915 5 0.6442 10 0.513 3 0.9636 2 0.9348 10 0.7874
r2PSO-lhc 2 0.6150 4 0.6662 7 0.5304 3 0.9636 3 0.9295 16 0.7198
r3PSO-lhc 3 0.6056 4 0.6520 8 0.5391 3 0.9584 2 0.9260 13 0.7246
GA_SN_CM 3 0.5821 4 0.6863 7 0.5391 4 0.9545 2 0.9260 17 0.7874
CatfishBPSO 4 0.5727 5 0.6721 8 0.5217 3 0.9558 2 0.9260 10 0.7487
Memetic Alg 2 0.6291 5 0.6412 5 0.5130 3 0.9510 2 0.9278 9 0.7294
BPSO 3 0.6056 4 0.6603 7 0.5043 3 0.9636 2 0.9348 10 0.7632
GA 3 0.6478 4 0.5917 7 0.5130 3 0.9558 5 0.9471 15 0.7246
Without FS 9 0.6666 19 0.684 20 0.6 20 0.9571 30 0.9366 60 0.7729
Bayes Net DFS 4 0.6478 3 0.5834 9 0.4956 3 0.9324 3 0.9190 20 0.7391
r2PSO 3 0.6384 4 0.6284 8 0.4695 3 0.9324 3 0.9313 10 0.7343
r3PSO 3 0.6338 5 0.6260 10 0.5130 3 0.9324 2 0.9295 10 0.7391
r2PSO-lhc 2 0.5915 4 0.6142 7 0.4347 3 0.9324 3 0.9190 16 0.7439
r3PSO-lhc 3 0.5962 4 0.6378 8 0.5043 3 0.9315 2 0.9313 13 0.7198
GA_SN_CM 3 0.5915 4 0.6284 7 0.5043 4 0.9263 2 0.9313 17 0.7536
CatfishBPSO 4 0.6338 5 0.6023 8 0.5217 3 0.9254 2 0.9119 10 0.7681
Memetic Alg 2 0.6197 5 0.6047 5 0.4782 2 0.9285 2 0.9313 9 0.7101
BPSO 3 0.5962 4 0.6082 7 0.4695 3 0.9324 2 0.8873 10 0.7198
GA 3 0.6384 4 0.5727 7 0.4608 3 0.9254 5 0.9154 15 0.7294
Without FS 9 0.7230 19 0.6189 20 0.6608 20 0.9155 30 0.9401 60 0.7487
J48 DFS 4 0.6995 3 0.6994 9 0.6086 3 0.9649 3 0.9242 20 0.7777
r2PSO 3 0.6572 4 0.6863 8 0.5652 3 0.9649 3 0.9401 10 0.7198
r3PSO 3 0.6056 5 0.6473 10 0.5826 3 0.9649 2 0.9383 10 0.7439
r2PSO-lhc 2 0.6713 4 0.6757 7 0.5739 3 0.9649 3 0.9330 16 0.7487
r3PSO-lhc 3 0.6525 4 0.6982 8 0.6173 3 0.9644 2 0.9295 13 0.7246
GA_SN_CM 3 0.6431 4 0.6863 7 0.6173 4 0.9623 2 0.9295 17 0.8212
CatfishBPSO 4 0.6338 5 0.6804 8 0.5652 3 0.9601 2 0.9295 10 0.7294
Memetic Alg 2 0.6572 5 0.7076 5 0.6782 3 0.9601 2 0.9295 9 0.7198
BPSO 3 0.6525 4 0.6650 7 0.5826 3 0.9649 2 0.9172 10 0.7081
GA 3 0.6572 4 0.6568 7 0.6260 3 0.9601 5 0.9295 15 0.6859
Without FS 9 0.676 19 0.7195 20 0.6173 20 0.9709 30 0.9401 60 0.7439
BFTree DFS 4 0.6713 3 0.6899 9 0.5565 3 0.9592 3 0.9190 20 0.7681
r2PSO 3 0.6713 4 0.7065 8 0.5478 3 0.9592 3 0.9348 10 0.7343
r3PSO 3 0.6009 5 0.6943 10 0.5565 3 0.9592 2 0.9313 10 0.7584
r2PSO-lhc 2 0.6478 4 0.671 7 0.5478 3 0.9592 3 0.9225 16 0.7101
r3PSO-lhc 3 0.6713 4 0.6769 8 0.6086 3 0.9618 2 0.9295 13 0.7487
GA_SN_CM 3 0.6197 4 0.7065 7 0.6086 4 0.9527 2 0.9295 17 0.7874
CatfishBPSO 4 0.615 5 0.6934 8 0.5652 3 0.9601 2 0.9190 10 0.7294
Memetic Alg 2 0.6666 5 0.6852 5 0.5478 3 0.9614 2 0.9154 9 0.7053
BPSO 3 0.6713 4 0.6674 7 0.5304 3 0.9592 2 0.9207 10 0.7294
GA 3 0.6713 4 0.6898 7 0.5478 3 0.9601 5 0.9278 15 0.6859
Without FS 9 0.6948 19 0.7124 20 0.5913 20 0.9536 30 0.9225 60 0.7198
S. Kamyab, M. Eftekhari / Neurocomputing 171 (2016) 586–597 597

References SIGKDD International Conference on Knowledge Discovery and Data Mining.


ACM (2000).
[31] Marill, Thomas, David M. Green, On the effectiveness of receptors in recogni-
[1] Alper Unler, Alper Murat, Ratna Babu Chinnam, mr2PSO: a maximum tion systems, Inf. Theory, IEEE Transactions on 9.1 (1963) 11–17.
relevance minimum redundancy feature selection method based on swarm [32] A.W Whitney, A direct method of nonparametric measurement selection, IEEE
intelligence for support vectorb machine classification, Inf. Sci. 181 (2011) Trans. Comput. 20 (1971) 1100–1103.
4625–4641. [33] AlAni, Ahmed, Akram Alsukker, Rami N. Khushaba, Feature subset selection
[2] Xiangyang Wang, J. Yang, Xiaolong Teng, Weijun Xia, Richard Jensen, Feature using differential evolution and a wheel based search strategy, Swarm and
selection based on rough sets and particle swarm optimization, Pattern Evol. Comput. 9 (2013) 15–26.
Recognit. Lett. 28 (2007) 459–471. [34] Oh, Il-Seok, Jin-Seon Lee, Byung-Ro Moon, Hybrid genetic algorithms for
[3] I. Guyon, A. Elisseeff, An introduction to variable and feature selection, J. Mach. feature selection, IEEE Trans.Pattern Anal. Mach. Intell. 26.11 (2004)
Learn. Res. 3 (2003) 1157–1182. 1424–1437.
[4] R. Kohavi, G.H. John, Wrappers for feature subset selection, Artif. Intell. 97 [35] Tanaka, Kenji, Takio Kurita, Tohru Kawabe, Selection of import vectors via
(1997) 273–324. binary particle swarm optimization and cross-validation for kernel logistic
[5] S.M. Vieiraa, J.M.C. Sousa, Uzay Kaymak, Fuzzy criteria for feature selection, regression, Neural Networks, 2007. IJCNN 2007. International Joint Conference
Fuzzy Sets Syst. 189 (2012) 1–18. on. IEEE (2007).
[6] Darya Chyzhyk, Alexandre Savio, Manuel Graña, Evolutionary ELM wrapper [36] Khushaba, N. Rami, Ahmed Al-Ani, Adel Al-Jumaily, Feature subset selection
feature selection for Alzheimer’s disease CAD on anatomical brain MRI, using differential evolution and a statistical repair mechanism, Expert Syst.
Neurocomputing 128 (2014) 73–80. Appl. 38.9 (2011) 11515–11526.
[7] Gang Chen, Jin Chen, A novel wrapper method for feature selection and its [37] Monirul Kabir, M.S. Md Kazuyuki Murase, A new hybrid ant colony optimiza-
applications, Neurocomputing (2015). tion algorithm for feature selection, Expert Syst. Appl. 39 (2012) 3747–3763.
[8] S.C. Yusta, Different metaheuristic strategies to solve the feature selection [38] Roy, Subhrajit, et al., Multimodal optimization by artificial weed colonies
problem, Pattern Recognit. Lett. 30 (2009) 525–534. enhanced with localized group search optimizers, Appl. Soft Comput. 13.1
[9] Feiping Nie, et al., Efficient and robust feature selection via joint ℓ2, 1-norms (2013) 27–46.
minimization, Adv. Neural Inf. Process. Syst. (2010). [39] Sebban, Marc, Nock Richard, A hybrid filter/wrapper approach of feature
[10] Chia-Ming Wang, Y.-F. Huang, Evolutionary-based feature selection selection using information theory, Pattern Recognit. Lett. 35.4 (2002)
approaches with new criteria for data mining: a case study of credit approval 835–846.
data, Expert Syst. Appl. 36 (2009) 5900–5908. [40] S. Kamyab, M. Eftekhari, Using a self-adaptive neighborhood scheme with
[11] H. Almuallim, T. Dietterich, Learning boolean concepts in the presence of many crowding replacement memory in genetic algorithm for multimodal optimi-
irrelevant features, Artif. Intell. 69 (1994) 279–305. zation, Swarm Evol. Comput. 12 (2013) 1–17.
[12] Liu, H., R. Setiono, A probabilistic approach to feature selection—a filter [41] X. Li, Niching without niching parameters: particle swarm optimization using
solution, in: Proceedings of the 13th International Conference on Machine a ring topology, IEEE Trans. Evol. Comput. 4 (2010) 150–168.
Learning, 1996, pp. 319–327. [42] Della Cioppa, Antonio, Claudio De Stefano, Angelo Marcelli, Where are the
[13] Kira K.,R.L., A practical approach to feature selection, in: Proceedings of niches? Dynamic fitness sharing, IEEE Trans. Evol. Comput. 11.4 (2007)
International Conference on Machine Learning, 1992, pp. 249–256. 453–465.
[14] M. Dash, H. Liu, Feature selection for classification, Intell. Data Anal. 1 (1997) [43] 〈http://archive.ics.uci.edu/ml/datasets.html〉.
131–156. [44] Nikola Kasabov, et al., Evolving spiking neural networks for personalised
[15] Jin Xu, et al., L1 graph based on sparse coding for feature selection ISNN 2013, modelling, classification and prediction of spatio-temporal patterns with a
Advances in Neural Networks, Springer, Berlin Heidelberg (2013) 594–601. case study on stroke, Neurocomputing 134 (2014) 269–279.
[16] Jin Xu, et al., Feature selection based on sparse imputation, The 2012
International Joint Conference on Neural Networks (IJCNN), IEEE, 2012.
[17] Cecille Freeman, Dana Kulić, Basir Otman, An evaluation of classifier-specific
filter measure performance for feature selection, Pattern Recognit. (2014).
[18] Laura Emmanuella A. dos S. Santana, Anne M. de Paula Canuto, Filter-based Shima Kamyab was born in Kerman, Iran in 1987. She
optimization techniques for selection of feature subsets in ensemble systems, received her B.Sc. in computer engineering from
Expert Syst. Appl. 41.4 (2014) 1622–1631. Department of Computer Engineering, Shahid Bahonar
[19] Feiping Nie, et al., Trace ratio criterion for feature selection, AAAI 2 (2008). university, Kerman, Iran in September 2009. She
[20] A.S. Fialho, F. Cismondi, S.M. Vieira, J.M.C. Sousa, S.R. Reti, M.D. Howell, S. obtained her M.Sc. degree in Artificial Intelligence from
N. Finkelstein, Predicting outcomes of septic shock patients using feature the same department in 2011. Shima has been a Ph.D.
selection based on soft computing techniques, Inf. Process. Manage. Uncer- student of Artificial Intelligence in Shiraz University,
tainty Knowl. Based Syst. Commun. Comput. Inf. Sci. 81 (2010) 65–74. Shiraz, Iran, since 2013. Her research interests include
[21] I. Guyon, S.G.,M. Nikravesh, L.A. Zadeh, Feature Extraction: Foundations and Meta Heuristics, Evolutionary Algorithms, and Image
Applications, Studies in Fuzziness and Soft Computing Springer, 2006. Processing.
[22] J. Huang, Y. Cai, X. Xu, A hybrid genetic algorithm for feature selection wrapper
based on mutual information, Pattern Recognit. Lett. 28 (2007) 1825–1844.
[23] J.W Jianmei Guo, Guangxin Wang, Jian Li, Wang Yinglin, A genetic algorithm
for optimized feature selection with resource constraints in software product
lines, J. Syst. Softw. 84 (2011) 2208–2221.
[24] Chuang, Li-Yeh, Tsai Sheng-Wei, Yang Cheng-Hong, Improved binary particle
swarm optimization using catfish effect for feature selection, Expert Systems Mahdi Eftekhari was born in Kerman, Iran in 1978. He
with Applications 38.10 (2011) 12699–12707. received his B.Sc. in computer engineering from
[25] Maldonado, Sebastián, Richard Weber, A wrapper method for feature selection Department of Computer Science and Engineering,
using support vector machines, Inf. Sci 179.13 (2009) 2208–2217. Shiraz University, Shiraz, Iran in September 2001. He
[26] Zhongyi Hu, et al., Hybrid filter–wrapper feature selection for short-term load obtained his M.Sc. and Ph.D. degrees in Artificial
forecasting, Eng. Appl. Artif. Intell. 40 (2015) 17–27. Intelligence from the same department in 2004 and
[27] Yang, Jaekyung, Sigurdur Olafsson, Optimization-based feature selection with 2008, respectively. Mahdi has been a faculty member of
adaptive instance sampling, Comput. Oper. Res 33.11 (2006) 3088–3106. Shahid Bahonar University of Kerman, Kerman, Iran
[28] J. Yang, V.H., Feature subset selection using a genetic algorithm, in: Feature since 2008 and he is currently an associate professor in
Selection, Construction, and Subset Selection: A Data Mining Perspective, the Department of Computer Engineering. His research
1998. interests include Fuzzy systems and modeling, Evolu-
[29] Bermejo, Pablo, Jose A. Gámez, Jose M. Puerta, A GRASP algorithm for fast tionary Algorithms, Data Mining, Machine Learning and
hybrid (filter-wrapper) feature subset selection in high-dimensional datasets, Application of intelligent methods in bioinformatics. He
Pattern Recognit. Lett. 32.5 (2011) 701–711. is the author and co-author of about 90 papers in cited journals and conferences.
[30] Kim, Yeong Seog, W. Nick Street, Filippo Menczer, Feature selection in Dr. Eftekhari is a member of Iranian Society of Fuzzy Systems.
unsupervised learning via evolutionary search, Proceedings of the sixth ACM

You might also like