2016-Elsevier-Feature selection using multi modal optimization techniques
2016-Elsevier-Feature selection using multi modal optimization techniques
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
art ic l e i nf o a b s t r a c t
Article history: This paper investigates the effect of using Multimodal Optimization (MO) techniques on solving the
Received 7 July 2014 Feature Selection (FSel) problem. The FSel problem is a high-dimensional optimization problem in the
Received in revised form nature and thus needs a solver with high exploration power. On the other hand, if alternative optimal
6 June 2015
solutions could be provided for a problem, the implementation phase may become more selective
Accepted 30 June 2015
depending on the cost and limitations of domain of the problem. The high exploration power and
Communicated by Feiping Nie
Available online 16 July 2015 solution conservation capability of MO methods make them able to find multiple suitable solutions in a
single run. Therefore, MO methods can be considered as a powerful tool of finding suitable feature
Keywords: subsets for FSel problem. In this paper, we made a special study on the use of MO methods in the feature
Multimodal optimization
selection problem. The binary versions of some existing Evolutionary Algorithm (EA) based MO methods
Feature selection
like Dynamic Fitness Sharing (DFS), local Best PSO variants and GA_SN_CM, are proposed and used for
Evolutionary feature selection
selection of suitable features from several benchmark datasets. The results obtained by the MO methods
are compared to some well-known heuristic approaches for FSel problem from the literature. The
obtained results and their statistical analyses indicate the effectiveness of MO methods in finding
multiple accurate feature subsets compared to existing powerful methods.
& 2015 Elsevier B.V. All rights reserved.
http://dx.doi.org/10.1016/j.neucom.2015.06.068
0925-2312/& 2015 Elsevier B.V. All rights reserved.
S. Kamyab, M. Eftekhari / Neurocomputing 171 (2016) 586–597 587
among, a graph structure data based on the proximity of data samples number of all features. Generally, for such problems, the optimal
is used. solution cannot be guaranteed to be acquired except by perform-
Hybrid methods attempt to take advantages of both wrapper ing an exhaustive search in the solution space. The use of meta-
and filter methods to obtain better feature sets [20–26]. In heuristic techniques allows us to obtain reasonably good solutions
particular, hybrid methods are based on a sequential (e.g. two- without being forced to explore the whole solution space. The
step) approach where the first step is usually based on filter quality of “heuristic” solutions, depends on the characteristics of
methods to reduce the number of features. Using this reduced set, applied method. Recent studies show that meta-heuristic techni-
in second step, a wrapper method is then employed to select ques are among superior methodologies. In real-world applica-
desired number of features [1]. Embedded methods embed the tions, people are more interested in obtaining good solutions in a
feature selection procedure into the learning algorithm i.e., feature reasonable amount of time rather than being obsessed with
selection is occurring naturally as a part of learners [3]. From the optimal solutions. Therefore, we favor meta-heuristic methods
literature, it seems more reasonable to use a wrapper (or that are efficient for dealing with real-world applications [8]. The
embedded method) with a linear predictor as a filter and then most frequently used meta-heuristic strategies applied to the
train a more complex nonlinear predictor on the resulting vari- feature selection problem are EA based methods. Such methods
ables [5]. Therefore, in this paper we used a wrapper approach in the process of feature selection do not suffer from the so called
with 1-NN classifier as predictor in the first phase of a wrapper Nesting effect caused by traditional sequential feature selection
based approach and then the resulted subsets were evaluated methods like SBS1 [31] and SFS2 [32] methods. Nesting effect
using more complex classifiers. corresponds to the inability to reselect the discarded features
From a computational perspective, the feature selection pro- (SBS) and discarding the selected features (SFS). Using such
blem is generally difficult to solve. It is inherently a combinatorial sequential schemes in feature selection may result in finding local
optimization problem [27]. Hence, optimization techniques are optimum solutions for the FSel problem. EA-based methods over-
suitable methods to resolve this problem. Recently, several opti- come this drawback by having no restriction on selecting features
mization based methods are used and proposed in order to find in during their search process. Moreover, when the number of
relevant feature subsets among possible combinations of features variables is large, due to the power of parallel selection of features,
[25,28–30]. Evolutionary Algorithms (EAs) are capable of exploring the computational time of EA based methods seems to be
the search space comprehensively due to their population based considerably less than the methods based on forward selection
structures, in a reasonable time. However, these methods suffer and backward elimination.
from premature convergence problem because of the tendency of There are several studies which investigate the effectiveness of
their population to converge to its best found solution. This EA based methods in feature selection process. They adopted
problem could be solved by controlling the diversity of the different kinds of EAs such as Genetic Algorithm (GA)
population. [22,25,28,29], PSO Algorithm [2,33–35], Differential Evolution
Finding multiple optimum solutions is the objective of a class of (DE) Algorithm [36], ACO algorithm [37] and so on. These
optimization methods called Multimodal Optimization (MO) tech- approaches also would act as filter, wrapper, hybrid or embedded
niques. In such EA based approaches, some additional mechanisms depending on the way they evaluate fitness of their population.
are involved in the EAs in order to help the main search process to Proposing some improvements of the exploration and local
avoid converging toward a local optimum. These mechanisms, search powers of EAs for feature selection in several available
called niching techniques, empower the search process to find studies, shows that these powers play crucial roles in obtaining
multiple global and local optima by controlling the diversity of the better results for the FSel problem. For instance, GRASP [29] is an
population and performing some local search behavior to fine- iterative process, in which each of the iterations has two phases:
tune the obtained solutions. These attributes of MO methods construction and local search. In the construction phase, a feasible
encourage them to be considered as appropriate tools to find solution is built. Then its neighborhood is explored by the local
suitable solutions for the feature selection problem. search. The final result will be the best solution found over all
In this paper, a special study on the use of MO methods to the iterations. Chuang et al. in [34] used a new modification of PSO
feature selection problem is made which was not performed algorithm called CatfishBPSO for the feature selection process. In
before. To do this, the modification of some well-known EA based CatfishBPSO, the so-called “catfish” effect introduces a competition
MO methods and also our proposed niching method called function into a group of individuals. Catfish particles are introduced
GA_SN_CM are used for feature selection task and are compared into the search space if the fitness of gbest cannot be improved over
with some well-known EA based methods for feature selection to a number of consecutive iterations. These catfish particles are
study the power of MO methods on improving the feature introduced at extreme positions of the search space and will
selection results. initialize a new search for these extreme positions. The catfish
The reminder of this paper is organized as follows: Section 3 particles open up new opportunities for finding better solutions, and
provides a brief description about using EA based optimization guide the entire swarm to promising new regions of the search
methods for feature selection. In Section 3, MO methods which are space [34]. Introducing catfish particles in CatfishBPSO algorithm
used for feature selection in this paper are discussed. In Section 4, helps it to avoid converging toward a local optimum solution by
the described MO methods are applied to several test problems increasing the exploration power and diversity of its population.
and the results are presented and discussed. Finally, Section 5 is High classification accuracy of this heuristic optimization method for
dedicated to conclusions. feature selection indicates the effectiveness of its mechanism. These
methods inspired us to use Multimodal Optimization techniques
which have high exploration power and good local search behavior
2. Feature selection using EA-based optimization methods for feature selection in order to find more desirable feature subsets.
In [8] Memetic Algorithm (MA), which is a population based
Searching for an appropriate feature subset, which can repre- approach with a local search mechanism to improve the solutions,
sent the target concept in a large feature set, is an NP-hard
problem. If the feature subset is modeled as a binary string where,
each feature is either selected or not for inclusion, the number of 1
Sequential Backward Selection.
possible non-empty feature subsets is 2n 1, where n is the 2
Sequential Forward Selection.
588 S. Kamyab, M. Eftekhari / Neurocomputing 171 (2016) 586–597
is used for feature selection. Like Multimodal Optimization techni- for assessing the “goodness” of a feature subset is computed as (1).
ques, MA takes advantages of the local search as well as exploration
f itnessi ðselectedFeaturesÞ ¼ αðclassif icationAccuracy ðselectedFeaturesÞÞ
of the search space and therefore it is enable to find more effective
Nt Ns
and accurate feature subsets. þ ð1 αÞ ð1Þ
Nt
where Nt and Ns denote the number of total features and the number
3. Multimodal optimization methods for feature selection
of selected features of a candidate solution, respectively. α is a
constant coefficient to balance the effect of classification accuracy
When a single-objective optimization problem has more than
and number of features in calculating the fitness value, which is set to
one optimal solution, it can be considered as multimodal optimi-
0.7 in our experiments.
zation problem. The objective of locating different optima in a
Note that Eq. (1) guides the algorithm to make a suitable
single run makes it more complicated than single global optimiza-
tradeoff between classification accuracy and the number of
tion. Niching methods, devised for extending EAs to multimodal
selected features. It leads the population to find accurate solutions
optimization, address this issue by maintaining the diversity of the
with fewer numbers of features in the search space.
population. In this way, they allow parallel convergence into
multiple good solutions in a multimodal domain. The concept of
niching is inspired by the way of evolving organisms in the nature. 3.2. Description of MO algorithms
The process involves the formation of sub-populations within a
population. Each sub-population aims at locating one optimal 3.2.1. Genetic algorithm using self-adaptive neighborhood scheme
solution of the fitness landscape and the whole population, with crowding replacement memory (GA_SN_CM)
together, is expected to locate multiple optima in a single run. In our recent work in [40] we proposed a niching scheme using
Several niching methods were proposed in the literature [38]. a self-adaptive neighborhood structure which leads to form
The aim of the Feature Selection approach is to obtain some suitable species within the population and a memory structure
representations of the dataset that are more adequate for learning with crowding replacement scheme in order to conserve found
a concept from that dataset [10]. Such representations may not be solutions during the search process of GA. Using such mechanisms
unique. Therefore, the FSel problem can be seen as a multimodal obviates the need for tuning problem dependent niche parameters
optimization problem. On the other hand, even if the aim of FSel is and causes formation of species, consistent with the optimum
to find a single solution, MO methods can find more desirable which they are searching for. Moreover, using the memory
solutions because of their high exploration and exploitation structure helps the search process at preserving found niches
power. These characteristics of MO methods, reduce the prob- without interrupting the exploration power of niching method.
ability of getting trapped in a local optimum. The optimal feature The obtained results indicate high exploration and exploitation
subset should be searched in a large space containing all possible power of proposed method. These characteristics motivated us to
subsets of features. Therefore, the FSel problem is a hard problem adopt this method in solving the feature selection problem.
when encounters large number of features. MO methods not only Note that we used binary version of GA associated with our
can explore wide regions of the search space and perform appro- proposed niching scheme for the feature selection problem. The
priate local search to fine tune obtained solutions, they also can pseudo-code of GA_SN_CM is shown in Fig. 1. More details about
provide alternative solutions for the problem at hand. In the real this method are discussed in [40].
world, sometimes, even when an optimal solution (feature subset)
is found, the high cost of implementation of a model using that
subset makes it impossible to use. If alternative optimum solutions 3.2.2. Local best PSO variants
could be provided to the user, the available solutions can be In [41] Li stated that a PSO algorithm using the ring topology
checked and the most appropriate one be selected considering the can operate as a niching algorithm . It does this by using individual
users’ preferences. The nature of providing multiple solutions by particles’ local memories to form a stable network retaining the
MO methods helps to achieve this end. best positions found so far, while these particles explore the search
Here, we shall briefly review the niching methods which we space more broadly. Given a reasonably large population uni-
used for FSel in this paper after describing the fitness function formly distributed in the search space, PSO algorithms using the
used by all methods in our experiments. ring topology are able to form stable niches across different local
neighborhood and eventually can locate multiple global/local
optima. Such fast and efficient niching method can form and
3.1. Fitness evaluation
maintain the species around optima with no need to set any
additional parameters apart from PSO parameters. Hence, we
Using an evaluation criterion in the search algorithm is crucial
believe this method, by establishing a fast and efficient search in
to guide the search process toward finding appropriate feature
the feature space, can perform well in finding accurate solutions
subsets.
for feature selection problem in a reasonable time.
Sebban and Nock [39] categorize the feature selection mechan-
The pseudo code of lbestPSO using a ring topology is illustrated
isms into three classes according to what is optimized in it:
in Fig. 2, where, the function neighborhoodBest(.) returns the best-fit
personal best in the ith neighborhood, which is stored in pn,i,
Algorithms that find the feature subsets of a specified dimen-
representing the neighborhood best for the ith particle. Conveni-
sionality in which the classes of data are most discriminable.
ently, the population indices can be used to identify the left and
Algorithms that find the smallest feature dimensionality for
right neighbors of each particle and their corresponding memories
which the discriminability exceeds a specified value, and
(p-bests).
Algorithms that find a compromise between a small subset of
The velocity and position update of each particle is done in this
features and the class discriminability.
niching method according to Eqs. (2) and (3), respectively.
! ! ! ! ! !
In this study, we considered the third mechanism considering v i ’w n v i þ c1 n R 1 ½0; φ1 ð p i x i Þ þ c2 n R 2 ½0; φ2
classification accuracy as the main discriminability measure to ! !
achieve the greatest possible precision. Therefore, the fitness function ð p n;i x i Þ ð2Þ
S. Kamyab, M. Eftekhari / Neurocomputing 171 (2016) 586–597 589
GA_SN_CM Procedure
Begin
N = number of neighbors for each individuals;
Ps = population size;
t =0;
Initialize the main population P(t) and memory M(t) randomly;
While (termination condition is not met) do
t = t + 1;
Check if the number of individuals in each neighborhood is equal to N;
Eliminate redundant chromosomes in each neighborhood in P(t);
Generate new individuals instead of eliminated chromosomes;
Fitnesses = Evaluate (P(t)) according to Eq(1) ;
Update memory M(t) using raw Fitness values;
For i = 1 to Ps
p1 = select the first parent from the P(t) using Fitness values;
if rand < Pn
p2 = select second parent from p1’s neighbors using Fitness values;
else
p2 = select second parent except for itself from P(t) using Fitness values;
end if
{o1 , o2} = crossover(p1 , p2);
{o1’sNeighbors , o2’sNeighbors} = binaryCrossover(p1’sNeighbors, p2’sNeighbors);
mutate(o1 , o2);
binaryMutate(o1’sNeighbors , o2’sNeighbors);
{t1 , t2 } = survive(p1, p2, o1, o2);
Add t1 and t2 to P(t+1) and their corresponding neighbors to M(t+1);
end for
End while
Return M(t) as the final solution;
End
Fig. 1. The pseudo code for main algorithm of GA_SN_CM.
where
! ! 1
Sð v i Þ ¼ sigmoidð v i Þ ¼ ð5Þ
1 þ e vi
The neighborhood structure used by lbestPSO contains two or
three particles of the swarm assuming either existing overlap
(shared members) between neighborhoods or not. Based on this,
four different niching methods were introduced in [41] such as
r2PSO and r3PSO, which are the lbestPSOs with 2 and 3 members of
overlapped neighborhood, respectively. Other versions are r2PSO-
Fig. 2. The pseudo code of lbest PSO using a ring topology [41].
lhc and r3PSO-lhc which are like the other two versions but
without the assumption of overlapped neighborhoods. In fact,
! ! ! the overlapped neighborhood assumption increases the tendency
x i’ x i þ v i ð3Þ
of the algorithm toward locating the global solutions rather than
! ! the local ones. By using these methods in FSel, we can also analyze
where R 1 ½0; φ1 and R 2 ½0; φ2 are two separate functions each
returning a vector comprising random values uniformly generated the behavior of different niching methods for this problem.
in the ranges [0,φ1] and [0,φ2], respectively. φ1 and φ2 are
commonly set to φ/2 (where φ is a positive constant). The symbol 3.2.3. Dynamic fitness sharing (DFS)
denotes point-wise vector multiplication. Fitness Sharing (FS) is one of the best-known and widely used
In the case of feature selection problem, feature subsets GA based niching methods for MO.
usually are encoded as n-bit binary strings, where n denotes During the FS procedure, sub-populations are formed and
the cardinality of the original feature set. Consequently, the maintained within the main population by reducing the prob-
value one or zero in a particular bit determines if its correspond- ability of growing the number of individuals populating a peak.
ing feature is selected from the original feature set or not. By This scheme causes the disappearance of the individuals in one
using such structure for feature subsets, a binary version of the peak and therefore they will populate the other peaks. This is
search algorithms should be used to deal with feature subsets in achieved by reducing the fitness of similar individuals as their
the search space. Hence, in this paper we used Eq. (4) instead of number grows, with the implicit assumption that similar indivi-
(3) to update the position of each particle after computing its duals belong to the same species and populate the same niche. To
590 S. Kamyab, M. Eftekhari / Neurocomputing 171 (2016) 586–597
Table 1
Description of used data sets.
classifier) is built using the training set and then the performance is software, were used in order to evaluate classifier dependence trend of
tested using the test set (unseen data). The experimental results are existing algorithms to find solutions.
reported in the form of the mean and standard deviation of the correct The parameters used in the algorithms were set as follows:
classification rate obtained on the test data. Some statistical analyses,
using the obtained results in the experiments, are taken into account in GA_SN_CM: population size ¼70, generations ¼200, selection
order to determine relative rank of existing methods. Moreover, the type¼roulette wheel, crossover operator¼ one point binary,
power of finding multiple solutions by MO methods is compared with survive type ¼generational, Pc ¼ 0.9, Pm ¼0.01, number of neigh-
other methods in the case of 6 datasets. Finally, 6 additional classifiers bors for each individual ¼population size/4.
including SVM, PART, jRip, BFTree, Bayes Net and J48, using weka Local best PSO variants: population size ¼ 70, iterations¼ 200,
vmin,d ¼ 6, vmax,d ¼ 6, c1 ¼ c2 ¼2, w¼1, j ¼4.1, j1 ¼j2 ¼j/2,
neighborhood variants ¼r2PSO, r3PSO, r2PSO-lhc and r3PSO-
lhc as used in [16].
GA and DFS: population size ¼70, generations ¼ 200, selection
type¼roulette wheel, crossover operator¼ one point binary,
survive type ¼generational, Pc ¼ 0.9, Pm ¼0.01, σsh ¼depends on
dataset.
BPSO and CatfishBPSO: population size ¼ 70, iterations¼ 200,
Fig. 5. the local search procedure used in Memetic algorithm described in [8]. vmin,d ¼ 6, vmax,d ¼ 6, c1 ¼ c2 ¼2, w¼1, j ¼4.1, j1 ¼j2 ¼j/2,
Table 2
Obtained results of feature number and classification accuracy by each algorithm for 11 datasets.
WBCO Feature No 9 (0) 2.05 (0.6048) 1.2 (0.41) 2.75 (0.4443) 1.45 (0.51) 1.67(0.5501)
CA 0.96 (0.004) 0.9337(0.019) 0.9340 (0.012) 0.8741 (0.046) 0.939 (0.01) 0.9387(0.023)
Glass Feature No 9 3.4 (0.695) 3.4 (0.502) 3.35 (0.5871) 3.5 (0.6882) 2.88(0.527)
CA 0.671 (0.05) 0.7157 (0.044) 0.7186 (0.043) 0.7104 (0.05) 0.786 (0.03) 0.7535(0.0861)
Wine Feature No 13 (0) 2.65 (0.67) 2.7 (0.6569) 2.55 (0.6048) 2.6 (0.5982) 2.47(0.3305)
CA 0.732 (0.03) 0.9514 (0.021) 0.9563 (0.019) 0.95 (0.017) 0.9563(0.02) 0.9537(0.018)
Zoo Feature No 16 (0) 4.6 (1.04) 3.9 (0.8335) 4.5 (0.6882) 4.75 (1.019) 3.8(0.4612)
CA 0.932 (0.06) 0.9687 (0.025) 0.9775 (0.031) 0.965 (0.0347) 0.9775(0.01) 0.961 (0.0351)
Vehicle Feature No 19 5.6 (1.142) 4.8 (0.6958) 4.8 (0.6958) 5.15 (0.875) 5.01 (0.41)
CA 0.69 (0.011) 0.7336 (0.021) 0.7270 (0.013) 0.7302 (0.02) 0.7303 (0.01) 0.7337(0.0101)
Protein Feature No 20 (0) 7.75 (1.2926) 7.25 (1.5517) 7.95 (1.468) 7.95 (1.6375) 6.98(0.89)
CA 0.691(0.068) 0.7880 (0.045) 0.8173 (0.042) 0.8348 (0.047) 0.8195 (0.04) 0.7826(0.0377)
Segment Feature No 20 (0) 3.3 (0.4701) 3 (0) 3 (0) 3 (0) 3 (0)
CA 0.96 (0.007) 0.9674 (0.006) 0.9665 (0.004) 0.9693 (0.005) 0.966 (0.004) 0.9751 (0.006)
Wdbc Feature No 30 (0) 4.95 (0.9986) 2.55 (0.6863) 2.9 (0.7181) 2.6 (0.6805) 2.58(0.43)
CA 0.916 (0.01) 0.9342 (0.008) 0.9418 (0.01) 0.9432 (0.008) 0.9381 (0.01) 0.9342(0.02)
Soybean Feature No 35(0) 4.75 (1.5174) 2 (0) 1.9 (0.307) 2 (0) 2.55(0.5104)
CA 0.984 (0.03) 1 (0) 1 (0) 1 (0) 1 (0) 1 (0)
Lung cancer Feature No 56 (0) 19.4 (2.5214) 9.35 (2.996) 7.9 (2.1496) 13.1 (2.3373) 18.1(3.2472)
CA 0.461 (0.13) 0.9269 (0.046) 0.9846 (0.031) 0.9846 (0.031) 0.95 (0.0516) 0.9538(0.0397)
Sonar Feature No 60 (0) 18.75 (2.6532) 12.2 (2.1908) 11.85 (1.7252) 14.4 (2.0365) 14.1(3.3149)
CA 0.816 (0.02) 0.9265 (0.023) 0.9668 (0.018) 0.9674 (0.017) 0.951 (0.02) 0.9324(0.0183)
Ovarian Cancer Feature No 4000(0) 1900(50.05) 1650.4(70.38) 1550(85.42) 1741.3(38.23) 1806.9(34.42)
CA 0.8895(0.03) 0.9603(0.05) 0.9849(0.014) 0.986(0.00917) 0.986(0.0091) 0.9813(0.0206)
WBCO Feature No 1.55 (0.510) 1.6 (0.502) 1.5 (0.41) 3 (0) 2.8 (0.4104)
CA 0.941 (0.015) 0.9306 (0.016) 0.9282 (0.018) 0.9247 (0.013) 0.9311 (0.017)
Glass Feature No 3.4 (0.7539) 3.8 (0.7451) 3.65 (0.5871) 4.6 (0.54) 4.6 (0.54)
CA 0.7462 (0.044) 0.6973 (0.041) 0.6773 (0.037) 0.6359 (0.03) 0.6767 (0.04)
Wine Feature No 2.75 (0.7163) 4.1 (0.6386) 2.8 (0.6882) 2 (0) 2.4 (0.5477)
CA 0.9605 (0.022) 0.9114 (0.025) 0.9394 (0.025) 0.9239 (0.0161) 0.9380 (0.0126)
Zoo Feature No 4.7 (0.8013) 5.3 (0.9233) 5.9 (0.4442) 4.8 (0) 6.6 (0.8944)
CA 0.9787 (0.026) 0.9725 (0.029) 0.9662 (0.023) 0.9650 (0.028) 0.97 (0.02)
Vehicle Feature No 5.05 (0.9986) 5.4 (0.8335) 5.5 (0.5130) 5.6 (0.5477) 5 (0.7071)
CA 0.7363 (0.021) 0.7225 (0.020) 0.7312 (0.0184) 0.7219 (0.005) 0.697 (0.018)
Protein Feature No 8.3 (2.028) 8.7 (1.8353) 8.2 (1.7137) 7.4 (1.5166) 7.4 (1.5166)
CA 0.8304 (0.0396) 0.7923 (0.036) 0.8141 (0.0432) 0.8087 (0.0677) 0.8174 (0.0451)
Segment Feature No 3 (0) 3.85 (0.2236) 3 (0) 3.4 (0.5312) 3 (0)
CA 0.9706 (0.049) 0.9301 (0.005) 0.9662 (0.0061) 0.9507 (0.0045) 0.969 (0.0093)
Wdbc Feature No 2.75 (0.6386) 3.15 (0.6386) 2.5 (0.6882) 4.8 (1.303) 2.4 (0.547)
CA 0.9394 (0.013) 0.9399 (0.018) 0.9373 (0.0151) 0.9298 (0.02) 0.9281 (0.01)
Soybean Feature No 1.9 (0.3077) 2.25 (0.36635) 2 (0) 7.05 (1.099) 4.95 (0.366)
CA 1 (0) 1 (0) 1 (0) 1 (0) 1 (0)
Lung cancer Feature No 12 (2.6754) 13.7 (3.4043) 7.9 (2.6832) 18.8 (2.38) 15.6 (2.79)
CA 0.980 (0.0423) 0.95 (0.0451) 0.9961 (0.017) 0.9692 (0.04) 0.9385 (0.03)
Sonar Feature No 12.6 (2.1126) 16.4 (2.6832) 12.15 (1.4609) 18.8 (2.7749) 12.8 (2.3875)
CA 0.946 (0.0229) 0.9502 (0.022) 0.9583 (0.0133) 0.9181 (0.0261) 0.9398 (0.019)
Ovarian Cancer Feature No 1661.7(40.81) 815.4(177.47) 1726.07(38.05) 1923(31.04) 1693(41.07)
CA 0.9872(0.011) 0.9814(0.0125) 0.9802(0.0314) 0.9418(0.021) 0.9757(0.035)
592 S. Kamyab, M. Eftekhari / Neurocomputing 171 (2016) 586–597
max age¼3, amount of catfish particles ¼10% of population of features. This can be due to its suitable exploration power which
size, other setting used same as in [24]. is the result of incorporating catfish particles in this algorithm.
Memetic algorithm: population size ¼ 70, generations ¼200, In the next section, in order to analyze the obtained results and
nSel ¼0.3 (i.e. 0.3 of the part of population (worst part) are evaluate our viewpoints, we used some statistical analyses to
replaced with generated children in each generation), Pc ¼ 0.9, compare the obtained number of features and classification
Pm ¼0.01, parent selection type ¼random (after sorting the accuracy by existing methods.
population), local search operation ¼Described as follows: If X
be a solution and its neighborhood (N(X)) is defined as follows: 4.5. Statistical analysis
0 0
NðXÞ ¼ fX j X ¼ X [ fvj0 g fvj g; 8 vj A X; vj0 2
= Xg; ð9Þ
In order to show that the reported results have statistical
The local search procedure can be described as in Fig. 5. significance, we have carried out some statistics on obtaining
results in Table 2. We used non-parametric 1 N Freidman and
Quade tests with 9 and 90 degrees of freedom according to F-
4.3. Computational complexity distribution. The obtained ranks using these statistics are reported
in Table 3.
In terms of the number of fitness evaluations as the elementary As we can see from Table 3, the statistical analyses also confirm
operation for computational complexity, since we set the termina- the better performance of local best variants even in finding a
tion condition of each algorithm as reaching to a fixed number of global good solution compared to the other existing methods. In
generations (equal to 200), we can see from all algorithms that the the case of classification accuracy, the r3PSO-lhc version of local
computational complexity of all used methods are O(NG), where, N best PSO could achieve the best performance. Based on this result,
and G stand for the number of individuals and number of genera- it can be said that considering a larger neighborhood rather than
tions, respectively. However, if “computing the distance” is consid- r2PSO with less overlap to other neighborhoods may lead the MO
ered as the elementary operation, in the case of MO methods, DFS method to achieve more accuracy in feature selection problem. On
has the complexity O(N2) due to it dynamic species identification the other hand, in the case of number of features, r2PSO and r3PSO
technique, GA_SN_CM has the complexity O(N2) due to its memory could achieve better ranks. It means that the overlap between
mechanism and The complexity of localbest PSO variants are O(N). In neighborhoods may lead the algorithm to reach solutions with
the case of single modal methods, the complexity of CatfishBPSO, fewer numbers of features. Memetic algorithm is the next algo-
basic binary GA and basic binary PSO are O(N) and Memetic rithm with high rank among existing algorithms . We believe its
Algorithms has complexity O(Nn) which n{N denotes the number local search behavior is suitable for feature selection problem and
of assumed neighbors for each individual. if its algorithm could be associated with a solution conservation
mechanism over its iterations, it can turn into a powerful MO
technique for solving different problems specially feature selec-
4.4. Results tion. After local best variants and MA, the proposed GA_SN_CM
method achieved the next rank. We can see that its performance is
Table 2 shows the results obtained by MO and other heuristic the best among GA based methods and its obtained rank is close to
methods over 20 independent runs on existing datasets. The CA in MA. We believe its performance over GA based methods is because
Table 2 refers to the average classification accuracy achieved by 1- of the tradeoff which it establishes between exploration and
NN classifier on the test set of classification dataset by each exploitation powers through its suitable local search behavior by
algorithm. The best values are highlighted in bold-face.
As can be seen from Table 2, local Best PSO variant MO methods
could achieve better results for several data sets. The Memetic
Table 3
algorithm could obtain results close to local Best variants. This
statistical analyses of the results in Table 2.
may be caused by the similarity between the nature of MA and MO
methods in which there is suitable local search operation in order Statistical test method CA Feature no.
to exploit better solutions. On the other hand, although DFS is an
Freidman 1nN DFS 6.5833 7.3182
MO method, apart from its better results comparing with GA, it
r2PSO 3.9167 3.2273
could not achieve good results compared to the other existing MO r3PSO 4.5 3.6364
methods. We believe this may be also because of the local search r2PSO-lhc 4.125 4.7727
operation which it uses, i.e. the fitness sharing mechanism r3PSO-lhc 2.625 4.9545
between individuals in specific specie and the similarity measure GA_SN_CM 8.5 4.1818
CatfishBPSO 6.9167 7.8182
which is used to consider individuals belonging to the same
MA 5.7083 5.3182
species which seems not to be appropriate in the feature space. BPSO 7.375 5.8182
Therefore, selection of a suitable local search operation mechan- GA 8.5 7.9545
ism considering the attributes of the problem at hand plays an Without FS 9.9167 11
important role in finding desirable solutions by MO methods. P-value 6e 06 0
Generally, the results show the power of MO methods in solving Quade 1nN DFS 7.0449 8.197
feature selection problem compared with their single modal r2PSO 3.4679 3.2273
r3PSO 4.2885 3.1364
versions of algorithms as they could achieve better results com- r2PSO-lhc 3.8205 4.7955
pared to existing well known heuristic feature selection methods r3PSO-lhc 2.5705 4.6136
in many cases. Therefore, they can be used as suitable techniques GA_SN_CM 6.0064 5.0303
for solving feature selection problem. In the case of ovarian cancer CatfishBPSO 7.1346 7.697
MA 5.5321 4.3788
dataset which is a high dimensional dataset we can see that the
BPSO 7.6603 5.4394
obtained results by EAs associated with niching techniques also GA 8.4359 8.4848
appeared better than their single modal version indicating the Without FS 10.0385 11
power of MO methods in solving high dimensional FSel problems. P-value 3.4878e 08 3.038e 09
In this case CatfishBPSO could find the subset with fewer numbers
S. Kamyab, M. Eftekhari / Neurocomputing 171 (2016) 586–597 593
improving found solutions along with the conservation of found rule based classifiers, Bayes Net, as Bayesian classifier, and J48 and
optima which it establishes in the GA. We can see from Table 3 BFTree, as tree based classifiers, were used to compute accuracy of
that DFS could achieve better ranks compared with GA . found solutions by each method. The best solution obtained by each
It is worth noting that the obtained results also show this fact that algorithm, during its search over 20 runs, was considered to compute
performing classification of a dataset without any feature selection has the classification accuracy of utilized classifiers. For each database, 10-
the worst accuracy among all existing techniques. This means that the fold Cross-validation was developed to compute the classification
feature selection process is a necessary operation to improve the accuracy for each classifier. Table 5 shows the classification accuracy
classification quality by eliminating noisy and irrelevant features. using the best feature subset selected by each algorithm for different
classifiers and datasets. In the following table F No stands for the
4.6. Algorithms’ behavior number of selected features and CA corresponds to the classification
accuracy of the best solution found by each method.
In order to observe the manner in which different algorithms Table 5 in fact provides the information indicating how much
select salient features over their iterations, the fitness values over the best features found by wrapper heuristic methods can be
the number of iterations are shown in Fig. 6 for 11 datasets suitable for classification while different classifiers are utilized.
(averaged over 20 runs). Such information shows the convergence Looking at Table 4, we observed that in many cases, the best
behavior together with the power of considered methods in both features obtained by wrapper based approaches could give a
selection of fewer numbers of features and achieving better reasonable accuracy compared with using all the features of that
classification accuracies. dataset for classification. In some cases the best selected features
We can see in Fig. 6 that, for most of the datasets, MO methods can even lead to improve the classification accuracy (e.g. see Sonar
specially GA_SN_CM converge slower than single modal methods. dataset using BFTree classifier). Among existing methods, the
This may be due to their tendency to explore the whole search results show that MO methods could find more suitable features
space resulting in finding fitter solutions. The methods like MA compared to the other heuristic algorithms in terms of the
and Catfish BPSO achieved good results in early iterations but they obtained classification accuracy for many datasets.
cannot improve their solutions later. We wish to note that the
exploration power of MO methods can lead the feature selection
process to find more desirable solutions. We believe if MO
methods uses a local search that is more consistence with feature 5. Conclusion
selection process, like the one used by MA, they can find even
fitter solutions in their search process. In this study, MO techniques are investigated in the case of solving
feature selection problem which is an NP-hard problem requiring
high exploration and exploitation powers to be solved. MO methods
4.7. Finding multiple solutions due to their slow convergence in order to form subpopulations in the
main population, can explore more regions of the search space.
The main objective of MO methods is to provide several Moreover, they use some local search operations to exploit better
optimal solutions for the user to choose from. In this section, we solutions in the formed subpopulations. The additional advantage of
investigate the variety of found solutions in the final population of using MO methods is that they have the ability of finding multiple
a sample run for each method. Table 4 includes the number of optimal feature subsets in terms of classification accuracy. We believe
distinct solutions (feature subsets) with high fitness value encoun- that such characteristics of MO methods make them desirable for
tered at the final population of each algorithm. Six data sets with solving feature selection problem. To verify this, we used 12 UCI
different characteristics including Sonar, Glass, Wdbc, Protein, datasets and five well-known MO methods such as DFS, r2PSO,
Segment and Vehicle were considered in this section. Moreover, r3PSO, r2PSO-lhc and r3PSO-lhc and our proposed method
the classification accuracy of obtained subsets is also reported GA_SN_CM, in our experiments. These methods were compared with
using 1-NN classifier (averaged over 20 times data permuting and single modal forms of EAs and also two well-known heuristic
classifying). The aim of this section is just to check the diversity of methods (introduced in the literature for feature selection) such as
the final population of considered methods and their ability to CatfishBPSO and Memetic Algorithm. Obtained results and their
provide multiple solutions. statistical analyses show that MO methods can significantly improve
In Table 4, we can see that compared with GA, PSO and MA, MO the feature selection process compared with the single modal form of
methods especially local best PSO variants could locate more EAs and they also can compete with powerful existing heuristic
distinct solutions in their final population. For instance, in the feature selection methods. We believe if a suitable local search
case of segment dataset (Table 4), the population of single modal consistent with feature selection is used in MO methods, their results
methods totally converge to one or two solutions while MO can significantly be improved. In order to check the suitability of
methods could find different feature subsets. For other datasets selected feature (by different methods) in terms of yielded classifica-
the solutions found by MO methods are more diverse than single tion accuracy, some other classifiers apart from the one used in the
modal methods. search process are utilized. The obtained results show acceptable
accuracies for classifiers by found features.
4.8. Obtained accuracy using other classifiers
Fig. 6. The fitness value obtained by each algorithm over its iteration for each dataset (averaged over 20 runs): (a) WBCO, (b) Glass, (c) Wine, (d) Zoo, (e) Vehicle, (f) Protein,
(g) Segment, (h) Wdbc, (i) Soybean, (j) Lung Cancer, (k) Sonar. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of
this article.)
Table 4
obtained feature subsets in the final population with their corresponding fitness and accuracy of existing methods. NoFSub denote the number of distinct solutions with high fitness value in the final population of each algorithm.
avgFit and avgCA represents the average fitness and average classification accuracy of distinct solutions, respectively.
method NoFSub avgFit avgCA NoFSub avgFit avgCA NoFSub avgFit avgCA
DFS-FS 3 0.8894 ( 7 0.0073) 0.9634 ( 7 0.0135) 5 0.6765 ( 7 0.0551) 0.5621 ( 70.0803) 12 0.8456 (7 0.0384) 0.8472 ( 70.0347)
r2PSO 9 0.9182 ( 7 0.0085) 0.9752 ( 7 0.0125) 3 0.7071 ( 7 0.0324) 0.6078 ( 70.0486) 5 0.9391 (7 0.0062) 0.9286 ( 70.0117)
r3PSO 15 0.9252 ( 7 0.0085) 0.9845 ( 7 0.0178) 2 0.7253 ( 7 0.0282) 0.6255 ( 70.0453) 2 0.9493 (7 0) 0.9454 ( 70.0051)
r2PSO-lhc 4 0.9038 ( 7 0.0117) 0.9697 ( 7 0.0205) 7 0.7448 ( 7 0.0037) 0.6588 ( 70.0043) 7 0.9319 (7 0.0065) 0.9241 ( 70.0101)
r3PSO-lhc 8 0.9111 ( 7 0.0095) 0.9658 ( 7 0.0322) 6 0.7135 ( 7 3.26e 04) 0.6120 ( 70.0064) 10 0.9462 (7 1.17e 16) 0.9438 ( 70.0053)
GA_SN_CM 8 0.8617 ( 7 0.0327) 0.9196 ( 7 0.0368) 6 0.6909 ( 7 0.0085) 0.7093 ( 70.0360) 7 0.9069 (7 0.0189) 0.9302 ( 70.0098)
CatfishBPSO 4 0.9179 ( 7 0.0102) 0.9505 ( 7 0.0182) 3 0.7071 ( 7 0.0324) 0.6078 ( 70.0486) 6 0.9426 (7 0.0063) 0.9369 ( 70.0085)
MA 4 0.9243 ( 7 0.0150) 0.9668 ( 7 0.0152) 1 0.7217 ( 7 0) 0.6167 ( 70) 4 0.9356 (7 0.0050) 0.9312 ( 70.0036)
GA 2 0.8836 ( 7 0.0011) 0.9480 ( 7 0.0117) 3 0.7198 ( 7 0.0211) 0.6235 ( 70.0261) 1 0.9224 (7 0) 0.9248 ( 70)
BPSO 4 0.9045 ( 7 0.0134) 0.9457 ( 7 0.0134) 3 0.7128 ( 7 0.0282) 0.6135 ( 70.0426) 7 0.9310 (7 0.0026) 0.9259 ( 70.0034)
DFS-FS 4 0.9153 ( 7 0.0262) 0.9092 ( 7 0.0269) 6 0.6978 ( 7 0.0289) 0.6063 ( 70.0344) 5 0.7750 (7 0.0660) 0.7363 ( 70.0979)
r2PSO 4 0.9335 ( 7 0.0090) 0.9264 ( 7 0.0128) 8 0.7342 ( 7 0.0128) 0.6507 ( 70.0175) 6 0.7859 (7 0.0062) 0.7405 ( 70.0121)
r3PSO 3 0.9314 ( 7 0.0066) 0.9234 ( 7 0.0094) 6 0.7317 ( 7 0.0132) 0.6487 ( 70.0218) 12 0.7506 (7 0.0191) 0.7043 ( 70.0343)
r2PSO-lhc 11 0.9275 ( 7 0.0077) 0.9223 ( 7 0.0119) 8 0.7373 ( 7 0.0079) 0.6586 ( 70.0139) 8 0.7793 (7 0.0159) 0.7481 ( 70.0187)
r3PSO-lhc 11 0.9289 ( 7 0.0046) 0.9249 ( 7 0.0072) 6 0.7338 ( 7 0.0109) 0.6506 ( 70.0149) 8 0.7753 (7 0.0150) 0.7433 ( 70.0196)
GA_SN_CM 11 0.9080 ( 7 0.0048) 0.9630 ( 7 0.0070) 8 0.7108 ( 7 0.0175) 0.6261 ( 70.0219) 5 0.6885 (7 0.0511) 0.7279 ( 70.0243)
CatfishBPSO 8 0.9313 ( 7 0.0016) 0.9285 ( 7 0.0060) 5 0.7428 ( 7 4.47e 05) 0.6682 ( 70.0051) 6 0.7983 (7 0.0122) 0.7535 ( 70.0113)
MA 2 0.9348 ( 7 0.0070) 0.9283 ( 7 0.0100) 2 0.7417 ( 7 0.0102) 0.6562 ( 70.0098) 3 0.7780 (7 0.0232) 0.7209 ( 70.0302)
GA 1 0.9390 ( 7 0) 0.9342 ( 7 0) 3 0.7172 ( 7 0.0259) 0.6270 ( 70.0373) 2 0.7732 (7 0.0862) 0.7224 ( 70.1282)
BPSO 1 0.9390 ( 7 0) 0.9342 ( 7 0) 7 0.7345 ( 7 1.2e 16) 0.6503 ( 70.0064) 6 0.7781 (7 0.0125) 0.7258 ( 70.0230)
595
596 S. Kamyab, M. Eftekhari / Neurocomputing 171 (2016) 586–597
Table 5
Obtained accuracy using different classifiers.
Method F No CA F No CA F No CA F No CA F No CA F No CA