0% found this document useful (0 votes)
25 views24 pages

A Feature Selection Method With Feature Ranking Using Genetic Programming

Uploaded by

Oday
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views24 pages

A Feature Selection Method With Feature Ranking Using Genetic Programming

Uploaded by

Oday
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Connection Science

ISSN: (Print) (Online) Journal homepage: www.tandfonline.com/journals/ccos20

A feature selection method with feature ranking


using genetic programming

Guopeng Liu, Jianbin Ma, Tongle Hu & Xiaoying Gao

To cite this article: Guopeng Liu, Jianbin Ma, Tongle Hu & Xiaoying Gao (2022) A feature
selection method with feature ranking using genetic programming, Connection Science, 34:1,
1146-1168, DOI: 10.1080/09540091.2022.2049702

To link to this article: https://doi.org/10.1080/09540091.2022.2049702

© 2022 The Author(s). Published by Informa


UK Limited, trading as Taylor & Francis
Group.

Published online: 17 Mar 2022.

Submit your article to this journal

Article views: 2349

View related articles

View Crossmark data

Citing articles: 8 View citing articles

Full Terms & Conditions of access and use can be found at


https://www.tandfonline.com/action/journalInformation?journalCode=ccos20
CONNECTION SCIENCE
2022, VOL. 34, NO. 1, 1146–1168
https://doi.org/10.1080/09540091.2022.2049702

A feature selection method with feature ranking using genetic


programming
Guopeng Liua , Jianbin Maa , Tongle Hub and Xiaoying Gaoc
a College of Information Science and Technology, Hebei Agricultural University, Baoding, People’s Republic of

China; b College of Plant Protection, Hebei Agricultural University, Baoding, People’s Republic of China;
c School of Engineering and Computer Science, Victoria University of Wellington, Wellington, New Zealand

ABSTRACT ARTICLE HISTORY


Feature selection is a data processing method which aims to Received 9 September 2021
select effective feature subsets from original features. Feature selec- Accepted 1 March 2022
tion based on evolutionary computation (EC) algorithms can often KEYWORDS
achieve better classification performance because of their global feature selection; feature
search ability. However, feature selection methods using EC cannot ranking; genetic
get rid of invalid features effectively. A small number of invalid fea- programming; multi-criteria
tures still exist till the termination of the algorithms. In this paper,
a feature selection method using genetic programming (GP) com-
bined with feature ranking (FRFS) is proposed. It is assumed that
the more the original features appear in the GP individuals’ termi-
nal nodes, the more valuable these features are. To further decrease
the number of selected features, FRFS using a multi-criteria fitness
function which is named as MFRFS is investigated. Experiments on 15
datasets show that FRFS can obtain higher classification performance
with smaller number of features compared with the feature selection
method without feature ranking. MFRFS further reduces the number
of features while maintaining the classification performance com-
pared with FRFS. Comparisons with five benchmark techniques show
that MFRFS can achieve better classification performance.

1. Introduction
Feature selection (Sreeja, 2019; Too & Abdullah, 2020) is to select effective feature subsets
from high-dimensional original features, which is one of the key issues for machine learning.
High-quality features play a key role in building an efficient model, and irrelevant or redun-
dant features may cause difficulties (Xue et al., 2013). Therefore, feature selection methods
have been extensively applied to solve practical classification tasks (Espejo et al., 2010; Liang
et al., 2017; Loughran et al., 2017; Mei et al., 2017; Patel & Upadhyay, 2020). Recently, Gaining
Sharing Knowledge based Algorithm (GSK) (Mohamed et al., 2020) was proposed for feature
selection and achieved good performance (Agrawal, Ganesh, & Mohamed, 2021a, 2021b;
Agrawal, Ganesh, Oliva, et al., 2022). Evolutionary computation (EC) algorithms (Al-Sahaf
et al., 2019; Hancer, 2019) have been widely used for feature selection due to their global
search ability (Ma & Gao, 2020b; Xue et al., 2014). EC algorithms are population based,

CONTACT Jianbin Ma [email protected]


© 2022 The Author(s). Published by Informa UK Limited, trading as Taylor & Francis Group.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.
org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.
CONNECTION SCIENCE 1147

which need to initialise individuals randomly. Sometimes, the genetic operators of genetic
algorithm (GA) and genetic programming (GP), or the updating strategy of particle swarm
optimisation (PSO) (Nagra et al., 2020; Too et al., 2021), are not sufficient to delete invalid
features. So, a small number of invalid features still exist till the termination of the algo-
rithms.
GP can be used for feature selection due to its global search ability and is proved to
achieve good classification performance. However, with the evolution of GP, a large number
of individuals with the same best fitness are generated. GP only outputs the best individual,
which may lose some good individuals. Moreover, the generated features in the output
individual may still contain redundant features.
Feature ranking (Ahmed et al., 2014; Friedlander et al., 2011; Neshatian, 2010) is the rank-
ing of original features based on specific evaluation criteria, which is usually a step of feature
selection. It is employed to find out which features or feature sets are more important. So,
this paper proposes a GP-based feature selection method combined with feature ranking
(FRFS), which considers the large number of best individuals generated during the evolu-
tion of GP instead of only one best individual. A correlation-based evaluation criterion is
used as the fitness function. It is assumed that the more the original features appear in
the GP individuals’ terminal nodes, the more valuable these features are. Therefore, the
individuals with the best fitness are stored. The occurrence times of features appearing
in the individuals’ terminal nodes are counted and the top β features are selected as the
candidate feature subset. To decrease the number of selected features while maintain-
ing the classification performance, another feature selection method that combines FRFS
with a multi-criteria fitness function is investigated. The motivation of this algorithm is to
set higher fitness values for the individuals with the same correlation value in the case of
smaller feature number.
The overall objective of this paper is to propose a feature selection algorithm that com-
bines feature ranking with a multi-criteria fitness function and verify the improvement of
the algorithm by feature ranking and the dimensionality reduction effect of multi-criteria
fitness function. To achieve the overall objective, the following three objectives will be
investigated.
Objective 1. Propose a feature selection method combined with feature ranking using
GP (FRFS) and verify whether it can achieve better classification performance with smaller
number of features than using feature selection only (FS).
Objective 2. Investigate a feature selection method that combines FRFS with a multi-
criteria fitness function which is named as MFRFS, and verify whether the multi-criteria
fitness function can reduce the number of features of FRFS while maintaining the classi-
fication performance.
Objective 3. Investigate whether our proposed MFRFS can achieve better classification
performance than five benchmark techniques, including ReliefF feature selection method
(Kira & Rendell, 1992), linear forward feature selection method (LFS) (Gutlein et al., 2009)
and three GSK related methods (Agrawal, Ganesh, & Mohamed, 2021b; Agrawal, Ganesh,
Oliva, et al., 2022).
The rest of this paper is arranged as follows. Section 2 outlines the background infor-
mation of this paper. Section 3 describes in detail the three GP-based feature selection
methods. Section 4 shows the experimental design. Section 5 presents the experimental
results and discussion. Section 6 is the conclusion and future work.
1148 G. LIU ET AL.

2. Related works
2.1. Genetic programming (GP)
GP is one of the evolutionary computation algorithms (Koza, 1992; Koza et al., 1999). It
is very similar to genetic algorithm (GA). The main difference between GP and GA is the
representation of individuals. Due to GP’s flexible representation methods, GP can be
used to construct high-level features (Ma & Gao, 2020a, 2020b), construct classifiers (Muni
et al., 2006; Neshatian, 2010; Neshatian & Zhang, 2011) and solve practical industrial prob-
lems (Bi et al., 2021a, 2021b; Peng et al., 2020). The commonly used representation method
of GP is based on tree structure. The terminal nodes (constants and variables) are randomly
selected from a terminal set, and the function (mathematical and logical operators) is ran-
domly selected from a function set to constitute internal nodes. GP randomly initialises the
first generation of individuals and evaluates each individual using a fitness function. Genetic
operators including selection, crossover and mutation are then performed to produce the
next generation’s population. This step is iterated until the termination criterion is reached.
Then GP outputs the optimal individual. For feature selection, the terminal nodes contain
the features selected by GP.

2.2. Correlation
The correlation-based feature evaluation criterion is proposed by Hall (1999), which takes
into account both the correlation between features and classes and the correlation
between features. This evaluation criterion is proved to be effective to select low redun-
dancy and high discrimination feature subsets (Hall & Smith, 1999; Ma & Gao, 2020b), and is
adopted as the evaluation criterion in this paper. The following is the formula for calculating
the correlation of a feature subset.

kCfc
CorrelationS =  (1)
k + k(k − 1)Cff

where CorrelationS is the correlation value of a set S containing k features, Cfc denotes the
average correlation between each feature and class, and Cff denotes the average correla-
tion between each feature pair. They are calculated by formula (2). H(X) and H(Y) are the
information entropy of X and Y respectively, H(X | Y) is the conditional entropy.
 
H(X) − H(X | Y)
CXY = 2.0 × (2)
H(X) + H(Y)

2.3. Feature selection methods based on GP


Due to GP’s flexible representation methods, GP can be used for feature selection. Recent
studies have focused on feature selection methods using evolutionary algorithms (Canuto
& Nascimento, 2012; Papa et al., 2017; Ribeiro et al., 2012; Tan et al., 2008) due to their global
search ability. According to whether the classification algorithm is involved in the fitness
function, feature selection is divided into filter-based (Davis et al., 2006; Lin et al., 2008;
CONNECTION SCIENCE 1149

Neshatian & Zhang, 2009b, 2012; Purohit et al., 2010; Ribeiro et al., 2012) and wrapper-
based (Hunt et al., 2012). The filter-based feature selection methods use information mea-
sures such as mutual information (Vergara & Estévez, 2014), information gain (Neshatian
& Zhang, 2012), consistency and correlation (Hall, 1999; Neshatian & Zhang, 2009c) as the
evaluation criterion, and need less running time than wrapper-based methods. Moreover,
the models produced by filter-based methods are more general. So, filter-based feature
selection is investigated further in this paper. The fundamental research goal of feature
selection is to find the feature subset with the best classification performance and fewer
features.
Lin et al. (2008) constructed a classifier using layered genetic programming, which had
the characteristic of feature selection and feature extraction. Neshatian and Zhang (2012)
combined information entropy and conditional entropy to evaluate the correlation
between feature sets and classes for feature selection.
Purohit et al. (2010) proposed a GP classifier construction method with feature selec-
tion and investigated a new crossover operator to discover the best crossover site. Davis
et al. (2006) proposed a two-stage feature selection method. In the first stage, a feature sub-
set is selected using GP and GP classifiers are evolved using the selected features. Neshatian
and Zhang (2009b) proposed a filter-based multiple-objective feature selection method for
binary classification task.
GP has flexible representation ability. In general, the GP-based feature selection method
takes original features as the terminal set and terminal nodes as the selected features. How-
ever, some researchers have researched other representations. Hunt et al. (2012) developed
GP-based hyper-heuristics for adding and removing features. Each individual generated by
GP is a series of operations on the feature set. Ribeiro et al. (2012) used four feature selec-
tion criteria (information gain, chi-square, correlation and odds ration) as the terminal set of
GP and three set operations (intersection, union and difference) as the function set. Viegas
et al. (2018) proposed a GP-based feature selection method for skewed dataset and vali-
dated the method on four high-dimensional datasets. Papa et al. (2017) used binary strings
as individual’s terminal nodes, where 1 means selecting a certain feature and 0 means with-
out selecting. AND, OR and XOR are regarded as non-terminal nodes. The output of the
individual is also a binary string, which contains the selected features.

2.4. Feature ranking methods based on GP


The goal of feature ranking is to show the importance of the original features under certain
evaluation criteria and is often used as a basis of feature selection.
Neshatian (2010) proposed a feature ranking method based on GP, which used GP to
evolve many weak classifiers and preserved the optimal weak classifiers. The score of each
feature is proportional to the fitness value of the weak classifier that contains it. This method
needs to run a large number of GPs, which is time-consuming.
Friedlander et al. (2011) proposed a feature ranking method based on weight vectors.
GP updates the weights for each feature at the end of each generation of evolution. After
10 generations of evolution, terminal nodes are no longer randomly selected during the
generation of new subtrees in mutation operation, but are selected according to different
probabilities based on feature weights.
1150 G. LIU ET AL.

Ahmed et al. (2014) proposed a two-stage GP-based classification method which first
ranked the original features then constructed classifiers. Neshatian and Zhang (2009a) con-
structed GP programs to measure the goodness of feature subsets using a virtual program
structure and an evaluation function.
At present, there have been many researches on GP-based feature selection methods,
but few research focus on the large amount of evolutionary information generated during
the evolution process. And some researchers have begun to use GP to rank the features
(Friedlander et al., 2011; Neshatian, 2010), which indicates that the frequency of features
appearing in terminal nodes can show the importance of features. Based on this motiva-
tion, some issues should be investigated on whether feature ranking can help quantify
the importance of features and improve the classification performance, and whether a
multi-criteria fitness function can help to further explore this area select more effective
features.

3. Method
In this paper, our proposed methods are based on standard GP, and correlation is chosen
as the fitness function. The fundamental purpose of this paper is to investigate whether the
feature selection method based on correlation evaluation criteria using GP can effectively
get rid of invalid features, and whether feature ranking and a multi-criteria fitness function
can improve the classification performance. We have different variants of algorithms to ver-
ify the research objectives. FS denotes the feature selection method using GP. FRFS denotes
the feature selection method combined with feature ranking. MFRFS denotes the algorithm
that combines FRFS with a multi-criteria fitness function. The details of the algorithms are
described below.

3.1. Feature selection using GP (FS)


This algorithm employs GP as the search algorithm and correlation as the evaluation crite-
rion. Because the smaller the fitness of GP, the better. The fitness function of FS is shown in
Formula (3).
fitFS = (1 − Correlation) (3)
where Correlation is the correlation value of the selected features and classes.
The output of FS is an individual that contains the optimal feature subset. The terminal
nodes are the selected features. The parameter settings of GP are shown in Section 4.1.

3.2. FS combined with feature ranking (FRFS)


FS only considers the best individual generated by GP and does not make full use of the
information during GP’s evolutionary process. With the convergence of GP, there are a large
number of individuals with the same best fitness. Our next step, feature ranking, is based on
the assumption that the more frequent features occur in these individuals, the more valu-
able they are. All the individuals with the same best fitness during the evolution process
after the jth generation are preserved. The occurrence of features in these individuals’ ter-
minal nodes are counted, and the features are ranked according to their occurrence. Our
CONNECTION SCIENCE 1151

Algorithm 1: Pseudo-Code of FRFS algorithm


Input: D:a training set
Output: Ds :the feature set selected by FRFS
1 Random initialisation of the first generation population of GP
2 while Maximum number of generations is not reached do
3 for i = 1; n ≤ s; i + + do
4 Calculate fiti of the individual i using Correlation
5 if currentgeneration > 30 then
6 if fiti < fitb then
7 fitb = fiti // fitb is the current optimal fitness value
8 Empty storage pool and store individual i to storage pool
9 end
10 else if fiti == fitb then
11 Store individual i to storage pool
12 end
13 end
14 end
15 Perform evolutionary operations:reproduction,crossover and mutation
16 end
17 Get the number of features k of the optimal individual
18 Count occurrences of features contained by all individuals in the storage pool and
get the sequence R in descending order
k
19 for j = ; j ≤ k; j + + do
3
20 The first j features are extracted from the sequence R and the classification
accuracy accj is obtained on the training set by the classifier
21 if accj > accb then
22 accb = accj
23 fbest = j
24 //accb is the largest classification accuracy value, and fbest is the number of
features matching with accb
25 end
26 Return the first fbest features of sequence R
27 end

experiments found that GP starts to converge after 30 iterations, so the parameter j is set
to 30.
Suppose the number of features selected by FS is k. Our goal is to select the top fbest
features among top k features according to ranking. The Pseudo-Code of FRFS is shown in
Algorithm 1. To determine how many features selected from k features are the best, first
fbest = nk and the fbest features are tested on the classification algorithms. Then the rest
k − nk features are added into fbest feature subset and tested one by one according to their
ranking until the classification accuracy begins to decline. Then, the top fbest features are
the selected features. In the paper, the parameter n is set to 3, that is, top one-third of k
1152 G. LIU ET AL.

features are selected into fbest feature subset. We started with top nk features instead of the
top feature is to improve the search efficiency. We set the value of n to 3 based on our
experiments.

3.3. FRFS combined with a multi-criteria fitness function (MFRFS)


In the process of GP evolution, we found that two individuals with the same fitness value
are not necessarily the same individuals. To give priority to individuals with the same fitness
value but less features, we use the multi-criteria fitness function as Formula 4. The method
FRFS using the multi-criteria fitness function is named as MFRFS.

fitMFRFS = (1 − Correlation) + αfnum (4)

where fnum is the number of features in an individual and α is the penalty coefficient
between Correlation and fnum . The parameter α is discussed in Section 5.4.

4. Experimental design
4.1. Datasets and parameter settings
To verify the effectiveness of our proposed feature selection methods, 15 datasets are col-
lected from UCI machine learning repository (Dheeru & Karra Taniskidou, 2017). The details
of the datasets are shown in Table 1. In the table, #features indicates the number of original
features, #instances indicates the number of instances and #classes indicates the num-
ber of class labels. The datasets have different number of classes and features, and are
representative classification problems.
A random split of 7:3 is used and 70% of the dataset is used as the training set and 30%
as the testing set (Ma & Gao, 2020b; Xue et al., 2013). The training set is used to select and
rank the original features. The testing set is used to evaluate the classification performance
of selected and ranked features by 10-fold cross-validation. To avoid stochastic character-
istics of GP, 30 experiments are done independently (Ma & Teng, 2019). The average of
classification accuracy and number of features of 30 different experiments are obtained.

Table 1. Description of the fifteen datasets.


Dataset #features #instances #classes
arrhythmia 262 452 13
audiology 69 226 24
climate-simulation-craches 20 540 2
hepatitis 19 155 2
house-vote 16 435 2
Lymphoma 4026 51 3
planning-relax 12 182 2
robot-failure-lp4 90 117 3
robot-failure-lp5 90 164 5
seismic-bumps 18 2584 2
spambase 57 4601 2
spect-heart 22 267 2
statlog-heart 13 270 2
statlog-image-segmentation 19 2310 7
thoracic-surgery 16 470 2
CONNECTION SCIENCE 1153

Table 2. Parameter setting of GP.


Parameters Parameter value
Function set +, −, ∗, % (protected division)
Terminal set Original features F0
Population size 500
Number of generations 50
Initialisation Ramped half-and-half
Mutation method Random subtree creation
Selection method Tournament method
Maximum tree depth 17
Mutation probability 10%
Cross-over probability 90%
Reproduction probability 10%

K-nearest Neighbours (KNN, K = 5), C4.5 decision tree and Naive Bayes (NB) are used to
evaluate the performance of the proposed methods. The ECJ platform (Luke, 2017) is used
to run GP. The parameter settings of GP are shown in Table 2, which are the commonly used
parameters of GP (Ma & Gao, 2020b; Ma & Teng, 2019). The function set is composed of four
arithmetic operators, the terminal set is the original features. The population size is set to
500, initial GP experimental results indicate that this population size can evolve good solu-
tions. The generations are set to 50 and the maximum tree depth is 17 to restrict irrelevant
and redundant features can also reduce the depth of GP programs. Mutation probability
and crossover probability are 10% and 90% respectively, which balance exploration and
exploitation during the evolution. The parameter α in Formula 3 is set as 0.01, which is
discussed in Section 5.4.

4.2. Benchmark techniques


To verify the effectiveness of the proposed MFRFS methods, we use five existing feature
selection methods as benchmarks to compare, including ReliefF feature selection method
(Kira & Rendell, 1992), linear forward feature selection method (LFS) (Gutlein et al., 2009)
and three GSK-related methods (Agrawal, Ganesh, & Mohamed, 2021b; Agrawal, Ganesh,
Oliva, et al., 2022).

4.2.1. ReliefF
ReliefF (Kira & Rendell, 1992) is an extension of Relief algorithm. Relief can only be used
to solve binary classification problems. Both Relief and ReliefF are feature weighting algo-
rithms, which assign different weights to features according to the correlation between
each feature and class, and features with a weight less than a certain threshold will be
deleted. The weight of a feature is proportional to its classification ability. ReliefF can be
applied to solve multi-class classification problems. First, one sample is randomly selected
from the dataset, and then k nearest neighbour samples are selected from all classes. Then,
the weight of all features is calculated, and the operation is repeated n times. Finally, the
final weight of each feature is obtained and the features are ranked according to their
weights. To facilitate comparisons with the proposed MFRFS, the same number of features
as MFRFS is selected on all datasets by ReliefF.
1154 G. LIU ET AL.

4.2.2. LFS
LFS (Gutlein et al., 2009) starts from an empty set and adds features one by one. It restricts
the number of features in each step of the forward selection. When the evaluation criteria
show that its performance remains unchanged or decreases after adding a feature, LFS will
be terminated. LFS is proved to be faster than standard forward selection and can obtain
good results (Gutlein et al., 2009). To compare with the proposed MFRFS method fairly,
Correlation is also selected as the evaluation criteria.

4.2.3. GSK related methods


Gaining-sharing knowledge-based optimisation algorithm (GSK) (Agrawal, Ganesh, &
Mohamed, 2021b) is a human-related metaheuristic algorithm. The basic principle of GSK is
to simulate the process of human gaining and sharing knowledge in the life span. Recently,
GSK has been used for feature selection and proved to be effective, so three GSK-related
methods including BGSK (Agrawal, Ganesh, & Mohamed, 2021b), pBGSK (Agrawal, Ganesh,
& Mohamed, 2021b) and PbGSK-V4 (Agrawal, Ganesh, Oliva, et al., 2022) were selected as
benchmarks for comparison.

5. Experimental results and conclusion


To verify the effectiveness of our proposed methods, four experiments are done. (1) FRFS
is compared with two baselines, i.e. original features (F0) and FS, to verify whether FRFS
can get higher classification accuracy with fewer features. (2) MFRFS is compared with FRFS
to verify whether the multi-criteria fitness function used by MFRFS can maintain the clas-
sification performance and reduce the number of features. (3) MFRFS is compared with
five benchmarks, i.e.ReliefF, LFS, BGSK, pBGSK and PbGSK-V4 , to verify whether MFRFS can
achieve better classification performance in terms of accuracy and number of features. (4)
The parameter α used in MFRFS is discussed to determine an appropriate parameter setting.
To justify the significance of various feature selection methods, the performance differ-
ences between paired feature selection methods are shown on the boxplots in Figures 1
and 2. The post-hoc analysis for Friedman test (Hollander & Wolfe, 1999) abbreviated as

Figure 1. The boxplots of performance differences between FRFS and two baselines marked by Post-
Friedman-test: (a) KNN, (b) C4.5, (c) NB.
CONNECTION SCIENCE 1155

Figure 2. The box plots of performance differences between MFRFS and three benchmarks marked by
Post-Friedman test: (a) KNN, (b) C4.5, (c) NB.

Post-Friedman test using different colours is also marked on the boxplots. If the Post-
Friedman test achieves p−value  0.05 or 0.05 < p−value  0.4, the box plots are marked
green or yellow respectively, which shows that the performance differences between
paired feature processing methods are significantly different or borderline significantly dif-
ferent. Figures 1 and 2 show whether one of the paired methods is significantly better than
the other.

5.1. Comparison between FRFS and F0, FS


In this section, the experiments between FRFS and F0, FS were done and the experimental
results are shown in Table 3. In the table, A-KNN, A-C4.5 and A-NB represent the average
1156 G. LIU ET AL.

Table 3. The experimental results of FRFS, F0 and FS.


Dataset Methods A-KNN #F-KNN A-C4.5 #F-C4.5 A-NB #F-NB
F0 61.33 262 65.67 262 61.67 262
arrhythmia FS 64.17 14.6 65.92 14.6 65.92 14.6
FRFS 64.42 10.5 66.17 10.5 66.75 11.6
F0 60.89 69 78.67 69 61.56 69
audiology FS 66.89 5.8 80 5.8 78.67 5.8
FRFS 70.22 4.9 80 4.9 80 4.9
F0 92.01 20 88.93 20 88.87 20
climate-simulation-craches FS 91.19 2.9 91.76 2.9 86.86 2.9
FRFS 91.51 2.4 92.2 2 89.12 2.3
F0 85.56 19 80.67 19 85.33 19
hepatitis FS 82 6.9 81.56 6.9 82.67 6.9
FRFS 82.22 4.6 82 3.3 83.78 5.3
F0 91.59 16 95.13 16 89.97 16
house-vote FS 95.74 1 95.63 1 95.63 1
FRFS 95.74 1 95.63 1 95.63 1
F0 80 4026 86 4026 80 4026
Lymphoma FS 80 19.7 90.67 19.7 82 19.7
FRFS 80 11.1 90 9.8 84 10.8
F0 66.85 12 66.3 12 60.93 12
planning-relax FS 67.78 4.5 69.07 4.5 67.59 4.5
FRFS 68.52 2.7 70.93 2.1 68.89 2.4
F0 50.3 90 66.36 90 80 90
robot-failure-lp4 FS 56.36 27.6 75.15 27.6 80.61 27.6
FRFS 61.82 11.7 76.97 11.9 83.03 16.2
F0 41.43 90 44.29 90 59.05 90
robot-failure-lp5 FS 42.14 26.6 45 26.6 61.9 26.6
FRFS 46.9 15.4 46.43 19.7 62.86 21.8
F0 92.97 18 93 18 86.61 18
seismic-bumps FS 92.83 5 93.26 5 88.15 5
FRFS 93.06 3.5 93.41 2.3 90.43 2.3
F0 87.36 57 90.47 57 79.6 57
spambase FS 89.49 10.8 90.32 10.8 87.21 10.8
FRFS 89.44 10.2 90.27 10.4 86.49 9.8
F0 66.15 22 66.41 22 67.31 22
spect-heart FS 71.67 3.2 71.03 3.2 70.77 3.2
FRFS 72.18 2.9 71.03 2.7 71.92 2.6
F0 81.85 13 74.81 13 83.33 13
statlog-heart FS 80.37 5.9 77.04 5.9 83.33 5.9
FRFS 80.62 4.2 77.16 4 82.22 5.3
F0 92.19 19 93.32 19 80.66 19
statlog-image-segmentation FS 93.22 5.7 93.36 5.7 82.74 5.7
FRFS 93.28 4.5 93.45 4.9 84.43 3.9
F0 83.19 16 84.61 16 64.75 16
thoracic-surgery FS 84.75 4.4 85.11 4.4 83.83 4.4
FRFS 84.96 2.7 85.11 2.2 84.4 2.9

classification accuracy of 30 independent experiments using KNN, C4.5 and NB respectively.


F-KNN, F-C4.5 and F-NB denote the average number of selected features of 30 independent
experiments using KNN, C4.5 and NB respectively.
The box plots of performance differences between FRFS and two baselines marked by
Post-Friedman test are shown in Figure 1. From the Friedman significance test, we can see
that there are significant differences between FRFS and two baselines. FRFS is significantly
better than F0 in all learning algorithms and is borderline significantly better than FS in all
learning algorithms.
As shown in Table 3, FRFS can achieve smaller number of features than F0 and FS in all
learning algorithms. FRFS can achieve better classification accuracy than F0 and FS in C4.5
CONNECTION SCIENCE 1157

on 12 out of 15 datasets, can achieve better classification accuracy than F0 and FS in NB on


11 out of 15 datasets, and can achieve better classification accuracy than F0 and FS in KNN
on 10 out of 15 datasets.
FS can select several features which have stronger discrimination ability than original
features on datasets such as audiology, planning-relax, seismic-bumps, statlog-image-
segmentation and thoracic-surgery. However, comparing with FS, FRFS can further reduce
the number of features and maintain the classification accuracy, which shows that the
feature subset selected by FS is not optimal. For example, on a high-dimensional dataset
Lymphoma, FS not only selects about 20 features from original 4026 features but also
improves the classification accuracy. However, FRFS further selects about 10 from 20 fea-
tures selected by FS and improves the classification accuracy. We can get the same results
on robot-failure-lp4 dataset and the number of features selected by FRFS has been reduced
to nearly 50% comparing with FS.
The experiments in this section confirm our assumption, that is, the more the original
features appear in the GP individuals’ terminal nodes, the more valuable these features are.
Based on this assumption, feature ranking can help feature selection method further reduce
the number of features while maintaining the classification accuracy.

5.2. Comparison between FRFS and MFRFS


In this section, MFRFS is compared with FRFS. The experimental results are shown in Table 4.
As shown in Table 4, MFRFS can further reduce the number of features comparing with FRFS
and can maintain or even improve the classification accuracy in C4.5, KNN and NB learning
algorithms on 12, 9 and 8 datasets respectively.
MFRFS further reduces the number of features by more than 50% on arrhythmia dataset.
MFRFS can not only select one-third of the number of features but also improve the
classification accuracy comparing with FRFS on Lymphoma, especially in NB learning
algorithm. As shown in the experimental results, one feature is obtained by MFRFS on
climate-simulation-craches, planning-relax and thoracic-surgery datasets, and is enough to
obtain equal or higher classification accuracy than FRFS.
The experimental results show that the multi-criteria fitness function can restrict redun-
dant features as GP’s terminal nodes and can achieve the purpose of decreasing the number
of selected features while maintaining the classification performance. However, appropri-
ate parameter α in Formula 4 should be set and the effect of parameter α is discussed in
Section 5.4.

5.3. Comparison between MFRFS and benchmarks


In this section, MFRFS is compared with five benchmark techniques, i.e. ReliefF, LFS, BGSK,
pBGSK and PbGSK-V4 . To make a fair comparison with our proposed method, in GSK related
methods, the parameter MAXNFE is set to 25,000, the number of generation is set to 500
when the dimension < = 20, and the number of generation is set to 250 when the dimen-
sion > 20. The experimental results are shown in Table 5. The box plots of performance
differences between MFRFS and benchmarks are shown in Figure 2.
1158 G. LIU ET AL.

Table 4. The experimental results of FRFS and MFRFS.


Dataset Methods A-KNN #F-KNN A-C4.5 #F-C4.5 A-NB #F-NB
arrhythmia FRFS 64.42 10.5 66.17 10.5 66.75 11.6
MFRFS 64.25 4.8 64 4.3 67.25 4.7
audiology FRFS 70.22 4.9 80 4.9 80 4.9
MFRFS 72 3.5 81.33 3.3 81.11 3.2
climate-simulation-craches FRFS 91.51 2.4 92.2 2 89.12 2.3
MFRFS 91.89 1 92.2 1 92.45 1
hepatitis FRFS 82.22 4.6 82 3.3 83.78 5.3
MFRFS 82 3.2 83.56 2.1 83.56 2.8
house-vote FRFS 95.74 1 95.63 1 95.63 1
MFRFS 95.74 1 95.63 1 95.63 1
Lymphoma FRFS 80 11.1 90 9.8 84 10.8
MFRFS 79.33 3.6 92 3.3 92.67 3.3
planning-relax FRFS 68.52 2.7 70.93 2.1 68.89 2.4
MFRFS 67.04 1 71.67 1 71.67 1
robot-failure-lp4 FRFS 61.82 11.7 76.97 11.9 83.03 16.2
MFRFS 66.06 3.9 78.18 5 77.27 6
robot-failure-lp5 FRFS 46.9 15.4 46.43 19.7 62.86 21.8
MFRFS 53.33 4.3 46.9 5.5 60.48 5.3
seismic-bumps FRFS 93.06 3.5 93.41 2.3 90.43 2.3
MFRFS 93.32 1.1 93.41 1 91.82 1.1
spambase FRFS 89.44 10.2 90.27 10.4 86.49 9.8
MFRFS 89.33 6.5 89.43 6.4 84.61 5.8
spect-heart FRFS 72.18 2.9 71.03 2.7 71.92 2.6
MFRFS 72.31 1.9 68.72 1.8 71.41 1.8
statlog-heart FRFS 80.62 4.2 77.16 4 82.22 5.3
MFRFS 80.99 3.2 78.4 3.2 81.11 4.3
statlog-image-segmentation FRFS 93.28 4.5 93.45 4.9 84.43 3.9
MFRFS 93.85 3 93.78 3 84.39 3
thoracic-surgery FRFS 84.96 2.7 85.11 2.2 84.4 2.9
MFRFS 84.75 1 85.11 1 85.04 1

5.3.1. Comparison between MFRFS and ReliefF


Like MFRFS, ReliefF also needs to rank the features. In the experiment of ReliefF, we
select the same number of features as MFRFS to verify whether MFRFS can achieve higher
classification accuracy when the two methods take the same number of features.
On statlog-image-segmentation dataset, both MFRFS and ReliefF select three features,
but the accuracy of MFRFS is 10% higher than that of ReliefF. MFRFS and ReliefF can
achieve comparable classification accuracy with only one feature on climate-simulation-
craches, seismic-bumps, thoracic-surgery and planning-relax datasets. On arrhythmia, hep-
atitis, robot-failure-lp4, robot-failure-lp5, spambase and statlog-heart datasets, MFRFS can
achieve better classification accuracy with same number of features than ReliefF. From the
experimental results in Table 5, we can see that MFRFS is more robust than ReliefF. The box
plots marked by Post-Friedman test in Figure 2 show that MFRFS is significantly better than
ReliefF in KNN and NB learning algorithms, and is borderline significantly better than ReliefF
in C4.5 learning algorithm. In general, the experimental results show that MFRFS can obtain
better classification results than ReliefF.

5.3.2. Comparison between MFRFS and LFS


The comparison results between MFRFS and LFS are shown in Table 5. As shown in Table 5,
in general, LFS can achieve comparable or better classification accuracy than MFRFS. The
CONNECTION SCIENCE 1159

Table 5. Experimental results of different feature selection methods.


Dataset Methods A-KNN #F-KNN A-C4.5 #F-C4.5 A-NB #F-NB
MFRFS 64.25 4.8 64 4.3 67.25 4.7
ReliefF 59.83 4.8 62.08 4.3 59.75 4.7
arrhythmia LFS 64.58 18.8 68.25 18.8 66.42 18.8
BGSK 61.64 15.9 59.72 15.9 56.61 15.9
pBGSK 60.5 20 58.83 20 54.53 20
PbGSK-V4 61.89 1.2 61.72 1.2 63.06 1.2
MFRFS 72 3.5 81.33 3.3 81.11 3.2
ReliefF 75.78 3.5 80 3.3 79.11 3.2
audiology LFS 67.33 15.3 79.33 15.3 76.89 15.3
BGSK 67.93 5.5 80.96 5.5 77.04 5.5
pBGSK 67.26 7 80.67 7 75.93 7
PbGSK-V4 39.56 1 39.56 1 39.56 1
MFRFS 91.89 1 92.2 1 92.45 1
ReliefF 91.76 1 92.2 1 92.45 1
climate-simulation-craches LFS 91.19 2.9 91.76 2.9 86.86 2.9
BGSK 91.51 2.1 92.26 2.1 86.42 2.1
pBGSK 91.4 1.9 92.24 1.9 87.86 1.9
PbGSK-V4 92.05 1 92.43 1 92.45 1
MFRFS 82 3.2 83.56 2.1 83.56 2.8
ReliefF 80.67 3.2 79.11 2.1 78.89 2.8
hepatitis LFS 83.78 7.4 82 7.4 84 7.4
BGSK 80.37 4.2 80.22 4.2 83.48 4.2
pBGSK 80.74 4.1 80.22 4.1 83.56 4.1
PbGSK-V4 80 1.4 80.52 1.4 80.3 1.4
MFRFS 95.74 1 95.63 1 95.63 1
ReliefF 95.74 1 95.63 1 95.63 1
house-vote LFS 94.47 3 95.61 3 94.92 3
BGSK 95.34 1.5 95.63 1.5 95.21 1.5
pBGSK 95.45 1.3 95.53 1.3 95.4 1.3
PbGSK-V4 94.74 2.2 95.5 2.2 94.89 2.2
MFRFS 79.33 3.6 92 3.3 92.67 3.3
ReliefF 78 3.6 92.67 3.3 92.67 3.3
Lymphoma LFS 80 73.1 91.33 73.1 80 73.1
BGSK 80 1319.9 84.44 1319.9 80 1319.9
pBGSK 80 1395.5 85.11 1395.5 80 1395.5
PbGSK-V4 78 1.2 80.89 1.2 77.33 1.2
MFRFS 67.04 1 71.67 1 71.67 1
ReliefF 65.39 1 71.48 1 70.56 1
planning-relax LFS 63.52 1 72.22 1 72.04 1
BGSK 66.36 1 71.98 1 71.05 1
pBGSK 66.36 1 72.16 1 71.48 1
PbGSK-V4 68.09 1 72.16 1 71.17 1
MFRFS 66.06 3.9 78.18 5 77.27 6
ReliefF 75.76 3.9 71.21 5 48.48 6
robot-failure-lp4 LFS 57.58 29.3 75.15 29.3 81.52 29.3
BGSK 65.05 8.1 71.01 8.1 76.77 8.1
pBGSK 63.13 9.4 68.89 9.4 76.77 9.4
PbGSK-V4 68.18 1.1 67.17 1.1 66.26 1.1
MFRFS 53.33 4.3 46.9 5.5 60.48 5.3
ReliefF 46.9 4.3 42.62 5.5 39.05 5.3
robot-failure-lp5 LFS 43.81 23.4 46.43 23.4 58.33 23.4
BGSK 49.68 7.9 49.44 7.9 60.87 7.9
pBGSK 48.97 10.3 47.46 10.3 60.08 10.3
PbGSK-V4 38.41 1.2 39.92 1.2 35.32 1.2
MFRFS 93.32 1.1 93.41 1 91.82 1.1
ReliefF 93.41 1.1 93.41 1 93.14 1.1
seismic-bumps LFS 92.79 5.2 93.26 5.2 88.14 5.2
BGSK 92.86 1.8 93.35 1.8 90.7 1.8
pBGSK 92.89 1.9 93.35 1.9 90.53 1.9
(continued)
1160 G. LIU ET AL.

Table 5. Continued.
Dataset Methods A-KNN #F-KNN A-C4.5 #F-C4.5 A-NB #F-NB
PbGSK-V4 93.16 1.4 93.41 1.4 92.76 1.4
MFRFS 89.33 6.5 89.43 6.4 84.61 5.8
ReliefF 80.3 6.5 82.42 6.4 67.46 5.8
spambase LFS 90.07 16.7 90.81 16.7 80.68 16.7
BGSK 89.3 7.2 89.41 7.2 82.08 7.2
pBGSK 89.08 8.1 89.38 8.1 83.95 8.1
PbGSK-V4 74.44 1 74.67 1 70.19 1
MFRFS 72.31 1.9 68.72 1.8 71.41 1.8
ReliefF 71.03 1.9 68.97 1.8 70.64 1.8
spect-heart LFS 71.41 3.6 70.9 3.6 70.64 3.6
BGSK 73.21 2.3 71.32 2.3 71.75 2.3
pBGSK 72.91 2.5 71.32 2.5 71.92 2.5
PbGSK-V4 71.32 1 71.11 1 71.28 1
MFRFS 80.99 3.2 78.4 3.2 81.11 4.3
ReliefF 73.09 3.2 75.56 3.2 78.64 4.3
statlog-heart LFS 79.26 6.7 76.91 6.7 82.96 6.7
BGSK 79.96 4.3 78.48 4.3 80.82 4.3
pBGSK 80.12 4.3 78.48 4.3 80.82 4.3
PbGSK-V4 63.87 1.5 63.62 1.5 63.79 1.5
MFRFS 93.85 3 93.78 3 84.39 3
ReliefF 82.6 3 83.2 3 73.51 3
statlog-image-segmentation LFS 92.51 7.6 93.9 7.6 84.24 7.6
BGSK 93.51 3.1 94 3.1 84.05 3.1
pBGSK 93.72 3 94.01 3 84.08 3
PbGSK-V4 74.56 1.6 75.04 1.6 65.65 1.6
MFRFS 84.75 1 85.11 1 85.04 1
ReliefF 84.4 1 85.11 1 84.61 1
thoracic-surgery LFS 84.68 4.9 85.11 4.9 83.97 4.9
BGSK 85.06 1.3 85.11 1.3 85.06 1.3
pBGSK 85.11 1.3 85.11 1.3 85.01 1.3
PbGSK-V4 85.11 1 85.11 1 85.15 1

box plots marked by Post-Friedman test in Figure 2 show that MFRFS is borderline sig-
nificantly better than LFS in KNN and NB learning algorithms, however, LFS selects more
features than MFRFS.
On arrhythmia and audiology datasets, the number of features selected by LFS is nearly
four times that of MFRFS. However, on audiology dataset, LFS does not obtain a better
classification accuracy with more features than MFRFS. On robot-failure-lp4 and robot-
failure-lp5 datasets, features selected by LFS are nearly six times that of MFRFS. On the
Lymphoma dataset, LFS even uses more than 20 times features that of MFRFS, and when
NB is used as the learning algorithm, the accuracy of LFS is 10% less than that of MFRFS.
On other datasets, MFRFS can select smaller number of features and achieve compara-
ble or better classification accuracy than LFS. The experimental results show that LFS’s
dimensionality reduction performance is weak. The features selected by LFS still contains
irrelevant and redundant features, which confirm that some invalid features are not helpful
for classification. Our proposed MFRFS can select more effective features on the premise of
maintaining the classification accuracy.

5.3.3. Comparison between MFRFS and GSK related methods (BGSK, pBGSK,
PbGSK-V4 )
Compared with BGSK, MFRFS selects a smaller feature subset and achieves better classifi-
cation performance on most cases in all learning algorithms. On the higher-dimensional
CONNECTION SCIENCE 1161

Figure 3. Convergence graph on two datasets: (a) spambase and (b) arrhythmia.

dataset Lymphoma, BGSK cannot reduce the dimension of features better, and select more
than 1300 features. We obtain similar experimental results on pBGSK, and in most cases,
a larger feature subset is selected, especially on the Lymphoma dataset. The box plots
marked by Post-Friedman test in Figure 2 show that MFRFS is borderline significantly better
than pBGSK in KNN learning algorithm and significantly better than pBGSK in NB learning
algorithm.
As shown in Table 5, PbGSK-V4 selects minimum number of features among the six meth-
ods. However, this method cannot achieve good classification performance, especially on
audiology, robot-failure-lp5, statlog-image-segmentation and spambase datasets. Only an
average of less than 2 features is selected by PbGSK-V4 in those datasets, but the classifica-
tion accuracy is very low. Compared with MFRFS, PbGSK-V4 reduces the number of features,
but in most cases the overall classification accuracy of MFRFS is better than PbGSK-V4 .
Figure 3 shows the convergence behaviours of four EC-based methods on arrhythmia
and spambase datasets. As shown in Figure 3, MFRFS starts to converge in about 10 gener-
ations and can obtain good solutions. The GSK related methods need more generations to
converge.

5.4. The parameter setting of α in MFRFS


The parameter α in Formula 4 is used to restrict redundant and irrelevant features as GP’s
terminal nodes and further reduce the number of selected features. In Section 5.2, we have
shown that MFRFS can further reduce the number of features comparing with FRFS. To
show the impact of parameter α on experimental results and set an appropriate parameter,
different values of parameter α are experimented and the experimental results are shown in
Table 6. As shown in Table 6, the number of features selected by MFRFS is inversely propor-
tional to the value of α, which means that the smaller the value of α, the more the number
of features selected.
When α = 0.001, the classification performance of MFRFS is similar to FRFS, and the
parameter α has little effect on classification performance. When α = 0.005, it has a cer-
tain degree of dimensionality reduction effect. When α = 0.01, it can be seen from Table 6
that the overall performance is the best. When α = 0.025, the number of features further
1162 G. LIU ET AL.

Table 6. Experimental results of different α values.


Dataset α A-KNN #F-KNN A-C4.5 #F-C4.5 A-NB #F-NB
0.1 59.17 1 60.92 1 61.83 1
0.05 60.33 1.7 62.5 1.7 61.42 1.7
arrhythmia 0.025 62.08 2.9 62.17 2.7 65 3.1
0.01 64.25 4.8 64 4.3 67.25 4.7
0.005 63.83 5.9 61.25 6.1 65.58 6.4
0.001 64.58 9.7 66.67 9.3 64.83 11.3
0.1 60 2 60 2 60 2
0.05 60 2 60 2 60 2
audiology 0.025 80.22 2.9 80.22 2.9 79.56 2.9
0.01 72 3.5 81.33 3.3 81.11 3.2
0.005 64.89 4 80.44 4.2 79.33 4
0.001 69.11 5.3 82.44 4.9 78.67 5.1
0.1 91.76 1 92.2 1 92.45 1
0.05 91.7 1 92.2 1 92.45 1
climate-simulation-craches 0.025 91.7 1 92.2 1 92.45 1
0.01 91.89 1 92.2 1 92.45 1
0.005 91.38 2 92.2 1.6 91.45 1.6
0.001 91.26 2.8 92.2 2 89.31 2.4
0.1 79.56 1 80.44 1 80 1
0.05 80.89 1.2 80 1.5 80.44 1.2
hepatitis 0.025 79.56 2.5 79.78 2.1 81.78 1.8
0.01 82 3.2 83.56 2.1 83.56 2.8
0.005 79.78 3 79.11 2.4 81.33 3.6
0.001 81.78 4.5 81.56 2.8 82.89 5
0.1 95.74 1 95.63 1 95.63 1
0.05 95.74 1 95.63 1 95.63 1
house-vote 0.025 95.74 1 95.63 1 95.63 1
0.01 95.74 1 95.63 1 95.63 1
0.005 95.74 1 95.63 1 95.63 1
0.001 95.74 1 95.63 1 95.63 1
0.1 79.33 1.2 89.33 1.2 91.33 1.2
0.05 77.33 1.4 90.67 1.4 90.67 1.5
Lymphoma 0.025 79.33 2.8 90.67 2.2 86 2.7
0.01 79.33 3.6 92 3.3 92.67 3.3
0.005 79.33 4.3 92 4.2 87.33 4.1
0.001 80 6.5 87.33 6.7 88.67 6.1
0.1 65.19 1 70.93 1 71.3 1
0.05 65.93 1 71.85 1 71.48 1
planning-relax 0.025 65 1 71.85 1 71.11 1
0.01 67.04 1 71.67 1 71.67 1
0.005 66.67 1 71.85 1 70.93 1
0.001 64.63 1 71.85 1 71.11 1
0.1 73.03 1.5 69.7 1.3 66.06 1.4
0.05 73.33 2.3 73.33 2.2 69.7 2.1
robot-failure-lp4 0.025 70.3 3.2 75.45 3 73.64 3.7
0.01 66.06 3.9 78.18 5 77.27 6
0.005 68.79 4.8 77.27 4.8 78.48 6.9
0.001 67.58 9.2 76.97 9.6 78.18 15
0.1 40.95 1 40.24 1 40.48 1
0.05 51.9 2.4 50.24 2.3 53.1 2.4
robot-failure-lp5 0.025 50.95 3.2 45.48 3.3 58.57 3.7
0.01 53.33 4.3 46.9 5.5 60.48 5.3
0.005 47.62 7.3 43.1 8.9 57.86 8.2
0.001 46.9 10.2 46.9 13.9 57.86 13.7
0.1 93.27 1 93.41 1 91.95 1
0.05 93.27 1 93.41 1 91.95 1
(continued)
CONNECTION SCIENCE 1163

Table 6. Continued.
Dataset α A-KNN #F-KNN A-C4.5 #F-C4.5 A-NB #F-NB
seismic-bumps 0.025 93.27 1 93.41 1 91.95 1
0.01 93.32 1.1 93.41 1 91.82 1.1
0.005 93.06 1.7 93.37 1.4 91.02 1.5
0.001 93.09 2.8 93.41 2.1 90.4 2.1
0.1 76.31 1 76.44 1 72.19 1
0.05 84.52 2 85.04 2 78.39 2
spambase 0.025 87.72 3.97 87.8 3.9 80.07 3.83
0.01 89.33 6.53 89.43 6.43 87.4 5.83
0.005 89.27 8 89.51 7.97 85.36 7.43
0.001 89.57 9.37 89.99 9.43 84.61 9
0.1 69.62 1 69.62 1 69.62 1
0.05 69.62 1 69.62 1 69.62 1
spect-heart 0.025 71.92 1.5 70.38 1.4 71.41 1.5
0.01 72.31 1.9 68.72 1.8 71.41 1.8
0.005 72.56 1.9 69.49 1.8 71.28 1.8
0.001 72.31 2.4 69.87 2 71.15 2.1
0.1 71.73 1 70.99 1 71.11 1
0.05 71.73 1.5 72.35 1.1 74.2 1.8
statlog-heart 0.025 77.41 2.8 77.16 2.6 77.16 3
0.01 80.99 3.2 78.4 3.2 81.11 4.3
0.005 79.75 4.2 77.53 4.3 81.36 4.9
0.001 78.64 3.8 76.67 4.3 81.98 5.3
0.1 82.24 1.9 82.34 1.9 72.21 1.9
0.05 84.49 2 84.47 2 73.59 2
statlog-image-segmentation 0.025 89.13 2.5 89.08 2.5 78.96 2.5
0.01 93.85 3 93.78 3 84.39 3
0.005 93.85 3 93.78 3.4 84.43 3.1
0.001 93.75 3.9 93.49 4.1 84.5 3.7
0.1 84.75 1 85.11 1 85.04 1
0.05 84.75 1 85.11 1 85.04 1
thoracic-surgery 0.025 84.75 1 85.11 1 85.04 1
0.01 84.75 1 85.11 1 85.04 1
0.005 85.11 1.5 85.11 1.2 84.89 1.6
0.001 84.96 2.6 85.11 1.9 84.26 2.5

decreases, but the average classification accuracy starts to decline to some extent. When
α = 0.5 and 0.1, the number of features is obviously reduced a lot, but the accuracy at this
time is also declined a lot.
Therefore, the parameter α has a great impact on the classification accuracy. If the
parameter α is too small, it cannot achieve the purpose of dimensionality reduction com-
paring with FRFS. If the parameter α is too large, it will remove effective features and
reduce the classification accuracy. Through the experiments, α = 0.01 meets the pur-
pose of decreasing the number of selected features while maintaining the classification
performance.

5.5. Further discussion


In this paper, FRFS, a feature selection method combined with feature ranking using GP, is
proposed to select smaller number of features comparing with feature selection method
only (FS). Based on a multi-criteria fitness function, MFRFS is proposed to obtain smaller
number of features comparing with FRFS. Here are some further discussions.
(1) Evolutionary computation algorithms will produce a large number of individuals with
the best fitness. If only one of the best individuals is used for feature selection, some good
1164 G. LIU ET AL.

Table 7. Classification accuracy of different number of top features.


Training set Testing set
fbest accKNN accC4.5 accNB accKNN accC4.5 accNB
12 62.35 68.98 70.19 65.83 61.67 61.67
11 63.56 69.58 67.47 66.67 64.17 65.00
10 64.16 71.69 68.37 65.83 68.33 57.50
9 68.07 71.39 70.18 63.33 66.67 66.67
8 66.27 65.66 67.17 59.17 60.83 62.5
7 65.96 65.96 67.77 60.00 60.00 65.00
6 66.27 68.37 68.07 63.33 60.83 65.00
5 65.06 66.57 66.57 62.50 58.33 65.83

individuals may be lost. In this paper, GP individuals after the 30th generation with the best
fitness are preserved. The features are ranked according to their occurrence times in these
individuals. Experiments show that feature ranking can help select more effective features.
To further demonstrate how feature ranking works, we take the arrhythmia dataset as an
example. The results of feature ranking are shown below. A total of 12 best features were
selected. The values before the “–” represent the index of the feature, and the values after
the “–” represent the number of times the feature occurs in all the best individuals.
F14-60244 F250-59799 F109-58235 F182-42527 F212-38822 F195-38505
F241-37687 F7-31955 F208-26438 F38-24976 F97-23970 F10-21086
We can see that feature F14 is probably the best feature, and feature F250 is the second
best, and so on. The purpose of feature ranking is to select the best feature set from the top
12 features.
According to the method described in Section 3.2, the classification accuracy on different
number of top ranked features on training set and testing set in fbest feature subset are
shown in Table 7. For training set, we can see that the classification accuracy first increases
and then decreases with the increase of the number of top features. When the number of
top features is 9, 10 and 9 for KNN, C4.5 and NB, the classification accuracy is the highest
respectively.
For the testing set, we also obtain the classification accuracies of the corresponding rank-
ing features for KNN, C4.5 and NB learning algorithms respectively in Table 7, and feature
ranking can achieve highest classification accuracy on corresponding ranking features in
C4.5 and NB learning algorithms.
(2) Feature ranking is based on the assumption that the more the original features appear
in the GP individuals’ terminal nodes, the more valuable these features are. Experiments
prove that the assumption is correct. GP can be used to quantify the importance of features
and select more important feature sets.
(3) MFRFS can restrict the number of irrelevant and redundant features in the GP’s
terminal nodes by a multi-criteria fitness function. By adjusting the parameter α, MFRFS
can achieve a smaller number of features while maintaining the classification accuracy
compared with FRFS.
(4) In general, MFRFS can achieve better classification performance than the five bench-
marks. However, MFRFS needs to store the individuals with the best fitness, rank the
features according to the occurrence times of features appearing in the GP individuals’ ter-
minal nodes and search smaller number of features with higher classification performance
from the ranked features. Therefore MFRFS has no advantages in time consumption and
CONNECTION SCIENCE 1165

computational complexity, especially on datasets with more features. In practical applica-


tions, we need to find a balance between classification performance and computational
complexity.
(5) Feature selection is a key data preprocessing process, which can solve “the curse of
dimensionality” problems. Our proposed MFRFS can greatly reduce the dimension of fea-
tures and obtain a better classification performance. The most important thing is that we
can identify which features are more important through the feature ranking method in this
paper. In industrial applications, our methods may greatly reduce the model’s training time
and improve the classification performance.

6. Conclusion and future work


This paper proposes a feature selection method with feature ranking (FRFS) using GP, which
ranks the features according to the number of their occurrence in the individuals with the
best fitness. Based on FRFS, to further reduce the number of selected features, another
feature selection method using a multi-criteria fitness function (MFRFS) is proposed. Exper-
iments on 15 datasets show that FRFS can achieve better classification performance and
obtain smaller number of features than FS. Compared with FRFS, MFRFS can further remove
redundant and irrelevant features, and select smaller number of features while maintain-
ing the classification performance. Compared with other five existing feature selection
methods, MFRFS can achieve better classification performance.
Feature selection is important to build an efficient classification model, and feature
ranking has been proved to be effective for feature selection. The fitness function of
our methods in this paper are based on Correlation. It is necessary to further investigate
whether our methods are effective to other evaluation criteria. Moreover, the dimension
of the selected datasets is relatively low. In the future, feature ranking methods on high
dimensional datasets will be investigated.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Funding
This work is supported by the Key R&D Program of Hebei Province, China [grant number 20327405D],
Hebei Provincial Department of Human Resources and Social Security, China [grant number
20190344], and Hebei Key Laboratory of Agricultural Big Data, China.

References
Agrawal, P., Ganesh, T., & Mohamed, A. W. (2021a). Chaotic gaining sharing knowledge-based opti-
mization algorithm: An improved metaheuristic algorithm for feature selection. Soft Computing,
25(14), 9505–9528. https://doi.org/10.1007/s00500-021-05874-3
Agrawal, P., Ganesh, T., & Mohamed, A. W. (2021b). A novel binary gaining–sharing knowledge-
based optimization algorithm for feature selection. Neural Computing and Applications, 33(11),
5989–6008. https://doi.org/10.1007/s00521-020-05375-8
Agrawal, P., Ganesh, T., Oliva, D., & Mohamed, A. W. (2022). S-shaped and V-shaped gaining-
sharing knowledge-based algorithm for feature selection. Applied Intelligence, 52 (1), 81–112.
https://doi.org/10.1007/s10489-021-02233-5
1166 G. LIU ET AL.

Ahmed, S., Zhang, M., & Peng, L. (2014). Improving feature ranking for biomarker discovery in pro-
teomics mass spectrometry data using genetic programming. Connection Science, 26(3), 215–243.
https://doi.org/10.1080/09540091.2014.906388
Al-Sahaf, H., Bi, Y., Chen, Q., Lensen, A., Mei, Y., Sun, Y., & Zhang, M. (2019). A survey on
evolutionary machine learning. Journal of the Royal Society of New Zealand, 49(2), 205–228.
https://doi.org/10.1080/03036758.2019.1609052
Bi, Y., Xue, B., & Zhang, M. (2021a). Genetic programming for image classification: An automated
approach to feature learning. Springer Nature.
Bi, Y., Xue, B., & Zhang, M. (2021b). Multi-objective genetic programming for feature learning
in face recognition. Applied Soft Computing, 103(4), 107152. https://doi.org/10.1016/j.asoc.2021.
107152
Canuto, A. M. P., & Nascimento, D. S. C. (2012). A genetic-based approach to features selection for ensem-
bles using a hybrid and adaptive fitness function. The 2012 International Joint Conference on Neural
Networks (IJCNN), (pp.1-8), Brisbane, QLD, Australia, June 2012.
Davis, R. A., Charlton, A. J., Oehlschlager, S., & Wilson, J. C. (2006). Novel feature selection method for
genetic programming using metabolomic 1H NMR data. Chemometrics and Intelligent Laboratory
Systems, 81(1), 50–59. https://doi.org/10.1016/j.chemolab.2005.09.006
Dheeru, D., & Karra Taniskidou, E. (2017). UCI machine learning repository. University of California,
Irvine, School of Information and Computer Sciences
Espejo, P. G., Ventura, S., & Herrera, F. (2010). A survey on the application of genetic programming to
classification. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews),
40(2), 121–144. https://doi.org/10.1109/TSMCC.2009.2033566
Friedlander, A., Neshatian, K., & Zhang, M. (2011). Meta-learning and feature ranking using genetic
programming for classification: Variable terminal weighting. IEEE Congress on Evolutionary Compu-
tation, (pp. 5–8), New Orleans, LA, USA, June 2011.
Gutlein, M., Frank, E., Hall, M., & Karwath, A. (2009). Large-scale attribute selection using wrappers. IEEE
symposium on computational intelligence and data mining, (pp. 332–339), Nashville, TN, USA, May
2009.
Hall, M. (1999). Correlation-based feature selection for machine learning (Unpublished doctoral
dissertation). The University of Waikato.
Hall, M., & Smith, L. (1999). Feature selection for machine learning: Comparing a correlation-based fil-
ter approach to the wrapper. Twelfth International Florida Artificial Intelligence Research Society
Conference, (pp. 235–239), Orlando, Florida, USA, May 1999.
Hancer, E. (2019). Fuzzy kernel feature selection with multi-objective differential evolution algorithm.
Connection Science, 31(4), 323–341. https://doi.org/10.1080/09540091.2019.1639624
Hollander, M., & Wolfe, D. A. (1999). Nonparametric statistical methods (2nd ed.). Wiley-Interscience.
Hunt, R., Neshatian, K., & Zhang, M. (2012). A genetic programming approach to hyper-heuristic feature
selection. The 9th international conference on Simulated Evolution and Learning, (pp. 320–330),
Berlin, Heidelberg, December 2012.
Kira, K., & Rendell, L. A. (1992). The feature selection problem: Traditional methods and a new algorithm.
The tenth national conference on artificial intelligence, (pp. 129–134), San Jose California, July
1992.
Koza, J. R. (1992). Genetic programming: On the programming of computers by means of natural selection.
MIT Press.
Koza, J. R., Andre, D., Bennett, F. H., & Keane, M. A. (1999). Genetic programming III: Darwinian
invention and problem solving. IEEE Transactions on Evolutionary Computation, 7(4), 451–453.
https://doi.org/10.1162/evco.1999.7.4.451
Liang, Y., Zhang, M., & Browne, W. N. (2017). Image feature selection using genetic programming
for figure-ground segmentation. Engineering Applications of Artificial Intelligence, 62(12), 96–108.
https://doi.org/10.1016/j.engappai.2017.03.009
Lin, J. Y., Ke, H. R., Chien, B. C., & Yang, W. P. (2008). Classifier design with feature selection and feature
extraction using layered genetic programming. Expert Systems with Applications, 34(2), 1384–1393.
https://doi.org/10.1016/j.eswa.2007.01.006
CONNECTION SCIENCE 1167

Loughran, R., Agapitos, A., Kattan, A., Brabazon, A., & O’Neill, M. (2017). Feature selection
for speaker verification using genetic programming. Evolutionary Intelligence, 10(1–2), 1–21.
https://doi.org/10.1007/s12065-016-0150-5
Luke, S.. (2017). ECJ then and now. Genetic and Evolutionary Computation Conference Companion
Pages, (pp.1223–1230), Berlin Germany, July 2017.
Ma, J., & Gao, X. (2020a). Designing genetic programming classifiers with feature selection and feature
construction. Applied Soft Computing, 97(4), 106826. https://doi.org/10.1016/j.asoc.2020.106826
Ma, J., & Gao, X. (2020b). A filter-based feature construction and feature selection approach for classi-
fication using genetic programming. Knowledge-Based Systems, 196(6), 105806. https://doi.org/10.
1016/j.knosys.2020.105806
Ma, J., & Teng, G. (2019). A hybrid multiple feature construction approach using genetic programming.
Applied Soft Computing, 80(2), 687–699. https://doi.org/10.1016/j.asoc.2019.04.039
Mei, Y., Nguyen, S., Xue, B., & Zhang, M. (2017). An efficient feature selection algorithm for evolv-
ing job shop scheduling rules with genetic programming. IEEE Transactions on Emerging Topics in
Computational Intelligence, 1(5), 339–353. https://doi.org/10.1109/TETCI.2017.2743758
Mohamed, A. W., Hadi, A. A., & Mohamed, A. K. (2020). Gaining-sharing knowledge based algorithm for
solving optimization problems: A novel nature-inspired algorithm. International Journal of Machine
Learning and Cybernetics, 11(7), 1501–1529. https://doi.org/10.1007/s13042-019-01053-x
Muni, D. P., Pal, N. R., & Das, J. (2006). Genetic programming for simultaneous feature selection
and classifier design. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics), 36(1),
106–117. https://doi.org/10.1109/TSMCB.2005.854499
Nagra, A. A., Han, F., Ling, Q. H., Abubaker, M., Ahmad, F., Mehta, S., & Apasiba, A. T.
(2020). Hybrid self-inertia weight adaptive particle swarm optimisation with local search using
C4.5 decision tree classifier for feature selection problems. Connection Science, 32(1), 16–36.
https://doi.org/10.1080/09540091.2019.1609419
Neshatian, K. (2010). Feature manipulation with genetic programming (Unpublished doctoral disser-
tation). Victoria University of Wellington.
Neshatian, K., & Zhang, M. (2009a). Genetic programming for feature subset ranking in binary classifi-
cation problems. The 12th European Conference on Genetic Programming, (pp. 121–132), Berlin,
Heidelberg, April 2009.
Neshatian, K., & Zhang, M. (2009b). Pareto front feature selection: Using genetic programming to explore
feature space. The 11th annual genetic and evolutionary computation conference, (pp. 1027–1034),
Montreal, Québec, Canada, July 2009.
Neshatian, K., & Zhang, M. (2009c). Unsupervised elimination of redundant features using genetic pro-
gramming. Australasian joint conference on artificial intelligence, (pp. 432–442), Berlin, Heidelberg,
November 2009.
Neshatian, K., & Zhang, M. (2011). Using genetic programming for context-sensitive feature scoring
in classification problems. Connection Science, 23(3), 183–207. https://doi.org/10.1080/09540091.
2011.630065
Neshatian, K., & Zhang, M. (2012). Improving relevance measures using genetic programming. European
conference on genetic programming, (pp. 97âĂŞ108), Berlin, Heidelberg, April 2012.
Papa, J. P., Rosa, G. H., & Papa, L. P. (2017). A binary-constrained geometric semantic genetic program-
ming for feature selection purposes. Pattern Recognition Letters, 100(24), 59–66. https://doi.org/10.
1016/j.patrec.2017.10.002
Patel, S. P., & Upadhyay, S. H. (2020). Euclidean distance based feature ranking and sub-
set selection for bearing fault diagnosis. Expert Systems with Applications, 154(8), 113400.
https://doi.org/10.1016/j.eswa.2020.113400
Peng, B., Wan, S., Bi, Y., Xue, B., & Zhang, M. (2020). Automatic feature extraction and construction
using genetic programming for rotating machinery fault diagnosis. IEEE Transactions on Cybernet-
ics, (99), 1–15. https://doi.org/10.1109/TCYB.2020.3032945
Purohit, A., Chaudhari, N., & Tiwari, A. (2010). Construction of classifier with feature selection based on
genetic programming. In Proceedings of the IEEE congress on evolutionary computation.
1168 G. LIU ET AL.

Ribeiro, I., Andrade, G., Viegas, F., Madeira, D., Rocha, L., Salles, T., & Gonçalves, M. (2012). Aggres-
sive and effective feature selection using genetic programming. In Proceedings of IEEE congress on
evolutionary computation.
Sreeja, N. K. (2019). A weighted pattern matching approach for classification of imbalanced
data with a fireworks-based algorithm for feature selection. Connection Science, 31(2), 143–168.
https://doi.org/10.1080/09540091.2018.1512558
Tan, F., Fu, X., Zhang, Y., & Bourgeois, A. G. (2008). A genetic algorithm-based method for feature
subset selection. Soft Computing, 12(2), 111–120. https://doi.org/10.1007/s00500-007-0193-8
Too, J., & Abdullah, A. R. (2020). Binary atom search optimisation approaches for feature selection.
Connection Science, 32(4), 406–430. https://doi.org/10.1080/09540091.2020.1741515
Too, J., Sadiq, A. S., & Mirjalili, S. M. (2021). A conditional opposition-based particle swarm optimisa-
tion for feature selection. Connection Science, 77(3), 1–23. https://doi.org/10.1080/09540091.2021.
2002266
Vergara, J. R., & Estévez, P. A. (2014). A review of feature selection methods based on mutual infor-
mation. Neural Computing and Applications, 24(1), 175–186. https://doi.org/10.1007/s00521-013-
1368-0
Viegas, F., Rocha, L., Gonçalves, M., Mourão, F., Sá, G., Salles, T., & Sandin, I. (2018). A genetic program-
ming approach for feature selection in highly dimensional skewed data. Neurocomputing, 273(3),
554–569. https://doi.org/10.1016/j.neucom.2017.08.050
Xue, B., Zhang, M., & Browne, W. (2013). Particle swarm optimization for feature selection in
classification: A multi-objective approach. IEEE Transactions on Cybernetics, 43(6), 1656–1671.
https://doi.org/10.1109/TSMCB.2012.2227469
Xue, B., Zhang, M., & Browne, W. N. (2014). Particle swarm optimisation for feature selection in clas-
sification: Novel initialisation and updating mechanisms. Applied Soft Computing, 18(C), 261–276.
https://doi.org/10.1016/j.asoc.2013.09.018

You might also like