A Feature Selection Method With Feature Ranking Using Genetic Programming
A Feature Selection Method With Feature Ranking Using Genetic Programming
To cite this article: Guopeng Liu, Jianbin Ma, Tongle Hu & Xiaoying Gao (2022) A feature
selection method with feature ranking using genetic programming, Connection Science, 34:1,
1146-1168, DOI: 10.1080/09540091.2022.2049702
China; b College of Plant Protection, Hebei Agricultural University, Baoding, People’s Republic of China;
c School of Engineering and Computer Science, Victoria University of Wellington, Wellington, New Zealand
1. Introduction
Feature selection (Sreeja, 2019; Too & Abdullah, 2020) is to select effective feature subsets
from high-dimensional original features, which is one of the key issues for machine learning.
High-quality features play a key role in building an efficient model, and irrelevant or redun-
dant features may cause difficulties (Xue et al., 2013). Therefore, feature selection methods
have been extensively applied to solve practical classification tasks (Espejo et al., 2010; Liang
et al., 2017; Loughran et al., 2017; Mei et al., 2017; Patel & Upadhyay, 2020). Recently, Gaining
Sharing Knowledge based Algorithm (GSK) (Mohamed et al., 2020) was proposed for feature
selection and achieved good performance (Agrawal, Ganesh, & Mohamed, 2021a, 2021b;
Agrawal, Ganesh, Oliva, et al., 2022). Evolutionary computation (EC) algorithms (Al-Sahaf
et al., 2019; Hancer, 2019) have been widely used for feature selection due to their global
search ability (Ma & Gao, 2020b; Xue et al., 2014). EC algorithms are population based,
which need to initialise individuals randomly. Sometimes, the genetic operators of genetic
algorithm (GA) and genetic programming (GP), or the updating strategy of particle swarm
optimisation (PSO) (Nagra et al., 2020; Too et al., 2021), are not sufficient to delete invalid
features. So, a small number of invalid features still exist till the termination of the algo-
rithms.
GP can be used for feature selection due to its global search ability and is proved to
achieve good classification performance. However, with the evolution of GP, a large number
of individuals with the same best fitness are generated. GP only outputs the best individual,
which may lose some good individuals. Moreover, the generated features in the output
individual may still contain redundant features.
Feature ranking (Ahmed et al., 2014; Friedlander et al., 2011; Neshatian, 2010) is the rank-
ing of original features based on specific evaluation criteria, which is usually a step of feature
selection. It is employed to find out which features or feature sets are more important. So,
this paper proposes a GP-based feature selection method combined with feature ranking
(FRFS), which considers the large number of best individuals generated during the evolu-
tion of GP instead of only one best individual. A correlation-based evaluation criterion is
used as the fitness function. It is assumed that the more the original features appear in
the GP individuals’ terminal nodes, the more valuable these features are. Therefore, the
individuals with the best fitness are stored. The occurrence times of features appearing
in the individuals’ terminal nodes are counted and the top β features are selected as the
candidate feature subset. To decrease the number of selected features while maintain-
ing the classification performance, another feature selection method that combines FRFS
with a multi-criteria fitness function is investigated. The motivation of this algorithm is to
set higher fitness values for the individuals with the same correlation value in the case of
smaller feature number.
The overall objective of this paper is to propose a feature selection algorithm that com-
bines feature ranking with a multi-criteria fitness function and verify the improvement of
the algorithm by feature ranking and the dimensionality reduction effect of multi-criteria
fitness function. To achieve the overall objective, the following three objectives will be
investigated.
Objective 1. Propose a feature selection method combined with feature ranking using
GP (FRFS) and verify whether it can achieve better classification performance with smaller
number of features than using feature selection only (FS).
Objective 2. Investigate a feature selection method that combines FRFS with a multi-
criteria fitness function which is named as MFRFS, and verify whether the multi-criteria
fitness function can reduce the number of features of FRFS while maintaining the classi-
fication performance.
Objective 3. Investigate whether our proposed MFRFS can achieve better classification
performance than five benchmark techniques, including ReliefF feature selection method
(Kira & Rendell, 1992), linear forward feature selection method (LFS) (Gutlein et al., 2009)
and three GSK related methods (Agrawal, Ganesh, & Mohamed, 2021b; Agrawal, Ganesh,
Oliva, et al., 2022).
The rest of this paper is arranged as follows. Section 2 outlines the background infor-
mation of this paper. Section 3 describes in detail the three GP-based feature selection
methods. Section 4 shows the experimental design. Section 5 presents the experimental
results and discussion. Section 6 is the conclusion and future work.
1148 G. LIU ET AL.
2. Related works
2.1. Genetic programming (GP)
GP is one of the evolutionary computation algorithms (Koza, 1992; Koza et al., 1999). It
is very similar to genetic algorithm (GA). The main difference between GP and GA is the
representation of individuals. Due to GP’s flexible representation methods, GP can be
used to construct high-level features (Ma & Gao, 2020a, 2020b), construct classifiers (Muni
et al., 2006; Neshatian, 2010; Neshatian & Zhang, 2011) and solve practical industrial prob-
lems (Bi et al., 2021a, 2021b; Peng et al., 2020). The commonly used representation method
of GP is based on tree structure. The terminal nodes (constants and variables) are randomly
selected from a terminal set, and the function (mathematical and logical operators) is ran-
domly selected from a function set to constitute internal nodes. GP randomly initialises the
first generation of individuals and evaluates each individual using a fitness function. Genetic
operators including selection, crossover and mutation are then performed to produce the
next generation’s population. This step is iterated until the termination criterion is reached.
Then GP outputs the optimal individual. For feature selection, the terminal nodes contain
the features selected by GP.
2.2. Correlation
The correlation-based feature evaluation criterion is proposed by Hall (1999), which takes
into account both the correlation between features and classes and the correlation
between features. This evaluation criterion is proved to be effective to select low redun-
dancy and high discrimination feature subsets (Hall & Smith, 1999; Ma & Gao, 2020b), and is
adopted as the evaluation criterion in this paper. The following is the formula for calculating
the correlation of a feature subset.
kCfc
CorrelationS = (1)
k + k(k − 1)Cff
where CorrelationS is the correlation value of a set S containing k features, Cfc denotes the
average correlation between each feature and class, and Cff denotes the average correla-
tion between each feature pair. They are calculated by formula (2). H(X) and H(Y) are the
information entropy of X and Y respectively, H(X | Y) is the conditional entropy.
H(X) − H(X | Y)
CXY = 2.0 × (2)
H(X) + H(Y)
Neshatian & Zhang, 2009b, 2012; Purohit et al., 2010; Ribeiro et al., 2012) and wrapper-
based (Hunt et al., 2012). The filter-based feature selection methods use information mea-
sures such as mutual information (Vergara & Estévez, 2014), information gain (Neshatian
& Zhang, 2012), consistency and correlation (Hall, 1999; Neshatian & Zhang, 2009c) as the
evaluation criterion, and need less running time than wrapper-based methods. Moreover,
the models produced by filter-based methods are more general. So, filter-based feature
selection is investigated further in this paper. The fundamental research goal of feature
selection is to find the feature subset with the best classification performance and fewer
features.
Lin et al. (2008) constructed a classifier using layered genetic programming, which had
the characteristic of feature selection and feature extraction. Neshatian and Zhang (2012)
combined information entropy and conditional entropy to evaluate the correlation
between feature sets and classes for feature selection.
Purohit et al. (2010) proposed a GP classifier construction method with feature selec-
tion and investigated a new crossover operator to discover the best crossover site. Davis
et al. (2006) proposed a two-stage feature selection method. In the first stage, a feature sub-
set is selected using GP and GP classifiers are evolved using the selected features. Neshatian
and Zhang (2009b) proposed a filter-based multiple-objective feature selection method for
binary classification task.
GP has flexible representation ability. In general, the GP-based feature selection method
takes original features as the terminal set and terminal nodes as the selected features. How-
ever, some researchers have researched other representations. Hunt et al. (2012) developed
GP-based hyper-heuristics for adding and removing features. Each individual generated by
GP is a series of operations on the feature set. Ribeiro et al. (2012) used four feature selec-
tion criteria (information gain, chi-square, correlation and odds ration) as the terminal set of
GP and three set operations (intersection, union and difference) as the function set. Viegas
et al. (2018) proposed a GP-based feature selection method for skewed dataset and vali-
dated the method on four high-dimensional datasets. Papa et al. (2017) used binary strings
as individual’s terminal nodes, where 1 means selecting a certain feature and 0 means with-
out selecting. AND, OR and XOR are regarded as non-terminal nodes. The output of the
individual is also a binary string, which contains the selected features.
Ahmed et al. (2014) proposed a two-stage GP-based classification method which first
ranked the original features then constructed classifiers. Neshatian and Zhang (2009a) con-
structed GP programs to measure the goodness of feature subsets using a virtual program
structure and an evaluation function.
At present, there have been many researches on GP-based feature selection methods,
but few research focus on the large amount of evolutionary information generated during
the evolution process. And some researchers have begun to use GP to rank the features
(Friedlander et al., 2011; Neshatian, 2010), which indicates that the frequency of features
appearing in terminal nodes can show the importance of features. Based on this motiva-
tion, some issues should be investigated on whether feature ranking can help quantify
the importance of features and improve the classification performance, and whether a
multi-criteria fitness function can help to further explore this area select more effective
features.
3. Method
In this paper, our proposed methods are based on standard GP, and correlation is chosen
as the fitness function. The fundamental purpose of this paper is to investigate whether the
feature selection method based on correlation evaluation criteria using GP can effectively
get rid of invalid features, and whether feature ranking and a multi-criteria fitness function
can improve the classification performance. We have different variants of algorithms to ver-
ify the research objectives. FS denotes the feature selection method using GP. FRFS denotes
the feature selection method combined with feature ranking. MFRFS denotes the algorithm
that combines FRFS with a multi-criteria fitness function. The details of the algorithms are
described below.
experiments found that GP starts to converge after 30 iterations, so the parameter j is set
to 30.
Suppose the number of features selected by FS is k. Our goal is to select the top fbest
features among top k features according to ranking. The Pseudo-Code of FRFS is shown in
Algorithm 1. To determine how many features selected from k features are the best, first
fbest = nk and the fbest features are tested on the classification algorithms. Then the rest
k − nk features are added into fbest feature subset and tested one by one according to their
ranking until the classification accuracy begins to decline. Then, the top fbest features are
the selected features. In the paper, the parameter n is set to 3, that is, top one-third of k
1152 G. LIU ET AL.
features are selected into fbest feature subset. We started with top nk features instead of the
top feature is to improve the search efficiency. We set the value of n to 3 based on our
experiments.
where fnum is the number of features in an individual and α is the penalty coefficient
between Correlation and fnum . The parameter α is discussed in Section 5.4.
4. Experimental design
4.1. Datasets and parameter settings
To verify the effectiveness of our proposed feature selection methods, 15 datasets are col-
lected from UCI machine learning repository (Dheeru & Karra Taniskidou, 2017). The details
of the datasets are shown in Table 1. In the table, #features indicates the number of original
features, #instances indicates the number of instances and #classes indicates the num-
ber of class labels. The datasets have different number of classes and features, and are
representative classification problems.
A random split of 7:3 is used and 70% of the dataset is used as the training set and 30%
as the testing set (Ma & Gao, 2020b; Xue et al., 2013). The training set is used to select and
rank the original features. The testing set is used to evaluate the classification performance
of selected and ranked features by 10-fold cross-validation. To avoid stochastic character-
istics of GP, 30 experiments are done independently (Ma & Teng, 2019). The average of
classification accuracy and number of features of 30 different experiments are obtained.
K-nearest Neighbours (KNN, K = 5), C4.5 decision tree and Naive Bayes (NB) are used to
evaluate the performance of the proposed methods. The ECJ platform (Luke, 2017) is used
to run GP. The parameter settings of GP are shown in Table 2, which are the commonly used
parameters of GP (Ma & Gao, 2020b; Ma & Teng, 2019). The function set is composed of four
arithmetic operators, the terminal set is the original features. The population size is set to
500, initial GP experimental results indicate that this population size can evolve good solu-
tions. The generations are set to 50 and the maximum tree depth is 17 to restrict irrelevant
and redundant features can also reduce the depth of GP programs. Mutation probability
and crossover probability are 10% and 90% respectively, which balance exploration and
exploitation during the evolution. The parameter α in Formula 3 is set as 0.01, which is
discussed in Section 5.4.
4.2.1. ReliefF
ReliefF (Kira & Rendell, 1992) is an extension of Relief algorithm. Relief can only be used
to solve binary classification problems. Both Relief and ReliefF are feature weighting algo-
rithms, which assign different weights to features according to the correlation between
each feature and class, and features with a weight less than a certain threshold will be
deleted. The weight of a feature is proportional to its classification ability. ReliefF can be
applied to solve multi-class classification problems. First, one sample is randomly selected
from the dataset, and then k nearest neighbour samples are selected from all classes. Then,
the weight of all features is calculated, and the operation is repeated n times. Finally, the
final weight of each feature is obtained and the features are ranked according to their
weights. To facilitate comparisons with the proposed MFRFS, the same number of features
as MFRFS is selected on all datasets by ReliefF.
1154 G. LIU ET AL.
4.2.2. LFS
LFS (Gutlein et al., 2009) starts from an empty set and adds features one by one. It restricts
the number of features in each step of the forward selection. When the evaluation criteria
show that its performance remains unchanged or decreases after adding a feature, LFS will
be terminated. LFS is proved to be faster than standard forward selection and can obtain
good results (Gutlein et al., 2009). To compare with the proposed MFRFS method fairly,
Correlation is also selected as the evaluation criteria.
Figure 1. The boxplots of performance differences between FRFS and two baselines marked by Post-
Friedman-test: (a) KNN, (b) C4.5, (c) NB.
CONNECTION SCIENCE 1155
Figure 2. The box plots of performance differences between MFRFS and three benchmarks marked by
Post-Friedman test: (a) KNN, (b) C4.5, (c) NB.
Post-Friedman test using different colours is also marked on the boxplots. If the Post-
Friedman test achieves p−value 0.05 or 0.05 < p−value 0.4, the box plots are marked
green or yellow respectively, which shows that the performance differences between
paired feature processing methods are significantly different or borderline significantly dif-
ferent. Figures 1 and 2 show whether one of the paired methods is significantly better than
the other.
Table 5. Continued.
Dataset Methods A-KNN #F-KNN A-C4.5 #F-C4.5 A-NB #F-NB
PbGSK-V4 93.16 1.4 93.41 1.4 92.76 1.4
MFRFS 89.33 6.5 89.43 6.4 84.61 5.8
ReliefF 80.3 6.5 82.42 6.4 67.46 5.8
spambase LFS 90.07 16.7 90.81 16.7 80.68 16.7
BGSK 89.3 7.2 89.41 7.2 82.08 7.2
pBGSK 89.08 8.1 89.38 8.1 83.95 8.1
PbGSK-V4 74.44 1 74.67 1 70.19 1
MFRFS 72.31 1.9 68.72 1.8 71.41 1.8
ReliefF 71.03 1.9 68.97 1.8 70.64 1.8
spect-heart LFS 71.41 3.6 70.9 3.6 70.64 3.6
BGSK 73.21 2.3 71.32 2.3 71.75 2.3
pBGSK 72.91 2.5 71.32 2.5 71.92 2.5
PbGSK-V4 71.32 1 71.11 1 71.28 1
MFRFS 80.99 3.2 78.4 3.2 81.11 4.3
ReliefF 73.09 3.2 75.56 3.2 78.64 4.3
statlog-heart LFS 79.26 6.7 76.91 6.7 82.96 6.7
BGSK 79.96 4.3 78.48 4.3 80.82 4.3
pBGSK 80.12 4.3 78.48 4.3 80.82 4.3
PbGSK-V4 63.87 1.5 63.62 1.5 63.79 1.5
MFRFS 93.85 3 93.78 3 84.39 3
ReliefF 82.6 3 83.2 3 73.51 3
statlog-image-segmentation LFS 92.51 7.6 93.9 7.6 84.24 7.6
BGSK 93.51 3.1 94 3.1 84.05 3.1
pBGSK 93.72 3 94.01 3 84.08 3
PbGSK-V4 74.56 1.6 75.04 1.6 65.65 1.6
MFRFS 84.75 1 85.11 1 85.04 1
ReliefF 84.4 1 85.11 1 84.61 1
thoracic-surgery LFS 84.68 4.9 85.11 4.9 83.97 4.9
BGSK 85.06 1.3 85.11 1.3 85.06 1.3
pBGSK 85.11 1.3 85.11 1.3 85.01 1.3
PbGSK-V4 85.11 1 85.11 1 85.15 1
box plots marked by Post-Friedman test in Figure 2 show that MFRFS is borderline sig-
nificantly better than LFS in KNN and NB learning algorithms, however, LFS selects more
features than MFRFS.
On arrhythmia and audiology datasets, the number of features selected by LFS is nearly
four times that of MFRFS. However, on audiology dataset, LFS does not obtain a better
classification accuracy with more features than MFRFS. On robot-failure-lp4 and robot-
failure-lp5 datasets, features selected by LFS are nearly six times that of MFRFS. On the
Lymphoma dataset, LFS even uses more than 20 times features that of MFRFS, and when
NB is used as the learning algorithm, the accuracy of LFS is 10% less than that of MFRFS.
On other datasets, MFRFS can select smaller number of features and achieve compara-
ble or better classification accuracy than LFS. The experimental results show that LFS’s
dimensionality reduction performance is weak. The features selected by LFS still contains
irrelevant and redundant features, which confirm that some invalid features are not helpful
for classification. Our proposed MFRFS can select more effective features on the premise of
maintaining the classification accuracy.
5.3.3. Comparison between MFRFS and GSK related methods (BGSK, pBGSK,
PbGSK-V4 )
Compared with BGSK, MFRFS selects a smaller feature subset and achieves better classifi-
cation performance on most cases in all learning algorithms. On the higher-dimensional
CONNECTION SCIENCE 1161
Figure 3. Convergence graph on two datasets: (a) spambase and (b) arrhythmia.
dataset Lymphoma, BGSK cannot reduce the dimension of features better, and select more
than 1300 features. We obtain similar experimental results on pBGSK, and in most cases,
a larger feature subset is selected, especially on the Lymphoma dataset. The box plots
marked by Post-Friedman test in Figure 2 show that MFRFS is borderline significantly better
than pBGSK in KNN learning algorithm and significantly better than pBGSK in NB learning
algorithm.
As shown in Table 5, PbGSK-V4 selects minimum number of features among the six meth-
ods. However, this method cannot achieve good classification performance, especially on
audiology, robot-failure-lp5, statlog-image-segmentation and spambase datasets. Only an
average of less than 2 features is selected by PbGSK-V4 in those datasets, but the classifica-
tion accuracy is very low. Compared with MFRFS, PbGSK-V4 reduces the number of features,
but in most cases the overall classification accuracy of MFRFS is better than PbGSK-V4 .
Figure 3 shows the convergence behaviours of four EC-based methods on arrhythmia
and spambase datasets. As shown in Figure 3, MFRFS starts to converge in about 10 gener-
ations and can obtain good solutions. The GSK related methods need more generations to
converge.
Table 6. Continued.
Dataset α A-KNN #F-KNN A-C4.5 #F-C4.5 A-NB #F-NB
seismic-bumps 0.025 93.27 1 93.41 1 91.95 1
0.01 93.32 1.1 93.41 1 91.82 1.1
0.005 93.06 1.7 93.37 1.4 91.02 1.5
0.001 93.09 2.8 93.41 2.1 90.4 2.1
0.1 76.31 1 76.44 1 72.19 1
0.05 84.52 2 85.04 2 78.39 2
spambase 0.025 87.72 3.97 87.8 3.9 80.07 3.83
0.01 89.33 6.53 89.43 6.43 87.4 5.83
0.005 89.27 8 89.51 7.97 85.36 7.43
0.001 89.57 9.37 89.99 9.43 84.61 9
0.1 69.62 1 69.62 1 69.62 1
0.05 69.62 1 69.62 1 69.62 1
spect-heart 0.025 71.92 1.5 70.38 1.4 71.41 1.5
0.01 72.31 1.9 68.72 1.8 71.41 1.8
0.005 72.56 1.9 69.49 1.8 71.28 1.8
0.001 72.31 2.4 69.87 2 71.15 2.1
0.1 71.73 1 70.99 1 71.11 1
0.05 71.73 1.5 72.35 1.1 74.2 1.8
statlog-heart 0.025 77.41 2.8 77.16 2.6 77.16 3
0.01 80.99 3.2 78.4 3.2 81.11 4.3
0.005 79.75 4.2 77.53 4.3 81.36 4.9
0.001 78.64 3.8 76.67 4.3 81.98 5.3
0.1 82.24 1.9 82.34 1.9 72.21 1.9
0.05 84.49 2 84.47 2 73.59 2
statlog-image-segmentation 0.025 89.13 2.5 89.08 2.5 78.96 2.5
0.01 93.85 3 93.78 3 84.39 3
0.005 93.85 3 93.78 3.4 84.43 3.1
0.001 93.75 3.9 93.49 4.1 84.5 3.7
0.1 84.75 1 85.11 1 85.04 1
0.05 84.75 1 85.11 1 85.04 1
thoracic-surgery 0.025 84.75 1 85.11 1 85.04 1
0.01 84.75 1 85.11 1 85.04 1
0.005 85.11 1.5 85.11 1.2 84.89 1.6
0.001 84.96 2.6 85.11 1.9 84.26 2.5
decreases, but the average classification accuracy starts to decline to some extent. When
α = 0.5 and 0.1, the number of features is obviously reduced a lot, but the accuracy at this
time is also declined a lot.
Therefore, the parameter α has a great impact on the classification accuracy. If the
parameter α is too small, it cannot achieve the purpose of dimensionality reduction com-
paring with FRFS. If the parameter α is too large, it will remove effective features and
reduce the classification accuracy. Through the experiments, α = 0.01 meets the pur-
pose of decreasing the number of selected features while maintaining the classification
performance.
individuals may be lost. In this paper, GP individuals after the 30th generation with the best
fitness are preserved. The features are ranked according to their occurrence times in these
individuals. Experiments show that feature ranking can help select more effective features.
To further demonstrate how feature ranking works, we take the arrhythmia dataset as an
example. The results of feature ranking are shown below. A total of 12 best features were
selected. The values before the “–” represent the index of the feature, and the values after
the “–” represent the number of times the feature occurs in all the best individuals.
F14-60244 F250-59799 F109-58235 F182-42527 F212-38822 F195-38505
F241-37687 F7-31955 F208-26438 F38-24976 F97-23970 F10-21086
We can see that feature F14 is probably the best feature, and feature F250 is the second
best, and so on. The purpose of feature ranking is to select the best feature set from the top
12 features.
According to the method described in Section 3.2, the classification accuracy on different
number of top ranked features on training set and testing set in fbest feature subset are
shown in Table 7. For training set, we can see that the classification accuracy first increases
and then decreases with the increase of the number of top features. When the number of
top features is 9, 10 and 9 for KNN, C4.5 and NB, the classification accuracy is the highest
respectively.
For the testing set, we also obtain the classification accuracies of the corresponding rank-
ing features for KNN, C4.5 and NB learning algorithms respectively in Table 7, and feature
ranking can achieve highest classification accuracy on corresponding ranking features in
C4.5 and NB learning algorithms.
(2) Feature ranking is based on the assumption that the more the original features appear
in the GP individuals’ terminal nodes, the more valuable these features are. Experiments
prove that the assumption is correct. GP can be used to quantify the importance of features
and select more important feature sets.
(3) MFRFS can restrict the number of irrelevant and redundant features in the GP’s
terminal nodes by a multi-criteria fitness function. By adjusting the parameter α, MFRFS
can achieve a smaller number of features while maintaining the classification accuracy
compared with FRFS.
(4) In general, MFRFS can achieve better classification performance than the five bench-
marks. However, MFRFS needs to store the individuals with the best fitness, rank the
features according to the occurrence times of features appearing in the GP individuals’ ter-
minal nodes and search smaller number of features with higher classification performance
from the ranked features. Therefore MFRFS has no advantages in time consumption and
CONNECTION SCIENCE 1165
Disclosure statement
No potential conflict of interest was reported by the author(s).
Funding
This work is supported by the Key R&D Program of Hebei Province, China [grant number 20327405D],
Hebei Provincial Department of Human Resources and Social Security, China [grant number
20190344], and Hebei Key Laboratory of Agricultural Big Data, China.
References
Agrawal, P., Ganesh, T., & Mohamed, A. W. (2021a). Chaotic gaining sharing knowledge-based opti-
mization algorithm: An improved metaheuristic algorithm for feature selection. Soft Computing,
25(14), 9505–9528. https://doi.org/10.1007/s00500-021-05874-3
Agrawal, P., Ganesh, T., & Mohamed, A. W. (2021b). A novel binary gaining–sharing knowledge-
based optimization algorithm for feature selection. Neural Computing and Applications, 33(11),
5989–6008. https://doi.org/10.1007/s00521-020-05375-8
Agrawal, P., Ganesh, T., Oliva, D., & Mohamed, A. W. (2022). S-shaped and V-shaped gaining-
sharing knowledge-based algorithm for feature selection. Applied Intelligence, 52 (1), 81–112.
https://doi.org/10.1007/s10489-021-02233-5
1166 G. LIU ET AL.
Ahmed, S., Zhang, M., & Peng, L. (2014). Improving feature ranking for biomarker discovery in pro-
teomics mass spectrometry data using genetic programming. Connection Science, 26(3), 215–243.
https://doi.org/10.1080/09540091.2014.906388
Al-Sahaf, H., Bi, Y., Chen, Q., Lensen, A., Mei, Y., Sun, Y., & Zhang, M. (2019). A survey on
evolutionary machine learning. Journal of the Royal Society of New Zealand, 49(2), 205–228.
https://doi.org/10.1080/03036758.2019.1609052
Bi, Y., Xue, B., & Zhang, M. (2021a). Genetic programming for image classification: An automated
approach to feature learning. Springer Nature.
Bi, Y., Xue, B., & Zhang, M. (2021b). Multi-objective genetic programming for feature learning
in face recognition. Applied Soft Computing, 103(4), 107152. https://doi.org/10.1016/j.asoc.2021.
107152
Canuto, A. M. P., & Nascimento, D. S. C. (2012). A genetic-based approach to features selection for ensem-
bles using a hybrid and adaptive fitness function. The 2012 International Joint Conference on Neural
Networks (IJCNN), (pp.1-8), Brisbane, QLD, Australia, June 2012.
Davis, R. A., Charlton, A. J., Oehlschlager, S., & Wilson, J. C. (2006). Novel feature selection method for
genetic programming using metabolomic 1H NMR data. Chemometrics and Intelligent Laboratory
Systems, 81(1), 50–59. https://doi.org/10.1016/j.chemolab.2005.09.006
Dheeru, D., & Karra Taniskidou, E. (2017). UCI machine learning repository. University of California,
Irvine, School of Information and Computer Sciences
Espejo, P. G., Ventura, S., & Herrera, F. (2010). A survey on the application of genetic programming to
classification. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews),
40(2), 121–144. https://doi.org/10.1109/TSMCC.2009.2033566
Friedlander, A., Neshatian, K., & Zhang, M. (2011). Meta-learning and feature ranking using genetic
programming for classification: Variable terminal weighting. IEEE Congress on Evolutionary Compu-
tation, (pp. 5–8), New Orleans, LA, USA, June 2011.
Gutlein, M., Frank, E., Hall, M., & Karwath, A. (2009). Large-scale attribute selection using wrappers. IEEE
symposium on computational intelligence and data mining, (pp. 332–339), Nashville, TN, USA, May
2009.
Hall, M. (1999). Correlation-based feature selection for machine learning (Unpublished doctoral
dissertation). The University of Waikato.
Hall, M., & Smith, L. (1999). Feature selection for machine learning: Comparing a correlation-based fil-
ter approach to the wrapper. Twelfth International Florida Artificial Intelligence Research Society
Conference, (pp. 235–239), Orlando, Florida, USA, May 1999.
Hancer, E. (2019). Fuzzy kernel feature selection with multi-objective differential evolution algorithm.
Connection Science, 31(4), 323–341. https://doi.org/10.1080/09540091.2019.1639624
Hollander, M., & Wolfe, D. A. (1999). Nonparametric statistical methods (2nd ed.). Wiley-Interscience.
Hunt, R., Neshatian, K., & Zhang, M. (2012). A genetic programming approach to hyper-heuristic feature
selection. The 9th international conference on Simulated Evolution and Learning, (pp. 320–330),
Berlin, Heidelberg, December 2012.
Kira, K., & Rendell, L. A. (1992). The feature selection problem: Traditional methods and a new algorithm.
The tenth national conference on artificial intelligence, (pp. 129–134), San Jose California, July
1992.
Koza, J. R. (1992). Genetic programming: On the programming of computers by means of natural selection.
MIT Press.
Koza, J. R., Andre, D., Bennett, F. H., & Keane, M. A. (1999). Genetic programming III: Darwinian
invention and problem solving. IEEE Transactions on Evolutionary Computation, 7(4), 451–453.
https://doi.org/10.1162/evco.1999.7.4.451
Liang, Y., Zhang, M., & Browne, W. N. (2017). Image feature selection using genetic programming
for figure-ground segmentation. Engineering Applications of Artificial Intelligence, 62(12), 96–108.
https://doi.org/10.1016/j.engappai.2017.03.009
Lin, J. Y., Ke, H. R., Chien, B. C., & Yang, W. P. (2008). Classifier design with feature selection and feature
extraction using layered genetic programming. Expert Systems with Applications, 34(2), 1384–1393.
https://doi.org/10.1016/j.eswa.2007.01.006
CONNECTION SCIENCE 1167
Loughran, R., Agapitos, A., Kattan, A., Brabazon, A., & O’Neill, M. (2017). Feature selection
for speaker verification using genetic programming. Evolutionary Intelligence, 10(1–2), 1–21.
https://doi.org/10.1007/s12065-016-0150-5
Luke, S.. (2017). ECJ then and now. Genetic and Evolutionary Computation Conference Companion
Pages, (pp.1223–1230), Berlin Germany, July 2017.
Ma, J., & Gao, X. (2020a). Designing genetic programming classifiers with feature selection and feature
construction. Applied Soft Computing, 97(4), 106826. https://doi.org/10.1016/j.asoc.2020.106826
Ma, J., & Gao, X. (2020b). A filter-based feature construction and feature selection approach for classi-
fication using genetic programming. Knowledge-Based Systems, 196(6), 105806. https://doi.org/10.
1016/j.knosys.2020.105806
Ma, J., & Teng, G. (2019). A hybrid multiple feature construction approach using genetic programming.
Applied Soft Computing, 80(2), 687–699. https://doi.org/10.1016/j.asoc.2019.04.039
Mei, Y., Nguyen, S., Xue, B., & Zhang, M. (2017). An efficient feature selection algorithm for evolv-
ing job shop scheduling rules with genetic programming. IEEE Transactions on Emerging Topics in
Computational Intelligence, 1(5), 339–353. https://doi.org/10.1109/TETCI.2017.2743758
Mohamed, A. W., Hadi, A. A., & Mohamed, A. K. (2020). Gaining-sharing knowledge based algorithm for
solving optimization problems: A novel nature-inspired algorithm. International Journal of Machine
Learning and Cybernetics, 11(7), 1501–1529. https://doi.org/10.1007/s13042-019-01053-x
Muni, D. P., Pal, N. R., & Das, J. (2006). Genetic programming for simultaneous feature selection
and classifier design. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics), 36(1),
106–117. https://doi.org/10.1109/TSMCB.2005.854499
Nagra, A. A., Han, F., Ling, Q. H., Abubaker, M., Ahmad, F., Mehta, S., & Apasiba, A. T.
(2020). Hybrid self-inertia weight adaptive particle swarm optimisation with local search using
C4.5 decision tree classifier for feature selection problems. Connection Science, 32(1), 16–36.
https://doi.org/10.1080/09540091.2019.1609419
Neshatian, K. (2010). Feature manipulation with genetic programming (Unpublished doctoral disser-
tation). Victoria University of Wellington.
Neshatian, K., & Zhang, M. (2009a). Genetic programming for feature subset ranking in binary classifi-
cation problems. The 12th European Conference on Genetic Programming, (pp. 121–132), Berlin,
Heidelberg, April 2009.
Neshatian, K., & Zhang, M. (2009b). Pareto front feature selection: Using genetic programming to explore
feature space. The 11th annual genetic and evolutionary computation conference, (pp. 1027–1034),
Montreal, Québec, Canada, July 2009.
Neshatian, K., & Zhang, M. (2009c). Unsupervised elimination of redundant features using genetic pro-
gramming. Australasian joint conference on artificial intelligence, (pp. 432–442), Berlin, Heidelberg,
November 2009.
Neshatian, K., & Zhang, M. (2011). Using genetic programming for context-sensitive feature scoring
in classification problems. Connection Science, 23(3), 183–207. https://doi.org/10.1080/09540091.
2011.630065
Neshatian, K., & Zhang, M. (2012). Improving relevance measures using genetic programming. European
conference on genetic programming, (pp. 97âĂŞ108), Berlin, Heidelberg, April 2012.
Papa, J. P., Rosa, G. H., & Papa, L. P. (2017). A binary-constrained geometric semantic genetic program-
ming for feature selection purposes. Pattern Recognition Letters, 100(24), 59–66. https://doi.org/10.
1016/j.patrec.2017.10.002
Patel, S. P., & Upadhyay, S. H. (2020). Euclidean distance based feature ranking and sub-
set selection for bearing fault diagnosis. Expert Systems with Applications, 154(8), 113400.
https://doi.org/10.1016/j.eswa.2020.113400
Peng, B., Wan, S., Bi, Y., Xue, B., & Zhang, M. (2020). Automatic feature extraction and construction
using genetic programming for rotating machinery fault diagnosis. IEEE Transactions on Cybernet-
ics, (99), 1–15. https://doi.org/10.1109/TCYB.2020.3032945
Purohit, A., Chaudhari, N., & Tiwari, A. (2010). Construction of classifier with feature selection based on
genetic programming. In Proceedings of the IEEE congress on evolutionary computation.
1168 G. LIU ET AL.
Ribeiro, I., Andrade, G., Viegas, F., Madeira, D., Rocha, L., Salles, T., & Gonçalves, M. (2012). Aggres-
sive and effective feature selection using genetic programming. In Proceedings of IEEE congress on
evolutionary computation.
Sreeja, N. K. (2019). A weighted pattern matching approach for classification of imbalanced
data with a fireworks-based algorithm for feature selection. Connection Science, 31(2), 143–168.
https://doi.org/10.1080/09540091.2018.1512558
Tan, F., Fu, X., Zhang, Y., & Bourgeois, A. G. (2008). A genetic algorithm-based method for feature
subset selection. Soft Computing, 12(2), 111–120. https://doi.org/10.1007/s00500-007-0193-8
Too, J., & Abdullah, A. R. (2020). Binary atom search optimisation approaches for feature selection.
Connection Science, 32(4), 406–430. https://doi.org/10.1080/09540091.2020.1741515
Too, J., Sadiq, A. S., & Mirjalili, S. M. (2021). A conditional opposition-based particle swarm optimisa-
tion for feature selection. Connection Science, 77(3), 1–23. https://doi.org/10.1080/09540091.2021.
2002266
Vergara, J. R., & Estévez, P. A. (2014). A review of feature selection methods based on mutual infor-
mation. Neural Computing and Applications, 24(1), 175–186. https://doi.org/10.1007/s00521-013-
1368-0
Viegas, F., Rocha, L., Gonçalves, M., Mourão, F., Sá, G., Salles, T., & Sandin, I. (2018). A genetic program-
ming approach for feature selection in highly dimensional skewed data. Neurocomputing, 273(3),
554–569. https://doi.org/10.1016/j.neucom.2017.08.050
Xue, B., Zhang, M., & Browne, W. (2013). Particle swarm optimization for feature selection in
classification: A multi-objective approach. IEEE Transactions on Cybernetics, 43(6), 1656–1671.
https://doi.org/10.1109/TSMCB.2012.2227469
Xue, B., Zhang, M., & Browne, W. N. (2014). Particle swarm optimisation for feature selection in clas-
sification: Novel initialisation and updating mechanisms. Applied Soft Computing, 18(C), 261–276.
https://doi.org/10.1016/j.asoc.2013.09.018