0% found this document useful (0 votes)
35 views23 pages

Genetic Programing Paper

Uploaded by

movecraftmagic
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views23 pages

Genetic Programing Paper

Uploaded by

movecraftmagic
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Genetic Programming for Classification

and Feature Selection

Kaustuv Nag and Nikhil R. Pal

Abstract Our objective is to provide a comprehensive introduction to Genetic


Programming (GP) primarily keeping in view the problem of classifier design along
with feature selection. We begin with a brief account of how genetic programming
has emerged as a major computational intelligence technique. Then, we analyse
classification and feature selection problems in brief. We provide a naive model of
GP-based binary classification strategy with illustrative examples. We then discuss a
few existing methodologies in brief and three somewhat related but different strate-
gies with reasonable details. Before concluding, we make a few important remarks
related to GP when it is used for classification and feature selection. In this context,
we show some experimental results with a recent GP-based approach.

Keywords Classification · Feature selection · Genetic programming

1 Introduction

1.1 The Emergence of Genetic Programming

Computational Intelligence (CI) deals with biologically and linguistically inspired


computing paradigms. Evolutionary computation (EC) is one of the major compo-
nents of CI. Evolutionary algorithms (EAs), which are concerned with EC, exploit
Darwinian principles to find solutions to a problem. These are, indeed, trial and
error based optimization schemes that use metaheuristics. Moreover, EAs use a pop-
ulation consisting of a set of candidate solutions instead of iterating over a single
solution in the search space. Usually, the following four techniques are categorized as

K. Nag (B)
Department of IEE, Jadavpur University, Kolkata, India
e-mail: [email protected]
N. R. Pal
ECS Unit, Indian Statistical Institute, Calcutta, India
e-mail: [email protected]
© Springer International Publishing AG, part of Springer Nature 2019 119
J. C. Bansal et al. (eds.), Evolutionary and Swarm Intelligence
Algorithms, Studies in Computational Intelligence 779,
https://doi.org/10.1007/978-3-319-91341-4_7
120 K. Nag and N. R. Pal

EC: (i) evolutionary programming (EP), (ii) evolutionary strategy (ES), (iii) genetic
algorithm (GA), and (iv) genetic programming (GP).
EC usually initializes a population with a set of randomly generated candidate
solutions. However, if domain knowledge is available, it can be used to generate the
initial population. Then, the population is evolved. This evolutionary process incorpo-
rates natural selection and other evolutionary operators. From an algorithmic point of
view, it is a guided random search that uses parallel processing to achieve the desired
solutions. Note that, the natural selection must be incorporated in an EA, otherwise
the approach cannot be categorized as an EC technique. For example, though several
metaheuristic algorithms, such as, particle swarm optimization (PSO) [12] and ant
colony optimization (ACO) [6, 9] are nature inspired algorithms (NIAs), they are
not EAs. Note that, sometimes they are still loosely referred to as EC techniques.
In 1948, in a technical report [1], titled “Intelligent Machinery”, written for
National Physics Laboratory, Alan M. Turing wrote, “There is the genetical or evo-
lutionary search by which a combination of genes is looked for, the criterion being
survival value. The remarkable success of this search confirms to some extent the
idea that intellectual activity consists mainly of various kinds of search.” According
to the best of our knowledge, this is the first technical article, where the concept of
evolutionary computation is found. However, it took few more decades to develop
the following three distinct interpretations of this philosophy: (i) EP, (ii) ES, and (iii)
GA. For the next one and half decades, these three areas grew separately. Later, in
the early nineties, they were unified as a subfield of the same technology, namely
EC. Each of EP, ES and GA is an algorithm for finding solutions to an optimization
problem - it finds a parameter vector that optimizes an objective function. Unlike
these three branches, GP finds a program to solve a problem. The concept of modern
tree-based GP was proposed by Cramer in 1985 [7]. Later, Koza, a student of Cramer,
popularized it with his many eminent works [15–18]. A large number of GP-based
inventions have been made after 2000, i.e., after the emergence of sufficiently well
performing hardware.

1.2 Genetic Programming: The Special Encoding Scheme

GP finds computer programs to perform a given task. GP consists of a set of instruc-


tions and a fitness function to evaluate the performance of a candidate computer pro-
gram. It can be considered a special case of GA, where each solution is a computer
program. Traditionally for GP computer programs are represented in the memory
as tree structures [7]. The internal nodes of the trees must be from a set of prede-
fined functions (operators), F. Moreover, every leaf node of a tree must be from a
set of predefined terminals (operands), T . The subtrees of a function node f ∈ F
are the arguments to f . Note that, a very important property of tree-based GP is
that F and T need to satisfy both the closure property and the sufficiency property
[15]. To satisfy the closure property, F needs to be well defined and closed for any
combination of probable arguments that it may encounter [15]. Moreover, F and
Genetic Programming for Classification and Feature Selection 121

T need to be able to encode any possible valid solution of the problem to satisfy
the sufficiency property. Thus, Lisp or any other functional programming language
that naturally embody tree structures, can be used to represent a candidate solution
in GP. Use of non-tree representations to encode solutions in GP, is comparatively
less popular. An example of this is linear genetic programming, which is suitable for
more traditional imperative languages. In this chapter, however, we concentrate only
on tree-based GP. The most frequently used representations of tree-based GPs are
decision trees, classification rules, and discriminant functions. Here, we confine our
discussion primarily to discriminant function based GPs.

1.3 Classification and Feature Selection

In machine learning and data mining, classification is an important and frequently


encountered problem. Besides, it is possible to restate a wide range of real world
problems as classification problems. Examples of such real world problems include
medical diagnosis, text categorization, and software quality assurance. In a classifica-
tion task, we have a set X with N p training data points, where each element of the set
is denoted by a pair (xi , yi ) , i = 1, 2, . . . , N p . Here, xi = (x1i , x2i , . . . , xni )T ∈ Rn
and yi ∈ Y , where Y is the set of all class labels. Now, based on the information
hidden in X , we need to model a set of criteria in a solver, called classifier, such that,
given a point x ∈ Rn , the classifier would predict an appropriate class label y ∈ Y .
To encode the set of criteria, a classifier relies on a set of features. Thus, feature
selection (FS) becomes an important part in classification. FS can be defined as a
process of selecting a subset of relevant features that can solve the problem at hand
satisfactorily. FS is primarily used for the following three reasons: (i) to simplify
models to enhance their interpretability, (ii) to attain shorter training time, and (iii)
to enhance the generalization capability of the classifier by reducing its degrees of
freedom. Note that, a genetic program may not (mostly will not) use all features
of a given data set. Hence, a GP-based system performs FS implicitly, at least to
some extent even if it is not specially designed for FS. Moreover, a discriminant
function based genetic program also implicitly performs feature extraction (FE)
from an initial set of measured features. The derived features are expected to be
less-redundant, informative, and should facilitate subsequent learning and enhance
generalization. Sometimes, they may lead to better human interpretation.
For a given classification task, there may be at least four types of features: (i)
essential, (ii) bad, (iii) redundant, and (iv) indifferent [5]. The objective of a FS
scheme should be to (i) select the essential features, (ii) reject the bad features,
(iii) judiciously select some of the redundant features, and (iv) reject the indifferent
features. Let us consider a small example to illustrate these four types of features [5].
Consider a data set on humans with five features: (i) sex, (ii) eye color, (iii) height,
(iv) weight, and (v) number of legs. Suppose the given classification task has the
following four classes:
122 K. Nag and N. R. Pal

(i) male AND (heavy weight OR long height),


(ii) male AND (low weight OR short height),
(iii) female AND (heavy weight OR long height), and
(iv) female AND (low weight OR short height).
In this particular case, (i) the feature sex is essential, (ii) the feature eye color is bad as
it may confuse the learning, (iii) the feature height is redundant with feature weight
as a long person is usually heavy. Thus, weight and height constitute a redundant set
for the given classification task because usually any one of the two will be enough for
the classification task. Note that, we have emphasized on the word usually because
height and weight are statistically strongly correlated, but there could be a heavy
person with a short height. (iv) Finally, number of legs is an indifferent feature as
under normal circumstances, it is going to be two for every individual. Note that,
keeping some level of redundancy in the selected set of features may sometimes be
helpful. Therefore, in the given context, one may argue that to account for some
measurement error, it may not be a very bad idea to use both height and weight to
design a classifier because that would be more robust than a classifier designed using
only one of these two features. Again, if we want to employ FE in the above described
scenario, it would be good to construct a feature height ‘OR’ weight. Note that, this
‘OR’ operator is not exactly the Boolean OR operator because the attribute heavy,
long etc. are not Boolean, but fuzzy concepts, and hence, such a combined feature
has to be designed judiciously. The beauty of a GP-based system is that, during the
evolution, it may compute the intuitive ‘OR’ operator using the members of F.

1.4 Genetic Programming for Classification and Feature


Selection: A Simple Example

Consider the tiny, artificial, two dimensional, binary classification data set shown in
Fig. 1. It consists of uniformly distributed points inside two circles: C1 : (x1 − 1)2 +
(x2 + 1)2 = 0.98 and C2 : (x1 + 1)2 + (x2 − 1)2 = 0.98, which are represented by
‘◦’s and ‘+’s, respectively. Points inside each circle represent a class. The points
corresponding to C1 belong to class 1 and the points corresponding to C2 belong to
class 2. Let us try to learn a GP-based binary classifier for this data set. To achieve
this, let us consider that every candidate solution in the GP-based system consists
of a tree that encodes a discriminant function. When we recursively evaluate a tree
T (·) using a data point x, it returns a real value rxT = T (x). If rxT > 0, the binary
classifier predicts that x belongs class 1, else it predicts that x belongs class 2. Let the
set of operators (internal
 nodes) be F = {+, −, ×, ÷}, and the set of operands (leaf
nodes) be T = R F, where F = {x1 , x2 } is the set of features. We also consider
that every operator f ∈ F is defined in such a way that if it returns an undefined
or infinite value (a value that is beyond storing due to precision limit), the returned
value is converted to zero. This conversion may not be the best way to handle this
issue. However, every subtree constructed using F and T must return a real value.
Genetic Programming for Classification and Feature Selection 123

Consequently, each f ∈ F is well defined and closed under all possible operands that
it may encounter. Hence, this scheme meets the closure property. Moreover, for the
given problem, we assume that F satisfies the sufficiency property, although strictly
speaking this is not true. For example, there are infinite number of equations that
can solve the classification problem, and using F and T , we can design infinitely
many trees that can solve the given problem, yet it may not be possible to generate
all possible functional form of solutions even for this simple data set. This of course
does not cause any problem from a practical point of view as long as F and T are
able to generate useful solutions at least approximately.
To illustrate further, we show three binary classifiers in Fig. 1, which are denoted
by a : (x1 − x2 ) − 0.1, b : (x1 × x2 ) − 0.1, and c : (x1 × x1 ) − x2 . Moreover, we
show the tree structures corresponding to these three trees respectively in Figs. 2, 3,
and 4. As mentioned earlier, infinitely many “correct” solutions to this problem are
possible. However, we have purposefully chosen these three particular trees, which
have visually similar structures. The solution with the tree a would predict all the
points accurately; the solution with the tree b would predict that points of both classes

Fig. 1 An artificial binary


classification data set and
some binary classifiers

Fig. 2 Tree
a : (x1 − x2 ) − 0.1
124 K. Nag and N. R. Pal

Fig. 3 Tree
b : (x1 × x2 ) − 0.1

Fig. 4 Tree
c : (x1 × x1 ) − x2

as belonging to class 2; and the solution with tree c would predict some points of
class 1 accurately and all points of class 2 accurately.
In any GP-based approach, after the encoding scheme, the second most important
issue is the formulation of the objective function that would be used to evaluate
the solutions. Here, to evaluate binary classifiers, we use prediction accuracy on the
training data set as the evaluation/objective function, this is a simple, straightforward
yet an effective strategy. Note that, this objective function is to be maximized.
The third most important issue in the design of a GP-based system is the choice of
operators. As we shall see later, there can be many issues like, FS, bloating control,
fitness, unfitness, that can be kept in mind while developing these operators. However,
here we discuss a primitive crossover and a primitive mutation technique, which are
adequate to solve this problem.
The crossover operator requires two parents S1 and S2 to generate a new offspring
O. To generate the tree of O, it selects two random nodes (may be leaf or non-
leaf), one from the tree of S1 and the other one from the tree of S2 . Let n S1 and n S2
respectively denote those nodes. Then, it replaces the subtree of S1 , which is rooted
at n S1 , called the crossover point, by the subtree of S2 , which is rooted at n S2 . To
illustrate this with an example, we assume that the trees associated with S1 and S2 are
respectively c and a. The randomly selected crossover points, and their respective
subtrees are also shown in Figs. 2 and 4. After replacing the selected subtree of c
(see Fig. 4) by the selected subtree of a (see Fig. 2), the crossover operator generates
an offspring O with tree d : x1 − x2 , which is illustrated in Fig. 5. Though, we do
not show the tree d in Fig. 1, it is a straight line parallel to tree a and goes through
the origin. Moreover, it can also accurately classify the given data set. Thus, though
Genetic Programming for Classification and Feature Selection 125

Fig. 5 Tree d : x1 − x2

the classifier with tree d yields the same accuracy as that by the classifier with tree
a, it is a better choice due to its simplicity (smaller size of the tree).
The mutation operator needs only one solution S. A node from the tree associated
with s is randomly selected. If the node is a terminal node, a random number rn is
drawn from [0, 1]. If rn < pvariable ( pvariable is a pre-decided probability of selecting
a variable), the selected terminal node is randomly replaced by a randomly selected
variable (feature). Otherwise, the node is replaced by a randomly generated constant
rc ∈ Rc , where Rc ⊆ R. Rc should be chosen judiciously. For example, in this par-
ticular example, a good choice of Rc might be [−2, 2]. To illustrate the mutation
process with an example, let us consider the solution with tree d shown in Fig. 5.
Suppose the randomly selected mutation point is the node with feature x1 . Moreover,
let us consider that we randomly select to replace this node with a constant node and
the constant value (rc ) be 0.01. Then, the mutant tree will be e : 0.01 − x2 , which
is nothing but a line parallel to the x1 axis that can also classify the given data set
correctly. To illustrate the mutation process, when an internal node is involved, let us
consider the solution with tree b (see Fig. 3). In Fig. 3, we have shown the randomly
selected mutation point. Let it be replaced with −, then it would result tree a, which
can classify the problem accurately. However, if this node would have been replaced
with ÷, it would result a new tree f : (x1 ÷ x2 ) − 0.1, which would predict all the
points of both the classes.
After defining all necessary components, we can now obtain a GP-based clas-
sifier following the steps shown in Algorithm 1. In step 22 of Algorithm 1, it per-
forms the environmental selection, which is a necessary component of any GP-based
approach. The environmental strategy that we have adopted here is naive. Several
criteria and strategies can be adopted for environmental selection. Note that, we also
select some solutions to perform crossover and mutation respectively in Steps 7 and
15 of Algorithm 1. This selection strategy is called mating selection. Instead of ran-
domly selecting these solutions, often some specific criteria are used. For example,
since we are maximizing the classification accuracy, a straightforward scheme could
be to select solutions using Roulette wheel selection. Another important step, where
it is possible to employ a different strategy, is the initialization of the trees. Ramped
half-n-half method [15] is one of the frequently used methods to initialize GP trees.
We have already mentioned that this algorithm does not explicitly perform any
FS and concentrates only on classification accuracy. However, it may implicitly
perform FS. To illustrate this, assume that the evolution of this system generates tree
e : 0.01 − x2 (we have already mentioned how it can be generated while illustrating
the mutation operation). Tree e will have the highest accuracy on the training data,
126 K. Nag and N. R. Pal

Algorithm 1: A Simple Generational Genetic Programming


1 Initialize population P with N number of solutions.
2 Generation Curr ent = 0.
3 while Generation Curr ent < Generation Maximum do
4 ccr ossover = 0.
5 O = ∅.
6 while ccr ossover < Ncr ossover do
7 Randomly select two distinct solutions from s1 and s2 from P .
8 Generate a new offspring o performing crossover using s1 and s2 .
9 Evaluatethe new offspring o.
10 O = O o.
11 ccr ossover = ccr ossover + 1
12 end
13 cmutation = 0.
14 while cmutation < N − Ncr ossover do
15 Randomly select a solution from s from P .
16 Generate a new offspring o performing mutation on s.
17 Evaluatethe new offspring o.
18 O = O o.
19 cmutation = cmutation + 1.
20 end 
21 U = P O.
22 Select the best N number of solutions from U and store them in P (natural or
environmental selection).
23 Generation Curr ent = Generation Curr ent + 1
24 end
25 return the best candidate solution sbest from P

though it uses only one feature (x2 ). In this example, the scheme selected only one
feature (x2 ), which has sufficient discriminating power.

2 Genetic Programming for Classification and Feature


Selection

There have been several attempts to design classifier using GP [4, 13, 14, 19, 21–
23]. A detailed survey on this topic is present in [10]. Some of these methods do
not explicitly pay attention to FS [13, 21], while others explicitly try to find useful
features to design the classifiers [22, 23]. Some of these GP-based approaches use
ensemble concept [19, 23]. In this section, we discuss three existing GP approaches.
The first approach [21] introduces a classification strategy employing a single objec-
tive GP-based objective search technique. The second approach [22] performs both
classification and FS task in an integrated manner. These two schemes use the same
multi-tree representation of solutions. On the contrary, the third method [23] decom-
poses a c-class classification problem into c binary classification problems, and then,
Genetic Programming for Classification and Feature Selection 127

performs simultaneous classification and FS. Nevertheless, unlike the first two meth-
ods [21, 22], it [23] uses ensembles of genetic programs (binary classifiers) and a
negative voting scheme.

2.1 GP-Based Schemes for Feature Selection


and Classification

To the best of our knowledge, in [13], the applicability of GP to solve a multi-class


classification problem was explored for the first time, where the authors used dis-
criminant function based GP. In [13], the authors decomposed a c-class classification
problem to c binary classification problems, and then evolved c genetic program-
ming classifier expressions (GPCEs), one corresponding to each binary classification
problem. The GPCE corresponding to the ith class learns to determine the belong-
ing of a point to the ith class. For a given point, a positive response from the GPCE
corresponding to the ith class indicates that the point belongs to the ith class. For
a point, if multiple GPCEs show positive result, then a special conflict resolution
scheme is used. This scheme uses a new measure called strength of association. The
authors in [13] also introduced an incremental learning scheme. This work showed
the capability of GP to automatically discover the features with a better discrim-
inating capability. Later, this work was extended in [14] introducing feature space
partitioning, where the feature space was divided into sub-spaces, and then, for every
sub-space a GPCE was evolved.
In [19], researchers have proposed a distinct GP based approach for classification
of multi-class microarray data sets. Its distinctive nature primarily lies in the struc-
ture of the candidate solutions. There [19], every solution that deals with a c-class
problem, consists of c sub-ensembles (SEs), where each ensemble posses k trees. In
this fashion, every individual comprises of c × k trees. The outputs of the SEs are
decided by a weighted voting scheme that uses the outputs of the k trees of the corre-
sponding SE as arguments and the classification accuracies of the trees (on training
data) as the weights. Assuming equal number of training points in each class, they
noted that every SE learns with positive to negative ratio 1 : (c − 1). To address this
data imbalance problem in the fusion of the outputs obtained from the SEs, they
furthermore designed a covering score to measure the generalization capability of
each SE. For FS, they introduced a measure, called diversity in features (DIF), which
estimates the diversity of features in a tree. This measure is used to keep the trees of
the same SE to be diverse in terms of their features.
There are two prominent works [27, 28] in the recent literature, in which
researchers have reformulated the receiver operating curve convex hull (ROCCH)
maximization problem using multi-objective GP to attain binary classification. In
both of these works, true positive rate was maximized and false positive rate
was minimized simultaneously using an evolutionary multi-objective optimization
framework. In [28], they investigated the performances of different evolutionary
128 K. Nag and N. R. Pal

multi-objective optimization algorithms. In [27], on the other hand the disadvantage


of nondominated sorting [8] in EMOAs for ROCCH maximization has been dis-
cussed, and to overcome the issues, a new convex hull-based sorting scheme without
redundancy has been proposed. This work also introduces an area-based selection
scheme, which maximizes the area under the convex hull.
In the past few years, GP based systems [3, 4] have been designed to solve
imbalanced binary classification problems. Four new fitness functions have been
proposed in [3], which are especially designed for imbalanced binary classification
problems. Two of these four intend to enhance the traditional weighted average
accuracy measure, whereas, the remaining two are threshold-independent measures,
which aim to evolve solutions with good class separability but with faster training
times than the area under curve-based functions. To address the data imbalance
issue in binary classification, in [4] a multi-objective GP-based scheme has been
proposed. This strategy evolves diverse ensembles of genetic programs, which are
indeed classifiers with good performance on both the majority and the minority
classes. The final ensembles are collections of the evolved nondominated solutions
in the population. To ensure diversity in the population, two methods, namely negative
correlation learning and pairwise failure crediting, have been proposed.
A GP-based learning technique to evolve compact and accurate fuzzy rule-based
classification systems, which is especially suitable for high dimensional problems,
has been proposed in [2]. This genetic cooperative-competitive learning approach,
where the population constitutes the rule base, learns disjunctive normal form rules.
Moreover, a special token competition has been employed to maintain the diversity
of the population. It causes rules to compete and cooperate among themselves so
that a compact set of fuzzy rules can be obtained. Next, we discuss three methods
with reasonable details to explain how GP can be used to design classifiers with or
without explicit FS.

2.2 Multi-tree Classifiers Using Genetic Programming [21]

In this approach [21], to solve a c-class classification problem, a GP-based system


evolves considering an integrated view of all classes. It uses a multitree representa-
tion of the solutions. Some interesting attributes of this approach are: (i) use of a new
concept of unfitness, (ii) a new crossover operator, (iii) a new mutation operator, (iv)
OR-ing chromosomes of the terminal population, and (v) a weight-based scheme
and heuristic rules, which make the classifier capable of saying “don’t know” in
ambiguous situations. Below we discuss the relevant aspects in details.

Multi-tree Representation of Solutions and Fitness Function: For a c class prob-


lem there are c trees in every solution, where each tree corresponds to a class and
vice versa. When a data point x is passed through the ith tree Ti of a given solution,
if it produces a positive real value, the tree predicts that x belong to the ith class. If
more than one tree demonstrate positive responses for x, additional methodologies
Genetic Programming for Classification and Feature Selection 129

are required to assign a class to x. Moreover, the trees are initialized using ramped-
half-n-half method with F = {+, −, ×, ÷} and T = {feature variables, R}. Here, R
is the set of all possible values in [0,10].
If a tree predicts x accurately, the system considers that the accuracy of tree for x
is one, otherwise zero. The normalized accuracy of all trees of a particular solution
for a given set of data points is considered the fitness of the solution.

Unfitness: Unlike the concept of fitness, unfitenss is a less frequently used strategy.
However, several works [21–23] have successfully used it to attain better perfor-
mance. When fitness is used in the selection process, more fit solutions, i.e., solu-
tions with higher fitness values get selected. On the contrary, if unfitness is used in
the selection process, then more unfit solutions, i.e., solutions with higher unfitness
values get selected. Unfitness-based selection helps unfit solutions to become more
fit. For a given problem, a simple choice for unfitness of a tree Ti is the number of
training data points, which are wrongly classified by Ti . Let u j be the unfitness of
the jth tree, then this can be easily used to select solutions for genetic operations
using Roulette Wheel with probably u i /( cj=1 u j ).

Crossover: At first, τ (tournament size) number of solutions are randomly selected


from the population for tournament selection. Then, the best two solutions S1 and S2
of the tournament are selected for crossover operation. Suppose the solutions have
c trees Ti S1 and Ti S2 (i = 1, 2, . . . , c) respectively. Now, using the unfitness values
of the trees of S1 , suppose the kth tree of S1 , i.e. TkS1 , is selected using Roulette
wheel selection. Next, a node from each of TkS1 and TkS2 is randomly selected, where
f
the probability of selecting a function node is pc and the same of a terminal node
f
is (1 − pc ). After this, the subtrees rooted at the selected nodes are swapped. In
addition to this, the trees T jS1 of S1 are swapped with the trees T jS2 of S2 for all j > k.

Mutation: For mutation, a random solution S is selected from the population. Then,
with Roulette wheel selection on the unfitness values of its trees, a tree TkS is selected.
S
Next, a random node n Tk of TkS is selected, where the probability of selection of a
f f S
function node is pm and the same of a terminal node is (1 − pm ). If n Tk is a function
node, it is replaced with a randomly chosen function node. Otherwise, it is replaced
with a randomly chosen terminal node. After that, both the mutated and the original
trees are evaluated with 50% of the samples of the kth class. Let the fitness values
of the two trees be f m and f o respectively. If f m ≥ f o , the mutated tree is accepted,
otherwise it is retained with a probability of 0.5. If f m = f o , then both f m and f o are
evaluated on the remaining 50% samples of the kth class to select one of the two.

Improving Performance with OR-ing: At the end of the evolution process, the
best classifier is selected using accuracy. If more than one classifier have the same
accuracy, the solution with the smallest number of nodes is selected. However, it
may happen that there be two solutions (classifiers) S1 = Ti S1 and S2 = Ti S2 , (i =
1, 2, . . . , c), such that, TkS1 models well for a particular area of the feature space,
130 K. Nag and N. R. Pal

whereas, Ti S2 performs well in another segment of feature space. To improve the


performance, this attribute is exploited as follows. The best solution is OR-ed with
all other solutions in the population and the best OR-ed pair is chosen as the final
classifier. Note that, here the essence of ensemble of classifiers is introduced, i.e.,
more than one (two in this case) classifier are used to make the final decision.

2.3 Genetic Programming for Simultaneous Feature


Selection and Classifier Design [22]

The GP-based classifier discussed in the last section does not explicitly do FS while
designing the classifier. The methodology that we discuss now integrates FS into the
classifier design task. This approach [22], simultaneously selects a good subset of
features and constructs a classifier using the selected features. Authors here propose
two new crossover operators, which carry out the FS process. Like the previously
discussed work, here also a multi-tree representation of a classifier is used.

Selection of a Feature Subset Corresponding to Every Solution: Let there be n f


number of features. To generate every solution of the initial population, a subset of
features with cardinality r f (r f < n f ) is randomly selected from the entire feature
set, where r f is selected using Roulette wheel selection with the probability pr f as
follows.
nf −rf
pr f = c  . (1)
j=1 n f − j

n
Note that, pr f decreases linearly with an increase of r f and r ff =1 pr f = 1. After
this, r f features are randomly selected from the entire set of features and then, a
single solution is initialized with this selected subset of features.

Fitness: Let fraw be the raw fitness of a solution, which is indeed the accuracy
achieved by a solution on the entire training data set. Then, the modified fitness
function ( f s ) that incorporates the FS task in the fitness is defined as follows.
 r

−nf
f s = fraw 1 + ae f , (2)

where  
gcurr ent
a = 2β 1 − . (3)
gmaximum

Here β is a constant, gcurr ent is the current generation number, and gmaximum is the
maximum number of generations of the evolution of GP. Note that, the factor e(−r f /n f )
is exponentially decreasing with the increase in r f . Consequently, if two classifiers
have the same value of fraw , the one using fewer features would attain a higher fitness
Genetic Programming for Classification and Feature Selection 131

value. Moreover, the penalty for using a larger number of features is dynamically
increased with generations, which keeps on emphasizing on the FS task with the
increase of generations.

Crossover: There are two different crossover strategies: homogeneous crossover


(crossover_hg) and heterogeneous crossover (crossover_ht). Crossover_hg restricts
the crossover between solutions which use the same feature subset. Crossover_ht
prefers two solutions for crossover which use more common features. Moreover, the
probability of using crossover_hg is phg = gcurr ent /gmaximum , whereas, the proba-
bility of using crossover_ht is (1 − phg ).
To perform Crossover_ht, at first τ solutions are randomly chosen as the tourna-
ment set. Then, the solution S1 , which has the best f s among this set, is selected as the
first parent. After that, another set with cardinality τ is randomly selected to select
the second parent S2 . Let this set be TS2 . Let the chosen subset of a given solution
be denoted by a vector v = (v1 , v2 , . . . , vn f ), such that, if the ith feature is present,
then vi = 1, otherwise vi = 0. The similarity measure s j of the jth solution of TS2
and S1 is computed as
n f
v S1 k v jk
sj = n fk=1 n f , (4)
max{ k=1 v S1 k , k=1 v jk }

where, if S1 uses the kth feature, then v S1 k = 1, otherwise v S1 k = 0; and if the jth
solution of TS2 uses the kth feature, then v jk = 1, otherwise v jk = 0. After that, the
jth classifier of the second set is selected with probability p second
j as follows.

f s j + βs j
p second = τ , (5)
k=1 ( f sk + βsk )
j

where f sk denotes the fitness ( f s ) of the kth solution of TS2 and β is a constant.
In this approach, usually FS is accomplished in the first few generations. Note that,
unlike the method that we have discussed in the previous section, this scheme does
not use step-wise learning. Because, at the beginning of the evolutionary process,
step-wise learning uses a small subset of training samples. Being small in size, it
may not be very helpful for the FS process.

2.4 A Multiobjective GP-Based Ensemble for Feature


Selection and Classification [23]

Unlike the previous two approaches [21, 22], this approach [23] divides a c-class
classification problem into c binary classification problems. Then, it evolves c popu-
lations of genetic programs to find out c sets of ensembles, which respectively solve
these c binary classification problems. To solve each binary classification problem,
132 K. Nag and N. R. Pal

a multi-objective archive-based steady-state micro-genetic programming, abbrevi-


ated as ASMiGP, is employed. Here to facilitate FS, it exploits the fitness values as
well as unfitness values of the features during mutation operation. Both the fitness
and the unfitness of features are dynamically altered with generations with a view
to obtaining a set of relevant features with low redundancy. A new negative voting
strategy is introduced in this context.

The ASMiGP-based Learning Strategy: Each solution in the c populations is a


binary classifier, which is encoded by a single tree. When a data point x is passed
through the tree of a binary classifier of the ith population, a positive output and
a negative output respectively indicate that x belongs to the ith class and does not
belong to the ith class.
Typically with the evolution of a GP-based system, the average tree size of its
population is likely to increase without sufficient or any enhancement in performance
of the solutions. This phenomenon is called bloating [29]. Though there are several
ways to control bloating [20, 26, 29], one of the prominent ways is to add the
tree size as an additional objective. Moreover, if c is sufficiently high, even if the
data set is balanced, it may lead to c highly imbalanced binary classification data
sets, one for each of the c classification problems. In this case, instead of maximizing
classification accuracy or minimizing classification error, simultaneous minimization
of false positive (FP) and false negative (FN) might be more suitable. To address these
issues, this scheme [23] uses MOGP with the following three objectives: (i)FP, (ii)
FN, and (iii) the number of leaf nodes of the tree. The third objective is used to reduce
bloating, whereas, the first two objectives incorporate performance parameters of the
learning task which help to deal with the imbalance issue. Moreover, to reduce the
size of the trees, after generation of any tree throughout the learning phase (using
mutation, crossover, or random initialization), the subtrees consisting of only constant
leaf nodes is replaced by an equivalent constant leaf node. This also helps to reduce
bloating.
This approach [23] uses a special environmental and a special mating selection
strategy. It maintains an archive (population) for every binary classification problem.
In every generation, it produces a single offspring using either crossover operator or
mutation operator. These operations are selected randomly with probability pc and
(1 − pc ) respectively. The female parent for crossover and the only parent for muta-
tion is selected performing Roulette wheel selection on the accuracies of the solutions
present in the population. For crossover, the additional (male) parent is selected ran-
domly. After the generation of the new off-spring, it is added to the archive with a
special multiobjective archive truncation strategy, which has been adopted from [24,
25]. This environmental selection scheme uses a Pareto-based fitness function, where
the Pareto dominance rank is used as the fitness function. This scheme, maintaining
a dynamic archive with a hard minimum and a hard maximum size, also ensures
diversity in the objective space.

Incorporation of Feature Selection: To assess the discriminating power of a feature,


f , the Pearson’s correlation, C if , between a vector containing the values of the feature
Genetic Programming for Classification and Feature Selection 133

and an ideal feature vector which posses corresponding ideal values for the ith binary
classification problem. Specifically, this ideal feature value is unity if the associated
training point belongs to the ith class, and zero otherwise [11]. Clearly, a higher
value of |C if | designates a stronger discriminative power of feature f for the ith
binary classification problem.
For the ith binary classification problem, we intend to incorporate only the fea-
tures with high discriminating capability from the set off all features Fall . To attain
this, we assign a fitness and an unfitness value to each of the features, which changes
with the evolution. Throughout the process, the features are selected to be added
in a tree with probabilities proportional to the current fitness values of the features.
On the other hand, features are selected to be replaced (in case of mutation) with
probabilities proportional to their current unfitness values. The fitness and unfitness
of features corresponding to the ith binary classification problem, during the first
50% evaluations, are defined respectively as in Eqs. (6) and (7).
⎧  
⎪ 2  i

⎨ C if C f 
F 0%,i , if > 0.3
f itness ( f, i) = (6)
i
Cmax i
Cmax



0, otherwise

0%,i 0%,i
Fun f itness ( f, i) = 1.0 − F f itness , (7)
 
 
i
where Cmax = max C if . Basically, to eliminate the impact of features with poor
f ∈Fall
discriminating ability during the initial evolution process, Eq. (6) sets their fitness
values to zero. Let us assume that Feval=0%,i ⊆ Fall is the features with nonzero
fitness values.
After completion of 50% evaluations, let the feature subset that is present in the
population of the ith binary classification task be Feval=50%,i . After 50% evaluations,
the fitness of all features in Fall − Feval=50%,i are set to zero. The assumption behind it
is that after 50% evaluations the features which could help the ith binary classification
task, would be used by the collection of trees of the corresponding population. Now
the fitness and unfitness values of all features in Feval=50%,i are changed respectively
according to Eqs. (8) and (9).

⎪ |C if |

⎨  |ρ | , if f ∈ Feval=50%,i

fg
F 50%,i
f itness ( f, i) = f  =g (8)

⎪ g∈F50%,i


0, otherwise

50%,i 50%,i
F f itness ( f,i)−min f {F f itness }
− 50%,i 50%,i
50%,i
f itness ( f, i) = e ,
max f {F f itness }−min f {F f itness }
Fun (9)
134 K. Nag and N. R. Pal

where ρ f g denotes the Pearson’s correlation between feature f and feature g. This
helps to select features with high relevance but with reduced level of redundancy,
i.e., to achieve maximum relevance and minimum redundancy (MRMR).
After 75% function evaluations, another snapshot of the population is taken. Let
the existing features for the ith population be Feval=75%,i ⊆ Feval=50%,i . Then, the
fitness and unfitness values of the features in Feval=75%,i are altered respectively as
in Eqs. (10) and (11).

f itness ( f, i), if f ∈ Feval=75%,i
F 0%
F 75%,i
f itness ( f, i) = (10)
0, otherwise

75%,i 75%,i
Fun f itness ( f, i) = 1.0 − F f itness ( f, i) (11)

Crossover: This scheme uses a crossover with male and female differentiation and
tries to generate an offspring near the female (acceptor) parent. For this a part of the
male (donor) parent replaces a part of the female parent. At first, from each parent a
random point (node) is chosen. The probabilities of selecting a terminal node and the
non-terminal nodes are respectively ptc and (1 − ptc ). After that, the subtree rooted
at the node selected from the female tree is replaced by the subtree rooted at the
selected node of the male tree.

Mutation: To mutate a tree, the following operations are performed on the tree: (i)
Each constant node of the tree is replaced by a randomly generated constant node with
probability pcm . (ii) Each function node is visited and the function there is replaced by
a randomly selected function with probability p mf . (iii) Only one feature node of the
tree is replaced by a randomly selected feature node. Among the feature nodes, the
mutation point is selected with a probability proportional to unfitness of the features
present in the selected tree. Moreover, the feature which is used to replace the old
feature, is also selected with a probability proportional to fitness values of the features.

Voting Strategy: After learning is over, a special negative voting strategy is used.
At the end of the evolution, c ensembles of genetic programs are obtained: A =
{A1 , A2 , . . . , Ac }; ∀i, 1 ≤ |Ai | ≤ Nmax , where c is the number of classes, and Ai is
the ensemble corresponding to the ith class. To determine whether a point p belongs
to the mth class or not, a measure, called net belongingness Bmnet (p) corresponding
to the mth class for p is calculated as follows.

|Am |
1 1 
Bmnet (p) = Bi (p) + 1.0 . (12)
2 |Am | i=1 m

Here, Bmi (p) is defined as


Genetic Programming for Classification and Feature Selection 135
⎧  

⎪ F Pmi
⎨+ 1.0 − , if Aim (p) > 0
 F Pmmax 
Bm (p) =
i
(13)

⎪ F Nmi
⎩− 1.0 − , otherwise,
F Nmmax

where in Eq. (13), F Pmi and F Nmi respectively denote the number of FPs and FNs
made by the ith individual of Am on the training data set; F Pmmax and F Nmmax
respectively denote the maximum possible FP and the maximum possible FN for the
mth class (determined using the training data set); and Aim (p) is the output from ith
c
individual of Am for p. Finally, p is assigned the class k, if Bknet = max{Bmnet }. Note
m=1
that, net belongingness lies in [0, 1], and a large value of this measure indicates a
higher chance of belonging to the corresponding class.

3 Some Remarks

3.1 GP for Classification and Feature Selection

While designing a GP-based system for classification and FS, several important
issues should be kept in mind. In this section, we discuss some of these salient issues
with details and some ways to address them. Note that, most of these issues depend
not only on the GP-based strategy, but also on the data set, i.e, on the number of
features, the number of classes, the distribution of the data in different (majority
and minority) classes, the feature-to-sample ratio, the distribution of the data in the
feature space, etc. Hence, there is no universal rule to handle all of them.
While designing a GP-based system for multi-class classification, the first problem
is how to handle the multi-class nature of the data. Primarily there are two ways. The
more frequently used scheme is to decompose a c-class problem into c binary classi-
fication problems and then develop separate GP-based systems for each of the binary
classification problem. Every candidate solution of the ith binary classification sys-
tem may have one tree, where a positive output from the tree for a given point would
indicate that the point belongs to the ith class. To solve every binary classification
problem, one may choose to use the best binary classifier (genetic program) found in
every binary system, or she may choose to use a set of binary classifiers (ensemble)
obtained from these binary systems. If an ensemble-based strategy is used, there need
to be a voting (aggregation) scheme. The voting scheme may become more effec-
tive if weighted and/or negative voting approach is judiciously designed. Again, to
decide the final class label, the outputs (decisions) from every binary system need to
be accumulated and processed. Here also, some (weighted, negative) voting scheme,
or some especially designed decision making system can be developed. Another
comparatively less frequently used strategy is as follows. Every solution consists of
a tree, where the output of the tree is divided into c windows using (c − 1) thresholds.
136 K. Nag and N. R. Pal

Every window in these values would have a one-to-one window-to-class relationship.


If the output from a tree falls inside the ith window, the tree would predict that the
point belongs to the ith class. The second approach is not recommended as it is less
likely to generate satisfactory outcome. Besides, even if it does, the interpretability
of the GP-based rules (classifiers) would be low.
If a c-class data set (c > 2) is decomposed into c binary classification data sets,
even if the class-wise data distribution is near uniform, after decomposition, every
binary data set would be imbalanced. The nature of imbalance, in this case, increases
with the increase in c. If special care is not taken and only accuracy (or classification
error) is used as the objective, each binary module may produce nearly (c − 1)/c ×
100% accuracy predicting “NO” to every data point. A possible solution to this
problem may be to use a multi-objective approach or a weighted sum approach
considering both the accuracies on the majority class and the minority class. In our
view, multi-objective approach is a better alternative because the choice of weights
for the majority and the minority class accuracies is hard to decide, and their choices
may have a large impact on the performance of the system.
GP has some interesting issues with generalization of the hidden patterns stored
in a particular data set. For example, consider the artificial “XOR” type binary classi-
fication data set shown in Fig. 6. Here, uniformly distributed points inside the circles
C3 : (x1 − 1)2 + (x2 − 1)2 = 0.75 and C4 : (x1 + 1)2 + (x2 + 1)2 = 0.75 belong
to class 1, and uniformly distributed points inside the circles C5 : (x1 − 1)2 +
(x2 + 1)2 = 0.75 and C6 : (x1 + 1)2 + (x2 − 1)2 = 0.75 belong to class 2. The
points of class 1 and class 2 have been respectively denoted by the symbols “+”
and “◦”. A solution (binary classifier) with tree b : x1 × x2 − 0.1 (see Fig. 3) can
accurately classify this data set. Tree b is a small equation and hence a simple model
in GP is capable of modeling the hidden pattern stored in this complicated “XOR”

Fig. 6 An artificial “XOR”


type binary classification
data set and the binary
classifier with tree b
performing accurately
Genetic Programming for Classification and Feature Selection 137

Fig. 7 An artificial binary


classification data set and the
binary classifier with tree b
performing poor
generalization

type data set. However, if we try to learn this data set with a multi-layer perceptron
type classifier, due to the “XOR” type pattern of the data, it may not be very easy
to learn. This example, though, may apparently illustrate GP as a powerful tool, this
specialization capability of GP may sometimes lead to poor generalization. To illus-
trate a case of poor generalization with an example, let us consider another artificial
binary classification data set shown in Fig. 7. There, the class 1 points are denoted
using “+” and class 2 points are denoted using “◦”. The class 1 points are uniformly
distributed inside the circle C3 : (x1 − 1)2 + (x2 − 1)2 = 0.75 and the class 2 points
are uniformly distributed inside the circle C6 : (x1 + 1)2 + (x2 − 1)2 = 0.75. For
this data set also, the binary classifier with tree b can accurately classify all the data
points. But, for the points denoted by “∗” in Fig. 7, the classifier would predict class
1. For these points the classifier should not make any decision. In this example, GP
ends up with a poor generalization.
Bloating is another important issue that needs to be addressed to develop a well
performing GP-based system. Though there are efficient methods in the literature
[20, 26, 29], the following two bloat control strategies are quite straightforward.
First, if a single objective approach is used, the size of the tree can be incorporated
with the objective function as a penalty factor such that the penalty is minimized.
Second, if a multi-objective approach is used, the size of the tree can be added as an
additional objective.
As we have already discussed, GP has an intrinsic FS capability. To enforce
FS further, whenever a feature is selected to be included in any tree, the model
should try to select the best possible feature for that scenario. In a similar fashion,
whenever a feature node is removed from any classifier, it should not be an important
feature under that scenario. Note that, if an ensemble based strategy is used, for
enhanced performance the members of the ensemble should be diverse but accurate.
138 K. Nag and N. R. Pal

To exploit this attribute, the model should try to develop member classifiers, i.e.,
genetic programs, diverse in terms of the features they use. Again, for a single binary
classifier, the features used by it should have a controlled level of redundancy but
the features should possess maximum relevance for the intended task.

3.2 Parameter Dependency

Any GP-based system requires a set of parameters and the performance of the system
is dependant on their chosen values. To discuss the parameter dependency, we select
the work proposed in [23]. Note that, this is one of the recent works that performs
simultaneous FS and classification. We have chosen this method because it has been
empirically shown to be effective on a wide range of data sets with a large number
of classes, a large number of features, and a large number of feature to sample ratio.
Table 1 shows the parameters and their values used in [23]. We consider CLL-SUB-
111 data set used in [23], which is a three class data set with 111 data points and
11,340 features, i.e, the feature-to-sample ratio is 102.16. We have repeated ten-fold
cross validation of the method proposed in [23] for ten times. Before training, we
have normalized the training data using Z-score, and based on the means and the
standard deviations of features in the training data, we also do Z-score normalization
of the test data.
Except Nmax and Nmin , every other parameter used in [23] is a common parameter
required for any GP-based method. While most of the GP-based approaches require
a single parameter called population size, the method in [23] requires two special
parameters Nmax and Nmin to bind the dynamic population size. The performance of
any GP-based system is largely dependant on two parameters: the number of function

Table 1 Parameter settings for ASMiGP-based method proposed in [23]


Parameter Value
Set of functions (F ) {+, −, ×, ÷}
Range of initial values of constants (C ) [0, 2]
Maximum depth of tree during initialization 6
Maximum allowable depth of tree 10
Maximum archive (population) size (Nmax ) 50
Minimum archive (population) size (Nmin ) 30
Initial probability of feature nodes ( pvar ) 0.8
Probability of crossover ( pc ) 0.8
Probability of crossover for terminal nodes ( ptc ) 0.2
Probability of mutation for constants ( pcm ) 0.3
Probability of mutation for function nodes ( p mf ) 0.1
Function evaluations for each binary classifier (Feval ) 400,000
Genetic Programming for Classification and Feature Selection 139

evaluations (Feval ) and the population size. Therefore, we choose to show the impact
of these parameters on the performance of this method. To attain this, we have
repeated our experiment with the CLL-SUB-111 dataset for seven times each with
a different parameter setting. The parameter settings used and their corresponding
results are shown in Table 2. These results demonstrate that with an increase in Feval ,
the accuracy increases, and with an increase in the population size (bounded by Nmax
and Nmin ), the number of selected features increases. As indicated by the average tree
size in every case the method could find somewhat small trees (binary classifiers).
To illustrate this with an example, in Table 3, we have provided six equations, which
are generated with the parameter setting S-I. They are the first two (as appeared
in the ensemble) binary classifiers of the final populations corresponding to the
three binary classification problems associated with the first fold of the first 10-fold
cross validation. We have also provided their objective values in that table, which
indicate that all of the binary classifiers could produce 100% training accuracy for
the corresponding binary classification problem. It is noteworthy that these six rules
are simple, concise, and human interpretable.

Table 2 Experimental settings and results


ID Alteration from Table 1 %TAa FSb TSc (F/T )d %Fe (%F/T )f
S-I Unchanged 80.22 342.9 7.33 3.22 3.02 0.03
S-II Feval = 400 67.19 547.8 6.91 2.53 4.83 0.02
S-III Feval = 4000 76.48 334.0 8.57 3.12 2.95 0.03
S-IV Feval = 40, 000 77.33 305.9 9.24 3.69 2.70 0.03
S-V Nmin = 90, Nmax = 150 80.02 398.4 7.29 3.40 3.51 0.03
S-VI Nmin = 150, Nmax = 250 79.03 422.7 7.75 3.68 3.73 0.03
S-VII Nmin = 250, Nmax = 300 78.17 436.5 8.39 4.03 3.85 0.04
a Test accuracy,b Number of features selected per classifier,c Tree size, d Number of features per
tree, e Percentage of features selected, f Percentage of features selected per tree

Table 3 The first two binary classifiers obtained corresponding to three binary classification prob-
lems associated with the first fold of the first 10-fold cross validation
Class Equation Objective values
Class 1 (x5261 − 1.3828) (0.0, 0.0, 2.0)
(0.8411 + x6911 ) (0.0, 0.0, 2.0)
Class 2 (x8962 + x9153 ) (0.0, 0.0, 2.0)
(−1.4391 + x5261 ) (0.0, 0.0, 2.0)
Class 3 (x8373 − 0.7756) (0.0, 0.0, 2.0)
(x8373 − 0.5976) (0.0, 0.0, 2.0)
140 K. Nag and N. R. Pal

4 Conclusion

We have briefly reviewed some of the GP-based approaches to classifier design, some
of which do FS. In this context, three approaches are discussed with reasonable details
with a view to providing a comprehensive understanding of various aspects related to
fitness, unfitness, selection, and genetic operations. We have also briefly discussed the
issues related to the choice of parameters and protocols, which can significantly alter
the performance of GP-based classifiers. However, there are important issues that
we have not discussed. For example, how to design efficiently GP-based classifiers
along with feature selection in a big data environment? How to enhance readability
of GP classifiers? How to deal with non-numeric data along with numeric ones
for designing GP-based classifiers? These are important issues that need extensive
investigation.

References

1. http://www.alanturing.net/turing_archive/archive/l/l32/L32-019.html. Accessed 17 Jan 2018


2. Berlanga, F.J., Rivera, A., del Jesús, M.J., Herrera, F.: Gp-coach: genetic programming-based
learning of compact and accurate fuzzy rule-based classification systems for high-dimensional
problems. Inf. Sci. 180(8), 1183–1200 (2010)
3. Bhowan, U., Johnston, M., Zhang, M.: Developing new fitness functions in genetic program-
ming for classification with unbalanced data. IEEE Trans. Syst. Man Cybern. Part B: Cybern.
42(2), 406–421 (2012)
4. Bhowan, U., Johnston, M., Zhang, M., Yao, X.: Evolving diverse ensembles using genetic
programming for classification with unbalanced data. IEEE Trans. Evol. Comput. 17(3), 368–
386 (2013)
5. Chakraborty, D., Pal, N.R.: Selecting useful groups of features in a connectionist framework.
IEEE Trans. Neural Netw. 19(3), 381–396 (2008)
6. Colorni, A., Dorigo, M., Maniezzo, V., et al.: Distributed optimization by ant colonies. In:
Proceedings of the First European Conference on Artificial Life, vol. 142, pp. 134–142. Paris,
France (1991)
7. Cramer, N.L.: A representation for the adaptive generation of simple sequential programs. In:
Proceedings of the First International Conference on Genetic Algorithms, pp. 183–187 (1985)
8. Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast and elitist multiobjective genetic algo-
rithm: Nsga-ii. IEEE Trans. Evol. Comput. 6(2), 182–197 (2002)
9. Dorigo, M.: Optimization, learning and natural algorithms. Ph. D. thesis, Politecnico di Milano,
Italy (1992)
10. Espejo, P.G., Ventura, S., Herrera, F.: A survey on the application of genetic programming to
classification. IEEE Trans. Syst. Man Cybern. Part C: Appl. Rev. 40(2), 121–144 (2010)
11. Hong, J.H., Cho, S.B.: Gene boosting for cancer classification based on gene expression profiles.
Pattern Recogn. 42(9), 1761–1767 (2009)
12. Kennedy, J., Eberhart, R.: Particle swarm optimization. In: Proceedings of International Con-
ference on Neural Networks, vol. 4., pp. 1942–1948. IEEE (Nov 1995)
13. Kishore, J., Patnaik, L.M., Mani, V., Agrawal, V.: Application of genetic programming for
multicategory pattern classification. IEEE Trans. Evol. Comput. 4(3), 242–258 (2000)
14. Kishore, J., Patnaik, L.M., Mani, V., Agrawal, V.: Genetic programming based pattern classi-
fication with feature space partitioning. Inf. Sci. 131(1), 65–86 (2001)
Genetic Programming for Classification and Feature Selection 141

15. Koza, J.R.: Genetic Programming: On the Programming of Computers by Means of Natural
Selection. MIT Press, Cambridge (1992)
16. Koza, J.R.: Genetic Programming II: Automatic Discovery of Reusable Programs. MIT press,
Cambridge (1994)
17. Koza, J.R., Bennett III, F.H., Stiffelman, O.: Genetic Programming as a Darwinian Invention
Machine. Springer, Berlin (1999)
18. Koza, J.R., Keane, M.A., Streeter, M.J., Mydlowec, W., Lanza, G., Yu, J.: Genetic Program-
ming IV: Routine Human-Competitive Machine Intelligence, vol. 5. Springer Science+Business
Media (2007)
19. Liu, K.H., Xu, C.G.: A genetic programming-based approach to the classification of multiclass
microarray datasets. Bioinformatics 25(3), 331–337 (2009)
20. Luke, S., Panait, L.: A comparison of bloat control methods for genetic programming. Evol.
Comput. 14(3), 309–344 (2006)
21. Muni, D.P., Pal, N.R., Das, J.: A novel approach to design classifiers using genetic program-
ming. IEEE Trans. Evol. Comput. 8(2), 183–196 (2004)
22. Muni, D.P., Pal, N.R., Das, J.: Genetic programming for simultaneous feature selection and
classifier design. IEEE Trans. Syst. Man Cybern. Part B: Cybern. 36(1), 106–117 (2006)
23. Nag, K., Pal, N.: A multiobjective genetic programming-based ensemble for simultaneous
feature selection and classification. IEEE Trans. Cybern. 99, 1–1 (2015)
24. Nag, K., Pal, T., Pal, N.: ASMiGA: an archive-based steady-state micro genetic algorithm.
IEEE Trans. Cybern. 45(1), 40–52 (2015)
25. Nag, K., Pal, T.: A new archive based steady state genetic algorithm. In: 2012 IEEE Congress
on Evolutionary Computation (CEC), pp. 1–7. IEEE (2012)
26. Poli, R.: A simple but theoretically-motivated method to control bloat in genetic programming.
In: Genetic Programming, pp. 204–217. Springer, Berlin (2003)
27. Wang, P., Emmerich, M., Li, R., Tang, K., Baeck, T., Yao, X.: Convex hull-based multi-
objective genetic programming for maximizing receiver operating characteristic performance.
IEEE Trans. Evol. Comput. 99, 1–1 (2014)
28. Wang, P., Tang, K., Weise, T., Tsang, E., Yao, X.: Multiobjective genetic programming for
maximizing roc performance. Neurocomputing 125, 102–118 (2014)
29. Whigham, P.A., Dick, G.: Implicitly controlling bloat in genetic programming. IEEE Trans.
Evol. Comput. 14(2), 173–190 (2010)

You might also like