Genetic Programing Paper
Genetic Programing Paper
1 Introduction
K. Nag (B)
Department of IEE, Jadavpur University, Kolkata, India
e-mail: [email protected]
N. R. Pal
ECS Unit, Indian Statistical Institute, Calcutta, India
e-mail: [email protected]
© Springer International Publishing AG, part of Springer Nature 2019 119
J. C. Bansal et al. (eds.), Evolutionary and Swarm Intelligence
Algorithms, Studies in Computational Intelligence 779,
https://doi.org/10.1007/978-3-319-91341-4_7
120 K. Nag and N. R. Pal
EC: (i) evolutionary programming (EP), (ii) evolutionary strategy (ES), (iii) genetic
algorithm (GA), and (iv) genetic programming (GP).
EC usually initializes a population with a set of randomly generated candidate
solutions. However, if domain knowledge is available, it can be used to generate the
initial population. Then, the population is evolved. This evolutionary process incorpo-
rates natural selection and other evolutionary operators. From an algorithmic point of
view, it is a guided random search that uses parallel processing to achieve the desired
solutions. Note that, the natural selection must be incorporated in an EA, otherwise
the approach cannot be categorized as an EC technique. For example, though several
metaheuristic algorithms, such as, particle swarm optimization (PSO) [12] and ant
colony optimization (ACO) [6, 9] are nature inspired algorithms (NIAs), they are
not EAs. Note that, sometimes they are still loosely referred to as EC techniques.
In 1948, in a technical report [1], titled “Intelligent Machinery”, written for
National Physics Laboratory, Alan M. Turing wrote, “There is the genetical or evo-
lutionary search by which a combination of genes is looked for, the criterion being
survival value. The remarkable success of this search confirms to some extent the
idea that intellectual activity consists mainly of various kinds of search.” According
to the best of our knowledge, this is the first technical article, where the concept of
evolutionary computation is found. However, it took few more decades to develop
the following three distinct interpretations of this philosophy: (i) EP, (ii) ES, and (iii)
GA. For the next one and half decades, these three areas grew separately. Later, in
the early nineties, they were unified as a subfield of the same technology, namely
EC. Each of EP, ES and GA is an algorithm for finding solutions to an optimization
problem - it finds a parameter vector that optimizes an objective function. Unlike
these three branches, GP finds a program to solve a problem. The concept of modern
tree-based GP was proposed by Cramer in 1985 [7]. Later, Koza, a student of Cramer,
popularized it with his many eminent works [15–18]. A large number of GP-based
inventions have been made after 2000, i.e., after the emergence of sufficiently well
performing hardware.
T need to be able to encode any possible valid solution of the problem to satisfy
the sufficiency property. Thus, Lisp or any other functional programming language
that naturally embody tree structures, can be used to represent a candidate solution
in GP. Use of non-tree representations to encode solutions in GP, is comparatively
less popular. An example of this is linear genetic programming, which is suitable for
more traditional imperative languages. In this chapter, however, we concentrate only
on tree-based GP. The most frequently used representations of tree-based GPs are
decision trees, classification rules, and discriminant functions. Here, we confine our
discussion primarily to discriminant function based GPs.
Consider the tiny, artificial, two dimensional, binary classification data set shown in
Fig. 1. It consists of uniformly distributed points inside two circles: C1 : (x1 − 1)2 +
(x2 + 1)2 = 0.98 and C2 : (x1 + 1)2 + (x2 − 1)2 = 0.98, which are represented by
‘◦’s and ‘+’s, respectively. Points inside each circle represent a class. The points
corresponding to C1 belong to class 1 and the points corresponding to C2 belong to
class 2. Let us try to learn a GP-based binary classifier for this data set. To achieve
this, let us consider that every candidate solution in the GP-based system consists
of a tree that encodes a discriminant function. When we recursively evaluate a tree
T (·) using a data point x, it returns a real value rxT = T (x). If rxT > 0, the binary
classifier predicts that x belongs class 1, else it predicts that x belongs class 2. Let the
set of operators (internal
nodes) be F = {+, −, ×, ÷}, and the set of operands (leaf
nodes) be T = R F, where F = {x1 , x2 } is the set of features. We also consider
that every operator f ∈ F is defined in such a way that if it returns an undefined
or infinite value (a value that is beyond storing due to precision limit), the returned
value is converted to zero. This conversion may not be the best way to handle this
issue. However, every subtree constructed using F and T must return a real value.
Genetic Programming for Classification and Feature Selection 123
Consequently, each f ∈ F is well defined and closed under all possible operands that
it may encounter. Hence, this scheme meets the closure property. Moreover, for the
given problem, we assume that F satisfies the sufficiency property, although strictly
speaking this is not true. For example, there are infinite number of equations that
can solve the classification problem, and using F and T , we can design infinitely
many trees that can solve the given problem, yet it may not be possible to generate
all possible functional form of solutions even for this simple data set. This of course
does not cause any problem from a practical point of view as long as F and T are
able to generate useful solutions at least approximately.
To illustrate further, we show three binary classifiers in Fig. 1, which are denoted
by a : (x1 − x2 ) − 0.1, b : (x1 × x2 ) − 0.1, and c : (x1 × x1 ) − x2 . Moreover, we
show the tree structures corresponding to these three trees respectively in Figs. 2, 3,
and 4. As mentioned earlier, infinitely many “correct” solutions to this problem are
possible. However, we have purposefully chosen these three particular trees, which
have visually similar structures. The solution with the tree a would predict all the
points accurately; the solution with the tree b would predict that points of both classes
Fig. 2 Tree
a : (x1 − x2 ) − 0.1
124 K. Nag and N. R. Pal
Fig. 3 Tree
b : (x1 × x2 ) − 0.1
Fig. 4 Tree
c : (x1 × x1 ) − x2
as belonging to class 2; and the solution with tree c would predict some points of
class 1 accurately and all points of class 2 accurately.
In any GP-based approach, after the encoding scheme, the second most important
issue is the formulation of the objective function that would be used to evaluate
the solutions. Here, to evaluate binary classifiers, we use prediction accuracy on the
training data set as the evaluation/objective function, this is a simple, straightforward
yet an effective strategy. Note that, this objective function is to be maximized.
The third most important issue in the design of a GP-based system is the choice of
operators. As we shall see later, there can be many issues like, FS, bloating control,
fitness, unfitness, that can be kept in mind while developing these operators. However,
here we discuss a primitive crossover and a primitive mutation technique, which are
adequate to solve this problem.
The crossover operator requires two parents S1 and S2 to generate a new offspring
O. To generate the tree of O, it selects two random nodes (may be leaf or non-
leaf), one from the tree of S1 and the other one from the tree of S2 . Let n S1 and n S2
respectively denote those nodes. Then, it replaces the subtree of S1 , which is rooted
at n S1 , called the crossover point, by the subtree of S2 , which is rooted at n S2 . To
illustrate this with an example, we assume that the trees associated with S1 and S2 are
respectively c and a. The randomly selected crossover points, and their respective
subtrees are also shown in Figs. 2 and 4. After replacing the selected subtree of c
(see Fig. 4) by the selected subtree of a (see Fig. 2), the crossover operator generates
an offspring O with tree d : x1 − x2 , which is illustrated in Fig. 5. Though, we do
not show the tree d in Fig. 1, it is a straight line parallel to tree a and goes through
the origin. Moreover, it can also accurately classify the given data set. Thus, though
Genetic Programming for Classification and Feature Selection 125
Fig. 5 Tree d : x1 − x2
the classifier with tree d yields the same accuracy as that by the classifier with tree
a, it is a better choice due to its simplicity (smaller size of the tree).
The mutation operator needs only one solution S. A node from the tree associated
with s is randomly selected. If the node is a terminal node, a random number rn is
drawn from [0, 1]. If rn < pvariable ( pvariable is a pre-decided probability of selecting
a variable), the selected terminal node is randomly replaced by a randomly selected
variable (feature). Otherwise, the node is replaced by a randomly generated constant
rc ∈ Rc , where Rc ⊆ R. Rc should be chosen judiciously. For example, in this par-
ticular example, a good choice of Rc might be [−2, 2]. To illustrate the mutation
process with an example, let us consider the solution with tree d shown in Fig. 5.
Suppose the randomly selected mutation point is the node with feature x1 . Moreover,
let us consider that we randomly select to replace this node with a constant node and
the constant value (rc ) be 0.01. Then, the mutant tree will be e : 0.01 − x2 , which
is nothing but a line parallel to the x1 axis that can also classify the given data set
correctly. To illustrate the mutation process, when an internal node is involved, let us
consider the solution with tree b (see Fig. 3). In Fig. 3, we have shown the randomly
selected mutation point. Let it be replaced with −, then it would result tree a, which
can classify the problem accurately. However, if this node would have been replaced
with ÷, it would result a new tree f : (x1 ÷ x2 ) − 0.1, which would predict all the
points of both the classes.
After defining all necessary components, we can now obtain a GP-based clas-
sifier following the steps shown in Algorithm 1. In step 22 of Algorithm 1, it per-
forms the environmental selection, which is a necessary component of any GP-based
approach. The environmental strategy that we have adopted here is naive. Several
criteria and strategies can be adopted for environmental selection. Note that, we also
select some solutions to perform crossover and mutation respectively in Steps 7 and
15 of Algorithm 1. This selection strategy is called mating selection. Instead of ran-
domly selecting these solutions, often some specific criteria are used. For example,
since we are maximizing the classification accuracy, a straightforward scheme could
be to select solutions using Roulette wheel selection. Another important step, where
it is possible to employ a different strategy, is the initialization of the trees. Ramped
half-n-half method [15] is one of the frequently used methods to initialize GP trees.
We have already mentioned that this algorithm does not explicitly perform any
FS and concentrates only on classification accuracy. However, it may implicitly
perform FS. To illustrate this, assume that the evolution of this system generates tree
e : 0.01 − x2 (we have already mentioned how it can be generated while illustrating
the mutation operation). Tree e will have the highest accuracy on the training data,
126 K. Nag and N. R. Pal
though it uses only one feature (x2 ). In this example, the scheme selected only one
feature (x2 ), which has sufficient discriminating power.
There have been several attempts to design classifier using GP [4, 13, 14, 19, 21–
23]. A detailed survey on this topic is present in [10]. Some of these methods do
not explicitly pay attention to FS [13, 21], while others explicitly try to find useful
features to design the classifiers [22, 23]. Some of these GP-based approaches use
ensemble concept [19, 23]. In this section, we discuss three existing GP approaches.
The first approach [21] introduces a classification strategy employing a single objec-
tive GP-based objective search technique. The second approach [22] performs both
classification and FS task in an integrated manner. These two schemes use the same
multi-tree representation of solutions. On the contrary, the third method [23] decom-
poses a c-class classification problem into c binary classification problems, and then,
Genetic Programming for Classification and Feature Selection 127
performs simultaneous classification and FS. Nevertheless, unlike the first two meth-
ods [21, 22], it [23] uses ensembles of genetic programs (binary classifiers) and a
negative voting scheme.
are required to assign a class to x. Moreover, the trees are initialized using ramped-
half-n-half method with F = {+, −, ×, ÷} and T = {feature variables, R}. Here, R
is the set of all possible values in [0,10].
If a tree predicts x accurately, the system considers that the accuracy of tree for x
is one, otherwise zero. The normalized accuracy of all trees of a particular solution
for a given set of data points is considered the fitness of the solution.
Unfitness: Unlike the concept of fitness, unfitenss is a less frequently used strategy.
However, several works [21–23] have successfully used it to attain better perfor-
mance. When fitness is used in the selection process, more fit solutions, i.e., solu-
tions with higher fitness values get selected. On the contrary, if unfitness is used in
the selection process, then more unfit solutions, i.e., solutions with higher unfitness
values get selected. Unfitness-based selection helps unfit solutions to become more
fit. For a given problem, a simple choice for unfitness of a tree Ti is the number of
training data points, which are wrongly classified by Ti . Let u j be the unfitness of
the jth tree, then this can be easily used to select solutions for genetic operations
using Roulette Wheel with probably u i /( cj=1 u j ).
Mutation: For mutation, a random solution S is selected from the population. Then,
with Roulette wheel selection on the unfitness values of its trees, a tree TkS is selected.
S
Next, a random node n Tk of TkS is selected, where the probability of selection of a
f f S
function node is pm and the same of a terminal node is (1 − pm ). If n Tk is a function
node, it is replaced with a randomly chosen function node. Otherwise, it is replaced
with a randomly chosen terminal node. After that, both the mutated and the original
trees are evaluated with 50% of the samples of the kth class. Let the fitness values
of the two trees be f m and f o respectively. If f m ≥ f o , the mutated tree is accepted,
otherwise it is retained with a probability of 0.5. If f m = f o , then both f m and f o are
evaluated on the remaining 50% samples of the kth class to select one of the two.
Improving Performance with OR-ing: At the end of the evolution process, the
best classifier is selected using accuracy. If more than one classifier have the same
accuracy, the solution with the smallest number of nodes is selected. However, it
may happen that there be two solutions (classifiers) S1 = Ti S1 and S2 = Ti S2 , (i =
1, 2, . . . , c), such that, TkS1 models well for a particular area of the feature space,
130 K. Nag and N. R. Pal
The GP-based classifier discussed in the last section does not explicitly do FS while
designing the classifier. The methodology that we discuss now integrates FS into the
classifier design task. This approach [22], simultaneously selects a good subset of
features and constructs a classifier using the selected features. Authors here propose
two new crossover operators, which carry out the FS process. Like the previously
discussed work, here also a multi-tree representation of a classifier is used.
n
Note that, pr f decreases linearly with an increase of r f and r ff =1 pr f = 1. After
this, r f features are randomly selected from the entire set of features and then, a
single solution is initialized with this selected subset of features.
Fitness: Let fraw be the raw fitness of a solution, which is indeed the accuracy
achieved by a solution on the entire training data set. Then, the modified fitness
function ( f s ) that incorporates the FS task in the fitness is defined as follows.
r
−nf
f s = fraw 1 + ae f , (2)
where
gcurr ent
a = 2β 1 − . (3)
gmaximum
Here β is a constant, gcurr ent is the current generation number, and gmaximum is the
maximum number of generations of the evolution of GP. Note that, the factor e(−r f /n f )
is exponentially decreasing with the increase in r f . Consequently, if two classifiers
have the same value of fraw , the one using fewer features would attain a higher fitness
Genetic Programming for Classification and Feature Selection 131
value. Moreover, the penalty for using a larger number of features is dynamically
increased with generations, which keeps on emphasizing on the FS task with the
increase of generations.
where, if S1 uses the kth feature, then v S1 k = 1, otherwise v S1 k = 0; and if the jth
solution of TS2 uses the kth feature, then v jk = 1, otherwise v jk = 0. After that, the
jth classifier of the second set is selected with probability p second
j as follows.
f s j + βs j
p second = τ , (5)
k=1 ( f sk + βsk )
j
where f sk denotes the fitness ( f s ) of the kth solution of TS2 and β is a constant.
In this approach, usually FS is accomplished in the first few generations. Note that,
unlike the method that we have discussed in the previous section, this scheme does
not use step-wise learning. Because, at the beginning of the evolutionary process,
step-wise learning uses a small subset of training samples. Being small in size, it
may not be very helpful for the FS process.
Unlike the previous two approaches [21, 22], this approach [23] divides a c-class
classification problem into c binary classification problems. Then, it evolves c popu-
lations of genetic programs to find out c sets of ensembles, which respectively solve
these c binary classification problems. To solve each binary classification problem,
132 K. Nag and N. R. Pal
and an ideal feature vector which posses corresponding ideal values for the ith binary
classification problem. Specifically, this ideal feature value is unity if the associated
training point belongs to the ith class, and zero otherwise [11]. Clearly, a higher
value of |C if | designates a stronger discriminative power of feature f for the ith
binary classification problem.
For the ith binary classification problem, we intend to incorporate only the fea-
tures with high discriminating capability from the set off all features Fall . To attain
this, we assign a fitness and an unfitness value to each of the features, which changes
with the evolution. Throughout the process, the features are selected to be added
in a tree with probabilities proportional to the current fitness values of the features.
On the other hand, features are selected to be replaced (in case of mutation) with
probabilities proportional to their current unfitness values. The fitness and unfitness
of features corresponding to the ith binary classification problem, during the first
50% evaluations, are defined respectively as in Eqs. (6) and (7).
⎧
⎪ 2 i
⎪
⎨ C if C f
F 0%,i , if > 0.3
f itness ( f, i) = (6)
i
Cmax i
Cmax
⎪
⎪
⎩
0, otherwise
0%,i 0%,i
Fun f itness ( f, i) = 1.0 − F f itness , (7)
i
where Cmax = max C if . Basically, to eliminate the impact of features with poor
f ∈Fall
discriminating ability during the initial evolution process, Eq. (6) sets their fitness
values to zero. Let us assume that Feval=0%,i ⊆ Fall is the features with nonzero
fitness values.
After completion of 50% evaluations, let the feature subset that is present in the
population of the ith binary classification task be Feval=50%,i . After 50% evaluations,
the fitness of all features in Fall − Feval=50%,i are set to zero. The assumption behind it
is that after 50% evaluations the features which could help the ith binary classification
task, would be used by the collection of trees of the corresponding population. Now
the fitness and unfitness values of all features in Feval=50%,i are changed respectively
according to Eqs. (8) and (9).
⎧
⎪ |C if |
⎪
⎨ |ρ | , if f ∈ Feval=50%,i
⎪
fg
F 50%,i
f itness ( f, i) = f =g (8)
⎪
⎪ g∈F50%,i
⎪
⎩
0, otherwise
50%,i 50%,i
F f itness ( f,i)−min f {F f itness }
− 50%,i 50%,i
50%,i
f itness ( f, i) = e ,
max f {F f itness }−min f {F f itness }
Fun (9)
134 K. Nag and N. R. Pal
where ρ f g denotes the Pearson’s correlation between feature f and feature g. This
helps to select features with high relevance but with reduced level of redundancy,
i.e., to achieve maximum relevance and minimum redundancy (MRMR).
After 75% function evaluations, another snapshot of the population is taken. Let
the existing features for the ith population be Feval=75%,i ⊆ Feval=50%,i . Then, the
fitness and unfitness values of the features in Feval=75%,i are altered respectively as
in Eqs. (10) and (11).
f itness ( f, i), if f ∈ Feval=75%,i
F 0%
F 75%,i
f itness ( f, i) = (10)
0, otherwise
75%,i 75%,i
Fun f itness ( f, i) = 1.0 − F f itness ( f, i) (11)
Crossover: This scheme uses a crossover with male and female differentiation and
tries to generate an offspring near the female (acceptor) parent. For this a part of the
male (donor) parent replaces a part of the female parent. At first, from each parent a
random point (node) is chosen. The probabilities of selecting a terminal node and the
non-terminal nodes are respectively ptc and (1 − ptc ). After that, the subtree rooted
at the node selected from the female tree is replaced by the subtree rooted at the
selected node of the male tree.
Mutation: To mutate a tree, the following operations are performed on the tree: (i)
Each constant node of the tree is replaced by a randomly generated constant node with
probability pcm . (ii) Each function node is visited and the function there is replaced by
a randomly selected function with probability p mf . (iii) Only one feature node of the
tree is replaced by a randomly selected feature node. Among the feature nodes, the
mutation point is selected with a probability proportional to unfitness of the features
present in the selected tree. Moreover, the feature which is used to replace the old
feature, is also selected with a probability proportional to fitness values of the features.
Voting Strategy: After learning is over, a special negative voting strategy is used.
At the end of the evolution, c ensembles of genetic programs are obtained: A =
{A1 , A2 , . . . , Ac }; ∀i, 1 ≤ |Ai | ≤ Nmax , where c is the number of classes, and Ai is
the ensemble corresponding to the ith class. To determine whether a point p belongs
to the mth class or not, a measure, called net belongingness Bmnet (p) corresponding
to the mth class for p is calculated as follows.
|Am |
1 1
Bmnet (p) = Bi (p) + 1.0 . (12)
2 |Am | i=1 m
where in Eq. (13), F Pmi and F Nmi respectively denote the number of FPs and FNs
made by the ith individual of Am on the training data set; F Pmmax and F Nmmax
respectively denote the maximum possible FP and the maximum possible FN for the
mth class (determined using the training data set); and Aim (p) is the output from ith
c
individual of Am for p. Finally, p is assigned the class k, if Bknet = max{Bmnet }. Note
m=1
that, net belongingness lies in [0, 1], and a large value of this measure indicates a
higher chance of belonging to the corresponding class.
3 Some Remarks
While designing a GP-based system for classification and FS, several important
issues should be kept in mind. In this section, we discuss some of these salient issues
with details and some ways to address them. Note that, most of these issues depend
not only on the GP-based strategy, but also on the data set, i.e, on the number of
features, the number of classes, the distribution of the data in different (majority
and minority) classes, the feature-to-sample ratio, the distribution of the data in the
feature space, etc. Hence, there is no universal rule to handle all of them.
While designing a GP-based system for multi-class classification, the first problem
is how to handle the multi-class nature of the data. Primarily there are two ways. The
more frequently used scheme is to decompose a c-class problem into c binary classi-
fication problems and then develop separate GP-based systems for each of the binary
classification problem. Every candidate solution of the ith binary classification sys-
tem may have one tree, where a positive output from the tree for a given point would
indicate that the point belongs to the ith class. To solve every binary classification
problem, one may choose to use the best binary classifier (genetic program) found in
every binary system, or she may choose to use a set of binary classifiers (ensemble)
obtained from these binary systems. If an ensemble-based strategy is used, there need
to be a voting (aggregation) scheme. The voting scheme may become more effec-
tive if weighted and/or negative voting approach is judiciously designed. Again, to
decide the final class label, the outputs (decisions) from every binary system need to
be accumulated and processed. Here also, some (weighted, negative) voting scheme,
or some especially designed decision making system can be developed. Another
comparatively less frequently used strategy is as follows. Every solution consists of
a tree, where the output of the tree is divided into c windows using (c − 1) thresholds.
136 K. Nag and N. R. Pal
type data set. However, if we try to learn this data set with a multi-layer perceptron
type classifier, due to the “XOR” type pattern of the data, it may not be very easy
to learn. This example, though, may apparently illustrate GP as a powerful tool, this
specialization capability of GP may sometimes lead to poor generalization. To illus-
trate a case of poor generalization with an example, let us consider another artificial
binary classification data set shown in Fig. 7. There, the class 1 points are denoted
using “+” and class 2 points are denoted using “◦”. The class 1 points are uniformly
distributed inside the circle C3 : (x1 − 1)2 + (x2 − 1)2 = 0.75 and the class 2 points
are uniformly distributed inside the circle C6 : (x1 + 1)2 + (x2 − 1)2 = 0.75. For
this data set also, the binary classifier with tree b can accurately classify all the data
points. But, for the points denoted by “∗” in Fig. 7, the classifier would predict class
1. For these points the classifier should not make any decision. In this example, GP
ends up with a poor generalization.
Bloating is another important issue that needs to be addressed to develop a well
performing GP-based system. Though there are efficient methods in the literature
[20, 26, 29], the following two bloat control strategies are quite straightforward.
First, if a single objective approach is used, the size of the tree can be incorporated
with the objective function as a penalty factor such that the penalty is minimized.
Second, if a multi-objective approach is used, the size of the tree can be added as an
additional objective.
As we have already discussed, GP has an intrinsic FS capability. To enforce
FS further, whenever a feature is selected to be included in any tree, the model
should try to select the best possible feature for that scenario. In a similar fashion,
whenever a feature node is removed from any classifier, it should not be an important
feature under that scenario. Note that, if an ensemble based strategy is used, for
enhanced performance the members of the ensemble should be diverse but accurate.
138 K. Nag and N. R. Pal
To exploit this attribute, the model should try to develop member classifiers, i.e.,
genetic programs, diverse in terms of the features they use. Again, for a single binary
classifier, the features used by it should have a controlled level of redundancy but
the features should possess maximum relevance for the intended task.
Any GP-based system requires a set of parameters and the performance of the system
is dependant on their chosen values. To discuss the parameter dependency, we select
the work proposed in [23]. Note that, this is one of the recent works that performs
simultaneous FS and classification. We have chosen this method because it has been
empirically shown to be effective on a wide range of data sets with a large number
of classes, a large number of features, and a large number of feature to sample ratio.
Table 1 shows the parameters and their values used in [23]. We consider CLL-SUB-
111 data set used in [23], which is a three class data set with 111 data points and
11,340 features, i.e, the feature-to-sample ratio is 102.16. We have repeated ten-fold
cross validation of the method proposed in [23] for ten times. Before training, we
have normalized the training data using Z-score, and based on the means and the
standard deviations of features in the training data, we also do Z-score normalization
of the test data.
Except Nmax and Nmin , every other parameter used in [23] is a common parameter
required for any GP-based method. While most of the GP-based approaches require
a single parameter called population size, the method in [23] requires two special
parameters Nmax and Nmin to bind the dynamic population size. The performance of
any GP-based system is largely dependant on two parameters: the number of function
evaluations (Feval ) and the population size. Therefore, we choose to show the impact
of these parameters on the performance of this method. To attain this, we have
repeated our experiment with the CLL-SUB-111 dataset for seven times each with
a different parameter setting. The parameter settings used and their corresponding
results are shown in Table 2. These results demonstrate that with an increase in Feval ,
the accuracy increases, and with an increase in the population size (bounded by Nmax
and Nmin ), the number of selected features increases. As indicated by the average tree
size in every case the method could find somewhat small trees (binary classifiers).
To illustrate this with an example, in Table 3, we have provided six equations, which
are generated with the parameter setting S-I. They are the first two (as appeared
in the ensemble) binary classifiers of the final populations corresponding to the
three binary classification problems associated with the first fold of the first 10-fold
cross validation. We have also provided their objective values in that table, which
indicate that all of the binary classifiers could produce 100% training accuracy for
the corresponding binary classification problem. It is noteworthy that these six rules
are simple, concise, and human interpretable.
Table 3 The first two binary classifiers obtained corresponding to three binary classification prob-
lems associated with the first fold of the first 10-fold cross validation
Class Equation Objective values
Class 1 (x5261 − 1.3828) (0.0, 0.0, 2.0)
(0.8411 + x6911 ) (0.0, 0.0, 2.0)
Class 2 (x8962 + x9153 ) (0.0, 0.0, 2.0)
(−1.4391 + x5261 ) (0.0, 0.0, 2.0)
Class 3 (x8373 − 0.7756) (0.0, 0.0, 2.0)
(x8373 − 0.5976) (0.0, 0.0, 2.0)
140 K. Nag and N. R. Pal
4 Conclusion
We have briefly reviewed some of the GP-based approaches to classifier design, some
of which do FS. In this context, three approaches are discussed with reasonable details
with a view to providing a comprehensive understanding of various aspects related to
fitness, unfitness, selection, and genetic operations. We have also briefly discussed the
issues related to the choice of parameters and protocols, which can significantly alter
the performance of GP-based classifiers. However, there are important issues that
we have not discussed. For example, how to design efficiently GP-based classifiers
along with feature selection in a big data environment? How to enhance readability
of GP classifiers? How to deal with non-numeric data along with numeric ones
for designing GP-based classifiers? These are important issues that need extensive
investigation.
References
15. Koza, J.R.: Genetic Programming: On the Programming of Computers by Means of Natural
Selection. MIT Press, Cambridge (1992)
16. Koza, J.R.: Genetic Programming II: Automatic Discovery of Reusable Programs. MIT press,
Cambridge (1994)
17. Koza, J.R., Bennett III, F.H., Stiffelman, O.: Genetic Programming as a Darwinian Invention
Machine. Springer, Berlin (1999)
18. Koza, J.R., Keane, M.A., Streeter, M.J., Mydlowec, W., Lanza, G., Yu, J.: Genetic Program-
ming IV: Routine Human-Competitive Machine Intelligence, vol. 5. Springer Science+Business
Media (2007)
19. Liu, K.H., Xu, C.G.: A genetic programming-based approach to the classification of multiclass
microarray datasets. Bioinformatics 25(3), 331–337 (2009)
20. Luke, S., Panait, L.: A comparison of bloat control methods for genetic programming. Evol.
Comput. 14(3), 309–344 (2006)
21. Muni, D.P., Pal, N.R., Das, J.: A novel approach to design classifiers using genetic program-
ming. IEEE Trans. Evol. Comput. 8(2), 183–196 (2004)
22. Muni, D.P., Pal, N.R., Das, J.: Genetic programming for simultaneous feature selection and
classifier design. IEEE Trans. Syst. Man Cybern. Part B: Cybern. 36(1), 106–117 (2006)
23. Nag, K., Pal, N.: A multiobjective genetic programming-based ensemble for simultaneous
feature selection and classification. IEEE Trans. Cybern. 99, 1–1 (2015)
24. Nag, K., Pal, T., Pal, N.: ASMiGA: an archive-based steady-state micro genetic algorithm.
IEEE Trans. Cybern. 45(1), 40–52 (2015)
25. Nag, K., Pal, T.: A new archive based steady state genetic algorithm. In: 2012 IEEE Congress
on Evolutionary Computation (CEC), pp. 1–7. IEEE (2012)
26. Poli, R.: A simple but theoretically-motivated method to control bloat in genetic programming.
In: Genetic Programming, pp. 204–217. Springer, Berlin (2003)
27. Wang, P., Emmerich, M., Li, R., Tang, K., Baeck, T., Yao, X.: Convex hull-based multi-
objective genetic programming for maximizing receiver operating characteristic performance.
IEEE Trans. Evol. Comput. 99, 1–1 (2014)
28. Wang, P., Tang, K., Weise, T., Tsang, E., Yao, X.: Multiobjective genetic programming for
maximizing roc performance. Neurocomputing 125, 102–118 (2014)
29. Whigham, P.A., Dick, G.: Implicitly controlling bloat in genetic programming. IEEE Trans.
Evol. Comput. 14(2), 173–190 (2010)