Computational Intelligence in Optimization
Computational Intelligence in Optimization
)
Computational Intelligence in Optimization
Adaptation, Learning, and Optimization, Volume 7
Series Editor-in-Chief
Meng-Hiot Lim
Nanyang Technological University, Singapore
E-mail: [email protected]
Yew-Soon Ong
Nanyang Technological University, Singapore
E-mail: [email protected]
Computational Intelligence in
Optimization
Applications and Implementations
123
Dr. Yoel Tenne
Department of Mechanical Engineering and
Science-Faculty of Engineering,
Kyoto University, Yoshida-honmachi,
Sakyo-Ku, Kyoto 606-8501, Japan
E-mail: [email protected]
Formerly: School of Aerospace Mechanical and Mechatronic Engineering,
Sydney University, NSW 2006, Australia
DOI 10.1007/978-3-642-12775-5
c 2010 Springer-Verlag Berlin Heidelberg
This work is subject to copyright. All rights are reserved, whether the whole or part
of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilm or in any other
way, and storage in data banks. Duplication of this publication or parts thereof is
permitted only under the provisions of the German Copyright Law of September 9,
1965, in its current version, and permission for use must always be obtained from
Springer. Violations are liable to prosecution under the German Copyright Law.
The use of general descriptive names, registered names, trademarks, etc. in this
publication does not imply, even in the absence of a specific statement, that such
names are exempt from the relevant protective laws and regulations and therefore
free for general use.
Typeset & Cover Design: Scientific Publishing Services Pvt. Ltd., Chennai, India.
987654321
springer.com
To our families for their love and support.
Preface
Classical optimizers may perform poorly or even may fail to produce any
improvement over the starting vector in the face of such challenges. This
has motivated researchers to explore the use computational intelligence (CI)
to augment classical methods in tackling such challenging problems. Such
methods include population-based search methods such as: a) evolutionary
algorithms and particle swarm optimization and b) non-linear mapping and
knowledge embedding approaches such as artificial neural networks and fuzzy
logic, to name a few. Such approaches have been shown to perform well in
challenging settings. Specifically, CI are powerful tools which offer several
potential benefits such as: a) robustness (impose little or no requirements
on the objective function) b) versatility (handle highly non-linear mappings)
c) self-adaption to improve performance and d) operation in parallel (making
it easy to decompose complex tasks). However, the successful application of
CI methods to real-world problems is not straightforward and requires both
expert knowledge and trial-and-error experiments. As such the goal of this
volume is to survey a wide range of studies where CI has been successfully ap-
plied to challenging real-world optimization problems, while highlighting the
VIII Preface
insights researchers have obtained. Broadly, the studies in this volume focus
on four main disciplines: continuous optimization, classification, scheduling
and hardware implementations.
For continuous optimization, Neto et al. study the use of artificial neural
networks (ANNs) and Heuristic Rules for solving large scale optimization
problems. They focus on a recurrent ANN to solve a quadratic program-
ming problem and propose several techniques to accelerate convergence of
the algorithm. Their method is more efficient than one using an ANN only.
Starzyk et al. propose a direct-search optimization algorithm which uses re-
inforcement learning, resulting in an algorithm which ‘learns’ the best path
during the search. The algorithm weights past steps based on their success
to yield a new candidate search step. They benchmark their algorithm with
several mathematical test functions and apply to training of a multi-layer
perceptron neural network for image recognition. Ventresca et al. use the op-
position sampling approach to decrease the number of function evaluations.
The approach attempts to sample the function in a subspace generated by the
‘opposites’ of an existing population of candidates. They apply their method
to differential evolution and incremental learning and show that the opposi-
tion method improves performance over baseline variants. Bazan studies an
optimization algorithm for problems where the objective function requires
large computational resources. His proposed algorithm uses locally regular-
ized approximations of the objective function using radial basis functions. He
provides convergence proofs and formulates a framework which can be ap-
plied to other algorithms such as Gauss-Seidel or the Conjugate Directions.
Ruiz-Torrubiano et al. study hybrid methods for solving large scale opti-
mization problems with cardinality constraints, a class of problems arising in
diverse areas such as finance, machine learning and statistical data analysis.
While existing methods can provide exact solutions (such as branch-and-
bound) they require large resources. As such the study focuses on meth-
ods which can efficiently identify approximate solutions but require far less
computer resources. For problems where it is expensive to evaluate the ob-
jective function, Jayadeva et al. propose using a support-vector machine to
predict the location of yet undiscovered optima. Their framework can be
applied to problems where little or no a-priori information is available on
the objective function, as the algorithm ‘learns’ during the search process.
Benchmarks show their method can outperform existing methods such as par-
ticle swarm optimization or genetic algorithms. Vouchkov and Keane study
multi-objective optimization problems using surrogate-models. They inves-
tigate how to efficiently update the surrogates under a small optimization
‘budget’ and compare different updating strategies. They also shows that us-
ing a number of surrogate models can improve the optimization search and
that the size of the ensemble should increase with the problem dimension.
Others study agents-based algorithms, that is, where the optimization is
done by agents which co-operate during the search. Dreżewski and Siwik
review agent-based co-evolutionary algorithms for multi-objective problems.
Preface IX
Such algorithms combine co-evolution (multiple species) with the agent ap-
proach (interaction). They review and compare existing methods and bench-
mark them over a range of test problems. Results show the agent-based co-
evolutionary algorithms can perform equally well and even surpass some of
the best existing multi-objective evolutionary algorithms. Salhi and Töreyen
proposes a multi-agent algorithm based on game theory. Their framework uses
multiple solvers (agents) which compete over available resources and their
algorithm identifies the most successful solver. In the spirit of game theory,
successful solvers are rewarded by increasing their computing resources and
vice versa. Test results show the framework provides a better final solution
when compared to using a single solver.
For applications in classification, Arana-Daniel et al. use Clifford algebra
to generalize support vector machines (SVMs) for classification (with an ex-
tension to regression). They represent input data as a multivector and use
a single Clifford kernel for multi-class problems. This approach significantly
reduces the computational complexity involved in training the SVM. Tests
using real-world applications of signal processing and computer vision show
the merit of their approach. Luukka and Lampinen propose a classification
method which combines principal component analysis to pre-process the data
followed by optimization of the classifier parameters using a differential evo-
lution algorithm. Specifically, they optimize the class vectors used by the
classifier and the power of the distance metric. Test results using real-world
data sets show the proposed approach performs equally or better to some of
the best existing classifiers. Lastly in this category, Zhang et al. study the
problem of feature selection in high-dimensional problems. They focus on the
GA-SVM approach, where a genetic algorithm (GA) optimizes the param-
eters of the SVM (the GA uses the SVM output as the objective values).
The problem requires large computational resources which make it difficult
to apply to large or high-dimensional sets. As such they propose several mea-
sures such as parallelization, neighbour search and caching to accelerate the
search. Test results show their approach can reduce the computational cost
of training an SVM classifier.
Two studies focus on difficult scheduling problems. First, Pieters studies
the problem of railway timetable design scheduling, which is an NP-hard
problem with additional challenging features as such being reactive and dy-
namic. He studies solving the problem with Symbiotic Networks, a class of
neural networks inspired by the symbiosis phenomenon is nature, and so the
network uses ‘agents’ to adapt itself to the problem. Test results show the
Symbiotic network can successfully handle the complex scheduling problem.
Next, Srivastava et al. propose an approach combining evolutionary algo-
rithms, neural network and fuzzy logic to solve problems of multiobjective
time-cost trade-off. They consider a range of such problems including non-
linear time-cost relationships, constrained resources and project uncertain-
ties. They show the merit of their approach by testing it on a real-world
test case.
X Preface
Otoni Nóbrega Neto, Ronaldo R.B. de Aquino, and Milde M.S. Lira
Abstract. This chapter deals with the study of artificial neural networks (ANNs) and
Heuristic Rules (HR) to solve optimization problems. The study of ANN as optimiza-
tion tools for solving large scale problems was due to the fact that this technique has
great potential for hardware VLSI implementation, in which it may be more efficient
than traditional optimization techniques. However, the implementation of computa-
tional algorithm has shown that the proposed technique should have been efficient
but slow as compared with traditional mathematical methods. In order to make it
a fast method, we will show two ways to increase the speed of convergence of the
computational algorithm. For analyzes and comparison, we solved three test cases.
This paper considers recurrent ANN to solve linear and quadratic programming prob-
lems. These networks are based on the solution of a set of differential equations that
are obtained from a transformation of an augmented Lagrange energy function. The
proposed hybrid systems combining recurrent ANN and HR presented a reduced
computational effort in relation to the one using only the recurrent ANN.
1.1 Introduction
The early 1980’s were marked by a resurgence of interest in artificial neural net-
works (ANNs). At that time, the development of ANNs had the important charac-
teristic of temporal processing. Many researchers have attributed the resumption of
researches on ANNs in the eighties to the Hopfield model presented in 1982 [1].
This recurrent Hopfield model has so far constituted, a great progress in the thresh-
old of knowledge in the area of neural networks, until then.
Nowadays, it is known that there are two ways of incorporating the temporal com-
putation in a neural network: the first one is possible by using a statistical neural nets
to accomplish a dynamical mapping in a structure of short-term memory; and the
Otoni Nóbrega Neto · Ronaldo R.B. de Aquino · Milde M.S. Lira
Electrical Engineering Department, Federal University of Pernambuco, Brazil
e-mail: [email protected],[email protected]
Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 1–26.
springerlink.com c Springer-Verlag Berlin Heidelberg 2010
2 O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira
second one is by making internal feedback connections that may be made by single
or multi-loop feedback in which the neural network can be fully connected. Artifi-
cial neural networks that have feedback connections in their topology are known as
recurrent neural networks [2]. The theoretical study and applications of recurrent
neural nets were developed in several subsequent works [3, 4, 5, 6, 7, 8, 9]. Actu-
ally, the progress provided by Hopfield’s works have shown that a value of energy
could be associated to each state of the net and that this energy decreases monoton-
ically as the path is described within the state-space towards a fixed point. These
fixed points are therefore stable points of energy [10], i.e., the described energy
function behaves as Lyapunov functions for the model described in detail in Hop-
field’s works. At this exact moment, it is observed subjects of stabilities in recurrent
neural nets. Considering the stability in a non-linear dynamical system, we usually
think about stability in the sense of Lyapunov. The Direct Method of Lyapunov is
broadly used for stability analysis of linear and non-linear systems which may be
either time-variant or time-invariant. Therefore, it can be directly applicable to the
stability analysis of ANNs [2].
In 1985, Hopfield solved the traveling salesman problem [7] that is a problem
in combinatorial optimization using a continuous model of the recurrent neural net-
work as an optimization tool. In 1986, Hopfield proposed a specialized ANN to
solve specific problems of linear programming (LP) [9] based on analog circuits,
studied since 1956 by Insley B. Pyne and presented in [11]. On that occasion, Hop-
field demonstrated that the dynamics involved in recurrent artificial neural nets were
described by a Lyapunov function and that for this reason, it was demonstrated that
this network is stable and also that the point of stability is the solution to the problem
for which the ANN was modeled.
In 1987, Kennedy and Chua demonstrated that the ANN which was proposed
by Hopfield in 1986, in spite of searching for the minimum level of the energy
function, it had not been modeled to offer an inferior limit, but only when the sat-
uration of an operational amplifier of the circuit was reached [12]. Due to this
deficiency, Kennedy and Chua proposed a new circuit for LP problems that also
proved to be able to solve quadratic programming (QP) problems. These circuits
were nominated as “canonical non-linear programming circuit”, which are based on
the Kuhn-Tucker (KT) conditions [12]. In this kind of ANN-based optimization,
the problem has to be “hard-wired” in the network and the convergence behavior of
the ANN depends greatly on how the cost function is modeled.
Later on, hard studies [13, 14] confirmed that for non-linear programming prob-
lems, the proposed model by Kennedy and Chua [15] has completely satisfied the
optimization of KT conditions and the penalty method. Besides, under appropri-
ate conditions this net is stable. In spite of the important progresses presented in
Kennedy and Chua’s studies, a deficiency was observed in the model, which appears
when the equilibrium point of the net happens in the neighborhood of the optimal
point of the original problem, but the distance between the optimal point and the
equilibrium point of the network can be reduced by increasing the penalty parame-
ter (s), as in [14] and [16]. Even so, Kennedy and Chua’s network is able to solve a
great class of optimization problems with and without constraints. However, when
1 New HIS to Solve LP and QP Optimization and Increase Convergence Speed 3
applicable to many problems as shown in [27, 28, 29, 30]. Here the basis of the
proposed heuristic rules is the dynamical behavior of neural networks. From the
convergence analysis, we identified the parameters and their relationships, which
are then transformed into a set of heuristic rules. We developed an algorithm based
on the heuristic rules and carried out some experiments to evaluate and support the
proposed technique.
In this work, two possible implementations were developed, tested and com-
pared; and a high reduction in computational effort was observed by using the pro-
posed heuristic rules. This reduction is related to the decrease in the number of ODE
computed during the convergence process. Other possible implementations are also
indicated.
This work is organized beginning with a revision of the two-phase method of
Maa and Shanblatt; following, we present the proposed heuristic rules and show the
solutions for test cases using the previously discussed techniques; next, the simula-
tion results are presented and analyzed; and finally, we draw conclusions about the
proposed work.
where s is a large positive real number, and the function g+ i (x(t)) = max{0, gi (x(t))},
whose notation was simplified to g+ = [g+ + T
1 , . . . , gm ] , according to [14].
As long as the system converges x(t) → x̂, sg+ i (x(t)) → λi and sh j (x(t)) → μ j
which are the Lagrange multipliers associated with each corresponding constraint.
In the first phase, an approximation of the Lagrange multipliers is already obtained.
The block diagram of a two-phase optimization network is shown in Fig. 1.1.
The dynamics that happen in the first phase are in the time range 0 ≤ t ≤ t1 (t1 is the
time instant when the switch is closed connecting the first phase to the second one).
The network operates according to the following dynamics:
p q
dx
= −∇ f (x) − s ∑ ∇gi (x)gi (x) + ∑ (∇h j (x)h j (x))
+
(1.3)
dt i=1 j=1
In the second phase (t ≥ t1 ) the network begins to shift the directional vector sg+ i (x)
gradually to λi , and sh j (v) to μ j . By imposing a small positive real value ε , the up-
date rate of d λi /dt and d μi /dt that are represented in (1.6) and (1.7), respectively,
is comparatively much slower than that of dx/dt (1.5). Approximation of such dy-
namics is possible by considering λ and μ to be fixed. Then it can be seen that (1.6)
is seeking a minimum point of the augmented Lagrangian function La (s, x):
s +
La (s, x) = f (x) + λ T g(x) + μ T h(x) + g (x) 2 + h(x) 2 (1.4)
2
In the block diagram of Fig. 1.1, in the first phase, the subsystems within the two
large rectangles do not contribute during t ≤ t1 and in the second phase, when t > t1 ,
the dynamics of the network become:
p + q
dx
= −∇ f (x) − ∑ ∇gi (x) sgi (x) + λi + ∑ (∇h j (x) (sh j (x) + μ j ))
dt i=1 j=1
(1.5)
6 O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira
Fig. 1.1 Block diagram of the dynamical system to Maa and Shanblatt network
d λi (t + Δ t)
= ε sg+
i (gi (x(t))), to i = 1, . . . p, and (1.6)
dt
d μ j (t + Δ t)
= ε sh j (x(t)), to j = 1, . . . q. (1.7)
dt
A practical value is ε = 1/s according to [14], what leaves the network with just one
adjustment parameter. However, using ε independently of s gives more freedom to
control the dynamics of the network. During the first phase, the Lagrange multipliers
are null, thus there is not restriction on the initial value of x(t).
According to the theorem of penalty function, the solution achieved in the first
phase is not equivalent to the minimum of the function f (x), unless the penalty
parameter s is infinite. In this way, the use of the second phase of optimization is
necessary to any finite value of s. The system reaches equilibrium when:
1 New HIS to Solve LP and QP Optimization and Increase Convergence Speed 7
g+
i =0
h j = 0 and
p q
∇ f (x) + ∑ ∇gi (x)λi (x) + ∑ ∇h j (x)μ j (x) = 0, (1.8)
i=1 j=1
that is identical to optimality condition of the KT theorem and thus the equilibrium
point of the two-phase network is precisely a global minimum point to a convex
problem (P).
In [12] it is demonstrated that the Kennedy and Chua network for linear and
quadratic programming problems satisfies completely the optimality conditions of
KT and the function penalty method. It shows also that under appropriate conditions
this network is completely stable. Moreover, it is shown that the equilibrium point
happens in the neighborhood of the optimal point of the original problem and that
the distance among them can be made arbitrarily small, selecting sufficiently a large
value of penalty parameter (s).
For problems that cannot tolerate a solution in the infeasible region, due to physi-
cal limits of operational amplifiers, a two-phase optimization network model is pro-
posed. In the second phase, we can obtain both, the exact solution for these problems
and the corresponding Lagrange multipliers associated with each restriction.
Trajectories of the state variables for the same system are exemplified graphically
in Fig. 1.3. These trajectories are distinct due to the fact that the state variables have
different initial states. The dynamic of recurrent ANN has the same properties and,
therefore, it is similar to the dynamics showed in Fig. 1.3.
In spite of Maa and Shanblatt model which deals with continuous-time recur-
rent network, in a computational algorithm the calculations of the iterations are
performed in a discrete-time form, since the calculations of the integral equations
demand a small step size for calculations, but not null. Therefore, we take total
control at the course of the iteration of the algorithm in the network.
Detailed observations were carried out during test of the algorithm of Maa and
Shanblatt model showing that the computational convergence is slow and the tra-
jectories in the state-space of convergence of recurrent networks are smooth and
possibly predictable. Then, we observed that, in certain conditions, it is not only
possible to estimate a point closer to the minimum point of the function energy of
the network, but it is also possible to estimate a point that instead of the initial orbit
of convergence turns to an initial point of a new orbit of convergence. This new orbit
would have a shorter curvature and, consequently, a smaller Euclidean distance to
the optimal point. In this regard, the number of steps to calculate the convergence
of the computational algorithm can be reduced and, consequently, the time to
compute the equilibrium point of the network.
1 New HIS to Solve LP and QP Optimization and Increase Convergence Speed 9
Fig. 1.3 An illustration of a two-dimensional state (phase) of a dynamical system and the
associated vectorial field
To achieve the equilibrium point, we use two methods. In the first one, the point
is calculated starting from the evolution of the dynamics in space-time plan (in this
work, we considered only autonomous systems). In the second method, the calcu-
lation is performed by observing the evolution of the variable in state-space. The
mechanism of these two methods and the way they operate in the proposed HIS are
described as follows.
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
x1(t)
x1(t)
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
t t
(a) (b)
1 1
0.9 0.9
0.8 0.8
0.7 0.7
0.6 0.6
x1(t)
x1(t)
0.5 0.5
0.4 0.4
0.3 0.3
0.2 0.2
0.1 0.1
0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
t t
(c) (d)
Fig. 1.4 Dynamical convergences of first order (single variable systems): graphs of evolution
of state variables in time
b(z − m) − a(z − M)
zN = (1.9)
M−m
1 New HIS to Solve LP and QP Optimization and Increase Convergence Speed 11
0.8
Original Dynamic
0.6
x1(t)
Advanced Dynamic
0.4
Predict Point by Heuristic Rule (HIS−1)
Fig. 1.5 Action of the HIS to calculate a better point in dynamics evolving over time
Region 4 (S4) is similar to the beginning of the convergence shown in Fig. 1.4(a).
While region 5 (S5) describes a behavior close to the curve formed by the end of
the convergence shown in Fig. 1.4(a) and the beginning of the convergence shown
in Fig. 1.4(b). Region 6 (S6) has a convergence similar to that shown in Fig. 1.4(b).
12 O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira
Region 7 (S7) describes the dynamic of the type shown in the Fig. 1.4(d) and region
8 (S8) describes a behavior close to the curve formed by the end of the convergence
shown in the Fig. 4(c) and the beginning of the convergence shown in Fig. 1.4(d).
While region 9 (S9) represents a behavior close to that shown in Fig. 1.4(c).
The straightforward regime can be of three types: increasing - the derivative of
the curve is positive and not close to zero, corresponding to region 1 (S1); constant
- the derivative of the curve is approximately zero, corresponding to region 2 (S2);
and decreasing - derivative is negative and not close to zero, corresponding to region
3 (S3). We can note that the regimes described by regions S5 and S8 can be consid-
ered close to the constant straightforward regime. Thus, we modeled the following
heuristic rules:
The actions shown in the rules lead to sub-functions that return a better value for
the next initialization point of the network. The straight line condition implicates
that either the system is converging very slowly or the step size of the integration
algorithm is very small. In this case, the linear function shown in (1.10) can be
applied as shown below:
Table 1.1 Description of the actions to be taken due to the heuristic rules according to each
decision region
Having the normalized point P3N estimated by the heuristic rules, we need to
unnormalize it to obtain the P3 value. This value will be used to start the recurrent
network.
−
→
v 2 (t) = (x1 (t − Δ t) + i · x2(t − Δ t)) − (x1(t − 2Δ t) + i · x2(t − 2Δ t)). (1.12)
14 O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira
From these vectors, we carried out a rotation transformation in the axes using (1.13)
and (1.14), according to the Fig. 1.7.
−
→
v 1’ = −
→
v 1 exp(−θ1 ) (1.13)
v 2’ = −
−
→ →
v 2 exp(−θ1 ) (1.14)
where θ1 is the −
→
v 1 (t) angle and also a translation transformation using:
x1 ’(t − 2Δ t) = −|−
→
v 1|
x2 ’(t − 2Δ t) = 0
x1 ’(t − Δ t) = 0
x2 ’(t − Δ t) = 0
x1 ’(t) = x1 (t) − x1 (t − 2Δ t)
x2 ’(t) = x2 (t) − x2 (t − 2Δ t) (1.15)
The rotation and translation transformation facilitates the behavior analysis of the
vector −→v 2 in relation to vector −→
v 1 . Therefore, heuristic rules can be applied to
perform the gain in module and angle of the state vectors yielding vector − →v 3 ’ to
each n − 1 complex plan. Finally, a strategy is created to determine which will be
the final value of the reference variable. An effective strategy is to add the value of
the reference variable and the average of the calculated increments of this variable
in the complex plans.
Fig. 1.8 shows two pictures associated with two examples of a set of heuris-
tic rules that can be used to produce vector − →
v 3 ’. In each picture, the point that is
closer to the left position in the circumference is the point P0 ’, in the center of the
circumference is fixed the point P1 ’ and the points marked with tiny circle in the
circumference symbolize several possibilities of point P2 ’. And finally, the results
1 New HIS to Solve LP and QP Optimization and Increase Convergence Speed 15
of the heuristic rules are marked with green circle, points P3 ’. In order to obtain the
final point P3 , we apply the inverse transformation of translation and rotation to the
point P3 ’, thus generating the appropriated value to initialize the recurrent network.
Fig. 1.9 shows an example of application of the heuristic rules to estimate a better
point through the dynamics in the state-space. The external curve represents the
dynamics of the recurrent network without the heuristic rules, and the internal one
represents the dynamics using the HIS (ANN and TDSS) which is based on heuristic
rules (TDSS). We point out that, in the internal curve, the points with circle as a
marker are the iteration points performed by the network and the points with a plus
sign are the points estimated by the heuristic rules (P3 ).
3 3
2 2
1 1
X2’(t)
X2’(t)
0 0
−1 −1
−2 −2
−3 −3
0 2 4 0 2 4
X1’(t) X1’(t)
(a) (b)
Fig. 1.8 Variations of the rules applied to points P0 , P1 , P2 to calculate the estimated point P3
2.5
2
Original Orbit
1.5
x2(t)
1
Predict Point by Heuristic Rule (HIS−2)
0.5
Advanced Orbit
Point by Recurrent NN
0
−0.5
−1 0 1 2 3 4 5
x1(t)
Fig. 1.9 Graph of the convergence orbit of the state variables x1 and x2 : external curve -
ANN; internal curve - HIS
16 O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira
By iteratively applying the ANN and the HR, the system will perform to reduce
the curvature of the orbit in the state-space jumping from one orbit to another until it
achieves the solution of the problem (point of equilibrium of the recurrent network).
and a set of elements called arcs, each arc ei j being an ordered pair (i, j) of distinct
nodes i and j. If ei j is an arc, then the node i is called the tail of ei j and the node j
is called the head of ei j . The directed graph shown in Fig. 1.10 is formed of 6 nodes
and 11 arcs.
e12 2
1
e23
e13
3
e42
e15
e53
e43 e62
e54 4
5
e64
e56 6
The cost of each arc is represented by vector c and its maximum capacity flow
by vector b, and the demands by vector w. If wi ≤ 0, the node is a supplier (sources)
and, if wi > 0, the node is consumers (sinks). Suppose, for instance, that we have
wT = [−9 4 17 1 − 5 − 8]. The matrix H of the network is called the incidence
matrix of our network. More generally, the incidence matrix of a network with n
nodes and m arcs has n rows and m columns. Thus, our matrix H has size 6 x 11 and
is formed as follows:
⎡ ⎤
−1 −1 −1 0 0 0 0 0 0 0 0
⎢ 1 0 0 −1 1 1 0 0 0 0 0 ⎥
⎢ ⎥
⎢ 0 1 0 1 0 0 1 1 0 0 0⎥
H =⎢⎢ 0 0 0 0 −1 0 0 −1 1 1 0 ⎥
⎥ (1.17)
⎢ ⎥
⎣ 0 0 1 0 0 0 −1 0 −1 0 −1 ⎦
0 0 0 0 0 −1 0 0 0 −1 1
Considering that there are no losses in the network, i.e., everything that is produced
is consumed then the sum of all the elements wi j of the graph is zero. This condition
turns the matrix H into a linearly dependent (LD) matrix, in other words, any row
can be obtained by a linear combination of the other rows. In order to overcome this
problem, we remove a row of the matrix H and one element of the column vector w.
Here, the last row of the matrix H was removed, turning this matrix and the vector
w into a truncated incidence matrix and a truncated vector, according to [32].
⎡ ⎤
−1 −1 −1 0 0 0 0 0 0 0 0
⎢ 1 0 0 −1 1 1 0 0 0 0 0 ⎥
⎢ ⎥
H =⎢ ⎢ 0 1 0 1 0 0 1 1 0 0 0⎥
⎥ (1.18)
⎣ 0 0 0 0 −1 0 0 −1 1 1 0 ⎦
0 0 1 0 0 0 −1 0 −1 0 −1
18 O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira
Data were obtained from [16]: xmin = [150 100 50]T in MW, xmax = [600 400 200]T
in MW, w = 850 MW and the following costs for the generator units:
1.5 Simulations
The chosen parameters to simulate the problems LP1 and LP2 were: integration step
size of 1e−3 , 100 for the parameter s of the neural network in the first and the second
phase, value 1.1 for the parameter e of the network in the second phase. The main
results of LP1 and LP2 simulation are presented in Table 1.2.
LP1 LP2
ANN HIS-1a HIS-2b ANN HIS-1a HIS-2b
No. points by the ANN (Phase 1) 8316 1000 2391 8670 2965 3099
No. points by the HR (Phase 1) - 331 795 - 986 1031
Total no. points (Phase 1) 8316 1331 3186 8670 3951 4130
Normalized computer processing time (Phase 1) 1.00 0.12 0.29 1.00 0.33 0.35
Time instant when the switch is closed (s) = t1 8.32 1.33 3.19 8.67 3.95 4.13
No. of calculated points by the ANN (Phase 2) 8607 8277 8316 7692 7251 7263
No. of calculated points in both phases 16923 9608 11502 16362 11202 11393
Initial cost (Phase 1) = f (x(0)) -260.00 -260.00 -260.00 0.00 0.00 0.00
Final cost (Phase 1) = f (x(t1 )) -741.77 -741.82 -741.78 55.67 55.65 55.63
Final cost (Phase 2) = f (x(tend )) -740.00 -740.00 -740.00 56.00 55.99 56.00
a HIS-1 = ANN and method of Tendency based on the Dynamics in
Space-Time(TDST).
b HIS-2 = ANN and method of Tendency based on the Dynamics in State-Space
(TDSS).
The results shown in Table 1.2 at row 5 point out that both proposed hybrid
systems (HIS-1 and HIS-2) were able to advance the dynamics of the simulated
linear problems efficiently. This reduces greatly the time to process the network
algorithm since at each integration step size, n ODEs are solved, where n is the
number of variables in the problem. For instance, for the problem LP1 , the total
number of points calculated at the end of the first phase by using the ANN was
8316, then 33264 ODEs were solved while using the HIS-1. It was necessary 1000
points yielding 4000 ODEs to be solved in order to reach the end of the first phase,
in other words, the HIS-1 reduced the computational effort by approximately 88%
compared to the ANN. Besides, for the LP2 problem, this rate was approximately
66%. The rates of computational effort reduction for problems LP1 and LP2 when
comparing the ANN to the HIS-2 were 71% and 64%, respectively.
Fig. 1.11-1.13 presents the simulation results for the LP1 problem and
Fig. 1.14-1.16 for the LP2 problem.
The chosen parameters to simulate the problems QP: integration step size of
1e−2 ; 50 for the parameter s of the neural network in the first phase. The initial
condition used was x(0) = [400 300 150]T .
20 O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira
40
40
x1 x2 X1(t) x X2(t)
35
35
30
30
X1(t) x X4(t)
25 25
x4
20 20
15 15
10 10
5 X1(t) x X3(t)
5
x3
0
0
0 2 4 6 8 10 12 14 16 10 15 20 25 30 35 40
(a) (b)
Fig. 1.11 Dynamics of the problem LP1 obtained by the ANN with the initial condition
x(0) = [10 10 10 10]T : (a) Dynamic in time-state plan; (b) Dynamic in state-space plan, taking
the variable x1 (t) as reference
40
40
x1 x2
35 X1(t) x X4(t)
35
30
30
25 25
x4
20 X1(t) x X2(t)
20
15 15
10 10
5 X1(t) x X3(t)
5
x3
0
0
0 1 2 3 4 5 6 7 8 9 10 10 15 20 25 30 35 40
(a) (b)
Fig. 1.12 Dynamics of the problem LP1 , obtained by the HIS-1 with the initial condition
x(0) = [10 10 10 10]T : (a) Dynamic in time-space plan; (b) Dynamic in state-space plan,
taking the variable x1 (t) as reference
The results shown in Table 1.3 at row 5 point out that both proposed hybrid
systems were able to advance the dynamics of the simulated quadratic problems
efficiently. This reduces greatly the time to process the network algorithm since at
each step size integration, n ODEs are solved, where n is the number of variables
in the problem. For instance, for the problem QP, the total number of points cal-
culated at the end of the first phase by using the ANN was 172629, then 517887
ODEs were solved, while using the HIS-1, it was necessary 8257 points yielding
24771 ODEs to be solved in order to reach the end of the first phase, in other words,
the HIS-1 reduced the computational effort by approximately 95% compared to the
1 New HIS to Solve LP and QP Optimization and Increase Convergence Speed 21
40
40
x1 x2
35
35
X1(t) x X4(t)
30
30
25 25
X1(t) x X2(t)
x4
20 20
15 15
10 10
5 X1(t) x X3(t)
5
x3
0
0
0 2 4 6 8 10 12 10 15 20 25 30 35 40
(a) (b)
Fig. 1.13 Dynamics of the problem LP1 , obtained by the HIS-2 with the initial condition
x(0) = [10 10 10 10]T : (a) Dynamic in time-space plan; (b) Dynamic in state-space plan,
taking the variable x1 (t) as reference
9
9
x8 x9 8
8
x6 7
7
6 6
x4
5 5
x3
4 4
x2
3 3
x1
2 2
x7 x10
1
x11 1
0
x5 0
-1
0 2 4 6 8 10 12 14 16 0 0.5 1 1.5 2 2.5
(a) (b)
Fig. 1.14 Dynamics of the problem LP2 , obtained by the ANN with the initial condition
x(0) = [0 0 0 0 0 0 0 0 0 0 0]T : (a) Dynamic in time-space plan; (b) Dynamic in state-space
plan, taking the variable x1 (t) as reference
ANN. In addition to HIS-2, and the QP problem, this rate was approximately 70%.
Fig. 1.17-1.19 presents the simulation results for the problem QP.
All case studies were carried out in the same computer, thus we take the process-
ing time of the ANN in phase 1 as the base to normalize the hybrid case in the same
phase. Here, we point out that the hybrid systems were not used in phase 2. As a
result, we have observed that for the LP1 case, the HIS-1 has taken a processing time
of 12%, while the HIS-2 has taken 29%; for the LP2 case, the HIS-1 has taken a pro-
cessing time of 33%, while the HIS-2 has taken 35%; and for the QP case, the HIS-1
has taken a processing time of 6%, while the HIS-2 has taken 45%. It is important
22 O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira
16
16
14
14
12
12
10 10
x8 x9
8 8
x6
6 6
x4
4 x3
4
x2
2 x1
2
x10
0 x5 x7 x11
0
-2
-2
-4
0 2 4 6 8 10 -1.5 -1 -0.5 0 0.5 1 1.5 2 2.5 3
(a) (b)
Fig. 1.15 Dynamics of the problem LP2 , obtained by the HIS-1 with the initial condition
x(0) = [0 0 0 0 0 0 0 0 0 0 0]T : (a) Dynamic in time-space plan; (b) Dynamic in state-space
plan, taking the variable x1 (t) as reference
9
9
x8 x9
8
8
x6 7
7
6 6
x4 5
5
x3
4 4
x2
3 3
x1
2 2
x10
1
1
x5 x7 x11
0
0
-1
0 2 4 6 8 10 0 0.5 1 1.5 2 2.5
(a) (b)
Fig. 1.16 Dynamics of the problem LP2 , obtained by the HIS-2 with the initial condition
x(0) = [0 0 0 0 0 0 0 0 0 0 0]T : (a) Dynamic in time-space plan; (b) Dynamic in state-space
plan, taking the variable x1 (t) as reference
to note that the heuristic rule efficiency varies according to the type of the problem.
In this work, the results have showed that the HIS-1 yielded better performance,
specifically, in the LP1 and QP problems. Therefore, we could observe a decrease in
the processing time yielded by the implemented heuristic rules, while reducing the
number of ODE computed. These rules estimated the next values to each variable
of the problem throughout the convergence. We can highlight that as ODE also
calculates points during the application of the HIS, the proposed systems can correct
themselves in case of incorrect estimative, showing the high performance for being
resilient.
1 New HIS to Solve LP and QP Optimization and Increase Convergence Speed 23
QP
ANN HIS-1a HIS-2b
No. points by the ANN (Phase 1) 172629 8257 51900
No. points by the HR (Phase 1) - 2750 17299
Total no. points (Phase 1) 172629 11007 69199
Normalized computer processing time (Phase 1) 1.00 0.06 0.45
Time instant when the switch is closed (s) = t1 1726.29 110.07 691.99
Final cost (Phase 1) = f (x(t1 )) 22680.05 22680.05 22680.05
a HIS-1 = ANN and method of Tendency based on the Dynamics in
Space-Time(TDST).
b
HIS-2 = ANN and method of Tendency based on the Dynamics in State-Space
(TDSS).
250 220
200
200
180
160
150
x3 140
X1(t) x X2(t)
120
0 200 400 600 800 1000 1200 1400 1600 1800 393 394 395 396 397 398 399 400 401
(a) (b)
Fig. 1.17 Dynamics of the problem QP, obtained by the ANN with the initial condition
x(0) = [400 300 150]T : (a) Dynamic in time-space plan; (b) Dynamic in state-space plan,
taking the variable x1 (t) as reference
1.6 Conclusion
In this paper, two Hybrid Intelligent Systems have been proposed. These systems
combined Maa and Shamblatt network with heuristic rules. Maa and Shamblatt
network is a two-phase recurrent neural network that provides the exact solution
for linear and quadratic programming problem. When compared to conventional
linear and nonlinear optimization techniques, the two-phase network formulation
becomes advantageous as there is no matrix inversion required. The main aim of
the proposed HIS is to increase the speed of convergence towards the optimal point
which is guaranteed by the ANN. In the cases presented, the optimal convergence
24 O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira
400 x1
320 X1(t) x X2(t)
300
350
x2
280
260
300
240
250 220
200
200
180
160
150
x3 140
X1(t) x X3(t)
120
0 20 40 60 80 100 393 394 395 396 397 398 399 400 401
(a) (b)
Fig. 1.18 Dynamics of the problem QP, obtained by the HIS-1 with the initial condition
x(0) = [400 300 150]T : (a) Dynamic in time-space plan; (b) Dynamic in state-space plan,
taking the variable x1 (t) as reference
400 x1
320
X1(t) x X2(t)
300
350
x2 280
260
300
240
250 220
200
200
180
160
150
x3 140 X1(t) x X3(t)
120
0 100 200 300 400 500 600 700 393 394 395 396 397 398 399 400 401
(a) (b)
Fig. 1.19 Dynamics of the problem QP, obtained by the HIS-2 with the initial condition
x(0) = [400 300 150]T : (a) Dynamic in time-space plan; (b) Dynamic in state-space plan,
taking the variable x1 (t) as reference
was reached. The proposed systems have both advantages. The simulation analyses
show a reduction in the computational effort by approximately 95% compared to
the ANN in the QP case solved in this paper which has a guaranteed optimal con-
vergence, without inverting matrices. The implementation of the proposed HIS has
been developed to solve operational planning problems of large-scale, which may
be applied in future works. That is, a large-scale economic power dispatch, with the
scheduling of hydro, thermal and wind power plants to minimize the overall pro-
duction cost, while satisfying the load demand in the mid-term operation planning
of hydrothermal generation systems. In future works, we will propose the combina-
tion of these heuristic rules and/or the application in the second phase of Maa and
Shanblatt method.
1 New HIS to Solve LP and QP Optimization and Increase Convergence Speed 25
References
1. Hopfield, J.J.: Neural networks and physical systems with emergent collective computa-
tional abilities. Proc. Natl. Acad. Sci. USA 79, 2552–2558 (1982)
2. Haykin, S.: Neural networks: a comprehensive foundation, 2nd edn. Prentice Hall, USA
(1999)
3. Hopfield, J.J.: Neurons with graded response have collective computational properties
like those of two-state neurons. Proc. Natl. Acad. Sci. USA 81, 3088–3092 (1984)
4. Hopfield, J.J.: Learning algorithms and probability distributions in feed-forward and
feed-back networks. Proc. Natl. Acad. Sci. USA 84, 8429–8433 (1987)
5. Hopfield, J.J.: The effectiveness of analogue neural network hardware. Network: Com-
putation in Neural Systems 1(1), 27–40 (1990)
6. Hopfield, J.J., Feinstein, D.I., Palmer, R.G.: Unlearning has a stabilizing effect in collec-
tive memories. Nature 304, 158–159 (1983)
7. Hopfield, J.J., Tank, D.W.: Neural Computation of Decisions in Optimization Problem.
Biological Cybernetics 52, 141–152 (1985)
8. Hopfield, J.J., Tank, D.W.: Computing with Neural Circuits: A Model. Science 233(8),
625–633 (1986)
9. Tank, D.W., Hopfield, J.J.: Simple Neural Optimization Networks: An A/D Converter,
Signal Decision Circuit, and a Linear Programming Circuit. IEEE Trans. on Circuits and
Systems 33(5), 533–541 (1986)
10. Ludemir, T.B., Braga, A.P., Carvalho, A.C.P.L.F.: Redes Neurais Artificiais Teoria e
Aplicacoes. 1st ed. Rio de Janeiro, RJ: LTC - Livros Tecnicos e Cientificos Editora S.A.
(2000)
11. Pyne, I.B.: Linear Programming on an electronic analogue computer. Trans. AIEE. Part
I (Comm. & Elect.) 75, 139–143 (1956)
12. Kennedy, M.P., Chua, L.O.: Unifying Tank and Hopfield Linear Programming Circuit
and the Canonical Nonlinear Programming Circuit of Chua and Lin. IEEE Trans. on
Circuits and Systems 34(2), 210–214 (1987)
13. Chiu, C., Maa, C.Y., Shanblatt, M.A.: An artificial neural network algorithm for dynamic
programming. Int. J. Neural Syst. 1(3), 211–220 (1990)
14. Maa, C.Y., Shanblatt, M.A.: A Two-Phase Optimization Neural Network. IEEE Trans-
actions on Neural Networks 3(6), 1003–1009 (1992)
15. Kennedy, M.P., Chua, L.O.: Neural Networks for Nonlinear Programming. IEEE Trans.
on Circuits and Systems 35(5), 210–220 (1988)
16. Maa, C.Y., Shanblatt, M.A.: Linear and Quadratic Programming Neural Network Anal-
ysis. IEEE Transactions on Neural Networks 3(4), 580–594 (1992)
17. Chiu, C., Maa, C.Y., Shanblatt, M.A.: Energy Function Analysis of Dynamic Program-
ming Neural Networks. IEEE Transactions on Neural Networks 2(4) (July 1991)
18. Xia, Y.S.: A New Neural Network for Solving Linear and Quadratic Programming Prob-
lems. IEEE Transactions on Neural Networks 7(6), 1544–1547 (1996)
19. Tao, Q., Cao, J.D., Xue, M.S., Qiao, H.: A High Performance Neural Network for Solv-
ing Nonlinear Programming Problems with Hybrid Constraints. Phys. Lett. A 288(2),
88–94 (2001)
20. Xia, Y.S., Wang, J.: A General Methodology for Designing Globally Convergent Op-
timization Neural Networks. IEEE Transactions on Neural Networks 9(6), 1331–1343
(1998)
26 O.N. Neto, R.R.B. de Aquino, and M.M.S. Lira
21. Xia, Y.S., Wang, J.: A Recurrent Neural Network for Solving Nonlinear Convex Pro-
grams Subject to Linear Constraints. IEEE Transactions on Neural Networks 16(2),
379–386 (2005)
22. Dieu, V.N., Ongsakul, W.: Enhanced Merit Order and Augmented Lagrange Hop-
field Network for Hydrothermal Scheduling. Electrical Power and Energy Systems 30,
93–101 (2008)
23. Naresh, R., Dubey, J., Sharma, J.: Two-phase Neural Network Based Modeling Frame-
work of Constrained Economic Load Dispatch. IEE Proc. Gener. Transm. Distrib. 151(3)
(May 2004)
24. Aquino, R.R.B.: Recurrent Artificial Neural Networks: an application to optimization of
hydro thermal power systems (in Portuguese), Ph.D. Thesis, COPELE/UFPE, Campina
Grande, Brazil (January 2001)
25. Rosas, P., Aquino, R.R.B., et al.: Study of Impacts of a Large Penetration of Wind Power
and Distributed Power Generation as a Whole on the Brazilian Power System. In: Euro-
pean Wind Energy Conference (EWEC), London (November 2004)
26. Witten, I.H., Frank, E.: Data Mining Practical Machine Learning Tools and Techniques,
2nd edn. Morgan Kaufmann, San Francisco (2005)
27. Mitra, S., Mitra, M., Chaudhuri, B.B.: Pattern Defined Heuristic Rules and Directional
Histogram Based Online ECG Parameter Extraction. Measurement 42, 150–156 (2009)
28. Tuncel, G.: A Heuristic Rule-Based Approach for Dynamic Scheduling of Flexible Man-
ufacturing Systems. In: Levner, E. (ed.) Multiprocessor Scheduling: Theory and Appli-
cations, December 2007, p. 436. Itech Education and Publishing, Vienna (2007)
29. Baykasoglu, A., Ozbakir, L., Dereli, T.: Multiple Dispatching Rule Based Heuristic for
Multi-Objective Scheduling of Job Shops Using Tabu Search. In: Proceedings of MIM
2002: 5th International Conference on Managing Innovations in Manufacturing (MIM)
Milwaukee, Wisconsin, USA, September 9-11, pp. 1–6 (2002)
30. Idris, N., Baba, S., Abdullah, R.: Using Heuristic Rules from Sentence Decomposition
of Experts Summaries to Detect Students Summarizing Strategies. International Journal
of Human and Social Sciences 2, 1 (Winter 2008), www.waset.org
31. Zak, S.H., Upatising, V., Hui, S.: Solving Linear Programming Problems with Neural
Networks: A Comparative Study. IEEE Transactions on Neural Networks 6(1), 94–104
(1995)
32. Chvatal, V.: Linear Programming. W.H. Freman and Company, New York (1983)
33. Lastman, G.J., Sinha, N.K.: Microcomputer-based numerical methods for science and
engineering. Saunders Colleg Pubblishing, USA (1988)
Chapter 2
A Novel Optimization Algorithm Based on
Reinforcement Learning
2.1 Introduction
Optimization is a process to find the maximum or the minimum function value
within given constraints by changing values of its multiple variables. It can be the
essential for solving complex engineering problems in such areas as computer sci-
ence, aerospace, machine intelligence applications, etc. When the analytical relation
Janusz A. Starzyk · Yinyin Liu
Ohio University, School of Electrical Engineering and Computer Science, U.S.A.
e-mail: [email protected],[email protected]
Sebastian Batog
Silesian University of Technology, Institute Of Computer Science, Poland
e-mail: [email protected]
Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 27–47.
springerlink.com c Springer-Verlag Berlin Heidelberg 2010
28 J.A. Starzyk, Y. Liu, and S. Batog
between the variables and the objective function value is explicitly known, analyti-
cal methods, such as Lagrange multiplier methods [1], interior point methods [18],
Newton methods [30], gradient descent methods [25], etc., can be applied. How-
ever, in many practical applications, analytical methods do not apply. This happens
when the objective functions are unknown, when relations between variables and
function value are not given or difficult to find, when the functions are known while
their derivatives are not applicable, or when the optimum value of function cannot
be verified. In these cases, iterative search processes are required to find the function
optimum.
Direct search algorithms [10] contain a set of optimization methods that do not
require derivatives and do not approximate either the objective functions or their
derivatives. These algorithms find locations with better function values following a
search strategy. They only need to compare the objective function values in succes-
sive iterative steps to make the move decision. Within the category of direct search,
distinctions can be made among three classes including pattern search methods [28],
simplex methods [6], and adaptive sets of search directions [23]. In pattern search
methods, the variables of the function are varied by either steps of predetermined
magnitude or the steps sizes are reduced at the same degree [15]. Simplex meth-
ods construct a simplex in ℜN using N+1 points and use the simplex to drive the
search for optimum. The methods with adaptive sets of search directions, proposed
by Rosenbrock [23] and Powell [21], construct conjugate directions using the infor-
mation about the curvature of the objective function during the search.
In order to avoid local minima, random search methods are developed utilizing
randomness in setting the initial search points and other search parameters like the
search direction or the step size. In Optimized Step-Size Random Search (OSSRS)
[24], the step size is determined by fitting a quadratic function for the optimized
function values in each of the random directions. The random direction is generated
with a normal distribution of a given mean and standard deviation. Monte-Carlo op-
timizations adopted randomness in the search process to generate the possibilities
to escape from the local minima. Simulate Annealing (SA) [13] is one typical kind
of Monte-Carlo algorithm. It exploits the analogy between the search for a mini-
mum in the optimization problem and the annealing process in which a metal cools
and stabilizes into a minimum energy crystalline structure. It accepts the move to a
new position with worse function value with a probability, which is controlled by
the ”temperature” parameter, and the probability decreases along the ”cooling pro-
cess”. SA can deal with highly nonlinear, chaotic problems provided that the cooling
schedule and other parameters are carefully tuned.
Particle Swarm Optimization (PSO) [11] is a population-based evolutionary com-
putational algorithm. It exploits the cooperation within the solution population in-
stead of the competition among them. At each iteration in PSO, a group of search
particles make moves in a mutually coordinated fashion. The step size of a particle is
a function of both the best solution found by that particle and the best solution found
so far by all the particles in the group. The use of a population of search particles
and the cooperation among them enable the algorithm to evaluate function values in
a wide range of variables in the input space and to find the optimum position. Each
2 A Novel Optimization Algorithm Based on Reinforcement Learning 29
particle only remembers its best solution and the global best solution of the group
to determine its step sizes.
Generally, during the course of search, a sequence of decisions on the step sizes is
made and a number of function values are obtained in these optimization methods.
In order to implement an efficient search for the optimum point, it is desired that
such historical information can be utilized in the optimization process.
Reinforcement Learning (RL) [27] is a type of learning process to maximize cer-
tain numerical values by combining exploration and exploitation and using rewards
as learning stimuli. In the reinforcement learning problem, the learning agent per-
forms the experiments to interact with the unknown environment and accumulate
the knowledge during this process. It is a trial-and-error exploratory process with
the objective to find the optimum action. During this process, an agent can learn
to build the model of the environment to instruct its search, so that the agent can
predict the environment’s response to its actions and choose the most useful actions
for its objectives based on its past exploring experience.
Surrogate based optimization refers to an idea of speeding optimization process
by using surrogates for the objectives and constraints functions. The surrogates also
allow for the optimization of problems with non-smooth or noisy responses, and
can provide insight into the nature of the design space. The max-min SAGA ap-
proach [20] is to search for designs that have the best worst case performance in
the presence of parameter uncertainty. By leveraging a trust-region approach which
uses computationally cheap surrogate models, the present approach allows for the
possibility of achieving robust design solutions on a limited computational budged.
Another example of a surrogate based optimization is the surrogate assisted
Hooke-Jeeves (SAHJA) algorithm [8] which can be used as a local component of a
global optimization algorithm. This local searcher uses the Hooke-Jeeves method,
which performs its exploration of the input space intelligently employing both the
real fitness and an approximated function.
The idea of building knowledge about an unknown problem through exploration
can be applied in the optimization problems. To find the optimum of an unknown
multivariable function, an efficient search procedure can be performed using only
historical information from conducted experiments to expedite the search. In this
chapter, a novel and efficient optimization algorithm based on reinforcement learn-
ing is presented. This algorithm uses simple search operators and will be called
reinforcement learning optimization (RLO) in the later sections. It does not require
any prior knowledge of the objective function or function’s gradient information,
nor does it require any characteristics of the objective function. In addition, it is
conceptually very simple and easy to implement. This approach to optimization
is compatible with the neural networks and learning through interaction, thus it is
useful for systems of embodied intelligence and motivated learning as presented in
[26]. The following section presents the RLO method and illustrates it within several
machine learning applications.
30 J.A. Starzyk, Y. Liu, and S. Batog
could have several local minima and several global minima Vopt1 , ...,VoptN . It is
desired that the search process, initiated from a random point, finds a path to the
global optimum point. Unlike particle swarm optimization [11], this process can be
performed with a single search particle that learns how to find its way to the opti-
mum point. It does not require the cooperation among a group of particles, although
implementing the cooperation among several search particles may further enhance
the search process in this method.
At each point of the search, the search particle intends to find a new location
with a better value within a searching range around it and then determines the di-
rection and the step size for the next move. It tries to reach the optimum by explor-
ing weighted random search of each variable (coordinate). The step size of search
in each variable is randomly generated with its own probability density function.
These functions are gradually learned during the search process. It is expected that
at the later stage of search, the probability density functions are approximated for
each variable. Then the stochastically randomized path to the minimum point of the
function from the start point is learned.
The step sizes of all the coordinates determine the center of the new searching
area and the standard deviations of the probability functions determine the size of
the new searching area around the center. In the new searching area, several loca-
tions PS are randomly generated. If there is a location p’ with better value than the
current one, the search operator moves to it. From this new location, new step sizes
and new searching range are determined, so that the search for optimum continues.
If in the current searching area, there is no point with better value that the search
particle can move to, another set of random points are generated until no improve-
ment is obtained after several, say M, trials. Then the searching area size and step
sizes are modified in order to find a better function value. If no better value is found
after K trials of generating different searching areas or the proposed stopping crite-
rion is met, we can claim that the optimum point has been found. The algorithm of
searching for the minimum point is schematically shown in the Figure 2.1.
Fig. 2.1 The algorithm of RLO searching for the minimum point
successful actions during the trial. It is proposed that the successful actions which
result in a positive reinforcement (as the step sizes of each coordinate) follow a
function of the iterative steps t, as in (2.1), where dpi represents the step sizes on ith
coordinate and f i (t) is the function for coordinate i.
These unknown functions f i (t) can be approximated, for example, using polynomi-
als through the least-squared fit (LSF) process.
⎧ ⎫
⎪
⎪ a0 ⎪⎪
⎡ ⎤ ⎪ ⎪ ⎧ i⎫
1 t1 t12 ... t1B ⎪ ⎨ a1 ⎪⎬ ⎨ d p1 ⎬
⎣ ... ... ... ... ... ⎦ a2 = ... (2.2)
⎪ ⎪ ⎩ i⎭
1 tn tn2 ... tnB ⎪ ⎪
⎪ ... ⎪
⎪
⎪ d p
⎩ ⎭ n
aB
In (2.2), the step sizes from d pi1 to d pin are the step sizes on a certain coordinate
during n steps and are fitted as unknown function values using polynomials of or-
der B. The polynomial coefficients a0 to aB can be obtained and will represent the
function f i (t) to estimate dpi ,
B
d pi = ∑ a jt j. (2.3)
j=0
Using polynomials for function approximation could be easy and efficient. How-
ever, considering the characteristic of optimization problems, we have two concerns.
First, in order to generate a good approximation while avoiding overfitting, a proper
order of polynomials must be selected. In the optimized approximation algorithm
(OAA) presented in [17], the goodness of fit is determined by the so-called signal-
to-noise ratio figure (SNRF). Based on SNRF, an approximation stopping criterion
32 J.A. Starzyk, Y. Liu, and S. Batog
was developed. Using a certain set of basis functions for approximation, the error
signal, computed as the difference between the approximated function and the sam-
pled data, can be examined by SNRF to determine how much useful information it
contains. The SNRF for the error signal, denoted as SNRF e , is compared to the pre-
calculated SNRF for white Gaussian noise (WGN), denoted as SNRFW GN . If SNRF e
is higher thanSNRFWGN , more basis functions should be used to improve the learn-
ing. Otherwise, the error signal shows the characteristic of WGN and should not
be reduced any more to avoid fitting into the noise, and the obtained approximated
function is the optimum function. Such process can be applied to determine the
proper order of the polynomial.
The second concern is that in the case of reinforcement learning, the knowl-
edge about originally unknown environment is gradually accumulated throughout
the learning process. The information that the learning system obtains at the be-
ginning of the process is mostly based on initially random exploration. During the
process of interaction, the learning system collects the historical information and
builds the model of the environment. The model can be updated after each step of
interaction. The decisions made at the later stages of the interaction are more based
on the built model rather than a random exploration. This means that the recent re-
sults are more important and should be weighted more heavily than the old ones.
For example, the weights applied can be exponentially increasing from the initial
trials to the recent ones, as
αt
wt = (t = 1, 2, ..., n), (2.4)
n
where we can define α n = n. As a result, the weights are in the open interval (0:1],
and weight is 1 for the most recent sample. Applying the weights in the LSF, we
have the weighted least-squared fit (WLSF), expressed as follows:
⎧ ⎫
⎪
⎪ a0 ⎪⎪
⎡ ⎤ ⎪ ⎪ ⎧ ⎫
1 · w1 t1 w1 t12 w1 ... t1B w1 ⎪⎨ a1 ⎪⎬ ⎨ d p1 w1 ⎬
⎣ ... ... ... ... ... ⎦ a2 = ... (2.5)
⎪ ⎪ ⎩ ⎭
1 · wn tn wn tn wn ... tn wn ⎪
2 B
⎪
⎪ ... ⎪
⎪
⎪ d p n wn
⎩ ⎭
aB
Due to the weights applied to the given samples, the approximated function will fit
to the recent data better than to the old ones.
Utilizing the concept of OAA to obtain optimized WLSF, the SNRF for the error
signal or WGN has to be estimated considering the sample weights. In the original
OAA for one-dimensional problem [17], the SNRF for error signal was calculated
as,
C(e j , e j−1 )
SNRFe = (2.6)
C(e j , e j ) − C(e j , e j−1 )
where C represents the correlation calculation, e j represents the error signal (j=1,2,
...,n), e j−1 represents the (circular) shifted version of the e j . The characteristics
2 A Novel Optimization Algorithm Based on Reinforcement Learning 33
of SNRF for WGN, expressed through the average value and the standard deviation,
can be estimated from Monte-Carlo simulation, as (see derivation at [17])
1
σSNRF W GN (n) = √ . (2.8)
n
Then the threshold, which determines whether SNRF e shows the characteristic of
SNRFW GN and the fitting error should not be further reduced, is,
For the weighted approximation, the SNRF for the error signal is calculated as,
2
σSNRF W GN (N) =√ . (2.11)
n
It is found that the 5% significance level can be approximated by the average value
plus 1.5 times standard deviations for an arbitrary n. Fig.2.2(b) illustrates the his-
togram of SNRFW GN with 216 samples, as an example. The threshold in this case of
a dataset with 216 samples can be calculated using μ + 1.5σ = 0 + 1.5 × 0.0078 =
0.0117.
Therefore, to obtain an optimized weighted approximation in one-dimensional
case, the following algorithm is performed.
Step (2.3). Take a set of basis functions, for example, polynomials of order from 0
up to order B.
Step (2.4). Use these B+1 basis functions to obtain the approximated function,
B+1
dˆpt = ∑ fl (xt ) (t = 1, 2, ..., n). (2.12)
l=1
34 J.A. Starzyk, Y. Liu, and S. Batog
Example
The function V (p1 , p2 ) = p22 sin(1.5p2 ) + 2p21 sin(2p1 ) + p1 sin(2p2 ) has several lo-
cal minima, but only one global minimum, as shown in Fig. 2.3. In the process of
interaction, the historical information after each iteration is collected. The historical
step sizes of 2 coordinates are separately approximated, as shown in the Fig. 2.4 (a)
and 2.4 (b). The step sizes of two coordinates are approximated by quadratic poly-
nomials which are determined by OWAA and the coefficients of polynomials are
obtained using WLSF. In Fig. 2.4, the approximated functions are compared with
the quadratic polynomials whose coefficients are obtained from LSF. Again, it is
2 A Novel Optimization Algorithm Based on Reinforcement Learning 35
observed that, the function obtained using WLSF is fitted closer to the data in later
iterations than the function obtained using LSF.
The level of the approximation error signal et for step sizes of a certain coor-
dinate dpi , which is the difference between the observed sampled data and the ap-
proximated function, can be measured by its standard deviation, as shown in (2.14).
1 n
σ pi = ∑ (et − ē)2 (2.14)
n t=1
This standard deviation will be called the approximation deviation in the follow-
ing discussion. It represents the maximum deviation of the location of the search
particle from the prediction by the approximated function in the unknown function
optimization problem.
36 J.A. Starzyk, Y. Liu, and S. Batog
Fig. 2.5 Prediction of the step sizes for the next iteration
The step size functions are the model of environment that the learning system builds
during the process of interaction based on historical information. The future step
size determined by such model can be employed as exploitation of the existing
model. However, such model built during the learning process cannot be treated
as exact. Besides exploitation which best utilizes the obtained model, exploration
is desired to a certain degree in order to improve the model and discover better
solutions. The exploration can be implemented using Gaussian random generator
(GRG). As a good trade-off between exploitation and exploration is needed, we pro-
pose to use the step sizes for the next iteration determined by the step size functions
as the mean value and the approximation deviation as the standard deviation of the
random generator. Gaussian random generators give several random choices of the
step sizes. Effectively, the determined step sizes of multiple coordinates generate
the center of the searching area, and the size of the searching range is determined
by the standard deviations of GRG for the coordinates. The multiple random values
generated by GRG for each coordinate effectively create multiple locations within
the searching area. The objective function values of these locations will be com-
pared and the location with the best value, called current best location, will be
chosen as the place from which the search particle will continue searching in the
next iteration. Therefore, the actual step sizes are calculated using the distance from
the “previous best location” to the “current best location”. The actual step sizes will
be added in the historical step sizes and used to update the model of the unknown
environment.
2 A Novel Optimization Algorithm Based on Reinforcement Learning 37
σ pi = α σ pi
(i = 1, 2, ..., N), (2.16)
d pi = ε d pi
where α > 1, and ε < 1. If this new search is still not successful, the searching range
and the step size will continue changing until some points with better function values
are found. If at certain step of the search process, in order to find the new location
with better function values, the current step size is reduced to be too small to make
the search particle move anywhere, it indicates that the optimum point has been
reached. The stop criterion can be defined by the current step size being β times
smaller than the previous step size, as,
(i). Approximate the function of the step sizes as a function of iterative steps us-
ing weighted least-square fit as in (2.5). The proper maximum order of the basis
functions is determined using SNRF described in section 2.2.2 to avoid overfitting.
(j). Use the modeled function to determine the step sizes dpi (i=1, 2, . . . ,N) for the
next iteration step. The approximation deviation difference between the approxi-
mated step sizes and the actual step sizes σ pi (i = 1, 2, ..., N) gives the approximation
deviation. Repeat Step (c) to (j).
In general, the optimization algorithm based on the reinforcement learning builds
the model of successful moves for a given objective function. The model is built
based on historical successful actions and it is used to determine new actions. The
algorithm combines the exploitation and exploration of searching using random gen-
erators. The optimization algorithm does not require any prior knowledge of the ob-
jective function or its derivatives nor there are any special requirements put on the
objective function. The use of search operator is conceptually very simple and intu-
itive. In the following section, the algorithm is verified using several experiments.
used previously in the example in section 2.2.2, is used as the objective function.
This function has several local minima and one global minimum equal to -112.2586.
The optimization algorithm starts at a random point and performs the search process
looking for the optimum point (minimum in this example). The number of random
points Ps generated in the searching area in each step is 10. The scaling factors α
and ε in (2.16) are 1.1 and 0.9. The β in (2.17) is 0.005.
One possible search path is shown in Fig. 2.7 from the start location to the final
optimum location as found by RLO algorithm. The global optimum is found in
13 iterative steps. The historical locations are shown in the figure as well. During
the search process, the historical step sizes taken are shown in Fig. 2.8 with their
approximation by WLSF.
Example of another search process starting from another random point is per-
formed and is shown in Fig. 2.9. The global optimum is found in 10 iterative steps.
Table 2.1 shows changes in the numerical function values and adjustment of the step
sizes dp1 and dp2 for p1 and p2 in the successive search steps. Notice how the step
size was initially reduced to be increased again once the algorithm started to follow
a correct path towards the optimum.
40 J.A. Starzyk, Y. Liu, and S. Batog
Such search process was performed for 300 random trials. The success rate of
finding the global optimum is 93.78%. On average, it takes 5.9 steps and 4299 func-
tion evaluations to find the optimum in this problem.
The same problems are tested on several other direct search based optimization
algorithms, including SA [29], PSO [14] and OSSRS [2]. The success rate of find-
ing global optimum and the average number of function evaluations are compared
in Tables 2.2, 2.3, 2.4. All the simulations were performed using an Intel Core Duo
2.2GHz based PC, with 2GB of RAM.
has 1 global minimum equal to 0 lying inside a narrow, curved valley. The opti-
mization performances of these algorithms from 300 random trials are compared in
Table 2.4.
output layer contains 6 elements. Overall, there are 36 weight elements (parameters)
to be optimized. In a typical trial, the optimization algorithm finds the optimal set
of weights after only 3 iterations. In the testing stage, the outputs of the MLP are
rounded to be the nearest integers to indicate predicted class IDs. Comparing the
given class IDs and the predicted class IDs from the MLP in Fig. 2.10, it is obtained
that 146 out of 150 iris samples can be correctly classified by such set of weights
and the percentage of correct classification is 97.3%. A single support vector ma-
chine (SVM) achieved 96.73% classification rate [12]. In addition, a MLP with the
same structure, training by back-propagation (BP) achieved 96% on Iris test case.
The MLP and BP are implemented using MATLAB neural network toolbox.
sampling. The set of images, as the sampled features, are fed to the self-organizing
winner-take-all classifier (SOWTAC) network for recognition. To find interesting
features of the input image and to direct the movements of saccade, image segmen-
tation, edge detection and basic morphology tools [4] are utilized.
Fig. 2.11 Face image and its interesting features in active vision [16]
Fig. 2.11 (a) shows a face image from [7] with 320×240 pixels. The interesting
features found are shown in Fig. 2.11 (b). The stars represent the center of the four
interesting features found on a face image and the rectangles represent the feature
boundaries. Then, the retina sampling model [16] places its fovea at the center of
each interesting feature, so that these features will be extracted.
Practically, the centers of the interesting features found by image processing tools
[4] are not guaranteed to be the accurate centers, which will affect the accuracy
of feature extraction and pattern recognition process. In order to help to find the
optimum sampling position, RLO algorithm can be used to direct the move of the
fovea of the retina and find the closest match between the obtained sample features
and pre-stored reference sample features. These slight moves during fixation to find
the optimum sampling positions can be called microsaccades in the active vision
process, although the actual role of microsaccades has been a debate topic unsolved
for several decades [19].
Fig. 2.12 (a) shows a group of ideal samples of important features in face recog-
nition. Fig. 2.12 (b) shows the group of sampled features with initial sampling posi-
tions. In the optimization process, the x-y coordinates need to be optimized so that
the sampled images have the optimum similarity to the ideal images. The level of
similarity can be measured by the sum of squared intensity difference [9]. In this
metric, increased similarity will have decreased intensity difference. Such problem
can be also perceived as an image registration problem. The two-variables objec-
tive function V(x, y), the sum of squared intensity difference, needs to be minimized
2 A Novel Optimization Algorithm Based on Reinforcement Learning 45
through RLO algorithm. It is noted that the only information available is that V
can be the function of x and y coordinates. How the function would be expressed
and what are its characteristics are totally unknown. The minimum value of the
objective function is not known either. RLO would be the suitable algorithm for
such optimization problem. Fig. 2.12 (c) shows the optimized sampled images us-
ing RLO-directed microsaccades. The optimized feature samples are closer to the
ideal feature samples, which will help the processing of the face image.
After the featured images are obtained through RLO-directed microsaccades,
these low-resolution images, instead of the entire high-resolution face image, are
sent to the SOWTAC network for further processing or recognition.
2.4 Conclusions
In this chapter, a novel and efficient optimization algorithm is presented for the
problems in which the objective functions are unknown. The search particle is able
to build the model of successful actions and choose its future action based on the
past exploring experience. The decisions on the step sizes (and directions) are made
based on a trade-off between exploitation of the known search path and exploration
for the improved search direction. In this sense, this algorithm falls into a category
of reinforcement learning based optimization (RLO) methods. The algorithm does
not require any prior knowledge of the objective function, nor does it require any
characteristics of such function. It is conceptually very simple and intuitive as well
as very easy to implement and tune.
The optimization algorithm was tested and verified using several multi-variable
functions and compared with several other widely used random search optimization
46 J.A. Starzyk, Y. Liu, and S. Batog
References
1. Arfken, G.: Lagrange Multipliers, 3rd edn. §17.6 in Mathematical Methods for Physi-
cists, pp. 945–950. Academic Press, Orlando (1985)
2. Belur, S.: A random search method for the optimization of a function of
n variables. MATLAB central file exchange, http://www.mathworks.com/
matlabcentral/fileexchange/loadFile.do?objectId=100
3. Cassin, B., Solomon, S.: Dictionary of Eye Terminology. Triad Publishing Company,
Gainsville (1990)
4. Detecting a Cell Using Image Segmentation. Image Processing Toolbox, the Mathworks,
http://www.mathworks.com/products/image/demos.html
5. Dixon, L.C.W., Szego, G.P.: The optimization problem: An introduction. Towards Global
Optimization II. North Holland, New York (1978)
6. Eelder, J.A., Mead, R.: A simplex method for function minimization. The Computer
Journal 7, 308–313 (1965)
7. Facegen Modeller. Singular Inversions,
http://www.facegen.com/products.htm
8. del Toro Garcia, X., Neri, F., Cascella, G.L., Salvatore, N.: A surrogate associated
Hooke-Jeeves algorithm to optimize the control system of a PMSM drive. IEEE ISIE,
347–352 (July 2006)
9. Hill, D.L.G., Batchelor, P.: Registration methodology: concepts and algorithms. In: Ha-
jnal, J.V., Hill, D.L.G., Hawkes, D.J. (eds.) Medical Image Registration. Medical Image
Registration. CRC, Boca Raton (2001)
10. Hooke, R., Jeeves, T.A.: Direct search solution of numerical and statistical problems.
Journal of the Association for Computing Machinery 8, 212–229 (1961)
11. Kennedy, J., Eberhart, R.C.: Particle swarm optimization. In: Proc. IEEE Int. Conf. Neu-
ral Networks, Perth, Australia, December 1995, vol. 4, pp. 1942–1948 (1995)
12. Kim, H., Pang, S., Je, H.: Support vector machine ensemble with bagging. In: Lee, S.-W.,
Verri, A. (eds.) SVM 2002. LNCS, vol. 2388, p. 397. Springer, Heidelberg (2002)
13. Kirkpatrick, S., Gelatt Jr., C.D., Vecchi, M.P.: Optimization by simulated annealing. Sci-
ence 220(4598), 671–680 (1983)
2 A Novel Optimization Algorithm Based on Reinforcement Learning 47
14. Leontitsis, A.: Hybrid Particle Swarm Optimization, MATLAB central file ex-
change, http://www.mathworks.com/matlabcentral/fileexchange/
loadFile.do?objectId=6497
15. Lewis, R.M., Torczon, V., Trosset, M.W.: Direct search methods: Then and now. Journal
of Computational and Applied Mathematics 124(1), 191–207 (2000)
16. Li, Y.: Active Vision through Invariant Representations and Saccade Movements. Master
thesis, School of Electrical Engineering and Computer Science, Ohio University (2006)
17. Liu, Y., Starzyk, J.A., Zhu, Z.: Optimized Approximation Algorithm in Neural Networks
without overfitting. IEEE Trans. on Neural Networks 19(4), 983–995 (2008)
18. Lustig, I.J., Marsten, R.E., Shanno, D.F.: Computational Experience with a Primal-Dual
Interior Point Method for Linear Programming. Linear Algebra and its Application 152,
191–222 (1991)
19. Martinez-Conde, S., Macknik, S.L., Hubel, D.H.: The role of fixational eye movements
in visual perception. Nature Reviews Neuroscience 5(3), 229–240 (2004)
20. Ong, Y.-S.: Max-min surrogate-assisted evolutionary algorithm for robust design. IEEE
Trans. on Evolutionary Computation 10(4), 392–404 (2006)
21. Powell, M.J.D.: An efficient method for finding the minimum of a function of several
variables without calculating derivatives. The Computer Journal 7, 155–162 (1964)
22. Fisher, R.A.: Iris Plants Database (July 1988),
http://faculty.cs.byu.edu/˜cgc/Teaching/CS_478/iris.arff
23. Rosenbrock, H.H.: An automatic method for finding the greatest or least value of a func-
tion. The Computer Journal 3, 175–184 (1960)
24. Sheela, B.V.: An optimized step-size random search. Computer Methods in Applied Me-
chanics and Engineering 19(1), 99–106 (1979)
25. Snyman, J.A.: Practical Mathematical Optimization: An Introduction to Basic Optimiza-
tion Theory and Classical and New Gradient-Based Algorithms. Springer, Heidelberg
(2005)
26. Starzyk, J.A.: Motivation in Embodied Intelligence. In: Frontiers in Robotics, Automa-
tion and Control, October 2008, pp. 83–110. I-Tech Education and Publishing (2008),
http://www.intechweb.org/
book.php?%20id=78&content=subject&sid=11
27. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cam-
bridge (1998)
28. Torczon, V.: On the Convergence of Pattern Search Algorithms. SIAM Journal on Opti-
mization 17(1), 1–25 (1997)
29. Vandekerckhove, J.: General simulated annealing algorithm, MATLAB central file ex-
change, http://www.mathworks.com/matlabcentral/fileexchange/
loadFile.do?objectId=10548
30. Ypma, T.J.: Historical development of the Newton-Raphson method. SIAM Re-
view 37(4), 531–551 (1995)
Chapter 3
The Use of Opposition for Decreasing Function
Evaluations in Population-Based Search
3.1 Introduction
Global optimization is concerned with discovering an optimal (minimum or maxi-
mum) solution to a given problem generally within a large search space. In some in-
stances the search space may be simple (i.e. concave or convex optimization can be
used). However, most real world problems are multi-modal and deceptive [5] which
often causes traditional optimization algorithms to become trapped at local optima.
Many strategies have been developed to overcome this for global optimization in-
cluding, but not limited to simulated annealing [9], tabu search [4], evolutionary
algorithms [7] and swarm intelligence [3].
Some of these methods employ a single solution per iteration methodology
whereby only one solution is generated and successively perturbed towards more
Mario Ventresca · Hamid Reza Tizhoosh
Department of Systems Design Engineering, The University of Waterloo, Ontario, Canada
e-mail: [email protected],[email protected]
Shahryar Rahnamayan
Faculty of Engineering and Applied Science,
The University of Ontario Institute of Technology, Ontario, Canada
e-mail: [email protected]
Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 49–71.
springerlink.com c Springer-Verlag Berlin Heidelberg 2010
50 M. Ventresca, S. Rahnamayan, and H.R. Tizhoosh
0.9
0.8
0.7
0.6
Evaluation
0.5 X X
0.4
0.3
0.2
0.1
0
Solution
Fig. 3.1 Example of a transformed evaluation funtion to a monotonic function. The values
in X and X̆ are negatively correlated. The original function (not shown) could have been any
nonlinear, continuous function
0.9
0.8
0.7
0.2
0.1
0
Solution
Fig. 3.2 Taking f (x) = max( f (x), f (−x)), we see that the possible evaluations in the search
space have been halved in the optimal situation of full knowledge. In the general case, the
transformed function will have a more desirable mean and lower variance
While not investigated in this chapter we make the observation that successive
applications of different opposite maps will lead to further smoothing of f . For
example,
f 2 (θ ) = max( f (z = arg max f (θ )), f (z̆)) (3.2)
θ
lim f i (θ ) = max( f i (zi = arg max f i−1 (θi )), f (z̆i )) = f (θ ∗ ) (3.3)
i→∞ θi
for i > 0 and global optima f (θ ∗ ). Effectively, this flattens the entire error surface
of f , except for the global optimal(s). A more feasible alternative is to use k > 0
transformations which give reasonable results where 0 ≤ | f k−1 − f k | < ε does not
diminish greatly.
where x, y, x̆ ∈ A . That is, the distribution max(x, x̆) must be less desirable than the
distribution of i.i.d. random guesses. If this condition is met, then the probability
that the optimal solution (or higher quality) is discovered is higher using opposite
guesses. The goal in developing Φ is to maximize ε and δ . A similar goal is to deter-
mine Φ such that E[g(x, x̆)] is maximized for some distance function g. Satisifying
this condition implies (3.4).
Thus, probabilistically we expect a lower number of function calls to find a solu-
tion of a given thresholded quality. If employing this strategy within a guided search
method the dynamics of the algorithm must be considered to assure the guarantee
(i.e. the algorithm adds bias to the search which affects the convergence rate).
Practically, the simplest manner to decide a satisfactory Φ is through intuition
or prior knowledge about f . A possibility is to utilize modified low-discrepancy
sequences (see below), which aim to distribute guesses evenly throughout the search
space.
If Y1 ,Y2 are i.i.d. then var(ξ̂ ) = var(Y1 +Y2 )/2. However, if cov(Y1 ,Y2 ) < 0 then the
variance can be further reduced.
One method to accomplish this is through the use of a monotonic function h.
Then, generate Y1 as an i.i.d. value as before, but utilizing h our two variables are
h(Y1 ) and h(1 − Y1), which are monotonic over interval [0,1]. And
h(Y1 ) + h(Y2 )
ξ̂ = (3.6)
2
will be an unbiased estimator of E[ f ].
Opposition is similar in its selection of negatively correlated samples. However,
in the antithetic approach there is no guideline to construct such a monotonic func-
tion, although such a function has been proven to exist [17]. Opposition provides
various means to accomplish this, as well as to incorporate the idea into optimiza-
tion while guaranteeing more desirable expected values and lower variance in the
target function.
3 The Use of Opposition for Decreasing Function Evaluations 55
for 0 ≤ ai < bi ≤ 1.
That is, the actual number of points within each interval for a given sample is
close to the expected number. Such sequences have been widely studied in quasi-
Monte Carlo methods [10].
Opposition may utilize low-discrepancy sequences in some situations. Though
in general, low-discrepancy sequences are simply a means for attaining uniform
distribution without regard to the correlation between the evaluations at these points.
Further, opposition-based techniques simultaneously consider two points in order to
smooth the evaluation function and improve performance of the sampling algorithm
whereas quasi-random sequences often are concerned with many more points.
These methods have been applied to evolutionary algorithms where it was found
that by a performance study of the different sampling methods such as Uniform,
Normal, Halton, Sobol, Faure, and Low-Discrepancy is valuable only for
low-dimensional (d < 10 and so non-highly-sparse) populations [11].
3.3 Algorithms
In this section we describe Differential Evolution (DE) and Population-¿ It seems,
performance study of the different sampling methods such as Uniform, Normal, Hal-
ton, Sobol, Faure, and Low-Discrepancy [11] is valuable only for low-dimensional
(D < 10 and so non-highly-sparse) populations. Learning (PBIL), which are the
56 M. Ventresca, S. Rahnamayan, and H.R. Tizhoosh
parent algorithms for this study. We also describe the oppositional variants of these
methods.
where d is the problem dimension; a j and b j are the lower and the upper boundaries
of the variable j, respectively. rand(0, 1) is the uniformly generated random number
in [0, 1].
Assume Xi,t (i = 1, 2, ..., N p ) are candidate solution vectors in generation t and
N p : is the population size. Successive populations are generated by adding the
weighted difference of two randomly selected vectors to a third randomly selected
vector. For classical DE (DE/rand/1/bin), the mutation, crossover, and selection
operators are straightforwardly defined as follows:
Mutation: For each vector Xi,t in generation t a mutant vector Vi,t is defined by
where i = {1, 2, ..., N p } and a, b, and c are mutually different random integer indices
selected from {1, 2, ..., N p }. Further, i, a, b, and c are unique such that it is neces-
sary for N p ≥ 4. F ∈ [0, 2] is a real constant which determines the amplification of
the added differential variation of Xc,t − Xb,t . Larger values for F result in higher
diversity in the generated population and lower values lead to faster convergence.
Crossover: By shuffling competing solution vectors DE utilizes the crossover op-
eration to generate new solutions and also to increase the population diversity. For
the classical DE (DE/rand/1/bin), the binary crossover (shown by ‘bin’ in the no-
tation) is utilized. It defines the following trial vector:
where,
V ji,t if rand j (0, 1) ≤ Cr ∨ j = k,
U ji,t = . (3.12)
X ji,t otherwise,
for Cr ∈ (0, 1) the predefined crossover rate, and rand j (0, 1) is the jth evaluation of
a uniform random number generator. k ∈ {1, 2, ..., d} is a random parameter index,
3 The Use of Opposition for Decreasing Function Evaluations 57
chosen once for each i to make sure that at least one parameter is always selected
from the mutated vector, V ji,t . Most popular values for Cr are in the range of (0.4, 1)
[14].
Selection: This decides which vector (Ui,t or Xi,t ) should be a member of next (new)
generation, t + 1. For a minimization problem, the vector with the lower value of
objective function is chosen (greedy selection).
This evolutionary cycle (i.e., mutation, crossover, and selection) is repeated N p (pop-
ulation size) times to generate a new population. These successive generations are
produced until meeting the predefined termination criteria.
i = 1, 2, ..., N p ; j = 1, 2, ..., D,
where Pi, j and OPi, j denote jth variable of the ith vector of the population and
the opposite-population, respectively.
3. Selecting the N p fittest individuals from {P ∪ OP} as the initial population.
The general ODE scheme also employs generation jumping, but it has not be used
in this work in lieu of only population-initialization and sample generation.
Mt = (1 − α )Mt−1 + α B∗ (3.14)
where 0 < α < 1 is the learning rate and t ≥ 1 is the iteration. Initially, mi, j = 0.5 to
reflect that lack of prior information.
To abstract the crossover and mutation operators of evolutionary computation,
PBIL employs a randomization of M. Let 0 < β , γ < 1 be the probability of mutation
and degree of mutation, respectively. Then with probability β
8: {Update M}
9: Mt = (1 − α )Mt−1 + α B∗
degree of opposition with respect to the number of iterations. That is, as the number
of iterations t → ∞ the amount two opposite solutions differ approaches 1 bit (w.r.t.
Hamming distance).
Sampling is accompished using an opposite guessing strategy whereby half of
the population R1 is generated using probability matrix M and the other half is
generated via a change in Hamming distance to a given element of R1 . The distance
is calculated using an exponentially decaying function in the flavor of
where l is the maximum number of bits in a guess and c < 0 is a user defined
constant.
Updating of M is performed in lines 14-28. If a new global best solution has been
discovered (i.e. η = B∗ ), or with probability pamp the sample best solution is used
to focus the search, respectively. When no new optima have been discovered this
strategy tends away from B∗ . The actual update is performed in line 16 and is based
on a reinforcement learning update using the sample best solution. The degree to
which M is updated is controlled by the user defined parameter 0 < ρ < 1.
Should the above criteria for update fail, a decay of M with probability pdecay
is attempted in line 17. The decay, performed in lines 21-27 slowly tends M away
from B∗ . This portion of the algorithm has the intention to prevent convergence an
aide in the exploration ability through small, smooth updates. Parameter 0 < τ < 1
is user defined where often τ ρ .
Equations in lines 11 and 12 were determined experimentally and no argument
regarding their optimality is provided. Indeed, there likely exists many functions
which will yield more desirable results. These have been decided because they tend
to lead to a good behavior and outcome of PBIL.
3: for t = 1 to ω do
4: {Generate samples}
5: R1 = generate samples(k/2,M)
6: R̆1 = generate opposites(R1)
Many general purpose segmentation algorithms are histogram based and aim
to discover a deep valley between two peaks, and setting ω equal to that value.
However, many real world problems will have multimodal histograms and deciding
which value (i.e. valley) will correspond to the best thresholing is not obvious. The
difficulty is compounded by the fact that the relative size of peaks may be large
(and the valley becomes hard to distinguish) or valleys could be very broad. Sev-
eral algorithms have been proposed to overcome this [33]. Other methods based on
information theory and other statistical methods have been proposed as well [13].
Typically, the problem of segmentation involves a high degree of uncertainty
which makes solving the problem difficult. Stochastic searches such as evolution-
ary algorithms and population-based incremental learning often cope well with un-
certainty in optimization, hence they provide an interesting alternative approach to
traditional methods.
The main difficulty associated with the use of population-based methods is that
they are computationally expensive due to the large amount of function calls re-
quired during the optimization process. One approach to minimizing uncertainty
is by splitting the image into subimages which (hopefully) have characteristics al-
lowing for an easy segmentation. Combining the subimages together then forms
the entire segmented image, although this requires extra function calls to analyze
each subimage. An important caveat is that the local image may represent a good
segmentation, but may not be useful with respect to the image as a whole.
In this chaper we investigate thresholding with population-based techniques. Us-
ing ODE we do not perform any splitting into subimages and for OPBIL we split I
into four equal-sized subregions, each having it’s own threshold value. In both cases
we require a single evaluation to perform the segmentation and we show that the
opposition-based techniques reduce the required number of function calls.
As stated above, there exist many different segmentation algorithms. Further, nu-
merous methods for evaluating the quality of a seqmentation have also been put
forth [34]. In this paper we use a simple method which aims to minimize the dis-
crepancy between the original M × N gray-level image I and its thresholded image
T [31]:
M N
∑ ∑ |Ii, j − Ti, j | (3.17)
i=1 j=1
where | · | represents the absolute value operation. Using different evaluations will
change the outcome of the algorithm, however, the problem of segmentation in this
manner nonetheless remains computationally expensive.
We use the images shown in Figure 3.3 to evaluate the algorithms. The first col-
umn represents the original image, then the gold and the third column corresponds
to the approximate target image for ODE and OPBIL (i.e. the value-to-reach targets,
discussed below). We show the gold image for completeness, it is not required in
the experiments.
Both experiments employ a value-to-reach (VRT) stopping criteria which mea-
sures the time or function calls required to reach a specific value. The VTR values
have been experimentally determined and are given in the following table. Due to
62 M. Ventresca, S. Rahnamayan, and H.R. Tizhoosh
Fig. 3.3 The images used to benchmark the algorithms. The first column is the original gray-
level image, the second is the gold and the third colum is the target image of the optimization
within the required function calls
3 The Use of Opposition for Decreasing Function Evaluations 63
the respective algorithm ability of solving this problem, given the representation and
bahavior of convergence these values differ for ODE and OPBIL.
ODE Settings
Parameter Value
Population size Np = 5
Amplification factor F = 0.9
Crossover probability Cr = 0.9
Mutation strategy DE/rand/1/bin
Maximum function calls MAXNFC = 200
Jumpring rate (no jumping) Jr = −1
In order to maintain a reliable and fair comparison, these settings are kept un-
changed for all conducted experiments for both DE and ODE algorithms.
64 M. Ventresca, S. Rahnamayan, and H.R. Tizhoosh
OPBIL Settings
Parameter Value
Maximum iterations t = 150
Sample size k = 24
PBIL Only
Learning rate α = 0.35
Mutation probability β = 0.15
Mutation degree γ = 0.25
OPBIL Only
Update frequency control b = 0.1
Learning rate ρ = 0.25
Probability decay τ = 0.0005
3.5.1 ODE
Table 3.4 presents a summary of the results obtained regarding function calls for
ODE versus DE. Except for image 2, we show an decrease in function calls for
all images. Images 4 and 5 have statistically significant improvements with respect
to the decreased number of function calls, using a t-test at 0.9 confidence level.
Further except for image 6, we show a lower standard deviation which indicates
higher reliability in the results.
3 The Use of Opposition for Decreasing Function Evaluations 65
Table 3.4 Summary results for DE vs. ODE with respect to required function calls. μ and σ
correspond to the mean and standard deviation of the subscripted algorithm, respectively
3.5.2 OPBIL
Table 3.5 shows the expected number of iterations (each iteration has 24 function
calls) to attain the value-to-reach given in Table 3.1. In all cases OPBIL reaches its
goal in fewer iterations that PBIL, where results for images 2,5,6 are found to be
statistically significant using a t-test at a 0.9 confidence interval. Additionally, in
all cases we find a lower standard deviation indicating a more reliable behavior for
OPBIL.
Overall we have 444-347=97 saved iterations using OPBIL, which is an average
of 16*24=384 function calls per image. The approximate savings is 444/347 ≈ 1.28
which is about a 28% improvement in required iterations.
Table 3.5 Summary results for PBIL vs. OPBIL with respect to required iterations calls.
μ and σ correspond to the mean and standard deviation of the subscripted algorithm,
respectively
In the following we analyze the correlation and distance for each sample per
iteration. This is to examine whether the negative correlation and larger distance
properties between a guess and its opposite are found in the sample. If true, we have
supported (although not confirmed) the hypothesis that the observed improvements
can be due to these characteristics.
Figure 3.4 shows the averaged correlation cor(Rt1 , R̆1 ), for randomly generated
R1 and respective opposites R̆t1 at iteration t. The solid line corresponds to OPBIL
t
and the dotted line is PBIL, respectively. These plots show that OPBIL indeed has a
lower correlation (with respect to evaluation function) than PBIL (where we gener-
ate R1 as above, and let R̆1 also be randomly generated). In all cases the correlation
is much stronger for PBIL (noting that if the algorithm reaches the VTR then we set
the correlation to 0).
Image 1 Image 2
1 1
0.8 0.8
correlation
correlation
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 50 100 150 0 50 100 150
iterations iterations
Image 3 Image 4
1 1
0.8 0.8
correlation
correlation
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 50 100 150 0 50 100 150
iterations iterations
Image 5 Image 6
1 1
0.8 0.8
correlation
correlation
0.6 0.6
0.4 0.4
0.2 0.2
0 0
0 50 100 150 0 50 100 150
iterations iterations
Fig. 3.4 Sample mean correlation over 30 trials for PBIL (dotted) versus OPBIL (solid). We
find OPBIL indeed yields a lower correlation than PBIL
3 The Use of Opposition for Decreasing Function Evaluations 67
2 k/2
ḡ = ∑ g(Rt1,i , R̆t1,i )
k i=1
(3.18)
which computes the fitness-distance between the ith guess R1,i and its opposite R̆1,i
at iteration t, which is shown in Figure 3.5. The distance for PBIL is relatively low
throughout the 150 iterations, gently decreasing as the algorithm converges. How-
ever, as a consequence of OPBIL’s ability to mainain diversity the distances between
samples increases during the early stanges of the search and similarily rapidly de-
creases. Indeed, this implies the lower correlation shown above.
Image 1 Image 2
4000 1500
3000
1000
distance
distance
2000
500
1000
0 0
0 50 100 150 0 50 100 150
iteration iterations
Image 3 Image 4
3000 4000
3000
2000
distance
distance
2000
1000
1000
0 0
0 50 100 150 0 50 100 150
iterations iterations
Image 5 Image 6
6000 4000
3000
4000
distance
distance
2000
2000
1000
0 0
0 50 100 150 0 50 100 150
iterations iterations
Fig. 3.5 Sample mean distance over 30 trials for samples of PBIL (dotted) versus OPBIL
(solid). We find OPBIL indeed yields a higher distance between paired samples than PBIL
68 M. Ventresca, S. Rahnamayan, and H.R. Tizhoosh
The final test is to examine the standard deviation (w.r.t. evaluation function) of
the distance between samples, given in Figure 3.6. Both algorithms have similarly
formed plots with respect to this measure, reflecting the convergence rate of the
respective algorithms. It seems as though the use of opposition aides by infusing
diversity early during the search and quickly focuses once a high quality optima is
found. Conversely, the basic PBIL does not include this bias, therefore convergence
is less rapid.
Image 1 Image 2
2000 1500
1500
1000
std. dev.
std. dev.
1000
500
500
0 0
0 50 100 150 0 50 100 150
iterations iterations
Image 3 Image 4
2000 1500
1500
1000
std. dev.
std. dev.
1000
500
500
0 0
0 50 100 150 0 50 100 150
iterations iterations
Image 5 Image 6
3000 1500
2000 1000
std. dev.
std. dev.
1000 500
0 0
0 50 100 150 0 50 100 150
iterations iterations
Fig. 3.6 Sample mean standard deviations over 30 trials for samples of PBIL (dotted) versus
OPBIL (solid)
3.6 Conclusion
In this chapter we have discussed the application of opposition-based computing
techniques to reducing the required number of function calls. Firstly, a brief
3 The Use of Opposition for Decreasing Function Evaluations 69
introduction to the underlying concepts of opposition were given, along with condi-
tions under which opposition-based methods should be successful. A comparison to
similar concepts of antithetic variates and quasi-random/low-discrepancy sequences
made obvious the uniqueness of our method.
Two recently proposed algorithms, ODE and OPBIL were briefly introduced and
the manner in which opposition is used to improve their parent algorithms, DE and
PBIL was given, respectively. The manner in which opposites are used in both cases
differed, but the underlying concepts are the same.
Using the expensive optimization problem of image thresholding as a benchmark,
we examine the ability of ODE and PBIL to lower the required function calls to
reach the pre-specified target value. It was found that both algorithms reduce the
expected number of function calls, ODE by approximately 16% (function calls)
and OPBIL by 28% (iterations), respectively. Further, concentrating on OPBIL, we
show the hypothesized lower correlation and higher fitness-distance measures for a
quality opposite mapping.
Our results are very promising, however their requires future work in various re-
gards. Firstly, a further theoretical basis for opposition and choosing opposite map-
pings is needed. This could possibly lead to general strategies of implementation
when no prior knowledge is available. Further application to different real-world
problems is also desired.
Acknowledgements
This work has been partially supported by Natural Sciences and Engineering Research Coun-
cil of Canada (NSERC).
References
1. Bai, F., Wu, Z.: A novel monotonization transformation for some classes of global opti-
mization problems. Asia-Pacific Journal of Operational Research 23(3), 371–392 (2006)
2. Baluja, S.: Population Based Incremental Learning - A Method for Integrating Genetic
Search Based Function Optimization and Competitive Learning. Tech. rep., Carnegie
Mellon University, CMU-CS-94-163 (1994)
3. Engelbrecht, A.: Fundamentals of Computational Swarm Intelligence. Wiley, Chichester
(2005)
4. Glover, F., Laguna, M.: Tabu Search. Kluwer, Dordrecht (1997)
5. Goldberg, D.E., Horn, J., Deb, K.: What makes a problem hard for a classifier system?
Tech. rep. In: Collected Abstracts for the First International Workshop on Learning Clas-
sifier Systems (IWLCS 1992), NASA Johnson Space (1992)
6. Niederreiter, H.: Random Number Generation and Quasi-Monte Carlo Methods. Society
for Industrial and Applied Mathematics (1992)
7. Holland, J.H.: Adaptation in Natural and Artificial Systems. The University of Michigan
Press (1975)
8. Price, K., Storn, R., Lampinen, J.A.: Differential Evolution: A Practical Approach to
Global Optimization. Springer, Heidelberg (2005)
70 M. Ventresca, S. Rahnamayan, and H.R. Tizhoosh
9. Kirkpatrick, S., Gelatt, C.D., Vecchi, M.P.: Optimization by Simulated Annealing. Sci-
ence 220(4598), 671–680 (1983)
10. Lemieux, C.: Monte Carlo and Quasi-Monte Carlo Sampling. Springer, Heidelberg
(2009)
11. Maaranen, H., Miettinen, K., Penttinen, A.: On intitial populations of genetic algorithms
for continuous optimization problems. Journal of Global Optimization 37(3), 405–436
(2007)
12. Montgomery, J., Randall, M.: Anti-pheromone as a tool for better exploration of search
space. In: Third International Workshop, ANTS, pp. 1–3 (2002)
13. O’Gormam, L., Sammon, M., Seul, M. (eds.): Practical Algorithms for Image Analysis.
Cambridge University Press, Cambridge (2008)
14. Rahnamayan, S., Tizhoosh, H.R., Salama, M.M.A.: Opposition-based differential evolu-
tion. IEEE Transactions on Evolutionary Computation 12(1), 64–79 (2008)
15. Rahnamayn, S., Tizhoosh, H.R., Salama, S.: Opposition-based Differential Evolution
Algorithms. In: IEEE Congress on Evolutionary Computation, pp. 7363–7370 (2006)
16. Rahnamayn, S., Tizhoosh, H.R., Salama, S.: Opposition-based Differential Evolution
Algorithms for Optimization of Noisy Problems. In: IEEE Congress on Evolutionary
Computation, pp. 6756–6763 (2006)
17. Rubinstein, R.: Monte Carlo Optimization, Simulation and Sensitivity of Queueing Net-
works. Wiley, Chichester (1986)
18. Sebag, M., Ducoulombier, A.: Extending Population-Based Incremental Learning to
Continuous Search Spaces. In: Eiben, A.E., Bäck, T., Schoenauer, M., Schwefel, H.-P.
(eds.) PPSN 1998. LNCS, vol. 1498, pp. 418–427. Springer, Heidelberg (1998)
19. Shapiro, J.: Diversity loss in general estimation of distribution algorithms. In: Parallel
Problem Solving in Nature IX, pp. 92–101 (2006)
20. Shokri, M., Tizhoosh, H.R., Kamel, M.: Opposition-based Q(lambda) Algorithm. In:
IEEE International Joint Conference on Neural Networks, pp. 646–653 (2006)
21. Storn, R., Price, K.: Differential evolution- a simple and efficient heuristic for global
optimization over continuous spaces. Journal of Global Optimization 11, 341–359 (1997)
22. Tizhoosh, H.R.: Reinforcement Learning Based on Actions and Opposite Actions. In:
International Conference on Artificial Intelligence and Machine Learning (2005)
23. Tizhoosh, H.R.: Opposition-based Reinforcement Learning. Journal of Advanced Com-
putational Intelligence and Intelligent Informatics 10(4), 578–585 (2006)
24. Tizhoosh, H.R., Ventresca, M. (eds.): Oppositional Concepts in Computational Intelli-
gence. Springer, Heidelberg (2008)
25. Toh, K.: Global optimization by monotonic transformation. Computational Optimization
and Applications 23(1), 77–99 (2002)
26. Ventresca, M., Tizhoosh, H.R.: Improving the Convergence of Backpropagation by Op-
posite Transfer Functions. In: IEEE International Joint Conference on Neural Networks,
pp. 9527–9534 (2006)
27. Ventresca, M., Tizhoosh, H.R.: Opposite Transfer Functions and Backpropagation
Through Time. In: IEEE Symposium on Foundations of Computational Intelligence, pp.
570–577 (2007)
28. Ventresca, M., Tizhoosh, H.R.: Simulated Annealing with Opposite Neighbors. In: IEEE
Symposium on Foundations of Computational Intelligence, pp. 186–192 (2007)
29. Ventresca, M., Tizhoosh, H.R.: A diversity maintaining population-based incremental
learning algorithm. Information Sciences 178(21), 4038–4056 (2008)
3 The Use of Opposition for Decreasing Function Evaluations 71
30. Ventresca, M., Tizhoosh, H.R.: Numerical condition of feedforward networks with op-
posite transfer functions. In: IEEE International Joint Conference on Neural Networks,
pp. 3232–3239 (2008)
31. Weszka, J., Rosenfeld, A.: Threshold evaluation techniques. IEEE Transactions on Sys-
tems, Man and Cybernetics 8(8), 622–629 (1978)
32. Wu, Z., Bai, F., Zhang, L.: Convexification and concavification for a general class of
global optimization problems. Journal of Global Optimization 31(1), 45–60 (2005)
33. Yoo, T. (ed.): Insight into Images: Principles and Practice for Segmentation, Registration,
and Image Analysis. AK Peters (2004)
34. Zhang, H., Fritts, J., Goldman, S.: Image segmentation evaluation: A survey of unsuper-
vised methods. Computer Vision and Image Understanding 110, 260–280 (2008)
Chapter 4
Search Procedure Exploiting Locally
Regularized Objective Approximation:
A Convergence Theorem for Direct Search
Algorithms
Marek Bazan
4.1 Introduction
Optimization processes with objective functions that are expensive to evaluate –
since usually their evaluation requires to solve a large system of linear equations
or to simulate some physical process – occur in many fields of modern design. The
main strategy in speeding up such processes via constructing a model to approxi-
mate an objective function are trust region methods [4]. The application of radial
basis function approximation as an approximation model in trust region methods
was discussed in [13]. The standard method to prove a convergence of a trust region
method is the method of sufficient decrease.
In [1] and [2] we presented the search procedure which can be viewed as an
alternative to trust region methods. It relies on combining the direct search algo-
rithm EXTREM [6] with the locally regularized radial basis approximation. The
Marek Bazan
Institute of Informatics, Automatics and Robotics, Department of Electronics,
Wrocław University of Technology, ul. Janiszewskiego 11/17, 50-372 Wrocław, Poland
e-mail: [email protected]
Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 73–103.
springerlink.com c Springer-Verlag Berlin Heidelberg 2010
74 M. Bazan
1
2 The Search Procedure exploiting Locally Regularized Objective
Approximation
Input : f : Rd → R – the objective function,
x 0 ∈ Rd – a starting point,
ε >0 – prescribed accuracy of the approximation,
f˜(·) – a radial basis function approximation of f (·),
Is – number of initial steps performed by the direct optimization
algorithm A,
N < Is – size of the dataset to construct the approximation f˜(·),.
ε -check – a procedure to check conditions required by the convergence
3 theorem to hold.
0. Perform Is initial steps of the algorithm A.
1. In the k-th step generate point xk for which the function value is supposed to be
evaluated using algorithm A.
2. Generate a set Z from N nearest points for which function was evaluated
directly.
3. Judge, whether for point xk a reliable approximation of f (xk ) can be
constructed.
a. If point xk is located in a reliable region of the search domain then construct
the approximation f˜ and evaluate f˜(xk ).
b. If the approximation f˜(xk ) was correctly constructed then perform an
ε -check.
c. If the ε -check is positive (i.e. the procedure returns true) then substitute
f (xk ) ← f˜(xk )
to the algorithm A.
d. Else evaluate f (xk ) directly.
4. If the stopping criterion of algorithm A is satisfied then stop.
Else replace k := k + 1 and go to 1.
in section 4.5 whereas the ε -check procedure from step 3.c) is associated with the
convergence theorem and will be discussed in section 4.4.
xk+1 = xk + τk dk
76 M. Bazan
where dk is a search direction and τk is a search step. The k-th iteration of such al-
gorithms can be viewed as a composite algorithmic transformation A = M 1 D where
D : Rd → Rd × Rd is a search direction generating transform
D(x) = (x, d)
b. if point x is a solution, then either the algorithm finishes or for any y ∈ A(x)
Z(y) ≤ Z(x).
Now we additionally need two lemmas from [21] concerning the closedness of a
composition of closed transforms.
A = RM ∗ Dd M ∗ Dd−1 . . . M ∗ D2 M ∗ D1
where Di chooses i-th direction from the orthogonal direction base of the k-th iter-
ation and R is an orthogonalization step to produce a new base along the direction
xk−1
0 xd
k−1
from k − 1 iteration.
f (xk + τ k dk ) ≤ f (xk ) − Δ .
or
f (yk ) − f (xk + τ dk ) ≤ Δ . (4.3)
Note that if (4.3) is not satisfied then (4.2) is satisfied.
Since f is continuous in a limit we get
Because the above for any τ (4.4) or (4.5) is fulfilled then for any point y∗ ∈
M ∗ (x∞ , d∞ ) we can
f (y∞ ) < f (y∗ ), (4.6)
or
f (y∞ ) − f (y∗ ) ≤ Δ . (4.7)
On the other hand in point y∗ ∈ M ∗ (x∞ , d∞ ) function f for τ ∈ J attains the least
value
y∞ = x∞ + τ ∞ d∞ , τ ∞ ∈ J
or,
f (y∞ ) < f (x∞ ) − Δ (4.8)
therefore
f (y∗ ) − f (y∞ ) ≤ Δ . (4.9)
Comparing (4.8) and (4.9) with (4.6) and (4.7) we get the result
y∞ ∈ M ∗ (x∞ , d∞ ).
where the sequence of the new orthogonal vectors is uniquely defined by the orthog-
onalization of the vectors w0 , w1 , . . . , wd−1
w0 = s0 d0 + s1 d1 + . . . + sd−1 dd−1
w1 = + s1 d1 + . . . + sd−1 dd−1
...
wd−1 = + sd−1 dd−1
where scalars s0 , s1 , . . . , sd−1 correspond to step sizes in all directions in the step
k − 1. Transformation R is uniquely defined without any conditions on scalars
s0 , s1 , . . . , sd−1 as long as the orthogonalization is performed using the algorithm
presented in [15]. In this case it is also a continuous function. Finally then since the
transformation A is a composition of closed transformations M ∗ with continuous
functions Di (i = 0, . . . , d − 1) and R then the assumptions of the Lemma 1 are sat-
isfied we see that transformation A is closed. That proves that assumption 3 of the
Theorem 1 holds for unperturbed M ∗ . In the succeeding subsection we show that
transformation M ∗ can be realized by the perturbed transformation M 1 .
80 M. Bazan
For ζ ∈ T with ζ1 < ζ2 < ζ3 the minimum of the quadratic interpolating points
(ζ1 , f (ζ1 )), (ζ2 , f (ζ2 )), (ζ3 , f (ζ3 )) equals
Then the set of admissible replacement triplets A(ζ ) is a set of candidate triplets
that may replace ζ ∈ T defining a smaller interval containing λ̂ in the next iteration
of the algorithm. A0 (ζ ) is defined as
(ζ1 , λ ∗ (ζ ), ζ2 ),
u1 (ζ ) =
(ζ2 , λ ∗ (ζ ), ζ3 ),
u2 (ζ ) =
A0 (ζ ) := T ∩ {u1 (ζ ), u2 (ζ ), u3 (ζ ), u4 (ζ )} where
(λ ∗ (ζ ), ζ2 , ζ3 ),
u3 (ζ ) =
(ζ1 , ζ2 , λ ∗ (ζ )).
u4 (ζ ) =
(4.10)
The crucial assumption on the perturbation that we introduce into the triplet used
to construct a quadratic is that we always allow a perturbation in only one point
of three points. Without loss of generality let us assume that a minimum of the
perturbed quadratic is constructed only for ζ such that ζ1 < ζ2 < ζ3 .
For the perturbation of the value of the function f on the level ε = 0 we define
three sets of triplets:
4 Search Procedure Exploiting Locally Regularized Objective Approximation 81
1 (ζ32 − ζ22 ) f (ζ1 )(1 + |ε |) + (ζ32 − ζ12 ] f (ζ2 ) + (ζ22 − ζ12 ) f (ζ3 )
λ̃1∗ (ε ; ζ ) = ,
2 (ζ3 − ζ2 ) f (ζ1 )(1 + |ε |) + (ζ3 − ζ1 ) f (ζ2 ) + (ζ2 − ζ1 ) f (ζ3 )
1 (ζ32 − ζ22 ) f (ζ1 ) + (ζ32 − ζ12 ) f (ζ2 )(1 + |ε |) + (ζ22 − ζ12 ) f (ζ3 )
λ̃2∗ (ε ; ζ ) = ,
2 (ζ3 − ζ2 ) f (ζ1 ) + (ζ3 − ζ1 ) f (ζ2 )(1 + |ε |) + (ζ2 − ζ1 ) f (ζ3 )
1 (ζ32 − ζ22 ) f (ζ1 ) + (ζ32 − ζ12 ) f (ζ2 ) + (ζ22 − ζ12 ) f (ζ3 )(1 + |ε |)
λ̃3∗ (ε ; ζ ) =
2 (ζ3 − ζ2 ) f (ζ1 ) + (ζ3 − ζ1 ) f (ζ2 ) + (ζ2 − ζ1 ) f (ζ3 )(1 + |ε |)
where
ũl1 (ε ; ζ ) = (ζ1 , λ̃l∗ (ε ; ζ ), ζ2 ),
ũl2 (ε ; ζ ) = (ζ2 , λ̃l∗ (ε ; ζ ), ζ3 ),
(4.14)
ũl3 (ε ; ζ ) = (λ̃l∗ (ε ; ζ ), ζ2 , ζ3 ),
ũl4 (ε ; ζ ) = (ζ1 , ζ2 , λ̃l∗ (ε ; ζ )),
where l ∈ {1, 2, 3}.
Lemma 4. Let S denote the set of stationary points of function f
where f˜l (·) = f or f˜l (·) = f (·)(1 + |ε |) depending on whether a value at point ζl
was exact or perturbed.
Proof. Let us introduce the following notation f˜l (·) = f (·) when evaluated at points
x = ζl or f˜l (·) = f (·)(1 + ε ) when evaluated at point x = ζl . At the beginning let us
note that Tl (ε ) ⊂ T for (l ∈ {1, 2, 3}) and ε > 0.
82 M. Bazan
1. Let ζ = (ζ1 , ζ2 , ζ3 ), ∈ T be fixed. If f (ζ1 ), f (ζ2 ) and f (ζ3 ) are computed with-
out perturbation then A(ζ ) is not empty by proof in [14]. Let us consider then
that a function value was approximated with the relative error equal ε . The mini-
mum of the quadratic constructed in this case will be λ̃l∗ (ε ; ζ ) where l ∈ {1, 2, 3}
depending on at which point a function value is approximated. Let us consider
the case when λ̃l∗ (ε , ζ ) ∈ [ζ1 , ζ2 ] providing moreover that the minimum λ ∗ (ζ )
obtained as if the function was evaluated without any perturbation also belongs
to [ζ1 , ζ2 ]. Then A(ζ ) is empty if and only if both ũ1 (ε ; ζ ) and ũ3 (ε ; ζ ) are not
in A(ζ ), i.e. if and only if
and
f (ζ2 ) > min{ f (λ̃1∗ (ε ; ζ )), f (ζ3 )} ≥ min{ f (λ̃1∗ (ε ; ζ )), f (ζ2 )} (4.16)
continuity of uk (·) with respect to ζ and the closedness of the set T it follows that
uk (ζ (i) ) → uk (ζ∗ ) = ζ∗∗ ∈ A(ζ∗ ). This proves the closedness of transformation A
if no approximation is used in any sequence ζ (i) .
Now we will consider the sequences containing approximated triplets. In-
troducing the approximated triplets ζ (i+1) (i.e. ζ (i+1) ∈ Ãl (ε ; ζ (i) )) introduces
discontinuities of the first kind into functions uk (ζ ) and the argument based on
continuity of functions uk (ζ ) cannot be applied directly. We will show that using
algorithm A introduces a finite number of isolated points discontinuities and that
guarantees that from any sequence {ζ (i) }∞ i=1 after removing some finite number
of initial elements we can apply the proof from [14].
We have to consider two cases
4 Search Procedure Exploiting Locally Regularized Objective Approximation 83
Let us choose such i1 for which inequality (4.17) holds. Since the approxi-
mation in the next iterations is used infnitely many times then there exists a
subsequence K = {i : i > i1 } ⊂ N such that if i ∈ K then
where 2
ε 2δ
ε1 = , Ka = − [1 − ζ(r)]2 .
(i)
f (ζl ) ζ3 − ζ1
c(ũl1 (ζ )) = f˜l (ζ1 ) + f (λ̃l∗ (ζ )) + f˜l (ζ3 ) < f˜l (ζ1 ) + f˜l (ζ2 ) + f˜l (ζ3 ) = c(ζ ),
where f˜l (·) = f (·) or f˜l (·) = f (·)(1 + ε ) depending on which value was per-
turbed.
b. A(ζ ) = {u3 (ζ ), ũ13 (ζ ), ũ23 (ζ ), ũ33 (ζ )}. Then, since u1 (ζ ) ∈
/ A(ζ ) as well as
/ A(ζ ) (l ∈ {1, 2, 3}) we must have that f (ζ2 ) ≤ f (λ ∗ (ζ )) as well
ũl1 (ε ; ζ ) ∈
as f˜(ζ2 ) ≤ f (λ̃l∗ (ζ )) (l ∈ {1, 2, 3}) depending on which value was perturbed.
Also f (λ ∗ (ζ )) < f (ζ1 ) as well as fl (λ̃l∗ (ζ )) < f˜l (ζ1 ) (l ∈ {1, 2, 3}) since oth-
erwise we would have a local maximum in [ζ1 , ζ2 ] contradicting unimodality.
Therefore in this case we must have
c(ũl3 (ζ )) = f (λ̃l∗ (ζ )) + f˜2 (ζ2 ) + f˜3 (ζ3 ) < f˜l (ζ1 ) + f˜l (ζ2 ) + f˜l (ζ3 ) = c(ζ )
c. Finally, we can have a case A(ζ ) = {u1 (ζ ), u3 (ζ )}. In this case we are not
able to include to A(ζ ) any of the approximated triplet ũl (ε ; ζ ) (i = 1, 2, 3).
This is because of the following properties
i. f (ζ2 ) < f (ζ1 ) by assumption,
ii. λ ∗ (ζ ) ≤ ζ2 ,
iii. f (λ ∗ (ζ )) = f (ζ2 ) which implies λ ∗ (ζ ) = ζ2 .
These equalities hold since otherwise, we would have a contradiction with
the unimodality of f (·). Approximating any of the value would mean that
we would not be able to guarantee the property iii. Therefore since f (ζ2 ) <
min{ f (ζ1 ), f (ζ3 )} we get c(u1 (ζ )) < c(ζ ) and c(u3 (ζ )) < c(ζ ). From a prac-
tical point of view for a given ζi where one of the coordinates is approximated
or all are exact, we can see whether we have to use exact values i.e. remove
the approximation, checking whether
This exhausts all the possibilities and finishes the proof of the third point.
To prove the above theorem we will show the way in which we can apply the proof
given in [14] making use of the Lemma 4.
Proof. The main difficulty in application of the method of the proof from [14] is the
fact that using approximation in certain points of the domain causes a discontinuity
of the cost function c as well as a discontinuity of the functional of calculating the
minimum of the quadratics with respect to a parameter triplet ζ (i) .
Let us observe first that the proof of the third point of Lemma 4 shows that allow-
(i)
ing perturbation according to (4.23) provides that {ζ1 }∞ i=0 is monotone increasing
(i)
as well as {ζ3 }∞ i=0 is monotone decreasing. Since both these sequences are bounded
(i)
they are both convergent. Moreover keeping Δ (ε ; ζ ) on a level so that the above
86 M. Bazan
1
2 The Sequential Quadratic Interpolation Algorithm with the Objective
Function Perturbation
Input : ζ0 ∈ T – a starting point,
ε > 0 – the relative error of the approximation available in certain
3 points of the function evaluation.
0. Set i = 0.
1. Compute λ ∗ = λ ∗ (ζ (i) ) or λ ∗ = λ̃l∗ (ε ; ζ (i) ) depending on whether the function
(i) (i) (i)
value was exact in all points ζ1 , ζ2 , ζ3 or it was perturbed in point
(i)
ζl (l ∈ {1, 2, 3}) respectively.
(i) (i)
2. If λ ∗ = ζ1 or λ ∗ = ζ3 then STOP, else construct the set A(ζ (i) ) and
a. If the approximation to any value of the triplet ζ (i) is not available then
A(ζ (i) ) = A0 according to (4.10).
(i)
b. If the approximation in point ζl (l ∈ {1, 2, 3}) is available
i. Compute transformation q̂ as described in Appendix A.
ii. Compute Δl (ε ; ζ (i) )
iii. If
(i)
|λl∗ (ε ; ζ ) − ζ2 | < Δl (ε ; ζ ).
then A(ζi ) = A0 and go to 3.
iv. If
then A = A0 ∪ Ãl
3. Compute
ζ (i+1) ∈ arg min{c(ζ ) : ζ ∈ A(ζ (i) )}
4. Replace i := i + 1 and go to step 1.
∑Nj=1,i< j ai jWi j
γ (x, Z ) := (4.24)
∑Nj=1,i< j Wi j
d
where ai j = r j +rij
i
, Wi j = r j +r
1
i
, and di j = ||x j − xi ||2 and r j = ||x − x j ||2 . γ (x, Z )
measures how well data points from Z surround the evaluation point x. Numbers ai j
measure how far point x is placed from the interval xi x j . The maximal value equal
1 is attained by ai j if point x is on this interval. The weighting Wi j emphasize in
γ (x, Z ) the impact of intervals xi x j that are close to point x. The additional denom-
inating by a sum of weights Wi j provides that for any x ∈ Rd the range of values
of γ (x, Z ) is (0, 1]. If the value of γ (x, Z ) is greater than a certain threshold value
then in the evaluation point x a construction of an approximation with a good local
quality can be expected.
0.9
0.8
0.7
0.6
0.5
X2
0.4
0.3
0.2
0.1
0
−1 −0.9 −0.8 −0.7 −0.6 −0.5 −0.4 −0.3
X1
Fig. 4.1 γ (x, Z ) defined by (4.24). Here the set Z is a data set constructed on the 50-th
optimization step of a 2-parametric Rosenbrock function optimization with the EXTREM
algorithm
where
s(xi ) = f (xi ) = fi , i = 1, . . . , N.
4 Search Procedure Exploiting Locally Regularized Objective Approximation 89
Φ w = f, (4.26)
(Φ T Φ − λ I)w = Φ T t (4.28)
for a λ > 0 governing a trade-off between a data reproduction and the desired
smoothness of solution fλ (x). Due to the ill-conditioning of matrix Φ a direct
inversion of matrix (Φ T Φ − λ I) in (4.28) is not numerically stable.
To solve (4.28) in a numerically stable way we use the singular value decompo-
sition of the matrix Φ defined as
Φ = USVT , (4.29)
where
σ1 σ2 σN
Ωλ = diag , ,..., 2 .
σ12 + λ 2 σ22 + λ 2 σN + λ 2
90 M. Bazan
Using the above equation the weight vector wλ is expressed (see [5]) by the expan-
sion with respect to singular vectors of the matrix Φ
N
σi
wλ = VT Ωλ† UT t = ∑ (uT t)vi . (4.31)
σ
i=1 i
2 +λ2 i
Comparing the expansion (4.31) for λ > 0 with the expansion for λ = 0 i.e. for the
interpolation problem which reads
N
1 T
w = VS−1 UT t = ∑ (ui t)vi . (4.32)
i=1 σi
one can see what is the role of the regularization parameter λ . For σ p ≥ λ ≥ σ p+1
we have
1 σi
2 (k > p),
σk σk + λ 2
and hence the impact of the singular vectors corresponding to singular values σk <
λ is dumped in the expansion (4.31). This enables us to avoid oscillation of the
solution that are introduced by inverting small singular values in the expansion of
the weight vector. Another approach in solving the problem of ill-conditioning of
the interpolation matrix generated by multiquadratic functions was presented in [7].
r j = ||x − x j ||2 , j = 1, . . . , N.
For a sequence of λ s that covers the singular value spectrum (σN , σ1 ) of the matrix
S from the decomposition (4.29) we calculate the Normalized Local Mean Square
Error
4 Search Procedure Exploiting Locally Regularized Objective Approximation 91
N [sλ (x j )− f j ]2 1
∑ j=1 · r2
f 2j j
NLMSEλ ,Z (x) = , (4.33)
∑Nj=1 r12
j
a) b)
0.98 −1
A
0.96
−2 B
0.94 A
0.92 −3
log10(NLMSE)
0.9
−4
x2
0.88
B
−5
0.86
0.84 −6
0.82
−7
0.8
0.78 −8
−1.02 −1 −0.98 −0.96 −0.94 −0.92 −0.9 −18 −16 −14 −12 −10 −8 −6 −4 −2
x log10(λ)
1
c)
1.8
1.7
1.6 A
log10(WGVλ, Z( x))
1.5
1.4
1.3
1.2
1.1
B
1
−18 −16 −14 −12 −10 −8 −6 −4 −2
log10(λ)
Fig. 4.2 a) Data set consisting of 30 points from the optimization path from EXTREM
algorithm optimizing the 2-parametric Rosenbrock function. b) Local reproduction of the
data near points A and B measured by NLMSEλ ,Z (A) and NLMSEλ ,Z (B) respectively. Here
NLMSEthr = 10−5 is depicted by a straight line. c) A measure of the oscillation of the solution
W GVλ ,Z (x) in points A and B. By dots the optimal λ for points A and B are depicted
e.g. [17], [19] and references therein). The rate of the convergence which is consid-
ered is with respect to dataset density i.e. a global fill distance
a) b)
approximation error of the approximation−5constructed with WGV λ for approximation constructed with WGV
for NLMSE thr = 10 −1 0.98 0.96 0.94 0.92
−4 0.9
x 10 0.88
−0.95 0.86
−12
2 y
−0.9 −10
−8
1
−6
−4
−2
0
−1 0
−0.98
−0.96
0.98
−0.94 0.96
0.94
−0.92 0.92
0.9
0.88
−0.9
0.86
Fig. 4.3 a) The approximation error for λ chosen by the measure W GVλ ,Z (x) for
NLMSEthr = 5.0 · 10−6 . b) Chosen value of λ parameter – it can be noticed that in data
regions where data is sparse the method suggests a greater value of λ
where Z = {xi }Ni=1 ⊆ Ω which is a mesh norm measuring the radius of the biggest
ball contained in the domain Ω and not containing any data points inside, and where
the domain Ω satisfies the interior cone condition with radius r and angle θ .
The summary of error bounds for various radial basis functions was given in [16]. A
very precise derivation of the error bounds for Gauss radial basis function interpo-
lation was given in [20]. In the latter paper one can find a derivation of all constants
involved in the bound. The analogue error estimates (with all constants involved)
of the approximation with positive defined radial basis function constructed with
Tikhonov regularization for functions from the Sobolew space Wpτ was presented in
[19]. Here we will show that the error estimates contained in [19] cannot be used
in our scheme due to the small number of points in the optimization process and
therefore using heuristic methods from previous sections is justified.
All of the error bounds rely on a common property of local polynomial reproduc-
tion that has to be guaranteed by an approximation procedure (c.f. [19]). The error
bounds can be formulated in the form of the following theorem.
Let us consider the unity ball B(0, 1) as a domain of the approximation. It satisfies
the interior cone condition with r = 1 and θ = π /3. It can be seen to reproduce only
the linear polynomials i.e. m = 1 and therefore for the above bounds to be satisifed
there has to be
h(Z , Ω ) ≤ Q(1, π /3) < 0.012613.
It means that from one data point to another there must be a distance not greater
than 1.3% of the radius of the ball containing the whole data set. Such a number of
points cannot be generated by any local optimization algorithm.
The above consideration shows that the exisiting accurate error bounds for the
regularized approximation with radial basis functions cannot be applied to estimate
the approximation in the SPELROA method. It is due to the sparseness of the data
in data sets constructed by local optimization algorithms.
and Ti is the i-th Chebyshev polynomial shifted to the interval [0, 1]. The standard
starting point is x0 = (ξi ) where ξ j = j/(d + 1) and the minimum for d = 8 equals
to f ∗ = 3.51687 · 10−3.
3. Eleven variable problem
As en eleven variable problem we chose Osborne 2 function (i.e. problem no 19
in [12]). It is defined as
11
f (x1 , . . . , x11 ) = ∑ fi (x)2 , fi (x) = yi − (xi e−ti x5 + x2e−(ti −x9 )
2x
where 6 +
i=1
4.6.2 Results
For each problem we run Algorithm 1 combined with EXTREM algorithm with the
three following objective function approximation methods
1. Radial basis function aproximation without regularization,
2. Radial basis function aproximation with regularization using Generalized Cross
Validation [18] to choose λ parameter,
3. Radial basis function aproximation with regularization using Weighted Gradient
Variance to choose λ parameter.
To construct the approximation of the objective function with one of these meth-
ods we used 30 Gaussian radial basis functions with an equal shape parameter set
to half of the distance between the most distant centers. No additional parameters
were required for the first and the second method. The single user-defned threshold
NLMSEthr for measure (4.33) was used in the third method. Apart from parameters
concerning the construction of the radial basis approximation we had to set-up three
parameters related directly to the Algorithm 1 itself. These are: the number of initial
steps Is = 50 and ε = 10−3 for the ε -chcek procedure and γthr – the threshold value
for a measure (4.24) to detect the reliable region.
In the tables below we show performance of the Algorithm 1 for problems de-
fined in the previous subsection with the above objective function approximation
strategies compared to the EXTREM algorithm. The first column shows the number
96 M. Bazan
Table 4.1 6-variable Rosenbrock function optimization using: Left) pure EXTREM, Right)
Algorithm 1 without regularization and with γthr = 0.65
EXTREM
step ||x − x∗|| f Alg. 1 no regularization
num. step num.
250 2.267853 2.885768 num. ||x − x∗|| f approx.
500 0.650821 0.109817 250 2.459117 3.711824 22
750 0.418187 0.031570 500 1.318398 0.559350 45
1000 0.246522 0.012666 750 0.675244 0.089372 35
1250 0.029985 0.001228 933 0.130808 0.005606 10
1523 0.002014 0.000001
Table 4.2 6-variable Rosenbrock function optimization using: Left) Algorithm 1 with regu-
larization using GCV, Right) Algorithm 1 with regularization using WGV with NLMSEthr =
5 · 10−6 . In both cases γthr = 0.65
Table 4.3 8-variable Rosenbrock function optimization using: Left) pure EXTREM, Right)
Algorithm 1 without regularization and γthr = 0.65
EXTREM
step ||x − x∗|| f Alg. 1 no regularization
num. step num.
250 3.923525 10.438226 num. ||x − x∗|| f approx.
500 2.731706 5.606541 250 3.030179 6.376938 22
750 1.771076 1.032648 500 2.153959 1.855113 23
1000 1.009230 0.251387 750 1.380208 0.471618 30
1250 0.484663 0.050215 1000 0.910852 0.215283 46
1500 0.274941 0.015386 1250 0.361361 0.028169 43
1750 0.262434 0.012950 1500 0.186309 0.006926 55
2000 0.208392 0.007465 1537 0.186309 0.006926 7
3471 0.000553 0.000000
4 Search Procedure Exploiting Locally Regularized Objective Approximation 97
Table 4.4 8-variable Rosenbrock function optimization using: Left) Algorithm 1 with regu-
larization using GCV, Right) Algorithm 1 with regularization using WGV with NLMSEthr =
10−6 . In both cases γthr = 0.65
Table 4.5 8-variable Chebquad function optimization using: Left) pure EXTREM Right)
Algorithm 1 without regularization and with γthr = 0.65
of steps of the algorithm i.e. a sum of the number of the direct functions evaluations
and the number of steps in which the radial basis function approximation was used.
The second column shows the distance from the minium and the third column shows
the objective function value. The last column shows the number of steps in which
objective function approximation was used within the previous 250 steps.
As we can see in all cases SPELROA required considerably fewer steps than the
pure EXTREM algorithm to stop. Using the WGV method to build radial
basis approximation gave the best convergence results, i.e. the stopping point for
the SPELROA with WGV compared to stopping points given by the method with
the other methods. For all problems NLMSEthr was chosen intuitively to ∼ 10−6 .
It means that the reconstruction of the training set in the vicinity of the evaluation
point was at the level of 10−6. To compute the reliable region γthr set up to 0.65 was
sufficient to preserve convergence of the method. In the optimization of 8-variable
Chebquad function it turned out that it was possible to reduce γthr to 0.6. That gave
16 points in which the objective function was approximated in the first 250 steps
98 M. Bazan
Table 4.6 8-variable Chebquad function optimization using: Left) Algorithm 1 with regular-
ization using GCV and γthr = 0.65, Right) Algorithm 1 with regularization using WGV with
NLMSEthr = 10−6 and with γthr = 0.60
Table 4.7 11-variable Osborne 2 function optimization using: Left) pure EXTREM, Right)
Algorithm 1 without regularization and with γthr = 0.65
EXTREM
step ||x − x∗|| f
num.
Alg. 1 no regularization
1 4.755269 2.093420 step num.
250 1.004450 0.081672 num. ||x − x∗|| f approx.
500 0.099411 0.041034 250 1.379902 0.977272 29
750 0.073530 0.041772 500 0.568322 0.059211 31
1000 0.007715 0.040138 710 0.335485 0.041512 43
1250 0.001092 0.040138
1434 0.000000 0.040138
Table 4.8 11-variable Osborne 2 function optimization using: Left) Algorithm 1 with regu-
larization using GCV and γthr = 0.65 b) Algorithm 1 with regularization using WGV with
NLMSEthr = 10−6 . In both cases γthr = 0.65
instead 4 such points when γthr = 0.65. An interesting result was also obtained for
Osborne 2 function. The EXTREM algorithm found a better solution than that sug-
gested in [12]. Algorithm 1 with any approximation method did not converge to this
minimum. The method without regularization did not converge at all whereas GCV
4 Search Procedure Exploiting Locally Regularized Objective Approximation 99
and WGV converged rather to the minimum suggested in [12] than to that found by
EXTREM.
4.7 Summary
The Search Procedure Exploiting Locally Regularized Objective Approximation is
a method to speed-up local optimization processes. The method combines a non-
gradient optimization algorithm with the regularized local radial basis function
approximation. It relies on using a local regularized radial basis function approxi-
mation instead of a direct objective function evaluation in a certain number of func-
tion evaluation steps of the optimization algorithm. In this chapter we presented the
proof of the convergence of the Search Procedure Exploiting Regularized Objective
Approximation which applies to any Gauss-Siedle and conjugate direction search al-
gorithm that uses the sequanetial quadratic interpolation as a line search procedure.
The convergence is proven under assumption that the approximation of the objec-
tive function with the prescribed approximation relative error is exploited only in
the seqential quadratic interplation. The performance of the method was presented
on 6 and 8-parametric Rosenbrock function, 8-parametric Chebquad function and
11-parametric Osborne 2 function. Further studies will be to compare the method
with trust region methods.
Acknowledgements. I would like to thank all referees for very valuable comments.
Appendix A
The minimum of the quadratic q(ζ ) built of three points (ζ1 , f (ζ1 )), (ζ2 , f (ζ2 )) and
(ζ3 , f (ζ3 )) where {ζ1 , ζ2 , ζ3 } ⊂ R equals
L2 (x) = a x2 + b x + c
where
f (ζ1 ) f (ζ2 ) f (ζ3 )
a = + + ,
ζ(r) ζ(r) (ζ(r) − 1) 1 − ζ(r)
!
f (ζ1 )(ζ(r) + 1) f (ζ2 ) f (ζ3 )ζ(r)
b =− + + ,
ζ(r) ζ(r) (ζ(r) − 1) 1 − ζ(r)
c = f (ζ1 ).
Appendix B
λ̃l∗ (ε ; ζ ) = λ ∗ (ζ ) ± Δl (ε ; ζ ).
Here we assume that both λ ∗ (ζ ) and λ̃l∗ (ε ; ζ ) are minima of the quadratics obtained
by the transformation q̂(·) from Appendix A with the assumption that f (ζ1 ) < f (ζ3 )
i.e. if the opposite is true then we have to rotate the quadratics with respect to the
center of the interval [ζ1 , ζ3 ].
Let us introduce a notation to simplify derivations. Let us denote A = 2(ζ(r) −
1), B = q̂(ζ(r) ),C = ζ(r) . Then for the unperturbed quadratic we have that
4 Search Procedure Exploiting Locally Regularized Objective Approximation 101
1 A(ζ(r) + 1) + B − ζ(r)C
λ∗ =
2 A + B −C
whereas for the perturbed one we have
∗ 1 A(ζ(r) + 1)(1 + ε ) + B − ζ(r)C
λ1 (ε ; ζ ) =
2 A(1 + ε ) + B − C
1 A(ζ(r) 1) + B(1 + ε ) − ζ(r)C
+
λ2∗ (ε ; ζ ) =
2 A + B(1 + ε ) − C
1 A(ζ(r) 1) + B − ζ(r)C(1 + ε )
+
λ3∗ (ε ; ζ ) =
2 A + B − C(1 + ε )
ζ2 −ζ1
for the perturbation in 0, ζ(r) and 1 respectively where ζ(r) = ζ3 −ζ1 . We can simplify
the expression |λ ∗ − λ̃ ∗ (ε ; ζ )| to get
C − ζ(r) B Aε
Δ1 (ε ; ζ ) = ·
2(A + B − C) Aε + A + B − C
which again is divided into two cases depending on the sign of the cooeficient
by ε
a. When A(1 − ζ(r)) > 0 then
Only upper bounds for ε , i.e. 1.a) and 2.b), are interpretable as a solution to our
problem. So finally we get two regions for ε and ζ(r) where
for
A(1 + ε ) + B − C > 0
, (4.36)
A(1 − ζ(r)) > 0
or
A(1 + ε ) + B − C < 0
. (4.37)
A(1 − ζ(r)) < 0
In the same way we can obtain conditions for l = 2:
−B(1 − ζ(r)) − A
ε< , (4.38)
B(1 − ζ(r))
for
A + B(1 + ε ) − C > 0
, (4.39)
B(1 − ζ(r)) < 0
or
A + B(1 + ε ) − C < 0
. (4.40)
B(1 − ζ(r)) > 0
References
1. Bazan, M., Russenschuck, S.: Using neural networks to speed up optimization algo-
rithms. Eur. Phys. J. AP 12, 109–115 (2000)
2. Bazan, M., Aleksa, M., Russenschuck, S.: An improved method using radial basis func-
tion neural networks to speed up optimization algorithms. IEEE Trans. on Magnetics 38,
1081–1084 (2002)
4 Search Procedure Exploiting Locally Regularized Objective Approximation 103
3. Bazan, M., Aleksa, M., Lucas, J., Russenschuck, S., Ramberger, S., Völlinger, C.: In-
tegrated design of superconducting magnets with the CERN field computation pro-
gram ROXIE. In: Proc. 6th International Computational Accelarator Physics Conference,
Darmstadt, Germany (September 2000)
4. Conn, A.R., Gould, N.I.M., Toint, P.L.: Trust region methods. SIAM, Philadelphia (2005)
5. Hansen, P.C.: Rank-deficient and Discrete Ill-posed Problems. SIAM, Philadelphia
(1998)
6. Jacob, H.G.: Rechnergestützte Optimierung statischer und dynamischer Systeme.
Springer, Heidelberg (1982)
7. Kansa, E.J., Hon, Y.C.: Circumventing the ill-conditionning problem with multiquadratic
radial basis: Applications to elliptic partial differentail equations. Comp. Math. with
App. 39(7-8), 123–137 (2000)
8. Luenberger, D.G.: Introduction to linear and nonlinear programming, 2nd edn. Addison-
Wesley, New York (1984)
9. Micchelli, C.A.: Interpolation of Scattered Data: Distance Matrices and Conditionally
Positive Definite Functions. Constructive Approximation 2, 11–22 (1986)
10. Madych, W.R., Nelson, S.A.: Multivariate interpolation and conditionally positive defi-
nite functions II. Math. Comp. 4(189), 211–230 (1990)
11. Madych, W.R.: Miscellaneous error bounds for multiquadric and related interpolators.
Comp. Math. with Appl. 24(12), 121–138 (1992)
12. Moré, J.J., Garbow, B.S., Hillstorm, K.E.: Testing unconstrained optimization software.
ACM Trans. Math. Software 7(1), 17–41 (1981)
13. Oeuvray, R.: Trust region method based on radial basis functions with application on
biomedical imaging, Ecole Polytechnique Federale de Lausanne (2005)
14. Polak, E.: Optimization. Algorithms and Consistent Approximations. Applied Mathe-
matical Sciences, vol. 124. Springer, Heidelberg (1997)
15. Powell, M.J.D.: On calculation of orthogonal vectors. The Computer Journal 11(3),
302–304 (1968)
16. Schaback, R.: Error estimates and condition number for radial basis function interpola-
tion. Adv. Comput. Math. 3, 251–264 (1995)
17. Schaback, R.: Native Hilbert Spaces for Radial Basis Functions I. The new development
in Approximation Theory. Birkhäuser, Basel (1999)
18. Wahba, G.: Spline models for obsevational data. SIAM, Philadelphia (1990)
19. Wendland, H., Rieger, C.: Approximate interpolation with applications to selecting
smoothing parameters. Numerische Mathematik 101, 643–662 (2005)
20. Wendland, H.: Gaussian Interpolation Revisited. In: Kopotun, K., Lyche, T., Neamtu,
M. (eds.) Trends in Approximation Theory, pp. 427–436. Vanderbilt University Press,
Nashville (2001)
21. Zangwill, W.I.: Nonlinear Programming; a Unified Approach. Prentice-Hall Interna-
tional Series. Prentice-Hall, Englewood Cliffs (1969)
Chapter 5
Optimization Problems with Cardinality
Constraints
Abstract. In this article we review several hybrid techniques that can be used to
accurately and efficiently solve large optimization problems with cardinality con-
straints. Exact methods, such as branch-and-bound, require lengthy computations
and are, for this reason, infeasible in practice. As an alternative, this study focuses
on approximate techniques that can identify near-optimal solutions at a reduced
computational cost. Most of the methods considered encode the candidate solutions
as sets. This representation, when used in conjunction with specially devised search
operators, is specially suited to problems whose solution involves the selection of
optimal subsets of specified cardinality. The performance of these techniques is il-
lustrated in optimization problems of practical interest that arise in the fields of
machine learning (pruning of ensembles of classifiers), quantitative finance (port-
folio selection), time-series modeling (index tracking) and statistical data analysis
(sparse principal component analysis).
5.1 Introduction
Many practical optimization problems involve the selection of subsets of specified
cardinality from a collection of items. These problems can be solved by exhaustive
enumeration of all the candidate solutions of the specified cardinality. In practice,
only small problems of this type can be exactly solved within a reasonable amount of
Rubén Ruiz-Torrubiano
Computer Science Department, Universidad Autónoma de Madrid, Spain
e-mail: [email protected]
Sergio Garcı́a-Moratilla
Computer Science Department, Universidad Autónoma de Madrid, Spain
e-mail: [email protected]
Alberto Suárez
Computer Science Department, Universidad Autónoma de Madrid, Spain
e-mail: [email protected]
Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 105–130.
springerlink.com c Springer-Verlag Berlin Heidelberg 2010
106 R. Ruiz-Torrubiano, S. Garcı́a-Moratilla, and A. Suárez
time. The number of steps required to find the optimal solution can be reduced using
branch-and-bound techniques. Nevertheless, the computational complexity of the
search remains exponential, which means that large problems cannot be handled by
these exact methods. It is therefore important to design algorithms that can identify
near-optimal solutions at a reduced computational cost. In this article we present an
unified framework for handling optimization problems with cardinality constraints.
A number of approximate methods within this framework are analyzed and their
performance is tested in extensive benchmark experiments.
In its general form, an optimization problem with cardinality constraints can be
formulated in terms of a vector of binary variables z = {z1 , z2 , . . . , zD }, zi ∈ {0, 1}.
The goal is to minimize a cost function that depends on z, subject to a constraint
that specifies the number of non-zero bits in z
D
min {F(z)}
z
∑ zi = k. (5.1)
i=1
where I(·) is an indicator function (I(true) = 1, I(false) = 0). This hybrid prob-
lem can be transformed into a purely combinatorial one of the type (5.1) by intro-
ducing a D-dimensional binary vector z whose i-th component indicates whether
variable i is allowed to take a non-zero value (zi = 1) or is set to zero (zi = 0).
min [F ∗ (z)] ,
z
∑ zi = k. (5.3)
i
where θ [z] denotes the k-dimensional vector formed by the components of θ for which
the value of the corresponding component of z is 1. The remaining components of θ
are set to zero in the auxiliary problem.
This decomposition makes it clear how hybrid methods that combine techniques
for combinatorial and continuous optimization can be applied to identify the so-
lution of the subset selection problem with a continuous objective function: For a
given value of z, the optimal θ [z] is calculated by solving the surrogate problem
defined by (5.4), where z determines which components of θ are allowed to take a
non-zero value. The final solution is obtained by searching in the purely combinato-
rial space of possible values of z, using the optimal function value that is a solution
of (5.4) to guide the exploration.
The success of this hybrid approach depends on the availability of a continuous
optimization algorithm that can efficiently identify the globally optimal solution
of the auxiliary optimization problem defined in (5.4) and on the efficiency of the
algorithm used to address the combinatorial part of the search. For simple forms
of the continuous objective function and of the remaining restrictions (other than
the cardinality constraint), the auxiliary problem can be efficiently solved by exact
optimization techniques. For instance, efficient linear and quadratic programming
algorithms are available if the function is linear or quadratic, respectively [1]. For
more complex objective functions, general non-linear optimization techniques (such
as quasi-Newton [2] or interior-point methods [3]) may be necessary. In these cases,
there is no guarantee that the solution of the auxiliary problem be globally optimal.
As a consequence, if the solutions found are far from the global optimum, the com-
binatorial search that is used to solve the original problem (5.3) can be seriously
misled.
In this work, we assume that the continuous optimization task defined by (5.4)
can be solved exactly and focus on the solution of the combinatorial part of the orig-
inal problem. Section 5.2 describes how standard combinatorial optimization tech-
niques can be adapted to handle the cardinality constraints considered. Emphasis is
placed on the use of an appropriate encoding for the search states in terms of sets.
This set-based encoding is particularly well-suited for the definition of search oper-
ators that preserve the cardinality of the candidate solutions. With this adaptation,
the approximate methods described provide a practicable alternative to identifying
the exact solution by exhaustive search, which becomes computationally infeasible
in large problems, or to computationally inexpensive optimization methods, such as
greedy search, which tend to find suboptimal solutions. The experiments presented
in Section 5.3 illustrate how the techniques reviewed find near-optimal solutions
with limited computational resources and can therefore be used to address optimal
subset selection problems of practical interest. Novel results regarding the applica-
tion of these techniques to some of these problems (ensemble pruning and sparse
PCA) are also provided. Finally, Section 8.5 summarizes the conclusions of this
work.
108 R. Ruiz-Torrubiano, S. Garcı́a-Moratilla, and A. Suárez
operator. The stochastic search proceeds by considering transitions from the cur-
rent state z(cur) to a neighboring configuration zl ∈ N (z(cur) ) generated at random.
The proposed transition is accepted if the value of the objective function decreases.
Otherwise, if the candidate configuration is of higher cost, the transition is accepted
only with a certain probability. This probability is expressed as a Boltzmann factor
(cur)
F(z ) − F(z )
Paccept (zl , z(cur) ; Tk ) = exp −
l
, (5.5)
Tk
where the parameter Tk plays the role of a temperature. A general version of this
technique is given as Algorithm 1. In this pseudocode, the function annealingSched-
ule returns the temperature Tk for the following epoch. It is common to use a geo-
metrical schedule Tk = γ Tk−1 , where γ smaller, but usually close to one, regulates
how fast the temperature is decreased.
For problems with cardinality constraints, two alternative encodings for the can-
didate solutions are considered. A first possibility is a standard binary representa-
tion, where the chromosomes are bit-strings. The difficulty with this encoding is
that standard mutation and crossover operators do not preserve the number of non-
zero bits of the parents. A possible solution to this problem is to assign a lower
fitness value to individuals in the population that violate the cardinality constraint.
Assuming that a problem with an inequality cardinality constraint is considered, a
penalized fitness function can be built by subtracting from the standard fitness func-
tion a penalty term that depends on the magnitude of the violation of the cardinality
constraint
Δk (z) = |Card(z) − k| (5.6)
The penalized fitness function is
can be repaired by randomly setting some bits to 0 or 1, as needed, until the cardi-
nality constraint is satisfied (random repair). Another alternative is to use a heuristic
to determine which bits must be set to 0 or to 1 (heuristic repair). The results of a
greedy optimization or the solutions of a relaxed version of the problem can also be
used to achieve this objective [10].
An alternative to binary encoding is to use the set representation introduced in
simulated annealing. The use of this representation simplifies the design of crossover
and mutation operators that preserve the cardinality of the individuals. The neigh-
boring operator defined in SA can be used to construct mutated individuals. Since
this operator swaps a variable in the set of selected variables with another variable in
the complement of this set, the cardinality of the original chromosome is preserved
by the mutation.
Some crossover operators on sets were introduced in [5]. They are defined taking
into account the properties of respect and assortment [11]. Respect ensures that the
offspring inherit the common genetic material of the parents. Assortment guarantees
that every combination of the alleles of the two parents is possible in the child, pro-
vided that these alleles are compatible. When cardinality constraints are considered,
it is no longer possible to design crossover operators that guarantee both respect and
assortment.
A crossover operation that provides a good balance of these properties and en-
sures that the cardinality of the parents is preserved in the offspring is random as-
sorting recombination (RAR). RAR crossover is described in Algorithm 3. In this
algorithm, the integer parameter w ≥ 0 determines the amount of common informa-
tion from both parents that is retained by the offspring. For w = 0, elements that are
present in the chromosomes of both parents are not allowed in the child. Higher val-
ues of w assign more importance to the elements in the intersection of the parents’
sets (chromosomes). In the limit w → ∞, the child contains every element that is in
both of the parents’ chromosomes with a probability that approaches 1.
1 M (gim )
p(g+1) = α ∑ z + (1 − α )p(g),
M m=1
(5.8)
112 R. Ruiz-Torrubiano, S. Garcı́a-Moratilla, and A. Suárez
Dg = {z(g1) , . . . , z(gP) }
where z(gim ) represents the individual in the im −th position on generation g, and
α ∈ (0, 1] is a smoothing parameter included to avoid strong fluctuations in the
estimates of the probability distribution. Individuals are sorted by decreasing fitness
values. The Univariate Marginal Distribution Algorithm (UMDA [14]) algorithm
from the EDA family is recovered when α = 1. Even though the encoding is binary,
the cardinality constraints can be enforced in the sampling of individuals. Algorithm
5 describes a sampling method that generates individuals of a specified cardinality k
from a distribution of bits characterized by the probability vector p. The application
of this method to sample new individuals guarantees that the algorithm is closed
with respect to the cardinality constraint.
i=1 p̂i = 1.
so that p̂ can be interpreted as a probability vector ∑D
– Return the generated individual x.
problem. While it is possible to address the combinatorial and the continuous op-
timization problems simultaneously, we concentrate on strategies that handle these
aspects separately. Therefore, the outcome of the continuous optimization algorithm
is used to guide the combinatorial optimization search, as in (5.4). For the hybrid
problems considered, the secondary optimization task can be efficiently solved in an
exact manner by quadratic programming. Nonetheless, the scheme can be directly
generalized when the evaluation of F(z) requires a more complex programming so-
lution, possibly without guarantee of convergence to the global solution of the sur-
rogate optimization problem. Under these conditions the algorithm used to address
the combinatorial part can actually be misled by the suboptimal solutions found in
the auxiliary problem.
Both exact and approximate methods have been used to address the 0/1 knapsack
problem. Exact algorithms based on branch-and-bound approaches and dynamic
programming are reviewed in [17]. Genetic algorithms [18, 19] and EDAs [12] have
also been used to address this problem.
Cardinality constraints are generally not considered in the standard 0/1 knapsack
problem. Nevertheless, the optimum of the unconstrained problem can be obtained
by solving D cardinality-constrained knapsack problems ∑D i=1 zi = k; k = 1, 2, . . . D.
The k-th element in this sequence is a knapsack problem with the restriction that
only k items can be included in the knapsack. To compare the performance of the
different optimization methods analyzed in this work, we use the testing protocol
proposed in [20] [18]. Three types of problems, defined in terms of two parameters
v, r ∈ R+ , v > 1, are considered:
(1) Uncorrelated: Weights and profits are generated randomly in [1, v].
(2) Weakly correlated: Weights are generated randomly in [1, v] and profits are
generated in the interval [wi − r, wi + r].
(3) Strongly correlated: Weights are generated randomly in [1, v] and profit pi =
wi + r.
In general, knapsack problems with correlations between weights and profits are more
difficult to solve than problems in which the weights and profits are independent. We
5 Optimization Problems with Cardinality Constraints 115
use v = 10, r = 5 and a capacity W = 2v, which tends to include very few items in
the solution. The results reported are averages over 25 realizations of each problem,
which are solved using the different approximate methods: SA, a standard GA with
linear penalty, a GA using set encoding and the RAR operator (w = 1), and PBIL.
The conditions under which the search is conducted are determined on exploratory
experiments. A geometric annealing schedule Tk = γ Tk−1 with γ = 0.9 is used in
SA. The GAs evolve populations composed of 100 individuals. The probabilities of
crossover and mutation are pc = 1, pm = 10−2 , respectively. In PBIL, a population
composed of 1000 individuals is used. The probability distribution is updated using
10% of the individuals. The smoothing parameter α is 0.1. Exact results obtained with
the solver SYMPHONY from the COIN-OR project [21] implementing a branch-and-
cut (B&C) approach [22], are also reported for reference. In the strongly correlated
problems it was not possible to find the exact solutions within a reasonable amount
of time.
Table 5.1 Results for the 0-1 Knapsack problem with restrictive capacity
Table 5.1 displays the average profit obtained and the time (in seconds) to reach
a solution for each method. The experiments were performed on an AMD Turion
computer with 1.79 Ghz processor speed and 1 Gb RAM. None of the approximate
methods reaches the optimal profit, which is calculated using an exact branch-and-
cut method. The highest profit obtained by an approximate optimization is high-
lighted in boldface. In all cases, the algorithms that use a set encoding (GA with
RAR crossover and SA) exhibit the best performance. They also require longer
times to reach a solution, specially SA. PBIL obtains good results only in small
uncorrelated knapsack problems. This is explained by the fact that the sampling and
estimation of probability distributions becomes progressively more difficult as the
dimensionality of the problem increases. Furthermore, PBIL assumes statistical in-
dependence between the variables, which makes the algorithm perform worse on
problems in which correlations are present. The standard GA with linear penalty
has a very poor performance in all the knapsack problems analyzed.
116 R. Ruiz-Torrubiano, S. Garcı́a-Moratilla, and A. Suárez
D
argmin zT · G̃ · z, s.t. ∑ zi = k, zi ∈ {0, 1}. (5.11)
z i=1
The binary variable zi indicates whether classifier i should be selected. The size of
the pruned ensemble, k, is specified beforehand. The selection process is a combi-
natorial optimization problem whose exact solution
requires the evaluation of the
D
performance of the exponentially large subensembles of size k that can be
k
extracted from an ensemble of size D. In [31] the solution is approximated in poly-
nomial time by applying semi-definite programming (SDP) to a convex relaxation
of the original problem.
To investigate the performance in the ensemble pruning problem of the opti-
mization methods described in Section 5.2, we generate bagging ensembles for five
representative benchmark problems from the UCI repository: heart, pima, satellite,
waveform and wdbc (Breast Cancer Wisconsin) [32]. The individual classifiers in
the ensemble are trained on different bootstrap samples of the original data [33].
If the classifiers used as base learners are unstable the fluctuations in the bootstrap
sample lead to the induction of different predictors. Assuming that the errors of
these classifiers are uncorrelated, pooling their decisions by majority voting should
improve the accuracy of the predictions. In the experiments performed, bagging en-
sembles of 101 CART trees are built [34]. The original ensemble is pruned to k = 21
decision trees. The strength-diversity measure G, the time consumed in seconds and
the number of evaluations are averaged over 5 ten-fold cross-validations for heart,
pima, satellite and wdbc, and over 50 independent partitions for waveform. The suc-
cess rate is the average over 50 repetitions of the optimization for a given partition
of the data into training and testing sets.
The parameters for the metaheuristic optimization methods are determined in
exploratory experiments using the results of SDP as a gauge. For the GAs, popu-
lations with 100 individuals are evolved using a steady state generational substitu-
tion scheme. The crossover probability is set to 1. The mutation probability is 10−2
for GAs with binary representation and 10−3 for GAs with set representation. The
strength of the penalty term in the GA with linear penalties is β = 400. If the best
individual of the final population does not satisfy the cardinality constraint, a greedy
search is performed to fulfill the restriction. The value w = 1 is used in RAR-GA. A
geometric annealing schedules with γ = 0.9 is used in SA. In these experiments, the
best solution in 10 independent executions of the SA algorithm is chosen. For PBIL,
a population of 1000 individuals is generated, where 10% of the individuals are used
to update the probability distribution. The smoothing constant is set to α = 0.1.
The results of the ensemble pruning experiments performed are summarized in
Table 5.2. Most of the optimization methods analyzed reach similar solutions in
all the classification problems considered, with the exception of the standard GA
with linear penalty, which obtains the worst values of the objective function. In
terms of this quantity, the best overall results correspond to SA and SDP. In terms
of efficiency, SDP should be preferred. In machine learning, the relevant measure
of performance is the generalization capacity of the classifiers generated. The test
118 R. Ruiz-Torrubiano, S. Garcı́a-Moratilla, and A. Suárez
Table 5.2 Results for the GA, SA and EDA approaches in the ensemble pruning problem
error displayed in the last column of the table provides an estimate of the error rate
in examples that have not been used to train the classifiers. Lower test errors indicate
better generalization capacity. According to this measure the ranking of methods is
rather different: classifiers that were optimal according to the objective function are
suboptimal in terms of their generalization capacity. This indicates that the learning
process is affected by overfitting, because the objective function is estimated on the
training data. Nevertheless, the generalization performance of the pruned ensembles
is very similar for all the optimization methods considered. Table 5.3 shows the test
error of a single CART tree, of a complete bagging ensemble and the range of values
5 Optimization Problems with Cardinality Constraints 119
Table 5.3 Test errors for CART, standard bagging and pruned bagging
of the test error obtained by pruned bagging ensembles of size k = 21. In all the
classification problems considered, pruned ensembles have a lower test error than
CART and complete bagging.
The inputs of the algorithm are r̄, the vector of expected asset returns and Σ, the
covariance matrix of the asset returns. The goal is to determine the optimal weights
of the assets in the portfolio; i.e. the value of w that maximizes the variance of the
portfolio returns (5.12), for a given value of the expected return of the portfolio, R∗
(5.13). The elements of the binary vector z specify whether asset i is included in the
final portfolio (zi = 1) or not (zi = 0). Column vectors x[z] are obtained by remov-
ing from the corresponding vector x those components i for which zi = 0. Similarly,
120 R. Ruiz-Torrubiano, S. Garcı́a-Moratilla, and A. Suárez
the matrix A[z] is obtained by eliminating the i-th column of A whenever zi = 0. Fi-
nally, Σ[z,z] is obtained by removing from Σ the rows and columns for which the
corresponding indicator is zero (zi = 0). The symbols 0 and 1 denote vectors of the
appropriate size whose entries are all equal to 0 or to 1, respectively. Minimum and
maximum investment constraints, which set a lower and an upper bound on the in-
vestment of each asset in the portfolio are captured by (5.14). Vectors a and b are
D × 1 column vectors with the lower and upper bounds on the portfolio weights, re-
spectively. Inequality (5.15) summarizes the M concentration of capital constraints.
The m-th row of the M × D matrix A is the vector of coefficients of the linear combi-
nation that defines the constraint. The M × 1 column vectors l and u correspond to the
lower and upper bounds of the M linear restrictions, respectively. Concentration of
capital constraints can be used, for instance, to control the amount of capital invested
in a group of assets, so that investor preferences or limits for investment in certain
asset classes can be formally expressed. Since these constraints are linear, they do
not increase the difficulty of the problem, which can still be solved efficiently by
quadratic programming. Expression (5.16) corresponds to the cardinality constraint,
which limits the number of assets that can be included in the final portfolio. Finally,
equation (5.17) ensures that all the capital is invested in the portfolio.
The cardinality-constrained problem is difficult to solve by standard optimization
techniques. Branch-and-Bound methods can be used to find exact solutions [36]. De-
spite the improvements in efficiency, the complexity of the search is still exponential.
Genetic algorithms have also been used to address this problem: In [37], the perfor-
mance of GAs is compared to SA and to tabu search (TS) [38]. According to this in-
vestigation, the best-performing portfolios are obtained by pooling the results of the
different heuristics. In [39] SA is used to search directly in the space of real-valued
asset weights. Tabu search is employed in [40]. This work focuses on the design
of appropriate neighborhood operators to improve the efficiency of the search. In
[7] [41] Multi-Objective Evolutionary Algorithms (MOEAs) are used to address the
problem. These algorithms employ a hybrid encoding instead of a pure continuous
one and heuristic repair mechanisms to handle infeasible individuals. The impact of
local search improvements are also investigated in this work. The authors conclude
that the hybrid encoding improves the overall performance of the algorithm.
In the experiments carried out in this investigation, we address the problem of
optimal portfolio selection with lower bounds and cardinality constraints. The pa-
rameters of the constraints considered are li = 0.1, ui = 0.1, i = 1, . . . , D and K = 10.
The performance of the different optimization methods is compared by calculating
the efficient frontier for the problem with and without these constraints. Points on
the efficient frontier correspond to minimum-risk portfolios for a given expected
return, or, alternatively, to portfolios that have the largest expected return from a
family of portfolios with equal risk. As a measure of the quality of the solution ob-
tained, the average relative distance to the unconstrained efficient frontier (without
cardinality and lower bound constraints) is calculated
1 NF
σic − σi∗
D=
NF ∑ σ∗ (5.18)
i=1 i
5 Optimization Problems with Cardinality Constraints 121
where NF = 100 is the number of frontier points considered, σic is the solution of
the constrained problem in the i-th point of the frontier, and σi∗ is the solution of the
corresponding unconstrained problem.
Table 5.4 Results for the GA, SA and EDA approaches in the portfolio selection problem
The expected returns and the covariance matrix of the components of five major
world markets included in the OR-Library [42] are used as inputs for the optimiza-
tion: Hang Seng (Hong-Kong, 31 assets), DAX (Germany, 85 assets), FTSE (UK,
89 assets), Standard and Poor’s (U.S.A., 98 assets) and Nikkei (Japan, 225 assets).
The methods compared are SA, standard GA with linear penalty, standard GA with
heuristic repair, GA with a set representation and RAR (w = 1) crossover, and PBIL.
The SA heuristic is used with a geometric annealing scheme with constant γ = 0.9.
Populations of 100 individuals are used for the GAs. The mutation and crossover
probabilities are pm = 10−2 and pc = 1, respectively. PBIL samples populations
122 R. Ruiz-Torrubiano, S. Garcı́a-Moratilla, and A. Suárez
of 400 individuals, 10% of which are used to update the probability distribution.
The heuristic repair scheme performs an unconstrained optimization without the
cardinality constraint, and then either includes in the chromosome those products
with the highest weights or eliminates the products with the smallest weights in the
unconstrained solution, as needed.
Table 5.4 summarizes the results of the experiments. The value of D (5.18) dis-
played in the third column is the best out of 5 executions of each of methods consid-
ered. The proportion of attempts in which the corresponding optimization algorithm
obtains the best known solution is given in the column labeled sucess rate. The last
two columns report the time employed (in seconds) and the number of quadratic
optimizations performed, respectively. In terms of the quality of the obtained solu-
tions, using a binary encoding with linear penalties performs worse than all the other
approximate methods. By contrast, the heuristic repair scheme identifies the best of
the known solutions in all the problems investigated. GA with a set representation
and RAR (w = 1) crossover has also an excellent performance and is slightly more
efficient on average. High quality solutions are also obtained by SA, albeit at higher
computational cost. PBIL performs well only in problems in which the number of
assets considered for investment is small. As the dimensionality of the problem in-
creases, sampling and estimation of the probability distribution in algorithms of the
EDA family become less effective.
where T is the length of the time series considered, D is the number of constituents
of the index, r j (t) is the return of asset j at time t and rt is the return of the index
at time t. Restriction (5.20) is a budget constraint, which ensures that all the cap-
ital is invested in the portfolio. Investment concentration constraints are captured
by (5.21). Expression (5.22) reflects lower and upper bound constraints. The binary
variables {z1 , z2 , . . . , zD } indicate whether an asset is included or excluded from the
tracking portfolio. Note that when zi = 0, the lower and upper bounds for the weight
of asset i are both equal to zero, which effectively excludes this asset from the
investment. The cardinality constraint is expressed by Eq.(5.23).
Index tracking has been extensively investigated in the literature. The hybrid GA
with set encoding and RAR crossover described in Section 5.2 is used in [4]. In-
stead of the tracking error, this work minimizes the variance of the difference be-
tween the returns of the index and of the tracking portfolio. Optimal impulse control
techniques are used in [43]. In [44] the problem is solved by using the threshold ac-
cepting (TA) heuristic, which is a deterministic analogue of simulated annealing, in
which transitions are rejected only when they lead to a deterioration in performance
that is above a specified threshold. Evolutionary algorithms with real-valued chro-
mosome representations are used in [45]. This investigation focuses on the influence
of transaction costs and portfolio rebalancing. In [46] the portfolio optimization and
index tracking problems are addressed by means of a heuristic relaxation method
that consists in solving a small number of convex optimization problems with fixed
transaction costs. Hybrid optimization approaches to minimizing the tracking by
partial replication are also investigated in [47, 48, 49].
In the current investigation, publicly available benchmark data from the OR-
Library [42] is used to compare the optimization techniques described in Section
5.2. Five major world market indices are used in the experiments: Hang-Seng, DAX,
FTSE, S&P and Nikkei. For each index, the time series of 290 weekly returns for
the index and for its constituents are given. From these data, the first 145 values
are used to create a tracking portfolio that includes a maximum of K = 10 assets.
The last 145 values are used to measure the out-of-sample tracking error. The pop-
ulation sizes are 350 for the GAs and 1000 for PBIL. The values of the remaining
parameters coincide with those used in the portfolio selection problem.
Table 5.5 presents a summary of the experiments performed. The best out of 5 ex-
ecutions of the different optimization methods are reported. GA with random repair
obtains the best overall results. GA with set encoding and RAR (w = 1) crossover
matches these results except in Nikkei, which is the index with the largest number of
constituents. PBIL also has a good performance, but the computational cost is higher
than for the other algorithms. In fact, the algorithm reached the maximum number
of optimizations established without converging. The results of SA and GA with
binary encoding and linear penalty are suboptimal in all but the simplest problems.
They also exhibit low success rates. In all problems investigated, the out-of-sample
error is typically larger than the in-sample error, but of the same order of magnitude.
124 R. Ruiz-Torrubiano, S. Garcı́a-Moratilla, and A. Suárez
Table 5.5 Results for the GA, SA and EDA approaches in the index tracking problem
where Σ is the data covariance matrix. As in the previous problems, the elements of
the binary vector z encode whether the principal component has a non-zero projec-
tion along the corresponding direction. Once the first principal component has been
found, if more principal components are to be calculated, the covariance matrix Σ
is deflated as follows
Σ = Σ − wT · Σ · w w wT (5.27)
and a new problem of the form given by (5.24), defined now in terms of this de-
flated covariance matrix is solved. The decomposition stops after a maximum of
Rank(Σ) iterations. In practice, the number of principal components is either spec-
ified beforehand or determined by the percentage of the total variance of the data
explained.
The problem of finding sparse principal components has also received a fair
amount of attention in the recent literature. Greedy search is used in [50]. In [51]
SPCA is formulated as a regression problem, so that LASSO techniques [52] can be
used to favor sparse solutions. In LASSO, an L1 -norm penalty for non-zero values
of the factor loadings is used. A higher weight of the penalty term in the objective
functions induces models that are sparser. However it is not possible to have a di-
rect control on the number of non-zero coefficients in the solution. The cardinality
constraint is explicitly considered in [53], which uses a method based on solving a
relaxation of the problem by semidefinite programming (SDP).
To compare the performance of the different methods analyzed, we use the bench-
mark problem introduced in [54]. Consider the sparse vector v, whose components
are ⎧
⎪
⎨1, if i ≤ 50
vi = 1/(i − 50), if 50 < i ≤ 100 (5.28)
⎪
⎩
0, otherwise
A covariance matrix is built from this vector and U, a square matrix of dimensions
150 × 150 whose elements are U[0, 1] random variables
Σ = σ vvT + UT U, (5.29)
best 10% of the individuals are used to update the probability distribution. The first
sparse principal component is then calculated. For each of the methods that involve
stochastic search (all except DSPCA), the best out of 5 independent executions of
the algorithm is taken. Figure 7.1 displays the variance explained by the first sparse
principal component as a function of its cardinality K = 1, 2, . . . , 140, for all the
methods considered. GA using a linear penalty does not obtain good solutions in
this high-dimensional problem. PBIL performs slightly better, but is clearly infe-
rior to SA, GAs with random repair, GA with set encoding and DSPCA. Table 5.6
shows the detailed results for cardinality K = 50, which is the cardinality of the true
hidden pattern. In this table, the largest value of the variance achieved is highlighted
in bold. The success rates, the computation times on an AMD Turion machine with
1.79 Ghz processor speed and 1 Gb RAM and the total number of optimizations are
also given. The times for the the DSPCA algorithm times are not given, because a
MATLAB implementation was used [54], which cannot be directly compared with
the other results, obtained with code written in C. The GA with set encoding and
RAR (w = 1) crossover and the GA with binary encoding and random repair obtain
the best results and explain more variance than the solution obtained by DSPCA.
The first of this methods is slightly faster. SA is very fast and achieves a result
that is only slightly worse with a success rate of 100%. PBIL and GA with binary
encoding and linear penalty obtain solutions that are clearly inferior.
45
40
35
30
25
Variance
20
15
10
GA Linear Penalty
GA Random Repair
5 GA RAR
SA
PBIL
DSPCA
0
25 50 75 100 125 150
Cardinality
Table 5.6 Results for the GA, SA, EDA and SDP approaches in the synthetic problem for
K = 50
5.4 Conclusions
Many tasks of practical interest can be formulated as optimization problems with
cardinality constraints. The examples analyzed in this article arise in various fields
of application: ensemble pruning, optimal portfolio selection, financial index track-
ing and sparse principal component analysis. They are large optimization problems
whose solution by standard optimization methods is computationally expensive. In
practice, using exact methods like branch-and-bound is feasible only for small prob-
lem instances. A practicable alternative is to use approximate optimization methods
that can identify near-optimal solutions at a lower computational cost: Genetic al-
gorithms, simulated annealing and estimation of distribution algorithms. However,
the search operators used in the standard formulations of these techniques are ill-
suited to the problem because they do not preserve the cardinality of the candi-
date solutions. This means that either ad-hoc penalization or repair mechanisms are
needed to enforce the constraints. Including penalty terms in the objective func-
tion distorts the search and generally leads to suboptimal solutions. Applying repair
mechanisms to infeasible configurations provides a more elegant and effective ap-
proach to the problem. Nonetheless, the best option is to use a set representation,
in conjunction with specially designed search operators that preserve the cardinality
of the candidate solutions. Some of the problems considered, such as the knapsack
problem and ensemble pruning are purely combinatorial optimization tasks. In prob-
lems like portfolio selection, index tracking and sparse PCA both combinatorial and
continuous aspects are present. For these we advocate the use of hybrid methods
that separately handle the combinatorial and the continuous aspects of cardinality-
constrained optimization problems. Among the approximate methods considered,
a genetic algorithm with set encoding and RAR crossover obtains the best overall
performance. In problems where the comparison was possible, the solutions ob-
tained are close to the exact ones and to those identified by approximate methods
that use semidefinitie programming. Using the same encoding, simulated annealing
also obtains fairly good solutions, generally at a higher computational cost. This
indicates that the RAR crossover operator seems to enhance the search by introduc-
ing in the population individuals that effectively combine advantageous features of
128 R. Ruiz-Torrubiano, S. Garcı́a-Moratilla, and A. Suárez
Acknowledgments
This research has been supported by Dirección General de Investigación (Spain),
project TIN2007-66862-C02-02.
References
1. Gill, P.E., Murray, W., Saunders, M.A., Wright, M.H.: Inertia-controlling methods for
general quadratic programming. SIAM Review 33, 1–36 (1991)
2. Gill, P., Murray, W.: Quasi-newton methods for unconstrained optimization. IMA Journal
of Applied Mathematics 9 (1), 91–108 (1972)
3. Adler, I., Karmarkar, N., Resende, M.G.C., Veiga, G.: An implementation of Kar-
markar’s algorithm for linear programming. Mathematical Programming 44, 297–335
(1989)
4. Shapcott, J.: Index tracking: genetic algorithms for investment portfolio selection. Tech-
nical report, EPCC-SS92-24, Edinburgh, Parallel Computing Centre (1992)
5. Radcliffe, N.J.: Genetic set recombination. Foundations of Genetic Algorithms. Morgan
Kaufmann Pulishers, San Francisco (1993)
6. Coello, C.: Theoretical and numerical constraint-handling techniques used with evolu-
tionary algorithms: a survey of the state of the art. Computer Methods in Applied Me-
chanics and Engineering 191, 1245–1287 (2002)
7. Streichert, F., Ulmer, H., Zell, A.: Evaluating a hybrid encoding and three crossover
operators on the constrained portfolio selection problem. In: Proceedings of the Congress
on Evolutionary Computation (CEC 2004), vol. 1, pp. 932–939 (2004)
8. Kirkpatrick, S., Gelatt Jr., C.D., Vecchi, M.P.: Optimization by simulated annealing. Sci-
ence 4598, 671–679 (1983)
9. Goldberg, D.: Genetic Algorithms in Search, Optimization and Machine Learning.
Addison-Weasley, Reading (1989)
10. Moral-Escudero, R., Ruiz-Torrubiano, R., Suarez, A.: Selection of optimal investment
portfolios with cardinality constraints. In: Proceedings of the IEEE World Congress on
Evolutionary Computation, pp. 2382–2388 (2006)
11. Radcliffe, N.J.: Equivalence class analysis of genetic algorithms. Complex Systems 5,
183–205 (1991)
12. Larrañaga, P., Lozano, J.A. (eds.): Estimation of Distribution Algorithms: A New Tool
for Evolutionary Computation. Kluwer Academic Publishers, Dordrecht (2002)
13. Baluja, S.: Population-based incremental learning: A method for integrating genetic
search based function optimization and competitive learning. Technical Report CMU-
CS-94-163, Carnegie Mellon University (1994)
14. Muehlenbein, H.: The equation for response to selection and its use for prediction. Evo-
lutionary Computation 5, 303–346 (1998)
15. Kellerer, H., Pferschy, U., Pisinger, D.: Knapsack Problems. Springer, Heidelberg (2004)
5 Optimization Problems with Cardinality Constraints 129
16. Miller, R.E., Thatcher, J.W. (eds.): Reducibility among combinatorial problems, pp. 85–
103. Plenum Press (1972)
17. Pisinger, D.: Where are the hard knapsack problems? Computers & Operations Research,
2271–2284 (2005)
18. Simões, A., Costa, E.: An evolutionary approach to the zero/one knapsack problem: Test-
ing ideas from biology. In: Proceedings of the Fifth International Conference on Artificial
Neural Networks and Genetic Algorithms, ICANNGA (2001)
19. Ku, S., Lee, B.: A set-oriented genetic algorithm and the knapsack problem. In: Proceed-
ings of the IEEE World Congress on Evolutionary Computation, CEC 2001 (2001)
20. Michalewicz, Z.: Genetic Algorithms + Data Structures = Evolution Programs. Springer,
Heidelberg (1996)
21. Ladanyi, L., Ralphs, T., Guzelsoy, M., Mahajan, A.: SYMPHONY (2009),
https://projects.coin-or.org/SYMPHONY
22. Padberg, M.W., Rinaldi, G.: A branch-and-cut algorithm for the solution of large scale
traveling salesman problems. SIAM Review 33, 60–100 (1991)
23. Dietterich, T.G.: An experimental comparison of three methods for constructing ensem-
bles of decision trees: Bagging, boosting, and randomization. Machine Learning 40,
139–157 (2000)
24. Margineantu, D.D., Dietterich, T.G.: Pruning adaptive boosting. In: Proc. of the 14th
International Conference on Machine Learning, pp. 211–218. Morgan Kaufmann, San
Francisco (1997)
25. Caruana, R., Niculescu-Mizil, A., Crew, G., Ksikes, A.: Ensemble selection from li-
braries of models. In: Proc. of the 21st International Conference on Machine Learning,
p. 18. ACM Press, New York (2004)
26. Banfield, R.E., Hall, L.O., Bowyer, K.W., Kegelmeyer, W.P.: Ensemble diversity mea-
sures and their application to thinning. Information Fusion 6, 49–62 (2005)
27. Martı́nez-Muñoz, G., Lobato, D.H., Suárez, A.: An analysis of ensemble pruning tech-
niques based on ordered aggregation. IEEE Transactions on Pattern Analysis and Ma-
chine Intelligence 31, 245–259 (2009)
28. Zhou, Z.H., Wu, J., Tang, W.: Ensembling neural networks: Many could be better than
all. Artificial Intelligence 137, 239–263 (2002)
29. Zhou, Z.H., Tang, W.: Selective ensemble of decision trees. In: Liu, Q., Yao, Y., Skowron,
A. (eds.) RSFDGrC 2003. LNCS (LNAI), vol. 2639, pp. 476–483. Springer, Heidelberg
(2003)
30. Hernández-Lobato, D., Hernández-Lobato, J.M., Ruiz-Torrubiano, R., Valle, Á.: Pruning
adaptive boosting ensembles by means of a genetic algorithm. In: Corchado, E., Yin,
H., Botti, V., Fyfe, C. (eds.) IDEAL 2006. LNCS, vol. 4224, pp. 322–329. Springer,
Heidelberg (2006)
31. Zhang, Y., Burer, S., Street, W.N.: Ensemble pruning via semi-definite programming.
Journal of Machine Learning Research 7, 1315–1338 (2006)
32. Asuncion, A., Newman, D.: UCI machine learning repository (2007)
33. Breiman, L.: Bagging predictors. Machine Learning 24, 123–140 (1996)
34. Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression
Trees. Chapman & Hall, New York (1984)
35. Markowitz, H.: Portfolio selection. Journal of Finance 7, 77–91 (1952)
36. Bienstock, D.: Computational study of a family of mixed-integer quadratic program-
ming problems. In: Balas, E., Clausen, J. (eds.) IPCO 1995. LNCS, vol. 920. Springer,
Heidelberg (1995)
130 R. Ruiz-Torrubiano, S. Garcı́a-Moratilla, and A. Suárez
37. Chang, T.J., Meade, N., Beasley, J.E., Sharaiha, Y.M.: Heuristics for cardinality con-
strained portfolio optimisation. Computers and Operations Research 27, 1271–1302
(2000)
38. Glover, F.: Future paths for integer programming and links to artificial intelligence. Com-
puters and Operations Research 13, 533–549 (1986)
39. Crama, Y., Schyns, M.: Simulated annealing for complex portfolio selection problems.
Technical report, Groupe d’Etude des Mathematiques du Management et de l’Economie
9911, Universie de Liege (1999)
40. Schaerf, A.: Local search techniques for constrained portfolio selection problems. Com-
putational Economics 20, 177–190 (2002)
41. Streichert, F., Tamaka-Tamawaki, M.: The effect of local search on the constrained port-
folio selection problem. In: Proceedings of the IEEE World Congress on Evolutionary
Computation (CEC 2006), Vancouver, Canada, pp. 2368–2374 (2006)
42. Beasley, J.E.: Or-library: Distributing test problems by electronic mail. Journal of the
Operational Research Society 41(11), 1069–1072 (1990)
43. Buckley, I., Korn, R.: Optimal index tracking under transaction costs and impulse con-
trol. International Journal of Theoretical and Applied Finance 1(3), 315–330 (1998)
44. Gilli, M., Këllezi, E.: Threshold accepting for index tracking. Computing in Economics
and Finance 72 (2001)
45. Beasley, J.E., Meade, N., Chang, T.: An evolutionary heuristic for the index tracking
problem. European Journal of Operations Research 148(3), 621–643 (2003)
46. Lobo, M., Fazel, M., Boyd, S.: Portfolio optimization with linear and fixed transaction
costs. Annals of Operations Research, special issue on financial optimization 152(1),
376–394 (2007)
47. Jeurissen, R., van den Berg, J.: Index tracking using a hybrid genetic algorithm. In: ICSC
Congress on Computational Intelligence Methods and Applications 2005 (2005)
48. Jeurissen, R., van den Berg, J.: Optimized index tracking using a hybrid genetic algo-
rithm. In: Proceedings of the IEEE World Congress on Evolutionary Computation (CEC
2008), pp. 2327–2334 (2008)
49. Ruiz-Torrubiano, R., Suárez, A.: A hybrid optimization approach to index tracking. Ac-
cepted for publication in Annals of Operations Research (2007)
50. Moghaddam, B., Weiss, Y., Avidan, S.: Spectral bounds for sparse PCA. In: Advances in
Neural Information Processing Systems, NIPS 2005 (2005)
51. Zou, H., Hastie, T., Tibshirani, R.: Sparse principal component analysis. Journal of Com-
putational and Graphical Statistics 15(2), 265–286 (2006)
52. Tibshirani, R.: Regression shrinkage and selection via the lasso. Journal of the Royal
Statistical Society B 58, 267–268 (1996)
53. d’Aspremont, A., Ghaoui, L.E., Jordan, M., Lanckriet, G.: A direct formulation for
sparse PCA using semidefinite programming. SIAM Review 49(3), 434–448 (2007)
54. d’Aspremont, A., Bach, F., Ghaoui, L.E.: Optimal solutions for sparse principal compo-
nent analysis. Journal of Machine Learning Research 9, 1269–1294 (2008)
55. d’Aspremont, A., Ghaoui, L.E., Jordan, M., Lanckriet, G.: MATLAB code for DSPCA
(2008), http://www.princeton.edu/˜aspremon/DSPCA.htm
Chapter 6
Learning Global Optimization through a
Support Vector Machine Based Adaptive
Multistart Strategy
Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 131–154.
springerlink.com c Springer-Verlag Berlin Heidelberg 2010
132 Jayadeva, S. Shah, and S. Chandra
does not assume any specific properties. We also discuss some real world applica-
tions of GOSAM involving constrained and design optimization problems.
collected about the function, it is possible to generate a start state that is better than
a random one. If the answer is in the affirmative, then successive iterations will lead
us closer to the global minimum.
Evolutionary algorithms like Particle Swarm optimization (PSO), Genetic Al-
gorithms (GA) and Ant Colony optimization (ACO), are distributed iterative search
algorithms, which indirectly use some form of information about the space explored
so far, to direct search. Initially, there is a finite number of “agents” that search for
the global optimum. The paths of these agents are dynamically and independently
updated during the search based on the results obtained till the current update.
PSO, developed by Kennedy and Eberhart [21], is inspired by the flocking behav-
ior of birds. In PSO, particles start search in different regions of the solution space,
and every particle is made aware of the best local optimum amongst those found
by its neighbors, as well as the global optimum obtained up to the current iteration.
Each particle then iteratively adapts its path and velocity accordingly. The algorithm
converges when a particle finds a highly desirable path to proceed on, and the other
particles effectively follow its lead.
Genetic Algorithms [12] are motivated by the biological evolutionary opera-
tions of selection, mutation and crossover. In real life, the fittest individuals tend
to survive, reproduce and improve over generations. Based on this, “chromosomes”
that yield better optima are considered to correspond to fitter individuals, and are
used for creating the next generation of chromosomes that hopefully lead us to bet-
ter optima. The population of chromosomes is updated till convergence, or until a
specified number of updates is completed.
Ant colony optimization [10] mimics the behavior of a group of ants following
the shortest path to a food source. Ants (agents) exchange information indirectly,
through a mechanism called “stigmergy”, by leaving a trail of pheromone on the
paths traversed. States believed to be good are marked by heavier concentrations
of pheromone to guide the ants that arrive later. Therefore, decisions that are taken
subsequently get biased by previous decisions and their outcomes.
Some heuristic techniques use an alternate approach to guide further search by
application of machine learning techniques on past search results. Machine learning
techniques help in discovering relationships by analyzing the search data that other
techniques may ignore. If any relationship exists, then it could be exploited to re-
duce search time or improve the quality of optima. For this task, some papers try
to understand the structure of the search space, while others try to tune algorithms
accordingly (cf. [4] for a survey of these algorithms).
Boyan used information of the complete trajectories to the local minima and the
corresponding value of the local minima reached, to construct evaluation functions
[8], [9]. The minimum of the evaluation function determined a new starting point.
Optimal solutions were obtained for many combinatorial optimization problems like
bin packing, channel routing, etc.
Agakov et. al [3], gave a compiler optimization algorithm that trains on a set of
computer programs and predicts which parts of the optimization space are likely
to give large performance improvements for programs. Boese et al. [6] explored the
use of local minima to adapt the optimization algorithm. For graph bisection and the
134 Jayadeva, S. Shah, and S. Chandra
TSP, they found a “big valley” structure to the set of minima. Using this information
they were able to hand code a strategy to find good starting states for these problems.
Is this possible for other problems as well ? The proposed work is motivated by
the question: for any general global optimization problem, is there a structure
to the set of local optima ? If so, can it be learnt automatically through the use
of machine learning ?
We propose a new algorithm for the general global optimization problem, termed
as Global Optimization using Support vector regression based Adaptive Multistart
(GOSAM). GOSAM attempts to learn the structure of local minima based on local
minima discovered during earlier local searches. GOSAM uses Support Vector Ma-
chine based learning to learn a Fit function (regressor) that passes through all the
local minima, thereby learning the structure of the locations of local minima. Since
the regressor can only learn the structure of the local minima encountered till the
present iteration, the idea is that the regressor for local minima will generalize well
to the local minima not obtained so far in the search. Consequently, its minimum
would be a ‘crude approximation’ to the global minimum of the objective function.
In the next iteration the search for the local minimum is started from the minimum of
this regressor. The new local minimum obtained is added as a new training point and
the Fit function is re-constructed. Over time, this approximation gets better, leading
the search towards regions that yield better local minima and eventually the global
minimum. Surprisingly, for most problems this algorithm tends to direct search to
the region containing the global minimum in just a few iterations and is significantly
faster than other methods. The results reinforce our belief that many problems have
some ‘structure’ to the location of local minima, which can be exploited in direct-
ing further search. It is important to emphasize that GOSAM’s approach is differ-
ent from approximating a fitness landscape; GOSAM attempts to predict how local
minima are distributed, and where the best one might lie. This turns out to be very
efficient in practice.
In this chapter, we wish to demonstrate the same by testing on many benchmark
global optimization problems against established evolutionary methods. The rest of
the chapter is organized as follows. Section 6.2 discusses the proposed algorithm.
Section 6.3 is devoted to GOSAM’s performance on benchmark optimization prob-
lems, as well as a comparison with GA and PSO. Section 6.4 extends the algorithm
for constrained optimization problems. Section 6.5 demonstrates how the algorithm
may be applied to design optimization problems. Section 6.6 is devoted to a general
discussion on the convergence of GOSAM to the global optimum, while Section 6.7
contains concluding remarks.
feasible region is the complete search space that lies within the lower and upper
bounds of all variables.
We now summarize the flow of the GOSAM algorithm. At each iteration, the
algorithm performs a local search1 , starting from a location termed as the start-
state. Iteratively the algorithm determines the start-state for the next iteration.
obtained till the current iteration are treated as training samples, and their corre-
sponding function values as the target values. The regressor obtained in Step 4,
termed as the current Fit function, approximates local minima of the objective func-
tion. In the limit that all local minima are known, SVR will construct a regressor
that passes through all local minima of the objective function. The global minimum
of this function would then correspond to the global minimum of the original objec-
tive function. If we knew all the local minima, then regression is not required and
one can easily determine the best local minimum. We utilize only the information
of the few local minima obtained through local search till the current iteration. We
then rely on the excellent generalization properties of SVRs to predict how the local
minima are distributed. Search is redirected to the region containing the minimum
of the regressor or the Fit function. Because of the limited size of the training set,
this regressor will not be an exact approximation of the local minima of the objective
function. However, over successive iterations, the Fit function tends to better local-
ize the global minimum of the function being optimized. This is demonstrated by
the experiments presented in section 6.3, that show that the ‘predictor’ turns out to
be so good that search terminates in the global minimum within very few iterations.
Apart from the generalization ability of SVRs, which is imperative in predicting
better starting points and finding the global optimum quickly, the choice of using
SVR for function approximation is also motivated by the fact that the regressors ob-
tained using SVR are generally very simple and can be constructed by using only a
few support vectors. Since minimization of the Fit function requires evaluating it at
several points, the use of only the support vectors contributes to computational ef-
ficiency. Regardless of the complexity of the kernel used, the optimization problem
that needs to be solved remains a convex quadratic one, because only a kernel matrix
that contains an inner product between the points is required. The meagre amounts
of data to be fit, i.e. the small number of local minima and their corresponding func-
tion values, also contribute to making the process fast and efficient.
In step 5, we minimize the Fit obtained and reset the start-state to its minimum.
If the local minimum obtained from this start-state is the same as the one obtained
in the previous r iterations, or out of bounds, then we conclude that the search has
become too localized, and needs to explore other regions to discover new minima.
In such a case, we reset the start-state to a random state.
10
−2
−4
−6
−8
−10
−10 −8 −6 −4 −2 0 2 4 6 8 10
Fig. 6.1 Iteration 1 for the global minimization of the objective function f (x) = (|x| −
10)cos(2π x). Local search started from a random start state given by x = -1.3444 (indicated
by the circle) and terminated at the local minimum x = -1.0 where f (x) = -9.0. Using only
this one local minimum in the training set, the regressor obtained till the end of iteration 1 is
shown by the solid line
The initial randomly chosen starting state is x = -1.3444. This is shown as the
circled point in Fig. 6.1. Local search from this point led to the local minimum at
x = -1.0, indicated by a square in Fig. 6.1. At this point, the objective function has a
value of f (−1.0) = -9.0. Using only one local minimum in the training set, the SVR
regressor that was obtained is shown by the solid line parallel to the x-axis. Since
this regressor has a constant function value, its minimum is the same everywhere;
therefore, any random point can be selected as the minimum. In our simulation,
the random point returned was x = -6.3. Local search from this point terminated at
138 Jayadeva, S. Shah, and S. Chandra
the local minimum x = -6.0. The regressor obtained using these two points led to
a minimum at the boundary. In cases when the minimum is at a boundary, we find
that one can start the next local search from either the boundary point, or from
a random new starting point. The search for the global optimum was not ham-
pered by either choice. However, the results reported here are based on a random
restart in such cases. In this simulation, search was restarted from a random point at
x = 4.483.
10
−2
−4
−6
−8
−10
−10 −8 −6 −4 −2 0 2 4 6 8 10
Fig. 6.2 Iteration 3 for the global minimization of the objective function f (x) = (|x| −
10)cos(2π x). Local search started from a random start state given by x = 4.483 (indicated
by the circle) and terminated at the local minimum at x = 4.0. The regressor obtained using
the three points, depicted by squares, is shown as the solid concave curve. The minimum of
this curve lies at x = -0.8422
In the third iteration, shown in Fig. 6.2, local search is started from x = 4.483, de-
picted by a circle. The local minimum was found to be at x = 4.0, and is depicted by
a square in the figure. When the information of these three local minima was used,
the SVR regressor shown as the solid concave curve was obtained. The minimum of
this curve lies at x = -0.8422.
The start state for iteration 4 was given by the minimum of the regressor obtained
in the previous iteration, given by x = -0.8422. This point is depicted as a circle in
Fig. 6.3. The local minimum obtained from this starting state is again depicted as the
square at the end of the slope. The regressor obtained using these four local minima
is shown as a bowl shaped curve, the minimum of which is located at x = −0.1130.
In the next iteration, depicted in Fig 6.4, local search from x = −0.1130, depicted
by a circle, led us to the global minimum at x = 0.0, depicted by a square in the
figure.
6 Learning Global Optimization through SVM Based Adaptive Multistart 139
10
−2
−4
−6
−8
−10
−10 −8 −6 −4 −2 0 2 4 6 8 10
Fig. 6.3 Iteration 4 for the global minimization of the objective function f (x) = (|x| −
10)cos(2π x). Local search started from the minimum of the regressor obtained in the
previous iteration, given by x = -0.8422 (indicated by the circle). It terminated at the local
minimum at x = -1.0. The regressor obtained using the four local minima obtained till the
current iteration, depicted by squares, is shown as the solid convex shaped curve. The mini-
mum of this curve lies at x = −0.1130
10
−2
−4
−6
−8
−10
−10 −8 −6 −4 −2 0 2 4 6 8 10
Fig. 6.4 Iteration 5 for the global minimization of the objective function f (x) = (|x| −
10)cos(2π x). Local search started from the start state given by the minimum of the regressor
obtained at the end of the previous iteration, given by x = -0.1130 (indicated by the circle)
and terminated at the local minimum at x = 0.0 where f (x) = -10.0. The regressor obtained
using all the local minima obtained, depicted by squares, is shown as the solid convex curve
140 Jayadeva, S. Shah, and S. Chandra
Fig. 6.5 Ackley’s function. A huge number of local minima are seen that obstruct the search
for the global minimum at (0, 0)
Figures 6.6 through 6.8 show the plots of the regressor function obtained after
iterations 2, 3, and 4 respectively. Note that though both figures 6.7 and 6.8 look
similar, there is a difference in the locations of their minima. The minimum of the
bowl shaped Fit function of Fig. 6.8, when used as the start state for next local
minimization procedure, led to the global minimum of Ackley’s function.
6 Learning Global Optimization through SVM Based Adaptive Multistart 141
Fig. 6.6 Regressor obtained after iteration 2, while optimizing Ackley’s function
Fig. 6.7 Regressor obtained after iteration 3, while optimizing Ackley’s function
Fig. 6.8 Regressor obtained after iteration 4, while optimizing Ackley’s function. Local
search starting from the minimum of this Fit function led to the global minimum
2 to 100. The Particle Swarm optimization toolbox was obtained from [30], while
the Genetic Algorithm optimization toolbox (GAOT) is the one available at [16].
The next start-state in PSO and GA is obtained by simple mathematical or logical
operations, whereas for GOSAM it is generated after determining the SVR followed
by minimization of a quadratic problem. Therefore, an iteration of GOSAM takes
more time than an iteration of either of these algorithms. Moreover, GA and PSO
run a number of agents in parallel, whereas the current implementation of GOSAM
is a sequential one. However, the difference in the number of function evaluations
required is so dramatic that GOSAM always found the global minimum significantly
faster.
In all our experiments, we evaluated the three algorithms on three different per-
formance criteria. The first criterion is the number of function evaluations required
to reach the global optimum. The second criterion is the number of times the global
optimum is reached in 20 runs, each from a randomly chosen starting point. The
third measure is the average CPU time taken to reach the global optimum.
Table 6.1 presents the results obtained. Each value indicated in the table is the av-
erage over 20 runs of the corresponding algorithm. For each run, the initial start
state of all the algorithms was the same randomly chosen point. The reported re-
sults have been obtained on a PC with a 1.6 GHz processor and 512 MB RAM. The
first row in the evaluation parameter for each benchmark function (Fn. Evals.) gives
the average number of function evaluations required by each algorithm to find the
global optimum. The number of times that the global optimum was obtained out of
the 20 runs is given in the second row (GO. Obtd.). If the global minimum was not
obtained in all runs, then the average value and the standard deviation of the best
6 Learning Global Optimization through SVM Based Adaptive Multistart 143
optima obtained over all the runs has been mentioned within parentheses. The third
row (T (s)) indicates the average time taken in seconds by each algorithm in a run.
Though any number of local minima may be used for building a predictor, we
used a maximum of 100 local minima. The 101st local minima overwrote the 1st
local minimum obtained, and so on. In each case, the search was confined to lie
within a box [−10, 10]n where n is the dimension. In all our experiments, we used
the conventional SVR framework [14]. The use of techniques such as SMO [28],
or online SVMs [7] can be used to speed up the training process further. Our focus
in this work is to show the use of machine learning techniques to help predict the
location of better local minima.
The parameters for GA and PSO (c1 = 2, c2 = 2, c3 = 1, chi = 1, and swarm
size = 20) were kept the same as the default ones. For GOSAM, the SVR parameters
were taken to be ε = 10−3 , C = 1000, and the kernel to be the two degree polynomial
kernel with t = 1.
Table 6.1 shows that consistently GOSAM outperforms both PSO and GA by a
large margin. This difference gets dramatically highlighted in higher dimensions.
Finding the global minimum becomes increasingly difficult as the dimension n in-
creases; PSO and GA fail to find the global optimum in many cases, despite a large
number of function evaluations. However, GOSAM always found the global mini-
mum after a relatively small number of function evaluations (the count for function
evaluation for GOSAM also includes the number of times the objective function
was evaluated during local search). We believe that this result is significant, because
it shows that GOSAM scales very effectively to large dimensional problems. The
experimental results strikingly demonstrate that GOSAM not only finds the global
optimum consistently, but also does so with a significantly fewer number of function
evaluations.
Table 6.1 Comparison of GOSAM with PSO and GA on Difficult Benchmark Problems
‡ The global optimum was not obtained in all the 20 runs. The value in the corresponding
parentheses indicates the mean and the standard deviation of the quality of global minima
obtained in the 20 runs.
† The global optimum obtained was not within the specified bounds.
146 Jayadeva, S. Shah, and S. Chandra
where a(x) is the objective function, and gi (x), for i = 1, . . . M are the M constraints.
One kind of SUMT, the quadratic penalty function method, minimizes a sequence
of functions of the form (p = 1, 2, . . . M).
M
Fp (x) = a(x) + ∑ α pi Max(0, gi (x))2 , (6.3)
i=1
where α pi is a scalar weight, and p is the problem number. The minimizer for the pth
problem in the sequence forms the guess or starting point for the (p + 1)th problem.
The scalars change from one problem to the next based on the rule that α pi >=
α(p−1)i ; they are typically increased geometrically, by say 10%. These weights
indicate the relative emphasis of the constraints and the objective function.
In the limit, the constraints become overwhelmingly large, the sequence of min-
ima of the unconstrained problems converges to a solution of the original con-
strained optimization problem. We now illustrate the use of SUMT through the
application of GOSAM to the graph coloring problem.
Given a graph with a set of nodes or vertices, and an adjacency matrix D, the
Graph Coloring Problem (GCP) requires coloring each node or vertex so that no
two adjacent nodes have the same color. The adjacency matrix entry di j is a 1 if
nodes i and j are adjacent, and is 0 otherwise.
A minimal coloring requires finding a valid coloring that uses the least number of
colors. The GCP can be solved through an energy minimization approach. We used
an approach based on the Compact Analogue Neural Network (CANN) formulation
[17]. In this approach, a N-vertex GCP is solved by considering a network of N
neurons, whose outputs denote the node colors. The outputs are represented by a set
of real numbers Xi , i = 1, 2, . . . , N. The color is not assumed to be an integer as is
done conventionally.
The GCP is solved by minimizing a sequence (p = 1, 2, . . .) of functions of the
form
A N N
E= ∑ ∑ (1 − di j )Vm ln coshβ (Xi − X j ) +
2 i=1
(6.4)
j=1
Bp N N
coshβ (Xi − X j + δ )coshβ (Xi − X j − δ )
∑∑
2 i=1
d i j Vm ln
coshβ (Xi − X j )2
j=1
In keeping with the earlier literature on neural network approaches to the GCP, we
term E in (6.4) as an energy function.
The first term of equation (6.4) is present only for di j = 0, i.e. for non-adjacent
nodes. The term is minimized when Xi = X j . The term therefore minimizes the
number of distinct colors used. The second term is minimized if the values of Xi and
X j corresponding to adjacent nodes differ by at least δ . This term corresponds to the
adjacency constraint in the GCP, and becomes large as the problem sequence index
p increases. Nodes colored by colors that differ by less than δ correspond to nodes
with identical colors.
6 Learning Global Optimization through SVM Based Adaptive Multistart 147
Instance Nodes Edges Optimal coloring Best Solution Obtained Iterations required
Myciel3 11 20 4 4 3
Myciel4 23 71 5 5 5
Huck 74 301 11 11 8
Games120 120 638 9 9 10
nearly impossible. For instance, VLSI design engineers carry out time-consuming
function evaluations by using circuit or other simulation tools , e.g. Spectre [2], and
choose a circuit with optimal component values. Since there are still many possi-
ble design parameter settings and computer simulations are time consuming, it is
crucial to find the best possible design with a minimum number of simulations. We
used GOSAM to solve several circuit optimization problems. The interface between
the optimizer and the circuit simulators is shown in Fig. 6.9. Preliminary details of
this work were reported in [18].
ble n
Netlist
ria ig
va des
s
te
da
Up
value
Output File
We initially start with values for the design variables that are provided by a de-
signer, or choose them randomly. Since there are no analytical formulae to compute
the output for the input design parameters, the function values are calculated by using
a circuit simulator such as Spectre. The simulator writes the output value to a file,
which is read by the interface and returned to GOSAM. GOSAM then uses SVR
on the set of values obtained so far, to determine the Fit function. The SVR yields a
smooth and differentiable regressor. GOSAM then computes the minimum of the Fit
function, and sends it as the vector of new design parameters, to the interface. A key
feature of this approach is that we can apply it even to scenarios where the objective
function is not available in analytical form or is not differentiable. A major bonus
is that examination of the Fit function yields information about what constitutes a
good design. We now briefly discuss a few interesting circuit optimization examples.
value is maintained well during the hold period. As of date, numerous complex
VLSI circuits have been designed using GOSAM interfaced with the circuit simu-
lator Spectre. The chosen circuits include Phase Locked Loops (PLLs), a variety of
operational amplifiers, and filters. In these examples, transistor sizes and other com-
ponent values have been selected to optimize specified objectives such as jitter, gain,
phase margin, and power, while meeting specified constraints on other performance
measures as well as on transistor sizes.
Fig. 6.10 Response of the optimized Sample-and-Hold Circuit, showing output voltage
versus time. The goal was to keep the output constant during the hold period
6.6 Discussion
An important question relates to assumptions that may be implicitly or explicitly
made regarding the function to be optimized. We mentioned previously that any
local search mechanism could be used in conjunction with GOSAM. Figure 6.12
illustrates this with the help of a toy example. For the objective function shown
by the dashed curve in Fig. 6.12, the gradient cannot be computed to reach two of
the minima. A line search method is used in the outer triangular regions, while for
the parabolic region in the middle the gradient is available and a simple gradient
descent leads us to the local minimum. These three local minima, when used by
SVR to construct the regressor, yield the parabolic shaped solid curve of Fig. 6.12.
Local search starting from the minimum of this curve led to the global minimum.
150 Jayadeva, S. Shah, and S. Chandra
Fig. 6.11 Phase margin versus iteration count for a folded cascode amplifier
Fig. 6.12 A toy example illustrating that any local minimizing procedure can be used with
GOSAM. The function is depicted as the dotted curve. For the outer triangular regions, the
gradient information cannot be used, so the local minima are found by a line search method.
However for the inner parabolic region, the local minimum can be found using gradient de-
scent. The regressor obtained is shown by the solid curve that passes through the local minima
obtained
In the worst case, GOSAM performs similar to a random multistart. This is be-
cause whenever it is not possible to use the minimum of the Fit function (for exam-
ple when it is out of bounds or almost the same minimum is given by the previous
two iterations), GOSAM restarts the search from a random state. Therefore in the
worst case it will randomly explore the search space for new starting points. How-
ever, real applications never involve functions that are discontinuous everywhere,
and we have not encountered this worst case.
6 Learning Global Optimization through SVM Based Adaptive Multistart 151
Fig. 6.13 A toy example to illustrate that the regressor for the objective function f (x), de-
noted as Fit of f (x) is smoother than f (x). Recursively, the Fit for the Fit of f (x) is smoother
than the Fit of f (x), and in the limit leads to a convex underestimate of f (x)
Server invokes
Instance of GOSAM
Web Server
Server requests Optimizer
function/ Sends
optimized points
Client sends
a request
Client sends
function value
at requested
points
Client
local minima would be even smoother. This is depicted pictorially in Fig. 6.13,
which uses a hypothetical example to illustrate what the application of GOSAM
to f (x) and recursively to Fit functions, might achieve. The original function f (x)
has a number of minima. As can be seen, the number of minima reduces at each
step and the sequence of recursively computed Fit functions become increasingly
smoother, and the sequence terminates at a convex function that is related to the
double conjugate of the original function. However, local minimization of the Fit
function seems to be more than adequate, as is done in the present implementation.
It is possible to construct functions where GOSAM’s strategy will fail. For ex-
ample, it would be impossible to learn any structure from a function with a uniform
distribution of randomly located minima, or a function that is discontinuous almost
everywhere. However, on most problems of any practical interest, small perturbations
from a local minimum will lead us to another locally minimal configuration. This im-
plies that a learning tool can be used to predict locations of other minima from the
knowledge of only a few.
only vectors and corresponding cost values are exchanged between the GOSAM
server and a client running a simulator or emulator. This allows GOSAM to be pro-
vided as a service across the web while protecting proprietary information about the
optimizer and the objective function.
Other aspects worthy of investigation include the use of different approaches to
SVR, such as online learning techniques, and parallellizing operations in GOSAM
to speed up search. Ongoing efforts include extending GOSAM to multi-objective
optimization tasks. GOSAM may be obtained from the authors for non-commercial
academic use on a trial basis.
Acknowledgements. The authors would like to thank Dr. R. Kothari of IBM India Re-
search Laboratory, Prof. R. Newcomb, University of Maryland, College Park, USA, and Prof.
S.C.Dutta Roy of the Department of Electrical Engineering, IIT Delhi, for their valuable
comments and a critical appraisal of the manuscript.
References
1. http://mat.gsia.cmu.edu/COLOR02/
2. http://www.cadence.com/products/custom ic/spectre/
index.aspx
3. Agakov, F., Bonilla, E., Cavazos, J., Franke, B., Fursin, G., O’Boyle, M., Thomson,
J., Toussaint, M., Williams, C.: Using machine learning to focus iterative optimisation.
In: Proceedings of the 4th Annual International Symposium on Code Generation and
Optimization (CGO), New York, NY, USA, pp. 295–305 (2006)
4. Baluja, S., Barto, A., Boese, K., Boyan, J., Buntine, W., Carson, T., Caruana, R., Davies,
S., Dean, T., Dietterich, T., Hazlehurst, S., Impagliazzo, R., Jagota, A., Kim, K., Mc-
Govern, A., Moll, R., Moss, E., Perkins, T., Sanchis, L., Su, L., Wang, X., Wolpert, D.:
Statistical machine learning for large-scale optimization. Neural Computing Surveys 3,
1–58 (2000)
5. Black, F., Litterman, R.: Global portfolio optimization. Financial Analysts Journal 48(5),
28–43 (1992)
6. Boese, K., Kahng, A.B., Muddu, S.: A new adaptive multi-start technique for combina-
torial global optimizations. Operations Research Letters 16(2), 101–113 (1994)
7. Bordes, A., Bottou, L.: The huller: A simple and efficient online SVM. In: Gama, J.,
Camacho, R., Brazdil, P.B., Jorge, A.M., Torgo, L. (eds.) ECML 2005. LNCS (LNAI),
vol. 3720, pp. 505–512. Springer, Heidelberg (2005),
http://leon.bottou.org/papers/bordes-bottou-2005
8. Boyan, J.: Learning evaluation functions for global optimization. Phd dissertation, CMU
(1998)
9. Boyan, J., Moore, A.: Learning evaluation functions for global optimization and
boolean satisfiability. In: Proceedings of the Fifteenth National Conference on Arti-
ficial Intelligence, vol. 15, pp. 3–10. John Wiley and Sons Ltd., Chichester (1998),
http://www.cs.cmu.edu/˜jab/cv/pubs/boyan.stage2.ps.gz
10. Dorigo, M., Maniezzo, V., Colorni, A.: Ant system: Optimization by a colony of cooper-
ating agents. IEEE Transactions on Systems, Man, and Cybernetics-Part B 26(1), 29–41
(1996),
http://iridia.ulb.ac.be/˜mdorigo/ACO/publications.html
154 Jayadeva, S. Shah, and S. Chandra
7.1 Introduction
In the world of real engineering design, there are often multiple targets which man-
ufacturers are trying to achieve. For instance in the aerospace industry, a general
problem is to minimize weight, cost and fuel consumption while keeping perfor-
mance and safety at a maximum. Each of these targets might be easy to achieve
individually. An airplane made of balsa wood would be very light and will have low
fuel consumption, however it will not be structurally strong enough to perform at
high speeds or carry useful payload. Also such an airplane might not be very safe,
Ivan Voutchkov
University of Southampton, Southampton SO17 1BJ, United Kingdom
e-mail: [email protected]
Andy Keane
University of Southampton, Southampton SO17 1BJ, United Kingdom
e-mail: [email protected]
Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 155–175.
springerlink.com c Springer-Verlag Berlin Heidelberg 2010
156 I. Voutchkov and A. Keane
i.e., robust to various weather and operational conditions. On the other hand, a solid
body and a very powerful engine will make the aircraft structurally sound and able
to fly at high speeds, but its cost and fuel consumption will increase enormously. So
engineers are continuously making trade-offs and producing designs that will sat-
isfy as many requirements as possible, while industrial, commercial and ecological
standards are at the same time getting ever tighter.
Multiobjective optimization (MO) is a tool that aids engineers in choosing the
best design in a world where many targets need to be satisfied. Unlike conventional
optimization, MO will not produce a single solution, but rather a set of solutions,
most commonly referred to as Pareto front (PF) [12]. By definition it will contain
only non-dominated solutions1 . It is up to the engineer to select a final design by
examining this front.
Over the past few decades with the rapid growth of computational power, the fo-
cus in optimization algorithms in general has shifted from local approaches that find
the optimal value with the minimal number of function evaluations to more global
strategies which are not necessarily as efficient as local searches but (some more
than the others) promise to converge to global solutions, the main players being
various strands of genetic and evolutionary algorithms. At the same time, comput-
ing power has essentially stopped growing in terms of flops per CPU core. Instead
parallel processing is an integral part of any modern computer system. Computing
clusters are ever more accessible through various techniques and interfaces such as
multi-threading, multi-core, Windows HPC, Condor, Globus, etc.
Parallel processing means that several function evaluations can be obtained at
the same time, which perfectly suits the ideology behind genetic and evolutionary
algorithms. For example Genetic algorithms are based on the idea borrowed from
biological reproduction, where the offspring of two parents copy the best genes of
their parents but also introduce some mutation to allow diversity. The entire gener-
ation of offspring produced by parents in a generation represent designs that can be
evaluated in parallel. The fittest individuals survive and are copied into the next gen-
eration, whilst weak designs are given some random chance with low probability to
survive. Such parallel search methods are conveniently applicable to multiobjective
optimization problems, where the fitness of an individual is measured by how close
to the Pareto front this designs is. All individuals are ranked, those that are part of
the Pareto front get the lowest (best) rank, the next best have higher rank and so on.
Thus the multiobjective optimization is reduced to single objective minimization of
the rank of the individuals. This is idea has been developed by Deb and implemented
in NSGA2 [5].
In the context of this paper, the aim of MO is to produce a well spread out set
of optimal designs, with as few function evaluations as possible. There are number
of methods published and widely used to do this – MOGA, SPEA, PAES, VEGA,
NSGA2, etc. Some are better than others - generally the most popular in the litera-
ture are NSGA2 (Deb) and SPEA2 (Zitzler), because they are found to achieve good
results for most problems [2, 3, 4, 5, 6]. The first is based on genetic algorithms and
1 Non-dominated designs are those where to improve performance in any particular goal
performance in at least one other goal must be made worse.
7 Multi-objective Optimization Using Surrogates 157
the second on an evolutionary algorithm, both of which are known to need many
function evaluations. In real engineering problems the cost of evaluating a design
is probably the biggest obstacle that prevents extensive use of optimization proce-
dures. In the multiobjective world, this cost is multiplied, because there are multi-
ple expensive results to obtain. Evaluating directly a finite element model can take
several days, which makes it very expensive to try hundreds or thousands of designs.
expense (see Fig. 7.1) [22]. Since their introduction, more and more companies
have adopted surrogate assisted optimization techniques and some are making steps
to incorporate this approach in their design cycle as standard. The reason for this is
that instead of using the expensive computational models during the optimization
step, they are substituted with a much cheaper but still accurate replica. This makes
optimization not only useful, but usable and affordable. The key idea that makes
surrogate models efficient is that they should become more accurate in the region of
interest as the search progresses, rather than being equally accurate over the entire
design space, as an FE representation will tend to be. This is achieved by adding to
the surrogate knowledge base only at points of interest. The procedure is referred to
as surrogate update.
Various publications address the idea of surrogates models and multiobjective op-
timisation [10, 11, 12, 13, 14, 15, 16, 17, 18, 19]. As one would expect, no approxi-
mation method is universal. Factors such as function modality, number of variables,
number of objectives, constraints, computation time, etc., all have to be taken into
account when choosing an approximation method. The work presented here aims to
demonstrate this diversity and hints at some possible strategies to make best use of
surrogates for multi-objective problems.
Number of variables 2 5 10
Number of function evaluations without surrogates 2500 5000 10000
Number of function evaluations with surrogates 40 40 60
On the other hand 2500 evaluations without surrogates were required to obtain a
similar quality of Pareto front to the case with surrogates and 40 evaluations. The
difference is even more significant if more variables are added – see Table 14.1.
Here we have chosen objective functions with simple shapes to demonstrate the
effectiveness of using surrogates. Both functions would be readily approximated
using most available methods. It is not uncommon to have relationships of simi-
lar simplicity in real problems, although external noise factors could make them
7 Multi-objective Optimization Using Surrogates 159
Fig. 7.2 A (left) – Function ZDT2; B (right) – ZDT2 – Pareto front achieved in 40 evalua-
tions: Diamonds – Pareto front with surrogates; Circles – solution without surrogates
any optimization based only on such surrogates will lead us to the local solution.
Therefore conventional optimization approaches based on surrogate models rely
on constant updating of the surrogate. A widely accepted technique in single
objective optimization is to update the surrogate with its current optimal solution.
In multiobjective terms this will translate to updating the surrogate with one or
more points belonging to its Pareto front. If the surrogate Pareto front is local
and not global, then the next update will also be around the local Pareto front.
Continuing with this procedure the surrogate model will become more and more
accurate in the area of the local optimal solution, but will never know about the
existence of the global solution.
6. Robust convergence from any start design with any random number sequence. It
turns out that the success of a conventional multiobjective optimization based on
surrogates, using updates at previously found optimal locations strongly depends
on the initial data used to train the first surrogate before any updates are added.
If this data happens to contain points around the global Pareto front, then the
algorithm will be able to quickly converge and find a nice global Pareto front.
However the odds are that the local Pareto fronts are smoother and easier to find
shapes and in most cases this is where the procedure will converge unless suitable
global exploration steps are taken.
7. Efficiency and convergence – better search techniques will converge using less
function evaluations.
Fig. 7.3 Pareto front potential problems - (a) clustering; (b) too few points; (c) lack of diver-
sity; (d) non-optimality
7 Multi-objective Optimization Using Surrogates 161
Pros:
• can always predict with no error at sample points,
• the error in close proximity to sample points is minimal,
• requires small number of sample points in comparison to other response surface
methods,
• reasonably good behaviour with high dimensional problems.
162 I. Voutchkov and A. Keane
Cons:
• for large number of data points and variables, training of the hyper-parameters
and prediction may become computationally expensive.
Researchers should make a conscious decision when choosing Kriging for their
RSMs. Such a decision should take into account the cost of a direct function eval-
uation including constraints (if any), available computational power, and dimen-
sionality of the problem. Sometimes it might be possible to use kriging for one
of the objectives while another is evaluated directly, or a different RSM is used to
minimize the cost.
As this paper aims to demonstrate various approaches in making a better use of
surrogate models, we will use kriging throughout, but most conclusions could be
generalised for other RS methods as well. The chosen multiobjective algorithm is
NSGA2. Other multiobjective optimizers might show slightly different behaviour.
The basic procedure is as follows:
objectives and the points are evenly distributed. Metrics for assessing the quality
of the Pareto front are discussed by Deb [3].
We have used the last of these criteria for our studies.
• UPDMOD = 6; (Nkmean) – The RSMs are searched using GA or DHC and points
are extracted using a k-mean cluster detection algorithm.
All these update strategies have their own strengths and weaknesses, and therefore
a suitable combination should be carefully considered. The results section of this
chapter provides some insights on the effects of each of these strategies when used
in various combinations.
variables whilst keeping the product of the variables greater than 0.75. There are 25
variables, each varying between 0 and 3.
7.8.2 Spacing
Standard deviation of the absolute differences between the solution (i) and the near-
est member of Q,
1 |Q| 2
sp = ∑ di − d¯ ,
|Q| i=1
& '
di = min ∑ j=1 f j − p j .
M (i) (k)
k=1,|p|
7.8.3 Spread
|Q|
¯
m=1 dm − ∑i=1 di − d
∑M e
Δ = 1− ,
m=1 dm + |Q| d
∑M e ¯
where di is the absolute difference between neighbouring solutions. For compatibil-
ity with the above metrics, the values of the spread is subtracted from 1, so that a
wider spread will produce a smaller value.
166 I. Voutchkov and A. Keane
7.9 Results
The study carried out aims to show the effect of applying various update strategies,
number of training and evaluation points, etc. The performance of each particular
approach is measured using the metrics described in the previous section.
An overall summary is given at the end of this section, but the best recipe ap-
pears to be highly problem dependant. It is also not possible to show all results for
all functions due to limited space, and we have therefore chosen several that best
represent the ideas discussed.
To correctly appreciate the results, please bear in mind that they are meant to
show diversity rather than a magic recipe that works in all situations.
7.9.2.3 What Is the Best Value for EPREWARD during the RSM Search?
The EPREWARD value is strictly individual for each function. Taking into account
the specifics of the test function it can improve the diversity of the Pareto front. The
default value is 0.65, which works well for most of the functions, but we have also
conducted studies where this parameter is varied between -1 and 1 in steps of 0.1,
and individual value for each function is selected based on best Pareto front metrics.
Fig. 7.9 and Fig. 7.10 show that the ‘bump’ function is particularly difficult for
all strategies, which makes it a good test problem. This function has extremely
tight constraint and multimodal features. It is not yet clear which of combination of
strategies should be recommended, as the ‘ideal’ Pareto front has not been reached,
however it seems that a decoupled secondary NSGA2 layer is showing a good
7 Multi-objective Optimization Using Surrogates 169
advancement. We are continuing studies on this function and will give results in
future publications.
To summarize the performance of each strategy an average statistics is com-
puted. It is derived as follows. The actual performance in most cases is a trade-
off between a given metric and the number of function evaluations needed for
170 I. Voutchkov and A. Keane
convergence. Therefore the four metrics can be ranked against the number of runs,
in the same way as ranks are obtained during NSGA2 operation. The obtained ranks
are then averaged across all test functions. Low average rank means that the strategy
has been optimal for more metrics and functions. These results are summarized in
Table 14.2.
7 Multi-objective Optimization Using Surrogates 171
Random RSM PF SL RMSE EI KMEAN Av. Rank Min. Rank Max. Rank Note
0 30 0 0 30 0 1.53 1 2 EI const.feas
0 30 30 0 0 0 1.83 1 3.33 SL coupled
0 30 0 30 0 0 2 1.33 3.33 RMSE
0 30 0 0 30 0 2.2 1.33 3 EI normal
0 30 30 0 0 0 2.8 1.33 4 SL decoupled
30 30 0 0 0 0 2.84 2 4 Random
0 60 0 0 0 0 2.85 2 3.33 RSM PF
The summary shows that all strategies are generally better than using only the
conventional RSM based updates, which is expected, as the conventional method is
almost always bound to converge at local solutions. However it must be underlined
that a correct selection is problem dependant and must be selected with care and
understanding.
Fig. 7.11 Generational distance for zdt1 starting from different initial DOEs
172 I. Voutchkov and A. Keane
Fig. 7.12 Generational distance for F5 starting from different initial DOEs
Fig. 7.13 Pareto fronts for ‘bump’ starting from different initial DOEs
used 10 updates for each of the techniques (60 updates per iteration in total) for all
functions. The only difference being the starting set of designs.
Fig. 7.11 and Fig. 7.12 illustrate the generational distance for zdt1 and f5 func-
tions - both with two variables. They both demonstrate a good averagibility,
7 Multi-objective Optimization Using Surrogates 173
Fig. 7.14 Generational distance for ‘bump’ starting from different initial DOEs
Fig. 7.15 Pareto fronts for ‘zdt1cons’ starting from different initial DOEs
confirming once again that the surrogate updates are fairly robust for functions with
low number of variables.
Figures 7.13, 7.14 and 7.15 illustrate much greater variance and show that high
dimensionality is a difficult challenge for surrogate strategies, however one should
also consider the low number of function evaluations used here.
174 I. Voutchkov and A. Keane
7.10 Summary
In this publication we have aimed to share our experience in tackling expensive
multiobjective problems. We have shown that as soon as we decide to use surrogate
models, to substitute for expensive objective functions, we need to consider a num-
ber of other specifics in order to produce a useful Pareto front. We have discussed
the challenges that one might face when using surrogates and have proposed six up-
date strategies that one might wish to use. Given understanding of these strategies,
the researcher should decide on the budget of updates they could afford and then
spread this budget over several update strategies. We have shown that it is best to
use at least two different strategies – ideally a mixture of RSM and non-RSM based
techniques. When solving problems with few variables we have shown that a com-
bination of two or three techniques is sufficient, however with higher dimensional
problems, one should consider using more techniques.
It is also beneficial to constrain the number of designs that are used for RSM
training and also for RSM evaluation to limit the cost. The selection method of the
designs then being used is open to further research. In this material we have used
selection based on Pareto front ranking.
Our research also included parameters that reward the search for exploring the
end points on the Pareto front. Although not explicitly mentioned in this material,
our studies are using features such as improved crossover, mutation and selection
strategies, declustering algorithm applied both in the variable and objective space
to avoid data clustering. Data is also being automatically conditioned and filtered,
and advanced kriging tuning techniques are used. These features are part of the
OPTIONS [1], OptionsMATLAB and OptionsNSGA2 RSM suites [24].
Acknowledgements. This work was funded by Rolls – Royce Plc, whose support is
gratefully acknowledged.
References
1. Keane, A.J.: OPTIONS manual,
http://www.soton.ac.uk/˜ajk/options.ps
2. Obayashi, S., Jeong, S., Chiba, K.: Multi-Objective Design Exploration for Aerodynamic
Configurations, AIAA-2005-4666
3. Deb, K.: Multi-objective optimization using evolutionary algorithms. John Wiley &
Sons, Ltd., New York (2003)
4. Zitzler, et al.: Comparison of multiobjective evolutionary algorithms: Empirical results.
Evolutionary Computational Journal 8(2), 125–148 (2000)
5. Knowles, J., Corne, D.: The Pareto archived evolution strategy: A new baseline algorithm
for multiobjective optimisation. In: Proceedings of the 1999 Congress on Evolutionary
Computation, pp. 98–105. IEEE Service Center, Piscatway (1999)
6. Fonseca, C.M., Fleming, P.J.: Multiobjective optimization and multiple constraint han-
dling with evolutionary algorithms - Part II: Application example. IEEE Transactions on
Systems, Man, and Cybernetics: Part A: Systems and Humans, 38–47 (1998)
7 Multi-objective Optimization Using Surrogates 175
7. Jones, D.R., Schonlau, M., Welch, W.J.: Efficient global optimization of expensive black-
box functions. Journal of Global Optimization 13, 455–492 (1998)
8. Sobol’, I.M., Turchaninov, V.I., Levitan, Y.L., Shukhman, B.V.: Quasi-Random Se-
quence Generators, Keldysh Institute of Applied Mathematics, Russian Acamdey of Sci-
ences, Moscow (1992)
9. Nowacki, H.: Modelling of Design Decisions for CAD. In: Goos, G., Hartmanis, J.
(eds.) Computer Aided Design Modelling, Systems Engineering, CAD-Systems. LNCS,
vol. 89. Springer, Heidelberg (1980)
10. Kumano, T., et al.: Multidisciplinary Design Optimization of Wing Shape for a Small Jet
Aircraft Using Kriging Model. In: 44th AIAA Aerospace Sciences Meeting and Exhibit,
Jannuary 2006, pp. 1–13 (2006)
11. Nain, P.K.S., Deb, K.: A multi-objective optimization procedure with successive approx-
imate models. KanGAL Report No. 2005002 (March 2005)
12. Keane, A., Nair, P.: Computational Approaches for Aerospace Design: The Pursuit of
Excellence (2005) ISBN: 0-470-85540-1
13. Leary, S., Bhaskar, A., Keane, A.J.: A derivative based surrogate model for approximat-
ing and optimizing the output of an expensive computer simulation. J. Global Optimiza-
tion 30, 39–58 (2004)
14. Leary, S., Bhaskar, A., Keane, A.J.: A Constraint Mapping Approach to the Structural
Optimization of an Expensive Model using Surrogates. Optimization and Engineering 2,
385–398 (2001)
15. Emmerich, M., Naujoks, B.: Metamodel-assisted multiobjective optimization strategies
and their application in airfoil design. In: Parmee, I. (ed.) Proc of. Fifth Int’l. Conf.
on Adaptive Design and Manufacture (ACDM), Bristol, UK, April 2004, pp. 249–260.
Springer, Berlin (2004)
16. Giotis, A.P., Giannakoglou, K.C.: Single- and Multi-Objective Airfoil Design Using Ge-
netic Algorithms and Artificial Intelligence. In: EUROGEN 1999, Evolutionary Algo-
rithms in Engineering and Computer Science (May 1999)
17. Knowles, J., Hughes, E.J.: Multiobjective optimization on a budget of 250 evaluations.
In: Coello Coello, C.A., Hernández Aguirre, A., Zitzler, E. (eds.) EMO 2005. LNCS,
vol. 3410, pp. 176–190. Springer, Heidelberg (2005)
18. Chafekar, D., et al.: Multi-objective GA optimization using reduced models. IEEE
SMCC 35(2), 261–265 (2005)
19. Nain, P.: A computationally efficient multi-objective optimization procedure using suc-
cessive function landscape models. Ph.D. dissertation, Department of Mechanical Engi-
neering, Indian Institute of Technology (July 2005)
20. Voutchkov, I.I., Keane, A.J.: Multiobjective optimization using surrogates. In: Proc. 7th
Int. Conf. Adaptive Computing in Design and Manufacture (ACDM 2006), Bristol, pp.
167–175 (2006) ISBN 0-9552885-0-9
21. Keane, A.J.: Bump: A Hard (?) Problem (1994),
http://www.soton.ac.uk/˜ajk/bump.html
22. Forrester, A., Sobester, A., Keane, A.: Engineering design via Surrogate Modelling. Wi-
ley, Chichester (2008)
23. Yuret, D., Maza, M.: Dynamic hill climbing: Overcoming the limitations of optimization
techniques. In: The Second Turkish Symposium on Artificial Intelligence and Neural
Networks, pp. 208–212 (1993)
24. OptionsMatlab & OptionsNSGA2 RSM,
http://argos.e-science.soton.ac.uk/blogs/OptionsMatlab/
Chapter 8
A Review of Agent-Based Co-Evolutionary
Algorithms for Multi-Objective Optimization
8.1 Introduction
In spite of a huge potential dozing in evolutionary algorithms and a lot of successful
applications of such algorithms in solving difficult problem of optimization and
searching, very frequently such methods have not been able to deal with defined
problem and obtained results have not been satisfying. Among the reasons of such
a situation the following can be mentioned:
• centralization of evolutionary process where the process of selection as well
as the process of creation of new generations are controlled by one single
algorithm;
Rafał Dreżewski · Leszek Siwik
Department of Computer Science
AGH University of Science and Technology, Kraków, Poland
e-mail: {drezew,siwik}@agh.edu.pl
Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 177–209.
springerlink.com c Springer-Verlag Berlin Heidelberg 2010
178 R. Dreżewski and L. Siwik
During further research on realizing advanced, complex social and biological mech-
anisms within the confines of EMAS—general model of so called CoEMAS Co-
evolutionary multi-agent systems (CoEMAS) [8] has been proposed and it has turned
out that with the use of such a model almost any kind of interaction, cooperation or
competition among many species or sexes of co-evolving agents is possible what al-
lows for improving the quality of obtained result. Such improvement results mainly
from better maintenance of population diversity—what is especially important in
the case of applying such systems for solving multi-modal or multi-objective opti-
mization tasks.
In the course of this chapter we are focusing on applying co-evolutionary
multi-agent systems for solving multi-objective optimization tasks.
Following [5]—multi-objective optimization problem—MOOP in its general
form is being defined as follows:
⎧
⎪
⎪ Minimize/Maximize fm (x̄), m = 1, 2 . . . , M
⎨ Sub ject to g j (x̄) ≥ 0, j = 1, 2 . . . , J
MOOP ≡
⎪
⎪ h k (x̄) = 0, k = 1, 2 . . . , K
⎩ (L) (U)
xi ≤ xi ≤ xi , i = 1, 2 . . . , N
Authors of this chapter assume that readers are familiar with at least fundamental
concepts and notions regarding multi-objective optimization in the Pareto sense (re-
lation of domination, Pareto frontier and Pareto set etc.) and their explanation is
omitted in this paper (interested readers can find definitions and deep analysis of all
necessary concepts and notions of Pareto multi-objective optimization for instance
in [3, 5]).
This chapter is organized as follows:
• in Section 8.2 formal model as well as detailed description of Co-Evolutionary
Multi-Agent System—CoEMAS is presented;
• in Section 8.3 detailed description and formal model of two realization of Co-
EMAS applied for solving MOOP is given. In this section Co-Evolutionary
Multi-Agent System with Predator-Prey interactions (PPCoEMAS) as well as
Co-Evolutionary Multi-Agent System with Cooperation (CCoEMAS) are
discussed;
• in Section 8.4 we discuss shortly test suite and performance metric used during
experiments, and next we glance at results obtained by both systems presented in
the course of this chapter (PPCoEMAS and CCoEMAS);
• in Section 8.5 the most important remarks, conclusions and comments are given.
computational node to another, observe the environment and other agents, and can
communicate with other agents and change the environment.
Basic model of agent-based evolutionary algorithm (so called evolutionary multi-
agent system—EMAS model) was proposed in [2]. EMAS model included all the
features which were mentioned above. However in the case of some problems, for
example multi-modal optimization or multi-objective optimization, it turned out that
these mechanisms are not sufficient. Such types of problems require maintaining of
population diversity mechanisms, speciation mechanisms and possibilities of intro-
ducing additional biologically and socially inspired mechanisms in order to solve a
problem and obtain satisfying results.
Mentioned above limitations of the basic EMAS model and research aimed at
applying agent-based evolutionary algorithms to multi-modal and multi-objective
problems led to the formulation of the model of co-evolutionary multi-agent system—
CoEMAS [8]. This model included the possibilities of existing different species and
sexes in the system and allowed for defining co-evolutionary interactions between
them. Below we present basic ideas and notions of CoEMAS model, which we will
use in Section 8.3 when the systems used in experiments will be described.
8.2.2 Environment
The environment of CoEMAS may be described as 3-tuple:
. /
E = T E,Γ E ,Ω E (8.2)
T E = H, l (8.3)
where H is directed graph with the cost function c defined: H = V, B, c, V is the
set of vertices, B is the set of arches. The distance between two nodes is defined as
the length of the shortest path between them in graph H.
8 A Review of Agent-Based Co-Evolutionary Algorithms 181
8.2.3 Species
Species s ∈ S is defined as follows:
where:
• As is the of agents of species s (by as we will denote the agent, which is of species
s, as ∈ As );
• SX s is the set of sexes within the s;
182 R. Dreżewski and L. Siwik
s ,z+ 3. /
−−
i
−→= si , s j ∈ S × S : agents of species si can increase the fitness of
4 (8.9)
agents of species s j by performing the action z ∈ Z si
s ,z−
If si −−
i
−→ si then we are dealing with the intra-species competition, for example
si ,z+
the competition for limited resources, and if si −− −→ si then there is some form of
co-operation within the species si .
With the use of the above relations we can define many different co-evolutionary
interactions, e.g., mutualism, predator-prey, host-parasite, etc. For example mutual-
ism between two species si and s j (i = j) takes place if and only if ∃zk ∈ Z si ∃zl ∈ Z s j ,
si ,z + s j ,zl +
such that si −−−
k
→ s j and s j −−−→ si and these two species live in tight co-operation.
Predator-prey interactions between two species, si (predators) and s j (preys) (i =
si ,z − s j ,zl +
j), takes place if and only if ∃zk ∈ Z si ∃zl ∈ Z s j , such that si −−−
k
→ s j and s j −−−→ si ,
where zk is the action of killing the prey (kill), and zl is the action of death (die).
8.2.4 Sex
The sex sx ∈ SX s which is within the species s is defined as follows:
With asx we will denote the agent of sex sx (asx ∈ Asx ). Z0sx is the set of actions
which can be performed by the agents of sex sx, Z sx = Z a , where Z a is the
a∈Asx
set of actions which can be performed by the agent a. And finally Csx is the set of
relations between the sx and other sexes of the species s.
Analogically as in the case of species, we can define the relations between the
sexes of the same species. The set of all relations of the sex sxi ∈ SX s with other
sexes of species s (Csxi ) is the sum of the following sets of relations:
1 sx ,z− 2 1 sx ,z+ 2
Csxi = −−i−→: z ∈ Z sxi ∪ −−i−→: z ∈ Z sxi (8.12)
sx ,z− sx ,z+
where −−i−→ and −−i−→ are the relations between sexes, in which some actions
z ∈ Z sxi are used:
sx ,z− 3. /
−−i−→= sxi , sx j ∈ SX s × SX s : agents of sex sxi can decrease the
4 (8.13)
fitness of agents of sex sx j by performing the action z ∈ Z sxi
sx ,z+ 3. /
−−i−→= sxi , sx j ∈ SX s × SX s : agents of sex sxi can increase the
4 (8.14)
fitness of agents of sex sx j by performing the action z ∈ Z sxi
With the use of presented relations between sexes we can model for example sexual
selection interactions, in which agents of one sex choose partners for reproduction
from agents of the other sex within the same species, taking into account some
preferred features (see [10]).
8.2.5 Agent
Agent a (see Fig. 8.2) of sex sx and species s (in order to simplify the notation we
assume that a ≡ asx,s ) is defined as follows:
where:
• gna is the genotype of agent a, which may be composed of any number of chro-
Ê
mosomes (for example: gna = (x1 , x2 , . . . , xk ), where xi ∈ , gna ∈ k ); Ê
• Z a is the set of actions, which agent a can perform;
• Γ a is the set of resource types, which are used by agent a (Γ a ⊆ Γ );
• Ω a is the set of information, which agent a can possess and use (Ω a ⊆ Ω );
• PRa is partially ordered set of profiles of agent a (PRa ≡ PRa , ) with defined
partial order relation .
184 R. Dreżewski and L. Siwik
The active goal (which is denoted as gl ∗ ) is the goal gl, which should be realized in
the given time. The relation is reflexive, transitive and antisymmetric and partially
orders the set PRa :
Profile pr1 is the basic profile—it means that the realization of its goals has the
highest priority and they will be realized before the goals of other profiles.
Profile pr of agent a (pr ∈ PRa ) can be the profile in which only resources are
used:
pr = Γ pr , ST pr , RST pr , GL pr (8.19)
8 A Review of Agent-Based Co-Evolutionary Algorithms 185
pr = Ω pr , M pr , ST pr , RST pr , GL pr (8.20)
or resources and information are used:
pr = Γ pr , Ω pr , M pr , ST pr , RST pr , GL pr (8.21)
where:
• Γ pr is the set of resource types, which are used within the profile pr (Γ pr ⊆ Γ a );
• Ω pr is the set of information types, which are used within the profile pr (Ω pr ⊆
Ω a );
186 R. Dreżewski and L. Siwik
3
= sti , st j ∈ ST pr × ST pr : strategy sti has equal or higher
4 (8.22)
priority than strategy st j
This relation is reflexive, transitive and antisymmetric and partially orders the set
ST pr . Every single strategy st ∈ ST pr is consisted of actions, which ordered perfor-
mance leads to the realization of some active goal of the profile pr:
st = z1 , z2 , . . . , zk , st ∈ ST pr , zi ∈ Z a (8.23)
This relation is reflexive, transitive and antisymmetric and partially orders the set
GL pr .
The partially ordered sets of profiles PRa , goals GL pr and strategies ST pr are
used by the agent in order to make decisions about the realized goal and to choose
the appropriate strategy in order to realize that goal. The basic activities of the agent
a are shown in Algorithm 6.
In CoEMAS systems the set of profiles is usually composed of resource profile
(pr1 ), reproduction profile (pr2 ), and migration profile (pr3 ):
The highest priority has the resource profile, then there is reproduction profile, and
finally migration profile.
8 A Review of Agent-Based Co-Evolutionary Algorithms 187
8.3.1.1 Species
The species s is defined as follows:
where SX s is the set of sexes which exist within the s species, Z s is the set of actions
that agents of species s can perform, and Cs is the set of relations of s species with
other species that exist in the CCoEMAS.
Actions
The set of actions Z s is defined as follows:
Z s = {die, seek, get, give, accept, seekPartner, clone, rec, mut, migr} (8.28)
where:
• die is the action of death (agent dies when it is out of resources);
• seek is the action of finding a dominated agent from the same species in order to
take some resources from it;
188 R. Dreżewski and L. Siwik
• get action gets some resource from another agent located within the same node,
which is dominated by the agent that performs get action;
• give action gives some resources to the agent that performs get action;
• accept action accepts partner for reproduction when the amount of resource pos-
sessed by the agent is above the given level;
• seekPartner action seeks for partner for reproduction, such that it comes from
another species and has the amount of resource above the minimal level needed
for reproduction;
• clone is the action of producing offspring (parents give some of their resources
to the offspring during this action);
• rec is the recombination operator (intermediate recombination is used [1]);
• mut is the mutation operator (mutation with self-adaptation is used [1]);
• migr is the action of migrating from one node to another. During this action agent
loses some of its resource.
Relations
The set of relations of si species with other species that exist within the system is
defined as follows: 1 s ,get− s ,accept+ 2
Csi = −−−−→, −−−−−−→
i i
(8.29)
The first relation models intra species competition for limited resources:
s ,get−
−−
i
−−→= {si , si } (8.30)
8.3.1.2 Agent
Agent a of species s (a ≡ as ) is defined as follows:
Profiles
The partially ordered set of profiles includes resource profile (pr1 ), reproduction
profile (pr2 ), interaction profile (pr3 ), and migration profile (pr4 ):
The goal of the pr1 profile is to keep the amount of resources above the minimal
level or to die when the amount of resources falls to zero. This profile uses the
model M pr1 = {iω3 }.
The reproduction profile is defined as follows:
.
pr2 = Γ pr2 = Γ , Ω pr2 = {ω2 } , M pr2 = {iω2 } ,
/ (8.36)
ST pr2 , RST pr2 = ST pr2 , GL pr2
The only goal of the pr2 profile is to reproduce. In order to realize this goal agent can
use strategy of reproduction: seekPartner, clone, rec, mut. During the reproduction
rep,γ
agent transfers the amount of rgive resources to the offspring.
The interaction profile is defined as follows:
.
pr3 = Γ pr3 = Γ , Ω pr3 = {ω2 , ω3 } , M pr3 = {iω2 , iω3 } ,
/ (8.38)
ST pr3 = {accept, give}, RST pr3 = ST pr3 , GL pr3
The goal of the pr3 profile is to interact with agents from another species with the
use of accept and give strategies.
The migration profile is defined as follows:
.
pr4 = Γ pr4 = Γ , Ω pr4 = {ω1 } , M pr4 = {iω1 } ,
3. /4 / (8.39)
ST pr4 = migr , RST pr4 = ST pr4 , GL pr4
The set of species includes two species, preys and predators S = {prey, pred}.
Two information types (Ω = {ω1 , ω2 }) and one resource type (Γ = {γ }) are used.
Information of type ω1 denote nodes to which agent can migrate. Information of
type ω2 denote such prey that are located within the particular node in time t.
where SX prey is the set of sexes which exist within the prey species, Z prey is the set
of actions that agents of species prey can perform, and C prey is the set of relations
of prey species with other species that exist in the PPCoEMAS.
Actions
where:
• die is the action of death (prey dies when it is out of resources);
• seek action seeks for another prey agent that is dominated by the prey performing
this action or is too close to it in criteria space.
• get action gets some resource from another a prey agent located within the same
node, which is dominated by the agent that performs get action or is too close to
it in the criteria space;
• give action gives some resource to another agent (which performs get action);
• accept action accepts partner for reproduction when the amount of resource pos-
sessed by the prey agent is above the given level;
• seekPartner action is used in order to find the partner for reproduction when the
amount of resource is above the given level and agent can reproduce;
• clone is the action of producing offspring (parents give some of their resources
to the offspring during this action);
• rec is the recombination operator (intermediate recombination is used [1]);
• mut is the mutation operator (mutation with self-adaptation is used [1]);
• migr is the action of migrating from one node to another. During this action agent
loses some of its resource.
Relations
The set of relations of prey species with other species that exist within the system is
defined as follows: 1 prey,get− prey,give+ 2
C prey = −−−−−→, −−−−−−→ (8.43)
192 R. Dreżewski and L. Siwik
The first relation models intra species competition for limited resources:
prey,get−
−−−−−→= {prey, prey} (8.44)
Actions
where:
• The seek action allows finding the “worst” (according to the criteria associated
with the given predator) prey located within the same node as the predator;
• getFromPrey action gets all resources from the chosen prey,
• migr action allows predator to migrate between nodes of the graph H—this re-
sults in losing some of the resources.
Relations
The set of relations of pred species with other species that exist within the system
are defined as follows:
1 pred,getFromPrey− 2
C pred = −−−−−−−−−−−→ (8.48)
As a result of performing getFromPrey action and taking all resources from selected
prey, it dies.
Profiles
The partially ordered set of profiles includes resource profile (pr1 ), reproduction
profile (pr2 ), interaction profile (pr3 ), and migration profile (pr4 ):
194 R. Dreżewski and L. Siwik
The goal of the pr1 profile is to keep the amount of resources above the minimal
level or to die when the amount of resources falls to zero. This profile uses the
model M pr1 = {iω2 }.
The reproduction profile is defined as follows:
.
pr2 = Γ pr2 = Γ , Ω pr2 = {ω2 } , M pr2 = {iω2 } ,
/ (8.54)
ST pr2 , RST pr2 = ST pr2 , GL pr2
The only goal of the pr2 profile is to reproduce. In order to realize this goal agent can
use strategy of reproduction seekPartner, clone, rec, mut or can accept partners for
reproduction (accept).
The interaction profile is defined as follows:
.
pr3 = Γ pr3 = Γ , Ω pr3 = 0, / M pr3 = 0,
/ ST pr3 = {give} ,
/ (8.56)
RST pr3 = ST pr3 , GL pr3
The goal of the pr3 profile is to interact with predators and preys with the use of
strategy give.
The migration profile is defined as follows:
.
pr4 = Γ pr4 = Γ , Ω pr4 = {ω1 } , M pr4 = {iω1 } ,
3. /4 / (8.57)
ST pr4 = migr , RST pr4 = ST pr4 , GL pr4
The goal of the pr4 profile is to migrate within the environment. In order to real-
ize such a goal the migration strategy is used, which firstly chooses the node and
then realizes the migration. As a result of migrating prey loses some amount of
resource.
8 A Review of Agent-Based Co-Evolutionary Algorithms 195
Profiles
The goal of the pr1 profile is to keep the amount of resource above the minimal level
with the use of strategy seek, getFromPrey.
The migration profile is defined as follows:
.
pr2 = Γ pr2 = Γ , Ω pr2 = {ω1 } , M pr2 = {iω1 } ,
3. /4 / (8.59)
ST pr2 = migr , RST pr2 = ST pr2 , GL pr2
Secondly the so-called Kursawe problem was used. Its definition is as follows [18]:
⎧ & & $ ''
⎪
⎨ f1 (x) = ∑n−1 −10 exp −0.2 x2i + x2i+1
i=0
Kursawe = f2 (x) = ∑n |xi |0.8 + 5 sin x3 (8.61)
⎪
⎩ i=1 i
n = 3 − 5 ≤ x1 , x2 , x3 ≤ 5
In one of our experiments discussed shortly in this chapter building effective port-
folio problem was used. Assumed definition as well as true Pareto frontier for such
a problem can be found in [16].
Obviously during our experiments also well known and commonly used test
suites were used. Inter alia such problems as ZDT test suite was used ([19, p. 57–63],
[21], [5, p. 356–362], [4, p. 194–199]).
f1
max
1
HVR Measure value 0.95
0.9
0.85
0.8
0.75
NSGA2
0.7 CCoEMAS
SPEA
0.65
0 5 10 15 20 25 30 35
(a) Time [s]
1
0.99
HVR Measure value
0.98
0.97
0.96
0.95
0.94
NSGA2
0.93 CCoEMAS
SPEA
0.92
0 5 10 15 20 25 30 35 40
(b) Time [s]
Fig. 8.4 HVR values obtained by CCoEMAS, SPEA2, and NSGA-II run against Zitzler’s
problems ZDT1 (a) and ZDT2 (b) [11]
As one may see after the analysis of results presented in figures 8.4 and 8.5—
CCoEMAS, as not so complex algorithm as NSGA-II or SPEA2, initially allows for
obtaining better solutions, but with time classical algorithms—especially NSGA-
II—are the better alternatives. It is however worth to mention that in the case of
8 A Review of Agent-Based Co-Evolutionary Algorithms 199
0.9
0.7
0.6
0.5 NSGA2
CCoEMAS
SPEA2
0.4
0 5 10 15 20 25 30 35 40
(a) Time [s]
0.97
0.96
HVR Measure value
0.95
0.94
0.93
0.92
0.91
NSGA2
0.9 CCoEMAS
SPEA2
0.89
0 5 10 15 20 25 30 35 40
(b) Time [s]
1
0.95
HVR Measure value
0.9
0.85
0.8 NSGA2
CCoEMAS
SPEA2
0.75
0 5 10 15 20 25 30 35 40
(c) Time [s]
Fig. 8.5 HVR values obtained by CCoEMAS, SPEA2, and NSGA-II run against Zitzler’s
problems ZDT3 (a) ZDT4 (b) and ZDT6 (c) [11]
200 R. Dreżewski and L. Siwik
5
PPCoEMAS frontier after 6000 steps
f2
0
0 1 2 3 4 5
(a) f1
5
PPES frontier after 6000 steps
f2
0
0 1 2 3 4 5
(b) f1
Fig. 8.6 Pareto frontier approximations obtained by PPCoEMAS (a) and PPES (b) algo-
rithms for Laumanns problem after 6000 steps [9]
8 A Review of Agent-Based Co-Evolutionary Algorithms 201
NPGA
PPES
PPCoEMAS
50 60 70
(a) HV
NPGA
PPES
PPCoEMAS
Fig. 8.7 The value of HV (a) and HVR (b) measure for Laumanns problem obtained by
PPCoEMAS, PPES and NPGA after 6000 steps
In the very first experiments with PPCoEMAS relatively simple Laumanns test
problem was used. In Figure 8.6 there are presented Pareto frontier approxima-
tions obtained by PPCoEMAS and PPES algorithms and in Figure 8.7 there are
presented values of HV and HVR metrics for all three algorithms being compared
(PPCoEMAS, PPES and NPGA). As it can be seen—the differences between al-
gorithms being analyzed are not so distinct, however proposed PPCoEMAS system
seems to be the best alternative.
The second problem used was more demanding multi-objective Kursawe prob-
lem with disconnected both Pareto set and Pareto frontier. In Figure 8.9 there are
presented final approximations of Pareto frontier obtained by PPCoEMAS and by
reference algorithms after 6000 time steps. As one may notice, there is no doubt that
PPCoEMAS is definitely the best alternative since it is able to obtain Pareto frontier
that is located very close to the model solution, that is very well dispersed and what
202 R. Dreżewski and L. Siwik
NPGA
PPES
PPCoEMAS
NPGA
PPES
PPCoEMAS
Fig. 8.8 The value of HV (a) and HVR (b) measure for Kursawe problem obtained by PP-
CoEMAS, PPES and NPGA after 6000 steps
is also very important—it is more numerous than PPES and NPGA-based solutions.
The above observations are fully confirmed by the values of HV and HVR metrics
presented in Figure 8.8.
Proposed co-evolutionary multi-agent system with predator-prey interactions was
also assessed with the use of building effective portfolio problem. In this case, each
individual in the prey population is represented as a p-dimensional vector. Each
dimension represents the percentage participation of i-th (i ∈ 1 . . . p) share in the
whole portfolio.
During presented experiments—Warsaw Stock Exchange quotations from 2003-
01-01 until 2005-12-31 were taken into consideration. Simultaneously, the portfolio
consists of the following three (experiment I) or seventeen (experiment II) stocks
quoted on the Warsaw Stock Exchange: in experiment I: RAFAKO, PONARFEH,
PKOBP, in experiment II: KREDYTB, COMPLAND, BETACOM, GRAJEWO,
KRUK, COMARCH, ATM, HANDLOWY, BZWBK, HYDROBUD, BORYSZEW,
8 A Review of Agent-Based Co-Evolutionary Algorithms 203
-5
f2
-10
-5
f2
-10
-5
f2
-10
Fig. 8.9 Pareto frontier approximations for Kursawe problem obtained by PPCoEMAS (a),
PPES (b) and NPGA (c) after 6000 steps [9]
0.2
PPCoEMAS-based Pareto frontier after 1000 steps
0.15
Profit
0.1
0.05
0
0 0.05 0.1 0.15 0.2 0.25 0.3
(a) Risk
0.2
PPES-based Pareto frontier after 1000 steps
0.15
Profit
0.1
0.05
0
0 0.05 0.1 0.15 0.2 0.25 0.3
(b) Risk
0.2
NPGA-based Pareto frontier after 1000 steps
0.15
Profit
0.1
0.05
0
0 0.05 0.1 0.15 0.2 0.25 0.3
(c) Risk
Fig. 8.10 Pareto frontier approximations after 1000 steps obtained by PPCoEMAS (a), PPES
(b), and NPGA (c) for building effective portfolio consisting of 3 stocks [16]
Similar situation can be also observed in Figure 8.11 presenting Pareto fron-
tiers obtained by PPCoEMAS, NPGA and PPES—but this time portfolio that is
being optimized consists of 17 shares. Also this time PPCoEMAS-based frontier
is quite numerous and quite close to the true Pareto frontier but the tendency for
focusing solutions around only selected part(s) of the whole frontier is very dis-
tinct. The explanation of observed tendency can be found in [9, 16] and on the very
8 A Review of Agent-Based Co-Evolutionary Algorithms 205
0.45
PPCoEMAS-based Pareto frontier after 1000 steps
0.4
0.35
0.3
0.25
Profit
0.2
0.15
0.1
0.05
0
0 0.05 0.1 0.15 0.2
(a) Risk
0.45
PPES-based Pareto frontier after 1000 steps
0.4
0.35
0.3
0.25
Profit
0.2
0.15
0.1
0.05
0
0 0.05 0.1 0.15 0.2
(b) Risk
0.45
NPGA-based Pareto frontier after 1000 steps
0.4
0.35
0.3
0.25
Profit
0.2
0.15
0.1
0.05
0
0 0.05 0.1 0.15 0.2
(c) Risk
Fig. 8.11 Pareto frontier approximations after 1000 steps obtained by PPCoEMAS (a), PPES
(b), and NPGA (c) for building effective portfolio consisting of 17 stocks [16]
general level it can be said that it is caused by the stagnation of evolution process
in PPCoEMAS. Hypothetical, non-dominated average portfolios for experiment I
and II are presented in Figure 8.12 and in Figure 8.13 respectively (in Figure 8.13
shares are presented from left to right in the order in which they were mentioned
above).
206 R. Dreżewski and L. Siwik
1
PPCoEMAS portfolio after 1 step
0.6
0.4
0.2
0
RAFAKO PONAR PKOBP
(a) share name
1
PPCoEMAS portfolio after 900 steps
percentage share in the portfolio
0.8
0.6
0.4
0.2
0
RAFAKO PONAR PKOBP
(b) share name
Fig. 8.12 Effective portfolio consisting of three stocks proposed by PPCoEMAS [16]
1
PPCoEMAS portfolio after 1 step
percentage share in the portfolio
0.8
0.6
0.4
0.2
1
PPCoEMAS portfolio after 900 steps
percentage share in the portfolio
0.8
0.6
0.4
0.2
Fig. 8.13 Effective portfolio consisting of seventeen stocks proposed by PPCoEMAS [16]
8 A Review of Agent-Based Co-Evolutionary Algorithms 207
References
1. Bäck, T., Fogel, D., Michalewicz, Z. (eds.): Handbook of Evolutionary Computation.
IOP Publishing and Oxford University Press (1997)
2. Cetnarowicz, K., Kisiel-Dorohinicki, M., Nawarecki, E.: The application of evolution
process in multi-agent world to the prediction system. In: Tokoro, M. (ed.) Proceedings
of the 2nd International Conference on Multi-Agent Systems (ICMAS 1996). AAAI
Press, Menlo Park (1996)
3. Coello, C., Lamont, G., Van Veldhuizen, D.: Evolutionary Algorithms for Solving Multi-
Objective Problems, 2nd edn. Springer, New York (2007)
4. Coello Coello, C., Van Veldhuizen, D., Lamont, G.: Evolutionary algorithms for solv-
ing multi-objective problems, 2nd edn. Genetic and evolutionary computation. Springer,
Heidelberg (2007)
5. Deb, K.: Multi-Objective Optimization using Evolutionary Algorithms. John Wiley &
Sons, Chichester (2001)
208 R. Dreżewski and L. Siwik
6. Deb, K., Agrawal, S., Pratab, A., Meyarivan, T.: A Fast Elitist Non-Dominated Sorting
Genetic Algorithm for Multi-Objective Optimization: NSGA-II. In: Deb, K., Rudolph,
G., Lutton, E., Merelo, J.J., Schoenauer, M., Schwefel, H.-P., Yao, X. (eds.) PPSN 2000.
LNCS, vol. 1917, pp. 849–858. Springer, Heidelberg (2000),
citeseer.ist.psu.edu/article/deb00fast.html
7. Deb, K., Pratab, A., Agarwal, S., Meyarivan, T.: A fast and elitist multi-objective ge-
netic algorithm: Nsga-ii. IEEE Transaction on Evolutionary Computation 6(2), 181–197
(2002)
8. Dreżewski, R.: A model of co-evolution in multi-agent system. In: Mařı́k, V., Müller, J.P.,
Pěchouček, M. (eds.) CEEMAS 2003. LNCS (LNAI), vol. 2691, pp. 314–323. Springer,
Heidelberg (2003)
9. Dreżewski, R., Siwik, L.: The application of agent-based co-evolutionary system with
predator-prey interactions to solving multi-objective optimization problems. In: Proceed-
ings of the 2007 IEEE Symposium Series on Computational Intelligence. IEEE, Los
Alamitos (2007)
10. Dreżewski, R., Siwik, L.: Agent-based co-evolutionary techniques for solving multi-
objective optimization problems. In: Kosiński, W. (ed.) Advances in Evolutionary Al-
gorithms. IN-TECH, Vienna (2008)
11. Dreżewski, R., Siwik, L.: Agent-based co-operative co-evolutionary algorithm for multi-
objective optimization. In: Rutkowski, L., Tadeusiewicz, R., Zadeh, L.A., Zurada, J.M.
(eds.) ICAISC 2008. LNCS (LNAI), vol. 5097, pp. 388–397. Springer, Heidelberg
(2008)
12. Zitzler, E., Laumanns, M., Thiele, L.: Spea2: Improving the strength pareto evolutionary
algorithm for multiobjective optimization. In: Giannakoglou, K., et al. (eds.) Evolution-
ary Methods for Design, Optimisation and Control with Application to Industrial Prob-
lems (EUROGEN 2001). International Center for Numerical Methods in Engineering
(CIMNE), pp. 95–100 (2002)
13. Horn, J., Nafpliotis, N., Goldberg, D.E.: A niched pareto genetic algorithm for multi-
objective optimization. In: Proceedings of the First IEEE Conference on Evolutionary
Computation. IEEE World Congress on Computational Intelligence, vol. 1, pp. 82–87.
IEEE Service Center, Piscataway (1994),
citeseer.ist.psu.edu/horn94niched.html
14. Kursawe, F.: A variant of evolution strategies for vector optimization. In: Schwefel, H.-
P., Männer, R. (eds.) PPSN 1990. LNCS, vol. 496, pp. 193–197. Springer, Heidelberg
(1991), citeseer.ist.psu.edu/kursawe91variant.html
15. Laumanns, M., Rudolph, G., Schwefel, H.P.: A spatial predator-prey approach to multi-
objective optimization: A preliminary study. In: Eiben, A.E., Bäck, T., Schoenauer, M.,
Schwefel, H.-P. (eds.) PPSN 1998. LNCS, vol. 1498, p. 241. Springer, Heidelberg (1998)
16. Siwik, L., Dreżewski, R.: Co-evolutionary multi-agent system for portfolio optimization.
In: Brabazon, A., O’Neill, M. (eds.) Natural Computation in Computational Finance, pp.
273–303. Springer, Heidelberg (2008)
17. Spears, W.: Crossover or mutation? In: Proceedings of the 2-nd Foundation of Genetic
Algorithms, pp. 221–237. Morgan Kauffman, San Francisco (1992)
18. Van Veldhuizen, D.A.: Multiobjective evolutionary algorithms: Classifications, analyses
and new innovations. PhD thesis, Graduate School of Engineering of the Air Force Insti-
tute of Technology Air University (1999)
19. Zitzler, E.: Evolutionary algorithms for multiobjective optimization: methods and appli-
cations. PhD thesis, Swiss Federal Institute of Technology, Zurich (1999)
8 A Review of Agent-Based Co-Evolutionary Algorithms 209
20. Zitzler, E., Thiele, L.: An evolutionary algorithm for multiobjective optimization: The
strength pareto approach. Tech. Rep. 43, Swiss Federal Institute of Technology, Zurich,
Gloriastrasse 35, CH-8092 Zurich, Switzerland (1998),
citeseer.ist.psu.edu/article/zitzler98evolutionary.html
21. Zitzler, E., Deb, K., Thiele, L.: Comparison of Multiobjective Evolutionary Algorithms:
Empirical Results. Evolutionary Computation 8(2), 173–195 (2000)
22. Zitzler, E., Laumanns, M., Thiele, L.: Spea2: Improving the strength pareto evolutionary
algorithm. Tech. Rep. TIK-Report 103, Computer Engineering and Networks Labora-
tory (TIK), Department of Electrical Engineering, Swiss Federal Institute of Technology
(ETH) Zurich, ETH Zentrum, Gloriastrasse 35, CH-8092 Zurich, Switzerland (2001)
23. Zitzler, E., Thiele, L., Laumanns, M., Fonseca, C.M., da Fonseca, V.G.: Performance
assessment of multiobjective optimizers: An analysis and review. IEEE Transactions on
Evolutionary Computation 7(2), 117–132 (2003)
Chapter 9
A Game Theory-Based Multi-Agent System for
Expensive Optimisation Problems
9.1 Introduction
Modelling problems arising in real world applications taking into account the non-
linearity and the combinatorial aspects of solution sets often leads to expensive to
solve optimisation problems; they are inherently intractable. Indeed, even checking
a given solution for optimality is NP-hard [10, 17, 32]. It is, therefore, not reason-
able, in general, to expect the optimum solution to be found in acceptable times.
What one can, almost always, only expect is an approximate solution, the quality of
which is crucial to its potential use.
Abdellah Salhi · Özgun Töreyen
Department of Mathematical Sciences, The University of Essex, Colchester CO4 3SQ, UK
e-mail: [email protected],[email protected]
Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 211–232.
springerlink.com c Springer-Verlag Berlin Heidelberg 2010
212 A. Salhi and Ö. Töreyen
It is well known, that, at least in the case of stochastic algorithms, the quality of
the approximate solution (or some confidence the user may have in it) is proportional
to the time spent in the search for it, [22, 23, 24]. As in many applications there is a
time constraint, a deadline beyond which a better approximate solution is of no use,
it is essential that all available resources (software and hardware) be used as well
as possible, to insure that the best approxiate solution, under the circumstances, is
obtained. This is what the novel approach suggested here is attempting to achieve.
To do so, it must:
1. find which algorithm(s), in the suite of algorithms, is the most appropriate for the
given instance of the expensive optimisation problem;
2. replicate this algorithm(s) on all availabe processor nodes in a parallel environ-
ment, or allocate to it all of the remaining CPU time if a single processor, or
sequential environment, is used;
Point (1) above is dealt with through measuring the performance of the algorithms
used. Point (2) is dealt with via the implementation of a cooperative/competitive
game of the Iterated Prisoners’ Dilemma (IPD) type, [4, 7, 16, 20, 25, 27]. Although
other paradigms of cooperative/competitive behaviour, such as the Stag Hunt game,
[7], can be used, the IPD seems appropriate. Note that, implementing cooperation is
fairly straightforward, and implementing competition is not. We believe that it is at
least as important as cooperation between agents for an effective search. To the best
of our knowledge, this is the first time implementing competition for optimisation
purposes is attempted. We use payoff matrices as a handle to manipulate it. Two al-
gorithms (agents) cooperate by exchanging their current solutions; they compete by
not exchanging their solutions. Note that, intuitively, cooperation may lead to early
convergence to a local optimum, by virtue of propagating a given solution poten-
tially to all algorithms and having all of them searching the same area. Competition,
on the other hand, may lead to good coverage of the search space by virtue of not
sharing solutions, i.e. helping algorithms “stay away” from each other and therefore,
potentially, explore different areas of the search space.
Although the study presents the prototype of a generic solver that can involve
any number of solver algorithms and run on any computing platform, here, a system
with only two search algorithms, implemented sequentially, is investigated. This
simplified model, however, has the inherent complexities of a system with many
more agents and should show how good or otherwise the general system can be for
expensive optimisation.
Note that the generic nature of this approach makes it applicable in any discipline
where problem solving is involved and more than one solution method is available.
This document is organised as follows. In Section 9.2, a brief literature review
is given. In Sections 9.3 and 9.4, the design and implementation of the system is
explained. Section 9.5 explains how the system is applied to solve the Travelling
Salesman Problem. The results are presented in Section 9.6. Finally, conclusions
are drawn and future research prospects are outlined in Section 9.7.
9 A Game Theory-Based Multi-Agent System 213
9.2 Background
In the following a brief review of the three main topics involved, i.e. optimisation,
the IPD and agents systems will be given.
9.2.1 Optimisation
The general optimisation problem is of the global type, constrained, nonlinear and
involves mixed variables, i.e. both discrete and continuous variables. However, a
lot of optimisation problems that are encountered in real applications do not have
all of these characteristics, but are still intractable. The 0-1-Knapsack problem, for
example, involves only binary variables and has one single constraint, but is still
NP-Hard. The general optimisation problem can be cast in the following form.
Let f be a function from Rn to R and A ⊂ Rn , then find x∗ ∈ A such that ∀x ∈ A,
f (x∗ ) ≤ f (x).
Player2
C D
Player1 C R=3,R=3 S=0,T=5
D T=5,S=0 P=1,P=1
In the payoff matrix of Table 1, actions C and D stand for ‘Cooperate’ and ‘De-
fect’ and payoffs R, P, T, and S stand for ‘Reward’, ‘Punishment’,‘Temptation’, and
‘Sucker’s’ payoff respectively. This payoff matrix shows that defecting is benefi-
cial to both players for two reasons. First, it leads to a greater payoff (T = 5) in
case the other player cooperates, (S = 0). Second, it is a safe move because neither
knows what the other’s move will be. So, to rational players, defecting is the best
choice. But, if both players choose to defect then it leads to a worse payoff (P = 1)
as compared to cooperating (R = 3). That is the dilemma.
The special setting of the one shot PD is seen by many to be contrary to the idea of
cooperation. This is because the only equilibrium point is the outcome [P, P] which
is a Nash equilibrium, [7]. Also, [P, P] is at the intersection of minimax strategy
choices for both players. These minimax strategies are dominant for both players,
hence the exclusion in principal of cooperation (by virtue of the dominance of the
chosen strategies). Moreover, even if cooperative strategies were chosen, the result-
ing cooperative ‘solution’ is not an equilibrium point. This means that it is not stable
214 A. Salhi and Ö. Töreyen
due to the fact that both players are tempted to defect from it. It should also be noted
that cooperative problems in real life are likely to be faced repeatedly. This makes
the IPD a more appropriate model for the study of cooperation than the one-shot
version of the game.
The PD game is characterised by the strict inequality relations between the pay-
offs: T > R > P > S. And to avoid coordination or total agreement getting a ‘helping
hand’, most experimental PD games have a payoff matrix satisfying: 2R > S + T , as
in Table 1.
The close analysis of the IPD reveals that, unlike the one-shot PD, it has a large
number of Nash equilibria. These being inside the convex hull of the outcomes
(0,5), (3,3), (5,0), (1,1) of the pure strategies in the one-shot PD, (see Figure 9.1).
Note that (1,1) corresponding to [P, P] is a Nash equilibrium for the IPD also. For a
comprehensive investigation of the IPD, please refer to [4, 5, 15].
(0, 5)
(3, 3)
(1, 1)
(5, 0)
Solver-agents cooperate (C) or compete (D) with each other by sharing or not
their solutions. If an agent can take an opponent’s possibly better solution when it
is stuck in a local optimum say, and uses it, then it can improve its own search.
The decision to cooperate or to compete is autonomously made by the agents us-
ing their beliefs (no notable change in the objective value in the last few iterations,
for instance, may mean convergence to a local optimium), the history of the pre-
vious encounters with their opponents (the number of times they cooperated and
competed), and certain rules which follow observations of the behaviours of agents.
Some are explained below.
The rules are set to prevent the game from converging too soon to a near
pure competition game (which is equivalent to playing the GRIM strategy, [2]).
216 A. Salhi and Ö. Töreyen
Go-it-alone type strategy can not contribute to the solution quality more than run-
ning an algorithm on its own. These rules are:
• If the number of times SA1 knows the solution of SA2 increases, then the likeli-
hood that SA2 finds the solution to ℘ first, decreases. Therefore, SA2 is unlikely
to cooperate. Since all solver-agents are aware of this, they would cooperate less
and take their opponent’s solution more often, given the chance.
• If SA1 does not cooperate when SA2 cooperates, then SA2 would retaliate. This
leads to the TIT-FOR-TAT and the go-it-alone type strategy.
• If a solver-agent cooperates in the first encounter with another solver-agent, then
it can be perceived as in need of help; i.e. it is stuck at a local optimum. Agents,
therefore, perceive the first cooperation of their opponent as a “forceful invitation
to cooperate or else...” from bullet point 2 above.
There are all sorts of rules which are implicit in the IPD. Agents, however, do not
have to apply them systematically.
of SA1. It can cooperate and end up in Node 2, or Node 3. Nodes 2 and 3 are the
decision nodes of SA2 which has the same two alternative decisions as SA1. Node 4
follows the cooperation of both of the agents that may result in a solution exchange.
Node 5 shows the situation where SA1 cooperates and SA2 competes which means
SA2 may take the solution of SA1 and SA1 takes nothing. Node 6 depicts the same
situation as that leading to node 5 but with agents taking different actions. In node
7, neither gives its solution; they continue without any exchange.
The decision tree is expanded further with branching from nodes 4-7, but with
alternatives now being: “Take the opponent’s solution” and “Do not take the op-
ponent’s solution”. This branching determines which agent has a better solution
and is essential for setting up the payoff matrices that drive the system. 8 new
nodes(leaves) arise. Each pair of sibling nodes yields a different payoff matrix. The
labels (G)(for good) and (B)(for bad) refer to agents having a better solution than the
opponent or otherwise, respectively. The cells that are crossed refer to impossible
outcomes.
Managing the resources is based on the outcomes of the decisions of the solver-
agents. When an agent cooperates it gains one unit (of CPU time or equivalent in
terms of iterations it is allowed to do) and loses double that. When it competes it
gains two units and loses one (or half of the initial gain). This means, the GTMAS
payoff matrix rewards competition. The idea behind supporting competition is to
counter the “helping hand” that cooperation gets from the rules underpinning the
construction of GTMAS (see above). It can also be argued that, intuitively at least,
too frequent exchanging of solutions will lead to early convergence to local optima.
So, competition gives solver-agents the chance to cover the search space better.
The 4 payoff matrices in Figure 9.3 can be combined in one payoff matrix
(Table 9.2).
Table 9.2 Combined Payoff Table for Evaluating and Rewarding Agents
B
C D
G C (1,-2) (1,-1)
D (2,-2) (2,-1)
The equilibrium point for the payoff matrix is (D, D) with payoffs 2 and -1. It
is also a regret-free point. The payoff matrix at the core of GTMAS is different
from those commonly found in the literature. These matrices would be drawn im-
mediately after decisions have been taken, i.e. at nodes 4 to 7 in Figure 9.3. Here,
they are drawn after other decisions are taken. In fact, one can highlight three main
differences;
218 A. Salhi and Ö. Töreyen
(i) The return of a player is not dependent on the opponent’s choice directly. Whether
the opponent cooperates or competes becomes only relevant after the exchange of
solutions has been decided;
(ii) The payoff is affected by what has been achieved in terms of the quality of
solution after exchange (or otherwise of solutions);
Unlike trditional games, here, after the players (solver-agents) have made their
choices, they are given a chance to progress with the consequences of the choices.
Only after that, are they rewarded/punished. This was made explicit in the above
paragraph where reasons for rewarding competition/penalising cooperation, were
given; for instance when we said that a cooperating agent “gains one unit and loses
double that”, we meant that the solver-agent runs first for a unit of CPU time (or
equivent in iterations) and only after that is it penalised by taking 2 units of CPU
time from its account. Basically, the quality of the solution following decisions has
to be measured first before the payoffs are allocated. Time is an important factor in
the IPD.
(iii) The third difference is that the players are not “Solver-Agent 1” and “Solver-
Agent 2”, but instead “Solver-Agent with the better solution” and “Solver-Agent
with the worse solution”. The configuration of the table may change at each stage
according to the solution qualities of the solver-agents. The one with better solution
is always placed as the row player.
Coordinator-Agent Pseudocode
1. Initialise belief. Initialise resources,n.
2. For Nstage stages, play the game and update belief
where Nstage limits the number of stages the game is played.
2.1. Start decision phase: Run the solver-agents to decide.
2.2. Manage the solution exchange.
2.3. Start competition phase: Run the solver-sgents to compete.
2.4. Evaluate and reward/punish the solver-agents.
Update resources, m1 , m2 iterations where
mi = n + ∑currentstage−1
j=1 ri j , ri j is the reward of agent i
at stage j.
2.5. Increment stage.
3. End the game. Select the best algorithm. Report the results.
9 A Game Theory-Based Multi-Agent System 219
Solver-Agents Pseudocode
1. Initialise belief.
2. If it is a decision phase, do:
2.1. If it is the first stage, do:
2.1.1. Initialise memory and algorithm specific parameters.
2.1.2. Run own algorithm for n iterations.
2.1.3. Cooperate.
2.1.4. End run. Send the results to the Coordinator-Agent.
2.2. If it is the second stage, do:
2.2.1. Update belief.
2.2.2. Run own algorithm for mi iterations.
2.2.3. Compete.
2.2.4. End run. Send the results to the Coordinator-Agent.
2.3. If stage > 2, do:
2.3.1. Update belief.
2.3.2. Run own algorithm for mi iterations.
2.3.3. Decide to cooperate/compete.
2.3.4. End run. Send the results to the Coordinator-Agent.
3. If it is a competition phase, do:
3.1. Update belief.
3.2. Run own algorithm for n iterations.
3.3. End run. Send the results to the Coordinator-Agent.
In the decision-making process, P(IC|OC) is the probability that SA1 will cooperate
in the next iteration given that SA2 cooperates in this iteration. It is equal to the
ratio between the number of times SA1 cooperates in the (n + 1)st iteration given
that SA2 cooperated in the nth iteration and the the total number of encounters.
P(OC|IC) is the probability that SA2 will cooperate in the next iteration given that
SA1 cooperates in this iteration. It is equal to the ratio of the number of times SA2
cooperates in (n + 1)st iteration given SA1 cooperated in the nth iteration to the total
number of encounters.
P(OC|ID) is the probability that SA2 will cooperate in the next iteration given that
SA1 cooperates in this iteration. It is equal to the ratio of the number of times SA2
cooperates in (n + 1)st iteration given that SA1 competed in the nth iteration to the
total number of encounters.
9 A Game Theory-Based Multi-Agent System 221
B
C D
G C (4,-8) (4,-4)
D (8,-8) (8,-4)
B B
C D C D
G C (4,-4) (4,-4) G C (4,-8) (4,-4)
D (4,-4) (4,-4) D (8,-8) (8,-4)
B B
C D C D
G C (8,-4) (8,-8) G C ( 4t ,−4t) ( 4t ,− 4t )
D (4,-4) (4,-8) D (4t,−4t) (4t,− 4t )
Agent
Characteristics α β γ η σ
cooperative 0.1 0.9 0.01 0.2 0.6
competitive 0.5 0.1 0.01 0.8 0.2
Twenty combinations of parameters were used and for each five runs were car-
ried out on the problems of Table 9.13, from TSPLIB ([21]). GA is found to be the
better algorithm in 98runs and SA is found to be better only in the remaining 2. The
final solutions, the elapsed times, the solver algorithm and the series of coopera-
tion/competition bouts and exchange of solutions are recorded. The results are en-
tered into SPSS 14 for analysis of significant factors. The cooperation/competition
series and the exchange series are categorised prior to analysis. The categorisation
is summarised in Table 9.7.
These are added to the factors of the experiment. The final factors of the question
are summarised in Table 9.8 with their corresponding values and the number of
occurrences.
224 A. Salhi and Ö. Töreyen
Number of
Factors ValuesObservations
PAYOFF simple 1 25
cooperation-rewarded 2 25
competition-rewarded 3 25
time-dependent 4 25
DECISION PROCESS coop GA vs coop SA 1 20
coop GA vs comp SA 2 20
comp GA vs coop SA 3 20
comp GA vs comp SA 4 20
random 5 20
CATEGORY GA COOPERATES less than twice 1 64
more than twice 2 36
CATEGORY SA COOPERATES less than twice 1 43
more than twice 2 57
CATEGORY GA TAKES never takes 1 25
SA’S SOLUTION takes in first stage 2 23
takes after second stage 3 26
takes in both 4 26
CATEGORY SA TAKES never takes 1 22
GA’S SOLUTION takes in first stage 2 33
takes after second stage 3 33
takes in both 4 12
Table 9.9 shows ANOVA results for the dependent variable deviation. Here de-
viation, is the difference between the true solution objective value and the objective
value of the solution found. All factors and reasonable multiple interactions are in-
cluded in the model. Most of them are very insignificant due to the high random
variability. However, the interaction of solution taking sequences of agents is signif-
icant with 11% confidence. Therefore, the solution taking sequences factors them-
selves are significant. Even though, it is not a very reliable confidence, these are the
most expected factors to be significant to explain the data since the solution quality
is expected to depend on the times solution exchanges occur.
Table 9.10 shows ANOVA results for the dependent variable time. The only sig-
nificant factor which is in the 12% significance level is the solution exchange se-
quences of SA solver-agent. This matches exactly the expectations since SA varies
a lot both between iterations within problems and between problems. When it takes
a solution in any stage, the average elapsed time is about 100 seconds. When it
doesn’t take the GA solution, the average elapsed time is about 60 seconds.
9 A Game Theory-Based Multi-Agent System 225
Table 9.9 ANOVA - Significant Factors Affecting Deviation From True Solution Value
In Table 9.11, the best deviation is found when GA takes SA’s solution in both
the first stage and after the second stage and SA takes GA’s solution only in the
first stage. Average deviation is 3.36%. However, the average time elapsed to obtain
this average deviation is quite high, at 141 seconds. The second best deviation is
observed when GA takes SA’s solution after the second stage and SA never takes
GA’s solution. The average deviation is 3.79% with an average elapsed time of 58
seconds. From these results, it can be said that, in this setting, i.e. when GA com-
petes and takes the solution of SA and SA cooperates by offering its own solution
and never taking that of GA, the best performance is obtained. Whether obtaining
this solution exchange setting is random, is not clear. What is clear is that it occurs
quite often. Table 9.12, records some of its occurences. Amongst these 11 occur-
rences, the best average deviation is obtained with a competition-rewarded payoff
matrix and both agents being competitive. The deviation is 0.65% which actually
comes from only one occurrence, and the time is 58 seconds. This analysis does not
show that if competitive agents play against each other in a competition-rewarded
environment, then this is the best environment; rather, it shows that if competitive
agents play against each other in a competition-rewarded environment and their so-
lution exchange happens to be one-way benefit to one of the solver-agents, then this
might be the best setting.
solution than the best algorithm would obtain on its own. This is because of the
synergistic effect of the algorithms working together.
GTMAS implements an interesting resource allocation process that uses a pur-
pose built payoff matrix to encourage competition for the available computing re-
sources. Solver-agents are rewarded by increasing their access to the computing fa-
cilities for good performance; they are punished for bad performance, by reducing
their access to the computing facilities. This simple rule guarantees that the comput-
ing platform is increasingly being dedicated to the most suited algorithm. In other
words, the bulk of the computing platform will eventually be used by the best per-
forming algorithm, which is synonymous with the computing resources being used
efficiently.
GTMAS as implemented here involves only two players. The study will benefit
from a more extensive investigation with a large number of algorithms. To extend
it to n players, the results obtained can be used. The game can be designed such
that given the players A1 , A2 , ..., An , pair-wise games are considered and each game
is evaluated separately according to the same 2-by-2 payoff matrix introduced here.
The solvers that fail in the simultaneous games in 2-by-2 competitions get elimi-
nated and the tournament continues with the ones that survive.
Another approach of playing the n-by-n game could be playing it simultaneously,
using notions of Nash’s poker game [13], with a specially created n-by-n payoff
matrix that would evaluate all agents at once but select the best iteratively. Current
and future research directions concern extending the ideas of the GTMAS prototype
to a general n-by-n environment which deals with n algorithms, running in parallel,
according to one of the two proposed payoff matrices.
References
1. Aldea, A., Alcantra, R.B., Jimenez, L., Moreno, A., Martinez, J., Riano, D.: The scope
of application of multi-agent systems in the process industry: Three case studies. Expert
Systems with Applications 26, 39–47 (2004)
2. Axelrod, R.: Effective choice in the prisoner’s dilemma. Journal of Conflict Resolu-
tion 24(1), 3–25 (1980)
3. Axelrod, R.: More effective choice in the prisoner’s dilemma. Journal of Conflict Reso-
lution 24(3), 379–403 (1980)
4. Axelrod, R.: The Evolution of Cooperation. Basic Books, New York (1984)
5. Axelrod, R.: The evolution of strategies in the iterated prisoners’ dilemma. In: Davis, L.
(ed.) Genetic Algorithms and Simulated Annealing, pp. 32–42. Morgan Kaufmann, Los
Altos (1987)
6. Axelrod, R., Hamilton, W.D.: The evolution of cooperation. Science 211, 1390–1396
(1981)
7. Binmore, K.: Fun and Games. D.C.Heath, Lexington (1991)
8. Binmore, K.: Playing fair: Game theory and the social contract. MIT Press, Cambridge
(1994)
9. Bratman, M.E.: Shared cooperative activity. The Philosophical Review 101(2), 327–341
(1992)
9 A Game Theory-Based Multi-Agent System 231
10. Byrd, R.H., Dert, C.L., Rinnooy Kan, A.H.G., Schnabel, R.B.: Concurrent stochastic
methods for global optimization. Mathematical Programming 46, 1–30 (1990)
11. Colman, A.M.: Game Theory and Experimental Games. Pergamon Press Ltd., Oxford
(1982)
12. Doran, J.E., Franklin, S., Jennings, N.R., Norman, T.J.: On cooperation in multi-agent
systems. The Knowledge Engineering Review 12(3), 309–314 (1997)
13. Nash, J.F.: Non-cooperative games. Annals of Mathematics 54(2), 286–295 (1951)
14. Holland, J.H.: Adaptation in Natural and Artificial Systems. University of Michigan
Press, Ann Arbor (1975)
15. Linster, B.: Essays on Cooperation and Competition. PhD thesis, University of Michigan,
Michigan (1990)
16. Luce, R., Raiffa, H.: Games and Decisions. Wiley, New York (1957)
17. Murty, K.G., Kabadi, S.N.: Some NP-complete problems in quadratic and nonlinear pro-
gramming. Mathematical Programming 39, 117–130 (1987)
18. Töreyen, Ö.: A game-theory based multi-agent system for solving complex optimisation
problems and a clustering application related to the integration of turkey into the eu com-
munity. M.Sc. Thesis Submitted to the Department of Mathematical Sciences, University
of Essex, UK (2008)
19. Park, S., Sugumaran, V.: Designing multi-agent systems: A framework and application.
Expert Systems with Applications 28, 259–271 (2005)
20. Rapoport, A., Chammah, A.M.: Prisoner’s Dilemma: A Study in Conflict and Coopera-
tion. University of Michigan Press, Ann Arbor (1965)
21. Reinelt, G.: TSPLIB,
http://www.iwr.uni-heidelberg.de/groups/comopt/
software/TSPLIB95
22. Rinnooy Kan, A.H.G., Timmer, G.T.: Stochastic global optimization methods Part I:
Clustering methods. Mathematical Programming 39, 27–56 (1987)
23. Rinnooy Kan, A.H.G., Timmer, G.T.: Stochastic global optimization methods Part II:
Multi-level methods. Mathematical Programming 39, 57–78 (1987)
24. Rinnooy Kan, A.H.G., Timmer, G.T.: Global optimization. In: Nemhauser, G.L., Rin-
nooy Kan, A.H.G., Todd, M.J. (eds.) Optimization. Handbooks in Operations Research
and Management Science, ch. IX, vol. 1, pp. 631–662. North Holland, Amsterdam
(1989)
25. Salhi, A., Glaser, H., De Roure, D.: A genetic approach to understanding cooperative
behaviour. In: Osmera, P. (ed.) Proceedings of the 2nd International Mendel Conference
on Genetic Algorithms, MENDEL 1996, pp. 129–136 (1996)
26. Salhi, A., Glaser, H., De Roure, D.: Parallel implementation of a genetic-programming
based tool for symbolic regression. Information Processing Letters 66(6), 299–307
(1998)
27. Salhi, A., Glaser, H., De Roure, D., Putney, J.: The prisoners’ dilemma revisited. Techni-
cal Report DSSE-TR-96-2, Department of Electronics and Computer Science, The Uni-
versity of Southampton, U.K. (February 1996)
28. Salhi, A., Proll, L.G., Rios Insua, D., Martin, J.: Experiences with stochastic algorithms
for a class of global optimisation problems. RAIRO Operations Research 34(22), 183–
197 (2000)
29. Seshadri, A.: Simulated annealing for travelling salesman problem,
http://www.mathworks.com/matlabcentral/fileexchange
232 A. Salhi and Ö. Töreyen
30. Tweedale, J., Ichalkaranje, H., Sioutis, C., Jarvis, B., Consoli, A., Phillips-Wren, G.:
Innovations in multi-agent systems. Journal of Network and Computer Applications 30,
1089–1115 (2007)
31. Wooldridge, M., Jennings, N.R.: Intelligent agents: Theory and practice. Knowledge En-
gineering Review 10(2), 115–152 (1995)
32. Zhigljavsky, A.A.: Theory of Global Search. Mathematics and its applications, Soviet
Series, vol. 65. Kluwer Academic Publishers, Dordrecht (1991)
Chapter 10
Optimization with Clifford Support Vector
Machines and Applications
10.1 Introduction
The Support Vector Machine (SVM) [1, 2, 3, 4] is a powerfull optimization algo-
rithm to solve classification and regression problems, but it was originally designed
N. Arana-Daniel · C. López-Franco
Computer Science Department, Exact Sciences and Engineering Campus, CUCEI,
University of Guadalajara, Av. Revolucion 1500, Col. Olı́mpica, C.P. 44430,
Guadalajara, Jalisco, México
e-mail: {nancy.arana,carlos.lopez}@cucei.udg.mx
E. Bayro-Corrochano
Cinvestav del IPN, Department of Electrical Engineering and Computer Science,
Zapopan, Jalisco, México
Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 233–262.
springerlink.com c Springer-Verlag Berlin Heidelberg 2010
234 N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano
a · b = 12 (ab + ba)
(10.2)
a ∧ b = 12 (ab − ba).
The inner product of two vectors is the standard scalar or dot product and produces
a scalar. The outer or wedge product of two vectors is a new quantity which we call
a bivector. We think of a bivector as a oriented area in the plane containing a and b,
formed by sweeping a along b. Thus, b ∧ a will have the opposite orientation mak-
ing the wedge product anti-commutative as given in ( 10.2). The outer product is
immediately generalizable to higher dimensions – for example, (a ∧ b) ∧ c, a trivec-
tor, is interpreted as the oriented volume formed by sweeping the area a ∧ b along
vector c. The outer product of k vectors is a k-vector or k-blade, and such a quantity
is said to have grade k. A multivector A ∈ Gn is the sum of k-blades of different or
10 Optimization with Clifford Support Vector Machines and Applications 235
which spans the entire geometric algebra Gn . Here I is the hyper volume called
pseudo scalar which commutes with all the multivectors and it is used as dualization
operator as well. Note that the basis vectors are not represented by bold symbols.
Any multivector can be expressed in terms of this basis. Any multivector can be
expressed in terms of this basis. Because the addition of k-vectors (homogeneous
vectors of grade k) is closed and the multiplication of a k-vector is a vector space,
8k
denoted V n . Each of this spaces is spanned by nk k-vectors, where nk := (n−k)!k!
n!
.
n n
Thus, our geometric algebra Gn , which is spanned by ∑ k = 0 k = 2n elements, is
a direct sum of its homogeneous subspaces of grades 0, 1, 2, ..., n, that is,
9
0 9
1 9
2 9
n
Gn = Vn ⊕ Vn ⊕ Vn ⊕ ...⊕ Vn (10.4)
8
0 8
1
where V n = R is the set of real numbers and V n = V n corresponds to the linear
n-Dimensional vector space. Thus, any multivector of Gn can be expressed in terms
of the basis of these subspaces.
In this chapter we will specify a geometric algebra Gn of the n dimensional space
by G p,q,r , where p, q and r stand for the number of basis vector which squares to 1,
-1 and 0 respectively and fulfill n=p+q+r. Its even sub algebra will be denoted by
+ .
G p,q,r
In the n-D space there are multivectors of grade 0 (scalars), grade 1 (vectors),
grade 2 (bivectors), grade 3 (trivectors), etc... up to grade n. Any two such multi-
vectors can be multiplied using the geometric product. Consider two multivectors
Ar and Bs of grades r and s respectively. The geometric product of Ar and Bs can be
written as
Ar Bs = ABr+s + ABr+s−2 + . . . + AB|r−s| (10.5)
where Mt is used to denote the t-grade part of multivector M, e.g. consider the
geometric product of two vectors ab = ab0 + ab2 = a · b + a ∧ b. Another simple
illustration is the geometric product of A = 4σ3 + 2σ1 σ2 and b = 8σ2 + 6σ3
Note here, that the Clifford product for σi σi = (σi )2 = σi · σi = 1, because the wedge
product between σi ∧ σi = 0, and σi σ j = σi ∧ σ j , the geometric product of differ-
ent unit basis vectors is equal to their wedge, which for simple notation can be
omitted. Using equation 10.5 we can express the inner and outer products for the
multivectors as
In order to deal with more general multivectors, we define the scalar product
A ∗ B = AB0 (10.9)
For an r-grade multivector Ar = ∑ri=0 Ar i , the following operations are defined:
r
Grade Involution: Âr = ∑ (−1)i Ai (10.10)
i=0
r
∑ (−1)
i(i−1)
Reversion: A†r = 2 Ai (10.11)
i=0
Clifford Conjugation: Ãr = †r (10.12)
r
∑ (−1)
i(i+1)
= 2 Ai (10.13)
i=0
The grade involution simply negates the odd-grade blades of a multivector. The
reversion can also be obtained by reversing the order of basis vectors making up the
blades in a multivector and then rearranging them to their original order using the
anti-commutativity of the Clifford product. The scalar product ∗ is positive definite,
i.e. one can associate with any multivector A = A0 + A1 + . . . + An a unique
positive scalar magnitude |A| defined by
n
|A|2 = A† ∗ A = ∑ |Ar |2 ≥ 0, (10.14)
r=0
A† A
A−1 = (−1)q = 2 (10.15)
|A|2 A
10 Optimization with Clifford Support Vector Machines and Applications 237
R = a0 + a1 (I σ1 ) − a2 (I σ2 ) + a3 (I σ3 ) (10.17)
;<=> ; <= >
scalar bivectors
= a0 + a1σ2 σ3 + a2σ3 σ1 + a3σ1 σ2 . (10.18)
data ith-vector xi ∈ GnD . And each of ith-vectors will be associated with one output
of the 2n possibilities given by the following multivector output
where the first subindex s stands for scalar part. For the classification the CSVM sep-
arates these multivector-valued samples into 2n groups by selecting a good enough
function from the set of functions
T
{ f (x) = w† x + b, }. (10.20)
where w†i xi corresponds to the Clifford product of two multivectors and w†i is the
reversion of the multivector wi .
Next,we introduce now a structural risk functional similar to the real valued
one of the SVM for classification. By using a loss function similar to Vapnik’s ξ -
insensitive one, we utilize following linear constraint quadratic programming for the
primalequation
T
min L(w, b, ξ ) = 12 w† w +C ∑i, j ξ1 j
subject to
T (10.22)
yi j ( f (xi )) j = yi j (w† xi + b) j ≥ 1 − ξi j
ξi j ≥ 0 for all i, j,
where ξi j stands for the slack variables, i, indicate the data ith-vector and j indexes
the multivector component, i.e. j = 1 for the coefficient of the scalar part, j = 2 for
the coefficient of σ1 . . . j = 2n for the coefficient of I. The dual expression of this
problem can be derived straightforwardly. Firstly let us consider the expression of
the orientation of optimal hyperplane.
are given by
aTs = [(αs )1 (ys )1 , (αs )2 (ys )1 , ..., (αs )l (ys )l ]
aTσ1 = [(ασ1 )1 (yσ1 )1 , (ασ1 )1 (yσ1 )1 , ..., (ασ1 )l (yσ1 )l ]
..
. (10.28)
aTI = [(αI )1 (yI )1 , (αI )1 (yI )1 , ..., (αI )l (yI )l ]
note that the vector aT has the dimension: (l × 2n ) × 1. We require a compact and
easy representation of the resultant GRAM matrix of the multi-components, this will
help for the programing of the algorithm. For that let us first consider the Clifford
product of (w∗T w), this can be expressed as follows
w†T w = w†T ws + w†T wσ1 + w†T wσ2 + . . . + w†T wI (10.29)
Since w has the components presented in (10.25), the equation (10.29) can be rewrit-
ten as follows
w†T w = aTs x†T xs as + ... + aTs x†T xσ1 σ2 aσ1 σ2 + ...
+aTs x†T xI aI + aTσ1 x†T xs as + ...
+aTσ1 x†T xσ1 σ2 aσ1 σ2 + ... + aTσ1 x†T xI aI +
. (10.30)
aTI x†T xs as + aTI x†T xσ1 aσ1 + ...
+aTI x†T xσ1 σ2 aσ1 σ2 + ... + aTI x†T xI aI .
240 N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano
Renaming the matrices of the t-grade parts of x†T xt , we rewrite previous
equation as:
Taken into consideration the previous equations and definitions, the primal equation
(10.22) reads now as follows:
1
min L(w, b, ξ ) = aT Ha + C · ∑ αi j (10.32)
2 i, j
using the previous definitions and equations we can define the dual optimization
problem as follows
1
max aT 1 − aT Ha
2
sub ject to
0 ≤ (αs ) j ≤ C, 0 ≤ (ασ1 ) j ≤ C, ...,
0 ≤ (ασ1 σ2 ) j ≤ C, ..., 0 ≤ (αI ) j ≤ C
f or j = 1, ..., l, (10.33)
note that the diagonal entries equal to Hs and since H is a symmetric matrix the
lower matrices are transposed. The optimal weight vector w is as given by (10.23).
The threshold b ∈ GnD can be computed by using KKT conditions with the Clif-
ford support vectors as follows
10 Optimization with Clifford Support Vector Machines and Applications 241
The decision function can be seen as sectors reserved for each involved class, i.e.
in the case of complex numbers (G1,0,0 ) or quaternions (G0,2,0 ) we can see that the
circle or the sphere are divide by means spherical vectors. Thus the decision function
can be envisaged as
" # " #
y = csignm f (x) = csignm w†T x + b
" l #
= csignm ∑ (α j ◦ y j )(x†Tj x) + b, (10.36)
j=1
" #
where csignm f (x) is the function for detecting the sign of f (x) and m stands for
the different values which indicate the state valency, e.g. bivalent, tetravalent and
the operation “◦” is defined as
simply one consider as coefficients of the multivector basis the multiplications be-
tween the coefficients of blades of same degree. For clarity we introduce this oper-
ation “◦”which takes place implicitly in previous equation (10.25).
Note that the cases of complex numbers 2-state (outputs 1 for − π2 ≤ arg( f (x)) <
π π 3π π
2 and -1 for 2 ≤ arg( f (x)) < 2 ) and 4-state (outputs 1+i for 0 ≤ arg( f (x)) < 2 , -
π 3π 3π
1+i for 2 ≤ arg( f (x)) < π , -1-i for π ≤ arg( f (x)) < 2 and 1-i for 2 ≤ arg( f (x)) <
2π ) can be solved by the multi-class real valued SVM, however in case of higher
representations like the 16-state using quaternions, it would be awkward to resort to
the multi-class real valued SVMs.
The major advantage of our approach is that we redefine the optimization vector
variables as multivectors. This allows us to utilize the components of the multivec-
tor output to represent different classes. The amount of achieved class outputs is
directly proportional to the dimension of the involved geometric algebra. The key
idea to solve multi-class classification in the geometric algebra is to avoid that the
multivector elements of different grade get collapsed into a scalar, this can be done
thanks to the redefinition of the primal problem involving the Clifford product in-
stead of the inner product (10.22). The reader should bear in mind that the Clifford
product performs the direct product between the spaces of different grade and its
result is represented by a multivector, thus the outputs of the CSVM are represented
by y= ys + yσ1 + yσ2 + ... + yI ∈ {±1 ± σ1 ± σ2 . . . ± I}.
242 N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano
In general we build a Clifford kernel K(xm , x j ) by taking the Clifford product be-
tween the reversion of xm and x j as follows
note that the kind of reversion operation (·)† of a multivector depends of the signa-
ture of the involved geometric algebra G p,q,r . Next as illustration we present kernels
using different geometric algebras. According to the Mercer theorem, there exists
a mapping u : G → F , which maps the multivectors x ∈ Gn into the complex Eu-
u
clidean space x →= ur (x) + IuI (x)
Complex-valued linear kernel function in G1,0,0 (the center of this geometric al-
gebra, i.e. s, I = σ1 σ2 is isomorph with C):
where (xs )m , (xs )n , (xI )m , (xI )n are vectors of the individual components of the
complex numbers (x)m = (xs )m + I(xI )n ∈ G1,0,0 and (x)n = (xs )n + I(xI )n ∈ G1,0,0
respectively.
For the quaternion-valued Gabor kernel function, we use i = σ2 σ3 , j = −σ3 σ1 ,
k = σ1 σ2 . The Gaussian window Gabor kernel function reads
and the variables w0 and xm − xn stand for the frequency and space domains
respectively.
Unlike the Hartley transform or the 2D complex Fourier this kernel function sep-
arates nicely the even and odd components of the involved signal, i.e.
In the same way we use the kernel functions to replace the the dot product of the
input data in (10.36). In general the output function of the nonlinear Clifford SVM
reads
" # " #
y = csignm f (x) = csignm w†T Φ (x) + b , (10.43)
We can now redefine the entries of the vector in (10.27), these are given by
Now, we can rewrite the Clifford product, as we did in (10.29 - 10.31) to get the
primal problem as follows:
min 12 aT Ha +C · ∑li=1 (ξ + ξ̃ )
subject to
(w† x + b − y) j ≤ (ε + ξ ) j
(y − w† x − b) j ≤ (ε + ξ̃ ) j
ξi j ≥ 0, ξ̃i j for all i, j.
Thereafter, we write straightforwardly the dual of 10.46 for solving the regression
problem
10 Optimization with Clifford Support Vector Machines and Applications 245
1
max −α̃ T (ε −
˜ y) − α T (ε + y) − aT Ha
2
sub ject to
l l
∑ (αs j − α̃s j ) = 0, ∑ (ασ 1 j − α̃σ1 j ) = 0, ..,
j=1 j=1
l
∑ (αI j − α̃I j ) = 0, 0 ≤ (αis ) ≤ C, 0 ≤ (αiσ1 ) ≤ C, ...,
j=1
0 ≤ (αiσ1 σ2 ) ≤ C, ..., 0 ≤ (iαI ) ≤ C 0 ≤ (αis∗ ) ≤ C, 0 ≤ (αi∗σ1 ) ≤ C, ...,
0 ≤ (αi∗σ1 σ2 ) ≤ C, ..., 0 ≤ (iαI∗ ) ≤ C j = 1, ..., l, (10.46)
LSTM-CSVM is a Evolino and Evoke based system [18, 19]: the underlying
idea of these systems is that it is needed two cascade modules: a robust module
to process short and long-time dependencies (LSTM) and an optimization module
to produce precise outputs (CSVM, Moore-Penrose pseudo inverse method, SVM
respectively). The LSTM module addresses the disadvantage of having relevant
pieces of information outside the history window and also avoids the problem of
the “vanishing error” presented by algorithms like Back-Propagation Through Time
(BPTT, e.g., Williams an Zipser 1992) or Real-Time-Recurrent Learning ( RTRL,
e.g., Robinson and Fallside 1987)1. Meanwhile CSVM maps the internal activations
of the fist module to a set of precise outputs, again, it is taken advantage of the mul-
tivector output representation to implement a system with less process units and
therefore less computational complex.
LSTM-CSVM works as follows: a sequence of input vectors (u(0)...u(t)) is given
to the LSTM which in turn feeds the CSVM with the outputs of each of its memory
cells, see Fig. 10.1.
The CSVM aimed at finding the expected nonlinear mapping of training data.
The input and output equations of Figure 10.1 are
where φ (t) = [ψ1 , ψ2 , ..., ψn ]T ∈ Rn is the activation in time t of n units of the LSTM,
this serves as input to the CSVM, given the input vectors(u(0)...u(t)) and the weight
matrix W . Since the LSTM is a recurrent net, the argument of the function f (.)
represents the history of the input vectors.
1 The reader can get more information about BPTT and RTRL-vanishing error versus
LSTM-constant error flow in [20].
10 Optimization with Clifford Support Vector Machines and Applications 247
First, the LSTM-CSVM system was trained using the conventional algorithm for
the LSTM. Although the system learns, unfortunately it takes too long to find a
suitable matrix W . Instead, propagating the training data through the LSTM-CSVM
system, we evolved the rows of the matrix using the evolutionary algorithms known
as Enforced Sub-Populations (ESP) [21] algorithm. This approach differs with the
standard methods, because instead of evolving the complete set of the net parame-
ters, it rather evolves subpopulations of the LSTM memory cells. For the mutation
of the chromosomes, the ESP uses Cauchy density function.
10.7 Applications
In this section we present five interesting experiments. The first one shows a multi-
class classification using CSVM with a simulated example.Here, we present also a
number of variables computing per approach and a time comparison between CSVM
and three approaches to do multi-class classification using real SVM. The second
is about object multi-class classification with two types of training data: Phase a)
artificial data and Phase b) real data obtained from a stereo vision system. We also
compared the CSVM against MLP’s (for multi-class classification) . The third ex-
periment presents a multi-class interpolation. The fourth and fifth includes the ex-
perimental analysis of the recurrent CSVM.
To depict these vectors they were normalized by 10. In Fig. 10.2 one can see that the
problem is high nonlinear separable. The CSVM uses for training 50 input quater-
nions of each of the five functions, since these have three coordinates we use simply
2 The dimension of this geometric algebra is 22 = 4.
248 N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano
Support Vectors
Fig. 10.2 3D spiral with five classes. The marks represent the support multivectors found by
the CSVM
Xi = δi s + xi σ2 σ3 + yi σ3 σ1 + zi σ1 σ2 (10.48)
≡ [δi , (xi , yi , zi )]T
For each object we trained the CSVM using a set of several feature quaternions
obtained from different level curves; that means that each object is represented by
several feature quaternions and not only one. Due to this way to train the CSVM,
the order in which the feature quaternions are shown to the CSVM is important:
we begin to sample data from the bottom to the top of the objects and we show the
training and test data in this order to the CSVM. We processed the outputs using
a counter that computes which class fires the most for each training or test set in
10 Optimization with Clifford Support Vector Machines and Applications 251
[ n,
[ +m , (x,y,z) n ]
(x,y,z) +m]
[ ,
(x,y,z) ]
a) b)
Fig. 10.3 Geometric characteristics of one training object. The magnitude is δi , and the 3D
coordinates (xi , yi , zi ) to build the feature vector: [δi , (xi , yi , zi )]
order to decide which class the object belongs, see Fig. 10.4. Note carefully, that
this experiment is anyway a challenge for any algorithm for recognition, because
the feature signature is sparse. We will show later, that using this kind of feature
vectors the CSVM’s performance is superior to the MLP’s one. Of course, if you
spend more time trying to improve the quality of the feature signature, the CSVM’s
performance will increase accordingly.
WINNER CLASS
COUNTER
OUTPUTS
CSVM
INPUTS
Fig. 10.4 After we get the outputs, these are accumulated using a counter to calculate which
class the object belongs
It is important to say that all the objects (synthetic and real) were preprocessed
in order to have a common center an the same scale, then our learning process can
be seen as centering and scale invariant.
the σ2 σ3 , σ3 σ1 , σ1 σ2 basis of the feature quaternion and the magnitude was packed
in the scalar part of the quaternion . Figure 10.6 shows the 3D points sampled from
the objects. We compared the performance of the following approaches: CSVM, a
4-7-6 MLP and the real valued SVM approaches one-against-one, one-against-all
and DAGSVM. The results in tables 10.4 and 10.5 show that CSVM has better gen-
eralization and less training errors than the MLP approach and the real valued-SVM
approaches. Note that all methods were speed up using the acceleration techniques
[28, 29]. The authors think that the MLP presents more training and generalization
errors because the way we represent the objects (as feature quaternion sets) makes
the MLP gets stuck in local minima very often during the learning phase, whereas
the CSVM is guaranteed to find the optimal solution to the classification problem
because it solves a convex quadratic problem with global minima. With respect to
the real-valued SVM based approaches, the CSVM takes advantage of the Clifford
product, which enhances the discriminatory power of the classificator itself unlike
the other approaches which are based solely on inner products.
a) b) c)
d) e) f)
Table 10.4 Object-recognition performance in percent (%) during training using synthetic
data
Table 10.5 Object-recognition performance in percent (%) during test using synthetic data
a) b) c)
*
***
***
***
*
* * ***
***
***
* *
* *
* *
* ***
***
*** *
d) e) f)
Fig. 10.6 Sampling of the training synthetic object set
is illustrated in Fig.10.7 and the whole training object set is shown in Fig.10.8. We
take the non-normalized 3D point for the bivector basis σ23 , σ31 , σ12 of the feature
quaternion in (10.49).
Fig. 10.7 Left:Sampling view of a real object. We use big white cross for the depiction.
Right: Stereo vision system in the experiment environment
After the training, we tested with a set of feature quaternions that the machine did
not see during its training and we used the approach of ’winner take all’ to decide
which class the object belongs. The results of the training and test are shown in
table 10.6. We trained CSVM with equal number of training data for each object,
that is, 90 feature quaternions for each object, but we test with different number of
data for object. Note that we have two pairs of objects that are very similar between
each other; first pair is composed by half sphere shown in Fig.10.8.c) and the rock
in Fig.10.8.d), in spite of this similarities, we got very good accuracy percentages in
10 Optimization with Clifford Support Vector Machines and Applications 255
Fig. 10.8 Training real object set, stereo pair images. We include only the frontal views
test phase for both objects: 65.9% for the half sphere and 84% for the rock. We think
we got better results for the rock because this object has a lot of texture that produces
many corners which in torn capture better the irregularities, therefore we have more
test feature quaternions for the rock than for half sphere (75 against 44 respectively).
The second pair composed by similar objects is shown in Fig.10.8.e) and Fig10.8.f),
these are two equal plastic bottles of juice, but one of them (Fig. 10.8.f)) is burned,
that makes the difference between them and give the CSVM enough distinguish
features to make two object classes, that is shown in table 10.6, we got 60% of
correct classified test samples for bottle in Fig. 10.8.e) against 61% for burned bottle
in Fig.10.8.f). The lower learn rates in the last objects (Fig.10.8 c), e) and f)) is
because the CSVM is confusing a bit the classes due to the fact that the feature
vectors are not large and reach enough.
a) b) c) d)
Fig. 10.9 a) and b) Continuous curves of training output data for axes x and y (50 points).
c) 2D result by testing with 400 input data. d) Experiment environment
10 Optimization with Clifford Support Vector Machines and Applications 257
a) b) c) d)
Fig. 10.10 a), b), c) Image sequence while robot is drawing d) Robot’s draw. Result by testing
with 400 input data
Fig. 10.11 a)Time seriess Venice Lagoon training. b)Recall data. Tick line (in red) real data,
thin line predicted values by LSTM-CSVM
3 A. Tomasin, CNR-ISDMG Universita Ca’Foscari, Venice.
258 N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano
In the next test, we employed the time series Mackey-Glass which is commonly
used for testing the generalization and prediction ability of an algorithm. The series
are generated by the following differential equation
˙ =α y(t − τ )
y(t) , (10.49)
(1 + y(t − τ )β ) − γ y(t)
where the parameters are usually set as α = 0.2, β = 10 and γ = 0.1. This equation
is chaotic when the delay is τ > 16.8. We select as delay the most common utilized
value of τ = 17. The task is to predict the series values after the delay y[t + P]
by using the previous points y[t], y[t − 6], y[t − 12], y[t − 18]. By P = 50 sampled
values, it is expected that the algorithm learns the four dimensional function: y(t) =
f (y[t − 50], y[t − 50 − 6], y[t − 50 − 12], y[t − 50 − 18]).
The LSTM-CSVM was trained with the first 3000 values of the series using
P = 100. The module LSTM with 4 memory cells was evolved with a Cauchy pa-
rameter α = 10−5 over 150 generations. The “Eco state approach” was trained with
1000 neurons and a mean square error of 10−4 , while using the Evolino system
the achieved error was of 1.9 × 10−3 with 30 cells evolved over 200 generations
[19, 30]. It has been reported [31] that using a LSTM the minimum error achieved
was of 0.1184 using the same amount of 4 neurons as in our LTSM-CSVM.
Table 10.7 shows a summary of the comparison results. Here we note that the
LTSM alone has a poorer performance than the LTSM-CSVM, showing that the
CSVM clearly improves the prediction precision. Note that for this complex time se-
ries, as opposite to these two approaches (Eco state approach, Evolino), the LTSM-
CSVM uses a lower number of neurons and it requires less generations during the
training for an acceptable error of 0.011.
red, this facilitated the segmentation of the blocks by the stereoscopic cameras. The
stereoscopic systems took images from an angle of 45 degrees, for that we needed
to calibrate cameras, and correct the position of the cameras views as they were
oriented on the top perpendicularly to the labyrinth. With this information we had
a complete 3D view from above. Using a color segmentation algorithm, we get the
vector coordinates of the block corners. These observation vectors were then fed to
the LTSM-CSVM.
The architecture of the LTSM-CSVM with reinforcement learning and the train-
ing were equal as the previous application. The differences with the simulated exper-
iment were: i) the 3D vectors of the block corners were obtained by the stereoscopic
camera (the blocks build a 2D labyrinth), ii) the robot actions were hand movements
trough the 2D labyrinth and iii) the length of this real labyrinth was smaller than pre-
vious simulated one. We had 4 different labyrinths. Each was 10 blocks length.
The evolution of the system consisted of 50 generations using a Cauchy noise
parameter of α = 10−4 . The module CSVM is fed with a vector of the output of
the last 4 memory cells of the LSTM. The 4 outputs of the CSVM represent the
4 different actions to be carried out during the navigation through the labyrinth.
After each generation, the best net was kept and the task was considered fulfilled
by a perfect reward of 4.0. The four possible actions of the system are robot hand
movements of 10 cm. length towards left, right, back an fort. The initial position
of the robot arm was located at the entry of a labyrinth. In all the labyrinths we
exploited the intern state (support state), i.e. the coordinates of the exit which was
the same for all cases.
1)
1)
2)
2)
3) 3)
4) 4)
Fig. 10.12 a) Training labyrinths 1 y 2. Recall labyrinths 3 y 4.(third column) The robot hand
is at the entry of the labyrinth holding a plastic object. (fourth column) Position from the hand
marked with a cross
260 N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano
Figure 10.12 shows the four labyrinths. The images on the first and third columns
were obtained by the stereoscopic system. The images on the second and fourth
column were obtained after perspective correction and color segmentation. The
labyrinths 1 y 2 were used for the training, whereas the 3 and 4 were used for recall.
The third and fourth columns in Figure 10.12 show the agent at the beginning
the labyrinths. In this labyrinth, only one trajectory was successful (reward 4.0).
The training and the test were done off line, then the robot had to follow the action
vectors computed by the system LTSM-CSVM.
10.8 Conclusions
This chapter generalizes the real valued SVM to Clifford valued SVM and it is used
for classification, regression and recurrence. The CSVM accepts multiple multivec-
tor inputs and multivector outputs like a MIMO architecture, that allows us to have
multi-class applications. We can use CSVM over complex, quaternion or hyper-
complex numbers according our needs. The application section shows experiments
in pattern recognition and visually guided robotics which illustrate the power of the
algorithms and help the reader understand the Clifford SVM and use it in various
tasks of complex and quaternion signal and image processing, pattern recognition
and computer vision using high dimensional geometric primitives. This generaliza-
tion appears promising particularly in geometric computing and their applications,
like graphics, augmented reality and robot vision.
References
1. Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)
2. Burges, C.J.C.: A tutorial on Support Vector Machines for Pattern Recognition. In:
Knowledge Discovery and Data Mining, vol. 2, pp. 1–43. Kluwer Academic Publish-
ers, Dordrecht (1998)
3. Müller, K.-R., Mika, S., Rätsch, G., Tsuda, K., Schölkopf, B.: An Introduction to Kernel-
Based Learning Algorithms. IEEE Trans. on Neural Networks 12(2), 181–202 (2001)
4. Cristianini, N., Shawe-Taylor, J.: Support Vector Machines and other kernel-based learn-
ing methods. Cambridge University Press, Cambridge (2000)
5. Hestenes, D., Li, H., Rockwood, A.: New algebraic tools for classical geometry. In: Som-
mer, G. (ed.) Geometric Computing with Clifford Algebras. Springer, Heidelberg (2001)
6. Lee, Y., Lin, Y., Wahba, G.: Multicategory Support Vector Machines, Technical Report
No. 1043, University of Wisconsin, Departament of Statistics, pp. 10–35 (2001)
7. Weston, J., Watkins, C.: Support vector machines for multi-class pattern recognition. In:
Proceedings of the 6th European Symposium on Artificial Neural Networks (ESANN),
pp. 185–201 (1999)
8. Bayro-Corrochano, E., Arana-Daniel, N., Vallejo-Gutierrez, R.: Geometric Preprocess-
ing, geometric feedforward neural networks and Clifford support vector machines for
visual learning. Journal Neurocomputing 67, 54–105 (2005)
9. Bayro-Corrochano, E., Arana-Daniel, N., Vallejo-Gutierrez, R.: Recurrent Clifford Sup-
port Machines. In: Proceedings IEEE World Congress on Computational Intelligence,
Hong-Kong (2008)
10 Optimization with Clifford Support Vector Machines and Applications 261
10. Mukherjee, S., Osuna, E., Girosi, F.: Nonlinear prediction of chaotic time series using a
support vector machine. In: Principe, J., Gile, L., Morgan, N., Wilson, E. (eds.) Neural
Networks for Signal Precessing VII - Proceedings of the 1997 IEEE Workshop, New
York, pp. 511–520 (1997)
11. Müller, K.-R., Smola, A.J., Rätsch, G., Schölkopf, B., Kohlmorgen, J., Vapnik, V.N.:
Predicting time series with support vector machines. In: Gerstner, W., Hasler, M., Ger-
mond, A., Nicoud, J.-D. (eds.) ICANN 1997. LNCS, vol. 1327, pp. 999–1004. Springer,
Heidelberg (1997)
12. Salomon, J., King, S., Osborne, M.: Framewise phone classification using support vector
machines. In: Proc. Int. Conference on Spoke Language Processing, Denver (2002)
13. Altun, Y., Tsochantaris, I., Hofmann, T.: Hidden markov support vector machines. In:
Proc. Int. Conference on Machine Learning (2003)
14. Jaakkola, T.S., Haussler, D.: Exploting generative models in discriminative classifiers.
In: Proc. of the Conference on Advances in Neural Information Systems II, Cambridge,
pp. 487–493 (1998)
15. Bengio, Y., Frasconi, P.: Difussion of credit and markovian models. In: Tesauro, G.,
Touretzky, D.S., Leen, T.K. (eds.) Advances in Neural Information Systems 14. MIT
Press, Cambridge (2002)
16. Hochreiter, S., Mozer, M.: A discrete probabilistic memory for discovering dependencies
in time. In: Int. Conference on Neural Networks, pp. 661–668 (2001)
17. Suykens, J.A.K., Vanderwalle, J.: Recurrent least squares support vector machines. IEEE
Transactions on Circuits and Systems-I 47, 1109–1114 (2000)
18. Schmidhuber, J., Gagliolo, M., Wierstra, D., Gomez, F.: Recurrent Support Vector Ma-
chines, Technical Report, no. IDSIA 19-05 (2005)
19. Schmidhuber, J., Wierstra, D., Gómez, F.J.: Hybrid neuroevolution optimal linear search
for sequence prediction. In: Kaufman, M. (ed.) Proceedings of the 19th International
Joint Conference on Artificial Intelligence, IJCAI, pp. 853–858 (2005)
20. Hochreiter, S., Schmidhuber, J.: Long Short-Term Memory, Technical Report FKI-207-
95 (1996)
21. Gmez, F.J., Miikkulainen, R.: Active guidance for a finless rocket using neuroevolution.
In: Proc. GECCO, pp. 2084–2095 (2003)
22. Hsu, C.W., Lin, C.J.: A comparison of methods for multi-class Support Vector Machines.
Technical report, National Taiwan University, Taiwan (2001)
23. Bottou, L., Cortes, C., Denker, J., Drucker, H., Guyon, I., Jackel, L.Y., Muller, U.,
Sackinger, E., Simard, P., Vapnik, V.: Comparison of classifier methods: a case study
in handwriting digit recognition. In: International Conference on Pattern Recognition,
pp. 77–87. IEEE Computer Society Press, Los Alamitos (1994)
24. Knerr, S., Personnaz, L., Dreyfus, G.: Single-layer learning revisited: a stepwise proce-
dure for building and training a neural network. In: Fogelman, J. (ed.) Neurocomputing:
Algorithms, Architectures and Applications. Springer, Heidelberg (1990)
25. Kreßel, U.: Pairwise classification and support vector machines. In: Schlkipf, B., Burges,
C.J.J., Smola, A.J. (eds.) Advances in Kernel Methods - Support Vector Learning, pp.
255–268. MIT Press, Cambridge (1999)
26. Platt, J.C., Cristianini, N., Shawe-Taylor, J.: Large margin DAGs for multiclass classifi-
cation. In: Advances in Neural Information Processing Systems, vol. 12, pp. 547–533.
MIT Press, Cambridge (2000)
27. Weston, J., Watkins, C.: Multi-class support vector machines. Technical Report CSD-
TR-98-04, Royal Holloway, University of London, Egham (1998)
262 N. Arana-Daniel, C. López-Franco, and E. Bayro-Corrochano
28. Hsu, C.W., Lin, C.J.: A simple decomposition method for Support Vector Machines.
Machine Learning 46, 291–314 (2002)
29. Joachims, T.: Making large-scale SVM learning practical. In: Schölkopf, B., Burges,
C.J.C., Smola, A.J. (eds.) Advances in Kernel Methods-Support Vector Learning. MIT
Press, Cambridge (1998); Journal of Machine Learning Research 5, 819–844 (1998)
30. Jaeger, H.: Harnessing nonlinearity: Predicting chaotic systems and saving energy in
wireless communication. EmphScience (304), 78–80 (2004)
31. Gers, F.A., Eck, D., Schmidhuber, J.: Applying LSTM to time series predictable through
time-window approaches. In: Dorffner, G., Bischof, H., Hornik, K. (eds.) ICANN 2001.
LNCS, vol. 2130, pp. 669–685. Springer, Heidelberg (2001)
Chapter 11
A Classification Method Based on Principal
Component Analysis and Differential Evolution
Algorithm Applied for Prediction Diagnosis
from Clinical EMR Heart Data Sets
Abstract. In this article we have studied the usage of a classification method based
on preprocessing the data first using principal component analysis, and then using
the compressed data in actual classification process which is based on differential
evolution algorithm, an evolutionary optimization algorithm. This method is applied
here for prediction diagnosis from clinical data sets with chief complaint of chest
pain using classical Electronic Medical Record (EMR), heart data sets. For exper-
imentation we used a set of five frequently applied benchmark data sets includ-
ing Cleveland, Hungarian, Long Beach, Switzerland and Statlog data sets. These
data sets are containing demographic properties, clinical symptoms, clinical find-
ings, laboratory test results specific electrocardiography (ECG), results pertaining
to angina and coronary infarction, etc. In other words, classical EMR data pertain-
ing to the evaluation of a chest pain patient and ruling out angina and/or Coronary
Artery Disease, (CAD). The prediction diagnosis results with the proposed classi-
fication approach were found promisingly accurate. For example, the Switzerland
data set was classified with 94.5% ± 0.4% accuracy. Combining all these data sets
resulted in the classification accuracy of 82.0% ± 0.5%. We compared the results
of the proposed method with the corresponding results of the other methods re-
ported in the literature that have demonstrated relatively high classification perfor-
mance in solving this problem. Depending on the case, the results of the proposed
method were of equal level with the best compared methods, or outperformed their
Pasi Luukka
Laboratory of Applied Mathematics, Lappeenranta University of Technology,
P.O. Box 20, FIN-53851 Lappeenranta, Finland
e-mail: [email protected]
Jouni Lampinen
Department of Computer Science, University of Vaasa, P.O. Box 700,
FI-65101 Vaasa, Finland
e-mail: [email protected]
Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 263–283.
springerlink.com c Springer-Verlag Berlin Heidelberg 2010
264 P. Luukka and J. Lampinen
classification accuracy clearly. In general, the results are suggesting that the pro-
posed method has potential in this task.
11.1 Introduction
Many data sets that come from the real world are admittedly coupled with noise.
Noise can be stated as a random error or variance of a measured variable [13]. Data
analysis is almost always burdened with uncertainty of different kinds. There are
several different techniques to deal with noisy data [7].
A major problem in mining scientific data sets is that the data is often high di-
mensional. In many cases there is a large number of features representing the object.
One problem is that the computational time for the pattern recognition algorithms
can become prohibitive, when the number of dimensions grows high. This can be a
severe problem, especially as some of the features are not discriminatory. Besides
the computational cost, irrelevant features may also cause a reduction in the accu-
racy of some algorithms.
To address this problem of high dimensionality, a common approach is to identify
the most important features associated with an object, so that further processing can
be simplified without compromising the quality of the final results. There are several
different ways in which the dimension of a problem can be reduced. The simplest
approach is to identify important attributes based on the input from domain experts.
Another commonly used approach is Principal Component Analysis (PCA) [19],
which defines new attributes (principal components or PCs) as mutually-orthogonal
linear combinations of the original attributes. For many data sets, it is sufficient to
consider only the first few PCs, thus reducing the dimension. However, for some
data sets, PCA does not provide a satisfactory representation. It is not always the
case that mutually-orthogonal linear combinations are the best way to define new
attributes but e.g. nonlinear combinations needs to be sometimes considered. The
analysis of the problem of dealing with data of high dimensionality is both diffi-
cult and subtle. The information loss caused by these methods is also sometimes a
problem.
One of the latest methods in evolutionary computation is differential evolution
(DE) algorithm [30]. In this paper we will examine the applicability of a classifica-
tion method where data is first preprocessed with PCA and then the resulting data is
classified with DE-classifier to the diagnosis of heart disease. In literature there are
several papers where evolutionary computation research has concerned the theory
and practice of classifier systems [4], [16], [17], [18], [31], [35], [10]. The differen-
tial evolution algorithm has been studied in unsupervised learning problems which
can be in a sence repositioned to classification problem in [26], [11]. DE was also
used combined with artificial neural networks in [1] for diagnosis of breast cancer.
It is also been used to tune classifiers parameter values in [12] and in similarity
classifier [23] to tune similarity measures parameters.
11 A Classification Method Based on PCA and DE 265
Here we propose a method which first preprocesses the data using PCA and then
classify the processed data using differential evolution classification method. Dif-
ferential evolution algorithm is applied for finding suitable vectors for each class to
classify sample by comparison of class vectors and the sample which we want to
classify. Differential evolution algorithm is applied for finding optimal class vectors
to represent each class. In addition, it is also applied for determining the value of a
distances parameter that we applied for making the final classification decision.
Advantages of doing the procedure this way are that we are able to reduce dimen-
sionality and hence reduce the computational cost that would be otherwise untoler-
ably high, especially in the cases for high dimensional data sets. Also advantage of
this procedure is that we are able to filter out noise which enhances the creation of
class vector for each class in the classifier. The class vectors are optimized using
the DE algorithm. Using this procedure we will also find the optimal dimension for
these data sets. Combination of finding best reduced dimension, filtering out noise
from the data and optimization of class vectors and needed parameters for our prob-
lem at hand brings out the more accurate solution for the problem.
The data sets for empirical evaluation of the proposed approach were taken from
a UCI-Repository of Machine Learning data set [25]. Classifier and preprocessing
methods were implemented with MAT LABT M -software.
From the optimization and modelling point of view the classification problem
subject to our investigations can be divided into two parts: to the classification model
and to the optimization approach applied for fitting (or learning) the model. Gen-
erally, a multipurpose classifier can be viewed as a scalable and learnable model
that can be fitted to a particular dataset by scaling it to the data dimensionality
and by optimizing a set of model parameters to maximize the classification accu-
racy. For the optimization, simply the classification accuracy over the learning set
may serve as the objective function value to be maximized. Alternatively the opti-
mization problem can be formulated as a minimization task, as we did here, where
the number of misclassified samples is to be minimized. In the literature, mostly
linear or nonlinear local optimization approaches has been applied for solving the
actual classifier model optimization problem, or approaches that can be viewed as
such. This is the most common approach despite of the fact that the underlying op-
timization problem is a global optimization problem. For example, the weight set
of a feed-forward neural network classifier is typically optimized with a gradient-
descent based on local optimizer, or alternatively by some other local optimizer like
Levenberg-Marquardt algorithm. This kind of usage of limited capacity optimizers
for fitting the classification model limits the achievable classification accuracy in
two ways. First, the model should be limited so, that local optimizers can be applied
to fit them. This means that only very moderately multimodal classification models
can be applied, and due to such modelling limitation, the classification capability
will be limited correspondingly. Secondly, if a local optimizer is applied to optimize
(to fit or to learn) even a moderately multimodal classification model, it is likely to
get trapped into a local optima, to a suboptimal solution. Thereby, the only way to
get the classifier models with a higher modelling capacity at disposal, and also to get
full capacity out of the current multimodal classification models, is applying global
266 P. Luukka and J. Lampinen
optimization for fitting the classification models to the data to be classified. For
example, in case of a nonlinear feed-forward neural network classifier, the model is
clearly multimodal, but practically always fitted by applying a local optimizer that is
capable of providing only locally optimal solutions. Thus, we consider that applying
global optimization instead of local optimization is an important fundamental issue
that is currently severely constraining the further development of classifiers. The
capabilities of currently used local optimizers are limiting the selection of the appli-
cable classifier models, and also the capabilities of the currently used models that
are including multimodal properties are limited by the capabilities of the optimizers
applied to fit them to the data.
Based on the above mentioned considerations, our basic motivation for applying
a global optimizer for learning the applied classifier model comes from the fact that
typically local (nonlinear) optimizers have been applied for the purpose, despite that
the underlying optimization problem is actually a multimodal global optimization
problem, and a local optimizer should be expected to become trapped into a local
suboptimal solution. The advantage of our proposed method is that since DE does
not get trapped in local minimum we can expect it to find better solutions than what
can be found in nearest local minimum.
Another motivation was that we would like to optimize also the parameter p of
the Minkowsky distance metrics (see section 3). In practice, that means increased
nonlinearity and increased multimodality of the classification model, resulting in
more locally optimal points in the search space, where a local optimizer would be
even more likely to get trapped. Practically, optimizing p successfully requires usage
of an effective global optimizer since local optimizers are unlikely to provide even
an acceptably good suboptimal solution anymore. Otherwise, by using a global op-
timizer, optimization of p becomes possible. Two folded advantages were expected
on this. First, by optimizing (systematically) the value for p, instead of selecting it
a priori by trial and error as earlier, a higher classification accuracy may be possi-
ble to reach. Secondly, the selection of the value of p can be done automatically
this way, and laborious trial and error experimentation by the user is not needed
at all. Furthermore, the potential for further developments is increased. The local
optimization approaches are severely limiting the selection of classifier models to
be used and as well the problem formulations for classifier model optimization task
become limited, too. Simply, local optimizers are limited to fit, or learn, only classi-
fier models, where trapping into a local suboptimal solution is not a major problem,
while global optimizers do not have such fundamental limitations. For example, the
range of possible class membership functions can be extended to those requiring
global optimization (due to increased nonlinearity and multimodality), and which
cannot be handled anymore by simple local optimizers, even with the nonlinear
ones. In addition, we would like to remark, that we have not yet fully utilized the
further development capabilities provided by our global optimization approach. For
example, even more difficult optimization problem settings are now within possi-
bilities, and the differential evolution have good capabilities for multi-objective and
multi-constrained nonlinear optimization that provides further possibilities for our
future developments.
11 A Classification Method Based on PCA and DE 267
no Attribute
1. Age
2. Sex
3. Chest pain type (4 values)
4. Resting blood pressure
5. Serum cholestoral in mg/dl
6. Fasting blood sugar > 120 mg/dl
7. Resting electrocardiographic results (values 0,1,2)
8. Maximum heart rate achieved
9. Exercise induced angina
10. Oldpeak = ST depression induced by exercise relative to rest
11. The slope of the peak exercise ST segment
12. Number of major vessels (0-3) colored by flouroscopy
13. Thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
PL x = (u x)u (11.2)
where prime denotes transposition. The variance of data in the direction of L is,
therefore
1 n 2 1 n 1 n
∑
n p=1
(u xp ) = ∑ u xp xp u = u( ∑ xp xp )u = u Su
n p=1 n p=1
(11.3)
11 A Classification Method Based on PCA and DE 269
where S is the sample covariance matrix of the data. PCA thus looks for the vector
u∗ which maximizes u Su, under the constraint ||u|| = 1. It is easy to show that the
solution is the normalized eigenvector u1 of S associated to its largest eigenvalue
λ1 , and
u1 Su1 = λ1 u1 u1 = λ1 (11.4)
This is then extented to find the k-dimensional subspace L on which the projected
points PL xp have maximal variance. The lines spanned by the eigenvectors uj are
called the principal axes of the data, and k new features y j = uj x defined by the
coordinates of x along the principal axes are called principal components. The vector
yp of principal components for each initial pattern vector xp may easily be computed
in matrix form as yp = Ur xp , where Ur = [u1 , ..., ur ] is r × k matrix having the r
normalized eigenvector of S as its columns.
PCA can be used in classification problems to display data in the form of infor-
mative plots. The score values have the same properties as the weighted averages,
i.e., they are not sensitive to random noise but show the processes that affect several
variables simultaneously in a systematic way. This makes them suitable for detect-
ing multivariate trends, such as the clustering of objects or variables in multivariate
data sets. PCA can be seen as a data compression method which can be used to (1)
display multivariate data sets, (2) filter noise and (3) study and interpret multivariate
processes. One clear limitation of the PCA is that it can only handle linear relations
between variables [9]. We acknowledge the fact that this may not be the best kernel
for the approach but here in our procedure it seems to be working.
We used Minkowsky metric because it is more general than euclidean Metric. Eu-
clidean metric is still included there as a special case when p = 2. We also found
that when p value was optimized by using DE, the optimum was not even near p = 2
which corresponds to euclidean metric.
After computing the distances between the samples and class vectors we can
make the classification decision according to the shortest distance.
for x, y ∈ Rn . We decide that x ∈ Cm if
Before doing the actual classification, all the parameters for classifier should be
decided. These parameters are
1. The class vectors yi = (yi (1), . . . , yi (T )) for each class i = 1, . . . , N
2. The power value p in (11.6).
In this study we used differential evolution algorithm [30] to optimize both the class
vectors and p value. For this purpose we split the data into learning set learn and
testing set test. Split was made so that half of the data was used in learning set and
half in testing set. We used data available in learning set to find the optimal class
vectors yi and the data in the testing set test was applied for assessing the clas-
sification performance of the proposed classifier. A brief description of differential
evolution algorithm is presented in the following section. The number of parameters
that differential evolution algorithm needs to optimize here is classes * dimension +
parameter coming from minkowsky distance. As results will later show PCA can be
used to lower datas dimensionality and with low dimensions we still we can find re-
sults which are clearly better than what can be found by using simply DE-classifier.
If we are not satisfied to just lower the data’s dimensionality and to enhancement
achieved this way but want to find out the best lowered dimension we have to do
this also for every dimension that is lower than maximum dimension so we get
∑dimension
i=1 (classes ∗ (dimension − i) + parameter coming from minkowsky distance)
worth of parameters to be optimized.
In short the procedure for our algorithm is as follows:
1. Divide data into learning set and testing set
2. Create initial class vectors for each class (here we used simply random numbers)
3. Compute distance between samples in the learning set and class vectors
4. Classify samples according to their minimum distance
5. Compute classification accuracy (no. of correctly classified samples/total number
of samples in learning set)
6. Compute the objective function value to be minimized as cost = 1 − accuracy
7. Create new class vectors for each class for the next population using selection,
mutation and crossover operations of differential evolution algorithm, and goto
step 3. until the stopping criteria is reached (e.g. maximum number of iterations
is reached)
11 A Classification Method Based on PCA and DE 271
8. Classify data in testing set according to the minimum distance between class
vectors and samples.
In this DE version, NP must be at least four and it remains fixed along CR and F
during the whole execution of the algorithm. Parameter CR ∈ [0, 1], which controls
the crossover operation, represents the probability that an element for the trial vector
is chosen from a linear combination of three randomly chosen vectors and not from
the old vector vi,G . The condition “ j = jrand ” is to make sure that at least one element
is different compared to the elements of the old vector. The parameter F is a scaling
factor for mutation and its value is typically (0, 1+]1 . In practice, CR controls the
rotational invariance of the search, and its small value (e.g., 0.1) is practicable with
separable problems while larger values (e.g., 0.9) are for non-separable problems.
The control parameter F controls the speed and robustness of the search, i.e., a lower
value for F increases the convergence rate but it also adds the risk of getting stuck
into a local optimum. Parameters CR and NP have the same kind of effect on the
convergence rate as F has.
1 Notation means that the practical upper limit is about 1 but not strictly defined.
272 P. Luukka and J. Lampinen
After the mutation and crossover operations, the trial vector ui,G is compared to
the old vector vi,G . If the trial vector has an equal or better objective value, then it
replaces the old vector in the next generation. This can be presented as follows (in
this paper minimization of objectives is assumed) [29]:
ui,G if f (ui,G ) ≤ f (vi,G )
vi,G+1 = .
vi,G otherwise
DE is an elitist method since the best population member is always preserved and
the average objective value of the population will never get worse.
As the objective function, f , to be minimized we applied the number of incor-
rectly classified learning set samples. Each population member, vi,G , as well as each
new trial solution, ui,G , contains the class vectors for all classes and the power value
p. In other words, DE is seeking the vector (y(1), ..., y(T ), p) that minimizes the
objective function f . After the optimization process the final solution, defining the
optimized classifier, is the best member of the last generation’s, Gmax , population,
the individual vi,Gmax . The best individual is the one providing the lowest objective
function value and therefore the best classification performance for the learning set.
The control parameters of DE algorithm were set here as follows: CR=0.9 and
F=0.5 were applied for all classification problems. NP was chosen so that it was six
times the size of the optimized parameters or if size of the NP.
However, these selections were mainly based on general recommendations and
practical experiences with the usage of DE, and no systematic investigations were
performed for finding the optimal control parameter values, therefore further clas-
sification performance improvements by finding better control parameter settings in
future are within possibilities.
Table 11.3 Classification results of heart data sets. Comparison of classification results with
original data and data preprocessed with two dimension reduction methods. Best mean accu-
racy is in boldface
Data Best result (in %) Mean result (in %) Variance (in %) dim p-value
Heart-Cleveland 89.44 % 82.86 % 7.71 13 19.3
Heart-Cleveland(PCA) 91.55 % 86.48 % 2.82 12 82.8
Heart-Hungarian 88.44 % 83.42 % 5.95 13 88.1
Heart-Hungarian(PCA) 93.20 % 87.48 % 3.34 11 96.7
Heart-Switzerland 95.16 % 94.35 % 0.67 13 70.8
Heart-Switzerland(PCA) 95.16 % 94.46 % 0.66 5 82.1
Heart-Long Beach 80.20 % 78.32 % 1.31 13 54.4
Heart-Long Beach(PCA) 85.15 % 79.93 % 2.70 12 67.9
Heart-All 78.22 % 76.98 % 0.94 13 1.8
Heart-All(PCA) 84.22 % 82.01 % 1.05 13 49.2
Heart-statlog 88.89 % 83.21 % 10.80 13 81.1
Heart-statlog(PCA) 91.86 % 87.63 % 4.01 13 90.9
Cleveland data set: From the Table 11.3 we can observe that best mean classifica-
tion accuracy for the Cleveland data set is 86.5% and when 99% confidence √ interval
is computed for the results (Using Student’s t distribution μ ± t1−α /2 Sμ / n) we get
for the confidence interval 86.5% ± 0.8%. This result was obtained when data was
first preprocessed with PCA. The preprocessing by PCA enhanced the results over
3%. The best mean accuracy was found with target dimensionality of 12.
Achieved results with the Cleveland data set are compared to other results in Ta-
bles 11.4–11.6. In Table 11.4 the classification results obtained by our DE based
approach are compared to the corresponding results reported in [32] where method
called Classification by Feature Partitioning (CFP) was introduced. This method is
an inductive, incremental and supervised learning method. There the data set was
divided in two sets as here but training and testing set sizes were a bit different.
When comparing our results with the results of Sirin and Güvenir [32] we observed
that DE classifier classified the Cleveland data set with a higher accuracy (83.4%)
than IB classifiers and C4 but yielded a slightly lower accuracy than CFP,(84.0%).
When data was first preprocessed with PCA the classification accuracy of 87.5%
was reached by the DE-classifier. In Table 11.5 the classification results obtained by
DE based approach are compared to results from classifiers which was reported in
[21]. They used decision tree classifier and also preprocessed the data with wavelet
transform. They also used two fold techniques as here but division for training and
testing set was 80 − 20. For their decision tree classifier 76% accuracy was reported.
In comparison, DE yielded accuracy of 83%. Li et al. [21] managed to enhance the
results by preprocessing the data first with wavelet transform gaining about 4% unit
enhancement having mean accuracy of 80%. We reached about 3% unit enhance-
ment with using PCA, corresponding to 86% classification accuracy.
In Table 11.6 we have compared our results with the results reported in [3] where
tenfold crossover was used instead of two fold as in our experiment. As can be
seen there smart crossover operator with multiple parents for a Pittsburgh Learning
274 P. Luukka and J. Lampinen
Classifier seems to be giving around 10% better performance with this data set.
Generally, the results obtained here by the DE classifier with PCA preprocessing
appeared to be rather promising.
Table 11.4 DE classifiers classification result comparison to results Sirin and Güvenir re-
ported in [32] from Cleveland and Hungarian data sets
Table 11.5 DE classifiers classification result comparison to results of Li et al. reported from
Cleveland, Hungarian and Switzerland data sets
Table 11.6 DE classifiers classification result comparison to results of Bacardit and Krasno-
gor [3] from Cleveland, Hungarian and Statlog data sets
Hungarian data set: With the Hungarian data set a same situation was observed.
The best results was found when data was first preprocessed with PCA. Best mean
accuracy with 99% confidence interval was 87.5% ± 0.9%. Preprocessing with PCA
enhanced the results over 3%. Best accuracy was found with the target dimension-
ality of 11.
The results obtained with the Hungarian data set are compared to the results
of the other classifiers in Tables 11.4–11.7. When the results are compared with
the results reported by Sirin and Güvenir [32] (Table 11.4) DE-classifier yielded a
slightly higher mean accuracy 83.4%, than the second best CFD with the accuracy
of 82.3%. When the Hungarian data set was preprocessed with PCA the accuracy of
DE-classifier increased to 87.5%, which can be considered as a remarkably good re-
sult. In Table 11.5 our results are compared with the corresponding ones by Li et al.
[21]. They reported accuracy of 76% with their decision tree classifier while in com-
parison our DE-classifier reached with 83% accuracy. Li et al also preprocessed the
11 A Classification Method Based on PCA and DE 275
data and their wavelet transform preprocessing gained about 4% unit enhancement
in accuracy (80%). We obtained 4% unit enhancement with the accuracy when we
preprocessed the data by using PCA and then performed the classification by using
DE classifier, reaching the accuracy of 87%. In Table 11.6 our results are compared
with the results of [3]. With their method accuracy of 86% was reported for Hun-
garian data set. This accuracy outperformes DE but when data is first preprocessed
with PCA and then the linear combination is used DE-classifier manages to get bet-
ter results. In Table 11.7 our results are compared with the results of Detrano et al
[6]. They have reported 77% accuracy with CDF, while our corresponding result
was 83% accuracy by using DE classifier.
Table 11.7 DE classifiers classification result comparison to results Detrano et al [6] reported
from Hungarian, Long beach and Switzerland data sets
Switzerland data set: With the Switzerland data set even higher classification ac-
curacy was reached than with the previous two cases. Best mean accuracy with 99%
confidence interval was 94.5%± 0.4%. Also variances were considerably lower than
with previous two data sets. The results with the original data and PCA-preprocessed
data were very close and there were no statistically significant differences between
the classifying the original data or preprocessed data.
Comparison to the other results with the Switzerland data set is provided in Tables
11.5 and 11.7. Rather similar findings as in [21] and [6] were done. Also in [21] and
[6] the Switzerland data set was found possible to classify with a higher accuracy
than the other data sets. Furthermore, in [21] it was noticed the same as what we
found that preprocessing the Switzerland data set did not considerably enhanced the
results but results were similar with the original data. Li et al. [21] reported accuracy
of 88% with decision tree classifier and Detrano et al. [6] accuracy of 81% with both
CDF and CADENZKA. We managed to classify the data with 94% accuracy using
preprocessed data and DE classifier. DE classifiers accuracy of 94% is the highest
one among these results.
Long Beach data set: The Long Beach data set seemed to be most difficult to
classify with the algorithm presented in this paper. Best mean accuracy with 99%
confidence interval was 79.9% ± 0.9%. This result was achieved when the data was
first preprocessed using principal component analysis algorithm and then classified
with DE classifier. Dimension where best results were achieved was 12. The Long
Beach data set has the highest number of missing values among these four data sets,
and it is often left out from studies [21] due to this reason. Here all those missing
values were replaced by dummy value −9 and large amount of −9 values is probably
276 P. Luukka and J. Lampinen
the reason why accuracies are somewhat lower with this data set than with the other
data sets.
For the Long Beach data set results are compared in Table 11.7. The results with-
out preprocessing are rather similar with the classifiers CDF and CADENZKA.
They reported accuracy of 79% with CDF and 77% with CADENZKA. We ob-
tained accuracy of 78% with DE-classifier. When data was preprocessed with PCA
and then classified with DE a accuracy (of 80%) was obtained for this particular
dataset.
Heart-All: In Heart-All all previous four data sets are combined to achieve larger
amount of data to be used. When all four data sets were combined the best mean
accuracy with 99% confidence interval was 82.0% ± 0.5%. This was achieved when
data was first preprocessed with PCA. This result was achieved by using 13 di-
mensions. Here variance (of 1.05) was lower than with the other data sets with the
exception of the Switzerland data set. When we compare the results with the results
without preprocessing with PCA first we managed to enhance the results by using
PCA about 5%.
In Table 11.8 there is classification result comparison to results reported by Łeski
in [22] and Pedreira in [27]. In all the results Cleveland, Hungarian, Switzerland
and Long-beach data sets are combined together. Łeski used An ε −margin nonlin-
ear classifier based on fuzzy if-then rules and divided the data in two folds as in
our procedure but here testing set was much larger than training set. Pedreira used
Kohonen’s LVQ2.1 with Training data selection and ten-fold crossvalidation was
used.
Table 11.8 DE classifiers classification result comparison to results reported by Łeski in [22]
and Pedreira in [27]. In all the results Cleveland, Hungarian, Switzerland and Long-beach
data sets are combined together
fuzzy if-then rules (79.96%) classified with a bit higher accuracy than DE classi-
fier. When data was preprocessed with PCA and then classified with DE classifier, a
higher mean accuracy of 82.01% was gained. As reported in Table 11.8 the result of
DE classifier with PCA data preprocessing outperformed the results of these seven
classifiers.
Heart-statlog: When the heart-statlog data set was classified, the best mean accu-
racy with 99% confidence interval was 87.6 ± 1.0. When this result is compared to
the original Cleveland data set, the accuracy is only little higher. So removing in
advanced samples which had missing values increased the accuracy only by about
1% unit. Also variances were actually higher with the heart-statlog data set than
with the original Cleveland data set. When the results are compared to the results
of [20] where this data set was classified with 19 different classifiers, the results are
more accurate with DE classifier than the best compared result, which was 84.4%
accuracy with NewId classifier.
The results with the heart-statlog data set are compared in Tables 11.9–11.11.
Similar mean accuracy was obverved with DE-classifier (83.2%) than with LMT
(83.2), SLogistic (83.3%) and MLogistic (83.7%) in Table 11.9. In this experiment
Hervas-Martinez & Martinez-Estudillo [15] used two folds as we but division be-
tween training set and testing set was 75 − 25. When the data was preprocessed
with PCA and then classified with DE we observed 87.6% mean accuracy. In Ta-
ble 11.10 the results are compared with the results reported by Abdel-Aal [2] with
GMDH. With GMDH also optimal feature selection was carried out. They used
also two fold division of the data set and there division was 70 − 30 between the
training set and testing set. When comparing the results of GMDH with all features
(82.5%) to the results of DE-classifier (83.2%), DE-classifier provided a sligthly
higher accuracy. When optimal features were selected with GMDH, the accuracy
of 85% is reported, which is a bit lower than we reached by applying DE-classifier
to the PCA-preprocessed data (87.6%). In Table 11.11 the results are compared to
those reported in [28] by Polat et al. Polat et al. [28] used ten-fold crossvalidation
scheme to get their results. They reported results for their main classification sys-
tem, artificial immune recognition system (AIRS), that reached accuracy of 84.50%.
Table 11.9 DE classifiers heart statlog data set classification result comparison to results
reported by Hervas-Martinez & Martinez-Estudillo [15]
Table 11.10 DE classifiers heart statlog data set classification result comparison to results
reported by Abdel-Aal [2]. Results are compared to results using all features and optimal
dimension
Table 11.11 DE classifiers classification result comparison to results Polat et al [28] reported
from heart statlog data set
This is a bit higher accuracy than 83.21% that we reached with DE-classifier. They
also used a preprocessing method which they called as weighting scheme based on
k−nearest neighbour (k-NN) and utilized that as preprocessing step before classi-
fying with their main classifier, AIRS. Using this method they gained accuracy of
87.00% while we reached accuracy of 87.63% by preprocessing the data with PCA
and then classifying with DE-classifier. Thus, the compared results appears to have
rather similar level of accuracy in these cases. When we compare the results with
enhanced PLCS [3], their method gives about 7% higher classification accuracy than
presented method.
11 A Classification Method Based on PCA and DE 279
85
80
75
70
Switzerland
Statlog
65 Hungarian
Cleveland
60 All
Long beach
55
0 2 4 6 8 10 12 14
8
Switzerland
Statlog
7
Hungarian
Cleveland
6 All
Long beach
0
0 2 4 6 8 10 12 14
Fig. 11.1 Classification results with respect to the reduced dimension a) Mean classification
accuracies when data is first preprocessed with PCA and the classified with DE classifier b)
variances
In Fig 11.1 mean classification accuracies and variances are plotted for every di-
mension to see how classification accuracy changes with respect to reduced dimen-
sion. As can be seen from Fig 11.1 accurate results can be found already with few
dimensions. Preprocessing the data with PCA, good results are still found with all
data sets when dimension is as low as 4. Also variances are low with all dimensions
if the first three dimensions are disregarded. The figure is suggesting that rather ac-
curate results can be achieved when the reduced dimensionality is only about half
of the original. This information can be very useful when dealing with large amount
of data having high dimensionality, since computations are taking considerably less
280 P. Luukka and J. Lampinen
time when the dimensionality is reduced. Computations with only six dimension
was about 3.5 times faster than computations with full 13 dimension.
Table 11.12 Results with SVM and with combination of PCA+SVM. Best and mean classi-
fication results, variances
for the case when all data sets were combined together. With diagnosis of heart
disease we found that by preprocessing the data first with PCA a higher classifica-
tion accuracy can be achieved than without preprocessing. This was observed in all
studied cases excluding the case for the Switzerland data set. Another aspect is the
reduced overall computing time. By data dimensionality reduction data sets having
high dimensionality can be classified considerably faster with this reduced data than
what could be done with the original data. This procedure also made it possible for
DE to find more robust and accurate class vectors which improved the classification
accuracy. Also this way we managed to filter out noise and improve the results.
We are considering that the main factor resulting in the good classification accu-
racy in studied cases was the application of an effective global optimizer, differen-
tial evolution, for fitting the classification model instead of local optimization based
approaches. The results are indicating that also prepossessing the data before clas-
sification may, in successful cases, not only help with the curse of increasing data
dimensionality, but also provide a further improvement in classification accuracy.
Anyway, the main contributor to the accuracy was global optimization method in
the classifier, that made it possible to at least some extend to avoid getting trapped
into locally optimal (and thereby suboptimal) solutions, and make it possible to
improve the solution further on in comparison with the compared approaches. An-
other important point contributing to the classification accuracy was systematical
optimization of parameter p, instead of keeping it fixed, or setting it manually by
trial-and-error. It should be noted, that inclusion of p among the optimized param-
eters, was possible due to application of global optimizer capable of handling the
extra parameter, extra nonlinearity and extra multimodality of classifier model op-
timization problem that the inclusion of p among parameters to be optimized is
resulting in.
Generally, the classification accuracy yielded by the proposed approach com-
pared well with the other corresponding results of several classifier reported in lit-
erature. We managed to classify the Switzerland data set with 94.5% ± 0.4% mean
accuracy and when all heart data sets were combined, we achieved the mean ac-
curacy of 82% ± 0.5%. The results are suggesting that the proposed classification
approach has potential in diagnosis of heart disease.
A further advantage of the approach is that when dimension of data is reduced,
the overall computational time is reduced, allowing classification of even larger data
sets.
References
1. Abbass, H.A.: An evolutionary artificial neural networks approach for breast cancer di-
agnosis. Artificial Intelligence in Medicine 25, 265–281 (2002)
2. Abdel-Aal, R.E.: GMDH-based Feature Ranking and Selection for Improved Classifica-
tion of Medical Data. Journal of Biomedical Informatics 38, 456–468 (2005)
282 P. Luukka and J. Lampinen
3. Bacardit, J., Krasnogor, N.: Smart Crossover Operator with Multiple Parents for a Pitts-
burgh Learning Classifier System. In: Proceedings of the 8th Annual Conference on Ge-
netic and Evolutionary Computation (GECCO 2006), pp. 1441–1448. ACM Press, New
York (2006)
4. Booker, L.: Improving the performance of generic algorithms in classifier systems. In:
Grefenstette, J.J. (ed.) Proc. 1st Int. Conf. on Genetic Algorithms, Pittsburgh, PA, July
1985, pp. 80–92 (1985)
5. Cauwenberghs, G., Poggio, T.: Incremental and decremental support vector machine
learning. In: Advanced Neural Information Processing Systems, vol. 13. MIT Press,
Cambridge (2001)
6. Detrano, R., Janosi, A., Steinbrunn, W., Pfisterer, M., Sandhu, S., Guppy, K., Lee, S.,
Froelicher, V.: International application of a new probability algorithm for the diagnosis
of coronary artery disease. Americal Journal of Cardiology 64, 304–310 (1989)
7. Donoho, D.: High-dimensional data analysis: The curses and blessings of dimensional-
ity. In: Lecture at the “Mathematical Challenges of the 21st Century” conference of the
American Math. Society, Los Angeles, August 6-11 (2000)
8. Duda, R., Hart, P.: Pattern Classification and Scene Analysis. John Wiley & Sons, Chich-
ester (1973)
9. Fodor, I.K.: A Survey of Dimension Reduction Techniques, LLNL technical report (June
2002)
10. Fogarty, T.C.: Co-evolving co-operative populations of rules in learning control systems.
In: Fogarty, T.C. (ed.) AISB-WS 1994. LNCS, vol. 865, pp. 195–209. Springer, Heidel-
berg (1994)
11. Giacobini, M., Brabazon, A., Cagnoni, S., Gianni, A.D., Drechsler, R.: Automatic
Recognition of Hand Gestures with Differential Evolution - Applications of Evolutionary
Computing: Evoworkshops (2008)
12. Gomes-Skarmeta, A.F., Valdes, M., Jimenez, F., Marin-Blazquez, J.G.: Approximative
fuzzy rules approaches for classification with hybrid-GA technigues. Information Sci-
ences 136, 193–214 (2001)
13. Han, J., Kamber, M.: Data Mining: Concepts and Techniques. Morgan Kaufmann Pub-
lisher, San Francisco (2000)
14. Herbrich, R., Graepel, T., Campbell, C.: Bayes point machines. J. Machine Learning
Res. 1, 245–279 (2001)
15. Hervas-Martinez, C., Martinez-Estudillo, F.: Logistic Regression Using Covariates Ob-
tained by Product-unit Neural Network Models. Pattern Recognition 40, 52–64 (2007)
16. Holland, J.H.: Properties of the bucket-brigade algorithm. In: Grefenstette, J.J. (ed.) Proc.
1st Int. Conf. on Genetic Algorithms, Pittsburgh, PA, July 1985, pp. 1–7 (1985)
17. Holland, J.H.: Genetic algorithms and classifier systems: foundations and future direc-
tions. In: Proc. 2nd Int. Conf. on Genetic Algorithms, pp. 82–89 (1987)
18. Holland, J.H., Holyoak, K.J., Nisbett, R.E., Thagard, P.R.: Classifier systems, Q-
morphisms and induction. In: Davis, L. (ed.) Genetic algorithms and Simulated Anneal-
ing, ch. 9, pp. 116–128 (1987)
19. Jolliffe, I.: Principal Component Analysis. Springer, Heidelberg (1986)
20. King, R.D., Feng, C., Sutherland, A.: Statlog: Comparison of Classification Algorithms
on Large Real-World Problems. Applied Artificial Intelligence 9(3), 256–287 (1995)
21. Li, Q., Li, T., Zhu, S., Kambhamettu, C.: Improving Medical/Biological Data Classi-
fication Performance by Wavelet Preprocessing. In: Proceedings of IEEE International
Conference on Data mining (ICDM), pp. 657–660 (2002)
11 A Classification Method Based on PCA and DE 283
22. Łeski, J.M.: An ε − Margin Nonlinear Classifier Based on Fuzzy If-Then Rules. IEEE
Transactions on Systems, Man and Cybernetics-Part B: Cybernetics 34(1), 68–76 (2004)
23. Luukka, P., Sampo, J.: Similarity Classifier Using Differential Evolution and Genetic
Algorithm in Weight Optimization. Journal of Advanced Computational Intelligence and
Intelligent Informatics 8(6), 591–598 (2004)
24. Martens, H., Naes, T.: Multivariate Calibration. John Wiley, Chichester (1989)
25. Newman, D.J., Hettich, S., Blake, C.L., Merz, C.J.: UCI Repository of machine learning
databases. University of California, Department of Information and Computer Science,
Irvine, CA, http://www.ics.uci.edu/˜mlearn/MLRepository.html
(Cited 30 November 2008)
26. Omran, M., Engelbrecht, A.P., Salman, A.: Differential Evolution Methods for Unsu-
pervised Image Classification. In: Proceedings of the Seventh Congress on Evolutionary
Computation (CEC 2005), Edinburgh, Scotland. IEEE Press, Los Alamitos (2005)
27. Pedreira, C.E.: Learning Vector Quantization with Training Data Selection. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence 28(1), 157–162 (2006)
28. Polat, K., Sahan, S., Günes, S.: Automatic detection of heart disease using an artifi-
cial immune recognition system (AIRS) with fuzzy resource allocation mechanism and
k-nn (nearest neighbour) based weighting preprocessing. Expert Systems with Applica-
tions 32, 625–631 (2007)
29. Price, K.V.: New Ideas in Optimization. In: An Introduction to Differential Evolution,
ch. 6, pp. 79–108. McGraw-Hill, London (1999)
30. Price, K., Storn, R., Lampinen, J.: Differential Evolution - A Practical Approach to
Global Optimization. Springer, Heidelberg (2005)
31. Robertson, G.: Parallel implementation of genetic algorithms in a classifier system. In:
Davis, L. (ed.) Genetic algorithms and Simulated Annealing, ch. 10, pp. 129–140 (1987)
32. Sirin, I., Güvenir, H.A.: An Algorithm for Classification by Feature Partitioning Techni-
cal Report CIS-9301, Bilkent University, Dept. of Computer Engineering and Informa-
tion Science, Ankara (1993)
33. Storn, R., Price, K.V.: Differential Evolution - a Simple and Efficient Heuristic for Global
Optimization over Continuous Space. Journal of Global Optimization 11(4), 341–359
(1997)
34. Vapnik, V.: The Nature of Statistical Learning Theory. Springer, Heidelberg (1995)
35. Wilson, S.W.: Hierarchical credit allocation in a classifier system. In: Davis, L. (ed.)
Genetic algorithms and Simulated Annealing, ch. 8, pp. 104–115 (1987)
Chapter 12
An Integrated Approach to Speed Up GA-SVM
Feature Selection Model
Tianyou Zhang, Xiuju Fu, Rick Siow Mong Goh, Chee Keong Kwoh,
and Gary Kee Khoon Lee
12.1 Introduction
The booming information technologies have promoted the production of data from
all sorts of domains. The significant information or features are often mixed up
with the noises inside the data. It subsequently places a challenging task in machine
learning for filtering the irrelevant and selecting the truly important features out.
Given data samples with class labels, supervised classification models are usually
used together with optimization algorithms for feature selection in which classifica-
tion accuracies are used as fitness evaluation of the selected feature subsets. In this
Tianyou Zhang · Xiuju Fu · Rick Siow Mong Goh · Gary Kee Khoon Lee
Institute of High Performance Computing, 1 Fusionopolis Way, #16-16 Connexis,
Singapore 138632
e-mail: [email protected]
Chee Keong Kwoh
Nanyang Technological University, Singapore 637457
Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 285–298.
springerlink.com c Springer-Verlag Berlin Heidelberg 2010
286 T. Zhang et al.
work, we develop a feature selection model which combines the merits of support
vector machine (SVM), genetic algorithm (GA) and high performance computing
techniques.
A supervised learning model is capable to learn a set of functions (classifiers)
from prior knowledge. It has been widely applied in many domains including bioin-
formatics, cheminformatics, financial forecasting, etc. A typical example in super-
vised learning is that given a set of training data with multiple input features and
labeled outputs, a classifier is learnt from the “known” examples, and generalized
to label the “unknown” ones. The rationale of applying supervised learning is that
labeling could be expensive. For example, in most of bioinformatics problems, lab-
oratory approaches are the most reliable and trustful but time-consuming, labor-
intensive and costly. A more cost-effective and efficient alternative can be sought
from conducting laboratory experiments to collect the sufficient labeled data, fol-
lowed by training a classifier to label the rest of input examples, and finally doing
laboratory verification to the highlighted examples.
Support vector machine [1] is a set of supervised learning tools based on struc-
tural risk minimization principle, and has been popular in both classification and
regression tasks. The principle of SVM is constructing an optimal separating hy-
perplane to maximize the margin between two classes of data. The concept can be
visualized as that two boundary planes parallel to the hyperplane are constructed
at its each side, and are pushed maximally towards the data points as long as no
data points fall in between the boundary planes. In the scene, “margin” refers to
the distance between the boundary planes, and “support vectors” are the data points
sitting on those boundary planes. However in many real-world problems, the un-
avoidable existence of noises makes infeasible to construct such hard-margin clas-
sifier with zero errors. Soft margin [2] is then introduced to allow the data points
to fall in-between the boundary planes or even across the hyperplane with penalty
cost C, which controls the tradeoff between maximal margin and minimal error.
For linear-separable data, constructing the linear classifier is straightforward. But
for non-linear-separable data, kernel trick [3] is needed to map the data into high
dimensional space, in which the transformed data become linear separable. The per-
formance of SVM classifier is often estimated by k-fold cross validation. That is
done by dividing the data into k subsets and using a different subset for validation
rotationally while the rest subsets are reserved for training in each round. The values
of generalization error or accuracy computed in all k rounds are finally aggregated
to measure the overall performance of SVM.
When SVM is employed for data classification, the choice of margin cost C and
kernel parameters is considered as a very important process for obtaining high per-
formance of SVM classifiers [4]. The optimal parameters that lead to the minimal
generalization error are data-dependent. Presently no rules or formula can be used to
compute such values analytically, so parameter tuning is often required. An intuitive
realization of parameter tuning is grid search [5]. That is, the parameters are varied
by step-size within the preset range of values (in a structure of “grid”); the optimal
values can be found by measuring every combination (every node in the grid). Due
to its complexity, usually two-dimensional grid is used to tune a pair of parameters
12 An Integrated Approach to Speed Up GA-SVM Feature Selection Model 287
such as C and γ (Gaussian function width in RBF kernel). Even after parameter
tuning, SVM classifier might deliver poor accuracy in classifying some particular
datasets. One possible reason is noise interference in which an overwhelming num-
ber of irrelevant features are included inside the inputs so that a truly representative
classifier cannot be learnt. If prior knowledge is insufficient to differentiate which
features are truly relevant to the output, such that all the possible features are in-
cluded in the training data, the accuracy of learning would be deteriorated. In those
cases, the key of improving learning performance is feature selection, which is the
technique of searching the significant candidates in the feature space by optimiza-
tion methods such as genetic algorithm.
Genetic Algorithm (GA) [6] is a search technique inspired by the natural evo-
lution. In the evolution, the individuals with better genetic merits (chromosome)
are more likely to survive under natural selection and reproduce the offspring; the
unfit ones are filtered out. By constant filtering, generation after generation, the pop-
ulation tends to carry the fitter and fitter chromosomes. To mimic this process, all
candidate solutions to a feature-selection problem can be encoded as “chromosome”
(feature subset representation), which takes the form of bit array (e.g. 10010101) -
1s and 0s denote the presence or absence of each feature. A group of those candidate
solutions is sampled randomly and form the initial population of chromosomes. The
chromosomes are then evaluated by objective function to compute the fitness scores.
Multiple chromosomes are stochastically selected based on their fitness, recombined
(crossover) and mutated, and finally form the next generation. By means of random
mutations and crossovers, the variety of chromosomes is introduced and evaluated
in every generation and gradually evolve the solutions towards the optimal. The pro-
cess will be iterated until convergence, i.e. there is no more improvement to the best
fitness score in the population. At the end, the chromosome with the best-ever fit-
ness will be the final solution and all the features denoted by 0s will then be filtered
out.
By taking SVM as objective function in GA, GA-SVM had been widely used to
filter out the irrelevant features and improve the learning accuracy in the noisy set-
tings. However, there is a practical problem of high computational cost in GA-SVM.
Assuming m population and g generations in GA, wxw grid in parameter tuning and
t seconds for SVM plus 10-fold cross validation, then the overall runtime will cost
mgw2t seconds. It is a time-consuming process that even a small-scale problem may
need nearly a day to complete (demonstrated in the Result section). That strongly
discourages the application of GA-SVM to larger and more complex data. In this
paper, we introduce high performance computing (HPC) techniques and heuristic
methods to speed up the traditional GA-SVM feature selection model. In our HPC-
enabled GA-SVM (HGA-SVM), we employ data parallelization, multithreading,
repeated evaluation reduction and heuristic optimization, with the ultimate goal of
trimming down computational cost and making large-scale feature selection more
feasible. The HGA-SVM is comprised of four improvement strategies: 1) GA Paral-
lelization, 2) SVM Parallelization, 3) Neighbor Search, and 4) Evaluation Caching.
All the four strategies work collectively towards higher computational efficiency.
288 T. Zhang et al.
12.2 Methodology
GA-SVM feature selection model is a technique that operates GA to search the
feature space for in which a subset of features could produce the best learning per-
formance through SVM. An implementation of such model is comprised of three
operators: crossover, mutation, and SVM evaluation. With respect to population
size, the first two are of linear complexity; and the last one is of quadratic complex-
ity. It is clear that reducing population size could lower computational cost effec-
tively. Moreover, when input data grows larger and more complex, most of time lag
would rise in SVM evaluation since all other operators only work in chromosome
layer. Imagine if a single SVM training is slowed down by t seconds in a larger
dataset, by the effect of 10-fold cross validation and 10x10 grid search, the time
lag of each chromosome in every generation would be amplified to 1000t seconds.
Apparently SVM evaluation is the biggest obstacle in GA-SVM that discourages its
application in large-scale dataset and therefore the speedup specific to SVM would
be desirable.
High computational cost of GA-SVM could also arise from parameter tuning in
SVM. Exhaustive grid search is an intuitive and straight-forward technique to tune
the parameters of SVM; however it is slow despite the grid dimension is small. In
a simple 10x10 grid search, it requires 100 times of SVM learning (with 10-fold
cross validation) for each chromosome in every generation. This exhaustive search
will incur a huge computational cost. However, parameter tuning cannot be omitted
even though the cost is high. Otherwise SVM learning would be biased and the
purpose of improving learning performance is undermined.
An additional waste of computational power is redundant evaluation. It happens
when identical chromosomes re-emerge in the different GA generations due to ran-
dom mutations and crossovers. Since a standard GA is memory-less, i.e. does not
keep any historical records on the past-evaluated chromosomes, it has to evaluate
every appearance of those identical chromosomes. It costs the computation power
for the unnecessary evaluations and increases the runtime.
We have enumerated three causes that lead to a slow execution of GA-SVM in
the large-scale datasets. With the respective targets, four improvement strategies
were designed to alleviate the computational cost. Parallel GA and parallel SVM
speed up the GA and SVM respectively; neighbor search replaces grid search to
reduce the number of combinations to be measured; and lastly evaluation caching
avoids the repeated unnecessary evaluations. Figure. 1 shows the workflow of all
four improvement strategies in HGA-SVM.
12.2.1 Parallel/Distributed GA
The design of GA parallelization follows parallel island model [7] in a coarse grained
architecture. The entire population of chromosomes is divided into n subpopula-
tion and each subpopulation is assigned to a different parallel node. Every parallel
node evolves their local subpopulation by a serial GA. At the end of every gener-
ation, multiple chromosomes are selected randomly at each node and exchanged
12 An Integrated Approach to Speed Up GA-SVM Feature Selection Model 289
among the peers, which is called “migration”. Migration brings in the new variety
to local population and facilitates to build up the common trend of evolution in all
subpopulations.
By distributing chromosomes to n parallel nodes, local population size is reduced
by factor of 1/n (population size is an even integer before and after reduction).
The execution time of a single GA generation is speeded up by n times approx-
imately, because selection, crossover and mutation are of O(n3 ) complexity and
SVM evaluation is of O(n3) complexity with respect to population size [8]. There
are however some drawbacks of parallelization - parallel overheads, which includes
start-up/termination overhead, synchronization overhead and communication over-
head. The first two overheads are unavoidable in order to coordinate parallel com-
puting in multiple nodes; so we focus on reducing communication overhead in this
resolving data dependency inside the loop, followed by inserting the OpenMP direc-
tives, without any modification to the structure of the algorithm. The parallel SVM
will be able to utilize the multiple CPU cores concurrently in form of multithreading
(refer to Figure 12.1), so effectively reduce the computation time. Since OpenMP
also introduces the overheads, the parallel SVM would perform more efficiently if
the training dataset is sufficiently large.
In our HGA-SVM, the GA and SVM operations are parallelized using MPI and
OpenMP respectively. The parallelization techniques used in both of these opera-
tions allow them to work together as hybrid parallelization to speed up the workflow
concurrently.
It is intuitive that if two chromosomes differ in few bits, their optimal locations
in the grid might be closer to each other. It could be applicable to the mutated chro-
mosomes in the new generation if the hamming distance between parent and child
chromosome is small. Since neighbor search has been done for the parent chromo-
some and found the optimal node, the same node can be used as the initial centroid
for the child chromosome, which would be advantageous for the faster convergence.
GA Parameters
# of parallel node n
population size 80
subpopulation size 80/n
cross-over rate 60%
mutation rate 5%
Migration rate 50%
max-generation 100
max-convergence n
fitness epsilon 0.01%
SVM Parameters
SVM kernel RBF
cross-validation 10-fold stratified
C and γ range 10−2 to 103
grid size 10x10
neighbor sampling size 8
their values were set based on experience. Max-generation refers to the maximum
number of generations to be evolved; and max-convergence denotes the number of
consecutive generations to be waited before termination if there is no further im-
provement to the fitness score, which is measured by average accuracy in stratified
10-fold cross validation of SVM classifier with the tuned parameters. The minimal
update level of fitness score is 0.01%. RBF kernel was used in SVM. C and γ were
tuned in range of 10− 2 to 103 by a 10x10 grid with sample size of 8.
Two datasets had been used in our study to evaluate the performance of our algo-
rithm. The details of datasets are shown in Table 12.3. Both datasets are the pre-
processed data from the LibSVM (http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/
datasets/). The performance of improvement strategies was measured by compu-
tation reduction (including search reduction and evaluation reduction) and runtime
reduction per generation. Search reduction refers to how much percentage of search
tasks is saved compared to grid search; and evaluation reduction refers to how much
percentage of evaluation tasks is saved compared to GA without cache. As GA is a
stochastic process, the number of generations for convergence may vary for different
runs. Thus the overall runtime is not appropriate for benchmark purpose. Instead,
runtime per generation could be used to illustrate the effectiveness of the improve-
ment strategies.
GA-SVM is time-consuming even for a small-scale dataset like Austrian. For
690 examples of 14 features, it took 1048 minutes ( 17.5 hours) to complete the
16 generations of GA-SVM. The slowness affirmed our determination to speed up
GA-SVM for any feasible application in practice.
294 T. Zhang et al.
4000
Fig. 12.3 Parallel-GA average runtime per generation w.r.t. MPI Nodes
120
100
runtime (sec)
80
60
40
20
0
1 2 3 4
# of threads
better performance could be expected in dealing with larger datasets, since the cost
of the overheads would be amortized.
Evaluation caching is able to avoid the unnecessary evaluations of identical chro-
mosomes in the different generations. If the chance of cache hit (i.e. a chromosome
has been evaluated earlier and stored in the cache) is significant, the overall speedup
would be remarkable. Fig. 12.5 shows how much percentage of evaluations was
saved by caching in the experiment. The frequent re-emergence of identical chromo-
somes was observed as result of low feature dimension and mutation rate. 76.25%
of cache hit was observed in the experiment and led to 75.62% reduction on the av-
erage runtime per generation (from 61.42 to 14.97 minutes, 4 times speedup). The
performance of evaluation caching highly depends on re-emergence probability of
identical chromosomes, which is mostly affected by feature dimension. As binary
encoding of GA’s chromosomes, the total number of combinations of features is 2n
where n is the feature dimension. When feature dimension increases, the chance
of hitting a chromosome in the past evaluations will drop rapidly. This phenomenon
was confirmed in the experiment with Adult dataset. As feature dimension rose from
14 to 123 with the same population size, there were only 3 cache hits during 40
296 T. Zhang et al.
Fig. 12.5 Evaluation Reduction Distribution for Evaluation Caching (left: Austrian, right:
Adult)
Fig. 12.6 Search Reduction Distribution for Neighbor Search (left: Austrian, right: Adult)
Fig. 12.7 Integration of Four Improvement Strategies (Austrian) (left: standard GA-SVM,
right: HGA-SVM)
conducted with grid search and neighbor search. Using the same 10x10 grid and pa-
rameter range, grid search required 8000 times (80 chromosomes x 10 x 10 grid)
of SVM measurements per generation; and neighbor search measured only 1410 to
1524 times (80.95% to 82.37% reduction, 81.77% on average). Both runs of HGA-
SVM found the best fitness of 87.97% classification accuracy but the one using
neighbor search was 5.79 times faster (61.42 mins v.s. 10.60 mins) per generation.
The similar observation was also found with Adult dataset: 79.10% to 84.70% search
reduction by neighbor search, i.e., on average 5.76 times faster per generation.
Finally, the collective speedup of all four improvement strategies was evaluated.
A remarkable reduction on computational cost was observed. Fig. 12.7 shows the
distributions of runtime per generation with Austrian dataset. The average runtime
per generation is reduced from 61.42 min to 0.46 min, ∼133 times.
In all the above experiments, the improvement was also observed to SVM learn-
ing accuracy over 10-fold cross validation (Fig. 12.8). The learning accuracy was
enhanced by 3.74% - 8.10% as a result of feature selection.
12.4 Conclusion
Our HGA-SVM illustrates an integrated approach that combines parallelization and
heuristic techniques that can effectively to lower the computational cost effectively.
We had demonstrated the individual speedup gains from parallel GA, parallel SVM,
neighbor search and evaluation caching as well as their collective gain. Through the
feature selection, the learning accuracy of SVM was enhanced as well. Overall, we
show that our HGA-SVM is useful in alleviating the computational cost with the
improved learning performance, allowing the feasible application to larger data and
298 T. Zhang et al.
more complex data. In our future work, caching, cross validation, and more efficient
heuristic techniques will be explored to further improve the current algorithm.
References
1. Boser, B.E., Guyon, I.M., Vapnik, V.N.: A training algorithm for optimal margin clas-
sifiers. In: Proceedings of the 5th Annual ACM Workshop on Computational Learning
Theory, pp. 144–152 (1992)
2. Cortes, C., Vapnik, V.: Support Vector Networks. Machine Learning 20, 273–297 (1995)
3. Aronszajn, N.: Theory of reproducing kernels. Transactions of the American Mathemat-
ical Society 68, 337–404 (1950)
4. Chapelle, O., Vapnik, V., Bousquet, O., Mukherjee, S.: Choosing Multiple Parameters
for Support Vector Machines. Machine Learning 46, 131–159 (2002)
5. Hsu, C.W., Chang, C.C., Lin, C.J.: A practical guide to support vector classification
(2003)
6. Mitchell, M.: An Introduction to Genetic Algorithms (1998)
7. Tanese, R.: Distributed genetic algorithms. In: Proceedings of the third international con-
ference on Genetic algorithms, George Mason University, United States, pp. 434–439.
Morgan Kaufmann Publishers Inc., San Francisco (1989)
8. Vapnik, V.N.: Statistical Learning Theory. Wiley Interscience, Hoboken (1998)
9. Foster, I.: Designing and Building Parallel Programs: Concepts and Tools for Parallel
Software Engineering. Addison-Wesley, Reading (1995)
10. Platt, J.: Sequential minimal optimization: A fast algorithm for training support vector
machines. Advances in Kernel Methods-Support Vector Learning, 185–208 (1999)
11. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines 80, 604–611
(2001), Software available at http://www.csie.ntu.edu.tw/cjlin/libsvm
12. Dagum, L., Menon, R.: OpenMP: An Industry-Standard API for Shared-Memory Pro-
gramming. IEEE Computational Science & Engineering, 46–55 (1998)
13. Momma, M., Bennett, K.P.: A pattern search method for model selection of support
vector regression. In: Proceedings of the SIAM International Conference on Data Mining
(2002)
14. Houck, C.R., Joines, J., Kay, M.: A Genetic Algorithm for Function Optimization: A
Matlab Implementation, NCSU-IE TR, vol. 95 (1995)
15. Eaton, J.W.: Octave, http://www.gnu.org/software/octave/
16. Fernández, J., Anguita, M., Ros, E., Bernier, J.: SCE Toolboxes for the Development of
High-Level Parallel Applications. In: Alexandrov, V.N., van Albada, G.D., Sloot, P.M.A.,
Dongarra, J. (eds.) ICCS 2006. LNCS, vol. 3992, pp. 518–525. Springer, Heidelberg
(2006)
Chapter 13
Computation in Complex Environments;
Optimizing Railway Timetable Problems with
Symbiotic Networks
Kees Pieters
13.1 Introduction
The title of this contribution balances on two concepts. The first is ‘complex envi-
ronments’.‘Complex’ should not be read in the colloquial sense of the word; com-
plexity addresses, amongst others, non-linear, contingent and ‘chaotic’ phenomena
([10],[11]). Many thinkers on complexity consider such characteristics — some-
times called organized complexity— to demarcate a transition point where analytical
approaches are no longer feasible ([26]:18).
Put in another way, organized complexity moves away from traditional machines,
which have supreme performance for their intended tasks, but also require very sta-
ble and predictable environments. Rather a line is drawn towards living organisms,
which are very robust and are better adjusted for contingent environments than ma-
chines are. Along this gradient, ‘robust machines’ form an interesting field of in-
quiry for optimization problems. Railway Timetable Problems (RTP) can be seen as
a benchmark for such robust machines (or algorithms).
The second concept, ‘symbiotic networks’, is introduced as an optimization strat-
egy that can, to some extent, optimize in such complex environments. RTP has been
a benchmark problem for symbiotic networks, and so this contribution does not fo-
cus on RTP in itself, but rather uses RTP to analyze the behaviour of symbiotic
networks in a practical setting.
This paper is outlined as follows; first a ‘meta-perspective’ on optimization pro-
cesses will be drawn, then the RTP will be introduced as a complex environment.
The theory of symbiotic networks will be discussed, and how this approach was
implemented to optimize RTP. Last the various tests that have been carried out with
the simulation environment will be discussed.
Cornelis P. Pieters
Condast, Omloop 82, 3552 AZ, Utrecht, the Netherlands
e-mail: [email protected]
Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 299–324.
springerlink.com c Springer-Verlag Berlin Heidelberg 2010
300 K. Pieters
Property Remarks
The ‘Traveling Salesman Problem’ or certain job shop problems are examples of
familiar, NP-hard problem domains, while agent systems typically operate in reac-
tive and often highly nonlinear environments [28]. Strictly speaking, many neural
networks can be considered as operating in ‘familiar’ environments, as the patterns
that they learn during the training phase (design) are assumed to be present in the
problem domain once the network is in operation. In practical settings, most so-
lutions require a mix of these, in which many constraints and heuristics that are
specific for the problem domain are also designed. Note that the categories are
‘actor-centric’; the designer of the process may, and usually will, have more knowl-
edge about the environment than the actor itself.
Along the scales that are introduced with these characteristics, it becomes clear
that computational intelligence often combines intelligence ‘by design’ and ‘true’
forms of computational intelligence. ‘Designed’ intelligence includes the implicit
assumptions about a chosen solution strategy, the type of solution that is chosen
(GAs, neural networks, etc.) and ‘tweaking’ of the algorithms by the designer.
Problem domains that are predominantly uncertain, NP-hard, non-linear, reactive
and contingent are amongst the most difficult to tackle. These form the charac-
teristics of a complex environment. A convergence inducing process operating in
302 K. Pieters
these domains usually cannot be tailored for a specific environment, but will need
to find a near-optimal solution in finite time in a problem domain that continuously
changes, and which may react to the process itself. A robust algorithm will try to op-
timize, regardless of the conditions the environment imposes on it, and a good robust
optimizer may even use these conditions to its advantage.
There is one specific class of problems that provides an interesting point of refer-
ence for optimization in these severely complex problem domains; railway timetable
and –pathing problems [3, 8, 12, 13, 17].
Train scheduling in practice therefore still relies on human experts [29], although
some supportive tools have been developed to assist in generating timetables for a
full service of trains on a full infrastructure ([13]:15-16).
Another approach can be taken where optimization is performed on a full in-
frastructure, and where the algorithm has to optimize without prior knowledge of
the problem domain. In this case, the number of conflicts will be measured as a
function of a given timetable (see Figure 13.1). This approach makes the problem
domain predominantly uncertain, NP-hard, non-linear and reactive. If the optimiza-
tion has to account for delays of trains, or other sources of contingencies, then RTP
fulfills all the criteria for a complex environment. In this case, efficiency is not the
only criterion for the optimization strategy, but robustness also becomes an impor-
tant aspect.
In effect, the trains just follow their intended trajectories (in a simulated envi-
ronment), while the optimization strategy measures the conflicts and adjusts the
timetables ‘on-the-fly’. The departure times are the only degrees of freedom that the
problem solver has, but changing them also changes the conflicts (reactivity).
Intuitively it is clear that train timetable problems cannot be resolved entirely
through optimization of the departure times. The railroad infrastructure may be too
restrictive, for instance when two trains are heading towards each other on a single
track. Most existing railroad infrastructures will be sufficiently extensive to avoid
these infrastructural limitations.
A second source of conflicts is related to the amount of trains in service at a given
time. The potential for conflicts increases with the amount of trains that use the in-
frastructure. At a certain critical threshold, the capacity of the railroad infrastructure
is so dense that conflicts can no longer be resolved. This is another boundary of the
RTP.
A practical description of RTP therefore is as follows:
Given a certain rail infrastructure, and given a certain required service of trains, can
the infrastructure provide this service without conflicts?
One attempt to automatically generate an optimal rail timetable for a full railroad
infrastructure (a model of the Dutch railroad infrastructure) used so-called symbiotic
networks to resolve this [16, 17, 19].
304 K. Pieters
Pattern: Actor/Co-actor
Description An actor can interact with others according to a number of strategies. The
notion of perceived ‘benefit’ is assigned to these interactions between actor,
the entity initiating the interaction, and the co-actor, the entity that is subject
to the interaction. This could, for instance, be according to the following
table:
a.k.a
Constraints:
Name Benefit ac- Benefit a.k.a
tor co-actor
co-existence 0 0
Competition + 0/- Adversarialism
Parasitism + -/0/+
+ + Mutualism
Altruism - +
Symbiosis 0/+ 0/+
Synnecrosis - - (e.g. spite)
Practical tests showed that symbiotic networks are particularly interesting in dy-
namic environments. If a number of n agents collaborated in solving a certain task,
the complexity of O(n3 ) proved to be relatively poor in static environments, such
as when the Traveling Salesman Problem was taken as benchmark [15]. A chance
article in a Dutch newspaper on railway timetable problems provided the ideal ex-
perimental environment for symbiotic networks. A model of the Dutch railway in-
frastructure was made, in which various optimization strategies were implemented
and compared. It was experimentally demonstrated that the complexity of the solu-
tion remained approximately O(n3 ) for RTP.
First, the theory behind symbiotic networks will be given some attention. As was
mentioned earlier, the initial research aimed to find the minimal requirements that
agents would need to engage in a symbiotic interaction. Symbiosis is seen here as
a mutually beneficiary pattern of interaction. According to actor/co-actor, mutual-
ism will also conform to this pattern, but mutualism is generally associated with a
parasitic interaction that turns out to be mutually beneficial [9]. Symbiosis in the
way it is used here rather starts from co-existence, which develops into symbio-
sis under certain conditions. This applies for more co-operative strategies, such as
CCGAs [23, 24], swarming algorithms and ANT algorithms [2], although these so-
lutions are usually designed to be co-operative. Symbiotic networks have to learn
this, given a certain overall (predetermined) goal.
As a first crude description, one could say that the agents in a symbiotic network
provide some sort of service that benefits the others. This is usually the implicit
assumption of symbiosis or mutualism, which comes down to a form of ‘I’ll scratch
your back if you’ll scratch mine’. However, the model that was developed shows
that this elementary form of feedback is insufficient to create a stable relationship.
A specific kind of communication is required to negotiate the need for the services
through the network.
Symbiosis assumes that the benefit, which is mathematically represented by a
goal function, is provided by the other participants in the network and so the agents
are encouraged to maintain and optimize the relationship once this benefit is ‘de-
tected’. In symbiotic networks, this optimization is considered to be the result of an
unusual feedback loop that is established once the agents are in each other’s sphere
of influence. This loop is achieved by an ability of the agents to communicate their
needs (through so-called stress-signals) and that they are able to change their behav-
ior, based on these stress-signals. It is a bit like a parent and her baby; when the baby
starts crying, the parent stops doing whatever she was engaged in and addresses the
baby’s needs.
In a way, some agents are sensitive to certain contingencies in, or require re-
sources from the environment, while other agents have the means to address them.
It is clear that competitive or altruistic approaches will not be a preferred form of in-
teraction between these agents in such situations. Parasitic or (other) invasive strate-
gies might work if agents know beforehand which others to target. However, it is
not always known which teams should be formed. Symbiotic networks, to some
extent, are able to figure this out by learning patterns in the stress signals that are
communicated.
306 K. Pieters
Fig. 13.2 In a symbiotic network, the problem domain is ‘folded in’ the network
The model presented in this paper considers the environment of the system to be
an integral part of the network (Figure 13.2) [4]. This approach is similar to that of
Pnuelian reactive systems and agent-based systems [28].
- A goal gi
- A dimensionless stress function si (Ii , gi ) (13.3)
- Symbiotic behavior μi (s0 , s1 , . . . , sn ) (13.4)
This results in the following agent, a symbiot (Figure 13.4):
Fig. 13.4 A symbiot emits a stress signal and is able to change its behavior as a function of
the stress signals in the network
The goal is associated with the input of the entity, which reflects for instance
a biological organism’s need for food. When these symbiots are connected, a very
simple network is formed (Figure 13.5).
In this figure, symbiot1 is the successor of symbiot0 through the environment.
At convergence, symbiot0 should be able to support symbiot1 in reaching its goal
I1 = g1 . This means that the following will apply:
g1
I1 = g1 = O0 · E1 ⇒ O0 = E1 (13.5)
O0 = μ0 · I0 ⇒ μ0 = g1
(E1 ·I0 )
(13.6)
The behavior μ0 is determined by the stress signals and should converge to a situa-
tion where I = g, or I0 = g0 and I1 = g1 . In such a situation (13.6) becomes
μ0 = g1
(g0 ·E1 )
(13.7)
It is clear that if this applies for all symbiots in the system, one could say that the
system has mapped its behavior ȝ = [μ0 , μ1 ] against its environment and its goals.
This is interesting when the environment is unknown, for a converged system
can provide information about the environmental relationships between the various
probes that the system has put in the environment. Note that in this situation the
inputs (i.e. the goals) should never be allowed to become zero as this would impair
convergence.
If the symbiots are connected in such a way that E2 = E0 , then a situation of
mutual benefit has been formed. This results in symbiosis according to the pattern
of actor/co-actor.
Convergence of the symbiotic system takes time. A system has converged when:
lim Δ μi ⇒ 0 (13.8)
t→tc
Δ is the change over a certain amount of time and the convergence time tc is the
time when the system has converged. Suppose the behavior μi changes according to
a certain algorithm f μi (s0 , s1 , . . . , sn ):
Here, t stands for iteration step in time. At convergence the following should apply:
f μi (s0 , s1 , . . . , sn ) = 0 (13.10)
Adding this so-called symbiotic algorithm to the symbiot results in figure 13.6.
The stress signal si (Ii , gi ) reflects the aim that one has when applying the network
to a specific problem, and should be zero once the goals have been achieved. The
network aims to achieve I = g. Take, for instance, the following stress function:
In this equation, the stress will increase the further the input edges away from the
goal function. The factor ρ is used to make the stress signal dimensionless and to
normalize its value, for instance between <−1, 1>. In this case, at convergence:
δ and ρ are included in (13.14) and (13.15) to scale the dimensionless factors f (s)
and si to the dimensions of Ii and μi . For now, each neighborhood is defined as
follows:
310 K. Pieters
Now suppose an initial situation where I = g and the following relations apply:
⎧ n
⎨ ≥ 0, i f ∑ si ≥ 0
f (s) = i=0
n (13.17)
⎩ < 0, i f ∑ si < 0
i=0
Because of (13.14), μi will increase or decrease due to (13.17). The same will also
happen to Oi due to (13.13),(13.14) and (13.16). But the opposite will occur due
n
to (13.15). This process will therefore repeat itself until ∑ si = 0. The time tc that
i=0
this convergence takes depends on the values of the elements of s, and therefore of
the values of I (13.15) and E (13.16). This means that the environment affects the
convergence time of the network.
The convergence criterion shows that any symbiotic algorithm that complies with
the constraints of (13.17) will cause converge of the system and that every symbiot
in the network should be connected to at least one other through the environment.
The convergence criterion also shows that I = g is only one of the possible solutions.
Depending on the symbiotic algorithm, averages of the various (gi − Ii ) functions
can also cause convergence. This will result in premature convergence (gi − Ii ) = ei ,
where ei is not equal to zero.
The problem of premature convergence is similar to limitations in pattern match-
ing that are recognized in neural networks [5]. The n symbiots as a whole form
an nxn matrix —n symbiots that are ‘listening’ to n stress signals, including their
own— that is dynamically altered by the stress vector s. This results in a number of
eigenvectors, which stand for solutions of s where the symbiotic algorithm is zero.
The feedback loop that is constructed through the environment causes the system
to converge to one of them. However, the eigenvector still represents a whole range
of solutions of s, of which only (s = 0) is desired. The challenge of the system is
to approximate this ideal convergence. This will be discussed further in the next
section.
The neighborhood influences the convergence process through its sign. Ei could
be a negative function in (13.16), in which case convergence is still possible, pro-
vided that the signs of the appropriate si+1 are inverted in the symbiotic algorithm
that is used.
Up to now a very rigid goal criterion has been used, namely one where Ii = gi . In
nature, the survival goals are often less strict and could be, for instance, Ii ≥ gi (e.g.
food). This would translate to the stress signals as follows:
{
ρ(gi − Ii ), i f Ii < gi
si (Ii , gi ) = (13.18)
0, i f Ii ≥ gi
This choice increases the solution space of the network significantly, as a whole
range of input vectors I lead to a situation where s = 0. This allows a much large
13 Computation in Complex Environments 311
portion of the solution space to give adequate results, leading to efficient systems
that use very simple symbiotic algorithms.
This does not mean that specific heuristics or optimization algorithms cannot
improve the overall convergence when applied to a specific problem. A number of
these ‘designed’ interventions will return in the specific case of RTP, that will be
discussed next.
Globally, RTP configured as a symbiotic network means simulating the railway
infrastructure and the intended services of the trains. The simulation lets the trains
travel their trajectory, and every time a conflict between trains occurs, a stress signal
is generated. The actual optimizing layer collects the stress signals and generates a
new timetable, which defines the ‘behaviour’ of the symbiot. The symbiotic algo-
rithms that are used, determine how the stress signals modify the timetable. Most of
the strategies implemented here change the departure time of a train a minute earlier
or later per optimization cycle, and keep track whether a previous change results in
an improvement or not. Thus, local optimization (of individual trains) should result
in global optimization. The overall flow chart is given in Figure 13.8. This approach
will be discussed in greater detail next.
(NS) operates a (daily) cyclic rail timetable, so trains depart at fixed intervals (for
instance twice or three times an hour) [12, 13].
Before continuing, it may be helpful to give a few definitions:
• A timetable is a set of departure times for all the trains that are required to travel
on a certain rail infrastructure during a certain time frame, for instance daily.
• A trajectory is a service that one or more trains are required to travel and runs
from a certain station (origin) to a destination. In-between the trains usually stop
at one or more stopover stations.
• A (train) schedule is the schedule that is assigned to a trajectory. This includes
the number of trips during a certain period (e.g. four times per hour) and the
departure times.
• A trip is the service that a certain train is actually carrying out at some point.
RTP has been used as an environment to investigate the behaviour of a symbiotic
network in a practical setting. The focus was not on generating actual timetables.
The research did confirm that the network does optimize, and therefore could, in
principle, be used to generate real timetables.
The experiments were conducted by running a computer simulation of the Dutch
railway system with the different symbiotic algorithms and under different condi-
tions. Each run consisted of 30,000 iterations, where an iteration step represents a
‘real time’ of 15 seconds. All the experiments initially select a random departure
time within the range of a trip.
those of the neighborhoods. For one, the different types of trains define a hierarchical
structure (Table 13.4):
The table shows the hierarchy from top to bottom. The type of train not only de-
termines the maximum speed vmax and the number of stops, it also defines how the
trains are influenced by stress signals. The departure time will only be influenced
by stress signals of trains of equal type or higher. An international train will only
respond to stress signals of other international trains, while local trains respond to
stress signals of all other train types. This approach minimizes possible goal con-
flicts of trains that are considered more important, but imposes a great deal of stress
on express trains and local trains. On the other hand, a lot of local trains operate in
rural areas, sparsely populated parts of the railway infrastructure, where they mainly
encounter stress around the origin and destination stations of their trajectory. In such
cases, the hierarchy in train types contributes to a situation where stress is distributed
from ‘hot spots’ to the periphery.
If the trajectory is a train’s general travel plan, a trip is its instantiation. If a
trajectory is traveled, say, four times an hour, the application creates eight trips for
that trajectory, four for both directions. A train may ‘collect’ stress signals based
on its encounters with other trains, but it is the trip that uses the result to change
the departure time of the next train. When a train has reached its destination, the
collected stress is made available to the optimizing layer. This approach makes the
optimization process less sensitive to short-term fluctuations of the stress signals
during a trip. Strictly speaking, the trip is therefore the active optimizing element
in the network, while the train merely causes and collects stress signals. This way a
relationship between departure time and stress signal is established.
The stress signals reflect the conflicts that trains can encounter:
• Two trains traveling in opposite directions pass each other on a single track.
• A train tries to enter a neighborhood that has no free tracks.
• A train encounters another train of lower rank (speed) less than 1500 meters in
front of it, heading in the same direction.
The model does not adjust the behavior of the trains, instead they pass each other as
if the other train is not there. Therefore, only the stress signal is a reminder of these
318 K. Pieters
encounters. When two trains find themselves in conflict, the stress that is calculated
is determined by the location of the trains and the nearest neighborhood that has
sufficient tracks to resolve the problem.
If the nearest free neighborhood is in front of a train, it outputs a negative stress
signal, which is translated to a request to leave earlier at the next trip. In the other
situation a positive stress signal is given, which is a request to leave later. The sys-
tem enforces that both trains give a stress signal that leads them to the same free
neighborhood. This prevents that stalemates can occur, especially for trains coming
from opposite directions. If for instance both trains would give a stress signal to
leave earlier, it would result in exactly the same conflict occurring a bit earlier on
the next trip.
An advantage of this particular environment is, that every agent in the system
knows exactly which other agents it is interacting with, and so its response is tar-
geted to service those that actually profit from it. This improves the behavior of the
system as a whole, as the chance of premature convergence due to goal conflicts
and neutralizing stress signals becomes smaller. Of course, the stress collected by a
train is still the superposition of the collected stress signals, but due to the dynamic
character of the network there is a good chance that the system ‘pulls’ itself out of
temporary premature convergence due to fluctuation of the stress signals. The dy-
namic character of the network also prevents stable branches with amplifications to
occur in the network. This means that the dynamics of the environment can, to some
extent, actually be used by the system to improve its overall behavior.
Most learning algorithms included a weight vector for every stress signal, in a
similar fashion as (13.19) and (13.20).
A series of tests have been carried out to analyze the optimizing behaviour of the
network using three algorithms, namely a hill-climbing algorithm, Hebbian learning
and a third that made the system behave a bit like a Kohonen network [17].
13.5.6 Results
A typical result of a number of runs has been depicted in Figure 13.10.
Figure 13.11 shows the results when random delays of trains were introduced to
test the robustness of the solutions. The most striking result is the fact that on aver-
age the network hardly seems affected by the delays. Normally a conflict is resolved
the moment when conflicting trains find a free track. Delays push the trains further
away from the conflict area into the free neighborhoods. As most of these ‘sink-
holes’ are formed by stations with a three minute stopover time, they are sufficient
to resolve the majority of the delays that are generated, while delays larger than
three minutes will only cause incidental stress and not lead to structural changes in
the system.
Delays with a higher impact will at some point deteriorate convergence, although
the variance seems fairly constant (Figure 13.12). Similar results were obtained
when the maximum delay time is increased.
free tracks much better. On average, CCGAs manage to let 85±3% of the trains run
without conflicts.
The configuration is hardly affected by varying mutation rates up to 5%. Above
that, the system gives significantly poorer results.
13.5.8 Discussion
When problem domains and solution strategies are considered from a wider per-
spective, such as provided by the pattern of a convergence inducing process, var-
ious classes of solution strategies can be compared in relationship to the specific
problems they aim to address. Besides the more intuitive distinctions between in-
telligence ‘by design’ and ‘true’ computational intelligence, it also allows a more
comprehensive assessment of various solution strategies provided by interactions of
multiple agents, such as for instance depicted in the actor/co-actor pattern. Most of
all, it introduces the environment as an integral part of the optimization process.
The specific problem domain provided by railway timetable problems has
demonstrated the relationships between environmental conditions, its constraints,
heuristics and possibilities, and the interplay between designed and computational
intelligence.
In this contribution, symbiotic networks were introduced as an approach to opti-
mize in complex, dynamic environments. Currently the research has pursued mod-
est goals, and concentrated on understanding how agents can learn to collaborate
in complex environments, in order to achieve an overall goal. Railway timetable
problems offered a means to analyze this, but the results demonstrate that symbiotic
networks have certain potential to be applied in real-world applications as robust
problem solvers.
Acknowledgments
I would like to thank prof. dr. Harry Hunneman from the University for Humanistics
in Utrecht in the Netherlands, and prof. dr. Paul Cilliers from the Centre for Studies
in Complexity of Stellenbosch University in South Africa for their valuable support
in developing a ‘helicopter view’ on the issues related to complexity thinking.
I am also greatly indebted to dr. ir. Schil de Vos and dr. Jack Gerissen for their
support during my research in symbiotic algorithms at the Open University in the
Netherlands, and Schil also for his feedback for the draft version of this chapter.
References
1. Alexander, C.: A Pattern Language: Towns, Buildings, Construction. Oxford University
Press, USA (1977)
2. Blum, C.: Ant colony optimization: Introduction and recent trends. Physics of Life Re-
views 2(4), 373, 353 (2005),
http://www.dx.doi.org/10.1016/j.plrev.2005.10.001
13 Computation in Complex Environments 323
3. Caprara, A., Fischetti, M., Toth, P.: Modeling and solving the train timetabling problem.
Operations Research 50(5), 851–861 (2002),
http://www.jstor.org/stable/3088485;
ArticleType: primary article / Full publication date: September-October 2002 / Copy-
right 2002 INFORMS
4. Cilliers, P.: Boundaries, hierarchies and networks in complex systems. International Jour-
nal of Innovation Management 5(2), 135–147 (2001)
5. Hassoun, M.H.: Fundamentals of Artificial Neural Networks. The MIT Press, Cambridge
(1995)
6. Holland, J.H.: Adaptation in Natural and Artificial Systems: An Introductory Analysis
with Applications to Biology, Control, and Artificial Intelligence. The MIT Press, Cam-
bridge (1992)
7. Jong, K.A.D.: An analysis of the behavior of a class of genetic adaptive systems. PhD
thesis, University of Michigan (1975),
http://portal.acm.org/citation.cfm?id=907087
8. Lee, Y., Chen, C.: Modeling and solving the train pathing problem. In: Twelfth World
Multi-Conference on Systemics, Cybernetics and Informatics, IIIS, Orlando (2008)
9. Margulis, L.: Symbiotic Planet: A New Look At Evolution, 1st edn. Basic Books (1999)
10. Mitchell, M.: Complexity: A Guided Tour. Oxford University Press, USA (2009)
11. Morin, E.: On Complexity. Hampton Press (2008)
12. Odijk, M.A.: Railway timetable generation (1998)
13. Peeters, L.: Cyclic Railway Timetable Optimization. Phd Thesis, ERIM PhD Series Re-
search in Management, Erasmus Universiteit, Rotterdam (2003)
14. Picek, S., Gloub, M.: Dealings with problem hardness in genetic algorithms. WSEAS
Transactions on Computers 8(5) (2009)
15. Pieters, C.P.: Symbiotic algorithms. Master’s thesis, Open University (2003)
16. Pieters, C.P.: Symbiotic networks. Evolutionary Computation. In: The 2003 Congress on
CEC 2003, vol. 2, pp. 921–927 (2003)
17. Pieters, C.P.: Trains in symbiosis. In: IASTED 2004 Congress on Artificial Intelligence
and Soft Computing 2004, pp. 481–487 (2005)
18. Pieters, C.P.: Effective Adaptive Plans, pp. 277–282. Springer, Heidelberg (2006),
http://dx.doi.org/10.1007/1-4020-5263-4_44
19. Pieters, C.P.: Reflections on the geno- and the phenotype. In: CEC 2006 IEEE Congress
on Evolutionary Computation, pp. 1638, 1632 (2006),
http://ieeexplore.ieee.org/Xplore/login.jsp?url=/iel5/
11108/35623/01688504.pdf?tp=&isnumber=&arnumber=1688504
20. Pieters, C.P.: Complex systems and patterns. In: Twelfth World Multi-Conference on
Systemics, Cybernetics and Informatics, vol. VII, pp. 268–275 (2008)
21. Pieters, C.P.: A pattern-oriented approach to health; using pac in a discourse of health.
International Journal of Education and Information Technologies 3(2), 126–134 (2009),
http://www.naun.org/journals/educationinformation/
eit-90.pdf
22. Pieters, C.P.: Patterns, complexity and the lingua democratica. In: Proceedings of the
10th WSEAS International Conference on Automation and Information, ICAI 2009.
Revent Advances in Automation & Information. WSEAS Press, Prague (2009)
324 K. Pieters
23. Potter, M.A., de Jong, K.A.: A cooperative coevolutionary approach to function op-
timization. In: Davidor, Y., Männer, R., Schwefel, H.-P. (eds.) PPSN 1994. LNCS,
vol. 866, pp. 249–257. Springer, Heidelberg (1994),
http://citeseerx.ist.psu.edu/viewdoc/
summary?doi=10.1.1.119.2706
24. Potter, M.A., de Jong, K.A.: Cooperative coevolution: An architecture for evolving coad-
apted subcomponents. Evolutionary Computation 8(1), 1–29 (2000),
http://dx.doi.org/10.1162/106365600568086
25. Rosenblatt, F.: The perceptron: A probabilistic model for information. Psychological
Review 65(6), 386–408 (1958)
26. Weinberg, G.M.: An Introduction to General Systems Thinking, 25th edn. Dorset House
Publishing Company, Incorporated (2001)
27. Wolpert, D., Macready, W.: No free lunch theorems for optimization. IEEE Transactions
on Evolutionary Computation 1(1), 67–82 (1997)
28. Wooldridge, M.: Reasoning about Rational Agents, 1st edn. The MIT Press, Cambridge
(2000)
29. Zwaneveld, P.J., Kroon, L.G., van Hoesel, S.P.M.: Routing trains through a railway sta-
tion based on a node packing model. European Journal of Operational Research 128(1),
14–33 (2001),
http://www.sciencedirect.com/science? ob=ArticleURL&
udi=B6VCT-41Y1XYH-2& user=10& rdoc=1& fmt=& orig=search&
sort=d&view=c& acct=C000050221& version=1& urlVersion=0&
userid=10&md5=0a0c12353a918d29b76ed2ad4c741a4f
Chapter 14
Project Scheduling: Time-Cost Tradeoff
Problems
14.1 Introduction
The project manager handles conflicting states to optimize various parameters of
project scheduling process. Minimizing project completion time and project cost
continues to be universally sought objectives, conflicting in nature, which is known
as time-cost tradeoff (TCT) in project scheduling. TCT belongs to a class of multi-
objective optimization (MOO) problem wherein there is no single optimum
solution rather there exists a number of solutions, which are all optimal – Pareto-
optimal solutions – optimal TCT profile in project scheduling literature. The tradeoff
between project time and cost gives project managers both challenges and opportu-
nities to work out the best schedule to complete a project, and is of considerable eco-
nomic importance. Projects are usually represented using networks, having nodes
Sanjay Srivastava
Department of Mechanical Engineering, Dayalbagh Educational Institute,
Dayalbagh, Agra, India
e-mail: [email protected]
Bhupendra Pathak · Kamal Srivastava
Department of Mathematics, Dayalbagh Educational Institute, Dayalbagh, Agra, India
e-mail: [email protected],[email protected]
Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 325–357.
springerlink.com c Springer-Verlag Berlin Heidelberg 2010
326 S. Srivastava, B. Pathak, and K. Srivastava
and directed arcs (Figure 14.1). These diagrams provide a powerful visualization of
the relationships among the various project activities, which form the precedence
constraints in TCT analysis. There are two kinds of networks:
i Activity-On-Arc (AOA): The arcs represent activities and the nodes represent
events. An event represents a stage of accomplishment; either the start or the
completion of an activity.
ii Activity-On-Node (AON): The nodes represent activities and the directed arcs
represent precedence relations. This representation is easier to construct.
Nonincreasing time-cost relationship of a project activity can be either continuous
or discrete (Cost refers to direct cost throughout this work). The continuous rela-
tionship can be linear or nonlinear. Accordingly TCT problems may be categorized
as: (i) linear TCT problem for linear continuous time-cost relationship; (ii) nonlin-
ear TCT problem for nonlinear continuous time-cost relationship, and (iii) discrete
TCT problem for discrete time-cost relationship.
There are a large number of activities within real-life projects; therefore, it is
almost impossible to enumerate all possible combinations to identify the best deci-
sions for completing a project in the shortest time and at the minimum cost. Several
researchers have suggested various methods [1], including mathematical techniques
and heuristics, for obtaining TCT profile, but there still remains many serious im-
pediments that restrict a wider use of TCT profile as management contrivances.
The first concerns the form of the time-cost relationship of project activities and
the size of project networks. Most of the existing methodologies for determining
optimal TCT profile need to rely on unrealistic assumptions about the time-cost re-
lationship of activities such as linear, convex, continuous etc. in order to manage
computational costs. This, however, renders the derived profile inaccurate. Most of
the reported methodologies that attempt to deal with realistic time-cost relation-
ships is accompanied by the only-for-small-networks admonition. Moreover, the
discrete version of TCT problem is known to be NP-hard, and it has been proved that
any exact solution algorithm would very likely exhibit an exponential worst-case
complexity [2].
The complexity of TCT problem further increases, if resource constraints are also
present, which is not uncommon in realistic projects. In addition, to solve TCT prob-
lem in a generalized way, the scheduler must consider the presence of project uncer-
tainties such as weather conditions, space congestion, labor performance etc., which
14 Project Scheduling: Time-Cost Tradeoff Problems 327
dynamically affect both, the project duration and cost, during its implementation.
In view of foregoing research issues we developed comprehensive and intelligent
methods to solve a variety of realistic TCT problems.
Some related research efforts follow. Richard et al. [3] developed nonlinear time-
cost tradeoff models with quadratic cost relations. Vanhouke [4] applied a branch
and bound method to solve discrete TCT problem with time switch constraints.
Vanhouke and Debels [5] used analytical method as well as tabu search to solve
discrete TCT problem.
As mentioned, TCT is a MOO problem with two conflicting objectives. MOO
is a field reasonably explored by researchers in recent years since 1990 - as a
result diverse techniques have been developed over the years [6]. Most of these
techniques elude the complexities involved in MOO and usually transform multi-
objective problem into a single objective problem by employing some user defined
function. Since MOO involves determining Pareto-optimal solutions, therefore, it
is hard to compare the results of various solution techniques of MOO, as it is the
decision-maker who decides the best solution out of all optimal solutions pertaining
to a specific scenario [7]. Evolutionary algorithms (EAs) are meta heuristics that are
able to search large regions of the solutions space without being trapped in local
optima [8]. Some well-known meta heuristics are genetic algorithm (GA), simu-
lated annealing (SA), and tabu search. Genetic algorithms are search algorithms
[9], which are based on the mechanics of natural selection and genetics to search
through decision space for optimal solutions [10]. In GA, a string represents a set of
decisions (chromosome combination), a potential solution to a problem. Each string
is evaluated on its performance with respect to the fitness function (objective func-
tion). The ones with better performance (fitness value) are more likely to survive
than the ones with worse performance. Then the genetic information is exchanged
between strings by crossover and perturbed by mutation. The result is a new gen-
eration with (usually) better survival abilities. This process is repeated until certain
termination condition is met. A genetic algorithm uses a population of solutions in
each iteration of its search procedure, instead of a single solution. Since a population
of solutions is processed in each iteration, the outcome of a GA is also a population
of solutions. This unique feature of GA makes it a true multiobjective optimization
technique and that is how GAs transcend classical search and optimization tech-
niques [11]. The robustness of GAs is greatly enhanced when they are hybridized
with other meta heuristics such as simulated annealing, tabu search etc. [8]. Differ-
ent versions of multiobjective GAs have been successfully employed to solve many
MOO problems in science and engineering [7]. GA based MOO techniques have
also been used to solve TCT problem [12], [13]. More recently, Azaron et al. [14]
proposed models using genetic algorithms and the Pareto front approach to solve
nonlinear TCT problem in PERT networks. However, these models did not consider
the presence of a constrained resource and/or nonlinear time-cost relationships of
project activities.
328 S. Srivastava, B. Pathak, and K. Srivastava
where, si (m) and sk (m) are the starting time of activities i and k respectively, in
mode m; S is the set of the succeeding activities of ith activity; rti (m) is the re-
source requirement of ith activity at processing time of ti (m); and Osi (m) is the set of
activities being performed at si (m).
Fig. 14.2 Artificial neural network architecture for nonlinear TCT problem
14 Project Scheduling: Time-Cost Tradeoff Problems 331
A. Structure of a Solution
A solution here is a string which basically represents an instance θ of the project
schedule (Figure 14.3); each element ti of an n-tuple string, T , can assume any value,
a natural number, from [CTi , NTi ] . As already mentioned, ci = fi (ti ) for ith activity,
and fi : [CTi , NTi ] → R is a nonlinear map for nonlinear TCT. The associated tθ and
cθ of each individual string are determined by summing up the corresponding cost
for each activity and by computing the maximum path time respectively.
B. Initial Population
The initial population consists of n p solutions, where (n p − 2) strings are selected
randomly from the feasible search space, i.e., each ti of a string is chosen randomly
from [CTi , NTi ] . The remaining two strings are formed such that for the first string
332 S. Srivastava, B. Pathak, and K. Srivastava
f itu
probu = (14.2)
∑ f itu
where f itu = fitness value of parent u; dmax = maximum du in the generation; du =
minimal distance between the parent u and each of the segment v of the convex hull,
du = min (duv , for all v); and probu = probability of selection of parent u.
14 Project Scheduling: Time-Cost Tradeoff Problems 333
E. Crossover
We consider one point crossover, wherein a parent P1 produces a child by crossing
over with another parent P2 selected randomly. A random integer q with 1 ≤ q ≤ n
is chosen, where q represents the crossover site. The first q positions of the child are
taken from the first q positions of P1 while the remaining (n- q) positions are defined
by the (n-q) positions of P2 .
F. Mutation
The mutation operator modifies a randomly selected activity of a string with a prob-
ability mr ; that is (mr × |F|) strings will undergo mutation. The mutation operator
works on a given string in the following manner. Let the string be represented by
str(i), i = 1, . . . , n. A random number q, 1 ≤ q ≤ n is generated for the location of
gene to be mutated. Another random natural number r, r ∈ [CTq , NTq ], is generated
and str[q] is replaced by r.
G. Heuristic Procedure
The float value of an activity is defined as the available time of an activity by which
it can be delayed without affecting the time deadline of project. Obviously the float
value of a critical activity is zero. Each string, representing a unique network sched-
ule, is tested beforehand against resource constraint in this module; and early start
time of non-critical activities is modified if necessary exploiting their float values.
The heuristic checks for resource requirement (RR ) period by period for each string
against resource availability (RA ) in the given project. If RR > RA at any time inter-
val Δ t of a network schedule, start time of non-critical activity falling in Δ t would
be shifted period by period exploiting its float value so as to adjust RR within RA
throughout the network schedule. Further, if more than one non-critical activity falls
334 S. Srivastava, B. Pathak, and K. Srivastava
in Δ t, each one would be processed one by one (ties may be broken arbitrarily) in
a similar manner as mentioned above till RR is adjusted within RA . If, even after
shifting corresponding non-critical activity (activities) the RR is not adjusted within
RA for a string, it is altogether rejected and hence it does not participate in the evo-
lutionary process of ANNHEGA. In case of rejection of strings due to violation
of resource constraints ANNHEGA keeps on generating other strings and check-
ing them against resource constraints till the population size is met. That is how
ANNHEGA maintains the population size in the evolutionary process.
It is a well-known fact that in general, the resource requirements of a project
over all periods is never constant, even after applying the best resource leveling
procedure. The proposed heuristic procedure makes use of this fact while fixing the
upper limit of RA . This is detailed out below. The peak resource requirement Rmax
based on activities’ normal time is computed for two extreme cases viz. (1) all the
non-critical activities are scheduled to start at their earliest start time (ES) and, (2)
all the non-critical activities are scheduled to start at their latest start time (LS). The
averaging of peak resource requirements of these two cases is considered to be the
peak resource requirement of the project network:
Rmax o f ES + Rmax o f LS
Rmax =
2
The initial value of RA to run ANNHEGA is taken as equal to Rmax ; this may be
termed as the upper limit of the constrained resource of the project. Now in gen-
erating TCT profile the project duration is basically crashed and obviously more
resources would be required for each subsequent crashing. In order to deciding the
lower possible limit of RA (below this limit project expediting is not possible), RA is
subsequently reduced and ANNHEGA is run every time till a point occurs when
project time starts increasing instead of decreasing to satisfying the constrained
resource.
More formally, the ANNHEGA scheme can be summarized in pseudo code as
follows. Let CHP be the set of children, I be the current generation number, and
GEN is the maximum number of iterations.
In real life projects the duration and cost of each activity could change dynamically
as a result of many uncertain variables, such as management experience (ME), labor
skill (LS), weather conditions (WC), etc. Project managers must take these uncer-
tainties into account and provide an optimal balance of time and cost based on their
own experience and knowledge. The uncertainty features can be well represented by
the fuzzy set concepts. Time analysis of a project under uncertainties has been stud-
ied using fuzzy set theoretic approach [22]. Daisy and Thomas [23] applied fuzzy
set theory to model the managers’ behavior in predicting project network parame-
ters within an activity. Leu et al. [24] used fuzzy set theory to model the variations
in the duration of activities due to changing environmental factors. Other types of
uncertainties such as budget uncertainty have also been incorporated into project
time-cost tradeoff [25]. Existing methods for sensitivity analysis of TCT profiles
with regard to project uncertainties ignore the cost parameter of project activities
[26], and do not include provision for nonlinear time-cost relationship of project
activities. To comprise these problems we devised and executed a novel method – it
examines the effects of project uncertainties on both, the duration as well as the cost
of the activities, and incorporates nonlinear time-cost relationship of project activ-
ities. The method integrates three key fields of computational intelligence – Fuzzy
Logic, ANNs and multiobjective Genetic Algorithm – the method is referred to as
Integrated Fuzzy-ANN-GA (IFAG).
A rule based fuzzy logic framework is developed which brings up the changes
in the duration and the cost of each activity for the inputted uncertainties, and then
ANNs are trained with these time-cost data (one ANN is used for each activity)
to model time-cost relationships. It has been already shown in Section 2 that the
integration of ANNs with GA facilitates the evaluation of fitness function of GA.
GA is employed to search for Pareto-optimal front for a given set of time-cost pair
of each project activity. That is how the integration of fuzzy logic framework and
ANNs with GA is implemented to comprehend the responsiveness of nonlinear TCT
profile with respect to project uncertainties. A test case of TCT problem is solved
using IFAG. Fuzzy sets and fuzzy inference system are briefly described below.
A. Fuzzy Sets
Fuzzy set theory is an efficient tool for modelling uncertainties associated with
vagueness, imprecision, or/and lack of information regarding variables of decision
space. The underlying power of fuzzy set theory is that it uses linguistic variables,
14 Project Scheduling: Time-Cost Tradeoff Problems 337
Fuzzy inference is the process of formulating the mapping from a given input to
an output using fuzzy logic. The mapping then provides a basis from which deci-
sions can be made. The process of fuzzy inference involves membership functions,
fuzzy logic operators, and if-then rules. Fuzzy inference systems (FIS) have been
successfully applied in fields such as automatic control, data classification, decision
analysis, expert systems and computer vision. We have employed Mamdani-type
fuzzy inference system using MATLAB’s fuzzy logic toolbox. Mamdani’s fuzzy
inference method [27] is the most commonly seen fuzzy methodology. Mamdani’s
effort was based on Lotfi Zadeh’s work on fuzzy algorithms for complex systems
and decision processes [28].
14 Project Scheduling: Time-Cost Tradeoff Problems 339
The FIS to capture the effect of linguistic variables on the activity duration and cost
is designed with 3 input variables – ME, LS and WC, and two output variables – ac-
tivity duration and activity cost. Triangular membership functions are used to model
the linguistic variables, input as well as output variables. The FIS editor interfaces
inputs and outputs.
The linguistic variables namely ME, LS, and WC are modeled using five member-
ship functions such as one shown in Figure 14.9 for weather condition. The linguis-
tic variables are defined in the range 0–1.
The output variables – activity duration, and activity cost – are modeled by 7 mem-
bership functions (Figure 14.10) over the universe of discourse (UOD). The range
for UOD for activity duration has been assumed from (D− 0.2 × D) to (D+ 0.2 × D)
where D represents an initial estimate of activity duration by the project experts.
Similarly the range for UOD for activity cost has been assumed from (C − 0.2 ×C)
to (C + 0.2 ×C) where C represents an initial estimate of activity cost by the project
experts.
5
x 10
1.7 Population
1.6
1.5
1.3
1.2
1.1
0.9
100 110 120 130 140 150 160 170
Project time
5
x 10
1.7 Population
Tradeoff Points
Convex Hull
1.6
1.5
Project Cost
1.4
1.3
1.2
1.1
0.9
100 110 120 130 140 150 160 170
Project time
Fig. 14.7 Intermediate improvements in the tradeoff points and convex hull
5
x 10
1.7 Population
Tradeoff Points
Convex Hull
1.6
1.5
Project Cost
1.4
1.3
1.2
1.1
0.9
100 110 120 130 140 150 160 170
Project time
Fig. 14.8 Tradeoff points and convex hull of the final population
14 Project Scheduling: Time-Cost Tradeoff Problems 341
Fig. 14.9 Membership curves for weather condition (fuzzy sets: VeryBad, Bad, Medium,
Good, & VeryGood)
Fig. 14.10 Membership curves for activity duration (fuzzy sets: VerySmall, Small, Small-
Medium, Medium, LongMedium, Long, VeryLong)
Project Time 169 162 159 152 133 121 116 108 104
Project Cost 98040 98370 98520 99630 101670104360 107120122540 137700
Project Time 185 152 146 145 138 117 104 103 102
Project Cost 110650 111450 112350 112630 114060 118610 154250 158750 166440
5
x 10
2.5
ME = 0.5 LS =0.5 WC =0.5
ME = 0.9 LS =0.9 WC =0.9
ME = 0.8 LS =0.6 WC =0.9
2 ME = 0.3 LS =0.3 WC =0.3
ME = 0.2 LS =0.3 WC =0.4
Project Cost
1.5
0.5
100 150 200 250
Project Time
Responsiveness of TCT profile for scenarios such as (ME= 0.2, LS =0.3, and WC=
0.4) and (ME = 0.8, LS = 0.6, WC= 0.9) works consistently (Figure 14.11).
Further, the values of linguistic variables are taken in different ways i.e. some
have the values above the normal conditions and some below the normal conditions.
TCT profile under normal conditions (i.e. ME, LS, and WC as 0.5, 0.5, and 0.5
respectively) is shown in Figure 14.12 for a run different than earlier one. For (ME,
LS, and WC) as (0.8, 0.4, and 0.7) respectively, the project duration and cost are
obtained as shown in Table 14.6 / Figure 14.12. Table 14.7 depicts the case when
(ME = 0.3, LS = 0.8, and WC = 0.3). The results are shown in Figure 14.12 for
14 Project Scheduling: Time-Cost Tradeoff Problems 343
Project Time 154 145 137 129 119 110 106 105 103
Project Cost 99720 100410 101970 103880 108190 126740 143980 148630 158120
Project Time 187 178 176 159 157 122 119 117 115
Project Cost 108280 109100 109280 112440 113280 117640 131100 141300 148830
5
x 10
2.4 ME = 0.5 LS = 0.5 WC = 0.5
ME = 0.8 LS = 0.4 WC = 0.7
2.2 ME = 0.3 LS = 0.8 WC = 0.2
1.8
Project Cost
1.6
1.4
1.2
0.8
0.6
0.4
60 80 100 120 140 160 180 200 220 240
Project Time
A. Structure of a Solution
A solution here is a string which represents an instance θ of the project schedule
(Fig. 14.2); each element ti of an n-tuple string, T , can assume any value from the
set {ti j }, j = 1, . . . , pi . The associated cost cθ and tθ of each individual string are
determined in the usual manner.
B. Initial Population
The initial populaton, consisting of n p solutions, is generated by randomly selecting
(n p − 2) individual strings from the feasible search space, i.e., each ti of a string is
chosen randomly from the set {ti j }, j = 1, . . . , pi . The remaining two strings are non-
randomly added in the population as follows. Tmaxi and Tmini , are two strings added
such that all activities have t maxi = max∀ j {ti j } and t mini = min∀ j {ti j } duration
respectively.
See Subsections 2.2C and 2.2E for definitions of TCT profile/convex hull and
crossover respectively.
14 Project Scheduling: Time-Cost Tradeoff Problems 345
C. Distance Measurement
The distance dw of an individual solution point in a population is determined by
calculating the minimal Euclidean distance (dwv ) between the wth solution point
and each of the segment v of the convex hull, i.e., dw = min∀v (dwv ) (Figure 14.4).
The solutions with a lower value of distance are considered to be fitter than those
having larger value of the distance.
D. Mutation
The mutation operator modifies a randomly selected activity of a string with a prob-
ability mr ; that is (mr × |F|) strings will undergo mutation. The mutation operator
works on a given string in the following manner. For the values of string T repre-
sented by ti , i = 1, . . . , n, a random number q, 1 ≤ q ≤ n is generated for the location
of gene to be mutated. Another random value tq , [(t mini ≤ tq , ≤ t maxi )and (tq =
tq )], is generated and tq replaces tq .
E. Simulated Annealing
Simulated annealing (SA) is a popular search technique which imitates the cooling
process of material in a heat bath. SA as stochastic optimization was introduced in
the context of minimization problems by Kirckpatrick et al. [30]. It is a global op-
timization method that distinguishes between different local optima. Starting from
an initial configuration, the SA algorithm generates at random new configurations
from the neighborhood of the original configuration. If the change results in a better
configuration, then the transition to the new configuration is accepted with a Boltz-
mann probability factor. The probability factor is regulated by a parameter called
Temperature (temp) and provides a mechanism for accepting a bad move. In the
initial iterations (temp = temp0 ) this probability is high (almost one) and when the
temperature is subsequently lowered using a cooling ratio (cool r) it comes down
to almost zero in the final stage of iterations (temp = temp f ).
The Pareto front (or TCT profile) of the generation, i.e., all the parents together
with their children, is determined, which represents the nondominated set of a gen-
eration. Thereafter, a convex hull that encloses all members of a population from
below, is drawn. The basic idea is that lesser the distance of an individual within a
generation from the convex hull, better is its fitness with respect to either /all of the
objectives (Figure 14.4).
For each family u, its members on the Pareto front are counted (par num(u), u =
1, . . . , n p ). These par num(u) members become the parents for the uth family. How-
ever, it is important to note that if for a family u, no member appears on the Pareto
front, the family is not rejected all together; in the hope of its improvement in future,
a member of that family which is ’nearest’ to the Pareto front is selected to be the
parent for the next generation. This ’nearness’ is measured by a distance function
described in Section 4. The importance of the number par num(u) is twofold. Firstly
it determines the parents for the next generation chosen from each family. This is
how elitism is incorporated in the algorithm, which helps it in converging closer to
Pareto-optimal front. Elites of a current population are given an opportunity to be
directly carried over to the next generation. Therefore, a ’good’ solution found in
a current generation will never be lost unless a better solution is discovered. The
absence of elitism does not guarantee this feature. Importantly the presence of elites
enhances the probability of creating better offspring. Secondly it helps in keeping a
’good’ distribution of solutions over the Pareto front.
The next step is to decide the number of children, child num(u), u = 1, . . . , n p
allocated to each family in the next generation. This number actually provides the
information of how good the region is. To accomplish this, a distance measure as
defined in Section 4 is used, which measures the nearness of each member of a
family to the Pareto front. To find child num(u), the process of simulated anneal-
ing has been incorporated into the selection process as mentioned in the procedure
f ind num() given in the Subsection 4.1.1. It first counts the members of family
which satisfy Bolzmann criteria. Clearly the number child num(u) is proportional
to the number of members of the uth family which are closer to the convex-hull.
Further, par num(u) also plays a direct role in measuring the fitness of each fam-
ily u, that is, number of children to be produced in the next generation by family u
is determined by par num(u) plus the number of family members who qualify the
Boltzman criterion. This is obvious as these par num(u) members are on the TCT
profile.
The next step is the generation of children by each family. As mentioned earlier,
initially each family has a single parent, but in subsequent generations the number
of parents per family may be more than one (as par num(u) ≥ 1 for the families
whose members are on the Pareto front). In such a case, the number child num(u) is
almost equally divided among these par num(u) parents for producing the children.
The method of producing children by any parent is same as explained for the initial
generation. Now mutation is applied on randomly selected strings of the population
and the temperature is cooled down. The process is repeated until no improvement
is observed in the TCT profile for a specified number of generations. The algorithm
is able to search for the best family in the evolution process.
14 Project Scheduling: Time-Cost Tradeoff Problems 347
np 20 40 60 80 100
Average time(sec) for 10 runs 20.95 17.85 9.33 17.44 18.74
search is set to terminate when the TCT profile does not change in five consecutive
iterations (it is found to be a good enough number). We use these parameters for all
the experiments with HMH on the test problems.
An initial generation of n p strings is randomly selected and nc children are pro-
duced. Results of a typical run of HMH for this test problem follow. It can be seen
(Figure 14.13) that the initial generation is well distributed over the solution space.
Figure 14.14 illustrates the intermediate improvements. In succeeding iterations
HMH searches for optimal TCT profile. Figure 14.15 depicts the tradeoff points
of the final generation population. Since our tradeoff points do not improve further,
therefore these points are concluded to be the best points obtained. It takes on an
average 6 iterations for HMH to search for the best possible TCT profile for this test
problem.
Interestingly HMH commands a good efficiency as it searches for a Pareto-
optimal front after examining an extremely small fraction of possible solutions. For
the project network of Figure 14.5, total number of possible schedules are 4.72 ×
109 , whereas HMH (on an average of 50 runs) searched for only 3600(180 × 20)
possible schedules to converge to best possible TCT profile, which is an extremely
small fraction (0.00007627%) of the solution space. The results of TCT profile of
final generation obtained by HMH are compared with analytical results obtained
from exhaustive enumeration. HMH proves very well in terms of accuracy as it is
able to search for 95% of the optimal solutions on TCT profile (on an average of 50
runs of HMH). Further, on comparing visually HMH results with GA based MOO
results [12] to solve the same test problem (Figure 14.5), HMH turns out to be better
in terms of both, degree of convergence to true Pareto front as well as diversity of
solutions.
350 S. Srivastava, B. Pathak, and K. Srivastava
5
x 10
1.7 Population
Tradeoff points
Convex hull
1.6
1.5
1.3
1.2
1.1
0.9
90 100 110 120 130 140 150 160 170 180
Project Time
Fig. 14.13 Initial population with tradeoff points and convex hull
5
x 10
1.7 Population
Tradeoff points
Convex hull
1.6
1.5
1.4
Project Cost
1.3
1.2
1.1
0.9
90 100 110 120 130 140 150 160 170 180
Project Time
5
x 10
1.7 Population
Tradeoff points
x min
1.6 x max
x mean
x median
1.5 x std
y min
y max
1.4 y mean
Project Cost
y median
y std
1.3 Convex hull
1.2
1.1
0.9
90 100 110 120 130 140 150 160 170 180
Project Time
Fig. 14.15 Tradeoff points and convex hull of final generation population
11500 11500
Population
11000 11000 Tradeoff points
x min
x max
10500 10500 x mean
x median
x std
10000 10000 y min
Project Cost
y max
y mean
9500
Project Cost
9500 y median
y std
Convex hull
9000 9000
8500 8500
8000 8000
7000 7000
48 49 50 51 52 53 54 55 56 57 58
48 50 52 54 56 58 Project Time
Project Time
Fig. 14.17 Initial population Fig. 14.18 Tradeoff points and convex
hull of final generation population
352 S. Srivastava, B. Pathak, and K. Srivastava
The problem has 30 variables, which lie in the range [0, 1]. It has a convex Pareto-
optimal region that corresponds to 0 ≤ x∗1 ≤ 1 and x∗i = 0 for i = 2, 3, . . . , 30. In
this problem, the Pareto-optimal front is formed with g(x) = 1 . Figure 14.21 shows
the first generation tradeoff points and the convex hull along with population. The
final generation tradeoff points and convex hull are shown in the Figure 14.22; im-
portantly, the obtained non-dominated front matches fairly well with known Pareto-
optimal front. Further, it has a good distribution of non-dominated solutions across
the front. HMH is, therefore, efficient and accurate in tackling a large number of
14 Project Scheduling: Time-Cost Tradeoff Problems 353
Population
Tradeoff points
20 Convex hull
15
f2
10
0
0 5 10 15 20
f1
Fig. 14.19 First generation tradeoff points (NDS) and convex hull
2.5
f2
1.5
0.5
0
0 0.5 1 1.5 2 2.5 3 3.5 4
f1
decision variables. The HMH has performed extremely well on the above standard
test problems. The non dominated solutions have converged very close to known
Pareto-optimal front. Further, it is visualized that obtained non-dominated solutions
maintain a good diversity. All the test problems presented in this work are per-
formed on HP Intel(R)Pentium(R) 4 CPU with 3.2 GHz Processor and 1 GB RAM.
The procedures are coded in MATLAB 7.0 and tested under Microsoft Windows XP
Professional version 2002.
354 S. Srivastava, B. Pathak, and K. Srivastava
0.9
0.8
0.7
0.6
f2
0.5
0.4
0.3 Population
Tradeoff points
0.2 Convex hull
0.1
0
0 1 2 3 4 5 6
f1
Fig. 14.21 First generation tradeoff points (NDS) and convex hull
14.5 Conclusions
ANNHEGA amalgamates ANN models and a heuristic technique with GA in a
unique way to solve resource-constrained nonlinear TCT problem and becomes
a powerful multiobjective optimization method without losing its simplicity. The
method succeeds in making TCT analysis more realistic by adding two important
dimensions to it. Firstly any existing arbitrary shaped time-cost relationship can be
dealt using its ANN module. Secondly the heuristic module takes care for a con-
strained resource in the TCT analysis. The feasibility of the ANNHEGA is shown
14 Project Scheduling: Time-Cost Tradeoff Problems 355
through an illustrative test case. An additional outcome of this work is that it deliv-
ers the lowest limit of the constrained resource beyond which project expediting is
not feasible, this information is important for schedule planner. ANNHEGA based
system can help to monitor and control the project in the most cost-effective way
in real time, and one can choose the best alternative over the RCNTCT profile to
execute the projects. There are interesting future extents of this work. More than
one constrained resource can be incorporated in the system. Also other precedence
relationships may be considered in the system.
IFAG is presented to carry out the sensitivity analysis of nonlinear TCT profiles
with respect to real life project uncertainties. Fuzzy logic framework facilitated (1)
the representation of imprecise activity duration as well as activity cost; (2) the
estimation of new time-cost pair for each activity based on inputted uncertainties;
and (3) the interpretation of the fuzzy results in the crisp forms. A case study is
solved using IFAG to demonstrate the working of the IFAG. The method provides a
comprehensive tool to project managers in analyzing their time – cost optimization
decisions in a more flexible and realistic manner. In future we intend to investigate
the responsiveness of RCNTCT profile for project uncertainties.
HMH is a new MOO method implemented combining genetic algorithm and sim-
ulated annealing to solve TCT problem by incorporating the concept of Pareto’s op-
timality to evolve a family of nondominated solutions distributing well along the
TCT profile. Two case studies of discrete TCT are solved using HMH to illustrate
its performance. HMH can discover near-optimal solutions after examining an ex-
tremely small fraction of possible solutions. HMH is also tested on two standard
test problems of MOO to validate its performance. Interestingly HMH suits well to
our problems, however, from algorithm viewpoint, we, as part of future work, in-
tend to (1) incorporate a mechanism to preserve the diversity in the algorithm, (2)
compare it with standard MOEAs such as NSGA-2, SPEA-2, PAES etc., by using
metrics to evaluate diversity & convergence properties and (3) enhance it to incor-
porate more than two objectives. The obvious future extensions of our work can be
to experiment HMH to solve RCNTCT in place of GA in ANNHEGA. Similarly
performance of sensitivity analysis of TCT profiles can be investigated using HMH
along with fuzzy logic and ANNs. HMH may be further explored for solving other
complex MOO problems.
References
1. De, P., Dunne, E.J., Ghosh, J.B., Wells, C.E.: The discrete time-cost tradeoff problem
revisited. European Journal of Operational Research 81, 225–238 (1995)
2. De, P., Dunne, E.J., Ghosh, J.B., Wells, C.E.: Complexity of the discrete time/cost trade-
off problem for project networks. Operations Research 45, 302–306 (1997)
3. Richard, F.D., Hebert, J.E., Verdini, W.A., Grimsrud, P.H., Venkateshwar, S.: Nonlin-
ear time/cost tradeoff models in project management. Computers & Industrial Engineer-
ing 28(2), 219–229 (1995)
356 S. Srivastava, B. Pathak, and K. Srivastava
4. Vanhoucke, M.: New computational results for the discrete time/cost trade-off problem
with time-switch constraints. European Journal of Operational Research 165, 359–374
(2005)
5. Vanhoucke, M., Debels, D.: The discrete time/cost trade-off problem: extensions and
heuristic procedures. Journal of Scheduling 10(4-5), 311–326 (2007)
6. Ehrgott, M., Gandibleux, X.: A survey and annotated bibliography of multiobjective
combinatorial optimization. OR Spektrum 22, 425–460 (2000)
7. Coello, C.A.C.: An updated survey of GA-based multiobjective optimization techniques.
ACM Computing Surveys 32(2), 109–142 (2000)
8. Dimopoulos, C., Zalzala, M.S.: Recent developments in evolutionary computation for
manufacturing optimization: problems, solutions and comparisons. IEEE Transactions
on Evolutionary Computation 4, 93–113 (2000)
9. Holland, J.H.: Adaptation in natural selection and artificial systems. Univ. of Michigan
Press, Ann Arbor (1975)
10. Goldberg, D.E.: Genetic algorithms in search optimization & machine learning. Addison
Wesley, Reading (1998)
11. Deb, K.: Multi-objective optimization using evolutionary algorithms. Wiley, Chichester
(2001)
12. Feng, C.W., Liu, L., Burns, A.: Using genetic algorithms to solve construction time-cost
trade-off problems. Journal of Computer in Civil Engineering 11, 184–189 (1997)
13. Leu, S.S., Yang, C.H.: GA-based multicriteria optimal model for construction schedul-
ing. Journal of Construction Engineering and Management 125(6), 420–427 (1999)
14. Azaron, A., Perkgoz, C., Sakawa, M.: A genetic algorithm approach for the time cost
trade-off in PERT networks. Applied Mathematics and Computation 168, 1317–1339
(2005)
15. Pathak, B.K., Singh, H.K., Srivastava, S.: Multi-resource-constrained discrete time-cost
tradeoff with MOGA based hybrid method. In: Proc. 2007 IEEE Congress on Evolution-
ary Computation, pp. 4425–4432 (2007)
16. Demeulemeester, E., Herroelen, W.: Project scheduling – A research handbook. Kluwer
Academic Publishers, Boston (2002)
17. Ozdamar, L., Ulusoy, G.A.: Survey on the Resource-Constrained Project Scheduling
Problem. IIE Transactions 27, 574–586 (1995)
18. Kolish, R., Hartmann, S.: Experimental investigation of heuristics for resource-
constrained project scheduling: An update. European Journal of Operational Re-
search 174, 23–37 (2006)
19. Yang, B., Geunes, J., O’Brien, W.J.: Resource-Constrained Project Scheduling: Past
Work and New Directions. Research Report, Department of Industrial and Systems En-
gineering, University of Florida, Gainesville, FL (2001)
20. Erenguc, S.S., Ahn, T.D., Conway, G.: The resource constrained project scheduling prob-
lem with multiple crashable modes: An exact solution method. Naval Research Logis-
tics 48(2), 107–127 (2001)
21. Arias, M.V., Coello, C.A.C.: Asymptotic convergence of metaheurisitcs for multiobjec-
tive optimization problems. Soft Computing 10, 1001–1005 (2005)
22. Mares, M.: Network analysis of fuzzy set methodology in industrial engineering. In:
Evans, G., Karwowski, W., Wilhelm, M.R. (eds.), pp. 115–125. Elsevier Science Pub-
lishers, B. V., Amsterdam (1989)
23. Daisy, X.M., Thomas, S.: Stochastic Time-cost optimization model incorporating fuzzy
sets theory and nonreplaceable front. Journal of Construction Engineering and Manage-
ment 131(2), 176–186 (2005)
14 Project Scheduling: Time-Cost Tradeoff Problems 357
24. Leu, S.S., Chen, A.T., Yang, C.H.: A GA-based fuzzy optimal model for construction
time-cost trade-off. International Journal of Project Management 19, 47–58 (2001)
25. Yang, T.: Impact of budget uncertainty on project time-cost tradeoff. IEEE Transactions
on Engineering Management 52(2), 167–174 (2005)
26. Pathak, B.K., Srivastava, S.: MOGA-based time-cost tradeoffs: responsiveness for
project uncertainties. In: Proc. 2007 IEEE Congress on Evolutionary Computation, pp.
3085–3092 (2007)
27. Mamdani, E.H.: Application of fuzzy logic to approximate reasoning using linguistic
synthesis. IEEE Transactions on Computers 26(12), 1182–1191 (1977)
28. Zadeh, L.A.: Outline of a new approach to the analysis of a complex system and decision
processes. IEEE Transactions on Systems, Man and Cybernetics SMC-3, 28–44 (1973)
29. Yip, P., Pao, Y.H.: Combinatorial optimization with use of guided evolutionary simulated
annealing. IEEE Transactions on Neural Networks 6(2), 290–295 (1995)
30. Kirkpatrick, S., Gelatt, C.D., Veechi, M.P.: Optimization by simulated annealing. Sci-
ence 220(4598), 671–680 (1983)
Chapter 15
Systolic VLSI and FPGA Realization of
Artificial Neural Networks
Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 359–380.
springerlink.com c Springer-Verlag Berlin Heidelberg 2010
360 P.K. Meher
15.1 Introduction
Over the years, the artificial neural network (ANN), not only has been more and
more popular due to its adaptive and non-linear capabilities, but also has been es-
tablished as a potential intelligent tool in every imaginable area of technology for
solving ill-posed problems where conventional techniques fail to be effective. The
ANN algorithms are, however, computation-intensive, and computational complex-
ity of these algorithms increases with the number of inputs in a training pattern, the
number of layers of neurons in case of multilayer network, and the number of neu-
rons in different layers. Apart from that, the ANN algorithms for training phase are
iterative by nature, and require execution of several iterations to train the network.
The general-purpose-computers based on sequential von Neumann architecture are
found to be slow to implement the iterative training particulary when the network
consists of a large number of neurons and multiple hidden layers. On the other hand,
the ANN algorithms are inherently parallel, and each layer of multilayer network
could easily be implemented by a separate pipeline stage. Attempts have, there-
fore, been made to exploit these features of ANN algorithms to implement them
in the single instruction-stream multiple data-stream (SIMD) machines and array-
processors [1, 2]. The SIMD configuration has been considered as a good choice for
the implementation of these algorithms, as it provides a large number of processing
cells using a shared controller with minimal programming, and low burden to the
operating system. The real-time and embedded systems, however, impose stringent
limitations on the cost, size, power-consumption, throughput rate and computational
latency of the neural algorithms. To fit into the embedding environment, the size of
the computing structure very often should be small enough, and at the same time it
should meet the speed requirement of the time-critical and hard-real-time applica-
tions. Although the general-purpose-computers can execute the ANN algorithms of
small-sized network through software, it is essential to realize these algorithms in
dedicated VLSI or field programmable gate array (FPGA) device to meet the cost,
size and time requirement of the embedded and real-time applications. The conven-
tional general-purpose-machines and SIMD machines fall far too short to match the
requirements and specifications of many such application environments. Several at-
tempts have therefore been made in the last two decades for the realization of ANN
in analog, as well as, digital VLSI. There can be two kinds of approaches to hard-
ware implementation of ANN algorithms, e.g., the direct-design approach and the
indirect-design approach [3, 4]. In the direct-design approach, the neural algorithms
are directly mapped into dedicated hardware, while the indirect approach makes use
of the matrix processing behaviour of the neural models.
The ANN algorithms are found to be well suited for systolic implementation due
to their repetitive and recursive behaviour. Several variations of one-dimensional
and two-dimensional systolic arrays are, therefore, reported for the implementation
15 Systolic VLSI and FPGA Realization of Artificial Neural Networks 361
Configurable
Logic Blocks
I/O Blocks
Programmable Interconnects
4. Analog processors are used with only limited precision (usually of 8-bit)
because the chip-area increases with the precision.
Contrastingly, the digital designs offer better tolerance to variations of power supply,
ambient temperature, noise and crosstalk etc. compared with the analog processors.
Besides, they provide accurate storage of weights in digital memory cells to facil-
itate updating of weights during learning. Compared with the analog circuits the
digital designs also allow flexible choice of word-length depending on the preci-
sion requirement. Along with these, the digital designs offer fast turn-around due
to availability of advanced CAD software and better support for silicon fabrication.
The digital implementations on the other hand are relatively slow and involve more
chip-area compared with their analog counterpart.
Wi j is the weight value for the connection running from the j-th neuron to the i-th
neuron, and θi (k) is the bias value associated with the i-th neuron. {Wi j } constitutes
the weight matrix W of size N ×N, where N is the number of neurons in the network.
f is a non-linear function, usually called as the threshold function or the activation
function, which may be a sigmoid function, a step function, a squashing function or
15 Systolic VLSI and FPGA Realization of Artificial Neural Networks 365
may be a stochastic function. In the learning phase, the neurons adaptively update
their weights either by supervised or by unsupervised learning. In this article, we
focus on the VLSI implementation of the neural networks based on the most basic
supervised learning rule [5], namely the Widrow-Hoff rule (also popularly known
as delta rule) where the weights of the neurons are updated in every iteration of
learning according to
OF NEURON OUTPUT
WEIGTHS
NEURON
UPDATING
ERROR VALUES
OF NEURON WEIGHTS
N
yi = ∑ Wi j · x j + θi. (15.4)
j=1
x1 x2 x3 xN
Yout
θΝ WN1
N1
WN2 WN3 WNN yN
(a) (b)
Fig. 15.4 DG for the computation of matrix-vector product of equation (15.4). (a) The DG.
(b) Function of each node of the DG
15 Systolic VLSI and FPGA Realization of Artificial Neural Networks 367
x1 x2 x3 xN
Δ 2Δ (Ν -1)Δ
PE PE PE PE
θ1 yN
θ2 W11 W12 W13 W1N yN-1
θ3
W21 W22 W23 W2N
W3N y2
θΝ W31 W32 W33
y1
WN1
N1 WN2 WN3 WNN
(a)
Yin
Win
(b)
Fig. 15.5 The linear systolic array for the computation of matrix-vector product. (a) The
systolic array. (b) Function of each PE of the structure
i-th PE in each clock cycle. The elements of input vector x are fed to N different
PEs as shown in Fig. 15.5(a) such that the input to a PE is staggered by one clock
cycle relative to the adjacent PE on its left. The elements of the input vector x (once
loaded to the PEs) stay in their respective PEs throughout the computation, while
the computed output of each PE is transferred to its neighbouring PE on the right.
The first output value of the array is obtained after N cycles from the right-most PE,
while the rest N − 1 output values are obtained in the next N − 1 cycles, where the
duration of a clock period T = TM + TA , for TM and TA , respectively, being the time
involved to perform one multiplication and one addition in a PE. The latency of the
structure is N cycles and has a throughput rate of one output per cycle. It has all the
advantages of systolic design, but the output is required to be demultiplexed to be
stored in separate registers to be used in the next iteration. For large size ANN, the
time required for demultiplexing is large; and the demultiplexer involves consider-
ably high area complexity and requires a lot of additional interconnections.
To avoid this difficulty of pure systolic realization, we discuss here a semi-systolic
implementation of matrix-vector product of (15.4). For semi-systolic realization, the
368 P.K. Meher
dependence graph of Fig. 15.4 can be modified to a form as shown in Fig. 15.6,
where the nodes are flipped about the diagonal (so that the weight values appear in
transposed form) and i-th column of resulting DG is circularly-shifted-up by (i − 1)
places. As in case of the original DG of Fig. 15.4, the modified DG also consists of
N 2 number of nodes arranged in N rows and N columns. In this case also the input
values are loaded to the nodes on the first row of the DG, but they move diagonally
down to the adjacent nodes on the left on the lower row, while the outputs computed
by the nodes of each row are transferred vertically down to the adjacent nodes.
θ1 θ2 θ3 θΝ
x1 x2 x3 xN
W11 W22 W33 WNN
Wij
W13 W24 W35 WN2
Zout
Yout
y1 y2 y3 yN
(a) (b)
Fig. 15.6 The modified DG for semi-systolic computation of matrix-vector product of (15.4).
(a) The dependence graph. (b) Function of each node of the DG. Note: N = N − 1
x1 x2 x3 xN
PE PE PE PE
y1 y2 y3 yN
W11 W22 W33 WNN
WN1
1N W21 W32 WNN’
(a)
(b)
Fig. 15.7 A semi-systolic array for the computation of equation (15.4). (a) The semi-systolic
array. (b) Function of each PE of the structure. N = N − 1
do not move, and get accumulated in the respective PEs. The sum of products com-
puted in each PE is finally released simultaneously after N cycles as output. The
semi-systolic structure has also a latency of N cycles, and has the duration of each
clock cycle equal to the time required to perform one multiply-accumulate opera-
tion T = TM + TA as in the case of the pure systolic structure of Fig. 15.5. Unlike
the other, since all the outputs in this case are obtained after N cycles from N PEs
of the structure, the output of each PE can be reused as input in the same PE for the
370 P.K. Meher
x1 x2 x3 xN
θ1 θ2 θ3 θΝ
Wij
W13 W24 W35 WN2
Zout
Yout
y1 y2 yN-1 yN
(a) (b)
nodes while the computed results move diagonally down to the adjacent nodes on
the next lower row on the left column. The output computed by the leftmost node of
a row of the DG is transferred to the rightmost node on the next row of the DG. The
DG of Fig. 15.8 can be projected vertically to obtain an array similar to the semi-
systolic array of Fig. 15.7, where the function of the PEs is depicted in Fig. 15.8(c).
The vector u(k) = {ui (k), for 1 ≤ i ≤ N} represents the states of all the N neurons
at the k-th iteration, and θ = {θi , for 1 ≤ i ≤ N} is the bias vector.
The systolic and semi-systolic architectures for matrix-vector product which we
have discussed in the last Section can be utilized for VLSI realization of Hopfield
nets as shown in Fig. 15.9. It consists of N PEs, where N is the length of input
activation. Each of these PEs consists of two sub-cells PE-1 and PE-2 and a circular-
shift-register. The set of circular-shift-registers Ri for 1 ≤ i ≤ N is used to feed the
appropriate values of weights to the PEs as shown in Fig. 15.9. The function of PE-1
is the same as that of the PEs of the semi-systolic structure of Fig. 15.7. Function of
PE-2 is described in Fig. 15.9(b). It performs the desired non-linear function given
by equation (15.6). There are several techniques reported for efficient computation
of tanh function to be performed by these cells, and can be implemented in many
different ways [31]-[34]. For low-complexity implementation of this function one
may use a CORDIC circuit or a look up table consisting of 2L words where L is the
word-length [31].
In the search phase, each PE may be treated as a neuron where the weight vector
W j = {W j j ,W j( j+1) , · · ·,W jN ,W j1 ,W j2 , · · ·,W j( j−2) ,W j( j−1) } in the shift-register R j
of the j-th PE (for 1 ≤ j ≤ N) corresponds to the synaptic weights of the neuron.
During (k + 1)-th iteration of the search phase, the activation output xi (k) of the i-th
PE (for 1 ≤ i ≤ N) of the k-th iteration is reloaded to the same PE, which moves
372 P.K. Meher
R1 R2 R3 RN
θ1 θ2 θ3 θΝ
PE-2 PE-2 PE-2 PE-2
Xin
X ← η ⋅ ( Xin + θ in);
Ain PE-2 θ in X ← ( Ain + X )/u0 ;
Xout ← [1 + tanh( x)] / 2;
Xout
(b)
Fig. 15.9 The linear array architecture for implementation of the search phase of the Hopfield
net. (a) The array architecture. Ri of the i-th PE is a circular-shift register that contains the i-th
column of the weight matrix W as shown in Fig.5. (b) Function of the non-linear processing
cell PE-2. u0 is the initial activation value stored in the PE and η is a constant
across the array from one PE to its adjacent PE in every computational cycle such
that each input activation visits each PE once in every N cycles. When x j (k) arrives
at the i-th PE, it is multiplied with Wi j , and the product value is accumulated in
the same PE. N such product values are accumulated in N consecutive cycles, and
the accumulated sum is then transferred to its non-linear processing cell PE-2. The
output activation xi (k + 1) obtained from the non-linear processing cell of the i-th
PE is reloaded to itself for the processing of next iteration. The iterative process
is continued till convergence is reached, and once the convergence is reached after
certain number of iterations, the learning phase starts for the adjustment of weights.
The systolic architecture derived for the search phase can be reused for the learning
phase as well. The architecture for the search phase (Fig. 15.9) can be used for the
learning phase as follows:
1. Calculate the product of the error value (di − xi ) and the learning rate η as: Si =
η · (di − xi ), and store that in the i-th PE.
2. To calculate the weight increment terms according to (15.3), the converged acti-
vation values x j for 1 ≤ j ≤ N move across the PEs as in case of the search phase
15 Systolic VLSI and FPGA Realization of Artificial Neural Networks 373
where
Nl −1
ui (l) = ∑ Wi j · x j (l − 1)) + θi(l) for 1 ≤ i ≤ Nl and 1 ≤ l ≤ L. (15.8)
j=1
Nl is the number of nodes in the l-th layer. For simplicity of presentation, (without
loss of generalization) we have assumed that each layer consists of equal number of
nodes (e.g., Nl = N) and θi (l) is the bias input of the i-th neuron in the l-th layer.
The computations of each neuron can be performed by a systolic array of the
kind shown in Fig. 15.9 (discussed in the Subsection 15.4.1) such that the compu-
tation of (15.7) and (15.8) for all the neurons of a layer can be realized in fully
parallel form in L systolic arrays. The resulting mesh architecture, consisting of LN
number of PEs arranged in L rows and N columns is shown in Fig. 15.10. The bias
values (not shown explicitly in the structure) are used to initialize the accumulation
registers in the PEs. For a reduced-hardware implementation, the computation of
different layers may be performed by a single array structure by time-multiplexing
of the computation of different layers by a simple control unit and external storage
elements to store the outputs of the neurons.
activation values (also called as the error signals) are estimated for all the neurons at
the output layer, and propagated back for weight adjustment of the preceding layers
progressively backward. The weight adaptation of the l-th layer according to BP
algorithm pertaining to the m-th input pattern is given by
where the error signal ‘δim (l)’ for updating the weights of the l-th layer is recursively
computed as:
m
δim (L) = (dim − xm
i (L)) · f (ui (L)) for l = L and (15.10)
@ A
δim (l) = ∑ δ jm (l + 1) ·Wim−1
j (l + 1) · f (um
i (l)) for l < L. (15.11)
j
The feed-forward step is the same as that of the search phase and can be implemented
by the structure of Fig.15.10. For the weight adjustment by back-propagation al-
gorithm, the structure of search phase can be reused by simple modification. The
formula for weight updating for the L-th layer is given by (15.10), which is similar
to that of (15.3) except that η is replaced by the derivative of non-linear function
f ’, and can be implemented by the structure discussed in the Subsection 15.4.1. For
all other values of l, ( e.g., for l < L in (15.11)), the error signal used in (15.10) is
replaced by an inner-product of the N-point error-vector and a row of the weight
matrix. The inner-products of (15.11) can be easily realized by the semi-systolic
structure of Fig. 15.7 using the PEs whose function is described in Fig. 15.8(c). The
15 Systolic VLSI and FPGA Realization of Artificial Neural Networks 375
l-th layer
3. δi (l) is then used by the i-th PE on the l-th layer for updating {W jim−1 (l) for
m
15.5 Conclusion
The systolic array architectures, due to their several features of advantage, are
considered to be attractive for the implementation of computation-intensive ANN
algorithms in custom VLSI and FPGA devices for real-time applications. The key
techniques used for mapping of ANN algorithms into systolic computing struc-
ture are discussed, and a brief overview of systolic architectures for different ANN
applications are presented in this chapter. Along with the design of basic systolic
building blocks for various ANN algorithms, the mapping of fully-connected un-
constrained ANN, as well as, multilayer ANN algorithm into fully-pipelined systolic
architecture is described by generalized dependance graph formulation. The readers
may refer to the cited references for detail discussions on hardware implementa-
tion of advance ANN algorithms and extended forms of ANN for different appli-
cations. Interested readers may also like to find several variations and optimization
of systolic architectures for the ANN models in the references. Most of the VLSI
structures suggested in the literature are meant for a particular topology of network,
15 Systolic VLSI and FPGA Realization of Artificial Neural Networks 377
and a specific training algorithm. Only a few of the architectures offer flexibility
of adapting to different learning process or topologies and constitution of network
[48]-[49]. Self-configuring and adaptive architectures could also be designed for
complex multi-modal learning applications and for the applications subjected to
different constraints and environmental influences [50, 51]. It is observed that both
analog, as well as, the digital implementations have their inherent advantages and
disadvantages. Therefore, it is expected that mixed analog-digital circuits might be
able to deliver the best of the two for the VLSI implementation of different ANN
models. Mixed analog-digital neural networks have significant potential to be de-
ployed directly and more efficiently for various applications in signal processing,
communication and instrumentation where real-world interaction in analog domain
is very much prevalent during different phases of network operation.
References
1. Brown, J.R., Garber, M.M., Venable, S.F.: Artificial neural network on a SIMD architec-
ture. In: Proceedings Frontiers of Massively Parallel Computation, pp. 43–47 (1988)
2. Shams, S., Przytula, K.W.: Mapping of neural networks onto programmable parallel ma-
chines. In: Proceedings IEEE International Symposium on Circuits and Systems, vol. 4,
pp. 2613–2617 (1990)
3. Kung, S.Y.: Digital Neurocomputing. Prentice Hall, Englewood Cliffs (1992)
4. Kung, S.Y.: Tutorial: digital neurocomputing for signal/image processing. In: Proceed-
ings of IEEE Workshop Neural Networks for Signal Processing, pp. 616–644 (1991)
5. Kung, S.Y., Hwang, J.N.: Parallel architectures for artificial neural nets. In: Proceedings
of IEEE International Conference on Neural Networks, vol. 2, pp. 165–172 (1988)
6. Kung, S.Y., Hwang, J.N.: A unifying algorithm/architecture for artificial neural networks.
In: Proceedings of International Conference on Acoustics, Speech, and Signal Process-
ing, vol. 4, pp. 2505–2508 (1989)
7. Amin, H., Curtis, K.M., Hayes Gill, B.R.: Efficient two-dimensional systolic array ar-
chitecture for multilayer neural network. Electronics Letters 33(24), 2055–2056 (1997)
8. Amin, H., Curtis, K.M., Hayes Gill, B.R.: Two-ring systolic array network for artificial
neural networks. In: IEE Proceedings Circuits, Devices and Systems, vol. 164(5), pp.
225–230 (1999)
9. Myoupo, J.F., Seme, D.: A single-layer systolic architecture for back propagation learn-
ing. In: Proceedings of IEEE International Conference on Neural Networks, vol. 2, pp.
1329–1333 (1996)
10. Khan, E.R., Ling, N.: Systolic architectures for artificial neural nets. In: Proceedings of
IEEE International Joint Conference on Neural Networks, vol. 1, pp. 620–627 (1991)
11. Zubair, M., Madan, B.B.: Systolic implementation of neural networks. In: Proceedings
of IEEE International Conference on Computer Design: VLSI in Computers and Proces-
sors, pp. 479–482 (1989)
12. Pazienti, F.: Systolic array for neural network implementation. In: Proceedings 6th
Mediterranean Electrotechnical Conference, vol. 2, pp. 981–984 (1991)
13. Girones, R.G., Salcedo, A.M.: Systolic implementation of a pipelined on-line back prop-
agation. In: Proceedings Seventh International Conference on Microelectronics for Neu-
ral, Fuzzy and Bio-Inspired Systems, pp. 387–394 (1999)
378 P.K. Meher
14. Naylor, D., Jones, S.: A performance model for multilayer neural networks in linear
arrays. IEEE Transactions on Parallel and Distributed Systems 5(12), 1322–1328 (1994)
15. Naylor, D., Jones, S., Myers, D.: Back propagation in linear arrays-a performance anal-
ysis and optimization. IEEE Transactions on Neural Networks 6(3), 583–595 (1995)
16. Kung, H.T.: Why systolic architectures? Computer 15(1), 37–46 (1982)
17. Kung, S.Y.: VLSI Array Processors. Prentice Hall, Englewood Cliffs (1988)
18. Parhi, K.K.: VLSI Digital Signal Processing Systems: Design and Implementation.
Wiley-Interscience Publication, John Wiley & Sons, New York (1999)
19. Zhang, D., Pal, S.K. (eds.): Neural Networks and Systolic Array Design. World Scien-
tific, River Edge (2002)
20. Ben Salem, A.K., Ben Othman, S., Ben Saoud, S.: Design and implementation of a neural
command rule on a FPGA circuit. In: Proceedings 12th IEEE International Conference
on Electronics, Circuits and Systems, pp. 1–4 (2005)
21. Liu, J., Liang, D.: A Survey of FPGA-Based Hardware Implementation of ANNs. In:
Proceedings 1st International Conference on Neural Networks and Brain, pp. 915–918
(2005)
22. Mohan, A.R., Sudha, N., Meher, P.K.: An embedded face recognition system on A VLSI
array architecture and its FPGA implementation. In: Proceedings 34th Annual Confer-
ence of IEEE Industrial Electronics, pp. 2432–2437 (2008)
23. Farhat, N.H., Paaltis, D., Prata, A., Paek, E.: Optical Implementation of the Hopfield
Model. Applied Optics 24, 1469–1475 (1985)
24. Wanger, K., Paaltis, D.: Multilayer optical learning networks. Applied Optics 26, 5061–
5076 (1987)
25. Mead, C.: Analog VLSI and neural systems. Addison Wesley, Reading (1989)
26. Sivilotti, M.A., Mahowald, M.A., Mead, C.A.: Real-time visual computations using ana-
log CMOS processing arrays. In: Loslben, P. (ed.) Advanced Research on VLSI, pp.
295–312. MIT Press, Cambridge (1987)
27. Sheu, B.J., Choi, J.: Neural Information Processing and VLSI. Kluwer Academic Pub-
lishers, Dordrecht (1995)
28. Hopfield, J.J., Tank, D.W.: Neural computation of decisions in optimization problems.
Biological cybernetics 52, 141–154 (1985)
29. Rummelhart, D.E., McClelland, J.L.: Parallel and distributed processing: Explorations
in the Microstructure of cognition. MIT Press, Cambridge (1986)
30. Werbos, P.: Beyond regression: New tools for prediction and analysis in the behavioral
sciences. Ph.D. thesis, Harvard University, Cambridge, Mass. (1974)
31. Gisutham, B., Srikanthan, T., Asari, K.V.: A high speed flat CORDIC based neuron with
multi-level activation function for robust pattern recognition. In: Proceedings Fifth IEEE
International Workshop on Computer Architectures for Machine Perception, pp. 87–94
(2000)
32. Anna Durai, S., Siva Prasad, P.V., Balasubramaniam, A., Ganapathy, V.: A learning strat-
egy for multilayer neural network using discretized sigmoidal function. In: Proceedings
Fifth IEEE International Conference on Neural Networks, pp. 2107–2110 (1995)
33. Zhang, M., Vassiliadis, S., Delgado-Frias, J.G.: Sigmoid generators for neural comput-
ing using piecewise approximations. IEEE Transactions on Computers 45, 1045–1049
(1996)
34. Saichand, V., Nirmala, D.M., Arumugam, S., Mohankumar, N.: FPGA realization of
activation function for artificial neural networks. In: Proceedings Eighth International
Conference on Intelligent Systems Design and Applications, vol. 3, pp. 159–164 (2008)
15 Systolic VLSI and FPGA Realization of Artificial Neural Networks 379
35. Williams, R.J., Zipser, D.: Experimental analysis of the real-time recurrent learning al-
gorithm. Connection Science 1, 87–111 (1989)
36. Williams, R.J., Zipser, D.: Gradient-based learning algorithms for recurrent networks
and their computational complexity. In: Back-propagation: Theory, Architectures and
Applications. Erlbaum, Hillsdale (1992)
37. Kechriotis, G., Manolakos, E.S.: A VLSI array architecture for the on-line training of re-
current neural networks. In: Conference Record of Asilomar Conference on the Twenty-
Fifth Signals, Systems and Computers, vol. 1, pp. 506–510 (1991)
38. Shaikh-Husin, N., Hani, M.K., Teoh, G.S.: Implementation of recurrent neural network
algorithm for shortest path calculation in network routing. In: Proceedings of Interna-
tional Symposium on Parallel Architectures, Algorithms and Networks, I-SPAN 2002,
pp. 313–317 (2002)
39. Ramacher, U., Beichter, J., Bruls, N., Sicheneder, E.: Architecture and VLSI design of
a VLSI neural signal processor. In: Proceedings IEEE International Symposium on Cir-
cuits and Systems, vol. 3, pp. 1975–1978 (1993)
40. Vidal, M., Massicotte, D.: A VLSI parallel architecture of a piecewise linear neural net-
work for nonlinear channel equalization. In: Proceedings the 16th IEEE Conference on
Instrumentation and Measurement Technology, vol. 3, pp. 1629–1634 (1999)
41. Broomhead, D.S., Jones, R., McWhirter, J.G., Shepherd, T.J.: A systolic array for nonlin-
ear adaptive filtering and pattern recognition. In: Proceedings IEEE International Sym-
posium on Circuits and Systems, vol. 2, pp. 962–965 (1990)
42. Cavaiuolo, M., Yakovleff, A.J.S., Watson, C.R., Kershaw, J.A.: A systolic neural net-
work image processing architecture. In: Proceedings Computer Systems and Software
Engineering, pp. 695–700 (1992)
43. Bermak, A., Martinez, D.: Digital VLSI implementation of a multi-precision neural net-
work classifier. In: Proceedings 6th International Conference on Neural Information Pro-
cessing, vol. 2, pp. 560–565 (1999)
44. Shadafan, R.S., Niranjan, M.: A systolic array implementation of a dynamic sequential
neural network for pattern recognition. In: Proceedings IEEE World Congress on Com-
putational Intelligence and IEEE International Conference on Neural Networks, vol. 4,
pp. 2034–2039 (1994)
45. Sudha, N., Mohan, A.R., Meher, P.K.: Systolic array realization of a neural network-
based face recognition system. In: Proceedings 3rd IEEE Conference on Industrial Elec-
tronics and Applications, pp. 1864–1869 (2008)
46. Sheu, B.J., Chang, C.F., Chen, T.H., Chen, O.T.C.: Neural-based analog trainable vector
quantizer and digital systolic processors. In: Proceedings IEEE International Symposium
on Circuits and Systems, vol. 3, pp. 1380–1383 (1991)
47. Moreno, J.M., Castillo, F., Cabestany, J., Madrenas, J., Napieralski, A.: An analog sys-
tolic neural processing architecture. IEEE Micro. 14(3), 51–59 (1994)
48. Madraswala, T.H., Mohd, B.J., Ali, M., Premi, R., Bayoumi, M.A.: A reconfigurable
‘ANN’ architecture. In: Proceedings IEEE International Symposium on Circuits and Sys-
tems, vol. 3, pp. 1569–1572 (1992)
49. Jang, Y.-J., Park, C.-H., Lee, H.-S.: A programmable digital neuro-processor design with
dynamically reconfigurable pipeline/parallel architecture. In: Proceedings International
Conference on Parallel and Distributed Systems, pp. 18–24 (1998)
50. Patra, J.C., Lee, H.Y., Meher, P.K., Ang, E.L.: Field Programmable Gate Array Imple-
mentation of a Neural Network-Based Intelligent Sensor System. In: Proceeding Inter-
national Conference on Control Automation Robotics and Vision, December 2006, pp.
333–337 (2006)
380 P.K. Meher
51. Patra, J.C., Chakraborty, G., Meher, P.K.: Neural Network-Based Robust Linearization
and Compensation Technique for Sensors under Nonlinear Environmental Influences.
IEEE Transactions on Circuits and Systems-I: Regular Papers 55(5), 1316–1327 (2008)
Pramod Kumar Meher received the B.Sc. and M.Sc. degrees in physics and the Ph.D.
in science from Sambalpur University, Sambalpur, India, in 1976, 1978, and 1996,
respectively. He has a wide scientific and technical background covering physics,
electronics, and computer engineering. Currently, he is a Senior Scientist with the
Institute for Infocomm Research, Singapore. Prior to this assignment he was a visit-
ing faculty with the School of Computer Engineering, Nanyang Technological Uni-
versity, Singapore. Previously, he was a Professor of computer applications with
Utkal University, Bhubaneswar, India, from 1997 to 2002, a Reader in electron-
ics with Berhampur University, Berhampur, India, from 1993 to 1997, and a Lec-
turer in physics with various Government Colleges in India from 1981 to 1993. His
research interest includes design of dedicated and reconfigurable architectures for
computation-intensive algorithms pertaining to signal processing, image processing,
communication, and intelligent computing. He has published more than 140 tech-
nical papers in various reputed journals and conference proceedings. Dr. Meher is a
Fellow of the Institution of Electronics and Telecommunication Engineers (IETE),
India and a Fellow of the Institution of Engineering and Technology (IET), UK.
He is currently serving as Associate Editor for the IEEE Transactions on Circuits
and Systems-II: Express Briefs, IEEE Transactions on Very Large Scale Integration
(VLSI) Systems, and Journal of Circuits, Systems, and Signal Processing. He was
the recipient of the Samanta Chandrasekhar Award for excellence in research in
engineering and technology for the year 1999.
Chapter 16
Application of Coarse-Coding Techniques for
Evolvable Multirobot Controllers
Abstract. Robots, in their most general embodiment, can be complex systems trying
to negotiate and manipulate an unstructured environment. They ideally require an
‘intelligence’ that reflects our own. Artificial evolutionary algorithms are often used
to generate a high-level controller for single and multi robot scenarios. But evolu-
tionary algorithms, for all their advantages, can be very computationally intensive.
It is therefore very desirable to minimize the number of generations required for
a solution. In this chapter, we incorporate the Artificial Neural Tissue (ANT) ap-
proach for robot control from previous work with a novel Sensory Coarse Coding
(SCC) model. This model is able to exploit regularity in the sensor data of the en-
vironment. Determining how the sensor suite of a robot should be configured and
utilized is critical for the robot’s operation. Much as nature evolves body and brain
simultaneously, we should expect improved performance resulting from artificially
evolving the controller and sensor configuration in unison. Simulation results on
an example task, resource gathering, show that the ANT+SCC system is capable of
finding fitter solutions in fewer generations. We also report on hardware experiments
for the same task that show complex behaviors emerging through self-organized task
decomposition.
16.1 Introduction
Our motivation for evolutionary-based control approaches for multirobot systems
originates in the use of robots for space exploration and habitat construction on
Jekanthan Thangavelautham
Mechanical Engineering Department, Massachusetts Institute of Technology,
77 Massachusetts Ave., Cambridge, MA, USA, 02139
e-mail: [email protected]
Paul Grouchy · Gabriele M.T. D’Eleuterio
Institute for Aerospace Studies, University of Toronto,
4925 Dufferin St., Toronto, Canada, M3H5T6
e-mail: {paul.grouchy,gabriele.deleuterio}@utoronto.ca
Y. Tenne and C.-K. Goh (Eds.): Computational Intelligence in Optimization, ALO 7, pp. 381–412.
springerlink.com c Springer-Verlag Berlin Heidelberg 2010
382 J. Thangavelautham, P. Grouchy, and G.M.T. D’Eleuterio
alien planets and planetoids, such as Mars and the Moon. Potential scenarios include
establishing a distributed antenna for communications, deploying a mobile array of
actuators and sensors for geological measurements, or constructing elements of an
outpost in preparation for the arrival of humans.
These kinds of project call for not just a single monolithic robotic system but
teams of elemental robots working in collaboration and coordination. While space
applications may require the use of a multiagent strategy, they are by no means the
only ones. Consider, for example, terrestrial applications such as search and rescue,
mapping, manufacturing and construction.
A number of factors make the team approach viable and attractive. Among them
are increased reliability. One can afford to lose a member of the team without de-
stroying the team’s integrity. A team approach can offer increased efficiency through
parallelization of operations. As such, multiagent systems are more readily scalable.
Most important, however, a team can facilitate task decomposition. A complex task
can be parsed into manageable subtasks which can be delegated to multiple elemen-
tal units.
Robotic systems can themselves be complex and their environments are generally
unstructured. Sound control strategies are therefore not easy to develop. A method-
ical approach is not only desired but arguably required. It would be ideal if the
controller could be automatically generated starting from a ‘blank slate,’ where the
designer is largely relieved of the design process and detailed models of the systems
or environment can be avoided. It is by such a road that we have come to the use of
evolutionary algorithms for generating controllers that are based on neural-network
architectures.
We have developed and tested, both in simulation and hardware, a neuroevo-
lutionary approach called the Artificial Neural Tissue (ANT). This neural-network-
based controller employs a variable-length genome consisting of a regulatory
system that dictates the rate of morphological growth and can selectively activate
and inhibit neuron ensembles through a coarse-coding scheme [31]. The approach
requires an experimenter to specify a goal function, a sensory input layout for the
robots and a repertoire of allowable basis behaviors. The control topology and its
contents emerge through the evolutionary process.
But is it possible to evolve, in addition to the controllers themselves, the neces-
sary sensor configurations and the selection of behavior primitives (motor-actuator
commands) concurrently with the evolution of the controller? It is this question that
we address in this work.
In tackling this challenge, we turn to a key theme in our ANT concept, namely,
coarse coding. Coarse coding is an efficient, distributed means of representation that
makes use of multiple coarse receptive fields to represent a higher-resolution field.
As is well known, nature exploits coarse coding in the brain and sensory systems.
In artificial systems, coarse coding is used to interpret data. In ANT, however, it
is the program (the artificial neural structure responsible for computation) that is
coarse coded. This allows the development of spatially modular functionality in the
architecture that mimics the modularity in natural brains.
16 Coarse-Coding Techniques for Evolvable Multirobot Controllers 383
We present in this work a Sensory Coarse Coding (SCC) model that extends
capabilities of ANT and allows for the evolution of sensor configuration and coupled
motor primitives. The evolution of sensor configuration and behavior primitives can
be used to take advantage of regularities in the task space and can help to guide and
speed up the evolution of the controller.
Training neural network controllers using an evolutionary algorithms approach
like ANT for robotics is computationally expensive and tends to be used when other
standard search algorithms like gradient descent are unsuitable or require substantial
supervision for the given task space. The controllers are developed using a biolog-
ically motivated development process and replicated on one or more robotic plat-
forms for evaluation. Training on hardware is often logistically difficult, requiring
a long-term power source and a means of automating the controller evaluation pro-
cess. The alternative is to simulate the robotic evaluation process on one or more
computers. The bulk of the required time in training is the evaluation process. Ge-
netic operations including selection, mutation and crossover tend to take less than
one percent of the computational time. Therefore any method that can reduce the
number of genetic evaluations will have a substantial impact on the training process.
Furthermore, a significant reduction in the number of generations required can also
make the training process feasible on hardware. Robotic simulations often take into
account the dynamics and kinematics of robotic vehicle interactions. However, care
has to be taken to ensure the simulation environment resembles or is compatible with
actual hardware. In other circumstances, it may not be beneficial to prototype and
demonstrate capabilities and concepts in simulation before proceeding towards ex-
pensive hardware demonstration. With robotics applications on the lunar surface, the
low gravity environment cannot be easily replicated on earth and hence high fidelity
dynamics simulations may be needed to demonstrate aspects of system capability.
For multirobotic tasks, the global effect of local interactions between robots is
often difficult to gauge, and the specific interactions required to achieve coordinated
behavior may even be counterintuitive. Furthermore, it is not at all straightforward to
determine the best sensor configuration. Often detailed analysis of the task needs to
be performed to figure out the necessary coordination rules and sensory configura-
tions. The alternative is to use optimization techniques to in effect shrink the search
space sufficiently to enable evolutionary search algorithms to find suitable solutions.
Evolving this configuration may also give useful insight into the sensors necessary
for a task. This may help guide a robotic designer in their design processes—we do
not presume to make the designer completely redundant—by helping them deter-
mine which sensors and actuators are best to achieve a given objective. In addition,
the evolution of the sensor configuration in conjunction with the controller would
allow us to mimic nature more closely. Nature perforce evolves body and brain
together.
The remainder of this chapter is organized as follows. First, we provide back-
ground to our problem by reviewing past work on the use of evolutionary algorithms
for the development of multirobot controllers and on ‘body and brain’ evolution. We
present the workings of the Artificial Neural Tissue approach followed by the Sen-
sory Coarse Coding model. We refer to the integration of the latter into the former
384 J. Thangavelautham, P. Grouchy, and G.M.T. D’Eleuterio
16.2 Background
Coordination and control of multirobot systems are often inspired by biology. In
nature, multiagent systems such as social insects use a number of mechanisms for
control and coordination. These include the use of templates, stigmergy, and self-
organization. Templates are environmental features perceptible to the individuals
within the collective [3]. In insect colonies, templates may be a natural phenomenon
or they may be created by the colonies themselves. They may include temperature,
humidity, chemicals, or light gradients. Stigmergy is a form of indirect communi-
cation mediated through the environment [12]. One way in which ants and termites
exploit stigmergy is through the use of pheromone trails. Self-organization describes
how local or microscopic behaviors give rise to a macroscopic structure in systems
[2]. However, many existing approaches suffer from another emergent feature called
antagonism [5]. This is the effect that arises when multiple agents trying to perform
the same task interfere with each other and reduce the overall efficiency of the group.
Within the field of robotics, many have sought to develop multirobot control and
coordination behaviors based on one or more of the prescribed mechanisms used
in nature. These solutions have been developed using user-defined deterministic ‘if-
then’ rules or preprogrammed stochastic behaviors. Such techniques in robotics in-
clude template-based approaches that exploit light fields to direct the creation of
walls [33] and planar annulus structures [34]. Stigmergy has been used extensively
in collective-robotic construction tasks, including blind bull dozing [24], box push-
ing [21] and heap formation [1].
Inspired by insect societies, the robots are equipped with the necessary sensors re-
quired to demonstrate multirobot control and coordination behaviors. Furthermore,
the robot controllers are often designed by hand to be reactive and have access only
to local information. They are nevertheless able to self-organize through coopera-
tion to achieve an overall objective. This is difficult to do by hand, since the global
effect of these local interactions is often very difficult to predict. The simplest hand-
coded techniques have been to design a controller for a single robot and scaling
to multiple units by treating other units as obstacles to be avoided [1], [24]. Other
more sophisticated techniques make use of explicit communication or designing an
extra set of coordination rules to gracefully handle agent-to-agent interactions [33].
These approaches are largely heuristic, rely on ad hoc assumptions that often re-
quire knowledge of the task domain and are implemented with a specified robot
configuration in mind.
16 Coarse-Coding Techniques for Evolvable Multirobot Controllers 385
increasing number of hidden neurons, one must contend with the effects of spa-
tial crosstalk where noisy neurons interfere and drown out signals from feature-
detecting neurons [16].
Crosstalk in combination with limited supervision (through use of a global fitness
function) can lead to the ‘bootstrap problem’ [23], where evolutionary algorithms
are unable to pick out incrementally fitter solutions resulting in premature stagnation
of the evolutionary run. Thus, choosing the wrong network topology may lead to a
situation that is either unable to solve a task or is difficult to train [31].
16.3.1 Computation
We imagine the motor neurons of our network to be spheres arranged in a regular
rectangular lattice in which the neuron Nλ occupies the position λ = (l, m, n) ∈ I3
(sphere centered within cube). The state sλ of the neuron is binary, i.e., sλ ∈ S =
{0, 1}. Each neuron Nλ nominally receives inputs from neurons Nκ where κ ∈ ⇑(λ ),
the nominal input set. Here we shall assume that these nominal inputs are the 3 × 3
neurons centered one layer below Nλ ; in other terms, ⇑(λ ) = {(i, j, k) | i = l −1, l, l +
1; j = m− 1, m, m+ 1; k = n − 1}. (As will be explained presently, however, we shall
not assume that all the neurons are active all the time.) The activation function of
each neuron is taken from among four possible threshold functions of the weighted
input σ :
0, if σ ≥ θ1
ψdown (σ , θ1 ) =
1, otherwise
0, if σ ≤ θ2
ψup (σ , θ2 ) =
1, otherwise
(16.1)
0, min(θ1 , θ2 ) ≤ σ < max(θ1 , θ2 )
ψditch (σ , θ1 , θ2 ) =
1, otherwise
0, σ ≤ min(θ1 , θ2 ) or σ > max(θ1 , θ2 )
ψmound (σ , θ1 , θ2 ) =
1, otherwise
∑κ ∈⇑(λ ) wκλ sκ
σλ = (16.2)
∑κ ∈⇑(λ ) sκ
with the proviso that σ = 0 if the numerator and denominator are zero. Also, wκλ ∈ R
is the weight connecting Nκ to Nλ . We may summarize these threshold functions in
a single analytical expression as
where k1 and k2 can take on the value 0 or 1. The activation function is thus encoded
in the genome by k1 , k2 and the threshold parameters θ1 , θ2 ∈ R.
It may appear that ψdown and ψup are mutually redundant as one type can be
obtained from the other by reversing the signs on all the weights. However, retaining
both increases diversity in the evolution because a single 2-bit ‘gene’ is required to
encode the threshold function and only one mutation suffices to convert ψdown into
ψup or vice versa as opposed to changing the sign of every weight.
The sensor data are represented by the activation of the sensor input neurons
Nα i , i = 1 . . . m, summarized as A = {sα 1 , sα 2 . . . sα m }. Similarly, the output of the
network is represented by the activation of the output neurons Nω j , j = 1 . . . n, sum-
marized as Ω = {sω 1 , sω 2 . . . sω bn }, where k = 1 . . . b specifies the output behavior.
1 2
Each output neuron commands one behavior of the agent. (In the case of a robot,
a typical behavior may be to move forward a given distance. This may result in
the coordinated action of several actuators. Alternatively, the behavior may be more
primitive such as augmenting the current of a given actuator.) If sω k = 1, output neu-
j
ron ω j votes to activate behavior k; if sω k = 0, it does not. Since multiple neurons
j
can have access to a behavior pathway, an arbitration scheme is imposed to ensure
n
the controller is deterministic where p(k) = ∑s k, j=1 sω k /nk and nk is the number of
j
output neurons connected to output behavior k resulting in behavior k being acti-
vated if p(k) ≥ 0.5.
As implied by the set notation of Ω , the outputs are not ordered. In this embodi-
ment, the order of activation is selected randomly. We are primarily interested here
in the statistical characteristics of relatively large populations but such an approach
would likely not be desirable in a practical robotic application. However this can be
remedied by simply assigning a sequence a priori to the activations (as shown in
Table 16.2 for the resource gathering task).
We moreover note that the output neurons can be redundant; that is, more than
one neuron can command the same behavior, in which case for a given time step
one behavior may be “emphasized” by being voted multiple times. Neurons may
also cancel each other out.
(a)
(b)
Fig. 16.1 Synaptic connections between motor neurons and operation of neurotransmitter
field, (a) Synaptic connections and (b) Coarse-coding
used to distinguish the functionality (between motor control, decision and tissue).
A constructor protein (an autonomous program) interprets the information encoded
in the gene and translates this into a cell descriptor protein (see Fig. 16.2). The
gene ‘activation’ parameter is a binary flag resident in all the cell genes and is used
to either express or repress the contents of the gene. When repressed, a descriptor
protein of the gene content is not created. Otherwise, the constructor protein ‘grows’
the tissue in which each cell is located relative to a specified seed-parent address.
A cell death flag determines whether the cell commits suicide after being grown.
Once again, this feature in the genome helps in the evolutionary process for a cell,
by committing suicide, still occupies a volume in the lattice although it is dormant.
In otherwise retaining its characteristics, evolution can decide to reinstate the cell
by merely toggling a bit.
Fig. 16.3 Genes are ‘read’ by constructor proteins that transcribe the information into a de-
scriptor protein which is used to construct a cell. When a gene is repressed, the constructor
protein is prevented from reading the gene contents
16 Coarse-Coding Techniques for Evolvable Multirobot Controllers 393
Therefore, the net concentration of chemical diffused due to the v sensory neurons
at ϕ is:
v
cϕ = ∑ cϕ ,τ i (16.5)
i=1
394 J. Thangavelautham, P. Grouchy, and G.M.T. D’Eleuterio
Fig. 16.4 (Left) Three sensor neurons and their respective receptive field shown shaded. With
S = 1, only Φ2 is selected since ∑ϕ ∈⇑(τ ) cϕ is the highest among the three. (Right) Once Φ2
is activated, only locations with the highest chemical concentrations (shaded in dark gray)
are fed as inputs to the evolved priority filter. The result is a single output from the neuron,
indicating red
To determine which grid squares a sensory neuron Φτ will receive, the chemical
concentration at each location ϕ ∈⇑ (τ ) is calculated. The states of the locations
Lϕ that have a maximum chemical concentration in the grid are fed to the sensory
neuron inputs. If ∑ϕ ∈⇑(τ ) cϕ = 0, sensory neuron Φτ is deactivated. Furthermore, S
sensor neurons with the highest ∑ϕ ∈⇑(τ ) cϕ are activated. S ∈ I and can be evolved
within the tissue gene or be specified for a task.
Therefore, we define Iτ = {Lϕ |cϕ = maxϕ ∈⇑(τ ) cϕ } as the input set after coarse
coding to sensory neuron Φτ . For sensor neurons that are active, we calculate sτ :
sτ = min (pi ) p j ∩ Iτ = 0,
/ ∀j < i (16.6)
i∈[1,...,q]
Thus to tally the votes for input state sα 3 , the filter units receive the input vector
[0 0 1 0 . . . 0] of size q and their outputs are summed as given below:
n
Vsk = ∑ ψup w jk , θ j (16.7)
j
This process is repeated for all states in A , and the priority list is generated by
assigning the state with the highest number of votes to p1 , assigning the state that
garnered the second highest number of votes to p2 , etc. In case of a tie, the tie-
breaker is the sum of the raw outputs of the filter networks, i.e., before the ψup
activation function is applied.
Fig. 16.5 Gene map for the Sensory Coarse Coding Model
Though solutions can be found without the light beacon, its presence improves
the efficiency of the solutions found, as it allows the robots to track the target loca-
tion from a distance instead of randomly searching the workspace for the perimeter.
The global fitness function for the task measures the amount of resource material
accumulated in the designated location within a finite number of time steps, in this
case T = 300. Darwinian selection is performed based on the fitness value of each
controller averaged over 100 different initial conditions.
Fig. 16.7 Predefined input sensor mapping, with simulation model inset
Fig. 16.8 Motor primitives composed of discretized voltage signals shown for a simulated
robot
Fig. 16.9 Modified tissue gene that includes order of execution of motor primitive sequences
voltage value of 0. The actual value of V , the voltage constant, is dependent on the
actuator.
The ANT controller also needs to determine the order of execution of these motor
primitive sequences. The modified tissue gene is shown in Fig. 16.9. The order
of the output coupled motor primitive (CMP) sequences are evolved as additional
parameters in the tissue gene and is read starting from the left. The elements of
the table, o1 , · · · , oε contain the Neuron ID values. The order is randomly initialized
when starting the evolutionary process and with each Neuron ID occupying one spot
on the gene. Point mutations to this section of the tissue gene result in swapping
16 Coarse-Coding Techniques for Evolvable Multirobot Controllers 399
Neuron ID values between sites. Table 16.3 shows the repertoire of coupled motor
primitives provided for the ANT controllers and thus ε = 15 for this particular setup
(Fig. 16.9). The motor primitives are coupled, where for example the left drive motor
and the right drive motor are executed in parallel (indicated using ||). Under this
setup, it is still possible for the controller to execute a sequence of motor primitives
in a serial fashion.
16.5 Results
We compare evolutionary performance of various evolvable control system models
in Fig. 16.10. Included is a Cellular Automata lookup table that consists of a table of
reactive rules that spans 1216384 entries for this task which was evolved using popu-
lation, selection and mutation parameters from Section 16.4.2. The genome is binary
and is merely the contents of the lookup table. For this approach, we also assumed
that the light beacon is turned off. Hence there exists 24 × 44 × 22 = 16384 possible
combinations of sensory inputs states, accounting for resource detection, template
detection and obstacle detection sensors respectively (Table 16.1). For each combi-
nation of sensory input, the 12 allowable behaviors outlined in Table 16.2 could be
executed. As can be seen, the population quickly stagnates at a very low fitness due
to the ‘bootstrap problem’ [23]. With limited supervision, the fitness function makes
it difficult to distinguish between incrementally fitter solutions. Instead the system
depends on ‘bigger leaps’ in fitness space (through sequences of mutations) for it
to be distinguishable during selection. However, bigger leaps become more improb-
able as evolution progresses, resulting in search stagnation. The performance of a
population of randomly initialized fixed-topology, fully-connected networks, con-
sisting of between 2 and 3 layers, with up to 40 hidden and output neurons is also
shown in Fig. 16.10.
In a fixed-topology network there tends to be more ‘active’ synaptic connec-
tions present (since all neurons are active), and thus it takes longer for each neuron
to tune these connections for all sensory inputs. In this regard ANT is advanta-
geous, since the topology is evolved and decision neurons learn to inhibit noisy
neurons through a masking process. The net result is that ANT requires fewer ge-
netic evaluations to evolve desired solutions in comparison to standard neural net-
works. The standard ANT model using sensory inputs and basis behaviors outlined
400 J. Thangavelautham, P. Grouchy, and G.M.T. D’Eleuterio
Fig. 16.10 Evolutionary performance comparison, showing population best averaged over
30 evolutionary algorithm runs of various evolvable control architectures. Error bars indicate
standard deviation. As shown, ANT combined with Sensory Coarse Coding (SCC) and Cou-
pled Motor Primitives (CMP) ordered through evolution obtains desired solutions with fewer
genetic evaluations. The CA lookup table approach as shown remains stagnant and is unable
to solve the task while fixed-topology neural nets converge at a much slower rate
only one core being used for the evolutionary run. With ANT+SCC+CMP, one can
get a comparably suitable solution in less than one hour and thirty minutes. Further-
more since this five fold improvement in performance is due to enhancements in the
search process, the improvement is expected to carry over with faster processors.
Using regular neural networks comparable solutions were not obtained even after
approximately 30 hours (10,000 generations) of evolution.
The solutions obtained in Table 16.4 can accumulate at least 88.5% of the dis-
persed resources in the designated dumping area within T = 300 timesteps and has
been determined to be of sufficient quality to complete the task (see Fig. 16.18 for
hardware demonstration). Given more time steps, it is expected that the robots will
have accumulated the remaining resources. One would ideally like to provide as in-
put raw sensor data to the robot controller. However this results in an exponential
increase in search space for a linear increase in sensor states. The alternative would
be to filter out and guess which subset of the sensory states maybe useful for solving
a prespecified task. A wrong guess or poor design choice may make the process of
finding a suitable controller difficult or impossible. Hence, ANT+SCC allows for
additional flexibility, by helping to filter out and determine suitable sensory input
states.
Fig. 16.13 shows the evolved population of sensory neurons on the body-centric
spatial map. Fig. 16.11 (left) shows the average area used by selected sensor neu-
rons and the average number of sensor neurons that participated in the selection pro-
cess during evolution. The average area remains largely constant indicating there is
strong selective pressure towards particular geometric shapes and area. This makes
sense for the resource gathering task, as controllers need to detect a sufficiently
large area in front to identify color cues indicating whether the robot is inside or
outside the dumping area. What is interesting is that with S = 4, for template detec-
tion sensor neurons, we still see a steady increase in the number of sensor neurons
competing to get selected. The increased number of neurons can potentially act in a
cooperative manner, reinforcing and serving as redundant receptive fields covering
key locations on the spatial map. Redundancy is beneficial in limiting the impact of
402 J. Thangavelautham, P. Grouchy, and G.M.T. D’Eleuterio
deleterious mutations. Fig. 16.11 (right) shows that the individuals in the evolution-
ary process start off by sensing a smaller area and that this area is steadily increased
as the solutions converge. If each sensor neuron senses just one grid square area,
then filtering is effectively disabled. At the beginning of the evolutionary process,
individuals take on reduced risk by sensing a smaller effective area, but as the fil-
tering capability evolves concurrently (correctly prioritizing the sensory cues), it
allows for the individual controllers to sense and filter a larger area. The number of
active filter units continue to get pruned, until they reach a steady state number. This
trend is consistent with experiments using ANT [31], where noisy neurons are shut
off as the controllers converge towards a solution.
In order to measure the impact of the coarse-coding and filtering towards
ANT+SCC performance improvement, we performed control experiment 1, where
the maximum size of the sensor cells was restricted to one grid square and where the
net concentration of each grid square within the spatial map was set to 1 (Table 16.4).
These two modifications effectively prevent coarse sensor cells from forming and
interacting to form fine representations. Instead, what is a left is a group of fine sen-
sor neurons that are always active. With the sensor cell area being restricted to one
Fig. 16.11 (Left) Average area occupied by selected sensor neurons and number of sensor
neurons that participated in the selection process during evolution. (Right) Number of active
filter units and number of grid squares accessible by the sensor neurons during evolution.
Both plots show parameters from population best averaged over 30 evolutionary algorithm
runs
Fig. 16.12 Evolved coupled motor primitives ordering scheme and sensor priority list for
template detection shown for an ANT+SCC+CMP controller with a fitness of 0.98. See Ta-
ble 16.3 and 16.1 for reference
16 Coarse-Coding Techniques for Evolvable Multirobot Controllers 403
Fig. 16.13 Example of an evolved sensor layout (fitness of 0.98) using the ANT+SCC+CMP
model. (Left) Participating sensor neurons and receptive fields (template detection) shown.
(Right) Selected sensor neurons shown. Shaded area indicates resultant regions sensed by the
controller
grid square, the priority filter has no effect, since it requires at least two grid squares
with differing sensory input states. The fitness performance of this model is com-
parable to the baseline ANT model. However, since this model also uses coupled
motor primitives and it performed worse than ANT+CMP alone, the net impact of
these imposed constraints is actually a decrease in performance. Furthermore, we
performed a second control experiment (control experiment 2), where we imposed
the receptive field sizes to 3 × 3 grid squares and set the net concentration at each
grid square to 1 (Table 16.4). These two modifications ensure the receptive field re-
mains coarse and prevents coarse coding interactions from occurring, while leaving
the filter functionality within SCC turned on. The net effect is that we see a no-
ticeable drop in performance due to SCC. Both of these experiments indicate that
coarse-coding interaction between sensor neurons is helping to find desired solution
within fewer genetic evaluation.
Fig. 16.14 Evolutionary performance comparison of ANT-based solutions for one to five
robots. Error bars indicate standard deviation
target area markings, etc.). Localized regions within the tissue do not exclusively
handle these specific user-defined, distal behaviors. Instead, the activity of the deci-
sion neurons indicate distribution of specialized ‘feature detectors’ among indepen-
dent networks.
Some of the emergent solutions evolved indicate that the individual robots all
figure out how to dump nearby resources into the designated berm area, but that not
all robots deliver resource all the way to the dumping area every time. Instead, the
robots learn to pass the resource material from one individual to another during an
encounter, forming a ‘bucket brigade’ (see Fig. 16.15, 16.18). This technique im-
proves the overall efficiency of the system as less time is spent traveling to and from
the dumping area. Since the robots cannot explicitly communicate with one another,
these encounters happen by chance rather than through preplanning. As with other
multiagent systems, communication between robots occurs through the manipula-
tion of the environment in the form of stigmergy. The task in [33] is similar in that
distributed objects must be delivered to a confined area; however, the hand-designed
controller does not scale as well as the ‘bucket brigade’ solution that the ANT con-
trollers discovered here. We also noticed that the robot controllers do make use of
the light beacon to home in on the light beacon that is located next to a dumping
area, however there is no noticeable difference in fitness performance when the robot
controllers are evolved with light turned off [32]. In these simulation experiments,
the robots have no way to measure the remaining time available; hence, the sys-
tem cannot greedily accumulate resource materials without periodically dumping
the material at the designated area.
16 Coarse-Coding Techniques for Evolvable Multirobot Controllers 405
Fig. 16.16 Tissue Topology and neuronal activity of a select number of decision neurons.
Decision neurons in turn ‘select’ (excite into operation) motor control neurons within its
diffusion field
406 J. Thangavelautham, P. Grouchy, and G.M.T. D’Eleuterio
Fig. 16.18 Snapshots of two rovers performing the resource gathering task using an ANT
controller. Frames 2 and 3 show the ‘bucket brigade’ behavior, while frames 4 and 5 show
the boundary avoidance behavior
The scalability of the evolved solution depends in large part on the number of
robots used during the training runs. The single-robot controller expectedly lacks
the cooperative behavior necessary to function well within a multiagent setting.
For example, such controllers fail to develop ‘robot collision avoidance’ or ‘bucket
brigade’ behaviors. Similarly, the robot controllers evolved with two or more robots
perform demonstrably worse when scaled down to a single robot, showing that the
solutions are dependent on cooperation among the robots.
16.6 Discussion
In this chapter, we use a global fitness function to train multirobot controllers with
limited supervision to perform self-organized task decomposition. Techniques that
perform well for the task make use of modularity and generalization. Modularity
is the use and reuse of components, while generalization is the process of finding
patterns or making inferences from many particulars. With a multirobot setup, mod-
ularity together with parallelism is exploited by evolved controllers to accomplish
the task. Rather than have one centralized individual attempting to solve a task using
global information, the individuals within the group are decentralized, make use of
local information and take on different roles through a process of self-organization.
This process of having different agents solve different subcomponents of the task in
order to complete the overall task is a form of task decomposition.
In this multirobot setup, there are both advantages and disadvantages to consider.
Multiple robots working independently exploit parallelism, helping to reduce the
time and effort required to complete a task. Furthermore, we also see that solutions
show improved overall system performance when evolved with groups of robots. It
should be noted that the density of robots is critical towards solving the task. Higher
densities of robots result in antagonism, with robots spending more time getting out
of the way of one another rather than progressing on the task, leading to reduced
system performance.
It was shown that a CA look-up table architecture that lacked both modularity
and generalization is found to be intractable due to the ‘bootstrap problem,’ result-
ing in premature search stagnation. This is due to the fact that EAs are unable to
find an incrementally better solution during the early phase of evolution. Use of
neural networks is a form of functional modularization, where each neuron per-
forms sensory-information processing and makes solving the task more tractable.
However with increased numbers of hidden neurons, one is faced with the effects of
spatial crosstalk where noisy neurons interfere and drown out signals from feature-
detecting neurons [16]. Crosstalk in combination with limited supervision (through
use of a global fitness function) can again lead to the ‘bootstrap problem’ [23]. Thus,
choosing the wrong network topology may lead to a situation that is either unable
to solve the problem or difficult to train [31].
With the use of Artificial Neural Tissues (ANT), we introduce hierarchical func-
tional modularity into the picture. The tissue consists of modular neurons that can
form dynamic, modular networks of neurons. These groups of neurons handle
408 J. Thangavelautham, P. Grouchy, and G.M.T. D’Eleuterio
specialized functionality as we have shown and can be reused repeatedly for this
purpose. In contrast, with a standard fixed topology neural network, similar func-
tionality may need to evolve independently multiple times in different parts of the
network. In these various neural network architectures, modularity is functional,
with behaviors and capabilities existing in individual neurons or in groups and trig-
gered when necessary. ANT facilitates evolution of this capability by allowing for
regulatory functionality that enables dynamic activation and inhibition of neurons
within the tissue. Groups of neurons could be easily activated or shut-off through a
coarse-coding process. Furthermore, with the ANT+SCC model, we allow for evo-
lution of both spatial and functional modularity. Spatial modularity is possible with
the SCC model, since we may get specialized sensory neurons that find spatial sen-
sory patterns. The output from these sensory neurons are used as input by various
groups of neurons active within ANT. These sensor neurons act as specialized fea-
ture detectors looking for either color cues or resources.
Comparison of the various evolvable control system models indicates that con-
trollers with an increased ability to generalize evolve desired solution with far fewer
genetic evaluations. The CA lookup table architecture lacks generalization capabil-
ity and performed the worst. For the CA lookup table, evolved functionality needs to
be tuned for every unique combination of sensory inputs. A regular fixed topology
network performed better, but since the topology had no capacity to increase in size
or selectively activate/inhibit neurons within the network, it needed to tune most of
the neurons in the network towards both helping perform input identifications or ac-
tions and preventing these same neurons from generating spurious outputs. Thus the
same capabilities may have to be acquired by different neurons located in different
parts of the network requiring an increased number of genetic evaluations to reach
a desired solution.
The standard ANT architecture can quickly shut off (mask out) neurons generat-
ing spurious output and thus does not require having sequences of mutation occur,
tuning each neuron within the tissue to acquire compatible (or similar) capabilities
or remain dormant. Thus certain networks of neurons within the tissue can acquire
and apply a certain specialized capability (Fig. 16.16), while most others remain
dormant through the regulatory process. Hence within ANT, increased functional
generalization is achieved through specialization. With the fixed topology neural
network, the net effect of all the neurons having to be active all the time implies
that the controllers have to evolve to individually silence each of the spurious neu-
rons or acquire the same capabilities repeatedly, thus implying reduced functional
generalization.
ANT+SCC can generalize even further. Apart from being able to selectively ac-
tivate/inhibit neurons, it can also choose to receive a coarse or fine representation
of the sensory input. In other words, it can perform further sensor generalization.
A coarse representation of the sensory input in effect implies some degree of gen-
eralization. The priority filtering functionality prioritizes certain sensor states over
others, while the coarse coding representation selects a subset of the inputs to send
to the filter. The resultant input preprocessing facilitates finding and exploiting un-
derlying patterns in the input set. The net effect is that the controller does not have
16 Coarse-Coding Techniques for Evolvable Multirobot Controllers 409
to deal with as many unique conditions since the number of unique sensory input
combinations seen by the ANT controller is reduced by SCC. This in turn facili-
tates evolution of controllers that require fewer generations to reach a desired solu-
tion. At the same time, over-generalization of the sensory inputs is problematic (see
ANT+SCC+CMP control experiment 2). By imposing coarse receptive fields and
preventing coarse-coded interactions, the controllers may miss key (fine) features
through prioritized filtering. Hence, although the sensory input space may effec-
tively have shrunk, through over-generalization valuable information is lost. These
results justify the need for representations that selectively increase or decrease gen-
eralization of sensory input through coarse-coding.
This increased ability to generalize by the ANT+SCC model also seems to offset
the increase number of parameters (increased search space) that needs to be evolved.
Herein lies a tradeoff, as a larger search space alone may require a greater number
of genetic evaluations to reach a desired solution, but this may also provide some
unexpected benefits. In particular, a larger space may help in finding more feasible
or desirable solutions than those already present and may even reduce the necessary
number of genetic evaluations by guiding evolution (as in the ANT+SCC case). As
pointed out, ANT+SCC with its ability to further generalize sensory input appears to
provide a net benefit, even though it needs to be evolved with additional parameters
(in comparison to the standard ANT model).
This benefit is also apparent when comparing the baseline ANT controller with
ANT-ordered coupled motor primitives. The additional genomic parameters appears
to be beneficial once again, since the search process has access to more potential
solutions. Furthermore, it should be noted that these additional degrees of freedom
within the ANT+SCC controller do not appear to introduce deceptive sensory inputs
or capabilities. Deceptive inputs and capabilities can slow down the evolutionary
process, since the evolving system may retain these capabilities when they initially
provide a slight fitness advantage. However, these functionalities can in turn limit
or prevent the controllers from reaching the desired solution. Thus in effect, the
evolving population can get stuck in a local minimum, unable to transcend towards
a better fitness peak.
16.7 Conclusion
This chapter has reported on a number of experiments used to automatically gen-
erate neural network based controllers for multirobot systems. We have shown
that with judicious selection of a fitness function, it is possible to encourage self-
organized task decomposition using evolutionary algorithms. We have also shown
that by exploiting hierarchical modularity, regulatory functionality and the ability
to generalize, controllers can overcome tractability concerns. Controllers with in-
creased modularity and generalization abilities are found to evolve desired solutions
with fewer training evaluations by effectively reducing the size of the search space.
These techniques are also able to find novel multirobot coordination and control
strategies. To facilitate this process of evolution, coarse-coding techniques are used
410 J. Thangavelautham, P. Grouchy, and G.M.T. D’Eleuterio
References
1. Beckers, R., Holland, O.E., Deneubourg, J.L.: From local actions to global tasks: Stig-
mergy and collective robots. In: Fourth International Workshop on the Syntheses and
Simulation of Living Systems, pp. 181–189. MIT Press, Cambridge (1994)
2. Bonabeau, E., Theraulaz, G., Deneubourg, J.-L., Aron, S., Camazine, S.: Self-
organization in social insects. Trends in Ecology and Evolution 12, 188–193 (1997)
3. Bonabeau, E., Dorigo, M., Theraulaz, G.: Swarm Intelligence: From Natural to Artificial
Systems. Oxford Univ. Press, New York (1999)
4. Bongard, J., Pfeifer, R.: Repeated structure and dissociation of genotypic and pheno-
typic complexity in artificial ontogeny. In: Proceedings of the Genetic and Evolutionary
Computation Conference 2001, San Francisco, CA, pp. 829–836 (2001)
5. Chantemargue, F., Dagaeff, T., Schumacher, M., Hirsbrunner, B.: Implicit cooperation
and antagonism in multi-agent systems, University of Fribourg, Technical Report (1996)
6. Chellapilla, K., Fogel, D.B.: Evolving an expert checkers playing program without using
human expertise. IEEE Transactions on Evolutionary Computation 5(4), 422–428 (2001)
7. Das, R., Crutchfield, J.P., Mitchell, M., Hanson, J.: Evolving globally synchronized cel-
lular automata. In: Proceedings of the Sixth International Conference on Genetic Algo-
rithms 1995, pp. 336–343. Morgan Kaufmann, San Fransisco (1995)
8. Dellaert, F., Beer, R.: Towards an evolvable model of development for autonomous agent
synthesis. In: Artificial Life IV: Proceedings of the 4th International Workshop on the
Synthesis and Simulation of Living Systems, pp. 246–257. MIT Press, Cambridge (1994)
9. Demeris, J., Matarić, M.J.: Perceptuo-Motor Primitives in Imitation. In: Autonomous
Agents 1998 Workshop on Agents in Interaction Acquiring Competence (1998)
10. Federici, D., Downing, K.: Evolution and Development of a Multicellular Organism:
Scalability, Resilience, and Neutral Complexification. Artificial Life 12, 381–409 (2006)
11. Gauci, J., Stanley, K.: A Case Study on the Critical Role of Geometric Regularity in
Machine Learning. In: Proceedings of the 23rd AAAI Conference on AI. AAAI Press,
Menlo Park (2008)
12. Grassé, P.: La reconstruction du nid les coordinations interindividuelles; la theorie de
stigmergie. Insectes Sociaux 35, 41–84 (1959)
13. Groß, R., Dorigo, M.: Evolving a Cooperative Transport Behavior for Two Simple
Robots. In: Liardet, P., Collet, P., Fonlupt, C., Lutton, E., Schoenauer, M. (eds.) EA
2003. LNCS, vol. 2936, pp. 305–316. Springer, Heidelberg (2004)
14. Gruau, F., Whitley, D., Pyeatt, L.: A comparison between cellular encoding and direct
encoding for genetic neural networks. In: Genetic Programming 1996, pp. 81–89. MIT
Press, Cambridge (1996)
16 Coarse-Coding Techniques for Evolvable Multirobot Controllers 411
15. Hastie, T., Tibshirani, R., Friedman, R.: The Elements of Statistical Learning. Springer,
New York (2001)
16. Jacobs, R., Jordan, M., Barto, A.: Task decomposition through competition in a modular
connectionist architecture. Cognitive Science (15), 219–250 (1991)
17. Komosinski, M., Ulatowski, S.: Framsticks: towards a simulation of a nature-like world,
creatures and evolution. In: Proceedings of the 5th European Conference on Artificial
Life. Springer, Berlin (1998)
18. Leffler, B.R., Littman, M.L., Edmunds, T.: Efficient reinforcement learning with relocat-
able action models. AAAI Journal, 572–577 (2007)
19. Lindenmayer, A.: Mathematical models for cellular interaction in development I. Fila-
ments with one-sided inputs. Journal of Theoretical Biology 18, 280–289 (1968)
20. Lipson, H., Pollack, J.: Automatic design and manufacture of artificial lifeforms. Na-
ture 406, 974–978 (2000)
21. Matarić, M.J., Nilsson, M., Simsarian, K.T.: Cooperative multi-robot box-pushing. In:
IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 556–561
(1995)
22. Mautner, C., Belew, R.K.: Evolving Robot Morphology and Control. In: Sugisaka, M.
(ed.) Proceedings of Artificial Life and Robotics 1999 (AROB 1999), Oita, ISAROB
(1999)
23. Nolfi, S., Floreano, D.: Evolutionary Robotics: The Biology, Intelligence, and Technol-
ogy of Self-Organizing Machines. MIT Press, Cambridge (2000)
24. Parker, C.A., Zhang, H., Kube, C.R.: Blind bulldozing: Multiple robot nest construction.
In: IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2010–
2015 (2003)
25. Pfeifer, R., Scheier, C.: Understanding Intelligence. MIT Press, Cambridge (1999)
26. Roggen, D., Federici, D.: Multi-cellular Development: Is There Scalability and Ro-
bustnes to Gain? In: Yao, X., Burke, E.K., Lozano, J.A., Smith, J., Merelo-Guervós,
J.J., Bullinaria, J.A., Rowe, J.E., Tiňo, P., Kabán, A., Schwefel, H.-P. (eds.) PPSN 2004.
LNCS, vol. 3242, pp. 391–400. Springer, Heidelberg (2004)
27. Sims, K.: Evolving 3D Morphology and Behavior by Competition. In: Proceedings of
Artificial Life IV, pp. 28–39. MIT Press, Cambridge (1994)
28. Stanley, K., Miikkulainen, R.: Continual Coevolution through Complexification. In: Pro-
ceedings of the Genetic and Evolutionary Computation Conference 2002. Morgan Kauf-
mann, San Francisco (2002)
29. Thangavelautham, J., Barfoot, T., D’Eleuterio, G.M.T.: Coevolving communication and
cooperation for lattice formation tasks (updated). In: Advances In Artificial Life: Pro-
ceedings of the 7th European Conference on ALife, pp. 857–864 (2003)
30. Thangavelautham, J., D’Eleuterio, G.M.T.: A neuroevolutionary approach to emergent
task decomposition. In: Yao, X., Burke, E.K., Lozano, J.A., Smith, J., Merelo-Guervós,
J.J., Bullinaria, J.A., Rowe, J.E., Tiňo, P., Kabán, A., Schwefel, H.-P. (eds.) PPSN 2004.
LNCS, vol. 3242, pp. 991–1000. Springer, Heidelberg (2004)
31. Thangavelautham, J., D’Eleuterio, G.M.T.: A coarse-coding framework for a gene-
regulatory-based artificial neural tissue. In: Advances In Artificial Life: Proceedings of
the 8th European Conference on ALife, pp. 67–77 (2005)
32. Thangavelautham, J., Alexander, S., Boucher, D., Richard, J., D’Eleuterio, G.M.T.:
Evolving a Scalable Multirobot Controller Using an Artificial Neural Tissue Paradigm.
In: IEEE International Conference on Robotics and Automation, Washington, D.C
(2007)
412 J. Thangavelautham, P. Grouchy, and G.M.T. D’Eleuterio
33. Wawerla, J., Sukhatme, G., Mataric, M.: Collective construction with multiple robots. In:
IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2696–2701
(2002)
34. Wilson, M., Melhuish, C., Sendova-Franks, A.B., Scholes, S.: Algorithms for building
annular structures with minimalist robots inspired by brood sorting in ant colonies. Au-
tonomous Robots 17, 115–136 (2004)
35. Zykov, V., Mytilinaios, E., Adams, B., Lipson, H.: Self-reproducing machines. Na-
ture 435(7038), 163–164 (2005)