M Hayes Statistical Digital Signal Proc Part 1
M Hayes Statistical Digital Signal Proc Part 1
ADAPTATION, LEARNING,
Serkan Kiranyaz
Turker Ince
AND
Moncef Gabbouj
Multidimensional Particle
Swarm Optimization for
Machine Learning and
Pattern Recognition
Adaptation, Learning, and Optimization
Volume 15
Editors-in-Chief
Meng-Hiot Lim
Division of Circuits and Systems, School of Electrical and Electronic Engineering,
Nanyang Technological University, Nanyang 639798, Singapore
Yew-Soon Ong
School of Computer Engineering, Nanyang Technological University, Block N4,
2b-39 Nanyang Avenue, Nanyang, 639798, Singapore
Moncef Gabbouj
Multidimensional Particle
Swarm Optimization
for Machine Learning
and Pattern Recognition
123
Serkan Kiranyaz Turker Ince
Moncef Gabbouj Department of Electrical and Electronics
Department of Signal Processing Engineering
Tampere University of Technology Izmir University of Economics
Tampere Balcova, Izmir
Finland Turkey
The definition of success—To laugh much; to win respect of intelligent persons and the
affections of children; to earn the approbation of honest critics and endure the betrayal of
false friends; to appreciate beauty; to find the best in others; to give one’s self; to leave the
world a little better, whether by a healthy child, a garden patch, or a redeemed social
condition; to have played and laughed with enthusiasm, and sung with exultation; to know
even one life has breathed easier because you have lived—this is to have succeeded.
The research work presented in this book has been carried out at the Depart-
ment of Signal Processing of Tampere University of Technology, Finland as a part
of the MUVIS project. This book contains a rich software compilation of C/C??
projects with open source codes, which can be requested from the authors via the
email address: [email protected].
Over the years the authors have had the privilege to work with a wonderful
group of researchers, students, and colleagues, many of whom are our friends. The
amount of our achievements altogether is much more than any individual
achievement and we strongly believe that together we have really built something
significant. We thank all of them so deeply. Our special thanks and acknowl-
edgment go to Jenni Raitoharju and Stefan Uhlmann for their essential
contributions.
Last but not least, the authors wish to express their love and gratitude to their
beloved families; for their understanding and endless love, and the vital role they
played in our lives and through the completion of this book. We would like to
dedicate this book to our children: the new-born baby girl, Alya Nickole Kiranyaz,
the baby boy, Doruk Ince, Selma, and Sami Gabbouj.
v
Abstract
vii
viii Preface
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Optimization Era . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Key Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Synopsis of the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
ix
x Contents
2D Two Dimensional
AAMI Association for the Advancement of Medical Instrumentation
AC Agglomerative Clustering
ACO Ant Colony Optimization
aGB Artificial Global Best
AI Artificial Intelligence
ANN Artificial Neural Network
API Application Programming Interface
AS Architecture Space
AV Audio-Visual
BbNN Block-based Neural Networks
BP Back Propagation
bPSO Basic PSO
CBIR Content-Based Image Retrieval
CGP Co-evolutionary Genetic Programming
CLD Color Layout Descriptor
CM Color Moments
CNBC Collective Network of Binary Classifiers
CPU Central Processing Unit
CV Class Vector
CVI Clustering Validity Index
DC Dominant Color
DE Differential Evolution
DFT Discrete Fourier Transform
DLL Dynamic Link Library
DP Dynamic Programming
EA Evolutionary Algorithm
ECG Electrocardiogram
ECOC Error Correcting Output Code
EFS Evolutionary Feature Syntheses
EHD Edge Histogram Descriptor
EM Expectation-Maximization
ENN Evolutionary Neural Networks
xiii
xiv Acronyms
EP Evolutionary Programming
ES Evolution Strategies
FCM Fuzzy C-means
FDSA Finite Difference Stochastic Approximation
FeX Feature Extraction
FF Fundamental Frequency
FFT Fast Fourier Transform
FGBF Fractional Global Best Formation
FT Fourier Transform
FV Feature Vector
GA Genetic Algorithm
GB Global Best
GLCM Gray Level Co-occurrence Matrix
GMM Gaussian Mixture Model
GP Genetic Programming
GTD Ground Truth Data
GUI Graphical User Interface
HMM Hidden Markov Model
HSV Hue, Saturation and (Luminance) Value
HVS Human Visual System
KF Key-Frame
KHM K-Harmonic Means
KKT Karush–Kuhn–Tucker
KLT Karhunen–Loéve Transform
kNN k Nearest Neighbours
LBP Local Binary Pattern
LP Linear Programming
MAP Maximum a Posteriori
MDA Multiple Discriminant Analysis
MD PSO Multi-dimensional Particle Swarm Optimization
ML Maximum-Likelihood
MLP Multilayer Perceptron
MPB Moving Peaks Benchmark
MRF Markov Random Field
MSE Mean-Square Error
MST Minimum Spanning Tree
MUVIS Multimedia Video Indexing and Retrieval System
NLP Nonlinear Programming
P Precision
PCA Principal Component Analysis
PNR Positive to Negative Ratio
PSO Particle Swarm Optimization
R Recall
RBF Radial Basis Function
RF Random Forest
Acronyms xv
xvii
xviii Tables
Table 5.9 Pseudo-code for the first SA-driven PSO approach . . . . . . . 132
Table 5.10 PSO Plug-in for the second approach . . . . . . . . . . . . . . . . 133
Table 5.11 MD PSO Plug-in for the second approach . . . . . . . . . . . . . 134
Table 5.12 Benchmark functions without dimensional bias . . . . . . . . . 135
Table 5.13 Statistical results from 100 runs over seven
benchmark functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
Table 5.14 Statistical results between full-cost and low-cost modes
from 100 runs over seven benchmark functions . . . . . . . . . 139
Table 5.15 t test results for statistical significance analysis
for both SPSA approaches, A1 and A2 . . . . . . . . . . . . . . . 140
Table 5.16 t table presenting degrees of freedom vs. probability . . . . . 140
Table 5.17 Implementation of FGBF pseudo-code given in Table 5.2 . . 144
Table 5.18 The environmental change signaling from the main
MD PSO function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
Table 5.19 MD PSO with FGBF implementation for MPB. . . . . . . . . . 146
Table 5.20 The fitness function MPB(). . . . . . . . . . . . . . . . . . . . . . . 147
Table 6.1 Processing time (in msec) per iteration for MD PSO
with FGBF clustering using 4 different swarm sizes.
Number of data items is presented in parenthesis
with the sample data space. . . . . . . . . . . . . . . . . . . . . . . . 159
Table 6.2 Statistical results from 20 runs over 8 2D
data spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
Table 6.3 MD PSO initialization in the function
CPSOcluster::PSOThread(). . . . . . . . . . . . . . . . . . . . . . 177
Table 6.4 MST formation in the function CPSO_MD
<T,X>::FGBF_CLFn(). . . . . . . . . . . . . . . . . . . . . . . . . . 178
Table 6.5 MST formation in the function CPSO_MD
<T,X>::FGBF_CLFn(). . . . . . . . . . . . . . . . . . . . . . . . . . 179
Table 6.6 Formation of the centroid groups by breaking
the MST iteratively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
Table 6.7 Formation of the aGB particle. . . . . . . . . . . . . . . . . . . . . . 181
Table 6.8 The CVI function, CPSOcluster::ValidityIndex2. . . . . . . . 182
Table 6.9 Initialization of DC extraction
in CPSOcolorQ::PSOThread() function . . . . . . . . . . . . . 183
Table 6.10 The plug-in function SADFn() for the second
SA-driven approach, A2. . . . . . . . . . . . . . . . . . . . . . . . . . 184
Table 6.11 The plug-in for the first SA-driven approach, A1. . . . . . . . 185
Table 7.1 Mean (l) and standard deviation (r) of classification
error rates (%) over test datasets. . . . . . . . . . . . . . . . . . . . 203
Table 7.2 A sample architecture space with range arrays,
R2min ¼ f13; 6; 6; 3; 2g and R2max ¼ f13; 12; 10; 5; 2g. . . . . . . 204
Table 7.3 Classification error rate (%) statistics of MD PSO
when applied to two architecture spaces. . . . . . . . . . . . . . . 204
Tables xix
xxiii
xxiv Figures
The optimization era started with the early days of Newton, Lagrange, and Cau-
chy. Particularly, the development of the mathematical foundations such as dif-
ferential calculus methods that are capable of moving toward an optimum of a
function was possible thanks to the contributions of Newton, Gauss, and Leibnitz
to calculus. Cauchy proposed the first steepest descent method to solve uncon-
strained optimization problems. Furthermore, Bernoulli, Euler, Lagrange, Fermat,
and Weistrass developed the foundations of function minimization in calculus
while Lagrange invented the method of optimization for constrained problems
using the unknown multipliers called after him, i.e., Lagrange multipliers. After
the second half of the twentieth century, with the invention of digital computers,
massive number of new techniques and algorithms were developed to solve
complex optimization problems and such ongoing efforts stimulated further
research on different and entirely new areas in optimization era. A major break-
through was ‘‘linear programming’’, which was invented by George Dantzig. To
name few other milestones in this area:
• Kuhn and Tucker in 1951 carried out studies later leading to the research on
Nonlinear programming, which is the general case where the objective function
or the constraints or both contain nonlinear parts.
• Bellman in 1957 presented the principle of optimality for Dynamic program-
ming with an optimization strategy based on splitting the problem into smaller
sub-problems. The equation given by his name describes the relationship
between these sub-problems.
1.1 Optimization Era 3
Optimization problems are often multi-modal, meaning that there exist some
deceiving local optima, as seen in the two examples illustrated in Fig. 1.1. This is
one of the most challenging type of optimization problems since the optimum
solution can be hard to find (e.g., see the last example in the figure)—if not
impossible. Recall that deterministic optimization methods are based on the cal-
culation of derivatives or their approximations. They converge to the position
where the function gradient is null following the direction of the gradient vector.
When the solution space of the problem is uni-modal, they are reliable, robust, and
fast in finding the global optimum; however, due to their iterative approach they
usually fail to find the global optimum in multi-modal problems. Instead, they get
trapped into a local minimum. Random initialization of multiple runs may help
find better local optima; however, finding the global solution is never guaranteed.
For multi-modal problems with a large number of local optima, random initiali-
zation may result in random convergence, therefore; the results are unrepeatable
and sub-optimum. Furthermore, the assumptions for their applications may seldom
hold in practice, i.e., the derivative of the function may not be defined.
Besides function minimization in calculus, deterministic optimization methods
are commonly used in several important application areas. For example in artificial
neural networks, the well-known training method, back propagation (BP) [3], is
indeed a gradient descent learning algorithm. The parameters (weights and biases)
of a feed-forward neural network are randomly initialized and thus each individual
BP run performs a gradient descent in the solution (error) space to converge to a
(new) set of parameters. Another typical example is the expectation-maximization
(EM) method, which finds the maximum likelihood or maximum a posteriori
(MAP) estimates of the parameters in some statistical models. The most common
models are Gaussian mixture models (GMMs) that are parametric probability
density functions represented as a weighted sum of Gaussian probability density
functions. GMMs are commonly used as a parametric model of the probability
distribution of the properties or features in several areas in signal processing,
pattern recognition, and many other related engineering fields. In order to deter-
mine the parameters of a particular GMM, EM is used as an iterative method
which alternates between performing an expectation (E) step, which computes the
Obj. Function f(x)
Isolated
global
optimum
x x x
multi-modal function with multi-modal function with no useful
Uni-modal function gradient information
few local optima
expectation of the log-likelihood evaluated using the current estimate for the
parameters, and a maximization (M) step, which computes the parameters maxi-
mizing the expected log-likelihood found in step E. These estimates are then used
to determine the distribution of the latent variables in the next E step and so on.
EM is a typical example of deterministic optimization methods, which performs
greedy descent in the error space and if the step sizes are chosen properly, it can be
easily shown that EM becomes identical to the gradient descent method. GMMs
are especially useful as a data mining tool and frequently used to model data
distributions and to cluster them. K-means [4] is another example of a determin-
istic optimization technique, and perhaps one of the most popular data clustering
algorithm ever proposed. Similarly, when applied to complex clustering problems
in high-dimensional data spaces, as a natural consequence of the highly multi-
modal nature of the solution space, K-means cannot converge to the global opti-
mum—meaning that the true clusters cannot be revealed and either over- or
usually under-clustering occurs. See, for instance, the clustering examples in Fig.
4.1 where clusters are colored into red, green, and blue for a better visualization
and the data points are represented by colored pixels with the cluster centroids
shown by a white ‘+’. In the simulations, it took 4, 13, and 127 K-means runs,
respectively, to extract the true clusters in the first three examples as shown in the
figure. Even though we performed more than 10,000 runs, no individual run
successfully clusters the last example with 42 clusters. This is an expected out-
come considering the extremely multi-modal error space of the last example with a
massive number of local optima.
Such a deficiency turned the attention toward stochastic optimization methods
and particularly to evolutionary algorithms (EAs) such as genetic algorithm (GA)
[5], genetic programming (GP) [6], evolution strategies (ES), [7] and evolutionary
programming (EP), [8]. All EAs are population-based techniques which can often
avoid being trapped in a local optimum; however, finding the optimum solutions is
never guaranteed. On the other hand, another major drawback still remains
unanswered for all, that is, the inability to find the true dimension of the solution
space in which the global optimum resides. Many problems, such as data clus-
tering, require this information in advance without which convergence to the
global optimum is simply not possible, e.g., see the simple 2D data clustering
examples in Fig. 1.2 where the true number of clusters (K) must be set in advance.
In many clustering problems, especially complex ones with many clusters, this
Fig. 1.2 Sample clustering operations in 2D data space using K-means method where the true K
value must be set before
6 1 Introduction
may not be feasible, if not impossible, to determine in advance, and thus the
optimization method should find it along with the optimum solution in that
dimension. This is also a major problem in many optimization methods mentioned
earlier. For instance, BP can only train a feed-forward ANN without searching for
the optimum configuration for the learning problem in hand. Therefore, what a
typical BP run can indeed accomplish is the sub-optimum parameter setting of a
sub-optimum ANN configuration.
All optimization methods so far mentioned and many more are applicable only
to static problems. Many real-world problems are dynamic and thus require sys-
tematic re-optimizations due to system and/or environmental changes. Even
though it is possible to handle such dynamic problems as a series of individual
processes via restarting the optimization algorithm after each change, this may
lead to a significant loss of useful information, especially when the change is not
too drastic, but rather incremental in nature. Since most of such problems have a
multi-modal nature, which further complicates the dynamic optimization prob-
lems, the need for powerful and efficient optimization techniques is imminent. Due
to the reasons mentioned earlier, in the last decade the efforts have been focused on
EAs and particularly on particle swarm optimization (PSO) [9–11], which has
obvious ties with the EA family, lies somewhere between GA and EP. Yet unlike
GA, PSO has no complicated evolutionary operators such as crossover, selection,
and mutation and it is highly dependent on stochastic processes. However, PSO
might exhibit some major problems and severe drawbacks such as parameter
dependency [12] and loss of diversity [13]. Particularly, the latter phenomenon
increases the probability of being trapped in local optima and it is the main source
of premature convergence problem especially when the dimensionality of the
search space is large [14] and the problem to be optimized is multi-modal [13, 15].
Low-level features (also called descriptors in some application domains) play a
central role in many computer vision, pattern recognition, and signal processing
applications. Features are various types of information extracted from the raw data
and represent some of its characteristics or signatures. However, especially the
(low-level) features, which can be extracted automatically, usually lack the dis-
crimination power needed for accurate processing especially in the case of a large
and varied media content data reserves. Especially in content-based image
indexing and retrieval (CBIR) area, this is referred to as ‘‘Semantic Gap’’ problem,
which defines a rather large gap between the low-level features and their limited
ability to represent the ‘‘content.’’ Therefore, it is crucial to optimize these features
particularly for achieving a reasonable performance on multimedia classification,
indexing, and retrieval applications, and perhaps for many other domains in var-
ious related fields. Since features are in high dimensions in general, the optimi-
zation method should naturally tackle with multi-modality and the phenomenon
so-called ‘‘the curse of dimensionality.’’ Furthermore, in such application domains,
the features may not be static, rather dynamically changing (new features can be
extracted or some features might be modified). This brings the scalability issue
along with the instantaneous adaptability to whatever (incremental) change may
1.2 Key Issues 7
occur in time. All in all, these issues are beyond the capability of the basic PSO or
any other EA method alone and will thus be the major subject of this book.
This book, first of all, is not about PSO or any traditional optimization method
proposed since there are many brilliant books and publications for them. As the key-
issues are highlighted in the previous section, we shall basically start over where
they left. In this book after a proper introduction to the general field and related work,
we shall first present a novel optimization technique, the so-called Multi-
dimensional particle swarm optimization (MD PSO), which re-forms the native
structure of swarm particles in such a way that they can make inter-dimensional
passes with a dedicated dimensional PSO process. Therefore, in a multi-dimensional
search space where the optimum dimension is unknown, swarm particles can seek
for both positional and dimensional optima. This eventually negates the necessity of
setting a fixed dimension a priori, which is a common drawback for the family of
swarm optimizers. Therefore, instead of operating at a fixed dimension N, the MD
PSO algorithm is designed to seek both positional and dimensional optima within a
dimension range, (Dmin N Dmax ). Nevertheless, MD PSO is still susceptible to
premature convergence as inherited by the basic PSO. To address this problem we
shall then introduce an enhancement procedure, called Fractional global best for-
mation (FGBF) technique, which basically collects all promising dimensional
components and fractionally creates an artificial global-best particle (aGB) that has
the potential to be a better ‘‘guide’’ than the PSO’s native gbest particle. We shall
further enrich this scope by introducing the application of stochastic approximation
(SA) technique to further ‘‘guide the guide.’’ As an alternative and generic approach
to FGBF, SA-driven PSO, and its multi-dimensional extension, SA-driven also
addresses the premature convergence problem in a more generic way. Both SA-
driven and MD PSO with FGBF can then be applied to solve many practical
problems in an efficient way. The first application is nonlinear function minimiza-
tion. We shall demonstrate the capability of both techniques for accurate root finding
in challenging high-dimensional nonlinear functions. The second application
domain is dynamic data clustering, which presents highly complex and multi-modal
error surface where traditional optimization or clustering techniques usually fails.
As discussed earlier, the dynamic clustering problem requires the determination of
the solution space dimension (i.e., number of clusters) and an effective mechanism to
avoid local optima traps (both dimensionally and spatially) particularly in complex
clustering schemes in high dimensions. The former requirement justifies the use of
the MD PSO technique while the latter calls for FGBF (or SA-driven). We shall
present that, if properly applied, the true number of clusters with accurate center
localization can be achieved. Several practical applications shall be demonstrated
based on this dynamic clustering technique, i.e., to start with, 2D synthetic data
spaces with ground truth clusters are first examined to show the accuracy and
8 1 Introduction
ECG patterns by evolving the optimal network structure and thus achieves a high
accuracy over large datasets. One major advantage of this system is that due to its
parameter invariance, it is highly generic and thus applicable to any ECG dataset.
We then focus on the multimedia classification, indexing, and retrieval problem
and provide detailed solutions within two chapters. In this particular area, the
following key questions still remain unanswered. (1) how to select relevant fea-
tures so as to achieve the highest discrimination over certain classes, (2) how to
combine them in the most effective way, (3) which distance metric to apply, (4)
how to find the optimal classifier configuration for the classification problem at
hand, (5) how to scale/adapt the classifier if a large number of classes/features are
present, and finally, (6) how to train the classifier efficiently to maximize the
classification accuracy. Traditional classifiers, such as support vector machines
(SVMs), random forest (RF), and artificial neural networks (ANNs), cannot cope
up with such requirements since a single classifier, no matter how powerful and
well-trained it can be, cannot discriminate efficiently a vast amount of classes, over
an indefinitely large set of features, where both classes and features are not static,
e.g., in the case of multimedia repositories. Therefore, in order to address these
problems and hence to maximize the classification accuracy which will in turn
boost the retrieval performance, we shall present a novel and generic-purpose
framework design that embodies a collective networks of evolutionary classifiers.
At a given time, this allows the creation and design of a dedicated classifier for
discriminating a certain class from the others. Each evolution session will ‘‘learn’’
from the current best classifier and can improve it further, possibly as a result of
the (incremental) optimization, which may find another configuration in the
architecture space as the ‘‘optimal.’’ Moreover, with each evolution, new classes or
features can also be introduced which signals the collective classifier network to
create new corresponding networks and classifiers within to adapt dynamically to
the change. In this way, the collective classifier network will be able to dynami-
cally scale itself to the indexing requirements of the multimedia database while
striving for maximizing the classification and retrieval accuracies thanks to the
dedicated classifiers within.
Finally, the multimedia indexing and retrieval problem further requires a
decisive solution for the well-known ‘‘Semantic Gap’’ problem. Initially, the focus
of the research in this field was on content analysis and retrieval techniques linked
to a specific medium. More recently, researchers have started to combine features
from various media. They have also started to study the benefit of knowledge
discovery in accurate content descriptions, refining relevance feedback, more
generally, improving indexing. All in all, narrowing the semantic gap basically
requires advanced approaches that depend on a central element to describe a
medium’s content: the features. Features are the basis of a content-based indexing
and retrieval system. They represent the information extracted from a medium in a
suitable way, they are stored in an index, and used during the query processing.
They basically characterize the medium’s signature. However, especially the low-
level features, which can be extracted automatically, currently lack the
10 1 Introduction
References
1. A. Kruger, ‘‘Median-cut color quantization,’’ Dr. Dobb’s J., 46–54 and 91–92, Sept. (1994)
2. S. Smale, On the average number of steps of the simplex method of linear programming.
Math. Program., 241–262 (1983)
3. Y. Chauvin, D.E. Rumelhart, Back Propagation: Theory, Architectures, and Applications
(Lawrence Erlbaum Associates Publishers, UK, 1995)
4. G. Hammerly, C. Elkan, Alternatives to the k-means algorithm that find better clusterings, in
Proceedings of the 11th ACM CIKM(2002), 600–607
5. D. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning (Addison-
Wesley, MA, 1989) pp. 1–25
6. S. Kirkpatrick, C.D. Gelatt, M.P. Vecchi, Optimization by simulated annealing. Science 220,
671–680 (1983)
7. T. Back, F. Kursawe, Evolutionary algorithms for fuzzy logic: a brief overview, In Fuzzy
Logic and Soft Computing, World Scientific (Singapore, 1995), pp. 3–10
8. U.M. Fayyad, G.P. Shapire, P. Smyth, R. Uthurusamy, Advances in Knowledge Discovery
and Data Mining (MIT Press, Cambridge, 1996)
9. A. P. Engelbrecht, Fundamentals of Computational Swarm Intelligence (Wiley, New York,
2005)
10. J. Kennedy, R Eberhart, Particle swarm optimization, in Proceedings of IEEE International
Conference on Neural Networks, vol. 4 (Perth, Australia, 1995), pp. 1942–1948
11. M.G. Omran, A. Salman, A.P. Engelbrecht, Particle Swarm Optimization for Pattern
Recognition and Image Processing (Springer, Berlin, 2006)
12. M. Løvberg, T. Krink, Extending Particle Swarm Optimizers with Self-Organized Criticality.
Proc. IEEE Congr. Evol. Comput. 2, 1588–1593 (2002)
13. M. Riedmiller, H. Braun, ‘‘A Direct Adaptive Method for Faster Backpropagation Learning:
The RPROP Algorithm,’’ in Proceedings of the IEEE International Conference on Neural
Networks (1993), pp. 586–591
References 11
14. G-J Qi, X-S Hua, Y. Rui, J. Tang, H.-J. Zhang, Image Classification With Kernelized Spatial-
Context, IEEE Transactions on Multimedia 12(4), 278–287, June (2010). doi:10.1109/
TMM.2010.2046270
15. K. Ersahin, B. Scheuchl, I. Cumming, Incorporating texture information into polarimetric
radar classification using neural networks,’’ in Proceedings of the IEEE International
Geoscience and Remote Sensing Symp (Anchorage, USA, 2004), pp. 560–563
Chapter 2
Optimization Techniques: An Overview
Since the fabric of the universe is most perfect, and is the work
of a most wise Creator, nothing whatsoever takes place in the
universe in which some form of maximum or minimum does not
appear.
Leonhard Euler
It is an undeniable fact that all of us are optimizers as we all make decisions for the
sole purpose of maximizing our quality of life, productivity in time, as well as our
welfare in some way or another. Since this is an ongoing struggle for creating the
best possible among many inferior designs, optimization was, is, and will always
be the core requirement of human life and this fact yields the development of a
massive number of techniques in this area, starting from the early ages of civili-
zation until now. The efforts and lives behind this aim dedicated by many brilliant
philosophers, mathematicians, scientists, and engineers have brought the high level
of civilization we enjoy today. Therefore, we find it imperative to get to know first
those major optimization techniques along with the philosophy and long history
behind them before going into the details of the method detailed in this book. This
chapter begins with a detailed history of optimization, covering the major
achievements in time along with the people behind them. The rest of the chapter
then draws the focus on major optimization techniques, while briefly explaining
the mathematical theory and foundations over some sample problems.
In its most basic terms, Optimization is a mathematical discipline that concerns the
finding of the extreme (minima and maxima) of numbers, functions, or systems.
The great ancient philosophers and mathematicians created its foundations by
defining the optimum (as an extreme, maximum, or minimum) over several fun-
damental domains such as numbers, geometrical shapes optics, physics, astron-
omy, the quality of human life and state government, and several others. This era
started with Pythagoras of Samos (569 BC to 475 BC), a Greek philosopher who
made important developments in mathematics, astronomy, and the theory of
music. He is often described as the first pure mathematician. His most important
philosophical foundation is [1]: ‘‘that at its deepest level, reality is mathematical in
nature.’’
Zeno of Elea (490 BC to 425 BC) who was a Greek philosopher famous for
posing so-called paradoxes was the first to conceptualize the notion of extremes in
numbers, or infinitely small or large quantities. He took a controversial point of
view in mathematical philosophy, arguing that any motion is impossible by per-
forming infinite subdivisions described by Zeno’s Dichotomy. Accordingly, one
cannot even start moving at all. Probably, Zeno was enjoying the challenging
concept of ‘‘infinity’’ with his contemporaries without the proper formulation of
the limit theory and calculus at the time.
Later, Plato (427 BC to 347 BC) who is one of the most important Greek
philosophers and mathematicians, gained from the disciples of Pythagoras, and
formed his idea [2], … ‘‘that the reality which scientific thought is seeking must be
expressible in mathematical terms, mathematics being the most precise and defi-
nite kind of thinking of which we are capable. The significance of this idea for the
development of science from the first beginnings to the present day has been
immense.’’ About 75 years earlier, Euclid wrote The Elements, Plato wrote The
Republic around 375 BC, where he was setting his ideas on education: In that, one
must study the five mathematical disciplines, namely arithmetic, plane geometry,
solid geometry, astronomy, and harmonics. After mastering mathematics, one can
proceed to the study of philosophy. The following dialog is a part of the argument
he made:
‘‘…But when it is combined with the perception of its opposite, and seems to involve the
conception of plurality as much as unity, then thought begins to be aroused within us, and
the soul perplexed and wanting to arrive at a decision asks ‘‘What is absolute unity?’’ This
is the way in which the study of the one has a power of drawing and converting the mind
to the contemplation of reality.’’
‘‘And surely,’’ he said, ‘‘this characteristic occurs in the case of one; for we see the same
thing to be both one and infinite in multitude?’’
‘‘Yes,’’ I said, ‘‘and this being true of one, it must be equally true of all number?’’
‘‘Certainly’’
Aristotle (384 BC to 322 BC), who was one of the most influential Greek
philosophers and thinkers of all times, made important contributions by system-
atizing deductive logic. He is perhaps best described by the authors of [3] as,
‘‘Aristotle, more than any other thinker, determined the orientation and the content
of Western intellectual history. He was the author of a philosophical and scientific
system that through the centuries became the support and vehicle for both medi-
eval Christian and Islamic scholastic thought: until the end of the seventeenth
century, Western culture was Aristotelian. And, even after the intellectual revo-
lutions of centuries to follow, Aristotelian concepts and ideas remained embedded
in Western thinking. ‘‘He introduced the well-known principle,’’ The whole is
more than the sum of its parts.’’ Both the Greek philosophers, Plato and Aristotle,
used their ‘‘powers of reasoning’’ to determine the best style of human life. Their
goal was to develop the systematic knowledge of how the behavior of both
individuals and society could be optimized. They particularly focused on questions
of ethics (for optimizing the lifestyle of an individual) and politics (for optimizing
the functioning of the state). At the end, both Plato and Aristotle recognized that
2.1 History of Optimization 15
the knowledge of how the members of society could optimize their lives was
crucial. Both believed that the proper development of an individual’s character
traits was the key to living an optimal lifestyle.
As a follower of Plato’s philosophy, Euclid of Alexandria (325 BC to 265 BC)
was the most prominent antique Greek mathematician best known for his work on
geometry, The Elements, which not only makes him the leading mathematician of
all times but also one who influenced the development of Western mathematics for
more than 2,000 years [4]. It is probable that no results in The Elements were first
proved by Euclid but the organization of the material and its exposition are cer-
tainly due to him. He solved some of the earliest optimization problems in
Geometry, e.g., in the third book, there is a proof that the greatest and least straight
lines can be drawn from a point to the circumference of a circle; in the sixth book
it is proven that a square has the maximum area among all rectangles with given
total length of the edges.
Archimedes of Syracuse (287 BC to 212 BC) is considered by most historians of
mathematics as one of the greatest mathematicians of all times [5]. He was the
inventor of the water pump, the so-called Archimedes’ screw that consists of a pipe
in the shape of a helix with its lower end dipped in the water. As the device is
rotated the water rises up the pipe. This device is still in use in many places in the
world. Although he achieved great fame due to his mechanical inventions, he
believed that pure mathematics was the only worthy pursuit. His achievements in
calculus were outstanding. He perfected a method of integration which allowed
him to find areas, volumes, and surface areas of many bodies by using the method
of exhaustion, i.e., one can calculate the area under a curve by approximating it by
the areas of a sequence of polygons. In Heath [6], it is stated that ‘‘Archimedes
gave birth to the calculus of the infinite conceived and brought to perfection by
Kepler, Cavalieri, Fermat, Leibniz and Newton.’’ Unlike Zeno and other Greek
philosophers, he and Euclid were the first mathematicians who were not troubled
by the apparent contradiction of the infinite concept. For instance, they contrived
the method of exhaustion technique to find the area of a circle without knowing the
exact value of p.
Heron of Alexandria (*10 AC to *75 AC) who was an important geometer
and worker in mechanics wrote several books on mathematics, mechanics, and
even optics. He wrote the book, Catoptrica, which is attributed by some historians
to Ptolemy although most now seem to believe that this was his genuine work
indeed. In this book, Heron states that vision occurs as a result of light emissions
by the eyes with infinite velocity. He has also shown that light travels between two
points through the path of the shortest length.
Pappus of Alexandria (*290 AC to *350 AC) is the last of the great Greek
geometers and made substantial contributions on many geometrical optimization
problems. He proved what is known as the ‘‘honeycomb conjecture’’ that the
familiar honeycomb shape, which is a repeating hexagonal pattern (volumetric
hexagonal-shaped cylinders, stacked one against the other in an endless array) was
the optimal way of storing honey. Pappus introduces this problem with one of the
most charming essays in the history of mathematics, one that has frequently been
16 2 Optimization Techniques: An Overview
excerpted under the title: On the Sagacity of Bees. In that, he speaks poetically of the
divine mission of bees to bring from heaven the wonderful nectar known as honey,
and says that in keeping with this mission they must make their honeycombs without
any cracks through which honey could be lost. Having also a divine sense of sym-
metry, the bees had to choose among the regular shapes that could fulfill this con-
dition, (e.g. triangles, squares, and hexagons). At the end they naturally chose the
hexagon because a hexagonal prism required the minimum amount of material to
enclose a given volume. He collected these ideas in his Book V and states his aim
that [7], ‘‘Bees, then, know just this fact which is useful to them, that the hexagon is
greater than the square and the triangle and will hold more honey for the same
expenditure of material in constructing each. But we, claiming a greater share in
wisdom than the bees, will investigate a somewhat wider problem, namely that, of
all equilateral and equiangular plane figures having an equal perimeter, that which
has the greater number of angles is always the greater, and the greatest of then all is
the circle having its perimeter equal to them.’’
Also in Book V, Pappus discusses the 13 semi-regular solids discovered by
Archimedes and solves other isoperimetric problems which were apparently dis-
cussed by the Athenian mathematician Zenodorus (200 BC to 140 BC). He
compares the areas of figures with equal perimeters and volumes of solids with
equal surface areas, proving that the sphere has the maximum volume among
regular solids with equal surface area. He also proves that, for two regular solids
with equal surface area, the one with the greater number of faces has the greater
volume. In Book VII, Pappus defines the two basic elements of analytical problem
solving, the analysis and synthesis [7] as, … ‘‘in analysis we suppose that which is
sought to be already done, and inquire what it is from which this comes about, and
again what is the antecedent cause of the latter, and so on until, by retracing our
steps, we light upon something already known or ranking as a first principle… But
in synthesis, proceeding in the opposite way, we suppose to be already done that
which was last reached in analysis, and arranging in their natural order as con-
sequents what were formerly antecedents and linking them one with another, we
finally arrive at the construction of what was sought…’’
During the time of the ancient great Greek philosophers and thinkers, arithmetic
and geometry were the two branches of mathematics. There were some early
attempts to do algebra in those days; however, they lacked the formalization of
algebra, namely the arithmetic operators that we take for granted today, such as
‘‘+, -, 9, 7’’ and of course, ‘‘=’’. Much of the world, including Europe, also
lacked an efficient numeric system like the one developed in the Hindu and Arabic
cultures. Al’Khwarizmi (790–850) was a Muslim Persian mathematician who
wrote on Hindu–Arabic numerals and was among the first to use the number zero
as a place holder in positional base notation. Algebra as a branch of mathematics
can be said to date to around the year 825 when Al’Khwarizmi wrote the earliest
known algebra treatise, Hisab al-jabr w’al-muqabala. The word ‘‘algebra’’ comes
from the Persian word al’jabr (that means ‘‘to restore’’) in the title. Moreover, the
English term ‘‘algorithm,’’ was derived from Al’Khwarizmi ‘s name as the way of a
Latin translation and pronunciation: Algoritmi.
2.1 History of Optimization 17
Ibn Sahl (940–1000) was a Persian mathematician, physicist, and optics engi-
neer who was credited for first discovering the law of refraction, later called as the
Snell’s law. By means of this law, he computed the optimum shapes for lenses and
curved mirrors. This was probably the first application of optimization in an
engineering problem.
Further developments in algebra were made by the Arabic mathematician
Al-Karaji (953–1029) in his treatise Al-Fakhri, where he extends the methodology
to incorporate integer powers and integer roots of unknown quantities. Something
close to a proof by mathematical induction appears in a book written by Al-Karaji
who used it to prove the binomial theorem, Pascal’s triangle, and the sum of
integral cubes. The historian of mathematics, Woepcke in [8], credits him as the
first who introduced the theory of algebraic calculus. This was truly one of the
cornerstone developments for the area of optimization as it is one of the uses of
calculus in the real world.
René Descartes (1596–1650) was a French mathematician and philosopher and
his major work, La Géométrie, includes his linkage of algebra to geometry from
which we now have the Cartesian geometry. He had a profound breakthrough
when he realized he could describe any position on a 2D plane using a pair of
numbers associated with a horizontal axis and a vertical axis—what we call today
as ‘‘coordinates.’’ By assigning the horizontal measurement with x’s and the
vertical measurement with y’s, Descartes was the first to define any geometric
object such as a line or circle in terms of algebraic equations. Scott in [9] praises
his work for four crucial contributions:
1. He makes the first step toward a theory of invariants, which at later stages
derelativises the system of reference and removes arbitrariness.
2. Algebra makes it possible to recognize the typical problems in geometry and to
bring together problems which in geometrical dress would not appear to be
related at all.
3. Algebra imports into geometry the most natural principles of division and the
most natural hierarchy of method.
4. Not only can questions of solvability and geometrical possibility be decided
elegantly, quickly, and fully from the parallel algebra, without it they cannot be
decided at all.
The seminal construction of what we call graphs was obviously the cornerstone
achievement without which any formulation of optimization would not be possi-
ble. In that, Descartes united the analytical power of algebra with the descriptive
power of geometry into the new branch of mathematics, he named as analytic
geometry, a term which is sometimes called as Calculus with Analytic Geometry.
He was one of the first to solve the tangent line problem (i.e., the slope or the
derivative) for certain functions. This was the first step toward finding the maxima
or minima of any function or surface, the foundation of all analytical optimization
solutions. On the other hand, when Descartes published his book, La Géometrie in
1637, his contemporary Pierre de Fermat (1601–1665) was already working on
analytic geometry for about 6 years and he also solved the tangent line problem
18 2 Optimization Techniques: An Overview
occasional discussion letters between the two, but despite of all his correspon-
dences with Newton, he had already come to his own conclusions about calculus.
In 1686 Leibniz published a paper, in Acta Eruditorum, dealing with the integral
calculus with the first appearance of the integral notation in print. Newton’s
famous work, Philosophiae Naturalis Principia Mathematica, surely the greatest
scientific book ever written, appeared in the following year. The notion that the
Earth rotated around the Sun was already known by ancient Greek philosophers,
but it was Newton who explained why, and henceforth, the great scientific revo-
lution began with it.
During his last years, Leibniz published Théodicée claiming that the universe is
in the best possible form but imperfect; otherwise, it would not be distinct from
God. He invented more mathematical terms than anyone else, including ‘‘func-
tion,’’ ‘‘analysis situ,’’ ‘‘variable,’’ ‘‘abscissa,’’ ‘‘parameter,’’ and ‘‘coordinate.’’
His childhood IQ has been estimated as the second-highest in all of history, behind
only Goethe [5]. Descriptions that have been applied to Leibniz include ‘‘one of the
two greatest universal geniuses’’ (da Vinci was the other) and the ‘‘Father of the
Applied Science.’’ On the other hand, Newton is the genius who began revolu-
tionary advances on calculus, optics, dynamics, thermodynamics, acoustics, and
physics; it is easy to overlook that he too was one of the greatest geometers for he
calculated the optimum shape of the bullet earlier than his invention of calculus.
Among many brilliant works in mathematics and especially in calculus, he also
discovered the Binomial Theorem, the polar coordinates, and power series for
P
exponential and trigonometric functions. For instance, his equation, ex ¼ xk = k!,
has been called as ‘‘the most important series in mathematics.’’ Another optimi-
zation problem he solved is the brachistochrone, which is the curve of fastest
descent between two points by a point-like body with a zero velocity while the
gravity is the only force (with no friction). This problem had defeated the best
mathematicians in Europe but it took Newton only a few hours to solve it. He
published the solution anonymously, yet upon seeing the solution, Jacob Bernoulli
immediately stated ‘‘I recognize the lion by his footprint.’’
After the era of Newton and Leibniz, the development of the calculus was
continued by the Swiss mathematicians, Bernoulli brothers, Jacob Bernoulli
(1654–1705), and Johann Bernoulli (1667–1748). Jacob was the first mathemati-
cian who applied separation of variables in the solution of a first-order nonlinear
differential equation. His paper of 1690 was indeed a milestone in the history of
calculus since the term integral appears for the first time with its integration
meaning. Jacob liked to pose and solve physical optimization problems such as the
catenary (which is the curve that an idealized hanging chain assumes under its own
weight when supported only at its ends) problem. He was a pioneer in the field of
calculus of variations, and particularly differential equations, with which he
developed new techniques to many optimization problems. In 1697, he posed and
partially solved the isoperimetric problem, which is a class of problems of the
calculus of variations. The simplest of them is the following: among all curves of
given length, find the curve for which some quantity (e.g., area) dependent on the
20 2 Optimization Techniques: An Overview
setting of x = p yields eip þ 1 ¼ 0 that fits the three most important constants in a
single equation.)
Euler was first partially and later totally blind during a long period of his life.
From the French Academy in 1735, he received a problem in celestial mechanics,
which had required several months to solve by other mathematicians. Euler, using
his improved methods solved it in 3 days (later, with his superior methods, Gauss
solved the same problem within an hour!). However, the strain of the effort
induced a fever that caused him the loss of sight in his right eye. Stoically
accepting the misfortune, he said, ‘‘Now I will have less distraction.’’ For a period
of 17 years, he was almost totally blind after a cataract developed in his left eye in
1766. Yet he possessed a phenomenal memory, which served him well during his
blind years as he calculated long and difficult problems on the blackboard of his
mind, sometimes carrying out arithmetical operations to over 50 decimal places.
The calculus of variations was created and named by Euler and in that he made
several fundamental discoveries. His first work in 1740, Methodus inveniendi
lineas curvas initiated the first studies in the calculus of variations. However, his
contributions already began in 1733, and his treaty, Elementa Calculi Variationum
in 1766 gave its name. The idea was already born with the brachistochrone curve
problem raised by Johann Bernoulli in 1696. This problem basically deals with the
following: Let a point particle of mass m on a string whose endpoints are at
a = (0,0) and b = (x,y), where y \ 0. If gravity acts on the particle with force
F = mg, what path of string minimizes its travel time from a to b, assuming no
friction? The solution of this problem was one of the first accomplishments of the
calculus of variations using which many optimization problems can be solved.
Besides Euler, by the end of 1754, Joseph-Louis Lagrange (1736–1813) had also
made crucial discoveries on the tautochrone that is the curve on which a weighted
particle will always arrive at a fixed point in a fixed amount of time independent of
its initial position. This problem too contributed substantially to the calculus of
variations. Lagrange sent Euler his results on the tautochrone containing his
method of maxima and minima in a letter dated 12 August 1755, and Euler replied
on 6 September saying how impressed he was with Lagrange’s new ideas (he was
19 years old at the time). In 1756, Lagrange sent Euler results that he had obtained
on applying the calculus of variations to mechanics. These results generalized
results which Euler had himself obtained. This work led to the famous Euler–
Lagrange equation, the solution of which is applied on many optimization prob-
lems to date. For example using this equation, one can easily show that the closed
curve of a given perimeter for which the area is a maximum, is a circle, the shortest
distance between two fixed points is a line, etc. Moreover, Lagrange considered
optimizing a functional with an added constraint and he turned the problem using
the method of Lagrange multipliers to a single optimization equation that can then
be solved by the Euler–Lagrange equation.
Euler made substantial contributions to differential geometry, investigating the
theory of surfaces and curvature of surfaces. Many unpublished results by Euler in
this area were rediscovered by Carl Friedrich Gauss (1777–1855). In 1737, Euler
22 2 Optimization Techniques: An Overview
the curve at that point. The method stops when it reaches to a local minimum
(maximum) where rf = 0 and thus no move is possible. Therefore, the update
equation is as follows:
xkþ1 ¼ xk kk rf ðxk Þ; k ¼ 0; 1; 2; . . . ð2:3Þ
where x0 is the initial starting point in the N-D space. An advantage of gradient
descent compared to Newton–Raphson method is that it only utilizes first-order
derivative information about the function when determining the direction of
movement. However, it is usually slower than Newton–Raphson to converge and it
tends to suffer from very slow convergence especially as a stationary point is
approached.
A crucial optimization application is the least-square approximation, which
finds the approximate solution of sets of equations in which there are more
equations than unknowns. At the age of 18, Carl Friedrich Gauss who was widely
agreed to be the most brilliant and productive mathematician ever lived, invented a
solution to this problem in 1795, although it was first published by Lagrange in
1806. This method basically minimizes the sum of the squares of the residual
errors, that is, the overall solution minimizes the sum of the squares of the errors
made in the results of every single equation. The most important application is in
data fitting and the first powerful demonstration of it was made by Gauss, at the
age of 24 when he used it to predict the future location of the newly discovered
asteroid, Ceres. In June 1801, Zach, an astronomer whom Gauss had come to
know two or three years earlier, published the orbital positions of Ceres, which
was discovered by an Italian astronomer Giuseppe Piazzi in January, 1801. Zach
was able to track its path for 40 days before it was lost in the glare of the sun.
Based on this data, astronomers attempted to determine the location of Ceres after
it emerged from behind the sun without solving the complicated Kepler’s non-
linear equations of planetary motion. Zach published several predictions of its
position, including one by Gauss which differed greatly from the others. When
Ceres was rediscovered by Zach on 7 December 1801, it was almost exactly where
Gauss had predicted using the least-squares method, which was not published at
the time.
The twentieth century brought the proliferation of several optimization tech-
niques. Calculus of variations was further developed by several mathematicians
including Oskar Bolza (1857–1942) and Gilbert Bliss (1876–1951). Harris Han-
cock (1867–1944) in 1917 published the first book on optimization, Theory of
Maxima and Minima. One of the crucial techniques of the optimization, Linear
Programming (LP), was developed in 1939 by the Russian mathematician, Leonid
Vitaliyevich Kantorovich (1912–1986); however, the method was kept secret until
the time the American scientist, George Bernard Dantzig (1914–2005) published
the Simplex method in 1947. LP, sometimes called linear optimization, is a
mathematical method for determining a way to achieve the optimal outcome in a
given mathematical model according to a list of requirements that are predefined
by some linear relationships. More formally, LP is a technique for the optimization
24 2 Optimization Techniques: An Overview
in 1961 by Robert Hooke and T. A. Jeeves. This was a pattern search method,
which is better than a random search due to its search directions by exploration in
the search space. After this key accomplishment, in 1962 the first simplex-based
direct search method was proposed by W. Spendley, G. R. Hext, and F. R. Hims-
worth in their paper, Sequential Application of Simplex Designs in Optimisation
and Evolutionary Operation. Note that this is an entirely different algorithm than
the Simplex method for LP as discussed earlier. It uses only two types of trans-
formations to form a new simplex (e.g., vertices of a triangle in 2D) in each step:
reflection away from the worst vertex (the one with the highest function value), or
shrinking toward the best vertex (the one with the lowest function value). For each
iteration, the angles between simplex edges remain constant during both opera-
tions, so the working simplex can change in size, but not in shape. In 1965, this
method was modified by John Ashworth Nelder (1924–2010) and Roger Mead
who added two more operators: expansion and contraction (in and out), which
allow the simplex to change not only its size, but also its shape [14]. Their
modified simplex method, known as Nelder-Mead (or simplex) method, became
immediately famous due to its simplicity and low storage requirements, which
makes it an ideal optimization technique especially for the primitive computers at
that time. During the 1970s and 1980s, it was used by several software packages
while its popularity grew even more. It is now a standard method in MATLAB
where it can be applied by the command: fminsearch. Nowadays, despite its long
past history, the simplex method, is still one of the most popular heuristic opti-
mization techniques in use.
During the 1950s and 1960s, the concept of artificial intelligence (AI) was also
born. Along with the AI, a new family of metaheuristic optimization algorithms in
stochastic nature was created: evolutionary algorithms (EAs). An EA uses
mechanisms inspired by biological evolution such as reproduction, mutation,
recombination, and selection. It is also a stochastic method as in simulated
annealing; however, it is based on the collective behavior of a population. A
potential solution of the optimization problem plays the role of a member in the
population, and the fitness function determines the search space within which the
solutions lie. The earliest instances of EAs appeared during the 1950s and early
1960s, simulated on computers by evolutionary biologists who were explicitly
seeking to model aspects of natural evolution. At first, it did not occur to any of
them that this approach might be generally applicable to optimization problems.
The EAs were first used by a Norwegian-Italian mathematician; Nils Aall Barri-
celli (1912–1993) who applied to evolutionary simulations. By 1962, several
researchers developed evolution-inspired algorithms for function optimization and
machine learning, but at the time their work only attracted little attention. The first
development in this field for optimization came in 1965, when the German sci-
entist Ingo Rechenberg (born in 1934), developed a technique called evolution
strategy (ES), which uses natural problem-dependent representations, and pri-
marily mutation and selection, as search operators in a loop where each iteration is
called generation. The sequence of generations is continued until a termination
criterion is met.
28 2 Optimization Techniques: An Overview
When the search direction is chosen as the gradient descent direction,rf ðxÞ, the
corresponding iterative search is called the method of gradient descent (also
known as steepest descent or Cauchy’s method). The direction of the negative
gradient along which the objective function decreases fastest is the most natural
choice. This simple algorithm for continuous optimization uses gradient of the
objective function in addition to the function value itself, hence f must be a
30 2 Optimization Techniques: An Overview
Fig. 2.1 Iterations of the fixed (left) and optimum aðkÞ (right) line search versions of gradient
descent algorithm plotted over the Rosenbrock objective function, x(0) = [-0.5, 1.0]
1
f ðx þ dÞ ffi f ðxÞ þ Df ðxÞd þ dT Hf ðxÞd ð2:5Þ
2
where the Hessian matrix Hf ð xÞ ¼ r2 f ð xÞ is assumed to be positive definite near
local minimum x*. Therefore, Newton–Raphson method utilizes both the first and
the second partial derivatives of the objective function to find its minimum.
Similar to the gradient descent method, it can be implemented as an iterative line
search algorithm using dðkÞ ¼ Hf ð xÞ1 rf ð xÞ as the search direction (Newton
direction or the direction of the curvature) in 2.1 yielding the position update:
Fig. 2.2 (Left) Iterations of the Quasi-Newton method plotted over Rosenbrock function,
x0 = [-0.5, 1.0]. (Right) Iterations of the gradient descent (red) versus Quasi-Newton (black)
methods plotted over a quadratic objective function, x(0) = [10, 5]
Fig. 2.3 Left Iterations of the Nelder–Mead method plotted over Rosenbrock function with
x(0) = [-0.5, 1.0]. The vertices with the minimum function values are only plotted. Right
consecutive simplex operations during the iterations 3–30
In the liquid phase, all particles are distributed randomly, whereas in the ground
state of the solid the particles are arranged in a highly structured lattice, for which
the corresponding energy is minimal. The ground state of the solid is obtained only
if the maximum value of the temperature is sufficiently high and the cooling is
performed slowly. Otherwise, the solid will be obtained in a meta-stable state
rather than in the true ground state. This is the key for achieving the optimal
ground state, which is the basis of the annealing as an optimization method. The
simulated annealing method is a Monte Carlo-based technique and generates a
sequence of states of the solid. Let i and j be the current and the subsequent state of
the solid and ei and ej their energy levels. The state j is generated by applying a
perturbation mechanism, which transforms the state i into j by a little distortion,
such as a mere displacement of a particle. If ej ei , the state j is accepted as the
current state; otherwise, the state j may still be accepted as the current state with a
probability
34 2 Optimization Techniques: An Overview
ei ej
Pði ) jÞ ¼ exp ð2:7Þ
kB T
where kB is a physical constant called the Boltzmann constant; recall from the
earlier discussion that T is the temperature of the state. This rule of acceptance is
known as Metropolis criterion. According to this rule, the Metropolis–Hastings
algorithm generates a sequence of solutions to an optimization problem by
assuming: (1) solutions of the optimization problem are equivalent to the state of
this physical system, and (2) the cost (fitness) of the solution is equivalent to the
energy of a state. Recall from the earlier discussion that the temperature is used as
the control parameter that is gradually (iteratively) decreased during the process
(annealing). Simulated annealing can thus be viewed as an iterative Metropolis–
Hastings algorithm, executed with the gradually decreasing values of T. With a
given cost (fitness) function, f, let w be the continuously decreasing temperature
function, T0 is the initial temperature, N is the neighborhood function, which
changes the state (candidate solution) with respect to the previous state in an
appropriate way, eC is the minimum fitness score aimed, and x is the variable to be
optimized in N-D search space. Accordingly, the pseudo-code of the simulated
annealing algorithm is given in Table 2.2.
Note that a typical characteristic of the simulated annealing is that it accepts
deteriorations to a limited extent. Initially, at large values of temperature, T, large
deteriorations may be accepted; as T gradually decreases, the amount of deterio-
rations possibly accepted goes down and finally, when the temperature reaches
absolute zero, deteriorations cannot happen at all—only improvements. This is
why it mimicks the family of steepest descent methods as T goes to zero. On the
other hand, recall that simulated annealing and the family of Evolutionary
not readily available. This makes the stochastic approximation (SA) algorithms
quite popular. The general SA takes the following form:
_ _
_
_
hkþ1 ¼ hk ak gk hk ð2:9Þ
_
_
where gk hk is the estimate of the gradient gðhÞ at iteration k and ak is a scalar
gain sequence satisfying certain conditions. Unlike any steepest (gradient) descent
method, SA assumes no direct knowledge of the gradient. To estimate the gradient,
there are two common SA methods: finite difference stochastic approximation
(FDSA) and simultaneous perturbation SA (SPSA) [17]. FDSA adopts the tradi-
tional Kiefer-Wolfowitz approach to approximate gradient vectors as a vector of
p partial derivatives where p is the dimension of the loss (fitness) function.
Accordingly, the estimate of the gradient can be expressed as follows:
2 _
_
3
L h k þck D1 L h k ck D1
6 7
6 2ck 7
6 _ _ 7
6 L h k þck D2 L h k ck D2 7
_ 66 2ck
7
7
gk hk ¼ 6 7
_
6 : 7 ð2:10Þ
6 : 7
6 7
6 7
6 : 7
4 _ _ 5
L h k þck Dp L h k ck Dp
2ck
where Dk is the unit vector with a 1 in the kth place and ck is a small positive
number that gradually decreases with k. Note that separate estimates are computed
for each component of the gradient, which means that a p-dimensional problem
requires at least 2p evaluations of the loss function per iteration. The convergence
theory for the FDSA algorithm is similar to that for the root-finding SA algorithm
of Robbins and Monro. These are: ak [ 0; ck [ 0, lim ak ¼ 0; lim ck ¼ 0,
P1 P1 k!1 k!1
k¼0 ak \1 and k¼0 ak =ck \1. The selection of these gain sequences is
critical to the performance of the FDSA. The common choice is the following:
a c
ak ¼ and ck ¼ ; ð2:11Þ
ð k þ A þ 1Þ t ð k þ 1Þ s
where a, c, t and s are strictly positive and A 0. They are usually selected based
on a combination of the theoretical restrictions above, trial-and-error numerical
experimentation, and basic problem knowledge.
Figure 2.5 shows the FDSA iterations plotted over the Rosenbrock function
with the following parameters: a = 20, A = 250, c = 1, t ¼ 1 and s ¼ 0:75. Note
that during the early iterations, it performs a random search due to its stochastic
nature with large ak ; ck values but then it mimics the gradient descent algorithm.
The large number of iterations needed for the convergence is another commonality
with the gradient descent.
2.4 Evolutionary Algorithms 37
Among the family members of EAs, this section will particularly focus on GAs
and DEs leaving out the details of both EP and ES while the next chapter will
cover the details of PSO.
In order to accomplish this, the first step is the encoding of the problem
variables into genes in a way of strings. This can be a string of real numbers but
more typically a binary bit string (series of 0s and 1s). This is the genetic repre-
sentation of a potential solution. For instance, consider the problem with two
variables, a and b 3 0 a; b\256. A sample chromosome representation for the
ith chromosome, gi ða; bÞ, is shown in Fig. 2.6 where both a and b are encoded with
8-bits, therefore, the chromosome contains 16 bits. Note that examining the
chromosome string alone yields no information about the optimization problem.
It is only with the decoding of the chromosome into its phenotypic (real) values
that any meaning can be extracted for the representation. In this case, as described
below, the GA search process will operate on these bits (chromosomes), rather
than the real-valued variables themselves, except, of course, where real-valued
chromosomes are used.
The second requirement is a proper fitness function which calculates the fitness
score of any potential solution (the one encoded in the chromosome). This is
indeed the function to be optimized by finding the optimum set of parameters of
the system or the problem in hand. The fitness function is always problem
dependent. In nature, this corresponds to the organism’s ability to operate and to
survive in its present environment. Thus, the objective function establishes the
basis for the proper selection of certain organism pairs for mating during the
reproduction phase. In other words, the probability of selection is proportional to
the chromosome’s fitness. The GA process will then operate according to the
following steps:
1. Initialization: The initial population is created while all chromosomes are
(usually) randomly generated so as to yield an entire range of possible solutions
(the search space). Occasionally, the solutions may be ‘‘seeded’’ in areas where
optimal solutions are likely to be found. The population size depends on the
nature of the problem, but typically contains several hundreds of potential
solutions encoded into chromosomes.
2. Selection: For each successive generation, first the selection of a certain pro-
portion of the existing population is performed to breed a new generation. As
mentioned earlier, the selection process is random, however, favors the chro-
mosomes with higher fitness scores. Certain selection methods rate the fitness
of each solution and preferentially select the best solutions.
3. Reproduction: For each successive generation, the second step is to generate
the next generation chromosomes from those selected through genetic operators
such as crossover and mutation. These genetic operators ultimately result in the
child (next generation) population of chromosomes that is different from the
initial generation but typically shares many of the characteristics of its parents.
2.4 Evolutionary Algorithms 39
4. Evaluation: The child chromosomes are first decoded and then evaluated using
the fitness function and they replace the least-fit individuals in the population so
as to keep the population size unchanged. This is the only link between the GA
process and the problem domain.
5. Termination: During the Evaluation step, if any of the chromosomes achieves
the objective fitness score or the maximum number of generations is reached,
then the GA process is terminated. Otherwise, steps 2 to 4 are repeated to
produce the next generation.
Fig. 2.8 The distributions of real-valued GA population for generations, g = 1, 160, 518, and
913 over Rosenbrock function with S = 10, Px = 0.8, and r linearly decreases from xrange to 0
be different and there are numerous variants performing different approaches for
each. A similar crossover operation as in binary GA can still be used (flipping with a
probability, Px, over the pairs chosen for breeding). Another crossover operation
common for real-valued GA is to exploit the idea of creating the child chromosome
between parents via arithmetic recombination (linear interpolation), i.e., zi = a
xi ? (1–a) yi where xi, yi ,and zi are ith parent and child chromosomes, respectively,
and a : 0 B a B 1. The parameter a can be a constant, or a variable changing
according to some function or a random number. On the other hand, the most
common mutation method is to shift by a random deviate, applied to each chro-
mosome separately, taken from Gaussian distribution N(0, r) and then curtail it to
the problem range. Note that the standard deviation, r, controls the amount of shift.
Figure 2.8 shows the distributions of a real-valued GA population for generations,
g = 1, 160, 518, and 913 over Rosenbrock function with the parameter settings as,
S = 10, Px = 0.8, and an adaptive r linearly decreasing from xrange to 0, where xrange
is the dimensional range, i.e., xrange ¼ 2 ð2Þ ¼ 4 for the problem shown in the
figure. It took 160 generations for a GA chromosome to converge to the close
vicinity of the optimum point, (1, 1).
2.4 Evolutionary Algorithms 41
ygþ1
a ¼ xgb Fr xgc xgd ð2:12Þ
where r U ð0; 1Þ is a random variable with a uniform distribution and F is a
constant, usually assigned to 2. This is the mutation operation, which adds the
weighted difference of the two of the vectors to the third, hence gives the name
‘‘differential’’ evolution. The following crossover operation then forms a trial
vector from the elements of the agent vector, xga , and the elements of the donor
vector, ygþ1
a , each of which enters the trial vector with probability R.
gþ1
gþ1 ya;j if r R or j ¼ d
ua;j ¼ ð2:13Þ
xga;j if r [ R and j 6¼ d
3. Selection (with Evaluation): For each successive generation once the trial
vector is generated, the agent vector, xga , is compared with the trail vector, ugþ1
a;j ,
and the one with the better fitness is admitted to the next generation.
ugþ1 if f ugþ1 f xga
xgþ1 ¼ a a ð2:14Þ
a
xga else
4. Termination: During the previous step, if any agent achieves the objective
fitness score or the maximum number of generations is reached, then the DE
process is terminated. Otherwise, steps 2 and 3 are repeated to produce the next
generation.
Figure 2.9 illustrates the generation of the trial vector on a sample 2-D
function.
gþ1
Note that the trial vector, ugþ1
a , gathers the first-dimensional element ua;1 from
the agent vector, xga; 1 and the second-dimensional element ugþ1 a;2 from the donor
vector, ygþ1
a;2 .
The choice of DE parameters F, S, and R can have a large impact on the
optimization performance and how to select good parameters that yield good
performance has therefore been subject to much research, e.g., see Price et al. [18]
and Storn [19]. Figure 2.10 shows the distributions of DE population for gener-
ations, g = 1, 20, 60, and 86 over Rosenbrock function with the parameter settings
as, S = 10, F = 0.8, and R ¼ 0:1. Note that as early as 20th generations, a
member of DE population already converged to the close vicinity of the optimum
point, (1, 1).
yag +1
xbg
x2
References 43
Fig. 2.10 The distributions of DE population for generations, g = 1, 20, 60, and 86 over
Rosenbrock function with S = 10, F = 0.8 and R ¼ 0:1
References
13. S. Kirkpatrick, C.D. Gelatt, M.P. Vecchi, Optimization by simulated annealing. Science 220,
671–680 (1983)
14. J.A. Nelder, R. Mead, A simplex method for function minimization. Comput. J. 7, 308–313
(1965)
15. S. Boyd, L. Vandenberghe, Convex Optimization (Cambridge University Press, UK, 2004)
16. A. Antoniou, W.-S. Lu, Practical Optimization, Algorithms and Engineering Applications
(Springer, USA, 2007)
17. R. Silipo, et al., ST-T segment change recognition using artificial neural networks and
principal component analysis. Comput. Cardiol., 213–216, (1995)
18. K. Price, R. M. Storn, J. A. Lampinen, Differential Evolution: A Practical Approach to Global
Optimization. Springer. ISBN 978-3-540-20950-8 (2005)
19. R. Storn, ‘‘On the usage of differential evolution for function optimization’’. Biennial
Conference of the North American Fuzzy Information Processing Society (NAFIPS).
pp. 519–523, (1996)
Chapter 3
Particle Swarm Optimization
3.1 Introduction
PSO in most basic terms belongs to the swarm intelligence paradigm, which
studies the collective behavior and social characteristics of organized, decentral-
ized, and complex systems known as ‘‘swarms.’’ A swarm is an apparently dis-
organized collection (population) of moving individuals that tend to cluster
together while each individual seems to be moving in a random direction. Each
individual in the swarm has the capability of interaction with the other individuals,
or the so-called ‘‘agents’’ (or ‘‘particles’’ in PSO), although the capabilities of each
agent are rather limited by certain set of rules. Therefore, the behavior of an agent
in a swarm is often insignificant, but their collective and social behavior is of
paramount importance, in that, the swarm intelligence comes both from the col-
lective adaptation and stochastic nature of the swarm. The main motivation stems
directly from the organic swarms in nature such as bird flocks, fish schools, ant
colonies, and other animal herds and packs, which exhibit an amazing self-
New velocity
va (t + 1)
Global optimum
gbest position
yˆ (t )
where w is the inertia weight [8], and c1 ; c2 are the acceleration constants.
r1;j Uð0; 1Þ and r2;j Uð0; 1Þ are random variables with a uniform distribution.
Recall from the earlier discussion that the first term in the summation is the
memory term, which represents the contribution of previous velocity, the second
term is the cognitive component, which represents the particle’s own experience
and the third term is the social component through which the particle is ‘‘guided’’
by the gbest particle toward the GB solution so far obtained. Note that the gbest
particle is the common guide for all swarm particles since the third term exists in
each velocity update equation. Although the use of inertia weight, w, was later
added by Shi and Eberhart [8], into the velocity update equation, it is widely
accepted as the basic form of PSO algorithm. Each PSO run updates the positions
of the particles using Eq. (3.2). Depending on the problem to be optimized, PSO
iterations can be repeated until a specified number of iterations, say IterNo, is
exceeded, velocity updates become zero, or the desired fitness score is achieved
(i.e.,f \eC , where f is the fitness function and eC is the cut-off error). Accordingly,
the general pseudo-code of the bPSO is presented in Table 3.1.
Velocity clamping in step 3.4.1.2, also called as ‘‘dampening’’ with the user
defined maximum range Vmax (and Vmax for the minimum) is one of the earliest
attempts to control or prevent oscillations [9]. Figure 3.2 illustrates a typical
The first set of improvements has been proposed for the problem-dependent per-
formance of PSO due to its strong parameter dependency. There are mainly two
types of approaches: The first one is through self-adaptation, which has been
applied to PSO by Clerc [11], Yasuda et al. [12], Zhang et al. [13], and Shi and
Eberhart [14]. The other approach is via performing hybrid techniques, which are
employed along with PSO by Angeline [15], Reynolds et al. [16], Higashi and Iba
[17], Esquivel and Coello Coello [18], and many others. Finally, Van den Bergh
50 3 Particle Swarm Optimization
[19] showed that the following inequality should be satisfied in order to guarantee
the convergence to (local) optima:
c1 þ c2
w[ 1 ð3:4Þ
2
where w is the inertia weight, and c1 ; c2 are the acceleration constants used in Eq.
(3.2). Mendes et al. [20] derived Fully Informed PSO (FIPS) from the constriction
PSO and present its general form as,
!
1 X
va;j ðt þ 1Þ ¼ v va;j ðtÞ þ c1 r1;j ðtÞ yn;j ðtÞ xa;j ðtÞ
jNa j n2Na ð3:5Þ
xa;j ðt þ 1Þ ¼ xa;j ðtÞ þ va;j ðt þ 1Þ
where Na defines a neighborhood of the particle a, and jNa j is the number of
particles in it. In the FIPS, a particle is attracted by every other particle in its
neighborhood. Therefore, the performance of FIPS is generally more dependent on
the neighborhood topology (global best neighborhood topology is recommended).
The rest of the PSO variants presented in this section contains some
improvements trying to avoid the premature convergence problem via introducing
diversity to swarm particles. An earlier improvement to avoid premature con-
vergence is the craziness operator, which has been first proposed by Kennedy and
Eberhart [1]. At each iteration, a set of particles from the center of the swarm is
selected and randomized within the search space. However, they concluded that it
may not be a necessary operation since it does not contribute much to the per-
formance of PSO.
Attractive and Repulsive PSO (ARPSO) proposed [21] alternates between
attraction and repulsion phases. During attraction, ARPSO allows fast information
flow between particles causing a low diversity but a better convergence to the
solution. It is reported that 95 % fitness improvements can be obtained within this
phase. In the repulsion phase, the particles are pushed away from the GB solution
so far achieved to increase diversity. ARPSO exhibits a higher performance
compared to both PSO and GA.
Note that according to the velocity update equation in Eq. (3.2), the velocity of
the gbest particle will only depend on the memory term since xgbest ¼ ygbest ¼ ^y.
To address this problem Van den Bergh introduced a new PSO variant, the PSO
with guaranteed convergence, (GCPSO) [22]. In GCPSO, a different velocity
update equation is used for the gbest particle based on two threshold values that
can be adaptively set during the process. It is claimed that GCPSO usually per-
forms better than the bPSO when applied to unimodal functions and comparable
for multimodal problems; however, due to its fast rate of convergence, GCPSO can
be more likely to trap to a local optimum with a guaranteed convergence, whereas
the bPSO may not. Based on GCPSO, Van den Bergh proposed the Multi-start
PSO (MPSO) [22], which repeatedly runs GCPSO over randomized particles and
stores the (local) optimum at each iteration. Yet, similar to bPSO and many of its
3.3 Some PSO Variants 51
variants, the performance still degrades significantly as the dimension of the search
space increases [22].
Another attempt to improve the overall performance is to use multiple swarms
instead of one. Lovberg et al. [23] proposed an approach, which divides the main
swarm into several swarms where each swarm has its own gbest particle. The
particles between different swarms can mate by using an arithmetic crossover
operator with a certain probability, the so-called breeding operation. The results,
however, show that this approach did not improve the overall performance since
swarms with less particles do not have enough exploration power and no pre-
vention is designed against such small-size swarms getting too similar to each
other over time.
In another approach, Lovberg and Krink presented Self Organized Criticality
(SOC) PSO [24, 25]. The criticality measures the proximity of particles, so that the
particles that are too close to each other can be relocated in the search space to
improve the diversity of the swarm. They propose two types of relocation: The first
one is random initialization and the second one is random displacement of parti-
cles further in the search space. SOC PSO outperformed bPSO only in one out of
four cases.
3.3.1 Tribes
In all PSO variants presented earlier including the basic (canonical) version, the
description of the problem typically provides the following: the definition of the
solution (or search) space; the fitness function to be optimized (the objective
function) on each point of the search space; and finally, a stopping criterion (e.g.,
the maximum number of iterations or admissible error). The swarm, on the other
hand, can be defined by the population size and other intrinsic parameters such as
inertia factor w, acceleration constants c1 ; c2 , or perhaps some other depending on
the variant. The particular PSO variant, ‘‘Tribes’’, is a parameter-free PSO algo-
rithm. Its major properties are:
• The swarm is divided into ‘‘tribes’’.
• At the beginning, the swarm is composed of only one particle.
• According to tribes’ behaviors, particles are added or removed.
• According to the performances of the particles, their strategies of displacement
are adapted.
• Adaptation of the swarm according to the performances of the particles.
In this process, some subgroups are defined in such a way that, inside each
group, every particle informs all other particles, including itself. Therefore, these
subgroups are called as tribes, a metaphor for different sized groups of particles
moving about the search space, looking for the global solution of the problem in
hand. In practice, this process is similar to nesting in GAs with the same purpose:
52 3 Particle Swarm Optimization
As in other canonical PSO types, each particle has current and personal best
positions. A particle is said to be good if it has just improved its personal best
performance, otherwise it is neutral. Note that this is a binary definition because
improvement is not measured. We check only if it is strictly positive (real
improvement) or null (no improvement). By definition, the best performance of a
particle cannot deteriorate, and that is why there is no ‘‘bad’’ particle in the
absolute, but only by comparison. The particle having the worst personal best
position within a tribe is called as the bad. Similarly, the best particle is assigned
relative to a tribe. Moreover, compared to canonical PSO, the particle memory is
slightly improved, so that it remembers its last two performance variations, thus
maintaining a short history of its moves. On the other hand, to measure the global
performance of a tribe, two status assignments, good and bad, are used. It is
determined by a simple rule: the higher the number of good particles in a tribe, the
more the tribe is itself good and vice versa. More precisely, consider tribe T with
size N, which can be assigned either good or bad according to:
good if NGood [ randð1; NÞ
T¼ ð3:6Þ
bad else
where NGood \N is the number of good particles in T. Such a probabilistic
approach will then lead to the construction of new tribes using the adaptation rules
summarized earlier. For instance, only the worst particle in the best tribe is
removed. Moreover, for each bad tribe, a free particle is created and initialized in
such a way that the probability of a new region discovery can be higher.
The tribe creation process starts with the randomly generated initial particle,
which also constitutes the initial tribe. It will then undergo to the same PSO
velocity updates and if there is no improvement observed in the first iteration, then
a second tribe is generated and initialized with its first particle, and so on.
Therefore, the number of tribes along with their particles will be increased, in
3.3 Some PSO Variants 53
order to improve the search ability with the increasing population as long as no
improvement is observed. As soon as a certain level of improvement is achieved,
the excess population will be regulated by removing the worst particles in the best
tribes.
3.3.2 Multiswarms
peak. A proper limit closer to which the swarms are not allowed to move, rrep is
attained by using the average radius of the peak basin, rbas. If p peaks are evenly
distributed in X N ; rrep ¼ rbas ¼ X=p1=N .
Multiswarms, or the so-called subswarms also exists in local PSO topologies.
Recall that in the (standard) global PSO, each particle is a neighbor of all other
particles (i.e., = a fully connected topology). Therefore, the personal best position
of the gbest particle guides the whole swarm as it affects all the velocity updates.
Yet, if the current global optimum is not close to the global optimum solution, it
may become hard, if not impossible, for the swarm to explore other areas of the
search space. Generally speaking, global PSOs usually converge faster and may
get trapped in a local optimum more easily. There are other local PSO variants
where the particles are grouped within neighborhoods according to a certain
strategy, to create subswarms. In this case, only the gbest particle in the subswarm
can influence the velocity update of a given particle in that subswarm. Conse-
quently, such local PSO variants (with certain subswarm topologies) converge
slower than the global PSO, but they have higher chance of avoiding local minima
due to greater population diversity [29]. Such a neighborhood approach actually
models the social networks [30]. Four sample topologies are shown in Fig. 3.3,
where the ‘‘Fully Connected’’ topology (also called as the ‘‘Star topology’’) cor-
responds to the global PSO. All the others are examples of local PSO topologies,
where a local best (lbest) particle guides the subswarm with a neighborhood size,
K. Henceforth, the same velocity update equation as given in Eq. (3.2) will be used
while ^yj ðtÞ now represents the jth dimensional component of the lbest position of a
subswarm, at time t.
Consider for instance the ‘‘Ring’’ topology; as the simplest example of local
PSO, it connects each particle with the two immediate neighbors, e.g., K = 2 (left
and right particles). The flow of information in this topology is drastically reduced
compared to the global PSO (the star topology). Using the ring topology will slow
down the convergence rate, because the best solution found has to propagate
through several neighborhoods before affecting all particles in the swarm. This
slow propagation will allow the particles to explore more areas in the search space
and thus may decrease the chance of premature convergence. On the other hand,
3.3 Some PSO Variants 55
the choice of topology and thus the size of the neighborhood might be critical, and
moreover it will induce new parameters to the PSO.
3.4 Applications
The first problem domain is nonlinear function minimization and several bench-
mark functions exist in high dimensions. Figure 3.4 presents six of these bench-
mark functions that are shown in 2D for illustration purposes. The general,
unconstrained function minimization problem in d-dimensional space has the
form:
x ¼ x1 ; x2 ; ::; xd ¼ arg min f ðxÞ , f ðx Þ ¼ minf ðxÞ ð3:7Þ
x2Rd x2Rd
where f(x) is the d-dimensional nonlinear function and suppose that it has a global
minimum within a practical range of xmax , which corresponds to a d-dimensional
cube representing the boundaries of the search space. All the benchmark functions
in the figure have their global minimum at the origin, i.e., x ¼ ½0; 0; ::; 0. This is
the most natural type of problem where the position of the PSO particle a can
directly correspond to the points in the data space, i.e., the jth component of a
d-dimensional point (xj ; j 2 ½1; d) is stored in its positional component, xa;j ðtÞ.
The PSO process can then start with a random initialization of each particle
position within the search range, xmax (i.e., as in step 1.1: Randomize xa ð1Þ).
Figure 3.5 presents the four plots of the personal best score of the gbest particle
versus iteration (epoch) number obtained from the individual PSO runs for the four
sample functions in dimensions d = 20 and d = 80. In all runs the following
parameters are used: c1 ¼ c2 ¼ 2; xmax ¼ 500; Vmax ¼ xmax =5 ¼ 100; iterNo ¼
5000; eC ¼ 104 and S ¼ 50. The inertial factor, w, is linearly decreased from 0.9
to 0.4. Note that for unimodal functions in both dimensions, Sphere and Griewank,
PSO runs successfully converged (within a vicinity of eC ) to the global minimum
within iterNo iterations. However, in the case of multimodal functions, i.e.,
56 3 Particle Swarm Optimization
(a) (b)
50
d = 20 d = 20
5
10 d = 80 d = 80
40
30
0
10 20
10
-5
10 0
0 1000 2000 3000 4000 0 1000 2000 3000 4000 5000
(c) (d)
6
10 600
d = 20 d = 20
d = 80 500 d = 80
4
10
400
2
10 300
200
0
10
100
-2
10 0
0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000
Fig. 3.5 The plots of the gbest particle’s personal best scores for the four sample non-linear
functions given in Fig. 3.4. a Sphere. b Giunta. c Rastrigin. d Griewank
3.4 Applications 57
DKmeans ¼
ck xp
2 ð3:8Þ
k¼1 xp 2ck
where ck is the kth cluster center, xp is the pth data point in cluster ck and k:k is the
distance metric in the Euclidean space. As a hard clustering method, K-means
suffers from the following drawbacks:
• The number of clusters K, needs to be set in advance.
• The performance of the method depends on the initial (random) centroid posi-
tions as the method converges to the closest local optima.
• The method is also dependent to the distribution of the data.
The fuzzy version of K-means, the so-called fuzzy C-means (FCM) (sometimes
also called as fuzzy K-means) was proposed by Bezdek [34], and has become the
most popular fuzzy clustering method so far. It is a fuzzy extension of the K-means
with a similar objective function as follows:
K X
X
2 X
K
DFCM ¼ um
where ukp 1 is a positive membership value of the data point zp to the cluster ck
and m [ 1 is the fuzziness exponent. FCM usually achieves a better performance
compared to the K-means [35] and is less data dependent; however, it still suffers
from the same drawbacks, i.e., the number of clusters should be fixed a priori and
unfortunately it may also converge to local optima [32]. Zhang and Hsu [36]
proposed a novel fuzzy clustering technique, the so-called K-harmonic means
(KHM), which is less sensitive to initial conditions and promises further
improvements. The experimental results demonstrate that KHM outperforms both
K-means and FCM [36, 37]. An extensive survey over various types of clustering
techniques can be found in [32] and [38].
A hard clustering technique based on the bPSO was first introduced by Omran
et al. in [39] and this work showed that the bPSO can outperform K-means, FCM,
KHM, and some other state-of-the-art clustering methods in any (evaluation)
criteria. This is indeed an expected outcome due to the PSO’s aforementioned
ability to cope up with the local optima by maintaining a guided random search
operation through the swarm particles. In clustering, similar to other PSO appli-
cations, each particle represents a potential solution at a particular time t, i.e., the
particle a in the swarm, n ¼ fx1 ; ::; xa ; ::; xS g, is formed as xa ðtÞ ¼
fca;1 ; ::; ca;j ; ::; ca;K g ) xa;j ðtÞ ¼ ca;j where ca;j is the jth (potential) cluster centroid
in N-dimensional data space and K is the number of clusters fixed in advance. Note
that contrary to nonlinear function minimization in the earlier section, the data
space dimension, N, is now different than the solution space dimension, K. Fur-
thermore, the fitness function, f that is to be optimized, is formed with respect to
two widely used criteria in clustering:
• Compactness: Data items in one cluster should be similar or close to each other
in N-dimensional space and different or far away from the others belonging to
other clusters.
• Separation: Clusters and their respective centroids should be distinct and well-
separated from each other.
The fitness functions for clustering are then formed as a regularization function
fusing both Compactness and Separation criteria and in this problem domain they
are known as clustering validity indices. Omran et al. used the following validity
index in their work [39]
distance, and dmin is the minimum centroid (intercluster) distance in the cluster
centroid set xa . The weights, w1 ; w2 ; w3 are user defined regularization coeffi-
cients. So the minimization of the validity index f ðxa ; ZÞ will simultaneously try to
minimize the intracluster distances (better Compactness) and maximize the
intercluster distance (better Separation). In such a regularization approach, dif-
ferent priorities (weights) can be assigned to both objectives via proper setting of
weight coefficients. Another traditional and well-known validity index is Dunn’s
index [40], which suffers from two drawbacks: It is computationally expensive and
sensitive to noise [41]. Several variants of Dunn’s index were proposed in [38],
where robustness against noise is improved. There are many other validity indices,
i.e., proposed by Turi [42], Davies and Bouldin [43], Halkidi and Vazirganis [44],
etc. A throughout survey can be found in [41]. Most of them presented promising
results; however, none of them can guarantee the ‘‘optimum’’ number of clusters in
every clustering scheme. Especially for the aforementioned PSO-based clustering
in [39], the clustering scheme further depends on weight coefficients and may
therefore result in over- and under-clustering particularly in complex data
distributions.
In order to test the clustering performance of the bPSO, we created 15 synthetic
data spaces as shown in Fig. 3.6, and to make the evaluation independent from the
choice of the parameters, we simply used Qe in Eq. (3.10) as the clustering validity
index function. For each clustering experiment, we manually set K as the true
number of clusters existing in 2D synthetic data space. For illustration purposes
each data space is formed in 2D; however, clusters are formed with different
shapes, densities, sizes, and intercluster distances to test the robustness of
clustering methods against such variations. Furthermore, recall that the number of
clusters determines the (true) dimension of the solution space in a PSO application,
and hence data spaces with different numbers of clusters are used to test the
convergence accuracy to the true (solution space) dimension. As a result, signif-
icantly varying complexity levels are achieved among the 11 data spaces to
establish a general purpose evaluation of each technique. We set: iterNo = 2,000;
however, the use of cut-off error as a termination criterion is avoided since it is not
feasible to set a unique eC value for all clustering schemes. Therefore, each PSO
run is executed with 2,000 iterations. The swarm size, S = 80, and the rest of the
PSO parameters are set as the default values given earlier are also used in all
experiments except the positional range, xmax , since it can now be set simply as
the natural boundaries of the 2D data space.
The first set of clustering operations is performed over the simple data spaces
where they can yield accurate results, e.g., the results of clustering over the four
data spaces at the top row in Fig. 3.6 are shown in Fig. 3.7 where each cluster is
represented in one of the three color codes (red, green, and blue) for illustration
purposes and each cluster centroid (each dimensional component of the gbest
particle) is shown with a white ‘+’.
The convergence accuary of PSO tends to degrade with the increasing
dimensionality and complexity due to the well-known ‘‘curse of dimensionality’’
phenomenon from which no optimization technique can be entirely immune.
Figure 3.8 presents typical clustering results for K 10 and while running each
PSO operation till iteration number reaches to 20,000 (i.e., stagnation). K = 10 is
indeed not a too high dimension for PSO but it particularly suffers from the highly
complex clustering schemes as in C4 and C5 (i.e., varying sizes, shapes, and
Fig. 3.7 PSO clustering for 2D data spaces C1-C4 shown in Fig. 3.6
Fig. 3.8 Erroneous PSO clustering over data spaces C4, C5, C8, and C9 shown in Fig. 3.6
3.4 Applications 61
structures among clusters). Over a simpler data space, e.g., C6 with 13 clusters, we
noticed that PSO occasionally yields accurate clustering but for those data spaces
with 20–25 clusters or above, clustering errors become inevitable regardless of the
level of complexity and errors tend to increase significantly in higher dimensions
as a natural consequence of earlier local traps. A typical example is C9, which has
42 clusters in the simplest form (uniform size, shape, and density) and the clus-
tering result presents many over- and under-clustering schemes with many occa-
sional miss-located centroids. Much worse performance can be expected from their
applications over C10 and C11.
As a result, although PSO-based clustering outperforms many well-known
clustering methods, it still suffers from two major drawbacks. The number of
clusters, K, (being the solution space dimension as well) should still be specified in
advance and similar to other PSO applications, the method obviously tends to trap
in local optima particularly when the complexity of the clustering scheme
increases. This also involves the dimension of the solution space, i.e., convergence
to ‘‘optimum’’ number of ‘‘true’’ clusters can only be guaranteed for low dimen-
sions. Recall from the earlier discussion that this is also true for dynamic clustering
schemes, DCPSO [45] and MEPSO [46], both of which eventually present results
only in low dimensions (K 10 in [45] and K 6 in [46]) and for simple data
distributions. The degradation is likely to be more severe particularly for DCPSO,
since it entirely relies on K-means for actual clustering.
3.4.3.1 An Overview
Another application domain is the training of artificial neural networks (ANNs) for
supervised data classification. After the introduction of simplified neurons by
McCulloch and Pitts in 1943 [20], ANNs have been widely applied to many
application areas, most of which used feedforward ANNs with the back propa-
gation (BP) training algorithm. An ANN consists of a set of connected processing
units, usually called neurons or nodes. ANNs can be described as directed graphs,
where each node performs some activation function on its inputs and then passes
the result forward to be the input of some other neurons, until the output neurons
are reached.
ANNs can be divided into feed forward and recurrent networks according to
their connectivity. In a recurrent ANN there can be backward loops in the network
structure, while in feedforward ANNs they are no loops. Furthermore, feedforward
ANNs are usually organized into layers of parallel neurons and only connections
between adjacent layers are possible. All layers besides the input and the output
layers are called hidden layers. Commonly, the input layer is just a passive layer,
where no computations are carried out and it is thus not counted in the total
number of layers. The active neurons perform an activation function f of the form,
62 3 Particle Swarm Optimization
!
N l1
X
yp;l
k ¼f wljk yp;l1
j hlk ; ð3:11Þ
j¼1
where yp;l
k is the output of neuron k at layer l, when pattern p is fed to the ANN,
N l1 is the total number of neurons in layer l-1, wljk is the connection weight
between neuron j at layer l-1 and neuron k at layer l, and hlk is the bias of neuron
k. For the first processing layer (the layer after the input layer), yp;l1
j ¼ yp;0
j is
p
naturally the jth dimension of the input xj . The number of input neurons Ni and the
number of output neurons No for ANNs are defined by the problem, while the
number of hidden layers and the number of neurons in each hidden layer is
somehow decided usually by expert rule of thumb and with respect to the problem.
A sample feedforward ANN is illustrated in Fig. 3.9. It has three layers (two
hidden layers and the output layer). Figure 3.9 also shows the connection weights
w2j1 and the bias h21 for the first neuron in layer 2.
As an example illustrated in Fig. 3.9, the most common ANN type is the
multilayer perceptron (MLP) [47]. It is a feedforward network, which contains one
or more hidden layers, each with a given number of neurons. The degree of neuron
connectivity is usually high and the neurons have smooth nonlinear activation
functions. The use of nonlinear activation functions is essential, because otherwise
the MLP could always be reduced to a single-layer perceptron (SLP) without
changing its capabilities. Another popular type of feed-forward ANN is the radial
basis function (RBF) network [48], which has always two layers in addition to the
passive input layer: a hidden layer of RBF units and a linear output layer. Only the
output layer has connection weights and biases. The activation function of the kth
RBF unit is defined as,
k X lk k
yk ¼ u ; ð3:12Þ
r2k
where u is a radial basis function or, in other words, a strictly positive radially
symmetric function, which has a unique maximum at N-dimensional center lk and
whose value drops rapidly close to zero away from the center. rk is the width of
3.4 Applications 63
the peak around center lk . The activation function gets noteworthy values only
when the distance between the N-dimensional input X and the center lk , kX lk k,
is smaller than the width rk . The most commonly used activation function in RBF
networks is the Gaussian basis function defined as,
!
k X lk k 2
yk ¼ exp ; ð3:13Þ
2r2k
where lk and rk are the mean and standard deviation of the Gaussian function and
||.|| is the Euclidean norm. While MLPs construct global approximations to non-
linear input–output mappings, RBF is built from local approximations centered on
clusters of input training samples, and both have a universal function approxi-
mation capability. More detailed information about MLPs and RBF networks the
reader may consult [47].
Back propagation is the most commonly used training technique for feedfor-
ward ANNs. BP has the advantage of a directed search, where the weights are
updated in such a way as to minimize the error. However, there are several aspects,
which make the algorithm not guaranteed to be universally useful. Most trouble-
some is its strict dependency on the learning rate parameter, which, if not set
properly, can either lead to oscillation or indefinitely long training time. Network
paralysis might also occur, i.e., as the ANN trains, the weights tend to assume
quite large values and the training process can come to a virtual standstill. Fur-
thermore, BP eventually slows down by an order of magnitude for every extra
(hidden) layer added to the ANN. After all, BP is simply a gradient descent
algorithm over the error space, which can be complicated and may contain many
deceiving local minima (multimodal). Therefore, BP most likely gets trapped into
a local minimum, making it entirely dependent on the initial (weight) settings.
Let Nhl be the number of hidden neurons in layer l of a MLP with input and
output layer sizes NI and NO , respectively. The input neurons are merely fan out
units, since no processing takes place there. Let F be the activation function
applied over the weighted inputs plus a bias, as follows:
X
yp;l p;l
k ¼ Fðsk Þ where sk ¼
p;l
wl1 p;l1
jk yj þ hlk ð3:14Þ
j
where yp;l
k is the output of the kth neuron of the lth hidden/output layer when
pattern p is fed at the input, wl1
jk is the weight from the jth neuron in layer l-1 to
the kth neuron in layer l, and hlk is the bias value of the kth neuron of the lth hidden/
output layer, respectively. The training mean square error, MSE, at the output layer
is formulated as,
NO 2
1 XX
MSE ¼ tkp yp;O
k ð3:15Þ
2PNO p2T k¼1
64 3 Particle Swarm Optimization
One complete run over the training dataset is called an epoch. Usually many
epochs are required to obtain the best training results; on the other hand, too many
training epochs can lead to over fitting. In the above realization of the BP algo-
rithm, the network parameters are updated after every training sample. This is
called the online or sequential mode. The other possibility is the batch mode,
where all the training samples are first presented to the network and then the
parameters are adjusted, so that the total training error is minimized. The
sequential mode is often favored over the batch mode as it requires less storage
space. Moreover, the sequential mode is less likely to get trapped in a local
minimum as updates at every training sample make the search stochastic in nature.
Hence, sequential BP mode is used for MLP training.
PSO has been successfully applied for training feedforward [50–53] and
recurrent ANNs [54, 55] and several works on this field have shown that it can
achieve a superior learning ability to the traditional BP method in terms of
accuracy and speed. At time t, suppose that particle a in the swarm,
n ¼ fx1 ; . . n
.; xa ; . . .; xS g, has the positional component: o
xa ðtÞ ¼ fw0jk g; fw1jk g; fh1k g; fw2jk g; fh2k g; . . .; fwO1 O1
jk g; fhk g; fhO
kg where
fwljk g and fhlk g represent the sets of weights and biases of layer l. Note that the
3.4 Applications 65
input layer (l = 0) contains only weights, whereas the output layer (l = O) has
only biases. With such a direct encoding scheme, the particle a represents all
potential network parameters of the MLP architecture. In the next section, we shall
present a direct comparison of PSO versus BP training of a collection of MLPs
over supervised classification over several medical datasets.
and Rmax . Finally, the most complex MLP with the largest number of possible
layers and the highest number of neurons is associated with the highest index,
d = 40. Therefore, all 41 entries in the hash table span the architecture space with
respect to the configuration complexity.
The comparative evaluations of both training algorithms were performed using
a medical diagnosis benchmark dataset from the UCI Machine Learning repository
[56], which is partitioned into three sets: training, validation, and testing. There are
several techniques [57] to use training and validation sets individually to prevent
overfitting and thus to improve the classification performance in the test data.
However, there is no universally effective technique and there are several research
articles reporting against the use of the cross validation technique in the design and
training of MLP networks [57], [58]. In this book, for simplicity and to obtain an
unbiased performance measure under equal training conditions, the validation and
training sets are simply combined to be used for training. From Proben1 repository
[56], three benchmark classification problems, breast cancer, heart disease, and
diabetes, are selected, which were commonly used by previous studies. These are
medical diagnosis problems, which present the following attributes:
• All of them are real-world problems based on medical data from human patients.
• The input and output attributes are similar to those used by a medical doctor.
• Since medical samples and data are expensive to obtain, the training sets are
quite limited.
3.4 Applications 67
The initial dataset consists of 920 exemplars with 35 input attributes, some of
which are severely missing. Hence, a second dataset is composed using the
cleanest part of this set, which was created at Cleveland Clinic Foundation by
Dr. Robert Detrano. The Cleveland dataset is called ‘‘heartc’’ in Proben1 reposi-
tory and contains 303 exemplars but 6 of them still contain missing data and are
hence discarded. The remaining exemplars are partitioned as follows: 149 for
training, 74 for validation, and 74 for testing. There are 13 input and 2 output
attributes. The purpose is to predict the presence of a heart disease according to the
input attributes.
3. Diabetes
This dataset is used to predict diabetes diagnosis among Pima Indians. The data
is collected from female patients, aged 21 years or older. There are total of 768
exemplars of which 500 are classified as diabetes negative and 268 as diabetes
positive. The dataset is originally partitioned as 384 for training, 192 for valida-
tion, and 192 for testing. It consists of eight input attributes and two output
attributes.
The input attributes of all datasets are scaled within the range [0,1] by a linear
function. Note that their output attributes are encoded using a 1-of-c representation
for c = 2 classes. The winner-takes-all methodology is applied so that the output
of the highest activation designates the class. Overall, the experimental setup
becomes identical to those used in the previous studies and thus fair comparative
evaluations can now be made over the classification error rate of the test data. In
all experiments in this section, we use the sample architecture space given in
Table 3.2, which has the generalized form as, Rmin ¼ fNI ; 1; 1; NO g and Rmax ¼
fNI ; 8; 4; NO g containing the compact 1-, 2-, or 3-layer MLPs where NI and NO ,
are determined by the number of input and output attributes of the classification
problem. For BP, all networks were trained with 500 (shallow training) and with
5,000 (deep training) iterations with a low learning rate of 0.02 to prevent oscil-
lations. For PSO training, in addition to default settings for the standard algorithm
parameters as defined in Sect. 2.3, the number of particles was set to 40 (S = 40)
68 3 Particle Swarm Optimization
and the number of training iterations was set to 200 for the shallow and 2,000 for
the deep training cases. For all experiments in this section, unless stated otherwise,
100 independent runs are performed for each configuration to compute the error
statistics plots for each dataset. We mainly consider two major criteria for the
performance assessment: (1) training MSE, which indicates the error minimization
achieved by each method; (2) test CE, which is the primary objective of the
classifier as it shows the classification accuracy level achieved as well as the
generalization capability of each method. Using the corresponding error statistics
plots, both criteria shall then be statistically evaluated by considering on the
average (i.e., mean MSE and CE) and the best (i.e., minimum MSE and CE)
performances achieved by each method, BP and PSO.
In order to perform a comprehensive and a systematic assessment of the per-
formance of ANN classifiers in medical diagnosis, we apply exhaustive BP and
PSO training for each network configuration in the architecture space, which is
defined over MLPs with sigmoid activation functions. In this way we can escape
from the bias or possible effect of a particular network over the performance,
which was the case of many of the aforementioned studies that were mostly
performed using only one or few fixed network architecture(s). Furthermore, to
assess the effect of the training depth on both BP and PSO, both shallow and deep
training will be applied over every network configuration in the architecture space
by setting the number of iterations appropriately.
Figure 3.10 presents the corresponding error statistics plots from the shallow
training over the breast cancer dataset. BP in general achieves the lowest average
0.035
Min. BP-Train MSE
Mean BP-Train MSE
Min. PSO-Train MSE
0.03 Mean PSO-Train MSE
0.025
0.02
0 5 10 15 20 25 30 35 40
-3
x 10
15
Min. BP-Test CE Mean BP-Test CE Min. PSO-Test CE Mean PSO-Test CE
10
0
0 5 10 15 20 25 30 35 40
Fig. 3.10 Train (top) and test (bottom) error statistics vs. hash index plots from shallow BP- and
PSO-training over the breast cancer dataset
3.4 Applications 69
training MSEs within a narrow variance except few network configurations with
the corresponding indices, d 2 ½13; 16 where PSO slightly surpasses BP. On the
other hand, PSO achieves the best training performances (i.e., minimum MSEs)
over the majority of network configurations except for the compact ones
(d 2 ½0; 9), where BP is consistently more successful. The lowest overall training
MSE (both average and minimum) is too achieved by PSO using the configuration
with the hash indices d = 14 and d = 16 (MLPs: 9 9 6 9 1 9 2 and
9 9 8 9 1 9 2), respectively. In terms of the classification performances over the
test set, the results are consistently in favor of PSO, which performs better than BP
with respect to both performance criteria. Particularly, PSO achieved the optimal
0 % CE (i.e.,100 % classification accuracy) as its best performance among all
networks except for two networks (d = 9 and 10), whereas BP managed to achieve
this only over the compact networks (i.e., d ¼ 0 and d 2 ½2; 8), plus the complex
MLP with d = 39. Overall, PSO usually demonstrates a better classification per-
formance for the breast cancer dataset with the shallow training.
The error statistics plots obtained from deep training of all networks in the
architecture space by both methods are shown in Fig. 3.11. In this case, both BP
and PSO achieve lower training MSEs as a natural consequence of the deep or
overtraining, and BP in general achieves the lowest average training MSEs, par-
ticularly on complex networks with two hidden layers but it also surpasses PSO in
terms of the minimum training MSEs except for only few networks. Due to such
overtraining, the classification performance of both methods is expected to
degrade, which is the case as shown by the bottom plots of Fig. 3.11. However, the
0.035
Min. BP-Train MSE Mean BP-Train MSE Min. PSO-Train MSE Mean PSO-Train MSE
0.03
0.025
0.02
0.015
0.01
0.005
0
0 5 10 15 20 25 30 35 40
0.03
Min. BP-Test CE Mean BP-Test CE Min. PSO-Test CE Mean PSO-Test CE
0.025
0.02
0.015
0.01
0.005
0
0 5 10 15 20 25 30 35 40
Fig. 3.11 Train (top) and test (bottom) error statistics vs. hash index plots from deep BP- and
PSO-training over the breast cancer dataset
70 3 Particle Swarm Optimization
0.12
Min. BP-Train MSE Mean BP-Train MSE Min. PSO-Train MSE Mean PSO-Train MSE
0.11
0.1
0.09
0.08
0.07
0.06
0 5 10 15 20 25 30 35 40
0.26
Min. BP-Test CE Mean BP-Test CE Min. PSO-Test CE Mean PSO-Test CE
0.24
0.22
0.2
0.18
0.16
0 5 10 15 20 25 30 35 40
Fig. 3.12 Train (top) and test (bottom) error statistics vs. hash index plots from shallow BP- and
PSO-training over the heart disease dataset
0.12
Min. BP-Train MSE Mean BP-Train MSE Min. PSO-Train MSE Mean PSO-Train MSE
0.1
0.08
0.06
0.04
0.02
0
0 5 10 15 20 25 30 35 40
0.26
Min. BP-Test CE Mean BP-Test CE Min. PSO-Test CE Mean PSO-Test CE
0.24
0.22
0.2
0.18
0.16
0.14
0.12
0 5 10 15 20 25 30 35 40
Fig. 3.13 Train (top) and test (bottom) error statistics vs. hash index plots from deep BP- and
PSO-training over the heart disease dataset
Contrary to shallow training results, the top plot in Fig. 3.13 indicates that
whenever deep training is performed over this dataset, BP surpasses PSO with
respect to the training MSEs (both average and minimum) for all networks. Hence
due to the overfitting of the training data, the classification performance of BP over
the test set is quite degraded while no significant performance degradation occurs
for PSO. PSO, once again, exhibits its relative immunity against overtraining due
to its global search ability and yields the best classification performance over the
test set (i.e., well generalization) regardless of the training depth. This is true for
both average and the best performance criteria considered (see red and blue curves
at the bottom plot in Fig. 3.13). PSO achieves the overall best classification per-
formance, *13 % CE, from the three different networks with the corresponding
hash indices, d = 14, 30, and 39 although no network configuration makes too
much difference when the average classification performance is concerned.
Figure 3.14 presents the corresponding error statistics plots from the shallow
training over the diabetes dataset. Similar comments can be made about the
training performance of PSO and BP as in the shallow training experiments over
the heart disease dataset. That is, although BP is consistently better than PSO for
compact networks, their training performances (minimum and average MSEs) are
quite comparable and varying along with the network configuration. In terms of
the classification performance over the test set, PSO usually achieves slightly
lower CEs but the results are again quite comparable. When minimum CEs are
concerned, from the network with the hash index d = 16 (MLP: 8 9 8 9 1 9 2)
PSO achieved a minimum of 17.1 % CE that is slightly lower than the 18.8 %
72 3 Particle Swarm Optimization
0.18
Min. BP-Train MSE
0.175
Mean BP-Train MSE
0.17 Min. PSO-Train MSE
Mean PSO-Train MSE
0.165
0.16
0.155
0.15
0.145
0 5 10 15 20 25 30 35 40
0.28
Min. BP-Test CE
0.26 Mean BP-Test CE
Min. PSO-Test CE
0.24
Mean PSO-Test CE
0.22
0.2
0.18
0.16
0 5 10 15 20 25 30 35 40
Fig. 3.14 Train (top) and test (bottom) error statistics vs. hash index plots from shallow BP- and
PSO-training over the diabetes dataset
minimum CE achieved by BP from the network with the hash index d = 4 (MLP:
8 9 4 9 2). Finally, according to the error statistics plots in Fig. 3.15 obtained by
deep training over the same dataset, similar conclusions can be drawn about both
training (MSE) and generalization (test CE) performances of PSO and BP as in the
deep training experiments over the heart disease dataset, i.e., PSO yields almost
the same average training MSE levels and BP significantly reduces both average
and minimum MSE levels, as expected. One observation worth mentioning here is
that the training and test performances of BP in both deep and shallow training
exhibits a large variation with respect to the network configuration used (e.g.,
compare for instance the mean BP training MSE or test CE for d = 8 and d = 9),
whereas the corresponding performance levels of PSO are more stable and usually
with a smaller variance, regardless of the network configuration.
The overall test CE statistics of both training techniques (BP and PSO) com-
puted over all configurations in selected MLP architecture spaces for each dataset
are enlisted in Table 3.3. We used the following three statistics: minimum (min),
mean (l), and standard deviation (r), respectively, which are computed per
training depth (deep and shallow). The results in the table clearly indicate that PSO
training on average achieves better classification performance than BP over the
three benchmark medical diagnosis problems.
Finally, in order to accomplish the comparative performance evaluations of
each method with respect to the variations in the training depth, we have selected a
particular network configuration with the hash index d = 16, which is a relatively
compact and one of the best performing classifier configuration within the sample
3.4 Applications 73
0.16
0.15
0.14
0.13
0.12
Min. BP-Train MSE Mean BP-Train MSE Min. PSO-Train MSE Mean PSO-Train MSE
0.11
0 5 10 15 20 25 30 35 40
0.24
0.22
0.2
0.18
Fig. 3.15 Train (top) and test (bottom) error statistics vs. hash index plots from deep BP- and
PSO-training over the diabetes dataset
architecture space, and we have performed exhaustive training (with 100 runs) for
each of the 10 intermediate (training) depths, between the corresponding shallow
and deep training, i.e., [500, 5,000] for BP and [200, 2,000] for PSO. Figure 3.16
shows the training MSE and test CE plots versus the training depth for all three
datasets. From the figure, it is clear that both methods reduce the training MSE
with increasing training depths, as a natural consequence of the overfitting of the
training data. On the other hand, PSO achieves lower average and minimum
74 3 Particle Swarm Optimization
0.08 0.2
0.18
0.06
0.16
0.04 0.14
0.12
0.02
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
0.15 0.22
0.14 0.2
0.13 0.18
0.12
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
x200 PSO, x500 BP x200 PSO, x500 BP
Fig. 3.16 Error statistics (for network configuration with the hash index d = 16) vs. training
depth plots using BP- and PSO-over the breast cancer (top), heart disease (middle), and diabetes
(bottom) datasets
training MSEs for the breast cancer, higher for the diabetes, and quite similar for
the heart disease datasets, respectively. The classification performance of PSO
shows a strong immunity against variations in the training depth and it generally
achieves the lowest minimum CEs. For this particular network, either BP or PSO
can achieve a better average classification performance depending on the training
depth. Hence, this clearly draws the conclusion that the training depth too should
be considered while comparing and/or analyzing individual performance of each
method.
This is the first chapter from which we start to explain briefly the software
packages we supply along with this book. All software programs are developed
using C and C++ languages under Microsoft Visual Studio 6.5 (VS6.5: version
6 with Service Pack 5). There are several applications developed and in this
chapter, we shall start with describing the PSO test-bed application for nonlinear
function minimization, namely PSO_MDlib. It is a simple console application
3.5 Programming Remarks and Software Packages 75
which is a VS6.5 workspace with three projects in it: MTL, PSO_MDlib, and
PSO_MDmain. The first one, MTL, stands for MUVIS Template Library, which
contains basic data structures such as link lists (queues), registers, threads, etc. (we
shall later return to MUVIS). PSO_MDlib is a static library, where the basic
(canonical and global) PSO has been implemented along with its multidimensional
extension, MD PSO (to be discussed in Chap. 4). Finally, PSO_MDmain.cpp is
the main console application, which offers three different entry point functions,
main(), two of which are enabled with a compiler flag: BATCH_RUN or
MOVING_PEAKS_BENCHMARK. The latter enables a special MD PSO
application over a benchmark dynamic environment, which will be presented as
the application for Sect. 5.2 in Chap. 5. The compiler flag BATCH_RUN can be
enabled so as to test both PSO and/or MD PSO over all the functions, for both PSO
types (BASIC vs. FGBF), over all test functions, and with several runs (FGBF will
also be discussed in Chap. 5).
All intrinsic PSO (and MD PSO) parameters are stored in PSOparam structure
(see the header file: PSOparam.h) (Table 3.4).
The first three parameters, S (_noAgent), iterNo (_maxNoIter), and eC (_e-
CutOff) are common for PSO and MD PSO. The next three parameters are specific
for MD PSO and will be covered in the Chap. 4. Finally, we shall perform bPSO
application; therefore, _mode is set to BASIC. Since MD PSO is simply the
multidimensional extension of the PSO, when _dMin=_dMax-1, then the MD
PSO will be reduced to a regular PSO process at the solution space dimension
_dMin.
PSO and its extension MD PSO are jointly implemented in an object-oriented,
template-based, and morphological structure with four class implementations:
1. template \class T, class X[ class CPSO_MD {…}
2. template \class T, class X[ class CParticle {…}
3. template \class X[ class CSolSpace {…}
4. class COneClass {…}
The main class: Template <class T, class X> class CPSO_MD is a template
based class with two template class implementations, class T and class X. Such a
The template class, class T represents the potential solution of the problem and
we assign class T to the following template class: template <class X> class
CSolSpace <X>. It simply contains the position of the solution, its dimension, and
the boundaries, i.e.,
X* m_pPos; // Current Position in N-dimensional solution space.
int m_nDim; // Dimension of the solution space.
X m_min, m_max; // Minimum and Maximum ranges (boundaries of the
solution).
Finally, the template class, class X, represents the data space, where the real
values of the PSO particle elements are stored. For instance, it can simply be set to
standard data structures such as float or double for nonlinear function minimi-
zation, yet for a generic usage, we shall assign it to class COneClass, which
contains nothing but a single floating point data element along with its individual
score, i.e.,
float m_x; // data per dimension.
float m_bScore; // and the individual score of each dimension.
and the standard arithmetic operators (=, +, -,/, \, [, etc.) are implemented (for
m_x) accordingly. We shall clarify the use of the member variable m_bScore in
Chap. 5.
All nonlinear functions are implemented within MYFUN.cpp source file. To
perform a PSO operation for nonlinear function minimization, a CPSO_MD
object should be created with proper template classes: <class T, class X>, and
initialized with: (1) the default PSO parameters stored in PSOparam _psoDef, and
(2) the fitness function ? any nonlinear function within MYFUN.cpp. In short,
the entire MD PSO initialization can be summarized as:
Create: CPSO_MD < CSolSpace < COneClass > ,COneClass > *pPSO =
new CPSO_MD < CSolSpace < COneClass > ,COneClass>
(_psoDef._noAgent, _psoDef._maxNoIter, _psoDef._eCutOff,
_psoDef._mode);
1) pPSO- > Init(_psoDef._dMin, _psoDef._dMax, _psoDef._vdMin, _pso-
Def._vdMax, -500, 500, _psoDef._xvMin, _psoDef._xvMax);/*** Initialize the
object ***/
3.5 Programming Remarks and Software Packages 77
Table 3.6 The main loop for (MD) PSO in Perform() function
Table 3.7 Implementation of Step 3.1 of the PSO pseudo-code given in Table 3.1
Table 3.9 Implementation of Step 3.4 of the PSO pseudo-code given in Table 3.1
composed over the previous velocity weighted with the inertia factor, weight_up.
The velocity is clamped as in step 3.4.1.2 of the PSO pseudo-code (vx->Check-
Limit();) and finally, the particle’s current position is updated to a new position
with the composed and clamped velocity term (i.e. *xx +=vx). As the last step, this
new position is also verified whether it falls within the problem boundaries
specified in advance (xmax ). Note that this practical verification step is omitted in
the pseudo-code of the PSO given in Table 3.1.
As mentioned earlier, the multidimensional extension of the bPSO, the MD
PSO will be explained in the Chap. 4 and the FGBF technique will be explained in
Chap. 5. Accordingly, the programming details that have so far been skipped in
this section shall be discussed at the end of those chapters.
80 3 Particle Swarm Optimization
References
23. M. Løvberg, T.K. Rasmussen, T. Krin, Hybrid particle swarm optimiser with breeding and
sub-populations. In Proceedings of GECCO2001—Genetic and Evolutionary Computation
Conference, p. 409, CA, USA. July 7–11, 2001
24. Y. Lin, B. Bhanu, Evolutionary feature synthesis for object recognition. IEEE Trans. Man
Cybern. C 35(2), 156–171 (2005)
25. M. Løvberg, T. Krink, Extending particle swarm optimizers with self-organized criticality.
Proc. IEEE Congr. Evol. Comput. 2, 1588–1593 (2002)
26. T.M. Blackwell, J. Branke, Multi-swarm optimization in dynamic environments. Appl. Evol.
Comput. 3005, 489–500 (2004). (Springer)
27. T.M. Blackwell, J. Branke, Multiswarms, exclusion, and anti-convergence in dynamic
environments. IEEE Trans. Evol. Comput. 10/4, 51–58 (2004)
28. T.M. Blackwell, Particle swarm optimization in dynamic environments. Evol. Comput. Dyn.
Uncertain Environ. Stud. Comput. Intell. 51, 29–49 (2007). (Springer)
29. R. Mendes, Population topologies and their influence in particle swarm performance. PhD
thesis, Universidade do Minho, 2004
30. J. Kennedy, Small worlds and mega-minds: effects of neighborhood topology on particle
swarm performance. In Proceedings of the 1999 Congress on Evolutionary Computation, vol.
3. doi:10.1109/CEC.1999.785509, 1999
31. H. Frigui, R. Krishnapuram, Clustering by competitive agglomeration. Pattern Recogn. 30,
1109–1119 (1997)
32. A.K. Jain, M.N. Murthy, P.J. Flynn, Data clustering: A review. ACM Computing Reviews,
Nov 1999
33. C.P. Tan, K.S. Lim, H.T. Ewe, Image processing in polarimetric SAR images using a hybrid
entropy decomposition and maximum likelihood (EDML). In Proceedings of International
Symposium on Image and Signal Processing and Analysis (ISPA), pp. 418–422, Sep 2007
34. B. Bhanu, J. Yu, X. Tan, Y.Lin, Feature synthesis using genetic programming for face
expression recognition. Genetic and Evolutionary Computation (GECCO 2004), Lecture
Notes in Computer Science, vol. 3103, pp. 896–907, 2004
35. G. Hammerly, Learning structure and concepts in data through data clustering. PhD thesis,
June 26, 2003
36. B. Zhang, M. Hsu, K-harmonic means—a data clustering algorithm. Hewlett-Packard Labs
Technical Report HPL-1999-124, 1999
37. G. Hammerly, C. Elkan, Alternatives to the k-means algorithm that find better clusterings. In
Proceedings of the 11th ACM CIKM, pp. 600–607, 2002
38. N.R. Pal, J. Biswas, Cluster validation using graph theoretic concepts. Pattern Recogn. 30(6),
847–857 (1997)
39. T. Ojala, M. Pietikainen, D. Harwood, A comparative study of texture measures with
classification based on feature distributions. Pattern Recogn. 29, 51–59 (1996)
40. T.G. Dietterich, G. Bakiri, Solving multiclass learning problems via error-correcting output
codes. J. Artif. Intell. Res. 2, 263–286 (1995)
41. M. Halkidi, Y. Batistakis, M. Vazirgiannis, On cluster validation techniques. J. Intell. Inf.
Syst. 17(2, 3), 107–145 (2001)
42. T.N. Tran, R. Wehrens, D.H. Hoekman, L.M.C. Buydens, Initialization of Markov random
field clustering of large remote sensing images. IEEE Trans. Geosci. Remote Sens. 43(8),
1912–1919 (2005)
43. S.R. Cloude, E. Pottier, An entropy based classification scheme for land applications of
polarimetric SAR. IEEE Trans. Geosci. Remote Sens. 35, 68–78 (1997)
44. M. Halkidi, M. Vazirgiannis, Clustering validity assessment: finding the optimal partitioning
of a dataset. In Proceedings of First IEEE International Conference on Data Mining
(ICDM’01), pp. 187–194, 2001
45. M.G. Omran, A. Salman, A.P. Engelbrecht, Dynamic clustering using particle swarm
optimization with application in image segmentation. Patt. Anal. Appl. 8, 332–344 (2006)
82 3 Particle Swarm Optimization
46. A. Abraham, S. Das, S. Roy, Swarm intelligence algorithms for data clustering. In Soft
Computing for Knowledge Discovery and Data Mining Book, Part IV, pp. 279–313, Oct 25,
2007
47. S. Haykin, Neural Networks: a Comprehensive Foundation (Prentice hall, USA, 1998). June
48. S. Pittner, S.V. Kamarthi, Feature extraction from wavelet coefficients for pattern recognition
tasks. IEEE Trans. Pattern Anal. Machine Intell. 21, 83–88 (1999)
49. J.A. Nelder, R. Mead, A simplex method for function minimization. Comput. J. 7, 308–313
(1965)
50. M. Carvalho, T.B. Ludermir, Particle swarm optimization of neural network architectures and
weights. In Proceedings of the 7th International Conference on Hybrid intelligent Systems,
pp. 336–339, Washington DC, 17–19 Sep 2007
51. M. Meissner, M. Schmuker, G. Schneider, Optimized particle swarm optimization (OPSO)
and its application to artificial neural network training. BMC Bioinf 7, 125 (2006)
52. Z. Ye, C.-C. Lu, Wavelet-based unsupervised SAR image segmentation using hidden markov
tree models. In Proceedings of the 16th International Conference on Pattern Recognition
(ICPR’02), vol. 2, pp. 20729, 2002
53. C. Zhang, H. Shao, An ANN’s evolved by a new evolutionary system and its application. In
Proceedings of the 39th IEEE Conference on Decision and Control, vol. 4, pp. 3562–3563,
2000
54. H. Robbins, S. Monro, A stochastic approximation method. Ann. Math. Stat. 22, 400–407
(1951)
55. J.F. Scott, The Scientific Work of René Descartes, 1987
56. L. Prechelt, Proben1—A set of neural network benchmark problems and benchmark rules.
Technical Report 21/94, Fakultät für Informatik, Universität Karlsruhe, Germany,
September, 1994
57. S. Amari, N. Murata, K.R. Muller, M. Finke, H.H. Yang, Asymptotic statistical theory of
overtraining and cross-validation. IEEE Trans. Neural Networks 8(5), 985–996 (1997)
58. T.L. Ainsworth, J.P. Kelly, J.-S. Lee, Classification comparisons between dual-pol, compact
polarimetric and quad-pol SAR imagery. ISPRS J. Photogram. Remote Sens. 64, 464–471
(2009)
Chapter 4
Multi-dimensional Particle Swarm
Optimization
Imagine now that each PSO particle can also change its dimension, which means
that they have the ability to jump to another (solution space) dimension as they see
fit. In that dimension they simply do regular PSO moves but in any iteration they
can still jump to any other dimension. In this chapter we shall show how the design
of PSO particles is extended into Multi-dimensional PSO (MD PSO) particles so as
to perform interdimensional jumps without altering or breaking the natural PSO
concept.
The major drawback of the basic PSO algorithm and many PSO variants is that
they can only be applied to a search space with fixed dimensions. However, in
many of the optimization problems (e.g., clustering, spatial segmentation, opti-
mization of the multi-dimensional functions, evolutionary artificial neural network
design, etc.), the optimum dimension where the optimum solution lies is also
unknown and should thus be determined within the PSO process. Take for
instance, the optimization problem of multi-dimensional functions. Let us start
with the d-dimensional sphere function,
X
d
FðxÞ ¼ x2i ð4:1Þ
i¼1
This is a unimodal function with a single minimum point at the origin. When
d is fixed and thus known a priori, there are powerful optimization techniques,
including PSO, which can easily find the exact minimum or converge to e-
neighborhood of the minimum. We can easily extend this to a family of functions,
i.e.,
X
d
Fðx; dÞ ¼ x2i 8d 2 fDmin ; Dmax g ð4:2Þ
i¼1
In this case, it is obvious that there is only one true optimum point at the origin
of dimension d0 : In other words, at all other dimensions, the function F ðx; dÞhas
only suboptimal points at the origin in each dimension. One straightforward
alternative for the optimization in this type of multi-dimensional functions is to run
the method distinctively for every dimension in the range. However, this might be
too costly—if not infeasible for many problems especially depending on the
dimensional range.
Another typical example of multi-dimensional optimization problems is data
clustering where the true number of clusters is usually unknown. Some typical 2D
synthetic data spaces with ground truth clusters were shown in Fig. 3.17 in the
previous chapter, some of which are also shown in Fig. 4.1. Recall that for
illustration purposes each data space is formed in 2D; however, clusters are formed
with different shapes, densities, sizes, and inter-cluster distances. Such a clustering
complexity will make their error surfaces highly multimodal and the optimization
method for clustering now has to find out the true number of clusters as well as the
accurate cluster centroids around which clusters are formed. Only few PSO studies
have so far focused on this problem, i.e., [1] and [2]. In Ref. [2], Omran et al.
In the simplest form, MD PSO reforms the native structure of swarm particles in
such a way that they can make interdimensional passes with a dedicated dimen-
sional PSO process. Therefore, in a multi-dimensional search space where the
optimum dimension is unknown, swarm particles can seek for both positional and
dimensional optima. This eventually negates the necessity of setting a fixed
dimension a priori, which is a common drawback for the family of swarm opti-
mizers. Therefore, instead of operating at a fixed dimension d, the MD PSO
algorithm is designed to seek both positional and dimensional optima within a
dimension range, (Dmin d Dmax :)
In order to accomplish this, each particle has two sets of components, each of
which has been subjected to two independent and consecutive processes. The first
one is a regular positional PSO, i.e., the traditional velocity updates and due
positional shifts in N-dimensional search (solution) space. The second one is a
dimensional PSO, which allows the particle to navigate through dimensions.
Accordingly, each particle keeps track of its last position, velocity, and personal
best position (pbest) in a particular dimension, so that when it revisits that the same
dimension at a later time, it can perform its regular ‘‘positional’’ fly using this
information. The dimensional PSO process of each particle may then move the
particle to another dimension where it will remember its positional status and keep
‘‘flying’’ within the positional PSO process in this dimension, and so on. The
swarm, on the other hand, keeps track of the gbest particles in all dimensions, each
of which respectively indicates the best (global) position so far achieved and can
86 4 Multi-dimensional Particle Swarm Optimization
xd7 (t ) = 2
d=3
MD PSO
(dbest) a
xda (t) = 23
Fig. 4.2 An Illustrative MD PSO process during which particles 7 and 9 have just moved 2D and
3D solution spaces at time t; whereas particle a is sent to 23rd dimension
thus be used in the regular velocity update equation for that dimension. Similarly,
the dimensional PSO process of each particle uses its personal best dimension in
which the personal best fitness score has so far been achieved. Finally, the swarm
keeps track of the global best dimension, dbest, among all the personal best
dimensions.
Figure 4.2 illustrates a typical MD PSO operation at a time instance t where
three particles have just been moved to the dimensions: d = 2 (particle 7), d = 3
(particle 9), and finally d = 23 (particle a), respectively by the dimensional PSO
process with the guidance of dbest. The figure also shows illustrative 2D and 3D
solution (search) spaces in which the particles (including 7 and 9) currently in
4.2 The Basic Idea 87
these dimensions are making regular positional PSO moves by the guidance of the
gbest particles (gbest(2) and gbest(3)). Afterwards each particle in these dimen-
sions will have the freedom to leave to another dimension and similarly, new
particles may come and perform the positional PSO operation and so on.
vx a ðtÞ jth component of the velocity vector of particle a, in dimension xda ðtÞ
xd ðtÞ
a;j
xd ðtÞ jth component of the personal best (pbest) position vector of particle a,
xya;ja ðtÞ
in dimension xda ðtÞ
gbest(d) Global best particle index in dimension d
x^ydj ðtÞ jth component of the global best position vector of swarm, in
dimension d
xda ðtÞ Dimension of particle a
vda ðtÞ Velocity of dimension of particle a
xd~a ðtÞ Personal best dimension of particle a
dbest Global best dimension ever achieved
Note that a simple two-letter naming convention is applied for these parame-
ters. Each parameter has two letters (i.e., xx, vx, etc.). The first letter is either ‘‘x’’
or ‘‘v’’, representing either the positional or the velocity member of a particle.
Since there are two interleaved PSO processes involved, the first letter will then
indicate the position in the former and dimension in the latter process. The second
character indicates the type of the PSO process or the type of the parameter (either
current or personal best position for instance). Thus all parameters in dimensional
~ Similarly, all parameters in
PSO have the second character as either ‘‘d’’ or ‘‘d’’.
the positional PSO have the second character as either ‘‘x’’ or ‘‘y’’. There are only
two exceptions in this naming convention, gbest(d) and dbest, as we intend to keep
the native PSO naming convention for gbest particle along with the fact that now
there are distinct gbest particles for each dimension, thus yielding to gbest(d). The
same analogy also applies to the parameter dbest being as the global best
dimension.
Let f denotes the dimensional fitness function that is to be optimized within a
certain dimension range, fDmin ; Dmax g: Without loss of generality assume that the
objective is to find the minimum (position) of f at the optimum dimension within a
multi-dimensional search space. Assume that the particle a visits (back) the same
88 4 Multi-dimensional Particle Swarm Optimization
dimension after T iterations (i.e., xda ðtÞ ¼ xda ðt þ TÞ), then the personal best
position can be updated in iteration t ? T as follows:
( xd ðtÞ xd ðtþTÞ xd ðtÞ
)
xda ðtþTÞ xya;ja ðtÞ if f ðxxa a ðt þ TÞÞ [ f ðxya a ðtÞÞ
xya;j ðt þ TÞ ¼ xd ðtþTÞ
xxa;ja ðt þ TÞ else
j ¼ 1; 2; . . .; xda ðtÞ
ð4:4Þ
Furthermore, the personal best dimension of particle a can be updated at iter-
ation t ? 1 as follows:
( )
~a ðtÞ xda ðtþ1Þ xd~a ðtÞ
~
xda ðt þ 1Þ ¼ x d if f ðxxa ðt þ 1ÞÞ [ f ðxya ðtÞÞ ð4:5Þ
xda ðt þ 1Þ else
Note that both Eqs. (4.4) and (4.5) are analogous to Eq. (3.1), which update the
personal best position/dimension in the basic (global) PSO if a better current
position/dimension is reached.
Figure 4.3 shows sample MD PSO and bPSO particles with indices a. A bPSO
particle that is at a (fixed) dimension, N = 5, contains only positional components,
whereas MD PSO particle contains both positional and dimensional components,
respectively. In the figure the dimension range for the MD PSO is given in
between 2 and 10; therefore, the particle contains nine sets of positional compo-
nents. In this example as indicated by the arrows the current dimension the particle
a resides is 2 (xda ðtÞ ¼ 2) whereas its personal best dimension is 3 (xd~a ðtÞ ¼ 3:)
Therefore, at time t a positional PSO update is first performed over the positional
elements,xx2a ðtÞ and then the particle may move to another dimension with respect
Fig. 4.3 Sample MD PSO (right) vs. bPSO (left) particle structures. For MD PSOfDmin ¼ 2;
Dmax ¼ 10g and at time t, xda ðtÞ ¼ 2 and xd~a ðtÞ ¼ 3
4.3 The MD PSO Algorithm 89
to the dimensional PSO. Recall that each positional element, xx2a;j ðtÞ; j 2 f0; 1g;
represents a potential solution in the data space of the problem.
Recall that gbest(d) is the index of the global best particle at dimension d and
let S(d) be the total number of particles in dimension d, then x^ydbest ðtÞ ¼
xydbest dbest
gbestðdbestÞ ðtÞ ¼ arg min ðf ðxyi ðtÞÞ: For a particular iteration t, and for a
8i2½1;S
particle a 2 f1; Sg; first the positional components are updated in its current
dimension, xda ðtÞ and then the dimensional update is performed to determine its
next (t ? 1st) dimension, xda ðt þ 1Þ: The positional update is performed for each
dimensional component, j 2 f1; xda ðtÞg;, as follows:
xd ðtÞ xdd ðtÞ xd ðtÞ xd ðtÞ xd ðtÞ xd ðtÞ
vxa;ja ðt þ 1Þ ¼ wðtÞvxa;j a ðtÞ þ c1 r1;j ðtÞ xya;ja ðtÞ xxa;ja ðtÞ þ c2 r2;j ðtÞ x^yj a ðtÞ xxa;ja ðtÞ
h i
xd ðtÞ xd ðtÞ xd ðtÞ
xxa;ja ðt þ 1Þ ¼ xxa;ja ðtÞ þ Cvx vxa;ja ðt þ 1Þ; fVmin ; Vmax g
h i
xd ðtÞ xd ðtÞ
xxa;ja ðt þ 1Þ Cxx xxa;ja ðt þ 1Þ; fXmin ; Xmax g
ð4:6Þ
where Cxx ½:; : Cvx ½:; :are the clamping operators applied over each positional
component, xxda;j and vxda;j : Cxx ½:; : may or may not be applied depending on the
optimization problem but Cvx ½:; : is needed to avoid exploding. Each operator can
be applied in two different ways,
8 d 9
> xx ðtÞ if Xmin xxda;j ðtÞ Xmax >
< a;j =
Cxx ½xxda;j ðtÞ; fXmin ; Xmax g ¼ Xmin if xxda;j ðtÞ\Xmin ðaÞ
>
: >
;
d
Xmax if xxa;j ðtÞ [ Xmax
d
xxa;j ðtÞ if Xmin xxda;j ðtÞ Xmax
Cxx ½xxda;j ðtÞ; fXmin ; Xmax g ¼ ðbÞ
UðXmin ; Xmax Þ else
ð4:7Þ
where the option (a) is a simple thresholding to the range limits and (b) reinitial-
izes randomly the positional component in the jth dimension (j \ d).
xd ðtÞ
Note that the particle’s new position, xxa a ðt þ 1Þ; will still be in the same
dimension, xda ðtÞ; however, the particle may jump to another dimension after-
wards with the following dimensional update equations:
vda ðt þ 1Þ ¼ vda ðtÞ þ c1 r1 ðtÞ xd~a ðtÞ xda ðtÞ þ c2 r2 ðtÞðdbest xda ðtÞÞ
xda ðt þ 1Þ ¼ xda ðtÞ þ Cvd ½vda ðt þ 1Þ; fVDmin ; VDmax g
xda ðt þ 1Þ Cxd ½xda ðt þ 1Þ; fDmin ; Dmax g
ð4:8Þ
where b:c is the floor operator, Cxd ½:; : and Cvd ½:; : are the clamping operators
applied over dimensional components, xda ðtÞ and vda ðtÞ; respectively. Though we
employed the inertia weight for positional velocity update in Eq. (4.5), we have
90 4 Multi-dimensional Particle Swarm Optimization
witnessed no benefit of using it for dimensional PSO, and hence we left it out of
Eq. (4.8) for the sake of simplicity. Note that both velocity update Eqs. (4.6) and
(4.8) are similar to those for the basic PSO given in Eq. (3.2). Cvd ½:; : is similar to
Cvx ½:; :; which is used to avoid exploding. This is accomplished by basic thres-
holding expressed as follows:
8 9
< vda ðtÞ if VDmin vda ðtÞ VDmax =
Cvd ½vda ðtÞ; fVDmin ; VDmax g ¼ VDmin if vda ðtÞ\VDmin ð4:9Þ
: ;
VDmax if vda ðtÞ [ VDmax
Cxd ½:; :; on the other hand, is a mandatory clamping operator, which keeps the
dimensional jumps within the dimension range of the problem, fDmin ; Dmax g:
Furthermore within Cxd ½:; :; an optional in-flow buffering mechanism can also be
implemented. This can be a desired property, which avoids the excess number of
particles on a certain dimension. Particularly, dbest and dimensions within its close
vicinity have a natural attraction and without such buffering mechanism, the
majority of swarm particles may be hosted within this local neighborhood, and
hence other dimensions might encounter a severe depletion. To prevent this, the
buffering mechanism should control the in-flow of the particles (by the dimen-
sional velocity updates) to a particular dimension. On some early bPSO imple-
mentations over problems with low (and fixed) dimensions, 15–20 particles were
usually sufficient for a successful operation. However, in high dimensions this may
not be so since more particles are usually needed as the dimension increases.
Therefore, we empirically set the number of particles to be proportional to the
solution space dimension and not less than 15. At time t, let Pd ðtÞ be the number of
particles in dimension d. Cxd ½:; : can then be expressed with the (optional) buf-
fering mechanism as follows:
8 9
>
> xda ðt 1Þ f Pd ðtÞ maxð15; xda ðtÞÞ >
>
< =
xda ðt 1Þ if xda ðtÞ\Dmin
Cxd ½xda ðtÞ; fDmin ; Dmax g ¼ ð4:10Þ
>
> xd ðt 1Þ f xda ðtÞ [ Dmax >
>
: a ;
xda ðtÞ else
In short, the clamping and buffering operator, Cxd ½:; :; allows a dimensional jump
only if the target dimension is within the dimensional range and there is a room for
a newcomer.
Accordingly, the general pseudo-code of the MD PSO technique is given in
Table 4.1.
It is easy to see that the random initialization of the swarm particles’ performed
in step 1 is similar to the initialization of the bPSO (between the same steps) given
in Table 3.1. The dimensional PSO is initialized in the same way; however, there is
a difference in the initialization of the positional PSO, that is, particle positions are
randomized for all solution space dimensions (8d 2 ½Dmin ; Dmax ) instead of a
single dimension. After the initialization phase, step 3 first evaluates each particle
a, which is residing in its current dimension, xda ðtÞ; (1) to validate its personal best
position in that dimension (step 3.1.1.1), (2) to validate (and update if improved)
4.3 The MD PSO Algorithm 91
the gbest particle in that dimension (step 3.1.1.2), (3) to validate (and update if
improved) its personal best dimension, (step 3.1.1.3) and finally, (4) to validate
(and update if improved) the global best dimension, dbest (step 3.1.1.4). Step 3.1
in fact evaluates the current position and dimension of each particle with its
personal best values that will be updated if any improvement is observed. The
(new) personal best position/dimension updates will then lead to the computation
of the (new) global best elements such as dbest and gbest(d) ðð8d 2 ½Dmin ; Dmax ÞÞ:
At any time t, the optimum solution will be x^ydbest at the optimum dimension,
dbest, achieved by the particle gbest(dbest), and finally the best (fitness) score
achieved will naturally be f ðx^ydbest Þ: This (best) fitness score so far achieved can
then be used to determine whether the termination criteria is met in step 3.3. If not,
step 3.4 performs first the positional PSO (step 3.4.1) and then the dimensional
92 4 Multi-dimensional Particle Swarm Optimization
PSO (step 3.4.3) to perform positional and dimensional updates for each particle,
respectively. Once each particle moves to a new position and (jumps to a new)
dimension, then in the next iteration its personal best updates will be performed
and so on, until the termination criteria is met.
The test bed application, PSO_MDlib, is initially designed for regular MD PSO
operations for the sole purpose of multi-dimensional nonlinear function minimi-
zation. Recall that in the previous chapter, we fixed the dimension of any function
as _dMin = _dMax 2 1. Recall further that both _dMin and dMax correspond to
the dimensional range fDmin ; Dmax g and thus any logical range values can be
assigned for them within the PSOparam structure (default: {2, 101}). The target
dimension can be set in at the beginning of the main() function, i.e.,
int tar_dim[3] = {20, 50, 80};
…
pDim = &tar_dim[2];//The target dimension.
which points to the third entry of the tar_dim[2] array (*pDim = 80). The
dimensional bias stored in the pointer pDim will then be used as the dimensional
bias in all nonlinear functions implemented in MYFUN.cpp which makes them
biased to all dimensions except the target dimension (i.e., *pDim = 80). In other
words, the nonlinear function has a unique global minimum (i.e., 0 for all func-
tions) only in the target dimension. Then MD PSO can be tested accordingly to
find out whether it can converge to the global optimum that resides in the target
(true) dimension. Once the MD PSO object has been initialized with a proper
dimensional range and dimensional PSO parameters (i.e., _psoDef._dMin,
_psoDef._dMax, _psoDef._vdMin, _psoDef._vdMax) then the rest of the code is
identical for both operations. At the end, recall that MD PSO is just the multi-
dimensional extension of the basic PSO.
As initially explained, in the main MD PSO process loop, i.e., the step 3 of the
MD PSO pseudo-code given in Table 4.1, is performed within Table 3.7 given in
4.4 Programming Remarks and Software Packages 93
Sect. 3.5. In that section we explained it in parallel with the basic PSO run, more
details can now be given with respect to the parameters of the MD PSO swarm
particles. First of all, note that there is an individual gbest particle for each
dimension and whenever the particle a visits resides in the current dimension,
cur_dim_a, it can be the new gbest particle in that dimension if its current position
surpasses gbest’s. Note that cur_dim_a can now be any integer number between
m_xdMin and m_xdMax-1. Another MD PSO parameter, dbest (m_dbest in the
code), which can be updated only if the particle becomes the new gbest in
cur_dim_a, and it can achieve a fitness score even better than the best fitness score
achieved in the dimension dbest.
The termination criteria (by either IterNo or eC ) were explained in Sect. 3.5
with the code given in Table 3.9. The best performance achieved by the MD PSO
swarm can be gathered from the personal best position of the gbest particle in
dbest dimension, i.e., m_gbest[m_dbest - m_xdMin] and the call in the if()
statement, which compares it with the m_eCutOff, i.e., m_pPA[m_g-
best[m_dbest - m_xdMin]]- > GetPBScore(m_dbest).
If neither of the termination criteria is reached, then step 3.4 is executed to
update the position and the dimension of each swarm particle as given in
Table 4.2. As shown in Table 3.10 in Sect. 3.5, first in the positional update, each
particle’s cognitive (cogn) and social (social) components are computed and the
velocity update is composed over the previous velocity weighted with the inertia
factor. The velocity is clamped (vx- > CheckLimit()) and finally, the particle’s
current position is updated to a new position with the composed and clamped
Table 4.2 Implementation of Step 3.4 of the MD PSO pseudo-code given in Table 4.1
94 4 Multi-dimensional Particle Swarm Optimization
velocity term. As the last step, this new position is also verified whether it falls
within the problem boundaries specified in advance (xmax by xx- > Check-
Limit()), all of which are performed within step 3.4.1 of the MD PSO pseudo-
code. The dimensional PSO within step 3.4.3 of the MD PSO pseudo-code is then
performed and this is simply a PSO implementation in 1D, over the integer
numbers to find out the optimal dimension using the PSO mechanism. Within the
loop, each particle’s cognitive (dbestx-xdim) and social components (m_dbest-
xdim) are computed and the velocity update is composed over the previous
dimensional velocity component, vdim. The velocity is clamped and the current
dimension of the particle is updated. As the last step, this new dimension is also
verified whether it falls within the problem boundaries specified in advance. Recall
that Cxd ½:; : and Cvd ½:; : are the clamping operators applied over dimensional
components, xda ðtÞ and vda ðtÞ; respectively. Note that in the code both clamping
operators are simply implemented by four member variables of the class:
m_xdMin, m_xdMax and m_vdMin, m_vdMax.
Fig. 4.4 GUI of PSOTestApp with several MD PSO applications (top) and MD PSO Parameters
dialog is activated when pressed ‘‘Run’’ button (bottom)
overview and data structures for the PSOTestApp and PSOCluster projects, and
then focus on the first application (1. 2D Clustering over binary (B\W) images).
The rest of the clustering applications (except eighth application Feature Syn-
thesis) will be covered in Chaps. 5 and 6. The eighth application, Evolutionary
Feature Synthesis, will be covered in Chap. 10.
The entry point of a Windows Dialog workspace created by Visual Studio 6.5 is
the [nameofApp]Dlg.cpp. For PSOTestApp it is therefore, PSOtestAppDlg.cpp
where all Windows dialog-based control and callback functions are implemented.
In this source file, a separate thread operates all the user’s actions by the callback
functions of the class CPSOtestAppDlg such as:
96 4 Multi-dimensional Particle Swarm Optimization
When ‘‘Run’’ button is pressed, the MD PSO application (selected in the combo
list) is initiated within the OnDopso() function and the request is conducted to the
proper interface class implemented within the PSOCluster library (DLL). For this,
in the header file PSOtestAppDlg.h, the class CPSOtestAppDlg contains six
interface classes, each of which has (the one and only) member object to
accomplish the task initiated. These member objects are as follows:
• CPSOcluster m_PSOt; //The generic 2-D PSO clustering obj..
• CPSOclusterND m_PSOn; //The generic N-D PSO clustering obj..
• CPSO_RBFnet m_PSOr; //The generic PSO RBF Net. training obj..
• CPSOcolorQ m_PSOc; //The generic PSO colorQ obj..
• CPSOclusterGMM m_PSOgmmt; //The GMM PSO clustering obj..
• CPSOFeatureSynthesis m_PSOfs; //The Feat. Synth. obj..
Table 4.3 The callback function OnDopso() activated when pressed ‘‘Run’’ button on
PSOtestApp GUI
4.4 Programming Remarks and Software Packages 97
Stop() function can stop an ongoing MD PSO application anytime and abruptly.
The ShowResults() function will show the 2D clustering results on a separate
dialog, which is created and controlled within the CPSOtestAppDlg class. Finally,
as presented in Table 4.5 the API function, ApplyPSO(), is called within the
callback function CPSOtestAppDlg::OnDopso() and creates a dedicated thread in
which the MD PSO application (2D clustering) is executed. This thread function is
called PSOThread() and created as:
CMThread < CPSOcluster > (PSOThread, this, THREAD_PRIOR-
ITY_NORMAL).Begin ();
There are three set of parameters of this API function, the list of input filenames
(pFL), MD PSO parameters (psoParam), and SPSA parameters (saParam),
which will be explained in the next chapter. The input files for 2D clustering
application are black and white images similar to the ones shown in Figs. 3.7 and
4.1, and 2D data points are represented by white pixels. As the MD PSO process
terminates, the resultant output images consist of the clusters extracted from the
image, each of which rendered with three-color representation, as typical examples
are shown in Figs. 3.8 and 3.9. The structure, psoParam contains all MD PSO
parameters, which can be edited by the user via MD PSO parameters dialog. Both
psoParam and saParam are copied into the member structures (m_psoParam and
m_saParam) and as a result, the MD PSO process can now be executed by a
separate thread within the function PSOThread().
98 4 Multi-dimensional Particle Swarm Optimization
References
1. A. Abraham, S. Das and S. Roy, ‘‘Swarm Intelligence Algorithms for Data Clustering’’, in Soft
Computing for Knowledge Discovery and Data Mining book, Part IV, pp. 279-313, Oct. 25,
2007
2. M.G. Omran, A. Salman, A.P. Engelbrecht, Dynamic Clustering using Particle Swarm
Optimization with Application in Image Segmentation. In Pattern Analysis and Applications 8,
332–344 (2006)
3. G-J Qi, X-S Hua, Y. Rui, J. Tang, H.-J. Zhang, ‘‘Image Classification With Kernelized Spatial-
Context,’’ IEEE Trans. on Multimedia, vol.12, no.4, pp.278-287, June 2010. doi: 10.1109/
TMM.2010.2046270
4. M. G. Omran, A. Salman, and A.P. Engelbrecht, Particle Swarm Optimization for Pattern
Recognition and Image Processing, Springer Berlin, 2006
Chapter 5
Improving Global Convergence
Like they say, you can learn more from a guide in one day than
you can in three months fishing alone.
Mario Lopez
S. Kiranyaz et al., Multidimensional Particle Swarm Optimization for Machine Learning 101
and Pattern Recognition, Adaptation, Learning, and Optimization 15,
DOI: 10.1007/978-3-642-37846-1_5, Ó Springer-Verlag Berlin Heidelberg 2014
102 5 Improving Global Convergence
generic technique, which should be specifically adapted to the problem at hand (we
shall return to this issue later). In order to address this drawback efficiently, we
shall further present two generic approaches, one of which moves gbest efficiently
or simply put, ‘‘guides’’ it with respect to the function (or error surface) it resides
on. The idea behind this is quite simple: since the velocity update equation of gbest
is quite poor, we shall replace it with a simple yet powerful stochastic search
technique to guide it instead. We shall henceforth show that due to the stochastic
nature of the search technique, the likelihood of getting trapped into a local
optimum can significantly be decreased.
Target: ( xT , yT )
X
Δybest ( x8 , y8 ) aGB : ( x3 , y8 )
FGBF
8 +
( x1 , y1 )
1
gbest
FGBF
( x3 , y 3 )
3
x
0
Δxbest
position, ya; j ðtÞ, j 2 ½1; N. The aGB particle, obtained through the FGBF process,
is fractionally formed from the components of some swarm particles, and therefore it
does not use any velocity term. Consequently, yaGB ðtÞ is set to the best of xaGB ðtÞ and
yaGB ðt 1Þ. As a result, the FGBF process creates one aGB particle providing a
potential GB solution, yaGB ðtÞ. Let f ða; jÞ be the dimensional fitness score of the jth
dimensional component of the position vector of particle a. Suppose that all
dimensional fitness scores (f ða; jÞ; 8a 2 ½1; S; 8j 2 ½1; N) can be computed in
step 3.1 and FGBF pseudo-code as given in Table 5.1 can then be plugged in
between steps 3.3 and 3.4 of bPSO’s pseudo-code.
Step 2 in the FGBF pseudo-code along with the computation of f ða; jÞ depends
entirely on the optimization problem. It keeps track of partial fitness contributions
from each individual component from each particle’s position. For those problems
without any constraints (e.g., nonlinear function minimization), the best dimen-
sional components can simply be selected, whereas in others (e.g., clustering),
some promising components, which satisfy the problem constraints or certain
criteria required, are first selected, grouped, and the most suitable one in each
group is then used for FGBF. Here, the internal nature of the problem will
determine the ‘‘suitability’’ of the selection.
The previous section introduced the principles of FGBF when applied in a bPSO
process on a single dimension. In this section, we present its generalized form with
the proposed MD PSO where there is one gbest particle per (potential) dimension
of the solution space. For this purpose, recall that at a particular iteration, t, each
xd ðtÞ
MD PSO particle, a, has the following components: position (xxa; aj ðtÞ), velocity
xd ðtÞ xd ðtÞ
(vxa; aj ðtÞ) and the personal best position (xya; aj ðtÞ) for each potential dimensions
in solution space (i.e., xda ðtÞ 2 ½Dmin ; Dmax and j 2 ½1; xda ðtÞ) and their
respective counterparts in the dimensional PSO process (i.e., xda ðtÞ, vda ðtÞ, and
xd~a ðtÞ). The aGB particle does not need dimensional components where a single
positional component with the maximum dimension Dmax is created to cover all
dimensions in the range, 8d 2 ½Dmin ; Dmax , and as explained earlier, there is no
need for the velocity term either, since aGB particle is fractionally (re-) formed
from the dimensions of some swarm particles at each iteration.
Furthermore, the aforementioned competitive selection ensures that
xydaGB ðtÞ; 8d 2 ½Dmin ; Dmax is set to the best of the xxdaGB ðtÞ and xydaGB ðt 1Þ. As a
result, the FGBF process creates one aGB particle providing (potential) GB
solutions (xydaGB ðtÞ ) for all dimensions in the given range (i.e., 8d 2 ½Dmin ; Dmax ).
Let f ða; jÞ be the dimensional fitness score of the jth component of particle a,
which has the current dimension, xda ðtÞ and j 2 ½1; xda ðtÞ. At a particular time t,
all dimensional fitness scores (f ða; jÞ; 8a 2 ½1; S) can be computed in step 3.1
and FGBF pseudo-code for MD PSO as given in Table 5.2 can then be plugged in
between steps 3.2 and 3.3 of the MD PSO’s pseudo-code. Next, we will present the
application of MD PSO with FGBF to nonlinear function minimization and other
applications will be presented in detail in the following chapters.
test the performance of MD PSO. The functions given in Table 5.3 provide a good
mixture of complexity and modality and have been widely studied by several
researchers, e.g., see [10, 16, 22, 29–31]. The dimensional bias term, WðdÞ, has the
form of WðdÞ ¼ K jd d0 ja where the constants K and a are properly set with
respect to the dynamic range of the function to be minimized. Note that the
variable d0 , Dmin d0 Dmax , is the target dimension in which the global mini-
mum resides and hence all functions have the global minimum Fn ðx; d0 Þ ¼ 0,
when d ¼ d0 . Sphere, De Jong, and Rosenbrock are the unimodal functions and the
rest are multimodal, meaning that they have many deceiving local minima. On the
macroscopic level, Griewank demonstrates certain similarities with unimodal
functions especially when the dimensionality is above 20; however, in low
dimensions it bears a significant noise, which creates many local minima due to the
second multiplication term with cosine components. Yet with the addition of
dimensional bias term WðdÞ, even unimodal functions eventually become multi-
modal, since they now have a local minimum at every dimension (which is their
global minimum at that dimension without WðdÞ) but only one global minimum at
dimension, d0 .
Recall from the earlier remarks that a MD PSO particle a represents a potential
solution
at a certain dimension, and therefore the jth component of a d-dimensional
point xj ; j 2 ½1; d is stored in its positional component, xxda; j ðtÞ at time t. Step
3.1 in MD PSO’s pseudo-code computes the (dimensional) fitness score (f ða; jÞ) of
the jth component (xj ) and at step 2 in the FGBF process, the index of the particle
with those xj ’s yielding minimum f ða; jÞ is then stored in the array a½j. Except the
nonseparable functions, Rosenbrock and Griewank, the assignment of f ða; jÞ for
qffiffiffiffiffiffi
ffi
particle a is straightforward (e.g., f ða; jÞ ¼ x2 for Sphere, f ða; jÞ ¼ xj sin xj
j
for Schwefel, etc., simply using the term with the jth component of the summa-
tion). For Rosenbrock, we can set f ða; jÞ ¼ ðxjþ1 x2j Þ2 þ ðxj 1Þ2 since the aGB
particle, which is fractionally formed by those xj ’s minimizing the jth summation
106
term, eventually minimizes the function. Finally for Griewank one can approxi-
mate f ða; jÞ x2j for particle a and the FGBF operation then finds and uses such xj
that can come to a close vicinity of the global minimum at dimension j on a
macroscopic scale, so that the native PSO process can then have a higher chance of
avoiding those noise-like local optima, and thus eventually converge to the global
optimum.
We use the termination criteria as the combination of the maximum number of
iterations allowed (iterNo = 5,000) and the cut-off error (eC ¼ 104 ). Table 5.3
also presents both positional, xmax , and dimensional ½Dmin ; Dmax range values,
whereas the other parameters are empirically set as Vmax ¼ xmax =2 and
VDmax ¼ 18. Unless stated otherwise, these range values are used in all experi-
ments presented in this section. The first set of experiments was performed for
comparative evaluation of the standalone MD PSO versus bPSO over both uni-
and multimodal functions. Figure 5.2 presents typical plots where both techniques
are applied over the unimodal function, De Jong using the swarm size, S = 160.
The red curves of both plots in Fig. 5.2 and all the rest of the figures in this section
represent the behavior of the GB particle (whether it is a new gbest or the aGB
particle created by FGBF) and the corresponding blue curves represent the
behavior of the gbest particle when the termination criteria are met (e.g.,
gbest = 74 for bPSO and f ½y74 ð158Þ ¼ 9:21 105 \eC ). Naturally, the true
dimension (d0 ¼ 20) is set in advance for the bPSO process and it converges to the
global optima within 158 iterations as shown in the right plot, whereas MD PSO
takes 700 iterations to finally place the GB particle at the target dimension
(d0 ¼ 20) and then only 80 iterations more to satisfy the termination criteria.
Recall that its objective is to find the global minimum of the function at the true
dimension. Overall, the standalone MD PSO is slower compared to bPSO, but over
an extensive set of experiments, its convergence behavior to the global optimum is
found similar to that of the bPSO. For instance, their performance is degraded in
10 10
10 10
GB score GB score
5 gbest score 5 gbest score
10 10
0 0
10 10
-5 -5
10 10
0 100 200 300 400 500 600 700 800 0 20 40 60 80 100 120 140 160
20 21
GB dimension
20.5 gbest dimension
15
20
10 GB dimension
19.5
gbest dimension
5 19
0 100 200 300 400 500 600 700 800 0 20 40 60 80 100 120 140 160
Fig. 5.2 Fitness score (top in log-scale) and dimension (bottom) plots vs. iteration number for
MD PSO (left) and bPSO (right) operations both of which run over De Jong function
108 5 Improving Global Convergence
5
10 GB score
5 GB score
10
gbest score gbest score
0
0 10
10
-5 -5
10 10
0 20 40 60 80 100 120 140 160 180 0 500 1000 1500 2000 2500 3000 3500 4000
100 40
GB dimension
80 gbest dimension
30
60
20
40 GB dimension
gbest dimension
20 10
0 20 40 60 80 100 120 140 160 180 0 500 1000 1500 2000 2500 3000 3500 4000
Fig. 5.3 Fitness score (top in log-scale) and dimension (bottom) plots vs. iteration number for a
MD PSO run over Sphere function with (left) and without (right) FGBF
GB = aGB
350
aGB=320
300
particle 250
no. 200
150
0 20 40 60 80 100 120 140 160
iteration no.
Fig. 5.4 Particle index plot for the MD PSO with FGBF operation shown in Fig. 5.3
higher dimensions, e.g., for the same function but at d0 ¼ 50, both require—on the
average—five times more iterations to find the global minimum.
A significant speed improvement can be achieved when MD PSO is performed
with FGBF. A typical MD PSO run using the swarm size, S = 320, over another
unimodal function, Sphere, but at a higher (target) dimension, is shown in Fig. 5.3.
Note that the one with FGBF (left) took only 160 iterations, whereas the stand-
alone MD PSO (right) is completed within 3,740 iterations. Note also that within a
few iterations, the process with FGBF already found the true dimension, d0 ¼ 40,
and after only 10 iterations, the aGB particle already came in a close vicinity of the
2
global minimum (i.e., f ðxy40 aGB ð10ÞÞ ffi 4 10 ). As shown in Fig. 5.4, the
particle index plot for this operation clearly shows the time instances where aGB
(with index number 320) becomes the GB particle, e.g., the first 14 iterations and
then occasionally in the rest of the process.
Besides the significant speed improvement for unimodal functions, the primary
contribution of FGBF technique becomes most visible when applied over multi-
modal functions where the bPSO (and the standalone MD PSO) are generally not
5.1 Fractional Global Best Formation 109
5 4
10 10
GB score GB score
gbest score gbest score
0
10
3
-5 10
10 0 1000 2000 3000 4000 5000
0 20 40 60 80 100 120 140 160 180
40
50
GB dimension
GB dimension
gbest dimension
40 gbest dimension
30
30
20
20
10 10
0 20 40 60 80 100 120 140 160 180 0 1000 2000 3000 4000 5000
Fig. 5.5 Fitness score (top in log-scale) and dimension (bottom) plots vs. iteration number for a
MD PSO run over Schwefel function with (left) and without (right) FGBF
-2 2
10 10
GB score GB score
gbest score gbest score
-3
10
0
10
-4
10
-5 -2
10 10
0 50 100 150 200 250 300 350 400 0 1000 2000 3000 4000 5000
21
100
GB dimension
GB dimension
20.5 gbest dimension 80
gbest dimension
60
20
40
19.5
20
19 0
0 50 100 150 200 250 300 350 400 0 1000 2000 3000 4000 5000
Fig. 5.6 Fitness score (top in log scale) and dimension (bottom) plots versus iteration number for
a MD PSO run over Giunta function with (left) and without (right) FGBF
able to converge to the global optimum even at the low dimensions. Figures 5.5
and 5.6 present two (standalone MD PSO vs. MD PSO with FGBF) applications
(using a swarm size 320) over Schwefel and Giunta functions at d0 ¼ 20. Note that
when FGBF is used, MD PSO can directly have the aGB particle in the target
dimension (d0 ¼ 20) at the beginning of the operation. Furthermore, the PSO
process benefits from having an aGB particle that is indeed in a close vicinity of
the global minimum. This eventually helps the swarm to move toward the right
direction thereafter. Without this mechanism, both standalone PSO applications
are eventually trapped into local minima due to the highly multimodal nature of
these functions. This is quite evident in the right-hand plots of both figures, and
except for few minority cases, it is also true for other multimodal functions. In
higher dimensions, standalone MD PSO applications over multimodal functions
110 5 Improving Global Convergence
5 5
10 10
GB score, d=20 GB score, d=20
GB score, d=80 GB score, d=80
0 0
10 10
-5 -5
10
0 200 400 600 800 1000 1200 1400 10 0 100 200 300 400 500 600 700 800 900
5 5
10 10
GB score, d=20 GB score, d=20
GB score, d=80 GB score, d=80
0 0
10 10
-5 -5
10 10
0 100 200 300 400 500 600 700 0 50 100 150 200 250 300 350
Fig. 5.7 MD PSO with FGBF operation over Griewank (top) and Rastrigin (bottom) functions
with d0 ¼ 20 (red) and d0 ¼ 80 (blue) using the swarm size, S = 80 (left) and S = 320 (right)
yield even worse results such as earlier traps in local minima and possibly at the
wrong dimension. For example, in standalone MD PSO operations over Schwefel
and Giunta with d0 ¼ 80, the GB scores at t = 4,999 (f ðx^y80 Þ) are 8,955.39 and
1.83, respectively.
An observation worth mentioning here is that MD PSO with FGBF is usually
affected by the higher dimensions, but its performance degradation usually entails
a slower convergence as opposed to an entrapment at a local minimum. For
instance, when applied over Schwefel and Giunta at d0 ¼ 80, the convergence to
the global optima is still achieved after only a slight increase in the number of
iterations, i.e., 119 and 484 iterations, respectively. Moreover, Fig. 5.7 presents the
fitness plots for MD PSO with FGBF using two different swarm sizes over two
more multimodal functions, Griewank and Rastrigin. Similar to earlier results, the
global minimum at the true dimension is reached for both functions; however, for
d0 ¼ 20 (red curves) the process takes a few hundreds iterations less than the one
at d0 ¼ 80 (blue curves).
The swarm size has a direct effect on the performance of MD PSO with FGBF,
that is, a larger swarm size increases the speed of convergence, which is quite
evident in Fig. 5.7 between the corresponding plots on the left- and the right side.
This is due to the fact that with the larger swarm size, the probability of having
better swarm components (which are closer to the global optimum) of the aGB
particle increases, thus yielding a better aGB particle formation in general. Note
that this is also clear in the plots at both sides, i.e., at the beginning (within the first
10–15 iterations when aGB is usually the GB particle) the drop in the fitness score
is much steeper on the right-hand plots with respect to the ones on the left.
For an overall performance evaluation, both proposed methods are tested over
seven benchmark functions using three different swarm sizes (160, 320, and 640)
and target dimensions (20, 50, and 80). For each setting, 100 runs are performed
and the first- and second-order statistics (mean, l and standard deviation, r) of the
operation time (total number of iterations) and the two components of the solution,
5.1 Fractional Global Best Formation 111
the fitness score achieved in the resulting dimension (dbest), are presented in
Table 5.4. During each run, the operation terminates when the fitness score drops
below the cut-off error ðeC ¼ 104 Þ and it is assumed that the global minimum of
the function in the target dimension is reached, henceforth, the score is set to 0 and
obviously, dbest ¼ d0 . Therefore, for a particular function, the target dimension,
d0 and swarm size, S, obtaining l = 0 as the average score means that the method
converges to the global minimum in the target dimension at every run. On the
other hand, having the average iteration number as 5,000 indicates that the method
cannot converge to the global minimum at all, instead it gets trapped in a local
minimum. The statistical results enlisted in Table 5.4 approve earlier observations
and remarks about the effects of modality, swarm size, and dimension over the
performance (both speed and accuracy). Particularly for the standalone MD PSO
application, increasing the swarm size improves the speed of convergence wher-
ever the global minimum is reached for unimodal functions, Sphere and De Jong,
whilst reducing the score significantly for the others.
The score reduction is particularly visible on higher dimensions, e.g., for
d0 ¼ 80, compare the highlighted average scores of the top five functions. Note
that especially for De Jong at d0 ¼ 50, none of the standalone MD PSO runs with
S = 160 converges to the global minimum whilst they all converge with a higher
swarm population (i.e., 320 or 640).
Both dimension and modality have a direct effect on the performance of the
standalone MD PSO. For unimodal functions, its convergence speed decreases
with increasing dimension, e.g., see the highlighted average values of the iteration
numbers for Sphere at d0 ¼ 20 versus d0 ¼ 50. For multimodal functions,
regardless of the dimension and the swarm size, all standalone MD PSO runs get
trap in local minima (except perhaps few runs on Rosenbrock at d0 ¼ 20); how-
ever, the fitness performance still depends on the dimension, that is, the final score
tends to increase in higher dimensions indicating an earlier entrapment at a local
minimum. Regardless of the swarm size, this can easily be seen in all multimodal
functions except Griewank and Giunta both of which show higher modalities in
lower dimensions. Especially, Griewank becomes a plain Sphere function when
the dimensionality exceeds 20. This is the reason of the performance improvement
(or score reduction) from d0 ¼ 20 to d0 ¼ 50 but note that the worst performance
(highest average score) is still encountered at d0 ¼ 80.
As the entire statistics in the right side of Table 5.4 indicate, MD PSO with
FGBF finds the global minimum at the target dimension for all runs over all
functions regardless of the dimension, swarm size and modality, and without any
exception. Moreover, the mutual application of MD PSO and FGBF significantly
improves the convergence speed, e.g., compares the highlighted average iteration
numbers with the results of the standalone MD PSO. Dimension, modality, and
swarm size might still be important factors over the speed and have the same
effects as mentioned earlier, i.e., the speed degrades with modality and dimen-
sionality, whereas it improves with increasing swarm size. Their effects, however,
vary significantly among the functions, e.g., as highlighted in Table 5.4, the swarm
Table 5.4 Statistical results from 100 runs over seven benchmark functions
112
size can enhance the speed radically for Giunta but only merely for Griewank. The
same statement can be made concerning the dimension of De Jong and Sphere.
Based on the results in Table 5.4, we can perform comparative evaluations with
some of the promising PSO variants such as [1, 30–32] where similar experiments
were conducted over some or all of these benchmark functions. They have,
however, the advantage of fixed dimension, whereas MD PSO with FGBF finds the
true dimension as part of the optimization process. Furthermore, it is rather dif-
ficult to make speed comparisons since none of them really find the global min-
imum for most functions; instead, they have demonstrated some incremental
performance improvements in terms of score reduction with respect to some other
competing technique(s). For example in Angeline [32], a tournament selection
mechanism is formed among particles and the method is applied over four func-
tions (Sphere, Rosenbrock, Rastrigin, and Griewank). Although the method is
performed over a reduced positional range, ±15, and at low dimensions (10, 20,
and 30), they got varying average scores between the range {0.3, 1,194}. As a
result, both better and worse performances than the bPSO were reported depending
on the function. In Esquivel and Coello Coello [30], bPSO and two PSO variants,
GCPSO and mutation-extended PSO over three neighborhood topologies were
applied to some common multimodal functions, Rastrigin, Schwefel, and Grie-
wank. Although the dimension is rather low (30), none of the topologies over any
PSO variant converged to the global minimum and average scores varying in the
range of {0.0014, 4,762} were reported. In Riget and Vesterstrom [1], a diversity
guided PSO variant, ARPSO, along with two competing methods, bPSO and GA
were applied over the multimodal functions Rastrigin, Rosenbrock, and Griewank,
at three different dimensions (20, 50, and 100). The range was kept quite low for
Rosenbrock and Rastrigin, ±100 and ±5.12, respectively, and for each run; the
number of evaluations (product of iterations and the swarm size) was kept in the
range of 400,000–2,000,000, depending on the dimension. The experimental
results have shown that none of the three methods converged to the global min-
imum except ARPSO over (only) Rastrigin at dimension 20. Only when ARPSO
runs until stagnation is reached after 200,000 evaluations, it can find the global
minimum over Rastrigin at higher dimensions (50 and 100). However, in practical
sense, this indicates that the total number of iterations might be in the magnitude
of 105 or even higher. Recall that the number of iterations required for MD PSO
with FGBF to convergence to the global minimum is less than 400 for any
dimension. ARPSO performed better than bPSO and GA over Rastrigin and Ro-
senbrock but worse over Griewank. The CPSO proposed in Bergh and Engelbrecht
[24] was applied over five functions of which four are common (Sphere, Rastrigin,
Rosenbrock, and Griewank). The dimension of all functions was fixed to 30 and in
this dimension, CPSO performed better than bPSO in 80 % of the experiments.
Finally in Richards and Ventura [31], dynamic sociometries via ring and star were
introduced among the swarm particles and the performance of various combina-
tions of swarm size and sociometry over six functions (the ones used in this section
except Schwefel) was reported. Although the tests were performed over
116 5 Improving Global Convergence
comparatively reduced positional ranges and at a low dimension (30), the exper-
imental results indicate that none of the sociometry and swarm size combination
converged to the global minimum for multimodal functions except only for some
dimensions of the Griewank function.
where Bð~ xÞ is time invariant basis landscape, whose utilization is optional, and P is
the function defining the height of the pth peak at location ~ x, where each of the
m peaks can have its own dynamic parameters such as height, hp ðtÞ, width, wp ðtÞ
and location vector of the peak center, ~ cp ðtÞ. Each peak parameter can be ini-
tialized randomly or set to a certain value and after a time period (number of
5.2 Optimization in Dynamic Environments 117
hp ðtÞ ¼hp ðt Te Þ þ r1 Dh
wp ðtÞ ¼wp ðt Te Þ þ r2 Dw ð5:2Þ
cp ðtÞ ¼~
~ cp ðt Te Þ þ~
vp ðtÞ
where r1 ; r2 2 Nð0; 1Þ, Dh and Dw are the height and width severities and ~ vp ðtÞ is
the normalized shift vector, which is a linear combination of a random vector and
the previous shift vector, ~ vp ðt Te Þ. The type and number of peaks along with
their initial heights and widths, environment (search space) dimension and size,
change severity, level of change randomness, and change frequency can be defined
[33]. To allow comparative evaluations among different algorithms, three standard
settings of such MPB parameters, the so-called Scenarios, have been defined.
Scenario 2 is the most widely used. Each scenario allows a range of values, among
them the following are commonly used: number of peaks = 10, change severity
vlength = 1.0, correlation lambda = 0.0, and peak change frequency = 5,000. In
Scenario 2, no basis landscape is used and peak type is a simple cone with the
following expression
ð5:3Þ
where sp ðtÞ is the slope and k : k is the Euclidean distance. More detailed infor-
mation on MPB and the rest of the parameters used in this benchmark can be
obtained from Branke [33].
The main problem of using the basic PSO algorithm in a dynamic environment is
that eventually the swarm will converge to a single peak—whether global or local.
When another peak becomes the global maximum as a result of an environmental
change, it is likely that the particles keep circulating close to the peak to which the
swarm has converged and thus they cannot find the new global maximum.
Blackwell and Branke have addressed this problem in [34] and [35] by introducing
multiswarms that are actually separate PSO processes. Recall from Sect. 3.3.2 that
each particle is now a member of one of the swarms only and it is unaware of other
swarms. Hence in this problem domain, the main idea is that each swarm can
converge to a separate peak. Swarms interact only by mutual repulsion that keeps
them from converging to the same peak. For a single swarm it is essential to
118 5 Improving Global Convergence
maintain enough diversity, so that the swarm can track small location changes of
the peak to which it is converging. For this purpose Blackwell and Branke
introduced charged and quantum swarms, which are analogs to an atom having a
nucleus and charged particles randomly orbiting it. The particles in the nucleus
take care of the fine tuning of the result while the charged particles are responsible
of detecting the position changes. However, it is clear that, instead of charged or
quantum swarms, some other method can also be used to ensure sufficient diversity
among particles of a single swarm, so that the peak can be tracked despite of small
location changes.
As one might expect, the best results are achieved when the number of swarms
is set equal to the number of peaks. However, it is then required that the number of
peaks is known beforehand. In [5], Blackwell presents self-adapting multiswarms,
which can be created or removed during the PSO process, and therefore it is not
necessary to fix the number swarms beforehand.
The repulsion between swarms is realized by simply reinitializing the worse of
two swarms if they move within a certain range from each other. Using physical
repulsion could lead to equilibrium, where swarm repulsion prevents both swarms
from getting close to a peak. A proper proximity threshold, rrep can be obtained by
using the average radius of the peak basin, rbas. If p peaks are evenly distributed in
N-dimensional cube, XN, then rrep ¼ rbas ¼ X=p1=N .
The previous section introduced how FGBF process works within a bPSO at a
fixed dimension and referred to some applications in other domains representing
static environments. However, in dynamic environments, this approach eventually
leads the swarm to converge to a single peak (whether global or local), and
therefore it may lose its ability to track other peaks. As any of the peaks can
become the optimum peak as a result of environmental changes, bPSO equipped
with FGBF is likely to lead to suboptimal solutions. This is the basic motivation of
using multiswarms along with the FGBF operation. As described earlier, mutual
repulsion between swarms is applied, where the distance between the swarms’
global best locations is used to measure the distance between two swarms. Instead
of using charged or quantum swarms, FGBF is sufficient to collect the necessary
diversity and thus enable peak tracking if the peaks’ locations are changed. Particle
velocities are also reinitialize after each environment change to enhance diversity.
Each particle with index a in a swarm n, represents a potential solution and
therefore, the jth component of an N-dimensional point (xj ; j 2 f1; Ng) is stored
in its positional component, xa; j ðtÞ, at a time t. The aim of the PSO process is to
search for the global optimum point, which maximizes P ~ x; hp ðtÞ; wp ðtÞ; ~
cp ðtÞ ,
in other words, finding the global (highest) peak in MPB environment. Recall that
in Scenario 2 of MPB the peaks used are all in cone shape, as given in Eq. (5.3).
5.2 Optimization in Dynamic Environments 119
Since in Eq. (5.3), hp ðtÞ and sp ðtÞ are both set by MPB, finding the highest peak is
2
equivalent to minimizing the
~ x ~ cp ðtÞ
term, yielding f ða; jÞ ¼ xj cpj .
Step 3.1 in bPSO’s pseudo-code computes the (dimensional) fitness scores
(f ða; jÞ; f ðgbest; jÞ) of the jth components (xa; j ; ygbest; j ) and in step 1 of the FGBF
process, the dimensional component yielding maximum f ða; jÞ is then placed in
aGB. In step 3, these dimensional components are replaced by dimensional
components of the personal best position of the gbest particle, if they yield higher
dimensional fitness scores. We do not expect that dimensional fitness scores can be
evaluated with respect to the optimum peak, since this requires the a priori
knowledge of the global optimum, instead we use either the current peak where
the particle resides on or the peak to which the swarm is converging (swarm peak).
We shall thus consider and evaluate both modes separately.
where d 2 fDmin ; Dmax g is the dimension of position ~ xd and ~ cdp ðtÞ refers to first
d coordinates (dimensions) of the peak center location. A cone peak is now
expressed as follows:
d
P~ x ; hp ðtÞ; wp ðtÞ; ~
cpd ðtÞ ¼ hp ðtÞ sp ðtÞ
~x ~ cdp ðtÞ
=d ðDopt dÞ2 where
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
u d 2
hp ðtÞ
d
uX d
sp ðtÞ ¼ and
~x ~cdp ðtÞ
¼ t xi cdpi 8xdi 2 ~ xd ; 8cdpi 2 ~
cdp ðtÞ
wp ðtÞ i¼1
ð5:5Þ
120 5 Improving Global Convergence
where Dopt is the current optimal dimension. Compared with Eq. (5.3), now for all
nonoptimal dimensions a penalty term ðDopt dÞ2 is subtracted from the whole
environment. In addition, the peak slopes are scaled by the term 1=d. The purpose
of this scaling is to prevent the benchmark from favoring lower dimensions.
Otherwise, a solution, whose coordinates each differs from the optimum by 1.0
would be a better solution in a lower dimension as the Euclidian distance is used.
Similar to the unidimensional (PSO) case, each positional component xxda ðtÞ of
a MD PSO particle represents a potential solution in dimension d. The only dif-
ference is that now the dimension of the optimal solution is not known beforehand,
but it can vary within the defined range. Even a single particle can provide
potential solutions in different dimensions as it makes interdimensional passes as a
result of MD PSO process. This dynamic multidimensional optimization algorithm
combines multiswarms and FGBF with MD PSO. As in the different dimensions,
the common coordinates of the peak locations are the same, it does not seem
purposeful for two swarms to converge to a same peak in different dimensions.
Therefore, the mutual repulsion between swarms is extended to affect swarms that
are in different dimensions. Obviously, only the common coordinates are con-
sidered when the swarm distance is computed.
FGBF naturally exploits information gathered in other dimensions. When the
aGB particle is created, FGBF algorithm is not limited to use dimensional com-
ponents from only those particles which are in a certain dimension, but it can
combine dimensional coordinates of particles in different dimensions. Note that as
2
we still use the dimensional fitness score, f ða; jÞ ¼ xj cpj , the common
coordinates of the positional components of the aGB particle created in different
search space dimensions, d 2 f1; Dmax g, shall be the same. In other words, it is
not necessary to create the positional components of the aGB particle from scratch
in every search space dimension, d 2 f1; Dmax g, instead one (new) coordinate
(dimension) to the aGB particle is created and added. Note also that it is still
possible that in some search space dimensions aGB beats the native gbest particle,
while in other dimensions it does not. In multidimensional version also the
dimension and dimensional velocity of each particle are reinitialized after an
environmental change in addition to the particle velocities in each dimension.
We conducted an exhaustive set of experiments over the MPB Scenario 2 using the
settings given earlier. In order to investigate the effect of multiswarm settings,
different numbers of swarms and particles in a swarm are used. Both FGBF modes
are applied using the current and swarm peaks. In order to investigate how FGBF
and multiswarms individually contribute to the results, experiments with each of
them performed separately will also be presented.
5.2 Optimization in Dynamic Environments 121
30
25
20
Current error
15
10
0
0 1 2 3 4 5 6 7 8
4
Number of evaluations x 10
Figure 5.8 presents the current error plot, which shows the difference between
the global maximum and the current best result during the first 80,000 function
evaluations, when 10 swarms, each with four particles, are used and the swarm
peak mode is applied for the FGBF operation. It can be seen from the figure that as
the environment changes after every 5,000 evaluations, it causes the results to
temporarily deteriorate. However, it is clear that after these environment changes,
the results improve (i.e., the error decreases quite rapidly), which shows the benefit
of tracking the peaks instead of randomizing the swarm when a change occurs. The
figure also reveals other typical behavior of the algorithm. First of all, after the first
few environmental changes, the algorithm does not behave as well as at later
stages. This is because the swarms have not yet converged to a peak. Generally, it
is more difficult to initially converge to a narrow or low peak than to keep tracking
a peak that becomes narrow and/or low. It can also be seen that typically the
algorithm gets close to the optimal solution before the environment is changed
again. In few cases, where the optimal solution is not found, the algorithm has
been unable to keep a swarm tracking that peak, which is too narrow.
In Figs. 5.9 and 5.10, the contributions of multiswarms with FGBF are dem-
onstrated. The algorithm is run on MPB applying the same environment changes,
first with both using multiswarms and FGBF, then without multiswarms and finally
without FGBF. Same settings are used as before. Without multiswarms, the
number of particles is set to 40 to keep the total number of particles unchanged.
122 5 Improving Global Convergence
40
without multi-swarms
35 with multi-swarms
30
25
Current error
20
15
10
0
2 3 4 5 6 7 8
4
Number of evaluations x 10
10
without FGBF
9 with FGBF
7
Current error
0
2 3 4 5 6 7 8
Number of evaluations x 10
4
two FGBF modes, better results are obtained when the swarm peak is used as the
cp ðtÞ.
peak ~
The extended multidimensional MPB uses similar settings as in the case of con-
ventional MPB (at fixed dimension) except that the change frequency is set to
15,000. The search space dimension range used is d 2 ½5 ; 15. Figure 5.11 shows
how the global optimal dimension changes over time and how MD PSO is tracking
these changes. The current best dimension represents the dimension, where the
best solution is achieved among all swarms’ dbest dimensions. Ten multiswarms
are used with seven particles in each. FGBF is used with the swarm peak mode. It
can be seen that the algorithm always finds the optimal dimension, even though the
difference in peaks heights between the optimal dimension and its neighbor
dimensions is quite insignificant (=1) compared to the peak heights (30–70).
Figure 5.12 shows how the current error behaves during the first 250,000 evalu-
ations, when the same settings are used. It can be seen that the algorithm behavior
is similar to the unidimensional case, but now the initial phase, when the algorithm
has not been yet behaving at its best is longer. Similarly, it takes a longer time to
15
14
13
Search space dimension
12
11
10
7
optimal dimension
6 current best dimension
5
0 0.5 1 1.5 2 2.5
Number of evalutions x 10
5
14
12
10
Current error
0
0 0.5 1 1.5 2 2.5
5
Number of evaluations x 10
regain the optimal behavior if the follow up of some peaks is lost for some reason
(it is, for example, possible that higher peaks hide other lower peaks under them).
Figures 5.13 and 5.14 illustrate the effect of using multiswarms on the per-
formance of the algorithm. Without multiswarms the number of particles is set to
70. Figure 5.13 shows that a single swarm can also find the optimal dimension
easily; however, as in the unidimensional case, without use of multiswarms, the
optimal peak can be found only if it happens to be the peak to which the swarm is
converging. This can be seen in Fig. 5.14. During the initial phase of the multi-
swarm algorithm results with and without multiswarms are similar. This indicates
that both algorithms initially converge to the same peak (highest) and as a result of
the first few environmental changes some peaks that are not yet discovered by
multiswarms become the highest.
Figures 5.15 and 5.16 illustrate the effect of FGBF on the performance of the
algorithm. In Fig. 5.15, it can be seen that without FGBF the algorithm has severe
problems in tracking the optimal dimension. In this case, it loses the benefit of
exploiting the natural diversity among the dimensional components and it is not
able to exploit information gathered from other dimensions. Therefore, even if
some particles visit the optimal dimension, they cannot track the global peak fast
enough that would hence surpasses the best results in other dimensions. Therefore,
the algorithm gets trapped in some suboptimum dimension where it happens to
find the best solution in an early phase. Such reasons also cause the current error to
be generally higher without FGBF, as can be seen in Fig. 5.16.
126 5 Improving Global Convergence
15
14
13
Search space dimension
12
11
10
7
optimal dimension
6 current best dimension
5
0 0.5 1 1.5 2 2.5
Number of evalutions x 10
5
30
without multi-swarms
with multi-swarms
25
20
Current error
15
10
0
0 0.5 1 1.5 2 2.5
5
Number of evaluations x 10
15
14
13
Search space dimension
12
11
10
7
optimal dimension
6
current best dimension
5
0 0.5 1 1.5 2 2.5
5
Number of evalutions x 10
25
without FGBF
with FGBF
20
Current error
15
10
0
0 0.5 1 1.5 2 2.5
Number of evaluations x 10
5
Table 5.7 Offline error on No. of swarms No. of particles Swarm peak Current peak
extended MPB
10 4 2.01 ± 0.98 3.29 – 1.44
10 5 1.77 ± 0.83 3.41 ± 1.69
10 6 1.79 ± 0.98 3.64 ± 1.60
10 7 1.69 – 0.75 3.71 ± 1.74
10 8 1.84 ± 0.97 4.21 ± 1.83
10 10 1.96 ± 0.94 4.20 ± 2.03
8 7 1.79 ± 0.91 3.72 ± 1.86
9 7 1.83 ± 0.84 4.30 ± 2.15
11 7 1.75 ± 0.91 3.52 ± 1.40
12 7 2.03 ± 0.97 4.01 ± 1.97
The numerical results in terms of offline errors are given in Table 5.7. Each
result given is the average of 50 runs, where each run consists of 500,000 function
evaluations. As in the unimodal case, the best results are achieved when the
number of swarms is equal to the number of peaks, which is 10. Interestingly,
when the swarm peak mode is used the optimal number of particles becomes seven
while with current peak mode it is still four. Note that these results cannot be
directly compared with the results on conventional MPB since the objective
function of multidimensional MPB is somewhat different.
Overall, MD PSO operation with FGBF and with multiswarms fundamentally
upgrades the particle structure and the swarm guidance, both of which accomplish
substantial improvements in terms of speed and accuracy for dynamic, multidi-
mensional and multimodal environments.
Let us recall the definition of the Merriam Webster dictionary for optimization, the
mathematical procedures (as finding the maximum of a function) involved in this.
More specifically, consider now the problem of finding a root h (either minimum
or maximum point) of the gradient equation: gðhÞ
oLðhÞ oh ¼ 0 for some differen-
p 1
tiable function L : R ! R . As discussed in Chap. 2, when g is defined and L is a
unimodal function, there are powerful deterministic methods for finding the global
h such as traditional steepest descent and Newton–Raphson methods. However, in
many real problems g cannot be observed directly and/or L is multimodal, in which
case the aforementioned approaches may be trapped into some deceiving local
optima. This brought the era of stochastic optimization algorithms, which can
estimate the gradient and may avoid being trapped into a local optimum due to
their stochastic nature. One of the most popular stochastic optimization techniques
is stochastic approximation (SA), in particular the form that is called ‘‘gradient
free’’ SA. Among many SA variants proposed by several researchers such as
5.3 Who Will Guide the Guide? 129
Styblinski and Tang [39], Kushner [40], Gelfand and Mitter [41], and Chin [42],
the one and somewhat different SA application is called simultaneous perturbation
SA (SPSA) proposed by Spall in 1992 [43]. The main advantage of SPSA is that it
often achieves a much more economical operation in terms of loss function
evaluations, which are usually the most computationally intensive part of an
optimization process.
As discussed earlier, PSO has a severe drawback in the update of its global best
(gbest) particle, which has a crucial role of guiding the rest of the swarm. At any
iteration of a PSO process, gbest is the most important particle; however, it has the
poorest update equation, i.e., when a particle becomes gbest, it resides on its
personal best position (pbest) and thus both social and cognitive components are
nullified in the velocity update equation. Although it guides the swarm during the
following iterations, ironically it lacks the necessary guidance to do so effectively.
In that, if gbest is (likely to get) trapped in a local optimum, so is the rest of the
swarm due to the aforementioned direct link of information flow. We have shown
that an enhanced guidance achieved by FGBF alone is indeed sufficient in most
cases to achieve global convergence performance on multimodal functions and
even in high dimensions. However, the underlying mechanism for creating the
aGB particle, the so-called FGBF, is not generic in the sense that it is rather
problem dependent, which requires (the estimate of) individual dimensional fitness
scores. This may be quite hard or even not possible for certain problems.
In order to address this drawback efficiently, in this section we shall present two
approaches, one of which moves gbest efficiently or simply put, ‘‘guides’’ it with
respect to the function (or error surface) it resides on. The idea behind this is quite
simple; since the velocity update equation of gbest is quite poor, SPSA as a simple
yet powerful search technique is used to drive it instead. Due to its stochastic
nature, the likelihood of getting trapped into a local optimum further decreased and
with the SA, gbest is driven according to (an approximation of) the gradient of the
function. The second approach has a similar idea with the FGBF, i.e., an aGB
particle is created by SPSA this time, which is applied over the personal best
(pbest) position of the gbest particle. The aGB particle will then guide the swarm
instead of gbest if it achieves a better fitness score than the (personal best position
of) gbest. Note that both approaches only deal with the gbest particle and hence the
internal PSO process remains as is. That is, neither of them is a PSO variant by
itself; rather a solution for the problem of the original PSO caused by the poor
gbest update. Furthermore, we shall demonstrate that both approaches have a
negligible computational cost overhead, e.g., only few percent increase of the
computational complexity, which can be easily compensated with a slight
reduction either in the swarm size or in the maximum iteration number allowed.
Both approaches of SA-driven PSO (SAD PSO) will be tested and evaluated
against the basic PSO (bPSO) over several benchmark uni- and multimodal
functions in high dimensions. Moreover, they are also applied to the multidi-
mensional extension of PSO, the MD PSO technique.
130 5 Improving Global Convergence
Recall that there are two common SA methods: finite difference SA (FDSA) and
simultaneous perturbation SA (SPSA). As covered in Sect. 2.3.2, FDSA adopts the
traditional Kiefer–Wolfowitz approach to approximate gradient vectors as a vector
of p partial derivatives where p is the dimension of the loss function. On the other
_
hand, SPSA has all elements of hk perturbed simultaneously using only two
measurements of the loss (fitness) function as
2 1 3
Dk1
6 D1 7
_ _ 6 k2 7
_
_ Lðhk þ ck Dk Þ Lðhk ck Dk Þ 6 : 7
6
7
gk ðhk Þ ¼ 6 : 7 ð5:6Þ
2ck 6 7
4 : 5
D1
kp
with respect to the problem whilst keeping the other three (A, a, and c) as rec-
ommended in Spall [44].
Maeda and Kuratani in [46] used SPSA with the bPSO in a hybrid algorithm
called Simultaneous Perturbation PSO (SP-PSO) over a limited set of problems
and reported some slight improvements over the bPSO. Both proposed SP-PSO
gk ^
variants involved the insertion of ^ hk directly over the velocity equations of all
swarm particles with the intention of improving their local search capability. This
may, however, present some drawbacks. First of all, performing SPSA at each
iteration and for all particles will double the computational cost of the PSO, since
SPSA will require an additional functionevaluation
at each iteration.1 Secondly,
such an abrupt insertion of SPSA’s ^ gk ^hk term directly into the bPSO may
degrade the original PSO workout, i.e., the collective swarm updates and inter-
actions, and require an accurate scaling between the parameters of the two
methods, PSO’s and SPSA. Otherwise, it is possible for one technique to dominate
the other, and hence their combination would not necessarily gain from the
advantage of both. This is perhaps the reason of the limited success, if any,
achieved by SP-PSO. As we discuss next and demonstrate its elegant performance
experimentally, SPSA should not be combined with SP-PSO as such. A better
alternative would be to use SPSA to guide only PSO’s native guide, gbest.
In this section, two distinct SAD PSO approaches are presented and applied only to
gbest whilst keeping the internal PSO and MD PSO processes unchanged. Since
SPSA and PSO are iterative processes, in both approaches to be introduced next,
SPSA can easily be integrated into PSO and MD PSO by using the same iteration
count (i.e., t
k). In other words, at a particular iteration t in the PSO process,
only the SPSA steps 2.1–2.5 in Table 5.8 are inserted accordingly into the PSO
and MD PSO processes. The following subsections will detail each approach.
In this approach, at each iteration, gbest particle is updated using SPSA. This
requires the adaptation of the SPSA elements (parameters and variables) and
integration of the internal SPSA part (within the loop) appropriately into the PSO
1
In Maeda and Kuratani [46], the function evaluations are given with respect to the iteration
number; however, it should have been noted that SP-PSO performs twice more evaluations than
bPSO per iteration. Considering this fact, the plots therein show little or no performance
improvement at all.
132 5 Improving Global Convergence
pseudo-code, as shown in Table 5.9. Note that such a ‘‘plug-in’’ approach will not
change the internal PSO structure and only affects the gbest particle’s updates. It
only costs two extra function evaluations and hence at each iteration the total
number of evaluations is increased from S to S ? 2, where S is the swarm size.
Since the fitness of each particle’s current position is computed within the PSO
process, it is possible to further decrease this cost to only one extra fitness eval-
_ _
uation per iteration. Let hk þ ck Dk ¼ xa ðtÞ in step 3.4.1.1. And thus Lðhk þ ck Dk Þ
_
is known a priori. Then naturally, hk ck Dk ¼ xa ðtÞ 2ck Dk which is the only
_
(new) location where the (extra) evaluation (Lðhk ck Dk Þ) has to be
fitness
computed. Once the gradient ^ ^
gk ; hk is estimated in step 3.4.1.4, then the next
_
(updated) location of the gbest will be: xa ðt þ 1Þ ¼ hkþ1 . Note that the difference
of this ‘‘low-cost’’ SPSA update is that xa ðt þ 1Þ is updated (estimated) not from
xa ðtÞ, but instead from xa ðtÞ ck Dk .
This approach can easily be extended for MD PSO, which is a natural extension
of PSO for multidimensional search within a given dimensional range,
d 2 ½Dmin ; Dmax . The main difference is that in each dimension, there is a distinct
gbest particle, gbest(d). So SPSA is applied individually over the position of each
gbest(d) if it (re-) visits the dimension d, (i.e., d ¼ xdgbest ðtÞ). Therefore, there can
be a maximum of 2ðDmax Dmin þ 1Þ number of function evaluations, indicating a
5.3 Who Will Guide the Guide? 133
The second approach replaces the native FGBF operation with the SPSA to create
an aGB particle. SPSA is basically applied over the pbest position of the gbest
particle. The aGB particle will then guide the swarm instead of gbest if it achieves
a better fitness score than the (personal best position of) gbest. SAD PSO pseudo-
code as given in Table 5.10 can then be plugged in between steps 3.3 and 3.4 of
bPSO pseudo-code.
The extension of the second approach to MD PSO is also quite straightforward.
In order to create an aGB particle, for all dimensions in the given range (i.e.,
ð8d 2 ½Dmin ; Dmax Þ) SPSA is applied individually over the personal best position
of each gbest(d) particle and furthermore, the aforementioned competitive selec-
tion ensures that xydaGB ðtÞ; 8d 2 ½Dmin ; Dmax is set to the best of the xxdaGB ðt þ 1Þ
and xydaGB ðtÞ. As a result, SPSA creates one aGB particle providing (potential) GB
solutions (xydaGB ðt þ 1Þ; 8d 2 ½Dmin ; Dmax ) for all dimensions in the given
dimensional range. The pseudo-code of the second approach as given in
Table 5.11 can then be plugged in between steps 3.2 and 3.3 of the MD PSO
pseudo-code, given in Table 5.12.
Note that in the second SAD PSO approach, there are three extra fitness
evaluations (as opposed to two in the first one) at each iteration. Yet as in the first
approach, it is possible to further decrease the cost of SAD PSO by one (from three
_
totwo fitness evaluations per iteration). Let hk þ ck Dk ¼ ^yðtÞ in step 2 and thus
L ^hk þ ck Dk is known a priori. Then aGB formation follows the same analogy as
_
before and the only difference is that the aGB particle is formed not from hk ¼ ^yðtÞ
_
but from hk ¼ ^yðtÞ ck Dk . However, in this approach a major difference in the
computational cost may occur since in each iteration there are inevitably 3ðDmax
Dmin Þ (or 2ðDmax Dmin Þ for lowcost application) fitness evaluations, which can
be significant.
The same seven benchmark functions given in Table 5.3 are used in this section
but without the dimensional terms (see their original form in Table 5.12). Recall
that Sphere, De Jong, and Rosenbrock are unimodal functions and the rest are
multimodal, meaning that they have many local minima. Recall further that on the
macroscopic level Griewank demonstrates certain similarities with unimodal
functions especially when the dimensionality is above 20; however, in low
dimensions it bears a significant noise.
Table 5.12 Benchmark functions without dimensional bias
Function Formula Initial range Dimension set: {d}
d
Sphere P 2 [-150, 75] 20, 50, 80
F1 ðx; dÞ ¼ xi
i¼1
5.3 Who Will Guide the Guide?
d
De Jong P 4 [-50, 25] 20, 50, 80
F2 ðx; dÞ ¼ ixi
i¼1
d
Rosenbrock P [-50, 25] 20, 50, 80
F3 ðx; dÞ ¼ 100 ðxiþ1 x2i Þ2 þ ðxi 1Þ2
i¼1
d
Rastrigin P [-500, 250] 20, 50, 80
F4 ðx; dÞ ¼ 10 þ x2i 10 cos ð2pxi Þ
i¼1
d d
Griewank 1
P Q [-500, 250] 20, 50, 80
F5 ðx; dÞ ¼ 4;000 x2i cos pxffiffiffiffiffi
i ffi
iþ1
i¼1 i¼1
d
Schwefel P pffiffiffiffiffiffi [-500, 250] 20, 50, 80
F6 ðx; dÞ ¼ 418:9829 d þ xi sin jxi j
i¼1
d
Giunta P 16 1 16 268 [-500, 250] 20, 50, 80
F7 ðx; dÞ ¼ sin 15 xi 1 þ sin 2 16
15 xi 1 þ 50 sin 4 ð 15 x i 1Þ þ 1;000
i¼1
135
136
Table 5.13 Statistical results from 100 runs over seven benchmark functions
Functions d SPSA bPSO SAD PSO (A2) SAD PSO (A1)
l r l r l r l r
Sphere 20 0 0 0 0 0 0 0 0
50 0 0 0 0 0 0 0 0
80 0 0 135.272 276.185 0 0 0 0
De Jong 20 0.013 0.0275 0 0 0 0 0 0
50 0.0218 0.03 9.0445 26.9962 0.0075 0.0091 0.2189 0.6491
80 0.418 0.267 998.737 832.1993 0.2584 0.4706 13546.02 4,305.04
Rosenbrock 20 1.14422 0.2692 1.26462 0.4382 1.29941 0.4658 0.4089 0.2130
50 3.5942 0.7485 15.9053 5.21491 12.35141 2.67731 2.5472 0.3696
80 5.3928 0.7961 170.9547 231.9113 28.1527 5.1699 5.2919 0.8177
Rastrigin 20 204.9169 51.2863 0.0429 0.0383 0.0383 0.0369 0.0326 0.0300
50 513.3888 75.7015 0.0528 0.0688 0.0381 0.0436 0.0353 0.0503
80 832.9218 102.1792 0.7943 0.9517 0.2363 0.6552 0.1240 0.1694
Griewank 20 0 0 0 0 0 0 0 0
50 1.0631e ? 007 3.3726e ? 006 50.7317 191.1558 0 0 3074.02 13,989
80 2.8251e ? 007 5.7896e ? 006 24,978 23,257 20,733 24,160 378,210 137,410
Schwefel 20 0.3584 0.0794 1.7474 0.3915 0.3076 0.0758 0.3991 0.0796
50 0.8906 0.1006 10.2027 2.2145 0.8278 0.1093 0.9791 0.1232
80 1.4352 0.1465 21.8269 5.1809 1.3633 0.1402 1.5528 0.1544
Giunta 20 42,743 667.2494 495.0777 245.1220 445.1360 264.1160 445.1356 249.5412
50 10,724 1,027.6 4,257 713.1723 3,938.9 626.9194 3,916.2 758.3290
80 17,283 1,247.9 9,873.6 1,313 8,838.2 1,357 8,454.2 1,285.3
5 Improving Global Convergence
5.3 Who Will Guide the Guide? 137
Both approaches of the proposed SAD PSO along with the ‘‘low cost’’ appli-
cation are tested over seven benchmark functions and compared with the bPSO
and standalone SPSA application. The results are shown in Table 5.13. The same
termination criteria as the combination of the maximum number of iterations
allowed (iterNo = 10,000) and the cut-off error (eC ¼ 105 ) were used. Three
dimensions (20, 50, and 80) for the sample functions are used in order to test the
performance of each technique. PSO (bPSO and SAD PSO) used a swarm size,
S = 40 and w was linearly decreased from 0.9 to 0.2. Also the values for A, a, c, a,
and c were set as recommended to 60, 0.602, 0.101, 1, and 1, for all functions. No
parameter tuning was done on purpose for SPSA since it may not be feasible for
many practical applications, particularly the ones where the underlying fitness
surface is unknown. In order to make a fair comparison among SPSA, bPSO, and
SAD-PSO, the number of evaluations is kept equal (so S = 38 and S = 37 are
used for both SAD PSO approaches and the number of evaluations is set to
40 9 10,000 = 4e+5 for SPSA). For each function and dimension, 100 runs are
performed and the first- and second-order statistics (mean, l and standard devia-
tion, r) of the fitness scores are reported in Table 5.13 whilst the best statistics are
highlighted. During each run, the operation terminates when the fitness score drops
below the cut-off error and it is assumed that the global minimum of the function is
reached, henceforth; the score is set to 0. Therefore, an average score l = 0 means
that the method converges to the global minimum at every run.
As the entire statistics in the right side of Table 5.13 indicate, either SAD PSO
approach achieves an equal or superior average performance statistics over all
functions regardless of the dimension, modality, and without any exception. In
other words, SAD PSO works equal or better than the best of bPSO and SPSA—
even though either of them might have a quite poor performance for a particular
function. Note especially that if SPSA performs well enough (meaning that the
setting of the critical parameters, e.g., a and c is appropriate), then a significant
performance improvement can be achieved by SAD PSO, i.e., see for instance De
Jong, Rosenbrock, and Schwefel. On the other hand, if SPSA does not perform
well, even much worse than any other technique, SAD PSO still outperforms
bPSO to a certain degree, e.g., see Giunta and particularly Griewank for d = 50
where SAD PSO can still converge to the global optimum (l = 0) although SPSA
performance is rather low. This supports the aforementioned claim, i.e., the PSO
update for gbest is so poor that even an underperforming SPSA implementation
can still improve the overall performance significantly. Note that the opposite is
also true, that is, SAD PSO, which internally runs SPSA for gbest achieves better
performance than SPSA alone.
Based on the results in Table 5.13, we can perform comparative evaluations
with some of the promising PSO variants such as [1, 30–32] where similar
experiments are performed over some or all of these benchmark functions. For
example in Angeline [32], a tournament selection mechanism is formed among
particles and the method is applied over four functions (Sphere, Rosenbrock,
Rastrigin, and Griewank). Although the method is applied over a reduced
138 5 Improving Global Convergence
positional range, ±15, and at low dimensions (10, 20, and 30), the mean scores
were in the range {0.3, 1,194}. As a result, both better and worse performances
than bPSO, depending on the function, were reported. In Esquivel and Coello
Coello [30], bPSO and two PSO variants, GCPSO and mutation-extended PSO
over three neighborhood topologies are applied to multimodal functions, Rastrigin,
Schwefel, and Griewank. Although the dimension is rather low (30), none of the
topologies over any PSO variant converged to the global minimum and the mean
scores were reported in the range of {0.0014, 4,762}. In Riget and Vesterstrom [1],
a diversity guided PSO variant, ARPSO, along with two competing methods, bPSO
and GA were applied over the multimodal functions, Rastrigin, Rosenbrock, and
Griewank at dimensions, 20, 50, and 100. The experimental showed that none of
the three methods converged to the global minimum except ARPSO for Rastrigin
at dimension 20. ARPSO performed better than bPSO and GA for Rastrigin and
Rosenbrock but worse for Griewank. The CPSO proposed in Bergh and Enge-
lbrecht [24] was applied to five functions including Sphere, Rastrigin, Rosenbrock,
and Griewank. The dimension of all functions is fixed to 30 and in this dimension,
CPSO performed better than bPSO in 80 % of the experiments. Finally in Richards
and Ventura [31], dynamic sociometries via ring and star were introduced among
the swarm particles and the performance of various combinations of swarm size
and sociometry over six functions (the ones used in this section except Schwefel)
was reported. Although the tests were performed over comparatively reduced
positional ranges and at a low dimension (30), the experimental results indicate
that none of the sociometry and swarm size combinations converged to the global
minimum of multimodal functions except for some cases of the Griewank
function.
The statistical comparison between low-cost mode and the original (full cost) is
reported in Table 5.14. The statistics in the table indicate that both modes within
both approaches usually obtain a similar performance but occasionally a signifi-
cant gap is visible. For instance, lowcost mode achieves a significantly better
performance within the second SAD PSO approach for De Jong and Griewank
functions at d = 80. The opposite is true for Schwefel particularly at d = 20.
In order to verify if the results are statistically significant, statistical signifi-
cance test is next applied between each SAD PSO approach and each technique
(bPSO and SPSA) using the statistical data given in Table 5.13. Let H0 be the null
hypothesis, which states that there is no difference between the proposed and
competing techniques (i.e., the statistical results occur by chance). We shall then
define two common threshold values for P, 5 % and 1 %. If the P value, which is
the probability of observing such a large difference (or larger) between the sta-
tistics, is less than either threshold, then we can reject H0 with the corresponding
confidence level. To accomplish this, the standard t test was performed and the t
values were computed between the pair of competing methods. Recall that the
formula for the t test is as follows:
Table 5.14 Statistical results between full-cost and low-cost modes from 100 runs over seven benchmark functions
Functions d Full cost mode Low cost mode
SAD PSO (A2) SAD PSO (A1) SAD PSO (A2) SAD PSO (A1)
l r l r l r l r
Sphere 20 0 0 0 0 0 0 0 0
50 0 0 0 0 0 0 0 0
80 0 0 0 0 0 0 0 0
De Jong 20 0 0 0 0 0 0 0 0
5.3 Who Will Guide the Guide?
Table 5.15 t test results for statistical significance analysis for both SPSA approaches, A1 and
A2
Functions d Pair of competing methods
bPSO (A2) bPSO (A1) SPSA (A2) SPSA (A1)
Sphere 20 0 0 0 0
50 0 0 0 0
80 4.90 4.90 0 0
De Jong 20 0 0 4.73 4.73
50 3.35 3.27 4.56 3.03
80 12.00 28.62 2.95 31.46
Rosenbrock 20 0.54 17.56 2.88 21.42
50 6.06 25.55 31.50 12.54
80 6.16 7.14 43.51 0.88 (*)
Rastrigin 20 0.86 2.12 39.95 39.95
50 1.80 2.05 67.81 67.81
80 4.83 6.93 81.49 81.50
Griewank 20 0 0 0 0
50 2.65 2.16 31.52 31.51
80 1.27 (*) 25.35 48.76 48.13
Schwefel 20 36.11 33.75 4.63 3.62
50 42.28 41.59 4.23 5.56
80 39.48 39.12 3.55 5.53
Giunta 20 1.39 1.43 589.42 593.75
50 3.35 3.27 56.37 53.31
80 5.48 7.73 45.81 49.28
l1 l2
t ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
ð5:7Þ
ðn1 1Þr21 þðn2 1Þr22 n1 þn2
n1 þn2 2 n1 n2
where n1 ¼ n2 ¼ 100 is the number of runs. Using the first- and second-order
statistics presented in Table 5.13, the overall t test values are computed and
enlisted in Table 5.15. In those entries with 0 value, both methods have a zero
mean and zero variance, indicating convergence to the global optimum. In such
cases, H0 cannot be rejected. In those nonzero entries, t test values corresponding
to the best approach are highlighted. In the t tests the degrees of freedom is
n1 þ n2 2 ¼ 198. Table 5.16 presents two corresponding entries of t test values
required to reject H0 at several levels of confidence (one-tailed test). Accordingly,
5.3 Who Will Guide the Guide? 141
H0 can be rejected and hence all results are statistically significant beyond the
confidence level of 0.01 except the two entries shown with a (*) in Table 5.15.
Note that the majority of the results are statistically significant beyond the 0.001
level of confidence (e.g., the likelihood to occur by chance is less than 1 in 1,000
times).
This chapter focused on a major drawback of the PSO algorithm: the poor gbest
update. This can be a severe problem, which may cause premature convergence to
local optima since gbest as the common term in the update equation of all parti-
cles, is the primary guide of the swarm. Therefore, a solution for the social
problem in PSO is the main goal of this chapter, i.e., ‘‘Who will guide the guide?’’
which resembles the rhetoric question posed by Plato in his famous work on
government: ‘‘Who will guard the guards?’’ (Quis custodiet ipsos custodes?). At
first the focus is drawn on improving the global convergence of PSO and in turn,
MD PSO, as its convergence performance is still limited to the same level as PSO,
which suffers from the lack of diversity among particles. This leads to a premature
convergence to local optima especially when multimodal problems are optimized
in high dimensions. Realizing that the main problem lies in fact in the inability of
using the available diversity among the vector components of swarm particles’
positions, the FGBF technique adapted in this section addresses this problem by
collecting the best components and fractionally creating an artificial global best,
aGB, particle that has the potential to be a better ‘‘guide’’ then the swarm’s native
gbest particle. When used with FGBF, MD PSO exhibits such an impressive speed
gain that their mutual performance surpasses bPSO by several magnitudes.
Experimental results over nonlinear function minimization show that except in few
minority cases, the convergence to the global minimum at the target dimension is
achieved within fewer than 1,000 iterations on the average, mostly only within few
hundreds or even less. Yet, the major improvement occurs in the convergence
accuracy. MD PSO with FGBF finds the global minimum at the target dimension
for all runs over all functions without any exception. This is a substantial
achievement in the area of PSO-based nonlinear function minimization.
FGBF was then tested in another challenging domain, namely optimization in
dynamic environments. In order to make comparative evaluations with other
techniques in the literature, FGBF with multiswarms is then applied over a con-
ventional benchmark system, the Moving Peak Benchmark, MPB. The results over
MPB with common settings (i.e., Scenario 2) clearly indicate the superior per-
formance of FGBF with multiswarms over other PSO-based methods. To make the
benchmark more generic for real-world applications where the optimum dimen-
sion may be unknown too, MPB is extended to a multidimensional system in
which there is a certain amount of dependency among dimensions. Note that
without such dependency embedded, the benchmark would be just a bunch of
142 5 Improving Global Convergence
As described in the related sections of the previous chapters, the test-bed appli-
cation, PSO_MDlib, is designed to implement MD PSO operations for the pur-
pose of multidimensional nonlinear function minimization and dynamic system
(MPB) optimization. The skipped operations that are plugged into the template
5.5 Programming Remarks and Software Packages 143
PSO_MDlib when configured for MPB optimization during the compile time by
simply defining ‘‘MOVING_PEAKS_BENCHMARK’’. When defined, another
entry point main() function (at the bottom) will be compiled instead for MDPSO
with FGBF optimization over MPB environment, which is implemented in the
source file movpeaks.cpp (and movpeaks.h). The MD PSO with FGBF operation is
almost identical as in the nonlinear function minimization; except the fitness
function (MPB()) and the setting of the change period by calling pPSO-
Table 5.18 The environmental change signaling from the main MD PSO function
5.5 Programming Remarks and Software Packages
145
146 5 Improving Global Convergence
change period, the rest of the MD PSO with FGBF operation is identical with the
nonlinear function minimization.
Finally, the fitness function MPB() is given in Table 5.20. Note that the first if()
statement checks for the signal for environmental change, that is sent from the
native MD PSO function, Perform(). If signal is sent (e.g., by assigning dim = -1
to) then the function change_peaks() changes the MPB environment and the
function returns abruptly without fitness evaluation. Otherwise, the position stored
in pPos proposed by a particle will be evaluated by the function eval_movpeaks(),
which returns the difference of the current height from the global peak height. The
other function calls are for debug purposes.
References
4. A. Abraham, S. Das, S. Roy, Swarm intelligence algorithms for data clustering. in Soft
Computing for Knowledge Discovery and Data Mining book, Part IV, (2007), pp. 279–313,
Oct 25 2007
5. T.M. Blackwell, Particle swarm optimization in dynamic environments. Evolutionary
Computation in Dynamic and Uncertain Environments, Studies in Computational
Intelligence, vol. 51 (Springer, Berlin, 2007) pp. 29–49
6. Y.-P. Chen, W.-C. Peng, M.-C. Jian, Particle swarm optimization with recombination and
dynamic linkage discovery. IEEE Trans. Syst. Man Cybern. Part B 37(6), 1460–1470 (2007)
7. K.M. Christopher, K.D. Seppi, The Kalman swarm. A new approach to particle motion in
swarm optimization, in Proceedings of the Genetic and Evolutionary Computation
Conference, GECCO, (2004), pp. 140–150
8. L.-Y. Chuang, H.W. Chang, C.J. Tu, C.H. Yang, Improved binary PSO for feature selection
using gene expression data. Comput. Biol. Chem. 32(1), 29–38 (2008)
9. R. Eberhart, P. Simpson, R. Dobbins, Computational Intelligence,PC Tools (Academic,
Boston, 1996)
10. H. Higashi, H. Iba, Particle Swarm Optimization with Gaussian Mutation, in Proceedings of
the IEEE swarm intelligence symposium, (2003), pp. 72–79
11. S. Janson, M. Middendorf, A hierarchical particle swarm optimizer and its adaptive variant.
IEEE Trans. Syst. Man Cybern. Part B 35(6), 1272–1282 (2005)
12. B. Kaewkamnerdpong, P.J. Bentley, Perceptive particle swarm optimization: an
investigation, in Proceedings of IEEE Swarm Intelligence Symposium, (California, 2005),
pp. 169–176, 8–10 June 2005
13. U. Kressel, Pairwise Classification and support vector machines. in Advances in Kernel
Methods—Support Vector Learning (1999)
14. J.J. Liang, A.K. Qin, Comprehensive learning particle swarm optimizer for global
optimization of multimodal functions. IEEE Trans. Evol. Comput. 10(3), 281–295 (2006)
15. Y. Lin, B. Bhanu, Evolutionary feature synthesis for object recognition. IEEE Trans. Man
Cybern. Part C 35(2), 156–171 (2005)
16. M. Løvberg, T. Krink, Extending particle swarm optimizers with self-organized criticality, in
Proceedings of the IEEE Congress on Evolutionary Computation, vol. 2 (2002),
pp. 1588–1593
17. W. McCulloch, W. Pitts, A logical calculus of the ideas immanent in nervous activity. Bull.
Math. Biophys. 7, 115–133 (1943)
18. J. Pan, W.J. Tompkins, A real-time QRS detection algorithm. IEEE Trans. Biomed. Eng.
32(3), 230–236 (1985)
19. T. Peram, K. Veeramachaneni, C.K. Mohan, Fitness-distance-ratio based particle swarm
optimization, in Proceedings of the IEEE Swarm Intelligence Symposium, (IEEE Press, 2003)
pp. 174–181
20. K. Price, R.M. Storn, J.A. Lampinen, Differential Evolution: A Practical Approach to Global
Optimization (Springer, Berlin, 2005). ISBN 978-3-540-20950-8
21. A.C. Ratnaweera, S.K. Halgamuge, H.C. Watson, Particle swarm optimiser with time varying
acceleration coefficients, in Proceedings of the International Conference on Soft Computing
and Intelligent Systems, (2002), pp. 240–255
22. Y. Shi, R.C. Eberhart, A modified particle swarm optimizer, in Proceedings of the IEEE
Congress on Evolutionary Computation, (1998), pp. 69–73
23. Y. Shi, R.C. Eberhart, Fuzzy adaptive particle swarm optimization. in Proceedings of the
IEEE Congress on Evolutionary Computation, (IEEE Press, 2001), vol. 1, pp. 101–106
24. F. van den Bergh, A.P. Engelbrecht, A cooperative approach to particle swarm optimization.
IEEE Trans. Evol. Comput. 3, 225–239 (2004)
25. X. Xie, W. Zhang, Z. Yang, A dissipative particle swarm optimization, in Proceedings of the
IEEE Congress on Evolutionary Computation, vol. 2 (2002), pp. 1456–1461
26. X. Xie, W. Zhang, Z. Yang, Adaptive particle swarm optimization on individual level,
in Proceedings of the Sixth International Conference on Signal Processing, vol. 2 (2002),
pp. 1215–1218
References 149
27. X. Xie, W. Zhang, Z. Yang, Hybrid particle swarm optimizer with mass extinction,
in Proceedings of the International Conference on Communication, Circuits and Systems,
vol. 2 (2002), pp. 1170–1173
28. W.-J. Zhang, X.-F. Xie, DEPSO: Hybrid particle swarm with differential evolution operator,
in Proceedings of the IEEE International Conference on System, Man, and Cybernetics, vol.
4 (2003), pp. 3816–3821
29. P.I. Angeline, Evolutionary optimization versus particle swarm optimization: Philosophy and
performance differences. in Evolutionary Programming VII, Conference EP’98, Springer
Verlag, Lecture Notes in Computer Science No. 1447, (California, USA, 1998). pp. 410–601
30. S.C. Esquivel, C.A. Coello Coello, On the use of particle swarm optimization with
multimodal functions, in Proceedings of 1106 the IEEE Congress on Evolutionary
Computation, (IEEE Press, 2003), pp. 1130–1136
31. M. Richards, D. Ventura, Dynamic sociometry in particle swarm optimization, in Proceedings
of the Sixth International Conference on Computational Intelligence and Natural Computing,
(North Carolina, 2003) pp. 1557–1560
32. P.J. Angeline, Using Selection to Improve Particle Swarm Optimization, in Proceedings of
the IEEE Congress on Evolutionary Computation, (IEEE Press, 1998), pp. 84–89
33. J. Branke, Moving peaks benchmark (2008), http://www.aifb.unikarlsruhe.de/*jbr/
MovPeaks/. Accessed 26 June 2008
34. T.M. Blackwell, J. Branke, Multi-Swarm Optimization in Dynamic Environments.
Applications of Evolutionary Computation, vol. 3005 (Springer, Berlin, 2004), pp. 489–500
35. T.M. Blackwell, J. Branke, Multiswarms, exclusion, and anti-convergence in dynamic
environments. IEEE Trans. Evol. Comput. 10(4), 51–58 (2004)
36. X. Li, J. Branke, T. Blackwell, Particle swarm with speciation and adaptation in a dynamic
environment, in Proceedings of Genetic and Evolutionary Computation Conference, (Seattle
Washington, 2006), pp. 51–58
37. R. Mendes, A. Mohais, DynDE: a differential evolution for dynamic optimization problems.
IEEE Congress on Evolutionary Computation, (2005) pp. 2808–2815
38. I. Moser, T. Hendtlass, A simple and efficient multi-component algorithm for solving
dynamic function optimisation problems. IEEE Congress on Evolutionary Computation,
(2007), pp. 252–259
39. M.A. Styblinski, T.-S. Tang, Experiments in nonconvex optimization: stochastic
approximation with function smoothing and simulated annealing. Neural Netw. 3(4),
467–483 (1990)
40. H.J. Kushner, G.G. Yin, Stochastic Approximation Algorithms and Applications (Springer,
New York, 1997)
41. S.B. Gelfand, S.K. Mitter, Recursive stochastic algorithms for global optimization. SIAM
J. Control Optim. 29(5), 999–1018 (1991)
42. D.C. Chin, A more efficient global optimization algorithm based on Styblinski and Tang.
Neural Networks, (1994), pp. 573–574
43. J.C. Spall, Multivariate stochastic approximation using a simultaneous perturbation gradient
approximation. IEEE Transactions on Automatic Control, 37(3), 332–341 (1992)
44. J.C. Spall, Implementation of the simultaneous perturbation algorithm for stochastic
optimization. IEEE Trans. Aerosp. Electron. Syst. 34, 817–823 (1998)
45. J.L. Maryak, D.C. Chin, Global random optimization by simultaneous perturbation stochastic
approximation, in Proceedings of the 33rd Conference on Winter Simulation, (Washington,
DC, 2001), pp. 307–312
46. Y. Maeda, T. Kuratani, Simultaneous Perturbation Particle Swarm Optimization. IEEE
Congress on Evolutionary Computation, CEC’06, (2006) pp. 672–676
Chapter 6
Dynamic Data Clustering
Data are a precious thing and will last longer than the systems
themselves.
Tim Berners-Lee
Clustering in the most basic terms, is the collection of patterns, which are usually
represented as vectors or points in a multi-dimensional data space, into groups
(clusters) based on similarity or proximity. Such an organization is useful in
pattern analysis, classification, machine learning, information retrieval, spatial
segmentation, and many other application domains. Cluster validity analysis is the
assessment of the clustering method’s output using a specific criterion for opti-
mality, i.e., the so-called clustering validity index (CVI). Therefore, the optimality
of any clustering method can only be assessed with respect to the CVI, which is
defined over a specific data (feature) representation with a proper distance (sim-
ilarity) metric. What characterizes a clustering method further depends on its
scalability over the dimensions of the data and solution spaces, i.e., whether or not
it can perform well enough on a large dataset; say with a million patterns and
having large number of clusters (e.g., 10). In the former case, the complexity of
the method may raise an infeasibility problem and the latter case shows the degree
of its immunity against the well-known phenomenon, ‘‘the curse of dimension-
ality’’. Even humans, who perform quite well in 2D and perhaps in 3D, have
difficulties of interpreting data in higher dimensions. Nevertheless, most real
problems involve clustering in high dimensions, in that; the data distribution can
hardly be modeled by some ideal structures such as hyperspheres.
Given a CVI, clustering is a multi-modal problem especially in high dimensions,
which contains many sub-optimum solutions such as over- and under-clustering.
Therefore, well-known deterministic methods such as K-means, Max–Min [1, 2],
FCM [2, 3], SOM [2], etc., are susceptible to get trapped to the closest local
optimum since they are all greedy descent methods, which start from a random
point in the solution space and perform a localized search. This fact eventually turns
the focus on stochastic Evolutionary Algorithms (EAs) [4] such as Genetic Algo-
rithms (GAs) [5], Genetic Programming (GP) [6], Evolution Strategies (ES), [7],
and Evolutionary Programming (EP), [8], all of which are motivated by the natural
evolution process and thus make use of evolutionary operators. The common point
of all is that EAs are in population-based nature and can perform a globalized
search. So they may avoid becoming trapped in a local optimum and find the
optimum solution; however, this is never guaranteed. Many works in the literature
S. Kiranyaz et al., Multidimensional Particle Swarm Optimization for Machine Learning 151
and Pattern Recognition, Adaptation, Learning, and Optimization 15,
DOI: 10.1007/978-3-642-37846-1_6, Springer-Verlag Berlin Heidelberg 2014
152 6 Dynamic Data Clustering
Based on the discussion in Sect. 3.4.2, it is obvious that the clustering problem
requires the determination of the solution space dimension (i.e., number of clus-
ters, K) and an effective mechanism to avoid local optima traps (both dimen-
sionally and spatially) particularly in complex clustering schemes in high
dimensions (e.g., K [ 10). The former requirement justifies the use of the MD
PSO technique while the latter calls for FGBF. At time t, the particle a in the
swarm, n ¼ fx1 ; . . .; xa ; . . .; xS g; has the positional component formed as,
xd ðtÞ xd ðtÞ
xxa a ðtÞ ¼ fca;1 ; . . .; ca;j ; . . .; ca;xda ðtÞ g ) xxa; ja ðtÞ ¼ ca; j meaning that it rep-
resents a potential solution (i.e., the cluster centroids) for the xda ðtÞ number of
clusters while jth component being the jth cluster centroid. Apart from the regular
limits such as (spatial) velocity, Vmax, dimensional velocity, VDmax, and dimension
range Dmin xda ðtÞ Dmax ; the N dimensional data space is also limited with
xd ðtÞ
some practical spatial range, i.e., Xmin \xxa a ðtÞ\Xmax . In case this range is
6.1 Dynamic Data Clustering via MD PSO with FGBF 153
xd ðtÞ
exceeded even for a single dimension j, xxa; ja ðtÞ, then all positional components
of the particle for the respective dimension xda ðtÞ are initialized randomly within
the range (i.e., refer to step 1.3.1 in MD PSO pseudo-code) and this further
contributes to the overall diversity. The following validity index is used to obtain
computational simplicity with minimal or no parameter dependency,
f xxxd
a
a ðtÞ
; Z ¼ Qe xx xda ðtÞ
a ðxda ðtÞÞa where
P xd ðtÞ
xxa;ja zp ð6:1Þ
xd a ðtÞ
1 X 8zp 2xxa;j
xd a ðtÞ
xda ðtÞ
Qe xxa ¼ xda ðtÞ
xda ðtÞ j¼1 xxa
where Qe is the quantization error (or the average intra-cluster distance) as the
Compactness term and ðxda ðtÞÞa is the Separation term, by simply penalizing
higher cluster numbers with an exponential, a [ 0: Using a ¼ 1; the validity index
yields the simplest form (i.e., only the nominator of Qe ) and becomes entirely
parameter-free.
On the other hand, (hard) clustering has some constraints. Let Cj ¼
xd ðtÞ
fxxa; ja ðtÞg ¼ fca; j g be the set of data points assigned to a (potential) cluster
xd ðtÞ
centroid xxa; ja ðtÞ for a particle a at time t. The clusters Cj ; 8j 2 ½1; xda ðtÞ should
maintain the following constraints:
a ðtÞ
xdS
1. Each data point should be assigned to one cluster set, i.e., Cj ¼ Z
j¼1
2. Each cluster should contain at least one data point, i.e., Cj 6¼ f/g;
8j 2 ½1; xda ðtÞ
3. Two clusters should have no common data points, i.e., Ci \ Cj ¼ f/g;
; i 6¼ j and 8i; j 2 ½1; xda ðtÞ
In order to satisfy the 1st and 3rd hard clustering constraints, before computing
the clustering fitness score via the validity index function in (6.1), all data points
are first assigned to the closest centroid. Yet there is no guarantee for the fulfill-
xd ðtÞ
ment of the 2nd constraint since xxa a ðtÞ is set (updated) by the internal dynamics
of the MD PSO process and hence any dimensional component (i.e., a potential
xd ðtÞ
cluster candidate), xxa; ja ðtÞ; can be in an abundant position (i.e., no closest data
point exists). To avoid this, a high penalty is set for the fitness score of the particle,
xd ðtÞ xd ðtÞ
i.e., f ðxxa a ; ZÞ 1; if fxxa; ja g ¼ f/g for any j.
The major outlines so far given are sufficient for the standalone application of
the MD PSO technique for a dynamic clustering application; however, the FGBF
operation presents further difficulties since for the aGB creation the selection of the
best or the most promising dimensions (i.e., the cluster centroids) among all
dimensions of swarm particles is not straightforward. Recall that in step 2 of the
154 6 Dynamic Data Clustering
FGBF pseudo-code, the index array of such particles yielding the minimum f ða; jÞ
for the jth dimension, can be found as, a½j ¼ arg min ðf ða; jÞÞ: This was
a2½1; S j2½1; Dmax
straightforward for the nonlinear function minimization where each dimension of
the solution space is distinct and corresponds to an individual dimension of the
data space. However, in the clustering application, any (potential) cluster centroid
xd ðtÞ
of each particle, xxa; ja ðtÞ; is updated independently and can be any arbitrary point
in N dimensional data space. Furthermore, data points assigned to the jth
xd ðtÞ
dimension of a particle a, ð8zp 2 xxa;ja ðtÞÞ; also depend on the distribution of the
other dimensions (centroids), i.e., the ‘‘closest’’ data points are assigned to the jth
centroid only because the other centroids happen to be at a farther location.
Inserting this particular dimension (centroid) into another particle (say aGB, in
case selected), might create an entirely different assignment (or cluster) including
the possibility of having no data points assigned to it and thus violating the 2nd
clustering constraint. To avoid this problem, a new approach is adopted for step 2
to obtain a½j: At each iteration, a subset among all dimensions of swarm particles
is first formed by verifying the following: a dimension of any particle is selected
into this subset if and only if there is at least one data point that is closest to it.
Henceforth, the creation of the aGB particle within this verified subset ensures that
the 2nd clustering constraint will (always) be satisfied. Figure 6.1 illustrates the
formation of the subset on a sample data distribution with 4 clusters. Note that in
the figure, all dimensions of the entire swarm particles are shown as ‘+’ but the red
ones belonging to the subset have at least one (or more) data points closest
whereas the blue ones have none and hence they are discarded.
Once the subset centroids are selected, then the objective is to compose a½j with
the most promising Dmax centroids selected from the subset in such a way that each
G1 + +
+
+ G2
+
+
+
+
+
G3
+ + G4
+ +
+
+ + +
+ + +
+
+
Fig. 6.1 The formation of the centroid subset in a sample clustering example. The black dots
represent data points over 2D space and each colored ‘+’ represents one centroid (dimension) of a
swarm particle
6.1 Dynamic Data Clustering via MD PSO with FGBF 155
In Fig. 6.1, a sample MST is formed using 14 subset centroids as the nodes and
13 branches are shown as the red lines connecting the closest nodes (in a minimum
span). Breaking the 3 longest branches (shown as the dashed lines) thus reveals the
4 groups (G1, …, G4) among which one centroid yielding the minimum f ða; jÞ can
then be selected as an individual dimension of the aGB particle with 4 dimensional
components (i.e., d ¼ K ¼ 4; xxKaGB; j ðtÞ; 8j 2 ½1; K).
In order to test the clustering performance of the standalone MD PSO, we used the
same 15 synthetic data spaces as shown in Fig. 3.6, and to make the evaluation
independent from the choice of the parameters, we simply used Qe in Eq. (6.1) as
the CVI function. Note that this is just a naïve selection and any other suitable CVI
function can also be selected since it is a black box implementation for MD PSO.
We also use the same PSO parameters and settings given in Sect. 3.4.2, and recall
that the clustering performance degraded significantly for complex datasets or
datasets with large number of clusters (e.g., [10).
As stated earlier, MD PSO with FGBF, besides its speed improvement, has its
primary contribution over the accuracy of the clustering, i.e., converging to the
true number of clusters, K, and correct localization of the centroids. As typical
results shown in Fig. 6.2, MD PSO with FGBF meets the expectations on clus-
tering accuracy, but occasionally results in a slightly higher number of clusters
156 6 Dynamic Data Clustering
C9: 42 Clusters
C11: 62 Clusters*
C10: 49 Clusters*
Fig. 6.2 Typical clustering results via MD PSO with FGBF. Over-clustered samples are
indicated with *
(over-clustering). This is due to the use of a simple but quite impure validity index
in (6.1) as the fitness function and for some complex clustering schemes it may,
therefore, yield its minimum score at a slightly higher number of clusters. A
sample clustering operation validating this fact is shown in Fig. 6.3. Note that the
true number of clusters is 10, which is eventually reached at the beginning of the
operation, yet the minimum score achieved with K ¼ 10ð 750Þ remains higher
than the one with K ¼ 11ð 610Þ and than the final outcome, K ¼ 12ð 570Þ too.
The main reason for this is that the validity index in (6.1) over long (and loose)
clusters such as ‘C’ and ‘S’ in the figure, yields a much higher fitness score with
one centroid than two or perhaps more and therefore, over all data spaces with
such long and loose clusters (e.g., C4, C8, C10, and C11), the proposed method
yields a slight over-clustering but never under-clustering. Improving the validity
index or adapting a more sophisticated one such as Dunn’s index [16] or many
others, might improve the clustering accuracy.
An important observation worth mentioning is that clustering complexity (more
specifically modality) affects the proposed methods’ mutual performance much
more than the total cluster number (dimension). For instance, MD PSO with FGBF
clustering (with S = 640) over data space C9 can immediately determine the true
cluster number and the accurate location of the centroids with a slight offset (see
Fig. 6.4), whereas this takes around 1,900 iterations for C8. Figure 6.5 shows time
instances where aGB (with index number 640) becomes the GB particle. It
6.1 Dynamic Data Clustering via MD PSO with FGBF 157
Fig. 6.3 Fitness score (top) and dimension (bottom) plots vs. iteration number for a MD PSO
with FGBF clustering operation over C4. 3 clustering snapshots at iterations 105, 1,050, and
1,850, are presented below
immediately (at the 1st iteration) provides a ‘‘near optimum’’ GB solution with 43
clusters and then the MD PSO process (at the 38th iteration) eventually finds the
global optimum with 42 clusters (i.e., see the 1st snapshot in Fig. 6.4). Afterward
the ongoing MD PSO process corrects the slight positional offset of the cluster
centroids (e.g., compare 1st and 2nd snapshots in Fig. 6.4). So when the clusters
are compact, uniformly distributed, and have similar shape, density, and size, thus
yielding the simplest form, it becomes quite straightforward for FGBF to select the
‘most promising’ dimensions with a greater accuracy. As the complexity
(modality) increases, different centroid assignments and clustering combinations
have to be assessed to converge toward the global optimum, which eventually
becomes a slow and tedious process.
Recall from the earlier discussion (Sect. 5.1.4) about the application of the
proposed methods over nonlinear function minimization (both standalone MD
PSO and MD PSO with FGBF), a certain speed improvement occurs in terms of
reduction in iteration number and a better fitness score is achieved, when using a
larger swarm. However, the computational complexity (per iteration) also
increases since the number of evaluations (fitness computations) is proportional to
the number of particles. The same trade-off also exists for clustering application
and a significantly higher computational complexity of the mutual application of
the proposed methods can occur due to the spatial MST grouping for the selection
of the well-separated centroids. As explained in the previous section, MST is the
essence of choosing the ‘‘most promising’’ dimensions (centroids) so as to form
158 6 Dynamic Data Clustering
Fig. 6.4 Fitness score (top) and dimension (bottom) plots vs. iteration number for a MD PSO
with FGBF clustering operation over C9. 3 clustering snapshots at iterations 40, 950, and 1,999,
are presented below
GB = aGB
aGB=640
particle
no.
iteration no.
Fig. 6.5 Particle index plot for the MD PSO with FGBF clustering operation shown in Fig. 6.4
2
the best possible aGB particle. However, it is a costly operation ðOðNSS ÞÞ where
NSS is the subset size, which is formed by those dimensions (potential centroids)
having at least one data item closest to it. Therefore, NSS tends to increase if a
larger swarm size is used and/or MD PSO with FGBF clustering is performed over
large and highly complex data spaces.
Table 6.1 presents average processing times per iteration over all sample data
spaces and using 4 different swarm sizes. All experiments are performed on a
computer with P-IV 3 GHz CPU and 1 GB RAM. Note that the processing times
tend to increase in general when data spaces get larger but the real factor is the
complexity. The processing for a highly complex data structure, such as C10, may
require several times more computations than a simpler but comparable-size data
Table 6.1 Processing time (in msec) per iteration for MD PSO with FGBF clustering using 4 different swarm sizes. Number of data items is presented in
parenthesis with the sample data space
S C1 (238) C2 (408) C3 (1,441) C4 (1,268) C5 (3,241) C6 (1,314) C7 (3,071) C8 (5,907) C9 (2,192) C10 (3,257) C11 (12,486)
80 19.7 57 140 688 864.1 231.5 690.8 1,734.5 847.7 4,418.2 8,405.1
160 31.7 104.9 357.3 1,641 1,351.3 465.5 2,716.3 3,842.8 1,699.4 13,693.7 26,608.6
320 62.3 222.9 1,748.8 4,542.5 3,463.9 1,007 3,845.7 7,372.7 4,444.5 55,280.6 62,641.9
640 153.4 512 3,389.4 17,046.4 8,210.1 4,004.5 11,398.6 23,669.8 14,828.2 159,642.3 212,884.6
6.1 Dynamic Data Clustering via MD PSO with FGBF
159
160 6 Dynamic Data Clustering
space, such as C5. Therefore, on such highly complex data spaces, the swarm size
should be kept low, e.g., 80 BS B 160, for the sake of a reasonable processing
time.
6.2.1 Motivation
This is used to determine which clusters to split until either a maximum number
max
of clusters (DCs), NDC ; is achieved or a maximum allowed distortion criterion, eD ;
is reached. Hence, pixels with smaller weights (detailed regions) are assigned
fewer clusters so that the number of color clusters in the detailed regions where the
likelihood of outliers’ presence is high, is therefore suppressed. As the final step,
an agglomerative clustering (AC) is performed on the cluster centroids to further
merge similar color clusters so that there is only one cluster (DC) hosting all
similar color components in the image. A similarity threshold Ts is assigned to the
maximum color distance possible between two similar colors in a certain color
domain (CIE-Luv, CIE_Lab, etc.). Another merging criterion is the color area, that
is, any cluster should have a minimum amount of coverage area, TA, so as to be
assigned as a DC; otherwise, it will be merged with the closest color cluster since it
is just an outlier. Another important issue is the choice of the color space since a
proper color clustering scheme for DC extraction tightly relies on the metric.
Therefore, a perceptually uniform color space should be used and the most
common ones are CIE-Luv and CIE-Lab, which are designed in such a way that
color distances perceived by HVS are also equal in L2 (Euclidean) distance in these
spaces. For CIE-Luv, a typical value for TS is between 10 and 25, TA is between
1–5 % and eD \0:05 [23].
Particularly for dominant color extraction, the optimal (true) number of DCs in
an image is also unknown and should thus be determined within the clustering
process, in an optimized way and without critical parameter dependency. MPEF-7
DCD, as a modified K-means algorithm, does not address these requirements at all.
Therefore, the reason for the use of MD PSO with FGBF clustering is obvious.
Furthermore, humans tend to think and describe color the way they perceive it.
Therefore, in order to achieve a color (dis-) similarity metric taking HVS into
account, HSV (or HSL), which is a perceptual color space and provides means of
modeling color in a way HVS does, is used in the presented technique for
extracting dominant colors. Note that in a typical image with 24-bit RGB repre-
sentation, there can be several thousands of distinct colors, most of which cannot
be perceived by HVS. Therefore, to reduce the computational complexity of RGB
to HSV color transformation and particularly to speed up the dynamic clustering
process via MD PSO and FGBF, a pre-processing step, which creates a limited
color palette in RGB color domain, is first performed. In this way such a massive,
yet unperceivable amount of colors in RGB domain can be reduced to a reasonable
number, e.g., 256 \ n \ 512. To this end, we used the Median Cut method [24]
because it is fast (i.e., O(n)) and for such a value of n, it yields an image which can
hardly be (color-wise) distinguished from the original. Only the RGB color
components in the color palette are then transformed into HSV (or HSL) color
space over which the dynamic clustering technique is applied to extract the
dominant colors, as explained next.
6.2 Dominant Color Extraction 163
HSV HSL
W W
C
F
G F C
G
B B
Fig. 6.6 Fuzzy model for distance computation in HSV and HSL color domains (best viewed in
color)
500, respectively. Their effects over the DC extraction are then examined. The
dimension (search) range for DC extraction is set as Dmin ¼ 2; Dmax ¼ 25: This
setting is in harmony with the maximum number of DCs set by the MPEG-7 DCD,
max
i.e., NDC ¼ 25: Finally the size of the initial color palette created by the Median
Cut method is set as 256.
25
DC Number Ts=15, Ta=1%
Ts=25, Ta=1%
20 Ts=25, Ta=5%
15
10
0
0 20 40 60 80 100 120
image number
Fig. 6.7 Number of DC plot from three MPEG-7 DCDs with different parameter set over the
sample database
166 6 Dynamic Data Clustering
(a)
(b)
(c)
(d)
(e)
Fig. 6.8 The DC extraction results over 5 images from the sample database (best viewed in
color)
According to the results, one straightforward conclusion is that not only the
number of DCs significantly varies but DC centroids, as well, change drastically
depending on the parameter values used. On the other hand, it is obvious that the best
DC extraction performance is achieved by the presented technique, where none of
the prominent colors are missed or mislocated while the ‘‘true’’ number of DCs is
extracted. However, we do not in any waw claim that the presented technique
achieves the minimum quantization error (or the mean square error, MSE), due to
two reasons. First, the optimization technique is applied over a regularization
(fitness) function where the quantization error minimization (i.e., minimum
Compactness) is only one part of it. The other part, implying maximum Separation,
presents a constraint so that minimum MSE has to be achieved using the least number
of clusters (DCs). The second and the main reason is that the computation of MSE is
typically performed in RGB color space, using the Euclidean metric. Recall that the
presented DC extraction is performed over HSV (or HSL) color domain, which is
discontinuous and requires nonlinear transformations, and using a fuzzy distance
metric with respect to the HVS perceptual rules for enhancing the discrimination
power. Therefore, the optimization in this domain using such a fuzzy metric obvi-
ously cannot ensure a minimum MSE in RGB domain. Besides that, several studies
show that MSE is not an appropriate metric for visual (or perceptual) quality (e.g.,
[26]) and thus we hereby avoid using it as a performance measure.
6.2 Dominant Color Extraction 167
(a)
(b)
(c)
(d)
(e)
Fig. 6.9 The DC extraction results over 5 images from the sample database (best viewed in color)
Due to its stochastic nature, there is a concern about robustness (defined here as
repeatability) of the results. In this section, we perform several experiments to
examine whether or not the results are consistent in regard to accuracy of the DC
168 6 Dynamic Data Clustering
Image 2
Image 1
P1
P2
P3
Fig. 6.10 DC number histograms of 2 sample images using 3 parameter sets. Some typical back-
projected images with their DC number pointed are shown within the histogram plots (best
viewed in color)
to HSV color transformation. Recall that the Median Cut is a fast method (i.e.,
O(n)), which has the same computational complexity as K-means. The following
color transformation has an insignificant processing time since it is only applied to
a reduced number of colors. As the dynamic clustering technique based on MD
PSO with FGBF is stochastic in nature, a precise computational complexity
analysis is not feasible; however, there are certain attributes, which proportionally
affect the complexity such as swarm size (S), the total number of iteration (IterNo)
and the dimension of the data space, (n). Moreover, the complexity of the validity
index used has a direct impact over the total computational cost since for each
170 6 Dynamic Data Clustering
particle (and at each iteration) it is used to compute the fitness of that particle. This
is the main reason of using such a simple (and parameter independent) validity
index as in Eq. (6.1). In that, the presented fuzzy color model makes the com-
putational cost primarily dependent on the color structure of the image because the
normalized Euclidean distance that is given in Eq. (6.5) and is used within the
validity index function is obviously quite costly; however, recall that it may not be
used at all for such color pairs that do not show any perceptual color similarity.
This further contributes to the infeasibility of performing an accurate computa-
tional complexity analysis for the presented technique. For instance, it takes on the
average of 4.3 and 17.2 s to extract the DCs for the images 1 and 2 shown in
Fig. 6.10, respectively. Nevertheless, as any other EA the DC extraction based on
MD PSO with FGBF is slow in nature and may require indefinite amount of
iterations to converge to the global solution.
In this section, the dynamic clustering technique based on MD PSO with FGBF is
applied for extracting ‘‘true’’ number of dominant colors in an image. In order to
improve the discrimination among different colors, a fuzzy model over HSV (or
HSL) color space is then presented so as to achieve such a distance metric that
reflects HVS perception of color (dis-) similarity. The DC extraction experiments
using MPEG-7 DCD have shown that the method, although a part of the MPEG-7
standard, is highly dependent on the parameters. Moreover, since it is entirely
based on K-means clustering method, it can create artificial colors and/or misses
some important DCs due to its convergence to local optima, thus yielding critical
over- and under-clustering. Consequently, a mixture of different colors, and hence
artificial DCs or DCs with shifted centroids, may eventually occur. This may also
cause severe degradations over color textures since the regular textural pattern
cannot be preserved if the true DC centroids are missed or shifted. Using a simple
CVI, we have successfully addressed these problems and a superior DC extraction
is achieved with ground-truth DCs. The optimum number of DCs can slightly vary
on some images, but the number of DCs on such images is hardly definitive, rather
subjective and thus in such cases the dynamic clustering based on a stochastic
optimization technique can converge to some near-optimal solutions. The tech-
nique presented in this section shows a high level of robustness for parameter
insensitivity and hence the main idea is that instead of struggling to fine tune
several parameters to improve performance, which is not straightforward—if
possible at all, the focus can now be drawn to designing better validity index
functions or improving the ones for the purpose of higher DC extraction perfor-
mance in terms of perceptual quality.
6.3 Dynamic Data Clustering via SA-Driven MD PSO 171
The theory behind the SA-driven PSO and its multi-dimensional extension, SA-
driven MD PSO is well explained in the previous chapter. Recall that unlike the
FGBF method, the main advantage of this global convergence technique is its
generic nature—the applicability to any problem without any need of adaptation or
tuning. Therefore, the clustering application of SA-driven (MD) PSO requires no
changes for both Simultaneously Perturbed Stochastic Approximation (SPSA)
approaches, and the (MD) PSO particles are encoded in the same way as was
explained in Sect. 6.1. In this section we shall focus on the application of SA-
driven MD PSO in dynamic clustering.
it is not feasible to set a unique eC value for all clustering schemes. The positional
range can now be set simply as the natural boundaries of the 2D data space. For
MD PSO, we used the swarm size, S = 200 and for both SA-driven approaches, a
reduced number is used in order to ensure the same number of evaluation among
all competing techniques. w is linearly decreased from 0.75 to 0.2 and we again
used the recommended values for A, a and c as 60, 0.602, and 0.101, whereas a and
c are set to 0.4 and 10, respectively. For each dataset, 20 clustering runs are
performed and the 1st and 2nd order statistics (mean, l and standard deviation, r)
of the fitness scores and dbest values converged are presented in Table 6.2.
According to the statistics in Table 6.2, similar comments can be made as in the
PSO application on nonlinear function minimization, i.e., either SA-driven
approach achieves a superior performance over all data spaces regardless of the
number of clusters and cluster complexity (modality) without any exception. The
superiority hereby is visible on the average fitness scores achieved as well as the
proximity of the average dbest statistics to the optimal dimension. Note that d in
the table is the optimal dimension, which may be different than the true number of
clusters due to the validity index function used.
Some further important conclusions can be drawn from the statistical results in
Table 6.2. First of all, the performance gap tends to increase as the cluster number
(dimension of the solution space) rises. For instance all methods have fitness scores
in a close vicinity for the data space C1 while both SA-driven MD PSO approaches
perform significantly better for C7. Note, however, that the performance gap for C8
is not as high as in C7, indicating SPSA parameters are not appropriate for C8 (as a
consequence of fixed SPSA parameter setting). On the other hand, in some par-
ticular clustering runs, the difference in the average fitness scores in Table 6.2 does
not correspond to the actual improvement in the clustering quality. Take for
instance the two clustering runs over C1 and C2 in Fig. 6.12, where some clustering
instances with the corresponding fitness scores are shown. The first (left-most)
instances in both rows are from severely erroneous clustering operation although
only a mere difference in fitness scores occurs with the instances in the second
column, which have significantly less clustering errors. One the other hand, the
proximity of the average dbest statistics to the optimal dimension may be another
alternative for the evaluation of the clustering performance; however, it is likely
that two runs, one with severely under- and another with over-clustering, may have
an average dbest that is quite close to the optimal dimension. Therefore, the
standard deviation should play an important role in the evaluation and in this aspect;
one can see from the statistical results in Table 6.2 that the second SA-driven MD
PSO approach (A2) in particular achieves the best performance (i.e., converging to
the true number of clusters and correct localization of the centroids) while the
performance of the standalone MD PSO is the poorest.
For visual evaluation, Fig. 6.13 presents the worst and the best clustering
results of the two competing techniques, standalone versus SA-driven MD PSO,
based on the highest (worst) and lowest (best) fitness scores achieved among the
20 runs. The clustering results of the best performing SA-driven MD PSO
approach, as highlighted in Table 6.2, are shown while excluding C1 since results
Table 6.2 Statistical results from 20 runs over 8 2D data spaces
Clusters No. d MD PSO SA-driven (A2) SA-driven (A1)
Score dbest Score dbest Score dbest
l r l r l r l r l r l r
C1 6 6 1,456.5 108.07 6.4 0.78 1,455.2 103.43 6.3 0.67 1,473.8 109 6.2 1.15
C2 10 12 1,243.2 72.12 10.95 2.28 1,158.3 44.13 12.65 2.08 1,170.8 64.88 11.65 1.56
C3 10 11 3,833.7 215.48 10.4 3.23 3,799.7 163.5 11.3 2.57 3,884.8 194.03 11.55 2.66
C4 13 14 1,894.5 321.3 20.2 3.55 1,649.8 243.38 19.75 2.88 1,676.2 295.8 19.6 2.32
C5 16 17 5,756 1,439.8 19 7.96 5,120.4 1,076.3 22.85 4.17 4,118.3 330.31 21.8 2.87
6.3 Dynamic Data Clustering via SA-Driven MD PSO
C6 19 28 21,533 4,220.8 19.95 10.16 18,323 1,687.6 26.45 2.41 20,016 3,382 22.3 6.97
C7 22 22 3,243 1,133.3 21.95 2.8 2,748.2 871.1 23 2.51 2,380.5 1,059.2 22.55 2.8
C8 22 25 6,508.85 1,014 17.25 10.44 6,045.1 412.78 26.45 3.01 5,870.25 788.6 23.5 5.55
173
174 6 Dynamic Data Clustering
Fig. 6.12 Some clustering runs with the corresponding fitness scores (f)
of all techniques are quite close for this data space due to its simplicity. Note first
of all that the results of the (standalone) MD PSO deteriorate severely as the
complexity and/or the number of clusters increases. Particularly in the worst
results, the critical errors such as under-clustering often occur with dislocated
cluster centroids.
For instance 4 out of 20 runs for C6 result in severe under-clustering with 3
clusters, similar to the one shown in the figure whereas this goes up to 10 out of 20
runs for C8. Although the clusters are the simplest in shape and in density for C7,
due to the high solution space dimension (e.g., number of clusters = 22), even the
best MD PSO run is not immune to under-clustering errors. In some of the worst
SA-driven MD PSO runs too, a few under-clusterings do occur; however, they are
minority cases in general and definitely not as severe as in MD PSO runs. It is
quite evident from the worst and the best results in the figure that SA-driven MD
PSO achieves a significantly superior clustering quality and usually converges to a
close vicinity of the global optimum solution.
C2
C3
C4
C5
C6
C7
C8
Fig. 6.13 The worst and the best clustering results using standalone (left) and SA-driven (right)
MD PSO
As the basics described in Sect. 4.4.2, the major MD PSO test-bed application is
PSOTestApp where several MD PSO applications, mostly based on data (or
feature) clustering, are implemented. In this section, we shall describe the pro-
gramming details for performing MD PSO-based dynamic clustering applications:
(1) 2D data clustering with FGBF, (2) 3D dynamic color quantization, and (3) 2D
data clustering with SA-driven MD PSO. These operations that are plugged into
the template <class T, class X> bool CPSO_MD <T,X>::Perform() function,
the first clustering application performed in CPSOcluster class is explained in the
following section. We shall then explain the implementation details the second
application performed in CPSOcolorQ class. Finally, the third application, which
is also performed in CPSOcluster class will be explained. Note that the interface
CPSOcluster class was described in Sect. 4.4.2, therefore the focus is mainly
drawn on the FGBF plug-in function for both MD PSO modes, FGBF and SA-
driven.
In this section, we shall focus on the FGBF function, FGBF_CLFn(), for any
(2D, 3D, or N-D) clustering operation. The other function, FGBF_FSFn(), for
feature syntheses will be covered in Chap. 10.
Recall from the FGBF application over clustering that in order to achieve well-
separated clusters and to avoid the selection of more than one centroid representing
the same cluster, spatially close centroids are first grouped using a MST and then a
certain number of centroid groups, say d 2 ½Dmin ; Dmax , can be obtained simply by
breaking (d - 1) longest MST branches. From each group, one centroid, which
provides the highest Compactness score [i.e., minimum dimensional fitness
score,f ða; jÞ as given in Eq. (6.2) is then selected and inserted into a½ j as the jth
dimensional component. Table 6.4 presents the actual code accomplishing the first
two steps: (1) By calling m_fpFindGBDim(m_pPA, m_noP), a subset among all
potential centroids (dimensions of swarm particles) is first formed by verifying the
following: a dimension of any particle is selected into this subset if and only if there
is at least one data point that is closest to it, and (2) Formation of the MST object by
grouping the spatially close centroids (dimensions).
Table 6.5 presents how the first step is performed. In the first loop, it is resetting
the Boolean array that is present in each swarm particle, which actually holds the
a½ j array. Then it determines which particle has a dimensional component (a
potential centroid) in its current dimension that is closest to one of the data points
(a white pixel in CPixel2D *pPix structure). The closest centroid is then selected
as the candidate dimension (one of the red ‘+’ in Fig. 6.1), all of which are then
used to form the MST within the FGBF_CLFn() function. Note that in the for-
loop over all swarm particles, the dimensions selected within the fpFindGBDim()
function of each particle are then appended into the MST object by pMSTp-
>AppendItem(&pCC[c]). In this way the MST is formed from the selected
dimensions (centroids) as illustrated in Fig. 6.1.
178 6 Dynamic Data Clustering
Once the MST is formed, then its longest branches are iteratively broken to
form the group of centroids. Table 6.6 presents the code breaking the longest MST
branches within a loop starting from 1 to Dmax. Recall that the best centroid
candidate selected from each group will be used to form the corresponding
dimensional component of the aGB particle. Each (broken) group is saved within
the MST array, pMSTarr[], and in the second for loop at the bottom, there is a
search for the longest branch among all MSTs stored in this array. Once found, the
MST with the (next) longest branch is then broken into two siblings and each is
added into the array.
Finally, as presented in Table 6.7 the aGB particle (xyb[]) is formed by
choosing the best dimensional components (the potential centroids with the min-
imum f ða; jÞ) where the individual dimensional scores are computed within the
CVI function for all particles and stored in m_bScore member of CPixel2D class.
Note that the if() statement checks whether the aGB formation is within the
dimensional range, d 2 fDmin ; Dmax g: If so, then within a for loop of all MST
groups, the best dimensional component is found and assigned to the corre-
sponding dimension of the aGB particle. Once the aGB particle is formed (for that
dimension), then recall that it has to compete with the best of previous aGB and
gbest particles. If it surpasses, only then it will be the GB particle of the swarm for
that dimension. This is then repeated for each dimensions in the range, d 2
fDmin ; Dmax g; and the GB particle is formed according to the competition result.
Table 6.8 shows the implementation of the CVI function that is given in Eq.
(6.1). Recall that the entire 2D dataset (white pixels) is stored in the link list
s_pPixelQ. In the first for loop each of them is assigned to the closest potential
cluster centroids (stored in the current position of the particle in dimension
_nDim) and stored in an array of link lists pClusterQ[cm]. In this way we can
6.4 Programming Remarks and Software Packages 179
know the group (cluster) of pixels represented with the cmth cluster centroid. The
second for loop then evaluates each cluster centroid by computing the quantization
error, Qe if and only if it has at least one or more data points assigned to it.
Otherwise, the entire set of potential centroids proposed by the particle in that
dimension will be discarded due to the violation of the second clustering constraint
by assigning its fitness value to a very large value (i.e., 1e ? 9) imitating a
practical infinity value. As mentioned earlier, individual dimensional scores are
also computed and stored in the m_bScore member, that will then be used to
perform FGBF. Finally, the Qe computed for a particle position is multiplied by
the Separation term, ðxda ðtÞÞa ; where a ¼ 3=2 in this CVI function and returned as
the fitness (CVI) score of the current position of the particle.
In order to perform DC extraction over natural images, the second action, namely
‘‘2. 3D Color Quantization over color images’’ should be selected in the GUI of the
180 6 Dynamic Data Clustering
Table 6.6 Formation of the centroid groups by breaking the MST iteratively
PSOTestApp application as shown in Fig. 4.4 and then one or more images should
be selected as input. Then the main dialog object from CPSOtestAppDlg class
will use the object from CPSOcolorQ class in the CPSOtestAppDlg::OnDopso()
function to extract the DCs (by the command: m_PSOc.ApplyPSO(m_pXImList,
m_psoParam, m_saParam)). CPSOcolorQ class is quite similar to the CPSO-
cluster having the same CVI function, and an identical code for MD PSO and
FGBF operations. Recall that DC extraction is nothing but a dynamic clustering
operation in 3D color space. Therefore, the few differences lie in the initialization,
formation of the color dataset. Moreover, the template class \X[ is now imple-
mented within CColor class, which has three color components, m_c1, m_c2,
m_c3 and the weight of the color, m_weight. As shown in Table 6.9, during the
initialization of the DC extraction operation in the beginning of the CPSOcol-
orQ::PSOThread() function, for the RBG frame buffer of each input image, there
is a pre-processing step, which performs the median-cut method to create a color
palette of number of colors, MAX_COLORS that is a constant value initially set
6.4 Programming Remarks and Software Packages 181
to 256. Over the this color palette, the dynamic clustering operation by MD PSO
with FGBF will extract the optimum number of DCs with respect to the CVI
function, CPSOcolorQ::ValidityIndex2(). It can be performed in one of the four
color spaces: RGB, LUV, HSV, and HSL by setting m_usedColorSpace to either
of the CS_RGB, CS_LUV, CS_HSV, CS_HSL variables in the constructor of the
class, CPSOcolorQ. If the color space selected is not RGB, then the color palette
is also converted to the selected color space and stored in the CColor array
s_ppColorA. As explained in Sect. 1.2, each color space has, first of all, a distinct
color distance function, i.e., HSV and HSL uses the distance metric given in Eq.
(6.5), while RGB and LUV use a plain Euclidean function. Each color distance
metric is performed in a distinct static function, which is stored in the function
pointer s_fpDistance. In this way, both the CPSOcolorQ:: ValidityIndex2() and
CPSOcolorQ:: FindGBDim() functions can use this static function pointer to
182 6 Dynamic Data Clustering
compute the distance between two colors regardless from the choice of the color
space. Furthermore, the positional range setting for MD PSO swarm particles vary
with respect to the choice of the color space. For example, fXmin ; Xmax g ¼
ff0; 0; 0g; f255; 255; 255gg for RGB color space. As mentioned earlier, apart
from such initialization details, the rest of the operation is identical the 2D
dynamic clustering application explained in the previous section. Once the DCs
are extracted, they are back projected to an output image in the CPSOcolorQ::
GetResults() function as some resultant images are shown in Figs. 6.8 and 6.9.
6.4 Programming Remarks and Software Packages 183
Both SA-driven approaches, A1 and A2, are implemented within the template
<class T, class X> bool CPSO_MD<T,X>::Perform() function as two separate
plug-ins, and the code for 2D data clustering (or any other MD PSO application) is
identical due to the fact that both approaches are generic. The first SA-driven plug-
in that implements the second SA-driven approach (A2), is called in the following
if() statement,
if(m_mode == SAD)
{//A2: Create aGB particle and compete with gbest at each dimension…
SADFn(xyb, score_min, iter);
}//if SAD…
and the plug-in function, SADFn(), is given in Table 6.10.
Recall that this approach is also an aGB formation, similar to the one given in
Table 6.7 while the underlying method is now SPSA, not FGBF, and thus it can be
used for any MD-PSO application. The new aGB particle is formed on each
dimension by applying SPSA over the personal best position of the gbest particle
184 6 Dynamic Data Clustering
Table 6.10 The plug-in function SADFn() for the second SA-driven approach, A2
pseudo-code given in Table, SPSA will only update the position of the native gbest
particle, m_gbest[cur_dim_a-m_xdMin] on its current dimension cur_dim. As in
the second SA-driven approach, there is a low-cost mode applied with the same
definition ‘‘LOWCOST’’ (Table 6.11).
References
1. J.T. Tou, R.C. Gonzalez, Pattern Recognition Principles (Addison-Wesley, London, 1974)
2. A.K. Jain, M.N. Murthy, P.J. Flynn, Data clustering: A review. ACM Comput. Rev. 31(3),
264–323 (1999)
3. J.C. Bezdek, Pattern Recognition with Fuzzy Objective Function Algorithms (Plenum, New
York, 1981)
4. A. Antoniou, W.-S. Lu, Practical Optimization, Algorithms and Engineering Applications
(Springer, USA, 2007)
5. D. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning (Addison-
Wesley, Reading, MA, 1989), pp. 1–25
6. J. Koza, Genetic Programming: On the Programming of Computers by Means of Natural
Selection (MIT Press, Cambridge, MA, 1992)
186 6 Dynamic Data Clustering
S. Kiranyaz et al., Multidimensional Particle Swarm Optimization for Machine Learning 187
and Pattern Recognition, Adaptation, Learning, and Optimization 15,
DOI: 10.1007/978-3-642-37846-1_7, Springer-Verlag Berlin Heidelberg 2014
188 7 Evolutionary Artificial Neural Networks
native tendency eventually steers the evolution process toward the compact net-
work configurations in the architecture space instead of more complex ones, as
long as optimality prevails.
In the fields of machine learning and artificial intelligence, the evolutionary search
mimics the process of natural evolution for finding the optimal solution to complex
high dimensional, multimodal problems. In other words, it is basically a search
process using an evolutionary algorithm (EA) to determine the best possible
(optimal or near-optimal) design among a large collection of potential solutions
according to a given cost function. Up to date, designing a (near) optimal network
architecture is made by a human expert and requires a tedious trial and error
process. Specifically, determining the optimal number of hidden layers and the
optimal number of neurons in each hidden layer is the most critical task. For
instance, an ANN with no or too few hidden layers may not differentiate among
complex patterns, and instead may lead to only a linear estimation of such—
possibly nonlinear—problem. In contrast, if an ANN has too many nodes/layers, it
might be affected severely by noise in data due to over-fitting, which eventually
leads to a poor generalization. Furthermore, proper training of complex networks
is often a time-consuming task. The optimum number of hidden nodes/layers
might depend on the input/output vector sizes, the amount of training and test data,
and more importantly the characteristics of the problem, e.g., its dimensionality,
nonlinearity, dynamic nature, etc.
The era of ANNs started with the simplified neurons proposed by McCulloch
and Pitts in 1943 [1], and particularly after 1980s, ANNs have widely been applied
to many areas, most of which used feed-forward ANNs trained with the back-
propagation (BP) algorithm. As detailed in Sect. 3.4.3, BP has the advantage of
performing directed search, that is, weights are always updated in such a way as to
minimize the error. However, there are several aspects, which make the algorithm
not guaranteed to be universally useful. The most troublesome is its strict
dependency on the learning rate parameter, which, if not set properly, can either
lead to oscillation or an indefinitely long training time. Network paralysis [2]
might also occur, i.e., as the ANN trains, the weights tend to be quite large values
and the training process can come to a virtual standstill. Furthermore, BP even-
tually slows down by an order of magnitude for every extra (hidden) layer added to
ANN [2]. Above all; BP is only a gradient descent algorithm applied on the error
space, which can be complex and may contain many deceiving local minima
(multi-modal). Therefore, BP gets most likely trapped into a local minimum,
making it entirely dependent on the initial settings. There are many BP variants
and extensions which try to address some or all of these problems, see, e.g., [3–5],
7.1 Search for the Optimal Artificial Neural Networks: An Overview 189
yet all share one major drawback, that is, the ANN architecture has to be fixed in
advance and the question of which specific ANN structure should be used for a
particular problem still remains unsolved.
Several remedies can be found in the literature, some of which are briefly
described next. Let NI ; NH and NO be the number of neurons in the input, the
hidden, and the output layers of a 2-layers feed-forward ANN. Jadid and Fairbain
in [6] proposed an upper bound on NH such as NH NTR =ðR þ NI þ NO Þ where
NTR is the number of training patterns. Masters [7] suggested that ANN archi-
pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
tecture should resemble to a pyramid with NH NI þ NO . Hecht-Nielsen [8]
proved that NH NI þ 1 by using Kolmogrov theorem. Such boundaries may only
give an idea about the architecture range that should be applied in general but
many problems with high nonlinearity and dynamic nature may require models
that far exceed these bounds. Therefore, these limits can only serve as a rule of
thumb which should further be analyzed.
Designing the optimal network architecture can be thought as a search process
within the AS containing all potential and feasible architectures. Some attempts are
found in the literature such as the research on constructive and pruning algorithms,
[9–12]. The former methods initially assume a minimal ANN and insert nodes and
links as warranted while the latter proceeds to the opposite way, i.e., starting with a
large network, superfluous components are pruned. However, Angeline et al. in [13]
pointed out that ‘‘Such structural hill climbing methods are susceptible to becoming
trapped at structural local minima.’’ The reasoning behind this is clarified by Miller
et al. in [14], stating that the AS is non-differentiable, complex, deceptive, and
multimodal. Therefore, those constructive and pruning algorithms eventually face
similar problems in the AS as BP does in the error (weight) space. This makes EA
[15] such as genetic algorithm (GA) [16], genetic programming (GP) [17], evo-
lutionary strategies (ES), [18], and evolutionary programming (EP), [19], more
promising candidates for both training and evolving the ANNs. Furthermore, they
have the advantage of being applicable to any type of ANNs, including feed-
forward ANNs, with any type of activation functions.
GAs are a popular form of EAs that rely mainly on reproduction heuristic,
crossover, and random mutations. When used for training ANNs (with a fixed
architecture), many researchers [14, 20–23] reported that GAs can outperform BP
in terms of both accuracy and speed, especially for large networks. However, as
stated by Angeline et al. in [13] and Yao and Liu in [24], GAs are not well suited
for evolving networks. For instance, the evolution process by GA suffers from the
permutation problem [25], indicating that two identical ANNs may have different
representations. This makes the evolution process quite inefficient in producing fit
offsprings. Most GAs use binary string representations for connection weights and
architectures. This creates many problems, one of which is the representation
precision of quantized weights. If weights are coarsely quantized, training might
be infeasible since the required accuracy for a proper weight representation cannot
be obtained. On the other hand, if too many bits are used (fine quantization), binary
strings may be unfeasibly long, especially for large ANNs, and this makes the
190 7 Evolutionary Artificial Neural Networks
evolution process too slow or even impractical. Another problem in this binary
representation is that network components belonging to the same or neighboring
neurons may be placed far apart in the binary string. Due to crossover operation,
the interactions among such components might be lost and hence the evolution
speed is drastically reduced.
EP-based Evolutionary ANNs (ENNs) in [13, 24] have been proposed to
address the aforementioned problems of GAs. The main distinction of EPs from
GAs is that they do not use the problematic crossover operation, instead make the
commitment to mutation as the sole operator for searching over the weight and
ASs. For instance the so-called EPNet, proposed in [24] uses 5 different mutations:
hybrid training, node and connection deletions, node and connection additions. It
starts with an initial population of M random networks, partially trains each net-
work for some epochs, selects the network with the best-rank performance as the
parent network and if it is improved beyond some threshold, further training is
performed to obtain an offspring network, which replaces its parent and the process
continues until a desired performance criterion is achieved. Over several bench-
mark problems, EPNet was shown to discover compact ANNs, which exhibits
comparable performance with the other GA-based evolutionary techniques.
However, it is not shown that EPNet creates optimal or near-optimal architectures.
One potential drawback is the fact that the best network is selected based only on
partial training since the winner network as a result of such initial (limited) BP
training may not lead to the optimal network at the end. Therefore, this process
may eliminate potential networks which may be the optimal or near-optimal ones
if only a proper training would have been performed. Another major drawback is
the algorithm’s dependence on BP as the primary training method which suffers
from the aforementioned problems. Its complex structure can only be suitable or
perhaps feasible for applications where computational complexity is not a crucial
factor and or the training problem is not too complex, as stated in [24], ‘‘However,
EPNet might take a long time to find a solution to a large parity problem. Some of
the runs did not finish within the user-specified maximum number of generations.’’
Finally, as a hybrid algorithm it uses around 15 user-defined parameters/thresholds
some of which are set with respect to the problem. This obviously creates a
limitation in a generic and modular application domain.
Recall that PSO, which has obvious ties with the EA family, lies somewhere in
between GA and EP. Yet unlike GA, PSO has no complicated evolutionary
operators such as crossover, selection, and mutation and it is highly dependent on
7.2 Evolutionary Neural Networks by MD PSO 191
stochastic processes. PSO has been successfully applied for training feed-forward
[26–29] and recurrent ANNs [30, 31] and several works on this field have shown
that it can achieve a superior learning ability to the traditional BP method in terms
of accuracy and speed. Only few researchers have investigated the use of PSO for
evolutionary design of ANNs or to be precise, the fully connected feed-forward
ANNs, the multilayer perceptrons (MLPs) only with single hidden layer. In [26,
29], the PSO–PSO algorithm and its slightly modified variant, PSO–PSO: weight
decay (PSO–PSO:WD) have been proposed. Both techniques use an inner PSO to
train the weights and an outer one to determine the (optimal) number of hidden
nodes. Both methods perform worse classification performance than EP and GA-
based Evolutionary Neural Networks (ENNs) on three benchmark problems from
Proben1 dataset [32]. Recently, Yu et al. proposed an improved PSO technique,
the so-called IPSONet [28], which achieved a comparable performance in terms of
average classification error rate over the same dataset in Proben1. All potential
network architectures have been encoded into the particles of a single PSO
operation, which evaluates their weights and architecture simultaneously. How-
ever, such an all-in-one encoding scheme makes the dimension of particles too
high and thus the method can be applied to only single hidden layer MLPs with a
limited number of (hidden) nodes (i.e., maximum 7 was used in [28]). Further-
more, it turns out to be a hybrid technique, which uses GA operators such as
mutation and crossover in order to alleviate the stagnation problem of PSO on such
high dimensions.
The major drawback of many PSO variants including the basic method is that they
can only be applied to a search space with a fixed dimension. However, in the field
of ANNs as well as in many of the optimization problems (e.g., clustering, spatial
segmentation, function optimization, etc.), the optimum dimension where the
optimum solution lies, is also unknown and should thus be determined. By a
proper adaptation, MD PSO can be utilized for designing (near-) optimal ANNs. In
this section, the focus is particularly drawn on automatic design of feed-forward
ANNs and the search is carried out over all possible network configurations within
the specified AS. Therefore, no assumption is made about the number of (hidden)
layers and in fact none of the network properties (e.g., feed-forward, differentiable
activation function, etc.) is an inherent constraint of the proposed scheme. All
network configurations in the AS are enumerated into a dimensional hash table
with a proper hash function, which ranks the networks with respect to their
complexity. That is, it associates a higher hash index to a network with a higher
complexity. MD PSO can then use each index as a unique dimension of the search
space where particles can make inter-dimensional navigations to seek an optimum
dimension (dbest) and the optimum solution on that dimension, x^ydbest . The
192 7 Evolutionary Artificial Neural Networks
Table 4. Note that in this example, the input and output layer sizes are 9 and 2,
which are eventually fixed for all MLP configurations. The hash function asso-
ciates the 1st dimension to the simplest possible architecture, i.e., a SLP with only
the input and the output layers (9 9 2). From dimensions 2–9, all configurations
are 2-layer MLPs with a hidden layer size varying between 1 and 8 (as specified in
the 2nd entries of Rmin and Rmax ). Similarly, for dimensions 10 and higher, 3-layer
MLPs are enumerated where the 1st and the 2nd hidden layer sizes are varied
according to the corresponding entries in Rmin and Rmax . Finally, the most complex
MLP with the maximum number of layers and neurons is associated with the
highest dimension, 41. Therefore, all 41 entries in the hash table span the AS with
respect to the configuration complexity and this eventually determines the
dimensional range of the solution space as Dmin ¼ 1 and Dmax ¼ 41.
At time t, suppose that the particle a in the swarm, n ¼ fx1 ; . . .; xa ; . . .; xS g, has
the positional component formed as,
n o
a ðtÞ 1 2 O1 O
xxxd
a ðtÞ ¼ fw 0
jk g; fw 1
jk g; fh k g; fw2
jk g; fh k g; ; :::; fw O1
jk g; fh k g; fh k g
n o
l
where wljk and hk represent the sets of weights and biases of the layer l. Note
that the input layer (l = 0) contains only weights whereas the output layer (l = O)
has only biases. By means of such a direct encoding scheme, the particle a rep-
resents all potential network parameters of the MLP architecture at the dimension
(hash index) xda ðtÞ. As mentioned earlier, the dimensional
range,Dmin xda ðtÞ Dmax , where MD PSO particles can make inter-dimensional
jumps, is determined by the AS defined. Apart from the regular limits such as
positional velocity range,fVmin ; Vmax g, dimensional velocity range,
fVDmin ; VDmax g, the data space can also be limited with a practical range, i.e.,
xd ðtÞ
Xmin xxda a ðtÞ\Xmax . In short, only a few boundary parameters need to be
defined in advance for the MD PSO process, as opposed to other GA-based
methods, or EPNet, which use several parameters and external techniques (e.g.,
Simulated Annealing, BP, etc.) in a complex process. Setting MSE in Eq. (3.15) as
the fitness function enables MD PSO to perform evolutions of both network
parameters and architectures within its native process.
In order to test and evaluate the performance of MD PSO for evolving ANNs,
experiments are performed over a synthetic dataset and real benchmark dataset.
The aim is to test the ‘‘optimality’’ of the networks found with the former dataset,
and the ‘‘generalization’’ ability while performing comparative evaluations against
several popular techniques with the latter dataset. In order to determine which
network architectures are (near-) optimal for a given problem, we apply exhaustive
BP training over every network configuration in the given AS. As mentioned
194 7 Evolutionary Artificial Neural Networks
Fig. 7.1 The function y ¼ cosðx=2Þ sinð8xÞ plot in interval fp; pg with 100 samples
training and perform 100 MD PSO runs, each of which terminates at the end of
1,000 epochs (iterations).
As shown in Fig. 7.1, the function y ¼ cosðx=2Þ sinð8xÞhas a highly dynamic
nature within the interval fp; pg. 100 samples are taken for training where
x coordinate of each sample is fed as input and y is used as the desired (target)
output, so all networks are formed with NI ¼ NO ¼ 1. At the end of each run, the
best fitness score (minimum error) achieved, f x^ydbest , by the particle with the
index gbest(dbest) at the optimum dimension dbest is stored. The histogram of
dbest, which is a hash index indicating a particular network configuration in R1 ,
eventually provides the crucial information about the (near-) optimal
configuration(s).
Figure 7.2 shows the dbest histogram and the error statistics plot from the
exhaustive BP training for the function approximation problem. Note that BP can
at best perform a linear approximation for most of the simple configurations (hash
indices) resulting in MSE 0:125 and only some 3-layer MLPs (in the form of
1 9 M 9 N 9 1 where M = 5,6,7,8 and N = 2, 3, 4) yield convergence to near-
optimal solutions. Accordingly, from the dbest histogram it is straightforward to
see that a high majority (97 %) of the MD PSO runs converges to those near-
optimal configurations. Moreover, the remaining solutions with dbest = 8 and 9
(indicating 2-layer MLPs in 1 9 7 9 1 and 1 9 8 9 1 forms) achieve a com-
petitive performance with respect to the optimal 3-layer MLPs. Therefore, these
particular MD PSO runs are not indeed trapped to local minima; on the contrary,
the exhaustive BP training on these networks could not yield a ‘‘good’’ solution.
Furthermore, MD PSO evolution in this AS confirmed that 4 out of 41
196 7 Evolutionary Artificial Neural Networks
Fig. 7.2 Error statistics from exhaustive BP training (top) and dbest histogram from 100 MD
PSO evolutions (bottom) for y ¼ cosðx=2Þ sinð8xÞ function approximation
7.2 Evolutionary Neural Networks by MD PSO 197
Fig. 7.3 Training MSE (top) and dimension (bottom) plots vs. iteration number for 17th (left)
and 93rd (right) MD PSO runs
whereas the bottom one shows the minimum MSE (best fitness) achieved per
dimension at the end of a MD PSO evolution process with 10,000 iterations (i.e.
mMSEðdÞ ¼ f ðx^yd ð10; 000Þ Þ 8d 2 f1; 41g). Note that both curves show a
similar behavior for d [ 18 (e.g. note the peaks at d = 19–20, 26–27, and 34);
however, only MD PSO provides a few near-optimal solutions for d 18, whereas
none of the BP runs managed to escape from local minima (the linear approxi-
mation). Most of the optimal and near-optimal configurations can be obtained from
this single run, e.g., the ranked list is {dbest = 23, 24, 40, 33, 22, 39, 30,…}.
Additionally, the peaks of this plot reveal the worst network configurations that
should not be used for this problem, e.g., d = 1, 2, 8, 9, 10, 11, 19, 26, and 34.
Note, however, that this can be a noisy evaluation since there is always the
possibility that MD PSO process may also get trapped in a local minimum on these
dimensions and/or not sufficient amount of particles visited this dimension due to
their natural attraction toward dbest (within the MD PSO process), which might be
too far away. This is the reason of the erroneous evaluation of dimensions 8 and 9,
which should not be on that list.
Figure 7.5 shows the dbest histogram and the error statistics plot from the
exhaustive BP training for 10-bit parity problem, where NI ¼ 10; NO ¼ 1 are used
for all networks. In this problem, BP exhibits a better performance on the majority
of the configurations, i.e., 4 04 mMSEðdÞ 103 for d ¼ 7; 8; 9; 16; 23;
24; 25; and f29; 41g f34g. The 100 MD PSO runs show that there are in fact
two optimum dimensions: 30 and 41 (corresponding to 3-layer MLPs in
10 9 5 9 3 9 1 and 10 9 8 9 4 9 1 forms), which can achieve minimum
MSEs, i.e., mMSEð30Þ\2 105 and mMSEð41Þ\105 . The majority of MD
PSO runs, which is represented in the dbest histogram in Fig. 7.5, achieved
MSEðdbestÞ\8 104 except the four runs (out of 100) with MSEðdbestÞ [
4 103 for dbest = 4, 18, and 19. These are the minority cases where MD PSO
trapped to local minima but the rest evolved to (near-) optimum networks.
198 7 Evolutionary Artificial Neural Networks
Fig. 7.4 MSE plots from the exhaustive BP training (top) and a single run of MD PSO (bottom)
Fig. 7.5 Error statistics from exhaustive BP training (top) and dbest histogram from 100 MD
PSO evolutions (bottom) for 10-bit parity problem
The error space is highly multimodal with many local minima, thus methods, such
as BP, encounters severe problems in error reduction [36]. The dataset consists of
194 patterns (2D points), 97 samples in each of the two classes (spirals). The
dataset is now used as a benchmark for ANNs by many researchers. Lang and
Witbrock in [35] reported that a near-optimum solution could not be obtained with
standard BP algorithm over feed-forward ANNs. They tried a special network
structure with short-cut links between layers. Similar conclusions are reported by
Baum and Lang [36] that the problem is unsolvable with 2-layers MLPs with
2 9 50 9 1 configuration. This, without doubt, is one of the hardest problems in
the field of ANNs. Figure 7.6 shows dbest histogram and the error statistics plot
from the exhaustive BP training for the two-spirals problem where
NI ¼ 2; NO ¼ 1. It is obvious that none of the configurations yield a sufficiently
low error value with BP training and particularly BP can at best perform a linear
approximation for most of the configurations (hash indices) resulting MSE 0:49
and only few 3-layer MLPs (with indices 32, 33, 38 and 41) are able to reduce
MSE to 0.3. MD PSO also shows a similar performance with BP, achieving
0:349 mMSEðdbestÞ 0:371 for dbest = 25, 32, 33, 38 and 41. These are
obviously the best possible MLP configurations to which a high majority of MD
PSO runs converged.
Fig. 7.6 Error (MSE) statistics from exhaustive BP training (top) and dbest histogram from 100
MD PSO evolutions (bottom) for the two-spirals problem
200 7 Evolutionary Artificial Neural Networks
In the previous section a set of synthetic problems that are among the hardest and
the most complex in the ANN field, has been used in order to test the optimality of
MD PSO evolution process, i.e., to see whether or not MD PSO can evolve to the
few (near-) optimal configurations present in the limited AS, R1 , which mostly
contains shallow MLPs. In this section we shall test the generalization capability
of the proposed method and perform comparative evaluations against the most
promising, state-of-the-art evolutionary techniques, over a benchmark dataset,
which is partitioned into three sets: training, validation and testing. There are
several techniques [37] to use training and validation sets individually to prevent
over-fitting and thus improve the classification performance in the test data.
However, the use of validation set is not needed for EA-based techniques since the
latter perform a global search for a solution [38]. Although all competing methods
presented in this section use training and validation sets in some way to maximize
their classification rate over the test data, we simply combine the validation and
training sets to use for training.
From Proben1 repository [32], we selected three benchmark classification
problems, breast cancer, heart disease and diabetes, which were commonly used
in the prior work such as PSO–PSO [29], PSO–PSO:WD [26], IPSONet [28],
EPNet [24], GA (basic) [38], and GA (Connection Matrix) [39]. These are medical
diagnosis problems, which mainly present the following attributes:
• All of them are real-world problems based on medical data from human patients.
• The input and output attributes are similar to those used by a medical doctor.
• Since medical examples are expensive to get, the training sets are quite limited.
This dataset is used to predict diabetes diagnosis among Pima Indians. All
patients reported are females of at least 21 years old. There are total of 768
exemplars of which 500 are classified as diabetes negative and 268 as diabetes
7.2 Evolutionary Neural Networks by MD PSO 201
positive. The dataset is originally partitioned as 384 for training, 192 for valida-
tion, and 192 for testing. It consists of 8 input and 2 output attributes.
3. Heart Disease
The initial dataset consists of 920 exemplars with 35 input attributes, some of
which are severely missing. Hence a second dataset is composed using the cleanest
part of the preceding set, which was created at Cleveland Clinic Foundation by Dr.
Robert Detrano. The Cleveland data is called as ‘‘heartc’’ in Proben1 repository
and contains 303 exemplars but 6 of them still contain missing data and hence
discarded. The rest is partitioned as 149 for training, 74 for validation, and 74 for
testing. There are 13 input and 2 output attributes. The purpose is to predict the
presence of the heart disease according to the input attributes.
The input attributes of all datasets are scaled to between 0 and 1 by a linear
function. Their output attributes are encoded using a 1-of-c representation using
c classes. The winner-takes-all methodology is applied so that the output of the
highest activation designates the class. The experimental setup is identical for all
methods and thus fair comparative evaluations can now be made over the clas-
sification error rate of the test data. In all experiments in this section we mainly use
R1 that is specified by the range arrays, R1min ¼ fNI ; 1; 1; NO g and R1max ¼
fNI ; 8; 4; N0 g containing the simplest 1-, 2-, or 3-layer MLPs where NI and NO , are
determined by the number of input and output attributes of the classification
problem. In order to collect some statistics about the results, we perform 100 MD
PSO runs, each using 250 particles (S = 250) and terminating at the end of 200
epochs (E = 200).
Before presenting the classification results over the test data, there are some
crucial points worth mentioning. First of all, the aforementioned ambiguity over
the decision of ‘‘optimality’’ is witnessed over the (training) datasets, Diabetes and
particularly the Breast Cancer, as the majority of networks in R1 can achieve
similar performances. Figure 7.7 demonstrates this fact by two error statistics plots
from the exhaustive BP training (K = 500, k ¼ 0:05) with 5,000 epochs. Note that
most of the networks trained over both datasets result in minimum MSE values
that are within a narrow range and it is rather difficult to distinguish or separate one
from the other.
Contrary to the two datasets, the Heart Disease dataset gives rise to four distinct
sets of network configurations, which can achieve training mMSEs below 102 as
shown in Fig. 7.8. The corresponding indices (dimensions) to these four optimal
sets are located in the lower vicinity of the indices: dbest = 9, 25, 32, and 41,
where MD PSO managed to evolve either to them or to those neighboring con-
figurations. Note that the majority of MD PSO runs ([50 %) to evolve the simplest
MLPs with single hidden layer (i.e., from Table 4, dbest = 9 is for the MLP
13 9 8 9 2) although BP achieved slightly lower mMSEs over other three (near-)
optimal configurations. The main reason is the fact that MD PSO or PSO in
general performs better in low dimensions and recall that premature convergence
problem might also occur when the search space is in high dimensions [40].
202 7 Evolutionary Artificial Neural Networks
Fig. 7.7 Error statistics from exhaustive BP training over Breast Cancer (top) and Diabetes
(bottom) datasets
Fig. 7.8 Error statistics from exhaustive BP training (top) and dbest histogram from 100 MD
PSO evolutions (bottom) over the Heart Disease dataset
7.2 Evolutionary Neural Networks by MD PSO 203
Table 7.1 Mean ðlÞ and standard deviation ðrÞ of classification error rates (%) over test datasets
Algorithm Dataset
Breast cancer Diabetes Heart disease
l r l r l r
MD PSO 0.39 0.31 20.55 1.22 19.53 1.71
PSO–PSO 4.83 3.25 24.36 3.57 19.89 2.15
PSO–PSO:WD 4.14 1.51 23.54 3.16 18.1 3.06
IPSONet 1.27 0.57 21.02 1.23 18.14* 3.42
EPNet 1.38 0.94 22.38 1.4 16.76* 2.03
GA [38] 2.27 0.34 26.23 1.28 20.44 1.97
GA [39] 3.23 1.1 24.56 1.65 23.22 7.87
BP 3.01 1.2 29.62 2.2 24.89 1.71
Table 7.1 presents the classification error rate statistics of MD PSO and other
methods from the literature. The error rate in the table refers to the percentage of
wrong classification over the test dataset of a benchmark problem. It is straight-
forward to see that the best classification performance is achieved with the MD
PSO technique over the Diabetes and Breast Cancer datasets. Particularly on the
latter set, roughly half of the MD PSO runs resulted in zero (0 %) error rate,
meaning that perfect classification is achieved. MD PSO exhibits a competitive
performance on the Heart Disease dataset. However, note that both IPSONet and
EPNet (marked as * in Table 7.1), showing slightly better performances, used a
subset of this set (134 for training, 68 for validation, and 68 for testing), excluding
27 entries overall. In [24] this is reasoned as, ‘‘27 of these were retained in case of
dispute, leaving a final total of 270’’.
Table 7.3 Classification error rate (%) statistics of MD PSO when applied to two architecture
spaces
Error rate statistics Breast cancer Diabetes Heart disease
R1 R2 R1 R2 R1 R2
l 0.39 0.51 20.55 20.34 19.53 20.21
r 0.31 0.37 1.22 1.19 1.71 1.90
During a MD PSO process, at each iteration and for each particle, first the
network parameters are extracted from the particle and input vectors are (forward)
propagated to compute the average MSE at the output layer. Therefore, it is not
feasible to accomplish a precise computational complexity analysis of the MD
PSO evolutionary process since this mainly depends on the networks that the
particles converge and in a stochastic process such as PSO this cannot be deter-
mined. Furthermore, it also depends on the AS selected because MD PSO can only
search for the optimal network within the AS. This will give us a hint that the
computational complexity also depends on the problem in hand because particles
eventually tend to converge to those near-optimal networks after the initial period
of the process. Yet we can count certain attributes which directly affects the
complexity, such as the size of the training dataset (T), swarm size (S), and number
of epochs to terminate the MD PSO process (E). Since the computational com-
plexity is proportional with the total number of forward propagations performed,
then it can be in the order of OðSETlt Þ where lt is an abstract time for the
propagation and MSE computation over an average network in the AS. Due to the
aforementioned reasons, lt cannot be determined a priori and therefore, the
abstract time can be defined as the expected time to perform a single forward
propagation of an input vector. Moreover the problem naturally determines T, yet
the computational complexity can still be controlled by S and E settings.
In this section, we shall draw the focus on evolutionary radial basis function (RBF)
network classifiers that will be evolved to classify terrain data in polarimetric
synthetic aperture radar (SAR) images. For the past few decades image and data
classification techniques have played an important role in the automatic analysis
and interpretation of remote sensing data. Particularly polarimetric SAR data poses
a challenging problem in this field due to the complexity of measured information
from its multiple polarimetric channels. Recently, a number of applications which
use data provided by the SAR systems having fully polarimetric capability have
been increasing. Over the past decade, there has been extensive research in the
area of the segmentation and classification of polarimetric SAR data. In the lit-
erature, the classification algorithms for polarimetric SAR can be divided into
three main classes: (1) classification based on physical scattering mechanisms
inherent in data [41, 42], (2) classification based on statistical characteristics of
data [43, 44], and (3) classification based on image processing techniques [45–47].
Additionally, there has been several works using some combinations of the above
classification approaches [41, 43]. While these approaches to the polarimetric SAR
classification problem can be based on either supervised or unsupervised methods,
206 7 Evolutionary Artificial Neural Networks
their performance and suitability usually depend on applications and the avail-
ability of ground truth.
As one of the earlier algorithms, Kong et al. [48] derived a distance measure
based on the complex Gaussian distribution and used it for maximum likelihood
(ML) classification of single-look complex polarimetric SAR data. Then, Lee et al.
[49] used the statistical properties of a fully polarimetric SAR to perform a
supervised classification based on complex Wishart distribution. Afterwards,
Cloude and Pottier [50] proposed an unsupervised classification algorithm based
on their target decomposition theory. Target entropy (H) and target average
scattering mechanism (scattering angle, a) calculated from this decomposition
have been widely used in polarimetric SAR classification. For multilook data
represented in covariance or coherency matrices, Lee et al. [43] proposed a new
unsupervised classification method based on a combination of polarimetric target
decomposition [50] and the maximum likelihood classifier using the complex
Wishart distribution. The unsupervised Wishart classifier has an iterative proce-
dure based on the well-known K-means algorithm, and has become a preferred
benchmark algorithm due to its computational efficiency and generally good
performance. However, this classifier still has some significant drawbacks since it
entirely relies on K-means for actual clustering, such as it may converge to local
optima, the number of clusters should be fixed a priori, its performance is sensitive
to the initialization and its convergence depends on several parameters. Recently, a
two-stage unsupervised clustering based on the EM algorithm [51] was proposed
for classification of polarimetric SAR images. The EM algorithm estimates
parameters of the probability distribution functions which represent the elements
of a 9-dimensional feature vector, consisting of six magnitudes and three angles of
a coherency matrix. Markov random field (MRF) clustering based method
exploiting the spatial relation between adjacent pixels in polarimetric SAR images
was proposed in [45, 49], a new wavelet-based texture image segmentation
algorithm was successfully applied to unsupervised SAR image segmentation
problem.
More recently, neural network based approaches [53, 54] for the classification
of polarimetric SAR data have been shown to outperform other aforementioned
well-known techniques. Compared with other approaches, neural network classi-
fiers have the advantage of adaptability to the data without making a priori
assumption of a particular probability distribution. However, their performance
depends on the network structure, training data, initialization, and parameters. As
discussed earlier, designing an optimal ANN classifier structure and its parameters
to maximize the classification accuracy is still a crucial and challenging task. In
this section, another feed-forward ANN type, the RBF network classifier which is
optimally designed by MD PSO, is employed. For this task, RBFs are purposefully
chosen due to their robustness, faster learning capability compared with other
feed-forward networks, and superior performance with simpler network architec-
tures. Earlier work on RBF classifiers for polarimetric SAR image classification
has demonstrated a potential for performance improvement over conventional
techniques [55]. The polarimetric SAR feature vector presented in this section
7.3 Evolutionary RBF Classifiers for Polarimetric SAR Images 207
includes: full covariance matrix, the H/a/A decomposition based features com-
bined with the backscattering power (Span), and the gray level co-occurrence
matrix (GLCM)-based texture features as suggested by the results of previous
studies [56, 57]. The performance of the evolutionary RBF network based clas-
sifier is evaluated using the fully polarimetric San Francisco Bay and Flevoland
datasets acquired by the NASA/Jet Propulsion Laboratory Airborne SAR (AIR-
SAR) at L-band [57–59]. The classification results measured in terms of confusion
matrix, overall accuracy and classification map are compared with other classifiers.
Polarimetric radars often measure the complex scattering matrix, [S], produced by
a target under study with the objective to infer its physical properties. Assuming
linear horizontal and vertical polarizations for transmitting and receiving, [S] can
be expressed as,
Shh Shv
S¼ ð7:1Þ
Svh Svv
Reciprocity theorem applies in a monostatic system configuration, Shv ¼ Svh :
For coherent scatterers only, the decompositions of the measured scattering matrix
[S] can be employed to characterize the scattering mechanisms of such targets.
One way to analyze coherent targets is the Pauli decomposition [43], which
1 0
expresses [S] in the so-called Pauli basis ½Sa ¼ p1ffiffi2 ; ½ S b ¼
0 1
p1ffiffi
1 0 0 1
2 0 1
; ½Sc ¼ p1ffiffi2 as,
1 0
Shh Shv
S¼ ¼ a½Sa þb½Sb þc½Sc ð7:2Þ
Svh Svv
pffiffiffi pffiffiffi pffiffiffi
where a ¼ ðShh þ Svv Þ= 2; b ¼ ðShh Svv Þ= 2; c ¼ 2Shv : Hence by means of
the Pauli decomposition, all polarimetric information in [S] could be represented
in a single RGB image by combining the intensities jaj2 ; jbj2 and jcj2 ; which
determine the power scattered by different types of scatterers such as single- or
odd-bounce scattering, double- or even-bounce scattering, and orthogonal polari-
zation returns by volume scattering. There are several other coherent decompo-
sition theorems such as the Krogager decomposition [60] the Cameron
decomposition [61], and SDH (Sphere, Diplane, Helix) decomposition [61] all of
which aim to express the measured scattering matrix by the radar as the combi-
nation of scattering responses of coherent scatterers.
Alternatively, the second order polarimetric descriptors of the 3 9 3 average
polarimetric covariance h½C i and the coherency h½T i matrices can be derived
208 7 Evolutionary Artificial Neural Networks
from the scattering matrix and employed to extract physical information from the
observed scattering process. The elements of the covariance matrix, [C], can be
written in terms of three unique polarimetric components of the complex-valued
scattering matrix:
C11 ¼ Shh Shh ; C21 ¼ Shh Shv
C22 ¼ Shv Shv ; C32 ¼ Shv Svv ð7:3Þ
C33 ¼ Svv Svv ; C31 ¼ Shh Svv
For single-look processed polarimetric SAR data, the three polarimetric com-
ponents (HH, HV, and VV) have a multivariate complex Gaussian distribution and
the complex covariance matrix form has a complex Wishart distribution [49]. Due
to the presence of speckle noise and random vector scattering from surface or
volume, polarimetric SAR data are often multi-look processed by averaging
n neighboring pixels. By using the Pauli-based scattering matrix for a pixel i,
pffiffiffi
ki ¼ ½Shh þ Svv ; Shh Svv ; 2Shv T = 2, the multi-look coherency matrix, h½T i, can
be written as,
1X n
hT i ¼ ki kT ð7:4Þ
n i¼1 i
Both coherency h½T i and covariance h½C iare 3 9 3 Hermitian positive semi-
definite matrices, and since they can be converted into one another by a linear
transform, both are equivalent representations of the target polarimetric
information.
The incoherent target decomposition theorems such as the Freeman decom-
position, the Huynen decomposition, and the Cloude-Pottier (or H/a/A) decom-
position employ the second order polarimetric representations of PolSAR data
(such as covariance matrix or coherency matrix) to characterize distributed scat-
terers. The H/a/A decomposition [62] is based on eigenanalysis of the polarimetric
coherency matrix, h½T i:
hT i ¼ k1 e1 eT T T
1 þ k2 e 2 e 2 þ k3 e 3 e 3 ð7:5Þ
where k1 [ k2 [ k3 0 are real eigenvalues, e1 implies complex conjugate of e1
and eT1 is the transpose of e1 . The corresponding orthonormal eigenvectors ei
(representing three scattering mechanisms) are,
T
ei ¼ eiui cos ai ; sin ai cos bi eidi ; sin ai sin bi eici ð7:6Þ
d, and
Cloude and Pottier defined entropy H, average of set of four angles a, b,
c, and anisotropy A for analysis of the physical information related to the scattering
characteristics of a medium:
X
3
ki
H¼ pi log3 pi where pi ¼ P3 ; ð7:7Þ
i¼1 i¼1 ki
7.3 Evolutionary RBF Classifiers for Polarimetric SAR Images 209
X
3 X
3 X
3 X
3
a¼
¼
pi ai ; b pi bi ;
d¼ pi di ; c ¼ pi c i ; ð7:8Þ
i¼1 i¼1 i¼1 i¼1
p2 p3
A¼ : ð7:9Þ
p2 þ p3
For a multi-look coherency matrix, the entropy, 0 H 1; represents the
randomness of a scattering medium between isotropic scattering (H = 0) and fully
random scattering (H = 1), while the average alpha angle can be related to target
average scattering mechanisms from single-bounce (or surface) scattering ða 0Þ
to dipole (or volume) scattering ð
a p=4Þ to double-bounce scattering ða p=2Þ.
Due to basis invariance of the target decomposition, H and a are roll invariant
hence they do not depend on orientation of the target about the radar line of sight.
Additionally, information about the target’s total backscattered power can be
determined by the span as,
X
3
span ¼ ki ð7:10Þ
i¼1
Entropy (H), estimate of the average alpha angle (a), and span calculated by the
above noncoherent target decomposition method have been commonly used as
polarimetric features of a scatterer in many target classification schemes [43, 63].
The first operation for SAR classification is naturally the feature extraction. The
SAR feature extraction process presented herein utilizes the complete covariance
matrix information, the GLCM-based texture features, and the backscattering
power (span) combined with the H/a/A decomposition [50]. The feature vector
from the Cloude–Pottier decomposition includes entropy (H), anisotropy (A),
estimates of the set of average angles (
a, b, d, and c), three real eigenvalues
(k1 ; k2 ; k3 ), and span. As suggested by the previous studies [53, 56] appropriate
texture measures for SAR imagery based on the gray level co-occurrence proba-
bilities are included in the feature set to improve its discrimination power and
classification accuracy. In this study, contrast, correlation, energy, and homoge-
neity features are extracted from normalized GLCMs which are calculated using
interpixel distance of 2 and averaging over four possible orientation settings
(h ¼ 0 ; 45 ; 90 ; 135 ). To reduce the dimensionality (and redundancy) of input
feature space, the principal components transform is applied to these inputs and the
most principal components (which contain about 95 % of overall energy in the
original feature matrix) are then selected to form a resultant feature vector for each
imaged pixel. Dimensionality reduction of input feature information improves
efficiency of learning for a neural network classifier due to a smaller number of
210 7 Evolutionary Artificial Neural Networks
input nodes (to avoid curse of dimensionality) [64] and reduces computation time.
For the purpose of normalizing and scaling the feature vector, each feature
dimension is first normalized to have a zero mean and unity standard deviation
before principal component analysis (PCA) is applied, and following the PCA
outputs are linearly scaled into [-1, 1] interval.
In this section, two distinct training methods for RBF network classifiers, the
traditional back-propagation (BP), and particle swarm optimization (PSO) are
investigated. The RBF networks and the training algorithm BP are introduced in
Sect. 3.4.3.1. For the BP algorithm, RPROP enhancement is used when training
RBF networks. The main difference in RPROP is that it modifies the update-values
for each parameter according to the sequence of signs of partial derivatives. This
only leads to a faster convergence, while the problems of a hill climbing algorithm
are not solved. Further details about BP and RPROP can be found in [65, 66],
respectively. In order to determine (near-) optimal network architecture for a given
problem, we apply exhaustive BP training over every network configuration in the
AS defined. For the training based on MD PSO, first the dynamic clustering based
on MD PSO is applied to determine the optimal (with respect to minimizing a
given cost function for the input–output mapping) number of Gaussian neurons
with their correct parameters (centroids and variances). Afterwards, a single run of
BP can conveniently be used to compute the remaining network parameters,
weights (w), and bias (h) of the each output layer neuron. Note that once the
number of Gaussian networks and their parameters are found, the rest of the RBF
network resembles a SLP where BP training results in a unique solution of weights
and biases. The overview of the classifier framework for polarimetric SAR images
is shown in Fig. 7.9.
MD PSO Dynamic
Clustering
Expert
Labeling
Training Set
{μ, σ, Ν, ω, θ}
GLCM Texture
Features
[C]
PCA Transform
Covariance Matrix
Cloude-Pottier
[T] Lee Speckle Decomposition
Filter + RBFN Classifier Classification Map
Coherency Matrix
Span
Fig. 7.9 Overview of the evolutionary RBF network classifier design for polarimetric SAR
image
7.3 Evolutionary RBF Classifiers for Polarimetric SAR Images 211
In this section, two test images of an urban area (San Francisco Bay, CA) and an
agricultural area (Flevoland, the Netherlands), both acquired by the NASA/Jet
Propulsion Laboratory’s AIRSAR at L-band, were chosen for performance eval-
uation of the RBF network classifier. Both datasets have been widely used in the
polarimetric SAR literature over the past two decades [57–59], and distributed as
multi-look processed and publicly available through the polarimetric SAR data
processing and educational tool (PolSARpro) by ESA [22]. The original four-look
fully polarimetric SAR data of the San Francisco Bay, having a dimension of
900 9 1,024 pixels, provides good coverage of both natural (sea, mountains,
forests, etc.) and man-made targets (buildings, streets, parks, golf course, etc.) with
a more complex inner structure. For the purpose of comparing the classification
results with the Wishart [43] and the NN-based [53] classifiers, the subarea
(Fig. 7.10) with size 600 9 600 is extracted and used. The aerial photographs for
this area which can be used as ground truth are provided by the TerraServer Web
site [67]. In this study, no speckle filtering is applied to originally four-look
processed covariance matrix data and before GLCM-based texture feature gen-
eration to retain the resolution and to preserve the texture information. However,
additional averaging, such as using the polarimetry preserving refined Lee filter
[68] with 5 9 5 window, of coherency matrix should be performed prior to the
Cloude-Pottier decomposition [50]. For MD PSO based clustering algorithm, the
typical internal PSO parameters (c1 ; c2 and w) are used as in [69], also explained
in [70]. For all experiments in this section, the two critical PSO parameters, swarm
size (S), and number of iterations (IterNo), are set as 40 and 1,000, respectively.
Fig. 7.10 Pauli image of 600 9 600 pixel subarea of San Francisco Bay (left) with the 5 9 5
refined Lee filter used. The training and testing areas for three classes are shown using red
rectangles and circles respectively. The aerial photograph for this area (right) provided by the
U.S. Geological Survey taken on Oct, 1993 can be used as ground-truth
212 7 Evolutionary Artificial Neural Networks
Table 7.4 Summary table of pixel-by-pixel classification results of the RBF-MDPSO classifier
over the training and testing area of San Francisco Bay dataset
Training data Test data
Sea Urban Vegetation Sea Urban Vegetation
Sea 14,264 4 0 6,804 0 0
Urban 11 9,422 22 10 6,927 23
Vegetation 10 87 4,496 21 162 6,786
To test the performance of the RBF classifier and compare its classification
results, the same training and testing areas for the three classes from the sub San
Francisco area (as shown on the Pauli-based decomposition image in Fig. 7.10),
the sea (15,810, 6723 pixels, respectively), urban areas (9,362, 6,800), and the
vegetated zones (5,064, 6,534), which are manually selected in an earlier study
[53], are used. The confusion matrix of the evolutionary RBF classifier on the
training and testing areas are given in Table 7.4. The classification accuracy values
are averaged over 10 independent runs. From the results, the main drawback of this
classifier is the separation of vegetated zones from urban areas. Compared to two
other competing techniques, this classifier is able to differentiate better the uniform
areas corresponding to main classes of scattering such as the ocean, vegetation,
and building areas. In Table 7.5, the overall accuracies in training and testing areas
for the RBF classifier trained using the BP and MD PSO algorithms and two
competing methods, the Wishart maximum likelihood (WML) classifier [43] and
the NN-based classifier [53], are compared. The average accuracies over 10
independent runs for the best configuration of the RBF-BP and RBF-MDPSO
classifiers are reported. The RBF classifier trained by the global PSO algorithm is
superior to the NN-based, WML, and RBF-BP-based methods with higher accu-
racies in both training (99.50 %) and testing (98.96 %) areas. Figure 7.11 shows
the classification results on the whole subarea image for the RBF-MDPSO based
classifier. The classification map of the whole San Francisco Bay image produced
by the same classifier is given in Fig. 7.12 for a qualitative (visual) performance
evaluation. The evolutionary RBF classifier has the structure of 11 input neurons,
21 Gaussian neurons which the cluster centroids and variance (lk and rk ) are
determined by MD PSO-based dynamic clustering the training data, and 3 output
neurons.
The classification results in Table 7.5 have been produced by using a high
percentage (60 %) of total (training and testing combined) pixels for training. The
RBF network classifier is also tested by limiting the percentage of total pixels
which were used for classifier training to less than 1 % of the total pixels to be
classified. The results over the same testing dataset are shown in Table 7.6. In this
case, the classifier trained by the BP or MD PSO algorithms performed still at a
high level, achieving accuracies over 95 and 98 %, respectively. Generally, a
relatively smaller training dataset can avoid over-fitting and improve generaliza-
tion performance of a classifier over larger datasets.
7.3 Evolutionary RBF Classifiers for Polarimetric SAR Images 213
Table 7.5 Overall performance comparison (in percent) for San Francisco Bay dataset
Method Training area Testing area
RBF-BP 98.00 95.70
WML [43] 97.23 96.16
NN [53] 99.42 98.64
RBF-MDPSO 99.50 98.96
Fig. 7.11 The classification results of the RBF-MDPSO classifier over the 600 9 600 sub-image
of San Francisco Bay (black denotes sea, gray urban areas, white vegetated zones)
Fig. 7.12 The classification results of the RBF-MDPSO technique for the original (900 9 1024)
San Francisco Bay image (black denotes sea, gray urban areas, white vegetated zones)
Table 7.6 Overall performance (in percent) using smaller training set (\%1 of total pixels) for
San Francisco Bay dataset
Method Training area Testing area
RBF-BP 100 95.60
RBF-MDPSO 100 98.54
identified crop classes {stem beans, potatoes, lucerne, wheat, peas, sugar beet, rape
seed, grass, forest, bare soil, and water}. The available ground truth for 11 classes
can be found in [71]. To compare classification results the same 11 training and
testing sets are used with those of the NN-based [53], wavelet-based [59], and
ECHO [72] classifiers. In Table 7.7, the overall accuracies in training and testing
areas of the Flevoland dataset for the RBF classifier trained using the BP and MD
PSO algorithms and three state-of-the-art methods, the ECHO [71], wavelet-based
[59], and NN-based [53] classifiers, are shown. The overall classification accu-
racies of the RBF-based classifier framework are quite high. The percentage of
correctly classified training and testing pixels in the Flevoland L-band image for
the evolutionary (MD PSO) RBF method are given in Table 7.8. Figure 7.14 shows
the classification results of the proposed evolutionary RBF classifier for the
Flevoland image.
7.3 Evolutionary RBF Classifiers for Polarimetric SAR Images 215
Fig. 7.13 Fitness score (left top) and dimension (left bottom) plots vs. iteration number for a
typical MD PSO run. The resulting histogram plot (right) of cluster numbers which are
determined by the MD PSO method
Table 7.7 Overall performance comparison (in percent) for Flevoland dataset
Method Training area Testing area
ECHO [72] – 81.30
Wavelet-based [59] – 88.28
RBF-BP 95.50 92.05
NN [53] 98.62 92.87
RBF-MDPSO 95.55 93.36
Table 7.8 Summary table of pixel-by-pixel classification results (in percent) of the RBF-MDPSO classifier over the training and testing data of Flevoland
Training (testing) data
Water Forest Stem Beans Potatoes Lucerne Wheat Peas Sugar Beet Bare Soil Grass Rape Seed
Water 99(98) 0(0) 0(0) 0(0) 0(0) 0(0) 0(0) 0(0) 0(0) 0(0) 1(2)
Forest 0(0) 95(97) 0(0) 0(0) 1(0) 0(0) 1(0) 1(0) 0(0) 2(3) 0(0)
Stem Beans 0(0) 0(0) 95(97) 0(0) 5(2) 0(1) 0(0) 0(0) 0(0) 0(0) 0(0)
Potatoes 0(0) 0(0) 0(0) 99(96) 0(0) 0(0) 0(0) 1(4) 0(0) 0(0) 0(0)
Lucerne 0(0) 0(0) 2(2) 0(0) 98(97) 0(0) 0(0) 0(0) 0(0) 0(1) 0(0)
Wheat 0(0) 0(0) 0(0) 0(0) 2(4) 91(86) 4(4) 1(3) 0(0) 2(3) 0(0)
Peas 0(0) 0(0) 0(0) 0(0) 0(0) 1(0) 94(88) 2(7) 0(0) 0(0) 3(5)
Sugar Beet 0(0) 0(0) 0(0) 0(0) 0(0) 0(2) 0(1) 95(91) 0(0) 4(5) 1(1)
Bare Soil 0(0) 0(0) 0(0) 0(0) 0(0) 0(2) 0(0) 0(0) 99(97) 0(0) 1(1)
Grass 0(0) 0(0) 0(0) 0(0) 1(0) 0(1) 0(0) 2(4) 0(0) 97(95) 0(0)
Rape Seed 2(2) 0(0) 0(0) 0(0) 0(0) 2(2) 1(2) 3(2) 3(7) 0(0) 89(85)
7 Evolutionary Artificial Neural Networks
7.4 Summary and Conclusions 217
Fig. 7.14 The classification results on the L-band AIRSAR data over Flevoland
In this chapter, we drew the focus on evolutionary ANNs by the MD PSO with the
following innovative properties:
• With the proper adaptation, MD PSO can evolve to the optimum network within
an AS for a particular problem. Additionally, it provides a ranked list of all other
potential configurations, indicating that any high rank configuration may be an
alternative to the optimum one, yet some with low ranking, on contrary, should
not be used at all.
• The evolutionary technique is generic and applicable to any type of ANNs in an
AS with varying size and properties, as long as a proper hash function enu-
merates all configurations in the AS with respect to their complexity into proper
hash indices representing the dimensions of the solution space over which MD
PSO seeks for optimality.
• Due to the MD PSO’s native feature of having better and faster convergence to
optimum solution in low dimensions, its evolution process in any AS naturally
yields to compact networks rather than large and complex ones, as long as the
optimality prevails.
Experimental results over synthetic datasets and particularly over several
benchmark medical diagnosis problems show that all the aforementioned
218 7 Evolutionary Artificial Neural Networks
properties and capabilities are achieved in an automatic way and as a result, ANNs
(MLPs) evolved by MD PSO alleviates the need of human ‘‘expertise’’ and
‘‘knowledge’’ for designing a particular network; instead, such virtues may still be
used in a flexible way to define only the size and perhaps some crucial properties
of the AS. In this way further efficiency in terms of speed and accuracy over global
search and evolution process can be achieved.
We then perform MD PSO evolution over another type of feed-forward ANNs,
namely the RBF networks. For the evaluation purposes in a challenging applica-
tion domain, we present a new polarimetric SAR image classification framework,
which is based on an efficient formation of covariance matrix elements, H=a=A
decomposition with the backscattered power (span) information, and GLCM-based
texture features, and the RBF network classifier. Two different learning algo-
rithms, the classical BP and MD PSO, were applied for the proposed classifier
training/evolution. In addition to determining the correct network parameters, MD
PSO also finds the best RBF network architecture (optimum number of Gaussian
neurons and their centroids) within an AS and for a given input data space. The
overall classification accuracies and qualitative classification maps for the San
Francisco Bay and Flevoland datasets demonstrate the effectiveness of the clas-
sification framework using the evolutionary RBF network classifier. Based on the
experimental results using real polarimetric SAR data, the classifier framework
performs well compared to several state-of-the-art classifiers, however, more
experiments using large volume of SAR data should be done for a general
conclusion.
*pW) for all MLP parameters. In this way, any MLP network can be represented
(for storage and easy initialization purposes) with a single buffer and the buffer
alone will be sufficient to revive it. This is the key property to store an entire MLP
AS within a single buffer and/or a binary file.
The class CDimHash is responsible of performing the hash function for any AS
from a family of MLP networks. Recall that for a particular AS, a range is defined
for the minimum and maximum l numberl
of layers, fLmin ; Lmax g and number of
neurons for hidden layer l, Nmin ; Nmax . As presented in Table 7.11, the con-
structor of the class CDimHash receives these inputs as the variables: int
min_noL, int max_noL, int *min_noN, int *max_noN and hashes the entire AS
by the pointer array, CENNdim** m_pDim. Each m_pDim[] element stores a
distinct MLP configuration in a CENNdim object, which basically stores the
number of layers (m_noL) and number of neurons in each layer (m_pNoN[]).
Therefore, once the AS is hashed, the dth entry (MLP configuration) can be
retrieved by simply calling GetEntry(int d) function. The CDimHash object can
also compute the buffer size of each individual MLP configuration in order to
allocate the memory space for its parameters. This is basically needed to compute
the overall buffer size needed for the entire AS and then the memory space for
each consecutive MLP network can be allocated within the entire buffer. Once the
evolution process is over, the evolved MLPs (their parameters) in the AS can then
be stored in a binary file.
The ClassifierTestApp application primarily uses the static library Classifi-
erLib, to perform evolution (training) and classification over a dataset, which is
partitioned into training and test sets. The library, ClassifierLib, then uses one of
the four static libraries MLPlibX, RBFlibX, SVMlibX, and RFlibX to
accomplish the task. Therefore, it is the controller library which contains six
primary classes: CMLP_BP, CMLP_PSO, CRBF_BP, CRBF_PSO, CRan-
domForest, and CSVM. The first four are the evolutionary ANNs (MLP and RBF
networks) with the two evolutionary techniques: MD PSO (CMLP_PSO and
CRBF_PSO) and exhaustive BP (CMLP_BP and CRBF_BP). The last two
classes are used to train RF and the network topology of the SVMs. In this chapter,
we shall focus on evolutionary ANNs and particular on MLPs, therefore,
CMLP_PSO and CRBF_PSO will be detailed herein. Besides these six classes
for individual classifiers, the ClassifierLib library also has classes for imple-
menting CNBC topology: CNBCglobal and COneNBC, which can use any of the
six classes (classifier/evolution) as its binary classifier implementation. The pro-
gramming details of CNBC will be covered in Chap. 9.
The classes implementing six classifiers are all inherited from the abstract base
class CGenClassifier, which defines the API member functions and variables.
Table 7.12 presents some of the member functions and variables of the CGen-
Classifier. The evolutionary technique will be determined by the choice of the
class, CMLP_PSO or CMLP_BP. The evolution (training) parameters for either
technique are then conducted by the Init (train_params tp) function with the
train_params variable. Once the evolutionary MLP is initialized as such, it then
becomes ready for the underlying evolutionary process via calling the
Train(class_params cp) function.
222 7 Evolutionary Artificial Neural Networks
Table 7.13 The main entry function for evolutionary ANNs and classification
‘‘diabetes3.dt’’) each with a random shuffling of train and test partitions. Any of
these functions call basically loads them into following global variables:
float **train_input, **test_input;
float **train_output, **test_output;
int in_size, out_size, train_size, test_size = 0;
The first row of double arrays store the FVs whereas the second row of double
arrays store the target CVs for the train and test datasets, respectively. The vari-
ables in the third row are the size of the FVs and CVs (in_size, out_size) of each
individual data entry, and the size of the train, and test datasets (train_size,
test_size). Once they are loaded from the data file (or from the function), they are
stored into a train_params object, one_cp, along with the rest of the training
parameters. Recall that this object will then be passed to the CMLP_PSO or
CMLP_BP object by calling the Init (train_params tp) function, where the ANN
evolution will be performed by the MD PSO process accordingly. Recall that the
AS is defined by the minimum and maximum number of layers, fLmin ; Lmax g and
1 Lmax 1
the two range arrays, Rmin ¼ N1 ; Nmin ; . . .; Nmin ; NO and
224 7 Evolutionary Artificial Neural Networks
1 Lmax 1
Rmax ¼ N1 ; Nmax ; . . .; Nmax ; NO , one for minimum and the other for maximum
number of neurons allowed for each layer of a MLP. Letting Lmin ¼ 2, these
parameters (called as classifier parameters as they define the AS for ANN clas-
sifiers), are then assigned into a class_params object, one_cp, by calling:
//Put Classifier params..
one_cp.minNoN = minNoN; one_cp.maxNoN = maxNoN;
one_cp.max_noL = MAX_NOL;
After both one_tp and one_cp objects are fully prepared, the classifier object
can be created with the underlying evolution technique, initialized (with the
training parameters), and evolution process can be initiated. Once it is completed,
the object can then be cleaned and destroyed. As shown in Table 7.13, all of these
are accomplished by calling the following code:
pOneClassifier = new CMLP_PSO(); // Create a MLP classifier..
pOneClassifier->Init(one_tp); // this for training..
pOneClassifier->SetActivationMode(aMode); // act. mode for MLPs..
pOneClassifier->Train(one_cp); // Start Evolution..…
…
pOneClassifier->Exit(); delete pOneClassifier; // Clean up..
Note that there is a piece of test code that is discarded from the compilation.
The first part of this code is to test the I/O functions of the classifier, i.e., first to
retrieve the AS buffer (pOneClassifier->GetClassBuffer(size);) and then to save
it into a file (with the name: *_buf.mlp). The second part involves creating a new
classifier object with the AS buffer obtained from the current evolution process.
Recall that the AS buffer is the best solution found by the underlying evolutionary
search process, in this case MD PSO since the classifier object was CMLP_PSO.
Note that this is not a re-evolution process from scratch, rather a new MD PSO
process where the previous best solution (stored within the AS buffer) is injected
into the process (by calling: pTest->Train(buf);). Thus, the new MD PSO process
will take the previous best solution into account to search for an ‘‘even-better’’
one. Finally, the last part of the code loads the AS from the file, which was saved
earlier with the pTest->Save(dir, tit); function call and as in the previous process,
is injected into the new MD PSO process. Note that all three test code examples
cleans up the internal memory by calling the pTest->Exit();
As mentioned earlier, there are two types of evolutionary processes; one is the
ANN evolution as initiated with the pOneClassifier->Train(one_cp) call and
another with the injection of the last best solution as initiated with pTest-
>Train(buf); call. We shall now cover the former implementation over the object
CMLP_PSO (MD PSO evolutionary process over an AS of MLPs). The Reader
will find it straightforward to see the similar kind of approach on the other
implementation and for other objects.
Table 7.14 presents the function CMLP_PSO::Train(class_params cp) where
the MD PSO evolutionary process over an AS of MLPs is implemented. Recall
7.5 Programming Remarks and Software Packages 225
that the evolution (training) parameters are already fed into the CMLP_PSO
object and several function calls can be made each with a different AS
(class_params), which is immediately stored within the object and used to create
the AS object with the call:
CSolSpace<float>::s_pDimHash = new CDimHash(min_noL, max_noL,
minNoN, maxNoN);
Note that the AS object is a static pointer of the CSolSpace class since it is
primarily needed by the MD PSO process to find the true solution space dimension
(the memory space needed for weights and biases for that particular MLP con-
figuration) from the hash function. There can be several MD PSO runs (i.e.,
no_run) and within the loop of each run, a new MD PSO object is first created,
initialized with the MD PSO parameters (and SPSA parameters if needed) and the
PropagateENN() is set as the fitness function. Note that the dimensional range,
½Dmin ; Dmax , is set to [0, no_conf], where no_conf is the number of configurations
within the AS. The dimension 0 corresponds to an SLP and the highest dimension,
no_conf-1, corresponds to the MLP with maximum number of hidden layers and
neurons within.
The member Boolean parameter, bRandParams, determines if there is an AS
buffer already present within the object, from an earlier run. If so, then the AS
buffer, which is nothing but the best solution found in the previous run is then
injected into the current MD PSO object via pPSO->Inject(m_pArSp-
Buf+m_hdrAS) call.
Table 7.15 presents the fitness function PropagateENN() in the CMLP_PSO
class. Recall that the position, pPos, of each MD PSO particle in its current
dimension, pPos->GetDim(), represents the potential solution, i.e., the MLP with
a configuration corresponding to that dimension. The configuration of the MLP can
be retrieved from the AS object by calling:
CENNdim* pConf = CSolSpace<float>::s_pDimHash->GetEntry(pPos->Get-
Dim());//get conf..
A new MLP is then created with the configuration stored in pConf and the
parameters of the MLP are assigned from the MD PSO particle position by calling:
CMLPnet *net = new CMLPnet (pConf->m_pNoN, pConf->m_noL);
net->initializeParameters(pPos->GetPos(), pPos->GetSize());
The rest of the function simply performs the forward propagation of the features
of the training dataset and computes the average training MSE, which is the fitness
Table 7.15 The fitness function of MD PSO process in the CMLP_PSO class
7.5 Programming Remarks and Software Packages 227
References
1. W. McCulloch, W. Pitts, A logical calculus of the ideas immanent in nervous activity. Bull.
Math. Biophys. 7, 115–133 (1943)
2. Y. Chauvin, D.E. Rumelhart, Back Propagation: Theory, Architectures, and Applications
(Lawrence Erlbaum Associates Publishers, Muhwah, 1995)
228 7 Evolutionary Artificial Neural Networks
29. C. Zhang, H. Shao, An ANN’s evolved by a new evolutionary system and its application, in
Proceedings of the 39th IEEE Conference on Decision and Control, vol. 4 (2000),
pp. 3562–3563
30. J. Salerno, Using the particle swarm optimization technique to train a recurrent neural model,
in Proceedings of IEEE Int. Conf. on Tools with Artificial Intelligence, (1997), pp. 45–49
31. M. Settles, B. Rodebaugh, T. Soule, Comparison of genetic algorithm and particle swarm
optimizer when evolving a recurrent neural network. Lecture Notes in Computer Science
(LNCS) No. 2723, in Proceedings of the Genetic and Evolutionary Computation Conference
2003 (GECCO 2003), (Chicago, IL, USA, 2003), pp. 151–152
32. L. Prechelt, Proben1—a set of neural network benchmark problems and benchmark rules,
Technical report 21/94, Fakultät für Informatik, Universität Karlsruhe, Germany, Sept 1994
33. S. Guan, C. Bao, R. Sun, Hierarchical incremental class learning with reduced pattern
training. Neural Process. Lett. 24(2), 163–177 (2006)
34. J. Zhang, J. Zhang, T. Lok, M.R. Lyu, A hybrid particle swarm optimization-back-
propagation algorithm for feed forward neural network training. Appl. Math. Comput. 185,
1026–1037 (2007)
35. K.J. Lang, M.J. Witbrock, Learning to tell two spirals apart, in Proceedings of the 1988
Connectionist Models Summer School, San Mateo, 1988
36. E. Baum, K. Lang, Constructing hidden units using examples and queries, in Advances in
Neural Information Processing Systems, vol. 3 (San Mateo, 1991), pp. 904–910
37. M.H. Hassoun, Fundamentals of Artificial Neural Networks (MIT Press, Cambridge, 1995)
38. R.S. Sexton, R.E. Dorsey, Reliable classification using neural networks: a genetic algorithm
and back propagation comparison. Decis. Support Syst. 30, 11–22 (2000)
39. E. Cantu-Paz, C. Kamath, An empirical comparison of combinations of evolutionary
algorithms and neural networks for classification problems. IEEE Trans. Syst. Man Cybern.
Part B, 35, 915–927 (2005)
40. G.-J. Qi, X.-S. Hua, Y. Rui, J. Tang, H.-J. Zhang, Image Classification with kernelized
spatial-context. IEEE Trans. Multimedia, 12(4), 278–287 (2010). doi:10.1109/
TMM.2010.2046270
41. E. Pottier, J.S. Lee, Unsupervised classification scheme of PolSAR images based on the
complex Wishart distribution and the H/A/a. Polarimetric decomposition theorem, in
Proceedings of the 3rd European Conference on Synthetic Aperture Radar (EUSAR 2000),
(Munich, Germany, 2000), pp. 265–268
42. J.J. van Zyl, Unsupervised classification of scattering mechanisms using radar polarimetry
data. IEEE Trans. Geosci. Remote Sens. 27, 36–45 (1989)
43. J.S. Lee, M.R. Grunes, T. Ainsworth, L.-J. Du, D. Schuler, S.R. Cloude, Unsupervised
classification using polarimetric decomposition and the complex Wishart classifier. IEEE
Trans. Geosci. Remote Sens. 37(5), 2249–2257 (1999)
44. Y. Wu, K. Ji, W. Yu, Y. Su, Region-based classification of polarimetric SAR images using
Wishart MRF. IEEE Geosci. Rem. Sens. Lett. 5(4), 668–672 (2008)
45. Z. Ye, C.-C. Lu, Wavelet-Based Unsupervised SAR Image Segmentation Using Hidden
Markov Tree Models, in Proceedings of the 16th International Conference on Pattern
Recognition (ICPR’02), (2002), pp. 729–732
46. C.P. Tan, K.S. Lim, H.T. Ewe, Image processing in polarimetric SAR images using a hybrid
entropy decomposition and maximum likelihood (EDML), in Proceedings of the
International Symposium on Image and Signal Processing and Analysis (ISPA), Sep 2007,
pp. 418–422
47. T. Ince, Unsupervised classification of polarimetric SAR image with dynamic clustering: an
image processing approach. Adv. Eng. Softw. 41(4), 636–646 (2010)
48. J.A. Kong, A.A. Swartz, H.A. Yueh, L.M. Novak, R.T. Shin, Identification of terrain cover
using the optimum polarimetric classifier. J. Electromagn. Waves Applicat. 2(2), 171–194
(1988)
49. J.S. Lee, M.R. Grunes, R. Kwok, Classification of multi-look polarimetric SAR imagery
based on complex Wishart distribution. Int. J. Rem. Sens. 15(11), 2299–2311 (1994)
230 7 Evolutionary Artificial Neural Networks
50. S.R. Cloude, E. Pottier, An entropy based classification scheme for land applications of
polarimetric SAR. IEEE Trans. Geosci. Remote Sens. 35, 68–78 (1997)
51. K.U. Khan, J. Yang, W. Zhang, Unsupervised classification of polarimetric SAR images by
EM algorithm. IEICE Trans. Commun. 90(12), 3632–3642 (2007)
52. T.N. Tran, R. Wehrens, D.H. Hoekman, L.M.C. Buydens, Initialization of Markov random
field clustering of large remote sensing images. IEEE Trans. Geosci. Remote Sens. 43(8),
1912–1919 (2005)
53. Y.D. Zhang, L.-N. Wu, G. Wei, A new classifier for polarimetric SAR images, Progress in
Electromagnetics Research, PIER 94 (2009), pp. 83–104
54. L. Zhang, B. Zou, J. Zhang, Y. Zhang, Classification of polarimetric SAR image based on
support vector machine using multiple-component scattering model and texture features,
Eurasip J. Adv. Sig. Process. (2010). doi:10.1155/2010/960831
55. T. Ince, Unsupervised classification of polarimetric SAR image with dynamic clustering: an
image processing approach. Adv. Eng. Softw. 41(4), 636–646 (2010)
56. D.A. Clausi, An Analysis of Co-occurrance Texture Statistics as a Function of Grey Level
Quantization. Canadian J. Remote Sens. 28(1), 45–62 (2002)
57. K. Ersahin, B. Scheuchl, I. Cumming, Incorporating texture information into polarimetric
radar classification using neural networks, in Proceedings of the IEEE International
Geoscience and Remote Sensing Symposium, Anchorage, USA, Sept 2004, pp. 560–563
58. L. Ferro-Famil, E. Pottier, J.S. Lee, Unsupervised classification of multifrequency and fully
polarimetric SAR images based on the H/A/Alpha-Wishart classifier. IEEE Trans. Geosci.
Remote Sens. 39(11), 2332–2342 (2001)
59. S. Fukuda, H. Hirosawa, A wavelet-based texture feature set applied to classification of
multifrequency polarimetric SAR images. IEEE Trans. Geosci. Remote Sens. 37(5),
2282–2286 (1999)
60. E. Krogager, J. Dall, S. Madsen, Properties of sphere, diplane and helix decomposition, in
Proceedings of the 3rd International Workshop on Radar Polarimetry, 1995
61. J.-S. Lee, E. Pottier, Polarimetric Radar Imaging: From Basics to Applications, in Optical
Science and Engineering, vol. 142 (CRC Press, Boca Raton, 2009)
62. S.R. Cloude, E. Pottier, A review of target decomposition theorems in radar polarimetry.
IEEE Trans. Geosci. Remote Sens. 34(2), 498–518 (1996)
63. C. Fang, H. Wen, W. Yirong, An improved Cloude–Pottier decomposition using h/a/span and
complex Wishart classifier for polarimetric SAR classification, in Proceedings of CIE, Oct
2006, pp. 1–4
64. S. Pittner, S.V. Kamarthi, Feature extraction from wavelet coefficients for pattern recognition
tasks. IEEE Trans. Pattern Anal. Machine Intell. 21, 83–88 (1999)
65. Y. Chauvin, D.E. Rumelhart, Back propagation: Theory, architectures, and applications,
(Lawrence Erlbaum Associates Publishers, UK, 1995)
66. M. Riedmiller, H. Braun, A direct adaptive method for faster back propagation learning: the
RPROP algorithm, in Proceedings of the IEEE International Conference on Neural
Networks, 1993, pp. 586–591
67. U.S. Geological Survey Images [Online], http://terraserver-usa.com/
68. J.S. Lee, M.R. Grunes, G. de Grandi, Polarimetric SAR speckle filtering and its implications
for classification. IEEE Trans. Geosci. Remote Sens. 37(5), 2363–2373 (1999)
69. Y. Shi, R.C. Eberhart, A modified particle swarm optimizer, in Proceedings of the IEEE
Congress on Evolutionary Computation, 1998, pp. 69–73
70. S. Kiranyaz, T. Ince, A. Yildirim, M. Gabbouj, Evolutionary artificial neural networks by
multi-dimensional particle swarm optimization. Neural Netw. 22, 1448–1462. doi:10.1016/
j.neunet.2009.05.013, Dec. 2009
71. T.L. Ainsworth, J.P. Kelly, J.-S. Lee, Classification comparisons between dual-pol, compact
polarimetric and quad-pol SAR imagery. ISPRS J. Photogram. Remote Sens. 64, 464–471 (2009)
72. E. Chen, Z. Li, Y. Pang, X. Tian, Quantitative evaluation of polarimetric classification for
agricultural crop mapping. Photogram. Eng. Remote Sens. 73(3), 279–284 (2007)
Chapter 8
Personalized ECG Classification
S. Kiranyaz et al., Multidimensional Particle Swarm Optimization for Machine Learning 231
and Pattern Recognition, Adaptation, Learning, and Optimization 15,
DOI: 10.1007/978-3-642-37846-1_8, Springer-Verlag Berlin Heidelberg 2014
232 8 Personalized ECG Classification
Patient-specific data:
first 5 min. beats Training Labels per beat
Patient X MD PSO:
Evolution + Training
Expert
Labeling
Morph.
Dimension
Feature
Reduction
Temporal Features
ANN
Space
As discussed in the previous chapters, ANNs are powerful tools for pattern
recognition as they have the capability to learn complex, nonlinear surfaces among
different classes, and such ability can therefore be the key for ECG beat recognition
and classification [18]. Although many promising ANN-based techniques have
been applied to ECG signal classification, [17–20], the global classifiers based on a
static (fixed) ANN have not performed well in practice. On the other hand, algo-
rithms based on patient–adaptive architecture have demonstrated significant per-
formance improvement over conventional global classifiers [10, 12, 15]. Among
all, one particular approach, a personalized ECG heartbeat pattern classifier based
on evolvable block-based neural networks (BbNN) using Hermite transform
coefficients [15], achieved such a performance that is significantly higher than the
others. Although this recent work clearly demonstrates the advantage of using
evolutionary ANNs, which can be automatically designed according to the problem
(patient’s ECG data), serious drawbacks and limitations still remain. For instance,
there are around 10–15 parameters/thresholds that need to be set empirically with
respect to the dataset used and this obviously brings about the issue of robustness
when it is used for a different dataset. Another drawback can occur due to the
specific ANN structure proposed, i.e., the BbNN, which requires equal sizes for
input and output layers. Even more critical is the back propagation (BP) method,
used for training, and genetic algorithm (GA), for evolving the network structure,
both have certain deficiencies [21]. Recall in particular that, BP is likely to get
trapped into a local minimum, making it entirely dependent on the initial (weight)
settings.
As demonstrated in the previous chapter, in order to address such deficiencies
and drawbacks, MD PSO technique can be used to search for the optimal network
configuration specifically for each patient and according to the patient’s ECG data.
On the contrary to the specific BbNN structure used in [15] with the aforemen-
tioned problems, MD PSO is used to evolve traditional ANNs and so the focus is
particularly drawn on automatic design of the MLPs. Such an evolutionary
approach makes this system generic, that is no assumption is made about the
number of (hidden) layers and in fact none of the network properties (e.g., feed-
forward or not, differentiable activation function or not, etc.) is an inherent con-
straint. Recall that as long as the potential network configurations are transformed
into a hash (dimension) table with a proper hash function where indices represent
the solution space dimensions of the particles, MD PSO can then seek both
positional and dimensional optima in an interleaved PSO process. This approach
aims to achieve a high level of robustness with respect to the variations of the
dataset, since the system is designed with a minimum set of parameters, and in
such a way that their significant variations should not show a major impact on the
overall performance. Above all, using standard ANNs such as traditional MLPs,
instead of specific architectures (e.g., BbNN in [15]) further contributes to the
generic nature of this system and in short, all these objectives are meant to make it
applicable to any ECG dataset without any modifications (such as tuning the
parameters or changing the feature vectors, ANN types, etc.).
8.1 ECG Classification by Evolutionary Artificial Neural Networks 235
In this section, the MIT-BIH arrhythmia database [22] is used for training and
performance evaluation of the patient-specific ECG classifier. The database con-
tains 48 records, each containing two-channel ECG signals for 30-min duration
selected from 24-h recordings of 47 individuals. Continuous ECG signals are
band-pass filtered at 0.1–100 Hz and then digitized at 360 Hz. The database
contains annotation for both timing information and beat class information verified
by independent experts. In the current work, so as to comply with the AAMI
ECAR-1987 recommended practice [13], we used 44 records from the MIT-BIH
arrhythmia database, excluding 4 records which contain paced heartbeats. The first
20 records (numbered in the range of 100–124), which include representative
samples of routine clinical recordings, are used to select representative beats to be
included in the common training data. The remaining 24 records (numbered in the
range of 200–234), contain ventricular, junctional, and supraventricular arrhyth-
mias. A total of 83,648 beats from all 44 records are used as test patterns for
performance evaluation. AAMI recommends that each ECG beat be classified into
the following five heartbeat types: N (beats originating in the sinus mode),
S (supraventricular ectopic beats), V (ventricular ectopic beats), F (fusion beats),
and Q (unclassifiable beats). For all records, we used the modified-lead II signals
and the labels to locate beats in ECG data. The beat detection process is beyond
the scope of this chapter, as many highly accurate ([99 %) beat detection algo-
rithms have been reported in the literature, [16, 23].
As suggested by the results from numerous previous works [10, 14, 23], both
morphological and temporal features are extracted and combined into a single
feature vector for each heartbeat to improve accuracy and robustness of the
classifier. The wavelet transform is used to extract morphological information
from the ECG data. The time-domain ECG signatures were first normalized by
subtracting the mean voltage before transforming into time-scale domain using the
dyadic wavelet transform (DWT). According to the wavelet transform theory, the
multiresolution representation of the ECG signal is achieved by convolving the
signal with scaled and translated versions of a mother wavelet. For practical
applications, such as processing of sampled and quantized raw ECG signals, the
discrete wavelet transform can be computed by scaling the wavelet at the dyadic
sequence ð2 j Þj2Z and translating it on a dyadic grid whose interval is proportional
to 2j : The discrete WT is not only complete but also nonredundant unlike the
continuous WT. Moreover, the wavelet transform of a discrete signal can be
efficiently calculated using the decomposition by a two-channel multirate filter
236 8 Personalized ECG Classification
Fig. 8.2 Sample beat waveforms, including normal (N), PVC (V), and APC (S) AAMI heartbeat
classes, selected from record 201 modified-lead II from the MIT/BIH arrhythmia database and
corresponding TI-DWT decompositions for the first five scales
8.1 ECG Classification by Evolutionary Artificial Neural Networks 237
abnormal heartbeats (i.e., PVC beats), two temporal features (i.e., the R–R time
interval and R–R time interval ratio) contribute to the discriminating power of
wavelet-based features, especially in discriminating morphologically similar
heartbeat patterns (i.e., Normal and APC beats).
In Fig. 8.3 (top), the estimated power spectrum of windowed ECG signal
(a 500 ms long Hanning window is applied before FFT to suppress high-frequency
components due to discontinuities in the end-points) from record 201 for N, V, and
S beats is plotted, while equivalent frequency responses of FIR filters, Qj ðwÞ; for
the first five scales at the native 360 Hz sampling frequency of the MIT-BIH data
are illustrated at the bottom part of the figure. After analyzing the DWT decom-
positions of different ECG waveforms in the database, and according to the power
spectra of ECG signal (the QRS complex, the P- and T-waves), noise, and artifact
in [28], we selected W24 f (at scale 24 ) signal as morphological features of each
heartbeat waveform. Based on the -3 dB bandwidth of the equivalent Q4 ðwÞ filter
(3.9–22.5 Hz) in Fig. 8.3 (bottom), W24 f signal is expected to contain most of QRS
complex energy and the least amount of high-frequency noise and low-frequency
baseline wander. The fourth scale decomposition together with RR-interval timing
information was previously shown to be the best performing feature set for DWT-
based PVC beat classification in [4]. Therefore, a 180-sample morphological
feature vector is extracted per heartbeat from DWT of the ECG signal at scale 24
Fig. 8.3 Power spectrum of windowed ECG signal from record 201 for normal (N), PVC (V),
and APC (S) AAMI heartbeat classes, and equivalent frequency responses of FIR digital filters for
a quadratic spline wavelet at 360 Hz sampling rate
238 8 Personalized ECG Classification
by selecting a 500 ms window centered at the R-peak (found by using the beat
annotation file). Each feature vector is then normalized to have a zero mean and a
unit variance to eliminate the effect of dc offset and amplitude biases.
Fig. 8.4 Scatter plot of normal (N), PVC (V), and APC (S) beats from record 201 in terms of the
first and third principal components and RRi time interval
First, we shall demonstrate the optimality of the networks (with respect to the
training MSE), which are automatically evolved by the MD PSO method
according to the training set of an individual patient record in the benchmark
database. We shall then present the overall results obtained from the ECG clas-
sification experiments and perform comparative evaluations against several state-
of-the-art techniques in this field. Finally, the robustness of this system against
variations of major parameters will be evaluated.
0.4
Min. Error
0.35 Mean Error
Median Error
0.3
0.25
0.2
0.15
0.1
0.05
0
0 5 10 15 20 25 30 35 40 45
18
16
14
12
10
0
0 5 10 15 20 25 30 35 40 45
Fig. 8.5 Error (MSE) statistics from exhaustive BP training (top) and dbest histogram from 100
MD PSO evolutions (bottom) for patient record 222
Table 8.1 Summary table of beat-by-beat classification results for all 44 records in the MIT/BIH
arrhythmia database
Ground truth Actual Classification Results
N S V F Q
N 73,019 (40,532) 991 (776) 513 (382) 98 (56) 29 (20)
S 686 (672) 1,568 (1,441) 205 (197) 5 (5) 6 (5)
V 462 (392) 333 (299) 4,993 (4,022) 79 (75) 32 (32)
F 168 (164) 28 (26) 48 (46) 379 (378) 2 (2)
Q 8 (6) 1 (0) 3 (1) 1 (1) 1 (0)
Classification results for the testing dataset only (24 records from the range 200 to 234) are shown
in parenthesis
beats depending on the patient’s heart rate, so only less than 1 % of the total beats
are used for training each neural network. The remaining beats (25 min) of each
record, in which 24 out of 44 records are completely new to the classifier, are used
as test patterns for performance evaluation.
Table 8.1 summarizes beat-by-beat classification results of ECG heartbeat
patterns for all test records. Classification performance is measured using the four
standard metrics found in the literature [10]: Classification accuracy (Acc), sen-
sitivity (Sen), specificity (Spe), and positive predictivity (Ppr). While accuracy
measures the overall system performance over all classes of beats, the other
metrics are specific to each class and they measure the ability of the classification
algorithm to distinguish certain events (i.e., VEBs or SVEBs) from nonevents (i.e.,
nonVEBs or nonSVEBs). The respective definitions of these four common metrics
using true positive (TP), true negative (TN), false positive (FP), and false negative
(FN) are as follows: Accuracy is the ratio of the number of correctly classified
patterns to the total number of patterns classified, Acc = (TP ? TN)/
(TP ? TN ? FP ? FN); Sensitivity is the rate of correctly classified events among
all events, Sen = TP/(TP ? FN); Specificity is the rate of correctly classified
nonevents among all nonevents, Spe = TN/(TN ? FP); and Positive Predictivity
is the rate of correctly classified events in all detected events, Ppr = TP/
(TP ? FP). Since there is a large variation in the number of beats from different
classes in the training/testing data (i.e., 39,465/50,354 type-N, 1,277/5,716 type-V,
and 190/2,571 type-S beats), sensitivity, specificity, and positive predictivity are
more relevant performance criteria for medical diagnosis applications.
The system presented in this section is compared with three state-of-the-art
methods, [10, 14] and [15], which comply with the AAMI standards and use all
records from the MIT-BIH arrhythmia database. For comparing the performance
results, the problem of VEB and SVEB detection is considered individually. The
VEB and SVEB classification results over all 44 records are summarized in
Table 8.2. The performance results for VEB detection in the first four rows of
Table 8.2 are based on 11 test recordings (200, 202, 210, 213, 214, 219, 221, 228,
231, 233, and 234) that are common to all four methods. For SVEB detection,
comparison results are based on 14 common recordings (with the addition of records
212, 222, and 232) between the presented system and the methods in [14] and [15].
8.1 ECG Classification by Evolutionary Artificial Neural Networks 243
Table 8.2 VEB and SVEB classification performance of the presented method and comparison
with the three major algorithms from the literature
Methods VEB SVEB
Acc Sen Spe Ppr Acc Sen Spe Ppr
Hu et al. [10]1 94.8 78.9 96.8 75.8 N/A N/A N/A N/A
Chazal et al. [14]1 96.4 77.5 98.9 90.6 92.4 76.4 93.2 38.7
Jiang and Kong [15]1 98.8 94.3 99.4 95.8 97.5 74.9 98.8 78.8
Presented1 97.9 90.3 98.8 92.2 96.1 81.8 98.5 63.4
Jiang and Kong [15]2 98.1 86.6 99.3 93.3 96.6 50.6 98.8 67.9
Presented2 97.6 83.4 98.1 87.4 96.1 62.1 98.5 56.7
Presented3 98.3 84.6 98.7 87.4 97.4 63.5 99.0 53.7
1
The comparison results are based on 11 common recordings for VEB detection and 14 common
recordings for SVEB detection
2
The VEB and SVEB detection results are compared for 24 common testing records only
3
The VEB and SVEB detection results of the presented system for all training and testing records
Several interesting observations can be made from these results. First, for SVEB
detection, sensitivity and positive predictivity rates are comparably lower than VEB
detection, while a high specificity performance is achieved. The reason for the worse
classifier performance in detecting SVEBs is that SVEB class is under-represented
in the training data and hence more SVEB beats are misclassified as normal beats.
Overall, the performance of the presented system in VEB and SVEB detection is
significantly better than [10] and [14] for all measures and is comparable to the
results obtained with evolvable BbNNs in [15]. Moreover, it is observed that this
system achieves comparable performance over the training and testing set of patient
records. It is worth noting that the number of training beats used for each patient’s
classifier was less than 2 % of all beats in the training dataset and the resulting
classifiers designed by the MD PSO process have improved generalization ability,
i.e., the same low number of design parameters are used for all networks.
8.1.3.3 Robustness
Table 8.3 VEB and SVEB classification accuracy of the classification system for different PSO
parameters and architecture spaces
Percentage (%) I II III IV
VEB 98.3 98.2 98.3 98.0
SVEB 97.4 97.3 97.1 97.4
I: R1min ¼ f11; 8; 4; 5g; R1min ¼ f11; 16; 8; 5g; S = 100, I = 500
II: R1min ¼ f11; 8; 4; 5g; R1min ¼ f11; 16; 8; 5g; S = 250, I = 200
III: R1min ¼ f11; 8; 4; 5g; R1min ¼ f11; 16; 8; 5g; S = 80, I = 200
IV: R1min ¼ f11; 6; 6; 3; 5g; R1min ¼ f11; 12; 10; 5; 5g; S = 400, I = 500
any set of common PSO parameters within a reasonable range can be conveniently
used. Furthermore, for this ECG database, the choice of the architecture space does
not affect the overall performance, yet any other ECG dataset containing more
challenging ECG data might require the architecture spaces such as in IV, in order
to obtain a better generalization capability.
Holter registers [27] are ambulatory ECG recordings with a typical duration of
24–48 h and they are particularly useful for detecting some heart diseases such as
cardiac arrhythmias, silent myocardial ischemia, transient ischemic episodes, and
for arrhythmic risk assessment of patients, all of which may not be detected by a
short-time ECG [31]. Yet any process that requires humans or even an expert
cardiologist to examine more than small amount of data can be highly error prone.
A single record of a Holter register is usually more than 100,000 beats, which make
the visual inspection almost infeasible, if not impossible. Therefore, the need for
automatic techniques for analyzing such a massive data is imminent and in that, it is
crucial not to leave out significant beats since the diagnosis may depend on just a
few of them. However, the dynamic nature and intra-signal variation in a typical
Holter register is quite low and the abnormal beats, which may indicate the pres-
ence of a potential disease, can be scattered along the signal. So utilizing a dynamic
clustering technique based on MD PSO with FGBF, a systematic approach is
developed, which can summarize a long-term ECG record by discovering the
so-called master key-beats that are the representative or the prototype beats from
different clusters. With a great reduction in effort, the cardiologist can then perform
a quick and accurate diagnosis by examining and labeling only the master key-
beats, which in duration are no longer than few minutes of ECG record. The expert
labels over the master key-beats are then back propagated over the entire ECG
record to obtain a patient-specific, long-term ECG classification. As the main
application of the current work, this systematic approach is then applied over a real
(benchmark) dataset, which contains seven long-term ECG recordings [32].
8.2 Classification of Holter Registers 245
Clustering by
MD PSO + FGBF
2nd Pass
Expert
Classification Labeling
1st Pass
Feature ECG Data
Beat Extraction & Key-Beat
Detection Labels Temporal Extraction
Data Acqusition
Segmentation
Pre-Processing
Fig. 8.6 The overview of the systematic approach for long-term ECG classification
ones used in the current work with the purpose of demonstrating the basic
performance level of this systematic approach. For both passes, the dynamic
clustering operation is performed using the same validity index given in Eq. (6.1)
with a ¼ 1: Recall that this is the simplest form, which is entirely parameter free
and in addition, L2 Minkowski norm (Euclidean) is used as the distance metric in the
feature space.
As shown in Fig. 8.6, after the data acquisition is completed, the pre-processing
stage basically contains beat detection and feature extraction of the sampled and
quantized ECG signal. Before beat detection, all ECG signals are filtered to
remove baseline wander, unwanted power-line interference and high-frequency
noise from the original signal. This filtering unit can be utilized as part of heartbeat
detection process (for example, the detectors based on wavelet transforms [16]).
For all records, we used the modified-lead II signals and utilized the annotation
information (provided with the MIT-BIH database [22]) to locate beats in ECG
signals. Beat detection process is beyond the scope of this chapter, as many beat
detection algorithms achieving over 99 % accuracy have been reported in the
literature, e.g., [16] and [24]. Before feature extraction, the ECG signal is nor-
malized to have a zero mean and unit variance to eliminate the effect of dc offset
and amplitude biases. After the beat detection over quasiperiodic ECG signals by
using RR-intervals, morphological and temporal features are extracted for each
beat as suggested in [14] and combined into a single characteristic feature vector
for each heartbeat. As shown in Fig. 8.7, temporal features relating to heartbeat
fiducial point intervals and morphology of the ECG signals are extracted by
sampling the signals. They are calculated separately for the first lead signals for
each heartbeat. Since the detection of some arrhythmia (such as Bradycardia,
Tachycardia, and premature ventricular contraction) depends on the timing
sequence of two or more ECG signal periods [35], four temporal features are
considered in our study. They are extracted from heartbeat fiducial point intervals
(RR-intervals), as follows:
248 8 Personalized ECG Classification
Fig. 8.7 Sample beat waveforms, including normal (N), PVC (V), and APC (S) AAMI [13]
heartbeat classes from the MIT-BIH database. Heartbeat fiducial point intervals (RR-intervals)
and ECG morphology features (samples of QRS complex and T-wave) are extracted
presence of an abnormal heart activity, may get lost due to their low frequency of
occurrences. Therefore, we adopt a typical approach, which is frequently per-
formed in audio processing, that is, temporally segmenting data into homogeneous
frames.
Due to the dynamic characteristics of an audio signal, the frame duration is
typically chosen between 20 and 50 ms in order to get as a homogeneous signal as
possible, i.e., [6]. Accordingly, for a Holter register with 24–48 h long, we choose
*5 min long (300 beats) duration for time segments since the intra-segment
variation along the time axis is often quite low. So performing a clustering
operation within such homogeneous segments will yield only one or few clusters
except perhaps the transition segments where a change, morphological or tem-
poral, occurs on the normal form of the ECG signal. No matter how minor or
insignificant duration this abnormal change might take, in such a limited time
segment, the MD PSO-based dynamic clustering technique can separate those
‘‘different’’ beats from the normal ones and group them into a distinct cluster. One
key-beat, which is the closest to the cluster centroid with respect to the distance
metric used in 21-D feature space, is then chosen as the ‘‘prototype’’ to represent
all beats in that cluster. Since the optimal number of clusters is extracted within
each time segment, only necessary and sufficient number of key-beats is thus used
to represent all 300 beats in a time segment. Note that the possibility of missing
outliers is thus reduced significantly with this approach since one key-beat is
equally selected either from an outlier or a typical cluster without considering their
size. Yet redundancy among the key-beats of consecutive segments still exists,
since it is highly probable that similar key-beats shall occur among different
segments. This is the main reason for having the second pass, which performs
dynamic clustering over key-beats to obtain finally the master key-beats. They are
basically the ‘‘elite’’ prototypes representing all possible physiological heart
activities occurring during a long-term ECG record.
Since this is a personalized approach, each patient has, in general, normal beats
with possibly one or few abnormal periods, indicating a potential heart disease or
disorder. Therefore, ideally speaking only a few master key-beats would be
expected at the end, each representing a cluster of similar beats from each type.
For instance, one cluster may contain ventricular beats arising from ventricular
cavities in the heart and another may contain only junctional beats arising from
atrioventricular junction of the heart. Yet, due to the lack of discrimination power
of the morphological or temporal features or the similarity (distance) metric used,
the dynamic clustering operation may create more than one cluster for each
anomaly. Furthermore, the normal beats have a broad range of morphological
characteristics [7] and within a long time span of 24 h or longer, it is obvious that
the temporal characteristics of the normal beats may significantly vary too.
Therefore, it is reasonable to represent normal beats with multiple clusters rather
than only one. In short, several master key-beats may represent the same physi-
ological type of heart activity. The presentation of the master key-beats to the
expert cardiologist can be performed with any appropriate way as this is a visu-
alization detail, and hence beyond the scope of this work. Finally, the overall
250 8 Personalized ECG Classification
-1
120 121 122 123 124 125 126 127
Time (s)
Fig. 8.8 Excerpt of raw ECG data from patient record 14,046 in the MIT-BIH long-term
database. The three key-beats, having morphological and RR-interval differences, are chosen by
the systematic approach presented
0
Voltage(mV)
-0.1
V
-0.2
S RRi+16 =0.62
-0.3
RRi=0.67 RRi+15 =1.06
RRi+1 =1.14 N
4534 4536 4538 4540 4542 4544 4546 4548 4550 4552
Time (s)
Fig. 8.9 Excerpt of raw ECG data from patient record 14,172 in the MIT-BIH long-term
database. The key-beats extracted by the systematic approach are indicated
earlier works such as [34] and [7], both of which iteratively determines this
number by an empirical threshold parameter.
Table 8.4 shows the overall results of the systematic approach over all patients
from the MIT-BIH ong-term ECG database. Labels manually annotated by the
experts are used only for the master key-beats selected by the system. The clas-
sification of the entire ECG data, or in other words, the labeling of all beats
contained therein is then automatically accomplished by the BP of the master key-
beat labels, as explained in Sect. 8.2.2. The performance results tabulated in
Table 8.4 are calculated based on the differences between the labels generated by
the systematic approach presented and the expert supplied labels provided with the
database. The AAMI provides standards and recommended practices for reporting
252 8 Personalized ECG Classification
Table 8.4 Overall results for each patient in the MIT-BIH long-term database using the
systematic approach presented
Patient N S V F Q Accuracy (%)
14,046 105,308/105,405 0/1 9,675/9,765 34/95 0/0 99.79
14,134 38,614/38,766 0/29 9,769/9,835 641/994 0/0 98.80
14,149 144,508/144,534 0/0 235/264 0/0 0/0 99.96
14,157 83,340/83,412 6/244 4,352/4,368 53/63 0/0 99.62
14,172 58,126/58,315 77/1003 6,517/6,527 0/1 0/0 98.41
14,184 77,946/78,096 0/39 23,094/23,383 2/11 0/0 99.53
15,814 91,532/91,617 6/34 9,680/9,941 1,427/1,744 0/0 99.32
Total 599,374/600,145 89/1350 63,322/64,083 2,157/2,908 0/0
Weighted average 99.48
For each class, the number of correctly detected beats is shown relative to the total beats
originally present
errors might be fatal. According to the overall confusion matrix given in Table 8.5,
the (average) critical error level is below 0.3 % where major part of critical errors
occurred for the beats within class S due to the above-mentioned reasons specific
to the feature extraction method used in this study.
Since MD PSO is in stochastic nature, to test the repeatability and robustness of
the systematic approach presented, we performed 10 independent runs on patient
record 14,172, from which we obtained the lowest performance with the average
classification accuracy 98.41 %. All runs led to similar accuracy levels with only a
slight deviation of 0.2 %. It is also worth mentioning that with an ordinary
computer, the extraction of key-beats in a *5 min. time frame typically takes less
than 1.5 min. Therefore, this systematic approach is quite suitable for a real-time
application, that is, the key-beats can be extracted in real time with a proper
hardware implementation during the recording of a Holter ECG.
help professionals focus on the most relevant data, it can also provide efficient and
robust solutions for much shorter ECG datasets too. Besides classification, with
some proper annotation, master key-beats can also be used for the summarization
of any long-term ECG data for a fast and efficient visual inspection, and they can
further be useful for indexing Holter databases, for a fast and accurate information
retrieval. On the other hand, 0.5 % error rate, although may seem quite insignif-
icant for a short ECG dataset, can still be practically high for Holter registers
because it corresponds to several hundreds of misclassified beats, some of which
might be important for a critical diagnosis. Yet, recall that the optimality of the
clustering algorithm depends on the CVI, the feature extraction method and the
distance metric, in that, we use simple and typical ones so as to obtain a basic or
unbiased performance level. Therefore, by using for instance a more efficient CVI
and better alternatives for distance metric may further improve the performance.
References
24. S.G. Mallat, A Wavelet Tour of Signal Processing, 2nd edn. (Academic Press, San Diego,
1999)
25. S.G. Mallat, S. Zhong, Characterization of signals from multiscale edges. IEEE Trans Pattern
Anal Machine Intell 14, 710–732 (1992)
26. N.V. Thakor, J.G. Webster, W.J. Tompkins, Estimation of QRS complex power spectra for
design of a QRS filter. IEEE Trans Biomed Eng BME-31, 702–705 (1984)
27. M. Paoletti, C. Marchesi, Discovering dangerous patterns in long-term ambulatory ECG
recordings using a fast QRS detection algorithm and explorative data analysis. Comput.
Methods Progr. Biomedicine 82, 20–30 (2006)
28. S. Pittner, S.V. Kamarthi, Feature extraction from wavelet coefficients for pattern recognition
tasks. IEEE Trans Pattern Anal Machine Intell 21, 83–88 (1999)
29. N.J. Holter, New methods for heart studies. Science 134(3486), 1214–1220 (1961)
30. M. Lagerholm, C. Peterson, G. Braccini, L. Edenbrandt, L. Sörnmo, Clustering ECG
complexes using Hermite functions and self-organizing maps. IEEE Trans. Biomed. Eng.
47(7), 838–848 (2000)
31. PhysioBank, MIT-BIH long-term database directory [Online], http://www.physionet.org/
physiobank/database/ltdb/
32. P. Mele, Improving electrocardiogram interpretation in the clinical setting, J. Electrocardiol.
(2008). doi:10.1016/j.jelectrocard.2008.04.003
33. D. Cuesta-Frau, J.C. Pe’rez-Corte’s, G. Andreu-Garci’a, Clustering of electrocardiograph
signals in computer-aided Holter analysis. Comput Methods Programs Biomed 72(3),
179–196 (2003)
34. M. Korurek, A. Nizam, A new arrhythmia clustering technique based on ant colony
optimization, J. Biomed. Inform. (2008). doi:10.1016/j.jbi.2008.01.014
35. PhysioToolkit, The WFDB software package [Online], http://www.physionet.org/
physiotools/wfdb.shtml
36. W.J. Tompkins, J.G. Webster, Design of microcomputer-based medical instrumentation
(Prentice Hall Inc, Englewood Cliffs, 1981), pp. 398–3999. ISBN 0-13-201244-8
Chapter 9
Image Classification and Retrieval
by Collective Network of Binary
Classifiers
It is not the strongest of the species that survives, nor the most
intelligent that survives. It is the one that is the most adaptable
to change.
Charles Darwin
S. Kiranyaz et al., Multidimensional Particle Swarm Optimization for Machine Learning 259
and Pattern Recognition, Adaptation, Learning, and Optimization 15,
DOI: 10.1007/978-3-642-37846-1_9, Springer-Verlag Berlin Heidelberg 2014
260 9 Image Classification and Retrieval by Collective Network
single but complex classifier, and the optimum classifier for the classification
problem at hand can be searched with the underlying evolutionary technique. At a
given time, this allows the creation and designation of a dedicated classifier for
discriminating a certain class type from the others based on a single feature. Each
incremental evolution session will ‘‘learn’’ from the current best classifier and can
improve it further. Moreover, with each incremental evolution, new classes/fea-
tures can also be introduced which signals the collective classifier network to
create new corresponding networks and classifiers within to adapt dynamically to
the change. In this way the collective classifier network will be able to dynamically
scale itself to the indexing requirements of the media content data reserve while
striving for maximizing the classification and retrieval accuracies for a better user
experience.
The CBIR has been an active research field for which several feature extraction,
classification, and retrieval techniques have been proposed up to date. However,
when the database size varies in time and usually grows larger, it is a common fact
that the overall classification and retrieval performance significantly deteriorates.
Due to the reasoning given earlier, the current state-of-the-art classifiers such as
support vector machines (SVMs) [1, 2], Bayesian Classifiers, random forests (RFs)
[3], and artificial neural networks (ANNs) cannot provide optimal or even feasible
solutions to this problem. This fact drew the attention toward classifier networks
(or ensembles of classifiers). An earlier ensemble of classifier type of approach,
Learn++ [4], incorporates an ensemble of weak classifiers, which can perform
incremental learning of new classes; however, albeit at a steep cost, i.e., learning
new classes requires an increasingly large number of classifiers for each new class
to be learned. The resource allocating network with long-term memory (RAN-
LTM) [5] can avoid this problem by using a single RBF network, which can be
incrementally trained by ‘‘memory items’’ stored in a long-term memory. How-
ever, RAN-LTM has a fixed output structure and thus, is not able to accommodate
a varying number of classes. For the incremental learning problem when new
classes are dynamically introduced, some hierarchical techniques such as [6] and
[7] have been proposed. They basically separate a single class from the previous
classes within each hierarchical step, which builds up on its previous step. One
major drawback of this approach is parallelization since the addition of N new
classes will result in N steps of adding one class at a time. Furthermore, the
possible removal of an existing class is not supported and hence requires retraining
of the entire classifier structure from scratch. None of the ensemble of classier
methods proposed so far can support feature scalability and thus a new feature
extraction will eventually make the current classifier ensemble obsolete and
require a new design (and re-training) from scratch.
9.1 The Era of CBIR 261
Another major question that still remains in CBIR is how to narrow the
‘‘Semantic Gap’’ between the low-level visual features that are automatically
extracted from images and the high-level semantics and content-description by
humans. Among a wide variety of features proposed in the literature, none can
really address this problem alone. So the focus has been drawn on fusing several
features in the most effective way since whatever type of classifiers is used, the
increased feature space may eventually cause the ‘‘Curse of Dimensionality’’
phenomenon that significantly degrades the classification accuracy. In [8], three
MPEG-7 visual features, color layout descriptor (CLD), scalable color descriptor
(SCD), and edge histogram descriptor (EHD) are fused to train several classifiers
(SVMs, KNN, and Falcon-ART) for a small database with only two-classes and 767
images. This work has clearly demonstrated that the highest classification accuracy
has been obtained with the proper feature fusion. This is indeed an expected out-
come since each feature may have a certain level of discrimination for a particular
class. In another recent work [9], this fact has been, once again, confirmed where
the authors fused three MPEG-7 features: SCD, Homogenous Texture Descriptor
and EHD and trained SVM over the same database. Basic color (12 dominant
colors) and texture (DWT using quadrature mirror filters) features were used in [10]
to annotate image databases using ensemble of classifiers (SVMs and Bayes Point
Machines). Although the classification is performed over a large image database
with 25 K images and 116 classes from Corel repository, the authors used above
80 % of the database for training and another database to evaluate and manually
optimize various kernel and classifier parameters. In [11] SVMs together with 2-D
Hidden Markov Model (HMM) are used to discriminate image classes in an inte-
grated model. Two features, 50-D SIFT (with a dimension reduction by PCA) and
9-D color moments (CM) are used individually in two datasets using 80 % of the
images for training, and the classification accuracies are compared. In all these
image classification works and many alike, besides the aforementioned key prob-
lems there are other drawbacks and limitations, e.g., they can work with only a
limited feature set to avoid the ‘‘Curse of Dimensionality’’ and they used the major
part of the database, some as high as 80 % or even higher, for training to sustain a
certain level of classification accuracy. They are all image classification methods
for static databases assuming a fixed GTD and fixed set of features where feature
and class scalability and dynamic adaptability have not been considered.
In order to address these problems and hence to maximize the classification
accuracy which will in turn boost the CBIR performance, in this chapter, we shall
focus on a global framework design that embodies a collective networks of
(evolutionary) binary classifiers (CNBC). In this way, we shall demonstrate that
one can achieve as compact classifiers as possible, which can be evolved and
trained in a much more efficient way than a single but complex classifier, and the
optimum classifier for the classification problem in hand can be searched with an
underlying evolutionary technique, e.g., as described in Chap. 7. At a given time,
this allows the creation and designation of a dedicated classifier for discriminating
a certain class type from the others based on a single feature. The CNBC can
support varying and large set of visual features among which it optimally selects,
262 9 Image Classification and Retrieval by Collective Network
weights, and fuses the most discriminative ones for a particular class. Each NBC is
devoted to a unique class and further encapsulates a set of evolutionary binary
classifiers (BCs), each of which is optimally chosen within the architecture space
(AS), discriminating the class of the NBC with a unique feature. For instance for
an NBC evolved for the sunset class, it will most likely select and use mainly color
features (rather than texture and edge features) and the most descriptive color
feature elements (i.e., color histogram bins) for discriminating this particular class
(i.e., red, yellow, and black), are weighted higher than the others.
The CNBC framework is primarily designed to increase the retrieval accuracy
in CBIR on ‘‘dynamic databases’’ where variations do occur at any time in terms
of (new) images, classes, features, and users’ relevance feedback. Whenever a
‘‘change’’ occurs, the CNBC dynamically and ‘‘optimally’’ adapts itself to the
change by means of its topology and the underlying evolutionary method.
Therefore, it is not a ‘‘just another classifier’’ as no static (or fixed) classifier
including the state-of-the-art competitors such as ANNs, SVMs, and RF can really
do that, e.g., if (a) new feature(s) or class(es) are introduced, one needs to reform
and retrain a new (static) classifier from scratch—which is a waste of time and
information so far accumulated. This topology shall prove useful for dynamic,
content/data adaptive classification in CBIR. In that sense, note that comparative
evaluations against SVMs and RFs are performed only in ‘‘static databases’’ to
show that the CNBC has a comparable or better performance –despite of the fact
that neither CBIR nor classification in static databases is not the primary objective.
Furthermore, the CNBC is not an ‘‘ensemble of classifiers’’ simply because: At any
time t, (1) it contains an individual network of classifier (NBC) for each class, (2)
each NBC contains several evolving classifiers, each is optimally selected among a
family of classifiers (the so-called architecture space, AS) and their number is
determined by the number of features present at that time, (3) the number of NBCs
is also determined by the number of classes at time t. So the entire CNBC body is
subject to change whenever there is a need for it. The ensemble of classifiers
usually contains a certain number of ‘‘static’’ classifiers and their input/output
(features and classes) must be fixed in advance; therefore, similar to standalone
static classifiers, they cannot be used for dynamic databases.
This section describes in detail the image classification framework; the collective
network of (evolutionary) binary classifiers (CNBC), which uses user-defined
ground truth data (GTD) as the train dataset1 to configure its internal structure and
1
As a common term, we shall still use ‘‘train dataset’’ or ‘‘train partition’’ to refer to the dataset
over which the CNBC is evolved.
9.2 Content-Based Image Classification and Retrieval Framework 263
to evolve its binary classifiers (BCs) individually. Before going into details of
CNBC, a general overview for this novel classification topology will be introduced
in Sect. 9.2.1.
User
New Image(s)
Feature
Extraction
Visual Features
Ground-Truth Data
Feature
New Classes
Extraction Norm.
New Features
Visual Features
FVs
Norm.
CNBC(t+1)
FVs
Image Class
Vectors
Database
Query
Class
By
Incremental Evolution Vectors Example
CNBC(t) CNBC(t+1)
Query
Result
need arises, i.e., if the existing CNBC (or some NBCs in it) fails to classify these
new classes accurately enough.
MUVIS system [12] is used to extract a large set of low-level visual features
that are properly indexed in the database along with the images. Unit normalized
feature vectors (FVs) formed from those features are fed into the input layer of the
CNBC where the user provided GTD is converted to target class vectors (CVs) to
perform an incremental evolution operation. The user can also set the number of
evolutionary runs or the desired level of classification. Any CNBC instance can
directly be used to classify a new image and/or to perform content-based image
queries, the result of which can be evaluated by the user who may introduce new
class(es), yielding another incremental evolution operation, and so on. This is an
ongoing cycle of human-classifier interaction, which gradually adapts CNBC to
the user’s class definitions. New low-level features can also be extracted to
improve the discrimination among classes, which signals CNBC to adapt to the
change simultaneously. In short, dynamic class and feature scalability are the key-
objectives aimed within the CNBC design. Before going into the details of the
CNBC framework, Sect. 9.2.2 will first introduce the evolutionary update mech-
anism that keeps the best classifier networks in the AS during numerous incre-
mental evolutionary runs.
Configuration AS FV
15x2 0.22
15x1x2 0.13
Feature + Class 15x2x2 0.10
Vectors (FV+CV) 15x2x2
15x3x2 0.14
15x4x2 0.12
Class Vectors
CV
FV The best BC
Configuration Run #1 Run #2 Run #3
15x2 0.24 0.22 0.25
15x1x2 0.13 0.21 0.16
15x2x2 0.12 * 0.10 * 0.18
AS 15x3x2 0.19 0.20 0.14
15x4x2 0.19 0.21 0.12 *
CV
Evolution Architecture Space with
Process 5 MLP configurations
Fig. 9.2 Evolutionary update in a sample AS for MLP configuration arrays Rmin ¼ f15; 1; 2g
and Rmax ¼ f15; 4; 2g where NR ¼ 3 and NC ¼ 5. The best runs for each configurations are
highlighted and the best configuration in each run is tagged with ‘*’
Dataset Features
{FV0 , FV1, ..., FVN −1}
Class Selection
{c*}
Fig. 9.3 Topology of the CNBC framework with C classes and N FVs
therefore, the output layer size of all binary classifiers is always two. Let CVc;1 and
CVc;2 be the first and second output of the cth BC’s class vector (CV). The class
selection in 1-of-n encoding scheme can simply be performed by comparing the
individual outputs, e.g., say a positive output if CVc;2 [ CVc;1 , and vice versa for
negative output. This is also true for the fuser BC, whose output is the output of its
NBC. The FVs of each dataset item are fed into each NBC in the CNBC. Each FV
drives through (via forward propagation) its corresponding binary classifier in the
input layer of the NBC. The outputs of these binary classifiers are then fed into the
fuser binary classifier of each NBC to produce class vectors (CVs). The class
selection block shown in Fig. 9.3 collects these outputs and selects the positive
class(es) of the CNBC as the final outcome. This selection scheme, first of all,
differs with respect to the dataset class type, i.e., the dataset can be called ‘‘uni-
class’’, if an item in the dataset can belong to only one class, otherwise it is called
‘‘multi-class’’. Therefore, in a uni-class dataset there must be only one class, c ,
selected as the positive outcome whereas in a multi-class dataset, there can be one
or more NBCs, fc g, with a positive outcome. In the class selection scheme the
winner-takes-all strategy is utilized. Assume without loss of generality that a CV
of {0, 1} or {-1, 1} corresponds to a positive outcome where CVc;2 CVc;1 is
maximum. Therefore, for uni-class datasets, the positive class index, c , (‘‘the
winner’’) is determined as follows:
c ¼ arg max ðCVc;2 CVc;1 Þ ð9:1Þ
c2½0;C1
In this way the erroneous cases (false negative and false positives) where no or
more than one NBC exists with a positive outcome can be properly handled.
However, for multi-class datasets the winner-takes-all strategy can only be applied
when no NBC yields a positive outcome, i.e., CVc;2 CVc;1 8c 2 ½0; C 1,
otherwise for an input set of FVs belonging to a dataset item, multiple NBCs with
positive outcome may indicate multiple true-positives and hence cannot be further
pruned. As a result, for a multi-class dataset the (set of) positive class indices,
fc g, is selected as follows:
0 1
arg max ðCVc;2 CVc;1 Þ if CVc;2 CVc;1 8c 2 ½0; C 1
B c2½0;C1 C
fc g ¼ @ A
f arg ðCVc;2 [ CVc;1 Þg else
c2½0;C1
ð9:2Þ
The evolution of a subset of the NBCs or the entire CNBC is performed for each
NBC individually with a two-phase operation, as illustrated in Fig. 9.4. As
explained earlier, using the FVs and the target class vectors (CVs) of the training
268 9 Image Classification and Retrieval by Collective Network
Architecture Spaces
Feature + Class for BCs
Vectors
CNBC Evolution Phase 1
(Evolution of BCs in the 1st Layer)
Class Vectors
Fig. 9.4 Illustration of the two-phase evolution session over BCs’ architecture spaces in each
NBC
9.2 Content-Based Image Classification and Retrieval Framework 269
fuser binary classifier learns the significance of each individual binary classifier
(and its feature) for the discrimination of that particular class. This can be viewed
as the adaptation of the entire feature space to discriminate a specific class in a
large dataset. Alternatively, this can be viewed as an efficient feature selection
scheme over the set of FVs, by selecting the most discriminative FVs for a given
class. The fuser BC, if properly evolved and trained, can then ‘‘weight’’ each
binary classifier (with its FV), accordingly. In this way the feature (and its BC)
shall optimally be ‘‘fused’’ according to its discrimination power with respect to
each class. In short the CNBC, if properly evolved, shall learn the significance (or
the discrimination power) of each FV and its individual components.
In this chapter, the image databases are considered as uni-class where one
sample can belong to only one class, and during the evolution process, each
positive sample of one class can be used as a negative sample for all others.
However, if there is a large number of classes, an uneven distribution of positive
and negative samples per class, may bias the evolution process. In order to prevent
this, a negative sample selection is performed in such a way that for each positive
sample, the number of negative samples (per positive sample) will be limited
according to a predetermined positive-to-negative ratio (PNR). The selection of
negative samples is performed with respect to the closest proximity to the positive
sample so that the classifier can be evolved by discriminating those negative
samples (from the positive one) which have the highest potential for producing a
false-positive. Therefore, if properly trained, the classifier can draw the ‘‘best
possible’’ boundary between positive and negative samples, which shall in turn
improve the classification accuracy. The features of those selected items and the
classes will form the FVs of the training dataset, over which the CNBC body can
be created and evolved.
parameters retrieved from the last record of the AS of that BC. Starting from this
as the initial point, and using the current training dataset with the target CVs, the
BP algorithm can then perform its gradient descent in the error space.
During the classification and CBIR experiments in this section, three major
properties of the CNBC will be demonstrated: (1) the (incremental) evolutionary
and dynamic adaptability virtues. For instance, when new classes and features are
introduced, whether or not it can adapt itself to the change with a ‘‘minimal’’
effort, i.e., the (incremental) evolution is applied only if there is a need for it and if
9.3 Results and Discussions 271
so, it uses the advantage of the previous (accumulated) knowledge, (2) the com-
petitive performance level with the state-of-the-art classifiers (SVMs and RFs) on
static databases and to demonstrate its parameter independence as the default
parameters are used for CNBC evolution while the best possible internal param-
eters are searched and used for the competitors for training, and finally, (3) the
application of the classifier to improve CBIR performance on dynamically varying
image databases (the main objective). On the other hand, since CBIR is the main
application domain, designing a classifier network for massive databases with
millions of images is not our objective due to the well-known ‘‘semantic gap’’
phenomenon. Note that this cannot be achieved only with the classifier design
since the low-level features have extremely limited or in practice no discrimination
power on such magnitudes. Obviously, designing such highly discriminative fea-
tures that can provide the description power needed is beyond the scope of this
chapter; however, in some cases, CNBC can also achieve a superior classification
performance thanks to its ‘‘Divide and Conquer’’ approach. For instance, in a
synthetic aperture radar (SAR) classification [13] where the SAR features have
indeed a high degree of discrimination, CNBC was evolved using the GTD only
from 0.02 % of SAR image pixels with a total number exceeding a million and
achieved higher than 99.6 % classification accuracy.
MUVIS framework [12], is used to create and index the following two image
databases by extracting 14 features for each image.
(1) Corel_10 Image Database: There are 1,000 medium resolution (384 9 256
pixels) images obtained from Corel repository [14] covering 10 diverse
classes: 1—Natives, 2—Beach, 3—Architecture, 4—Bus, 5—Dino Art, 6—
Elephant, 7—Flower, 8—Horse, 9—Mountain, and 10—Food.
(2) Corel_Caltech_30 Image Database: There are 4,245 images from 30 diverse
classes that are obtained from both Corel and Caltech [15] image repositories.
As detailed in Table 9.1, some of the basic color (e.g., MPEG-7 Dominant
Color Descriptor, HSV color histogram and Color Structure Descriptor [16]),
texture (e.g. Gabor [17], Local Binary Pattern [18], and Ordinal Co-occurrence
Matrix [19]), and edge (e.g., Edge Histogram Direction [16]) features, are
extracted. Some of them are created with different parameters to extract several
features and the total feature dimension is obtained as 2,335. Such a high feature
space dimension can thus give us an opportunity to test the performance of the
CNBC framework against the ‘‘curse-of-dimensionality’’ and the scalability with
respect to the varying number of features.
272 9 Image Classification and Retrieval by Collective Network
Both databases are partitioned in such a way that the majority (55 %) of the items
is spared for testing and the rest was used for evolving the CNBC. The evolution
(and training) parameters are as follows: For MD PSO, we use the termination
criteria as the combination of the maximum number of iterations allowed
(iterNo = 100) and the cut-off error (eC ¼ 104 ). Other parameters were empiri-
cally set as: the swarm size, S = 50, Vmax ¼ xmax =5 ¼ 0:2 and VDmax ¼ 5. For
exhaustive BP, the learning parameter is set as k ¼ 0:01 and the iteration number
x x
is 20. We use the typical activation function: hyperbolic tangent (tanhðxÞ ¼ eex e
þex ).
For the AS, we used simple configurations with the following range arrays: Rmin ¼
fNi ; 8; 2g and Rmax ¼ fNi ; 16; 2g, which indicate that besides the single layer
perceptron (SLP), all MLPs have only a single hidden layer, i.e., Lmax ¼ 2, with no
more than 16 hidden neurons. Besides the SLP, the hash function enumerates all
MLP configurations in the AS, as shown in Table 9.2. Finally, for both evolution
methods, the exhaustive BP and MD PSO, NR ¼ 10 independent runs are per-
formed. Note that for exhaustive BP, this corresponds to 10 runs for each con-
figuration in the AS.
dimension of 636) and 14 FVs (all FVs with a total dimension of 2,335) features.
Therefore, the first CNBC has 7 ? 1 = 8, the second one has 10 ? 1 = 11, and
finally the third one has 14 ? 1 = 15 binary classifiers in each NBC. As for the
competing methods, we selected the two most powerful classifier networks,
namely SVMs and RF despite the fact that they are static classifiers that cannot
adapt dynamically to the changes in features, classes, and any update in training
dataset. Therefore, as the features are populated from 7 to 14, new SVM networks
and RFs are trained whereas CNBC dynamically adapts itself as mentioned earlier.
For SVM networks, we employ the libSVM library [2] using the one-against-one
topology [1]. Since in the CNBC framework, the optimal binary classifier con-
figuration within each NBC is determined by the underlying evolutionary method,
in order to provide a fair comparison, all possible classifier kernels and parameters
are also determined for the SVMs and RFs. For SVMs, all standard kernel types
such as linear, polynomial, radial basis function (RBF), and sigmoid, are indi-
vidually used while searching for the best internal SVM parameters, e.g., the
respectable penalty parameter, C = 2n; for n = 0,..,3 and parameter c = 2-n; for
n = 0,..,3, whenever applicable to the kernel type. For the RF, the best number of
trees within the forest is also searched from 10 to 50 in steps of 10.
Table 9.3 presents the average classification performances achieved by 10-fold
random train/test set partitions in Corel_10 database by the competing techniques
against the CNBC that is evolved by both evolutionary methods (exhaustive BP
and MD PSO) over the sample AS given in Table 9.2. It is evident that all SVM
networks with different kernels and RF suffered from the increased feature space
dimension, as a natural consequence of ‘‘Curse of Dimensionality’’. Particularly,
SVMs with RBF and sigmoid kernels cannot be trained properly and thus exhibits
severe classification degradation in the test set. Henceforth with 14 features, the
best classification accuracy has been achieved by the CNBC evolved with the
exhaustive BP. We can thus foretell that the performance gap may even get
widened if more features are involved. Between two evolutionary techniques, the
results indicate that MD PSO achieves the lowest MSE and classification error
levels (and hence the best results) within the training set whereas the opposite is
true for the exhaustive BP within the test set. CNBC in general demonstrates a
solid improvement against the major feature dimension increase [i.e., from 7 (188-
D) to 14 subfeatures (2335-D)] since the classification performance does not show
any deterioration, on the contrary, with both evolutionary techniques a better
274 9 Image Classification and Retrieval by Collective Network
Table 9.3 Average classification performance of each evolution method per feature set by 10-
fold random train/test set partitions in Corel_10 database
Feature set Classifier Train MSE Train CE Test MSE Test CE
7 sub-features SVM (Linear) 0 0 3.56 13.76
SVM (Polynom.) 0.28 0 3.53 13.8
SVM (RBF) 0 0 4.35 16.87
SVM (SIGMOID) 3.51 12.78 4.96 18.07
Random Forest 0.04 0.2 4.96 17.58
CNBC (MD PSO) 0.52 2.42 1.33 16.49
CNBC (BP) 0.47 5.56 1.21 16.44
10 sub-features SVM (Linear) 0 0 3.48 13.84
SVM (Polynom.) 0.16 0.58 3.87 14.65
SVM (RBF) 0 0 6.92 29.87
SVM (SIGMOID) 16.15 83.22 16.99 86.86
Random Forest 0.06 0.3 4.82 16.33
CNBC (MD PSO) 0.88 5.56 5.43 15.91
CNBC (BP) 0.36 4.22 1.19 14.43
14 sub-features SVM (Linear) 0 0 3.59 14.8
SVM (Polynom.) 0 0 3.59 14.56
SVM (RBF) 0 0 10.55 40.45
SVM (SIGMOID) 19.25 88.7 20.43 91.07
Random Forest 0.09 0.47 4.76 17.09
CNBC (MD PSO) 0.44 5.52 6.33 14.41
CNBC (BP) 0.37 4.56 1.21 13.43
The best classification performances in the test set are highlighted
The CNBC evolutions so far performed are much alike to the (batch) training of
traditional classifiers where the training data (the features) and (number of) classes
are all fixed and the entire GTD is used during the training (evolution). As detailed
earlier, incremental evolutions can be performed whenever new features/classes
9.3 Results and Discussions 275
Table 9.4 Confusion matrix of the evolution method, which produced the best (lowest) test
classification error in Table 9.3
Actual 1 2 3 4 5 6 7 8 9 10
Truth 1 42 2 1 1 0 5 0 0 1 3
2 2 37 4 1 0 0 0 1 9 1
3 2 3 46 1 0 1 0 0 1 1
4 2 0 0 53 0 0 0 0 0 0
5 0 0 0 0 55 0 0 0 0 0
6 2 4 2 0 0 37 0 1 8 1
7 1 0 0 0 0 0 53 1 0 0
8 0 0 0 0 0 0 0 55 0 0
9 0 8 1 0 0 0 0 0 46 0
10 1 1 0 1 1 2 1 0 2 46
can be introduced and the CNBC can dynamically create new binary classifiers
and/or NBCs as the need arises. In order to evaluate the incremental evolution
performance, a fixed set of 10 features (FVs with indices 1, 2, 3, 4, 5, 6, 11, 12, 13,
and 14 in Table 9.1 with a total dimension of 415) are used, and the GTD is
divided into three distinct partitions, each of which contains 5 (classes 1–5), 3
(classes 6–8), and 2 (classes 9 and 10) classes, respectively. Therefore, three stages
of incremental evolutions have been performed where at each stage except the first
one, the CNBC is further evolved using the new and the log GTD. During the
second phase, three out of five existing NBCs failed the verification test (per-
formed below 95 % classification accuracy with the new GTD of classes 6–8) and
thus they were incrementally evolved. Finally at the third phase, 4 out of 8 existing
NBCs did not undergo for incremental evolution since they passed the verification
test over the training dataset of those new classes (9 and 10) while the others failed
and had to be incrementally evolved.
Table 9.5 presents the confusion matrices achieved at each incremental evo-
lution stage over the test sets of the GTD partitions involved. It is worth noting that
the major source of error results from the confusion between the 2nd (Beach), 3rd
(Architecture), and 9th (Mountain) classes where low-level features cannot really
discriminate the classes due to excessive color and texture similarities among
them. This is the reason class 2 and class 3 have undergone incremental evolution
at each stage; however, a significant lack of discrimination still prevailed. On the
other hand, at the end of stage 3, high classification accuracies are achieved for
classes 1, 4, and particularly 5 (and also for 7, 8, and 10).
Table 9.6 presents the final classification performance of each evolution
method per feature set. The results indicate only slight losses on both training and
test classification accuracies, which can be expected since the incremental evo-
lution was purposefully skipped for some NBCs whenever they surpass 95 %
classification accuracy over the training dataset of the new classes. This means, for
instance, some NBCs (e.g., the one corresponds to class 4, the Bus) evolved with
only over a fraction of the entire training dataset.
276 9 Image Classification and Retrieval by Collective Network
Table 9.5 Test dataset confusion matrices for evolution stages 1 (top), 2 (middle) and 3 (bottom)
Actual 1 2 3 4 5
Truth 1 47 3 2 1 2
2 1 49 5 0 0
3 1 14 36 4 0
4 2 3 1 49 0
5 0 0 0 0 55
Actual 1 2 3 4 5 6 7 8
Truth 1 44 2 3 0 1 1 3 1
2 1 43 8 0 1 1 0 1
3 3 4 41 2 0 2 3 0
4 1 3 2 48 0 0 1 0
5 0 0 0 0 55 0 0 0
6 2 11 3 0 2 32 5 0
7 1 0 1 1 0 0 52 0
8 0 0 0 0 1 0 0 54
Actual 1 2 3 4 5 6 7 8 9 10
Truth 1 38 0 0 1 2 3 0 1 3 7
2 3 23 2 0 0 1 0 1 24 1
3 8 0 22 1 0 2 1 0 19 2
4 1 2 0 39 0 1 0 0 10 2
5 0 0 0 0 55 0 0 0 0 0
6 4 2 0 0 3 38 0 0 8 0
7 0 1 0 0 1 1 50 0 2 0
8 0 0 0 0 1 0 0 54 0 0
9 1 5 2 1 0 3 0 0 43 0
10 0 1 0 0 3 0 0 0 0 51
Table 9.6 Final classification performance of the 3-stage incremental evolution for each evo-
lution method and feature set for Corel_10 database
Feature set Evol. method Train MSE Train CE Test MSE Test CE
7 sub-features MD PSO 1.36 6.89 2.61 28.63
Exhaustive BP 0.82 4.22 1.85 21.63
14 sub-feature MD PSO 1.23 6.66 2.39 26.36
Exhaustive BP 0.91 7.55 1.83 20.81
Finally, the CNBC evolution for Corel_Caltech_30 database allows testing and
evaluation of its classification performance when the number of classes along with
the database size is significantly increased. For both evolution techniques, we used
the same parameters as presented earlier except that the number of epochs
9.3 Results and Discussions 277
Table 9.7 Classification performance of each evolution method per feature set for Corel_Cal-
tech_30 database
Feature set Evol. method Train MSE Train CE Test MSE Test CE
7 subfeatures MD PSO 0.54 8.1 2.3 33.40
Exhaustive BP 0.24 2.95 2.16 34.67
14 subfeature MD PSO 0.33 5.47 2.52 36.33
Exhaustive BP 0.074 1.31 2.69 33.86
(iterations) for BP and MD PSO were increased to 200 and 500. Table 9.7 presents
the classification performances of each evolution method per feature set. As
compared with the results from Corel_10 database in Table 9.3, it is evident that
both evolution methods achieved a similar classification performance over the
training set (i.e., similar train classification errors) while certain degradation
occurs in the classification performance in the test set (i.e., 10–15 % increase in
the test classification errors). This is an expected outcome since the lack of dis-
crimination within those low-level features can eventually yield a poorer gener-
alization especially when the number of classes is tripled. This eventually brought
the fact that higher discrimination power with the addition of new powerful fea-
tures is needed so as to achieve a similar test classification performance in large
image databases.
P
NðqÞ
RðkÞ
k¼1
AVR(qÞ ¼ and W ¼ 2NðqÞ
NðqÞ
2AVRðqÞ NðqÞ 1
NMRR(qÞ ¼ 1 ð4Þ
2W NðqÞ þ 1
P
Q
NMRR(qÞ
q¼1
ANMRR ¼ 1
Q
where N(q) is the minimum number of relevant (via ground-truth) images in a set
of Q retrieval experiments, R(k) is the rank of the kth relevant retrieval within a
window of W retrievals, which are taken into consideration for each query, q. If
there are less than N(q) relevant retrievals among W then a rank of W ? 1 is
assigned for the remaining (missing) ones. AVR(q) is the average rank obtained
from the query, q. Since each query item is selected within the database, the first
retrieval will always be the item queried and this obviously yields a biased
NMRR(q) calculation and it is, therefore, excluded from ranking. Hence the first
relevant retrieval (R(1)) is ranked by counting the number of irrelevant images
beforehand and note that if all N(q) retrievals are relevant, then NMRR(q) = 0,
achieving the best retrieval performance. On the other hand, if none of the relevant
items can be retrieved among W then NMRR(q) = 1, indicating the worst case.
Therefore, the lower NMRR(q) is the better (more relevant) the retrieval is, for the
query, q. Both performance criteria are computed by querying all images in the
database (i.e., batch query) and within a retrieval window equal to the number of
ground truth images, N(q) for each query q. This henceforth makes the average
precision AP identical to the average recall.
Over each database, six batch queries are performed to compute the average
retrieval performances, four with and two without using CNBC. Whenever used,
CNBC is evolved with the MD PSO and the exhaustive BP, the former with 7 and
the latter with 14 subfeatures, respectively. As listed in Table 9.8, it is evident that
the CNBC can significantly enhance the retrieval performance regardless of the
evolution method, the feature set, and the database size. The results (without
CNBC) in the table also confirm the enhanced discrimination obtained from the
Table 9.8 Retrieval performances (%) of the four batch queries in each MUVIS databases
Feature set Retrieval method Corel_10 Corel_Caltech_30
ANMRR AP ANMRR AP
7 subfeatures CNBC (MD PSO) 31.09 65.01 43.04 54.47
CNBC (BP) 23.86 74.26 46.44 52.21
Traditional 55.81 42.15 60.21 37.80
14 subfeature CNBC (MD PSO) 29.84 67.93 33.29 61.12
CNBC (BP) 22.21 76.20 32.00 65.37
Traditional 47.19 50.38 62.94 34.92
9.3 Results and Discussions 279
Fig. 9.5 8 sample queries in Corel_10 (qA and qB), and Corel_Caltech_30 (qC and qD)
databases with and without CNBC. The top-left image is the query image
larger feature set, which led to better classification performance and in turn, leads
to a better retrieval performance.
For visual evaluation, Fig. 9.5 presents four typical retrieval results with and
without CNBC. All query images are selected among the test set and the query is
processed within the entire database. Table 9.9 presents the retrieval performances
obtained from each batch query operation of the CNBCs that are incrementally
evolved where the corresponding confusion matrices at each stage are presented in
Table 9.5. It is evident that the final CNBC (at the end of stage 3) can significantly
enhance the retrieval performance compared to traditional query method. It is also
interesting to note that this is also true for the immature CNBC at stage 2, which
demonstrates a superior performance even though it is not yet fully evolved with
all the classes in the database.
Figure 9.6 presents two sample retrievals of two queries from classes 2 (Beach)
and 6 (Elephants) where each query operation is performed at the end of each
incremental evolution stage. In the first query (qA), at stage 1 CNBC failed to
retrieve relevant images since it is not yet evolved with the GTD of this class
Table 9.9 Retrieval performances per incremental evolution stage and traditional (without
CNBC) method
Stage—1 Stage—2 Stage—3 Traditional
ANMRR (%) 54.5 33.36 24.13 46.32
AP (%) 44.07 65.42 73.91 51.23
280 9 Image Classification and Retrieval by Collective Network
Fig. 9.6 Two sample retrievals of sample queries qA and qB, performed at each stage from
classes 2 and 6. The top-left is the query image
(class 6). At stage 2, a precision level of, P = 59 %, is achieved where there are still
several irrelevant retrievals as shown in the figure. Only after the last incremental
evolution at stage 3, a high precision level of 85 % is achieved without any irrelevant
image in the first 12 retrievals. In the second query (qB), the retrieval performance
improves smoothly with each incremental evolutionary stage. It is evident that
despite the NBC corresponding to class 2 has been initially evolved in stage 1, it can
only provide a limited retrieval performance (P = 42 %) for such queries like the
one shown in the figure; while it takes 2 more incremental evolution sessions to
gather the maturity level for a reasonable retrieval (P = 80 %).
_RBF_BP,
_RBF_PSO,
_SVM, //Support Vector Machines..
_RF,
_BC, //Bayesian Class..
_HMM, //Hidden Markov Models..
_RANDOM = 1000
};
There are other global variables that are used to assign the training and classifier
parameters, as follows:
CNBCtrain_params cnbc_tp;
CNBCclass_params cnbc_cp;
one_class_data *train_new = NULL, *test_new = NULL;
float **train_input, **test_input;
float **train_output, **test_output;
int in_size, out_size, train_size, test_size = 0;
int *one_index;
int *clip_index = NULL, *kfs = NULL, *table = NULL;
int train_clip_no, test_clip_no;
#define TRAIN_RATE .65
#define TRAIN_RUNS 500//No of training epochs..
#define LEARNING_PARAM 0.01
#define REPEAT_NO 2//Max. no. of REPEATITIONS per MLP conf. training..
PSOparam _psoDef = {80, 201, 0.001, -10, 10, -.2, .2, FAST};//default param-
eters MD PSO..
SPSAparam _saDef = {1, 1, .602, .2, .101};//default SPSA parameters, a, A,
alpha, c, gamma, for SAD..
#define MAX_NOL 3//Max. no. of layers in ENN
#define RF_MAX_NOL 4//Max. no. of parameters for RF
#define SVM_MAX_NOL 5//Max. no. of parameters for SVM
int minNoN[MAX_NOL] = {-1, 8, 2};//Min. No. of Layers for ANNs..
int maxNoN[MAX_NOL] = {-1, 16, 2};//Max. No. of Layers ANNs..
int svmMaxNoN[SVM_MAX_NOL] = {-1, 4, 3, 1, 2};//Max. index for SVM
Kernel..
int svmMinNoN[SVM_MAX_NOL] = {-1, 1, 0, 0, 2};//Min. index for SVM
Kernel..
int svmMaxNoN[SVM_MAX_NOL] = {-1, 1, 0, 1, 2};//Max. index for SVM
Kernel..
int rfMinNoN[RF_MAX_NOL] = {-1, 1, 1, 2};//Min. index arr for RF conf..
int rfMaxNoN[RF_MAX_NOL] = {-1, 2, 3, 2};//Min. index arr for RF conf..
//int newClassNo[] = {5, 3, 2};//SET the no. of classes per training stage here..
int newClassNo[] = {1000};//SET the no. of classes per training stage here..
bool bUniClass = 0, bEvaluateOnly = 0, bEvaluateAll = 0;//true if the database
is uni-class and true if only evaluation performed..
9.5 Programming Remarks and Software Packages 283
Note that several of the training and classifier parameters shown above are
described earlier, such as PSOparam, SPSAparam, and AS parameters such as
int min_noL, int max_noL, int *min_noN, int *max_noN. Moreover, the
training constants such as TRAIN_RATE (database partition for training),
TRAIN_RUNS (number of epochs for BP), LEARNING_PARAM (learning
parameter for BP), and REPEAT_NO (number of repetitions for each BC train-
ing/evolution).These are all related to each individual BC training and BC con-
figuration. There are three Boolean parameters. The first one should be set
according to the dataset type: true for uni-class (or uni-label) or false for multi-
class (or multi-label). The second Boolean, bEvaluateOnly, is set to true only to
test (evaluate the performance) of an existing CNBC without (incremental)
training/evolution. If there is no CNBC yet evolved, then this parameter should be
set to false. An important data structure where the information such as positive and
negative item lists, class index, and some operators are stored in one_class_data.
When the GTD of a dataset is loaded, the individual class information for train and
test datasets is stored using this data structure. Finally, there are two CNBC-related
data structures and their global variables: CNBCtrain_params, and CNBC-
class_params cnbc. As the other variables, they are also all declared in the header
file, GenClassifier.h file. Table 9.10 presents these three important data structures.
The entry function, main() is quite straightforward:
int main(int argc, char* argv[])
{
memset(&cnbc_cp, 0, sizeof(CNBCclass_params));//init..
//RunDTfiles();
RunFeXfiles(-1);
return 0;
}
In this application, a CNBC can be evolved either data (*.dt) by calling the
function: RunDTfiles() or a MUVIS feature (or the so-called FeX) file by calling
the function: RunFeXfiles(-1). In this section we shall describe the latter process;
however, the former is also quite similar and simpler than the latter since data files
contain both features and GTD in a single file. In a MUVIS image database, there
is a folder called ‘‘\Images\’’ where all the raw data, the images, reside. There is
also an image database descriptor file, ‘‘*.idbs’’ where database properties (such as
date and time of creation, version, number of images, features, and their
descriptions) are stored. For instance consider the sample image database file,
‘‘dbsMC.idbs’’ (under the folder ‘‘\dbs60mc\’’), as follows:
v 1.8
20:41:02, Thursday, January 12, 2012
IS = NONE
noIm = 60 visF = 1 nSEG = 0
CNBC 1 19
284 9 Image Classification and Retrieval by Collective Network
This database has an internal MUVIS version 1.8, with the creation date below.
Number of images (noIM = 60) is 60, number of visual features (visF = 1) is 1
and finally no segmentation method is applied (nSEG = 0). The fifth line (CNBC
1 19) indicates that the feature extracted for this database is CNBC, which
encapsulates either 7, 10, or 14 distinct low level features. For this database, 7
features have been extracted (the first number of the 19 parameters given below).
Note that under the same folder, there is a file called ‘‘dbsMC_FeX.CNBC’’, which
contains the FVs of these 7 features within a single file. Moreover, the GTD for
this database reside in the so-called Query-Batch-File (*.qbf), with the name
‘‘dbsMC CNBC.qbf’’. This is a simple text file with the following entries:
DATABASE dbsMC
QUERY_TYPE visual
CLASS_NO 5
DBGOUT 1
# Put any nonzero number to enable DbgOut signals..
QUERY 0 - 59
# this basically means to query ALL items in the database..
% this is a 5-class database with some Corel images.
CLASS 0 0 - 9
CLASS 1 10 - 19
CLASS 2 20 - 29
class 3 30 - 39
CLASS 4 40 - 59
% From now on put the multi-class entries..
CLASS 0 20 - 29, 42, 44
class 4 0, 1, 3, 8, 9, 16
The self-explanatory text tokens (e.g. ‘‘DATABASE’’, ‘‘CLASS’’, etc.) indicate
that the database (dbsMC) has 5 classes, where the image indices per class are
given below. For instance the first class (CLASS 0) contains images 0, 1, 2, …, 9
and 20, 21, 22, …, 29, 42 and 44 as well.
Table 9.11 presents the function RunFeXfiles(). The database filename and
folder are indicated by variables dir[] and tit_base[]. Then all training parameters
are stored within the cnbc_tp whereas all classifier parameters are stored in
cnbc_cp variables, respectively. Since this is a multiclass database, the function
call CreateCNBCDataFeXgen(fname) will load the qbf file and according to the
TRAIN_RATE, it will fill the class entries in train and test datasets (into the
one_class_data varibles of train_new and test_new). Once the GTD is loaded,
then the function call LoadCNBCFex(dir, tit_base) will load the FVs in the FeX
file of the database which will be pointed by the pointer, cnbc_tp.inputFV with a
total size of cnbc_tp.sizeFV. Finally, the function call IncrementalEvolu-
tion(run), will simulate a multi-stage CNBC evolution session where in each stage
certain number of new classes will be introduced and the CNBC will be incre-
mentally evolved. The number of new classes for each stage can be defined in the
286 9 Image Classification and Retrieval by Collective Network
global array of newClassNo[]. For example, the following definition for a 10-class
database:
int newClassNo[] = {5, 3, 2};//SET the no. of classes per training stage here..
will introduce GTD of the images, first from the 5 classes, then (in the 2nd
stage) from the 3 classes and finally, from the 2 classes. This is to test if the CNBC
can adapt to new class entries. If a batch training is desired, then a single entry
should be given with the total number of classes in the database (or higher), i.e.,
int newClassNo[] = {1000};. Then the function call IncrementalEvolution(run)
will directly copy the entries cnbc_cp.train_new and cnbc_cp.test_new from the
global one_class_data varibles of train_new and test_new. This will yield a
single (batch) CNBC evolution session by the function call: TrainFn(run); with
all the GTD available and the classification performance will be computed over
both train and test datasets.
In each of the training functions, the appropriate classifier object should be
created and used for training and testing. Table 9.12 presents the training function
9.5 Programming Remarks and Software Packages 287
This function call will verify and retrieve the next available NBC slot (from a
log file) for the current process. If there is no log file present (for the first run of
ClassifierTestApp), it will create the one dedicated to the current database and
initially set all the NBC slots as ‘‘idle’’. It will then retrieve the NBC index of the
first idle slot back so that the evolution/training process can begin for it. In the
same time it will change its ‘‘idle’’ status to ‘‘started’’ status so that other Clas-
sifierTestApp instances will not attempt any further processing on this particular
NBC. Whenever the evolution process is completed, the same function will then
change its ‘‘started’’ status to ‘‘completed’’ status. The ClassifierTestApp
instances, which cannot find any more NBCs with an ‘‘idle’’ status, will simply
290 9 Image Classification and Retrieval by Collective Network
sleep until all the NBCs are evolved (all NBC slots are turned to ‘‘completed’’
status), and the log file is deleted by the last process active. This will eventually
break the for-loop and for each process, the few internal NBCs evolved will be
deleted (clean-up) and instead the entire CNBC will be loaded (by the Create-
Distributed() call) and its performance is evaluated by the EvaluatePerfor-
mance() call.
On the other hand, if an NBC has already been created (and evolved) by a past
process and loaded in the current process, then it may undergo to an incremental
evolution only if it fails to verification test. This is clear in the following code:
if(!pNBC)
{
…
} else if(pNBC- > Verify(cp, tp, c))
{
cold = c;
continue;//this NBC discriminates well between pos. and neg. items.. So NOGO
for train..
}//else if..
If it does not fail the verification test (the pNBC- > Verify(cp, tp, c) returns
true), then this means that there is no need for further training or incremental
evolution. The process simply continues with the next NBC.
Table 9.15 presents the class COneNBC in the second layer hierarchy. As its
name implies, each object of this class represents a single NBC, and thus
encapsulates one or many BCs within. Recall that each NBC is dedicated to an
individual class in the database. Therefore, besides evolving/training its BCs,
performing verification tests, propagating a FV to return a CV, it is the also task of
this class to separate positive and negative items for the first layer BCs (via the
function: SelectFeatures()) and for the fuser BC (via the function: SelectFu-
serFeatures()). Moreover, it performs the selection of the negative samples with
respect to the predetermined positive-to-negative ratio (PNR) by the function:
NegativeFeatureSelection(). Once the positive and negative item lists are selected
for the current NBC, for a proper training especially when BP is used, it also
shuffles them by calling the function, RandomizeFeatures().
Table 9.16 presents the code for the function, Train(), which creates and train
the BC(s) of the current NBC. Recall that the first layer of each NBC has a number
of BCs equivalent to the number of the database features. If there are two or more
BCs, then a fuser BC is also created (and evolved) to fuse the outputs of the first
layer BCs. So if this is the first time the current NBC is created, then a BC pointer
array (m_pBC) of CGenClassifier is created where each pointer is spared per BC.
Next, within the for-loop, first each pointer in the array will be allocated to one of
the six classifiers according to the classifier-training choice within cp.cp.ctype.
Then the function call, SelectFeatures(cp, tp, bc, nbc), selects the train and test
datasets’ FVs for the current BC according to the class of the NBC. Recall that all
9.5 Programming Remarks and Software Packages 291
FVs are stored in a single chunk of memory pointed by the pointer, cnbc_tp.in-
putFV. So this function basically assigns the train and test dataset input FV
pointers within cnbc_tp.inputFV and output FV pointers to a static outcome
according to positive and negative item lists. The following code piece is taken
from this function to accomplish this. Note that the first for-loop basically assigns
to input and output FVs of the positive items of the training dataset whereas the
second one does the same for the negative items. A similar piece of code exists for
the test set too.
for (int n = 0 ; n < train_data->pos_item_no ; ++n)
{
int ptr = tp.sizeFV * train_data- > pos_item_list[n] + tp.ptrSF[bc];//ptr of the
bc.th feature vector within inputFV..
tp.tp.train_input[n] = tp.inputFV + ptr;//bc.th feature vector of the item poin-
ted by train_data- [ pos_item_list[n]
tp.tp.train_output[n] = CGenClassifier::outV[0];//positive outcome..
}//for..
292 9 Image Classification and Retrieval by Collective Network
Once the training dataset FVs for both positive and negative item list are
assigned, the function call, SelectFeatures(cp, tp, bc, nbc), then performs the
selection of the negative samples with respect to the predetermined positive-to-
negative ratio (PNR) by the function call, NegativeFeatureSelection().
After the train and test datasets’ FVs are selected and pointed by, tp.tp.trai-
n_input, tp.tp.train_output, and tp.tp.test_input, tp.tp.test_output, the function
call, RandomizeFeatures(tp.tp);, shuffles the entries of the train dataset for
proper training. They are then fed to the BC by calling: m_pBC[bc]->Init(tp.tp).
Finally, with the rest of the training parameters (tp.tp), the training/evolution of
the BC is initiated by: m_pBC[bc]->Train(cp.cp) with the classifier parameters
(cp.cp). When completed, the pointers for the current BC’s training FVs are
cleaned and the process is then repeated in the for-loop for the next BC, until all
BCs in the first layer are trained/evolved. If there are two or more BCs present in
the NBC, then the fuser BC is created, its FVs are selected from the CVs of the
first layer BCs for both train and test datasets. Besides this difference, the training/
evolution of the fuser BC is identical to any first layer BC’s.
The I\O routines of the NBCs are quite straightforward. In any time, the NBC
can be saved to or read from a binary file. As can be seen in the COne-
NBC::Save() function, with a simple preceding header covering NBC information
such as number of BCs, the NBC index and the activation function used, the
architecture spaces of all BCs are then stored to the file in a sequential order, with
the last one for the fuser BC. To load an NBC from the file via COne-
NBC::Create(char* dir, char* title, ClassifierType ctype) function, the binary
file header is simply read to assign the NBC parameters, and each BC is initialized
with their AS buffers. Recall that the entire AS information is needed for any
incremental evolution session and the best configuration in the AS will always be
used for classification (forward propagation of FVs).
Finally, the third hierarchical class layer is the CGenClassifier (CMLP_BP,
CMLP_PSO, CRBF_BP, CRBF_PSO, CRandomForest, and CSVM) for which
the programming details were covered in Sect. 7.5. Note that such a morphological
structure of the base classifiers that are all inherited from the CGenClassifier class
enables us to integrate other classifiers in the future with the common API and data
structures defined within the base class.
References
4. R. Polikar, L. Udpa, S. Udpa, V. Honavar, Learn ++: an incremental learning algorithm for
supervised neural networks. IEEE Trans. Syst. Man Cybern. (C) 31(4), 497–508 (2001).
(Special Issue on Knowledge Management)
5. T. Ojala, M. Pietikainen, D. Harwood, A comparative study of texture measures with
classification based on feature distributions. Pattern Recogn. 29, 51–59 (1996)
6. S. Guan, C. Bao, R. Sun, Hierarchical incremental class learning with reduced pattern
training. Neural Process. Lett. 24(2), 163–177 (2006)
7. H. Jia, Y. Murphey, D. Gutchess, T. Chang, Identifying knowledge domain and incremental
new class learning in SVM. IEEE Int. Joint Conf. Neural Netw. 5, 2742–2747 (2005)
8. S. Smale, On the average number of steps of the simplex method of linear programming.
Math Program 27(3), 241–262 (1983)
9. H. Chen, Z. Gao, G. Lu, S. Li, A novel support vector machine fuzzy network for image
classification using MPEG-7 visual descriptors. International conference on multimedia and
information technology, MMIT ‘08, pp. 365–368, 30–31 Dec 2008 doi:10.1109/
MMIT.2008.199
10. E. Chang, K. Goh, G. Sychay, W. Gang, CBSA: content-based soft annotation for multimodal
image retrieval using Bayes point machines. IEEE Trans. Circuits Syst. Video Technol.
13(1), 26–38 (2003) doi:10.1109/TCSVT.2002.808079
11. G.-J. Qi, X.-S. Hua, Y. Rui, J. Tang, H.-J. Zhang, image classification with kernelized spatial-
context. IEEE Trans. Multimedia 12(4), 278–287 (2010) doi:10.1109/TMM.2010.2046270
12. MUVIS http://muvis.cs.tut.fi/
13. S. Kiranyaz, T. Ince, S. Uhlmann, M. Gabbouj, Collective network of binary classifier
framework for polarimetric SAR image classification: an evolutionary approach. IEEE Trans.
Syst. Man Cybern.—Part B (in Press)
14. Corel Collection/Photo CD Collection (www.corel.com)
15. L. Fei–Fei, R. Fergus, P. Perona, Learning generative visual models from few training
examples: an incremental Bayesian approach tested on 101 object categories. IEEE CVPR
Workshop Generative-Model Based Vision 12, 178 (2004)
16. S.G. Mallat, A Wavelet Tour of Signal Processing, 2nd edn. (Academic Press, San Diego,
1999)
17. B. Manjunath, P. Wu, S. Newsam, H. Shin, A texture descriptor for browsing and similarity
retrieval. J. Signal Process. Image Commun. 16, 33–43 (2000)
18. T. Ojala, M. Pietikainen, D. Harwood, A comparative study of texture measures with
classification based on feature distributions. Pattern Recogn. 29, 51–59 (1996)
19. M. Partio, B. Cramariuc, M. Gabbouj, An Ordinal co-occurrence matrix framework for
texture retrieval. EURASIP J. Image Video Process. 2007, Article ID 17358 (2007)
Chapter 10
Evolutionary Feature Synthesis
Multimedia content features (also called descriptors) play a central role in many
computer vision and image processing applications. Features are various types of
information extracted from the content and represent some of its characteristics or
signatures. However, especially these low-level features, which can be extracted
automatically usually lack the discrimination power needed for accurate content
representation especially in the case of a large and varied media content data
reserve. Therefore, a major objective in this chapter is to synthesize better dis-
criminative features using an evolutionary feature synthesis (EFS) framework,
which aims to enhance the discrimination power by synthesizing media content
descriptors. The chapter presents an EFS framework, which applies a set of linear
and nonlinear operators in an optimal way over the given features in order to
synthesize highly discriminative features in an optimal dimensionality. The opti-
mality therein is sought by the multidimensional particle swarm optimization (MD
PSO) along with the fractional global best formation (FGBF) presented in Chaps. 4
and 5, respectively. We shall further demonstrate that the features synthesized by
the EFS framework that is applied over only a minority of the original feature
vectors exhibit a major increase in the discrimination power between different
classes and a significant content-based image retrieval (CBIR) performance
improvement can thus be achieved.
10.1 Introduction
S. Kiranyaz et al., Multidimensional Particle Swarm Optimization for Machine Learning 295
and Pattern Recognition, Adaptation, Learning, and Optimization 15,
DOI: 10.1007/978-3-642-37846-1_10, Springer-Verlag Berlin Heidelberg 2014
296 10 Evolutionary Feature Synthesis
only a higher level understanding of the image content can reveal that they should
be classified into the same class. Efficient CBIR systems require a decisive solu-
tion for this well-known ‘‘Semantic Gap’’ problem. Most current general purpose
attempts to solve this problem gather knowledge of human perception of image
similarity directly from the users. For example, user labeling of the images may be
exploited to select the most appropriate ones among the vast number of available
feature extraction techniques or to define a discriminative set of features best
matching to the human visual perception.
The efforts addressing the aforementioned problem can be categorized into two
feature transformation types: feature selection and feature synthesis. The former
does not change the original features; instead selects a particular sub-set of them to
be used in CBIR. So no matter how efficient the feature selection method may be
the final outcome is nothing but a subset of the original features and may still lack
the discrimination power needed for an efficient retrieval. The latter performs a
linear and/or nonlinear transformation to synthesize new features. Both transfor-
mation types require searching for an optimal set of new features among a large
number of possibilities in a search space probably containing many local optima
and, therefore, evolutionary algorithms (EAs) [1] such as Genetic Algorithm (GA)
[2] and Genetic Programming (GP) [3] are mainly used. Recall from the earlier
chapters that the common point of all EAs is that they are stochastic population-
based optimization methods that can avoid being trapped in a local optimum. Thus
they can find the optimal solutions; however, this is never guaranteed. In addition
to the limitations of EA-based algorithms discussed in earlier chapters, another
critical drawback of the existing EA-based EFS methods is that they work only in
a search space with an a priori fixed dimensionality. Therefore, the optimal
dimensionality for the synthesized features remains unknown.
In order to address these problems, in this chapter, we shall present an EFS
technique which is entirely based on MD PSO with FGBF. The main objective is
to maximize the discrimination capability of low-level features so as to achieve the
best possible retrieval performance in CBIR. Recall that MD PSO can search for
the optimal dimensionality of the solution space and hence voids the need of fixing
the dimensionality of the synthesized features in advance. MD PSO can also work
along with the FGBF to avoid the premature convergence problem. With the
proper encoding scheme that encapsulates several linear and nonlinear operators
(applied to a set of optimally selected features), and their scaling factors (weights),
MD PSO particles can, therefore, perform an evolutionary search to determine the
optimal feature synthesizer to generate new features with the optimal dimen-
sionality. The optimality therein can be defined via such a fitness measure that
maximizes the overall retrieval (or classification) performance.
10.2 Feature Synthesis and Selection: An Overview 297
EFS is still in its infancy as there are only few successful methods proposed up to
date. There are some applications of PSO to feature selection. In [4] binary PSO
was successfully used to select features for face recognition and in many earlier
studies PSO-based feature selection has been shown to produce good classification
results when combined with different classification methods (e.g., logistic
regression classifier [5], K-nearest neighbor method [6], SVMs [7, 8], and back-
propagation networks [9]) and applied for different classification problems (e.g.,
UCI Machine Learning Repository classification tasks [5, 7, 8], gene expression
data classification [6], and microcalcifications in mammograms [9]).
Most existing feature synthesis systems are based on genetic programming (GP)
[3]. In [10] and [11], GP is used to synthesize features for face expression rec-
ognition. The genes are composite operators represented by binary trees whose
internal nodes are primitive operators and leaf nodes are primitive features. The
primitive features are generated by filtering the original images using a Gabor filter
bank with 4 scales and 6 orientations (i.e., 24 images per original image) and the
primitive operators are selected among 37 different options. The fitness of the
composite operators is evaluated in terms of the classification accuracy (in the
training set) of a Bayesian classifier learned simultaneously with the composite
operator. Finally, the best composite operator found is used to synthesize a feature
vector for each image in the database and the corresponding Bayesian classifier is
then used to classify the image into one of 7 expressions types. The expression
recognition rate was slightly improved compared to similar classification methods
where no feature synthesis was applied.
In [7], co-evolutionary genetic programming (CGP) is used to synthesize fea-
tures for object recognition in synthetic aperture radar (SAR) images. The
approach is similar to the one in [10] and [11], but separate sub-populations are
utilized to produce several composite operators. However, the primitive features
used in this application are only 1-dimensional properties computed from the
images and thus each composite operator only produces a single 1-dimensional
composite feature. The final feature vector is formed by combining the composite
features evolved by different sub-populations. Although both the number of
primitive features (20) and the number of classes to be recognized (B5) were low,
the classification accuracy obtained using the synthesized features was only
occasionally better than the classification accuracy obtained directly with the
primitive features.
In [12], a similar CGP approach was applied for image classification and
retrieval. The original 40-D feature vectors were reduced to 10-D feature vectors.
The results were compared in terms of classification accuracy against 10-D feature
vectors obtained using multiple discriminant analysis (MDA) and also against a
support vector machine (SVM) classifier using the original 40-D feature vectors.
The databases used for testing consisted of 1,200–6,600 images from 12 to 50
classes. In all cases, the classification results obtained using the features
298 10 Evolutionary Feature Synthesis
10.3.1 Motivation
As mentioned earlier, the motivation behind the EFS technique is to maximize the
discrimination power of low-level features so that CBIR performance can be
improved. Figure 10.1 illustrates an ideal EFS operation where 2D features of a 3-
class database are successfully synthesized in such a way that significantly
improved CBIR and classification performances can be achieved. Unlike in the
figure, the feature vector dimensionality is also allowed to change during the
feature synthesis if more discriminative features can be obtained.
The features synthesized by the existing FS methods based on GP produce only
slightly improved (or in some cases even worse) results compared to original
features. These methods have several limitations and the results may be signifi-
cantly improved if those limitations can be properly addressed. First of all, the
synthesized feature space dimensionality is kept quite low to avoid increasing the
computational complexity. Also the dimensionality is fixed a priori which further
decreases the probability of finding optimal synthesized features with respect to the
problem at hand. In order to maximize the discrimination among classes, also the
dimensionality into which the new features are synthesized should be optimized by
the evolutionary search technique. Most existing systems also use only few
(non)linear operators in order to avoid a high search space dimensionality due to
the fact that the probability of getting trapped into a local optimum significantly
increases in higher dimensionalities. Furthermore, the methods are quite dependent
on a number of parameters, which must be set manually.
ANNs, which may also be seen as feature synthesizers, similarly suffer from the
pre-set dimensionality of the synthesized features (e.g., C for a C-class case).
Simultaneously, the limited set of operators (only summation and nonlinear
class-1
class-2
class-3
EFS
Fig. 10.1 An illustrative EFS, which is applied to 2D feature vectors of a 3-class dataset
300 10 Evolutionary Feature Synthesis
bounding) may prevent the ANNs from finding a successful feature synthesizer for
a particular set of feature vectors. The most problematic limitation is the lack of
feature selection. When the dimensionality of the input feature vector rises, the
number of weights to be optimized increases exponentially and the task of finding
proper weights soon becomes difficult and perhaps infeasible for any training
method due to the well-known ‘‘curse of dimensionality’’ phenomenon.
With SVM classifiers, a major drawback is the critical choice of the (non)linear
kernel function along with its intrinsic parameters that may not be a proper choice
for the problem at hand. Consider for instance, two sample feature synthesizers
(FS-1 and FS-2) illustrated in Fig. 10.2, where for illustration purposes features are
only shown in 1-D and 2-D, and only two-class problems are considered. In the
case of FS-1, a SVM with a polynomial kernel in quadratic form can make the
proper transformation into 3-D so that the new (synthesized) features are linearly
separable. However, for FS-2, a sinusoid with a proper frequency, f, should be used
instead for a better class discrimination. Therefore, searching for the right trans-
formation (and hence for the linear and nonlinear operators within) is of paramount
importance, which is not possible for static (or fixed) ANN and SVM
configurations.
The primary objective of the EFS framework presented in this chapter is to
address all the above-mentioned deficiencies simultaneously and to achieve the
highest possible discrimination between image features belonging to different
x1 (FS-1) y1
x2 x12 , x 22 , 2 x 1 x 2
(1,0)
2D 3D
class-1 y2
class-2 (1,0)
1
(FS-2)
0.5
sin( 2 π fx )
0 1
0 1D 1D
-0.5
class-1
class-2
-1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fig. 10.2 Two sample feature synthesis performed on 2-D (FS-1) and 1-D (FS-2) feature spaces
10.3 The Evolutionary Feature Synthesis Framework 301
10.3.2.1 Overview
As shown in Fig. 10.3, the EFS can be performed in one or several runs where each
run can further synthesize the features generated from the previous run. The number
of runs, R, can be specified in advance or adaptively determined, i.e., runs are
carried out until a point where the fitness improvement is no longer significant. The
EFS dataset can be the entire image database or a certain subset of it where the
ground truth is available. If there is more than one Feature eXtraction (FeX)
module, an individual feature synthesizer can be evolved for each module and once
completed, each set of features extracted by an individual FeX module can then be
passed through the individual synthesizers to generate new features for CBIR.
Fig. 10.3 The block diagram of Feature eXtraction (FeX) and the EFS technique with R runs
302 10 Evolutionary Feature Synthesis
Along with the operators and the feature selection, encoding of the MD PSO
particles is designed to enable a feature scaling mechanism with the proper
weights. As illustrated in Fig. 10.4, xxda;j ðtÞ; j 2 ½0; d 1 is a 2 K ? 1 dimen-
sional vector encapsulating:
Original FV Synthesized FV
( N -dimensional) ( d -dimensional)
x0 y0
x1 y1
wα 1
xα1 Θ1 Θ2 ΘK yj
wβ 1
wβ 2
xβ 1 y
d −1
wβ K
xβ2
xβ K
x N −1
Fig. 10.4 Encoding jth dimensional component of the particle a in dimension d for K-depth
feature synthesis
10.3 The Evolutionary Feature Synthesis Framework 303
where lc;R is the centroid vector computed for all classes, and Qe ðlc;R ; ZÞ is the
quantization error (or the average intra-cluster distance). dmin is the minimum
centroid (inter-cluster) distance among all classes and FP ðlc;R ; ZÞ is the number of
false positives i.e., synthesized feature vectors which are closer to another class
centroid than their own. So the minimization of the validity index f ðlc;R ; ZÞ will
simultaneously try to minimize the intra-cluster distances (for better compactness)
and maximize the inter-cluster distance (for better separation), both of which lead
to a low FP ðlc;R ; ZÞ value or in the ideal case FP ðlc;R ; ZÞ ¼ 0, meaning that each
synthesized feature is in the closest proximity of its own class centroid, thus
leading to the highest discrimination.
In the second approach we adapt a similar methodology to the one used in
ANNs, i.e., target vectors are assigned to synthesize the features from each class,
the EFS system searches for a proper synthesis to get to this desired output, and the
fitness is evaluated in terms of the mean square error (MSE) between the syn-
thesized output vectors and the target output vectors. However, we do not want to
fix the output dimensionality to C as in ANNs, but instead let the EFS system
search for an optimal output dimensionality. Therefore, target output vectors are
generated for all dimensionalities within the range of {Dmin,…,Dmax}. While
generating the target vector table, the two criteria are applied for a good error
correcting output code (ECOC) suggested in [16], i.e.,
• Row separation: Each target vector should be well-separated in the sense of
Hamming distance from each of the other target vectors.
• Column separation: Each column in the vector table should be well-separated in
the sense of Hamming distance from each of the other columns.
Large row separation allows the final synthesized vectors to somewhat differ
from the target output vectors without losing the discrimination between classes.
Each column of the target vector table can be seen as a different binary classifi-
cation i.e., those original classes with value 1 in the specific column form the first
metaclass and those original classes with value -1 in that column form the second
10.3 The Evolutionary Feature Synthesis Framework 305
metaclass. Depending on the similarity of the original classes, some binary clas-
sification tasks are likely to be notably easier than others. Since the same target
output vectors should be used with any given input classes, it is beneficial to keep
the binary classification tasks as different as possible i.e., maximize the column
separation.
There is no simple and fast method available to generate target vectors with
maximal row and column separation, but the following simple approach is used to
create target output vectors with row and column separations which are satisfac-
tory for this purpose:
1. Assign MinBits as the minimum number of bits needed to represent C classes.
2. Form a bit table with MinBits rows where each row is the binary representation
of the row number.
3. Assign the first MinBits target vector values for each class ci equal to the ith
row in the bit table.
4. Move the first row of the bit table to the end of the table and shift the other rows
up by one row.
5. Assign the next MinBits target vector values for each class ci equal to the ith
row in the bit table.
6. Repeat the previous two steps until Dmax target vector values have been
assigned.
7. Replace the first C values in each target vector by a 1-of-C coded section.
This procedure will produce different binary classification tasks until the bit
table is rotated back to its original state (large column separation) and simulta-
neously the row separation is notably increased compared to using only the 1-of-
C coded section. While step 7 reduces the row separation, it has been observed that
for distinct classes it is often easiest to find a synthesizer that discriminates a single
class from the others and, therefore, conserving the 1-of-C coded section at the
beginning generally improves the results. The target vectors for dimensionalities
below Dmax can then be obtained by simply discarding a sufficient number of
elements from the end of the target vector for dimensionality Dmax. Since the
common elements in the target vectors of different lengths are thus identical, the
FGBF methodology for EFS can still freely combine elements taken from particle
positions having different dimensionalities.
The target output vector generation for a 4-class case is illustrated in
Table 10.2. For clarity, the elements set to -1 are shown as empty boxes. Dmax is
set to 10 and for four classes MinBits is 2. Note that the 1-of-C coding is used for
the first four elements, while the remaining elements are created from the (shifted)
rows of a 2-bit representation table.
When the target outputs are created as described above, the fitness of the jth
element of a synthesized vector (and thus the corresponding fractional fitness
score, fj(Zj)) can simply be computed as the MSE between the synthesized output
vectors and the target vectors belonging to their classes, i.e.,
X C X 2
fj Z j ¼ tjck zj : ð10:3Þ
k¼1 8z2ck
where tjck denotes the jth element of the target output vector for class ck and zj is
the jth element of a synthesized output vector. The most straightforward way to
form the actual fitness score f(Z) would be to sum up the fractional fitness scores
and then normalize the sum with respect to the number of dimensions. However,
we noticed that since the first C elements with 1-of-C coding are usually easiest to
synthesize successfully, the MD PSO usually favors dimensionalities not higher
than C, indicating a crucial local optimum for dimensionality d = C. In order to
address this drawback efficiently, the elements from 1 to C in the target vector are
handled separately and moreover, the normalizing divisor used for the rest of the
elements is strengthened, i.e., with the power of a [ 1, to slightly increase the
probability of finding better solutions in higher dimensionalities, d [ C. As a
result, the fitness function is formulated as follows,
1X C X C X 2 1 Xd XC X 2
f ðZÞ ¼ tjck zj þ a tjck zk : ð10:4Þ
C j¼1 k¼1 8z2c ðd C Þ j¼Cþ1 k¼1 8z2c
k k
MUVIS framework [17] is used to create and index a Corel image database to be
used in the experiments in this section. The database contains 1,000 medium
resolution (384 9 256 pixels) images obtained from Corel image collection [18]
covering 10 classes (natives, beach, architecture, busses, dinosaurs, elephants,
roses, horses, mountains, and food). In order to demonstrate the efficacy of the EFS
technique for CBIR, we used well-known low-level descriptors that have a limited
discrimination power and a severe deficiency for a proper content description. The
descriptors used are 64-bins unit normalized RGB and YUV color histograms and
57-D Local Binary Pattern (LBP) and 48-D Gabor texture descriptors. The syn-
thesis depth, K, was set to 7 meaning that only 7 operators and 8 features were
used to compose each element of the new feature vector. The parameter a in Eq.
(10.4) was set to 1.1. Unless stated otherwise, the set of 18 empirically selected
operators given in Table 10.1 was used within the EFS. All numerical results
10.4 Simulation Results and Discussions 307
Table 10.3 Discrimination measure (DM) and the number of false positives (FP) for the original
features
Corel RGB YUV LBP Gabor
DM 431.2 384.6 984.2 462.7
FP 357 334 539 378
308 10 Evolutionary Feature Synthesis
Table 10.4 Statistics of the discrimination measure (DM), the number of false positives (FP) and
the corresponding output dimensionalities for features synthesized by the first run of EFS over the
entire database
Corel RGB YUV LBP Gabor
Min/mean DM 179.9/203.9 187.2/201.3 306.8/334.7 299.2/309.5
Min/mean FP 161/181.7 167/179.9 281/307.5 272/283.4
Best/mean dim. 37/32.6 36/33.7 26/24.7 37/28.8
Table 10.5 Statistics of the DM, FP and the corresponding output dimensionalities for features
synthesized by the first run of the EFS evolved with the ground truth data over 45 % of the
database
Corel RGB YUV LBP Gabor
Min/Mean DM 245.8/259.1 239.2/261.7 384.5/408.4 336.6/363.9
Min/Mean FP 223/237.5 218/239.5 357/372.3 303/330.6
Best/Mean Dim. 33/33.4 37/31.4 30/24.0 31/29.4
Table 10.7 Test CE statistics and the corresponding output dimensionalities for features syn-
thesized by a single run of the EFS with 45 % EFS dataset
Corel RGB YUV LBP Gabor
Min/mean 0.293/0.312 0.285/0.307 0.402/0.422 0.327/0.367
test CE
Best/mean dim. 33/34.4 37/31.4 30/24.0 29/29.4
While our main objective is to improve CBIR results, EFS can also be used for
other purposes as long as a proper fitness function can be designed. A possible
application may be the synthesis of such features that can be classified more
efficiently. To demonstrate this potential of EFS, a K-means classifier is trained by
computing the class centroids using 45 % of the database, and the test samples are
then classified according to the closest class centroid. When it is applied over the
original and synthesized features, the classification errors presented in Tables 10.6
and 10.7 are obtained. The results clearly indicate a clear improvement in the
classification accuracy, leading to the conclusion that when EFS is evolved with a
proper fitness function, it can synthesize such features that can be classified more
efficiently and accurately.
10.4 Simulation Results and Discussions 309
In the following experiments, image features are synthesized using the ground truth
data of only part of the image database (45 %). Then ANMRR and AP are computed
by the batch query, i.e., querying all images in the database. In order to evaluate the
baseline performance, ANMRR and AP performance measures obtained using the
original low-level features are first computed, as given in Table 10.8.
Recall that a single run of EFS can be regarded as a generalization of a SLP.
Therefore, to perform a comparative evaluation under equal terms, a SLP is first
trained using PSO and used as a feature synthesizer in which the low-level image
features are propagated to create the output (class) vectors that are used to compute
(dis) similarity distances during a query process. Performing batch queries over
class vectors, the final ANMRR and AP statistics are given in Table 10.9. It can be
seen that, even though the SLP output dimensionality is notably lower than the
vector dimensionalities of the original low-level features, ANMRR and AP sta-
tistics for the synthesized LBP and Gabor features exhibit a significant performance
improvement. This is also true but somewhat limited for RGB and YUV histo-
grams. It is, therefore, clear that feature synthesis performed by SLP improves the
discrimination between classes, which in turn leads to a better CBIR performance.
To demonstrate the significance of each property of EFS, a series of experi-
ments are conducted enabling the properties one by one and evaluating their
individual impact. Let us start with a single run of the most restricted EFS version
that has the highest resemblance to a SLP. In this version, fixing the output
dimensionality to C, the same 1-of-C coding for target (class) vectors and the same
MSE fitness function are used. As in SLP, PSO (with FGBF) process in this limited
EFS searches for only the feature weights, selecting only between addition or
subtraction operators (in SLP the weights are limited between [-1, 1], i.e., we
need also subtraction operation to compensate for the negative weights).
Table 10.8 ANMRR and AP measures using original low-level features. ‘‘All’’ refers to all
features are considered
Corel RGB YUV LBP Gabor All
ANMRR 0.589 0.577 0.635 0.561 0.504
AP 0.391 0.405 0.349 0.417 0.473
Table 10.9 ANMRR and AP statistics obtained by batch queries over the class vectors of the
single SLP
Corel RGB YUV LBP Gabor All
Min/mean 0.542/ 0.526/ 0.544/ 0.482/ 0.367/
ANMRR 0.573 0.554 0.555 0.495 0.390
Max/ 0.446/ 0.463/ 0.443/ 0.503/ 0.611/
mean AP 0.414 0.434 0.432 0.490 0.589
310 10 Evolutionary Feature Synthesis
Table 10.10 ANMRR and AP statistics for features synthesized by a single run of the EFS when
the output dimensionality is fixed to C = 10 and operator selection is limited to addition and
subtraction
Corel RGB YUV LBP Gabor All
Min/mean 0.564/ 0.540/ 0.636/ 0.598/ 0.482/
ANMRR 0.580 0.548 0.651 0.623 0.487
Max/ 0.419/ 0.447/ 0.350/ 0.385/ 0.498/
mean AP 0.404 0.439 0.334 0.360 0.492
Furthermore, the combined features are then bounded using tanh function. The
bias used in SLP is mimicked by complementing each input feature vector with a
constant ‘1’ value. Thus note that, the only property different from the SLP is the
feature selection of only 8 (K = 7) features for composing each element of
the output vector. The retrieval result statistics obtained from the batch queries are
presented in Table 10.10.
Even though the feature synthesis process is essentially similar to the one
applied by the SLP with the only exception that the number of input features used
to synthesize each output feature is notably reduced as a result of the feature
selection, the retrieval performance statistics for synthesized RGB and YUV
histograms are quite similar. This leads to the conclusion that, in these low-level
features, the original feature vectors have many redundant or irrelevant elements
for the discrimination of the classes, which is a well-known limitation of color
histograms. The feature selection, therefore, removes this redundancy while
operating only on the essential feature elements and as a result leads to a signif-
icant reduction in computational complexity. The statistics for the synthesized
LBP features are close to the ones obtained from the original LBP features, while
for the synthesized Gabor features they are somewhat inferior. This suggests that
with these descriptors, and especially with the Gabor features, most of the original
feature vector elements are indeed essential to achieve maximal discrimination
between classes. However, feature selection is also essential to achieve the
objective of reducing the overall optimization complexity and it may hence be a
prerequisite for applying feature synthesis on larger databases. In the following
experiments, it will be demonstrated that the feature selection no longer presents a
disadvantage whenever used along with the other properties of EFS.
In the following experiment, the significance of having several operators and
operator selection in EFS is examined. All operators are now used in the EFS
process. The retrieval performance statistics are given in Table 10.11. Note that
Table 10.11 ANMRR and AP statistics for features synthesized by a single run of the EFS when
the output dimensionality is fixed to C = 10 and all operators are used
Corel RGB YUV LBP Gabor All
Min/mean 0.550/ 0.517/ 0.499/ 0.486/ 0.380/
ANMRR 0.587 0.562 0.509 0.516 0.398
Max/ 0.432/ 0.468/ 0.487/ 0.498/ 0.597/
mean AP 0.394 0.423 0.477 0.467 0.578
10.4 Simulation Results and Discussions 311
Table 10.12 ANMRR and AP statistics and the corresponding output dimensionalities for the
synthesized features by a single EFS run using the fitness function in Eq. (10.4)
Corel RGB YUV LBP Gabor All
Min/meanANMRR 0.485/0.500 0.475/0.505 0.507/0.519 0.506/0.520 0.346/0.357
Max/ 0.494/0.478 0.507/0.477 0.477/0.465 0.479/0.464 0.630/0.619
mean AP
Best/mean dim. 34/35.7 25/34.7 37/35.3 14/27.0
ANMRR and AP statistics for synthesized RGB and YUV histograms are quite
similar with and without operator selection, while for LBP and Gabor features a
significant improvement can be observed with operator selection.
In the next experiment, EFS is allowed to optimize the output dimensionality
and the fitness function given in Eq. (10.4) is used; i.e., now all the properties of
EFS are in use but only in a single EFS run. ANMRR and AP statistics along with
the corresponding statistics of the output dimensionality (best/mean) of the syn-
thesized feature vector (dbest) are presented in Table 10.12.
ANMRR scores obtained with the single run of the EFS are better than to the
ones obtained with the original features and the features (class vectors) synthesized
by the SLP. The only exception is the synthesized Gabor features for which the
statistics are slightly worse than the ones synthesized with the SLP. As discussed
earlier, this indicates the relevance of each element of the Gabor feature vector and
selecting a limited subset may yield a slight disadvantage in this case.
Finally, several runs of the EFS (without exceeding 25) are performed until the
performance improvement is no longer significant. The ANMRR and AP statistics
and along with the corresponding statistics of the output dimensionality (best/
mean) of the final synthesized feature vector (dbest) are presented in Table 10.13.
The average numbers of EFS runs for the synthesis of the RGB, YUV, LBP, and
Gabor features were 19.6, 18.7, 22.8, and 24.5, respectively.
Note that when EFS is performed with all the properties enabled, the retrieval
performance of the synthesized features has been significantly improved compared
to EFS with a single run. It can be also observed that the average dimensionalities
of the final synthesized feature vectors (dbest) become lower. This is not surprising
since repeated applications of consecutive arithmetic operators can achieve a
similar discrimination capability with fewer output feature elements. Dimension-
ality reduction in the synthesized features is also a desired property that makes the
Table 10.13 Retrieval performance statistics and the corresponding output dimensionalities for
the final synthesized features by several EFS runs using the fitness function in Eq. (10.4)
Corel RGB YUV LBP Gabor All
Min/mean 0.365/ 0.369/ 0.372/ 0.408/ 0.258/
ANMRR 0.385 0.397 0.396 0.428 0.280
Max/ 0.616/ 0.610/ 0.613/ 0.577/ 0.716/
mean AP 0.596 0.584 0.589 0.559 0.694
Best/mean dim. 12/12.4 18/17.8 11/11.2 11/11.0
312 10 Evolutionary Feature Synthesis
retrieval process faster. However, we noticed that the best retrieval results were
usually obtained when a higher output dimensionality was maintained for several
runs. This suggests that it could be beneficial to set the value of power a in Eq.
(10.4) to a higher value than now used 1.1 in order to favor higher dimensionalities
even more. However, this may not be a desired property especially for large-scale
databases.
Figure 10.5 illustrates four sample queries each of which is performed using the
original features, features synthesized by a single EFS run, and the features syn-
thesized by four EFS runs. It is obvious that multiple EFS runs improve the
discrimination power of the original features and, consequently, an improved
retrieval performance is achieved.
In order to perform comparative evaluations against evolutionary ANNs, fea-
tures are synthesized using feed-forward ANNs since several EFS runs are con-
ceptually similar to MLPs or especially to multiple concatenated SLPs. For ANNs,
MD PSO is used to evolve the optimal MLP architecture in an architecture space,
Rmin = {Ni, 8, 4, No}, Rmax = {Ni, 16, 8, No}. Such 3-layer MLPs correspond to 3
EFS runs. The number of hidden neurons in the first/second hidden layer corre-
sponds to the output dimensionality of the first/second EFS run (with MLPs the
number of hidden neurons must be limited more to keep the training feasible).
Naturally, the number of output neurons is fixed to C = 10, according to the 1-to-
C encoding scheme. In order to provide a fair comparison, the number of MD PSO
iterations is now set to 3 9 2,000 = 6,000 iterations. Table 10.14 presents the
retrieval performance statistics for the features synthesized by the best MLP
configuration evolved by MD PSO.
It is fairly clear that except for RGB histograms, the retrieval performance
statistics for features synthesized by EFS even with a single run are better than the
ones achieved by the best MLP. This basically demonstrates the significance of the
feature and operator selection.
As each EFS run is conceptually similar to a SLP, the EFS with multiple runs in
fact corresponds to the synthesis obtained by the concatenation of the multiple
SLPs (i.e., the output of the previous SLP is fed as the input of the next one similar
to the block diagram shown in Fig. 10.3). The retrieval performance statistics
obtained for the features synthesized by the concatenated SLPs are given in
Table 10.15.
Compared to the retrieval result statistics given in Table 10.9 for the batch
queries over the class vectors of a single SLP, slightly better retrieval perfor-
mances are obtained. However, similar to the results for evolutionary MLPs, they
are also significantly inferior to the ones obtained by EFS, as given in Table 10.13.
Therefore, similar conclusions can be drawn for the significance of the three major
properties of EFS presented in this chapter, i.e., feature and operator selection and
searching for the optimal feature dimensionality. Furthermore, EFS provides a
higher flexibility and better feature (or data) adaptation than regular ANNs, since
1) it has the capability to select the most appropriate linear/nonlinear operators
among a large set of candidates, 2) it has the advantage of selecting proper features
10.4 Simulation Results and Discussions 313
Fig. 10.5 Four sample queries using original (left) and synthesized features with single (middle)
and four (right) runs. Top-left is the query image
314 10 Evolutionary Feature Synthesis
Table 10.14 ANMRR and AP statistics for the features synthesized by the best MLP configu-
ration evolved by MD PSO
Corel RGB YUV LBP Gabor All
Min/mean 0.392/ 0.527/ 0.513/ 0.490/ 0.307/
ANMRR 0.442 0.558 0.545 0.547 0.348
Max/ 0.594/ 0.465/ 0.445/ 0.498/ 0.673/
mean AP 0.543 0.433 0.475 0.442 0.633
Best ANN {Ni,9,5,No} {Ni,9,5,No} {Ni,8,6,No} {Ni,14,5,No}
Table 10.15 ANMRR and AP statistics for the features synthesized by the concatenated SLPs
Corel RGB YUV LBP Gabor All
Min/mean 0.498/ 0.495/ 0.494/ 0.459/ 0.346/
ANMRR 0.535 0.525 0.503 0.474 0.365
Max/ 0.491/ 0.493/ 0.493/ 0.527/ 0.634/
mean AP 0.454 0.463 0.485 0.512 0.615
that in turn reduces the complexity of the solution space, and 3) it synthesizes
features in an optimal dimensionality.
Note that the application of EFS is not limited to CBIR, but it can directly be
utilized to synthesize proper features for other application areas such as data
classification or object recognition, and perhaps even speech recognition. With
suitable fitness functions, synthesized features can be optimized for the application
at hand. To further improve performance, different feature synthesizers may be
evolved for different classes, since the features essential for the discrimination of a
certain class may vary.
Recall that the major MD PSO test-bed application is PSOTestApp where several
MD PSO applications, mostly based on data (or feature) clustering, are imple-
mented. The basics of MD PSO operation is described in Sect. 4.4.2 and FGBF in
Sect. 6.4 whereas in Sect. 8.4, N-D feature clustering application for Holter reg-
ister classification is explained. In this section, we shall describe the programming
details for performing MD PSO-based EFS operation. For this, the option: ‘‘8.
Feature Synthesis..,’’ should be selected from the main GUI of the PSOTestApp
application, and the main dialog object from CPSOtestAppDlg class will then use
the object from CPSOFeatureSynthesis class in void CPSOtestAppDlg::
OnDopso() function to perform EFS operation over MUVIS low-level features
stored in FeX files.
The API of this class is identical to the other major classes in the PSOTestApp
application. The thread function, PSOThread(), is again called in a different
thread where the EFS operations are performed. The class structure is also quite
10.5 Programming Remarks and Software Packages 315
similar to CPSOclusterND class where there are several fitness functions are
implemented each for different EFS objective. Table 10.16 presents the class
CPSOFeatureSynthesis. The input file for this application is a MUVIS image
database file in the form of ‘‘*.idbs.’’ Using the database handling functions, a
MUVIS database file from which its low-level features are retrieved, is loaded. For
each low-level feature, an individual EFS operation (with multiple repetitions if
required) is performed and the synthesized features are then saved into the original
FeX file, while the FeX file with low-level features is simply renamed with the
‘‘FeX0’’ enumeration at the end. If there are more than one EFS blocks (runs), then
the synthesized features from the latest run is saved into the original FeX file while
the previous runs outcomes are saved into enumerated FeX files with ‘‘FeX1,’’
‘‘_FeX2,’’ etc. For example, let Corel_FeX.RGB be the FeX file for the Corel
database where RGB color histogram features are stored. Assume that 2 EFS runs
are performed where each run can be repeated more than one time and the EFS
with the best fitness score is only kept while the others are simply discarded. At the
end of 2 EFS runs, the following files are generated:
• Corel_FeX0.RGB (The original low-level features)
• Corel_FeX1.RGB (The synthesized features after the first EFS run)
• Corel_FeX.RGB (The synthesized features after the second –last– EFS run)
In this way, the synthesized features of the last EFS run can directly be used for
a CBIR operation in MUVIS while the features from the previous runs as well as
the original features are still kept for backup purpose.
Recall that EFS has certain similarities to a classification operation. For
instance, the EFS is also evolved over the training dataset (the GTD) of a database
and tested over the test dataset. Therefore the function, Load_qbf(), loads the
GTD for this purpose and another function, SeparateTrainingFeatures(), sepa-
rates the FVs of the training dataset.
Table 10.17 presents the first part of the function, CPSOFeatureSynthe-
sis::PSOThread() with its basic blocks. Note that the first operation is to load the
database (‘‘*.idbs’’) file, and the ‘‘*.qbf’’ file where the entire GTD resides. Once
they are loaded, the EFS operation is performed for each feature in the database
(i.e., the first for-loop) while for each EFS operation, there can be one or more EFS
runs concatenated (as shown in Fig. 10.3). For each EFS run, first the current
features (either low-level features if this is the first EFS run, or the features
synthesized by the previous EFS run) are loaded by the Load_FeX() function.
Then the training dataset features that are used to evolve the synthesizer, are
separated by the function, SeparateTrainingFeatures(). Then the MD PSO
parameters, internal and static variables/pointers are all set, created and/or allo-
cated. Finally, the EFS operation by the MD PSO process is performed and
repeated by m_psoParam._repNo times. As mentioned earlier, the synthesizer,
best_synthesis, obtained from the best MD PSO run in terms of the fitness score is
kept in an encoded form by a CVectorNDopt array of dimension, best_dim,
which is the dbest converged by the MD PSO swarm. Figure 10.4 illustrates the
encoding of the best_synthesis, in an array format.
Table 10.18 presents the second part of the function, CPSOFeatureSynthe-
sis::PSOThread() where the EFS, best_synthesis, is tested against the previous
synthesizer or the original features in terms of the fitness score achieved, i.e., it
will be kept only if there is an improvement on the fitness score. To accomplish the
test, first a dummy synthesizer is created (pOrigFeatures), which creates an
identical FV as the original one and then the fitness score is computed by simply
calling the fitness function, FitnessFnANN(pOrigFeatures, s_v_size). The fol-
lowing if statement compares both fitness scores and if there is an improvement,
then the synthesizer, best_synthesis, generates new features in a pointer array,
SF[], and saves them into the FeX file while renaming the previous one with an
enumerated number. Note that at this stage the synthesizer is applied over the
entire database, not only over the training dataset. Therefore, it is still unknown
10.5 Programming Remarks and Software Packages 317
whether or not the CBIR performance is improved by the synthesized features, and
this can then be verified by performing batch queries using this feature alone.
Table 10.19 presents the current fitness function used, FitnessFnANN(pOri-
gFeatures, s_v_size). This corresponds to the second approach, which adapts a
similar methodology to the one used in ANNs, i.e., using the ECOC scheme. The
first step is to apply the synthesis with the potential synthesizer stored in the MD
318 10 Evolutionary Feature Synthesis
PSO particle’s position, pPos in dimension, nDim. The function call, ApplyFS-
adaptive(true, true, SynthesizedFeatures, pCC, nDim), will synthesize features
stored in the static variable, SynthesizedFeatures. The fitness score is the nor-
malized mean square error (MSE) between the synthesized features and the ECOC
code in this dimension stored in the static variable, mean_vec[c][d]. While cal-
culating the MSE in the last for-loop, note that the MSE per dimension is also
stored as the individual dimensional fitness scores that are used for the FGBF
operation.
Table 10.20 presents the feature synthesizer function, CPSOFeatureSynthe-
sis::ApplyFSadaptive(), which synthesizes either the FVs of the entire database,
s_pInputFeatures or the training dataset, s_pTrainFeatures. Recall that nDim is
the dimension of the MD PSO particle, and also corresponds to the dimension of
the synthesized feature (FV). Therefore, the particle’s position vector is in
dimension nDim and each dimensional component holds an individual synthesizer
encoded in an array of fixed length, FS_DEPTH. As shown in Fig. 10.4, within
10.5 Programming Remarks and Software Packages 319
the for-loop of each dimension, the elements of each synthesizer array is decoded
into FV component indices, operator indices, weights, and biases. For instance,
int feature1 = ((int) pCC[j].m_pV[0]) ? feat_min;
float weight1 = ABS(pCC[j].m_pV[0] - ((int) pCC[j].m_pV[0]));
The variable, feature1, is the index of the first FV component (of the original
feature) and the variable, weight1, is its weight. Note that within the following for-
loop, the FV component index and the weight of the second FV are also decoded
as well as the first operator index, FSoperator. Then for the entire dataset, the (jth)
dimensional component of the synthesized features are computed using the
associated operator, oFn[FSoperator], as,
Note that the first term, SynthesizedFeatures[i][j], was already assigned to the
scaled (with weight1) version of the first feature, feature1. Therefore, the first
term of the operator will always be the output of the previous operator, and the
second term will be the new scaled (with weight2) component of the original FV.
The final synthesized feature may then be passed through the activation function,
tanh, to scale it within [-1, 1] range.
320 10 Evolutionary Feature Synthesis
The MD PSO operation for EFS is identical to the one for N-D feature clus-
tering, as for both operation, the task is to find out certain number of (dbest) N-D
arrays (for EFS carrying synthesizers in encoded form) or FVs (for N-D clustering,
carrying the N-D cluster centroids). However, the FGBF operations differ entirely.
As explained in Sect. 6.4.1, the FGBF operation for clustering operation is handled
in the FGBF_CLFn() function whereas the other function, FGBF_FSFn(), is for
10.5 Programming Remarks and Software Packages 321
EFS. Note that this is the basic FGBF operation as explained in Sect. 5.5.1 and the
implementation is identical to the one presented in Table 5.17.
References