0% found this document useful (0 votes)

60 views318 pages

Computational Intelligence and Its Applications - Evolutionary Computation, Fuzzy Logic, Neural Network and Support Vector Machine Techniques (PDFDrive)

The document discusses the edited volume 'Computational Intelligence and Its Applications', which covers various computational intelligence techniques including evolutionary computation, fuzzy logic, neural networks, and support vector machines. It highlights the application of these techniques in solving complex real-world problems across different fields such as biomedical applications and industrial engineering. The volume consists of 13 chapters that present state-of-the-art research and methodologies, making it a valuable resource for researchers and postgraduate students in the field.

Uploaded by

prof.fabio

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

60 views318 pages

Computational Intelligence and Its Applications - Evolutionary Computation, Fuzzy Logic, Neural Network and Support Vector Machine Techniques (PDFDrive)

Uploaded by

prof.fabio

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 318

COMPUTATIONAL

INTELLIGENCE
AND ITS APPLICATIONS
Evolutionary Computation, Fuzzy Logic, Neural Network
and Support Vector Machine Techniques

P773.9781848166912-tp.indd 1 11/6/12 4:13 PM

March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/

This page intentionally left blank

COMPUTATIONAL
INTELLIGENCE
AND ITS APPLICATIONS
Evolutionary Computation, Fuzzy Logic, Neural Network
and Support Vector Machine Techniques

Editors

H. K. Lam
King’s College London, UK

S. H. Ling • H. T. Nguyen
University of Technology, Australia

Imperial College Press

ICP

P773.9781848166912-tp.indd 2 11/6/12 4:13 PM

Published by
Imperial College Press
57 Shelton Street
Covent Garden
London WC2H 9HE

Distributed by
World Scientific Publishing Co. Pte. Ltd.
5 Toh Tuck Link, Singapore 596224
USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601
UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE

British Library Cataloguing-in-Publication Data

A catalogue record for this book is available from the British Library.

COMPUTATIONAL INTELLIGENCE AND ITS APPLICATIONS

Evolutionary Computation, Fuzzy Logic, Neural Network and Support Vector
Machine Techniques

Copyright © 2012 by Imperial College Press

All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means,
electronic or mechanical, including photocopying, recording or any information storage and retrieval
system now known or to be invented, without written permission from the Publisher.

For photocopying of material in this volume, please pay a copying fee through the Copyright
Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to
photocopy is not required from the publisher.

ISBN-13 978-1-84816-691-2
ISBN-10 1-84816-691-5

Printed in Singapore.

Patrick - Computational Intelligence.pmd 1 3/13/2012, 10:42 AM

March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 0

Preface

Computational intelligence techniques are fast-growing and promising

research topics that have drawn a great deal of attention from researchers
for many years. This volume brings together many different aspects
of the current research on intelligence technologies such as neural
networks, support vector machines, fuzzy logics, evolutionary computing
and swarm intelligence. The combination of these techniques provides an
effective treatment toward some industrial and biomedical applications.
Most real-world problems are complex and even ill-defined. Lack of
knowledge on the problems or too much information makes the classical
analytical methodologies difficult to apply and obtain reasonable results.
Computational intelligence techniques demonstrate superior learning and
generalization abilities on handling these complex and ill-defined problems.
By using appropriate computational intelligence techniques, some essential
characteristics and important information can be extracted to deal with
the problems. It has been shown that various computational intelligence
techniques have been successfully applied to a wide range of applications
from pattern recognition and system modeling to intelligent control
problems and biomedical applications.
This edited volume provides the state-of-the-art research on significant
topics in the field of computational intelligence. It presents fundamental
concepts and essential analysis of various computational techniques to offer
a systematic and effective tool for better treatment of different applications.
Simulation and experimental results are included to illustrate the design
procedure and the effectiveness of the approaches. With collective
experiences and the knowledge of leading researchers, the important
problems and difficulties are fully addressed, concepts are fully explained
and methodologies are provided to handle various problems.
This edited volume comprises 13 chapters which falls into 4 main
categories: (1) Evolutionary computation and its applications, (2) Fuzzy

v
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 0

vi Preface

logics and their applications, (3) Neural networks and their applications
and (4) Support vector machines and their applications.
Chapter 1 compares three machine learning methods, support vector
machines, AdaBoost and soft margin AdaBoost algorithms, to solve the
pose estimation problem. Experiment results show that both the support
vector machines-based method and soft margin AdaBoost-based method
are able to reliably classify frontal and pose images better than the original
AdaBoost-based method.
Chapter 2 proposes a particle swarm optimization for polynomial
modeling in a dynamic environment. The performance of the proposed
particle swarm optimization is evaluated by polynomial modeling based
on a set of dynamic benchmark functions. Results show that the proposed
particle swarm optimization can find significantly better polynomial models
than genetic programming.
Chapter 3 deals with the problem of restoration of color-quantized
images. A restoration algorithm based on particle swarm optimization
with multi-wavelet mutation is proposed to handle the problem. Simulation
results show that it can improve the quality of a half-toned color-quantized
image remarkably in terms of both signal-to-noise ratio improvement and
convergence rate and the subjective quality of the restored images can also
be improved.
Chapter 4 deals with a non-invasive hypoglycemia detection for Type 1
diabetes mellitus (T1DM) patients based on the physiological parameters
of the electrocardiogram signals. An evolved fuzzy inference model is
developed for classification of hypoglycemia with rule optimization and
membership functions using a hybrid particle swarm optimization method
with wavelet mutation.
Chapter 5 studies the limit cycle behavior of weights of perceptron. It
is proposed that the perceptron exhibiting the limit cycle behavior can
be employed for solving a recognition problem when downsampled sets
of bounded training feature vectors are linearly separable. Numerical
computer simulation results show that the perceptron exhibiting the limit
cycle behavior can achieve a better recognition performance compared to a
multi-layer perceptron.
Chapter 6 presents an alternative information theoretic criterion
(minimum description length) to determine the optimal architecture of
neural networks according to the equilibrium between the model parameters
and model errors. The proposed method is applied for modeling of various
data using neural networks for verification.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 0

Preface vii

Chapter 7 solves eigen-problems of matrices using neural networks.

Several recurrent neural network models are proposed and each model is
expressed as an individual differential equation, with its analytic solution
being obtained. The convergence properties of the neural network models
are fully discussed based on the solutions to these differential equations and
the computing steps are designed toward solving the eigen-problems, with
numerical simulations being provided to evaluate each model’s effectiveness.
Chapter 8 considers the study of methods of automation for the insertion
of self-tapping screws. A new methodology for monitoring the insertion
of self-tapping screws is developed based on radial basis function neural
networks, which are able to generalize and to correctly classify unseen
insertion signals. The ability of the artificial neural networks to classify
signals belongs to a single insertion case. Both the computer simulation
and experimental results show that after a modest training period, the
neural network is able to correctly classify torque signature signals.
Chapter 9 applies the theory behind both support vector classification
and regression to deal with real-world problems. A classifier is developed
which can accurately estimate the risk of developing heart disease simply
from the signal derived from a finger-based pulse oximeter. The regression
example shows how support vector machines can be used to rapidly and
effectively recognize hand-written characters particularly designed for the
so-called graffiti character set.
Chapter 10 proposes a control oriented modeling approach to depict
nonlinear behavior of heart rate response at both the onset and offset of
treadmill exercise to accurately regulate cardiovascular response to exercise
for the individual exerciser.
Chapter 11 explores control methodologies to handle time variant
behavior for heart rate dynamics at onset and offset of exercises. The
effectiveness of the proposed modeling and control approach is shown
from the regulation of dynamical heart rate response to exercise through
simulation using Matlab.
Chapter 12 investigates real-time fault detection and isolation for
heating, ventilation and air conditioning systems by using an online support
vector machine. Simulation studies are given to show the effectiveness of
the proposed online fault detection and isolation approach.
This edited volume covers state-of-the-art computational intelligence
techniques and the materials are suitable for post-graduate students and
researchers as a reference in engineering and science. Particularly, it
is more suitable for researchers working on computational intelligence
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 0

viii Preface

including evolutionary computation, fuzzy logics, neural networks and

support vector machines. Moreover, a wide range of applications using
computational intelligence techniques, such as biomedical problems, control
systems, forecasting, optimization problems, pattern recognition and
system modeling, are covered. These problems can be commonly found
in industrial engineering applications. So, this edited volume can be a
good reference, providing concept, techniques, methodologies and analysis,
for industrial engineers applying computational intelligence to deal with
engineering problems.
We would like to thank all the authors for their contributions to
this edited volume. Thanks also to the staff members of the Division
of Engineering, King’s College London and the Faculty of Engineering
and Information Technology, University of Technology, Sydney for their
comments and support. The editor, H.K. Lam, would like to thank his
wife, Esther Wing See Chan, for her patience, understanding, support and
encouragement that make this work possible. Last but not least, we would
like to thank the publisher Imperial College Press for the publication of this
edited volume and the staff who have offered support during the preparation
of the manuscript.
The work described in this book was substantially supported by grants
from King’s College London and University of Technology, Sydney.

H.K. Lam
S.H. Ling
H.T. Nguyen
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 0

Contents

Preface v

Evolutionary Computation and its Applications 1

1. Maximal Margin Algorithms for Pose Estimation 3

Ying Guo and Jiaming Li

2. Polynomial Modeling in a Dynamic Environment based

on a Particle Swarm Optimization 23
Kit Yan Chan and Tharam S. Dillon

3. Restoration of Half-toned Color-quantized Images Using

Particle Swarm Optimization with Multi-wavelet Mutation 39
Frank H.F. Leung, Benny C.W. Yeung and Y.H. Chan

Fuzzy Logics and their Applications 59

4. Hypoglycemia Detection for Insulin-dependent Diabetes

Mellitus: Evolved Fuzzy Inference System Approach 61
S.H. Ling, P.P. San and H.T. Nguyen

Neural Networks and their Applications 87

5. Study of Limit Cycle Behavior of Weights of Perceptron 89

C.Y.F. Ho and B.W.K. Ling

ix
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 0

x Contents

6. Artiﬁcial Neural Network Modeling with Application to

Nonlinear Dynamics 101
Yi Zhao

7. Solving Eigen-problems of Matrices by Neural Networks 127

Yiguang Liu, Zhisheng You, Bingbing Liu and Jiliu Zhou

8. Automated Screw Insertion Monitoring Using Neural

Networks: A Computational Intelligence Approach to
Assembly in Manufacturing 183
Bruno Lara, Lakmal D. Seneviratne and Kaspar Althoefer

Support Vector Machines and their Applications 211

9. On the Applications of Heart Disease Risk Classiﬁcation

and Hand-written Character Recognition using Support
Vector Machines 213
S.R. Alty, H.K. Lam and J. Prada

10. Nonlinear Modeling Using Support Vector Machine for

Heart Rate Response to Exercise 255
Weidong Chen, Steven W. Su, Yi Zhang, Ying Guo,
Nghir Nguyen, Branko G. Celler and Hung T. Nguyen

11. Machine Learning-based Nonlinear Model Predictive

Control for Heart Rate Response to Exercise 271
Yi Zhang, Steven W. Su, Branko G. Celler and
Hung T. Nguyen

12. Intelligent Fault Detection and Isolation of HVAC System

Based on Online Support Vector Machine 287
Davood Dehestani, Ying Guo, Sai Ho Ling, Steven W.
Su and Hung T. Nguyen

Index 305
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 0

PART 1

Evolutionary Computation and its

Applications

1
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 0

This page intentionally left blank

2
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 1

Chapter 1

Maximal Margin Algorithms for Pose Estimation

Ying Guo and Jiaming Li

ICT Centre, CSIRO
PO Box 76, Epping, NSW 1710, Australia
ying.guo, [email protected]

This chapter compares three machine learning methods to solve

the pose estimation problem. The methods were based on support
vector machines (SVMs), AdaBoost and soft margin AdaBoost (SMA)
algorithms. Experiment results show that both the SVM-based method
and SMA-based method are able to reliably classify frontal and
pose images better than the original AdaBoost-based method. This
observation leads us to compare the generalization performance of these
algorithms based on their margin distribution graphs.
For a classification problem, features selection is the first step, and
selecting better features results in better classification performance. The
feature selection method described in this chapter is easy and efficient.
Instead of resizing the whole facial image to a subwindow (as in the
common pose estimation process), the prior knowledge in this method is
only the eye location, hence simplifying its application. In addition, the
method performs extremely well even when some facial features such as
the nose or the mouth become partially or wholly occluded.
Experiment results show that the algorithm performs very well, with
a correct recognition rate around 98%, regardless of the scale, lighting
or illumination conditions associated with the face.

Contents

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Pose Detection Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.1 Procedure of pose detection . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.2 Eigen Pose Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Maximal Margin Algorithms for Classiﬁcation . . . . . . . . . . . . . . . . . . 9
1.3.1 Support vector machines . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.2 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.3 Soft margin AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 1

4 Ying Guo and Jiaming Li

1.3.4 Margin distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.4 Experiments and Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4.1 Data preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4.2 Data preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4.3 Experiment results of Group A . . . . . . . . . . . . . . . . . . . . . . 14
1.4.4 Margin distribution graphs . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4.5 Experiment results of Group B . . . . . . . . . . . . . . . . . . . . . . 18
1.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

1.1. Introduction

Research in face detection, face recognition and facial expression usually

focuses on using frontal view images. However, approximately 75% of faces
in normal photographs are non-frontal [1]. Significant improvements in
many computer vision algorithms dealing with human faces can be obtained
if we can achieve an accurate estimation of the pose of the face, hence pose
estimation is an important problem.
CSIRO has developed a real-time face capture and recognition system
(SQIS - System for Quick Image Search) [2], which can automatically
capture a face in a video stream and verify this against face images stored
in a database to inform the operator if a match occurs. It is observed that
the system performs better for the frontal images, so a method is required
to separate the frontal images from pose images for the SQIS system.
Pose detection is hard because large changes in orientation significantly
change the overall appearance of a face. Attempts have been made to use
view-based appearance models with a set of view-labeled appearances (e.g.
[3]). Support vector machines (SVMs) have been successfully applied
to model the appearance of human faces which undergo nonlinear change
across multiple views, and these have achieved good performance [4–7].
SVMs are linear classifiers that use the maximal margin hyperplane in a
feature space defined by a kernel function. Here, margin is a measure
of the generalization performance of a classifier. An example is classified
correctly if and only if it has a positive margin. A large value of the margin
indicates that there is little uncertainty in the classification of the point.
They will be precisely explained in Section 1.3.
Recently, there has been great interest in ensemble methods for learning
classifiers, and in particular in boosting algorithms [8], which is a general
method for improving the accuracy of a basic learning algorithm. The best
known boosting algorithm is the AdaBoost algorithm. These algorithms
have proved surprisingly effective at improving generalization performance
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 1

Maximal Margin Algorithms for Pose Estimation 5

in a wide variety of domains, and for diverse base learners. For instance,
Viola and Jones demonstrated that AdaBoost can achieve both good speed
and performance in face detection [9]. However, research also showed that
AdaBoost often places too much emphasis on misclassified examples which
may just be noise. Hence, it can suffer from overfitting, particularly with
a highly noisy data set. The soft margin AdaBoost algorithm [10]
was introduced by using a regularization term to achieve a soft margin
algorithm, which allows mislabeled samples to exist in the training data
set, and improves the generalization performance of original AdaBoost.
Researchers have pointed out the equivalence between the mathematical
programs underlying both SVMs and boosting, and formalized a
correspondence. For instance, in [11], a given hypothesis set in boosting can
correspond to the choice of a particular kernel in SVMs and vice versa. Both
SVMs and boosting are maximal margin algorithms, whose generalization
performance of a hypothesis f can be bounded in terms of the margin with
respect to the training set. In this chapter, we will compare these maximal
margin algorithms, SVM, AdaBoost and SMA, in one practical pattern
recognition problem – pose estimation. We will especially analyze their
generalization performance in terms of the margin distribution.
This chapter is directed toward a pose detection system that can
classify the frontal images (within ±25o ) from pose images (greater angles),
under different scale, lighting or illumination conditions. The method uses
Principal Component Analysis (PCA) to generate a representation of the
facial image’s appearance, which is parameterized by geometrical variables
such as the angle of the facial image. The maximal margin algorithms are
then used to generate a statistical model, which captures variation in the
appearance of the facial angle.
All the experiments were evaluated on the CMU PIE database [12],
Weizmann database [13], CSIRO front database and CMU profile face
testing set [14]. It is demonstrated that both the SVM-based model and
SMA-based model are able to reliably classify frontal and pose images better
than the original AdaBoost-based model. This observation leads us to
discuss the generalization performance of these algorithms based on their
margin distribution graphs.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 1

6 Ying Guo and Jiaming Li

The remainder of this chapter is organized as follows. We proceed

in Section 1.2 to explain our approach and the feature extraction. In
Section 1.3, we analyze the maximal margin algorithms, including the
theoretic introduction of SVMs, AdaBoost and SMA. In Section 1.4, we will
present some experiment results, and analyze diﬀerent algorithms’ margin
distribution. The conclusions are discussed in Section 1.5.

1.2. Pose Detection Algorithm

1.2.1. Procedure of pose detection

Murase and Nayar [15] proposed a continuous compact representation of

object appearance that is parameterized by geometrical variables such as
object pose. In this pose estimation approach, the principal component
analysis (PCA) is applied to make pose detection more efficient. PCA is a
well-known method for computing the directions of greatest variance for a
set of vectors. Here the training facial images are transformed into vectors
first. That is, each l × w pixel window is vectorised to a (l × w)-element
vector v ∈ RN , where N = l × w. By computing the eigenvectors of the
covariance matrix of a set of v, PCA determines an orthogonal basis, called
the eigen pose space (EPS), in which to describe the original vector v, i.e.
vector v is projected into a vector x in the subspace of eigenspace, where
x ∈ Rd and normally d ≪ N .
According to the pose angle θi of the training image vi , the
corresponding label yi of each vector xi is defined as
{
+1 |θi | ≤ 25o
yi =
−1 otherwise.

The next task is to generate a decision function f (x) based on a set of

training samples (x1 , y1 ), . . . , (xm , ym ) ∈ Rd × R. We call x inputs from
the input space X and y outputs from the output space Y. We will use
the shorthand si = (xi , yi ), Xm = (x1 , . . . , xm ), Y m = (y1 , . . . , ym ) and
S m = (Xm , Y m ). The sequence S m (sometimes also S) is generally called
a training set, which is assumed to be distributed according to the product
probability distribution Pm (x, y). The maximal margin machine learning
algorithms, such as the SVMs, AdaBoost and SMA, are applied to solve
this binary classiﬁcation problem. Figure 1.1 shows the procedure of the
whole pose detection approach.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 1

Maximal Margin Algorithms for Pose Estimation 7

1.2.2. Eigen Pose Space

We chose 3003 facial images from the CMU PIE database under 13 different
poses to generate the EPS. The details of PIE data set can be seen in
[12], where the facial images are from 68 persons, and each person was
photographed using 13 different poses, 43 different illumination conditions
and 4 different expressions. Figure 1.2 shows the pose variations.
A mean pose image and set of orthonormal eigen poses are produced.
The first eight eigen poses are shown in Fig. 1.3. Figure 1.4 shows the

Facial images Training Samples

Scale & Rotate

Preprocessing

Crop

Normalize

Eigenvectors of
PCA EPS Training samples (x,y)
Compression EPS projection Maximal Margin Algorithms
Decision
function f(x)
Real x
Images Preprocessing EPS projection
Pose detection

Fig. 1.1. Flow diagram of pose detection algorithm.

Fig. 1.2. The pose variation in the PIE database. The pose varies from full left proﬁle
to full frontal and on to full right proﬁle. The nine cameras in the horizontal sweep are
each separated by about 22.5o . The four other cameras include one above and one below
the central camera, and two in the corners of the room, which are typical locations for
surveillance cameras.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 1

8 Ying Guo and Jiaming Li

mean 1 2 3 4 5 6 7 8

Fig. 1.3. Mean face and ﬁrst eight eigen poses.

Fig. 1.4. Pose Eigenvalue Spectrum, normalized by the largest eigenvalue.

eigenvalue spectrum, which decreases rapidly. Good reconstruction is

achieved using at most 100 eigenvectors, as shown in Fig. 1.5, where the
sample face from the Weizmann database is compared with reconstructions
using increasing numbers of eigenvectors. The number of eigenvectors is
shown on the top of each reconstructed images. The ﬁrst 100 eigenvectors
were saved to construct the EPS.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 1

Maximal Margin Algorithms for Pose Estimation 9

100

150

200

250

300

350

400

450

500
50 100 150 200 250 300 350

sample 20 40 60 80 100 120

Fig. 1.5. Reconstructed image (Weizmann) using the eigen poses of the PIE data set.
Top: original image. Bottom: sample (masked and relocated) and its reconstruction
using the given numbers of eigenvectors. The number of eigenvectors is above each
reconstructed image.

1.3. Maximal Margin Algorithms for Classification

As shown in the procedure (Fig. 1.1), the projection of a facial image to

the EPS is classified as a function of the pose angle. A classifier is required
to learn the relationship. We will apply three maximal margin algorithms,
SVMs, AdaBoost and SMA.
Many binary classifier algorithms produce their output by thresholding
a real valued function, such as neural network, support vector machine
and combined classifiers produced by voting methods. It is obvious that
it is often useful to deal with the continuous function directly, since the
thresholded output contains less information. So it is easy to imagine that
the real-valued output can be interpreted as a measure of generalization
performance in pattern recognition. Such generalization performance is
expressed in terms of margin. Next we define the margin of an example:
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 1

10 Ying Guo and Jiaming Li

Definition 1.1 (Margin). Given an example (x, y) ∈ Rd × {−1, 1} and

a real-valued hypothesis f : Rd → R, the margin of s with respect to f is
defined as γ(s) := yf (x).

From this deﬁnition, we can see that an example is classiﬁed correctly if

and only if it has a positive margin. A large value of the margin indicates
that there is little uncertainty in the classiﬁcation of the point. Thus,
we would expect that a hypothesis with large margins would have good
generalization performance. In fact, it can be shown that achieving a
large margin on the training set results in an improved bound on the
generalization performance [16, 17]. Geometrically, for “well behaved”
functions f , the distance between a point and the decision boundary will
roughly correspond to the magnitude of the margin at that point. The
learning algorithm which separates the samples with the maximal margin
hyperplane is called the maximal margin algorithm. There are two main
maximal margin algorithms: support vector machines and boosting.

1.3.1. Support vector machines

As a powerful classification algorithm, SVMs [18] generate the maximal
margin hyperplane in a feature space defined by a kernel function. For
the binary classification problem, the goal of an SVM is to find an optimal
hyperplane f (x) to separate the positive examples from the negative ones
with maximum margin. The points which lie on the hyperplane satisfy
⟨w, x⟩ + b = 0. The decision function which gives the maximum margin is
(m )
∑
f (x) = sign yi αi ⟨x, xi ⟩ + b . (1.1)
i=1

The parameter αi is non-zero only for inputs xi closest to the hyperplane

f (x) (within the functional margin). All the other parameters αi are zero.
The inputs with non-zero αi are called support vectors. Notice f (x) depends
only on the inner products between inputs. Suppose the data are mapped
to some other inner product space S via a nonlinear map Φ : Rd → S. Now
the above linear algorithm will operate in S, which would only depend on
the data through inner products in S : ⟨Φ(xi ), Φ(xj )⟩. Boser, Guyon and
Vapnik [19] showed that there is a simple function k such that k(x, xi ) =
⟨Φ(x), Φ(xi )⟩, which can be evaluated eﬃciently. The function k is called a
kernel. So we only need to use the kernel k in the optimization algorithm
and never need to explicitly know what Φ is.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 1

Maximal Margin Algorithms for Pose Estimation 11

For linear SVM, the kernel k is just the inner product in the input space,
i.e. k(x, xi ) = ⟨x, xi ⟩, and the corresponding decision function is (1.1). For
nonlinear SVMs, there are a number of kernel functions which have been
found to provide good performance, such as polynomials and radial basis
function (RBF). The corresponding decision function for a nonlinear SVM
is
(m )
∑
f (x) = sign yi αi k(x, xi ) + b . (1.2)
i=1

1.3.2. Boosting
Boosting is a general method for improving the accuracy of a learning
algorithm. In recent years, many researchers have reported significant
improvements in the generalization performance using boosting methods
with learning algorithms such as C4.5 [20] or CART [21] as well as with
neural networks [22].
A number of popular and successful boosting methods can be seen as
gradient descent algorithms, which implicitly minimize some cost function
of the margin [23–25]. In particular, the popular AdaBoost algorithm [8]
can be viewed as a procedure for producing voted classifiers which minimize
the sample average of an exponential cost function of the training margins.
The aim of boosting algorithms is to provide a hypothesis which is a
voted combination of classifiers of the form sign (f (x)), with
∑
T
f (x) = αt ht (x), (1.3)
t=1

where αt ∈ R are the classifier weights, ht are base classifiers from some
class F and T is the number of base classifiers chosen from F. Boosting
algorithms take the approach of finding voted classifiers which minimize
the sample average of some cost function of the margin.
Nearly all the boosting algorithms iteratively construct the combination
one classifier at a time. So we will denote the combination of the first t
classifiers by ft , while the final combination of T classifiers will simply be
denoted by f .
In [17], Schapire et al. show that boosting is good at finding classifiers
with large margins in that it concentrates on those examples whose margins
are small (or negative) and forces the base learning algorithm to generate
good classifications for those examples. Thus, boosting can effectively find
a large margin hyperplane.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 1

12 Ying Guo and Jiaming Li

1.3.3. Soft margin AdaBoost

Although AdaBoost is remarkably successful in practice, AdaBoost’s cost

function often places too much emphasis on examples with large negative
margins. It was shown theoretically and experimentally that AdaBoost is
especially effective at increasing the margins of the training examples [17],
but the generalization performance of AdaBoost is not guaranteed. Hence
it can suffer from overfitting, particularly in high noise situations [26].
For the problem of approximating a smooth function from sparse data,
regularization techniques [27, 28] impose constraints on the approximating
set of functions. Rätsch et al. [10] show that versions of AdaBoost
modified to use regularization are more robust for noisy data. Mason et
al. [25] discuss the problem of approximating a smoother function based
on boosting algorithms and a regularization term was also suggested to be
added to the original cost function. This term represents the “mistrust” to
a noisy training sample, and allows it to be misclassified (negative margin)
in the training process. The final hypothesis f (x) obtained this way has
worse training error but better generalization performance compared to
f (x) of the original AdaBoost algorithm.
There is a broad range of choices for the regularization term, including
many of the popular generalized additive models used in some regularization
networks [29]. In the so-called soft margin AdaBoost, the regularization
term ηt (si ) is defined as
( )2
∑
t
ηt (si ) = αr wr (si ) ,
r=1

where w is the sample weight and t the training iteration index (cf. [10] for
a detailed description). The soft margin of a sample si is then deﬁned as

γ̃(si ) := γ(si ) + ληt (si ), for i = 1, · · · , m,

where λ is the regularization constant and ηt (si ) the regularization term.

A large value of η(si ) for some patterns allow for some larger soft margin
γ̃(si ). Here λ balances the trade-off between goodness-of-fit and simplicity
of the hypothesis. In the noisy case, SMA prefers hypotheses which do
not rely on only a few samples with smaller values of η(si ). So by using
the regularization method, AdaBoost is not changed for easily classifiable
samples, but only for the most difficult ones.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 1

Maximal Margin Algorithms for Pose Estimation 13

1.3.4. Margin distribution

In this chapter, we will compare these maximal margin algorithms in one
practical pattern recognition problem, pose estimation. Our comparison
will be based on the margin distribution of different algorithms over the
training data. The key idea of this analysis is the following: In order
to analyze the generalization error, one should consider more than just
the training error, which is the number of incorrect classifications in the
training set. One should also take into account the confidence of the
classifications [30]. We will use margin as a measure of the classification
confidence, for which it is possible to prove that an improvement in margin
on the training set guarantees an improvement in the upper bound on the
generalization error. For maximal margin algorithms, it is easy to see that
slightly perturbing one training example with a large margin is unlikely
to cause a change in the hypothesis f , and thus have little effect on the
generalization performance of f . Hence, the distribution of the margin
over the whole set of training examples is useful to analyze the confidence
of the classification. To visualize this distribution, the fraction of examples
whose margin is at most γ as a function of γ and normalized in [−1, 1] is
plotted and analyzed in the experiment part. These graphs are referred to
as margin distribution graphs [17].

1.4. Experiments and Discussions

1.4.1. Data preparation

The databases used for experiments were collected from four databases: the
CMU PIE database, the Weizmann database, the CSIRO front database
and the CMU proﬁle face database. These contain a total of 41567 faces, of
which 22221 are frontal images, and 19346 are pose images. The training
samples were randomly selected from the whole data set, and the rest to
test the generalization performance of the technique. The training and
testing were carried out based on the procedure for pose detection shown
in Fig. 1.1.
In experiment Group A, we compared the performance of SVM,
AdaBoost and soft margin AdaBoost on a small training sample set (m =
500), in order to save the computation time. Their margin distribution
graphs were generated based on the training performance, which explains
why SVM and SMA perform better than AdaBoost. In experiment Group
B, we did experiments on a much larger training set (m = 20783) for the
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 1

14 Ying Guo and Jiaming Li

SVM- and SMA-based techniques, in order to show the best performance

these two techniques can achieve.
For the SVM, we chose the polynomial kernel with degree d = 3 and
bias b = 0. For AdaBoost and SMA, we used radial basis function (RBF)
networks with adaptive centers as the base classifier, and the number of base
classifiers is T = 800. The effect of using different numbers of significant
Principal Components (PCs), i.e. the signal components along the principal
directions in the EPS, is also observed in the experiments. We will define
the number of PCs as the PC-dimension. We tested the PC-dimensions
between 10 and 80 in steps of 10. All the experiments were repeated five
times, and the results were averaged over the five repeats.

1.4.2. Data preprocessing

As the prior knowledge of the system is the face’s eye location, the
x-y positions of both eyes were hand-labeled. The face images were
normalized for rotation, translation and scale according to eye location. The
subwindow of the face is then cropped using the normalized eyes distance.
Because the new eye location is fixed without knowledge of the face’s pose,
the subwindow range is quite different based on different poses. For a
large-angle pose face, the cropped subwindow cannot include the whole
face. Figure 1.6 shows such face region extraction for nine poses of the PIE
database with the corresponding angle on top.

85 67.5 45 22.5 0 −22.5 −45 −67.5 −85

Fig. 1.6. Normalized and cropped face region for nine diﬀerent pose angles. The degree
of the pose is above the image. The subwindows only include eyes and part of nose for
the ±85o pose images.

1.4.3. Experiment results of Group A

We compared SVM, original AdaBoost and SMA-based techniques in this
group of experiments. In Table 1.1, the average generalization performance
(with standard deviation) over the range of PC-dimensions is given. Our
experiments show that the AdaBoost results are in all cases at least
2 ∼ 4% worse than both the SVM and SMA results. Figure 1.7 shows
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 1

Maximal Margin Algorithms for Pose Estimation 15

Table 1.1. Performance of pose classiﬁers on diﬀerent PC-dimensions.

PC-dim testErr of SVM testErr of Ada testErr of SMA Diﬀ between
Ada and SMA
10 12.02 ± 1.34% 13.72 ± 1.34% 11.63 ± 0.71% 2.09%
20 9.07 ± 0.56% 10.71 ± 0.81% 7.26 ± 0.38% 3.45%
30 6.49 ± 0.21% 8.19 ± 0.63% 6.06 ± 0.47% 2.13%
40 5.72 ± 0.64% 7.87 ± 0.92% 5.05 ± 0.19% 2.82%
50 5.03 ± 0.43% 8.20 ± 0.69% 5.14 ± 0.21% 3.06%
60 4.79 ± 0.39% 7.32 ± 1.04% 4.63 ± 0.61% 2.69%
70 4.83 ± 0.12% 7.98 ± 0.85% 4.49 ± 0.33% 3.52%
80 4.60 ± 0.26% 8.03 ± 0.74% 4.87 ± 0.27% 3.16%

one comparison of AdaBoost and SMA with PC-dimension n = 30. The

training error of the AdaBoost-based technique converges to zero after only
ﬁve iterations, but the testing error clearly shows the overﬁtting. For the
SMA-based technique, because of the regularization term, the training error
is not zero in most of the iterations, but the testing error keeps decreasing.

1.4.4. Margin distribution graphs

We will use the margin distribution graphs to explain why both SVM
and SMA perform better than AdaBoost. Figure 1.8 shows the margin
distribution for SVM, AdaBoost and SMA, indicated by solid, dashed and
dotted lines, respectively. The margin of all AdaBoost are positive, which
means all of the training data are classified correctly. For SVM and SMA,
about 5% of the training data are classified incorrectly, while the margins
of more than 90% of the points are bigger than those of AdaBoost.
We know that a large positive margin can be interpreted as a “confident”
correct classification. Hence the generalization performance of SVM and
SMA should be better than AdaBoost, which is observed in the experiment
results (Table 1.1). AdaBoost tends to increase the margins associated with
examples and converge to a margin distribution in which all examples try
to have positive margins. Hence, AdaBoost can improve the generalization
error of a classifier when there is no noise. But when there is noise in
the training set, AdaBoost generates an overfitting classifier by trying to
classify the noisy points with positive margins. In fact, in most cases,
AdaBoost will modify the training sample distribution to force the learner
to concentrate on its errors, and will thereby force the learner to concentrate
on learning noisy examples. AdaBoost is hence called “hard margin”
algorithm. For noisy data, there is always the trade-off between believing
in the data or mistrusting it, because the data point could be mislabeled.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 1

16 Ying Guo and Jiaming Li

original AdaBoost
0.12
training error
test error
0.1

0.08

0.06
error

0.04

0.02

0 100 200 300 400 500 600 700 800

number of iterations

Soft Margin AdaBoost

0.12
training error
test error
0.1

0.08

0.06
error

0.04

0.02

0 100 200 300 400 500 600 700 800

number of iterations

Fig. 1.7. Training and testing error graphs of original AdaBoost and SMA when training
set size m = 500, and PC-dimension of the feature vector n = 30. The testing error of
AdaBoost overﬁts to 8.19% while the testing error of SMA converges to 6.06%.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 1

Maximal Margin Algorithms for Pose Estimation 17

dimension of preprocessed vector: 30

1
SVM
0.9 AdaBoost
SMA
0.8

0.7
cumulative distribution

0.6

0.5

0.4

0.3

0.2

0.1

0
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
margin

dimension of preprocessed vector: 50

1
SVM
0.9 AdaBoost
SMA
0.8

0.7
cumulative distribution

0.6

0.5

0.4

0.3

0.2

0.1

0
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
margin

dimension of preprocessed vector: 80

1
SVM
0.9 AdaBoost
SMA
0.8

0.7
cumulative distribution

0.6

0.5

0.4

0.3

0.2

0.1

0
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
margin

Fig. 1.8. Margin distribution graphs of SVM, AdaBoost and SMA on training set when
training sample size m = 500, PC-dimension is 30, 50, 80 respectively. For SVM and
SMA, although about 1 ∼ 2% of the training data are classiﬁed incorrectly, the margins
of more than 90% of the points are bigger than those of AdaBoost.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 1

18 Ying Guo and Jiaming Li

So the “hard margin” condition needs to be relaxed. For SVM, the original
SVM algorithm [19] had poor generalization performance on noisy data
as well. After the “soft margin” was introduced [31], SVMs tended to
find a smoother classifier and converge to a margin distribution in which
some examples may have negative margins, and have achieved much better
generalization results. The SMA algorithm is similar, which tends to find a
smoother classifier because of the balancing influence of the regularization
term. So, SVM and SMA are called “soft margin” algorithms.

0.1
Support vector machine
0.09 Soft margin AdaBoost

0.08

0.07

0.06
test error

0.05

0.04

0.03

0.02

0.01

0
0 10 20 30 40 50 60 70 80 90
dimension of the preprocessed vectors

Fig. 1.9. The testing error of SVM and SMA versus the PC-dimension of the feature
vector. As the PC-dimension increases, the testing errors decrease. The testing error is
as low as 1.80% (SVM) and 1.72% (SMA) when the PC-dimension is 80.

1.4.5. Experiment results of Group B

From the experiment results in Section 1.4.3, we observed that both
SVM- and SMA-based approaches perform better than AdaBoost, and we
analyzed the reason in Section 1.4.4. In order to show the best performance
of SVM and SMA techniques, we trained the SMA on more training samples
(m = 20783), and achieved competitive results on pose estimation problem.
Figure 1.9 shows the testing errors of the SVM- and SMA-based techniques
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 1

Maximal Margin Algorithms for Pose Estimation 19

versus the PC-dimension of the feature vector. The testing error is related
to the PC-dimension of the feature vector. The performance in pose
detection is better for a higher dimensional PCA representation. Especially,
the testing error is as low as 1.80% and 1.72% when PC-dimension is 80.
On the other hand, a low dimensional PCA representation can already
provide satisfactory performance, for instance, the testing errors are 2.27%
and 3.02% when PC-dimension is only 30.

Fig. 1.10. The number of support vectors versus the dimension of the feature vectors.
As the dimension increases, the number of support vectors increases as well.

Fig. 1.11. Test images correctly classified as profile. Notice these images include
different facial appearance, expression, significant shadows or sun-glasses.

It is observed that the number of support vectors increases as the

PC-dimension of the feature vectors for SVM-based technique (Fig. 1.10).
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 1

20 Ying Guo and Jiaming Li

The proportion of support vectors to the size of the whole training set is less
than 5.5%. Figure 1.11 shows some examples of the correctly classiﬁed pose
images, which include diﬀerent scale, lighting or illumination conditions.

1.5. Conclusion

The main strength of the present method is the ability to estimate the
pose of the face efficiently by using the maximal margin algorithms. The
experimental results and the margin distribution graphs show that the “soft
margin” algorithms allow higher training errors to avoid the overfitting
problem of “hard margin” algorithms, and achieve better generalization
performance. The experimental results show that SVM- and SMA-based
techniques are very effective for the pose estimation problem. The testing
error on more than 20000 testing images was as low as 1.73%, where the
images cover different facial features such as beards, glasses and a great
deal of variability including shape, color, lighting and illumination.
In addition, because the only prior knowledge of the system is the
eye locations, the performance is extremely good, even when some facial
features such as the nose or the mouth become partially or wholly
occluded. For our current interest in improving the performance of our
face recognition system (SQIS), as the eye location is already automatically
determined, this new pose detection method can be directly incorporated
into the SQIS system to improve its performance.

References

[1] A. Kuchinsky, C. Pering, M. Creech, D. Freeze, B. Serra, and J. Gwizdka,

Fotofile: A consumer multimedia organization and retrieval system, M CHI
99 Conference Proceedings. pp. 496–503, (1999).
[2] R. Qiao, J. Lobb, J. Li, and G. Poulton, Trials of the CSIRO face recognition
system in a video surveillance environment, Proceedings of the Sixth Digital
Image Computing Techniques and Application. pp. 246–251, (2002).
[3] S. Baker, S. Nayar, and H. Murase, Parametric feature detection,
International Journal of Computer Vision. 27(1), 27–50, (1998).
[4] S. Gong, S. McKenna, and J. Collins, An investigation into face
pose distributions, IEEE International Conference on Face and Gesture
Recognition. pp. 265–270, (1996).
[5] J. Ng and S. Gong, Performing multi-view face detection and pose
estimation using a composite support vector machine across the view sphere,
Proceedings of the IEEE International Workshop on Recognition, Analysis,
and Tracking of Faces and Gestures in Real-Time Systems. pp. 14–21, (1999).
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 1

Maximal Margin Algorithms for Pose Estimation 21

[6] Y. Li, S. Gong, and H. Liddell, Support vector regression and classification
based multi-view face detection and recognition, Proceedings of IEEE
International Conference On Automatic Face and Gesture Recognition. pp.
300–305, (2000).
[7] Y. Guo, R.-Y. Qiao, J. Li, and M. Hedley, A new algorithm for face pose
classification, Image & Vision Computing New Zealand IVCNZ. (2002).
[8] Y. Freund and R. E. Schapire, A decision-theoretic generalization of on-line
learning and an application to boosting, Journal of Computer and System
Sciences. 55(1), 119–139, (1997).
[9] P. Viola and M. Jones, Rapid object detection using a boosted cascade
of simple features, IEEE Conference on Computer Vision and Pattern
Recognition. (2001).
[10] G. Rätsch, T. Onoda, and K.-R. Müller, Soft margins for AdaBoost, Machine
Learning. 42(3), 287–320, (2001).
[11] G. Rätsch, B. Schölkopf, S. Mika, and K.-R. Müller. SVM and boosting:
One class. Technical report, NeuroCOLT, (2000).
[12] T. Sim, S. Baker, and M. Bsat, The CMU Pose, Illumination, and Expression
(PIE) database, Proceedings of the IEEE International Conference on
Automatic Face and Gesture Recognition. (2002).
[13] Y. Moses, S. Ullman, and S. Edelman, Generalization to novel images in
upright and inverted faces, Perception. 25, 443–462, (1996).
[14] H. Schneiderman and T. Kanade, A statistical method for 3D object
detection applied to faces and cars, Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition. 1, 746–751, (2000).
[15] H. Murase and S. K. Nayar, Visual learning and recognition of 3D objects
from appearance, International Journal of Computer Vision. 14(1), 5–24,
(1995).
[16] M. Anthony and P. Bartlett, A Theory of Learning in Artificial Neural
Networks. (Cambridge University Press, 1999).
[17] R. Schapire, Y. Freund, P. L. Bartlett, and W. S. Lee, Boosting the
margin: A new explanation for the effectiveness of voting methods, Annals
of Statistics. 26(5), 1651–1686, (1998).
[18] V. Vapnik, The Nature of Statistical Learning Theory. (Springer, NY, 1995).
[19] B. E. Boser, I. M. Guyon, and V. N. Vapnik, A training algorithm for
optimal margin classifiers, Proceedings of the 5th Annual ACM Workshop
on Computational Learning Theory. pp. 144–152, (1992).
[20] J. R. Quinlan, C4.5: Programs for Machine Learning. (Morgan Kaufmann,
San Marteo, CA, 1993).
[21] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and
Regression Trees. (Wadsworth International Group, Belmont, CA, 1984).
[22] Y. Bengio, Y. LeCun, and D. Henderson. Globally trained handwritten word
recognizer using spatial representation, convolutional neural networks and
hidden markov models. In eds. J. Cowan, G. Tesauro, and J. Alspector,
Advances in Neural Information Processing Systems, vol. 5, pp. 937–944.
(Morgan Kaufmann, San Marteo, CA, 1994).
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 1

22 Ying Guo and Jiaming Li

[23] L. Breiman, Prediction games and arcing algorithms, Neural Computation.

11(7), 1493–1518, (1999).
[24] J. Friedman, T. Hastie, and R. Tibshirani, Additive logistic regression: a
statistical view of boosting, Annals of Statistics. 28(2), 400–407, (2000).
[25] L. Mason, J. Baxter, P. Bartlett, and M. Frean, Functional Gradient
Techniques for Combining Hypotheses. (MIT Press, Cambridge, MA, 2000).
[26] A. Grove and D. Schuurmans, Boosting in the limit: Maximizing the margin
of learned ensembles, Proceedings of the Fifteenth National Conference on
Artificial Intelligence. pp. 692–699, (1998).
[27] A. N. Tikhonov, Solution of incorrectly formulated problems and the
regularization method, Soviet Mathematics. Doklady. 4, 1035–1038, (1963).
[28] A. N. Tikhonov and V. Y. Arsenin, Solution of Ill–Posed Problems.
(Winston, Washington, DC, 1977).
[29] F. Girosi, M. Jones, and T. Poggio, Regularization theory and neural
networks architectures, Neural Computation. 7(2), 219–269, (1995).
[30] Y. Guo, P. L. Bartlett, J. Shawe-Taylor, and R. C. Williamson, Covering
numbers for support vector machines, IEEE Transactions on Information
Theory. 48, 239–250, (2002).
[31] C. Cortes and V. Vapnik, Support vector networks, Machine Learning. 20,
273–297, (1995).
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 2

Chapter 2

Polynomial Modeling in a Dynamic Environment based on

a Particle Swarm Optimization

Kit Yan Chan and Tharam S. Dillon

Digital Ecosystems and Business Intelligence Institute,
Curtin University of Technology, Perth, Australia
[email protected]

In this chapter, a particle swarm optimization (PSO) is proposed for

polynomial modeling in a dynamic environment. The basic operations
of the proposed PSO are identical to the ones of the original PSO
except that elements of particles represent arithmetic operations and
polynomial variables of polynomial models. The performance of the
proposed PSO is evaluated by polynomial modeling based on a set of
dynamic benchmark functions in which their optima are dynamically
moved. Results show that the proposed PSO can find significantly better
polynomial models than genetic programming (GP) which is a commonly
used method for polynomial modeling.

Contents

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 PSO for Polynomial Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3 PSO vs. GP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.1. Introduction

Particle swarm optimization is inspired by the social behaviors of animals

like ﬁsh schooling and bird ﬂocking [1]. The swarm consists of a population
of particles which aim to look for the optimal solutions, and the movement
of each particle is based on both its best previous position recorded so far
from the previous generations and the position of the best particle among
all the particles. Diversity of the particles can be kept along the search
by selecting suitable PSO parameters which provide a balance between the

23
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 2

24 Kit Yan Chan and Tharam S. Dillon

global exploration based on the position of best particle among all the
particles in the swarm, and local exploration based on each individual
element’s best previous position recorded. Each particle can converge
gradually toward the position of best particle among all the particles in
the swarm and its best previous position recorded so far. Kennedy and
Eberhart [2] demonstrated that PSO can solve hard optimization problems
with satisfactory results.
However, there is no research involving PSO on polynomial modeling in
a dynamic environment according to the authors’ knowledge. This can be
blamed on the fact that the polynomials consist of arithmetic operations
and polynomial variables, which are diﬃcult to represent by the commonly
used continuous version of PSO (i.e. the original PSO [1]). In fact, recent
literature shows that PSO has been applied to solve various combinatorial
problems with satisfactory results [3–7] and the particles of the PSO-based
algorithm can represent arithmetic operations and polynomial variables
in polynomial models. These reasons motivated the author to apply a
PSO-based algorithm on generating polynomial models.
In this chapter, a PSO has been proposed for polynomial modeling in
a dynamic environment. The basic operations of the proposed PSO are
identical to those of the original PSO [1] except that elements of particles of
the PSO are represented by arithmetic operations and polynomial variables
in polynomial models. To evaluate the performance of the PSO polynomial
modeling in a dynamic environment, we compared the PSO with the GP,
which is a commonly used method on polynomial modeling [8–11]. Based
on the benchmark functions in which their optima are dynamically moved,
polynomial modeling in dynamic environments can be evaluated. This
evaluation indicates that the proposed PSO signiﬁcantly outperforms the
GP in polynomial modeling in a dynamic environment.

2.2. PSO for Polynomial Modeling

The polynomial model of an output response relating to polynomial

variables can be described as follows:
y = f (x1 , x2 , ..., xm ) (2.1)
where y is the output response, xj , j = 1, 2, ..., m is the j-th polynomial
variable and f is a functional relationship, which represents the polynomial
model. The polynomial model f can be found by a set of experimental
ND
data D = (xD (i), y D (i))i=1 with the corresponding values of the
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 2

Polynomial Modeling in a Dynamic Environment 25

i-th experimental data xD (i) = (xD 1 (i), x2 (i), ..., xm (i)) ∈ R

D D m
and
corresponding value of the i-th response output y (i) ∈ R. With D,
D

f can be generated as a high-order high-dimensional Kolmogorov–Gabor

polynomial:

∑
m ∑
m ∑
m
y = a0 + ai1 xi1 + ai1 i2 xi1 xi2 +
i1 =1 i1 =1 i2 =1
(2.2)
∑
m ∑
m ∑
m ∏
m
ai1 i2 i3 xi1 xi2 xi3 + ... + a123...m xim
i1 =1 i2 =1 i3 =1 im =1

which is a universal format for polynomial modeling if the number of terms

in (2.2) is large enough [12]. The polynomial model varies, if the polynomial
coefficiencies vary dynamically. In this chapter, a PSO is proposed to
generate polynomial models, while experimental data is available. The PSO
uses a number of particles, which constitute a swarm, and each particle
represents a polynomial model. Each particle traverses the search space
looking for the optimal polynomial model.
Each particle in the PSO represents the polynomial variables (x1 , x2 , ...,
and xm ) and the arithmetic operations (′ +′ ,′ −′ and′ ∗′ ) in the polynomial
model as defined in (2.2). The i-th particle at generation t is defined as Pit =
(pti,1 , pti,2 , ..., pti,Np ), where Np > m; i = 1, 2, ..., Npop ; Npop is the number of
t
particles of the swarm; Pi,k is the k-th element of the i-th particle at the
t-th generation, and is in the range between zero to one, i.e. Pi,k t
∈ 0...1.
t t t
The elements in odd numbers (i.e. Pi,1 , Pi,3 , Pi,5 , ...) are used to illustrate
t t
the polynomial variables, and the elements in even numbers (i.e. Pi,2 , Pi,4 ,
t
Pi,6 , ...) are used to illustrate the arithmetic operations. For odd k, if
0 < Pi,k t
≤ 1/(m + 1), no polynomial variable is presented by the element
t
Pi,k . If 1/(m + 1) < Pi,k t
≤ (l + 1)/(m + 1) with l > 0, Pi,k t
represents the
l-th polynomial variable, xl .
In the polynomial model, ′ +′ ,′ −′ and ′ ∗′ are the only three arithmetic
operations. For even k, if 0 < Pi,k t
≤ 1/3, 1/3 < Pi,k t
≤ 2/3 and
2/3 < Pi,k ≤ 1, the element Pi,k represents the arithmetic operations
t t
′ ′ ′ ′
+ , − and ′ ∗′ respectively. For example, the following particle with seven
elements is used to represent a polynomial model with four polynomial
variables (i.e. x1 , x2 , x3 and x4 ):
t t t t t t t
Pi,1 Pi,2 Pi,3 Pi,4 Pi,5 Pi,6 Pi,7
0.2 0.4 0.9 0.9 0.4 0.1 0.1
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 2

26 Kit Yan Chan and Tharam S. Dillon

The elements in the particle are within the following ranges:

t t t t t t t
Pi,1 Pi,2 Pi,3 Pi,4 Pi,5 Pi,6 Pi,7
0 < 0.2 ≤ 1
5
1
3 < 0.4 ≤ 2
3
4
5 < 0.9 ≤ 5
5
2
3 < 0.9 ≤ 3
3
2
5 < 0.4 ≤ 3
5 0 < 0.1 ≤ 1
3 0 < 0.1 ≤ 1
5

Therefore this particle represents the following polynomial model:

t t t t t t t
Pi,1 Pi,2 Pi,3 Pi,4 Pi,5 Pi,6 Pi,7
0 − x4 ∗ x2 + 0
which is equivalent to: fi (x) = 0 − x4 ∗ x2 + 0 or fi (x) = −x4 ∗ x2
The polynomial coefficients a0 and a1 are determined after the structure
of the polynomial is generated, where the number of coefficients of the
polynomial is two. The completed function can be represented as follows:
fi (x) = a0 − a1 ∗ x4 ∗ x2 . In this research, the polynomial coefficients are
determined by an orthogonal least square algorithm [13, 14], which has
been demonstrated to be effective in determining polynomial coefficients in
models generated by the GP [15]. Details of the orthogonal least square
algorithm can be found in [13, 14]. Each particle is evaluated based on the
mean absolute error (MAE), which can reflect the differences between the
predicted values of the polynomial model and the actual values of the data
sets. The mean absolute error of the i-th particle at the t-th generation
M AEit can be calculated based on (2.3).
√
1 ∑ y D (j) − fit (xD (j))
M AEit = 100% × j = 1ND (2.3)
ND y D (j)

where fit is the polynomial model represented by the i-th particle Pit at
the t-th generation, (xD (j), y D (j)) is the j-th training data set, and ND is
the number of training data sets used for developing the polynomial model.
t
The velocity vi,j (corresponding to the ﬂight velocity in a search space) and
the k-th element of the i-th particle at the t-th generation pti,j are calculated
by the following formula:
t
vi,k t−1
= K(vi,k +ϕ1 ×rand()×(pbesti,k −pt−1
i,k +ϕ2 ×rand()×(gbestk −pi,k ))
t−1

(2.4)
pti,k = pt−1 t
i,k + vi,k (2.5)
where
pbesti = [pbesti,1 , pbesti,2 , ..., pbesti,Np ],
gbest = [pbest1 , pbest2 , ..., pbestNp ],
k = 1, 2, ..., Np ,
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 2

Polynomial Modeling in a Dynamic Environment 27

where the best previous position of a particle is recorded so far from the
previous generation and is represented as pbesti ; the position of best particle
among all the particles is represented as gbest; rand() returns a uniform
random number in the range of [0,1]; w is an inertia weight factor; φ1
and φ2 are acceleration constants; ϕ is a constriction factor derived from
the stability analysis of (2.6) to ensure the system to be converged but not
prematurely [16]. Mathematically, K is a function of φ1 and φ2 as reﬂected
in the following equation:

pti,j = pt−1 t
i,j + vi,j (2.6)

where ϕ = ϕ1 + ϕ2 and ϕ > 4.

The PSO utilizes pbesti and gbest to modify the current search point to
avoid the particles moving in the same direction, but to converge gradually
toward pbesti and gbest. In (2.4), the particle velocity is limited by a
maximum value vmax . The parameter vmax determines the resolution with
which regions are to be searched between the present position and the target
position. This limit enhances the local exploration of the problem space
and it realistically simulates the incremental changes of human learning.
If vmax is too high, particles might ﬂy past good solutions. If vmax is too
small, particles may not explore suﬃciently beyond local solutions. vmax
was often set at 10% to 20% of the dynamic range of the element on each
dimension.

2.3. PSO vs. GP

A set of benchmark functions was employed to evaluate the eﬀectiveness of

the PSO on polynomial modeling in a dynamic environment. Comparison
between the PSO and the genetic programming (GP), which is a commonly
used method on polynomial modeling, was carried out. The GP parameters,
which have been shown to be able to ﬁnd good solutions for both static
and dynamic system modeling [17], were also used: population size =
50; crossover rate = 0.5; mutation rate = 0.5; pre-deﬁned number
of generations = 1000; maximum depth of tree = 50; roulette wheel
selection, point-mutation and one-point (two parents) were used. The PSO
parameters, which can be referred to [18] and are shown in Table 2.1, were
used in this research.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 2

28 Kit Yan Chan and Tharam S. Dillon

Table 2.1. The PSO parameters implemented.

Number of particles in the swarm 500
Number of elements in the particle 30
Acceleration constants ϕ1 and ϕ2 Both 2.05
Maximum velocity vmax 0.2
Pre-deﬁned number of generations 1000

The genetic programming GP based on a public GP toolbox developed

in Matlab [17], which aims at modeling both dynamic and static systems,
was used in this research. The benchmark functions shown in Table 2.2
have been employed to evaluate the performance of the GP and the PSO in
polynomial modeling. The dynamic environment in polynomial modeling
was simulated by moving the optimum of the benchmark function in a
period of generations. All benchmark functions are in the ranges of the
initialization areas [Xmin , Xmax ]n as shown in Table 2.2. The dimension
of each test function is n = 4. The Sphere and Rosenbrock functions
are unimodal (a single local and global optimum), and the Rastrigin and
Griewank functions are multimodal (several local optima). To simulate
the dynamic environment, the optimum position x+ of each dynamic
benchmark function in Table 2.2 is moved by adding or subtracting the
random values in all dimensions by a severity parameter s, at every
change of the environment. The choice of whether to add or subtract the
severity parameter s on the optimum x+ was done randomly with an equal
probability. The severity parameter s is deﬁned by:
{
+∆d(Xmax − Xmin ), if rand() ≥ 0;
s= (2.7)
−∆d(Xmax − Xmin ), if rand() < 0;

where ∆d = 1% or 10%.
Hence the optimal value Vmin of each dynamic function can be
represented by:

Vmin = F (x+ + x) = min F (x + s). (2.8)

For each test run a diﬀerent random seed was used. The severity was
chosen relative to the extension (in one dimension) of the initialization
area of each dynamic benchmark function. The diﬀerent severities
were chosen as 1% and 10% of the range of each dynamic benchmark
function. The benchmark functions were periodically changed in every 5000
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 2

Polynomial Modeling in a Dynamic Environment 29

Table 2.2. Benchmark functions and initialization areas.

computational evaluations. Based on the benchmark function, 100 data sets

were generated randomly in every generation. Each date set is in the form of
xD D D D D D D D D
1 (i), x2 (i), x3 (i), x4 (i), yi (i)), where x1 (i), x2 (i), x3 (i)andx4 (i) were
4 D
generated randomly in the range of [Xmin , Xmax ] , and yi was calculated
based on the benchmark function with i = 1, 2, ..., 100. Based on the 100
data sets generated in each generation, the polynomial models in (2.2) can
be obtained by the GP and the PSO. Since the optima of the benchmark
functions moved in a period of generations, the dynamic environment can
be simulated and the performance of both the GP and the PSO in modeling
dynamic systems can be evaluated.
Thirty runs have been performed on both the GP and the PSO on
polynomial modeling based on the benchmark functions with various
severity parameters ∆d = 1% or 10%. In each run, the smallest MAEs
obtained by the best chromosome of the GP and the best particle of the
PSO were recorded.
For the Sphere function that the optimum of the Sphere function
varies by ∆d = 10% in every 5000 computational evaluations, it can be
observed from Fig. 2.1a that the PSO kept progressing in every 5000
computational evaluations while the optimum moved. However, the GP
may not achieve smaller MAEs in every 5000 computational evaluations.
Finally, the PSO obtained smaller MAEs than the ones obtained by the
GP. For the Rosenbrock function with ∆d = 10%, it can be observed in
Fig. 2.1b that the solution qualities of the PSO can be maintained in every
5000 computational evaluations while the optimum moved. However, the
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 2

30 Kit Yan Chan and Tharam S. Dillon

solution qualities of the GP are more unstable than the one obtained by
the PSO in every 5000 computational evaluations. Finally, the PSO can
reach smaller MAEs than the ones obtained by the GP. For the Rastrigin
function with ∆d = 10%, Fig. 2.2a shows that the PSO kept progressing
in every 5000 computational evaluations even when the optimum moved,
while the GP may obtain larger MEAs after the optimum moves at every
5000 computational evaluations. Finally, the PSO can obtain a smaller
MAE than the one obtained by the GP. For the Griewank function,
similar characteristics can be found in Fig. 2.2b. Similar results can be
found on solving the benchmark functions with ∆d = 10%. Therefore,
it can be concluded that the PSO is more effective to adapt to smaller
MAEs than the GP while the optimum of the benchmark functions move
periodically in every 5000 computational evaluations. Also, the PSO can
converge to smaller MAEs than the ones obtained by the GP in the
benchmark functions. For the benchmark functions with ∆d = 1%, similar
characteristics can be found.
Solely from the figures, it is difficult to compare the qualities and
stabilities of solutions found by both methods. Tables 2.3 and 2.4
summarize the results for the benchmark functions. They show that the
minimums, the maximums and the means of MAEs found by the PSO
are smaller than those found by the GP in all the benchmark functions
with ∆d = 1% and 10% respectively. Furthermore, the variances of
MAE found by the PSO are much smaller than those found by GP. These
results indicate that the PSO can not only produce smaller MEAs but
also produce more stable MEAs for polynomial modeling based on the
benchmark functions with the dynamic environments. The t-test is then
used to evaluate significance of performance difference between the PSO
and the GP. Table 2.4 shows that all t-values between the PSO and the GP
for all benchmark functions with ∆d = 1% and 10% are higher than 2.15.
Based on the normal distribution table, if the t-value is higher than 2.15,
the significance is 98% confident level. Therefore, the performance of the
PSO is significantly better than that of the GP with 98% confident level
in polynomial modeling based on the benchmark functions with ∆d = 1%
and 10%.
Since maintaining population diversity in population-based algorithms
like GP or PSO is a key in preventing premature convergence and stagnation
in local optima [19, 20], it is essential to study population diversities of
the two algorithms along the search. Various diversity measures, which
involve calculation of distance between two individuals in both genetic
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 2

Polynomial Modeling in a Dynamic Environment 31

Table 2.3. Experimental results obtained by the GP and the PSO for the benchmark
functions with ∆d = 1%.

programming [21, 22] and genetic algorithm [23] have been widely studied.
However, those distance measures can only apply either on tree-based
representation or string-based representation, which cannot be applied
on measuring population diversities for all the two algorithms, since
the representations of the two algorithms are not identical. Tree-based
representation of individuals is used in the GP. String-based representation
of individuals is used in the PSO. Either of the two types of distance
measures can be applied on the three algorithms in which the types of
representation are different.
To investigate population diversities of the algorithms, we measure the
distance between two individuals by counting the number of different terms
of the polynomials represented by the two individuals. If the terms in both
polynomials are all identical, the distance between two polynomials is zero.
The distance between two polynomials is larger with a greater number
of different terms in the two polynomials. However, for each method we
use the same for the polynomial and hence results can be compared. For
example, f1 and f2 are two polynomials represented by:
f1 = x1 + x2 + x1 x3 + x24 + x1 x3 x5 and f2 = x1 + x2 + x1 x5 + x4 + x1 x3 x5 .
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 2

32 Kit Yan Chan and Tharam S. Dillon

Table 2.4. Experimental results obtained by the GP and the PSO for the benchmark
functions with ∆d = 10%.

Both f1 and f2 contain the three terms x1 , x2 and x1 x3 x5 , and the

terms x1 x3 and x24 in f1 and the terms x1 x5 and x4 in f2 are different.
Therefore the number of terms which are different in f1 and f2 is four, and
the distance between f1 and f2 is defined to be four.
The diversity measure of the population at the g-th generation is defined
by the mean of the distances of individuals which is denoted as:

2 ∑ ∑
Np Np
σg = 2 d(sg (i), sg (j)) (2.9)
Np i=1 j=i+1

where sg (i) and sg (j) are the i-th and the j-th individuals in the population
at the g-th generation, and d is the distance measure between the two
individuals.
The diversities of the populations along the generations were recorded
for the two algorithms. Figure 2.3a shows the diversities of the Sphere
functions with ∆d = 10%. It can be found from the ﬁgure that the
diversities of the GP are the highest in the early generations. The diversities
of the PSO are smaller than the ones of the GP in the early generations.
However, diversities of the PSO can be kept until the late generations,
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 2

Polynomial Modeling in a Dynamic Environment 33

(a) Convergence plot for Sphere function with ∆d = 10%

(b) Convergence plot for Rosenbrock function with ∆d = 10%

Fig. 2.1. Convergence plot for Sphere and Rosenbrock functions.

while the ones of the GP saturated to low levels in the mid generations.
Figures 2.3b, 2.4a and 2.4b show the diversities of the two algorithms
for Rosenbrock function, Rastrigin function and Griewank function with
∆d = 10% respectively. The ﬁgures show similar characteristics to the one
of the Sphere function in that the diversities of populations of the PSO can
be kept along the generations, while the GP saturated to a low value after
the early generations. For the benchmark functions with ∆d = 1%, similar
characteristics can be found.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 2

34 Kit Yan Chan and Tharam S. Dillon

(a) Convergence plot for Rastrigin function with ∆d = 10%

(b) Convergence plot for Griewank function with ∆d = 10%

Fig. 2.2. Convergence plot for Rastrigin and Griewank functions.

Diversities of the population in the two algorithms are necessary to be

maintained along the generations, so as to explore potential individuals with
better scores throughout the search. If individuals of the population of the
algorithm converge too early, the algorithm may trap into a sub-optimum.
The algorithm is more likely to explore the solution space, while the
diversity of the individuals of the algorithm can be maintained. The results
indicate that diversities of the individuals in the PSO can be maintained
along the search in both early and late generations, while the individuals
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 2

Polynomial Modeling in a Dynamic Environment 35

(a) Population diversities for Sphere function (∆d = 10%)

(b) Population diversities for Rosenbrock function (∆d = 10%)

Fig. 2.3. Population diversities for for Sphere and Rosenbrock functions.

of the GP converged in the early generation. Therefore the results explain

why the PSO can ﬁnd better solutions than the ones obtained by the GP,
while the PSO can maintain higher diversity than the other two algorithms.

2.4. Conclusion

In this chapter, a PSO has been proposed for polynomial modeling which
aims at generating explicit models in the form of Kolmogorov–Gabor
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 2

36 Kit Yan Chan and Tharam S. Dillon

(a) Population diversities for Rastrigin function (∆d = 10%)

(b) Population diversities for Griewank function (∆d = 10%)

Fig. 2.4. Population diversities for for Rastrigin and Griewank functions.

polynomials in dynamic environment. The operations of the PSO are

identical to the ones of the original PSO except that elements of particles of
the PSO are represented by arithmetic operations and polynomial variables,
which are the components of the polynomial models. A set of dynamic
benchmark functions in which their optima are dynamically moved was
employed to evaluate the performance of the PSO. Comparison of the PSO
and the GP, which is commonly used on polynomial modeling, was carried
out by using the data generated based on the dynamic benchmark functions.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 2

Polynomial Modeling in a Dynamic Environment 37

It was shown that the PSO perform signiﬁcantly better than the GP in
polynomial modeling. Enhancing the eﬀectiveness of the proposed PSO in
polynomial modeling in dynamic environments will be considered in the
further work. It will be incorporated with the techniques implemented in
the PSO which have been shown to be able to obtain satisfactory results
on solving dynamic optimization problems [24, 25].

References

[1] R. Eberhart and J. Kennedy, A new optimizer using particle swarm theory,
Proceedings of the Sixth IEEE International Symposium on Micro Machine
and Human Science. pp. 39–43, (1995).
[2] J. Kennedy and R. Eberhart, Swarm Intelligence. (Morgan Kaufmann,
2001).
[3] Z. Lian, Z. Gu, and B. Jiao, A similar particle swarm optimization
algorithm for permutation flowshop scheduling to minimize makespan,
Applied Mathematics and Computation. 175, 773–785, (2006).
[4] C. J. Liao, C. T. Tseng, and P. Luran, A discrete version of particle swarm
optimization for flowshop scheduling problems, Computer and Operations
Research. 34, 3099–3111, (2007).
[5] Y. Liu and X. Gu, Skeleton network reconfiguration based on topological
characteristics of scale free networks and discrete particle swarm
optimization, IEEE Transactions on Power Systems. 22(3), 1267–1274,
(2007).
[6] Q. K. Pan, M. F. Tasgetiren, and Y. C. Liang, A discrete particle
swarm optimization algorithm for the no-wait flowshop scheduling problem,
Computer and Operations Research. 35, 2807–2839, (2008).
[7] C. T. Tseng and C. J. Liao, A discrete particle swarm optimization for
lot-streaming flowship scheduling problem, European Journal of Operations
Research. 191, 360–373, (2008).
[8] H. Iba, Inference of differential equation models by genetic programming,
Information Sciences. 178, 4453–4468, (2008).
[9] J. Koza, Genetic Programming: On the Programming of Computers by
Means of Natural Evolution. (MIT Press, Cambridge. MA, 1992).
[10] K. Rodriguez-Vazquez, C. M. Fonseca, and P. J. Fleming, Identifying
the structure of nonlinear dynamic systems using multiobjective genetic
programming, IEEE Transactions on Systems, Man and Cybernetics – Part
A. 34(4), 531–545, (2004).
[11] N. Wagner, Z. Michalewicz, M. Khouja, and R. R. McGregor, Time series
forecasting for dynamic environments: the dyfor genetic program model,
IEEE Transactions on Evolutionary Computation. 4(11), 433–452, (2007).
[12] D. Gabor, W. Wides, and R. Woodcock, A universal nonlinear filter
predictor and simulator which optimizes itself by a learning process,
Proceedings of IEEE. 108-B, 422–438, (1961).
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 2

38 Kit Yan Chan and Tharam S. Dillon

[13] S. Billings, M. Korenberg, and S. Chen, Identification of nonlinear

outputaffine systems using an orthogonal least-squares algorithm,
International Journal of Systems Science. 19, 1559–1568, (1988).
[14] S. Chen, S. Billings, and W. Luo, Orthogonal least squares methods and
their application to non-linear system identification, International Journal
of Control. 50, 1873–1896, (1989).
[15] B. McKay, M. J. Willis, and G. W. Barton, Steady-state modeling of
chemical processes using genetic programming, computers and chemical
engineering, Computers and Chemical Engineering. 21(9), 981–996, (1997).
[16] R. C. Eberhart and Y. Shi, Comparison between genetic algorithms and
particle swarm optimization, Lecture Notes in Computer Science. 1447,
611–616, (1998).
[17] J. Madar, J. Abonyi, and F. Szeifert, Genetic programming for the
identification of nonlinear input – output models, Industrial and Engineering
Chemistry Research. 44, 3178–3186, (2005).
[18] M. O. Neill and A. Brabazon, Grammatical swarm: The generation of
programs by social programming, Natural Computing. 5, 443–462, (2006).
[19] A. Ekart and S. Nemeth, A metric for genetic programs and fitness
sharing, Proceedings of the European Conference of Genetic Programming.
pp. 259–270, (2000).
[20] R. I. McKay, Fitness sharing genetic programming, Proceedings of the
Genetic and Evolutionary Computation Conference. pp. 435–442, (2000).
[21] E. K. Burke, S. Gustafson, and G. Kendall, Diversity in genetic
programming: an analysis of measures and correlation with fitness, IEEE
Transactions on Evolutionary Computation. 8(1), 47–62, (2004).
[22] X. H. Nguyen, R. I. McKay, D. Essam, and H. A. Abbass, Toward an
alternative comparison between different genetic programming systems,
Proceedings of the European Conference on Genetic Programming. pp. 67–77,
(2004).
[23] B. Naudts and L. Kallel, A comparison of predictive measures of problem
difficulty in evolutionary algorithms, IEEE Transactions on Evolutionary
Computation. 4(1), 1–15, (2000).
[24] T. Blackwell and J. Branke, Multiswarms, exclusion, and anti-convergence
in dynamic environments, IEEE Transactions on Evolutionary Computation.
10(4), 459–472, (2006).
[25] S. Janson and M. Middendorf, A hierarchical particle swarm optimizer
for noisy and dynamic environments, Genetic Programming and Evolvable
Machines. 7, 329–354, (2006).
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 3

Chapter 3

Restoration of Half-toned Color-quantized Images Using

Particle Swarm Optimization with Multi-wavelet Mutation

Frank H.F. Leung, Benny C.W. Yeung and Y.H. Chan

Centre for Signal Processing,
Department of Electronic and Information Engineering,
The Hong Kong Polytechnic University,
Hung Hom, Kowloon, Hong Kong
[email protected] ∗

Restoration of color-quantized images is rarely addressed in the

literature, especially when the images are color-quantized with
half-toning. Many existing restoration algorithms are inadequate to
deal with this problem because they were proposed for restoring noisy
blurred images only. In this chapter, a restoration algorithm based on
Particle Swarm Optimization with multi-wavelet mutation (MWPSO) is
proposed to solve the problem. This algorithm makes good use of the
available color palette and the mechanism of a half-toning process to
derive useful a priori information for the restoration. Simulation results
show that it can improve the quality of a half-toned color-quantized
image remarkably in terms of both signal-to-noise ratio improvement
and convergence rate. The subjective quality of the restored images can
also be improved.

Contents

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 Color Quantization With Half-toning . . . . . . . . . . . . . . . . . . . . . . . 41
3.3 Formulation of Restoration Algorithm . . . . . . . . . . . . . . . . . . . . . . 42
3.3.1 PSO with multi-wavelet mutation (MWPSO) . . . . . . . . . . . . . . 42
3.3.2 The ﬁtness function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3.3 Restoration with MWPSO . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3.4 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4 Result and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
∗ The work described in this chapter was substantially supported by a grant from

the Research Grants Council of the Hong Kong Special Administrative Region, China
(Project No. PolyU 5224/08E).

39
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 3

40 Frank H.F. Leung, Benny C.W. Yeung and Y.H. Chan

3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.1. Introduction

Color quantization is a process of reducing the number of colors in a digital

image by replacing them with some representative colors selected from a
palette [1]. It is widely used because it can save transmission bandwidth and
data storage requirements in many multimedia applications. When color
quantization is performed, certain degradation of quality will be introduced
owing to the limited number of colors used to produce the output image.
The most common artifact is the false contour. False contours occur when
the available palette colors are not enough to represent a gradually changing
region. Another common artifact is the color shift. In general, the smaller
the color palette size is used, the more severe the defects will be.
Half-toning is a reprographic technique that converts a continuous-tone
image to a lower resolution, and it is mainly for printing [2–6]. Error
diffusion is one of the most popular half-toning methods. It makes use of
the fact that the human visual system is less sensitive to higher frequencies,
and during diffusion, the quantization error of a pixel is diffused to the
neighboring pixels so as to hide the defects and to achieve a more faithful
reproduction of colors.
Restoring a color-quantized image back to the original image is often
necessary. However, most of the recent restoration algorithms mainly
concern the restoration of noisy and blurred color images [7–14]; literature
for restoring half-toned color-quantized images is seldom found. The
restoration can be formulated as an optimization problem. Since error
diffusion is a nonlinear process, conventional gradient-oriented optimization
algorithms might not be suitable to solve the addressed problem.
In this chapter, we make use of a proposed particle swarm optimization
with multi-wavelet mutation (MWPSO) algorithm to restore half-toned
color-quantized images. It is an evolutionary computation algorithm that
works very well on various optimization problems. By taking advantage
of the wavelet theory, which increases the searching space for the PSO at
the early stage of evolution, the chance of converging to local minima is
lowered. The process of color quantization with error diffusion will first be
detailed. Then the application of the proposed MWPSO to the restoration
of the degraded images will be discussed. The results demonstrate that the
MWPSO can achieve a remarkable improvement in terms of convergence
rate and signal-to-noise ratio.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 3

Restoration of Half-toned Color-quantized Images 41

3.2. Color Quantization With Half-toning

A color image generally consists of three color planes, namely, Or , Og and

Ob , which represent the red, green and blue color planes of the image
respectively. Accordingly, the (i, j)-th color pixel of a 24-bit full-color
image of size N × N pixels consists of three color components. The
intensity values of these three components form a 3D vector O ⃗ (i, j) =
(O(i, j)r , O(i, j)g , O(i, j)b ), where O(i, j)c ∈ [0 1] is the intensity value of
the c color component of the (i, j)-th pixel. Here, we assume that the
maximum and the minimum intensity values of a pixel are one and zero
respectively.

Fig. 3.1. Color quantization with half-toning.

Figure 3.1 shows the system that performs color quantization with error
diﬀusion. The input image is scanned in a row-by-row fashion from top to
bottom and from left to right. The relationship between the original image
⃗ (i, j) and the encoded image Y
O ⃗(i, j) is described by
∑
U(i, j)c =O(i, j)c − H(i, j)c E(i−k, j−l)c (3.1)
(k, l)∈Ω

E ⃗(i, j) − U
⃗ (i, j) =Y ⃗ (i, j) (3.2)
⃗(i, j) =Qc [U
Y ⃗ (i, j) ] (3.3)

where U ⃗ (i, j) = (U(i, j)r , U(i, j)g , U(i, j)b ) is a state vector of the system,
⃗ (i, j) is the quantization error of the pixel at position (i, j) and H(i, j)c is
E
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 3

42 Frank H.F. Leung, Benny C.W. Yeung and Y.H. Chan

a coefficient of the error diffusion filter for the c color component. Ω is the
corresponding causal support region of H(i, j)c .
The operator Qc [·] performs a 3D vector quantization. Specifically,
the 3D vector U ⃗ (i, j) is compared with a set of representative color vectors
stored in a previously generated color palette V = {vbi : i = 1, 2, . . . , Nc }.
The best-matched vector in the palette is selected based on the minimum
Euclidean distance criterion. In other words, a state vector U ⃗ (i, j) is
represented by the color vbk if and only if ∥U ⃗ (i, j) − vbk ∥ ≤ ∥U ⃗ (i, j) − vbl ∥
for all l = 1, 2, . . . , Nc ; l ̸= k. Once the best-matched vector is selected
from the color palette, its index is recorded and the quantization error
E⃗ (i, j) = vbk − U
⃗ (i, j) is diffused to pixel (i, j)’s neighborhood as described
in (3.1). Note that in order to handle the boundary pixels, E ⃗ (i, j) is defined
to be zero when (i, j) falls outside the image. Without loss of generality,
in this chapter, we use a typical Floyd–Steinberg error diffusion kernel as
H(i, j)c to perform the half-toning. The recorded indices of the color palette
will be used again in future to reconstruct the color-quantized image of the
restored image.

3.3. Formulation of Restoration Algorithm

3.3.1. PSO with multi-wavelet mutation (MWPSO)

Consider a swarm X(t) at the t-th iteration. Each particle xp (t) ∈ X(t)
contains κ elements xpj (t) ∈ xp (t) at the t-th iteration, where p =
1, 2, . . . , γ and j = 1, 2, . . . , κ; γ denotes the number of particles in the
swarm. First, the particles of the swarm are initialized and then evaluated
based on a defined fitness value. The objective of PSO is to minimize the
fitness value (cost value) of a particle through iterative steps. The swarm
evolves from iteration t to t + 1 by repeating the procedures as shown in
Fig. 3.2.
The velocity vjp (t)(corresponding to the flight speed in a search space)
and the coordinate xpj (t) of the j-th element of the p-th particle at the t-th
generation can be calculated using the following formula:
( ) ( )
vjp (t) = w · vjp (t − 1) + φ1 · randpj () · pbestpj − xpj (t − 1)
( )
+ φ2 · randpj () · gbestj − xpj (t − 1) (3.4)
xpj (t) =xpj (t − 1) + k · vjp (t) (3.5)
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 3

Restoration of Half-toned Color-quantized Images 43

Fig. 3.2. Pseudo code for MWPSO.

where

pbestp = [pbestp1 pbestp2 ... pbestpκ ]

and

gbest = [gbest1 gbest2 ... gbestκ ].

The best previous position of a particle is recorded and represented as

pbestp ; the position of the best particle among all the particles is represented
as gbest; w is an inertia weight factor; φ1 and φ2 are acceleration constants;
randpj () returns a random number in the range of [0 1]; k is a constriction
factor derived from the stability analysis of (3.5) to ensure the system can
converge but not prematurely.
PSO utilizes pbestp and gbest to modify the current search point in order
to prevent the particles from moving in the same direction, but to converge
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 3

44 Frank H.F. Leung, Benny C.W. Yeung and Y.H. Chan

gradually toward pbestp and gbest. A suitable selection of the inertia weight
w provides a balance between the global and local explorations. Generally,
w can be dynamically set with the following equation:
wmax − wmin
w = wmax − ×t (3.6)
T
where t is the current iteration number, T is the total number of iteration,
wmax and wmin are the upper and lower limits of the inertia weight, and
are set to 1.2 and 0.1 respectively in this chapter.
In (3.4), the particle velocity is limited by a maximum value vmax . The
parameter vmax determines the resolution of region between the present
position and the target position to be searched. This limit enhances the
local exploration of the problem space, affecting the incremental changes of
learning.
Before generating a new X(t), the mutation operation is performed:
every particle of the swarm will have a chance to mutate governed by a
probability of mutation µm ∈ [0 1], which is defined by the user. For
each particle, a random number between zero and one will be generated;
if µm is larger than the random number, this particle will be selected for
the mutation operation. Another parameter called the element probability
Nm ∈ [0 1] is then defined by the user to control the number of elements
in the particle that will mutate in each iteration step. For instance, if
xp (t) = [xp1 (t), xp2 (t), . . . , xpκ (t)] is the selected p-th particle, the expected
number of elements that undergo mutation is given by
Expected number of mutated elements = Nm × κ (3.7)
We propose a multi-wavelet mutation operation for realizing the
mutation. The exact elements for doing mutation in a particle are randomly
selected. The resulting particle is given by x̄p (t) = [x̄p1 (t), x̄p2 (t), . . . , x̄pκ (t)].
If the j-th element is selected for mutation, the operation is given by
{ ( )
xpj (t) + σ × parajmax − xpj (t) if σ > 0
p
x̄j (t) = ( ) (3.8)
xpj (t) + σ × xpj (t) − parajmin if σ ≤ 0

Using the Morlet wavelet as the mother wavelet, we have

1 φ 2
( ( φ ))
σ = √ e−( a ) /2 cos 5 (3.9)
a a
The value of the dilation parameter a is set to vary with the value of
t/T in order to realize a ﬁne-tuning eﬀect, where T is the total number of
iteration and t is the current iteration number. In order to perform a local
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 3

Restoration of Half-toned Color-quantized Images 45

search when t is large, the value of a should increase as t/T increases so as

to reduce the signiﬁcance of the mutation. Hence, a monotonic increasing
function governing a and t/T is proposed as follows:

a = e− ln(g)×(1− T )
t ζwm
+ln(g)
(3.10)
where ζwm is the shape parameter of the monotonic increasing function, g
is the upper limit of the parameter a.

3.3.2. The fitness function

The proposed MWPSO is applied to restore half-toned color-quantized
images. Let X be the output image of the restoration. Obviously, when the
restored image X is color-quantized with error diﬀusion, the output should
be close to the original half-toned image Y . Suppose Qch [·] (as shown in
Fig. 3.1) denotes the operator that performs the color quantization with
half-toning, then we should have
Y ≈ Qch [X] (3.11)
Based on the above criterion, the cost function for the restoration
problem can be deﬁned as
∑
f itness = |Y − Qch [X]| (3.12)
If f itness = 0, it implies the restored image X and the original image
O provide the same color quantization results. In other words, the cost
function of (3.12) provides a good measure to judge if X is a good estimate
of O.

3.3.3. Restoration with MWPSO

Let Xcur (t − 1) be the group of the current estimates of the restored image
(the swarm) at a particular iteration, t − 1, Xcur
p
(t) be the p-th particle of
p
the swarm, and Ecur be its corresponding fitness value. According to (3.4)
and (3.5), the new estimate of the restored image is given by
p
Xnew p
(t) = Xcur (t − 1) + k · v p (t) (3.13)
where k is a constriction factor. Before evaluating the fitness of each
particle, the multi-wavelet mutation is performed. The mutation operation
realizes a fine tuning upon many times of iteration. After each time of
p p
iteration, Enew , which is the fitness for Xnew (t), is then evaluated. If
p p p p p
Enew < Ecur , Xcur (t) is updated to Xnew (t). Furthermore, if Enew < Ebest ,
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 3

46 Frank H.F. Leung, Benny C.W. Yeung and Y.H. Chan

where Ebest is the ﬁtness value of the best estimate gbest so far, then gbest
p
will be updated by Xnew (t). In summary, we have
{ p p p
p Xnew (t) if Enew < Ecur
Xcur (t) = (3.14)
Xcur (t − 1) otherwise
p

and
{ p p
Xnew (t) if Enew < Ebest
gbest = (3.15)
gbest otherwise

3.3.4. Experimental setup

On realizing the MWPSO restoration, the following simulation conditions
are used:
• Shape parameter of the wavelet mutation (ζwm ): 0.2
• Probability of mutation (µm ): 0.2
• Element probability (Nm ): 0.3
• Acceleration constant φ1 : 2.05
• Acceleration constant φ2 : 2.05
• Constriction factor k: 0.005
• Parameter g: 10000
• Swarm size: 8
• Number of iteration: 500
• Initial population X(0): All the particles are initialized to be the
Gaussian ﬁltered output of the original half-toned image Y
• Initial global best gbest: The original half-toned image Y
• Initial previous best pbestp : The zero matrix
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 3