0% found this document useful (0 votes)
60 views318 pages

Computational Intelligence and Its Applications - Evolutionary Computation, Fuzzy Logic, Neural Network and Support Vector Machine Techniques (PDFDrive)

The document discusses the edited volume 'Computational Intelligence and Its Applications', which covers various computational intelligence techniques including evolutionary computation, fuzzy logic, neural networks, and support vector machines. It highlights the application of these techniques in solving complex real-world problems across different fields such as biomedical applications and industrial engineering. The volume consists of 13 chapters that present state-of-the-art research and methodologies, making it a valuable resource for researchers and postgraduate students in the field.

Uploaded by

prof.fabio
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views318 pages

Computational Intelligence and Its Applications - Evolutionary Computation, Fuzzy Logic, Neural Network and Support Vector Machine Techniques (PDFDrive)

The document discusses the edited volume 'Computational Intelligence and Its Applications', which covers various computational intelligence techniques including evolutionary computation, fuzzy logic, neural networks, and support vector machines. It highlights the application of these techniques in solving complex real-world problems across different fields such as biomedical applications and industrial engineering. The volume consists of 13 chapters that present state-of-the-art research and methodologies, making it a valuable resource for researchers and postgraduate students in the field.

Uploaded by

prof.fabio
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 318

COMPUTATIONAL

INTELLIGENCE
AND ITS APPLICATIONS
Evolutionary Computation, Fuzzy Logic, Neural Network
and Support Vector Machine Techniques

P773.9781848166912-tp.indd 1 11/6/12 4:13 PM


March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/

This page intentionally left blank


COMPUTATIONAL
INTELLIGENCE
AND ITS APPLICATIONS
Evolutionary Computation, Fuzzy Logic, Neural Network
and Support Vector Machine Techniques

Editors

H. K. Lam
King’s College London, UK

S. H. Ling • H. T. Nguyen
University of Technology, Australia

Imperial College Press


ICP

P773.9781848166912-tp.indd 2 11/6/12 4:13 PM


Published by
Imperial College Press
57 Shelton Street
Covent Garden
London WC2H 9HE

Distributed by
World Scientific Publishing Co. Pte. Ltd.
5 Toh Tuck Link, Singapore 596224
USA office: 27 Warren Street, Suite 401-402, Hackensack, NJ 07601
UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE

British Library Cataloguing-in-Publication Data


A catalogue record for this book is available from the British Library.

COMPUTATIONAL INTELLIGENCE AND ITS APPLICATIONS


Evolutionary Computation, Fuzzy Logic, Neural Network and Support Vector
Machine Techniques

Copyright © 2012 by Imperial College Press


All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means,
electronic or mechanical, including photocopying, recording or any information storage and retrieval
system now known or to be invented, without written permission from the Publisher.

For photocopying of material in this volume, please pay a copying fee through the Copyright
Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to
photocopy is not required from the publisher.

ISBN-13 978-1-84816-691-2
ISBN-10 1-84816-691-5

Printed in Singapore.

Patrick - Computational Intelligence.pmd 1 3/13/2012, 10:42 AM


March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 0

Preface

Computational intelligence techniques are fast-growing and promising


research topics that have drawn a great deal of attention from researchers
for many years. This volume brings together many different aspects
of the current research on intelligence technologies such as neural
networks, support vector machines, fuzzy logics, evolutionary computing
and swarm intelligence. The combination of these techniques provides an
effective treatment toward some industrial and biomedical applications.
Most real-world problems are complex and even ill-defined. Lack of
knowledge on the problems or too much information makes the classical
analytical methodologies difficult to apply and obtain reasonable results.
Computational intelligence techniques demonstrate superior learning and
generalization abilities on handling these complex and ill-defined problems.
By using appropriate computational intelligence techniques, some essential
characteristics and important information can be extracted to deal with
the problems. It has been shown that various computational intelligence
techniques have been successfully applied to a wide range of applications
from pattern recognition and system modeling to intelligent control
problems and biomedical applications.
This edited volume provides the state-of-the-art research on significant
topics in the field of computational intelligence. It presents fundamental
concepts and essential analysis of various computational techniques to offer
a systematic and effective tool for better treatment of different applications.
Simulation and experimental results are included to illustrate the design
procedure and the effectiveness of the approaches. With collective
experiences and the knowledge of leading researchers, the important
problems and difficulties are fully addressed, concepts are fully explained
and methodologies are provided to handle various problems.
This edited volume comprises 13 chapters which falls into 4 main
categories: (1) Evolutionary computation and its applications, (2) Fuzzy

v
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 0

vi Preface

logics and their applications, (3) Neural networks and their applications
and (4) Support vector machines and their applications.
Chapter 1 compares three machine learning methods, support vector
machines, AdaBoost and soft margin AdaBoost algorithms, to solve the
pose estimation problem. Experiment results show that both the support
vector machines-based method and soft margin AdaBoost-based method
are able to reliably classify frontal and pose images better than the original
AdaBoost-based method.
Chapter 2 proposes a particle swarm optimization for polynomial
modeling in a dynamic environment. The performance of the proposed
particle swarm optimization is evaluated by polynomial modeling based
on a set of dynamic benchmark functions. Results show that the proposed
particle swarm optimization can find significantly better polynomial models
than genetic programming.
Chapter 3 deals with the problem of restoration of color-quantized
images. A restoration algorithm based on particle swarm optimization
with multi-wavelet mutation is proposed to handle the problem. Simulation
results show that it can improve the quality of a half-toned color-quantized
image remarkably in terms of both signal-to-noise ratio improvement and
convergence rate and the subjective quality of the restored images can also
be improved.
Chapter 4 deals with a non-invasive hypoglycemia detection for Type 1
diabetes mellitus (T1DM) patients based on the physiological parameters
of the electrocardiogram signals. An evolved fuzzy inference model is
developed for classification of hypoglycemia with rule optimization and
membership functions using a hybrid particle swarm optimization method
with wavelet mutation.
Chapter 5 studies the limit cycle behavior of weights of perceptron. It
is proposed that the perceptron exhibiting the limit cycle behavior can
be employed for solving a recognition problem when downsampled sets
of bounded training feature vectors are linearly separable. Numerical
computer simulation results show that the perceptron exhibiting the limit
cycle behavior can achieve a better recognition performance compared to a
multi-layer perceptron.
Chapter 6 presents an alternative information theoretic criterion
(minimum description length) to determine the optimal architecture of
neural networks according to the equilibrium between the model parameters
and model errors. The proposed method is applied for modeling of various
data using neural networks for verification.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 0

Preface vii

Chapter 7 solves eigen-problems of matrices using neural networks.


Several recurrent neural network models are proposed and each model is
expressed as an individual differential equation, with its analytic solution
being obtained. The convergence properties of the neural network models
are fully discussed based on the solutions to these differential equations and
the computing steps are designed toward solving the eigen-problems, with
numerical simulations being provided to evaluate each model’s effectiveness.
Chapter 8 considers the study of methods of automation for the insertion
of self-tapping screws. A new methodology for monitoring the insertion
of self-tapping screws is developed based on radial basis function neural
networks, which are able to generalize and to correctly classify unseen
insertion signals. The ability of the artificial neural networks to classify
signals belongs to a single insertion case. Both the computer simulation
and experimental results show that after a modest training period, the
neural network is able to correctly classify torque signature signals.
Chapter 9 applies the theory behind both support vector classification
and regression to deal with real-world problems. A classifier is developed
which can accurately estimate the risk of developing heart disease simply
from the signal derived from a finger-based pulse oximeter. The regression
example shows how support vector machines can be used to rapidly and
effectively recognize hand-written characters particularly designed for the
so-called graffiti character set.
Chapter 10 proposes a control oriented modeling approach to depict
nonlinear behavior of heart rate response at both the onset and offset of
treadmill exercise to accurately regulate cardiovascular response to exercise
for the individual exerciser.
Chapter 11 explores control methodologies to handle time variant
behavior for heart rate dynamics at onset and offset of exercises. The
effectiveness of the proposed modeling and control approach is shown
from the regulation of dynamical heart rate response to exercise through
simulation using Matlab.
Chapter 12 investigates real-time fault detection and isolation for
heating, ventilation and air conditioning systems by using an online support
vector machine. Simulation studies are given to show the effectiveness of
the proposed online fault detection and isolation approach.
This edited volume covers state-of-the-art computational intelligence
techniques and the materials are suitable for post-graduate students and
researchers as a reference in engineering and science. Particularly, it
is more suitable for researchers working on computational intelligence
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 0

viii Preface

including evolutionary computation, fuzzy logics, neural networks and


support vector machines. Moreover, a wide range of applications using
computational intelligence techniques, such as biomedical problems, control
systems, forecasting, optimization problems, pattern recognition and
system modeling, are covered. These problems can be commonly found
in industrial engineering applications. So, this edited volume can be a
good reference, providing concept, techniques, methodologies and analysis,
for industrial engineers applying computational intelligence to deal with
engineering problems.
We would like to thank all the authors for their contributions to
this edited volume. Thanks also to the staff members of the Division
of Engineering, King’s College London and the Faculty of Engineering
and Information Technology, University of Technology, Sydney for their
comments and support. The editor, H.K. Lam, would like to thank his
wife, Esther Wing See Chan, for her patience, understanding, support and
encouragement that make this work possible. Last but not least, we would
like to thank the publisher Imperial College Press for the publication of this
edited volume and the staff who have offered support during the preparation
of the manuscript.
The work described in this book was substantially supported by grants
from King’s College London and University of Technology, Sydney.

H.K. Lam
S.H. Ling
H.T. Nguyen
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 0

Contents

Preface v

Evolutionary Computation and its Applications 1

1. Maximal Margin Algorithms for Pose Estimation 3


Ying Guo and Jiaming Li

2. Polynomial Modeling in a Dynamic Environment based


on a Particle Swarm Optimization 23
Kit Yan Chan and Tharam S. Dillon

3. Restoration of Half-toned Color-quantized Images Using


Particle Swarm Optimization with Multi-wavelet Mutation 39
Frank H.F. Leung, Benny C.W. Yeung and Y.H. Chan

Fuzzy Logics and their Applications 59

4. Hypoglycemia Detection for Insulin-dependent Diabetes


Mellitus: Evolved Fuzzy Inference System Approach 61
S.H. Ling, P.P. San and H.T. Nguyen

Neural Networks and their Applications 87

5. Study of Limit Cycle Behavior of Weights of Perceptron 89


C.Y.F. Ho and B.W.K. Ling

ix
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 0

x Contents

6. Artificial Neural Network Modeling with Application to


Nonlinear Dynamics 101
Yi Zhao

7. Solving Eigen-problems of Matrices by Neural Networks 127


Yiguang Liu, Zhisheng You, Bingbing Liu and Jiliu Zhou

8. Automated Screw Insertion Monitoring Using Neural


Networks: A Computational Intelligence Approach to
Assembly in Manufacturing 183
Bruno Lara, Lakmal D. Seneviratne and Kaspar Althoefer

Support Vector Machines and their Applications 211

9. On the Applications of Heart Disease Risk Classification


and Hand-written Character Recognition using Support
Vector Machines 213
S.R. Alty, H.K. Lam and J. Prada

10. Nonlinear Modeling Using Support Vector Machine for


Heart Rate Response to Exercise 255
Weidong Chen, Steven W. Su, Yi Zhang, Ying Guo,
Nghir Nguyen, Branko G. Celler and Hung T. Nguyen

11. Machine Learning-based Nonlinear Model Predictive


Control for Heart Rate Response to Exercise 271
Yi Zhang, Steven W. Su, Branko G. Celler and
Hung T. Nguyen

12. Intelligent Fault Detection and Isolation of HVAC System


Based on Online Support Vector Machine 287
Davood Dehestani, Ying Guo, Sai Ho Ling, Steven W.
Su and Hung T. Nguyen

Index 305
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 0

PART 1

Evolutionary Computation and its


Applications

1
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 0

This page intentionally left blank

2
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 1

Chapter 1

Maximal Margin Algorithms for Pose Estimation

Ying Guo and Jiaming Li


ICT Centre, CSIRO
PO Box 76, Epping, NSW 1710, Australia
ying.guo, [email protected]

This chapter compares three machine learning methods to solve


the pose estimation problem. The methods were based on support
vector machines (SVMs), AdaBoost and soft margin AdaBoost (SMA)
algorithms. Experiment results show that both the SVM-based method
and SMA-based method are able to reliably classify frontal and
pose images better than the original AdaBoost-based method. This
observation leads us to compare the generalization performance of these
algorithms based on their margin distribution graphs.
For a classification problem, features selection is the first step, and
selecting better features results in better classification performance. The
feature selection method described in this chapter is easy and efficient.
Instead of resizing the whole facial image to a subwindow (as in the
common pose estimation process), the prior knowledge in this method is
only the eye location, hence simplifying its application. In addition, the
method performs extremely well even when some facial features such as
the nose or the mouth become partially or wholly occluded.
Experiment results show that the algorithm performs very well, with
a correct recognition rate around 98%, regardless of the scale, lighting
or illumination conditions associated with the face.

Contents

1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Pose Detection Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.1 Procedure of pose detection . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2.2 Eigen Pose Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Maximal Margin Algorithms for Classification . . . . . . . . . . . . . . . . . . 9
1.3.1 Support vector machines . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.2 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.3 Soft margin AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 1

4 Ying Guo and Jiaming Li

1.3.4 Margin distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13


1.4 Experiments and Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4.1 Data preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4.2 Data preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.4.3 Experiment results of Group A . . . . . . . . . . . . . . . . . . . . . . 14
1.4.4 Margin distribution graphs . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4.5 Experiment results of Group B . . . . . . . . . . . . . . . . . . . . . . 18
1.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

1.1. Introduction

Research in face detection, face recognition and facial expression usually


focuses on using frontal view images. However, approximately 75% of faces
in normal photographs are non-frontal [1]. Significant improvements in
many computer vision algorithms dealing with human faces can be obtained
if we can achieve an accurate estimation of the pose of the face, hence pose
estimation is an important problem.
CSIRO has developed a real-time face capture and recognition system
(SQIS - System for Quick Image Search) [2], which can automatically
capture a face in a video stream and verify this against face images stored
in a database to inform the operator if a match occurs. It is observed that
the system performs better for the frontal images, so a method is required
to separate the frontal images from pose images for the SQIS system.
Pose detection is hard because large changes in orientation significantly
change the overall appearance of a face. Attempts have been made to use
view-based appearance models with a set of view-labeled appearances (e.g.
[3]). Support vector machines (SVMs) have been successfully applied
to model the appearance of human faces which undergo nonlinear change
across multiple views, and these have achieved good performance [4–7].
SVMs are linear classifiers that use the maximal margin hyperplane in a
feature space defined by a kernel function. Here, margin is a measure
of the generalization performance of a classifier. An example is classified
correctly if and only if it has a positive margin. A large value of the margin
indicates that there is little uncertainty in the classification of the point.
They will be precisely explained in Section 1.3.
Recently, there has been great interest in ensemble methods for learning
classifiers, and in particular in boosting algorithms [8], which is a general
method for improving the accuracy of a basic learning algorithm. The best
known boosting algorithm is the AdaBoost algorithm. These algorithms
have proved surprisingly effective at improving generalization performance
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 1

Maximal Margin Algorithms for Pose Estimation 5

in a wide variety of domains, and for diverse base learners. For instance,
Viola and Jones demonstrated that AdaBoost can achieve both good speed
and performance in face detection [9]. However, research also showed that
AdaBoost often places too much emphasis on misclassified examples which
may just be noise. Hence, it can suffer from overfitting, particularly with
a highly noisy data set. The soft margin AdaBoost algorithm [10]
was introduced by using a regularization term to achieve a soft margin
algorithm, which allows mislabeled samples to exist in the training data
set, and improves the generalization performance of original AdaBoost.
Researchers have pointed out the equivalence between the mathematical
programs underlying both SVMs and boosting, and formalized a
correspondence. For instance, in [11], a given hypothesis set in boosting can
correspond to the choice of a particular kernel in SVMs and vice versa. Both
SVMs and boosting are maximal margin algorithms, whose generalization
performance of a hypothesis f can be bounded in terms of the margin with
respect to the training set. In this chapter, we will compare these maximal
margin algorithms, SVM, AdaBoost and SMA, in one practical pattern
recognition problem – pose estimation. We will especially analyze their
generalization performance in terms of the margin distribution.
This chapter is directed toward a pose detection system that can
classify the frontal images (within ±25o ) from pose images (greater angles),
under different scale, lighting or illumination conditions. The method uses
Principal Component Analysis (PCA) to generate a representation of the
facial image’s appearance, which is parameterized by geometrical variables
such as the angle of the facial image. The maximal margin algorithms are
then used to generate a statistical model, which captures variation in the
appearance of the facial angle.
All the experiments were evaluated on the CMU PIE database [12],
Weizmann database [13], CSIRO front database and CMU profile face
testing set [14]. It is demonstrated that both the SVM-based model and
SMA-based model are able to reliably classify frontal and pose images better
than the original AdaBoost-based model. This observation leads us to
discuss the generalization performance of these algorithms based on their
margin distribution graphs.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 1

6 Ying Guo and Jiaming Li

The remainder of this chapter is organized as follows. We proceed


in Section 1.2 to explain our approach and the feature extraction. In
Section 1.3, we analyze the maximal margin algorithms, including the
theoretic introduction of SVMs, AdaBoost and SMA. In Section 1.4, we will
present some experiment results, and analyze different algorithms’ margin
distribution. The conclusions are discussed in Section 1.5.

1.2. Pose Detection Algorithm

1.2.1. Procedure of pose detection

Murase and Nayar [15] proposed a continuous compact representation of


object appearance that is parameterized by geometrical variables such as
object pose. In this pose estimation approach, the principal component
analysis (PCA) is applied to make pose detection more efficient. PCA is a
well-known method for computing the directions of greatest variance for a
set of vectors. Here the training facial images are transformed into vectors
first. That is, each l × w pixel window is vectorised to a (l × w)-element
vector v ∈ RN , where N = l × w. By computing the eigenvectors of the
covariance matrix of a set of v, PCA determines an orthogonal basis, called
the eigen pose space (EPS), in which to describe the original vector v, i.e.
vector v is projected into a vector x in the subspace of eigenspace, where
x ∈ Rd and normally d ≪ N .
According to the pose angle θi of the training image vi , the
corresponding label yi of each vector xi is defined as
{
+1 |θi | ≤ 25o
yi =
−1 otherwise.

The next task is to generate a decision function f (x) based on a set of


training samples (x1 , y1 ), . . . , (xm , ym ) ∈ Rd × R. We call x inputs from
the input space X and y outputs from the output space Y. We will use
the shorthand si = (xi , yi ), Xm = (x1 , . . . , xm ), Y m = (y1 , . . . , ym ) and
S m = (Xm , Y m ). The sequence S m (sometimes also S) is generally called
a training set, which is assumed to be distributed according to the product
probability distribution Pm (x, y). The maximal margin machine learning
algorithms, such as the SVMs, AdaBoost and SMA, are applied to solve
this binary classification problem. Figure 1.1 shows the procedure of the
whole pose detection approach.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 1

Maximal Margin Algorithms for Pose Estimation 7

1.2.2. Eigen Pose Space

We chose 3003 facial images from the CMU PIE database under 13 different
poses to generate the EPS. The details of PIE data set can be seen in
[12], where the facial images are from 68 persons, and each person was
photographed using 13 different poses, 43 different illumination conditions
and 4 different expressions. Figure 1.2 shows the pose variations.
A mean pose image and set of orthonormal eigen poses are produced.
The first eight eigen poses are shown in Fig. 1.3. Figure 1.4 shows the

Facial images Training Samples

Scale & Rotate

Preprocessing

Crop

Normalize

Eigenvectors of
PCA EPS Training samples (x,y)
Compression EPS projection Maximal Margin Algorithms
Decision
function f(x)
Real x
Images Preprocessing EPS projection
Pose detection

Fig. 1.1. Flow diagram of pose detection algorithm.

Fig. 1.2. The pose variation in the PIE database. The pose varies from full left profile
to full frontal and on to full right profile. The nine cameras in the horizontal sweep are
each separated by about 22.5o . The four other cameras include one above and one below
the central camera, and two in the corners of the room, which are typical locations for
surveillance cameras.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 1

8 Ying Guo and Jiaming Li

mean 1 2 3 4 5 6 7 8

Fig. 1.3. Mean face and first eight eigen poses.

Fig. 1.4. Pose Eigenvalue Spectrum, normalized by the largest eigenvalue.

eigenvalue spectrum, which decreases rapidly. Good reconstruction is


achieved using at most 100 eigenvectors, as shown in Fig. 1.5, where the
sample face from the Weizmann database is compared with reconstructions
using increasing numbers of eigenvectors. The number of eigenvectors is
shown on the top of each reconstructed images. The first 100 eigenvectors
were saved to construct the EPS.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 1

Maximal Margin Algorithms for Pose Estimation 9

50

100

150

200

250

300

350

400

450

500
50 100 150 200 250 300 350

sample 20 40 60 80 100 120

Fig. 1.5. Reconstructed image (Weizmann) using the eigen poses of the PIE data set.
Top: original image. Bottom: sample (masked and relocated) and its reconstruction
using the given numbers of eigenvectors. The number of eigenvectors is above each
reconstructed image.

1.3. Maximal Margin Algorithms for Classification

As shown in the procedure (Fig. 1.1), the projection of a facial image to


the EPS is classified as a function of the pose angle. A classifier is required
to learn the relationship. We will apply three maximal margin algorithms,
SVMs, AdaBoost and SMA.
Many binary classifier algorithms produce their output by thresholding
a real valued function, such as neural network, support vector machine
and combined classifiers produced by voting methods. It is obvious that
it is often useful to deal with the continuous function directly, since the
thresholded output contains less information. So it is easy to imagine that
the real-valued output can be interpreted as a measure of generalization
performance in pattern recognition. Such generalization performance is
expressed in terms of margin. Next we define the margin of an example:
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 1

10 Ying Guo and Jiaming Li

Definition 1.1 (Margin). Given an example (x, y) ∈ Rd × {−1, 1} and


a real-valued hypothesis f : Rd → R, the margin of s with respect to f is
defined as γ(s) := yf (x).

From this definition, we can see that an example is classified correctly if


and only if it has a positive margin. A large value of the margin indicates
that there is little uncertainty in the classification of the point. Thus,
we would expect that a hypothesis with large margins would have good
generalization performance. In fact, it can be shown that achieving a
large margin on the training set results in an improved bound on the
generalization performance [16, 17]. Geometrically, for “well behaved”
functions f , the distance between a point and the decision boundary will
roughly correspond to the magnitude of the margin at that point. The
learning algorithm which separates the samples with the maximal margin
hyperplane is called the maximal margin algorithm. There are two main
maximal margin algorithms: support vector machines and boosting.

1.3.1. Support vector machines


As a powerful classification algorithm, SVMs [18] generate the maximal
margin hyperplane in a feature space defined by a kernel function. For
the binary classification problem, the goal of an SVM is to find an optimal
hyperplane f (x) to separate the positive examples from the negative ones
with maximum margin. The points which lie on the hyperplane satisfy
⟨w, x⟩ + b = 0. The decision function which gives the maximum margin is
(m )

f (x) = sign yi αi ⟨x, xi ⟩ + b . (1.1)
i=1

The parameter αi is non-zero only for inputs xi closest to the hyperplane


f (x) (within the functional margin). All the other parameters αi are zero.
The inputs with non-zero αi are called support vectors. Notice f (x) depends
only on the inner products between inputs. Suppose the data are mapped
to some other inner product space S via a nonlinear map Φ : Rd → S. Now
the above linear algorithm will operate in S, which would only depend on
the data through inner products in S : ⟨Φ(xi ), Φ(xj )⟩. Boser, Guyon and
Vapnik [19] showed that there is a simple function k such that k(x, xi ) =
⟨Φ(x), Φ(xi )⟩, which can be evaluated efficiently. The function k is called a
kernel. So we only need to use the kernel k in the optimization algorithm
and never need to explicitly know what Φ is.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 1

Maximal Margin Algorithms for Pose Estimation 11

For linear SVM, the kernel k is just the inner product in the input space,
i.e. k(x, xi ) = ⟨x, xi ⟩, and the corresponding decision function is (1.1). For
nonlinear SVMs, there are a number of kernel functions which have been
found to provide good performance, such as polynomials and radial basis
function (RBF). The corresponding decision function for a nonlinear SVM
is
(m )

f (x) = sign yi αi k(x, xi ) + b . (1.2)
i=1

1.3.2. Boosting
Boosting is a general method for improving the accuracy of a learning
algorithm. In recent years, many researchers have reported significant
improvements in the generalization performance using boosting methods
with learning algorithms such as C4.5 [20] or CART [21] as well as with
neural networks [22].
A number of popular and successful boosting methods can be seen as
gradient descent algorithms, which implicitly minimize some cost function
of the margin [23–25]. In particular, the popular AdaBoost algorithm [8]
can be viewed as a procedure for producing voted classifiers which minimize
the sample average of an exponential cost function of the training margins.
The aim of boosting algorithms is to provide a hypothesis which is a
voted combination of classifiers of the form sign (f (x)), with

T
f (x) = αt ht (x), (1.3)
t=1

where αt ∈ R are the classifier weights, ht are base classifiers from some
class F and T is the number of base classifiers chosen from F. Boosting
algorithms take the approach of finding voted classifiers which minimize
the sample average of some cost function of the margin.
Nearly all the boosting algorithms iteratively construct the combination
one classifier at a time. So we will denote the combination of the first t
classifiers by ft , while the final combination of T classifiers will simply be
denoted by f .
In [17], Schapire et al. show that boosting is good at finding classifiers
with large margins in that it concentrates on those examples whose margins
are small (or negative) and forces the base learning algorithm to generate
good classifications for those examples. Thus, boosting can effectively find
a large margin hyperplane.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 1

12 Ying Guo and Jiaming Li

1.3.3. Soft margin AdaBoost

Although AdaBoost is remarkably successful in practice, AdaBoost’s cost


function often places too much emphasis on examples with large negative
margins. It was shown theoretically and experimentally that AdaBoost is
especially effective at increasing the margins of the training examples [17],
but the generalization performance of AdaBoost is not guaranteed. Hence
it can suffer from overfitting, particularly in high noise situations [26].
For the problem of approximating a smooth function from sparse data,
regularization techniques [27, 28] impose constraints on the approximating
set of functions. Rätsch et al. [10] show that versions of AdaBoost
modified to use regularization are more robust for noisy data. Mason et
al. [25] discuss the problem of approximating a smoother function based
on boosting algorithms and a regularization term was also suggested to be
added to the original cost function. This term represents the “mistrust” to
a noisy training sample, and allows it to be misclassified (negative margin)
in the training process. The final hypothesis f (x) obtained this way has
worse training error but better generalization performance compared to
f (x) of the original AdaBoost algorithm.
There is a broad range of choices for the regularization term, including
many of the popular generalized additive models used in some regularization
networks [29]. In the so-called soft margin AdaBoost, the regularization
term ηt (si ) is defined as
( )2

t
ηt (si ) = αr wr (si ) ,
r=1

where w is the sample weight and t the training iteration index (cf. [10] for
a detailed description). The soft margin of a sample si is then defined as

γ̃(si ) := γ(si ) + ληt (si ), for i = 1, · · · , m,

where λ is the regularization constant and ηt (si ) the regularization term.


A large value of η(si ) for some patterns allow for some larger soft margin
γ̃(si ). Here λ balances the trade-off between goodness-of-fit and simplicity
of the hypothesis. In the noisy case, SMA prefers hypotheses which do
not rely on only a few samples with smaller values of η(si ). So by using
the regularization method, AdaBoost is not changed for easily classifiable
samples, but only for the most difficult ones.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 1

Maximal Margin Algorithms for Pose Estimation 13

1.3.4. Margin distribution


In this chapter, we will compare these maximal margin algorithms in one
practical pattern recognition problem, pose estimation. Our comparison
will be based on the margin distribution of different algorithms over the
training data. The key idea of this analysis is the following: In order
to analyze the generalization error, one should consider more than just
the training error, which is the number of incorrect classifications in the
training set. One should also take into account the confidence of the
classifications [30]. We will use margin as a measure of the classification
confidence, for which it is possible to prove that an improvement in margin
on the training set guarantees an improvement in the upper bound on the
generalization error. For maximal margin algorithms, it is easy to see that
slightly perturbing one training example with a large margin is unlikely
to cause a change in the hypothesis f , and thus have little effect on the
generalization performance of f . Hence, the distribution of the margin
over the whole set of training examples is useful to analyze the confidence
of the classification. To visualize this distribution, the fraction of examples
whose margin is at most γ as a function of γ and normalized in [−1, 1] is
plotted and analyzed in the experiment part. These graphs are referred to
as margin distribution graphs [17].

1.4. Experiments and Discussions

1.4.1. Data preparation


The databases used for experiments were collected from four databases: the
CMU PIE database, the Weizmann database, the CSIRO front database
and the CMU profile face database. These contain a total of 41567 faces, of
which 22221 are frontal images, and 19346 are pose images. The training
samples were randomly selected from the whole data set, and the rest to
test the generalization performance of the technique. The training and
testing were carried out based on the procedure for pose detection shown
in Fig. 1.1.
In experiment Group A, we compared the performance of SVM,
AdaBoost and soft margin AdaBoost on a small training sample set (m =
500), in order to save the computation time. Their margin distribution
graphs were generated based on the training performance, which explains
why SVM and SMA perform better than AdaBoost. In experiment Group
B, we did experiments on a much larger training set (m = 20783) for the
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 1

14 Ying Guo and Jiaming Li

SVM- and SMA-based techniques, in order to show the best performance


these two techniques can achieve.
For the SVM, we chose the polynomial kernel with degree d = 3 and
bias b = 0. For AdaBoost and SMA, we used radial basis function (RBF)
networks with adaptive centers as the base classifier, and the number of base
classifiers is T = 800. The effect of using different numbers of significant
Principal Components (PCs), i.e. the signal components along the principal
directions in the EPS, is also observed in the experiments. We will define
the number of PCs as the PC-dimension. We tested the PC-dimensions
between 10 and 80 in steps of 10. All the experiments were repeated five
times, and the results were averaged over the five repeats.

1.4.2. Data preprocessing


As the prior knowledge of the system is the face’s eye location, the
x-y positions of both eyes were hand-labeled. The face images were
normalized for rotation, translation and scale according to eye location. The
subwindow of the face is then cropped using the normalized eyes distance.
Because the new eye location is fixed without knowledge of the face’s pose,
the subwindow range is quite different based on different poses. For a
large-angle pose face, the cropped subwindow cannot include the whole
face. Figure 1.6 shows such face region extraction for nine poses of the PIE
database with the corresponding angle on top.

85 67.5 45 22.5 0 −22.5 −45 −67.5 −85

Fig. 1.6. Normalized and cropped face region for nine different pose angles. The degree
of the pose is above the image. The subwindows only include eyes and part of nose for
the ±85o pose images.

1.4.3. Experiment results of Group A


We compared SVM, original AdaBoost and SMA-based techniques in this
group of experiments. In Table 1.1, the average generalization performance
(with standard deviation) over the range of PC-dimensions is given. Our
experiments show that the AdaBoost results are in all cases at least
2 ∼ 4% worse than both the SVM and SMA results. Figure 1.7 shows
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 1

Maximal Margin Algorithms for Pose Estimation 15

Table 1.1. Performance of pose classifiers on different PC-dimensions.


PC-dim testErr of SVM testErr of Ada testErr of SMA Diff between
Ada and SMA
10 12.02 ± 1.34% 13.72 ± 1.34% 11.63 ± 0.71% 2.09%
20 9.07 ± 0.56% 10.71 ± 0.81% 7.26 ± 0.38% 3.45%
30 6.49 ± 0.21% 8.19 ± 0.63% 6.06 ± 0.47% 2.13%
40 5.72 ± 0.64% 7.87 ± 0.92% 5.05 ± 0.19% 2.82%
50 5.03 ± 0.43% 8.20 ± 0.69% 5.14 ± 0.21% 3.06%
60 4.79 ± 0.39% 7.32 ± 1.04% 4.63 ± 0.61% 2.69%
70 4.83 ± 0.12% 7.98 ± 0.85% 4.49 ± 0.33% 3.52%
80 4.60 ± 0.26% 8.03 ± 0.74% 4.87 ± 0.27% 3.16%

one comparison of AdaBoost and SMA with PC-dimension n = 30. The


training error of the AdaBoost-based technique converges to zero after only
five iterations, but the testing error clearly shows the overfitting. For the
SMA-based technique, because of the regularization term, the training error
is not zero in most of the iterations, but the testing error keeps decreasing.

1.4.4. Margin distribution graphs


We will use the margin distribution graphs to explain why both SVM
and SMA perform better than AdaBoost. Figure 1.8 shows the margin
distribution for SVM, AdaBoost and SMA, indicated by solid, dashed and
dotted lines, respectively. The margin of all AdaBoost are positive, which
means all of the training data are classified correctly. For SVM and SMA,
about 5% of the training data are classified incorrectly, while the margins
of more than 90% of the points are bigger than those of AdaBoost.
We know that a large positive margin can be interpreted as a “confident”
correct classification. Hence the generalization performance of SVM and
SMA should be better than AdaBoost, which is observed in the experiment
results (Table 1.1). AdaBoost tends to increase the margins associated with
examples and converge to a margin distribution in which all examples try
to have positive margins. Hence, AdaBoost can improve the generalization
error of a classifier when there is no noise. But when there is noise in
the training set, AdaBoost generates an overfitting classifier by trying to
classify the noisy points with positive margins. In fact, in most cases,
AdaBoost will modify the training sample distribution to force the learner
to concentrate on its errors, and will thereby force the learner to concentrate
on learning noisy examples. AdaBoost is hence called “hard margin”
algorithm. For noisy data, there is always the trade-off between believing
in the data or mistrusting it, because the data point could be mislabeled.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 1

16 Ying Guo and Jiaming Li

original AdaBoost
0.12
training error
test error
0.1

0.08

0.06
error

0.04

0.02

0 100 200 300 400 500 600 700 800


number of iterations

Soft Margin AdaBoost


0.12
training error
test error
0.1

0.08

0.06
error

0.04

0.02

0 100 200 300 400 500 600 700 800


number of iterations

Fig. 1.7. Training and testing error graphs of original AdaBoost and SMA when training
set size m = 500, and PC-dimension of the feature vector n = 30. The testing error of
AdaBoost overfits to 8.19% while the testing error of SMA converges to 6.06%.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 1

Maximal Margin Algorithms for Pose Estimation 17

dimension of preprocessed vector: 30


1
SVM
0.9 AdaBoost
SMA
0.8

0.7
cumulative distribution

0.6

0.5

0.4

0.3

0.2

0.1

0
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
margin

dimension of preprocessed vector: 50


1
SVM
0.9 AdaBoost
SMA
0.8

0.7
cumulative distribution

0.6

0.5

0.4

0.3

0.2

0.1

0
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
margin

dimension of preprocessed vector: 80


1
SVM
0.9 AdaBoost
SMA
0.8

0.7
cumulative distribution

0.6

0.5

0.4

0.3

0.2

0.1

0
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
margin

Fig. 1.8. Margin distribution graphs of SVM, AdaBoost and SMA on training set when
training sample size m = 500, PC-dimension is 30, 50, 80 respectively. For SVM and
SMA, although about 1 ∼ 2% of the training data are classified incorrectly, the margins
of more than 90% of the points are bigger than those of AdaBoost.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 1

18 Ying Guo and Jiaming Li

So the “hard margin” condition needs to be relaxed. For SVM, the original
SVM algorithm [19] had poor generalization performance on noisy data
as well. After the “soft margin” was introduced [31], SVMs tended to
find a smoother classifier and converge to a margin distribution in which
some examples may have negative margins, and have achieved much better
generalization results. The SMA algorithm is similar, which tends to find a
smoother classifier because of the balancing influence of the regularization
term. So, SVM and SMA are called “soft margin” algorithms.

0.1
Support vector machine
0.09 Soft margin AdaBoost

0.08

0.07

0.06
test error

0.05

0.04

0.03

0.02

0.01

0
0 10 20 30 40 50 60 70 80 90
dimension of the preprocessed vectors

Fig. 1.9. The testing error of SVM and SMA versus the PC-dimension of the feature
vector. As the PC-dimension increases, the testing errors decrease. The testing error is
as low as 1.80% (SVM) and 1.72% (SMA) when the PC-dimension is 80.

1.4.5. Experiment results of Group B


From the experiment results in Section 1.4.3, we observed that both
SVM- and SMA-based approaches perform better than AdaBoost, and we
analyzed the reason in Section 1.4.4. In order to show the best performance
of SVM and SMA techniques, we trained the SMA on more training samples
(m = 20783), and achieved competitive results on pose estimation problem.
Figure 1.9 shows the testing errors of the SVM- and SMA-based techniques
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 1

Maximal Margin Algorithms for Pose Estimation 19

versus the PC-dimension of the feature vector. The testing error is related
to the PC-dimension of the feature vector. The performance in pose
detection is better for a higher dimensional PCA representation. Especially,
the testing error is as low as 1.80% and 1.72% when PC-dimension is 80.
On the other hand, a low dimensional PCA representation can already
provide satisfactory performance, for instance, the testing errors are 2.27%
and 3.02% when PC-dimension is only 30.

Fig. 1.10. The number of support vectors versus the dimension of the feature vectors.
As the dimension increases, the number of support vectors increases as well.

Fig. 1.11. Test images correctly classified as profile. Notice these images include
different facial appearance, expression, significant shadows or sun-glasses.

It is observed that the number of support vectors increases as the


PC-dimension of the feature vectors for SVM-based technique (Fig. 1.10).
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 1

20 Ying Guo and Jiaming Li

The proportion of support vectors to the size of the whole training set is less
than 5.5%. Figure 1.11 shows some examples of the correctly classified pose
images, which include different scale, lighting or illumination conditions.

1.5. Conclusion

The main strength of the present method is the ability to estimate the
pose of the face efficiently by using the maximal margin algorithms. The
experimental results and the margin distribution graphs show that the “soft
margin” algorithms allow higher training errors to avoid the overfitting
problem of “hard margin” algorithms, and achieve better generalization
performance. The experimental results show that SVM- and SMA-based
techniques are very effective for the pose estimation problem. The testing
error on more than 20000 testing images was as low as 1.73%, where the
images cover different facial features such as beards, glasses and a great
deal of variability including shape, color, lighting and illumination.
In addition, because the only prior knowledge of the system is the
eye locations, the performance is extremely good, even when some facial
features such as the nose or the mouth become partially or wholly
occluded. For our current interest in improving the performance of our
face recognition system (SQIS), as the eye location is already automatically
determined, this new pose detection method can be directly incorporated
into the SQIS system to improve its performance.

References

[1] A. Kuchinsky, C. Pering, M. Creech, D. Freeze, B. Serra, and J. Gwizdka,


Fotofile: A consumer multimedia organization and retrieval system, M CHI
99 Conference Proceedings. pp. 496–503, (1999).
[2] R. Qiao, J. Lobb, J. Li, and G. Poulton, Trials of the CSIRO face recognition
system in a video surveillance environment, Proceedings of the Sixth Digital
Image Computing Techniques and Application. pp. 246–251, (2002).
[3] S. Baker, S. Nayar, and H. Murase, Parametric feature detection,
International Journal of Computer Vision. 27(1), 27–50, (1998).
[4] S. Gong, S. McKenna, and J. Collins, An investigation into face
pose distributions, IEEE International Conference on Face and Gesture
Recognition. pp. 265–270, (1996).
[5] J. Ng and S. Gong, Performing multi-view face detection and pose
estimation using a composite support vector machine across the view sphere,
Proceedings of the IEEE International Workshop on Recognition, Analysis,
and Tracking of Faces and Gestures in Real-Time Systems. pp. 14–21, (1999).
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 1

Maximal Margin Algorithms for Pose Estimation 21

[6] Y. Li, S. Gong, and H. Liddell, Support vector regression and classification
based multi-view face detection and recognition, Proceedings of IEEE
International Conference On Automatic Face and Gesture Recognition. pp.
300–305, (2000).
[7] Y. Guo, R.-Y. Qiao, J. Li, and M. Hedley, A new algorithm for face pose
classification, Image & Vision Computing New Zealand IVCNZ. (2002).
[8] Y. Freund and R. E. Schapire, A decision-theoretic generalization of on-line
learning and an application to boosting, Journal of Computer and System
Sciences. 55(1), 119–139, (1997).
[9] P. Viola and M. Jones, Rapid object detection using a boosted cascade
of simple features, IEEE Conference on Computer Vision and Pattern
Recognition. (2001).
[10] G. Rätsch, T. Onoda, and K.-R. Müller, Soft margins for AdaBoost, Machine
Learning. 42(3), 287–320, (2001).
[11] G. Rätsch, B. Schölkopf, S. Mika, and K.-R. Müller. SVM and boosting:
One class. Technical report, NeuroCOLT, (2000).
[12] T. Sim, S. Baker, and M. Bsat, The CMU Pose, Illumination, and Expression
(PIE) database, Proceedings of the IEEE International Conference on
Automatic Face and Gesture Recognition. (2002).
[13] Y. Moses, S. Ullman, and S. Edelman, Generalization to novel images in
upright and inverted faces, Perception. 25, 443–462, (1996).
[14] H. Schneiderman and T. Kanade, A statistical method for 3D object
detection applied to faces and cars, Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition. 1, 746–751, (2000).
[15] H. Murase and S. K. Nayar, Visual learning and recognition of 3D objects
from appearance, International Journal of Computer Vision. 14(1), 5–24,
(1995).
[16] M. Anthony and P. Bartlett, A Theory of Learning in Artificial Neural
Networks. (Cambridge University Press, 1999).
[17] R. Schapire, Y. Freund, P. L. Bartlett, and W. S. Lee, Boosting the
margin: A new explanation for the effectiveness of voting methods, Annals
of Statistics. 26(5), 1651–1686, (1998).
[18] V. Vapnik, The Nature of Statistical Learning Theory. (Springer, NY, 1995).
[19] B. E. Boser, I. M. Guyon, and V. N. Vapnik, A training algorithm for
optimal margin classifiers, Proceedings of the 5th Annual ACM Workshop
on Computational Learning Theory. pp. 144–152, (1992).
[20] J. R. Quinlan, C4.5: Programs for Machine Learning. (Morgan Kaufmann,
San Marteo, CA, 1993).
[21] L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and
Regression Trees. (Wadsworth International Group, Belmont, CA, 1984).
[22] Y. Bengio, Y. LeCun, and D. Henderson. Globally trained handwritten word
recognizer using spatial representation, convolutional neural networks and
hidden markov models. In eds. J. Cowan, G. Tesauro, and J. Alspector,
Advances in Neural Information Processing Systems, vol. 5, pp. 937–944.
(Morgan Kaufmann, San Marteo, CA, 1994).
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 1

22 Ying Guo and Jiaming Li

[23] L. Breiman, Prediction games and arcing algorithms, Neural Computation.


11(7), 1493–1518, (1999).
[24] J. Friedman, T. Hastie, and R. Tibshirani, Additive logistic regression: a
statistical view of boosting, Annals of Statistics. 28(2), 400–407, (2000).
[25] L. Mason, J. Baxter, P. Bartlett, and M. Frean, Functional Gradient
Techniques for Combining Hypotheses. (MIT Press, Cambridge, MA, 2000).
[26] A. Grove and D. Schuurmans, Boosting in the limit: Maximizing the margin
of learned ensembles, Proceedings of the Fifteenth National Conference on
Artificial Intelligence. pp. 692–699, (1998).
[27] A. N. Tikhonov, Solution of incorrectly formulated problems and the
regularization method, Soviet Mathematics. Doklady. 4, 1035–1038, (1963).
[28] A. N. Tikhonov and V. Y. Arsenin, Solution of Ill–Posed Problems.
(Winston, Washington, DC, 1977).
[29] F. Girosi, M. Jones, and T. Poggio, Regularization theory and neural
networks architectures, Neural Computation. 7(2), 219–269, (1995).
[30] Y. Guo, P. L. Bartlett, J. Shawe-Taylor, and R. C. Williamson, Covering
numbers for support vector machines, IEEE Transactions on Information
Theory. 48, 239–250, (2002).
[31] C. Cortes and V. Vapnik, Support vector networks, Machine Learning. 20,
273–297, (1995).
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 2

Chapter 2

Polynomial Modeling in a Dynamic Environment based on


a Particle Swarm Optimization

Kit Yan Chan and Tharam S. Dillon


Digital Ecosystems and Business Intelligence Institute,
Curtin University of Technology, Perth, Australia
[email protected]

In this chapter, a particle swarm optimization (PSO) is proposed for


polynomial modeling in a dynamic environment. The basic operations
of the proposed PSO are identical to the ones of the original PSO
except that elements of particles represent arithmetic operations and
polynomial variables of polynomial models. The performance of the
proposed PSO is evaluated by polynomial modeling based on a set of
dynamic benchmark functions in which their optima are dynamically
moved. Results show that the proposed PSO can find significantly better
polynomial models than genetic programming (GP) which is a commonly
used method for polynomial modeling.

Contents

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 PSO for Polynomial Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.3 PSO vs. GP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.1. Introduction

Particle swarm optimization is inspired by the social behaviors of animals


like fish schooling and bird flocking [1]. The swarm consists of a population
of particles which aim to look for the optimal solutions, and the movement
of each particle is based on both its best previous position recorded so far
from the previous generations and the position of the best particle among
all the particles. Diversity of the particles can be kept along the search
by selecting suitable PSO parameters which provide a balance between the

23
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 2

24 Kit Yan Chan and Tharam S. Dillon

global exploration based on the position of best particle among all the
particles in the swarm, and local exploration based on each individual
element’s best previous position recorded. Each particle can converge
gradually toward the position of best particle among all the particles in
the swarm and its best previous position recorded so far. Kennedy and
Eberhart [2] demonstrated that PSO can solve hard optimization problems
with satisfactory results.
However, there is no research involving PSO on polynomial modeling in
a dynamic environment according to the authors’ knowledge. This can be
blamed on the fact that the polynomials consist of arithmetic operations
and polynomial variables, which are difficult to represent by the commonly
used continuous version of PSO (i.e. the original PSO [1]). In fact, recent
literature shows that PSO has been applied to solve various combinatorial
problems with satisfactory results [3–7] and the particles of the PSO-based
algorithm can represent arithmetic operations and polynomial variables
in polynomial models. These reasons motivated the author to apply a
PSO-based algorithm on generating polynomial models.
In this chapter, a PSO has been proposed for polynomial modeling in
a dynamic environment. The basic operations of the proposed PSO are
identical to those of the original PSO [1] except that elements of particles of
the PSO are represented by arithmetic operations and polynomial variables
in polynomial models. To evaluate the performance of the PSO polynomial
modeling in a dynamic environment, we compared the PSO with the GP,
which is a commonly used method on polynomial modeling [8–11]. Based
on the benchmark functions in which their optima are dynamically moved,
polynomial modeling in dynamic environments can be evaluated. This
evaluation indicates that the proposed PSO significantly outperforms the
GP in polynomial modeling in a dynamic environment.

2.2. PSO for Polynomial Modeling

The polynomial model of an output response relating to polynomial


variables can be described as follows:
y = f (x1 , x2 , ..., xm ) (2.1)
where y is the output response, xj , j = 1, 2, ..., m is the j-th polynomial
variable and f is a functional relationship, which represents the polynomial
model. The polynomial model f can be found by a set of experimental
ND
data D = (xD (i), y D (i))i=1 with the corresponding values of the
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 2

Polynomial Modeling in a Dynamic Environment 25

i-th experimental data xD (i) = (xD 1 (i), x2 (i), ..., xm (i)) ∈ R


D D m
and
corresponding value of the i-th response output y (i) ∈ R. With D,
D

f can be generated as a high-order high-dimensional Kolmogorov–Gabor


polynomial:


m ∑
m ∑
m
y = a0 + ai1 xi1 + ai1 i2 xi1 xi2 +
i1 =1 i1 =1 i2 =1
(2.2)

m ∑
m ∑
m ∏
m
ai1 i2 i3 xi1 xi2 xi3 + ... + a123...m xim
i1 =1 i2 =1 i3 =1 im =1

which is a universal format for polynomial modeling if the number of terms


in (2.2) is large enough [12]. The polynomial model varies, if the polynomial
coefficiencies vary dynamically. In this chapter, a PSO is proposed to
generate polynomial models, while experimental data is available. The PSO
uses a number of particles, which constitute a swarm, and each particle
represents a polynomial model. Each particle traverses the search space
looking for the optimal polynomial model.
Each particle in the PSO represents the polynomial variables (x1 , x2 , ...,
and xm ) and the arithmetic operations (′ +′ ,′ −′ and′ ∗′ ) in the polynomial
model as defined in (2.2). The i-th particle at generation t is defined as Pit =
(pti,1 , pti,2 , ..., pti,Np ), where Np > m; i = 1, 2, ..., Npop ; Npop is the number of
t
particles of the swarm; Pi,k is the k-th element of the i-th particle at the
t-th generation, and is in the range between zero to one, i.e. Pi,k t
∈ 0...1.
t t t
The elements in odd numbers (i.e. Pi,1 , Pi,3 , Pi,5 , ...) are used to illustrate
t t
the polynomial variables, and the elements in even numbers (i.e. Pi,2 , Pi,4 ,
t
Pi,6 , ...) are used to illustrate the arithmetic operations. For odd k, if
0 < Pi,k t
≤ 1/(m + 1), no polynomial variable is presented by the element
t
Pi,k . If 1/(m + 1) < Pi,k t
≤ (l + 1)/(m + 1) with l > 0, Pi,k t
represents the
l-th polynomial variable, xl .
In the polynomial model, ′ +′ ,′ −′ and ′ ∗′ are the only three arithmetic
operations. For even k, if 0 < Pi,k t
≤ 1/3, 1/3 < Pi,k t
≤ 2/3 and
2/3 < Pi,k ≤ 1, the element Pi,k represents the arithmetic operations
t t
′ ′ ′ ′
+ , − and ′ ∗′ respectively. For example, the following particle with seven
elements is used to represent a polynomial model with four polynomial
variables (i.e. x1 , x2 , x3 and x4 ):
t t t t t t t
Pi,1 Pi,2 Pi,3 Pi,4 Pi,5 Pi,6 Pi,7
0.2 0.4 0.9 0.9 0.4 0.1 0.1
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 2

26 Kit Yan Chan and Tharam S. Dillon

The elements in the particle are within the following ranges:


t t t t t t t
Pi,1 Pi,2 Pi,3 Pi,4 Pi,5 Pi,6 Pi,7
0 < 0.2 ≤ 1
5
1
3 < 0.4 ≤ 2
3
4
5 < 0.9 ≤ 5
5
2
3 < 0.9 ≤ 3
3
2
5 < 0.4 ≤ 3
5 0 < 0.1 ≤ 1
3 0 < 0.1 ≤ 1
5

Therefore this particle represents the following polynomial model:


t t t t t t t
Pi,1 Pi,2 Pi,3 Pi,4 Pi,5 Pi,6 Pi,7
0 − x4 ∗ x2 + 0
which is equivalent to: fi (x) = 0 − x4 ∗ x2 + 0 or fi (x) = −x4 ∗ x2
The polynomial coefficients a0 and a1 are determined after the structure
of the polynomial is generated, where the number of coefficients of the
polynomial is two. The completed function can be represented as follows:
fi (x) = a0 − a1 ∗ x4 ∗ x2 . In this research, the polynomial coefficients are
determined by an orthogonal least square algorithm [13, 14], which has
been demonstrated to be effective in determining polynomial coefficients in
models generated by the GP [15]. Details of the orthogonal least square
algorithm can be found in [13, 14]. Each particle is evaluated based on the
mean absolute error (MAE), which can reflect the differences between the
predicted values of the polynomial model and the actual values of the data
sets. The mean absolute error of the i-th particle at the t-th generation
M AEit can be calculated based on (2.3).

1 ∑ y D (j) − fit (xD (j))
M AEit = 100% × j = 1ND (2.3)
ND y D (j)

where fit is the polynomial model represented by the i-th particle Pit at
the t-th generation, (xD (j), y D (j)) is the j-th training data set, and ND is
the number of training data sets used for developing the polynomial model.
t
The velocity vi,j (corresponding to the flight velocity in a search space) and
the k-th element of the i-th particle at the t-th generation pti,j are calculated
by the following formula:
t
vi,k t−1
= K(vi,k +ϕ1 ×rand()×(pbesti,k −pt−1
i,k +ϕ2 ×rand()×(gbestk −pi,k ))
t−1

(2.4)
pti,k = pt−1 t
i,k + vi,k (2.5)
where
pbesti = [pbesti,1 , pbesti,2 , ..., pbesti,Np ],
gbest = [pbest1 , pbest2 , ..., pbestNp ],
k = 1, 2, ..., Np ,
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 2

Polynomial Modeling in a Dynamic Environment 27

where the best previous position of a particle is recorded so far from the
previous generation and is represented as pbesti ; the position of best particle
among all the particles is represented as gbest; rand() returns a uniform
random number in the range of [0,1]; w is an inertia weight factor; φ1
and φ2 are acceleration constants; ϕ is a constriction factor derived from
the stability analysis of (2.6) to ensure the system to be converged but not
prematurely [16]. Mathematically, K is a function of φ1 and φ2 as reflected
in the following equation:

pti,j = pt−1 t
i,j + vi,j (2.6)

where ϕ = ϕ1 + ϕ2 and ϕ > 4.


The PSO utilizes pbesti and gbest to modify the current search point to
avoid the particles moving in the same direction, but to converge gradually
toward pbesti and gbest. In (2.4), the particle velocity is limited by a
maximum value vmax . The parameter vmax determines the resolution with
which regions are to be searched between the present position and the target
position. This limit enhances the local exploration of the problem space
and it realistically simulates the incremental changes of human learning.
If vmax is too high, particles might fly past good solutions. If vmax is too
small, particles may not explore sufficiently beyond local solutions. vmax
was often set at 10% to 20% of the dynamic range of the element on each
dimension.

2.3. PSO vs. GP

A set of benchmark functions was employed to evaluate the effectiveness of


the PSO on polynomial modeling in a dynamic environment. Comparison
between the PSO and the genetic programming (GP), which is a commonly
used method on polynomial modeling, was carried out. The GP parameters,
which have been shown to be able to find good solutions for both static
and dynamic system modeling [17], were also used: population size =
50; crossover rate = 0.5; mutation rate = 0.5; pre-defined number
of generations = 1000; maximum depth of tree = 50; roulette wheel
selection, point-mutation and one-point (two parents) were used. The PSO
parameters, which can be referred to [18] and are shown in Table 2.1, were
used in this research.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 2

28 Kit Yan Chan and Tharam S. Dillon

Table 2.1. The PSO parameters implemented.


Number of particles in the swarm 500
Number of elements in the particle 30
Acceleration constants ϕ1 and ϕ2 Both 2.05
Maximum velocity vmax 0.2
Pre-defined number of generations 1000

The genetic programming GP based on a public GP toolbox developed


in Matlab [17], which aims at modeling both dynamic and static systems,
was used in this research. The benchmark functions shown in Table 2.2
have been employed to evaluate the performance of the GP and the PSO in
polynomial modeling. The dynamic environment in polynomial modeling
was simulated by moving the optimum of the benchmark function in a
period of generations. All benchmark functions are in the ranges of the
initialization areas [Xmin , Xmax ]n as shown in Table 2.2. The dimension
of each test function is n = 4. The Sphere and Rosenbrock functions
are unimodal (a single local and global optimum), and the Rastrigin and
Griewank functions are multimodal (several local optima). To simulate
the dynamic environment, the optimum position x+ of each dynamic
benchmark function in Table 2.2 is moved by adding or subtracting the
random values in all dimensions by a severity parameter s, at every
change of the environment. The choice of whether to add or subtract the
severity parameter s on the optimum x+ was done randomly with an equal
probability. The severity parameter s is defined by:
{
+∆d(Xmax − Xmin ), if rand() ≥ 0;
s= (2.7)
−∆d(Xmax − Xmin ), if rand() < 0;

where ∆d = 1% or 10%.
Hence the optimal value Vmin of each dynamic function can be
represented by:

Vmin = F (x+ + x) = min F (x + s). (2.8)

For each test run a different random seed was used. The severity was
chosen relative to the extension (in one dimension) of the initialization
area of each dynamic benchmark function. The different severities
were chosen as 1% and 10% of the range of each dynamic benchmark
function. The benchmark functions were periodically changed in every 5000
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 2

Polynomial Modeling in a Dynamic Environment 29

Table 2.2. Benchmark functions and initialization areas.

computational evaluations. Based on the benchmark function, 100 data sets


were generated randomly in every generation. Each date set is in the form of
xD D D D D D D D D
1 (i), x2 (i), x3 (i), x4 (i), yi (i)), where x1 (i), x2 (i), x3 (i)andx4 (i) were
4 D
generated randomly in the range of [Xmin , Xmax ] , and yi was calculated
based on the benchmark function with i = 1, 2, ..., 100. Based on the 100
data sets generated in each generation, the polynomial models in (2.2) can
be obtained by the GP and the PSO. Since the optima of the benchmark
functions moved in a period of generations, the dynamic environment can
be simulated and the performance of both the GP and the PSO in modeling
dynamic systems can be evaluated.
Thirty runs have been performed on both the GP and the PSO on
polynomial modeling based on the benchmark functions with various
severity parameters ∆d = 1% or 10%. In each run, the smallest MAEs
obtained by the best chromosome of the GP and the best particle of the
PSO were recorded.
For the Sphere function that the optimum of the Sphere function
varies by ∆d = 10% in every 5000 computational evaluations, it can be
observed from Fig. 2.1a that the PSO kept progressing in every 5000
computational evaluations while the optimum moved. However, the GP
may not achieve smaller MAEs in every 5000 computational evaluations.
Finally, the PSO obtained smaller MAEs than the ones obtained by the
GP. For the Rosenbrock function with ∆d = 10%, it can be observed in
Fig. 2.1b that the solution qualities of the PSO can be maintained in every
5000 computational evaluations while the optimum moved. However, the
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 2

30 Kit Yan Chan and Tharam S. Dillon

solution qualities of the GP are more unstable than the one obtained by
the PSO in every 5000 computational evaluations. Finally, the PSO can
reach smaller MAEs than the ones obtained by the GP. For the Rastrigin
function with ∆d = 10%, Fig. 2.2a shows that the PSO kept progressing
in every 5000 computational evaluations even when the optimum moved,
while the GP may obtain larger MEAs after the optimum moves at every
5000 computational evaluations. Finally, the PSO can obtain a smaller
MAE than the one obtained by the GP. For the Griewank function,
similar characteristics can be found in Fig. 2.2b. Similar results can be
found on solving the benchmark functions with ∆d = 10%. Therefore,
it can be concluded that the PSO is more effective to adapt to smaller
MAEs than the GP while the optimum of the benchmark functions move
periodically in every 5000 computational evaluations. Also, the PSO can
converge to smaller MAEs than the ones obtained by the GP in the
benchmark functions. For the benchmark functions with ∆d = 1%, similar
characteristics can be found.
Solely from the figures, it is difficult to compare the qualities and
stabilities of solutions found by both methods. Tables 2.3 and 2.4
summarize the results for the benchmark functions. They show that the
minimums, the maximums and the means of MAEs found by the PSO
are smaller than those found by the GP in all the benchmark functions
with ∆d = 1% and 10% respectively. Furthermore, the variances of
MAE found by the PSO are much smaller than those found by GP. These
results indicate that the PSO can not only produce smaller MEAs but
also produce more stable MEAs for polynomial modeling based on the
benchmark functions with the dynamic environments. The t-test is then
used to evaluate significance of performance difference between the PSO
and the GP. Table 2.4 shows that all t-values between the PSO and the GP
for all benchmark functions with ∆d = 1% and 10% are higher than 2.15.
Based on the normal distribution table, if the t-value is higher than 2.15,
the significance is 98% confident level. Therefore, the performance of the
PSO is significantly better than that of the GP with 98% confident level
in polynomial modeling based on the benchmark functions with ∆d = 1%
and 10%.
Since maintaining population diversity in population-based algorithms
like GP or PSO is a key in preventing premature convergence and stagnation
in local optima [19, 20], it is essential to study population diversities of
the two algorithms along the search. Various diversity measures, which
involve calculation of distance between two individuals in both genetic
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 2

Polynomial Modeling in a Dynamic Environment 31

Table 2.3. Experimental results obtained by the GP and the PSO for the benchmark
functions with ∆d = 1%.

programming [21, 22] and genetic algorithm [23] have been widely studied.
However, those distance measures can only apply either on tree-based
representation or string-based representation, which cannot be applied
on measuring population diversities for all the two algorithms, since
the representations of the two algorithms are not identical. Tree-based
representation of individuals is used in the GP. String-based representation
of individuals is used in the PSO. Either of the two types of distance
measures can be applied on the three algorithms in which the types of
representation are different.
To investigate population diversities of the algorithms, we measure the
distance between two individuals by counting the number of different terms
of the polynomials represented by the two individuals. If the terms in both
polynomials are all identical, the distance between two polynomials is zero.
The distance between two polynomials is larger with a greater number
of different terms in the two polynomials. However, for each method we
use the same for the polynomial and hence results can be compared. For
example, f1 and f2 are two polynomials represented by:
f1 = x1 + x2 + x1 x3 + x24 + x1 x3 x5 and f2 = x1 + x2 + x1 x5 + x4 + x1 x3 x5 .
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 2

32 Kit Yan Chan and Tharam S. Dillon

Table 2.4. Experimental results obtained by the GP and the PSO for the benchmark
functions with ∆d = 10%.

Both f1 and f2 contain the three terms x1 , x2 and x1 x3 x5 , and the


terms x1 x3 and x24 in f1 and the terms x1 x5 and x4 in f2 are different.
Therefore the number of terms which are different in f1 and f2 is four, and
the distance between f1 and f2 is defined to be four.
The diversity measure of the population at the g-th generation is defined
by the mean of the distances of individuals which is denoted as:

2 ∑ ∑
Np Np
σg = 2 d(sg (i), sg (j)) (2.9)
Np i=1 j=i+1

where sg (i) and sg (j) are the i-th and the j-th individuals in the population
at the g-th generation, and d is the distance measure between the two
individuals.
The diversities of the populations along the generations were recorded
for the two algorithms. Figure 2.3a shows the diversities of the Sphere
functions with ∆d = 10%. It can be found from the figure that the
diversities of the GP are the highest in the early generations. The diversities
of the PSO are smaller than the ones of the GP in the early generations.
However, diversities of the PSO can be kept until the late generations,
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 2

Polynomial Modeling in a Dynamic Environment 33

(a) Convergence plot for Sphere function with ∆d = 10%

(b) Convergence plot for Rosenbrock function with ∆d = 10%

Fig. 2.1. Convergence plot for Sphere and Rosenbrock functions.

while the ones of the GP saturated to low levels in the mid generations.
Figures 2.3b, 2.4a and 2.4b show the diversities of the two algorithms
for Rosenbrock function, Rastrigin function and Griewank function with
∆d = 10% respectively. The figures show similar characteristics to the one
of the Sphere function in that the diversities of populations of the PSO can
be kept along the generations, while the GP saturated to a low value after
the early generations. For the benchmark functions with ∆d = 1%, similar
characteristics can be found.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 2

34 Kit Yan Chan and Tharam S. Dillon

(a) Convergence plot for Rastrigin function with ∆d = 10%

(b) Convergence plot for Griewank function with ∆d = 10%

Fig. 2.2. Convergence plot for Rastrigin and Griewank functions.

Diversities of the population in the two algorithms are necessary to be


maintained along the generations, so as to explore potential individuals with
better scores throughout the search. If individuals of the population of the
algorithm converge too early, the algorithm may trap into a sub-optimum.
The algorithm is more likely to explore the solution space, while the
diversity of the individuals of the algorithm can be maintained. The results
indicate that diversities of the individuals in the PSO can be maintained
along the search in both early and late generations, while the individuals
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 2

Polynomial Modeling in a Dynamic Environment 35

(a) Population diversities for Sphere function (∆d = 10%)

(b) Population diversities for Rosenbrock function (∆d = 10%)

Fig. 2.3. Population diversities for for Sphere and Rosenbrock functions.

of the GP converged in the early generation. Therefore the results explain


why the PSO can find better solutions than the ones obtained by the GP,
while the PSO can maintain higher diversity than the other two algorithms.

2.4. Conclusion

In this chapter, a PSO has been proposed for polynomial modeling which
aims at generating explicit models in the form of Kolmogorov–Gabor
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 2

36 Kit Yan Chan and Tharam S. Dillon

(a) Population diversities for Rastrigin function (∆d = 10%)

(b) Population diversities for Griewank function (∆d = 10%)

Fig. 2.4. Population diversities for for Rastrigin and Griewank functions.

polynomials in dynamic environment. The operations of the PSO are


identical to the ones of the original PSO except that elements of particles of
the PSO are represented by arithmetic operations and polynomial variables,
which are the components of the polynomial models. A set of dynamic
benchmark functions in which their optima are dynamically moved was
employed to evaluate the performance of the PSO. Comparison of the PSO
and the GP, which is commonly used on polynomial modeling, was carried
out by using the data generated based on the dynamic benchmark functions.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 2

Polynomial Modeling in a Dynamic Environment 37

It was shown that the PSO perform significantly better than the GP in
polynomial modeling. Enhancing the effectiveness of the proposed PSO in
polynomial modeling in dynamic environments will be considered in the
further work. It will be incorporated with the techniques implemented in
the PSO which have been shown to be able to obtain satisfactory results
on solving dynamic optimization problems [24, 25].

References

[1] R. Eberhart and J. Kennedy, A new optimizer using particle swarm theory,
Proceedings of the Sixth IEEE International Symposium on Micro Machine
and Human Science. pp. 39–43, (1995).
[2] J. Kennedy and R. Eberhart, Swarm Intelligence. (Morgan Kaufmann,
2001).
[3] Z. Lian, Z. Gu, and B. Jiao, A similar particle swarm optimization
algorithm for permutation flowshop scheduling to minimize makespan,
Applied Mathematics and Computation. 175, 773–785, (2006).
[4] C. J. Liao, C. T. Tseng, and P. Luran, A discrete version of particle swarm
optimization for flowshop scheduling problems, Computer and Operations
Research. 34, 3099–3111, (2007).
[5] Y. Liu and X. Gu, Skeleton network reconfiguration based on topological
characteristics of scale free networks and discrete particle swarm
optimization, IEEE Transactions on Power Systems. 22(3), 1267–1274,
(2007).
[6] Q. K. Pan, M. F. Tasgetiren, and Y. C. Liang, A discrete particle
swarm optimization algorithm for the no-wait flowshop scheduling problem,
Computer and Operations Research. 35, 2807–2839, (2008).
[7] C. T. Tseng and C. J. Liao, A discrete particle swarm optimization for
lot-streaming flowship scheduling problem, European Journal of Operations
Research. 191, 360–373, (2008).
[8] H. Iba, Inference of differential equation models by genetic programming,
Information Sciences. 178, 4453–4468, (2008).
[9] J. Koza, Genetic Programming: On the Programming of Computers by
Means of Natural Evolution. (MIT Press, Cambridge. MA, 1992).
[10] K. Rodriguez-Vazquez, C. M. Fonseca, and P. J. Fleming, Identifying
the structure of nonlinear dynamic systems using multiobjective genetic
programming, IEEE Transactions on Systems, Man and Cybernetics – Part
A. 34(4), 531–545, (2004).
[11] N. Wagner, Z. Michalewicz, M. Khouja, and R. R. McGregor, Time series
forecasting for dynamic environments: the dyfor genetic program model,
IEEE Transactions on Evolutionary Computation. 4(11), 433–452, (2007).
[12] D. Gabor, W. Wides, and R. Woodcock, A universal nonlinear filter
predictor and simulator which optimizes itself by a learning process,
Proceedings of IEEE. 108-B, 422–438, (1961).
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 2

38 Kit Yan Chan and Tharam S. Dillon

[13] S. Billings, M. Korenberg, and S. Chen, Identification of nonlinear


outputaffine systems using an orthogonal least-squares algorithm,
International Journal of Systems Science. 19, 1559–1568, (1988).
[14] S. Chen, S. Billings, and W. Luo, Orthogonal least squares methods and
their application to non-linear system identification, International Journal
of Control. 50, 1873–1896, (1989).
[15] B. McKay, M. J. Willis, and G. W. Barton, Steady-state modeling of
chemical processes using genetic programming, computers and chemical
engineering, Computers and Chemical Engineering. 21(9), 981–996, (1997).
[16] R. C. Eberhart and Y. Shi, Comparison between genetic algorithms and
particle swarm optimization, Lecture Notes in Computer Science. 1447,
611–616, (1998).
[17] J. Madar, J. Abonyi, and F. Szeifert, Genetic programming for the
identification of nonlinear input – output models, Industrial and Engineering
Chemistry Research. 44, 3178–3186, (2005).
[18] M. O. Neill and A. Brabazon, Grammatical swarm: The generation of
programs by social programming, Natural Computing. 5, 443–462, (2006).
[19] A. Ekart and S. Nemeth, A metric for genetic programs and fitness
sharing, Proceedings of the European Conference of Genetic Programming.
pp. 259–270, (2000).
[20] R. I. McKay, Fitness sharing genetic programming, Proceedings of the
Genetic and Evolutionary Computation Conference. pp. 435–442, (2000).
[21] E. K. Burke, S. Gustafson, and G. Kendall, Diversity in genetic
programming: an analysis of measures and correlation with fitness, IEEE
Transactions on Evolutionary Computation. 8(1), 47–62, (2004).
[22] X. H. Nguyen, R. I. McKay, D. Essam, and H. A. Abbass, Toward an
alternative comparison between different genetic programming systems,
Proceedings of the European Conference on Genetic Programming. pp. 67–77,
(2004).
[23] B. Naudts and L. Kallel, A comparison of predictive measures of problem
difficulty in evolutionary algorithms, IEEE Transactions on Evolutionary
Computation. 4(1), 1–15, (2000).
[24] T. Blackwell and J. Branke, Multiswarms, exclusion, and anti-convergence
in dynamic environments, IEEE Transactions on Evolutionary Computation.
10(4), 459–472, (2006).
[25] S. Janson and M. Middendorf, A hierarchical particle swarm optimizer
for noisy and dynamic environments, Genetic Programming and Evolvable
Machines. 7, 329–354, (2006).
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 3

Chapter 3

Restoration of Half-toned Color-quantized Images Using


Particle Swarm Optimization with Multi-wavelet Mutation

Frank H.F. Leung, Benny C.W. Yeung and Y.H. Chan


Centre for Signal Processing,
Department of Electronic and Information Engineering,
The Hong Kong Polytechnic University,
Hung Hom, Kowloon, Hong Kong
[email protected]

Restoration of color-quantized images is rarely addressed in the


literature, especially when the images are color-quantized with
half-toning. Many existing restoration algorithms are inadequate to
deal with this problem because they were proposed for restoring noisy
blurred images only. In this chapter, a restoration algorithm based on
Particle Swarm Optimization with multi-wavelet mutation (MWPSO) is
proposed to solve the problem. This algorithm makes good use of the
available color palette and the mechanism of a half-toning process to
derive useful a priori information for the restoration. Simulation results
show that it can improve the quality of a half-toned color-quantized
image remarkably in terms of both signal-to-noise ratio improvement
and convergence rate. The subjective quality of the restored images can
also be improved.

Contents

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.2 Color Quantization With Half-toning . . . . . . . . . . . . . . . . . . . . . . . 41
3.3 Formulation of Restoration Algorithm . . . . . . . . . . . . . . . . . . . . . . 42
3.3.1 PSO with multi-wavelet mutation (MWPSO) . . . . . . . . . . . . . . 42
3.3.2 The fitness function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3.3 Restoration with MWPSO . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.3.4 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
3.4 Result and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
∗ The work described in this chapter was substantially supported by a grant from

the Research Grants Council of the Hong Kong Special Administrative Region, China
(Project No. PolyU 5224/08E).

39
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 3

40 Frank H.F. Leung, Benny C.W. Yeung and Y.H. Chan

3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.1. Introduction

Color quantization is a process of reducing the number of colors in a digital


image by replacing them with some representative colors selected from a
palette [1]. It is widely used because it can save transmission bandwidth and
data storage requirements in many multimedia applications. When color
quantization is performed, certain degradation of quality will be introduced
owing to the limited number of colors used to produce the output image.
The most common artifact is the false contour. False contours occur when
the available palette colors are not enough to represent a gradually changing
region. Another common artifact is the color shift. In general, the smaller
the color palette size is used, the more severe the defects will be.
Half-toning is a reprographic technique that converts a continuous-tone
image to a lower resolution, and it is mainly for printing [2–6]. Error
diffusion is one of the most popular half-toning methods. It makes use of
the fact that the human visual system is less sensitive to higher frequencies,
and during diffusion, the quantization error of a pixel is diffused to the
neighboring pixels so as to hide the defects and to achieve a more faithful
reproduction of colors.
Restoring a color-quantized image back to the original image is often
necessary. However, most of the recent restoration algorithms mainly
concern the restoration of noisy and blurred color images [7–14]; literature
for restoring half-toned color-quantized images is seldom found. The
restoration can be formulated as an optimization problem. Since error
diffusion is a nonlinear process, conventional gradient-oriented optimization
algorithms might not be suitable to solve the addressed problem.
In this chapter, we make use of a proposed particle swarm optimization
with multi-wavelet mutation (MWPSO) algorithm to restore half-toned
color-quantized images. It is an evolutionary computation algorithm that
works very well on various optimization problems. By taking advantage
of the wavelet theory, which increases the searching space for the PSO at
the early stage of evolution, the chance of converging to local minima is
lowered. The process of color quantization with error diffusion will first be
detailed. Then the application of the proposed MWPSO to the restoration
of the degraded images will be discussed. The results demonstrate that the
MWPSO can achieve a remarkable improvement in terms of convergence
rate and signal-to-noise ratio.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 3

Restoration of Half-toned Color-quantized Images 41

3.2. Color Quantization With Half-toning

A color image generally consists of three color planes, namely, Or , Og and


Ob , which represent the red, green and blue color planes of the image
respectively. Accordingly, the (i, j)-th color pixel of a 24-bit full-color
image of size N × N pixels consists of three color components. The
intensity values of these three components form a 3D vector O ⃗ (i, j) =
(O(i, j)r , O(i, j)g , O(i, j)b ), where O(i, j)c ∈ [0 1] is the intensity value of
the c color component of the (i, j)-th pixel. Here, we assume that the
maximum and the minimum intensity values of a pixel are one and zero
respectively.

Fig. 3.1. Color quantization with half-toning.

Figure 3.1 shows the system that performs color quantization with error
diffusion. The input image is scanned in a row-by-row fashion from top to
bottom and from left to right. The relationship between the original image
⃗ (i, j) and the encoded image Y
O ⃗(i, j) is described by

U(i, j)c =O(i, j)c − H(i, j)c E(i−k, j−l)c (3.1)
(k, l)∈Ω

E ⃗(i, j) − U
⃗ (i, j) =Y ⃗ (i, j) (3.2)
⃗(i, j) =Qc [U
Y ⃗ (i, j) ] (3.3)

where U ⃗ (i, j) = (U(i, j)r , U(i, j)g , U(i, j)b ) is a state vector of the system,
⃗ (i, j) is the quantization error of the pixel at position (i, j) and H(i, j)c is
E
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 3

42 Frank H.F. Leung, Benny C.W. Yeung and Y.H. Chan

a coefficient of the error diffusion filter for the c color component. Ω is the
corresponding causal support region of H(i, j)c .
The operator Qc [·] performs a 3D vector quantization. Specifically,
the 3D vector U ⃗ (i, j) is compared with a set of representative color vectors
stored in a previously generated color palette V = {vbi : i = 1, 2, . . . , Nc }.
The best-matched vector in the palette is selected based on the minimum
Euclidean distance criterion. In other words, a state vector U ⃗ (i, j) is
represented by the color vbk if and only if ∥U ⃗ (i, j) − vbk ∥ ≤ ∥U ⃗ (i, j) − vbl ∥
for all l = 1, 2, . . . , Nc ; l ̸= k. Once the best-matched vector is selected
from the color palette, its index is recorded and the quantization error
E⃗ (i, j) = vbk − U
⃗ (i, j) is diffused to pixel (i, j)’s neighborhood as described
in (3.1). Note that in order to handle the boundary pixels, E ⃗ (i, j) is defined
to be zero when (i, j) falls outside the image. Without loss of generality,
in this chapter, we use a typical Floyd–Steinberg error diffusion kernel as
H(i, j)c to perform the half-toning. The recorded indices of the color palette
will be used again in future to reconstruct the color-quantized image of the
restored image.

3.3. Formulation of Restoration Algorithm

3.3.1. PSO with multi-wavelet mutation (MWPSO)


Consider a swarm X(t) at the t-th iteration. Each particle xp (t) ∈ X(t)
contains κ elements xpj (t) ∈ xp (t) at the t-th iteration, where p =
1, 2, . . . , γ and j = 1, 2, . . . , κ; γ denotes the number of particles in the
swarm. First, the particles of the swarm are initialized and then evaluated
based on a defined fitness value. The objective of PSO is to minimize the
fitness value (cost value) of a particle through iterative steps. The swarm
evolves from iteration t to t + 1 by repeating the procedures as shown in
Fig. 3.2.
The velocity vjp (t)(corresponding to the flight speed in a search space)
and the coordinate xpj (t) of the j-th element of the p-th particle at the t-th
generation can be calculated using the following formula:
( ) ( )
vjp (t) = w · vjp (t − 1) + φ1 · randpj () · pbestpj − xpj (t − 1)
( )
+ φ2 · randpj () · gbestj − xpj (t − 1) (3.4)
xpj (t) =xpj (t − 1) + k · vjp (t) (3.5)
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 3

Restoration of Half-toned Color-quantized Images 43

Fig. 3.2. Pseudo code for MWPSO.

where

pbestp = [pbestp1 pbestp2 ... pbestpκ ]

and

gbest = [gbest1 gbest2 ... gbestκ ].

The best previous position of a particle is recorded and represented as


pbestp ; the position of the best particle among all the particles is represented
as gbest; w is an inertia weight factor; φ1 and φ2 are acceleration constants;
randpj () returns a random number in the range of [0 1]; k is a constriction
factor derived from the stability analysis of (3.5) to ensure the system can
converge but not prematurely.
PSO utilizes pbestp and gbest to modify the current search point in order
to prevent the particles from moving in the same direction, but to converge
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 3

44 Frank H.F. Leung, Benny C.W. Yeung and Y.H. Chan

gradually toward pbestp and gbest. A suitable selection of the inertia weight
w provides a balance between the global and local explorations. Generally,
w can be dynamically set with the following equation:
wmax − wmin
w = wmax − ×t (3.6)
T
where t is the current iteration number, T is the total number of iteration,
wmax and wmin are the upper and lower limits of the inertia weight, and
are set to 1.2 and 0.1 respectively in this chapter.
In (3.4), the particle velocity is limited by a maximum value vmax . The
parameter vmax determines the resolution of region between the present
position and the target position to be searched. This limit enhances the
local exploration of the problem space, affecting the incremental changes of
learning.
Before generating a new X(t), the mutation operation is performed:
every particle of the swarm will have a chance to mutate governed by a
probability of mutation µm ∈ [0 1], which is defined by the user. For
each particle, a random number between zero and one will be generated;
if µm is larger than the random number, this particle will be selected for
the mutation operation. Another parameter called the element probability
Nm ∈ [0 1] is then defined by the user to control the number of elements
in the particle that will mutate in each iteration step. For instance, if
xp (t) = [xp1 (t), xp2 (t), . . . , xpκ (t)] is the selected p-th particle, the expected
number of elements that undergo mutation is given by
Expected number of mutated elements = Nm × κ (3.7)
We propose a multi-wavelet mutation operation for realizing the
mutation. The exact elements for doing mutation in a particle are randomly
selected. The resulting particle is given by x̄p (t) = [x̄p1 (t), x̄p2 (t), . . . , x̄pκ (t)].
If the j-th element is selected for mutation, the operation is given by
{ ( )
xpj (t) + σ × parajmax − xpj (t) if σ > 0
p
x̄j (t) = ( ) (3.8)
xpj (t) + σ × xpj (t) − parajmin if σ ≤ 0

Using the Morlet wavelet as the mother wavelet, we have


1 φ 2
( ( φ ))
σ = √ e−( a ) /2 cos 5 (3.9)
a a
The value of the dilation parameter a is set to vary with the value of
t/T in order to realize a fine-tuning effect, where T is the total number of
iteration and t is the current iteration number. In order to perform a local
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 3

Restoration of Half-toned Color-quantized Images 45

search when t is large, the value of a should increase as t/T increases so as


to reduce the significance of the mutation. Hence, a monotonic increasing
function governing a and t/T is proposed as follows:

a = e− ln(g)×(1− T )
t ζwm
+ln(g)
(3.10)
where ζwm is the shape parameter of the monotonic increasing function, g
is the upper limit of the parameter a.

3.3.2. The fitness function


The proposed MWPSO is applied to restore half-toned color-quantized
images. Let X be the output image of the restoration. Obviously, when the
restored image X is color-quantized with error diffusion, the output should
be close to the original half-toned image Y . Suppose Qch [·] (as shown in
Fig. 3.1) denotes the operator that performs the color quantization with
half-toning, then we should have
Y ≈ Qch [X] (3.11)
Based on the above criterion, the cost function for the restoration
problem can be defined as

f itness = |Y − Qch [X]| (3.12)
If f itness = 0, it implies the restored image X and the original image
O provide the same color quantization results. In other words, the cost
function of (3.12) provides a good measure to judge if X is a good estimate
of O.

3.3.3. Restoration with MWPSO


Let Xcur (t − 1) be the group of the current estimates of the restored image
(the swarm) at a particular iteration, t − 1, Xcur
p
(t) be the p-th particle of
p
the swarm, and Ecur be its corresponding fitness value. According to (3.4)
and (3.5), the new estimate of the restored image is given by
p
Xnew p
(t) = Xcur (t − 1) + k · v p (t) (3.13)
where k is a constriction factor. Before evaluating the fitness of each
particle, the multi-wavelet mutation is performed. The mutation operation
realizes a fine tuning upon many times of iteration. After each time of
p p
iteration, Enew , which is the fitness for Xnew (t), is then evaluated. If
p p p p p
Enew < Ecur , Xcur (t) is updated to Xnew (t). Furthermore, if Enew < Ebest ,
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 3

46 Frank H.F. Leung, Benny C.W. Yeung and Y.H. Chan

where Ebest is the fitness value of the best estimate gbest so far, then gbest
p
will be updated by Xnew (t). In summary, we have
{ p p p
p Xnew (t) if Enew < Ecur
Xcur (t) = (3.14)
Xcur (t − 1) otherwise
p

and
{ p p
Xnew (t) if Enew < Ebest
gbest = (3.15)
gbest otherwise

3.3.4. Experimental setup


On realizing the MWPSO restoration, the following simulation conditions
are used:
• Shape parameter of the wavelet mutation (ζwm ): 0.2
• Probability of mutation (µm ): 0.2
• Element probability (Nm ): 0.3
• Acceleration constant φ1 : 2.05
• Acceleration constant φ2 : 2.05
• Constriction factor k: 0.005
• Parameter g: 10000
• Swarm size: 8
• Number of iteration: 500
• Initial population X(0): All the particles are initialized to be the
Gaussian filtered output of the original half-toned image Y
• Initial global best gbest: The original half-toned image Y
• Initial previous best pbestp : The zero matrix
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 3

Restoration of Half-toned Color-quantized Images 47

Lena Baboon Fruit

Fig. 3.3. Original images used for testing the proposed restoration algorithm.

Fig. 3.4. Half-toned images with 256-color palette size.


March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 3

48 Frank H.F. Leung, Benny C.W. Yeung and Y.H. Chan

Fig. 3.5. Half-toned images with 128-color palette size.

Fig. 3.6. Half-toned images with 64-color palette size.

Fig. 3.7. Half-toned images with 32-color palette size.


March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 3

Restoration of Half-toned Color-quantized Images 49

Fig. 3.8. MWPSO restored images with 256-color palette size.

Fig. 3.9. MWPSO restored images with 128-color palette size.

Fig. 3.10. MWPSO restored images with 64-color palette size.


March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 3

50 Frank H.F. Leung, Benny C.W. Yeung and Y.H. Chan

Fig. 3.11. MWPSO restored images with 32-color palette size.

Fig. 3.12. SA restored images with 256-color palette size.

Fig. 3.13. SA restored images with 128-color palette size.


March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 3

Restoration of Half-toned Color-quantized Images 51

Fig. 3.14. SA restored images with 64-color palette size.

Fig. 3.15. SA restored images with 32-color palette size.


March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 3

52 Frank H.F. Leung, Benny C.W. Yeung and Y.H. Chan

Fig. 3.16. Comparisons between SA and MWPSO for Lena with 256- and 128-color
palette size.

Fig. 3.17. Comparisons between SA and MWPSO for Baboon with 256- and 128-color
palette size.

Fig. 3.18. Comparisons between SA and MWPSO for Fruit with 256- and 128-color
palette size.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 3

Restoration of Half-toned Color-quantized Images 53

Fig. 3.19. Comparisons between SA and MWPSO for Lena with 64- and 32-color palette
size.

Fig. 3.20. Comparisons between SA and MWPSO for Baboon with 64- and 32-color
palette size.

Fig. 3.21. Comparisons between SA and MWPSO for Fruit with 64- and 32-color palette
size.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 3

54 Frank H.F. Leung, Benny C.W. Yeung and Y.H. Chan

Table 3.1. Fitness and SNRI of two


algorithms for restoring the half-toned
image Lena.
Lena with 256-color Palette size
MWPSO SA
SNRI 6.7736 6.7123
Best fitness 8945 8957

Lena with 128-color Palette size


MWPSO SA
SNRI 8.1969 8.1944
Best fitness 3849 3851

Lena with 64-color Palette size


MWPSO SA
SNRI 9.4404 9.566
Best fitness 1604 1652

Lena with 32-color Palette size


MWPSO SA
SNRI 11.522 11.4903
Best fitness 649 677

3.4. Result and Analysis

Simulations have been carried out to evaluate the performance of the


proposed algorithm. Three de facto standard 24-bit full-color images of size
256×256 pixels are used, and all the test images are shown in Fig. 3.3. These
test images are color-quantized to produce the corresponding images (Y ).
The color-quantized images with 256-color palette size, 128-color palette
size, 64-color palette size and 32-color palette size are shown in Fig 3.4 to
3.7 respectively.
Color palettes of different sizes are generated using the median-cut
algorithm [2]. On doing color quantization, half-toning is performed
with error diffusion, and the Floyd–Steinberg diffusion filter [3] is used.
The proposed restoration algorithm is used to restore the half-toned
color-quantized images.
For comparison, simulated annealing (SA) is also applied to do the
restoration. The performance of SA and the proposed MWPSO are
reflected in terms of the signal-to-noise ratio improvement (SNRI) achieved
when the algorithms are used to restore the color-quantized images “Lena”,
“Baboon” and “Fruit”. The simulation condition for SA is basically the
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 3

Restoration of Half-toned Color-quantized Images 55

Table 3.2. Fitness and SNRI of two


algorithms for restoring the half-toned
image Baboon.
Baboon with 256-color Palette size
MWPSO SA
SNRI 2.6854 2.6706
Best fitness 5410 5445

Baboon with 128-color Palette size


MWPSO SA
SNRI 4.3301 4.3293
Best fitness 2189 2277

Baboon with 64-color Palette size


MWPSO SA
SNRI 5.5836 5.616
Best fitness 1045 1166

Baboon with 32-color Palette size


MWPSO SA
SNRI 5.8208 5.8186
Best fitness 750 760

same as that of the proposed MWPSO. The control parameter of SA,


gamma, is set to be equal to the parameter k of the MWPSO. The number
of iteration is set at 500. The experimental results in terms of SNRI and
the convergence rate are summarized in Tables 3.1 to 3.3 and Figs 3.16 to
3.21 respectively. The SNRI is defined as:

∑ ⃗ (i, j) − Y
⃗(i, j)
2
(i, j) O
SN RI = 10 log ∑ 2 (3.16)
⃗ (i, j) − X
O ⃗ best(i, j)
(i, j)

where O ⃗ (i, j) , Y
⃗(i, j) and X
⃗ best(i, j) are the (i, j)-th pixels of the original, the
half-toned color-quantized and the optimally restored images respectively.
From Figs 3.16 to 3.21, we can see that the final results achieved by SA
and MWPSO are similar: both of them converge to some similar values. It
means that both algorithms can reach the same optimal point. However,
we can see that the proposed MWPSO exhibits a higher convergence rate.
Also, from Tables 3.1 to 3.3, MWPSO converges to smaller fitness values
in all experiments and offers higher SNRI in most of the experiments.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 3

56 Frank H.F. Leung, Benny C.W. Yeung and Y.H. Chan

Table 3.3. Fitness and SNRI of two


algorithms for restoring the half-toned
image Fruit.
Fruit with 256-color Palette size
MWPSO SA
SNRI 7.5598 7.5155
Best fitness 6372 6394

Fruit with 128-color Palette size


MWPSO SA
SNRI 9.9556 9.9186
Best fitness 2400 2417

Fruit with 64-color Palette size


MWPSO SA
SNRI 10.8616 10.8607
Best fitness 982 1012

Fruit with 32-color Palette size


MWPSO SA
SNRI 11.7873 11.7815
Best fitness 478 574

3.5. Conclusion

In this chapter, we used the MWPSO as the restoration algorithm for


half-toned color-quantized images. The half-toned color quantization and
the application of the proposed MWPSO have been described. Simulation
results demonstrate that the proposed algorithm performs better in terms of
convergence rate and SNRI than the conventional SA algorithm. By using
MWPSO, the number of particles used for searching is increased, and more
freedom is given to search the optimal result thanks to the multi-wavelet
mutation. This explains the improved searching ability and the convergence
rate. The chance for the particles being trapped in some local optimal point
is also reduced.

References

[1] M. T. Orchard and C. A. Bouman, Color quantization of images, IEEE


Transactions on Signal Processing. 39(12), 2677–2690, (1991).
[2] P. Heckbert, Color image quantization for frame buffer displays, Computer
Graphics. 16(4), 297–307, (1982).
[3] R. Ulichney, Digital Halftoning. (MIT Press, Cambridge, MA, 1987).
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 3

Restoration of Half-toned Color-quantized Images 57

[4] R. S. Gentile, E. Walowit, and J. P. Allebach, Quantization and multi-level


halftoning of color images for near original image quality, Proceedings of the
SPIE 1990. pp. 249–259, (1990).
[5] S. S. Dixit, Quantization of color images for display/printing on limited color
output devices, Computer Graphics. 15(4), 561–567, (1991).
[6] X. Wu, Color quantization by dynamic programming and principal analysis,
ACM Transactions on Graphics. 11(4), 372–384, (1992).
[7] M. Barni, V. Cappellini, and L. Mirri, Multichannel m-filtering for color
image restoration. pp. 529–532, (2000).
[8] G. Angelopoulosnd and I. Pitas, Multichannel wiener filters in color image
restoration, IEEE Transactions on CASVT. 4(1), 83–87, (1994).
[9] N. P. Galatsanos, A. K. Katsaggelos, R. T. Chin, and A. D. Hillery, Least
squares restoration of multichannel images, IEEE Transactions on Signal
Processing. 39(10), 2222–2236, (1991).
[10] N. P. Galatsanos and R. T. Chin, Restoration of color images by
multichannel kalman filtering, IEEE Transactions on Signal Processing. 39
(10), 2237–2252, (1991).
[11] B. R. Hunt and O. Kubler, Karhunen-loeve multispectral image restoration,
part 1: theory, IEEE Transactions on Acoustic Speech Signal Processing. 32
(3), 592–600, (1984).
[12] H. Altunbasak and H. J. Trussell, Colorimetric restoration of digital images,
IEEE Transactions on Image Processing. 10(3), 393–402, (2001).
[13] K. J. Boo and N. K. Bose, Multispectral image restoration with multisensor,
IEEE Transactions on Geosci. Remote Sensing. 35(5), 1160–1170, (1997).
[14] N. P. Galatsanos and R. T. Chin, Digital restoration of multichannel images,
IEEE Transactions on Acoustic Speech Signal Processing. 37, 415–421,
(1989).
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 3

This page intentionally left blank

58
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 3

PART 2

Fuzzy Logics and their Applications

59
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 3

This page intentionally left blank

60
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 4

Chapter 4

Hypoglycemia Detection for Insulin-dependent Diabetes


Mellitus: Evolved Fuzzy Inference System Approach

S.H. Ling, P.P. San and H.T. Nguyen


Centre for Health Technologies,
Faculty of Engineering and Information Technology,
University of Technology, Sydney, NSW, Australia,
[email protected]

Insulin-dependent diabetes mellitus is classified as Type 1 diabetes


and it can be further classified as immune-mediated or idiopathic.
Hypoglycemia is a common and serious side effect of insulin therapy in
patients with Type 1 diabetes. In this chapter, we measure physiological
parameters continuously to provide a non-invasive hypoglycemia
detection for Type 1 diabetes mellitus (T1DM) patients. Based on the
physiological parameters of the electrocardiogram (ECG) signal, such as
heart rate, corrected QT interval, change of heart rate and change of
corrected QT interval, an evolved fuzzy inference model is developed for
classification of hypoglycemia. To optimize the rules and membership
functions of the fuzzy system, a hybrid particle swarm optimization with
wavelet mutation (HPSOWM) is introduced. For the clinical study, 15
children with Type 1 diabetes are volunteered overnight. All the real
data sets are collected from the Department of Health, Government of
Western Australia and are randomly organized into a training set (10
patients) and testing set (5 patients). The results show that the evolved
fuzzy inference system approach performs well in terms of sensitivity
and specificity.

Contents
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
4.2 Hypoglycemia Detection System: Evolved Fuzzy Inference System Approach 66
4.2.1 Fuzzy inference system . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2.2 Particle swarm optimization with wavelet mutation . . . . . . . . . . . 71
4.2.3 Choosing the HPSOWM parameters . . . . . . . . . . . . . . . . . . . 74
4.2.4 Fitness function and training . . . . . . . . . . . . . . . . . . . . . . . 77
4.3 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

61
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 4

62 S.H. Ling, P.P. San and H.T. Nguyen

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.1. Introduction

Hypoglycemia is the medical term for a state produced by a lower level


of blood glucose. It is characterized by a mismatch between the action of
insulin, the ingestion of food and energy expenditure. The most common
forms of hypoglycemia occur as a complication of treatment with insulin
or oral medications. Hypoglycemia is less common in non-diabetic people,
but can occur at any age because of many causes. Among these causes
are excessive insulin produced in the body, inborn errors, medications and
poisons, alcohol, hormone deficiencies, prolonged starvation, alterations of
metabolism associated with infection and organ failure. In [1, 2] it was
discussed that diabetic patients, especially those who have been treated
with insulin, are at risk of developing hypoglycemia while it is less likely
to occur in non-insulin-dependent patients who are taking sugar-lowering
medicine for diabetes. Most surveys in [3] have demonstrated that the
tighter the glycemic control, and the younger the patient, the greater
frequency of both mild and severe hypoglycemia.
The level of blood glucose low enough to define hypoglycemia may be
different for different people in different circumstances, and for different
purposes and occasionally has been a matter of controversy. Most healthy
adults maintain fasting glucose levels above 70 mg/dL, (3.9 mmol/L) and
develop symptoms of hypoglycemia when the glucose level falls below
55 mg/dL, (3.0 mmol/L). In [4], it has been reported that the severe
hypoglycemic episodes are defined in those whose documented blood
glucose level is 50 mg/dL, (2.8 mmol/L) and the patients are advised to
take the necessary treatment. Hypoglycemia is treated by restoring the
blood glucose level to normal by the investigation or administration of
carbohydrate foods. In some cases, it is treated by injection or infusion of
glucagon.
The symptoms of hypoglycemia occur when not enough glucose is
supplied to the brain, in fact the brain and nervous system need a
certain level of glucose to function. The two typical symptoms of
hypoglycemia arise from the activation of the autonomous central nervous
systems (autonomic symptoms) and reduced cerebral glucose consumption
(neuroglycopenic symptoms). Autonomic symptoms such as headache,
extreme hunger, blurry or double vision, fatigue, weakness and sweating are
activated before neuroglycopenic symptoms follow. As discussed in [5, 6]
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 4

Hypoglycemia Detection for Insulin-dependent Diabetes Mellitus 63

the initial condition of the presence of hypoglycemia can only be identified


when autonomic symptoms occur and allow the patient to recognize correct
ensuing episodes. The neuroglycopenic symptoms such as confusion,
seizures and loss of consciousness (coma) arise due to insufficient glucose
flow to the brain [7].
In cases of hypoglycemia presented in [8, 9], the symptoms occured
without the patients being aware of them, at any time while driving,
or during sleeping. Nocturnal hypoglycemia is particularly dangerous
because it reduces sleep and may obscure autonomic counter-regulatory
responses, so that an initially mild episode may become severe. The
risk of hypoglycemia is high during night time, since 50% of all severe
episodes occur at that time [10]. Deficient glucose counter-regulation
may also lead to severe hypoglycemia even with modest insulin elevations.
Regulation of nocturnal hypoglycemia is further complicated by the dawn
phenomenon. This is a consequence of nocturnal changes in insulin
sensitivity secondary to growth hormone secretion: a decrease in insulin
requirements approximately between midnight and 5 am followed by an
increase in requirements between 5 am and 8 am. Thus, hypoglycemia is
one of the complications of diabetes most feared by patients.
Current technologies used in the diabetes diagnostic testing and
self-monitoring market have already been improved to the extent that any
additional improvements would be minimal. For example, glucose meter
manufacturers have modified their instruments to use as little as 2 µl
of blood and produce results within a minute. Technology advancement
in this market is expected to occur using novel design concepts. An
example of such technology is the non-invasive glucose meter. There
is a limited number of non-invasive blood glucose monitoring systems
currently available in the market but each has specific drawbacks in terms of
functioning, cost, reliability and obtrusiveness. Intensive research has been
devoted to the development of hypoglycemia alarms, exploiting principles
that range from detecting changes in the electroencephalogram (EEG) or
skin conductance (due to sweating) to measurements of subcutaneous tissue
glucose concentrations by glucose sensors. However, none of these have been
proved sufficiently.
A significant contribution is made by the use of fuzzy reasoning
systems to the modeling and design of a non-invasive hypoglycemia monitor
with physiological responses. During hypoglycemia, the most profound
physiological changes are caused by activation of the sympathetic nervous
system. Among them, the strongest responses are sweating and increased
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 4

64 S.H. Ling, P.P. San and H.T. Nguyen

cardiac output. As discussed in [8, 11, 12] sweating is mediated through


sympathetic cholinergic fibres, while the change in cardiac output is due to
an increase in heart rate and stroke volume. In [13, 14] the possibility of
hypoglycemia-induced arrhythmias and experimental hypoglycemia have
been shown to prolong QT intervals and dispersion in both non-diabetic
subjects and those with Type 1 and Type 2 diabetes.
The classification models for numerous medical diagnosis such as
diabetic nephropathy [15], acute gastrointestinal bleeding [16] and
pancreatic cancer [17] have been developed by the use of the statical
regression methods in [18]. However, statical regression models are accurate
over the range of patients’ data. In addition, it can only be applied if the
patients’ data is distributed due to the developed regression model, and the
correlation between dependent and independent variables does not exist. If
the patients’ data is irregular, the developed regression models have an
unnaturally wide possibility range.
Another commonly used method to generate classification models for
heart disease [19] and Parkinson’s disease [20] is genetic programming as
discussed in [21]. To generate classification models with nonlinear terms
in polynomial forms, the genetic operation is used, while the least square
algorithm is used to determine the contribution of each nonlinear term
of the model classification. According to fuzziness of measures, it is
unavoidable that the patients’ data involves uncertainty. Since the genetic
programming with least square algorithm does not consider the fuzziness
of uncertainty during measurement, it cannot give the best classification
model for diagnosis purposes.
To carry out modeling and classification for medical diagnosis purposes
of ECG and EEG in [22–24] much attention has been devoted to
computational technologies such as fuzzy systems [25], support vector
machine [26] and neural networks [27]. Not only in EEG and ECG
classifications, the advanced computational intelligent techniques have been
applied to cardiovascular response [28, 29], breast cancer and blood cell
[30], skull and brain [31, 32], dermatologic disease [17], heart disease [33],
radiation therapy [34] and lung sound [35] etc. Each technology has
its own advantages, for example, fuzzy system is famous because of its
decision making ability. Due to its human experts’ representation, the
system’s output can be determined by a set of linguistic rules which can be
understood easily. As discussed in [36] neural networks (NNs) have been
used as a universal function approximator due to their good approximation
capability. To develop classification models for medical diagnosis purposes
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 4

Hypoglycemia Detection for Insulin-dependent Diabetes Mellitus 65

as presented in [37] NNs are used due to their generalization ability in


addressing both the nonlinear and fuzzy nature of patients’ data. In
[38] support vector machines (SVMs) have proven good performance for
classification and have also been used in various applications. Because
they tackle binary classification problems, SVMs have been used in the
classification of cardiac signal [39].
Since the traditional optimization methods of least square algorithms
and gradient descent methods have a problem of trapping in local optima,
evolutionary algorithms such as PSO [40], genetic algorithm (GA) [41]
differential evolution (DE) [42] and ant colony optimization [43] have been
introduced. These algorithms are efficient global optimization algorithms
and make it easy to handle multimodel problems. As discussed in
[9, 23, 44–47] the combination of optimization methods and evolutionary
algorithms eventually gives a good performance in clinical applications.
A hybrid particle swarm optimization-based fuzzy reasoning model
has been developed for early detection of hypoglycemic episodes using
physiological parameters, such as heart rate (HR) and corrected QT interval
(QTc ) of ECG signal. The fuzzy reasoning model (FRM) [25, 48] is good
in representing expert knowledge and linguistic rules that can be easily
understood by human beings. In this chapter, it has been proved that
the overall performance of a hypoglycemia detection system is distinctly
improved in terms of sensitivity and specificity by introducing FRM. To
optimize fuzzy rules and fuzzy membership functions of FRM, a global
learning algorithm called hybrid particle swarm optimization with wavelet
mutation (HPSOWM) is presented [49]. Since PSO is a powerful random
global search technique for optimization problems, by using it, the global
optimum solution over a domain can be obtained instantly. By introducing
wavelet mutation in PSO, the drawback of possible trapping in local optima
in PSO can be overcome easily. Also, a fitness function is introduced to
reduce the risk of the overtraining phenomenon [50]. A real hypoglycemia
study is given in Section 4.3 to show that the proposed detector can
successfully detect the hypoglycemic episodes in T1DM.
The organization of this chapter is as follows. In Section 4.2, the details
of the development of hybrid particle swarm optimization based on the fuzzy
reasoning model is presented. To show the effectiveness of our proposed
methods, the results of early detection of nocturnal hypoglycemic episodes
in T1DM are discussed in Section 4.3 before a conclusion is drawn in Section
4.4.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 4

66 S.H. Ling, P.P. San and H.T. Nguyen

4.2. Hypoglycemia Detection System: Evolved Fuzzy


Inference System Approach

Even though current technologies used in diabetes diagnosis testing and


self-monitoring have already been improved to some extent, technological
advancement in the market is expected to come from the use of novel design
concepts, i.e the development of the non-invasive glucose meter. There is a
limited number of non-invasive blood glucose monitoring systems currently
available on the market. However, each has its own drawbacks in terms of
functioning, cost, reliability and obtrusiveness. To measure glucose levels
up to 3 times per hour for 12 hours, the Gluco Watch G2 Biographer
was designed and developed by Cygnus Inc, in which the autosensor is
attached to the skin and provides 12 hours of measurement. The product
uses reverse iontophoresis to extract and measure the levels of glucose
non-invasively using interstitial fluid. Before each measurement, the device
needs calibrating and takes two hours. The gel pads, costly disposable
components, are required whenever using these devices and may cause
skipped readings during measurement because of sweating. The worst case
is that the device has a time delay of about 10 to 15 minutes. Due to its
limited ability, the device is no longer available on the market.
Although real-time glucose monitoring systems (CGMS) are now
available to give real-time estimations of glucose levels, they still lack the
sensitivity to be used as an alarm. In [51, 52] the median error for MiniMed
Medtronic CGMS was reported as 10 to 15% at a plasma glucose of 4 to
10 mmol/l and the inaccuracy of CGMS in detecting of hypoglycemia is
reported in [53]. In addition, for the Abbott Freestyle Navigator CGMS,
the sensor accuracy was the lowest during hypoglycemia (3.9 mmol/l), with
the median absolute relative difference (ARD) being 26.4% [54]. As these
are median values, the error may be significantly greater and, as a result,
those sensors are not to be used as an alarm. Intensive research has been
devoted to the development of hypoglycemia alarms, exploiting principles
that range from detecting changes in the EEG or skin conductance [11].
However, none of these have proved sufficiently reliable or unobtrusive.
During hypoglycemia, the most profound psychological changes are
caused by activation of the sympathetic nervous systems [55]. Among them,
the strongest responses are sweating and increased cardiac output. [12–14]
Sweating is mediated through sympathetic cholinergic fibres, while the
change in cardiac output is due to an increase in heart rate and stroke
volume [14]. The possibility of hypoglycemia induced arrhythmias and
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 4

Hypoglycemia Detection for Insulin-dependent Diabetes Mellitus 67

experimental hypoglycemia has been shown to prolong QT intervals and


dispersion in both non-diabetic subjects and in those with Type 1 and Type
2 diabetes. In this chapter, the physiological parameters for detection of
hypoglycemic episodes in T1DM were collected by the use of a continuous
non-invasive hypoglycemia monitor in [56] in which an alarm system is
available for warning at various stages of hypoglycemia.
To realize the early detection of hypoglycemic episodes in T1DM,
a hybrid PSO-based fuzzy reasoning model is developed with four
physiological inputs and one decision output, as shown in Fig. 4.1.
The four physiological inputs of the system are HR, QTc of the
electrocardiogram signal, ∆HR, and ∆QTc and the output is the binary
status of hypoglycemia(h) which gives +1 when hypoglycemia is presented
(positive hypoglycemia) while −1 represents non-hypoglycemia (negative
hypoglycemia).

Fig. 4.1. PSO-based fuzzy inference system for hypoglycemia detection.

The ECG parameters which will be investigated in this research


involve the parameters in depolarization and repolarization stages of
electrocardiography. The concern points are Q point, R peak, T wave peak
and T wave end, as shown in Fig. 4.2. The peak of T wave is searched in
the section of 300 ms after R peak. In this section, the maximum peak is
defined as the peak of T wave. Q point is searched in the section of 120 ms in
the left side of R peak. The Q point is found by investigating the sequential
gradients of negative-negative-positive/zero-positive/zero from the right
side. These concerned points are used to obtain the ECG parameters
which are used for inputs in the hypoglycemia detection. QT is the interval
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 4

68 S.H. Ling, P.P. San and H.T. Nguyen


Fig. 4.2. The concerned points: Q, R, Tp and Te, which are used to find the ECG
parameters.


QT
between Q and Tp points. QTc is RR in which RR is the interval between
60
R peaks. Heart rate is RR .

4.2.1. Fuzzy inference system


The detailed methodologies of the fuzzy reasoning model (FRM) are
discussed in Section 4.2.1.1 by showing that FRM plays a vital role
in modeling the correlation between the physiological parameters HR,
QTc , ∆HR and ∆QTc and status of hypoglycemia (h). The three
FRM components such as fuzzification, reasoning by if-then rule and
defuzzification are also discussed in Sections 4.2.1.1, 4.2.1.2 and 4.2.1.3
respectively.

4.2.1.1. Fuzzification
The first step is to take the inputs and determine the degree of membership
to which they belong to each of the appropriate fuzzy sets via membership
functions. In this study, there are four inputs, HR, QTc , ∆HR, and ∆QTc .
The degree of the membership function is shown in Fig. 4.3. For input HR,
a bell-shaped function, µNHRk (HR(t)) is given when mkHR ̸= max{mHR } or
min{mHR }.
−(HR(t)−mk
HR )
2

2ς k
µNHR
k (HR(t)) = e HR , (4.1)

where
[ mf ]
mHR = m1HR m2HR · · · mkHR · · · mHR ,
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 4

Hypoglycemia Detection for Insulin-dependent Diabetes Mellitus 69

k=1, 2, ..., mf , mf denotes the number of membership function, t=1, 2,


..., nd , nd denotes the number of input-output data pairs, parameter mkHR
k
and ςHR are the mean value and the standard deviation of the member
function, respectively.
When mkHR = min{mHR }

1 if HR(t) ≤ min{mHR }
µNHRk (HR(t)) = −(HR(t)−mk
HR )
2
, (4.2)
e 2ς k
HR if HR(t) > min{mHR }

when mkHR = max{mHR }



1 if HR(t) ≥ max{mHR }
µNHR
k (HR(t)) = −(HR(t)−mk
HR )
2
. (4.3)
e 2ς k
HR if HR(t) < max{mHR }

Similarly, the degree of the membership function for input QTc


(µNQT
k (QTc (t))), ∆HR (µN∆HR
k (∆HR(t))), ∆QTc (µN∆QT
k (∆QTc (t))) is
c c
same as input heart rate.

0.9

0.8

0.7

0.6

V
0.5 k

0.4

0.3

0.2

0.1

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Fig. 4.3. Fuzzy inputs.


March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 4

70 S.H. Ling, P.P. San and H.T. Nguyen

4.2.1.2. Fuzzy reasoning


With these fuzzy inputs (HR, QTc , ∆HR, ∆QTc ) and fuzzy output
(presence of hypoglycemia, y), the behavior of the FRM is governed by
a set of fuzzy if-then rules in the following format:
k k
Rule γ : IF HR(t) is NHR (HR(t)) AND QTc (t) is NQTc
(QTc (t)) AND
k k
∆HR(t) is N∆HR(t) (∆HR(t)) AND ∆QTc (t) is N∆QT c (t)
(∆QTc (t))
THEN y(t) is wγ .
(4.4)
k k k k
where NHR (HR(t)), NQT c
(QTc (t)), N∆HR (∆HR(t)) and N∆QT c (t)
are
fuzzy terms of rule γ, γ=1, 2, ..., nr ; nr denotes number of rules and
n
nr is equal to (mf ) in where nin represents the number of input to FRM;
wγ ∈ [0, 1] is the fuzzy singleton to be determined.

4.2.1.3. Defuzzification
Defuzzification is the process of translating the output of the fuzzy rules
into a scale. The presence of hypoglycemia h(t) is given by:
{
−1 if y(t) < 0
h(t) = , (4.5)
+1 if y(t) ≥ 0
where,

nr
y(t) = mγ (t)wγ , (4.6)
γ=1

and

µN k (z(t))
mγ (t) = ∑nr z (4.7)
γ=1 µNzk (z(t))

where
µNzk (z(t)) = (µNHR
k (HR(t))) × (µNQT
k (QTc (t)))
c
(4.8)
×(µN∆HR
k (∆HR(t))) × (µN∆QT
k (∆QTc (t)))
c

All the parameters of FRM, mHR , ς HR , mQTc , ς QTc , m∆HR , ς ∆HR ,


m∆QTc , ς ∆QTc and w are tuned by the hybrid particle swarm optimization
with wavelet mutation [49].
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 4

Hypoglycemia Detection for Insulin-dependent Diabetes Mellitus 71

4.2.2. Particle swarm optimization with wavelet mutation

To optimize the parameters of the fuzzy reasoning model, a global learning


algorithm called hybrid particle swarm optimization with wavelet mutation
[49] is investigated in the system. PSO is a novel optimization method
which is firstly developed in [40]. It models the processes of the sociological
behavior associated with bird flocking, and is one of the evolutionary
computation techniques. It uses a number of particles that constitute a
swarm. Each particle traverses the search space looking for the global
optimum. In this section, HPSOWM is introduced and the detailed
algorithm of HPSOWM is presented in Algorithm 4.2.1.

Algorithm 4.2.1: Pseudo code for HPSOWM(X(t))




t←0



Initialize X(t)



output (f (X(t)))

while <not termination condition>



 

  t←t+1

 


 


 

Update v(t) and x(t) based on (4.9) to (4.12)

 


 

if v(t) > vmax

  then v(t) = v


 max
if v(t) < v

 min

 do


 

then v(t) = vmin

 


 

Perform wavelet mutation operation with µm

 


 
 Update x̄pj (t) based on (4.13) to (4.15)

 


 
 output (X(t))

 

 output (f (X(t)))





return (x̂)

comment: return the best solution

From Algorithm 4.2.1, X(t) is denoted as a swarm at the t-th iteration.


Each particle xp (t) ∈ X(t) contains κ elements xpj (t) at the t-th iteration,
where p = 1, 2, ..., θ and j = 1, 2, ..., κ; θ denotes the number of particles in
the swarm and κ is the dimension of a particle. First, the particles of the
swarm are initialized and then evaluated by a defined fitness function. The
objective of HPSOWM is to minimize the fitness function (cost function)
f (X(t)) of particles iteratively. The swarm evolves from iteration t to t+1
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 4

72 S.H. Ling, P.P. San and H.T. Nguyen

by repeating the procedures as shown in Algorithm 4.2.1. The operations


are discussed as follows.
The velocity vjp (t)(corresponding to the flight speed in a search space)
and the position xpj (t) of the j-th element of the p-th particle at the t-th
generation can be calculated using the following formulae:

vjp (t) = k ·{{w ·vjp (t−1)}+{φ1 ·r1 (x̃pj −xpj (t−1))}+{φ2 ·r2 (x̂j −xpj (t−1))}}
(4.9)
and

xpj (t) = xpj (t − 1) + vjp (t). (4.10)

where x̃p = [x̃p1 , x̃p2 , . . . , x̃pk ] and x̂ = [ xˆ1 xˆ2 , ...xˆκ ], j = 1, 2, ..., κ. The
best previous position of a particle is recorded and represented as x̃; the
position of best particle among all the particles is represented as x̂; w is
an inertia weight factor; r1 and r2 are acceleration constants which return
a uniform random number in the range of [0,1]; k is a constriction factor
derived from the stability analysis of 4.10 to ensure the system is converged
but not prematurely [57]. Mathematically, k is a function of φ1 and φ2 as
reflected in the following equation:
( )
2
k= √ (4.11)
|2 − φ − φ2 − 4φ|

where φ = φ1 + φ2 and φ > 4. PSO utilizes x̃ and x̂ to modify the current


search point to avoid the particles moving in the same direction, but to
converge gradually toward x̃ and x̂. A suitable selection of the inertia
weight w provides a balance between the global and local explorations.
Generally, w can be dynamically set with the following equation [57]:
( )
wmax − wmin
w = wmax − ×T (4.12)
T

where t is the current iteration number, T is the total number of iterations,


wmax and wmin are the upper and lower limits of the inertia weight, wmax
and wmin are set to 1.2 and 0.1 respectively.
In 4.9, the particle velocity is limited by a maximum value vmax . The
parameter vmax determines the resolution with which regions are to be
searched between the present position and the target position. This limit
enhances the local exploration of the problem space and it realistically
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 4

Hypoglycemia Detection for Insulin-dependent Diabetes Mellitus 73

simulates the incremental changes of human learning. If vmax is too high,


particles might fly past good solutions. If vmax is too small, particles may
not explore sufficiently beyond local solutions. From experience, vmax is
often set at 10% to 20% of the dynamic range of the element on each
dimension. Next, the mutation operation is used to mutate the element of
particles.

4.2.2.1. Wavelet mutation

The wavelet mutation (WM) operation exhibits a solution stability and a


fine-tuning ability. Every particle element of the swarm will have a chance
to mutate governed by a probability of mutation, µm ∈ [0 1], which is
defined by the user. For each particle element, a random number between
zero and one will be generated such that if it is less than or equal to pm , the
mutation will take place on that element. For instance, if xp (t) = [xp1 (t),
xp2 (t), ..., xpκ (t)] is the selected p-th particle and the element of particle
xpj (t) is randomly selected for mutation (the value of xpj (t) is inside the
particle element’s boundaries [ρjmin , ρjmax ]), the resulting particle is given
by x̄p (t) = [x̄p1 (t), x̄p2 (t), ..., x̄pκ (t)]
{
p xpj (t) + σ × (ρjmax − xpj (t)) if σ > 0
x̄j (t) = (4.13)
xpj (t) + σ × (xpj (t) − ρjmin ) if σ < 0

where j ∈ 1, 2, ...κ, κ denotes the dimension of particle and the value of σ


is governed by Morlet wavelet function [58]:

1 (φ) ( ( φ )) φ 2
1 −( a )
σ = ψa,0 (φ) = √ ψ = √ e 2 cos 5 (4.14)
a a a a

The amplitude of ψa,0 (φ) will be scaled down as the dilation parameter
a increases. This property is used to do the mutation operation in order
to enhance the searching performance. According to 4.14, if σ is positive
and approaching one, the mutated element of the particle will tend to have
the maximum value of xpj (t). Conversely, when σ is negative (σ ≤ 0)
approaching −1, the mutated element of the particle will tend to the
minimum value of xpj (t). A larger value of |σ| gives a larger searching
space for xpj (t). When |σ| is small, it gives a smaller searching space
for fine-tuning. As over 99% of the total energy of the mother wavelet
function is contained in the interval [−2.5, 2.5], it can be generated from
[−2.5, 2.5] × a randomly. The value of the dilation parameter a is set to
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 4

74 S.H. Ling, P.P. San and H.T. Nguyen

vary with the value of Tt in order to meet the fine-tuning purpose, where T
is the total number of iterations and t is the current number of iterations.
In order to perform a local search when t is large, the value of a should
increase as Tt increases so as to reduce the significance of the mutation.
Hence, a monotonic increasing function governing a and Tt is proposed in
the following form:
ζwm
a = e−ln(g)×(1− T )
t
+ln(g)
(4.15)

where ζwm is the shape parameter of the monotonic increasing function


and g is the upper limit of the parameter a. The effects of the various
values of the shape parameter ζwm and parameters g to a with respect
to ζwm are shown in Figs 4.4 and 4.5, respectively. In this figure, g is
set as 10000. Thus, the value of a is between 1 and 10000. Referring to
4.14, the maximum value of σ is 1 when the random number of φ = 0
and a = 1 ( Tt = 0). Then referring to 4.13, the resulting particle x̄pj (t) =
xpj (t) + 1 × (ρjmax − xpj (t)) = ρjmax . It ensures that a large search space
for the mutated element is given. When the value Tt is near to 1, the
value of a is so large that the maximum value of σ will become very small.
For example, at Tt =0.9 and ζwm = 1, the dilation parameter a = 4000;
if the random value of φ is zero, the value of σ will be equal to 0.0158.
With x̄pj (t) = xpj (t) + 0.0158 × (ρjmax − xpj (t)), a smaller searching space
for the mutated element is given for fine-tuning. Changing the parameter
ζwm will change the characteristics of the monotonic increasing function
of the wavelet mutation. The dilation parameter a will take a value so
as to perform fine-tuning faster as ζwm is increasing. It is chosen by trial
and error, which depends on the kind of optimization problem. When ζwm
becomes larger, the decreasing speed of the step size (σ) of the mutation
becomes faster. In general, if the optimization problem is smooth and
symmetric, the searching algorithm is easier to find the solution and the
fine-tuning can be done in the early stage. Thus, a larger value of ζwm
can be used to increase the step size of the early mutation. After the
operation of wavelet mutation, a new swarm is generated. This new swarm
will repeat the same process. Such an iterative process will be terminated
if the pre-defined number of iterations is met.

4.2.3. Choosing the HPSOWM parameters


HPSOWM is seeking a balance between the exploration of new regions
and the exploitation of the already sampled regions in the search spaces.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 4

Hypoglycemia Detection for Insulin-dependent Diabetes Mellitus 75

4
10

ζ=5

10
3
ζ=2
dilation parameter a

ζ=1

2
10

ζ=0.5
ζ=0.2

1
10

0
10
0 0.2 0.4 0.6 0.8 1
t/T
t
Fig. 4.4. Effect of the shape parameter ζwm to a with respect to T
= 0.

This balance, which critically affects the performance of HPSOWM, is


governed by the right choices of the control parameters: Swarm size (θ), the
probability of mutation (µm ), the shape parameter (ζwm ) and parameter
g of wavelet mutation. Some views about these parameters are given as
follows:

(i) Increasing swarm size (θ) will increase the diversity of the search space,
and reduce the probability that HPSOWM prematurely converges to
a local optimum. However, it also increases the time required for the
population to converge to the optimal region in the search space.
(ii) Increasing the probability of mutation (µm ) tends to transform the search
into a random search such that when µm = 1, all elements of particles
will mutate. This probability gives us an expected number (µm × θ × κ)
of elements of particles that undergo the mutation operation. In other
words, the value of µm depends on the desired number of elements of
particles that undergo the mutation operation. Normally, when the
dimension is very low (number of elements of particles is less than 5,
µm is set at 0.5 to 0.8). When the dimension is around 5 to 10, µm is set
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 4

76 S.H. Ling, P.P. San and H.T. Nguyen

5
10

g=10000
g=100000
4
10

g=1000
dilation parameter a

3
10

g=100
2
10

1
10

0
10
0 0.2 0.4 0.6 0.8 1
t/T
t
Fig. 4.5. Effect of the parameter g to a with respect to T
= 0.

at 0.3 to 0.4. When the dimension is in the range of 11 to 100, µm is set


at 0.1 to 0.2. When the dimension is in the range of 101 to 1000, normally
µm is set at 0.05 to 0.1. Lastly, when the dimension is very high (number
of elements of particles is larger than 1000), µm is set at < 0.05. The
rationale to set this selection criterion is: when the dimension is high, µm
should be set to be a smaller value; when the dimension is low, µm should
be set to be a larger value. This is because if the dimension is high and
µm is set to be a larger number, then the number of elements of particles
undergoing mutation operation will be large. It will increase the searching
time and more importantly it destroys the current information about the
application in each iteration as all elements of particles are randomly
assigned. Generally speaking, by choosing the value of µm , the ratio of
the number of elements of particles undergoing mutation operation to
the population size can be maintained to prevent the searching process
turning to a random searching one. Thus, the value of µm is based on
this selection criterion and chosen by trial and error through experiments
for good performance for all functions.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 4

Hypoglycemia Detection for Insulin-dependent Diabetes Mellitus 77

(iii) The dilation parameter a is governed by the monotonic increasing


function 4.15, and this monotonic increasing function is controlled by
two parameters. They are called shape parameter ζwm and parameter
g. Changing the parameter ζwm will change the characteristics of the
monotonic increasing function of the wavelet mutation. The dilation
parameter a will take a value so as to perform fine-tuning faster as ζwm
is increasing. It is chosen by trial and error, which depends on the kind of
optimization problem. When ζwm becomes larger, the decreasing speed
of the step size (σ) of the mutation becomes faster. In general, if the
optimisation problem is smooth and symmetric, it is easier to find the
solution and the fine-tuning can be done in early iteration. Thus, a larger
value of ζwm can be used to increase the step size of the early mutation.
Parameter g is the value of the upper limit of dilation parameter a. A
larger value of g implies that the maximum value of a is larger. In other
words, the maximum value of ζwm will be smaller (smaller searching limit
is given). Conversely, a smaller value of g implies that the maximum
value of a is smaller. On the other hand, the maximum value of |Σ| will
be larger (larger searching limit is given). In our point of view, fixing
one parameter and adjusting another parameter to control the monotonic
increasing function is more convenient to find a good setting.

4.2.4. Fitness function and training


In this system, HPSOWM is employed to optimize the fuzzy rules and
membership functions by finding out the best parameters of the FSM.
The function of the fuzzy reasoning model is to detect the hypoglycemic
episodes accurately. To measure the performance of the biomedical
classification test, sensitivity and specificity are introduced [59]. The
sensitivity measures the proportion of actual positives which are correctly
identified and the specificity measures the proportion of negatives which are
correctly identified. The definitions of the sensitivity (ξ) and the specificity
(η) are given as follows:
NT P
ξ= (4.16)
NT P + NF N

NT N
η= (4.17)
NT N + NF P
where NT P is the number of true positives which implies the sick people
correctly the diagnosed as sick; NF N is the number of the false negatives
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 4

78 S.H. Ling, P.P. San and H.T. Nguyen

which implies sick people wrongly diagnosed as healthy; NF P is the number


of false positives which implies healthy people wrongly diagnosed as sick;
and NT N is the number of true negatives which implies healthy people
correctly diagnosed as healthy. The values of these are within zero to one.
The objective of the system is to maximize the sensitivity and the specificity,
thus, the fitness function, f (ξ, η) is defined as follows:
ηmax − η
f (ξ, η) = ξ + (4.18)
ηmax
where ηmax is the upper limit of the specificity. The objective is to maximize
the fitness function of 4.18 which is equivalent to maximize the sensitivity
and the specificity. In 4.18, the specificity is limited by a maximum value
ηmax . The parameter ηmax is used to fix the region of specificity and find
the optimal sensitivity in this region. In particular, the ηmax can be set
from zero to one and different sensitivity with different specificity values
can be determined. With this proposed fitness function, the ROC curve
for this hypoglycemia detection can be found satisfactory. The ROC curve
is commonly used in medical decision making and is a useful technique for
visualizing the classification performance. The classification results with
ROC curve of the proposal method will be illustrated in Section 4.3.

4.3. Results and Discussion

Fifteen children with T1DM (14.6 ± 1.5 years) volunteered for the
10-hour overnight hypoglycemia study at the Princess Margaret Hospital
for Children in Perth, Western Australia, Australia. Each patient is
monitored overnight for the natural occurrence of nocturnal hypoglycemia.
Data are collected with approval from Women’s and Children’s Health
Service, Department of Health, Government of Western Australia and with
informed consent. A comprehensive patient information and consent form
is formulated and approved by the Ethics Committee. The consent form
includes the actual consent and a revocation of consent page. Each patient
receives this information consent form at least two weeks prior to the
start of the studies. He/she has the opportunity to raise questions and
concerns with any medical advisors, researchers and the investigators. For
the children participating in this study, the parent or guardian signed the
relevant forms.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 4

Hypoglycemia Detection for Insulin-dependent Diabetes Mellitus 79

The required physiological parameters are measured by the use of the


non-invasive monitoring system in [56], while the actual blood glucose levels
(BGL) are collected for reference using Yellow Spring Instruments. The
main parameters used for the detection of hypoglycemia are the heart
rate and corrected QT interval. The actual blood glucose profiles for
15 T1DM children [56] are shown in Fig. 4.6. The responses from 15
T1DM children exhibit significant changes during the hypoglycemia phase
against the non-hypoglycemia phase. Normalization is used to reduce
patient-to-patient variability and to enable group comparison by dividing
the patient’s heart rate and corrected QT interval by his/her corresponding
values at time zero.
The study shows that the detection of hypoglycemic episodes (BGL
≤ 2.8 mmol/l and BGL ≤ 3.3 mmol/l) using these variables is based on a
hybrid PSO with wavelet mutation-based fuzzy reasoning model developed
from the obtained clinical data. In effect, it estimates the presence of
hypoglycemia at sample period ks based on the basis of the data at sampling
period ks and the previous data at sampling period ks − 1. In general, the
sampling period is 5 minutes and approximately 35 to 40 data points are
used for each patient. The overall data set consisted of a training set and a
testing set, 10 patients for training and 5 patients for testing are randomly
selected. For these, the whole data set included both hypoglycemia data
part and non-hypoglycemia data part. By using HPSOWM which is used
to find the optimized fuzzy rules and membership functions of FRM, the
basic settings of the parameters of the HPSOWM are shown as follows:

• Swarm size θ : 50;


• Constant c1 and c2 : 2.05;
• Maximum velocity vmax : 0.2 or 20%;
• Probability of mutation µm : 0.5;
• The shape parameter of wavelet mutation ζwm [49] : 2;
• The constant value g of wavelet mutation [49] : 10000;
• Number of iteration T : 1000.

Table 4.1. Simulation Results.

Training Testing
Method Sensitivity Specificity Sensitivity Specificity Area of ROC curve

EFIS 89.23% 39.87% 81.40% 41.82% 71.46%


EMR 78.46% 39.23% 72.09% 41.82% 64.19%
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 4

80 S.H. Ling, P.P. San and H.T. Nguyen

blood glucose levels − 15 patients


16

14

12

10
BG (mml/dl)

0
0 100 200 300 400 500
time (minutes)

Fig. 4.6. Actual blood glucose level profiles in 15 T1DM children.

The clinical results of hypoglycemia detection with different approaches,


fuzzy interference system (FIS) and evolved multiple-regression (EMR) are
tabulated in Table 4.1. By the use of proposed FIS, the test sensitivity
and specificity are about 81.4% and 41.82%, while the detection of EMR
gives 72.09% and 41.82%. From Table 4.1, we can say that the training
and testing results of FIS performs better than EMR in terms of the
sensitivity and specificity. In comparison studies of area of ROC curves
for FIS (71.46%) and EMR (64.19%) in Fig. 4.7, it can be distinctly seen
that the proposed FIS detection system has higher accuracy than that of
detection with EMR. Thus, the proposed detection system gives satisfactory
results with higher accuracy.

4.4. Conclusion

A hybrid particle swarm optimization-based fuzzy inference system is


developed, in this chapter, to detect the hypoglycemic episodes for diabetes
patients. The results in Section 4.3 indicate that the hypoglycemic episodes
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 4

Hypoglycemia Detection for Insulin-dependent Diabetes Mellitus 81

ROC Curve for testing


1

0.9

0.8

0.7
EFIS

0.6
sensitivity

0.5 EMR
0.4

0.3

0.2

0.1

0
0 0.2 0.4 0.6 0.8 1
1−specificity

Fig. 4.7. ROC curve for testing.

in T1DM children can be detected non-invasively and continuously from the


real-time physiological responses (heart rate, corrected QT interval, change
of heart rate and change of corrected QT interval). To optimize the fuzzy
rules and membership functions, a hybrid particle swarm optimization is
presented where wavelet mutation operation is introduced to enhance the
optimization performance. A real T1DM study is given to illustrate that
the proposed algorithm produces better results compared with a multiple
regression algorithm in terms of the sensitivity and specificity. To conclude,
the performance of the proposed algorithm for detection of hypoglycemic
episodes for T1DM is satisfactory, as the sensitivity is 81.40% and specificity
is 41.82% for hypoglycemia defined less than 3.3 mmol/l.

References

[1] T. Duning and B. Ellger, Is hypoglycaemia dangerous?, Best Practice and


Research Clinical Anaesthesiology. 23(4), 473–485, (1996).
[2] M. Egger and G. D. Smith, Hypoglycaemia awareness and human insulin,
The Lancet. 338(8772), 950–951, (1991).
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 4

82 S.H. Ling, P.P. San and H.T. Nguyen

[3] D. J. Becker and C. M. Ryan, Hypoglycemia: A complication of diabetes


therapy in children, Trends in Endocrinology and Metabolism. 11(5),
198–202, (2000).
[4] D. R. Group, Adverse events and their association with treatment regimens
in the Diabetes Control and Complications Trial, Diabetes Care. 18,
1415–1427, (1995).
[5] D. R. Group, Epidemiology of severe hypoglycemia in the diabetes control
and complication trial, The American Journal of Medicine. 90(4), 450–459,
(1991).
[6] M. A. E. Merbis, F. J. Snoek, K. Kanc, and R. J. Heine, Hypoglycaemia
induces emotional disruption, Patient Education and Counseling. 29(1),
117–122, (1996).
[7] P. E. Cryer, Symptoms of hypoglycemia, thresholds for their occurrence and
hypoglycemia unawareness, Endocrinology and Metabolism Clinics of North
America. 28(3), 495–500, (1999).
[8] G. Heger, K. Howorka, H. Thoma, G. Tribl, and J. Zeitlhofer, Monitoring
set-up for selection of parameters for detection of hypoglycaemia in diabetic
patients, Medical and Biological Engineering and Computing. 34(1), 69–75,
(1996).
[9] S. Pramming, B. Thorsteinsson, I. Bendtson, and C. Binder, Symptomatic
hypoglycaemia in 411 type 1 diabetic patients, Diabetic medicine : A Journal
of the British Diabetic Association. 8(3), 217–222, (1991).
[10] J. C. Pickup, Sensitivity glucose sensing in diabetes, Lancet. 355, 426–427,
(2000).
[11] E. A. Gale, T. Bennett, I. A. Macdonald, J. J. Holst, and J. A. Matthews,
The physiological effects of insulin-induced hypoglycaemia in man: responses
at differing levels of blood glucose, Clinical Science. 65(3), 263–271, (1983).
[12] N. D. Harris, S. B. Baykouchev, and J. L. B. Marques, A portable system
for monitoring physiological responses to hypoglycaemia, Journal of Medical
Engineering and Technology. 20(6), 196–202, (1996).
[13] R. B. Tattersall and G. V. Gill, Unexplained death of type 1 diabetic
patients, Diabetic Medicine. 8(1), 49–58, (1991).
[14] J. L. B. Marques1, E. George, S. R. Peacey, N. D. Harris, and T. C.
I. A. Macdonald, Altered ventricular repolarization during hypoglycaemia
in patients with diabetes, Diabetic Medicine. 14(8), 648–654, (1997).
[15] B. H. Cho, H. Yu, K. Kim, T. H. Kim, I. Y. Kim, and S. I. Kim, Application
of irregular and unbalanced data to predict diabetic nephropathy using
visualization and feature selection methods, Artificial Intelligence in
Medicine archive. 42(1), 37–53, (2008).
[16] A. Chu, H. Ahn, B. Halwan, B. Kalmin, E. L. V. Artifon, A. Barkun,
M. G. Lagoudakis, and A. Kumar, A decision support system to facilitate
management of patients with acute gastrointestinal bleeding, Artificial
Intelligence in Medicine Archive. 42(3), 247–259, (2008).
[17] C. L. Chang and C. H. Chen, Applying decision tree and neural network to
increase quality of dermatologic diagnosis, Expert Systems with Applications.
36(2), 4035–4041, (2009).
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 4

Hypoglycemia Detection for Insulin-dependent Diabetes Mellitus 83

[18] G. A. F. Seber and A. J. Lee, Linear regression analysis. (John Wiley &
Sons, New York, 2003).
[19] S. M. Winkler, M. Affenzeller, and S. Wagner, Using enhanced genetic
programming techniques for evolving classifiers in the context of medical
diagnosis, Genetic Programming and Evolvable Machines. 10(2), 111–140,
(2009).
[20] T. S. Subashini, V. Ramalingam, and S. Palanivel, Breast mass classification
based on cytological patterns using RBFNN and SVM, Source Expert
Systems with Applications: An International Journal Archive. 36(3),
5284–5290, (2009).
[21] H. F. Gray and R. J. Maxwell, Genetic programming for classification and
feature selection: analysis of 1H nuclear magnetic resonance spectra from
human brain tumour biopsies, NMR in Biomedicine. 11(4), 217–224, (1998).
[22] C. M. Fira and L. Goras, An ECG signals compression method and its
validation using NNs, IEEE Transactions on Biomedical Engineering. 55
(4), 1319–1326, (2008).
[23] W. Jiang, S. G. Kong, and G. D. Peterson, ECG signal classification using
block-based neural networks, IEEE Transactions on Neural Networks. 18
(6), 1750–1761, (2007).
[24] A. H. Khandoker, M. Palaniswami, and C. K. Karmakar, Support vector
machines for automated recognition of obstructive sleep apnea syndrome
from ECG recordings, IEEE Transactions on Information Technology in
Biomedicine. 13(1), 37–48, (2009).
[25] J. S. Jang and C. T. Sun, Neuro-Fuzzy and Soft Computing: A
Computational Approach to Learning and Machine Intelligence. (Prentice
Hall, Upper Saddle River, NJ, 1997).
[26] C. Cortes and V. Vapnik, Support-vector networks, Machine Learning. 20
(3), 273–297, (1995).
[27] A. K. Jain, J. Mao, and K. M. Mohiuddin, Artificial neural networks: a
tutorial, IEEE Computer Society. 29(3), 31–44, (1996).
[28] K. H. Chon and R. J. Cohen, Linear and nonlinear ARMA model
parameter estimation using an artificial neural network, IEEE Transactions
on Biomedical Engineering. 44(3), 168–174, (1997).
[29] W. W. Melek, Z. Lu, A. Kapps, and B. Cheung, Modeling of dynamic
cardiovascular responses during G-transition-induced orthostatic stress in
pitch and roll rotations, IEEE Transactions on Biomedical Engineering. 49
(12), 1481–1490, (2002).
[30] S. Wang and W. Min, A new detection algorithm (NDA) based on fuzzy
cellular neural networks for white blood cell detection, IEEE Transactions
on Information Technology in Biomedicine. 10(1), 5–10, (2006).
[31] Y. Hata, S. Kobashi, K. Kondo, Y. Kitamura, and T. Yanagida, Transcranial
ultrasonography system for visualizing skull and brain surface aided by fuzzy
expert system, IEEE Transactions on Systems, Man, and Cybernetics - Part
B. 35(6), 1360–1373, (2005).
[32] S. Kobashi, Y. Fujiki, M. Matsui, and N. Inoue, Genetic programming
for classification and feature selection: analysis of 1H nuclear magnetic
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 4

84 S.H. Ling, P.P. San and H.T. Nguyen

resonance spectra from human brain tumour biopsies, IEEE Transactions


on Systems, Man, and Cybernetics, Part B. 36(1), 74–86, (2006).
[33] R. Das, I. Turkoglu, and A. Sengur, Effective diagnosis of heart disease
through neural networks ensembles, An International Journal Source Expert
Systems with Applications. 36(4), 7675–7680, (2009).
[34] E. I. Papageorgiou, C. D. Stylios, and P. P. Groumpos, An integrated
two-level hierarchical system for decision making in radiation therapy based
on fuzzy cognitive maps, IEEE Transactions on Biomedical Engineering. 50
(12), 1326–1339, (2003).
[35] P. A. Mastorocostas and J. B. Theocharis, A stable learning algorithm for
block-diagonal recurrent neural networks: application to the analysis of
lung sounds, IEEE Transactions on Systems, Man, and Cybernetics. 36(2),
242–254, (2006).
[36] M. Brown and C. Harris, Neural Fuzzy Adaptive Modeling and Control.
(Prentice Hall, Upper Saddle River, NJ, 1994).
[37] J. Reggia and S. Sutton, Self-processing networks and their biomedical
implications, Proceedings of the IEEE. 76(6), 580–592, (1988).
[38] D. Meyer, F. Leisch, and K. Hornik, The support vector machine under test,
Neurocomputing. 55(6), 169–186, (2003).
[39] N. Acir, A support vector machine classifier algorithm based on a
perturbation method and its application to ECG beat recognition systems,
Expert Systems with Applications. 31(1), 150–158, (2006).
[40] J. Kennedy and R. Eberhart. Particle swarm optimization. In Proceedings of
IEEE International Conference on Neural Networks, pp. 1942–1948, (1995).
[41] Z. Michalewicz, Genetic algorithms + data structures = evolution programs
(2nd, extended ed.). (Springer–Verlag, Berlin, 1994).
[42] R. Storn and K. Price, Differential evolutiona simple and efficient
heuristic for global optimization over continuous spaces, Journal of Global
Optimization. 11, 341–359, (1997).
[43] M. Dorigo and T. Stuzle, Ant Colony Optimization. (MIT Press, Cambridge,
MA, 2004).
[44] Nuryani, S. Ling, and H. Nguyen. Hypoglycaemia detection for type 1
diabetic patients based on ECG parameters using fuzzy support vector
machine. In Proceedings of International Joint Conference on Neural
Networks, pp. 2253–2259, (2010).
[45] A. Keles, S. Hasiloglu, K. Ali, and Y. Aksoy, Neuro-fuzzy classification of
prostate cancer using NEFCLASS-J, Computers in Biology and Medicine.
37(11), 1617–1628, (2007).
[46] R. J. Oentaryo, M. Pasquier, and C. Quek, GenSoFNN-Yager: A novel
brain-inspired generic self-organizing neuro-fuzzy system realizing Yager
inference, Expert Systems with Application. 35(4), 1825–1840, (2008).
[47] S. Osowski and H. Tran, ECG beat recognition using fuzzy hybrid neural
network, IEEE Transactions on Biomedical Engineering. 48(11), 875–884,
(2001).
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 4

Hypoglycemia Detection for Insulin-dependent Diabetes Mellitus 85

[48] H. Mamdani and S. Assilian, An experiment in linguistic synthesis with a


fuzzy logic controller, International Journal of Man–Machine Studies. 7(1),
1–13, (1975).
[49] S. H. Ling and H. C. C. Iu, Hybrid particle swarm optimization with wavelet
mutation and its industrial applications, IEEE Transactions on Systems,
Man, and Cybernetics–Part B: Cybernetics. 38(3), 743–763, (2008).
[50] M. L. Astion, M. H. Wener, R. G. Thomas, and G. G. Hunder, Overtraining
in neural networks that interpret clinical data, Clinical Chemistry. 39(9),
1998–2004, (1993).
[51] Diabetes in Research Children Network (DirecNet) Study Group, Evaluation
of factors affecting CGMS calibration, Diabetes technology and therapeutics.
8(3), 318–325, (2006).
[52] M. J. Tansey and R. W. Beck, Accuracy of the modified continuous glucose
monitoring system (CGMS) sensor in an outpatient setting: results from a
diabetes research in children network (DirecNet) study, Diabetes Technology
and Therapeutics. 7(1), 109–114, (2005).
[53] F. F. Maia and L. R. Arajo, Efficacy of continuous glucose monitoring system
to detect unrecognized hypoglycemia in children and adolescents with type
1 diabetes, Arquivos Brasileiros De Endocrinologia E Metabologia. 49(4),
569–574, (2005).
[54] R. L. Weinstein, Accuracy of the freestyle navigator CGMS: comparison with
frequent laboratory measurements, Diabetes Care. 30, 1125–1130, (2007).
[55] S. R. Heller and I. A. MacDonald, Physiological disturbances in
hypoglycaemia: effect on subjective awareness, Clinical Science. 81(1), 1–9,
(1991).
[56] H. T. Nguyen, N. Ghevondian, and T. W. Jones, Neural-network detection of
hypoglycemic episodes in children with Type 1 diabetes using physiological
parameters, Proceedings of the 28th Annual International Conference of the
IEEE Engineering in Medicine and Biology Society. pp. 6053–6056, (2006).
[57] R. C. Eberhart and Y. Shi, Comparing inertia weights and constriction
factors in particle swarm optimization, Proceedings of the IEEE Congress
on Evolutionary Computing. pp. 84–88, (2000).
[58] I. Daubechies, Ten Lectures on Wavelets. (Society for Industrial and Applied
Mathematics, Philadelphia, 1992).
[59] D. G. Altman and J. M. Bland, Statistics notes: Diagnostic tests 1:
sensitivity and specificity, Clinical Chemistry. (308), 1552–1552, (1994).
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 4

This page intentionally left blank

86
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 4

PART 3

Neural Networks and their


Applications

87
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 4

This page intentionally left blank

88
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 5

Chapter 5

Study of Limit Cycle Behavior of Weights of Perceptron

C.Y.F. Ho and B.W.K. Ling


School of Engineering, University of Lincoln
Lincoln, Lincolnshire, LN6 7TS, United Kingdom
[email protected]

In this chapter, limit cycle behavior of weights of perceptron is discussed.


First, the weights of a perceptron are bounded for all initial weights
if there exists a nonempty set of initial weights that the weights of
the perceptron are bounded. Hence, the boundedness condition of the
weights of the perceptron is independent of the initial weights. Second,
a necessary and sufficient condition for the weights of the perceptron
exhibiting a limit cycle behavior is discussed. The range of the number
of updates for the weights of the perceptron required to reach the limit
cycle is estimated. Finally, it is suggested that the perceptron exhibiting
the limit cycle behavior can be employed for solving a recognition
problem when downsampled sets of bounded training feature vectors
are linearly separable. Numerical computer simulation results show that
the perceptron exhibiting the limit cycle behavior can achieve a better
recognition performance compared to a multi-layer perceptron.

Contents

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.2 Notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.3 Global Boundness Property . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.4 Limit Cycle Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.5 Application of Perceptron Exhibiting Limit Cycle Behavior . . . . . . . . . . 97
5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

5.1. Introduction

Since the implementation cost of a perceptron is low and a perceptron


can classify linearly separable bounded training feature vectors [1–3],

89
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 5

90 C.Y.F. Ho and B.W.K. Ling

perceptrons [4–8] are widely applied in many pattern recognition systems.


However, as the values of the output of the perceptron are binary,
they can be represented by symbols and the dynamics of the perceptron
are governed by symbolic dynamics. Symbolic dynamics is very complex
because limit cycle and chaotic behaviors may occur. One of the properties
of symbolic dynamical systems is that the system state vectors may be
bounded for some initial system state vectors while they may not be
bounded for other initial system state vectors. Hence, it is expected that
the boundedness condition of the weights of the perceptron would also
depend on the initial weights. In fact, the boundedness condition of the
weights of the perceptron is not completely known yet. As the boundedness
property is very important because of safety reasons, this chapter studies
the boundedness condition of the weights of the perceptron.
It is well known from the perceptron convergence theorem [1–3] that
the weights of the perceptron will converge to a fixed point within a finite
number of updates if the set of bounded training feature vectors is linearly
separable, and the weights may exhibit a limit cycle behavior if the set
of bounded training feature vectors is nonlinearly separable. However, the
exact condition for the weights of the perceptron exhibiting the limit cycle
behavior is unknown. A perceptron exhibiting the limit cycle behavior is
actually a neural network with time periodically varying coefficients. In
fact, this is a generalization of the perceptron with constant coefficients.
Hence, better performances will result if the downsampled sets of bounded
training feature vectors are linearly separable. By knowing the exact
condition for the weights of the perceptron exhibiting limit cycle behaviors,
one can operate the perceptron accordingly so that better performances are
achieved. Besides, the range of the number of updates for the weights of
the perceptron to reach the limit cycle is also unknown when the weights of
the perceptron exhibit the limit cycle behavior. The range of the number of
updates for the weights of the perceptron to reach the limit cycle relates to
the rate of the convergence of the training algorithm. Hence, by knowing the
range of the number of updates for the weights of the perceptron to reach
the limit cycle, one can estimate the computational effort of the training
algorithm. The details of these issues will be discussed in Section 5.4.
The outline of this chapter is as follows. Notations used throughout this
chapter will be introduced in Section 5.2. It will be discussed in Section
5.3 that the weights of the perceptron are bounded for all initial weights
if there exists a nonempty set of initial weights that the weights of the
perceptron is bounded. A necessary and sufficient condition for the weights
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 5

Study of Limit Cycle Behavior of Weights of Perceptron 91

of the perceptron exhibiting the limit cycle behavior will be discussed in


Section 5.4. Also, the range of the number of updates for the weights
of the perceptron to reach the limit cycle will be estimated in the same
section. Numerical computer simulation results will be shown in Section
5.5 to illustrate that the perceptron exhibiting the limit cycle behavior
can achieve a better recognition performance compared to a multi-layer
perceptron. Finally, a conclusion will be drawn in Section 5.6.

5.2. Notations

Denote N as the number of the bounded training feature vectors and d


as the dimension of these bounded training feature vectors. Denote the
elements in the bounded training feature vectors as xi (k) for i = 1, 2, . . . , d
and for k = 0, 1, . . . , N − 1. Define x(k) ≡ [1, x1 (k), . . . , xd (k)]T for k =
0, 1, . . . , N − 1 and x(N n + k) = x(k) ∀n ≥ 0 and for k = 0, 1, . . . , N − 1,
where the superscript T denotes the transposition operator. Denote the
weights of the perceptron as wi (n) for i = 1, 2, . . . , d and ∀n ≥ 0. Denote
the threshold of the perceptron as w0 (n) ∀n ≥ 0 and the activation function
of the perceptron as
{
1z≥0
Q(z) = .
−1 z < 0

Define w(n) ≡ [w0 (n), w1 (n), . . . , wd (n)]T ∀n ≥ 0 and denote the output
of the perceptron as y(n) ∀n ≥ 0, then y(n) = Q(wT (n)x(n)) ∀n ≥ 0.
Denote the desired output of the perceptron corresponding to x(n) as t(n)
∀n ≥ 0. Assume that the perceptron training algorithm [9] is employed for
the training, so the updated rule for the weights of the perceptron is as
follows:

t(n) − y(n)
w(n + 1) = w(n) + x(n) ∀ n ≥ 0. (5.1)
2

v value of a real number as |.| and the 2-norm of


Denote the absolute
u d
u∑
a vector as ∥v∥ ≡ t vi2 , where v ≡ [v1 , . . . , vd ]T . Denote K as the
i=1
maximum 2-norm of the vectors in the set of bounded training feature
vectors, that is K ≡ max ∥x(k)∥. Denote ∅ as the empty set.
0≤k≤N −1
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 5

92 C.Y.F. Ho and B.W.K. Ling

5.3. Global Boundness Property

Since y(n) ∈ {1, −1} ∀n ≥ 0, the values of y(n) can be represented


as symbols and the dynamics of the perceptron is governed by symbolic
dynamics. As discussed in Section 5.1 the boundedness condition of the
system state vectors of general symbolic dynamical systems depends on
the initial system state vectors, so one may expect that different initial
weights would lead to different boundedness conclusions. However, it is
found that if there exists a nonempty set of initial weights that leads to the
bounded behavior, then all initial weights will lead to the bounded behavior.
That means, the boundedness condition of the weights of the perceptron
is independent of the initial weights. This result is stated in Theorem 5.1
and is useful because engineers can employ arbitrary initial weights for the
training and the boundedness condition of the weights is independent of
the choice of the initial weights. Before we discuss this result, we need the
following lemmas:

Lemma 5.1. Assume that there are two perceptrons with the initial weights
w(0) and w∗ (0). Suppose that the set of the bounded training feature vectors
and the corresponding set of desired outputs of these two perceptrons are the
∗ T T
same, then (w(k) − w∗ (k))T Q((w (k)) x(k))−Q((w(k))
2
x(k))
x(k) ≤ 0∀k ≥ 0.

The importance of Lemma 5.1 is for deriving the result in Lemma


5.2 stated below, which is essential for deriving the main result on the
boundedness condition of the weights of the perceptron stated in Theorem
5.1.

Lemma 5.2. If ∥x(k)∥


2

2 < |(w(k) − w∗ (k))T x(k)|, then ∥w(k) − w∗ (k)∥2 ≥


∥w(k + 1) − w∗ (k + 1)∥2 .

The importance of Lemma 5.2 is for deriving the result in Theorem 5.1
stated below, which describes the main result on the boundedness condition
of the weights of the perceptron.

Theorem 5.1. If ∃w∗ (0) ∈ Rd+1 and ∃B̃ ≥ 0 such that ∥w∗ (k)∥ ≤ B̃
∀k ≥ 0, then ∃B ′′ ≥ 0 such that ∥w(k)∥ ≤ B ′′ ∀k ≥ 0 and ∀w(0) ∈ Rd+1 .

For practical applications, the weights of the perceptron are required to


be bounded because of safety reasons. Suppose that there exists a nonempty
set of initial weights such that the weights of the perceptron are bounded.
Otherwise, the perceptron is useless. Without Theorem 5.1, we do not know
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 5

Study of Limit Cycle Behavior of Weights of Perceptron 93

what exact initial weights will lead to the bounded behavior. However, by
Theorem 5.1, we can conclude that it is not necessary to know the exact
initial weights which lead to the bounded behavior. This is because once
there exists a nonempty set of initial weights that leads to the bounded
behavior, then all initial weights will lead to the bounded behavior. The
result implies that engineers can employ arbitrary initial weights for the
training and the boundedness condition is independent of the choice of
the initial weights. This phenomenon is counter-intuitive to the general
understanding of symbolic dynamical systems because the system state
vectors of general symbolic dynamical systems may be bounded for some
initial system state vectors, but exhibit an unbounded behavior for other
initial system state vectors. It is worth noting that the weights of the
perceptron may exhibit complex behaviors, such as limit cycle or chaotic
behaviors.

Corollary 5.1. Define a nonlinear map Q̃ : Rd+1 → Rd+1 such that


Q̃(w(n)) ≡ [q0 (n), . . . , qN −1 (n)]T , where

{
t(j) j ̸= mod(n, N )
qj (n) ≡
Q(wT (n)x(n)) otherwise


∀n ≥ 0 and for j = 0, 1, . . . , N
∑− 1. If ∃w (0) ∈ R
d+1
and ∃B ≥ 0 such that

∥w (k)∥ ≤ B ∀k ≥ 0, then t(j) − qj (n) = 0 for j = 0, 1, . . . , N − 1.
∀n≥0

This corollary states a sufficient condition for the boundedness of the


weights of the perceptron. Hence, this corollary can be used for testing
whether the weights of the perceptron are bounded or not.
It is worth noting that the difference between the block diagram of the
perceptron shown in Fig. 5.1 and that of conventional interpolative sigma
delta modulators [10] is that Q̃(·) is a periodically time varying system
with period N , while the nonlinear function in conventional interpolative
sigma delta modulators is a memoryless system. Moreover, Q̃(·) is not a
quantization function, while that in conventional interpolative sigma delta
modulators is a quantization function. Hence, the boundedness condition
derived in Theorem 5.1 is not applicable to the conventional sigma delta
modulators.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 5

94 C.Y.F. Ho and B.W.K. Ling

Fig. 5.1. A block diagram for modeling the dynamics of the weights of the perceptron.

5.4. Limit Cycle Behavior

In Section 5.3, the boundedness condition of the weights of the perceptron


has been discussed. However, even when the weights of the perceptron
are bounded, it is not guaranteed that the weights of the perceptron will
converge to limit cycles. In this section, a necessary and sufficient condition
for the occurrence of the limit cycle is discussed and the result is stated
in Lemma 5.3. These results are important because perceptrons exhibiting
limit cycle behaviors are actually time periodically varying neural networks,
which are the generalization of neural networks with constant coefficients.
Hence, it can achieve better performances, such as better recognition rate.
By applying the result in Lemma 5.3, the perception can be operated under
the limit cycle behaviors and better performances could be achieved. The
range of the number of updates for the weights of the perceptron to reach
the limit cycle is estimated. This result is discussed in Theorem 5.2 and
Lemma 5.4. In addition, by applying the results in Theorem 5.2 and Lemma
5.4, one can estimate the computational effort of the training algorithm,
which is also very important for practical applications.

Lemma 5.3. Suppose that q1 and q2 are co-prime and M and N are
positive integers. That is q1 M = q2 N . Then w∗ (n) is periodic with period

M −1
t(kM + j) − Q((w∗ (kM + j))T x(kM + j))
M if and only if x(kM +
j=0
2
j) = 0 for k = 0, 1, . . . , q1 − 1.

Lemma 5.3 can be described as follows: By duplicating q2 sets of


bounded training feature vectors and dividing all of them to q1 groups with
M bounded training feature vectors in each group, then the sign or the null
combinations, that is, 1 or 0 or −1, of these M bounded training feature
vectors in each group will be zero, where the sign or the null coefficients
are exactly equal to half of the difference between the desired outputs and
the true outputs of the perceptron.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 5

Study of Limit Cycle Behavior of Weights of Perceptron 95

Lemma 5.3 is the generalization of the existing result on the necessary


and sufficient condition for the weights of the perceptron exhibiting from
the fixed point behavior to the limit cycle behavior with the period being
any positive rational multiple of the number of bounded training feature
vectors. Here, we have M hyperplanes and N bounded training feature
vectors, so the weights of the perceptron exhibit the periodic behavior with
period M . Note that neither the number of hyperplanes is necessarily equal
to a positive integer multiple of the number of bounded training feature
vectors nor vice versa, that is neither M = k1 N for k1 ∈ Z+ nor N = k2 M
for k2 ∈ Z+ is necessarily required. When M = 1, q1 = N and q2 = 1,
Lemma 5.3 reduces to the existing perceptron convergence theorem. In this
case, Lemma 5.3 implies that w∗ (n) is periodic with period 1 if and only if
t(k)−Q((w∗ (k))T x(k))
2 x(k) = 0 k = 0, 1, . . . , N − 1.
This is equivalent to w∗ (n) exhibiting a fixed point behavior if and only
if
t(k)−Q((w∗ (k))T x(k))
2 = 0 for k = 0, 1, . . . , N − 1.

In other words, w (n) exhibits a fixed point behavior if and only if the
set of bounded training feature vectors is linearly separable.
Since the limit cycle behavior is a bounded behavior, by combining
Lemma 5.3 and Theorem 5.1 together, we can conclude that the
weights of the perceptron will be bounded for all initial weights
if there exists a nonempty set of initial weights w∗ (0) such that
M∑ −1
t(kM + j) − Q((w∗ (kM + j))T x(kM + j))
x(kM + j) = 0 for k =
j=0
2
0, 1, . . . , q1 − 1.
However, it does not imply that the weights of the perceptron will
eventually exhibit a limit cycle behavior for all initial weights. The
perceptron may still exhibit complex behaviors, such as chaotic behaviors,
for some initial weights.
Suppose that the perceptron converges to a limit cycle with the
equivalent instantaneous initial weights w∗ (0), that is ∃n0 ≥ 0 such that
w(k + nM ) = w∗ (k) for k = 0, 1, . . . , M − 1 and for n ≥ n0 , where
w∗ (k + nM ) = w∗ (k) ∀n ≥ 0 and for k = 0, 1, . . . , M − 1. Then it is
important to estimate the range of the number of updates for w(0) to reach
{w∗ (k) for k = 0, 1, . . . , M − 1}. This is because the number of updates for
w(0) to reach the limit cycle relates to the rate of the convergence of the
perceptron and the computational effort of the training algorithm.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 5

96 C.Y.F. Ho and B.W.K. Ling

Theorem 5.2. Define X ≡ [x(0), . . . , x(N − 1)]. Suppose that


T −1
rank(XX ) = d + 1. Define X̃ ≡ X (XX ) . Denote λmax and
T T

λmin as the maximum and minimum eigenvalues of X̃X̃T , respectively.


Define Cnq2 N +kN +j ≡ [C̈0,nq2 N +kN +j , . . . , C̈N −1,nq2 N +kN +j ]T ∀j ∈
{0, 1, . . . , N − 1}, ∀k ∈ {0, 1, . . . , q2 − 1} and ∀n ≥ 1, in
∑ q∑
n−1 2 −1
t(i) − Q((w(pq2 N + lN + i))T x(i))
which C̈i,nq2 N +kN +j = +
p=0
2
l=0

k−1
t(i) − Q((w(nq2 N + lN + i))T x(i))
+
2
l=0
t(i) − Q((w(nq2 N + kN + i))T x(i))
for i ≤ j, where i = 0, 1, . . . , N −
2
1, j = 0, 1, . . . , N − 1, k = 0, 1, . . . , q2 − 1 and ∀n ≥ 1,
∑ q∑
n−1 2 −1
t(i) − Q((w(pq2 N + lN + i))T x(i))
and C̈i,nq2 N +kN +j = +
p=0
2
l=0

l−1
t(i) − Q((w(nq2 N + lN + i))T x(i))
for i > j, where i = 0, 1, . . . , N −1,
2
l=0
j = 0, 1, . . . , N − 1, k = 0, 1, . . . , q2 − 1 and ∀n ≥ 1. Assume that
w(nq2 N + kN + j) = w∗ (j) for some j ∈ {0, 1, . . . , N − 1}, for some
k ∈ {0, 1, . . . , q2 − 1} and for some n ≥ 1, where w∗ (nM + k) = w∗ (k)
∀n ≥ 0 and for k = 0, 1, . . . , M − 1. Define C̃j ≡ X̃(w∗ (j) − w(0))
∥C̃ ∥2 ∥C̃j ∥2
for j = 0, 1, . . . , M − 1. Then λmax j
≤ ∥XCnq2 N +kN +j ∥2 ≤ λmin
∀j ∈ {0, 1, . . . , N − 1}, ∀k ∈ {0, 1, . . . , q2 − 1} and ∀n ≥ 1.

Although the range of the number of updates for the weights of the
perceptron to reach the limit cycle is equal to ∥Cnq2 N +kN +j ∥1 , it can be
reflected through ∥XCnq2 N +kN +j ∥2 . Hence, Theorem 5.2 provides an idea
on the range of the number of updates for the weights of the perceptron to
reach the limit cycle, which is useful for the estimation of the computational
effort of the training algorithm. In order to estimate the bounds for
∥Cnq2 N +kN +j ∥1 , denote m′ as the number of the differences between the
output of the perceptron based on w(0) and that based on w∗ (0), that is
∑ Q((w∗ (n))T x(n)) − Q((w(n))T x(n))
m′ = | |.
2
∀n
Then we have the following result:

Lemma 5.4. If w(nq1 M + kM + j) = w∗ (j) for some j ∈ {0, 1, . . . , M −


1}, for some k ∈ {0, 1, . . . , q1 − 1} and for some n ≥ 1, by defining
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 5

Study of Limit Cycle Behavior of Weights of Perceptron 97

M −2 q1∑
q1∑ M −1
Q((w∗ (j))T x(j)) − y(j)
c≡ (w∗ (k) − w(k))T x(j), then m′ ≥
j=0 k=j+1
2
1 −1 M
q∑ ∑ −1

c+ ∥w (kM + j) − w(kM + j)∥2
k=0 j=0
1 −1 M
q∑ −1
.


∥ w (pM + i) − w(pM + i)∥K
p=0 i=0

The importance of Lemma 5.1 is to estimate the minimum number of


updates for the weights of the perceptron to reach the limit cycle and it is
useful for the estimation of the computation effort of the training algorithm.
It is worth noting that the minimum number of updates for the weights
of the perceptron to reach the limit cycle depends on the initial weights.
Similar result can be obtained by generalizing the conventional perceptron
convergence theorem with zero initial weights to arbitrarily initial weights
when the set of bounded training feature vectors is linearly separable.

5.5. Application of Perceptron Exhibiting Limit Cycle


Behavior

Since time divisional multiplexing systems are widely used in many


communications and signal processing systems, a time divisional
multiplexing system is employed for an illustration. Consider an example
that sixteen voices from four African boys, four Asian boys, four European
girls and four American girls are multiplexed into a single channel. As two
dimensional bounded training feature vectors are easy for an illustration,
the dimension of the bounded feature vectors is chosen to be equal to two.
Without loss of generality, denote the bounded training feature vectors
of these voices as x(i) for i = 0, 1, . . . , 15, and the corresponding desired
outputs as t(i) for i = 0, 1, . . . , 15. Suppose that the voices generated by the
boys are denoted as -1, so t(4n+1) = t(4n+2) = −1 for n = 0, 1, 2, 3, while
the voices generated by the girls are denoted as 1, so t(4n) = t(4n + 3) = 1
for n = 0, 1, 2, 3. Suppose that the means of the bounded training feature
vectors corresponding to African boys are [−1, 1]T , [−0.9, 1]T , [−1, 0.9]T
and [−0.9, 0.9]T , that corresponding to Asian boys are [1, −1]T , [0.9, −1]T ,
[1, −0.9]T and [0.9, −0.9]T that corresponding to American girls are [1, 1]T ,
[0.9, 1]T , [1, 0.9]T and [0.9, 0.9]T and that corresponding to European girls
are [−1, −1]T , [−0.9, −1]T , [−1, −0.9]T and [−0.9, −0.9]T . Each speaker
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 5

98 C.Y.F. Ho and B.W.K. Ling

generates 100 bounded feature vectors for transmission and the channel is
corrupted by an additive white Gaussian noise with zero mean and variance
equal to 0.5. These bounded feature vectors are used for testing. Figure
5.2 shows the distribution of these bounded testing feature vectors. On the
other hand, the 16 noise-free bounded training feature vectors are trained
using the conventional perceptron training algorithm.

Fig. 5.2. Distribution of bounded testing feature vectors.

As the set of bounded training feature vectors is nonlinearly separable,


the conventional perceptron training algorithm does not converge to a fixed
point and the perceptron exhibits the limit cycle behavior with period four.
A perceptron exhibiting the limit cycle behavior is actually a neural network
with time periodically varying coefficients. In fact, this is a generalization
of the perceptron with constant coefficients. Hence, better performances
are resulted if the downsampled sets of bounded training feature vectors are
linearly separable. The reason is as follows: Assume that w∗ (nM + k) =
w∗ (k) ∀n ≥ 0 and for k = 0, 1, . . . , M − 1. Suppose that N is an integer
multiple of M , that is ∃z ∈ Z+ such that N = zM . By downsampling the
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 5

Study of Limit Cycle Behavior of Weights of Perceptron 99

set of the bounded training feature vectors by M , we have M downsampled


sets of bounded training feature vectors, denoted as {x(kM + i)} for i =
0, 1, . . . , M − 1 and for k = 0, 1, . . . , z − 1. For each downsampled set of
bounded training feature vectors, if these z samples are linearly separable,
then w∗ (i) for i = 0, 1, . . . , M − 1 can be employed for the classification and
the recognition error will be exactly equal to zero. Hence, the perceptron
exhibiting the limit cycle behavior significantly improves the recognition
performance.
Refer to the example discussed above, by defining the classification
rule as follows: Assign x(k) to class -1 if Q((w∗ (k))T x(k)) = 1, and
assign x(k) to class 1 if Q((w∗ (k))T x(k)) = −1, as well as by running
the conventional perceptron training algorithm with zero initial weights,
then it is found that the weights of perceptron are bounded and converge
to a limit cycle. This implies that the weights of the perceptron will also
be bounded if other initial weights are employed. We generated other
initial weights randomly and found that the weights of the perceptron
are really bounded. This demonstrates the validity of Theorem 5.1.
For the zero initial weight, it is found that the recognition error is
2.25%. For comparison, a two layer perceptron is employed for solving the
corresponding nonlinearly separable problem. It is well known that if the
weights of the perceptrons are selected as w1∗ ≡ [ 12 , 12 , 12 ]T , w2∗ ≡ [− 12 , 12 , 21 ]T
and w3∗ ≡ [0, − 12 , 12 ]T , then the output of the two layer perceptron, defined
as y ′ (k) = Q([1, Q((w1∗ )T x(k)), Q((w2∗ )T x(k))]T w3∗ ), will solve the XOR
nonlinear problem, and it can be checked easily that the recognition error
for the set of the noise-free bounded training feature vectors is exactly
equal to zero. Hence, these coefficients are employed for the comparison.
It is found that the recognition error based on the two layer perceptron is
14.56%, while that based on the perceptron exhibiting a limit cycle behavior
is only 2.25%. This demonstrates that the perceptron exhibiting the limit
cycle behavior outperforms the two layer perceptron.

5.6. Conclusion

Unlike other symbolic dynamical systems, the boundedness condition of


the perceptron is independent of the initial weights. That means, if there
exists a nonempty set of initial weights that the weights of the perceptron
are bounded, then the weights of a perceptron will be bounded for all
initial weights. Also, it is suggested that the perceptron exhibiting the
limit cycle behavior can be employed for solving a recognition problem
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 5

100 C.Y.F. Ho and B.W.K. Ling

when the downsampled sets of bounded training feature vectors are linearly
separable.

References

[1] M. Gori and M. Maggini, Optimal convergence of on-line backpropagation,


IEEE Transactions on Neural Networks. 7(1), 251–254, (2002).
[2] S. J. Wan, Cone algorithm: An extension of the perceptron algorithm, IEEE
Transactions on Systems, Man and Cybernetics. 24(10), 1571–1576, (1994).
[3] M. Brady, R. Raghavan, and J. Slawny, Gradient descent fails to separate,
Proceedings of the IEEE International Conference on Neural Networks, 1988.
pp. 649–656, (2002).
[4] T. B. Ludermir, A. Yamazaki, and C. Zanchettin, An optimization
methodology for neural network weights and architectures, IEEE
Transactions on Neural Networks. 17(6), 1452–1459, (2006).
[5] S. G. Pierce, Y. Ben-Haim, K. Worden, and G. Manson, Evaluation of neural
network robust reliability using information-gap theory, IEEE Transactions
on Neural Networks. 17(6), 1349–1361, (2006).
[6] C. F. Juang, C. T. Chiou, and C. L. Lai, Hierarchical singleton-type
recurrent neural fuzzy networks for noisy speech recognition, IEEE
Transactions on Neural Networks. 18(3), 833–843, (2007).
[7] R. W. Duren, R. J. Marks, P. D. Reynolds, and M. L. Trumbo, Real-time
neural network inversion on the SRC-6e reconfigurable computer, IEEE
Transactions on Neural Networks. 18(3), 889–901, (2007).
[8] S. Wan and L. E. Banta, Parameter incremental learning algorithm for
neural networks, IEEE transactions on Neural Networks. 17(6), 1424–1438,
(2006).
[9] M. Basu and Q. Liang, The fractional correction rule: a new perspective,
Neural Networks. 11(6), 1027–1039, (1998).
[10] C. Y. F. Ho, B. W. K. Ling, and J. D. Reiss, Estimation of an Initial
Condition of Sigma–Delta Modulators via Projection Onto Convex Sets,
IEEE Transactions on Circuits and Systems I: Regular Papers. 53(12),
2729–2738, (2006).
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 6

Chapter 6

Artificial Neural Network Modeling with Application to


Nonlinear Dynamics

Yi Zhao
Harbin Institute of Technology Shenzhen Graduate School
Shenzhen, China
[email protected]

Artificial neural network (ANN) is well known for its strong capability
to handle nonlinear dynamical systems, and this modeling technique
has been widely applied to the nonlinear time series prediction problem.
However, this application needs more cautions as overfitting is the
serious problem endemic to neural networks. The conventional method
of avoiding overfitting is to avoid fitting the data too precisely while
it cannot determine the exact model size directly. In this chapter,
we employ an alternative information theoretic criterion (minimum
description length) to determine the optimal architecture of neural
networks according to the equilibrium between the model parameters
and model errors. When applied to various time series, we find that the
model with the optimal architecture both generalizes well and accurately
captures the underlying dynamics. To further confirm the dynamical
character of the model residual of the optimal neural network, the
surrogate data method from the regime of nonlinear dynamics is then
described to analyze such model residual so as to determine whether
there is significant deterministic structure not captured by the optimal
neural network. Finally, a diploid model is proposed to improve the
prediction precision under the condition that the prediction error is
considered to be deterministic. The systematic framework composed
of the preceding modules is validated in sequence, and illustrated with
several computational and experimental examples.

Contents

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102


6.2 Model Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.3 Avoid Overfitting by Model Selection . . . . . . . . . . . . . . . . . . . . . . 107
6.3.1 How it works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

101
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 6

102 Yi Zhao

6.3.2 Case study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110


6.4 Surrogate Data Method for Model Residual . . . . . . . . . . . . . . . . . . . 115
6.4.1 Linear surrogate data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.4.2 Systematic flowchart . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
6.4.3 Identification of model residual . . . . . . . . . . . . . . . . . . . . . . 116
6.4.4 Further investigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
6.5 The Diploid Model Based on Neural Networks . . . . . . . . . . . . . . . . . 120
6.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

6.1. Introduction

Nonlinear dynamics has a significant impact on a variety of applied fields,


ranging from mathematics to mechanics, from meteorology to economy,
from population to biomedical signal analysis. That is, a great many
phenomena are subject to nonlinear dynamics. For this reason, establishing
effective models to capture or learn the nonlinear dynamic mechanism of the
object system and to solve further prediction and control problems becomes
a critical issue. Considering the length of this chapter, we will focus on one
of the primary model applications: neural networks (NN) modeling for
nonlinear time series prediction. Some other applications, such as pattern
recognition, control and classification, can be found in [1–3].
In the previous research, there are many linear methods with mutation
emerging, such as the famous autoregressive and moving average (ARMA)
model. This model family proposed by Box et al. four decades ago has
been widely used to solve the prediction problem with great success [4].
The ARMA model is composed of two parts: the autoregressive part,
AR(p) predicting the current observation based on a linear function of
the p previous ones, and the moving average part, MA(q) calculating the
mean of the time series with the moving windows with the windows size q.
Furthermore, the primary limitation of ARMA models is the assumption
of the linear relation between independent and dependent variables. The
ARMA models are applicable only to stationary time series modeling.
Those constraints, therefore, limit the application range and prediction
accuracy of related ARMA models. Meanwhile, artificial neural networks
(ANN) have been suggested to handle the complicated time series. They
are quite effective in modeling nonlinearity between the input and output
variables and in particular, the multi-layer feedforward neural network
with the given sigmoid activation function is able to approximate any
nonlinear continuous function in arbitrary closed interval. Cybenko first
proved the preceding assertion in 1989 [5]. Then Kurt Hornik showed that
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 6

Artificial Neural Network Modeling with Application to Nonlinear Dynamics 103

this universal approximation was closely related to the architecture of the


network but not limited to the specific activation function of the neural
network [6]. Afterwards, concerning model optimization, the architecture
becomes the first factor to be considered. In addition, in 1987, Lapedes
and Farber explained that ANN can be used for modeling and predicting
nonlinear time series in a simulated case study [7].
Supported by these theoretical proofs, feedforward neural networks
became much more popular for modeling nonlinear complicated system.
Several prediction models based on ANN have been presented. Wilson and
Sharda employed an ANN model to predict the bankruptcy and found it
performed significantly better than the classical multivariate discrimination
method [8]. Tseng et al. proposed a hybrid prediction model combining
the seasonal autoregressive integrated moving average model and the
backpropagation neural network to predict two kinds of sales data: the
Taiwan machinery industry and soft drink production value. The result
shows that this combination brings better prediction than the single model
and other kind of models [9]. Chang et al. demonstrated that the traditional
approach of sales forecasting integrated with the appropriate artificial
intelligence was highly competitive in the practical business service [10].
D. Shanthi et al. employed the ANN model trained by backpropagation
algorithm to forecast the thrombo-embolic stock disease and obtained
high predictive accuracy [11]. In summary, ANN exhibits superiority for
nonlinear time series prediction.
The significant advantage of ANN modeling can be mainly ascribed to
its imitation of the biologically nervous system, a massive highly connected
array of nonlinear excitatory “neuron”. The high-degree freedom in the
neural network architecture provides the potential to model complicated
nonlinearity, but it also brings about uncertain conditions when dealing
with those complicated systems. It is thus expected to build neural
networks with a fairly large number of neurons to take the challenge.
However, people usually found that large neural networks just made small
prediction errors on the training data but failed to fit to novel data or
even performed worse (i.e. overfitted). So the crucial issue in developing
a neural network is generalization of the network: the network not only
fits the training data well but also responds properly to novel inputs. To
solve this problem satisfactorily requires a selection technique to ensure the
neural network makes the reliable and accurate prediction.
Usual methods to improve overfitting are known as: early stopping and
statistical regularization techniques, such as weights decaying and Bayesian
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 6

104 Yi Zhao

learning [12–14]. As Zhao and Small discussed [15], the validation set
required in the method of early stopping should be representative of all
points in the training set, and the training algorithm cannot converge too
fast. The weights decaying method modifies the performance function, the
mean sum of squares of the model errors to the sum of the mean sum of
squares of the model errors and the network parameters. The key problem
of this method is that it is difficult to seek equilibrium between the modified
two parts.
For Bayesian learning, the optimal regularization parameter, the
previous balance is determined in an automatical way [14]. The promising
feature of Bayesian learning is that it can measure how many network
parameters are effectively used for the given application. Although it gives
an indicator of wasteful parameters, this method cannot build a compact
(or smaller) optimal neural network.
In this chapter, we describe an alternative approach, which estimates
exactly the optimal structure of the neural network for time series
prediction. We focus on feedforward multi-layer neural networks, which has
been proved to be qualified for modeling nonlinear functions. The criterion
is a modification and competitive to the initial minimum description length
(MDL) [16].
This method is based on the well-known principle of minimum
description length rooted in the theory of algorithmic complexity [17].
J. Rissanen proposed the issue of model selection as a problem in data
compression with a series of papers starting with [18]. Judd and Mees
developed this principle to selection for local linear models [19], and then
Small and Tse extended the previous analysis to selection for radius basis
function models [20]. Zhao and Small generalized the previous results
to neural network architectures [15]. Several other modifications to the
method of minimum description length can be found in the literature
according to their own demands [21–24].
Another technique, surrogate data method, is also described in this
chapter. Surrogate data tests are examples of Monte Carbo hypothesis tests
[25]. The standard surrogate data method, suggested and implemented
by Theiler et al. has been widely applied in the literature [26]. This
method aims to determine whether the given time series has a statistically
significant deterministic component, or is just consistent with identical
independent distribution (i.i.d.) noise. We will introduce this idea at length
in Section 6.4.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 6

Artificial Neural Network Modeling with Application to Nonlinear Dynamics 105

In summary, we employ the method of MDL to select the optimal model


from a number of neural network candidates for the specific time series
prediction, and afterward apply the standard surrogate data method to
the residual of the optimal model (i.e. the prediction error) to determine
whether there is significant deterministic structure not captured by the
optimal neural network estimated by MDL. This combination of MDL for
model selection and surrogate data method for model residual validation
can enhance the model prediction ability effectively. In the case that
the prediction error contains deterministic dynamics a diploid model is
introduced to handle this scenario at the end of this chapter.
This chapter is organized as follows. Some basic ideas about ANN
are briefly described in Section 6.2, and the MDL method developed
for selecting optimal neural networks is presented in Section 6.3; in the
next section, we discuss the application of the surrogate data method to
analyze model residuals. A diploid model is proposed in Section 6.5 to
further improve the prediction precision if necessary. Finally, we have the
conclusion of this chapter.

6.2. Model Structure

The modeling process by ANN to build a model for dynamical systems is a


process to extract the deterministic dynamics that govern the evolution of
the system and then establish an ANN model to describe it effectively.
As mentioned previously, the multi-layer feedforward neural network is
applicable to approximate any nonlinear function under certain conditions.
So, in this chapter, this type of neural network is our interest.
Common to all neural network architectures is the connection of input
vector, neurons (or nodes) and then output layers to establish numerous
interconnected pathways from input to output. Figure 6.1 shows the
architecture of a multilayer neural network.
Given an input vector x = (xt−1 , xt−2 , · · · , xt−d ), the transfer function
of this neural network is mathematically described by

k ∑
d
y(x) = b0 + vi f ( (ωi,j xt−j + bi )). (6.1)
i=1 j=1

The input vectors where b0 ,bi , vi , ωi,j are parameters, k represents the
number of neurons in the hidden layer, and f (·)is the activation function
of neurons. As shown in the figure below, there is one hidden layer,
but notice that the multiple hidden layers are also optional. The input
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 6

106 Yi Zhao

Input Hidden Output

n1 a1
w1, 1 ™ f
p1 b1
1
n2 a2 V(1×S)
p2 ™ f y
+
b2 b0
p3
...

...
1
...
...

pR nS aS
™ f
wS, R
bS
1

a = f (WP+b) y = Vf (WP+b)+b0

Fig. 6.1. The multilayer network is composed of the input, hidden and output layers
[27].

vector is denoted by P = {p1 , p2 , · · · , pd } and the output (or prediction


value) is denoted by y. W = {wij |i = 1, · · · , k, j = 1, · · · , d} and
V = {vi |i = 1, · · · , k} are weights associated with connections between
layers; b = {bi |i = 0, 1, · · · , k} are biases.
The process of training the neural network is to update these preceding
parameters iteratively according to the mean square error between the
original target and the prediction on it. The time series is divided into two
sets: training and test sets. The first data set is used to train the neural
network. Intuitively, it modifies the weights until the every output is close to
the corresponding expectation. One of the classical training algorithms for
feedforward neural networks is error back propagation. Hence, the neural
network is sometimes called a backpropagation neural network. A number
of training algorithms, including Levenberg–Marquardt (LM) algorithm
were developed with the respective specifications [28]. For example, the
LM training algorithm is known as fast convergence.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 6

Artificial Neural Network Modeling with Application to Nonlinear Dynamics 107

Obviously, the performance of the neural network heavily relies on


the successful training of the model. But success training does not
mean wonderful training on the training set. Trained neural networks,
especially for back propagation networks, make small prediction errors on
the training set, and yet, give poor prediction on another novel data set.
This phenomenon has been recognized as overfitting. It is an old but hot
issue endemic to neural network modeling. It widely occurs in any resultant
neural networks, in particular those with a large number of neurons. This
leads to the consequence that the performance of neural networks in some
comparative experiments is not superior to or as good as other classical
techniques. So, how many neurons should be put in the hidden layer is a
crucial step to establish an optimal model. The next section will explain
the details of MDL method to solve this problem and avoid overfitting.

6.3. Avoid Overfitting by Model Selection

There are several typical methods applicable to model selection. Akaike


proposed his information criterion (AIC) based on a weighted function of
fitting of a maximum log-likelihood model [29]. The motivation of AIC and
its assumptions was the subject of some discussion in [18, 30, 31]. From
a practical point, the AIC tends to overfit the data [32, 33]. To address
this problem, Wallace et al., therefore, developed the minimum message
length (MML) [34]. Like MDL, MML chooses the hypothesis minimizing
the code-length of the data but its codes are quite different from those in
MDL. In addition, the Bayesian information criterion (BIC), also known as
Schwarz’s information criterion (SIC) is also an option [18, 31, 35].
It is clear that MDL criterion is related to other well-known model
selection criterion. From our point of view, the AIC and BIC perform best
for linear models; for nonlinear models description length style information
criteria are better. We found that MDL in particular is robust and relatively
easy to realize with reasonable assumption [15].

6.3.1. How it works

The basic principle of minimum description length is to estimate both the


cost of describing the model parameters and model prediction errors. Let
k be the number of neurons in the neural network. Λ represents all the
parameters of the given neural network. As shown in Fig. 6.1, given the
size of input vector, the number of parameters is completely determined
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 6

108 Yi Zhao

by the number of neurons. We thus consider establishing the description


length function of the neural network with respective to its neurons [15].
When we build a series of neural networks with varying neurons, we can
also compute a series of description length of these models, which forms a
DL curve along the x -axis of neurons.
Let E(k ) be the cost of counting the model prediction errors and M(k )
be the cost of counting the model parameters. The description length of
the data with respect to this model is then given by the sum [20]:

D(k) = E(k) + M (k) (6.2)

Intuitively, the typical tendency of E(k ) and M(k ) is that if the model size
increases M(k ) increases and E(k ) decreases, which corresponds to the more
model parameters and more potential modeling ability (i.e. less prediction
error) respectively. The minimum description length principle states that
the optimal model is the one that minimizes D(k ).
Let {yi }N
i=1 be a time series of N measurements and

f (yi−1 , yi−2 , · · · , yi−d ; Λk )

be the neural network output, given d previous inputs and parameters


of the neural network described by Λk = (λ1 , λ2 , · · · , λk ) associated
with k neurons. The prediction error is thus given by ei =
f (yi−1 , yi−2 , · · · , yi−d ; Λk ) − yi . For any Λk the description length of the
model f (·; Λk ) is given by [15]:


k
c
M (k) = L(Λk ) = ln (6.3)
i=1
δi

where c is a constant and represents the number of bits required in the


representation of floating points, and δi is interpreted as the optimal
precision of the parameter λi .
Rissanen [16] showed that E(k ) is the negative logarithm of the
likelihood of the errors e = {ei }Ni=1 under the assumed probability
distribution of those errors:

E(k) = − ln P rob(e) (6.4)

For the general unknown probability distribution of errors, the


estimation of E(k ) would be rather complicated. However, it is reasonable
to assume that the model residual is consistent with the normal Gaussian
distribution according to central limit theorem, as N is adequately large.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 6

Artificial Neural Network Modeling with Application to Nonlinear Dynamics 109

We get the simplified equation

N 2π N ∑N
N
E(k) = + ln( ) 2 + ln( e2i ) 2 (6.5)
2 N i=1

For the multilayer neural network described in the previous section, its
parameters are completely denoted by Λk = {b0 , bi , vi , wi,j |i = 1. · · · , k, j =
1, · · · , d}. Of these parameters, the weights vi and the bias b0 are all
linear, the remaining parameters wi,j and bi (i = 1, · · · , k, j = 1, · · · , d)
are nonlinear. Fortunately, the activation function f (·) is approximately
linear in the region of interest so we suppose that the precision of the
nonlinear parameters is similar to that of the linear one and employ the
linear parameters to give the precision of the model, δi .
To account for the contribution of all linear and nonlinear parameters
to M(k ), instead of the contribution of only linear ones, we define np (i)
as the effective number of the parameters associated with the ith neuron
contributed to the description length of the neural network [15]. So M(k )
is updated by:


k
γ
M (k) = np (i) ln (6.6)
i=1
δi

where δi is the relative precision of the weight vi ∈ {v1 , v2 , · · · , vk }, and


np (i) will be a variable with respect to different neurons. But in order to
make the problem tractable we make one further approximation that np (i)
is fixed for all i and then replace np (i) with np .
However, the exact value of np is so difficult to calculate that we give
n̂p the embedding dimension by using False Nearest Neighbors (FNN) to
approximate np [36]. In (6.3), n̂p = 1. That is, the previous work [20] just
paid attention to the linear parameters and ignored contribution of those
nonlinear ones. But for our neural networks n̂p is in the range from one to
d + 2. Embedding dimension represents the dimension of the phase space
required to unfold a point in that phase space, i.e. it represents the effective
number of inputs for a model.
With necessary but reasonable assumption and approximation, we
establish a tractable description length function with respect to the number
of neurons (i.e. all the associated parameters). The minimal value of this
function takes the guidance of the optimal neural network, which makes the
trade-off between the model parameters and corresponding model residual
intuitively.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 6

110 Yi Zhao

6.3.2. Case study


In this section, we present this analysis in two case studies: including a
known dynamics system (Section A) and an experimental data set (Section
B). The test system is the Rössler system with the addition of dynamic
noise. We then describe the application of this model selection technique to
experimental recordings of human pulse data during normal sinus rhythm.
When realizing description length, we notice that DL curves fluctuate
somewhat or even dramatically for the practical data. It is very likely to give
a wrong estimation of the minimum point. The independent construction
of a series of neural networks correspondingly results in fluctuation of DL
curves but the potential tendency estimated by DL curves still exists. The
nonlinear curve fitting procedure is considered to smooth such perturbation
and provide an accurate estimation of the actual minimum [15].
We first define a function, which takes a variable (the number of
neurons) and coefficient vector to fit the original curve. The form of E(k )
is consistent with that of a decreasing exponent function. So ae−bk (a >
0, b > 0) is used to reflect E(k ). M(k ) can be regarded appropriately as the
linear function about the number of neurons. Thus a linear function, ck is
defined to approximate M(k). d is required to compensate for an arbitrary
constant missing from the computation of DL. So the defined function is
F (a, b, c, d; k) = ae−bk + ck + d , where a, b, c and d are the required
coefficients. Note that the previous function is the empirical approximation
of the true tendency of the DL curve.

k
Next, according to min (F (a, b, c, d; i) − D(i))2 , we obtain the
a,b,c,d i=1
coefficient vector (a, b, c, d). Finally, we substitute a, b, c, d, with estimated
values and determine the solution of k that makes the fitted curve minimal.
The fitted DL curve provides a smooth estimation of the original description
length, and reflects the true tendency. The new estimation regarding the
minimal description length appears to be robust, especially for complicated
experimental data.

A. Computational experiments
Consider a reconstruction of the Rössler system with dynamic noise. The
equations of the Rössler system are given by ẋ(t) = −y(t) − z(t), ẏ(t) =
x(t) + a ∗ y(t), ż(t) = b + z(t) ∗ [x(t) − c] with parameters a = 0.1, b =
0.1, c = 18 to generate the chaotic data [37]. Here we set the iteration step
ts to 0.25. By dynamic noise we mean that system noise is added to the
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 6

Artificial Neural Network Modeling with Application to Nonlinear Dynamics 111

x -component data prior to prediction of the succeeding state. That is, we


integrate this equation iteratively and then use the integrated results added
with random noise as initial data for the next step. So the random noise is
coupled into the generated data. The magnitude of the noise is set at 10%
of the magnitude of the data. We generate 2000 points of this system of
which 1600 points are selected to train the neural network and the rest are
the testing data. We calculate a description length of 20 neural networks
constructed with different neurons from 1 to 20, as shown in Fig. 6.2.

Fig. 6.2. Description Length (solid line) of neural networks for modeling the Rössler
system (left panel) has the minimum point at five, and the fitted curve (dashed line)
attains the minimum at the same point. In the right panel the solid line is the mean
square error of training set and the dotted line is that of testing data.

We observe that both the DL and fitted curves denote that the optimal
number of neurons is five. That is, the neural network with five neurons is
the optimal model according to the principle of the minimum description
length. Mean square error of the testing set gives little help in indicating
the appearance of overfitting.
As a comparison, we chose other three networks with different numbers
of neurons to perform a free-run prediction for the testing set. Concerning
free-run prediction, the prediction value is based on the current and
previous prediction values, in a comparison of the so-called one-step
prediction. The predicted x -component data is converted into three vectors,
x(t), x(t + 3) and x(t + 5) (t ∈ [1, 395]) to construct the phase space shown
in Fig. 6.3. The network with five neurons exactly captures the dynamics
of the Rössler system but the neural network with more neurons is apt to
overfit. As the test data is chaotic Rössler data, the corresponding attractor
should be full of reconstructed trajectories, as shown in Fig. 6.3(b).
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 6

112 Yi Zhao

Fig. 6.3. Four reconstructions of free-run predictions of neural networks with 3, 5, 7


and 15 neurons.

B. Experimental data

Generally, experimental data are more complicated than computational


data since the measurement of these data is usually contaminated
with observational noise somewhat and also the deterministic dynamics
governing the evolution of the real system is unknown. Hence, experimental
data are much more difficult to predict accurately. We apply this
information theoretic method to experimental recordings of human pulse
data to validate its practicality.
We randomly utilize 2550 points to build neural networks and the
consecutive 450 data points to test these well-trained models. Note that in
the previous simulation example, the FNN that is used to estimate n̂p curve
drops to zero and remains at zero, but for practical ECG data the process
is similar to that of noisy signal (i.e. it cannot reach zero). The inflection
point of the FNN curve in this case points out the value of n̂p . Figure
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 6

Artificial Neural Network Modeling with Application to Nonlinear Dynamics 113

6.4 describes description length and mean square error for this application
respectively.

Fig. 6.4. Description length (solid line) and the fitted curve (dashed line) of ECG data
(left panel) estimate the minimal points at the 11 and 7 point respectively. The right
one is the mean square error of training set (solid line) and testing set (dotted line).

The DL curve suggests the optimal number of neurons is 11, but the
fitted curve estimates that the optimal number of neurons is 7. In this
experiment the mean square error (MSE) of testing data cannot give
valuable information regarding the possibility of overfitting either. We
thus deliberately select neural networks with both 7 and 11 neurons for
verification. As in the previous case, another 2 neural networks with 5
neurons and 17 neurons are also used for comparison. All the free-run
predictions obtained by these models are illustrated in Fig. 6.5
Networks with 7 neurons can predict the testing data accurately, but
prediction of the network with 11 neurons is overfitted over the evolution
of the time. So we confirm that the fitted curve shows robustness against
fluctuation. The neural network with 7 neurons can provide adequate
fitting to both training and novel data, and networks with more neurons,
such as 11, overfit. Although referring to the DL line one may decide
the wrong optimal number of neurons, the fitted curve reflects the true
tendency hidden in the original DL estimation. For the computational
data, both the original DL curve and the fitted one can provide the same
or close estimation while taking the practical data into consideration, it is
demonstrated the nonlinear curve fitting is necessary and effective.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 6

114 Yi Zhao

Fig. 6.5. Free-run prediction (dotted line) and actual ECG data (solid line) for 4 neural
networks with 5, 7, 11 and 17 neurons.

By observation, readers are persuaded that the short-term prediction


given by the selected optimal neural network is nearly perfect and these
optimal models accurately capture the underlying deterministic dynamics
of the given time series. However, how close is close enough? The model
residual always exists as long as the target data and its prediction are not
the same. So it could be hasty to make the previous conclusion without
investigation of the model residual. Readers are referred to the next section,
where we discuss how to examine the dynamical characters of this residual
in a statistical way. If the model residual is merely random noise, it then
gives a positive support to the performance of optimal neural networks as
well as the proposed information theoretic criterion. If the model residual
still contains significant determinism, it then indicates that these models
are not omnipotent to deal with the original time series. We, therefore,
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 6

Artificial Neural Network Modeling with Application to Nonlinear Dynamics 115

present a solution to enhance the capability of the current neural network


modeling in Section 6.5. One may find some interesting contents.

6.4. Surrogate Data Method for Model Residual

The surrogate data method has been widely applied in the literature. It was
proposed to analyze whether the observed data is consistent with the given
null hypothesis. Here, the observed data is the previous model residual.
Hence, we employ the surrogate data method to answer the question in the
last section. The given hypothesis is whether such data is consistent with
random noise, known as NH0.
In fact, this technique can also be used prior to modeling to find whether
the data comes from the deterministic dynamics or merely random noise.
In the latter case, it is wasteful to make the prediction on it. Stochastic
process tools may be more applicable to model such data. Consequently,
the surrogate data method before modeling provides an indirect evidence
of predictability of the data.

6.4.1. Linear surrogate data


The principle of surrogate data hypothesis testing is to generate an ensemble
of artificial surrogate data (surrogates in short) by using a surrogate
generation algorithm and ensure that the generated surrogates follow the
certain null hypothesis. One then applies some test statistics to both
surrogates and the original data to observe whether the original data
is consistent with the given hypothesis. Commonly employed standard
hypotheses include [38]:

• NH0: The data is independent and identically distributed (i.i.d.) noise.


• NH1: The data is linearly filtered noise.
• NH2: The data is a static monotonic nonlinear transformation of linearly
filtered noise.

If the test statistic value for the data is distinct from the ensemble of values
estimated for surrogates, then one can reject the given null hypothesis as
being a likely origin of the data. If the test statistic value for the data
is consistent with that for surrogates, then one may not reject the null
hypothesis. Consequently, surrogate data provides a rigorous way to apply
the statistical hypothesis testing to exclude experimental time series from
the family of certain dynamics. One can apply it to determine whether an
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 6

116 Yi Zhao

observed time series has a statistically significant deterministic component,


or just is random noise.
There are three algorithms to generate surrogate data corresponding to
the hypotheses above, known as Algorithm 0, Algorithm 1 and Algorithm 2.
We only describe Algorithm 0 that we use. For Algorithm 0, the sequence
of the data is randomly shuffled. Such shuffling will destroy any temporal
autocorrelation of the original data. In essence such surrogates are random
but consistent with the same probability distribution as that of the original.
To test the hypothesis of surrogate data one must select an appropriate
statistic criterion. Among available optional criteria, correlation dimension
is a popular choice [39]. By the value of correlation dimension, one can
differentiate the time series with different dynamics. For example, the
correlation dimension for a closed curve (a periodic orbit) is one and a
strange (fractal) set can have a correlation dimension that is not an integer.
Correlation dimension is now a standard tool in the area of nonlinear
dynamics analysis. We adopt Gaussian Kernel Algorithm (GKA) [40] to
estimate correlation dimension.

6.4.2. Systematic flowchart


Reviewing the workflow or basic ideas introduced previously, we summarize
the contents in sequence by using a flowchart, as listed in Fig. 6.6. In
summary, we incorporate neural network, MDL as well as the nonlinear
curve fitting and the surrogate data method into one comprehensive
modeling module for time series prediction [41, 42]. In other words,
the modeling module constitutes three sub-modules with their respective
purposes.

6.4.3. Identification of model residual


The surrogate data method is used to investigate the model residual of
the optimal neural network selected by MDL for the previous two cases.
We generate 50 surrogates of the prediction error for each case. The
given hypothesis is NH0, i.e. such prediction error is consistent with
the i.i.d noise. Suitable embedding dimension is required to estimate the
correlation dimension as it is a scale invariant measured on the phase space
reconstruction. There is no universal criterion to select the embedding
dimension so we employ embedding dimension deviating from two to nine
to calculate correlation dimension. The delayed time of phase space
reconstruction is chosen to one the same as the time lag of the input vector.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 6

Artificial Neural Network Modeling with Application to Nonlinear Dynamics 117

Increment of n

No
Model the time series
Collect time using neural networks n reaches an
series with n neurons adequate value?
(initialize n = 1)
Yes

MDL in the help of nonlinear


curve fitting determines the
optimal model.

Given the hypothesis, apply


surrogate data method to the Apply this optimal network to
prediction error of the model the dynamics of the test
optimal model and decide to data
reject the hypothesis or not.

Fig. 6.6. Systematic integration of neural networks, MDL, and the surrogate data
method to capture the dynamics of the observed time series.

Based on whether correlation dimension of the model error is out of or


in the range of correlation dimension for the surrogates, we can determine
whether to reject or fail to reject this given hypothesis. Typical results are
depicted in Fig. 6.7.
One can find that the correlation dimension of the original error stays
in the range of the mean plus or minus one standard deviation between
de = 2 and de = 6. Moreover, most of them are close to the average, which
means correlation dimension of the original data is close to the center of
the distribution of correlation dimension for all the surrogates. Therefore,
the original data is not distinguishable from the results of the surrogates.
Consequently we cannot reject the given hypothesis. This result indicates
that the previous prediction errors are consistent with random noise (i.e.
there is no significant determinism in the prediction error). In other words,
we conclude that the optimal neural networks estimated by MDL accurately
capture the underlying dynamics in both cases.
Note that the behavior of correlation dimension calculated by GKA with
embedding dimension higher than six becomes unstable. Large embedding
dimension yields unexpected correlation dimension that is even lower than
zero for some surrogates. This means that such embedding dimension is
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 6

118 Yi Zhao

Fig. 6.7. Analysis of the prediction error for the Rössler system (left panel) and human
pulse data (right panel). Stars are the correlation dimension of the original model residual
of the optimal model; the solid line is the mean of correlation dimension of 50 surrogates
at every embedding dimension; two dashed lines denote the mean plus one standard
deviation (the upper line) and the mean minus one standard deviation (the lower line);
two dotted lines are the maximum and minimum correlation dimension among these
surrogates.

inappropriate to estimate correlation dimension. So the proper embedding


dimension should be lower than six in both cases.

6.4.4. Further investigation

However, it may be possible that the surrogate data method failed


to distinguish the deterministic components even though significant
determinism exists. To address this problem, we further consider another
experiment in which we add the prediction error with (deterministic but
independent) observational “noise”. The “noise” is from the Rössler system
given in the third section. We take the prediction error of pulse modeling
as an example.
The magnitude of x -component data is set at 2% of the magnitude of
the original prediction error. So the added noise is considerably smaller
in contrast to the original prediction error. Relevant results under this
scenario are presented in Fig. 6.8, which reveals the deviation between the
correlation dimension of the “new” model residual and its surrogates.
Since the deterministic Rössler dynamics is added to the original
prediction error, new data should contain the deterministic dynamics. In
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 6

Artificial Neural Network Modeling with Application to Nonlinear Dynamics 119

Fig. 6.8 we can observe that the correlation dimension of this new error is
even further away from the maximal boundary of correlation dimension for
surrogate between de = 2 and de =6. So we can reject the given hypothesis
that the original error is the i.i.d noise. It is consistent with our expectation.
Again, numerical problems with GKA are evident for de ≥ 7.
We notice that the surrogate data method can exhibit the existence
of this deterministic structure in the model residual even if it is weak in
comparison with the original signal. On the contrary, if there is no difference
between the model prediction and data, the surrogate data method can also
exhibit corresponding consistency.
We apply the surrogate data method to the residual of the optimal
model (i.e. the prediction error). It estimates correlation dimension
for this prediction error and its surrogates under the given hypothesis
(NH0): the prediction error is consistent with i.i.d noise. According
to results, we cannot reject that the prediction error is i.i.d noise. We
conclude that with the test statistic at our disposal, the predictions achieved
by the optimal neural network and original data are indistinguishable.
Combination of neural networks, minimum description length and the
surrogate data method provides a comprehensive framework to handle time
series prediction. We feel that this technique is important and will be
applicable to a wide variety of real-world data.

Fig. 6.8. Application of the surrogate data method to the prediction error contaminated
with deterministic dynamics. All the donation of curves and symbol are the same as the
previous graph.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 6

120 Yi Zhao

6.5. The Diploid Model Based on Neural Networks

In the previous section, all the prediction is made based on a single model.
However, the observed data may come from various different dynamics. So
a single model is not qualified for handling this kind of complicated data.
Section 6.4.4 gave an example to describe this case. The single model
usually just captures one dominating dynamic but fails to respond to other
relatively weak dynamics also hidden in the data. Thus, new integrated
model structures are developed to model those data with more dynamics.
It has been well verified that the prediction can be improved significantly
by the combination of different methods [43, 44]. Several combining schemes
have been proposed to enhance the ability of the neural network. Horn
described a combined system of the two feedforward neural networks to
predict the future value [45]. Lee et al. proposed an integrated model with
backpropagation neural networks and self organizing feature map (SOFM)
model for bankruptcy prediction [46]. Besides that, the combination
of autoregressive integrated moving average model (ARIMA) and neural
networks are often mentioned as a common practice to improve prediction
accuracy of the practical time series [47, 48]. They both show that the
prediction error of the combined model is smaller than just the single
model, which exhibits the advantages of the combined model. Inspired by
the previous works, we develop an idea of the diploid model for predicting
complicated data.
We construct the diploid neural network in a series way, which is defined
as the series-wound model (Fig. 6.9). The principle of series-wound diploid
model is that after the prediction of model I, its prediction error is used as
the input data for model II to compensate the error of model I. The final
results will be the sum of each output of both models.

Inputs Model I Model II Outputs

Fig. 6.9. The diploid neural networks constructed in a series-wound.

Suppose that model I is represented by the function f (·), and model II


is represented by the function S(·). We briefly describe the principle of the
diploid series-wound model as follow:
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 6

Artificial Neural Network Modeling with Application to Nonlinear Dynamics 121

Model I gives its prediction on the long-term tendency, x̂t = f (x),


where x = {xt−1 , xt−2 , · · · , xt−d }, d is the number of input data and
x̂t denotes the prediction value of model I. Then model II is used for
remedying the residual of model I. So we use the prediction error of
the previous model as the training set to train it. That is, it aims to
capture the short-term dynamics left in the previous model residual. So
model II makes its predictions given by the equation x̂ ˆt = S(e), where
e = {et−1 , et−2 , · · · , et−l } and l denotes the number of predictions that
the previous model obtained. The sum of both equations gives the final
prediction on the original time series with the equation x̄t+1 = S(e) + f (x).
It compensates the missing components of the first prediction with the new
prediction of the second model. The final prediction is expected to capture
both underlying short-term and long-term tendencies of the data.
As we know, the complicated time series is usually generated by
several different dynamics. For simplicity, it is reasonable to classify those
dynamics into two categories: fast dynamics and slow dynamics, which
determine the short-term and long-term varying tendencies of the time
series respectively. We wish to develop an approach of double multi-layer
feed-forward neural networks, denominated as the diploid model to capture
both dynamics. The large time-scale inputs are fed to the first (global)
network while the small time-scale inputs are selected for the second (local)
network. At step one an optimal global neural network determined by the
minimal description length makes its respective prediction on the original
time series to follow the slow dynamics. Local dynamics exist in the model
residual of the first one. At step two a local neural network is constructed
to remedy global model prediction, i.e. capture the fast dynamics in the
previous model residual.
In addition, the surrogate data method can provide an auxiliary measure
to confirm the dynamic character of the model residual at each stage so as
to validate the performance of the single model and the diploid model.
Significantly, this new strategy introduces a new path for establishing the
diploid ANN to deal with complex systems.

6.6. Conclusion

The modeling technique concerning artificial neural networks has been


widely applied to capture nonlinear dynamics. Here, we address only one of
these primary applications: prediction of nonlinear time series. Our interest
is to seek the optimal (or appropriate) architecture of the multi-layer neural
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 6

122 Yi Zhao

network to avoid overfitting and further provide adequate generalization for


general time series prediction.
In the previous related works, researchers usually focused on
optimization of the inherent parameters of the neural networks or training
strategies. So these methods do not directly estimate how many neurons of
the neural network are sufficient to a specific application. In this chapter, we
describe an information theoretic criterion based on minimum description
length to decide the optimal model. Moreover, the nonlinear fitted function
is defined to reflect the tendency of the original DL curve so as to smooth
its fluctuation. The experimental results demonstrate that the fitted curve
is quite effective and robust in both computational and practical data sets.
To further confirm the performance of the optimal neural networks, we
utilize the standard surrogate data method to analyze the prediction error
obtained by these neural networks based on MDL. If the model residual is
consistent with the given hypothesis of random noise, it then indicates
that there is no deterministic structure in the residual. That is, the
optimal model accurately follows dynamics of the data. A comprehensive
procedure that integrates the neural network modeling with the surrogate
data method is illustrated in this work. Application of the surrogate data
method to the model residual paves the way for evaluating the performance
of corresponding models. If the result is negative, more cautions are
required to use the current model. Or another advanced model is suggested.
The idea of a diploid model is then presented to deal with the complex
system prediction. It aims to follow both short-term and long-term varying
tendencies of the time series. The related work is in progress, and the
preliminary results show its great advantage of modeling the simulated
time series composed of components of the Lorenz system and Ikeda map.

Acknowledgments

This work was supported by the China National Natural Science


Foundation Grant No. 60801014, the Natural Science Foundation Grant
of Guangdong Province No. 9451805707002363 and the Scientific Plan of
Nanshan District, Shenzhen, China.

References

[1] B. Lippmann, Pattern classification using neural networks, Communications


Magazine. 27, 47–50, (1989).
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 6

Artificial Neural Network Modeling with Application to Nonlinear Dynamics 123

[2] P. J. Antsaklis, Neural networks for control systems, Neural Networks. 1,


242–244, (1990).
[3] G. P. Zhang, Neural networks for classification: A survey, IEEE Transactions
on Systems, Man, and Cybernetics – Part C: Application and Review. 30,
451–462, (2000).
[4] G. E. P. Box, G. M. Jenkins, and G. C. Reinsel, Time Series Analysis:
Forecasting and Control. (Prentice Hall, Upper Saddle River, 1994).
[5] P. J. Antsaklis, Approximations by superpositions of sigmoidal functions,
Mathematics of Control, Signals, and Systems. 2, 303–314, (1990).
[6] K. Hornik, Approximation capabilities of multilayer feedforward networks,
Neural Networks. 4, 251–257, (1991).
[7] A. Lapedes and R. Farber, Nonlinear signal processing using neural network
prediction and system modeling, Technical Report. (1987).
[8] R. L. Wilson and R. Sharda, Bankruptcy prediction using neural networks,
Decision Support Systems. 11, 545–557, (1994).
[9] F.-M. Tseng, H.-C. Yu, and G.-H. Tzeng, Combining neural network model
with seasonal time series ARIMA model, Technological Forecasting and
Social Change. 69, 71–87, (2002).
[10] P. C. Chang, C. Y. Lai, and K. R. Lai, A hybrid system by evolving
case-based reasoning with genetic algorithm in wholesaler’s returning book
forecasting, Decision Support Systems. 42, 1715–1729, (2006).
[11] D. Shanthi, G. Sahoo, and N. Saravanan, Designing an artificial neural
network model for the prediction of Thrombo-embolic stroke, International
Journals of Biometric and Bioinformatics. 3, 10–18, (2009).
[12] A. Weigend, On overfitting and the effective number of hidden units,
Proceedings of the 1993 Connectionist Models Summer School. pp. 335–342,
(1994).
[13] J. E. Moody, S. J. Hanson, and R. P. Lippmann, The effective number of
parameters: An analysis of generalization and regularization in nonlinear
learning systems, Advances in Neural Information Processing Systems. 4,
847–854, (1992).
[14] D. J. C. MacKay, Bayesian interpolation, Neural Computation. 4, 415–447,
(1992).
[15] Y. Zhao and M. Small, Minimum description length criterion for modeling of
chaotic attractors with multilayer perceptron networks, IEEE Transactions
on Circuit and Systems-I: Regular Papers. 53, 722–732, (2006).
[16] J. Rissanen, Stochastic Complexity in Statistical Inquiry. (World Scientific,
Singapore, 1989).
[17] M. Li and P. Vitnyi, An Introduction of Kolmogorov and its Applications.
(Springer–Verlag, Berlin, 2009).
[18] J. Rissanen, Modeling by the shortest data description, Automatica. 14,
465–471, (1978).
[19] K. Judd and A. Mees, On selecting models for nonlinear time series, Physica
D. 82, 426–444, (2009).
[20] M. Small and C. K. Tse, Minimum description length neural networks for
time series prediction, Physical Review E. 66, (2009).
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 6

124 Yi Zhao

[21] L. Xu, Advances on BYY harmony learning: Information theoretic


perspective generalized projection geometry and independent factor
autodetermination, Neural Networks. 15, 885–902, (2004).
[22] C. Alippi, Selecting accurate, robust, and minimal feedforward neural
networks, IEEE. Transactions on Circurits and Systems I: Regular Papers.
49, 1799–1810, (2002).
[23] H. Bischof and A. Leonardis, Finding optimal neural networks for land use
classification, IEEE Transactions on Geoscience and Remote Sensing. 36,
337–341, (1998).
[24] X. M. Gao, Modeling of speech signals using an optimal neural network
structure based on the PMDL principle, IEEE Transactions on Speech and
Audio Processing. 6, 177–180, (1998).
[25] A. Galka, Topics in Nonlinear Time Series Analysis with Applications for
ECG Analysis. (World Scientific, Singapore, 2000).
[26] D. W. Baker and N. L. Carter, Testing for nonlinearity in time series: the
method of surrogate data, Physica D. 58, 77–94, (1972).
[27] H. Demuth and M. Beale, Neural Network Toolbox User’s Guide. (The Math
Works, USA, 1998).
[28] M. T. Hagan and M. Menhaj, Training feedforward networks with the
Marquardt algorithm, Neural Networks. 5, 989–993, (1994).
[29] H. Akaike, A new look at the statistical model identification, IEEE
Transactions on Automatic Control. 19, 716–723, (1974).
[30] G. Schwarz, Estimating the dimension of a model, The Annals of Statistics.
6, 461–464, (2009).
[31] M. Stone, Comments on model selection criteria of Akaike and Schwarz, The
Royal Statistical Society. 41, 276–278, (1979).
[32] J. Rice, Bandwidth choice for nonparameteric regression, The Annals of
Statistics. 12, 1215–1230, (1984).
[33] Z. Liang, R. J. Jaszczak, and R. E. Coleman, Parameter estimation of finite
mixtures using the EM algorithm and information criteria with application
to medical image processing, IEEE Transactions on Nuclear Science. 39,
1126–1133, (1992).
[34] C. Wallace and D. Boulton, An information measure for classification,
Computing Journal. 11, 185–195, (1972).
[35] B. A. Barron and J. Rissanen, The minimum description length principle
in coding and modeling, IEEE Transactions on Information Theory. 4,
2743–2760, (1972).
[36] M. B. Kennel, R. Brown, and H. D. I. Abarbanel, Determining embedding
dimension for phase-space reconstruction using a geometrical construction,
Physical Review A. 45, 3403–3411, (1992).
[37] O. E. Rossler, An equation for continuous chaos, Physics Letters A. 57,
397–398, (1976).
[38] J. Theiler, Testing for nonlinearity in time series: The method of surrogate
data, Physica. D. 58, 77–94, (1992).
[39] M. Small and C. K. Tse, Applying the method of surrogate data to cyclic
time series, Physica D. 164, 182–202, (2002).
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 6

Artificial Neural Network Modeling with Application to Nonlinear Dynamics 125

[40] D. Yu, Efficient implementation of the Gaussian kernel algorithm in


estimating invariants and noise level from noisy time series data, Physical
Review E. 61, 3750, (2000).
[41] Y. Zhao and M. Small, Equivalence between “feeling the pulse” on the
human wrist and the pulse pressure wave at fingertip, International Journal
of Neural Systems. 15, 277–286, (2005).
[42] Y. Zhao, J. F. Sum, and M. Small, Evidence consistent with deterministic
chaos in human cardiac data: Surrogate and nonlinear dynamical modeling,
International Journal of Bifurcations and Chaos. 18, 141–160, (2008).
[43] S. Makridakis, Why combining works, International Journal of Forecasting.
5, 601–603, (1989).
[44] F. Palm and A. Zellner, To combine or not to combine? Issue of combining
forecasts, International Journal of Forecasting. 11, 687–701, (1992).
[45] G. D. Horn and I. Ginzburg, Combined neural networks for time series
analysis, Neural Information Processing Systems - NIPS. 6, 224–231, (1994).
[46] K. C. Lee, I. Han, and Y. Kwon, Hybrid neural network models for
bankruptcy predictions, Decision Support Systems. 18, 63–72, (1996).
[47] G. P. Zhang, Time series forecasting using a hybrid ARIMA and neural
network model, Neurocomputing. 50, 159–175, (2003).
[48] L. Aburto and R. Weber, Improved supply chain management based on
hybrid demand forecasts, Applied Soft Computing. 7, 136–144, (2007).
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 6

This page intentionally left blank

126
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

Chapter 7

Solving Eigen-problems of Matrices by Neural Networks

1
Yiguang Liu, 1 Zhisheng You, 2 Bingbing Liu and 1 Jiliu Zhou
1
Video and Image Processing Lab, School of Computer Science &
Engineering, Sichuan University, China 610065,
[email protected]
2
Data Storage Institute, A* STAR, Singapore 138632

How to efficiently solve eigen-problems of matrices is always a significant


issue in engineering. Neural networks run in an asynchronous manner,
and thus applying neural networks to address these problems can attain
high performance. In this chapter, several recurrent neural network
models are proposed to handle eigen-problems of matrices. Each model
is expressed as an individual differential equation, with its analytic
solution being derived. Subsequently, the convergence properties of
the neural network models are fully discussed based on the solutions to
these differential equations. Finally, the computation steps are designed
toward solving the eigen-problems, with numerical simulations being
provided to evaluate the effectiveness of each model.
This chapter consists of three major parts, with each approach in
these three parts being in the form of neural networks. Section 7.1
presents how to solve the eigen-problems of real symmetric matrices;
Sections 7.2 and 7.3 are devoted to addressing the eigen-problems of
anti-symmetric matrices; Section 7.4 aims at solving eigen-problems of
general matrices, which are neither symmetric nor anti-symmetric.
Finally, conclusions are made in Section 7.5 to summarize the whole
chapter.

Contents
7.1 A Simple Recurrent Neural Network for Computing the Largest and Smallest
Eigenvalues and Corresponding Eigenvectors of a Real Symmetric Matrix . . 128
7.1.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7.1.2 Analytic solution of RNN . . . . . . . . . . . . . . . . . . . . . . . . . 130
7.1.3 Convergence analysis of RNN . . . . . . . . . . . . . . . . . . . . . . . 131
7.1.4 Steps to compute λ1 and λn . . . . . . . . . . . . . . . . . . . . . . . . 135
7.1.5 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135

127
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

128 Yiguang Liu, Zhisheng You, Bingbing Liu and Jiliu Zhou

7.1.6 Section summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141


7.2 A Recurrent Neural Network for Computing the Largest Modulus Eigenvalues
and Their Corresponding Eigenvectors of an Anti-symmetric Matrix . . . . . 142
7.2.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
7.2.2 Analytic solution of RNN . . . . . . . . . . . . . . . . . . . . . . . . . 145
7.2.3 Convergence analysis of RNN . . . . . . . . . . . . . . . . . . . . . . . 148
7.2.4 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
7.2.5 Section summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
7.3 A Concise Recurrent Neural Network Computing the Largest Modulus
Eigenvalues and Their Corresponding Eigenvectors of a Real Skew Matrix . . 155
7.3.1 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
7.3.2 Analytic solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
7.3.3 Convergence analysis of RNN . . . . . . . . . . . . . . . . . . . . . . . 158
7.3.4 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
7.3.5 Comparison with other methods and discussions . . . . . . . . . . . . . 163
7.3.6 Section summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
7.4 A Recurrent Neural Network Computing the Largest Imaginary or Real Part
of Eigenvalues of a General Real Matrix . . . . . . . . . . . . . . . . . . . . . 166
7.4.1 Analytic expression of |z (t)|2 . . . . . . . . . . . . . . . . . . . . . . . 168
7.4.2 Convergence analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
7.4.3 Simulations and discussions . . . . . . . . . . . . . . . . . . . . . . . . 172
7.4.4 Section summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
7.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

7.1. A Simple Recurrent Neural Network for Computing the


Largest and Smallest Eigenvalues and Corresponding
Eigenvectors of a Real Symmetric Matrix

Using neural networks to compute eigenvalues and eigenvectors of a matrix


is both quick and parallel. Fast computation of eigenvalues and eigenvectors
has many applications, e.g. primary component analysis (PCA) for image
compression and adaptive signal processing etc. Thus, many research works
in this field have been reported [1–11]. Zhang Yi et al. [8] proposed a
recurrent neural network (RNN) to compute eigenvalues and eigenvectors
of a real symmetric matrix, which is as follows
dx (t) [ ( ) ]
= −x (t) + xT (t) x (t) A + 1 − xT (t) Ax (t) I x (t) ,
dt
T
where t ≥ 0, x = (x1 , · · · , xn ) ∈ Rn represents the states of neurons, I is
an identity matrix, the matrix A ∈ Rn×n waiting for computing eigenvalues
and eigenvectors is symmetric. Obviously, this neural network is wholly
equivalent to the following neural networks
dx (t)
= xT (t) x (t) Ax (t) − xT (t) Ax (t) x (t) . (7.1)
dt
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

Solving Eigen-problems of Matrices by Neural Networks 129

We propose a new neural network that is applicable to computing


eigenvalues and eigenvectors of real symmetric matrices, which is described
as follows
dx (t)
= Ax (t) − xT (t) x (t) x (t) , (7.2)
dt
where t ≥ 0, definitions of A and x (t) accord with A and x (t) in Equation
(7.1). When A − xT (t) x (t) is looked at as synaptic connection weights
that vary with xT (t) x (t) and is regarded as a function of xT (t) x (t),
and the neuron activation functions are supposed as pure linear functions,
Equation (7.2) can be seen as a recurrent neural network (RNN (7.2)).
Evidently, RNN (7.2) is simpler than RNN (7.1), which has important
meanings for electronic designing; the connection is much less than RNN
(7.1), and the electronic circuitry of RNN (7.2) is more concise than RNN
(7.1). Therefore, the realization of RNN (7.2) is easier than RNN (7.1).
Moreover, the performance is close to that of RNN (7.1), which can be seen
in Section 7.1.3.

7.1.1. Preliminaries
All eigenvalues of A are denoted as λ1 ≥ λ2 ≥ · · · ≥ λn , and their
corresponding eigenvectors and eigensubspaces are denoted as µ1 , · · · , µn
and V1 , · · · , Vn . ⌈x⌉ denotes the most minimal integer which is not less
than x. ⌊x⌋ denotes the most maximal integer which is not larger than x.

Lemma 7.1. λ1 + α ≥ λ2 + α ≥ · · · ≥ λn + α are eigenvalues of A +


αI (α ∈ R) and µ1 , · · · , µn are the eigenvectors corresponding to λ1 +
α, · · · , λn + α.
Proof: From
Aµi = λi µi , i = 1, · · · , n.
It follows that
Aµi + αµi = λi µi + αµi , i = 1, · · · , n
i.e.
(A + αI) µi = (λi + α) µi , i = 1, · · · , n.
Therefore, this lemma is correct.
When RNN (7.2) approaches equilibrium state, we get the following
relation from Equation (7.2)
Aξ = ξ T ξξ,
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

130 Yiguang Liu, Zhisheng You, Bingbing Liu and Jiliu Zhou

where ξ ∈ Rn is the equilibrium vector. If ∥ξ∥ ̸= 0, ξ is an eigenvector, and


the corresponding eigenvalue λ is
λ = ξ T ξ. (7.3)

7.1.2. Analytic solution of RNN

Theorem 7.1. Denote Si = ∥µµii ∥ . zi (t) denotes the projection value of


x (t) onto Si , then the analytic solution of RNN (7.2) is presented as

n
zi (0) exp (λi t) Si
x (t) = √ i=1
. (7.4)

n ∫t
1+ zj2 (0) 0
exp (2λj τ ) dτ
j=1

Proof: Recurring to the properties of real symmetric matrices, we know


S1 , · · · , Sn construct a normalized perpendicular basis of Rn×n . So, x (t) ∈
Rn can be presented as

n
x (t) = zi (t) Si . (7.5)
i=1

Taking Equation (7.5) into Equation (7.2), we get


d ∑n
zi (t) = λi zi (t) − zj2 (t)zi (t) ,
dt j=1

when zi (t) ̸= 0
1 d ∑ n
λi − zi (t) = zj2 (t).
zi (t) dt j=1

Therefore
1 d 1 d
λi − zi (t) = λr − zr (t) (r = 1, · · · , n) ,
zi (t) dt zr (t) dt
so
zi (t) zi (0)
= exp [(λi − λr ) t] (7.6)
zr (t) zr (0)
for t ≥ 0. From Equation (7.2), we can also get
[ ] ∑n [ ]2
d 1 1 zj (t)
= −2λr + 2 . (7.7)
dt zr2 (t) zr2 (t) j=1
zr (t)
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

Solving Eigen-problems of Matrices by Neural Networks 131

Taking Equation (7.6) into Equation (7.7), we have


[ ] ∑n [ ]2
d 1 1 zj (0)
= −2λr 2 +2 exp [2 (λj − λr ) t] ,
dt zr2 (t) zr (t) j=1
zr (0)

i.e.

∑n [ ]2 ∫ t
exp (2λr t) 1 zj (0)
− = 2 exp (2λj τ ) dτ ,
zr2 (t) zr2 (0) j=1
zr (0) 0

so

zr (0) exp (λr t)


zr (t) = √ . (7.8)

n
2
∫t
1+2 zj (0) 0 exp (2λj τ ) dτ
j=1

So, from Equations (7.5), (7.6) and (7.8), it follows that


n
x (t) = zi (t) Si
i=1
∑n
zi (t)
= zr (t) zr (t) Si
i=1
∑n
zi (0) zr (0) exp(λr t)Si
= zr (0) exp [(λi − λr ) t] √ ∑
n ∫ (7.9)
i=1 1+2 zj2 (0) 0t exp(2λj τ )dτ
j=1

n
zi (0) exp(λi t)Si
= √ i=1
.

n ∫ t
1+2 zj2 (0) 0
exp(2λj τ )dτ
j=1

So, Theorem 7.2 shown in the following can be achieved.

7.1.3. Convergence analysis of RNN

Theorem 7.2. For nonzero initial vector x (0) ∈ Vi , the equilibrium vector
ξ may be trivial or ξ ∈ Vi . If ξ ∈ Vi , ξ T ξ is the eigenvalue λi .
Proof: Since x (0) ∈ Vi , Vi ⊥Vj (1 ≤ j ≤ n, j ̸= i), the projections of
x (0) onto Vj (1 ≤ j ≤ n, j ̸= i) are zero. Therefore, from Equation (7.9),
it follows that
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

132 Yiguang Liu, Zhisheng You, Bingbing Liu and Jiliu Zhou


n
zk (0) exp(λk t)Sk
x (t) = √ k=1

n ∫ t
1+2 zj2 (0) 0
exp(2λj τ )dτ
j=1
zi (0) exp(λi t)Si
= √

n ∫
1+2 zj2 (0) 0t exp(2λj τ )dτ
j=1

∈ Vi ,

so, ξ ∈ Vi . From Equation (7.3), we know ξ T ξ is the eigenvalue λi .

Theorem 7.3. If nonzero initial vector x (0) ∈ / Vi (i = 1, · · · , n) and the


equilibrium vector ∥ξ∥ ̸= 0, then ξ ∈ V1 , ξ ξ is the largest eigenvalue λ1 .
T

Proof: Since x (0) ∈ / Vi (i = 1, · · · , n), the projection of x (0) onto Vi


(1 ≤ i ≤ n) is nonzero, i.e.

zi (0) ̸= 0 (1 ≤ i ≤ n) .

From Theorem 7.1, it follows that



n
zi (0) exp(λi t)Si
x (t) = √ i=1

n ∫ t
1+2 zj2 (0) 0
exp(2λj τ )dτ
j=1

n
z1 (0) exp(λ1 t)S1 + zi (0) exp(λi t)Si
= √ i=2
,
∫ t ∑
n ∫ t
1+2z12 (0) 0
exp(2λ1 τ )dτ +2 zj2 (0) 0
exp(2λj τ )dτ
j=2

and λ1 ≤ 0 will make ∥ξ∥ = 0. So λ1 > 0 and


n
z1 (0)S1 + zi (0) exp[(λi −λ1 )t]Si
x (t) = √ i=2
.
2 (0)
z1 ∑
n ∫ t
exp(−2λ1 t)+ λ1 [exp(2λ1 t)−1] exp(−2λ1
t)+2 zj2 (0) 0
exp[(2λj τ )−(2λ1 t)]dτ
j=2

Therefore, it holds that

ξ = lim x (t)
t→∞

n
z1 (0)S1 + zi (0) exp[(λi −λ1 )t]Si
= lim √ i=2

t→∞ 2 (0)
z1 ∑
n ∫
exp(−2λ1 t)+ λ1 [exp(2λ1 t)−1] exp(−2λ1 t)+2 zj2 (0) t
0
exp[(2λj τ )−(2λ1 t)]dτ ,
√ j=2

λ1
= z
z12 (0) 1
(0) S1
∈ V1
(7.10)
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

Solving Eigen-problems of Matrices by Neural Networks 133

and
√ √
λ1 λ1
ξT ξ = z
z12 (0) 1
(0) S1T S1 z
z12 (0) 1
(0)
(7.11)
= λ1 .
From Equations (7.10) and (7.11), we know this theorem is proved.

Theorem 7.4. Replace A with −A, If nonzero initial vector x (0) ∈ /


Vi (i = 1, · · · , n) and the equilibrium vector ∥ξ∥ ̸= 0, then ξ ∈ Vn , λn =
−ξ T ξ.
Proof: Replace A with −A, it follows that

dx (t)
= −Ax (t) − xT (t) x (t) x (t) . (7.12)
dt
For RNN (7.12), from Theorem 7.3, we know that if |x (0)| ̸= 0,
x (0) ∈
/ Vi (i = 1, · · · , n) and the equilibrium vector ∥ξ∥ ̸= 0, ξ belongs
to the eigenspace corresponding to the largest eigenvalue of −A. Because
the eigenvalues of −A are −λn ≥ −λn−1 ≥ · · · ≥ −λ1 . Therefore, ξ ∈ Vn
and ξ T ξ = −λn , i.e. λn = −ξ T ξ.

Theorem 7.5. When A is positive definite, RNN (7.2) cannot compute λn


by replacing A with −A. When A is negative definite, RNN (7.2) cannot
compute λ1
Proof: When A is positive definite, the eigenvalues λ1 ≥ λ2 ≥ · · · ≥
λn > 0, there exists a vector ς ∈ Rn which matches ς T ς = λ1 . Therefore,
the RNN (7.2) will converge to a nonzero vector ξ ∈ Rn that matches
ξ T ξ = λ1 . Replace A with −A, since 0 > −λn ≥ −λn−1 ≥ · · · ≥ −λ1 , there
does not exist a vector ς ∈ Rn , which matches ς T ς = −λn . So, when A is
positive definite, RNN (7.2) will converge to zero vector when A is replaced
by −A. While if A is negative definite, there does not exist a vector that
matches ς T ς = λ1 , RNN (7.2) will converge to zero. However, there exists
a vector ζ ∈ Rn , which matches ξ T ξ = −λn , so, RNN (7.2) can compute
λn by replacing A with −A when A is negative definite.

Theorem 7.6. When A is positive definite, if A in RNN (7.2) is replaced


with − (A − ⌈λ1 ⌉ I), then λn can be computed from the equilibrium vector
ξ as

λn = −ξ T ξ + ⌈λ1 ⌉ . (7.13)

for the nonzero initial vector x (0) ∈


/ Vi (i = 1, · · · , n).
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

134 Yiguang Liu, Zhisheng You, Bingbing Liu and Jiliu Zhou


Proof: Let λn denote the smallest eigenvalue of (A − ⌈λ1 ⌉ I). Since A
is positive definite, so, the smallest eigenvalue of (A − ⌈λ1 ⌉ I) is negative,
therefore ξ is nonzero when A is replaced by − (A − ⌈λ1 ⌉ I), from Theorem
7.4, it follows that

λn = −ξ T ξ. (7.14)

From Lemma 7.1, we know that



λn = λn − ⌈λ1 ⌉ . (7.15)

From Equations (7.14) and (7.15), it follows Equation (7.13).

Theorem 7.7. When A is negative definite, if A in RNN (7.2) is replaced


with (A − ⌊λn ⌋ I), then λ1 can be computed from the equilibrium vector ξ
as

λ1 = ξ T ξ + ⌊λn ⌋ . (7.16)

for nonzero initial vector x (0) ∈


/ Vi (i = 1, · · · , n).
Proof: As similar to the proof of Theorem 7.6, the proof of this theorem
is thus omitted.

Theorem 7.8. The trajectory of RNN (7.2) will converge to zero or


spherical surface whose radius is the square root of a positive eigenvalue
of A.
Proof: From Equation (7.9), we know

n
zi (0) exp (λi t) Si
x (t) = √ i=1
,

n ∫t
1+2 zj2 (0) 0
exp (2λj τ ) dτ
j=1

for the equilibrium vector ξ, if 0 > λ1 ≥ λ2 ≥ · · · ≥ λn , it follows that


ξ = lim x (t)
t→∞

n
zi (0) exp(λi t)Si
= lim √ i=1

t→∞ ∑
n ∫ t
1+2 zj2 (0) 0
exp(2λj τ )dτ
j=1

n (7.17)
zi (0) lim exp(λi t)Si
t→∞
= √ i=1

n ∫ t
1+2 lim zj2 (0) 0
exp(2λj τ )dτ
t→∞ j=1

= 0.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

Solving Eigen-problems of Matrices by Neural Networks 135

Suppose λ1 ≥ λ2 ≥ · · · ≥ λk > 0 ≥ λk+1 ≥ · · · ≥ λn . The first nonzero


projection is zp (0) ̸= 0 (1 ≤ p ≤ n). Then we get two cases:
1) If 1 ≤ p ≤ k, it follows that

ξ = lim x (t)
t→∞

k ∑
n
zi (0) exp(λi t)Si + zi (0) exp(λi t)Si
i=p i=k+1
= lim √
t→∞ ∑
n ∫
1+2 zj2 (0) 0t exp(2λj τ )dτ
j=1

k ∑
n
zp (0)Sp + lim zi (0) exp(λi t−λp t)Si + lim zi (0) exp(λi t−λp t)Si
t→∞ i=p+1 t→∞ i=k+1
= v { }
u
u ∫
t1+ lim z 2 (0)[1−exp(−2λp t)]/λp +2 ∑ z 2 (0) t exp(2λj τ −2λp t)dτ
n

t→∞ p j 0
√ /
j=p+1

= zp (0) λp zp2 (0)Sp ,


so

∥ξ∥ = √ξ T ξ
√ / √ /
= zpT (0) zp (0) λp zp2 (0) λp zp2 (0)SpT Sp (7.18)

= λp .

2) If k + 1 ≤ p ≤ n, it follows that

ξ = lim x (t)
t→∞

n
zi (0) exp(λi t)Si
i=p
= lim √ . (7.19)
t→∞ ∑
n ∫ t
1+2 zj2 (0) 0
exp(2λj τ )dτ
p=1

=0

From Equations (7.17), (7.18) and (7.19), we know that the theorem is
correct.

7.1.4. Steps to compute λ1 and λn


In this section, we provide the steps to compute the largest eigenvalue and
the smallest eigenvalue of a real symmetric matrix B ∈ Rn×n with RNN
(7.2), which are as follows:

7.1.5. Simulation
We provide three examples to evaluate the method integrated by RNN (7.2)
and the five steps for computing the smallest eignevalue and the largest
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

136 Yiguang Liu, Zhisheng You, Bingbing Liu and Jiliu Zhou

Table 7.1. Steps computing λ1 and λn .


Step 1 Replace A with B. If equilibrium vector ∥ξ∥ ̸= 0, we
get λ1 = ξ T ξ.
Step 2 Replace A with −B. If equilibrium vector ∥ξ∥ ̸= 0, we
get λn = −ξ T ξ.
If λ1 has been received, go to step 5. Otherwise, go to
step 3. If ∥ξ∥ = 0,
go to step 4.
Step 3 Replace A with (B − ⌊λn ⌋ I), and we get λ1 = ξ T ξ +
⌊λn ⌋. Go to step 5.
Step 4 Replace A with − (B − ⌈λ1 ⌉ I), and we get λn =
−ξ T ξ + ⌈λ1 ⌉. Go to step 5.
Step 5 End.

Fig. 7.1. The trajectories of the components of x when A is replaced with B.

eigenvalue of a real symmetric matrix. The simulating platform we used is


Matlab.
Example 1: A real 7 × 7 symmetric matrix is randomly generated as

 
0.4116 0.3646 0.5513 0.6659 0.5836 0.6280 0.4079
 0.3919 
 0.3646 0.4668 0.3993 0.4152 0.4059 0.4198 
 0.8687 
 0.5513 0.3993 0.9862 0.4336 0.6732 0.2326 
 
B= 0.6659 0.4152 0.4336 0.5208 0.5422 0.5827 0.5871  .
 
 0.5836 0.4059 0.6732 0.5422 0.3025 0.8340 0.4938 
 
 0.6280 0.4198 0.2326 0.5827 0.8340 0.9771 0.3358 
0.4079 0.3919 0.8687 0.5871 0.4938 0.3358 0.1722
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

Solving Eigen-problems of Matrices by Neural Networks 137

Fig. 7.2. The trajectories of λ1 (t) = xT (t)x(t) when A is replaced with B.

Fig. 7.3. The trajectories of the components of x when A is replaced with −B.

Through step 1, we get


T
ξ1 = (0.7199 0.5577 0.8172 0.7356 0.7637 0.7979 0.6542) .

So, λ1 = ξ1T ξ1 = 3.6860. The trajectories of xi (t) (i = 1, 2, · · · , 7)


T ∑
7
and λ1 (t) = x (t) x (t) = x2i (t) are shown in Figs 7.1 and 7.2.
i=1
From Figs 7.1 and 7.2, we can see that RNN (7.2) quickly reaches the
equilibrium state and the largest eigenvalue is obtained. Through step 2,
T
we get ξn = ( - 0.1454 0.0388 0.3184 0.2658 - 0.1201 0.0846 - 0.5324) and
λn = −ξnT ξn = −0.4997. The trajectories of xi (t) (i = 1, 2, · · · , 7) and
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

138 Yiguang Liu, Zhisheng You, Bingbing Liu and Jiliu Zhou

Fig. 7.4. The trajectories of λn (t) = −xT (t)x(t) when A is replaced with −B.

T ∑
7
λn (t) = −x (t) x (t) = − x2i (t) are shown in Figs 7.3 and 7.4. By using
i=1
Matlab to directly compute the eigenvalues of B, we get the “ground truth”
′ ′
largest eigenvalue λ1 = 3.6862 and the smallest eigenvalue λn = −0.4997.
Therefore, compare the results received by the RNN (7.2) approach with
the ground truth, the absolute difference values are

λ1 − λ1 = 0.0002,
and

λn − λn = 0.0000.

This verifies that the RNN (7.2) can compute the largest eigenvalue and
the smallest eigenvalue of B successfully.
Example 2: Use Matlab to generate a positive matrix that is
 
0.6495 0.5538 0.5519 0.5254
 0.5538 0.8607 0.7991 0.5583 
C= 
 0.5519 0.7991 0.8863 0.4508  ,
0.5254 0.5583 0.4508 0.6775
the ground truth eigenvalues of C directly computed by Matlab are λ′ =
(0.0420 0.1496 0.3664 2.5160). From step 1, we can get λ1 = 2.5159,
the trajectories of x (t) and xT (t) x (t) which will approach to λ1 are
shown in Figs 7.5 and 7.6. For step 2, the trajectories of x (t) and λn =
−xT (t) x (t) are shown in Figs 7.7 and 7.8. The equilibrium vector ξn =
( - 0.0010 - 0.0028 0.0025 0.0015) and λn = −ξnT ξn = −1.7615 × 10−5 ,
T
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

Solving Eigen-problems of Matrices by Neural Networks 139

obviously, λn is approaching to zero. So use step 4 to compute the smallest


eigenvalue, and replace A with
 
- 2.3505 0.5538 0.5519 0.5254
 0.5538 - 2.1393 0.7991 0.5583 
C ′ = − (C − ⌈λ1 ⌉ I) = −  
 0.5519 0.7991 - 2.1137 0.4508  ,
0.5254 0.5583 0.4508 - 2.3225
the trajectories of x (t) and −xT (t) x (t) that will approach to the smallest
eigenvalue of C ′ are shown in Figs 7.9 and 7.10. In the end, we get that
the equilibrium vector is ξn∗ = ( - 0.4405 - 1.1408 1.0296 0.6342) and the
T

small eigenvalue of C

Fig. 7.5. The trajectories of components of x when A is replaced with C.

λ∗n = −ξn∗T ξn∗ + ⌈λ1 ⌉


= −2.9578 + 3
= 0.0422
λ1 = 2.5159 and λ∗n = 0.0422 are the results computed by RNN (7.2).
′ ′
Compared with the ground true values λ1 = 2.5160 and λn = 0.0420, we

can see that λ1 is very close to λ1 , with the absolute difference being 0.0001,

and λ∗n is close to λn as well the absolute difference being 0.0002.
This verifies that using step 4 to compute the smallest eigenvalue of a
real positive symmetric matrix is effective.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

140 Yiguang Liu, Zhisheng You, Bingbing Liu and Jiliu Zhou

Fig. 7.6. The trajectories of xT (t)x(t) when A is replaced with C.

Fig. 7.7. The trajectories of components of x when A is replaced with −C.

Example 3: Use matrix D = −C to evaluate RNN (7.2).


Obviously, D is negative definite and its true eigenvalues are λ′ =
(- 2.5160 - 0.3664 - 0.1496 - 0.0420). Replace A with D, from step 1, we
find the received eigenvalue is 2.1747 × 10−5 , we cannot decide whether this
value is the true largest eigenvalue. Through step 2, the smallest eigenvalue
is received λ4 = −2.5159. Using step 3, we get λ1 = −0.0420. Comparing
′ ′
λ4 and λ1 to λ4 and λ1 , it can be found that using step 3 to compute the
largest eigenvalue is successful.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

Solving Eigen-problems of Matrices by Neural Networks 141

Fig. 7.8. The trajectories of −xT (t)x(t) when A is replaced with −C.

Fig. 7.9. The trajectories of components of x when A is replaced with C ′ .

7.1.6. Section summary

In this section, we presented a recurrent neural network, and designed


steps to compute the largest eigenvalue and smallest eigenvalue of a real
symmetric matrix using this neural network. This method depicted by
the recurrent neural network and the five steps is not only adaptive to
non-definite matrix but also to positive definite and negative definite
matrix. Three examples were provided to evaluate the validity of this
method, with B being a non-definite matrix, C being a positive definite
matrix and D being a negative definite matrix. All of the results obtained
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

142 Yiguang Liu, Zhisheng You, Bingbing Liu and Jiliu Zhou

Fig. 7.10. The trajectories of −xT (t)x(t) when A is replaced with C ′ .

by applying this method were highly close to the corresponding ground


truth values. Compared with other approaches based on neural networks,
this recurrent neural network is simpler and can be realized more easily.
Actually there exists an improved form of the neural network presented
in Equation (7.2), which is
dx (t) [ ]
= Ax (t) − sgn xT (t) x (t) x (t) , (7.20)
dt
where definitions of A, t and x (t) are the same to those of Equation (7.2),
and sgn [xi (t)] returns 1 if xi (t) > 0, 0 if xi (t) = 0 and -1 if xi (t) < 0.
Evidently, this new form is more concise than the neural network presented
in Equation (7.2). Interested readers can refer to [12] for the analytic
solution, convergence analysis and numerical valuations of this new form in
detail.

7.2. A Recurrent Neural Network for Computing the


Largest Modulus Eigenvalues and Their Corresponding
Eigenvectors of an Anti-symmetric Matrix

As we argued in Section 7.1, the ability of quick computation of


eigenvalues and eigenvectors has many important applications, and using
neural networks to compute eigenvalues and eigenvectors of a matrix has
outstanding features like high parallelism and quickness. That is the
reason why many research works have been reported [1–11, 13, 14] in this
field. However, the received results mainly focus on solving eigenvalues
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

Solving Eigen-problems of Matrices by Neural Networks 143

and eigenvectors for real symmetric matrices. According to the authors’


knowledge, approaches based on neural networks to extract eigenvalues
and eigenvectors of a real anti-symmetric matrix are rarely found in the
literature.
In this section, we propose a recurrent neural network (RNN), in order
to compute eigenvalues and eigenvectors for real anti-symmetric matrices.
The proposed RNN is
dv (t) [ ′ ]
= A g + A′ g ′ − |v (t)| v (t)
2
(7.21)
dt
( ) ( ) ( )
I0 I ′ I0 I0 ′ A I0
where t ≥ 0, v (t) ∈ R , g =
2n
,g = ,A = .
I0 I0 −I I0 I0 A
I is an n × n unit matrix. I0 is an n × n zero matrix. A is the
real anti-symmetric matrix whose eigenvalues are to be computed. |v (t)|
[denotes the modulus of] v (t). With v(t) being taken as the states of neurons,
A′ g + A′ g ′ − |v (t)| being regarded as synaptic connection weights and
2

the activation functions being assumed as pure linear functions, Equation


(7.21) describes a continuous time recurrent neural network.
( )T
Denote v (t) = y T (t) z T (t) , y (t) = (y1 (t) , y2 (t) , · · · , yn (t)) ∈ Rn
and z (t) = (z1 (t) , z2 (t) , · · · , zn (t)) ∈ Rn , Equation (7.21) is equal to

 n [ ]
 ∑
 dy(t)
 dt = Az (t) − yj2 (t) + zj2 (t) y (t)
j=1
∑n [ ] . (7.22)

 dz(t)
 dt = −Ay (t) − yj2 (t) + zj2 (t) z (t)
j=1

Denote
x (t) = y (t) + iz (t) , (7.23)
where i is the imaginary unit, Equation (7.22) is equal to

dy(t) dz(t) ∑n
[ 2 ]
+i = −Ai [y (t) + iz (t)] − yj (t) + zj2 (t) [y (t) + iz (t)] ,
dt dt j=1

i.e.
dx (t)
= −Ax (t) i − xT (t) x̄ (t) x (t) , (7.24)
dt
where x̄ (t) denotes the complex conjugate values of x (t). From Equations
(7.21), (7.22) and (7.24), we know that analyzing the convergence properties
of Equation (7.24) is equal to analyzing that of RNN (7.21).
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

144 Yiguang Liu, Zhisheng You, Bingbing Liu and Jiliu Zhou

7.2.1. Preliminaries

Lemma 7.2. The eigenvalues of A are pure imaginary numbers or zero.


Proof: let λ be an eigenvalue of A, and v is its corresponding
( )T
eigenvector. Because v T λ̄v̄ = v T λv = v T Av = v T Av̄ = AT v v̄ =
T T
(−Av) v̄ = (−λv) v̄ = −λv T v̄, so
( ) ( ) 2
λ̄ + λ v T v̄ = λ̄ + λ |v| = 0.
For |v| ̸= 0, thus λ̄ + λ = 0, this indicates that the value of the real
component of λ is zero. So, λ is a pure imaginary number or zero.

Lemma 7.3. If η i is an eigenvalue of A, and µ is the corresponding


eigenvector, then −η i is an eigenvalue of A too, and µ̄ is the eigenvector
corresponding to −η i.
Proof: from Aµ = η iµ, we get Aµ = η iµ, i.e. Aµ̄ = (−η i) µ̄. So, we
can draw the conclusion.
Based on Lemmas 7.2 and 7.3, all eigenvalues of A are denoted as
λ1 i, λ2 i, · · · , λn i which match |λ1 | ≥ |λ2 | ≥ · · · ≥ |λn |, λ2k = −λ2k−1 and
λ2k−1 ≥ 0 (k = 1, · · · , ⌊n/2⌋, ⌊n/2⌋ denotes the most maximal integer that
is not larger than n/2). Corresponding complex eigenvectors and complex
eigensubspaces are denoted as µ1 , · · · , µn and V1 , · · · , Vn .

Lemma 7.4. For matrix A, if two eigenvalues λk i ̸= λm i, then µk ⊥µm ,


namely the inner product ⟨µk , µm ⟩ = 0, k, m ∈ [1, n].
Proof: from Aµk = λk iµk and Aµm = λm iµm , we get µTk λm iµ̄m =
( )T T
T
µk λm iµm = µTk Aµm = µTk Aµ̄m = AT µk µ̄m = (−Aµk ) µ̄m
= −λk iµTk µ̄m , thus,
( )
λm i + λk i µTk µ̄m = 0. (7.25)
From Lemma 7.2, we know λk i and λm i are imaginary numbers. From
λk i ̸= λm i, it follows that
( )
λm i + λk i ̸= 0. (7.26)
From Equations (7.25) and (7.26), it easily follows that
µTk µ̄m = 0, (7.27)
i.e.
⟨µk , µm ⟩ = 0.
Therefore µk ⊥µm .
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

Solving Eigen-problems of Matrices by Neural Networks 145

When RNN (7.21) approaches equilibrium state, we get the following


formula from Equation (7.24)
¯
−Aξ i = ξ T ξξ, (7.28)

where ξ ∈ C n is the equilibrium vector. When ξ is an eigenvector of A,


λ i (–λ ∈ R) is supposed to be the corresponding eigenvalue, then

Aξ = –λ iξ. (7.29)

From Equations (7.28) and (7.29), it follows that


¯
–λ = ξ T ξ. (7.30)

7.2.2. Analytic solution of RNN

Theorem 7.9. Denote Sk = |µµkk | , xk denotes the projection value of x (t)


onto Sk , then the analytic solution of Equation (7.24) is

n
[yk (0) + izk (0)] exp (λk t) Sk
x (t) = √ k=1
(7.31)
n [
∑ ]∫t
1+2 zj2 (0) + yj2 (0) 0 exp (2λj τ ) dτ
j=1

for all t ≥ 0.
Proof: from Lemma 7.4, we know that S1 , S2 , · · · , Sn construct an
orthonormal basis of C n×n . Since xk = yk (t) + i zk (t), thus

n
x (t) = (zk (t) + i yk (t))Sk . (7.32)
k=1

Take Equation (7.32) into Equation (7.24), as Sk is a normalized


complex eigenvector of A, from the denotation of λk i, we know the
corresponding eigenvalue is λk i, thus
n [
∑ d d
]
dt zk (t) + i dt yk (t) Sk
k=1

n n [
∑ ]∑
n
=− A [zk (t) + i yk (t)]Sk i − zk2 (t) + yk2 (t) [zk (t) + i yk (t)] Sk
k=1 k=1 k=1
∑n ∑n [ 2 ]∑
n
=− λk i Sk [zk (t) + i yk (t)] i − zk (t) + yk2 (t) [zk (t) + i yk (t)] Sk
k=1 k=1 k=1

n n [
∑ ]∑
n
= λk Sk [zk (t) + i yk (t)] − zk2 (t) + yk2 (t) [zk (t) + i yk (t)] Sk ,
k=1 k=1 k=1
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

146 Yiguang Liu, Zhisheng You, Bingbing Liu and Jiliu Zhou

therefore
d ∑n
[ 2 ]
zk (t) = λk zk (t) − zj (t) + yj2 (t) zk (t) , (7.33)
dt j=1

d ∑n
[ 2 ]
yk (t) = λk yk (t) − zj (t) + yj2 (t) yk (t) . (7.34)
dt j=1

From Equation (7.33), we get

1 d ∑[ n
] 1 d
λk − zk (t) = zj2 (t) + yj2 (t) = λr − zr (t) ,
zk (t) dt j=1
zr (t) dt

where r ∈ [1, n], so


1 d 1 d
zr (t) − zk (t) = λr − λk ,
zr (t) dt zk (t) dt
i.e.
zr (t) zr (0)
= exp [(λr − λk ) t] . (7.35)
zk (t) zk (0)
From Equation (7.34), we analogously get
yr (t) yr (0)
= exp [(λr − λk ) t] . (7.36)
yk (t) yk (0)
From Equations (7.33) and (7.34), we get
1 d 1 d
λk − zk (t) = λr − yr (t) ,
zk (t) dt yr (t) dt
i.e.
yr (t) yr (0)
= exp [(λr − λk ) t] . (7.37)
zk (t) zk (0)
From Equation (7.21), we also get
[ ]
1 d λk ∑
n
zj2 (t) yj2 (t)
zk (t) = 2 − + ,
zk3 (t) dt zk (t) j=1 zk2 (t) zk2 (t)

i.e.
[ ]
d 1 λk ∑n
zj2 (t) yj2 (t)
+2 2 =2 + . (7.38)
dt zk2 (t) zk (t) j=1
zk2 (t) zk2 (t)
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

Solving Eigen-problems of Matrices by Neural Networks 147

Take Equations (7.35) and (7.37) into Equation (7.38), it follows that
[ ] [ ]
d exp (2λk t) ∑n
zj2 (0) + yj2 (0)
=2 exp (2λj t), (7.39)
dt zk2 (t) j=1
zk2 (0)

Integrating two sides of Equation (7.39) with respect to t from 0 to t′ ,


we get
∑ ∫ ′
exp (2λk t′ ) zk2 (0)
n
[ 2 2
] t
= 1 + 2 z (0) + y (0) exp (2λj t) dt,
zk2 (t′ ) j=1
j j
0

so
exp (2λk t′ ) zk2 (0)
zk2 (t′ ) = ∑
n [ 2 ] ∫ t′ ,
1+2 zj (0) + yj2 (0) 0 exp (2λj t) dt
j=1

therefore
exp (λk t′ ) zk (0)
zk (t′ ) = √ , (7.40)
n [
∑ ] ∫ t′
1+2 2 2
zj (0) + yj (0) 0 exp (2λj t) dt
j=1

or,
− exp (λk t′ ) zk (0)
zk (t′ ) = √ .
n [
∑ ] ∫ ′
t
1+2 zj2 (0) + yj2 (0) 0 exp (2λj t) dt
j=1

For the above formula, when t′ = 0, we get

zk (0) = −zk (0) , k = 1, 2, · · · , n.

When z (0) is a nonzero vector, there exists

zd (0) ̸= −zd (0) (d ∈ [1, 2, · · · , n]) ,

so, we take only Equation (7.40) as correct.


Analogously, from Equations (7.34), (7.36) and (7.37), we get

exp (λk t′ ) yk (0)


yk (t′ ) = √ . (7.41)
n [
∑ ] ∫ t′
1+2 2 2
zj (0) + yj (0) 0 exp (2λj t) dt
j=1
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

148 Yiguang Liu, Zhisheng You, Bingbing Liu and Jiliu Zhou

Therefore, from Equations (7.40) and (7.41), it follows that



n
x (t) = xk (t) Sk
k=1

n
= [yk (t) + i yk (t)] Sk
k=1 . (7.42)

n
[yk (0)+i zk (0)] exp(λk t)Sk
= √ k=1

n ∫t
1+2 [zj2 (0)+yj2 (0)] 0
exp(2λj τ )dτ
j=1

for all t ≥ 0. So the theorem is proved.

7.2.3. Convergence analysis of RNN

Theorem 7.10. If nonzero initial vector x (0) ∈ Vm , then the equilibrium


vector ξ = 0 or ξ ∈ Vm . If ξ ∈ Vm , iξ T ξ¯ is the eigenvalue λm i.
Proof: because x (0) ∈ Vm and Vm ⊥Vj (j ̸= m), thus
xj (0) = 0, j ̸= m. (7.43)
From Theorem 7.9 and Equation (7.43), we get
[ym (0) + i ym (0)] exp (λm t) Sm
x (t) = √ ∫ . (7.44)
2 (0) + y 2 (0)] t exp (2λ τ ) dτ
1 + 2 [zm m 0 m

when the equilibrium vector ξ exists, there exists the following relationship
ξ = lim x (t) . (7.45)
t→∞

From Equations (7.44) and (7.45), it follows that: When λm < 0


ξ = 0. (7.46)
When λm = 0
[ym (0) + i ym (0)] Sm
x (t) = lim √ (7.47)
t→∞ 2 (0) + y 2 (0)] t
1 + 2 [zm m
=0
When λm > 0


[ym (0)+i ym (0)]Sm
ξ = lim
t→∞ exp(−2λm t)+ λ1m [zm
2 (0)+y 2 (0)] [1−exp(−2λ t)]
m m

[ym (0)+i ym (0)]Sm


√ . (7.48)
= 1 2 2
λm [zm (0)+ym (0)]

∈ Vm
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

Solving Eigen-problems of Matrices by Neural Networks 149

From Equations (7.46), (7.47) and (7.48), it concludes that ξ = 0 or


ξ ∈ Vm . When ξ ∈ Vm , from Equation (7.48), it follows that
[ym (0)+i ym (0)]Sm √[ym (0)−i ym (0)]S̄m
ξ T ξ¯i = √
1 2 2 1 2 2
i
λm [zm (0)+ym (0)] λm [zm (0)+ym (0)] . (7.49)
= λm i

From Equations (7.46)∼(7.49), the theorem can be concluded.

Theorem 7.11. If nonzero initial vector x (0) ∈ / Vm (m = 1, · · · , n), then


the equilibrium vector ξ ∈ V1 or ξ = 0, ξ T ξ¯i is the eigenvalue λ1 i. If λ2 i
exists, −λ1 i is the eigenvalue λ2 i, and ξ¯ is eigenvector corresponding to
λ2 i.
Proof: because x (0) ∈ / Vm (m = 1, · · · , n), so xk (t) ̸=
0 (1 ≤ k ≤ n). From the denotation of λ1 i, λ2 i, · · · , λn i, it follows that

λ1 ≥ 0.

From Theorem 7.9 and Equation (7.45), when λ1 = 0, we get



n
[yk (0)+i yk (0)] exp(λk t)Sk
ξ = lim √ k=1

t→∞ ∑
n ∫t
1+2 [zj2 (0)+yj2 (0)] 0
exp(2λj τ )dτ
j=1

n
[yk (0)+i yk (0)]Sk , (7.50)
= lim √k=1
t→∞ ∑
n
1+2 [zj2 (0)+yj2 (0)]t
j=1

=0

and

ξ T ξ¯i = 0 = λ1 i, (7.51)

when λ1 > 0, we get


n
[y1 (0)+i y1 (0)]S1 + [yk (0)+i yk (0)] exp[(λk −λ1 )t]Sk
ξ = lim √ k=2
,
t→∞ ∫t ∑
n ∫t
exp(−2λ1 t)+2[z12 (0)+y12 (0)] 0
exp[2λ1 (τ −t)]dτ +2 [zj2 (0)+yj2 (0)] 0
exp(2λj τ −2λ1 t)dτ
j=2

thus

λ1
ξ= [y1 (0) + iy1 (0)] S1
[z12 (0)+y12 (0)] , (7.52)
∈ V1
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

150 Yiguang Liu, Zhisheng You, Bingbing Liu and Jiliu Zhou

and
√ √
ξ T ξ¯ i = λ1
[y1 (0) + iy1 (0)] S1 λ1
[y1 (0) − iy1 (0)] S̄1
[ z12 (0)+y12 (0) ] [z12 (0)+y12 (0)] .
= λ1 i
(7.53)
If λ2 i exists, from the denotation of λ1 i, λ2 i, · · · , λn i, we know
λ2 i = λ1 i = −λ1 i, (7.54)
and from Equation (7.54) and Lemma 7.2, it easily follows
Aξ¯ = λ2 i ξ,
¯ (7.55)
so ξ¯ is the eigenvector corresponding to λ2 i .
From Equations (7.50), (7.51), (7.52), (7.53), (7.54) and (7.55), it is
concluded that the theorem is correct.

Theorem 7.12. If x (0) ̸= 0, the modulus of trajectory |x (t)| will converge


to zero or λk > 0 (1 ≤ k ≤ n).
Proof: Rearrange λ1 , λ2 , · · · , λn into order λr1 > λr2 > · · · > λrn . From
Theorem 1, we know

n
[yk (0) + izk (0)] exp (λrk t) Sk
x (t) = √ k=1
.
n [
∑ ]∫t ( )
1+2 zj2 (0) + yj2 (0) 0 exp 2λrj τ dτ
j=1

Since λ1 ≥ 0,
λr1 ≥ 0,
the supposition λr1 > · · · λrq ≥ 0 > λrq+1 · · · λrn is rational. Suppose the first
nonzero component of x (t) is xp (0) ̸= 0 (1 ≤ p ≤ n). Then there are two
cases:
1) If 1 ≤ p ≤ q, it follows that
ξ = lim x (t)
t→∞
∑q [ ] ( ) ∑
n [ ] ( )
yk (0)+izk (0) exp λr k t Sk + yk (0)+izk (0) exp λr k t Sk
k=p k=q+1
= lim v
t→∞ u n [ ]∫ ( )
u ∑
t1+2 z 2 (0)+y 2 (0) 0t exp 2λr τ dτ
j j j
j=1
[ ] ∑
n [ ] ( )
yp (0)+izp (0) Sp + lim y (0)+izk (0) exp λr r
k t−λp t Sk
t→∞ k=p+1 k
= v
u ( ) [ ][ ( )]/ [ ]∫ ( )
u ∑
n
lim texp −2λr 2 2
p t + zp (0)+yp (0) 1−exp −2λp t
r λrp +2 z 2 (0)+y 2 (0) t r r
0 exp 2λj τ −2λp t dτ
t→∞ j j
j=p+1

[yp (0)+izp (0)] λp r
= √ Sp ,
2 2
zp (0)+yp (0)
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

Solving Eigen-problems of Matrices by Neural Networks 151

so

|ξ|√
= ξ T ξ¯
√ √
[yp (0)+izp (0)] λrp [yp (0)+izp (0)] λrp
= √ SpT √ S̄p . (7.56)
zp2 (0)+yp2 (0) zp2 (0)+yp2 (0)
= λrp
2) If k + 1 ≤ p ≤ n, it follows that
ξ = lim x (t)
t→∞

n
[yk (0)+izk (0)] exp(λrk t)Sk
k=p
= lim √ . (7.57)
t→∞ ∑
n ∫t
1+2 [zj2 (0)+yj2 (0)] 0
exp(2λrj τ )dτ
j=1

=0
Since the two sets [λ1 , λ2 , · · · , λn ] and [λr1 , λr2 , · · · , λrn ] are the same. So,
from Equations (7.56) and (7.57), it follows that |ξ| is λk > 0 (1 ≤ k ≤ n)
or zero. So, the theorem is drawn.

7.2.4. Simulation
In order to evaluate RNN (7.21), a numerical simulation is provided, and
Matlab is used to generate the simulation.
Example: For Equation (7.24), A is randomly generated as

 
0 -0.0511 0.2244 -0.0537 0.0888 -0.1191 0.1815
 0.0511 0 0.1064 0.2901 -0.0073 0.0263 0.2322 
 
 -0.2244 -0.1064 0 0.1406 -0.0206 0.1402 -0.0910 
 
 
A= 0.0537 -0.2901 -0.1406 0 -0.2708 -0.0640 -0.2373  ,
 
 -0.0888 0.0073 0.0206 0.2708 0 0.1271 0.1094 
 
 0.1191 -0.0263 -0.1402 0.0640 -0.1271 0 -0.1200 
-0.1815 -0.2322 0.0910 0.2373 -0.1094 0.1200 0
and the initial value is

 
0.0511-0.2244i
 -0.1064i 
 
 0.1064 
 
 
x (0) =  0.2901+0.1406i  .
 
 -0.0073-0.0206i 
 
 0.0263+0.1402i 
0.2322-0.0910i
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

152 Yiguang Liu, Zhisheng You, Bingbing Liu and Jiliu Zhou

Through simulation, the generated complex equilibrium vector is

 
-0.1355-0.1563i
 0.0716-0.2914i 
 
 0.2410-0.0716i 
 
 
ξ= 0.2840+0.3073i  ,
 
 0.1556-0.2338i 
 
 0.1076+0.1511i 
0.3334-0.1507i

and the eigenvalue is

λs1 i = ξ T ξ¯ i
= 0.6182 i

Fig. 7.11. The trajectories of modulus of components of x when A is replaced with B.

The trajectories of the modulus values of the components of x(t) and


xT (t) x̄ (t) approaching λ1 are shown in Figs 7.11 and 7.12. Use Matlab to
directly compute the eigenvalues and eigenvectors, the ground truth values
are follows:
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

Solving Eigen-problems of Matrices by Neural Networks 153

Fig. 7.12. The trajectory of xT (t)x(t) when A is replaced with B.

   
- 0.2630 - 0.0084i 0.5318
 - 0.2105 - 0.3184i   0.0207 - 0.4494i 
   
 0.1411 - 0.2869i   - 0.0997 + 0.3784i 
   
   
µ1 =  0.5322  , µ3 =  0.2835 - 0.1898i ,
   
 - 0.0841 - 0.3471i   - 0.2272 - 0.0857i 
   
 0.2340 + 0.0299i   - 0.1797 - 0.3099i 
0.1470 - 0.4415i 0.2063 + 0.1246i
   
0.1852 + 0.2252i 0.3549
   - 0.5360 
 0.0424 + 0.0809i   
 0.2436 + 0.3907i   - 0.2557 
   
   
µ5 =  - 0.2443 + 0.1977i  , µ7 =  - 0.0568  .
   
 - 0.2995 + 0.1881i   0.6138 
   
 0.4992   0.3656 
0.0769 - 0.4644i 0.0880
µ2 = µ̄1 , µ4 = µ̄3 , µ6 = µ̄5 , λ1 i = 0.6184i, λ2 i = −0.6184i,
λ3 i = 0.3196i, λ4 i = −0.3196i, λ5 i = 0.0288i, λ6 i = −0.0288i and
λ7 i = 0.
When λs1 i is compared with λ1 i, the difference is
∆λ1 = λs1 i − λ1 i
= 0.6182i - 0.6184i .
= −0.0002i
The modulus value of ∆λ1 is very small, which means that the eigenvalue
computed by RNN (7.21) is very close to the true value. This verifies the
validity of RNN (7.21) for computing eigenvalues. On the other side, it
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

154 Yiguang Liu, Zhisheng You, Bingbing Liu and Jiliu Zhou

may be noticed that the eigenvector ξ computed by RNN (7.21) is different


from µk 1 ≤ k ≤ n. Compute the inner product between ξ and µk , we get:
U1 = ⟨ξ, µ1 ⟩ = 0.5336 -0.5775i, U2 = ⟨ξ, µ2 ⟩ = 2.7411 × 10-6 -1.1816 × 10-6 i,
U3 = ⟨ξ, µ3 ⟩ = 6.8277 × 10-8 + 7.7934 × 10-8 i, U4 = ⟨ξ, µ4 ⟩ = 1.9873 ×
10-13 + 1.2935 × 10-13 i, U5 = ⟨ξ, µ5 ⟩ = 3.4937 × 10-15 + 9.1108 × 10-15 i,
U6 = ⟨ξ, µ6 ⟩ = -9.7491 × 10-16 +1.1172 × 10-15 i, U7 = ⟨ξ, µ7 ⟩ = -5.5511 ×
10-17 +5.0307 × 10-17 i. Obviously, Uk (2 ≤ k ≤ n) is nearly close zero, and
U1 is not zero. Hence, ξ belongs to the subspace corresponding to λ1 , i.e.

ξ ∈ V1 .

Thus the eigenvector ξ computed by RNN (7.21) is equivalent to µ1 , both


of which belong to the same subspace V1 .
When x (0) is changed into other wholly generated values, the
trajectories of modulus values of components of x(t) vary. The values
of ξ vary too, but ξ still belongs to V1 , and ξ T ξ¯ i is still very close to
λ1 i = 0.6184i.
Clearly λs1 i is very close to λ2 i, and ξ is the corresponding eigenvector.
This example verifies that it is effective to use RNN (7.21) to compute
the eigenvalues and eigenvectors of real anti-symmetric matrices.

7.2.5. Section summary


Many researches have been focused on using a neural network to compute
eigenvalues and eigenvectors of real symmetric matrices. In this section,
a new neural network modal was presented so that eigenvalues and
eigenvectors of a real anti-symmetric matrix can be computed. The
network was transformed into a differential equation, and the complex
analytic solution of the equation was derived. Based on the solution,
the convergence behavior of the neural network was analyzed. In order
to examine the proposed neural network, a 7 × 7 real anti-symmetric
matrix was randomly generated and used. The eigenvalue computed
by the network was very close to the ground true values, which were
computed using Matlab functions directly. Although the eigenvector was
different from the corresponding eigenvector of the ground truth eigenvalue,
they both belong to the same subspace and were equivalent. Therefore,
the network is effective to compute eigenvalues and eigenvectors of real
anti-symmetric matrices.
It remains a problem on using this neural network to compute all
eigenvalues of real anti-symmetric matrices. And it is also unsolved
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

Solving Eigen-problems of Matrices by Neural Networks 155

on using neural networks to compute eigenvalues of non-symmetric and


non-anti-symmetric matrices. These problems inspire further research
interests and in Section 7.4 we will introduce some of our efforts on
computing eigenvalues and eigenvectors of real and general matrices using
neural networks.

7.3. A Concise Recurrent Neural Network Computing the


Largest Modulus Eigenvalues and Their Corresponding
Eigenvectors of a Real Skew Matrix

As extraction of eigenstructure has many applications such as the fast


retrieval of structural patterns [15], hand-written digits recognition [16] etc.,
many techniques have been proposed to accelerate the computation speed
[4, 17]. There are also many works on applying artificial neural networks
(ANNs) to calculate eigenpairs for the concurrent and asynchronous
running manner of ANNs [1, 2, 4–9, 11, 13, 14, 17–24].
Generally, the techniques can be categorized into three types. Firstly,
an energy function is constructed from the relations between eigenpairs and
the matrix, and a differential system which can be implemented by ANNs
is formed. At the equilibrium state of the system, one or more eigenpairs
is computed [19, 20]. Secondly, a special neural network is constructed,
originating from Oja’s algorithms to extract some eigenstructure. Most of
these kinds of methods [1, 2, 4–6, 8, 9, 13, 14, 18, 24] need the input matrix
to be symmetric, such as the autocorrelation matrix. Finally, a differential
system implemented by artificial recurrent neural networks is formulated
purely for calculating eigenstructure of a special matrix [21–23]. Ordinarily,
the computation burden of the first two types is heavy and the realization
circuit is complex. In the framework of the last type of technique, the neural
network can be flexibly designed. The concise expression of the differential
system is a significant factor affecting the neural network’s implementation
in hardware. In Section 7.2 we have introduced a recurrent neural network
adaptive to real anti-symmetric matrices. In this section, we will propose
an even more concise recurrent neural network (RNN) to complete the
calculation,

dv (t)
= [A′ − B (t)] v (t) (7.58)
dt
( ) ( n )
′ I0 A IU v (t) −IUn v (t)
where t ≥ 0, v (t) ∈ R , A =
2n
, B (t) = .
−A I0 IUn v (t) IU n v (t)
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

156 Yiguang Liu, Zhisheng You, Bingbing Liu and Jiliu Zhou

I is a n×n unit matrix. I0 is a n×n zero matrix. U n = (1, · · · , 1, 0, · · · , 0),


| {z } | {z }
n n
Un = (0, · · · , 0, 1, · · · , 1). A is the real skew matrix. When v(t) is assumed
| {z } | {z }
n n
as the states of neurons, [A′ − B (t)] being taken as synaptic connection
weights, and the activation functions are supposed as pure linear functions,
Equation (7.58) describes a continuous time recurrent neural network.
( )T T
Let v (t) = xT (t) y T (t) , x (t) = (x1 (t) , x2 (t) , · · · , xn (t)) ∈ Rn
T
and y (t) = (y1 (t) , y2 (t) , · · · , yn (t)) ∈ Rn . Rewrite Equation (7.58) as
{ ∑n ∑n
dx(t)
= Ay (t) − j=1 xj (t)x (t) + j=1 yj (t)y (t)
dt
dy(t) ∑n ∑n . (7.59)
dt = −Ax (t) − j=1 yj (t)x (t) − j=1 xj (t)y (t)
Let
z (t) = x (t) + iy (t) , (7.60)
where i denotes imaginary unit. Equation (7.59) can be transformed into

dx(t) dy(t) ∑n
+i = −Ai [x (t) + iy (t)] − [xj (t) + yj (t) i] [x (t) + iy (t)] ,
dt dt j=1

that is,
dz (t) ∑n
= −Az (t) i − zj (t)z (t) . (7.61)
dt j=1

From Equations (7.58), (7.59) and (7.61), we know that analyzing the
convergence properties of Equation (7.61) is equal to analyzing those of
RNN (7.58).
Actually, RNN (7.58) could easily transform
( to another neural ) network,
−IU n v (t) IUn v (t)
with only matrix B changed to B (t) = while all
−IUn v (t) −IU n v (t)
the other components remain the same with RNN (7.58). Similar to what
will be discussed in this section, the theoretical analysis and numerical
valuation of this new form of neural network can be found in [25] in detail.

7.3.1. Preliminaries
We will use some similar preliminaries here as in Section 7.2. Thus we only
give these Lemmas and please refer back to Lemmas 7.2, 7.3 and 7.4 for
proof of these Lemmas.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

Solving Eigen-problems of Matrices by Neural Networks 157

Lemma 7.5. The eigenvalues of A are pure imaginary numbers or zero.

Lemma 7.6. If eigenvalue η i and µ construct an eigenpair of A, then −η i


and µ̄ also constitute an eigenpair of A.

Lemma 7.7. If λk i ̸= λm i, then µk ⊥µm , i.e. ⟨µk , µm ⟩ = 0, k, m ∈ [1, n].

Using Lemma 7.7 gives that if λk i ̸= λm i, then Vk ⊥Vm . When λk i =


λm i, i.e. the algebraic multiplicity of λk i is more than one, two subspaces
orthogonal to each other can be specially determined. Thus
V1 ⊥, · · · , ⊥Vn . (7.62)
Let ξ ∈ C n denote the equilibrium vector, when ξ exists, there exists
ξ = lim z (t) . (7.63)
t→∞

From (7.61), it follows that



n
−Aξ i = ξj ξ, (7.64)
j=1

– i (–λ ∈ R) denote the associate


When ξ is an eigenvector of A, λ
eigenvalue, we have
Aξ = –λ iξ. (7.65)
From (7.64) and (7.65), it follows that

n
–λ = ξj . (7.66)
j=1

7.3.2. Analytic solution

Theorem 7.13. Let Sk = µk /|µk |, zk (t) denotes the projection value of


z (t) onto Sk , then
zk (0) exp (λk t)
zk (t) = ∑
n ∫t (7.67)
1+ zl (0) 0 exp (λl τ ) dτ
l=1

for t ≥ 0.
Proof: By using the ideas in proving Theorem 7.9, this theorem can
be proved.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

158 Yiguang Liu, Zhisheng You, Bingbing Liu and Jiliu Zhou

7.3.3. Convergence analysis of RNN

Theorem 7.14. If xl (0) < 0 and yl (0) = 0 for l = 1, 2, · · · n, there must


exist overflow.
Proof: When yl (0) = 0 for l = 1, 2, · · · n, from Theorem 7.1, it follows
that

n ∑
n
zk (0) exp(λk t)
zk (t) = ∑
n ∫
k k=1 1+ zl (0) 0t exp(λl τ )dτ
l=1

n (7.68)
1
= xk (0) exp (λk t) ∑
n ∫t
k=1 1+ xl (0) 0
exp(λl τ )dτ
l=1

When xl (0) < 0, whether λl = 0 or λl > 0, it follows that


n ∫ t
xl (0) exp (λl τ ) dτ < 0,
l=1 0

∑n ∫t
Let F (t) = 1 + l=1 xl (0) 0 exp (λl τ ) dτ . Evidently F (t) < 1 and
∑n
Ḟ (t) = l=1 xl (0) exp (λl t) < 0, that is, there must exist a time tg which
leads to
∑n ∫ tg
1+ xl (0) exp (λl τ ) dτ = 0,
l=1 0

i.e. at time tg RNN (7.58) enters into overflow state. This theorem is
proved.
∑n
Theorem 7.15. If z (0) ∈ Vm and x (0) > 0, when ξ ̸= 0, then i k=1 ξk =
λm i.
Proof: Using Lemma 7.3, it gives that Vm ⊥Vk (1 ≤ k ≤ n, k ̸= m), so
zk (0) = 0, (1 ≤ k ≤ n, k ̸= m). Using Theorems 7.1 and 7.8 gives that


n ∑
n
zk (0) exp(λk t)
ξk = lim ∫t
t→∞ k=1 1+ ∑ zl (0)
n
k=1 exp(λl τ )dτ
l=1
0 . (7.69)
zm (0) exp(λm t)
= lim ∫t
t→∞ 1+zm (0) 0 exp(λm τ )dτ


n
When λm ≤ 0, it easily follows that ξk = 0, i.e. ξ = 0, which
k=1
contradicts ξ ̸= 0. So λm > 0. In this instance, using Equation (7.69), it
gives that
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

Solving Eigen-problems of Matrices by Neural Networks 159


n
zm (0)
ξk = lim 1
k=1 t→∞ exp(−λm t)+ λm zm (0)[1−exp(−λm t)] ,
= λm
∑n
i.e. i k=1 ξk = λm i. This theorem is proved.
Theorem 7.16. If z (0) ∈ / Vm (m = 1, · · · , n) and xl (0) ≥ 0 (l =
∑n
1, 2, · · · n), then ξ ∈ V1 , and ξk i = λ1 i.
k=1
Proof: Since z (0) ∈ / Vm and Equation (7.62), so zl (0) ̸= 0. By the
result of xl (0) ≥ 0 and Theorem 7.2, it can be concluded that there does
not exist overflow. Using Theorems 7.1 and 7.8 gives that
z1 (0) exp(λ1 t)S1 +z2 (0) exp(λ2 t)S2 +···+zn (0) exp(λn t)Sn
ξ = lim ∑n ∫
t→∞ 1+ zl (0) 0t exp(λl τ )dτ
l=1
z1 (0)S1 +z2 (0) exp[(λ2 −λ1 )t]S2 +···+zn (0) exp[(λn −λ1 )t]Sn
= lim ∫ ,
t→∞ exp(−λ1 t)+ 1 zl (0)[1−exp(−λ1 t)]+ ∑ zl (0) t exp(λl τ )dτ
n
λ 1 0
l=2

= λ1 S1 ∈ V1

n
obviously ξk i = λ1 i.
k=1

Corollary 7.1. ξ¯ is an eigenvector corresponding to λ2 i.


Proof: Using Equation (7.64), it gives that

n
Aξ¯i = ξ¯j ξ.
¯ (7.70)
j=1


n ∑
n
Since ξ¯j = ξj = λ1 = λ1 , λ2 = −λ1 and Equation (7.70),
j=1 j=1

Aξ¯i = −λ2 ξ,
¯

i.e.

Aξ¯ = λ2 i ξ.
¯

So, this corollary can be drawn.

Theorem 7.17. If λl1 = λl2 = · · · = λlp ≥ 0, nonzero z (0) ∈ Vl1 ⊕ Vl2 ⊕


· · · ⊕ Vlp and xlk (0) ≥ 0 (k = 1, 2, · · · p), then ξ ∈ Vl1 ⊕ Vl2 ⊕ · · · ⊕ Vlp , and
∑n
ξk i = λl1 i = λl2 i = · · · = λlp i (l1, l2, · · · , lp ∈ [1, n]).
k=1
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

160 Yiguang Liu, Zhisheng You, Bingbing Liu and Jiliu Zhou

Proof: Since z (0) ∈ Vl1 ⊕ Vl2 ⊕ · · · ⊕ Vlp , so at least there exists one
component zk ̸= 0 for k ∈ (l1 , l2 , · · · , lp ) and zk′ = 0 k ′ ∈
/ (l1 , l2 , · · · , lp ).
Let λl denote λl1 = λl2 = · · · = λlp . Using Theorem 7.1 and Equation
(7.63), it gives that

n ∑
n
zk (0) exp(λk t)
ξk = lim ∫t
t→∞ k=1 1+ ∑ zl (0)
n
k=1 0
exp(λl τ )dτ
l=1

n ∑
n
zk (0) exp(λk t)+ zk (0) exp(λk t)
k∈(l1,l2,··· ,lp) k∈(l1,l2,···
/ ,lp)
= lim ∑
n ∫t ∑
n ∫t ,
t→∞ 1+ zk (0) exp(λk τ )dτ + zk (0) exp(λk τ )dτ
0 0
k∈(l1,l2,··· ,lp) k∈(l1,l2,···
/ ,lp)

n
zk (0) exp(λk t)
k∈(l1,l2,··· ,lp)
= lim ∑n ∫t
t→∞ 1+ zk (0) exp(λk τ )dτ
0
k∈(l1,l2,··· ,lp)

(7.71)
if λl = 0, using Equation (7.71), it gives that

n
zk (0)

n
k∈(l1,l2,··· ,lp)
ξk = lim = 0 = λl ,
t→∞ ∑n
k=1 1+ zk (0) t
k∈(l1,l2,··· ,lp)

and if λl > 0, Equation (7.71) gives that


∑n

n
k∈(l1,l2,··· ,lp) zk (0) exp(λl t)
ξk = lim ∑n ∫t
k=1 t→∞ 1+ zk (0) exp(λk τ )dτ
0
∑n
k∈(l1,l2,··· ,lp)
k∈(l1,l2,··· ,lp) zk (0) . (7.72)
= lim ∑n
t→∞ exp(−λl t)+ λ1 zk (0)[1−exp(−λl t)]
l
k∈(l1,l2,··· ,lp)

= λl
From Equations (7.71) and (7.72), this theorem is proved.

7.3.4. Simulation
In order to evaluate RNN (7.58), we provide two examples here. One will
demonstrate the effectiveness of RNN (7.58), and the other will illustrate
that the convergence behavior remains good for large dimension matrices.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

Solving Eigen-problems of Matrices by Neural Networks 161

Example 1: let

 
0 -0.2564 0.2862 -0.0288 0.4294
 0.2564 0 -0.1263 -0.1469 -0.1567 
 
A= 
 -0.2862 0.1263 0 -0.2563 -0.1372  .
 0.0288 0.1469 0.2563 0 -0.0657 
-0.4294 0.1567 0.1372 0.0657 0

The equilibrium vector is

ξ = (0.1587-0.5258i-0.3023+0.1154i 0.2279+0.2702i 0.0998+0.0171i 0.4470+0.1231i)T

and
∑5
λ1 i = i ξk = 0.6311 i.
k=1

The trajectories of the real parts of the components of z (t) are


∑n
illustrated in Fig. 7.13 and the trajectory of k=1 zk (t) is shown in Fig.
7.14.

Fig. 7.13. The trajectories of real part of Zk (t).

The largest modulus true eigenvalues directly computed by Matlab as


– 1 i = 0.6310 i, –λ2 i = −0.6310 i. The corresponding
the ground truth are λ
eigenvectors are
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

162 Yiguang Liu, Zhisheng You, Bingbing Liu and Jiliu Zhou

∑5
Fig. 7.14. The trajectories of real part and imaginary part of i=1 zi (t).

 
0.6315
 -0.2274-0.2944i 
 
S1 =  
 -0.2217+0.3406i  , S2 = S1 ,
 0.0143+0.1155i 
0.0130+0.5329i
   
0.0024 + 0.2991i 0.1533
 0.1267 + 0.2947i   0.7193 
   
 
S3 =  -0.0802 + 0.5391i  , S4 = S3 and S5 =   -0.2749 .

 0.6904   -0.1397 
-0.0284-0.1817i 0.6033
When λ1 i is compared with –λ1 i, it can be seen that their difference is
trivial. Therefore, RNN (7.58) is effective to compute the largest modulus
eigenvalues –λ1 i and –λ2 i. Since ξ T S̄2 = ξ T S̄3 = ξ T S̄4 = ξ T S̄5 = 0 and
ξ T S̄1 = 0.2513 + 0.8327 i, ξ is equivalent to S1 . As ξ¯T S̄1 = ξ¯T S̄3 =
ξ¯T S̄4 = ξ¯T S̄5 = 0 and ξ¯T S̄2 = 0.2513 − 0.8327 i, so ξ¯ is equivalent to S2 .
Therefore the largest modulus eigenvalues (–λ1 i, –λ2 i) and their corresponding
eigenvectors are computed.
The above results are obtained when the initial vector is z (0) =
T
(0.1520i 0.2652i 0.1847i 0.0139i 0.2856i) . When the initial vector is
replaced with other random values, the results have trvial changes and
keep being very close to the corresponding ground truth values.
Example 2: Let A ∈ R85×85 , A (j, i) = −A (i, j) = −(sin (i) + sin (j))/2
T
and z (0) = (sin (1) i sin (2) i · · · sin (85) i) . The calculated largest
modulus eigenvalue is 19.6672i which is very close to the corresponding
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

Solving Eigen-problems of Matrices by Neural Networks 163

Fig. 7.15. The trajectories of real part of zk (t).

∑85
Fig. 7.16. The trajectories of real and imaginary parts of i=1 zi (t).

ground truth value 19.6634i. Convergence behaviors are shown in Figs 7.15
and 7.16. From these figures, we can see that the iteration numbers do
not increase with the dimension, and that this method is still effective even
when the dimension number is very large.

7.3.5. Comparison with other methods and discussions


When RNN (7.58) is to be implemented by hardwares, some techniques
about discretization must be considered [26]. In the Matlab environment,
operating on a single central processing unit, the feature of parallel
running of ANN does not exhibit as the nodes of the network cannot run
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

164 Yiguang Liu, Zhisheng You, Bingbing Liu and Jiliu Zhou

synchronously. But for the stand power method [10, 17], as it serially runs
in nature, it can achieve high performance in a serial running platform.
To compare RNN (7.58) with the stand power method, let A ∈ R15×15 ,
A (j, i) = −A (i, j) = −(cos (i) + sin (j))/2 for i ̸= j and A (i, i) = 0. The
simulated results corresponding to RNN (7.58) and the stand power method
are shown in Figs 7.17, 7.18, 7.19 and 7.20 respectively. From Fig. 7.17, we
can see that RNN (7.58) goes into a unique equilibrium state fast, while the
stand power method falls into a cycle procedure, which can be seen in Fig.
7.19. The eigenvalue calculated by RNN (7.58) is 4.0695i, and that by the
stand power method is 4.0693i, both of which are very close to the ground
truth value 4.0693i. The convergence behaviors of these two methods are
shown in Figs 7.18 and 7.20, respectively. Also from Figs 7.18 and 7.20, we
can see that the convergence rates of the two methods are similar. After
around five iterations, the calculated values approach to the ground truth
eigenvalue.

Fig. 7.17. The trajectories of the real parts of zi (t) for 1 ≤ i ≤ 15 by RNN (7.58).

With an autocorrelation matrix, Fa-Long Luo introduced two


algorithms: with one calculating the principal eigenvectors [4], and the
other extracting the eigenvectors corresponding to the smallest eigenvalues
[2]. He also proposed a real-time neural computation approach to extract
the largest eigenvalue of a positive matrix [9]. Compared with these
references, the electronic circuits of RNN (7.58) may be more complex
because it considers the question in complex space, but is more suitable for
a real skew matrix. As for reference [9], if the eigenvalue was a complex
number, the computation model would fail. The approach proposed by
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

Solving Eigen-problems of Matrices by Neural Networks 165

∑15
Fig. 7.18. The trajectory of i=1 zi (t) by RNN (7.58).

Fig. 7.19. The trajectories of the components of z(t) by the standard power method.

reference [13, 14] involves a neural network model which is different from
RNN (7.58), and the scheme extracting eigen-parameters in these two
references would fail when the matrix had repeated eigenvalues and the
initial state was not specially restricted. References [8, 21, 22] are only
suitable for a real symmetric matrix. RNN (7.58) is more concise than the
model in reference [23] since RNN (7.58) does not include the computation
2
of |z (t)| . Briefly speaking, compared with the published methods, our one
has some novelties in different aspects.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

166 Yiguang Liu, Zhisheng You, Bingbing Liu and Jiliu Zhou

Fig. 7.20. The trajectory of the variable which converges to the largest modulus
eigenvalue by the standard power method.

7.3.6. Section summary


A new concise recurrent neural network model adaptive to compute
largest modulus eigenvalues and corresponding eigenvectors of a real
anti-symmetric matrix was proposed in this section. When the model was
given, it was equivalently transformed into a complex dynamical system
depicted by a complex differential equation. With the analytic solution of
the equation, the convergence behaviors were analyzed in detail. Simulation
results verified the effectiveness of the method to calculate the eigenvalues
and corresponding eigenvectors. The obtained eigenvectors were proved
equivalent to the corresponding ground truth. Compared with other neural
networks used in the same field, the proposed model is concise and adaptive
to real anti-symmetric matrices, with good convergence property.

7.4. A Recurrent Neural Network Computing the Largest


Imaginary or Real Part of Eigenvalues of a General Real
Matrix

In the first three sections of this chapter, we have introduced how to


use neural networks to solve the problems of computing the eigenpairs
of real symmetric or anti-symmetric matrices, which have been popular
research topics in the past few years. Reference [27] generalized the
well-known Oja network to compute non-symmetrical matrices. Liu et
al. [21] meliorated the model in [8] and introduced a simpler model to
accomplish analogous calculations. Reference [23] introduced an RNN
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

Solving Eigen-problems of Matrices by Neural Networks 167

model to compute the largest modulus eigenvalues and their corresponding


eigenvectors of an anti-symmetric matrix. It is found that most of the recent
efforts have mainly been focused on solving eigenpairs of real symmetric, or
anti-symmetric, matrices. Although [27] involved non-symmetric matrices,
its section “The nonlinear homogeneous case” has the assumption of
α (w) ∈ R, which requires the matrix occupy real eigenvalues. Following
a search of the relevant literature, we were unable to locate a neural
network-based method to calculate eigenvalues of a general real matrix,
whether the eigenvalue is a real or general complex number. Thus, in
this section, a recurrent neural network-based approach will be proposed
(Equation (7.73)) to estimate the largest modulus of eigenvalues of a general
real matrix.
dv (t)
= A′ v (t) − |v (t)| v (t) ,
2
(7.73)
dt
( )
0 A
where t ≥ 0, v (t) ∈ R2n , A′ = , A is the general real matrix
−A 0
requiring calculation. |v (t)| denotes
[ the modulus]of v (t). When v(t) is
seen as the states of neurons, with A′ − |v (t)| U (U denotes a suitable
2

dimensional identity matrix) being regarded as synaptic connection weights


and the activation functions being assumed to be pure linear functions,
Equation (7.73) describes a continuous time recurrent neural network.
( )T T
Let v (t) = xT (t) y T (t) , x (t) = (x1 (t) , x2 (t) , · · · , xn (t)) ∈ Rn
T
and y (t) = (y1 (t) , y2 (t) , · · · , yn (t)) ∈ Rn . From Equation (7.73), it
follows that
 n [ ]
 ∑
 dx(t)
 dt = Ay (t) − x2j (t) + yj2 (t) x (t)
j=1
n [
∑ ] . (7.74)

 dy(t)
−Ax − x2j (t) + yj2 (t) y (t)
 dt = (t)
j=1

Let

z (t) = x (t) + iy (t) , (7.75)

with i denoting the imaginary unit, it can be easily followed from Equation
(7.74) that

dx(t) dy(t) ∑n
[ 2 ]
+i = −Ai [x (t) + iy (t)] − xj (t) + yj2 (t) [x (t) + iy (t)] ,
dt dt j=1
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

168 Yiguang Liu, Zhisheng You, Bingbing Liu and Jiliu Zhou

i.e.
dz (t)
= −Az (t) i − z T (t) z̄ (t) z (t) , (7.76)
dt
where z̄ (t) denotes the complex conjugate vector of z (t). Obviously,
Equation (7.76) is a complex differential system. A set of ordinary
differential equations is just a model which may approximate the real
behavior of some neural networks, although there are differences between
them. In the following subsections, we will discuss the convergence
properties of Equation (7.76) instead of RNN (7.73).

2
7.4.1. Analytic expression of |z (t)|
All eigenvalues of A are denoted as λR 1 + λ1 i, λ2 + λ2 i, · · · , λn + λn i
I R I R I

(λk , λk ∈ R, k = 1, 2, · · · , n), and the corresponding complex eigenvectors


R I

are denoted as µ1 , · · · , µn .
With any general real matrix A, there are two cases for µ1 , · · · , µn :
1) when the rank of A is deficient, some of λR 1 +λ1 i, λ2 +λ2 i, · · · , λn +λn i
I R I R I
R I
may be zeros. When λj + λj i = 0, uj can be randomly chosen ensuring
µ1 , · · · , µn construct a basis in C n×n ;
2) When A is full rank, µ1 , · · · , µn are decided by A. Although they
may not be orthogonal to each other, they can still construct a basis in
C n×n .
Let Sk = µk /|µk |, obviously S1 , · · · , Sn construct a normalized basis in
n×n
C .

Theorem 7.18. Let zk (t) = xk (t) + i yk (t) denote the projection value of
2
z (t) onto Sk . The analytic expression of |z (t)| is

n ( )[ ]
exp 2λIk t x2k (0) + yk2 (0)
2 k=1
|z (t)| = ∑n [ ]∫t ( ) (7.77)
1+2 x2j (0) + yj2 (0) 0 exp 2λIj τ dτ
j=1

for t ≥ 0.
Proof: the proof of this theorem is similar to that of Theorem 7.7.9
and omitted here.

7.4.2. Convergence analysis


If an equilibrium vector of RNN (7.73) exists, let ξ denote it, and there
exists
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

Solving Eigen-problems of Matrices by Neural Networks 169

ξ = lim z (t) . (7.78)


t→∞

Theorem 7.19. If each eigenvalue is a real number, then |ξ| = 0.


Proof: From Theorem 7.1, we know
v
u ∑n ( )
u exp 2λIk t [x2k (0) + yk2 (0)]
u
u k=1
|z| = u ∑n [ ]∫t ( ) . (7.79)
t
1+2 x2j (0) + yj2 (0) 0 exp 2λIj τ dτ
j=1

Thus,
v
u ∑
n ( )
u exp 2λIk t [x2k (0) + yk2 (0)]
u
u k=1
|ξ| = lim |z (t)| = lim u ∑n [ ]∫t ( ) .
t→∞ t→∞ t
1+2 x2j (0) + yj2 (0) 0 exp 2λIj τ dτ
j=1
(7.80)
Since each eigenvalue is real, so

λI1 = λI2 = · · · = λIn = 0. (7.81)

From Equations (7.80) and (7.81), it follows that


v
u ∑
n
u [x2k (0) + yk2 (0)]
u
u k=1
|ξ| = lim u ∑n [ ] = 0.
t→∞ t
1+2 x2j (0) + yj2 (0) t
j=1

This theorem is proved. This theorem implies that if a matrix only


has real eigenvalues, RNN (7.73) will converge to zero point, which is
independent of the initial complex vector.

Theorem 7.20. Denote λIm = max λIk . If λIm > 0, then ξ T ξ¯ = λIm .
1≤k≤n
Proof: Using Equation (7.78) and Theorem 7.1 gives that


n ( )[ ]
exp 2λIk t x2k (0) + yk2 (0)
2 k=1
ξ T ξ¯ = lim |z (t)| = lim n [
∑ ]∫t ( ) ,
t→∞ t→∞
1+2 x2j (0) + yj2 (0) 0 exp 2λIj τ dτ
j=1
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

170 Yiguang Liu, Zhisheng You, Bingbing Liu and Jiliu Zhou

i.e.


n
exp(2λIm t)[x2m (0)+ym
2
(0)]+ exp(2λIk t)[x2k (0)+yk
2
(0)]
ξ T ξ¯ = lim
k=1,k̸=m
∫t ∑n ∫t
t→∞ 1+2[x2 (0)+y 2 (0)] exp(2λIm τ )dτ +2 [x2j (0)+yj2 (0)] exp(2λIj τ )dτ
m m 0 0
j=1,j̸=m


n
[x2m (0)+ym
2
(0)]+ exp[2(λIk −λIm ) t] [x2k (0)+yk
2
(0)]
k=1,k̸=m
= lim ∑
n ∫t
t→∞ exp(−2λI t)+ 1
[x2m (0)+ym
2 (0)][1−exp(−2λI t)]+2
[x2j (0)+yj2 (0)] exp(2λIj τ −2λIm t)dτ
m I
λm m 0
j=1,j̸=m

= λIm .

This theorem is proved. From this theorem, we know that when the
maximal imaginary part of eigenvalues is positive, RNN (7.73) will converge
to a nonzero equilibrium vector. In addition, the square modulus of the
vector is equal to the largest imaginary part.
( )
′ A0
Theorem 7.21. If A is replaced by , then
0 A


n ( )[ 2 ]
exp 2λR 2
k t xk (0) + yk (0)
2 k=1
|z (t)| = ∑n [ ]∫t ( ) .
1+2 x2j (0) + yj2 (0) 0 exp 2λR
j τ dτ
j=1

) (
A0 ′
Proof: When A = , by a similar way of obtaining Equation
0 A
(7.76), RNN (7.73) is transformed into

dz (t)
= Az (t) − z T (t) z̄ (t) z (t) . (7.82)
dt
Using the denotations of xk (t) and yk (t), we have

d ∑n
[ 2 ]
k xk (t) − λk yk (t) −
I
xk (t) = λR xj (t) + yj2 (t) xk (t) , (7.83)
dt j=1

d ∑n
[ 2 ]
yk (t) = λIk xk (t) + λR y
k k (t) − xj (t) + yj2 (t) yk (t) . (7.84)
dt j=1
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

Solving Eigen-problems of Matrices by Neural Networks 171

From Equations (7.83) and (7.84), it follows that

d [ 2 ] [ 2 ] ∑ n
[ 2 ][ ]
xk (t) + yk2 (t) = 2λR
k x k (t) + yk
2
(t) −2 xj (t) + yj2 (t) x2k (t) + yk2 (t) .
dt j=1
(7.85)
The remaining procedures can refer to the proof of Theorem 7.9.
This theorem provides a way to extract the maximal real part of all
eigenvalues through rearranging the connection weights.
( )
R R ′ A0
Theorem 7.22. Let λm = max λk . When A is replaced by , if
1≤k≤n 0 A
T¯ T¯
λRm ≤ 0, then ξ ξ = 0. If λm > 0, then ξ ξ = λm .
R R

Proof: Using Equation (7.78) and Theorem 7.21 gives that



n ( )[ 2 ]
exp 2λR 2
k t xk (0) + yk (0)
2 k=1
ξ T ξ¯ = lim |z (t)| = lim ∑n [ ]∫t ( ) .
t→∞ t→∞
1+2 x2j (0) + yj2 (0) 0 exp 2λR
j τ dτ
j=1
(7.86)
From Equation (7.86), if λR
m < 0, it easily follows that

ξ T ξ¯ = 0, (7.87)

if λR
m = 0, it follows that

n
exp(2λR 2 2
m t)[xk (0)+yk (0)]+ exp(2λR 2 2
k t)[xk (0)+yk (0)]
ξ ξ¯ = lim
T k=1,k̸=m
∫t ∑n ∫t
t→∞ 1+2[x2 (0)+y 2 (0)] [x2j (0)+yj2 (0)] exp(2λR
j j 0
exp(2λR
m τ )dτ +2 0 j τ )dτ
j=1,j̸=m

n
[x2k (0)+yk2 (0)]+ exp(2λR 2 2
k t)[xk (0)+yk (0)]
k=1,k̸=m
= lim ∑n ∫t
t→∞ 1+2[x2 (0)+y 2 (0)] t+2 [x2j (0)+yj2 (0)] exp(2λR
j j 0 j τ )dτ
j=1,j̸=m

= 0,
(7.88)
and if λR
m > 0, like the deductive procedures of Theorem 7.3, it follows that

ξ T ξ¯ = λR
m. (7.89)

From Equations (7.87), (7.88) and (7.89), we know the theorem is


proved. This theorem indicates that when the largest real part of all
eigenvalues is positive, the slightly changed RNN (7.73) converges to a
nonzero equilibrium vector, the square modulus of which is equal to the
maximal real part.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

172 Yiguang Liu, Zhisheng You, Bingbing Liu and Jiliu Zhou

7.4.3. Simulations and discussions


To evaluate the method, two examples are given. Example 1 uses a 7×7
matrix to show the effectiveness. A 50×50, and a 100×100, matrices
are exploited to test the method when the dimensionality is very large
in Example 2. The simulation platform is Matlab.
Example 1: A is randomly generated like
 
0.1347 0.0324 0.8660 0.8636 0.6390 0.1760 0.4075
 0.0225 0.7339 0.2542 0.5676 0.6690 0.0020 0.4078 
 
 0.2622 0.5365 0.5695 0.9805 0.7721 0.7902 0.0527 
 
 
A =  0.1165 0.2760 0.1593 0.7918 0.3798 0.5136 0.9418  .
 
 0.0693 0.3685 0.5944 0.1526 0.4416 0.2132 0.1500 
 
 0.8529 0.0129 0.3311 0.8330 0.4831 0.1034 0.3844 
0.1803 0.8892 0.6586 0.1919 0.6081 0.1573 0.3111
The eigenvalues directly computed by Matlab functions are λ1 = 2.9506,
λ2 = 0.7105, λ3 = −0.3799 + 0.4808i, λ4 = λ̄3 , λ5 = 0.0286 + 0.5397i,
λ6 = λ̄5 and λ7 = 0.1274, so λIm = 0.5397 and λR
m = 2.9506.

Fig. 7.21. The trajectories of |zk (t)| in searching –


λIm when n = 7.

When the initial vector is


z (0) = [0.0506 + 0.0508i 0.2690 + 0.1574i 0.0968 + 0.1924i
T
0.2202 + 0.0049i 0.1233 + 0.2511i 0.1199 + 0.2410i 0.1517 + 0.2093i] ,
we get the equilibrium vector
ξ = [0.1335 - 0.1489i 0.0204 + 0.1250i 0.3471 + 0.0709i
T
-0.1727 + 0.3771i -0.0531 - 0.2687i - 0.0719 - 0.0317i - 0.0970 - 0.3091i] .
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

Solving Eigen-problems of Matrices by Neural Networks 173

Fig. 7.22. The trajectory of |z(t)|2 in searching –


λIm when n = 7.

Hence the computed maximum imaginary part is

– Im = ξ T ξ¯ = 0.5397.
λ

Fig. 7.23. The trajectories of |zk (t)| in searching –


λRm when n = 7.

When λIm is compared with λ – Im , it can be easily seen that they are very
close. The trajectories of |zk (t)| (k = 1, 2, · · · , 7) and z T (t) z̄ (t) which will
approach λIm are shown in Figs 7.21 and 7.22.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

174 Yiguang Liu, Zhisheng You, Bingbing Liu and Jiliu Zhou

Fig. 7.24. The trajectory of |z(t)|2 in searching –


λRm when n = 7.

Fig. 7.25. The trajectories of |zk (t)| in searching –


λIm when n = 50.

( )
A0
When A′ = , the computed maximum real part is
0 A

m = |(0.4981 + 0.4949i 0.3701 + 0.3678i 0.6003 + 0.5965i


–R
λ
T
0.4773 + 0.4742i 0.3059 + 0.3039i 0.4719 + 0.4689i 0.4418 + 0.4390i) |2
= 2.9504.
When –λR R
m is compared with λm , the absolute difference value is

m = λm − –
∆λR m = |2.9506 − 2.9504| = 0.0002,
R
λR

m is very close to λm . The trajectories of |zk (t)| (k =


–R R
meaning that λ
1, 2, · · · , 7) and z T (t) z̄ (t) are shown in Figs 7.23 and 7.24.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

Solving Eigen-problems of Matrices by Neural Networks 175

Fig. 7.26. The trajectory of |z(t)|2 in searching –


λIm when n = 50.

With the variation of z (0), the trajectories of |zk (t)| (k = 1, 2, · · · , 7)


and z T (t) z̄ (t) will vary. But –λIm and –λR m do insistently approach the
corresponding ground truth values.
Example 2: How does the approach behave when the dimensionality
increases? If a 50×50 matrix is randomly produced, the expression may
be too long to write, thus a 50×50 matrix is specially given as
{
(−1)i (50 − i)/100, i = j
aij = i j .
(1 − i/50) − (j/50) , i ̸= j

Fig. 7.27. The trajectories of |zk (t)| in searching –


λRm when n = 50.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

176 Yiguang Liu, Zhisheng You, Bingbing Liu and Jiliu Zhou

Fig. 7.28. The trajectory of |z(t)|2 in searching –


λRm when n = 50.

Fig. 7.29. The trajectories of |zk (t)| in searching –


λIm when n = 100.

The calculated results are –λIm = 0.3244 and –λR m = 4.6944. Convergence
behaviors of |zk (t)| and z T (t) z̄ (t) in searching –λIm and –λR
m are shown in
Figs 7.25 to 7.28. Comparing –λIm and –λR m with corresponding true ones
I R
λm = 0.3246 and λm = 4.7000, we find each pair in comparison is very
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

Solving Eigen-problems of Matrices by Neural Networks 177

Fig. 7.30. The trajectory of |z(t)|2 in searching –


λIm when n = 100.

close. From Figs 7.25 to 7.28, we can also see the system gets to equilibrium
state fast though the dimensionality has reached 50.
In order to ulteriorly show the effectiveness of this approach when
dimensionality becomes large, let n = 100. The corresponding convergence
behaviors are shown in Figs 7.29 to 7.32. λ –R
m = 7.3220 keeps very close to
λm = 7.3224, and –λm = 0.2541 is very close to λIm = 0.2540. From Figs
R I

7.25 to 7.32, we can see that the iteration number at which the system
enters into equilibrium state is not sensitive to dimensionality.

Fig. 7.31. The trajectories of |zk (t)| in searching –


λRm when n = 100.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

178 Yiguang Liu, Zhisheng You, Bingbing Liu and Jiliu Zhou

Fig. 7.32. The trajectory of |z(t)|2 in searching –


λRm when n = 100.

7.4.4. Section summary


Although many research works have been focused on using neural networks
to compute eigenpairs of a real symmetric, or anti-symmetric matrix,
similar efforts toward computation of a general real matrix are seldom found
in relevant literatures. Therefore, in this section a recurrent neural network
model was proposed to calculate the largest imaginary, or real part, of a
general real matrix’s eigenvalues. The network was described by a set of
differential equations, which was transformed into a complex differential
system. After obtaining the squared module of the system variable, the
convergence behavior of the network was discussed in detail. Using a 7×7
general real matrix for numerical testing, the results indicated that the
computed values were very close to the corresponding ground truth ones. In
order to show the approach’s performance when the dimensionality is large,
a 50×50 and a 100×100 matrices were used for testing respectively. The
calculated results were quite close to the ground truth values as well. It was
also seen from the results that the iteration number at which the network
enters into equilibrium state is not sensitive to the dimension number. This
approach can thus be of potential use to estimate the largest modulus of a
general real matrix’s eigenvalues.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

Solving Eigen-problems of Matrices by Neural Networks 179

7.5. Conclusions

Using neural networks to compute eigenpairs has advantages such as being


able to run in parallel and fast. Quick extraction of eigenpairs from
matrices has many significant applications such as principal component
analysis (PCA), real-time image compression and adaptive signal processing
etc. In this chapter, we have presented a few RNNs for computation
of eigenvalues and eigenvectors of different kinds of matrices. These
matrices, in terms of level of processing difficulty, are classified into three
types, i.e. symmetric matrices, anti-symmetric matrices and general real
matrices. The procedures of how we use different RNNs to handle each
type of matrices have been presented carefully in one or two sections
respectively. Generally speaking, we have formulated each RNN as an
individual differential equation. Subsequently, each analytic solution to the
individual differential equation has been derived. Next, the convergence
properties of the neural network models have been fully discussed based
on these solutions. Finally, in order to evaluate the performance of each
model, numerical simulations have been provided. The computational
results from these models have been compared with those computed directly
from Matlab functions, which are used as ground truth values. Comparison
has revealed that the results from the proposed models are highly close to
the ground truth values. In some sections, very large dimensional matrices
have been used for rigorous testing and the results still approach the ground
truth values closely. Thus, the conclusion can be made that with the
use of specially designed neural networks, the computation of eigenpairs
of symmetric, anti-symmetric and general real matrices is really possible.

References

[1] N. Li, A matrix inverse eigenvalue problem and its application, Linear
Algebra And Its Applications. 266(15), 143–152, (1997).
[2] F.-L. Luo, R. Unbehauen, and A. Cichocki, A minor component analysis
algorithm, Neural Networks. 10(2), 291–297, (1997).
[3] C. Ziegaus and E. Lang, A neural implementation of the jade algorithm
(njade) using higher-order neurons, Neurocomputing. 56, 79–100, (2004).
[4] F.-L. Luo, R. Unbehauen, and Y.-D. Li, A principal component analysis
algorithm with invariant norm, Neurocomputing. 8(2), 213–221, (1995).
[5] J. Song and Y. Yam, Complex recurrent neural network for computing the
inverse and pseudo-inverse of the complex matrix, Applied Mathematics and
Computing. 93(2–3), 195–205, (1998).
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

180 Yiguang Liu, Zhisheng You, Bingbing Liu and Jiliu Zhou

[6] H. Kakeya and T. Kindo, Eigenspace separation of autocorrelation memory


matrices for capacity expansion, Neural Networks. 10(5), 833–843, (1997).
[7] M. Kobayashi, G. Dupret, O. King, and H. Samukawa, Estimation of
singular values of very large matrices using random sampling, Computers
And Mathematics With Applications. 42(10–11), 1331–1352, (2001).
[8] Y. Zhang, F. Nan, and J. Hua, A neural networks based approach
computing eigenvectors and eigenvalues of symmetric matrix, Computers
And Mathematics With Applications. 47(8–9), 1155–1164, (2004).
[9] F.-L. Luo and Y.-D. Li, Real-time neural computation of the eigenvector
corresponding to the largest eigenvalue of positive matrix, Neurocomputing.
7(2), 145–157, (1995).
[10] V. U. Reddy, G. Mathew, and A. Paulraj, Some algorithms for eigensubspace
estimation, Digital Signal Processing. 5(2), 97–115, (1995).
[11] R. Perfetti and E. Massarelli, Training spatially homogeneous fully recurrent
neural networks in eigenvalue space, Neural Networks. 10(1), 125–137,
(1997).
[12] Y. Liu and Z. You, A concise functional neural network for computing the
extremum eigenpairs of real symmetric matrices, Lecture Notes in Computer
Science. (3971), 405–413, (2006).
[13] Y. Tan and Z. Liu, On matrix eigendecomposition by neural networks,
Neural Networks World. 8(3), 337–352, (1998).
[14] Y. Tan and Z. He, Neural network approaches for the extraction of the
eigenstructure, Neural Networks for Signal Processing VI–Proceedings of the
1996 IEEE Workshop, Kyoto, Japan. 23–32, (1996).
[15] L. Xu and I. King, A PCA approach for fast retrieval of structural
patterns in attributed graphs, IEEE Transactions on Systems, Man, and
Cybernetics–Part B: Cybernetics. 31, 812–817, (2001).
[16] B. Zhang, M. Fu, and H. Yan, A nonlinear neural network model of mixture
of local principal component analysis: application to handwritten digits
recognition, Pattern Recognition. 34, 203–214, (2001).
[17] U. Helmke and J. B. Moore, Optimization and dynamical systems.
(Springer–Verlag:, London, 1994).
[18] T. Chen, Modified oja’s algorithms for principal subspace and minor
subspace extraction, Neural Processing Letters. 5, 105–110, (1997).
[19] A. Cichocki and R. Unbehauen, Neural networks for computing eigenvalues
and eigenvectors, Biological Cybernetics. 68, 155–164, (1992).
[20] A. Cichocki, Neural network for singular value decomposition, Electronics
Letters. 28(8), 784–786, (1992).
[21] Y. Liu, Z. You, and L. Cao, A simple functional neural network
for computing the largest and smallest eigenvalues and corresponding
eigenvectors of a real symmetric matrix, Neurocomputing. 67, 369–383,
(2005).
[22] Y. Liu, Z. You, L. Cao, and X. Jiang, A neural network algorithm for
computing matrix eigenvalues and eigenvectors, Journal of Software [in
Chinese]. 16(6), 1064–1072, (2005).
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

Solving Eigen-problems of Matrices by Neural Networks 181

[23] Y. Liu, Z. You, and L. Cao, A functional neural network for computing
the largest modulus eigenvalues and their corresponding eigenvectors of an
anti-symmetric matrix, Neurocomputing. 67, 384–397, (2005).
[24] A neural network for computing eigenvectors and eigenvalues, Biological
Cybernetics. 65, 211–214, (1991).
[25] Y. Liu, Z. You, and L. Cao, A functional neural network computing some
eigenvalues and eigenvectors of a special real matrix, Neural Networks. (18),
1293–1300, (2005).
[26] Y. Nakamura, K. Kajiwara, and H. Shiotani, On an integrable discretization
of rayleigh quotient gradient system and the power method with a shift,
Journal of Computational and Applied Mathematics. 96, 77–90, (1998).
[27] J. M. Vegas and P. J. Zufiria, Generalized neural networks for spectral
analysis: dynamics and liapunov functions, Neural Networks. 17, 233–245,
(2004).
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 7

This page intentionally left blank

182
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 8

Chapter 8

Automated Screw Insertion Monitoring Using Neural


Networks: A Computational Intelligence Approach to
Assembly in Manufacturing
Bruno Lara, Lakmal D. Seneviratne and Kaspar Althoefer
King’s College London, Department of Informatics,
Strand, London WC2R 2LS, UK
[email protected]

Threaded fastenings are a widely used industrial method to clamp


components together. The method is successfully employed in
many industries including car manufacturing, toy manufacturing and
electronic component assembly - the method is popular because it is a
low-cost assembly process and permits easy disassembly for maintenance,
repair, relocation and recycling. Particularly, the usage of self-tapping
screws has gained wide-spread recognition, because of the cost savings
that can be achieved when preparing for the fastening process - the
components to be clamped together need to be only equipped with holes
and as such the more complicated and costly placing of threaded holes
as is required for non-self-tapping bolts can be avoided. However, the
process of inserting self-tapping screws is more complicated and typically
carried out manually. This chapter discusses the study of methods of
automation for the insertion of self-tapping screws.

Contents

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184


8.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
8.2.1 The screw insertion process: modelling and monitoring . . . . . . . . . 187
8.2.2 Screw insertion process: monitoring . . . . . . . . . . . . . . . . . . . . 192
8.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
8.3.1 Screw insertion signature classification: successful insertion and type
of failure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
8.3.2 Radial basis function neural network for error classification . . . . . . 194
8.3.3 Simulations and experimental study . . . . . . . . . . . . . . . . . . . 196
8.4 Results of Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
8.4.1 Single insertion case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
8.4.2 Generalization ability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
8.4.3 Multi-case classification . . . . . . . . . . . . . . . . . . . . . . . . . . 199

183
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 8

184 Bruno Lara, Lakmal D. Seneviratne and Kaspar Althoefer

8.5 Results of Experimental Study . . . . . . . . . . . . . . . . . . . . . . . . . . 200


8.5.1 Single insertion case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
8.5.2 Generalization ability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201
8.5.3 Four-output classification . . . . . . . . . . . . . . . . . . . . . . . . . 202
8.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

8.1. Introduction

Automation of the assembly process is an important research topic


in many manufacturing sectors, including the automotive industry, toy
manufacturers and household appliance producing industries. With the
current trend moving from fixed mass manufacturing toward flexible
manufacturing methods, ones that can adapt to changes in the production
lines and can cope more easily with a greater variation in the product
palette and approaches that employ adaptive models and learning
paradigms become more and more attractive for the manufacturing
industries.
Considering that in quite a number of manufacturing industries around
25% of all assembly processes involve some kind of a threaded fastening
approach, it is of great importance to research this particular assembling
method in detail. Advances in the understanding of screw insertion,
screw advancing and screw tightening will undoubtedly have the potential
to advance the relevant manufacturing sectors. Enhanced monitoring
capabilities based on comprehensive knowledge of an assembly process, such
as screw fastening, will enable direct improvements in quality assurance
even in the face of product variation and uncertainty. With the current
interest in model predictive control, modeling the assembly process of screw
fastening will certainly impact positively on modern, model-based control
architectures for manufacturing allowing greater flexibility.
Screw fastenings or, more generally, threaded fastenings have proved
to be a popular assembly method. The integrity of the achieved joint is
defined by a number of factors including the tightening torque, the travel
of the screw, the strength and quality of the fasteners and the prepared
holes [1]. Compared to gluing, welding and other joining techniques,
threaded fastenings have the great advantage of allowing the assembled
components to be disassembled again relatively easily for the purpose of
repair, maintenance, relocation or recycling. Another advantage is that the
forces between joined components can be controlled to a very high degree
of accuracy, if required for specific applications such as shear pins on a
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 8

Automated Screw Insertion Monitoring Using Neural Networks 185

rocket body. A number of manufacturing industries, such as Desoutters [2],


have shown a keen interest in intelligent systems for automated screw
fastening, as the factory floor implementation of such systems will improve
a number of aspects of this assembly process including the insertion quality,
the completion of successful insertions and the working conditions of
human operators. A number of research studies have drawn attention to
the advantages and financial benefits of advanced insertion systems in a
manufacturing environment [3–5].
With the view set on minimizing assembly costs, a particularly
economical form of threaded fastenings has established itself in a number
of manufacturing processes: self-tapping screws [6]. A self-tapping screw
cuts a thread in the part to be fastened, avoiding the need to create
pre-tapped holes prior to the screw insertion process and thus simplifying
and shortening the overall assembly task. However, self-tapping insertions
are more difficult to control and have an increased risk of failure, since a
thread has to be formed whilst the screw is advanced into one of the joint
components.
However, in many cases the fastening process based on self-tapping
screws is made more difficult by the fact that handheld power tools
or numerically controlled column power screwdrivers are used. A
human operator has good control of the insertion process when using a
simple non-motorized screwdriver; however, when employing a powered
screwdriver, the control is limited, mainly owing to the high speed of the
tool and reduced force feedback from the interaction between screw and
surrounding material. In such a situation, the operator is usually not
capable of assessing the insertion process and the overall quality of the
created screw coupling.
A number of research studies have provided insight into the mechanisms
that define the processes of thread cutting, advance and tightening for
self-tapping screws [7–9]. Still, most of today’s power screwdrivers employ
basic maximum-torque mechanisms to stop the insertion process, once the
desired tightening torque has been reached. Such simple methods do not
lend themselves to appropriately monitoring the insertion process and fail
to determine issues such as cross-threading, loose screws and other insertion
failures.
The automation of the screw insertion monitoring process is of great
interest for many sectors of the manufacturing industry. However, the
methods being used for the accomplishment and monitoring of screw
insertions or bolt tightenings are still fairly simple. These methods, such as
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 8

186 Bruno Lara, Lakmal D. Seneviratne and Kaspar Althoefer

the so-called “Teach Method”, are suitable for large batch operations that
do not require any changes of the assembly process, e.g. the production
of high-volume products as needed for the automotive industry. The teach
method is based on the assumption that a specific screw insertion process
will have a unique signature based on the measured “Torque-Insertion
Depth” curve. Comparing signals that are acquired on-line during the
screw insertion process to signals that were recorded during a controlled
insertion phase (teaching phase) differences can be easily flagged and a
match between on-line signals and signals from the teaching phase indicates
a correct insertion. Such methods have usually no adaptive or learning
capabilities and require a labor-intensive set-up before production can
commence. Improvements to the standard teach method were proposed and
implemented, e.g. the “Torque-Rate” approach [10]. For this approach,
the measured insertion signals need to fall within a-priori-defined torque
rate levels and has shown to be capable of coping with a number of
often-occurring faults such as stripped threads, excessive yielding of bolts,
crossed threads, presence of foreign materials in the thread, insufficient
thread cut and burrs. Despite the improvements this approach has no
generalization capabilities and thus cannot cope with unknown, new cases
that have not been part of the original training set. In a number of cases,
the required tools such as screwdrivers can be specifically set-up to carry
out the required fastening process to join components by means of screws
or bolts and are economical if the components and the screw fastening
requirements do not change. However, for small-batch production this
approach is prohibitive because a “re-teaching” is required every time the
components or fastening requirements change. Where smaller production
runs are the norm, e.g. for wind turbine manufacturing, the industry is
still relying on human operators to a great extent. These operators insert
and fasten screws and bolts manually or by means of handheld power tools.
With a view to create flexible manufacturing approaches, this book chapter
discusses novel, intelligent monitoring methods for screw insertion and bolt
tightening capable of adapting to uncertainty and changes in the involved
components.
This book chapter investigates novel neural-network-based systems for
the monitoring of the insertion of self-tapping screws. Section 8.2 provides
an overview of the screw insertion process. Section 8.3 describes the
methodology behind the process. Section 8.4 discusses the results from
simulations. Section 8.5 presents an in-depth experimental study and
discusses the results obtained. Conclusions are drawn in Section 8.6.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 8

Automated Screw Insertion Monitoring Using Neural Networks 187

8.2. Background

8.2.1. The screw insertion process: modelling and


monitoring

Screw insertion can be divided into two categories: the first describes the
insertion of self-tapping screws into holes that have not been threaded; the
second deals with the insertion of screws or bolts into pre-threaded holes or
fixed nuts. The advantage of the first approach lies in its simplicity when
preparing the components for joining – a hole in each of the components
is the only requirement. The latter approach needs to undergo a more
complex preparatory step, in addition to drilling holes threads have to be
cut or nuts to be attached. Industry is progressively employing more of
the former as it is more cost-effective, more generally applicable and allows
for more rapid insertion of screws as possible with the other approach.
However, the downside is that the approach based on self-tapping screws is
in need of a more advanced control policy to ensure that the screw thread
cuts an appropriate groove through at least one of the components during
the insertion process. Owing to its complexity, the process of inserting
self-tapping screws is usually carried out by human operators who can
ensure that an appropriate and precise screw advancement is guaranteed,
making good use of the multitude of tactile and force sensors as well as
feedback control loops they have available.
In order to develop an advanced monitoring strategy for the insertion
of self-tapping screws, a proper theoretical understanding of the underlying
behavior of the insertion process is needed [11]. At King’s College
London, mathematical models, based on an equilibrium analysis of
forces and describing the insertion of self-tapping screws into metal and
plastic components, have been created to identify the importance of the
torque-vs.-insertion-depth profile as a good representation for the five
distinct stages of the insertion process, Fig. 8.2 [12]. Existing knowledge
on screw advance and screw tightening of screws in pre-tapped holes is
included in the proposed model for the relevant stages of the overall process
[9, 10, 13–15]. The key points of this profile – representative for the five
main stages occurring during the insertion – are defined as shown below.

(1) Screw engagement represents the stage of the insertion process lasting
from the moment where the tip of the screw touches the tap plate (on
the top component) until the moment when the cutting portion of
the conical screw taper has been completely inserted into the hole,
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 8

188 Bruno Lara, Lakmal D. Seneviratne and Kaspar Althoefer

Fig. 8.1. (left) Notation for self-tapping screw insertions; (right) Key stages of the screw
insertion in a two-plate joint [1].

effectively resulting in an entire turn of the screw taper penetrating


through the tap plate. The screw is said to be engaged at this point,
denoted TE, Figs 8.1 and 8.2. Mainly shear forces are experienced
during this stage, due to the cutting of a helical groove in the hole
wall.
(2) Cutting describes the subsequent interval from completion of screw
engagement (TE) until the completion of a cut of a groove into the
wall by the screw thread (TP). The main forces are due to friction
and cutting. From this stage onwards (i.e. the cutting portion of the
screw has completely passed through the hole in the tap plate (screw
breakthrough)), cutting or shear forces have an insignificant impact
on the screw advancement.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 8

Automated Screw Insertion Monitoring Using Neural Networks 189

(3) Transition from cutting to screw advance is an intermediate stage


continuing from TP to TB in the mathematical insertion model
(Figs 8.1 and 8.2); a mixture of cutting and friction forces can be
observed.
(4) Screw advance or rundown is the fourth stage where the screw moves
forward in the previously-cut groove. Screw advance continues until
the lower side of the screw cap touches the near plate (TF). Here, the
main forces observable are friction forces owing to the movement of
the screw’s thread along the surface of the groove of the hole. The
forces of this stage are known to be constant (see constant torque
between TB and TF in Fig. 8.2 [11]).
(5) Screw tightening is the fifth and final stage, commencing the moment
that the screw cap gets into contact with the near plate (TF) and
completing when a predetermined clamping force is achieved (TF2).
The main forces concerned are tightening forces, made up entirely
from friction components that can be divided in two categories, firstly,
forces between screw thread and hole groove, and secondly, forces
occurring between screw cap and near plate.

When comparing the processes of inserting screws into pre-tapped holes,


on the one hand, and inserting self-tapping screws, on the other, one
recognises a clear similarity between the two processes over a range of
stages. Compared to self-tapping screws, pre-tapped insertions lack the two
stages from TE to TB, because cutting is not required. For screws inserted
into pre-tapped holes, a short engagement stage is experienced, then
transitioning rapidly to the advance or rundown stage [9, 10, 13–16]. After
this, the two mathematical screw insertion models are indistinguishable.
The forces acting on the screw during the different stages are examined
using fundamental stress analysis concepts [17, 18]. Each stage of the
insertion process can be represented by an appropriate equation describing
the relationship between screw advancement (insertion depth) and resultant
torque values. It has been shown that those stages can be modeled as simple
straight-line segments shown in Fig. 8.2 [11]. All other parameters depend
on screw and hole geometry, material properties and friction. The overall
set of equations (submodels) describing the five stages of the process of
inserting self-tapping screws is as follows [11]:
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 8

190 Bruno Lara, Lakmal D. Seneviratne and Kaspar Althoefer

Fig. 8.2. Average experimental torque profile of an insertion on two-joint plate and the
corresponding theoretical prediction [1].

cos β 4πLt h
T1 = ( √ )(Ds2 − Dh2 )σuts (Df Ds P + πµLt (Ds + Dh )){θ + }
64πLt 3 P Ds
(8.1)

cos β µπLt (Ds2 − Dh2 )


T2 = ( √ )(Ds2 − Dh2 )σuts (Df (Ds − Dh ) + + µRf θ)
16 3 2Ds P
(8.2)

cos β D D P
T3 = ( √ )(Ds2 − Dh2 )σuts {(2Rf µLt − f s )θ
32Lt 3 π
2t
+ (2Rf µπLt − Df Ds P )} (8.3)
P
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 8

Automated Screw Insertion Monitoring Using Neural Networks 191

cos β
T4 = ( √ )πtµσuts (Ds3 − Ds Dh2 + Dh Ds2 − Dh3 ) (8.4)
32P 3

µEP (Dsh 3
− Ds3 )(θ − π4 (Dsh
2
− Ds2 ))
T5 = (8.5)
24l
where T1 to T5 are the torques required for stages one to five respectively, θ
is the angle of rotation of the screw, Ls is the screw length, Lt is the taper
length, Ds is the major screw diameter, Dm is the minor screw diameter, Dh
is the lower hole diameter, t is the lower plate thickness, l is the total plate
thickness, Dsh is the screw head diameter, P is the pitch of the thread, β
is the thread angle, µ is the friction coefficient between the screw and plate
material and σuts is the ultimate tensile strength of the plate material. All
the parameters are constant and depend on the screw and hole dimensions,
as well as on the friction and material properties.
Realistically modeling a common manufacturing process, the process of
inserting self-tapping screws can be experimentally investigated in a test
environment with two plates (representing two manufacturing components)
to be united [11]. As part of the experimental study, holes are drilled into
the plates to be joined; the latter are then clamped together such that the
hole centers of the top plate (also called near plate) are in line with the
centers of the holes of the second plate (also called tap plate), Fig. 8.1. A
self-tapping screw is then inserted into a hole during an experiment and
screwed in, applying appropriate axial force and torque values. Ngemoh
showed that the axial insertion force does not influence the advancement
of the screw very much, even if it varies over quite a wide range as long
as it does not go beyond a maximum value whereby the threads of the
screw and/or the hole are in danger of being destroyed [11]. During this
insertion process a helical groove is cut into the far plate’s hole and forces as
described above are experienced across the five main stages of the insertion
process until the screw head is finally tightened against the near plate [11].
Based on the modeling and practical knowledge of the screw insertion
and fastening process monitoring strategies can be created [12, 19]. The
signature that is obtained when recording the torque-vs.-insertion-depth
throughout the insertion process provides important clues on the final
outcome of the overall process and its likelihood of success. The resultant
torques can be best estimated if the signature is closest to its ideal one. This
book chapter presents monitoring methods that are capable of predicting
to what extend a pre-set final clamping torque is achieved by employing
artificial intelligence in the form of an artificial neural network that uses
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 8

192 Bruno Lara, Lakmal D. Seneviratne and Kaspar Althoefer

online torque-vs.-indentation-depth insertion profiles as input. In contrast


to monitoring approaches that measure the final clamping torque only, the
proposed method has another advantage, faults can be detected early on
and a screw insertion process can be interrupted in order to initiate remedial
actions.

8.2.2. Screw insertion process: monitoring

A number of screw insertion monitoring techniques have established


themselves in the manufacturing industry. One of the commonly employed
methods is the “teach method” [10]. This method is active during the
insertion of a screw, monitoring online the profile of the occurring forces
during the insertion process. The teach method presumes a particular screw
insertion operation to follow a unique “torque-insertion depth” signature
signal that was taught before the actual assembly process commenced.
Following this approach involves long set-up times, involving teaching the
torque/depth profile for each insertion when the production run changes.
The teach method is mainly limited by its inflexibility and the lack of
generalization ability. To combat this, Smith et al. proposed various
enhancements to the standard teach method, including the “torque-rate”
approach [10]. His approach requires the correct fastening signatures to
fall within pre-defined torque rate windows. This approach can deal with
a range of common faults including stripped threads, excessive yielding
of bolts, crossed threads, presence of foreign materials in the threads,
insufficient thread cuts and burrs. However, Smith’s approaches lack the
flexibility to generalize when insertion cases differ from those used to define
the windows.
Other efforts have focused on developing online torque estimation
strategies correlating the electrical screwdriver’s current to the torque at
the tip of the screwdriver [8, 15, 20], and increasing the reliability during
the tightening of the screw insertion process [8, 9]. However, it is noted
that the start of the insertion process, in particular when using self-tapping
screws, is a research area not investigated in detail. Other, more advanced
techniques, such as the one presented by Dhayagude et al. [21] employs fuzzy
logic providing a robust tool for monitoring the screw insertion process,
even in the presence of noisy signals or screw alignment inaccuracies with
the capability for early detection of insertion faults. However, the proposed
computer simulation model used for testing ignores a variety of factors that
define the torque requirements of a screw insertion.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 8

Automated Screw Insertion Monitoring Using Neural Networks 193

8.3. Methodology

8.3.1. Screw insertion signature classification: successful


insertion and type of failure

Since the process of inserting a self-tapping screw and the subsequent


fastening can be modeled by a torque-insertion depth signature signal [11],
intelligent model-inspired monitoring strategies can be created that are
capable of separating successful from failed insertions. With the profile of
the insertion signature depending mainly on the geometrical and mechanical
properties of the involved components (screw and parts to be joined),
commonly-occurring property variations invariably lead to deviations from
the ideal torque signature profile. Theoretically and under the assumption
that any given insertion case will produce a unique torque-insertion depth
signature, any such deviations from the ideal signature can be employed to
predict failure and type of failure. However, in practice, even the signature
of a successful insertion may diverge by a small amount from the ideal
signature, owing to signal parameter variations. Hence, any intelligent
monitoring approach needs be capable of recognizing an insertion signature
as successful even if diverging from the ideal one within some boundaries.
Intelligent methods such as artificial neural networks have generalization
capabilities and are thus suitable to cope with this type of uncertainty.
Neural networks have also been proved to be good classifiers and thus can
be used to develop the failure type. The proposed monitoring strategy
is particularly beneficial for insertion processes as it can predict a failure
before the completion of the insertion and can be used to invoke the
appropriate, failure type dependent corrective action, when available.
In this study artificial neural networks (ANNs) based on the principle of
radial basis functions (RBFs) are used [22, 23]. The following classification
tasks are performed to validate the monitoring concept:

(a) Single insertion case


The ANN is to differentiate between signals from successful and failed
insertions. The ANN is subjected to a phase of learning (using successful
and unsuccessful insertions signals either simulated or from real sensor
readings), and then tested with unknown signals. These test signals can
only be correctly categorized if the ANN is capable of generalizing, which
is necessary to cope with the random noise, added to the outputs from the
mathematical model and the one inherent to real sensor data.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 8

194 Bruno Lara, Lakmal D. Seneviratne and Kaspar Althoefer

(b) Multiple insertion cases


The ability of the network to handle multiple insertion cases and to cope
with cases not seen during training is investigated. Four insertion cases,
corresponding to four different hole diameters, 3.7 mm, 3.8 mm, 3.9 mm
and 4.0 mm, are examined, Column 2 of Table 8.1. The ANN is trained on
insertion signals from the smallest hole (six signals) and the widest hole (six
signals), as indicated by asterisks in Table 8.1. The ANN is also trained
on eight signals representing failed insertions. However, during testing the
network has, in addition, been presented with signals from insertions into
the two middle-sized holes. The test set consists of 14 successful signals (2
from a 3.7 mm hole, 2 from a 4.0 mm hole, 5 from a 3.8 mm hole, and 5 from
a 3.9 mm hole) and 8 unsuccessful signals. Here, the network is required to
interpolate from cases learnt during training. All the signals are correctly
classified after a relatively modest training period of four cycles, Fig. 8.4.

(c) Multiple output classifications


The aim of this classification experiment is to separate successful signals
from unsuccessful signals as before and, in addition, to further classify the
successful signals into different classes. Three different insertion classes are
investigated in this experiment (Column 3 of Table 8.1). Here, an ANN with
four nodes in the output layer is used; three to distinguish between the three
different insertion classes (Classes A to C), and one to indicate unsuccessful
insertions. Given an input signal, the trained network is expected to output
a value close to one at a single output node, while all other output nodes
should have values close to zero.

8.3.2. Radial basis function neural network for error


classification
A radial basis function (RBF) neural network is used to classify between the
different cases outlined in Section 8.3.1 [24]. After training, the network
is expected to distinguish between successful insertions and failed ones,
and being capable of separating successful insertions into one of the three
insertion classes, Section 8.3.1 [25]. Choosing the most appropriate neural
network architecture was mainly guided by the relevant literature and
previous experience using neural networks for the problem of screw insertion
classification.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 8

Automated Screw Insertion Monitoring Using Neural Networks 195

Computer simulations based on the screw insertion model introduced in


Section 8.2.1, allow the network designer to determine the most appropriate
RBF network structure with respect to numbers of nodes in the input and
the hidden layer by investigating a range of diverse insertion cases. As a
result of a detailed computer-based investigation that aimed to optimize
the network’s robustness and generalization capacity as a function of the
number of network nodes, an appropriate network structure was found (for
more details see Section 8.3.3), Fig. 8.5. The number of nodes of the output
layer is defined by the number of output categories required: one output
node for (a) single insertion case and (b) multiple insertion case, and five
output nodes for (c) where the network is required to distinguish between
different types of successful insertions and failed insertions. Here, the neural
network has a two dimensional input space that is made up of 30 discrete
values of each of the two main insertion parameters, insertion depth and
torque, as output by the model and acquired from sensors attached to a
real screwdriver, respectively.
The function of the RBF network can be summarized as follows.
After receiving input signals at the network’s input nodes, the signals are
weighted and passed to the network’s hidden layer. Formed from Radial
Basis Functions, the hidden layer nodes perform a nonlinear mapping from
the input space to the output space. The results of the computations in the
hidden layer nodes are weighted, added up and forwarded to the network’s
output layer. The task of the output layer is to categorize the signals either
as successfully/unsuccessfully inserted or to distinguish between a number
of different insertion types. In the output nodes, a logistic function limiting
the final output of each node to a value between zero and one, denoting
disagreement (zero) and agreement (one) with the specified target.
In this research, a supervised, batch mode training approach has been
adopted, which has been demonstrated to be more robust and reliable than
online learning [2, 26]. Here, the overall torque/insertion depth signature
signal data set is divided into a training set comprising the input signals
and the corresponding target outputs and a test set employed to evaluate
the network’s classification and generalization capabilities with regards to
unknown data.
In preparation for training, the centres of the radial basis functions are
initialized to values evenly collected from the set of teaching data of torque
and insertion depth signals [2, 26]. The widths of the hidden layer nodes
are initialized with random values between zero and one, and the weights
between hidden and output nodes are also randomly initialized.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 8

196 Bruno Lara, Lakmal D. Seneviratne and Kaspar Althoefer

During the learning phase, training data is propagated through the


ANN repeatedly; each time this happens the sum-squared error (SSE) is
computed. The SSE is the difference between actual output of the network
and the target output. It is then the aim of the training algorithm to
modify the free parameters of the network to be updated such that the
SSE is minimized [27].
After completing the learning phase, the network’s performance is
evaluated using data that were not seen by the network previously. It can
then be seen how well the network is trained (i.e. how well it can predict
the various insertion cases) and how well it is able to generalize (i.e. can
cope with noisy data and variations introduced by unseen data, e.g. hole
diameters that were not part of the training set).

8.3.3. Simulations and experimental study

Simulations were conducted to obtain the optimal ANN with regards to


network learning needs, quality of screw insertion monitoring and power to
generalize over the given insertion cases and possible insertion failures.
This work focuses on (a) single insertion case, (b) multiple insertion
cases, and (c) multiple output classifications, in order to validate the
proposed screw insertion monitoring method.
Based on the mathematical model (Section 8.2.1) torque-insertion-depth
signals are produced for the testing and evaluation of the ANN-based
monitoring strategy using a computer implementation of the developed
screw insertion equations, (8.1) to (8.5) [11].
Since measurement noise is expected during the real experiments,
random noise is added to the signal acquired from the mathematical model
in an attempt to make the simulated study as realistic as possible and
train the ANN on a set of signals as close as possible to the real data,
Fig. 8.8. Noise of up to 0.2 Nm (up to 15% of the highest generated
torque) was superimposed. Without impacting on generality, simulation
and experimental studies were conducted without near plate.
Wide-ranging simulation sequences were conducted and revealed the
most suitable architectures for the RBF ANN. In all three cases (single
insertion, generalization ability, multi-case insertion), it was most suitable
to equip the input layer with 60 nodes (to be fed with 30 pairs of
torque/insertion-depth signals during learning and testing). With regards
to the studies on the single insertion case and the network’s generalization
ability, it was most appropriate to employ 15 nodes in the hidden layer
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 8

Automated Screw Insertion Monitoring Using Neural Networks 197

and 1 node in the output layer. For the multi-classification problem, a


network with 20 nodes in the hidden layer and 4 nodes in the output
layer proved to give the best results. The same ANN structures were
employed in the experimental study where signals from the torque sensor
and insertion-depth sensor attached to an electric screwdriver were used.
Table 8.1 summarizes the simulation and experimental studies carried out.
While a computer-based implementation of the mathematical model
(described in Section 8.3.1) was used for the simulation study, an electric
screwdriver (Desoutter, Model S5) was utilized during the experimental
study. Care was taken to design both studies such that they were as
compatible and comparable as possible. The test rig used during this
experimental study is based on the screwdriver equipped with a rotary
torque sensor (maximum range: 1.9 Nm) and an optical encoder (measuring
angular position in a range from 0 to 3000 degrees with a resolution of 60
degrees) attached to the screwdriver’s shaft. [2] During a common screw
insertion experiment, up to 50 data point pairs of torque and angular
signals are recorded. The acquired angular signals are transformed into
insertion-depth values exploiting knowledge of the pitch of the given screw.
It was found that for ANN learning and subsequent testing reduced data
sets of 30 torque/insertion-depth pairs were sufficient.

8.4. Results of Simulation Study

8.4.1. Single insertion case

The task of the ANN is to distinguish between successful and unsuccessful


insertion signals (Column 1 of Table 8.1). During training (employing
signals from ten successful insertions and eight unsuccessful insertions),
the network performance in response to unseen test data is monitored.
To correctly classify this set, the network requires a certain degree of
generalization, since these signals would be different to those in the training
set due to random noise superimposed on the theoretical curve. Figure 8.3
shows that after moderate training (three cycles), there is a clear separation
between the output values of the five successful and eight unsuccessful
signals in the test set, and that the ANN can efficiently distinguish between
successful and unsuccessful insertion signals.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 8

198 Bruno Lara, Lakmal D. Seneviratne and Kaspar Althoefer

Table 8.1. Simulation and experimental study. The table shows the parameters for all
screws used in this study. Results with regards to the number of successful/unsuccessful
insertion signals per training/test set are shown. Generalization capabilities are
investigated by training the ANN on two hole diameters (*) and testing it on
intermediate hole diameters (see also Sections 8.4 and 8.5) [1].
Simulation Study Experimental Study
Section 8.4.1 8.4.2 8.4.3 8.5.1 8.5.2 8.5.3
Polycarbonate

Polycarbonate

Polycarbonate
Aluminium

Mild Steel

Acrylic

Acrylic

Acrylic

Acrylic
Brass
Plate material
Screw type 8 8 4 6 8 4 6 4 6 8
Far plate thickness (mm) 6.0 6.0 3.0 3.0 3.0 3.0 6.0 3.0 5.0 5.0
Hole diameter (mm) 3.8 3.7* 2.7 3.2 3.9 1.0 1.0* 1.5 2.5 3.5
3.8 1.5
3.9 2.0*
4.0
Training set

successful 10 12 6 6 6 6 10 5 5 5
unsuccessful 8 8 8 8 8 16
Test set

successful 5 14 4 4 4 6 12 4 4 4
unsuccessful 8 8 8 8 8 16
binary

binary

binary

binary

Classification Four classes Four classes


A B C A B C
unsuccessful unsuccessful

8.4.2. Generalization ability

The ability of the network to handle multiple insertion cases and to cope
with cases not seen during training is investigated. Four insertion cases,
corresponding to four different hole diameters, 3.7 mm, 3.8 mm, 3.9 mm
and 4.0 mm, are examined (Column 2 of Table 8.1). The network is only
trained on insertion signals from the smallest (3.7 mm, 6 signals) and the
widest hole (4.0 mm, 6 signals) as well as 8 signals representing unsuccessful
insertions. However, during testing the network has, in addition, been
presented with signals from insertions into the 2 middle-sized holes (3.8
and 3.9 mm). The test set consists of 22 signals including 14 successful
signals (2 from a 3.7 mm hole, 2 from a 4.0 mm hole, 5 from a 3.8 mm
hole and 5 from a 3.9 mm hole) and 8 unsuccessful signals. Hence, the
network is required to interpolate from cases learnt during training. All
the signals are correctly classified after a relatively modest training period
of four cycles (including initialization), Fig. 8.4.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 8

Automated Screw Insertion Monitoring Using Neural Networks 199

Fig. 8.3. Output values as training evolves (Simulated insertion signals). Single case
experiment [1].

8.4.3. Multi-case classification

The aim of this classification experiment is to separate successful signals


from unsuccessful signals as before and, in addition, to further classify the
successful signals into different classes. Three different insertion classes are
investigated in this experiment, Column 3 of Table 8.1. Here, an ANN with
four nodes in the output layer is used; three to distinguish between the three
different insertion classes (Classes A to C), and one to indicate unsuccessful
insertions. Given an input signal, the trained network is expected to output
a value close to one at a single output node, while all other output nodes
should have values close to zero. During training, it was observed that the
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 8

200 Bruno Lara, Lakmal D. Seneviratne and Kaspar Althoefer

Fig. 8.4. Output values as training evolves (Simulated insertion signals). Variation of
hole diameter experiment [1].

sum squared error (SSE) of the network reduces quickly to relatively low
values indicating a good and steady training behavior. Figure 8.5 shows
the ANN output in response to the test set after 20 training cycles. Signals
1 to 4, 5 to 8 and 9 to 12 are correctly classified as the insertions of Classes
A, B and C, respectively, and the remaining signals are correctly classified
as unsuccessful insertions.

8.5. Results of Experimental Study

8.5.1. Single insertion case


As in Section 8.4.1, the task of the network is to distinguish between
successful and unsuccessful insertions for a single insertion case, for the
set of parameters given in column 4 of Table 8.1. Based on the results of
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 8

Automated Screw Insertion Monitoring Using Neural Networks 201

Fig. 8.5. Activation output on test set after 20 training cycles. Four-output
classification experiment using simulated signals [1].

the simulation study, a network with 60 input nodes, 15 hidden layer nodes
and 1 output node is used. After initialization, the network is trained over
22 cycles, Fig. 8.6. After a modest training period (eight cycles including
initialization), the network output clearly differentiates between successful
and unsuccessful insertions.

8.5.2. Generalization ability

As in Section 8.4.2, the network is trained on signals acquired during


insertions of screws into the smallest (1.0 mm, 5 signals) and widest (2.0
mm, 5 signals) hole as well as on signals from eight unsuccessful insertions,
Column 5 of Table 8.1. During testing (employing four successful insertion
signals from each of the three holes and eight unsuccessful insertion signals),
it is analyzed whether the network is capable of interpolating and correctly
classifying the signals corresponding to the medium-sized hole (1.5 mm).
Figure 8.7 shows the acquired insertion signals for the three different cases
used in this experiment. The network is trained using 26 cycles, and the test
set is presented to the network after initialization and after each training
cycle, Fig. 8.8. It is seen that after a modest period of training (nine cycles),
the network correctly classifies successful and unsuccessful insertions.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 8

202 Bruno Lara, Lakmal D. Seneviratne and Kaspar Althoefer

Fig. 8.6. Output values as training evolves (insertion signals recorded from electric
screwdriver). Single case experiment [1].

8.5.3. Four-output classification

This experiment is equivalent to that performed in Section 8.4.3. The


three insertion classes investigated in this experiment are shown in Column
6 of Table 8.1; the corresponding insertion signals are depicted in Fig. 8.9.
The aim of the experiment is to separate successful insertions from failed
insertions, and to further classify the successful signals to one of the three
classes. Thus, as for the ANN in Section 8.4.3, four output nodes are
used. As training evolves, output nodes one to three become more and
more activated for correct signals of Classes A, B and C, respectively, while
the fourth output node increases in activity for unsuccessful insertions.
Tests (involving 12 signals representing successful insertions (4 for each
class) and 16 signals from unsuccessful insertions) have shown that the
SSE decreases relatively rapidly and smoothly to low values. At different
training stages, the training and the test sets were presented to the network.
When presented with the training set after 50 iterations, the network is
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 8

Automated Screw Insertion Monitoring Using Neural Networks 203

Fig. 8.7. Insertion signals from screwdriver. Variation of hole diameter [1].

able to correctly classify all 31 training signals. However, when presented


with the test set, the network fails to classify one of the signals properly.
However, after continued training (200 and more training cycles), the
network correctly classifies all the presented signals from the test set,
Fig. 8.10. Signal 8 correctly activates output node 2, but also activates
output node 1; however, to a lesser extent. Also, for signals 9, 10 and
17 an increased output activity is observed. This is probably due to the
proximity of the different classes. However, choosing the output node with
the maximum activity will give the correct answer for all signals after 200
training cycles.

8.6. Conclusions

The main contribution of this chapter is the development of a new


methodology, based on RBF ANNs, for monitoring the insertion of
self-tapping screws. Its main advantage is its capability to generalize and
to correctly classify unseen insertion signals. This level of classification
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 8

204 Bruno Lara, Lakmal D. Seneviratne and Kaspar Althoefer

Fig. 8.8. Output values as training evolves (real insertion signals). Variation of hole
diameter experiment [1].

cannot be achieved with standard methods such as the teach method and
is particularly useful where insertions have to be logged as is the case where
high safety standards are required.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 8

Automated Screw Insertion Monitoring Using Neural Networks 205

Fig. 8.9. Insertion signals from screwdriver. Four-output classification experiment [1].

This chapter investigates the ability of the ANN to classify signals


belonging to a single insertion case. Both the computer simulation and
experimental results show that after a modest training period (eight cycles
when using real data), the network is able to correctly classify torque
signature signals.
Further, the ability of the network to cope with unseen test signals
is investigated, using signatures acquired from insertion into holes with
different diameters. The network is trained with signals from the smallest
and largest hole, and tested using signals from all insertions including
those from the unseen middle-sized holes. It is shown that after a modest
training period (10 cycles when using real data), the network is able to
classify all test signals, including those belonging to unseen signatures. To
classify signals of the unseen insertion cases, the network has to interpolate
from insertion cases learnt during training, demonstrating its generalization
ability.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 8

206 Bruno Lara, Lakmal D. Seneviratne and Kaspar Althoefer

Fig. 8.10. Network activation output for the test set after 200 training cycles.
Four-output classification experiment using real insertion signals. All signals are
correctly attributed [1].

Finally, the ability of the network to classify into multiple categories


is investigated. The classification task is to separate correct signals from
faulty signals, and to further classify the correct signals into one of three
classes. Although the training requirements for this case are higher than in
the previous experiments (200 cycles to ensure correct classification when
using real data), after training, the network is able to classify all test signals
accurately. The considerably extensive training required for this experiment
is due to the higher complexity of the task at hand as compared to the
previous experiments.
Current research focuses on developing advanced preprocessing
techniques that will provide a better separation of the essential features of
the torque-insertion depth profiles into the different cases aiming to improve
the multi-output classification. Work is underway on creating estimation
techniques to determine the parameters of the analytical model with respect
to real insertion signals [28].

Acknowledgments

The research leading to these results has been partially supported by


the COSMOS project, which has received funding from the European
Community’s Seventh Framework Programme (FP7-NMP-2009-SMALL-3,
NMP-2009-3.2-2) under grant agreement 246371-2. The work of Dr. Lara
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 8

Automated Screw Insertion Monitoring Using Neural Networks 207

was funded by CONACYT. I also thank Allen Jiang for compositing this
chapter.

References

[1] K. Althoefer, B. Lara, and L. D. Seneviratne, Monitoring of self-tapping


screw fastenings using artificial neural networks, ASME Journal of
Manufacturing Science and Engineering. 127, 236–247, (2005).
[2] B. Lara Guzman. Intelligent Monitoring of Screw Insertions. PhD thesis,
King’s College, University of London, (2005).
[3] A. S. Kondoleon. Application of technology-economic model of assembly
techniques to programmable assembly machine configuration. SM thesis.
Master’s thesis, Mechanical Engineering Department, Massachusetts
Institute of Technology, (1976).
[4] P. M. Lynch. Economic-Technological Modeling and Design Criteria for
Automated Assembly. PhD thesis, Mechanical Engineering Department,
Massachusetts Institute of Technology, (1977).
[5] D. E. Whitney and J. L. Nevins, Computer-Controlled Assembly, Scientific
American. 238(2), 62–74, (1978).
[6] L. D. Seneviratne, F. A. Ngemoh, and S. W. E. Earles, An experimental
investigation of torque signature signals for self-tapping screws, Proceedings
of the Institution of Mechanical Engineers, Part C: Journal of Mechanical
Engineering Science. 214(2), 399–410, (2000).
[7] L. D. Seneviratne, P. Visuwan, and K. Althoefer. Monitoring of threaded
fastenings in automated assembly using artificial neural networks. In
Intelligent assembly and disassembly: (IAD 2001): a proceedings volume
from the IFAC workshop, Camela, Brazil, 5-7 November 2001, 61–65, (2002).
[8] M. Matsumura, S. Itou, H. Hibi, and M. Hattori, Tightening torque
estimation of a screw tightening robot, Proceedings of the 1995 IEEE
International Conference on Robotics and Automation. 2, 2108–2112, (1995).
[9] K. Ogiso and M. Watanabe, Increase of reliability in screw tightening,
Proceedings of the 4th International Conference on Assembly Automation,
Tokyo, Japan. 292–302, (1982).
[10] S. K. Smith, Use of a microprocessor in the control and monitoring of
air tools while tightening threaded fasteners, Proceedings of the Society of
Manufacturing Engineers, Dearborn, Michigan. 2, 397–421, (1980).
[11] F. A. Ngemoh. Modelling the Automated Screw Insertion Process. PhD
thesis, King’s College, University of London, (1997).
[12] L. D. Seneviratne, F. A. Ngemoh, and S. W. E. Earles, Theoretical modelling
of screw tightening operations, Proceedings of Engineering Systems Design
and Analysis, American Society of Mechanical Engineers, Istanbul, Turkey.
47(1), 301–310, (1992).
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 8

208 Bruno Lara, Lakmal D. Seneviratne and Kaspar Althoefer

[13] E. J. Nicolson and R. S. Fearing, Compliant control of threaded fastener


insertion, Proceedings of the 1993 IEEE International Conference on
Robotics and Automation. 484–490, (2002).
[14] G. P. Peterson, B. D. Niznik, and L. M. Chan, Development of an automated
screwdriver for use with industrial robots, IEEE Journal of Robotics and
Automation. 4(4), 411–414, (1988).
[15] T. Tsujimura and T. Yabuta, Adaptive force control of screwdriving with
a positioning-controlled manipulator, Robotics and Autonomous Systems. 7
(1), 57–65, (1991).
[16] F. Mrad, Z. Gao, and N. Dhayagude, Fuzzy logic control of automated
screw fastening, Conference Record of the 1995 IEEE Industry Applications
Conference, 1995. Thirtieth IAS Annual Meeting, IAS’95. 2, 1673–1680,
(2002).
[17] A. P. Boresi, R. J. Schmidt, and O. M. Sidebottom, Advanced Mechanics of
Materials. (John Wiley, New York, 1993).
[18] S. P. Timoshenko and J. N. Goodier, Theory of Elasticity. (McGraw, New
York, 1970).
[19] B. Lara, K. Althoefer, and L. D. Seneviratne, Automated robot-based screw
insertion system, Proceedings of the 24th Annual Conference of the IEEE
Industrial Electronics Society IECON’98. 4, 2440–2445, (2002).
[20] K. Althoefer, L. D. Seneviratne, and R. Shields, Mechatronic strategies for
torque control of electric powered screwdrivers, Proceedings of the Institution
of Mechanical Engineers, Part C: Journal of Mechanical Engineering
Science. 214(12), 1485–1501, (2000).
[21] N. Dhayagude, Z. Gao, and F. Mrad, Fuzzy logic control of automated
screw fastening, Robotics and Computer-Integrated Manufacturing. 12(3),
235–242, (1996).
[22] L. Bruzzone and D. F. Prieto, Supervised training technique for radial basis
function neural networks, Electronics Letters. 34(11), 1115–1116, (1998).
[23] C. M. Bishop, Neural Networks for Pattern Recognition. (Oxford University
Press, New York, 1995).
[24] M. T. Musavi, W. Ahmed, K. H. Chan, K. B. Faris, and D. M. Hummels,
On the training of radial basis function classifiers, Neural networks. 5(4),
595–603, (1992).
[25] B. Lara, K. Althoefer, and L. D. Seneviratne, Artificial neural networks
for screw insertions classification, Proceedings of the IEEE International
Conference on Robotics and Automation (ICRA), San Francisco, CA.
1912–1917, (2000).
[26] B. Lara, K. Althoefer, and L. D. Seneviratne, Use of artificial neural
networks for the monitoring of screw insertions, Proceedings of the 1999
IEEE/RSJ International Conference on Intelligent Robots and Systems,
1999, IROS’99. 1, 579–584, (2002).
[27] A. Zell, G. Mamier, and et al., SNNS: Stuttgart Neural Network Simulator.
User Manual, Version 4.1, Institute for Parallel and Distributed High
Performance Systems, Technical Report. (1995).
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 8

Automated Screw Insertion Monitoring Using Neural Networks 209

[28] M. Klingajay, L. Seneviratne, and K. Althoefer, Identification of threaded


fastening parameters using the Newton Raphson Method, Proceedings of
the 2003 IEEE/RSJ International Conference on Intelligent Robots and
Systems, IROS 2003. 2, 2055–2060, (2003).
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 8

This page intentionally left blank

210
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 8

PART 4

Support Vector Machines and their


Applications

211
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 8

This page intentionally left blank

212
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 9

Chapter 9

On the Applications of Heart Disease Risk Classification


and Hand-written Character Recognition using Support
Vector Machines
S.R. Alty, H.K. Lam and J. Prada
Department of Electronic Engineering, King’s College London
Strand, London, WC2R 2LS, United Kingdom
[email protected]
[email protected]
[email protected]

Over the past decade or so, support vector machines have established
themselves as a very effective means of tackling many practical
classification and regression problems. This chapter relates the theory
behind both support vector classification and regression, including an
example of each applied to real-world problems. Specifically, a classifier
is developed which can accurately estimate the risk of developing heart
disease simply from the signal derived from a finger-based pulse oximeter.
The regression example shows how SVMs can be used to rapidly and
effectively recognize hand-written characters particularly designed for
the so-called graffiti character set.

Contents

9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214


9.1.1 Introduction to support vector machines . . . . . . . . . . . . . . . . . 214
9.1.2 The maximum-margin classifier . . . . . . . . . . . . . . . . . . . . . . 214
9.1.3 The soft-margin classifier . . . . . . . . . . . . . . . . . . . . . . . . . . 216
9.1.4 Support vector regression . . . . . . . . . . . . . . . . . . . . . . . . . 218
9.1.5 Kernel functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
9.2 Application: Biomedical Pattern Classification . . . . . . . . . . . . . . . . . 221
9.2.1 Pulse wave velocity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
9.2.2 Digital volume pulse analysis . . . . . . . . . . . . . . . . . . . . . . . 222
9.2.3 Study population and feature extraction . . . . . . . . . . . . . . . . . 223
9.2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
9.3 Application: Hand-written Graffiti Recognition . . . . . . . . . . . . . . . . . 228
9.3.1 Data acquisition and feature extraction . . . . . . . . . . . . . . . . . . 229
9.3.2 SVR-based recognizer . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
9.3.3 Simulation results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232

213
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 9

214 S.R. Alty, H.K. Lam and J. Prada

9.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235


References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236
Appendix A. Grid search shown in three dimensions for SVM-based DVP classifier
with Gaussian Radial Basis function kernel . . . . . . . . . . . . . . . . . . . 239
Appendix B. Tables of recognition rate for SVR-based graffiti recognizer with linear
kernel function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
Appendix C. Tables of recognition rate for SVR-based graffiti recognizer with spline
kernel function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242
Appendix D. Tables of recognition rate for SVR-based graffiti recognizer with
polynomial kernel function . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
Appendix E. Tables of recognition rate for SVR-based graffiti recognizer with radial
basis kernel function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249

9.1. Introduction

This chapter relates the rise in interest, largely over the past decade, of
supervised learning machines that employ a hypothesis space of linear
discriminant functions in a higher dimensional feature space, trained and
optimized from theory based on statistical learning theory. Vladimir Vapnik
is largely credited with introducing this learning methodology that since
its inception has outperformed many other systems in a broad selection of
applications. We refer, of course, to support vector machines.

9.1.1. Introduction to support vector machines


Support vector machines (SVMs) [1–4] have attracted a significant amount
of attention in the field of machine learning over the past decade by
proving themselves to be very effective in a variety of real-world pattern
classification and regression tasks. In the literature, they have been
successfully applied to many problems ranging from face recognition, to
bioinformatics and hand-written character recognition (amongst many
others). In this chapter we give a brief introduction to the mathematical
basis of SVMs for both classification and regression problems and give an
example application for each. For a complete treatment of the background
theory please see [4].

9.1.2. The maximum-margin classifier


Although the theory can be extended to accommodate multiple classes,
without loss of generality let us first consider a binary classification task
assuming we have linearly separable set of data samples
{ }
S = (x1 , y1 ) · · · (xm , ym ) , (9.1)
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 9

Heart Disease Risk Classification and Hand-written Character Recognition 215

where x ∈ Rd , i.e. x lies in a d-dimensional input space, and yi is the class


label such that yi ∈ {−1, 1}. Since SVMs are principally based on linear
discriminant functions, a suitable classifier could then be defined as:

f (x) = sgn(⟨w, x⟩ + b) . (9.2)

Where vector w determines the orientation of a discriminant plane (or


hyperplane), ⟨w, x⟩ is the inner product of the vectors, w and x and b is
the bias or offset from the origin. It is clear that there exists an infinite
number of possible planes that could correctly dichotomize the training
data. Simple intuition would lead one to expect the choice of a line drawn
through the “middle”, between the two classes, to be a suitable choice. As
this would imply small disturbances of each data point would likely not
affect the resulting classification significantly. This concept then suggests

‹w,xi› + b = 0
Class “+1”

2
b w

‹w,xi+› + b = +1

Class “–1”
‹w,xi–› + b = –1

Fig. 9.1. Maximum-margin classifier with no errors showing optimal separating


hyperplane (solid line) with canonical hyperplanes on either side (dotted lines). It can
be shown that the margin is equal to 2/||w||. The support vectors are encircled.

that a good separating plane is one that works well at generalizing, this
implies a plane which has a higher probability of correctly classifying new,
unseen, data. Hence, an optimal classifier is one which finds the best
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 9

216 S.R. Alty, H.K. Lam and J. Prada

generalizing hyperplane that maximizes the margin between each class of


data points. This paradigm results in the selection of a single hyperplane
dependent solely on the “support vectors” (these are the data points that
support the so-called canonical hyperplanes) by maximizing the size of the
margin. It is this characteristic that forms the key to the robustness
of SVMs, as they rely solely on this sparse data set, and will not be
affected by perturbations in any of the remainder of the test data. In
this manner, SVMs are based on so-called Structural Risk Minimisation
(SRM) [1] and not Empirical Risk Minimisation (ERM) on which other
traditional classification techniques such as Neural Networks rely.

9.1.3. The soft-margin classifier


More often than not, however, real-world data sets are typically not linearly
separable in input space, meaning that the maximum margin classifier
model is no longer valid and a new approach must be introduced. This
is achieved by relaxing the constraints a little to tolerate a small amount
of misclassification. Points that subsequently fall on the wrong side of the
margin, therefore, are treated as errors. The error vectors are assigned a

‹w,xi› + b = 0
Class “+1”

i

j

‹w,xi› + b = +1
Class “–1”

‹w,xi› + b = –1

Fig. 9.2. Soft-margin classifier with some errors denoted by the slack variables ξ which
represent the errors.

lower influence (determined by a preset slack variable) on the position of


the hyperplane. Optimization of the soft-margin classifier is now achieved
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 9

Heart Disease Risk Classification and Hand-written Character Recognition 217

by maximizing the margin whilst at the same time allowing the margin
constraints to be violated according to the preset slack variables ξi . Leading
∑m
to the minimization of: 12 ∥w∥2 +C i=1 ξi subject to yi (⟨w, xi ⟩+b) ≥ 1−ξi
and ξi ≥ 0 for i = 1, . . . , m. Lagrangian duality theory [4] is typically
applied to solve the minimization problems presented by linear inequalities.
Thus, one can form the primal Lagrangian, L(w, b, ξ, α, β)
1 ∑m ∑m ∑m
= ∥w∥2 + C ξi − β i ξi − αi [yi (⟨w, xi ⟩ + b) − 1 + ξi ] , (9.3)
2 i=1 i=1 i=1

where αi and βi are independent undetermined Lagrangian multipliers.


The dual-form Lagrangian can be found by calculating each of the partial
derivatives of the primal and equating to zero in turn thus,

m
w= yi αi xi (9.4)
i=1

and

m
0= yi αi , (9.5)
i=1

these are then re-substituted into the primal, which then yields,

m
1 ∑
m
L(w, b, ξ, α, β) = αi − yi yj αi αj ⟨xi , xj ⟩ . (9.6)
i=1
2 i,j=1

We note that this result is the same as for the maximum-margin classifier.
The only difference is the constraint α + β = C, where both α and β ≥ 0,
hence 0 ≤ α, β ≤ C. Thus, the value C sets an upper limit on the size
of the Lagrangian optimization variables αi and βi , this limit is sometimes
referred to as the box constraint. The selection of the value of C results in
a trade-off between accuracy of data fit and regularization. The optimum
choice of C will depend on the underlying data and nature of the problem
and is usually found by experimental cross-validation (whereby the data
is divided into a training set and testing or hold-out set and the classifier
is tested on the hold-out set alone). This is best performed on a number
of different sets (sometimes referred to as folds) of unseen data, leading
to so-called “multi-folded cross-validation”. Quadratic Programming (QP)
algorithms are typically employed to solve these equations and calculate the
Lagrangian multipliers. There are many online resources of such algorithms
available for download, see website referred to in the excellent book by
Cristianini and Shawe-Taylor [4] for an up to date listing.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 9

218 S.R. Alty, H.K. Lam and J. Prada

9.1.4. Support vector regression



 

 

Fig. 9.3. Support vector regression showing the linear ε-insensitive loss function “tube”
with width 2ε. Only those points lying outside the tube are considered support vectors.

SVMs can also lend themselves easily to the task of regression with only
a simple extension to the theory. In so doing, they allow for real-valued
targets to be estimated by modeling a linear function (see Fig. 9.3) in
feature space (see Section 9.1.5). The same concept of maximizing the
margin is retained but it is extended by a so-called loss function; which
can take on different forms. The loss functions can be linear or quadratic
or more usefully allow for an insensitive region whereby training data
points that lie within this insensitive range then no error is deemed to
have occurred. In a manner similar to the soft-margin classifier, errors are
accounted for by the inclusion of slack variables that tolerate data points
that violate this constraint to a limited extent. This then leads to a modified
∑m
set of constraints requiring the minimization of: 12 ∥w∥2 + C i=1 (ξi + ξi∗ )
subject to yi −⟨w, xi ⟩−b ≤ ε+ξi and ⟨w, xi ⟩+b−yi ≤ ε+ξi∗ with ξi , ξi∗ ≥ 0
for i = 1, . . . , m. The solution for the above QP problem is provided once
again by the use of Lagrangian duality theory, however, we have omitted a
full derivation for the sake of brevity (please see the tutorial by Smola and
Schölkopf [5] and the references therein). The above Fig. 9.3 shows only the
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 9

Heart Disease Risk Classification and Hand-written Character Recognition 219

linear loss function whereby the points that are further than |ε| from the
optimal line are apportioned weight in a linearly increasing fashion. There
are, however, numerous different types of loss functions in the literature,
including the quadratic and Huber’s loss function among others which is in
fact a hybrid or piecewise combination of quadratic and linear functions.
Whilst most of them include an insensitive region to facilitate sparseness
of solution this is not always the case. Figure 9.4 shows the quadratic and

a) b)
L( ) L( )

2 2

0 yi – ‹w,xi› – b 0 yi – ‹w,xi› – b

Fig. 9.4. Quadratic (a) and Huber (b) type loss functions with ε-insensitive regions.

Huber’s loss function. Hence, when training an SV-based regressor one


must optimize for both the box constraint, C, the parameters of the Kernel
function used and ε. Typically, this is achieved by performing a grid search
for all hyper-parameters during cross-validation.

9.1.5. Kernel functions

Often, real-world data is not readily separable using a linear hyperplane


but it may exhibit a natural underlying nonlinear characteristic. Kernel
mappings offer an efficient method of projecting input data into a higher
dimensional feature space where a linear hyperplane can successfully
separate classes. Kernel functions must obey Mercer’s Theorem, and
as such they offer an implicit mapping into feature space. This means
that the explicit mapping need not be computed, rather the calculation
of a simple inner-product is sufficient to facilitate the mapping. This
reduces the burden of computation significantly and in combination
with SVM’s inherent generality, greatly mitigates the so-called “curse of
dimensionality”. Furthermore, the input feature inner-product from (9.6)
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 9

220 S.R. Alty, H.K. Lam and J. Prada

can simply be substituted with the appropriate Kernel function to obtain


the mapping whilst having no effect on the Lagrangian optimization theory.
Hence, the relevant classifier function then becomes:
[nSV s ]

f (xj ) = sgn yi αi K(xi , xj ) + b (9.7)
i=1

and for regression

∑s
nSV
f (xj ) = (αi − αi∗ )K(xi , xj ) + b , (9.8)
i=1

where nSV s denotes the number of support vectors, yi are the labels, αi
and αi∗ are the Lagrangian multipliers, b the bias, xi the Support Vectors
previously identified through the training process and xj the test data
vector. The use of Kernel functions transforms a simple linear classifier
into a powerful and general nonlinear classifier (or regressor). There are
a number of different Kernel functions available, here are few popular
types [4]:

Linear function: K(xi , xj ) = ⟨xi , xj ⟩ + c (9.9)

Polynomial function: K(xi , xj ) = (⟨xi , xj ⟩ + c)d (9.10)

[ −∥x − x ∥2 ]
i j
Gaussian radial basis function: K(xi , xj ) = exp (9.11)
2σ 2

10 (
∏ 1
Spline function: K(xi , xj ) = 1 + xik · xjk + xik · xjk min(xik , xjk )
2
k=1
1 )
− min(xik , xjk )3 (9.12)
6
where c is an arbitrary constant parameter (often c is simply set to zero or
unity), d is the degree of the polynomial Kernel and σ controls the width
of the Gaussian radial basis function. When experimenting with training
and cross-validation it is common practice to perform a grid search over
these Kernel parameters along with the box constraint, C, for the lowest
classification error within a given training set.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 9

Heart Disease Risk Classification and Hand-written Character Recognition 221

Fig. 9.5. Kernel functions generate implicit nonlinear mappings, enabling SVMs to
operate on a linear separating hyperplane in higher dimensional feature space.

9.2. Application: Biomedical Pattern Classification

Cardiovascular disease (CVD) is the leading cause of mortality in the


developed world [6]. In 2004 alone approximately 17 million people died
from some form of CVD (largely from stroke and myocardial infarction).
The World Health Organisation (WHO), suggests that this figure is
expected to exceed 23 million by the year 2030. There are a number
of established risk factors for CVD such as sex, age, tobacco smoking,
high blood pressure, blood serum cholesterol and the presence of diabetes
mellitus. Current methods for estimating the risk of a CVD event (such as
a myocardial infarction or a stroke) within an individual, rely on the use
of these factors in a so-called “risk calculator”. This calculator is based on
regression equations relating levels of individual risk factors to CVD events
in various prospective follow-up studies such as the Framingham Heart
study [7] or the Cox model study [8]. Such risk calculators, however, fail to
identify a significant minority of subjects who subsequently go on to develop
some form of CVD. Events typically occur as a result of arteriosclerosis
and/or atherosclerosis, inflammatory and degenerative conditions of the
arterial wall. Early on, changes occur at the cellular level but this leads
to changes in the mechanical properties of the arterial wall. Hence, the
possibility that biophysical measures of the mechanical properties of the
arterial wall may provide a measure of CVD risk has received a great
deal of attention. Currently, one of the most promising measurements
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 9

222 S.R. Alty, H.K. Lam and J. Prada

is arterial stiffness. Arteries naturally tend to stiffen with age and


premature stiffening may result from a combination of arteriosclerosis
and atherosclerosis. Additionally, arterial stiffening leads to systolic
hypertension and increased loading on the heart. Simple measurement of
Brachial (upper arm) blood pressure may not, however, reveal this condition
as central blood pressure can differ markedly from that measured in the
periphery. A number of studies [9, 10] indicate that large artery stiffness,
as measured by pulse wave velocity (see below) has proved to be a powerful
and independent predictor of CVD events, more closely related to CVD
risk than the traditional risk factors.

9.2.1. Pulse wave velocity


Determining arterial stiffness directly by simultaneous measurement of the
change in arterial diameter with pressure is technically challenging and the
most practical technique employed to measure stiffness is the estimation
of arterial Pulse Wave Velocity (PWV). PWV is the speed at which the
pressure pulse propagates through the arterial tree and is directly related
to arterial stiffness. Measuring PWV from the Carotid artery (in the neck)
to the Femoral artery (in the thigh) is the measurement that has been used
in most outcome studies to date. It includes the Aorta and large elastic
arteries that are most susceptible to age-related stiffening. Carotid-femoral
PWV is often determined by placement of a tonometer (a non-invasive
pressure sensor) over the Carotid and Femoral arteries. The time delay
between the pressure pulse arriving at the Carotid and Femoral arteries
is measured and then divided by the path length which is estimated from
distance between the two sites of application of the sensors. Hence, Pulse
Wave Velocity is determined in ms−1 .

9.2.2. Digital volume pulse analysis


Determination of PWV as described above, whilst non-invasive, involves
the subject undressing to allow access to the Femoral artery and requires
specialized equipment and a skilled technician. An alternative, simpler
and quicker technique would be of great advantage in screening for CVD.
We have previously proposed that the shape of the digital volume pulse
(DVP) may be used to estimate arterial stiffness [11–14]. The DVP can
be easily and rapidly acquired (without the need for a skilled technician)
by measuring absorption of infra-red light across the finger pulp (which
is technically termed photoplethysmography). This varies with red blood
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 9

Heart Disease Risk Classification and Hand-written Character Recognition 223

Fig. 9.6. The Photoplethysmograph device used to acquire the DVP waveform attached
to the index finger.

density and hence with blood vessel diameter during the cardiac cycle.
Typically the DVP waveform (see Fig. 9.7) exhibits an initial systolic peak
followed by a diastolic peak. The time between these two peaks, called the
peak-to-peak time (PPT), is related to the time taken for the pressure wave
to propagate from the heart to the peripheral blood vessels and back again.
Thus PPT is related to PWV in the large arteries. This has led to the
development of the so-called stiffness index (SI), which is simply the ratio
of PPT divided by the subject height and can be used as a crude estimate
of PWV. Older subjects, however, and in subjects with premature arterial
stiffening, the systolic and diastolic peaks in the DVP become difficult to
distinguish and SI cannot be used to estimate arterial stiffness.

9.2.3. Study population and feature extraction

A group of 461 subjects were recruited from the local area of South East
London. None of the subjects had a previous history of cardiovascular
disease or were receiving heart medication. The subjects ranged from 16 to
81 years of age, with an average age of 50 and standard deviation of 13.6
years. The DVP waveform was measured for each of the subjects along
with their PWV and a number of other basic physiological measures such
as systolic and diastolic blood pressure, their height and weight. There are
various ways in which features can be derived from the DVP waveform.
Previous work in this field [13] has led to the selection of what we have
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 9

224 S.R. Alty, H.K. Lam and J. Prada

come to term Physiological Features [15]. Essentially, these are parameters


associated with the physiological properties of the aorta and arterial
characteristics in general. More recently, we have focused our attention
on exploiting features that do not assume an underlying physiological basis
but instead are selected by applying information theoretic approaches to
the DVP waveform and we refer to these as Signal-Based Features.

9.2.3.1. Physiological features

After a great deal of experimentation comparing the significance of many


of the physiological features, it was found that, in fact, a specific set of
four of these features gave the best classification of high or low PWV.
Interestingly, two of these features have been independently cited in the
literature [11, 13, 16] as having a bearing on cardiovascular pathology.
Dillon and Hertzman are credited [16] with first measuring the DVP in 1941,

DVP Waveform A DVP Waveform B


(a) (b)
RI = y/x RI = y/x

x x y
y

Derivative of DVP Waveform A Derivative of DVP Waveform B


(c) (d)

PPT
PPT

CT
CT

Fig. 9.7. Digital Volume Pulse waveforms as extracted from the Photoplethysmograph
device and their derivatives. Labeled to show various features, e.g. PPT, RI and CT.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 9

Heart Disease Risk Classification and Hand-written Character Recognition 225

and they observed that subjects with hypertension or arterial heart disease
exhibited an “increase in the crest time” compared to healthy subjects.
The Crest Time (CT), is the time from the foot of the DVP waveform to
its peak (as shown in Figs 9.7c and d) and it has proved to be a useful
feature for the classifier. Also Peak-to-Peak Time (PPT), defined as the
time between the first peak and the second peak or inflection point of
the DVP waveform (see Figs 9.7c and d). As mentioned previously in
the introduction, the second peak/inflection point on the DVP is generally
accepted to be due to reflected waves. So its timing would be related to
arterial stiffness and PWV. The definition of PPT depends on the DVP
waveform as its contour varies with subjects. When there is a second peak
as is the case with “Waveform A” in Fig. 9.7a, PPT is defined as time
between the two maxima. Hence, the time between the two positive to
negative zero-crossings of the derivative can be considered to be the PPT.
In some DVP waveforms, however, the second peak is not distinct as in
“Waveform B” in Fig. 9.7b. When this occurs, the time between the peak
of the waveform and the inflection point on the downward going slope of
the waveform (which is a local maximum of the first derivative, as shown in
Fig. 9.7d) is defined as the PPT. Another measurement extracted from the
DVP used in the classifier, is the so-called Reflection Index (RI), which is
the ratio of relative amplitudes of the first and second peaks (see Figs 9.7a
and b). These four features then: PPT, CT, RI and SI were empirically
found [14] to be amongst the best physiologically motivated features for
successful classification of PWV.

9.2.3.2. Signal-based features

These features were extracted by applying established signal processing


techniques to reduce the dimensionality of the waveform to a standardized
set of features without making any assumptions about the physical
generation of the waveform. After much experimentation it was found that
a certain range of the eigenvalues of the covariance matrix (formed 0by the
autocorrelation of the DVP waveform with its mean removed) outperformed
all the other features and methods by some margin. Essentially, the
covariance matrix, A, formed from the DVP waveform (one per subject) is
decomposed using Eigenvalue Decomposition such that

A = VΣV−1 . (9.13)
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 9

226 S.R. Alty, H.K. Lam and J. Prada

Here V is the matrix of orthonormal eigenvectors of A and Σ its


eigenvalues, where Σ = diag{σ1 , σ2 , . . . , σn }. Specifically, the range of
eigenvalues Σ̂ = {σ3 , . . . , σ9 } inclusive were found, during experimentation,
to give the best results. It is thought that the first two eigenvalues, σ1
and σ2 , primarily represent the signal subspace data, which is essentially
common to all of the waveforms in the database, and hence discarding
these features enhanced the separability of the data and its subsequent
classification.

9.2.4. Results

This section contains the best results obtained after thorough


experimentation of the Support Vector classifier. Results are presented
in percentages with overall (or “total” rate) then the sensitivity or “true
positive” rate followed by the specificity or “true negative” rate. In actual
fact, a wide range of possible feature sets and combinations exist that
performed the required tasks satisfactorily, however, for ease of readability
and conciseness we have selected those which we found to perform the best
overall. The results presented here are based on the best two sets of each
of the physiologically motivated features, two sets of signal subspace-based
features and finally the two best combinations of the each type of feature
set. During experimentation, the two sets of physiologically-based features
that performed the best were found to be: P1 = {CT, PPT} and
P2 = {CT, PPT, SI}. The two sets of signal subspace-based features that
performed the best were also found accordingly, Σ1 = {σ3 , . . . , σ9 } and
Σ2 = {σ2 , . . . , σ9 }. The original cohort contained 461 subjects, both their
complete DVP waveform data and PWV measurements were available to
this study. The mean PWV value of our cohort was found to be around
10 ms−1 , hence, a binary target label was determined according to this
threshold. The cohort was gapped to remove those subjects with PWV
of between 9 and 11 ms−1 to avoid ambiguity of the target classes. The
remaining 315 records were used to train and test the classifier using a
three fold cross-validation regimen, where 90% of the data were used for
training and 10% for testing in any given fold. A binary classifier was
developed using the Ohio State University SVM toolbox for Matlab [17], the
classifier was trained and tested using a number of different kernel functions.
The so-called “ν-SVM” type of classifier was employed and differs from
conventional C-SVM only in the way the box constraint is implemented.
Instead of having an infinite range like that of the value of C, ν varies
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 9

Heart Disease Risk Classification and Hand-written Character Recognition 227

Table 9.1. SVM classification rates % and (SD) for various data sets using GRBF kernel.
Data sets
Physiological Eigenvalues Combinations
P1 P2 Σ1 Σ2 Σ1+P1 Σ1+P2
Total 84.0 (0.5) 84.0 (0.5) 85.1 (1.4) 85.3 (1.6) 86.1 (0.9) 87.5 (0.0)
Sens 90.3 (2.0) 88.2 (2.1) 90.3 (3.3) 90.3 (4.0) 86.7 (1.4) 87.5 (0.0)
Spec 77.4 (0.5) 79.9 (1.0) 80.2 (2.8) 80.5 (3.4) 85.3 (2.5) 87.5 (0.0)

between the limits zero and one, thus simplifying the grid search somewhat.
Experimentation revealed that the Gaussian RBF kernel performed as well
or better than the others and hence the results in Table 9.1 are based
on this kernel (the performance of the other kernels is compared below in
Table 9.2). After performing thorough model hyper-parameter grid searches
(see Fig. A.1 in Appendix A), results were averaged from a “block” of
nine individual results obtained from a range of both constraint factor,
ν, and GRBF width, γ. This technique was applied to all the results to
mitigate against over-training of the model parameters, ensuring a more
general classifier and more realistic results. As shown in Table 9.1, the
SVM method using the physiological feature set P1 alone, gives a fairly
high degree of classification accuracy, with a significantly high sensitivity of
90.3% achieved. There was a slightly lower result of only 77.4% specificity.
Hence, the overall average successful classification rate becomes 84.0%.
The results for the signal subspace-based features, are better still, showing
a distinct improvement in the specificity when compared with those of
the physiologically motivated features. It is readily possible to achieve
sensitivities in the region of 90%, with specificities of over 80%, bringing
overall classification to 85.3%. Finally, combinations of both feature sets
were tested and it was found that two pairs gave good results whilst the
other two combinations were less effective. In fact, the combinations of
Σ1+P1 and Σ1+P2 gave the best results overall. The latter set giving an
overall classification rate of 87.5%, with an equal rate for both sensitivity
and specificity. This is definitely the best classification rate achieved in this
study so far and is a very high classification rate for PWV based solely on
features extracted from the DVP waveform. Further tests were performed
with different kernel functions to compare with those of the GRBF kernel;
linear and second and third order polynomial kernels were tested but the
GRBF kernel consistently outperformed them for all of the data sets (see
Table 9.2). For a complete treatment of this work please see Alty et al [6].
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 9

228 S.R. Alty, H.K. Lam and J. Prada

Table 9.2. Overall SVM classification rates % and (SD) for various Kernel functions for
each data set.
Data sets
Physiological Eigenvalues Combinations
Kernel P1 P2 Σ1 Σ2 Σ1+P1 Σ1+P2
Linear 84.0 (0.5) 83.3 (0.0) 82.3 (0.9) 80.9 (0.5) 84.4 (0.0) 85.8 (0.5)
Poly2 83.9 (0.8) 84.0 (0.5) 83.9 (1.4) 84.5 (1.1) 86.0 (1.2) 86.8 (0.5)
Poly3 83.7 (0.9) 83.9 (1.1) 85.0 (0.9) 84.6 (1.4) 86.1 (0.9) 86.5 (0.0)
GRBF 84.0 (0.5) 84.0 (0.5) 85.1 (1.4) 85.3 (1.6) 86.1 (0.9) 87.5 (0.0)

9.3. Application: Hand-written Graffiti Recognition

In this section, the problem of recognizing one-stroke hand-written graffiti


[18–20] is handled by the SVRs [1, 2]. The hand-written graffiti include
digits zero to nine and three commands, i.e. backspace, carriage return and
space, which are shown in Fig. 9.3. Some feature points are extracted from
each graffiti for the training of the SVRs. With the feature points as the
input, an SVR-based recognizer is proposed to recognize the hand-written
graffiti. The recognition performance of the proposed SVR-based graffiti
recognizer, with various kernels (linear function, spline function, radial basis
function and polynomial function) and parameters, is investigated.

Table 9.3. Hand-written one-stroke digits and commands. The dot denotes the
starting point of the character.
Class No. Class Name Strokes Class No. Class Name Strokes

1 0(a) 9 6
   
2 0(b) 10 7
   
3 1 11 8(a)
   
4 2 12 8(b)
   
5 3 13 9
   
6 4 14 Backspace
   
7 5(a) 15 Carriage Return
   
8 5(b) 16 Space
   
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 9

Heart Disease Risk Classification and Hand-written Character Recognition 229

9.3.1. Data acquisition and feature extraction

The features of the hand-written graffiti play an important role in the


recognition process. By employing meaningful features which represent
the characteristic of the hand-written graffiti, it is possible to achieve a
high recognition rate even using a simple structure of recognizer. In this
application, 10 sample points of a hand-written graffiti will be obtained as
the feature vector, which will serve as the input of the SVR-based graffiti
recognizer for the training and recognition processes.

 
Fig. 9.8. Written trace of digit “2” (solid line) and 10 feature points characterized by
coordinate (ϕk , βk ) (indicated by dots).

The following approach [18–20] is employed to obtain the feature vector


of a graffiti. It is assumed that a graffiti is drawn in a square drawing
area with dimension of ϕmax × βmax (in pixels). Each point in the square
drawing area is characterized by a coordinate (ϕ, β) where 0 ≤ ϕ ≤ ϕmax
and 0 ≤ β ≤ βmax . The bottom left corner is considered as the origin
denoted by the coordinate (0, 0).
Figure 9.8 illustrates a hand-written digit “2” with trace in solid
line and 10 sampled points (denoted by dots) taken in uniform distance
characterized by coordinates (ϕk , βk ), k = 1, 2, · · · , 10. By using the
coordinate of the points to form the feature vector, we have 20 numerical
values as the input for the SVR-based graffiti recognizer. Theoretically, it is
feasible and the training of the SVRs will be done in the same way. However,
the complexity of the SVR can be reduced if a smaller number of feature
points is employed and thus it is possible to reduce the implementation
complexity and cost of the hand-written graffiti recognizer.
In order to take more information about the graffiti and reduce the
number of feature points, the following algorithm is proposed. Denote
the feature vector as ρ = [ρ1 ρ2 · · · ρ10 ]. The first five points, i.e.
(ϕk , βk ), k = 1, 3, 5, 7 and 9, taken alternatively are converted to five
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 9

230 S.R. Alty, H.K. Lam and J. Prada

numerical values, respectively, using the formula ρk = ϕk ϕmax + βk . The


other five points, (ϕk , βk ), k = 2, 4, 6, 8 and 10, are converted to another
five numerical values, respectively, using the formula ρk = βk βmax + ϕk .
The second formula is a kind of transformation to obtain the feature points
simply by rotating the square writing area by 90 degrees. The main reason
for transforming the feature points is to enlarge the difference between each
graffiti, particularly for those graffiti resembling each others, to improve
the recognition performance. It can be imagined that the first formula
look at the graffiti from the x-axis direction while the second one from
the y-axis direction. Consequently, as some graffiti look like others in the
x-axis but may not look like others in y-axis, the feature vectors for different
graffiti will be situated far apart from each other in the feature space and
benefit the recognition process. Different transformation techniques may
lead to different recognition results. In the following, as more than one
feature vector will be used, an index i is introduced to denote the feature
vector number. Hence, we have the feature vector in the form of ρi =
[ρi1 ρi2 · · · ρi10 ].
It should be noted that the feature vector ρi provides the coordinate
information of the graffiti, it is thus very sensitive to the position and size
of the graffiti appearing in the square writing area. For example, the feature
vector of a small size of digit “2” in the top right corner of the square writing
area will be very different from a large size of digit “2” right in the middle.
In order to reduce the effect of the position and size information of the
graffiti to the recognition performance, a normalization process is proposed
and defined as follows:
ρi
xi = (9.14)
∥ ρi ∥
where ∥ · ∥ denotes the l2 norm. The feature vector xi is taken as the input
for the SVR-based graffiti recognizer.

9.3.2. SVR-based recognizer


An SVR-based graffiti recognizer with the block diagram shown in Fig.
9.9 is proposed for recognition of hand-written graffiti. Referring to
Fig. 9.9, it consists of M SVR-based sub-recognizers. In this example,
16 graffiti as shown in Table 9.3 are to be recognized, hence, we have
M = 16. The SVR-based graffiti recognizer takes the feature vector xi as
an input and then distributes it to all sub-recognizers for further processing.
Corresponding to the sub-recognizer j (j = 1, 2, · · · , 16), an output vector
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 9

Heart Disease Risk Classification and Hand-written Character Recognition 231

yij ∈ ℜ10 will be produced. All output vectors yij are fed to the class
determiner simultaneously for determination of the most likely input graffiti
indicated by an integer in the output. The integer from the output of the
class determiner is in range of 1 and 16 indicating the class label of the
possible input graffiti as shown in Table 9.3.

Fig. 9.9. An SVR-based recognizer for hand-written graffiti.

The SVR-based graffiti recognizer shown in Fig. 9.9 consists of 16


SVR-based sub-recognizers. Each SVR-based sub-recognizer is constructed
by 10 SVRs (due to 10 feature points are taken as input) as shown in Fig.
9.10. Denote the 10 SVRs as SVR k, the normalized feature vector xi =
[xi1 xi2 · · · xi10 ] and the output vector yij = [yij1 yij2 · · · yij10 ].
As it can be seen from Fig. 9.10, the SVR k takes xik as an input and
produces yijk as an output. Defining the target output of the SVR k as
its input, i.e. xik , the SVR k is trained to minimize the difference between
actual and target outputs according to the chosen loss function. In other
words, the SVR k is trained to reproduce its input.
The sub-recognizer j, (j = 1, 2, · · · , 16), will be trained using the
normalized feature vector xi corresponding to the j-th graffiti. For example,
the sub-recognizer 1 is trained using the feature vectors corresponding to the
0(a) (graffiti “0” drawn from left to right as shown in Table 9.3). According
to the training data and the training objective, the input-output difference
of the SVR k will be smaller when the input xik is corresponding to the
graffiti k compared to other graffiti. This characteristic of the trained SVR
k offers a nice property to make the graffiti recognition possible by the class
determiner in the subsequent stage.
The class determiner takes all yij as input. To determine the possible
input graffiti, the class determiner computes a scalar similarity value sij
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 9

232 S.R. Alty, H.K. Lam and J. Prada

Fig. 9.10. Structure of SVR-based sub-recognizer.

for each SVR-based sub-recognizer according to the following formula.


yij
sij = xi − ∈ [0, 1] (9.15)
∥ yij ∥
The similarity value sij indicates the level of the input and output that
the SVR-based sub-recognizer resemble each other. The higher level of
similarity is indicated by a similarity value of sj (t) closer to zero, otherwise,
one. The class determiner is based on the similarity values making the
recognition decision by picking the SVR sub-recognizer which produces the
smallest value of sij and produces the value of index j as the output. The
index j showing in the output indicates that the graffiti j is the possible
input graffiti.

9.3.3. Simulation results


The proposed SVR-based recognizer under different settings will be trained
and tested in this section and the recognition performance in terms of
recognition rate will be investigated. The training and testing data sets
for each graffiti consist of 100 and 50 patterns, respectively. For the
purposes of training, we consider four kernel functions, namely, linear,
spline, polynomial and radial basis functions, for the SVRs (see (9.9), (9.10),
(9.11) and (9.12) in section 9.1.5).
To perform the training, we consider the SVRs with different kernel
functions, c, σ, loss functions, ε and C as shown in Table 9.4. The
recognition performances of the SVR-based graffiti recognizers under
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 9

Heart Disease Risk Classification and Hand-written Character Recognition 233

different settings are shown in the tables in the appendices at the end
of this chapter. The value of ε is chosen to be zero for quadratic loss
function. It is found experimentally that the recognition performance of
the SVR-based graffiti recognizers with ε-insensitive loss function subject
to ε = 0.01 or ε = 0.1 is not comparable to ε = 0.05. Thus, the tables
(in the appendices) showing the recognition rate for ε = 0.01 and ε = 0.1
are omitted. Referring to the tables, the training and testing recognition
rates for each sub-recognizer j, j = 1, 2, · · · , 16, and the average training
and testing recognition rates of the SVR-based recognizer are illustrated.
The worst recognition rates given by the sub-recognizers and the best
average recognition rate are highlighted in bold under different settings
and parameters.

Table 9.4. Settings and parameters of SVRs.


Kernel functions Linear (c = 0), spline, polynomial, radial basis kernel functions
c, sigma 1, 2, 5, 10
Loss functions Quadratic (ε = 0), ε-insensitive loss functions
ε 0.01, 0.05, 0.1 (for ε-insensitive loss functions)
C 0.01, 0.1, 1, 10, 100, ∞

To summarize the results, the SVR-based graffiti recognizers with both


average training and testing recognition rates over 99% under different
settings and parameters are tabulated in Table 9.5. In total, we have
23 cases satisfying the criterion. Comparing the cases among the kernel
functions without parameter c or σ, namely, linear and spline kernel
functions, the SVR-based graffiti recognizer with spline kernel function
performs better in general in the sense of higher average recognition
rate. Comparing the cases among the kernel functions with parameter
c or σ, namely, polynomial and radial basis functions, the radial basis
function demonstrates a better potential to produce acceptable recognition
performance. Referring to Table 9.5, the radial basis function offers 13 cases
with average recognition rate over 99% while only 5 cases for the polynomial
function. It can be seen from the table that the recognition performance
offered by the radial basis function is less sensitive to the parameters c, σ
and C.
In general, the kernel functions with extra parameter c or σ are able to
offer a better recognition performance due to the extra parameter giving one
more degree of freedom for transformation of the feature space. It is noted
that the average training recognition rate is always higher than the average
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 9

234 S.R. Alty, H.K. Lam and J. Prada

Table 9.5. Recognition performance of SVR-based graffiti recognizers with the average
recognition rate over 99% for both training and testing data.
Case Kernel♭ Loss Function♯ c, σ C Average (%)† Worst (%)∗
1 Linear Quadratic 0 0.1 99.2500/99.0000 93(5)/96(1,11,12)
2 Linear ϵ-insensitive 0 0.01 99.0000/99.0000 92(5)/96(1,11)
3 Spline Quadratic – 0.1 99.2500/99.0000 93(5)/96(1,11,12)
4 Spline ϵ-insensitive – 0.1 99.5625/99.1250 95(5)/94(1)
5 Poly Quadratic 1 0.1 99.2500/99.0000 93(5)/96(1,11,12)
6 Poly Quadratic 2 0.1 99.3750/99.0000 94(5)/94(12)
7 Poly ϵ-insensitive 1 0.1 99.6250/99.0000 96(5)/94(1)
8 Poly ϵ-insensitive 2 0.01 99.3750/99.2500 94(5)/96(1,11)
9 Poly ϵ-insensitive 2 0.1 99.8125/99.0000 98(5)/94(8)
10 RB Quadratic 2 1 99.3125/99.0000 93(5)/96(1,11,12)
11 RB Quadratic 5 10 99.3750/99.0000 94(5)/94(1)
12 RB Quadratic 10 10 99.2500/99.0000 93(5)/96(1,11,12)
13 RB ϵ-insensitive 1 0.01 99.1875/99.0000 93(5)/96(1,11)
14 RB ϵ-insensitive 1 0.1 99.5625/99.1250 96(5)/96(1,11)
15 RB ϵ-insensitive 2 0.01 99.1250/99.0000 92(5)/96(1,11)
16 RB ϵ-insensitive 2 0.1 99.1875/99.0000 93(5)/96(1,11)
17 RB ϵ-insensitive 5 0.01 99.1250/99.0000 92(5)/96(1,11)
18 RB ϵ-insensitive 5 0.1 99.0625/99.0000 92(5)/96(1,11)
19 RB ϵ-insensitive 5 1 99.2500/99.1250 93(5)/96(1,11)
20 RB ϵ-insensitive 10 0.01 99.1250/99.0000 92(5)/96(1,11)
21 RB ϵ-insensitive 10 0.1 99.0625/99.0000 92(5)/96(1,11)
22 RB ϵ-insensitive 10 1 99.0625/99.0000 92(5)/96(1,11)
23 RB ϵ-insensitive 10 10 99.5000/99.0000 95(5)/94(1)
♭ Poly and RB stand for polynomial and radial basis functions, respectively.
♯ When ϵ-insensitive loss function is employed, ε = 0.05 is used.
† Average recognition rate in the format of A/B. A is for the training data and B is for the

testing data.
∗ The worst recognition rate in the format of A(a)/B(b). A is for the training data and B is

for the testing data. a and b denote the class labels of the graffiti.

testing recognition rate. The highest average training recognition rate is


99.8125% (case 9) while the highest testing recognition rate is 99.2500%
(case 8). It is interestingly found that both cases are for the SVR-based
graffiti recognizer with polynomial function. It is noted that the worst
training recognition rate is 92% for the graffiti with the class label 5 while
the worst testing recognition rate is 94% for the graffiti with the class
labels 1, 11 and 12. The graffiti labels 1, 5, 11 and 12 are corresponding
to the digits 0, 3, 8 (written in both left and right directions). They have
a similar characteristic in the shapes leading to the feature vectors looking
like each other. Thus, it makes the recognition process more difficult and
degrades the recognition performance. The result gives a clue to improve
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 9

Heart Disease Risk Classification and Hand-written Character Recognition 235

the recognition performance by employing a better transformation to obtain


the feature vectors such that they are far apart from each other in the
feature space.
Among the 23 cases, we are going to to come up with one which offers the
best recognition performance. We define the overall recognition rate as the
average of the average training and testing recognition rates. Considering
the cases where the overall recognition rate is over 99.3000% or the average
testing recognition rate is equal to or greater than 99.1250%, we come up
with 6 cases, i.e. cases 4, 7, 8, 9 14 and 19. By considering the worst training
and testing recognition rates, the SVR-based graffiti recognizer of case 14
offers the best performance. The worst training and testing recognition
rates are both 96% which is the highest compared to other chosen cases.
The number of support vectors corresponding to cases 4, 7, 8, 9, 14
and 19 is tabulated in Table 9.6. The number of support vectors for each
sub-recognizer j, j = 1, 2, · · · , 16, is the sum of the support vectors of
the 10 SVRs shown in Fig. 9.10. It can be seen that the sub-recognizer 5
corresponding to digit “3” produces the largest number of support vectors
while the sub-recognizer 14 corresponding to the command “backspace”
produces the smallest number for all cases. Of the cases 4, 7, 8, 9, 14 and 19,
case 8 produces the highest average number of support vectors, i.e. 251.9375
while case 9 produces the smallest average number of support vectors,
i.e. 143.000. From the point of view of the recognizer complexity, the
SVR-based graffiti recognizer for case 9 with the smallest average number
of support vectors is recommended.

9.4. Conclusion

This chapter has given a basic introduction to SVMs. Some properties


and theories have been presented to support the design of SVMs dealing
with some practical problems. Two applications have been considered in
this chapter, namely, classification of heart disease risk and recognition of
hand-written characters. On the application of the classification of heart
disease, a classifier has been developed using the support vector classifiers
to accurately estimate the risk of developing heart disease by using the
signal derived from a finger-based pulse oximeter. On the application of
recognition of hand-written characters, a recognizer has been developed
using the support vector regressors. Simulation results have been shown to
illustrate the effectiveness of the proposed approaches.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 9

236 S.R. Alty, H.K. Lam and J. Prada

Table 9.6. Number of support vectors for cases 4, 7, 8, 9 14 and 19.


Sub-Recognizer Case 4 Case 7 Case 8 Case 9 Case 14 Case 19
1 145 160 282 93 184 228
2 340 356 295 286 375 420
3 87 89 164 67 99 94
4 336 344 374 285 372 409
5 342 358 406 292 387 442
6 95 103 216 58 113 116
7 306 325 351 261 345 389
8 158 162 187 146 172 172
9 112 116 214 80 133 157
10 133 152 246 89 173 225
11 201 211 230 168 231 256
12 216 229 282 178 251 281
13 218 232 358 176 262 301
14 17 17 120 13 22 6
15 80 80 130 71 89 76
16 34 41 176 25 62 74
Average 176.2500 185.9375 251.9375 143.0000 204.3750 227.8750

Acknowledgements

The work described in this paper was supported by the Division of


Engineering, King’s College London. The authors would like to thank
the patients of St. Thomas’ Hospital who were involved in this study for
allowing their data to be collected and analyzed.

References

[1] V. N. Vapnik, The Nature of Statistical Learning Theory. (Springer–Verlag,


Berlin, 2000).
[2] C. J. C. Burges, A tutorial on support vector machines for pattern
recognition, Data Mining and Knowledge Discovery. 2(2), 121–167, (1998).
[3] S. R. Gunn, Support vector machines for classification and regression, ISIS
Technical Report. 14, (1998).
[4] N. Cristianini and J. Shawe-Taylor, An Introduction to Support Vector
Machines. (Cambridge University Press, Cambridge, 2000).
[5] A. J. Smola and B. Schölkopf, A tutorial on support vector regression,
Statistics and Computing. 14(3), 199–222, (2004).
[6] S. R. Alty, N. Angarita-Jaimes, S. C. Millasseau, and P. J. Chowienczyk,
Predicting arterial stiffness from the digital volume pulse waveform, IEEE
Transactions on Biomedical Engineering. 54(12), 2268–2275, (2007).
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 9

Heart Disease Risk Classification and Hand-written Character Recognition 237

[7] K. M. Anderson, P. M. Odell, P. W. F. Wilson, and W. B. Kannel,


Cardiovascular disease risk profiles, American Heart Journal. 121(1),
293–298, (1991).
[8] S. M. Grundy, R. Pasternak, P. Greenland, S. Smith Jr, and V. Fuster,
AHA/ACC scientific statement: Assessment of cardiovascular risk by use
of multiple-risk-factor assessment equations: a statement for healthcare
professionals from the American Heart Association and the American
College of Cardiology, Journal of the American College of Cardiology. 34
(4), 1348–1359, (1999).
[9] J. Blacher, R. Asmar, S. Djane, G. M. London, and M. E. Safar, Aortic pulse
wave velocity as a marker of cardiovascular risk in hypertensive patients,
Hypertension. 33(5), 1111–1117, (1999).
[10] P. Boutouyrie, A. I. Tropeano, R. Asmar, I. Gautier, A. Benetos,
P. Lacolley, and S. Laurent, Aortic stiffness is an independent predictor
of primary coronary events in hypertensive patients: a longitudinal study,
Hypertension. 39(1), 10–15, (2002).
[11] P. J. Chowienczyk, R. P. Kelly, H. MacCallum, S. C. Millasseau,
T. L. G. Andersson, R. G. Gosling, J. Ritter, and E. Anggard,
Photoplethysmographic assessment of pulse wave reflection: blunted
response to endothelium-dependent beta2-adrenergic vasodilation in type
II diabetes mellitus, Journal of the American College of Cardiology. 34(7),
2007–2014, (1999).
[12] S. C. Millasseau, F. G. Guigui, R. P. Kelly, K. Prasad, J. R. Cockcroft, J. M.
Ritter, and P. J. Chowienczyk, Noninvasive assessment of the digital volume
pulse: comparison with the peripheral pressure pulse, Hypertension. 36(6),
952–956, (2000).
[13] S. C. Millasseau, R. P. Kelly, J. M. Ritter, and P. J. Chowienczyk,
Determination of age-related increases in large artery stiffness by digital
pulse contour analysis, Clinical Science. 103(4), 371–378, (2002).
[14] N. Angarita-Jaimes. Support Vector Machines for Improved Cardiovascular
Disease Risk Prediction. Master’s thesis, Department of Electronic
Engineering, Kings College London, (2005).
[15] N. Angarita-Jaimes, S. R. Alty, S. C. Millasseau, and P. J. Chowienczyk,
Classification of aortic stiffness from eigendecomposition of the digital
volume pulse waveform. 2, 2416–2419, (2006).
[16] J. B. Dillon and A. B. Hertzman, The form of the volume pulse in the finger
pad in health, arteriosclerosis and hypertension, American Heart Journal.
21, 172–190, (1941).
[17] J. Ma, Y. Zhao, and S. Ahalt, OSU SVM Classifier Matlab Toolbox (version
3.00). (Ohio State University, Columbus, USA, 2002).
[18] H. K. Lam and F. H. F. Leung, Digit and command interpretation
for electronic book using neural network and genetic algorithm, IEEE
Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics. 34
(6), 2273–2283, (2004).
[19] K. F. Leung, F. H. F. Leung, H. K. Lam, and S. H. Ling, On interpretation of
graffiti digits and characters for eBooks: Neural-fuzzy network and genetic
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 9

238 S.R. Alty, H.K. Lam and J. Prada

algorithm approach, IEEE Transactions on Industrial Electronics. 51(2),


464–471, (2004).
[20] H. K. Lam and J. Prada, Interpretation of handwritten single-stroke graffiti
using support vector machines, International Journal of Computational
Intelligence and Applications. 8(4), 369–393, (2009).
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 9

Heart Disease Risk Classification and Hand-written Character Recognition 239

Appendix A. Grid search shown in three dimensions


for SVM-based DVP classifier with Gaussian Radial Basis
function kernel

90

85

80
Classification Rate %

75

70

65

60

55

50
2.0

1.5 0.9
0.8
0.7
1.0 0.6
0.5
γ 0.5 0.3
0.4

0.1 0.1
0.2 ν

Fig. A.1. Grid search for optimal hyperparameters during training, showing values of
γ and ν versus overall classification rate.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 9

240 S.R. Alty, H.K. Lam and J. Prada

Appendix B. Tables of recognition rate for SVR-based


graffiti recognizer with linear kernel function

Table B.1. Training and testing results with linear kernel function, quadratic loss
function and various values of C.
Recognition rate (%) (Training/Testing phase)
Class No. C = 0.01 C = 0.1 C=1 C = 10 C = 100 C = Inf
1 100/96 100/96 100/96 100/88 98/78 98/76
2 100/100 100/100 100/100 99/94 98/70 98/64
3 100/100 100/100 100/100 98/96 93/90 92/88
4 97/100 99/100 99/100 99/100 96/100 96/100
5 93/100 93/100 95/100 98/100 100/100 100/100
6 100/98 100/98 100/98 98/94 94/86 93/84
7 98/100 98/100 100/100 98/100 90/100 86/96
8 100/98 100/98 100/94 100/86 99/76 99/76
9 100/100 100/100 100/100 100/92 94/86 94/82
10 99/100 99/100 99/100 97/80 75/36 55/34
11 99/96 99/96 99/98 99/100 97/98 97/98
12 100/96 100/96 100/96 100/96 100/96 100/96
13 100/98 100/100 100/100 100/100 100/100 100/100
14 100/100 100/100 100/100 100/100 97/66 82/40
15 100/100 100/100 100/100 100/100 99/96 96/96
16 100/100 100/100 100/100 100/92 100/78 100/76
Average 99.1250/ 99.2500/ 99.5000/ 99.1250/ 95.6250/ 92.8750/
98.8750 99.0000 98.8750 94.8750 84.7500 81.6250
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 9

Heart Disease Risk Classification and Hand-written Character Recognition 241

Table B.2. Training and testing results with linear kernel function, ϵ-insensitive loss
function (ϵ = 0.05) and various values of C.
Recognition rate (%) (Training/Testing phase)
Class No. C = 0.01 C = 0.1 C=1 C = 10 C = 100 C = Inf
1 100/96 100/90 97/70 100/98 100/98 100/98
2 100/100 100/100 100/100 100/100 100/100 100/100
3 100/100 100/100 31/22 100/100 100/100 100/100
4 97/100 99/100 97/100 100/100 100/100 100/100
5 92/100 96/100 97/98 98/100 98/100 98/100
6 99/98 86/96 100/98 100/98 100/98 100/98
7 98/100 99/100 69/98 91/94 91/94 91/94
8 100/98 100/98 98/70 99/74 99/74 99/74
9 100/100 100/100 97/100 100/98 100/98 100/98
10 99/100 99/100 98/96 100/100 100/100 100/100
11 99/96 99/98 91/80 99/98 99/98 99/98
12 100/98 99/84 100/96 100/96 100/96 100/96
13 100/98 100/100 99/88 100/100 100/100 100/100
14 100/100 100/100 100/100 100/100 100/100 100/100
15 100/100 100/100 100/100 100/100 100/100 100/100
16 100/100 100/100 100/98 100/94 100/94 100/94
Average 99.0000/ 98.5625/ 92.1250/ 99.1875/ 99.1875/ 99.1875/
99.0000 97.8750 88.3750 96.8750 96.8750 96.8750
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 9

242 S.R. Alty, H.K. Lam and J. Prada

Appendix C. Tables of recognition rate for SVR-based


graffiti recognizer with spline kernel function

Table C.1. Training and testing results with spline kernel function, quadratic loss
function and various values of C.
Recognition rate (%) (Training/Testing phase)
Class No. C = 0.01 C = 0.1 C=1 C = 10 C = 100 C = Inf
1 100/96 100/96 100/96 100/94 100/92 96/96
2 100/100 100/100 100/100 98/96 100/100 100/100
3 100/100 100/100 100/100 99/98 100/100 100/100
4 98/100 99/100 99/100 99/100 100/100 100/90
5 93/100 93/100 96/100 98/100 97/100 66/46
6 100/98 100/98 100/98 100/94 97/94 4/20
7 98/100 98/100 99/100 99/100 99/98 67/98
8 100/98 100/98 100/94 100/86 99/74 100/74
9 100/100 100/100 100/100 100/96 97/92 100/84
10 98/100 99/100 99/100 98/84 95/68 99/98
11 99/96 99/96 99/98 99/100 99/98 100/96
12 100/96 100/96 100/94 100/96 100/96 99/94
13 100/98 100/100 100/100 100/100 100/100 100/100
14 100/100 100/100 100/100 100/100 100/100 100/100
15 100/100 100/100 100/100 100/100 100/100 100/100
16 100/100 100/100 100/100 100/94 100/98 100/78
Average 99.1250/ 99.2500/ 99.5000/ 99.3750/ 98.9375/ 89.4375/
98.8750 99.0000 98.7500 96.1250 94.3750 85.8750
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 9

Heart Disease Risk Classification and Hand-written Character Recognition 243

Table C.2. Training and testing results with spline kernel function, ϵ-insensitive loss
function (ϵ = 0.05) and various values of C.
Recognition rate (%) (Training/Testing phase)
Class No. C = 0.01 C = 0.1 C=1 C = 10 C = 100 C = Inf
1 100/96 100/94 100/100 100/100 100/100 100/100
2 100/100 100/100 100/100 100/100 100/100 100/100
3 100/100 100/100 100/100 100/100 100/100 100/100
4 98/100 99/100 100/100 100/100 100/100 100/100
5 93/100 95/100 97/100 97/100 97/100 97/100
6 100/98 100/100 100/98 100/98 100/98 100/98
7 98/100 100/100 100/100 96/98 96/98 96/98
8 100/98 100/98 100/88 100/80 100/80 100/80
9 100/100 100/100 100/98 100/96 100/96 100/96
10 99/100 100/100 100/100 100/100 100/100 100/100
11 99/96 99/98 99/100 99/100 99/100 99/100
12 100/96 100/96 100/96 100/96 100/96 100/96
13 100/98 100/100 100/100 100/100 100/100 100/100
14 100/100 100/100 100/100 100/100 100/100 100/100
15 100/100 100/100 100/100 100/100 100/100 100/100
16 100/100 100/100 100/98 100/94 100/94 100/94
Average 99.1875/ 99.5625/ 99.7500/ 99.5000/ 99.5000/ 99.5000/
98.8750 99.1250 98.6250 97.6250 97.6250 97.6250
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 9

244 S.R. Alty, H.K. Lam and J. Prada

Appendix D. Tables of recognition rate for SVR-based


graffiti recognizer with polynomial kernel function

Table D.1. Training and testing results with polynomial function (c = 1), quadratic
loss function and various values of C.
Recognition rate (%) (Training/Testing phase)
Class No. C = 0.01 C = 0.1 C=1 C = 10 C = 100 C = Inf
1 100/96 100/96 100/96 100/90 98/78 98/76
2 100/100 100/100 100/100 98/90 98/70 98/64
3 100/100 100/100 100/100 98/96 93/90 92/88
4 98/100 99/100 99/100 99/100 96/100 96/100
5 93/100 93/100 95/100 98/100 100/100 100/100
6 100/98 100/98 100/98 99/94 94/86 93/84
7 98/100 98/100 99/100 98/100 90/100 86/96
8 100/98 100/98 100/94 100/86 99/76 99/76
9 100/100 100/100 100/100 100/94 94/86 94/82
10 98/100 99/100 99/100 97/80 75/36 55/34
11 99/96 99/96 99/98 99/100 97/98 97/98
12 100/96 100/96 100/94 100/96 100/96 100/96
13 100/98 100/100 100/100 100/100 100/100 100/100
14 100/100 100/100 100/100 100/100 97/66 82/40
15 100/100 100/100 100/100 100/100 99/96 96/96
16 100/100 100/100 100/100 100/92 100/78 100/76
Average 99.1250/ 99.2500/ 99.4375/ 99.1250/ 95.6250/ 92.8750/
98.8750 99.0000 98.7500 94.8750 84.7500 81.6250
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 9

Heart Disease Risk Classification and Hand-written Character Recognition 245

Table D.2. Training and testing results with polynomial function (c = 2), quadratic
loss function and various values of C.
Recognition rate (%) (Training/Testing phase)
Class No. C = 0.01 C = 0.1 C=1 C = 10 C = 100 C = Inf
1 100/98 100/96 100/96 100/94 100/86 99/68
2 100/100 100/100 99/96 98/96 100/100 100/96
3 99/100 100/100 100/100 98/96 100/100 100/100
4 98/100 99/100 100/100 99/100 99/100 99/100
5 94/100 94/100 96/100 100/100 98/100 92/96
6 100/98 100/100 100/98 97/94 97/92 94/90
7 99/100 99/100 99/100 98/100 98/96 97/96
8 100/98 100/98 100/94 100/86 99/76 76/20
9 100/100 100/100 100/100 99/94 97/92 95/86
10 98/100 99/100 98/100 96/72 96/72 93/64
11 99/96 99/96 99/98 99/100 99/98 98/96
12 100/94 100/94 100/92 100/96 100/96 100/96
13 100/98 100/100 100/100 100/100 100/100 100/96
14 100/100 100/100 100/100 100/100 100/100 100/88
15 100/100 100/100 100/100 100/100 100/100 100/100
16 100/100 100/100 100/100 100/92 100/98 100/84
Average 99.1875/ 99.3750/ 99.4375/ 99.0000/ 98.9375/ 96.4375/
98.8750 99.0000 98.3750 95.0000 94.1250 86.0000

Table D.3. Training and testing results with polynomial function (c = 5), quadratic
loss function and various values of C.
Recognition rate (%) (Training/Testing phase)
Class No. C = 0.01 C = 0.1 C=1 C = 10 C = 100 C = Inf
1 100/92 100/92 100/98 100/92 100/94 100/68
2 97/98 98/86 98/88 99/100 100/100 100/100
3 96/100 99/100 99/100 100/100 100/100 100/100
4 98/100 99/100 97/100 100/100 100/100 99/80
5 91/100 96/100 97/100 99/100 99/100 73/68
6 100/100 100/100 100/96 100/94 96/94 97/92
7 96/88 98/92 98/98 98/96 99/98 90/72
8 100/98 100/96 100/90 100/82 94/60 15/8
9 100/100 100/100 99/98 98/94 97/90 93/92
10 98/98 98/100 98/90 98/76 99/78 86/86
11 99/90 99/90 98/94 99/100 99/98 100/94
12 99/90 99/90 100/82 100/94 100/96 100/96
13 100/98 100/98 100/100 100/100 100/100 100/98
14 100/100 100/100 100/100 100/100 100/100 100/100
15 99/100 99/100 99/100 100/100 100/100 100/100
16 100/100 100/100 100/94 100/98 100/96 100/84
Average 98.3125/ 99.0625/ 98.9375/ 99.4375/ 98.9375/ 90.8125/
97.0000 96.5000 95.5000 95.3750 94.0000 83.6250
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 9

246 S.R. Alty, H.K. Lam and J. Prada

Table D.4. Training and testing results with polynomial function (c = 10), quadratic
loss function and various values of C.
Recognition rate (%) (Training/Testing phase)
Class No. C = 0.01 C = 0.1 C=1 C = 10 C = 100 C = Inf
1 99/78 99/68 100/76 100/96 100/96 100/80
2 98/82 98/70 98/78 100/94 100/100 100/100
3 92/98 98/100 100/100 100/100 100/100 100/100
4 96/100 98/100 100/100 100/100 100/100 100/88
5 94/100 97/100 92/96 98/100 96/100 98/84
6 100/98 100/96 100/96 100/94 96/92 98/94
7 99/96 96/84 98/96 100/98 98/98 76/64
8 100/98 100/92 100/90 98/52 98/78 76/50
9 100/98 99/98 100/98 96/84 99/96 94/96
10 98/98 98/100 98/86 96/80 100/88 84/84
11 99/98 99/98 99/98 97/88 98/98 100/94
12 100/100 98/82 100/76 100/94 100/94 100/96
13 100/98 95/96 100/100 100/100 99/100 100/100
14 100/100 100/100 100/100 100/100 100/100 100/100
15 99/100 99/100 99/100 100/100 100/100 100/100
16 100/82 100/96 100/100 100/96 100/98 100/80
Average 98.3750/ 98.3750/ 99.0000/ 99.0625/ 99.0000/ 95.3750/
95.2500 92.5000 93.1250 92.2500 96.1250 88.1250

Table D.5. Training and testing results with polynomial function (c = 1),
ϵ-insensitive loss function (ϵ = 0.05) and various values of C.
Recognition rate (%) (Training/Testing phase)
Class No. C = 0.01 C = 0.1 C=1 C = 10 C = 100 C = Inf
1 100/96 100/94 100/96 100/98 100/98 100/98
2 100/100 100/100 100/100 100/100 100/100 100/100
3 100/100 100/100 100/100 100/100 100/100 100/100
4 98/100 99/100 100/100 100/100 100/100 100/100
5 93/100 96/100 98/100 97/100 97/100 97/100
6 100/98 100/100 100/98 100/98 100/98 100/98
7 98/100 100/100 100/100 94/98 94/98 94/98
8 100/98 100/98 100/88 99/80 99/80 99/80
9 100/100 100/100 100/98 100/96 100/96 100/96
10 99/100 100/100 100/100 100/100 100/100 100/100
11 99/96 99/96 99/98 99/100 99/100 99/100
12 100/96 100/96 100/96 100/96 100/96 100/96
13 100/98 100/100 100/100 100/100 100/100 100/100
14 100/100 100/100 100/100 100/100 100/100 100/100
15 100/100 100/100 100/100 100/100 100/100 100/100
16 100/100 100/100 100/98 100/94 100/94 100/94
Average 99.1875/ 99.6250/ 99.8125/ 99.3125/ 99.3125/ 99.3125/
98.8750 99.0000 98.2500 97.5000 97.5000 97.5000
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 9

Heart Disease Risk Classification and Hand-written Character Recognition 247

Table D.6. Training and testing results with polynomial function (c = 2),
ϵ-insensitive loss function (ϵ = 0.05) and various values of C.
Recognition rate (%) (Training/Testing phase)
Class No. C = 0.01 C = 0.1 C=1 C = 10 C = 100 C = Inf
1 100/96 100/96 100/100 100/100 100/100 100/100
2 100/100 100/100 100/100 100/100 100/100 100/100
3 100/100 100/100 100/100 100/100 100/100 100/100
4 99/100 100/100 100/100 100/100 100/100 100/100
5 94/100 98/100 98/100 97/100 97/100 97/100
6 100/100 100/100 100/98 100/98 100/98 100/98
7 99/100 100/100 96/100 93/98 93/98 93/98
8 100/98 100/94 100/82 99/74 99/74 99/74
9 100/100 100/100 100/94 100/94 100/94 100/94
10 99/100 100/100 100/100 100/100 100/100 100/100
11 99/96 99/98 99/100 99/100 99/100 99/100
12 100/98 100/96 100/98 100/98 100/98 100/98
13 100/100 100/100 100/100 100/100 100/100 100/100
14 100/100 100/100 100/100 100/100 100/100 100/100
15 100/100 100/100 100/100 100/100 100/100 100/100
16 100/100 100/100 100/98 100/98 100/98 100/98
Average 99.3750/ 99.8125/ 99.5625/ 99.2500/ 99.2500/ 99.2500/
99.2500 99.0000 98.1250 97.5000 97.5000 97.5000

Table D.7. Training and testing results with polynomial function (c = 5),
ϵ-insensitive loss function (ϵ = 0.05) and various values of C.
Recognition rate (%) (Training/Testing phase)
Class No. C = 0.01 C = 0.1 C=1 C = 10 C = 100 C = Inf
1 100/92 100/96 100/96 100/96 100/96 100/96
2 98/88 100/90 100/88 100/88 100/88 100/88
3 99/100 100/100 100/100 100/100 100/100 100/100
4 98/100 100/100 100/100 100/100 100/100 100/100
5 96/100 97/100 99/100 99/100 99/100 99/100
6 100/100 100/98 100/98 100/98 100/98 100/98
7 98/92 98/94 98/98 98/98 98/98 98/98
8 100/96 100/90 100/76 100/76 100/76 100/76
9 100/100 100/96 98/88 98/88 98/88 98/88
10 98/100 100/98 100/94 100/94 100/94 100/94
11 99/90 96/78 97/90 97/90 97/90 97/90
12 99/90 100/90 100/96 100/96 100/96 100/96
13 100/98 100/100 100/100 100/100 100/100 100/100
14 100/100 100/100 100/100 100/100 100/100 100/100
15 99/100 99/100 100/100 100/100 100/100 100/100
16 100/100 100/96 100/98 100/98 100/98 100/98
Average 99.0000/ 99.3750/ 99.5000/ 99.5000/ 99.5000/ 99.5000/
96.6250 95.3750 95.1250 95.1250 95.1250 95.1250
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 9

248 S.R. Alty, H.K. Lam and J. Prada

Table D.8. Training and testing results with polynomial function (c = 10),
ϵ-insensitive loss function (ϵ = 0.05) and various values of C.
Recognition rate (%) (Training/Testing phase)
Class No. C = 0.01 C = 0.1 C=1 C = 10 C = 100 C = Inf
1 100/74 97/66 98/76 98/76 98/76 98/76
2 99/66 99/72 94/60 93/60 94/60 93/60
3 95/98 98/98 98/98 98/98 98/98 98/98
4 98/100 98/100 99/100 99/100 99/100 99/100
5 98/100 100/100 99/98 99/98 99/98 99/98
6 100/96 100/96 100/96 100/96 100/96 100/96
7 97/90 97/96 97/98 97/98 97/98 97/98
8 94/94 100/90 100/82 100/82 100/82 100/82
9 99/96 96/88 92/80 92/80 92/80 92/80
10 98/98 99/84 99/84 99/84 99/84 99/84
11 99/98 99/100 99/100 99/100 99/100 99/100
12 99/90 100/98 100/100 100/100 100/100 100/100
13 93/90 96/98 96/96 96/96 96/96 96/96
14 99/86 100/94 100/96 100/96 100/96 100/96
15 99/100 100/100 100/98 100/98 100/98 100/98
16 98/72 100/92 99/86 99/86 99/86 99/86
Average 97.8125/ 98.6875/ 98.1250/ 98.0625/ 98.1250/ 98.0625/
90.5000 92.0000 90.5000 90.5000 90.5000 90.5000
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 9

Heart Disease Risk Classification and Hand-written Character Recognition 249

Appendix E. Tables of recognition rate for SVR-based


graffiti recognizer with radial basis kernel function

Table E.1. Training and testing results with radial basis kernel function (σ = 1),
quadratic loss function and various values of C.
Recognition rate (%) (Training/Testing phase)
Class No. C = 0.01 C = 0.1 C=1 C = 10 C = 100 C = Inf
1 100/96 100/96 100/96 100/90 99/78 100/90
2 100/100 100/100 100/100 98/90 98/68 100/100
3 100/100 100/100 100/100 98/96 95/90 100/100
4 97/100 99/100 99/100 99/100 100/100 100/100
5 93/100 93/100 95/100 99/100 100/100 99/100
6 100/98 100/98 100/98 99/94 96/96 100/98
7 98/100 98/100 99/100 98/100 90/90 100/98
8 100/98 100/98 100/94 100/86 99/76 100/80
9 100/100 100/100 100/100 100/92 95/90 100/96
10 99/100 99/100 99/100 97/78 82/60 100/88
11 99/96 99/96 99/98 99/100 97/92 100/100
12 100/96 100/96 100/96 100/96 100/92 100/92
13 100/98 100/98 100/100 100/100 100/98 100/100
14 100/100 100/100 100/100 100/100 100/100 100/100
15 100/100 100/100 100/100 100/100 99/100 100/100
16 100/100 100/100 100/100 100/94 100/96 100/100
Average 99.1250/ 99.2500/ 99.4375/ 99.1875/ 96.8750/ 99.9375/
98.8750 98.8750 98.8750 94.7500 89.1250 96.3750
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 9

250 S.R. Alty, H.K. Lam and J. Prada

Table E.2. Training and testing results with radial basis kernel function (σ = 2),
quadratic loss function and various values of C.
Recognition rate (%) (Training/Testing phase)
Class No. C = 0.01 C = 0.1 C=1 C = 10 C = 100 C = Inf
1 100/96 100/96 100/96 100/92 100/84 100/100
2 100/100 100/100 100/100 100/100 98/80 100/94
3 100/100 100/100 100/100 100/100 95/94 97/92
4 97/100 98/100 99/100 100/100 98/100 100/100
5 93/100 93/100 93/100 97/100 100/100 97/100
6 100/98 100/98 100/98 100/98 97/92 99/96
7 98/100 98/100 99/100 100/100 95/100 81/72
8 100/98 100/98 100/98 100/90 99/76 98/72
9 100/100 100/100 100/100 100/100 98/88 99/88
10 99/100 99/100 99/100 99/100 92/64 79/82
11 99/96 99/96 99/96 99/98 99/100 100/98
12 100/96 100/96 100/96 100/96 100/96 100/98
13 100/98 100/98 100/100 100/100 100/100 99/100
14 100/100 100/100 100/100 100/100 100/96 100/98
15 100/100 100/100 100/100 100/100 99/100 100/100
16 100/100 100/100 100/100 100/98 100/88 100/88
Average 99.1250/ 99.1875/ 99.3125/ 99.6875/ 98.1250/ 96.8125/
98.8750 98.8750 99.0000 98.2500 91.1250 92.3750

Table E.3. Training and testing results with radial basis kernel function (σ = 5),
quadratic loss function and various values of C.
Recognition rate (%) (Training/Testing phase)
Class No. C = 0.01 C = 0.1 C=1 C = 10 C = 100 C = Inf
1 100/96 100/96 100/96 100/94 100/92 100/90
2 100/100 100/100 100/100 100/100 100/98 100/100
3 100/100 100/100 100/100 100/100 100/100 100/100
4 97/100 97/100 98/100 99/100 99/100 100/100
5 93/100 93/100 93/100 94/100 98/100 98/100
6 100/98 100/98 100/98 100/98 100/94 100/96
7 98/100 98/100 98/100 99/100 100/100 99/94
8 100/98 100/98 100/98 100/98 100/90 100/74
9 100/100 100/100 100/100 100/100 100/96 100/96
10 99/100 99/100 99/100 99/100 98/100 99/98
11 99/96 99/96 99/96 99/96 99/98 99/96
12 100/96 100/96 100/96 100/98 100/96 100/98
13 100/98 100/98 100/98 100/100 100/100 100/100
14 100/100 100/100 100/100 100/100 100/100 100/100
15 100/100 100/100 100/100 100/100 100/100 100/100
16 100/100 100/100 100/100 100/100 100/94 100/100
Average 99.1250/ 99.1250/ 99.1875/ 99.3750/ 99.6250/ 99.6875/
98.8750 98.8750 98.8750 99.0000 97.3750 96.3750
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 9

Heart Disease Risk Classification and Hand-written Character Recognition 251

Table E.4. Training and testing results with radial basis kernel function (σ = 10),
quadratic loss function and various values of C.
Recognition rate (%) (Training/Testing phase)
Class No. C = 0.01 C = 0.1 C=1 C = 10 C = 100 C = Inf
1 100/96 100/96 100/96 100/96 100/96 100/96
2 100/100 100/100 100/100 100/100 100/100 100/100
3 100/100 100/100 100/100 100/100 100/100 100/100
4 97/100 97/100 97/100 99/100 99/100 99/100
5 93/100 93/100 93/100 93/100 95/100 97/100
6 100/98 100/98 100/98 100/98 100/98 100/96
7 98/100 98/100 98/100 98/100 100/100 98/98
8 100/98 100/98 100/98 100/98 100/94 100/90
9 100/100 100/100 100/100 100/100 100/100 100/100
10 99/100 99/100 99/100 99/100 99/100 98/98
11 99/96 99/96 99/96 99/96 99/98 99/98
12 100/96 100/96 100/96 100/96 100/96 100/94
13 100/98 100/98 100/98 100/100 100/100 100/100
14 100/100 100/100 100/100 100/100 100/100 100/100
15 100/100 100/100 100/100 100/100 100/100 100/100
16 100/100 100/100 100/100 100/100 100/100 100/98
Average 99.1250/ 99.1250/ 99.1250/ 99.2500/ 99.5000/ 99.4375/
98.8750 98.8750 98.8750 99.0000 98.8750 98.0000

Table E.5. Training and testing results with radial basis kernel function (σ = 1),
ϵ-insensitive loss function (ϵ = 0.05) and various values of C.
Recognition rate (%) (Training/Testing phase)
Class No. C = 0.01 C = 0.1 C=1 C = 10 C = 100 C = Inf
1 100/96 100/96 100/94 100/98 100/98 100/98
2 100/100 100/100 100/100 100/100 100/100 100/100
3 100/100 99/100 100/100 100/100 100/100 100/100
4 98/100 99/100 100/100 100/100 100/100 100/100
5 93/100 96/100 98/100 97/100 97/100 97/100
6 100/98 100/98 100/98 100/98 100/98 100/98
7 98/100 100/100 100/100 94/98 94/98 94/98
8 100/98 100/98 100/86 99/74 99/74 99/74
9 100/100 100/100 100/98 100/96 100/96 100/96
10 99/100 100/100 100/100 100/100 100/100 100/100
11 99/96 99/96 99/98 99/100 99/100 99/100
12 100/98 100/98 100/98 100/96 100/96 100/96
13 100/98 100/100 100/100 100/100 100/100 100/100
14 100/100 100/100 100/100 100/100 100/100 100/100
15 100/100 100/100 100/100 100/100 100/100 100/100
16 100/100 100/100 100/98 100/94 100/94 100/94
Average 99.1875/ 99.5625/ 99.8125/ 99.3125/ 99.3125/ 99.3125/
99.0000 99.1250 98.1250 97.1250 97.1250 97.1250
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 9

252 S.R. Alty, H.K. Lam and J. Prada

Table E.6. Training and testing results with radial basis kernel function (σ = 2),
ϵ-insensitive loss function (ϵ = 0.05) and various values of C.
Recognition rate (%) (Training/Testing phase)
Class No. C = 0.01 C = 0.1 C=1 C = 10 C = 100 C = Inf
1 100/96 100/96 100/96 100/98 100/98 100/98
2 100/100 100/100 100/100 100/100 100/100 100/100
3 100/100 100/100 99/100 100/100 100/100 100/100
4 98/100 98/100 100/100 100/100 100/100 100/100
5 92/100 93/100 98/100 97/100 97/100 97/100
6 100/98 100/98 100/98 100/98 100/98 100/98
7 98/100 98/100 100/100 94/98 94/98 94/98
8 100/98 100/98 100/94 99/74 99/74 99/74
9 100/100 100/100 100/100 100/96 100/96 100/96
10 99/100 99/100 100/100 100/100 100/100 100/100
11 99/96 99/96 99/98 99/100 99/100 99/100
12 100/98 100/98 100/96 100/96 100/96 100/96
13 100/98 100/98 100/100 100/100 100/100 100/100
14 100/100 100/100 100/100 100/100 100/100 100/100
15 100/100 100/100 100/100 100/100 100/100 100/100
16 100/100 100/100 100/98 100/94 100/94 100/94
Average 99.1250/ 99.1875/ 99.7500/ 99.3125/ 99.3125/ 99.3125/
99.0000 99.0000 98.7500 97.1250 97.1250 97.1250

Table E.7. Training and testing results with radial basis kernel function (σ = 5),
ϵ-insensitive loss function (ϵ = 0.05) and various values of C.
Recognition rate (%) (Training/Testing phase)
Class No. C = 0.01 C = 0.1 C=1 C = 10 C = 100 C = Inf
1 100/96 100/96 100/96 100/96 100/98 100/98
2 100/100 100/100 100/100 100/100 100/100 100/100
3 100/100 100/100 100/100 100/100 100/100 100/100
4 98/100 98/100 99/100 100/100 100/100 100/100
5 92/100 92/100 93/100 98/100 97/100 97/100
6 100/98 99/98 100/98 100/98 100/98 100/98
7 98/100 98/100 98/100 100/100 94/98 94/98
8 100/98 100/98 100/98 100/92 99/74 99/74
9 100/100 100/100 100/100 100/100 100/96 100/96
10 99/100 99/100 99/100 100/100 100/100 100/100
11 99/96 99/96 99/96 99/98 99/100 99/100
12 100/98 100/98 100/98 100/96 100/96 100/96
13 100/98 100/98 100/100 100/100 100/100 100/100
14 100/100 100/100 100/100 100/100 100/100 100/100
15 100/100 100/100 100/100 100/100 100/100 100/100
16 100/100 100/100 100/100 100/98 100/94 100/94
Average 99.1250/ 99.0625/ 99.2500/ 99.8125/ 99.3125/ 99.3125/
99.0000 99.0000 99.1250 98.6250 97.1250 97.1250
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 9

Heart Disease Risk Classification and Hand-written Character Recognition 253

Table E.8. Training and testing results with radial basis kernel function (σ = 10),
ϵ-insensitive loss function (ϵ = 0.05) and various values of C.
Recognition rate (%) (Training/Testing phase)
Class No. C = 0.01 C = 0.1 C=1 C = 10 C = 100 C = Inf
1 100/96 100/96 100/96 100/94 100/94 100/98
2 100/100 100/100 100/100 100/100 100/100 100/100
3 100/100 100/100 100/100 99/100 100/100 100/100
4 98/100 98/100 98/100 99/100 100/100 100/100
5 92/100 92/100 92/100 95/100 98/100 97/100
6 100/98 99/98 99/98 100/98 100/98 100/98
7 98/100 98/100 98/100 100/100 99/100 94/98
8 100/98 100/98 100/98 100/98 100/86 99/74
9 100/100 100/100 100/100 100/100 100/98 100/96
10 99/100 99/100 99/100 100/100 100/100 100/100
11 99/96 99/96 99/96 99/96 99/98 99/100
12 100/98 100/98 100/98 100/98 100/96 100/96
13 100/98 100/98 100/98 100/100 100/100 100/100
14 100/100 100/100 100/100 100/100 100/100 100/100
15 100/100 100/100 100/100 100/100 100/100 100/100
16 100/100 100/100 100/100 100/100 100/98 100/94
Average 99.1250/ 99.0625/ 99.0625/ 99.5000/ 99.7500/ 99.3125/
99.0000 99.0000 99.0000 99.0000 98.0000 97.1250
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 9

This page intentionally left blank

254
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 10

Chapter 10

Nonlinear Modeling Using Support Vector Machine for


Heart Rate Response to Exercise

∗ †‡
Weidong Chen, Steven W. Su, † Yi Zhang, § Ying Guo, † Nghir Nguyen,
#‡
Branko G. Celler and † Hung T. Nguyen
Faculty of Engineering and Information Technology,
University of Technology,
Sydney, Australia
[email protected]

In order to accurately regulate cardiovascular response to exercise for


an individual exerciser, this study proposed a control oriented modeling
approach to depict nonlinear behavior of heart rate response at both
the onset and offset of treadmill exercise. With the aim of capturing
nonlinear dynamic behaviors, a well designed exercise protocol has
been applied for a healthy male subject. Non-invasively measured
variables, such as ECG,, body movements and oxygen saturation (Sp O2 ),
have been reliably monitored and recorded. Based on several sets of
experimental data, both steady state gain and time constant of heart
rate response are identified. The nonlinear models relating to steady
state gain and time constant (vs. walking speed) were built up based
on support vector machine regression (SVR). The nonlinear behaviors
at both onset and offset of exercise have been well described by using
the established SVR models. The model provides the fundamentals for
the optimization of exercise efforts by using model-based optimal control
approaches, which is the following step of this study.

∗ Department
of Automation, Shanghai Jiao Tong University, Shanghai, China.
† Centre
for Health Technologies, Faculty of Engineering and Information Technology,
University of Technology, Sydney, Australia.
‡ Human Performance Group, Biomedical Systems Lab, School of Electrical Engineering

and Telecommunications, University of New South Wales, UNSW Sydney N.S.W. 2052
Australia.
# College of Health and Science, University of Western Sydney, UWS Penrith NSW 2751

Australia.
§ Autonomous Systems Lab, CSIRO ICT Center, Sydney, Australia.

255
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 10

256 Weidong Chen et al.

Contents

10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256


10.2 SVM Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
10.3 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262
10.4 Data Analysis and Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . 264
10.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269

10.1. Introduction

Heart rate, as we know, has been extensively applied to evaluate


cardio-respiratory response [1–5] to exercise as it can be easily measured
by cheap wireless portable non-invasive sensors. In this sense, therefore,
developing nonlinear models with respect to the analysis of dynamic
characteristics of heart rate response is the starting point of this project.
Based on the model, a nonlinear switching model predictive control
approach will be developed to optimize exercise effects while keeping the
level of exercise intensity within the safe range [6].
There are plenty of papers [7, 8] about the analysis of steady state
characteristics of heart rate response. Nonlinear behavior has been detected
and nonlinear models have been established for response analysis when
response entering steady state. During medical diagnosis and analysis
of cardio-respiratory kinetics, however, transient response of heart rate
is more valuable as it contains indicators of current disease or warnings
about impending cardiac diseases [9]. Although both linear and nonlinear
modeling approaches [10, 11] have been applied to explore dynamic
characteristics of heart response to exercise, few papers focus on the
variation of dynamic characteristics under different exercise intensities for
both onset and offset of exercise. For moderate exercise, literatures often
assume heart rate dynamics can be described by linear time invariant
models. In this study, it was observed that the time constant of heart
response to exercise not only depends on exercise intensity but also varies
at the onset and offset of exercise. Papers [12–14] prove that exercise
effects can be optimized by regulating heart rate following a pre-defined
exercise protocol. It is well known that higher control performance can be
obtained if the model contains less uncertainty. Therefore, it is worthwhile
to establish a more accurate dynamical model to enhance the controller
design of heart rate regulation. In this study, we designed a treadmill
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 10

Nonlinear Modeling Using Support Vector Machine 257

walking exercise protocol to analyze step response of heart rate. During


experiments, ECG, body movement and oxygen saturation were recorded
by using portable non-invasive sensors: Alive ECG monitor, Micro Inertial
Measurement Unit (IMU) and Alive Pulse Oximeter. It was confirmed that
time constants are not invariant especially when walking speed is faster
than three miles/hour. Time constants for offset of exercises are normally
bigger than those of onset of exercises. Steady state gain variation under
different exercise intensity has also been visibly observed. Furthermore, the
experiment results indicate that it is difficult to describe the variation of
the transient parameters (such as time constant) by using a simple linear
model. We applied the novel machine learning method, support vector
machine (SVM) ,to depict the nonlinear relationship.
Support vector machine-based regression [15] (Support Vector
Regression (SVR)) has been successfully applied to nonlinear function
estimation. Vapnik et al. [2, 16, 17] established the foundation of SVM. The
formulation of SVM [18, 19] embodies the structure of the risk minimization
principle, which has been shown to be superior to other traditional empirical
risk minimization principles [20]. Support vector machine-based regression
applies the kernel methods implicitly to transform data into a feature space
(this is known as a kernel trick [21]), and uses linear regression to get
a nonlinear function approximation in the feature space. By using RBF
kernel , this study efficiently established the nonlinear relationship between
time constant and exercise intensity for both onset and offset of exercises.
This chapter is organized as follows. Section 10.2 provides
preliminary knowledge of SVM-based regression. Section 10.3 describes
the experimental equipments and exercise protocol. Data analysis and
modeling results are given in Section 10.4. Lastly, Section 10.5 gives
conclusions.

10.2. SVM Regression

Let {ui , yi }N
i=1 be a set of inputs and outputs data points (ui ∈ U ⊆
Rd , yi ∈ Y ⊆ R, N is the number of points). The goal of the support vector
regression is to find a function f (u) which has the following form
f (u) = w · ϕ(u) + b, (10.1)
where ϕ(u) is the high-dimensional feature spaces which are nonlinearly
transformed from u. The weight vector w and bias b are defined as the
hyperplane by the equation ⟨w · ϕ(u)⟩ + b = 0. The hyperplane is estimated
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 10

258 Weidong Chen et al.

4min
Va

3min 5min
Vb
t

t1 t2 t3 t4

Fig. 10.1. Experiment protocol.

0
Ax (g)

−10

−20
0 1 2 3 4 5
4
x 10
10
Ay (g)

−10
0 1 2 3 4 5
4
x 10
10
Az (g)

−10
0 1 2 3 4 5
Samples 4
x 10

Fig. 10.2. Accelerations of three axes provided by the Micro IMU.

by minimizing the regularized risk function:

1 ∑
N
1
∥w∥2 + C Lε (yi , f (ui )). (10.2)
2 N i=1
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 10

Nonlinear Modeling Using Support Vector Machine 259

200

−200
0 1 2 3 4 5
4
x 10
90

80

70
0 1 2 3 4 5
4
x 10
200

−200
0 1 2 3 4 5
4
x 10

Fig. 10.3. Roll, pitch and yaw angles provided by the Micro IMU.

The first term is called the regularized term. The second term is the
empirical error measured by ε-insensitivity loss function which is defined
as:
{
| yi − f (ui ) | −ε, | yi − f (ui ) |> ε
Lε (yi , f (ui )) = (10.3)
0, | yi − f (ui ) |≤ ε.
This defines an ε tube. The radius ε of the tube and the regularization
constant C are both determined by user.
The selection of parameter C depends on application knowledge of the
domain. Theoretically, a small value of C will under-fit the training data
because the weight placed on the training data is too small, thus resulting
in large values of MSE (mean square error) on the test sets. However,
1
when C is too large, SVR will over-fit the training set so that ∥w∥2 will
2
lose its meaning and the objective goes back to minimize the empirical risk
only. Parameter ε controls the width of the ε-insensitive zone. Generally,
the larger the ε the fewer number of support vectors and thus the sparser
the representation of the solution. However, if the ε is too large, it can
deteriorate the accuracy on the training data.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 10

260 Weidong Chen et al.

Fig. 10.4. Experimental scenario.

By solving the above constrained optimization problem, we have



N
f (u) = βi ϕ(ui ) · ϕ(ui ) + b. (10.4)
i=1

As mentioned above, by the use of kernels, all necessary computations can


be performed directly in the input space, without having to compute the
map ϕ(u) explicitly. After introducing kernel function k(ui , uj ), the above
equation can be rewritten as follows:

N
f (u) = βi k(ui , u) + b, (10.5)
i=1
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 10

Nonlinear Modeling Using Support Vector Machine 261

Wireless ECG signal


100

80

60

40

20

−20

−40

−60

−80

−100
0 100 200 300 400 500 600
Samples

Fig. 10.5. Original ECG signal.

where the coefficients βi correspond to each (ui , yi ). The support vectors


are the input vectors uj whose corresponding coefficients βj ̸= 0. For linear
support regression, the kernel function is thus the inner product in the
input space:


N
f (u) = βi ⟨ui , u⟩ + b. (10.6)
i=1

For nonlinear SVR, there are a number of kernel functions which have
been found to provide good generalization capabilities, such as polynomials,
radial basis function (RBF), sigmod. Here we present the polynomials and
RBF kernel functions as follows:
Polynomial kernel: k(u, u′ ) = ((u · u′ ) + h)p .
∥u − u′ ∥2
RBF Kernel: k(u, u′ ) = exp(− ).
2σ 2
Details about SVR, such as the selection of radius ε of the tube, kernel
function and the regularization constant C, can be found in [18] [21].
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 10

262 Weidong Chen et al.


SpO2 (Percentage) SpO2 (Percentage) SpO2 (Percentage)

100

90

80
2100 2200 2300 2400 2500 2600 2700 2800 2900 3000
100

95

90

85
2900 3000 3100 3200 3300 3400 3500 3600 3700
100

95

90
3600 3700 3800 3900 4000 4100 4200 4300 4400
t(seconds)

Fig. 10.6. The recording of SpO2 .

Table 10.1. The values of walking speed Va and Vb .


Set 1 Set 2 Set 3 Set 4 Set 5 Set 6
Vb (m/h) 0.5 1.5 2 2.5 3 3.5
Va (m/h) 1.5 2.5 3 3.5 4 4.5

10.3. Experiment

A 41-year-old healthy male joined the study. He was 178 cm tall and 79
kg heavy. Experiments were performed in the afternoon, and the subject
was allowed to have a light meal one hour before the measurements. After
walking for about 10 minutes on the treadmill to get acquainted with this
kind of exercise, the subject walked at six sets of exercise protocol (see
Fig. 10.1) to test step response. The values of walking speed Va and
Vb were designed to vary exercise intensity and are listed in Table 10.1.
To properly identify time constants for onset and offset of exercise, the
recorded data should be precisely synchronized. Therefore, time instants
t1 , t2 , t3 , and t4 should be identified and marked accurately. In this study,
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 10

Nonlinear Modeling Using Support Vector Machine 263

140

120

100

0 1 2 3 4 5 6 7 8
5
x 10

140

120

100

0 1 2 3 4 5 6 7 8
5
x 10

140

120

100

0 1 2 3 4 5 6 7 8
5
x 10

Fig. 10.7. A measured heart rate step response signal.

we applied a Micro Inertial Measurement Unit (Xsens MTi-G IMU) to fulfil


this requirement. We compared both attitude information (roll, pitch and
yaw angles) and acceleration information provided by the Micro IMU. It
was observed that acceleration information alone is sufficient to identify
these time instants (see Fig. 10.2 and Fig. 10.3).
During experiments, continuous measurements of ECG, body movement
and Sp O2 (oxygen saturation) were made by using portable non-invasive
sensors. Specifically, ECG was recorded by using Alive ECG Monitor.
Body movement was measured by using the Xsens MTi-G IMU. Sp O2 was
monitored by using Alive Pulse Oximeter to guarantee the safety of the
subject. The experimental scenario is shown in Fig. 10.4.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 10

264 Weidong Chen et al.

130
Measured
Estimated
125

120
Heart Rate (bmp)

115

110

105

100
0 50 100 150 200 250 300
Time (Seconds)

Fig. 10.8. A typical curve fitting result.

10.4. Data Analysis and Discussions

Original signals of IMU, ECG and Sp O2 are shown in Figs 10.3, 10.5 and
10.6 respectively. It is well known that even in the absence of external
interference the heart rate can vary substantially over time under the
influence of various internal or external factors. [12] As mentioned before,
in order to reduce the variance, designed experimental protocol has been
repeated three times. Experimental data of these repeated experiments has
been synchronized and averaged.
A typical measured heart rate response is shown in Fig. 10.7. Paper
[12] found that heart rate response to exercise can be approximated as
first order process from a control application point of view. Therefore we
established first order model for six averaged step response data by using
Matlab System Identification Toolbox [22].
Table 10.2 shows the identified steady state gain (K) and time constant
(T ) by using averaged data of three sets of experimental data. A typical
curve fitting result is shown in Fig. 10.8. From Table 10.2, we can clearly
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 10

Nonlinear Modeling Using Support Vector Machine 265

Time constant at onset exercise


60

50
Time Constant (Seconds)

40

30

20

10

0
1 1.5 2 2.5 3 3.5 4
Speed (mph)

Fig. 10.9. SVM regression results for time constant at onset of exercise.

see that both steady state gain and time constant vary when walking speed
Va and Vb change. Furthermore, time constants of the offset of exercise are
noticeably bigger than those of the onset of exercise. However, it should be
pointed out that the variants of time constants are not distinctly dependent
on walking speed when walking speed is less than three miles/hour. Overall,
experimental results indicate that heart rate dynamics at the onset and
offset of exercise exhibit high nonlinearity when walking speed is higher
than three miles/hour.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 10

266 Weidong Chen et al.

Time constant at offset exercise


120

100
Time constant (Seconds)

80

60

40

20

0
1 1.5 2 2.5 3 3.5 4
Speed (mph)

Fig. 10.10. SVM regression results for time constant at offset of exercise.

Table 10.2. The identified time constants and steady state gains
by using averaged data.
Sets Onset Offset
DC gain Time constant DC gain Time constant
1 9.2583 9.4818 7.9297 27.358
2 11.264 10.193 9.8561 27.365
3 10.006 13.659 8.9772 26.741
4 12.807 18.618 12.087 30.865
5 17.753 38.192 17.953 48.114
6 32.911 55.974 25.733 81.693

In order to quantitatively describe the detected nonlinear behaviour,


we employed the novel machine learning modeling method, SVR, to model
time constant and DC gain of heart rate dynamics.
The time constant regression results at onset and offset of exercise are
shown in Figs 10.9 and 10.10 respectively. In these figures, the continuous
curve stands for the estimated input-output steady state relationship. The
dotted lines indicate the ε-insensitivity tube. The plus markers are the
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 10

Nonlinear Modeling Using Support Vector Machine 267

Steady state gain at onset exercise


35

30

25
Steady State Gain

20

15

10

0
1 1.5 2 2.5 3 3.5 4
Speed (mph)

Fig. 10.11. SVM regression results for DC gain at onset of exercise.

points of input–output data. The circled plus markers are the support
points. It should be emphasized that ε-insensitive SVR uses just less than
30% of total points to sparsely describe the nonlinear relationship efficiently.
It can be seen that the time constant at the offset of exercise is bigger
than that at the onset of exercise. It can also be observed that the time
constant at the onset of exercise is more accurately identified than that at
the offset of exercise. This is indicated by the ε tube (or the width of the
ε-nsensitive zone). It is probable that the recovery stage can be influenced
by other exercise-unrelated factors than those at the onset of exercise.
The SVM regression results for DC gain at the onset and offset of
exercise are shown in Figs 10.11 and 10.12. It can be observed that the DC
gain for recovery stage is less than that at the onset of exercise, especially
when walking speed is greater than three miles/hour.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 10

268 Weidong Chen et al.

Steady state gain at offset exercise


30

25

20
Steady State Gain

15

10

0
1 1.5 2 2.5 3 3.5 4
Speed (mph)

Fig. 10.12. SVM regression results for DC gain at offet of exercise.

10.5. Conclusion

This study aims to capture the nonlinear behavior of heart rate response to
treadmill walking exercises by using support vector machine-based analysis.
We identified both steady state gain and the time constant under different
walking speeds by using the data from a healthy middle-aged male subject.
Both steady state gain and the time constant are variant under different
walking speeds. The time constant for the recovery stage is longer than
that at the onset of exercise as predicted. In this study, these nonlinear
behaviors have been quantitatively described by using an effective machine
learning-based approach, named SVM regression. Based on the established
model, we have already developed a new switching control approach which
will be reported somewhere else. We believe this integrated modeling and
control approach can be utilized to a broad range of process control. [6] In
the next step of this study, we are planning to recruit more subjects to test
the established nonlinear modeling and control approach further.
March 8, 2012 14:59 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 10

Nonlinear Modeling Using Support Vector Machine 269

References

[1] S. W. Su, B. G. Celler, A. Savkin, H. T. Nguyen, T. M. Cheng, Y. Guo, and


L. Wang, Transient and steady state estimation of human oxygen uptake
based on noninvasive portable sensor measurements, Medical & Biological
Engineering & Computing. 47(10), 1111–1117, (2009).
[2] S. W. Su, L. Wang, B. Celler, E. Ambikairajah, and A. Savkin, Estimation of
walking energy expenditure by using Support Vector Regression, Proceedings
of the 27th Annual International Conference of the IEEE Engineering in
Medicine and Biology Society (EMBS). pp. 3526–3529, (2005).
[3] M. S. Fairbarn, S. P. Blackie, N. G. McElvaney, B. R. Wiggs, P. D. Pare, and
R. L. Pardy, Prediction of heart rate and oxygen uptake during incremental
and maximal exercise in healthy adults, Chest. 105, 1365–1369, (1994).
[4] P. O. Astrand, T. E. Cuddy, B. Saltin, and J. Stenberg, Cardiac output
during submaximal and maximal work, Journal of Applied Physiology. 9,
268–274, (1964).
[5] M. E. Freedman, G. L. Snider, P. Brostoff, S. Kimelblot, and L. N. Katz,
Effects of training on response of cardiac output to muscular exercise in
athletes, Journal of Applied Physiology. 8, 37–47, (1955).
[6] Y. Zhang, S. W. Su, H. T. Nguyen, and B. G. Celler, Machine learning
based nonlinear model predictive control for heart rate response to exercise,
in H. K. Lam and S. H. Ling and H. T. Nguyen (Eds), Computational
Intelligence and its Applications: Evolutionary Computation, Fuzzy Logic,
Neural Network and Support Vector Machine Techniques. pp. 271–285,
(World Scientific, Singapore, 2011).
[7] V. Seliger and J. Wagner, Evaluation of heart rate during exercise on a
bicycle ergometer, Physiology Boheraoslov. 18, 41, (1969).
[8] Y. Chen and Y. Lee, Effect of combined dynamic and static workload on
heart rate recovery cost, Ergonomics. 41(1), 29–38, (1998).
[9] R. Acharya, A. Kumar, I. P. Bhat, L. Choo, S. Iyengar, K. Natarajan, and
S. Krishnan, Classification of cardiac abnormalities using heart rate signals,
Medical & Biological Engineering & Computing. 42(3), 288–293, (2004).
[10] M. Hajek, J. Potucek, and V. Brodan, Mathematical model of heart rate
regulation during exercise, Automatica. 16, 191–195, (1980).
[11] T. M. Cheng, A. V. Savkin, B. G. Celler, S. W. Su, and L. Wang, Nonlinear
modelling and control of human heart rate response during exercise with
various work load intensities, IEEE Transactions on Biomedical Engineering.
55(11), 2499–2508, (2008).
[12] S. Su, L. Wang, B. Celler, A. Savkin, and Y. Guo, Identification and control
for heart rate regulation during treadmill exercise, IEEE Transactions on
Biomedical Engineering. 54(7), 1238–1246, (2007).
[13] R. A. Cooper, T. L. Fletcher-Shaw, and R. N. Robertson, Model
reference adaptive control of heart rate during wheelchair ergometry, IEEE
Transactions on Control Systems Technology. 6(4), 507–514, (1998).
March 8, 2012 14:59 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 10

270 Weidong Chen et al.

[14] R. Eston, A. Rowlands, and D. Ingledew, Validity of heart rate, pedometry,


accelerometry for predicting the energy cost of children’s activities, Journal
of Applied Physiology. 84, 362–371, (1998).
[15] H. Drucker, C. Burges, L. Kaufman, A. Smola, and V. Vapnik, Support
vector regression machines, in M. Mozer, M. Jordan, and T. Petsche (Eds),
Advances in Neural Information Procession Systems, pp. 57–86 (Elsevier,
Cambridge, MA, 1997).
[16] I. Goethals, K. Pelckmans, J. Suykens, and B. D. Moor, Identification of
MIMO Hammerstein models using least squares support vector machines,
Automatica. 41, 1263–1272, (2005).
[17] J. Suykens, V. Gestel, J. D. Brabanter, B. D. Moor, and J. Vandewalle,
Least squares support vector machines. (World Scientific, Singapore, 2002).
[18] V. Vapnik, The Nature of Statistical Learning Theory. (Springer, New York,
1995).
[19] V. Vapnik and A. Lerner, Pattern recognition using generalized portrait
method, Automation and Remote Control. 24, 774–780, (1963).
[20] S. Gunn, M. Brown, and K. Bossley, Network performance assessment
for neuro-fuzzy data modelling, Intelligent Data Analysis. 1280, 313–323,
(1997).
[21] B. Schlkopf and A. Smola, Learning with kernels. (MIT Press, Cambridge,
MA, 2002).
[22] L. Ljung, System Identication Toolbox V4.0 for Matlab. (The MathWorks,
Inc, Natick, MA, 1995).
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 11

Chapter 11

Machine Learning-based Nonlinear Model Predictive


Control for Heart Rate Response to Exercise

† †‡
Yi Zhang, Steven W. Su, #‡
Branko G. Celler and † Hung T. Nguyen
[email protected]

This study explores control methodologies to handle time variant


behavior for heart rate dynamics at onset and offset of exercise. To
achieve this goal, a novel switching model predictive control (MPC)
algorithm is presented to optimize the exercise effects at both onset
and offset of exercise. Specifically, dynamic matrix control (DMC), one
of the most popular MPC control algorithms, has been employed as
the essential of the optimization of process regulation while switching
strategy has been adopted during the transfer between onset and offset
of exercise. The parameters of the DMC/MPC controller have been well
tuned based on a previously established SVM-based regression model
relating to both onset and offset of treadmill walking exercises. The
effectiveness of the proposed modeling and control approach has been
shown from the regulation of dynamical heart rate response to exercise
through simulation using Matlab.

Contents

11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272


11.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274
11.2.1 Model-based predictive control (MPC) . . . . . . . . . . . . . . . . . . 274
11.2.2 Dynamic matrix control (DMC) . . . . . . . . . . . . . . . . . . . . . . 275
11.3 Control Methodologies Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
11.3.1 Discrete time model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
11.3.2 Switching control method . . . . . . . . . . . . . . . . . . . . . . . . . 278
11.3.3 Demonstration of tuned DMC parameters for control system of
cardio-respiratory response to exercise . . . . . . . . . . . . . . . . . . 280
† Centre
for Health Technologies, Faculty of Engineering and Information Technology,
University of Technology, Sydney, Australia.
‡ Human Performance Group, Biomedical Systems Lab, School of Electrical Engineering

and Telecommunications, University of New South Wales, Sydney NSW 2052 Australia.
# College of Health and Science, University of Western Sydney, Penrith NSW 2751

Australia.

271
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 11

272 Yi Zhang, Steven W. Su, Branko G. Celler and Hung T. Nguyen

11.3.4 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282


11.4 Conclusions and Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284

11.1. Introduction

As heart rate was found to be a predictor of major ischemic heart disease


events, cardiovascular mortality and sudden cardiac death, [1, 2] many
scholars have been interested in monitoring it to evaluate the cardiovascular
fitness. [3–7] Heart rate is determined by the number of heartbeats per unit
of time, typically expressed as beats per minute (BPM), and can vary as
the body’s need for oxygen changes during exercise.
SVM regression offers a solution for nonlinear behavior description of
heart rate response to moderate exercise as shown in the previous chapter
[8]. This SVM-based model for heart rate estimation can be used to
implicitly indicate some key cardio-respiratory responses to exercise, such
as oxygen uptake as discussed in [9] and [10]. In our previous study [8], it
is shown that time constants of heart rate response are variant. It is often
bigger at the offset of exercise than at the onset of exercise. The captured
difference leads to setting up two nonlinear models separately to present
the dynamic characteristics of heart rate at the onset and offset of exercise.
One process possessing two or more quite different characteristics is
quite common not only for human body responses but also for some
industrial processes. For instance, the boiler is widely used in live stream
sector which is a closed vessel where water or other liquid is heated. It
may only take a few minutes to heat the water (or other liquid) up to
100 ◦ C, but it may take several hours to cool down in general. Although it
is evident that using different models for different stages (as shown in [8])
may provide a more precise description of the process, it requires a more
advanced control strategy to handle this complicated mechanism.
MPC is a family of control algorithms that employ an explicit model
to predict the future behavior of the process over a prediction horizon.
The controller output is calculated by minimizing a pre-defined objective
function over a control horizon. Figure 11.1 illustrates the ”moving
horizon” technique used in MPC. Recently, MPC has established itself in
industry as an important form of advance control [11] due to its advantages
over traditional controllers. [12, 13] MPC displays improved performance
because the process model allows current computations to consider future
dynamic events. This provides benefits when controlling processes with
large dead times or non-minimum phase behavior. MPC also allows for the
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 11

Machine Learning-based Nonlinear Model Predictive Control 273

incorporation of hard and soft constraints directly in the objective function.


In addition, the algorithm provides a convenient architecture for handling
multivariable control due to the superposition of linear models within the
controller.

Fig. 11.1. The ”moving horizon” concept of model predictive control.

In this study, one of the most popular MPC algorithms, Dynamic Matrix
Control, is selected to control the heart rate responses based on previously
established SVM-based nonlinear time variant model. The major benefit of
using a DMC controller is its simplicity and efficiency for the computation
of optimal control action, which is essentially a least-square optimization
problem.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 11

274 Yi Zhang, Steven W. Su, Branko G. Celler and Hung T. Nguyen

To handle different dynamic characteristics at the onset and offset


of exercise, the switching control strategy has been implemented during
the transmission between the onset and offset of treadmill exercises. By
integrating the proposed modeling and control methods, heart rate response
to exercise has been optimally regulated at both the onset and offset of
exercise.
The organization of this chapter is as follows. The preliminaries
of DMC/MPC are clarified in Section 11.2. The proposed switching
DMC control approach is discussed in Section 11.3, which is followed by
simulation results. Conclusions are given in Section 11.4.

11.2. Background

11.2.1. Model-based predictive control (MPC)

11.2.1.1. MPC structure

The basic structure of the MPC controller consists of two main elements as
depicted in Fig. 11.2. The first element is the predicted model of the process
to be controlled. Further, if any measurable disturbance or noise exists in
the process it can be added to the model of the system for compensation.
The second element is the optimizer that calculates the future control action
to be executed in the next step while taking into account the cost function
and constraints [14].

Fig. 11.2. Structure of model predictive control.


March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 11

Machine Learning-based Nonlinear Model Predictive Control 275

11.2.2. Dynamic matrix control (DMC)

DMC uses a linear finite step response model of the process to predict
the process variable profile, ŷ(k + j) over j sampling instants ahead of the
current time, k:


j ∑
N −1
ŷ(k + j) = y0 + Ai △ u(k + j − i) + Pi △ u(k + j − i)
i=1 i=j+1
| {z } | {z }
Effect of current and future moves Effect of past moves

j = 1, 2 · · · P N : M odel Horizon, (11.1)

where P is the prediction horizon and represents the number of sampling


intervals into the future over which DMC predicts the future process
variable [15]. In (11.1), y0 is the initial condition of the process variable,
△ui = ui − ui−1 is the change in the controller output at the i-th sampling
instant, Ai and Pi are composed by the i-th unit step response coefficient
of the process, and N is the model horizon and represents the number of
sampling intervals of past controller output moves used by DMC to predict
the future process variable profile.
The current and future controller output moves have not been
determined and cannot be used in the computation of the predicted process
variable profile. Therefore, (11.1) reduces to


N −1
ŷ(k + j) = y0 + (Pi △ u(k + j − i)) + d(k + j), (11.2)
i=j+1

where the term d(k + j) combines the unmeasured disturbances and the
inaccuracies due to plant–model mismatch. Since future values of the
disturbances are not available, d(k + j) over future sampling instants is
assumed to be equal to the current value of the disturbance, or


N −1
d(k + j) = d(k) = y(k) − y0 − (Hi △ u(k − j)), (11.3)
i=1

where y(k) is the current process variable measurement.


The goal is to compute a series of controller output moves such that

Rsp (k + j) − ŷ(k + j) = 0. (11.4)


March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 11

276 Yi Zhang, Steven W. Su, Branko G. Celler and Hung T. Nguyen

Substituting (11.1) in (11.4) gives


N −1
Rsp (k + j) − y0 − Pi △ u(k + j − i) − d(k + j) =
i=j+1
| {z }
Predicted error based on past moves, e(k+j)


j
Ai △ u(k + j − i) . (11.5)
i=1
| {z }
Effect of current and future moves to be determined

Equation (11.5) is a system of linear equations that can be represented


as a matrix equation of the form
   
e(k + 1) a1 0 0 ... 0
 e(k + 2)  a 
   2 a1 0 ... 0 
 e(k + 3)  a 
   3 a2 a1 . . . 0 
   
 ..   .. .. .. . . .. 
 .  = . . . . .  ×
   
e(k + M ) aM aM −1 aM −2 a1 
   
 ..   .. .. .. . . .. 
 .   . . . . . 
e(k + P ) P ×1 aP aP −1 aP −2 . . . aP −M +1 P ×M
 
△u(k)
 △u(k + 1) 
 
 △u(k + 2) 
  (11.6)
 .. 
 . 
△u(k + M − 1) M ×1

or in a compact matrix notation as [15]

ē = A △ ū, (11.7)

where ē is the vector of predicted errors over the next P sampling instants,
A is the dynamic matrix and △ū is the vector of controller output moves
to be determined.
An exact solution to (11.7) is not possible since the number of equations
exceeds the degrees of freedom (P > M ). Hence, the control objective is
posed as a least squares optimization problem with a quadratic performance
objective function of the form determined.

min J = [ē − A △ ū]T [ē − A △ ū]. (11.8)


△ū
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 11

Machine Learning-based Nonlinear Model Predictive Control 277

The DMC control law of this minimization problem is


△ū = (AT A)−1 AT ē. (11.9)
Implementation of DMC with the control law in (11.9) results in excessive
control action, especially when the control horizon is greater than one.
Therefore, a quadratic penalty on the size of controller output moves is
introduced into the DMC performance objective function. The modified
objective function has the form
min J = [ē − A △ ū]T [ē − A △ ū] + [△ū]T λ[△ū], (11.10)
△ū

where λ is the move suppression coefficient (controller output weight). This


weighting factor plays a crucial role on the optimizer of DMC. If the value
of λ is large enough, the optimizer attaches more importance to the effects
of △u, so that the robustness of output moves of the process is straightly
improved, but the accuracy of output moves along with reference profile
might be sacrificed. In the same way to reduce the value of λ, to optimize
the effect of ē has higher priority than that of △u. Accuracy of system
becomes more significant than robustness.
In the unconstrained case, the modified objective function has a closed
form solution of [16, 17]
△ū = (AT A + λI)−1 AT ē. (11.11)
Adding constraints to the classical formulation given in (11.10) produces
the quadratic dynamic matrix control (QDMC) [18, 19] algorithm. The
constraints considered in this work include:
ŷmin ≤ ŷ ≤ ŷmax ; (11.12a)
△ūmin ≤ △ū ≤ △ūmax ; (11.12b)
ūmin ≤ ū ≤ ūmax . (11.12c)

11.3. Control Methodologies Design

11.3.1. Discrete time model


From the previous studies, [8] heart rate response to exercise can be
approximated as a first order process, which is expressed in S-domain as
(11.13)
Y (s) K
H(s) = = . (11.13)
U (s) Ts + 1
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 11

278 Yi Zhang, Steven W. Su, Branko G. Celler and Hung T. Nguyen

For obtaining the discrete time model, (11.13) has to be transformed to


z-domain by
1 − z −1
s= , (11.14)
Ts
Ts is the sample time.
To bring (11.14) to (11.13), this model in z-domain will be followed as
(11.15)
T T −1
( + 1)y = yz + Ku. (11.15)
Ts Ts
According to y(k)z −1 = y(k − 1) (k: the kth sample time), (11.15) is
transformed to the discrete time form
T Ts K
y(k) = y(k − 1) + u(k). (11.16)
T + Ts T + Ts
In (11.16), T and K are the only parameters to be defined for describing
the first order model. It can be regarded as a nonlinear model when the set
of parameters K and T varies as u(k) does change. This is exactly what
the mathematic model of the dynamic characteristics of cardio-respiratory
response to exercise is. The relationships between the transient parameters
(K and T ) and u(k) relating to SVR results can be found in [8].

11.3.2. Switching control method


If a control system has two or more than two processes, switching control
would be one of the approaches commonly used in the multiply model
control field. The switching control for discrete time models increases the
control accuracy, lowers the system consumption and raises the efficiency
of control processing. Nevertheless, it also issues the risk on system
robustness, because of the existing gap between models.
By the analysis of the previous experiment results in [8], it can be
seen there are two time variant models being introduced for dynamic
characteristics of heart rate response at the onset and offset of exercise.
Certainly, two unique DMC controllers are also established, each computing
their own control actions. Although two models plus two controllers are
employed in this work, the approach can easily be extended to include as
many local models and controllers as the practitioner would like (see Fig.
11.3).
If △R(k) = R(k) − R(k − 1) is the set point change, △u(k) = u(k) −
u(k −1) is the input change of designed controller at kth sample time, uonset
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 11

Machine Learning-based Nonlinear Model Predictive Control 279

Fig. 11.3. Block diagram for double model predictive switching control system.

is the controller output for onset of exercise, uoffset is the controller output
for offset of exercise, yonset is the measured output of the process for onset
of exercise and yoffset is the measured output of the process for offset of
exercise, then
If △R(k) > 0 then
umeas = uonset . (11.17)
And if △u(k) > 0
ymeas = yonset . (11.18)
Otherwise
ymeas = yoffset . (11.19)
If △R(k) = 0
And if △u(k) > 0
umeas = uonset . (11.20)

ymeas = yonset . (11.21)


Otherwise
umeas = uoffset . (11.22)

ymeas = yoffset . (11.23)


If △R(k) < 0 then
umeas = uoffset . (11.24)
And if △u(k) > 0
ymeas = yonset . (11.25)
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 11

280 Yi Zhang, Steven W. Su, Branko G. Celler and Hung T. Nguyen

Otherwise

ymeas = yoffset . (11.26)

From this point of view, ymeas is the actual measured output of the
control system and umeas is the actual controller output after which the
controller is selected by the above conditions.

11.3.3. Demonstration of tuned DMC parameters for


control system of cardio-respiratory response to
exercise
The parameters to be tuned relating to this project include the sample
time (Ts ), prediction horizon (P ), moving horizon (M ), model horizon (N ),
move suppression coefficient (λ), peak value of the reference profile (Rp )
and number of samples (S) (see Table 11.1).

Table 11.1. Tuning parameters for DMC control system of


cardio-respiratory response to exercise.
controller for onset stage controller for offset stage
P 30 30
M 10 10
N 180 250
λ 300 3000
S 900
Rp 30
Ts 2

According to previous studies [8], the range of a normal measured heart


rate step response signal during the target’s workout stage to treadmill
exercises with the pace rate of around 3.5 m/hour is approximately from
100 bpm to 140 bpm. Therefore, the target’s variant heart rate (Rp ) is 30
as the simulation reference scale.
Due to the nonlinearity of this study, the DMC parameters are usually
set by a combination of practical training and theoretical philosophies.
[8] For example, in order to find a best value for the move suppression
coefficient (λ), a test is carried out that employs 21 sets of steady state
gain (K) and time constant (T ) in terms of the SVR simulation results
in [8]. These 21 different experiment data can be simply treated as 21
linear models. The method of practical training tends to tailor these linear
models with a possible best value and evaluate them to find the final λ for
nonlinear DMC controllers. The tuned values for DMC onset and offset
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 11

Machine Learning-based Nonlinear Model Predictive Control 281

controllers through 21 sets of test data analysis respectively amount to 300


and 3,000.
The sample time (Ts ) is set as two seconds based on the general
heartbeat rate of human beings. The experiment period is about 30 minutes
except 10 minutes for warm-up at the beginning of exercise. Hence, the
number of samples (S) is 900. Considering the larger Ts and S, prediction
horizon (P ) and moving horizon (M ) are modulated to 30 and 10 severally.
In addition, the tuned parameters Ts and N improve the system response
time so the control effect can match with the sample time moves.
Model horizon (N ) also needs to be tuned in order to reduce the system’s
computational time and enhance the efficiency of the DMC controller. In
particular, N is defined as a dynamic array to keep storing and shifting
the values of steady state responses of the process until it stays level. If
the length of N is too long, it would be redundant. Based on experimental
data analysis, the value of N at DMC onset controller would be 180, while
250 is estimated at DMC offset controller for a safe range.

Simulation Results WITH All Tuning Parameters


30

25

20

15

10

0
100 200 300 400 500 600 700 800 900
time, 30 minutes

Fig. 11.4. Simulation results for machine learning-based double nonlinear model
predictive switching control for cardio-respiratory response to exercise with all tuning
parameters (I).
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 11

282 Yi Zhang, Steven W. Su, Branko G. Celler and Hung T. Nguyen

Fig. 11.5. Simulation results added noise for machine learning-based double nonlinear
model predictive switching control for cardio-respiratory response to exercise with all
tuning parameters.

11.3.4. Simulation

The simulation results for machine learning-based double nonlinear model


predictive switching control for cardio-respiratory response to exercise
is demonstrated in Figs 11.4 and 11.5. Based on these figures, the
results are sound, even if a random disturbance is added into DMC
controllers (see Fig. 11.5), which also can efficiently avoid the distortion.
In the experiment results, these complex nonlinear behaviors have been
qualitatively optimized with high accuracy by using the double model
predictive switching control approach.
On the other hand, switching control brings slight oscillations at middle
stage (see Fig. 11.6). As the amplitudes of these oscillations do not exceed
the theoretical error range (σ ≤ 5%), it is acceptable to bear it in mind at
the simulation stage.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 11

Machine Learning-based Nonlinear Model Predictive Control 283

Simulation Results WITH All Tuning Parameters

30.1

30

29.9

29.8

29.7

29.6

29.5

29.4
250 300 350 400 450 500 550 600 650
time, 30 minutes

Fig. 11.6. Simulation results for machine learning-based double nonlinear model
predictive switching control for cardio-respiratory response to exercise with all tuning
parameters (II).

11.4. Conclusions and Outlook

In this study, a machine learning-based nonlinear model predictive control


for heart rate response to exercise is introduced.
As discussed in [8], this nonlinear behavior of heart rate response
to treadmill walking exercise can be effectively captured by using SVM
regression. We investigated both steady state gain and the time constant
under different walking speeds by using the data from an individual healthy
middle-aged male subject. The experiment results demonstrate that the
time constant for recovery stage is longer than that at onset of exercise,
which provides the essential idea to form double models method to describe
the corresponding onset or offset of exercise.
Based on the established model, a novel switching model predictive
control algorithm has been developed, which applies DMC algorithm to
optimize the regulation of heart rate responses at both the onset and offset
of exercise.
March 8, 2012 14:59 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 11

284 Yi Zhang, Steven W. Su, Branko G. Celler and Hung T. Nguyen

Simulation results indicate switching DMC controller can efficiently


handle the different dynamic characteristics at onset and offset of exercise.
However, it should be pointed out that the proposed approach, as most
switching control strategies, also suffers from transient behavior during
controller switching. For example, the simulation results in transition stage
(△R = 0, Fig. 11.6; Equations (11.20), (11.21), (11.22) and (11.23)) have
slight oscillation due to the quick switching between two controllers. In the
next step of this study, we will develop a bump-less transfer controller to
minimize the transient behavior and implement those methodologies in the
real-time control of heart rate response during treadmill exercises.

References

[1] A. G. Shaper, G. Wannamethee, P. W. Macfarlane, and M. Walker, Heart


rate ischaemic heart disease: sudden cardiac death in middle-aged British
men, British Heart Journal. 70, 49–55, (1993).
[2] R. Acharya, A. Kumar, I. P. S. Bhat, L. Choo, S. S. Iyengar, K. Natarajan,
and S. M. Krishnan, Classification of cardiac abnormalities using heart rate
signals, Medical & Biological Engineering & Computing. 42(3), 288–293,
(2004).
[3] S. W. Su, B. G. Celler, A. Savkin, H. T. Nguyen, T. M. Cheng, Y. Guo, and
L. Wang, Transient and steady state estimation of human oxygen uptake
based on noninvasive portable sensor measurements, Medical & Biological
Engineering & Computing. 47(10), 1111–1117, (2009).
[4] S. W. Su, L. Wang, B. Celler, E. Ambikairajah, and A. Savkin, Estimation of
walking energy expenditure by using Support Vector Regression, Proceedings
of the 27th Annual International Conference of the IEEE Engineering in
Medicine and Biology Society (EMBS). pp. 3526–3529, (2005).
[5] M. S. Fairbarn, S. P. Blackie, N. G. McElvaney, B. R. Wiggs, P. D. Pare, and
R. L. Pardy, Prediction of heart rate and oxygen uptake during incremental
and maximal exercise in healthy adults, Chest. 105, 1365–1369, (1994).
[6] P. O. Astrand, T. E. Cuddy, B. Saltin, and J. Stenberg, Cardiac output
during submaximal and maximal work, Journal of Applied Physiology. 9,
268–274, (1964).
[7] M. E. Freedman, G. L. Snider, P. Brostoff, S. Kimelblot, and L. N. Katz,
Effects of training on response of cardiac output to muscular exercise in
athletes, Journal of Applied Physiology. 8, 37–47, (1955).
[8] W. Chen, S. W. Su, Y. Zhang, Y. Guo, N. Nguyen, B. G. Celler, and H. T.
Nguyen, Nonlinear modeling using support vector machine for heart rate
response to exercise, in H. K. Lam, S. H. Ling and H. T. Nguyen (Eds),
Computational Intelligence and its Applications: Evolutionary Computation,
Fuzzy Logic, Neural Network and Support Vector Machine Techniques. pp.
255–270, (Imperial College Press, London, 2012).
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 11

Machine Learning-based Nonlinear Model Predictive Control 285

[9] S. W. Su, B. G. Celler, A. Savkin, H. T. Nguyen, T. M. Cheng, Y. Guo, and


L. Wang, Transient and steady state estimation of human oxygen uptake
based on noninvasive portable sensor measurements, Medical & Biological
Engineering & Computing. 47(10), 1111–1117, (2009).
[10] L. Wang, S. Su, B. Celler, G. Chan, T. Cheng, and A. Savkin, Assessing
the human cardiovascular response to moderate exercise, Physiological
Measurement. 30, 227–244, (2009).
[11] J. Richalet, Industrial applications of model based predictive control,
Automatica. 29(5), 1251–1274, (1993).
[12] C. E. Garcia, D. M. Prett, and M. Morari, Model predictive control: theory
and practice–a survey, Automatica. 25(3), 335–348, (1989).
[13] K. R. Muske and J. L. Rawlings, Model predictive control with linear models,
AICHE Journal. 39, 262–287, (1993).
[14] C. Bordons and E. F. Camacho, Model predictive control, 2nd edition.
(Springer-Verlag, London, 2004).
[15] D. Dougherty and D. Cooper, A practical multiple model adaptive strategy
for single-loop MPC, Control Engineering Practice. 11, 141–159, (2003).
[16] J. L. Marchetti, D. A. Mellichamp, and D. E. Seborg, Predictive control
based on discrete convolution models, Industrial & Engineering Chemistry,
Processing Design and Development. 22, 488–495, (1983).
[17] B. A. Qgunnaike, Dynamic matrix control: A nonstochastic, industrial
process control technique with parallels in applied statistics, Industrial &
Engineering Chemical Fundamentals. 25, 712–718, (1986).
[18] A. M. Morshedi, C. R. Cutler, and T. A. Skrovanek, Optimal solution
of dynamic matrix control with linear programming techniques (LDMC),
Proceedings of the American Control Conference, New Jersey: IEEE
Publications. pp. 199–208, (1985).
[19] C. E. Garcia and A. M. Morshedi, Quadratic programming solution of
dynamic matrix control (QDMC), Chemical Engineering Communications.
46, 73–87, (1986).
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 11

This page intentionally left blank

286
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 12

Chapter 12

Intelligent Fault Detection and Isolation of HVAC System


Based on Online Support Vector Machine


Davood Dehestani, † Ying Guo, ∗ Sai Ho Ling, ∗ Steven W. Su and ∗ Hung
T. Nguyen
Faculty of Engineering and Information Technology,
University of Technology,
Sydney, Australia
[email protected]

Heating, Ventilation and Air Conditioning (HVAC) systems are often


one of the largest energy consuming parts in modern buildings. Two
focused issues of HVAC systems are energy saving and its safety. Regular
checking and maintenance are usually the keys to tackle these problems.
Due to the high cost of maintenance, preventive maintenance plays an
important role. One cost-effective strategy is the development of analytic
fault detection and isolation (FDI) modules by online monitoring of the
key variables of HVAC systems. This chapter investigates real-time FDI
for HVAC systems by using online support vector machine (SVM), by
which we are able to train an FDI system with manageable complexity
under real-time working conditions. It also proposes a new approach
which allows us to detect unknown faults and update the classifier by
using these previously unknown faults. Based on the proposed approach,
a semi-unsupervised fault detection methodology has been developed
for HVAC systems. This chapter also identifies the variables which are
the indications of the particular faults we are interested in. Simulation
studies are given to show the effectiveness of the proposed online FDI
approach.

Contents
12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 288
12.2 General Introduction on HVAC System . . . . . . . . . . . . . . . . . . . . . 289
12.3 HVAC Parameter Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
∗ Faculty of Engineering and Information Technology, University of Technology, Sydney,

Australia
† Autonomous System Lab, CSIRO ICT Center, Australia

287
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 12

288 Davood Dehestani, Ying Guo, Sai Ho Ling, Steven W. Su and Hung T. Nguyen

12.4 HVAC Model Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292


12.5 Fault Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293
12.6 Parameter Sensitivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
12.7 Incremental–Decremental Algorithm of SVM . . . . . . . . . . . . . . . . . . 296
12.8 Algorithm of FDI by Online SVM . . . . . . . . . . . . . . . . . . . . . . . . 297
12.9 FDI Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300
12.10 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303

12.1. Introduction

There are not many energy systems so commonly used in both industry
and domestic as HVAC systems. Moreover, HVAC systems usually
consume the largest portion of energy in a building both in industry and
domestically. It is reported in [1] that the air-conditioning of buildings
accounts for 28% of the total energy end use of commercial sectors. From
15% to 30% of the energy waste in commercial buildings is due to the
performance degradation, improper control strategy and malfunctions of
HVAC systems. Regular checks and maintenance are usually the keys
to reaching these goals. However, due to the high cost of maintenance,
preventive maintenance plays an important role. A cost-effective strategy
is the development of fault detection and isolation (FDI).
Several strategies have been employed as an FDI modular in an
HVAC system. These strategies can be mainly classified in two
categories: model-based strategy and signal processing-based strategy [2–4].
Model-based techniques either use a mathematical model or a knowledge
model to detect and isolate the faulty modes. These techniques include but
are not limited to observer-based approach [5], parity-space approach [6],
and parameter identification-based methods [7]. Henao [8] reviewed fault
detection based on signal processing. This procedure involves mathematical
or statistical operations which are directly performed on the measurements
to extract the features of faults.
Intelligent methods such as genetic algorithm (GA), neural network
(NN) and fuzzy logic had been applied during the last decade for fault
detection. Neural network has been used in a range of systems for fault
detection even for HVAC systems [9]. Lo [10] proposed intelligent technique
based on fuzzy-genetic algorithm (FGA) for automatically detecting faults
on HVAC systems. However, many intelligent methods such as NN
often require big data sets for training. Some of them are not fast
enough to realize real-time fault detection and isolation. This chapter
investigates methods with real-time operation capability and requiring less
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 12

Intelligent Fault Detection and Isolation of HVAC System 289

data. Support vector machine (SVM) has been extensively studied in data
mining and machine learning communities for the last two decades. SVM
is capable of both classification and regression. It is easy to formulate a
fault detection and isolation problem as a classification problem.
SVM can be treated as a special neural network. In fact, an SVM model
is equivalent to a two-layer, perceptron neural network. With using a kernel
function, SVM is an alternative training method for multi-layer perceptron
classifiers in which the weights of the network are identified by solving a
quadratic programming problem under linear constraints, rather than by
solving a non-convex unconstrained minimization problem as in standard
neural network training.
Liang [2] studied FDI for HVAC systems by using standard SVM
(offline). In this chapter, incremental SVM (online) has been applied. It is
required to solve a quadratic programming (QP) for the training of an SVM.
However, standard numerical techniques for QP are unfeasible for very large
data sets which is the situation for fault detection and isolation for HVAC
systems. By using online SVM, the large-scale classification problems can
be implemented in real-time configuration under limited hardware and
software resources. Furthermore, this chapter also provides a potential
approach for the implementation of FDI under an unsupervised learning
framework.
Based on the model structure given in paper [2], we constructed a
HVAC model by using Matlab/Simulink and identified the variables which
are more sensitive to commonly encountered HVAC faults. Finally, the
effectiveness of the proposed online FDI approach has been verified and
illustrated by using Simulink Simulation Platform.

12.2. General Introduction on HVAC System

In parallel to the modeling of other energy systems, HVAC modelling was


developed by Arguello-Serran [11], Bourhan Tashtoush [12] and others.
Gian Liang [13] developed a dynamic model of the HVAC system with
single zone thermal space. A model of HVAC for thermal comfort control
based on NN was also developed by Liang [13]. Based on this research [13],
Liang developed a new model for fault detection and diagnosis [2].
Specifically, a few changes in control signal and model parameters have
been made in order to well match real applications. Figure 12.1 shows a
simple schematic of a HVAC system. It consists of three main parts: air
handling unit (AHU), the chiller and the control system. When the HVAC
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 12

290 Davood Dehestani, Ying Guo, Sai Ho Ling, Steven W. Su and Hung T. Nguyen

system starts to work, fresh air passes from a heat exchanger cooling coil
section to change heat between fresh air and cooling water. Cooled fresh
air is forced by a supply fan to the room.

Fig. 12.1. General schematic of HVAC system.

After just a few minutes, the return damper opens to allow room air to
come back to AHU. Then the mixing air passes from the cooling coil section
to decrease its temperature and humidity. A trade off among exhaust, fresh
and return air is decided by the control unit. Also the temperature of the
room is regulated by adjusting the flow rate of cooling water by a certain
control valve. Figure 12.2 shows the block diagram of the HVAC system
with a simple PI controller. The model consists of eight variables in which
six variables define as state variables.
Two pressures (air supply pressure Ps and room air pressure Pa ) and
four temperatures (wall temperature Tw , cooling coil temperature Tcc , air
supply temperature Ts and room air temperature Ta ) are considered as six
states. Cooling water flow rate fcc is considered as control signal. Also
all six mentioned state variables with cooling water outlet temperature
Twaterout are considered as system outputs. But just one of them (room
temperature Ta ) acts as feedback signa. It should be noted that outlet
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 12

Intelligent Fault Detection and Isolation of HVAC System 291

water temperature is not used as a state variable in this modeling and it is


just used as an auxiliary parameter for finding faults. The states, control
input and controlled output are listed as follows:

Fig. 12.2. Block diagram of HVAC model with PI controller.

X = [Pa ; Ps ; Tw ; Tcc ; Ts ; Ta ]
U = [fcc ]
Y = [Pa ; Ps ; Tw ; Tcc ; Ts ; Ta ]
For the control of HVAC systems, the most popular method is
Proportion-Integration (PI) control with the flow rate fc c served as the
control input. In this study, we therefore select a PI controller, and tune the
controller by using the Ziegler–Nichols method (Reaction Curve Method).
In order to simulate the environmental disturbance in real application,
two disturbances are considered in the model: outdoor temperature and
outdoor heating (or cooling) load. Outdoor heat/cool loading will disturb
the system but it cannot be measured directly. Though it can be estimated
based on the supply/return air temperature/humidity via a load observer
[11], but for convenience, it is assumed that the two disturbances are
sinusoidal functions.

12.3. HVAC Parameter Setting

Table 12.1 shows some major parameters used in this simulation. Without
loss of generality, the simulation runs in cooling mode. It is easy to consider
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 12

292 Davood Dehestani, Ying Guo, Sai Ho Ling, Steven W. Su and Hung T. Nguyen

heating mode just by changing some parameters in Table 12.1. Range of


disturbances are 24 to 30 ◦ C and 0.8-1 kW for outside temperature and heat
loading respectively. Mixed air ratio for this model is a constant value when
the system is working in steady state mode. The inlet water temperature
is set as Twater−in = 7 ◦ C whilst the outlet water temperature is set as
Twater−out = 9 ◦ C which may be disturbed by the cooling load.
Table 12.1. Main parameters of HVAC system.
Parameter definition Setting value
Temperature set point 27.5 ◦ C
Room space dimension 5m × 5m × 3m
Indoor cooling load range 0.8-1 kW
Outdoor temperature range 24-30 ◦ C
Outdoor humidity 55 − 75%
Max chilled water flow rate 0.5 kg/s
Air flow rate 980 m3/h
Mixed air ratio 4
Noise of temperature 5% mean value
Air handling unit volume 2m × 1m × 1m
Outside pressure 1 atm.
Inlet water temperature 7 ◦C

As with other intelligent systems, the parameters of an online SVM


classifier should be determined at first. First parameter is maximum
penalty. By defining the maximum penalty of 10 we could achieve the
best margin for SVM. Based on the testing of different kernel functions,
Gaussian function is chosen as the best kernel function with regard to its
perfect response to the HVAC problem.

12.4. HVAC Model Simulation

All initial values of temperature are set at the morning time. Mathematical
model simulated for 12 hours (from 8 am to 8 pm). The output temperature
for normal condition (without fault) is shown in Fig. 12.3. The first
half hour of simulation is the transient mode of the system. We can see
room temperature could stay in the desired value (set point) but other
temperatures change with noise profile due to their effort for adjusting the
room temperature with the set point. It is clear that wall temperature
is a function of outside temperature as it follows outdoor temperature
profile. But other internal temperatures follow an inverse profile of outdoor
temperature.
March 8, 2012 14:59 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 12

Intelligent Fault Detection and Isolation of HVAC System 293

Fig. 12.3. Temperatures in normal condition of HVAC.

Figure 12.4 shows the control signal during simulation time. This flow
rate is controlled by a control valve in which the control valve signal
generated from the PI controller. Cooling water flow rate reaches its
maximum value at noon due to highest value of outlet heat and temperature
at this time. Some fluctuation in control signal and other variables at the
beginning time of the simulation is related to transient behaviors of the
system.

12.5. Fault Introduction

HVAC systems may suffer from many faults or malfunctions during


operation. Three commonly encountered faults are defined in this
simulation:
• Supply fan fault
• Return damper fault
• Cooling coil pipes fouling
Usage of these faults is just to test the performance of the proposed
fault detector system. Incipient faults are applied to test the proposed fault
detector due to its difficulty of detection. Four different models consisting
of one healthy model and three faulty models are generated. Figure 12.5
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 12

294 Davood Dehestani, Ying Guo, Sai Ho Ling, Steven W. Su and Hung T. Nguyen

Fig. 12.4. Water flow rate as the function of control signal.

shows the general profile of the fault during one day. The amplitude of the
faults gradually increases for six hours to reach their maximum values then
they stay in this state for four hours and finally return gradually to normal
condition during the last six hours.

Fig. 12.5. Fault trend during one day.

This fault profile has been applied for each fault with some minor
changes. In the air supply fan fault this profile is used with gain of 10.
In damper fault it is used with gain of −2 and shift point of +4. For the
pipe fault it is used with gain −0.3 and shift point of +1. The most sensitive
parameters have been identified for each fault. For the air supply fan fault
the most sensitive parameter is the air supply pressure that changes between
0 to 10 Pascal during fault period. The mixed air ratio is selected as the
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 12

Intelligent Fault Detection and Isolation of HVAC System 295

indicator of the damper fault that decreases from four to two. Cooling
water flow rate decreases from fcc to 0.7fcc in the cooling coil tube fault.

12.6. Parameter Sensitivity

The sensitivity of variables in respect of each fault is analyzed in this


subsection. Figure 12.6 shows sensitivity of cooling water flow rate with
respect to different faulty modes. It is sensitive to all three faults (damper
fault, supply fan fault and cooling water tube fault) with sensitivity of 0.03
kg/s, 0.07 kg/s and 0.08 kg/s.

Fig. 12.6. Cooling water flow rate changes.

Other analysis shows cooling coil temperature and outlet water


temperature are sensitive to both supply fan fault and return damper fault.
Based on this sensitivity analysis, six parameters consisting of air supply,
room pressure, air supply temperature, cooling coil temperature, outlet
water temperature and water flow rate are used for training of supply fan
fault. Also three parameters consisting of cooling coil temperature, outlet
water temperature and water flow rate are applied for training of return
damper fault but only water flow rate is used for training of the third fault
(cooling coil pipes fouling).
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 12

296 Davood Dehestani, Ying Guo, Sai Ho Ling, Steven W. Su and Hung T. Nguyen

12.7. Incremental–Decremental Algorithm of SVM

The main advantages of SVM include the usage of kernel trick (no
need to know the nonlinear mapping function), the global optimal
solution (quadratic problem) and the generalization capability obtained
by optimizing the margin [14]. However, for very large data sets, standard
numeric techniques for QP become unfeasible. An online alternative, that
formulates the (exact) solution for ℓ+1 training data in terms of that for
ℓ data and one new data point, is presented in online incremental method.
Training an SVM incrementally on new data by discarding all previous
data except their support vectors, gives only approximate results [15].
Cauwenberghs [16] consider incremental learning as an exact online method
to construct the solution recursively, one point at a time. The key is
to retain the Kuhn–Tucker (KT) conditions on all previous data, while
adiabatically adding a new data point to the solution. Leave-one-out is
a standard procedure in predicting the generalization power of a trained
classifier, both from a theoretical and empirical perspective [17].
Giving n data, S = {xi , yi } and yi ∈ {+1, −1} where xi represents the
condition attributes, yi is the class label (correct label is +1 and faulty label
is -1) and i is the number of data for training. The decision hyperplane
of SVM can be defined as (w,b), where w is a weight vector and b a bias.
Let w0 and b0 denote the optimal values of the weight vector and bias.
Correspondingly, the optimal hyperplane can be written as:
w0 T + b0 = 0. (12.1)
To find the optimum values of w and b, it is necessary to solve the
following optimization problem:


minw,b,ξ , 1/2wT w + C ξi , (12.2)
i

subject to
yi (wT φ(xi ) + b) ≥ 1 − ξi , (12.3)
where ξ is the slack variable, C is the user-specified penalty parameter of
the error term (C > 0) and φ is the kernel function. SVM can change
the original nonlinear separation problem into a linear separation case
by mapping input vector on to a higher feature space. On the feature
space, the two-class separation problem is reduced to find the optimal
hyperplane that linearly separates the two classes transformed into a
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 12

Intelligent Fault Detection and Isolation of HVAC System 297

quadratic optimization problem. Depending on problem type, several kernel


functions are used. Two best kernel functions for classification problems are
Radial Basis Function (RBF) and Gaussian function regard to nonlinearity
consideration. Equation (12.4) shows RBF as the important function for
nonlinear classification.
K(xi , xj ) = exp{−γ∥xi − xj ∥2 }, γ > 0. (12.4)
In SVM classification, the optimal separating function reduces to a linear

combination of kernels on the training data, f (x) = j αj yj k(xj , x) + b,
with training vectors xi and corresponding labels yi in the dual formulation
of the training problem, the coefficients αi are obtained by minimizing a
convex quadratic objective function under constraints in Equation 5.1.
∑ ∑ ∑
min0<αi <C , w = 1/2 αi Qij αj − αi + b yi αi , (12.5)
ij i i

With Lagrange multiplier (and offset) b, and the symmetric positive


definite kernel matrix Qij = yi yj K(xi , xj ) the first-order conditions on w
reduce to the Kuhn–Tucker (KT) condition as described in Equations (12.6)
and (12.7):


 αi = 0
∂w > 0;
= 0; 0 < αi < C , (12.6)
∂αi 

< 0; αi = C
and
∂w
= 0. (12.7)
∂b
The margin vector coefficients change value during each incremental
step to keep all elements in equilibrium, i.e. keep their KT conditions
satisfied. It is naturally implemented by decremental unlearning, adiabatic
reversal of incremental learning, on each of the training data from
the full trained solution. Incremental learning and, in particular,
decremental unlearning offer a simple and computationally efficient scheme
for online SVM training. It has also exact leave-one-out evaluation of the
generalization performance on the training data.

12.8. Algorithm of FDI by Online SVM

There are huge amounts of data generated before a fault happens as most
HVAC systems are rather reliable. For large data sets, standard SVM
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 12

298 Davood Dehestani, Ying Guo, Sai Ho Ling, Steven W. Su and Hung T. Nguyen

Fig. 12.7. Schematic of semi-unsupervised fault detection with online SVM.

techniques (offline SVM) become unfeasible. This motivates the usage of


incremental–decremental SVM indexSVM (online SVM). Figure 12.7 shows
the proposed fault detection scheme by using incremental-decremental
support vector machine classification. The main purpose of the system is
to detect un-known faults by monitoring key HVAC variables as discussed
in previous sections during system operation. In this algorithm new faults
can be detected (as unknown new faults) by comparing with the outputs
of the healthy model and the real system. If the detected fault is similar to
the old fault, it will be categorized by the algorithm as an existing fault.
Otherwise, this data is sent to the online SVM trainer for training for
the new fault. Finally the new fault will be isolated by this online SVM
as a known fault. The incremental procedure is reversible and decremental
unlearning of each training sample produces an exact leave-one-out estimate
of faults with using all HVAC data during its operation.
The main advantage of this algorithm is usage of only a range of useful
data (including healthy data, old faults and new faults) instead of whole
data sets. Based on this online training procedure, a semi-unsupervised
fault detection can be implemented.
Figure 12.8 shows the structure of the label generation algorithm. Here,
label of y is set as +1 (non-faulty) when the error is smaller than a given
threshold and it is set as −1 (faulty) when the error is bigger than that
threshold. Labels including yp1 , · · · ,ypn are generated with n variables
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 12

Intelligent Fault Detection and Isolation of HVAC System 299

Fig. 12.8. Schematic of label generation algorithm for training system of each fault.

%! &
'()*+',&-..
/)0*+,&-..
$" '()*+',&- 1)+(2
/)0*+,&- 1)+(2
/)0*+&32(45.+567
$!

#"

#!

"

"&
! " #! #" $! $"

Fig. 12.9. Graduate known fault detection.


March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 12

300 Davood Dehestani, Ying Guo, Sai Ho Ling, Steven W. Su and Hung T. Nguyen

$! %

$"

#!

#"

"

!
&'()*&+%,--)./0%,-.)%*'12'3(*43'
#" 5(4)*+%,--)./0%,-.)%*'12'3(*43'
5(4)*%23'6.,*.-/
1(30./
#! %
! " ! #" #! $" $!

Fig. 12.10. Sudden unknown fault detection.

for each fault. For each fault all labels should be combined together in
a proper logic to generate one label as one fault needs just one label for
training. A combination of labels can be used to generate the final label.
But for really complex HVAC systems, we recommend using a fuzzy logic
membership function and some rules to the generated final label. In this
chapter, a specific fault can happen if all errors of sensitive parameters can
be passed from their thresholds.

12.9. FDI Simulation

Since the SVM classifier presented in the last section can only be used to
deal with two-class cases, a multi-layer SVM framework has to be designed
for the FDI problem with various faulty conditions. In order to use online
SVM classification methods to achieve a better isolation performance, three
faulty models are used in the isolation section. A four-layer SVM classifier
is designed, in which the normal and three different HVAC fault conditions
are all taken into consideration. Furthermore, it should be pointed out that
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 12

Intelligent Fault Detection and Isolation of HVAC System 301

#! %

#"

"

#"
&'()*+
',-.'
#! %
! " ! #" #! $" $!

Fig. 12.11. Training coefficient via margin change.

other unknown faulty conditions can be placed in the upper layer of the FDI
system. The kernel function must be properly selected for SVM classifier
in order to achieve high classification accuracy. In general, linear function,
polynomial function, radial basis function (RBF), sigmoid function and
Gaussian function can be adopted as the kernel function. In this chapter,
Gaussian function is used as it has excellent performance in the simulation.
In this research, two tests are conducted systematically. The diagnosis
results and corresponding characteristics of the SVM classifiers are shown in
Figs 12.9, 12.10, 12.11. Test 1 is designed to investigate the SVM classifier
performance on known incipient faults. The steady-state data is used to
build the four-layer SVM classifier: as mentioned in previous sections, the
data within the threshold under the normal condition indicate fault free,
and the data beyond the threshold indicate faults one to three. For each
normal/faulty condition, two days data (20 hours data per day between 2
am and 10 pm) are used. Therefore, a total of 4 times 40 hour samples
are collected. Half of the data for each condition are used as the training
data, whilst the rest are used as the testing data for fault diagnosis. In
March 8, 2012 14:59 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 12

302 Davood Dehestani, Ying Guo, Sai Ho Ling, Steven W. Su and Hung T. Nguyen

Fig. 12.9, the label changes from +1 (non-faulty situation) to -1 (faulty


situation) when the fault is detected. For simplicity Fig. 12.9 only shows
2 variables when there is fault 1 in the data: the cooling coil temperature
and the water outlet temperature. It is clear that the HVAC faults can be
diagnosed 100% by using the SVM classifier for the testing data.
As mentioned earlier, our proposed algorithm is able to detect
unknown faults in the sense of semi-unsupervised manner. To testing
semi-unsupervised performances, an unknown sudden fault (at time of 9
hours) combined with previously introduced incipient faults are imposed
on the system at the second test. The detection results are shown in Fig.
12.10.
It is clearly indicated that the margin changes from high level to low
level when detecting incipient faults. For unknown faults this change is
dramatic as unknown faults are abrupt types. To efficiently optimize the
training process, samples in each normal/faulty condition should be applied.
A group containing the maximum of faulty training samples is selected, and
applied for training. From Fig. 12.10, it is found that the designed SVM
classifier can identify the HVAC unknown fault accurately. Based on the
simulation result, it is found that by using the proposed approach, the
unknown faults of HVAC system can also be detected efficiently.
We can see α coefficient change with respect to margin change in Fig.
12.11. As mentioned in the previous section, this coefficient should be
confined between zero and maximum penalty as shown in this figure.

12.10. Conclusion

This chapter focuses on the fault detection and isolation of HVAC systems
under real-time working conditions. An online SVM FDI classifier has been
developed which can be trained during the operating of the HVAC system.
Different to the offline method, the proposed approach can even detect new
unknown faults for the training of the classifier in real-time. Furthermore,
this online approach can more efficiently train the FDI modular by throwing
out unnecessary data (leave out vectors) and just using a series of data
with high priority regarding classification. Due to these properties, the
proposed algorithm can be implemented in a semi-unsupervised learning
framework. Simulation study indicates that the proposed approach can
efficiently detect and isolate typical HVAC faults. In the next step, we will
validate the proposed approach by using real experimental data.
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 12

Intelligent Fault Detection and Isolation of HVAC System 303

References

[1] S. W. Wang, Q. Zhou, and F. Xiao, A system-level fault detection and


diagnosis strategy for HVAC systems involving sensor faults, Energy and
Buildings. 42(4), 447–490, (2010).
[2] J. Liang and R. Du, Model-based fault detection and diagnosis of HVAC
systems using support vector machine method, International Journal of
Refrigeration. 30(6), 1104–1114, (2007).
[3] T. I. Salsbury and R. C. Diamond, Fault detection in HVAC systems using
model-based feedforward control, Energy and Buildings. 33(4), 403–415,
(2001).
[4] Q. Zhou, S. W. Wang, and Z. J. Ma, A model-based fault detection
and diagnosis strategy for HVAC systems, International Journal of Energy
Research. 33(10), 903–918, (2009).
[5] K. Zhang, B. Jiang, and P. Shi, A new approach to observer-based
fault-tolerant controller design for Takagi–Sugeno fuzzy systems with state
delay, Circuits Systems and Signal Processing. 28(5), 679–697, (2009).
[6] P. S. Kim and E. H. Lee, A new parity space approach to fault detection
for general systems, High Performance Computing and Communications,
Proceedings. 37(26), 535–540, (2007).
[7] Y. Dote, S. J. Ovaska, and X. Z. Gao, Fault detection using RBFN-
and AR-based general parameter methods, Proceeding of the 2001 IEEE
International Conference on Systems, Man, and Cybernetics. 1(5), 77–80,
(2002).
[8] H. Henao and G. A. Capolino, An improved signal processing-based fault
detection technique for induction machine drives, Proceeding of the 29th
Annual Conference of the IEEE Industrial Electronics Society (IECON’03).
1(3), 1386–1389, (2003).
[9] Z. M. Du, X. Q. Jin, and B. Fan, Fault diagnosis for sensors in HVAC systems
using wavelet neural network, Proceedings of the 4th Asian Conference on
Refrigeration and Air-Conditioning (ACRA 2009). 86(9), 409–415, (2010).
[10] C. H. Lo, Y. K. Wong, A. B. Rad, and K. L. Cheung, Fuzzy-genetic algorithm
for automatic fault detection in HVAC systems, Applied Soft Computing. 7
(1), 554–560, (2002).
[11] B. Arguello-Serrano and M. Velez-Reyes, Nonlinear control of a heating,
ventilating, and air conditioning system with thermal load estimation., IEEE
Transactions on Control Systems Technology. 7(1), 56–63, (1999).
[12] B. Tashtoush, M. Molhim, and M. Al-Rousan, Dynamic model of an HVAC
system for control analysis, Energy. 30(10), 1729–1745, (2005).
[13] L. Jian and D. Ruxu, Thermal comfort control based on neural network
for HVAC application, Proceedings of the IEEE Conference on Control
Applications. 11(9), 819–824, (2005).
[14] A. S. Cerqueira, D. D. Ferreiraa, M. V. Ribeiroa, and C. A. Duque,
Power quality events recognition using a SVM-based method, Electric Power
Systems Research. 78(9), 1546–1552, (2008).
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 12

304 Davood Dehestani, Ying Guo, Sai Ho Ling, Steven W. Su and Hung T. Nguyen

[15] N. A. Syed, H. Liu, and K. K. Sung, Incremental learning with support vector
machines, Proceeding of the International Joint Conference on Artificial
Intelligence (IJCAI-99). 11(7), 143–148, (1999).
[16] G. Cauwenberghs and T. Poggio, Incremental and decremental support
vector machine learning, Advances in Neural Information Processing
Systems. 13(13), 409–415, (2001).
[17] V. Vapnik, The Nature of Statistical Learning Theory. (Springer–Verlag,
New York, 1995).
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 12

Index

activation function, 91 cardio respiratory kinetics, 256


AdaBoost, 4 central limit theorem, 108
additive white Gaussian noise, 98 chaotic, 90
algorithmic complexity, 104 co-prime, 94
arithmetic operation, 25 communications, 97
artificial neural network, 193, complex behaviors, 93
196–199, 202, 203 computational effort, 90
assembly, 183, 184, 186 corrected QT interval, 65, 67
automation, 183–185 correlation dimension, 116
autoregressive and moving
average model, 102
data acquisition, 229
autoregressive integrated moving
defuzzification, 70
average model, 120
digital volume pulse, 222
diploid model, 105, 120
Bayesian information criterion,
107 discrete time model, 278
Bayesian learning, 104 DMC, 271
benchmark functions, 28 DMC controller, 273, 281, 282,
bias, 257 284
biomedical classification results, downsampled, 89
226 dynamic benchmark function, 28
blood glucose, 62 dynamic environment, 28
blood glucose monitoring system, dynamic matrix, 276
66
body movement, 257, 263 electrocardiogram (ECG), 255,
body movements, 255 257, 263, 264
bolt tightening, 183, 185–187, 192 electrocardiography, 67
boundedness condition, 89, 90 electroencephalogram (EEG), 63
BPM, 272 embedding dimension, 109

305
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 12

306 Index

false nearest neighbors, 109 least-square optimization, 273


FDI, 287 Levenberg–Marquardt, 106
feature extraction, 229 limit cycle, 89
feedforward neural network, 102 linear function, 220
first order model, 278 linearly separable, 89
first order process, 277 Lorenz system, 122
fitness function, 78
fixed point, 90 Matlab system identification
fuzzification, 68 toolbox, 264
fuzzy reasoning, 70 maximum log-likelihood, 107
fuzzy reasoning model, 65, 68 maximum-margin classifier, 214
mean square error, 113
Gaussian kernel algorithm, 116 membership function, 68
general real matrix, 128, 166 memoryless, 93
genetic programming, 27 minimum description length, 104,
111
hand-written graffiti recognition, minimum message length, 107
228 model horizon, 275, 280, 281
heart rate, 65, 67, 256, 272 model predictive control
high-dimensional feature spaces, approach, 256
257 monitoring, 183–187, 191–193,
HVAC, 287, 289 196
hyperplanes, 95, 257 Monte Carbo hypothesis, 104
hypoglycemia, 62, 67 move suppression coefficient, 277,
280
identical independent moving average, 102
distribution, 104 moving horizon, 280, 281
Ikeda map, 122 MPC, 271, 272, 274, 280
impending cardiac diseases, 256 multi-layer perceptron, 89
incremental, 298
incremental learning, 296 neural network, 102, 183, 191,
information criterion, 107 193, 194
information theoretic criterion, nonlinearly separable, 90
114 normal Gaussian distribution,
108
kernel functions, 219 normalization, 230
Kolmogorov–Gabor polynomial, null hypothesis, 115
25 number of updates, 89
March 2, 2012 11:56 World Scientific Review Volume - 9in x 6in 01˙ws-rv9x6/Chp. 12

Index 307

online SVM, 287, 289, 297 ROC curve, 78


optimizer, 274, 277 Rössler system, 110
orthogonal least square, 26
overfitting, 107 sample time, 280, 281
oxygen saturation, 255, 257, 263 Schwarz information criterion,
107
particle swarm optimization, 23, screw fastening, 183
71 screw insertion, 183, 185, 187,
pattern recognition, 90 192, 193
perceptron, 89 screw tightening, 184
perceptron convergence theorem, self-organizing feature map, 120
90 sensitivity, 77
perceptron training algorithm, 91 series-wound diploid model, 120
performance objective function, sigma delta modulators, 93
276, 277 signal processing, 97
phase space reconstruction, 116 similarity value, 231
polynomial function, 220 soft-margin AdaBoost, 5
polynomial kernel, 261 soft-margin classifier, 216
polynomial model, 24 specificity, 77
population diversities, 31 spline function, 220
pose estimation, 4 surrogate data method, 104, 115
pre-defined exercise protocol, 256 SVM, 4, 214, 256, 257, 271, 287
prediction horizon, 280, 281 SVR, 218, 255, 257, 261, 278, 280
probability of mutation, 73, 75 swarm size, 75
pulse wave velocity, 222 symbolic dynamics, 90

quadratic dynamic matrix threshold, 91


control, 277 time divisional multiplexing
quantization, 93 systems, 97
time periodically varying, 90
rate of the convergence, 90 training feature vectors, 91
RBF, 220, 261
RBF kernel, 257, 261 velocity, 26
RBF neural network, 193–195,
203 wavelet mutation, 73
real anti-symmetric matrix, 143 weight vector, 257
real symmetric matrix, 127, 128
recurrent neural network (RNN), XOR nonlinear problem, 99
128

You might also like