A Practical Approach To Model Based Neural Networks Control
A Practical Approach To Model Based Neural Networks Control
A Practical Approach To Model Based Neural Networks Control
Hannu Koivisto
email: [email protected]
fax: +358-31-3162340
phone: +358-31-3162656
Koivisto Hannu J: A Practical Approach to Model Based Neural Network Control
Tampere University of Technology, Tampere, Finland, 1995
Tampere University of Technology Publications 170
Abstract
This thesis presents a practical approach to model based control of nonlinear dynamical
systems using multilayered perceptron type neural networks as process models and con-
trollers. A framework for modeling and identification of nonlinear time series models is
introduced. Advanced parameter estimation methods are used to identify the weights of
the process model and controller networks. A novel stability analysis and practical meth-
ods for maintaining stability and generalization capability are introduced.
The identified models are used for model predictive control. Both the direct long range
predictive control approach and the dual network control approach are applied. The prob-
lems related to the model inverse based IMC design are partially avoided by employing a
nonlinear optimal controller within the IMC structure. General guidelines and practical
methods for model mismatch and for maintaining stability are introduced and applied.
The neuro-control workstation based on HP9000/425 platform and the software tools for
identification and control are also introduced. This forms an efficient tool for neural net-
work identification, control design and real-time control tasks.
The approach is applied in the control of small laboratory scale water and air heating pro-
cesses, and in the multivariate control of a pilot headbox of a paper machine. The exper-
imental results show that the proposed approach yields good performance characteristics
and robust control.
The work has been carried out in the Control Engineering Laboratory, Department of
Electrical Engineering at Tampere University of Technology during the years 1990-1995.
Many people have given me their time, advice, encouragement and support. One person,
Professor Heikki Koivo, stands apart from the others. I wish to express my gratitude to
him for his guidance and continuous support during the course of this thesis and my
research work in this area.
This thesis was financially supported by the Academy of Finland, Technology Develop-
ment Centre in Finland, Foundation for the Advancement of Technology in Finland and
Wihuri Foundation, which are gratefully acknowledged.
Finally, my best thanks to my wife Aino-Liisa and daughter Aura for their unending sup-
port and patience in the course of my research work.
Hannu Koivisto
Table of Contents
Abstract i
Acknowledgements ii
Operators and Notational Conventions iii
Abbreviations v
1 Introduction 1
5 Simulation Studies 87
5.1 Experimental Setup.......................................................................... 87
5.2 Simulation Examples ....................................................................... 90
References 151
col { A } stacks the columns of the matrix A under each other (a column vector)
det { A } determinant of a matrix
diag { A } diagonal of a matrix A (a column vector)
diag { x } diagonal matrix with the vector x as its main diagonal
2 T
x weighted norm, Euclidian by default i.e. x = x x
x maximum norm
A element-by-element absolute value of the matrix A
arg min V ( x ) value of x that minimizes V ( x )
1
q, q forward and backward shift operators
1 1
A( q ) matrix polynomial in q
1 1
A( q , t ) time varying matrix polynomial in q
A(q) matrix polynomial in q
roots ( A ( q ) ) roots of the polynomial A ( q ) = 0
G(z) , G(q) transfer function matrix
(i)
A i th column of the matrix A
(t 1)
y a finite length past history of the sequence { y ( t ) } , y ( t 1 ) as newest
n m
model, a function f : R R with the input regression vector
f ( x, ) n n n
R or x R and parameter vector R . A shorthand
f ( , )
notation f ( ) and f ( x ) are also used.
n m
h ( , ) controller, a function h : R R with the input regression vector
n n
R and parameter vector R . A notation h ( ) also used.
V ( ) gradient of V ( ) w.r.t. to its arguments. This and all other gradients are
row vectors if V is scalar.
V ( ) second order gradient of V ( ) w.r.t. to its arguments
iv
Abbreviations
ARX (linear) AutoRegressive with eXogenous input (predictor)
BFGS Broyden-Fletcher-Goldfarb-Shanno
CMAC Cerebellar Model Articulation Computer
CSTR Continuous Stirred Tank Reactor
IMA Integrating Moving Average (noise)
IMC Internal Model Control
LM Levenberg-Marquardt
LRPC Long Range Predictive Control
MFD Matrix Fraction Description
MIMO Multi-Input Multi-Output
MLP Multi Layer Perceptron (neural network)
MSSE Mean Sum of Squared Errors
MWGS Modified Weighted Gram-Schmidt algorithm
NARX Nonlinear AutoRegressive with eXogenous input (predictor)
NARMAX Nonlinear AutoRegressive Moving Average with eXogenous input
NARIMAX Nonlinear AutoRegressive Moving Average with Integrating noise
and eXogenous input
NOE Nonlinear Output Error model
OE (linear) Output Error model
PRBS Pseudo Random Binary Sequence
PRS Pesudo Random Sequnce
RBF Radial Base Function (neural network)
(R)PEM (Recursive) Prediction Error Method
SISO Single-Input Single-Output
TITO Two-Input-Two-Outpt
w.r.t with respect to
Chapter 1 Introduction 1
1 Introduction
This thesis considers practical real-time model predictive control of nonlinear dynamical
systems using artificial neural networks (ANN) as process models and controllers. A
framework for modelling identification and control of nonlinear time series models is pre-
sented. Advanced parameter estimation methods are used to identify the weights of the
process model and the controller networks. The performance of the control systems is ver-
ified with real-time experiments using laboratory pilot processes.
This study started at the beginning of 1990's, at a time when there was a belief that artifi-
cial neural networks as itself include some mysterious intelligence and robustness
and that a combination of an ANN and some learning method will show similar adap-
tiveness and capability to cope with the real world environment as humans do in everyday
life. Whether this is true for the massively parallel ANNs we will see in the future - it is
true for the human brain with its 108 neurons and 1012 synapses - but definitely not for
the small and moderate sized artificial neural networks which engineers use for function
approximation and classification tasks. They resemble - and they should resemble - the
established methods of statistics and numerical analysis.
The background of the neural computation somewhat reflects to the study, especially in
the way that the study started directly as a research towards adaptive neural network con-
trol. This was also due to the long expertize of the Control Engineering Laboratory at
Tampere University of Technology in linear self-tuning control. Soon it was realized that
lessons learnt when applying linear self-tuning control must be taken seriously. The first
results were encouraging, c.f. Koivisto (1990) and Raiskila and Koivo (1990). Also the
first real-time experiments of the adaptive neural network control proved to be a success,
Koivisto et. al. (1991a, 1991b). The adaptive approach worked - mostly, but sometimes it
failed for unexplained reasons.
Off-line identified neural network models were applied with the direct model predictive
control approach - with no advantage. Only after implementation of advanced off-line
identification methods and after detailed analysis of the properties of the resulting models
and controllers, like stability and extrapolation issues, the reasons for failures were iso-
lated and explained: even small neural network models are often too flexible function
approximators. They must be constrained in order to maintain the stability and the gener-
alization capability. This increased the reliability of the approach and now practical
results were achieved, c.f. Koivisto et. al. (1992, 1993) and Koivisto (1994).
2 Chapter 1
Some History
The history of ANN's can be traced back to the paper of McCulloch and Pitts (1943) intro-
ducing a simplified model of one neuron. A good review of the history of neural compu-
tation can be found in Widrow (1990) or from the excellent collection of early papers, by
Anderson and Rosenfeld (1988). Almost 50 different types of neural network architec-
tures are developed, although only some of them are in common use. There are numerous
introductory level books to neural computation. A more detailed presentation and analysis
can be found for example in Rumelhart and McClelland (1986) or Herz et al. (1991).
The birth of the field as a self-conscious, organized discipline should probably be dated
back to June 1987, when the first IEEE Conference on Neural Networks (ICNN) took
place in San Diego. Everyone was surprised when almost 2000 people showed up (Wer-
bos 1989). The book of Rumelhart and McClelland (1986) and articles like Hopfield and
Tank (1985) had aroused great expectations. A similar hyperbole for neural networks in
control can be dated back to the Boston summer school 1988. The abstracts of this con-
ference were published in the first number of Neural Networks (1988).
Since then neural network control has progressed rapidly and also real industrial applica-
tions have been reported, e.g. Widrow et al. (1994), especially in Japan, e.g. Asakawa and
Takagi (1994). A computer survey (INSPEC) considering the experimental or practical
applications of the neural network control in 1994 resulted 262 journal articles and 869
conference papers. Of course only a portion of all these were real on-line tests of neural
network control. The fuzzy control was excluded from the search to reduce the number of
articles, although Neuro-fuzzy systems form an entity and it is often hard to distinguish
between all the approaches, except by the title.
Neurocontrol
Most promising application areas of the neural network control are robotics, process con-
trol, vehicle guidance and teleoperation. A good introduction to the subject is given in the
books Neural Networks for Control (ed. Miller, Sutton and Werbos, 1990) and Hand-
book of Intelligent Control (ed. White and Sofge, 1992). Artificial neural networks can
be used in several fields of control engineering: as process models, as controllers, for opti-
mization and failure detection.
Relevant work on optimization and especially on fault diagnosis has been done, but the
scope of this study is restricted to process control applications i.e. modelling, identifica-
tion and control of nonlinear dynamical systems using artificial neural networks. Critical
research issues, when considering neural network control of dynamical systems, were
(Neural Networks for Control, 1990):
Chapter 1 Introduction 3
The importance of these issues is still the same. Most of them will be considered in this
thesis in conjunction with process control.
Neural networks were first applied to robotics, using reinforcement learning schemes, e.g.
Barto et al. (1983), Anderson (1989), Michie and Chambers (1968) or applying super-
vised learning to inverse kinematics problem, e.g Psaltis et al. (1988), Elsley (1988), Bar-
Kana and Guez (1989), Zeman et al. (1989). Also on-line experiments were reported quite
soon, e.g. Miller (1989), Yabuta et al. (1989). The pioneering work of Widrow in the six-
ties should also mentioned (Widrow 1990).
Robot control applications are not largely considered in this study. The main interest is on
process control. Bavarian (1988) presented the first overview of neural network control.
Narendra and Parthasarathy (1990) presented a general framework for identification and
control of dynamical systems using neural networks. Bhat and McAvoy (1990) presented
first demonstrations of neural network control of chemical process systems. Since then
numerous articles have considered neural networks for process control. A more detailed
survey will be presented in Section 4.1.
The potential usefulness of neural networks in control is due to two main features: learn-
ing capability and function approximation capability. Two separate goals can be seen:
learning/adaptation and representation of nonlinear models, although these are normally
combined to (nonadaptive) nonlinear controller design or to adaptive nonlinear control.
The attractive feature of neural network models is that they offer a parametric function
which can accurately represent virtually any smooth nonlinear function with similar basic
structure.
Nonlinear Control
Systems in the process control area are more or less nonlinear. Most of these processes
can be successfully controlled with conventional linear methods like PID control. If the
required operation region is large or the process is strongly nonlinear, a linear controller
is likely to perform poorly or to become even unstable. Nonlinear controllers can properly
compensate the nonlinearities in a system. The application of nonlinear control methods
has been limited by the theoretical and computational difficulties associated with the prac-
tical nonlinear control design. The representation problem has been another limiting fac-
tor and the nonlinear control theory has mainly been concerned with deterministic
4 Chapter 1
continuous time models derived from physical principles like mass and energy balances.
Function approximation networks have remarkably reduced this representation difficulty.
From the mathematical point of view, even the control of known nonlinear dynamical sys-
tem is a formidable problem. This becomes substantially more complex when the system
is not completely known, e.g. Levin and Narendra (1992). We are far from being able to
optimally design controllers that can handle time varying and uncertain nonlinear sys-
tems. Robust control of nonlinear systems is currently one of the most active research
areas in the control literature.
Two main approaches to control of nonlinear systems are: differential geometric approach
and model based approach. Differential geometric framework allows an exact analytic
linearization of the nonlinear model using a nonlinear controller, e.g. Isidori (1989) or
Slotine and Li (1991), sometimes at a price of robustness. Linear controllers can then be
designed for the equivalent controlled linear system. The geometric approach produces
excellent results if some basic assumptions hold, mainly the assumption of a deterministic
bilinear type nonlinear system. These methods have been extended to more general mod-
els, but still some analytical invertibility assumptions must hold. See Henson and Seborg
(1990) for an overview.
Neural networks have successfully been applied to adaptive nonlinear control using the
geometric approach. Tzirkel-Handcock and Fallside (1992) demonstrate that neural net-
works can be successfully used to perform an approximate input/output linearization.
Notice that a direct continuous time controller is used. Sanner and Slotine (1991) have
presented similar results.
Model based approach, on the other hand, normally uses empirically identified linear or
nonlinear discrete time models. More emphasis is put on the robustness of control system.
The survey paper (Garcia et al. 1989) refers to Model Predictive Control (MPC) as that
family of controllers in which there is a direct use of explicit and separate identifiable
model. The same process model is also implicitly used to compute the control action in a
such way that the control design specifications are satisfied.
Control design methods based on the MPC concept have found a wide acceptance in
industrial applications due to their high performance and robustness. There are several
variants of model predictive control methods, like Dynamic Matrix Control (DMC),
Model Algorithmic Control (MAC) and Internal Model Control (IMC). Nonlinear versions
of these have also been developed, for example a nonlinear IMC concept, e.g. Economou
et al. (1986). Another, largely independently developed branch of MPC, called General-
ized Predictive Control (GPC), is aimed more for adaptive control, e.g. Clarke and
Mohtadi (1989). For the current state-of-the-art, see Clarke (1994).
Chapter 1 Introduction 5
Model predictive control in this sense is a broad area and some confusion is encountered,
because the abbreviation MPC is often used to mean receding horizon (RHPC) or long
range predictive control (LRPC), where a model is used to predict the process output sev-
eral steps into the future and the control action is computed at each step by numerical min-
imization of the prediction errors i.e. no specific controller is used. This is quite different
from the concept where the model is controlled with a implicitly derived specific control-
ler, like in many IMC approaches.
Neural networks have successfully been applied to model based control of nonlinear sys-
tems. General guidelines can be found from Nahas et al. (1992), Psichogios and Ungar
(1991), Hunt and Sbarbaro (1991) and Ydstie (1990). A more detailed survey is presented
in Section 4.1.
Reinforcement type connectionistic learning control has been studied for example by Sut-
ton (1984, 1988), and Anderson (1989). It can be applied to complex real world problems,
where the performance of supervised learning is poor in many cases. Most of these learn-
ing systems use algorithms with fixed structure, but they can hardly be called only adap-
tive in the sense of Landau (1979).
Supervised learning is a good choice, if correct responses are known, because it learns
faster. Only models and controllers with a fixed structure and the supervised learning
scheme are considered in this study and adaptive control is used as a synonym for
supervised self-tuning control. Similarly "learning" is often (mis)used instead of "iden-
tification".
6 Chapter 1
The goal of this study was to develop efficient identification and control design methods
for neural network based nonlinear control, and to implement them in a real world envi-
ronment. The performance of the control systems was verified by simulations and with
real-time experiments using pilot processes. Thus the study also fills the gap between the-
ory and practice. This wide area is not so popular among neural network researchers,
because of its difficulty and especially due to its time consuming nature. Practice is any-
way the final measure to any control method and the area is of great importance.
A framework for modelling and identification of nonlinear time series models is intro-
duced in Chapter 3. The basic properties of the nonlinear stochastic models are discussed
and the structures for nonlinear predictors based on time series approach are introduced.
The multivariate case and the multistep prediction are also considered.
An advanced identification scheme for the parameter estimation of these neural network
predictors is presented in Section 3.3. The main underlying idea behind the presented Pre-
diction Error Method is that the gradient of the identification cost function with respect
to the network weights is computed in a proper manner corresponding to the dynamical
nature of the predictor. Methods for the stability analysis and for the projection onto the
stable domain are introduced in Section 3.4. Also other projection schemes are consid-
ered.
Chapter 4 presents the Model Predictive Control approach. A short review of the existing
neural network control methods and their applications are presented in Section 4.1. The
direct predictive control method is considered in Section 4.2. The identified neural net-
work model is used to predict the future process measurements and the control action at
each time step is computed by numerical minimization of the predicted control error. The
stability issues and the model mismatch are also discussed. The connection with the IMC
design is also analysed.
Chapter 1 Introduction 7
Section 4.3 presents the dual network control approach. A neural network is used as a con-
troller. The problems related to the model inverse based IMC design are partially avoided
by employing a nonlinear optimal controller within the IMC structure. The nonlinear con-
trol law is approximated by a perceptron network and the cost function associated to the
optimal controller design is minimized numerically. The stability of the resulting control
system is discussed and practical methods for maintaining the stability are introduced.
Chapter 5 presents results of the simulation studies demonstrating the properties of the
proposed neural network identification and control approach. Also the neuro-control
workstation based on a HP9000/425 platform is introduced including the software tools
for identification, simulation and controller design. The system forms a flexible and effi-
cient tool for identification of nonlinear process models, for development of control sys-
tems and for experimental studies.
Chapters 6, 7 and 8 present the experimental results obtained by applying the proposed
approach to control of several laboratory scale pilot processes. Chapter 6 considers the
real-time control of two small water heating processes. Both direct predictive control and
the dual network control are considered. Chapter 7 considers the identification and direct
predictive control of a small air heating process. Chapter 8 presents the experimental
results obtained applying the multivariate direct predictive control to a pilot head box
process of a paper machine.
Contributions
The full nonlinear model based control approach was applied. In this sense it is rather
irrelevant that only mildly nonlinear processes were considered. They incorporate many
of the features which commonly make control difficult: delays, noise, deterministic dis-
turbances, trends, time varying features etc. Solving the problems associated to these in
conjunction of neural networks, both from the theoretical and practical point of view, is
the main contribution of the thesis.
The author also has a contribution in this development: guidelines for neural networks as
time series models an and application to one-step-ahead predictive control (Koivisto,
1990), an application to Long Range Predictive Control, even with real time experiments
(Koivisto, Kimpimki and Koivo, 1991a,b), and a RPEM based dual network IMC
approach (Koivisto, Ruoppila and Koivo, 1992). Identification in presence of trend type
of disturbances and identification of multistep-ahead predictors are novel contributions.
8 Chapter 1
A detailed analysis of stability and extrapolation issues resulting in general guidelines and
practical methods for maintaining the stability and generalization, and the successful
application within the experimental case studies are a novel contributions.
A contribution is also the analysis of the differences and the robustness issues of the LRPC
approach using direct predictive models or deterministic recurrent models (IMC
approach), and between different implementations of the IMC approach.
Finally, a major contribution is of course that the approach as a whole was put into prac-
tice. The experimental results show good performance characteristics and robust control.
Chapter 2 Neural Networks as Function Approximators 9
Some of the most common neural networks are: functional mapping nets like multilayer
perceptron (MLP) and cerebellar model articulation computer (CMAC) (Albus 1975),
adaptive resonance nets (ART) that form input categories from input data, input feature
categorization nets of Kohonen (Kohonen, 1984), bilinear associative memories and feed-
back network of analog neurons (Hopfield 1985). An interpolation technique called
Radial Basis Functions (RBF) (Moody and Darken, 1989, Poggio and Girosi, 1990), is
normally viewed as a neural network. Fuzzy models are also an efficient function approx-
imation scheme.
The difference between neural networks and other approaches is frail because they all
apply established statistical methods for classification and function approximation tasks
and they are commonly implemented as conventional computer programs.
Neural networks are here considered only from the point of view of function approxima-
tion. This requires interpolation, or more generally approximation of the function between
the presented data points. This feature is called generalization in the neural network ter-
minology.
Most of the neural network architectures are initially motivated by pattern recognition and
associative memory tasks. Most of the classification networks can also be used or can be
modified for approximating purposes. They categorize directly the input vector space or
extract similarities (features) from the input space (Kohonen net, ART, some RBF variants,
Principal Component nets (PCA by Oja 1989). The actual functional mapping is approx-
imated separately for each cluster or feature which is relatively easy task compared to the
overall approximation of the particular function. This means that the approximation of
10 Chapter 2 Neural Networks as Function Approximators
a complex function is transferred to a clustering problem which in turn can be quite com-
plex.
For example, Hytyniemi (1994), and Fox and Heinze (1991) present demonstrations of
Kohonen network as (a part of) function approximator. Similar results using ART-2 net-
work is presented by Srheim (1990). The upper layer consist of several MLP networks,
one for each cluster / feature.
On the other hand, some neural networks are directly applicable to function approxima-
tion, namely MLP networks, most RBF variants, CMAC and BMAC (B-spline CMAC, e.g.
Lane et. al. 1992). These have successfully been applied to representation of complex
nonlinear functions.
Multilayer perceptron is the most used and studied neural network architecture today. It
constructs a global approximation of a multi-input multi-output function in a similar man-
ner as fitting of a low order polynomial through a set of data points. A rich collection of
different learning paradigms has been developed.
Neural network architectures have different features and the suitability of particular archi-
tecture depends on the application type. When selecting a network type for function
approximation, a compromise between several desired features must be made. Some of
these are:
The last two items has a special importance when considering representation of nonlinear
time series models. These aspects will be studied in the next section. Special emphasis is
paid to real-time specific features.
Section 2.2 Global Function Approximation 11
y = F(x) (2.1)
n m
where F : R R is a mapping between an input vector x and output vector y . The
other type is a nonlinear difference equation
T
x ( t ) = [ x 1 ( t )x n ( t ) ]
, t = 1N (2.3)
T
y ( t ) = [ y 1 ( t )y m ( t ) ]
which will be called the training set. The index t can denote just a sample index or time
in time series. The behaviour of the network with the training set does not tell anything
about the generalization capability outside the trained region. The approximation should
be verified also using a separate test set. The term overlearning is used in this context to
x y x(t) y = x(t + 1)
F(x) F( x)
1. The reader should make a distinction between recurrent and recursive. The latter is here used to
mean on-line specific features like recursive identification.
12 Chapter 2 Neural Networks as Function Approximators
mean that the resulting network estimates the trained examples too well and loose its gen-
eralization capability.
Training can be made using incremental (recursive) minimisation of the selected cost
function or using batch learning where the cost function over the whole training set is
computed before each parameter update pass. The batch learning can be used only as an
off-line method, while recursive learning can be used both as an off-line or an on-line
method. On-line learning needs recursive methods.
y ( t ) = y ( t ) = f ( x ( t ), ) (2.4)
n m
between the input vector x R and the predicted output vector y R of the true out-
n
put y . The parameter vector R contains the parameters of the function. The global
mapping means that the change of one parameter might affect to the predictions made in
the whole input-output domain. The structure of the function f is prespecified and does
not necessarily correspond to the true system structure. This type of mapping is termed as
restricted complexity approximation, e.g. Goodwin and Sin (1984). A consequence of the
restricted nature is that there may not exist any vector for which y ( t ) = y ( t )
t = 1N . Instead one finds values of that minimize the approximation error accord-
ing some selected cost function.
1 1
W 1 2
x1 W
y 1
1 2
2 1
x2
1 2 y 2
3 2
1 1
b 2
1 b
Fig. 2.2.
Block diagram of the multilayered perceptron network
with one hidden layer and n = 2, m 1 = 3, m 2 = m = 2 .
Section 2.2 Global Function Approximation 13
Multilayer perceptron network (MLP) implements a global mapping in the sense of (2.4).
It consists of input and output layers and one or more hidden layers. The block diagram
of MLP with one hidden layer and m 1 = 3 nodes is presented in Fig. 2.2. The super-
script denotes the layer number. The lower layer is connected to the upper one through a
series of connections called weights (matrices W 1 and W 2 ). Also threshold connections
(bias) are normally used (vectors b 1 and b 2 of appropriate dimension). Each layer per-
k k
forms a diagonal nonlinear mapping : R R , where k is the number of nodes in the
layer
T
( v ) = [ 1 ( v 1 ) k ( v k ) ] (2.5)
where v is the input vector to the layer. The scalar activation function j is usually some
sigmoid shape function, for example
vj
j ( v j ) = 1/ (1 + e ) or j ( v j ) = tanh ( v j ) . (2.6)
The outermost activation function can be a linear function, but typically the same nonlin-
earity is used for all nodes. The network in Fig. 2.2 implements a nested nonlinear func-
tion
2 2 1 1 1 2
f ( x ) = ( W (W x + b ) + b ) (2.7)
1 3 3 2 2 2
with diagonal activation functions : R R and : R R .
1 1 2 2
The parameter vector contains the weights of the network ( W , b , W , b ), columns
ordered to an n = nm 1 + m 1 + m 1 m + m dimensional vector. Also short-cut connec-
tions directly from the input layer to the output layer are commonly used. These provide
a convenient way to implement a parallel linear model, if used with linear output activa-
tion function. For example with (2.7) one obtains
2 1 1 1 2 L
f ( x ) = ( W (W x + b ) + b ) + W x (2.8)
L
where W is a parameter matrix for the linear part of the function.
Hornik et al. (1989) presents a proof that a standard multilayer feedforward network
architectures with one hidden layer and arbitrary monotonic activation functions can
approximate virtually any smooth and continuous function of interest to any desired
degree of accuracy, provided that sufficiently many hidden units are available. In practice,
the proof is not very useful, because real networks can not be sufficiently large and the
restricted complexity assumption must hold.
However, Sontag (1990) presents a proof that for certain problems two hidden layers are
required, contrary to what might be in principle expected from the known approximation
theorems. The differences are not based on numerical accuracy nor on capabilities for fea-
14 Chapter 2 Neural Networks as Function Approximators
ture extraction, but rather more basic classification into direct and inverse problems.
The former corresponds to the approximation of continuous functions, while the latter is
concerned with the approximation of one-sided inverses of continuous functions.
If large number of weights is used, there is a strong tendency to overfit the data and to
perform poorly on unseen data. Methods for structure selection are nowadays available,
but they are computationally expensive. In practice the structure of the network is selected
by a heuristic trial-and-error method using networks of increasing complexity to mini-
mize the selected cost function and using the test set to monitor the generalization.
The values of the weights are determined during the identification phase. A rich set of dif-
ferent optimization methods have been applied as training methods, like gradient based
methods, simulated annealing and genetic algorithms. Most of these are suitable only for
batch learning. Only gradient based optimization methods are used in this study, both for
batch and on-line learning. In the batch identification, the quadratic cost function
N
V N ( ) = 1---
2
2 y ( t ) y ( t ) (2.9)
t=1
is minimized iteratively with respect to the parameter vector . The y ( t ) is the predic-
tion of the output y ( t ) for the sample input x ( t ) according (2.4). In the recursive case
the quadratic cost function
t
V ( t, ) = 1---
2
2 y ( i ) y ( i ) (2.10)
i=1
is minimized at each step t , i.e
general nonlinear optimization methods using values of the scalar cost function,
its gradients and possibly its Hessian (the second order derivative matrix).
nonlinear least squares approach, which can be used only for quadratic cost func-
tions, like (2.9).
At iteration count k the Taylor expansion of the cost function (2.9) around the estimate
( k 1 ) gives the well-known Newton's update equation, e.g. Ljung (1987)
1 T
( k ) = [ V N ( ) ] [ V N ( ) ] (2.12)
= ( k 1 )
Section 2.2 Global Function Approximation 15
On a very general level this approximation is a basis for all gradient based minimization
methods, batch or recursive. Different algorithms are obtained depending on the approx-
imation of the Hessian in (2.12). For example, if the Hessian is approximated with a diag-
onal matrix I , the steepest descent minimization algorithm is obtained.
An important and useful feature of MLP networks is that the gradient of the (2.9) and var-
ious other Jacobians of the model (2.4) can be computed efficiently in a hierarchical and
parallel manner using the chain rule differentiation. This gradient computation method is
called the generalized delta rule or error backpropagation, e.g Rumelhart et al. (1986).
This is commonly considered as a minimization method, i.e it includes also the steepest
descent parameter update. The terms forward pass and backward pass are used in this
study to denote the computation of the predictions (forward) and the gradients (backward)
only. A detailed description of these algorithms is presented in Appendix A. The recursive
and batch type nonlinear least squares approaches are presented in detail in Section 3.4.
Variable metric methods are considered the most efficient general nonlinear (batch) opti-
mization methods. Most common are, e.g. Fletcher et al. (1990)
Broyden-Flecther-Goldfarb-Shanno (BFGS)
Davidon-Fletcher-Powell (DFP)
Common nonlinear least squares methods are:
Levenberg-Marquardt (LM)
Gauss-Newton (GN)
Implementations of these are generally available, for example:
An analogy can be seen between learning an input-output mapping and a surface recon-
struction from sparse data points ( y ( t ), x ( t ) ) , t = 1N . In this sense learning is a
problem of hypersurface reconstruction, e.g. Poggio and Girosi (1990) that can be math-
ematically formulated as a selection of a hypersurface f that solves the variational problem
of minimization of the functional ( y ( t ) scalar for notational simplicity)
N
2 2
V[f] = y(t ) f(x(t ) ) + Pf (2.13)
t=1
The second term measures the cost associated with the deviation from smoothness. The
stabilizer P is usually a differential operator and the regularization parameter controls
the compromise between the degree of smoothness of the solution and its closeness to the
data. If P is an operator with radial symmetry, the solution of (2.13) has following simple
form, e.g. Poggio and Girosi (1990)
l
f(x) = Bj j ( x ( j) ) (2.14)
j=1
where B is a l -dimensional coefficient vector and ( j ) is an n dimensional vector which
corresponds to the center of the j th radial function j ( ) . The function f ( x ) is a
weighted sum of l radial functions, each with its own center ( j ) . In general the norm
should be separate for each radial function i.e.
2 T
x ( j ) ( j) = ( x ( j ) ) ( j )( x ( j ) ) (2.15)
where ( j ) is an n n dimensional weighting matrix, one for each radial function. Typ-
ically the simplest solution is used: one common diagonal matrix ( j ) = = I . It
is obvious that this type of ( j ) reduces greatly the efficiency of the approach.
Section 2.3 Local Function Approximation 17
()
x1
y
+
x2
The use of weighting and selection of the radial function varies significantly. One clear
difference is whether an n -dimensional radial function is used or whether it is formed as
a cartesian product of n one-dimensional membership functions. A common radial
function is the Gaussian function
2
z
j ( z ) = e (2.16)
where z is the distance according to (2.14) or (2.15). The Gaussian function produces a
local mapping i.e. j ( z ) 0 as z . It is commonly used as an n -dimensional radial
function although it can be used also as a one-dimensional radial function. A more com-
mon one-dimensional radial function is a linear membership function (triangle)
1 a j z, if ( a j z 1 )
j ( z ) = (2.17)
0, otherwise
where a j is a constant tuning parameter. This type of function is typical in fuzzy models.
Consider first the n -dimensional radial function. If l = N , the centers are on distinct data
points and (2.14) is equivalent to generalized splines with fixed knots, leaving the coeffi-
cients B j and the weight ( j ) to be determined. This approach can also be used for inter-
polation between prespecified local models (here the constants B j ) leaving only ( j ) to
be determined, e.g. Johanssen (1994).
A common application is to identify directly a global function approximator from the data
according (2.14), i.e. as a weighted sum of local radial functions; known as Radial Basis
Function (RBF) network, see Fig. 2.3. In this case l N and the centers ( j ) are
unknown, which is the main difficulty when applying this approach. This is commonly
known as a clustering task. Several methods for off-line clustering are available from the
18 Chapter 2 Neural Networks as Function Approximators
simple k -means clustering (Moody and Darken 1989) up to the Kohonen network.
Chen et al. (1991) introduced an advanced method for off-line training for RBF networks:
Orthogonal Least Squares (OLS) learning. The algorithm selects best l regressors (centers
in fact) using modified Gram-Schmidt orthogonalization with prespecified and common
. The widths ( j ) of the resulting RBF model can then be refined with gradient based
minimization. This and other similar methods results in good performance, comparable to
the MLP network.
For example a self-organizing neural network (SONN) by Tenorio (1990) constructs a net-
work, chooses the node functions and adjusts the weights. The rule for the node selection
is based on a variant of Rissanen's Minimal Description Length (MDL, Rissanen, 1978)
information criterion, which provides a trade-off between the accuracy and the complex-
ity of the model.
Although OLS, SONN and other reviewed self-organising systems produce good results,
they cannot be used recursively for on-line learning, but only for batch learning. There
exist only a couple on-line self-organizing clustering methods.
The well known method of the movable centers, e.g. Moody and Darken (1989) is one
possibility, although it cannot be considered as a self-organising one. This method does
not produce good results.
The dynamically capacity allocating (DCA) network (Jokinen 1991) is suitable for on-line
learning due to its self-organizing and instant learning capability. It is also instantly for-
getting and is suitable only for plain predictive purposes, not for modelling recurrent sys-
tems.
Modern workstations can store relevant amount of data directly in memory and obviously
more efficient on-line clustering methods will be seen in near future.
One way to avoid the clustering task is to use an n -dimensional fixed grid for the region
of the interest. The whole operation region is divided into boxes each containing its own
local model (a constant as the simplest case) which are combined so that continuity and
smoothness are achieved (at least should be).
x1
Cartesian product
Memory
y
+
x2
The block diagram of this network type is presented in Fig. 2.4. The CMAC network is a
typical example of this approach, although it uses binary valued basis functions. The car-
tesian product requires huge amount of memory in a high dimensional case. Because in
practice only a small portion of possible combinations are needed, rehash techniques are
used to map the needed memory into much smaller physical memory, e.g. Albus (1975).
Due to the binary basis function, the CMAC does not produce smooth approximations.
Higher order basis functions can be used to produce smooth approximations and contin-
uous derivatives (BMAC), but if applied to high dimensional case, the number of param-
eters in the cartesian product explodes.
The analogy between the approaches can be demonstrated considering the approximator
l l
f( x) = Bj j ( zj ) j ( zj ) (2.18)
j=1 j=1
zj = x ( j ) (j)
20 Chapter 2 Neural Networks as Function Approximators
with triangular j . The basis functions are normalized so that the sum of basis function
This type of localized B-spline approximation is efficient if the input dimension is low.
This can be seen from the vast amount of the articles considering fuzzy models and con-
trollers. A minor note is that the use of prespecified fixed grid (fixed knots) does not pro-
duce optimal approximation even in one dimensional case.
All local approximators without clustering are suitable for on-line learning. The high
dimensional case or the use of very dense grid makes the learning process slow and
reduces the extrapolation capability.
1
zero-lines
+1 +1
ion
r eg
ion
nsit
0 -1 tra
-1
-1
-1 0 1
Fig. 2.5. The first hidden layer of the MLP network divides the input space into subre-
Section 2.4 Global vs. Local Models 21
Consider first a global function approximator, a MLP network with one hidden layer. The
hidden layer produces the activations according to (2.7)
1 1
p = tanh ( W x + b ) (2.19)
These activations are zero if
1 1
W x+b = 0 (2.20)
This means that the input space is divided into subregions by the hyperplanes defined by
(2.20). Near these zero-hyperplanes is a transition region which corresponds to the sig-
moid shape of the activation function. The actual model output is a weighted sum of the
values of the activations (2.19). A two-dimensional example of these hyperplanes is pre-
sented in Fig. 2.5. If the data do not cover the whole region of interest (Fig. 2.6a), the MLP
network extrapolates. There is no guarantee that even the sign of the Jacobian w.r.t. the
input vector is correct outside the trained region, in fact it can be anything.
The localized function approximator behaves even worse (Fig. 2.6b). When extrapolat-
ing, the localized model typically produce the output or at least the corresponding Jaco-
bians 0 . These predictions and Jacobians are anyway needed and serious problems are
encountered during the model inversion.
Precautions with both model types must be taken into account to obtain correct results.
Ensuring (projecting) the signs of the Jacobian w.r.t the input vector is one possibility to
avoid divergence. Straightforward methods for the global approximator will be presented
in Section 3.4.
a) b)
1 1
trained regions
trained regions
0 0
-1 -1
-1 0 1 -1 0 1
Fig. 2.6.Extrapolating outside the trained region could cause a divergence when
inverting the model iteratively, a) MLP network, b) localized approximator.
22 Chapter 2 Neural Networks as Function Approximators
u y
Process
Apriori
Model
+ y
+
Neural
Network
Ensure the Jacobian values (signs) around the borders of the trained region and
ensure that only the trained region is used during the inversion.
Use a separate and additive global model (Fig. 2.7) obtained using apriori knowl-
edge or via identification. A crude approximation is enough, the signs of the Jaco-
bians are important.
Using these guidelines also the localized approximator can be successfully applied with
predictive control. However, the projection operation is more difficult than with the global
approximator. Handling the projection within on-line identification is also difficult.
As discussed in Sections 2.2 and 2.3, there are several function approximation networks
with different properties. Some of them are suitable for on-line learning, some for local-
ized representation, and some for high dimensional problems. It seems that only global
approximators can easily combine all necessary properties. The MLP network is selected
for the basic building block to be used in this study although localized representations
have some useful properties over the MLP network.
Chapter 3 Modelling and Identification 23
This chapter presents a framework for modelling and identification of nonlinear dynami-
cal systems. The multilayer perceptron neural network is used as the basic building block
when representing the nonlinear time series models.
It is well-known that a nonlinear system can be described by a nonlinear time series model
involving nonlinear regression of past data. As indicated by Priestley (1988), a general
model of a discrete time noise process takes the form
( v ( t ), v ( t 1 ), v ( t k ), )
y(t) = H (3.1)
where y ( t ) is a scalar measurement at time t and { v ( t ) } is a white noise sequence.
Equation (3.1) represents a general non-anticipative (causal) nonlinear model involving
infinite dimensional nonlinear regression of past data. If the H is assumed to be suffi-
ciently well-behaved, it can be expanded in a Taylor series about some fixed point - say
0 = ( 0, 0, 0, ) , resulting in
y ( t ) = y0 + ci v ( t i ) + cij v ( t i )v ( t j ) + .... (3.2)
i=0 i=0 j=0
( 0 ) . This expansion is
with coefficients c i, c ij, obtained from expansion and y 0 = H
known as a (discrete time) Volterra series and it provides an important type of represen-
tation for nonlinear models.
24 Chapter 3 Modelling and Identification
( u ( t ), u ( t 1 ), , u ( t k ), , v ( t ), v ( t 1 ), , v ( t k ), )
y( t) = G (3.3)
where G is assumed to be smooth and continuous so that a Volterra series expansion can
be derived. The sequences { y ( t ) } and { u ( t ) } are now assumed at least as quasi-station-
ary bounded sequences (Ljung 1987).
Obviously and as indicated by Billings and Voon (1986), this nonlinear model (3.3) can-
not be presented as a deterministic part and a separate noise model. Three blocks are
u v uv
, the noise model G
needed: the deterministic part G and the block G , which repre-
sents an interaction between the input u and the noise v (see Fig. 3.1). Depending on the
actual physical system some blocks might be missing. This presentation is of practical
importance in model validation using correlation tests, e.g. Billings and Voon (1986).
(t 1) (t 1) (t)
y( t) = G( Y ,U ,V ) (3.4)
where
(t 1) T
Y = [ y ( t 1 )y ( t l y ) ]
(t 1) T
U = [ u ( t 1 )u ( t l u ) ]
(t) T
V = [ v ( t )v ( t l v 1 ) ]
v
u
uv v
G
G
u
y
G
In the nonlinear system model (3.4) the noise may enter the system internally and cannot
always be translated to be additive. In this case there is no simple way to obtain a one-
step-ahead prediction y ( t + 1 t ) for the measurement y ( t + 1 ) (at time t ). If the noise
is additive, the model can be presented as
(t 1) (t 1) (t 1)
y(t) = G( Y ,U ,V ) + v(t) (3.5)
where G is not same as G in (3.4). Also v ( t ) is normalized to have unit coefficient. Now
the one-step-ahead predictor can be derived, if the function G and the structure of the
regression vectors are known. At time t the best prediction y ( t + 1 t ) for the output
y ( t + 1 ) is the one which predicts the future noise v ( t + 1 ) to be zero, resulting in
( t + 1 ) = y ( t + 1 ) y ( t + 1 t ) = v ( t + 1 ) (3.7)
If the prediction and the estimation of v ( t ) are made recursively and the system (3.6) is
(globally asymptotically) stable, the predictor converges to a quasi-stationaty state, to the
heuristically best available predictor. The term stable is here used in rather loose context.
More precise definitions will be given in Section 3.4. Also the assumption of quasi-
stationary sequences includes an assumption of some sort of stability. A combined repre-
sentation of this one-step-ahead predictor is
( t ) = y ( t ) y ( t t 1 )
(3.8)
(t) (t) (t)
y ( t + 1 t ) = G (Y , U , E )
(t) T
where E = [ ( t ) ( t l 1 ) ] denotes finitely many ( l ) values of the prediction
error sequence { ( t ) } . As mentioned before, it is not in general possible to obtain a sim-
ple closed form for the optimal predictor of the output of a nonlinear system (3.4) . A sen-
sible approach is to seek the best predictor of a given structure, a restricted complexity
predictor e.g. Goodwin and Sin (1984) and to assume the noise to be additive:
y ( t + 1 t ) = f ( ( t + 1 ), )
(3.9)
( t ) = y ( t ) y ( t t 1 )
26 Chapter 3 Modelling and Identification
where ( t + 1 ) = [y ( t )y ( t n a + 1 )
u ( t )u ( t n b + 1 )
T
( t ) ( t n c + 1 )]
where f is a function with prespecified (selected) structure and constant unknown param-
eters , which are to be identified. The number of past values (model orders n a , n b and
n c ) in the input data vector ( t + 1 ) should correspond to the true system structure.
This restricted complexity approach can be clarified using the state space point of view.
Assume that the single input single output (SISO) system model (3.5) is presented in non-
linear state space form
x ( t + 1 ) = ( x ( t ), u ( t ), 1, v 1( t ) )
(3.10)
y ( t ) = ( x ( t ) , 2, v 2 ( t ) )
where and are known smooth continuous functions with constant parameters 1 and
2 . The sequences { v 1 ( t ) } and { v 2 ( t ) } are independent white noise sequences. The
system is also assumed to be globally asymptotically stable. Assume first that the model
parameters 1 and 2 are known. Referring to the Extended Kalman Filter, e.g. Goodwin
and Sin (1984), the one-step-ahead predictor is sought in the form
x ( t + 1 ) = ( x ( t ), u ( t ), 1, 0 ) + K ( 3 ) ( y ( t ) y ( t t 1 ) )
(3.11)
y ( t + 1 t ) = ( x ( t + 1 ), 2, 0 )
In practice the functions and are not known and they must also be approximated. In
that sense the predictors (3.9) and (3.11) are identical. The nonlinear time series model
approach (3.9) is clearly simpler and more suitable for predictor identification.
The prediction error approach (Section 3.3) will be used as a parameter estimation tool to
identify the unknown parameters . The advantage of the prediction error algorithm is
that it will provide (locally) the best predictor for a given structure.
Various predictor structures are presented in Section 3.2. If there is apriori knowledge
about the noise model structure, the predictor and the identification task might be remark-
ably simplified.
Section 3.1 Nonlinear Stochastic Models 27
Multilayer perceptrons are selected as the basic building block to represent the nonlinear-
ities in the time series models, because they can approximate virtually any smooth con-
tinuous function. Also conventional solutions can be useful, if the general form of the
nonlinearities is known in advance. Most conventional models presume some prefixed
nonlinear structure with unknown parameters, which are determined during the identifi-
cation phase.
As indicated before e.g. Priestley (1988) or Monaco and Norman-Cyrot (1986) any non-
linear system can be under certain conditions represented as a corresponding series such
as the Volterra series. When the nonlinearity of the true system is unknown, one possibil-
ity is to some extent approximate the Volterra series.
l
1 1
A 1 (q )y ( t ) = B k (q ) [ fk ( u ( t d ) ) ]
k=1
(3.12)
j
1
+ A k (q ) [ gk ( y ( t ) ) ] + c
k=1
1
where finite length polynomials in q are
1 1
A 1 (q ) = 1 + a 11 q +
1 1 2
A k (q ) = a k1 q + a k2 q + k = 2j
1 1 2
B k (q ) = b k0 + b k1 q + b k2 q + k = 1l
and d is the delay. The functions f k and g k must be selected using apriori knowledge. If
the functions are selected as polynomials, the well-known Hammerstein model is
obtained as a special case of (3.12). Other possibilities are presented for example in the
survey paper of Haber and Unbehauen (1990).
This type of model is useful for identification and adaptive control (e.g. Vadstrup 1988),
because it is linear with respect to the parameters (linear-in-the-parameters). The key
question is whether the selected nonlinearity corresponds to the structure of the true sys-
tem. The multi-input multi-output (MIMO) case is also quite complex, as it can be seen
from the Volterra series (3.2).
28 Chapter 3 Modelling and Identification
The multilayer perceptron network approach has also a fixed structure with unknown
parameters, but so much expressing power that it can in principle represent any continu-
ous function. The disadvantage is that MLP is not linear with respect to the parameters and
more complicated identification schemes must be applied.
There are several possibilities to simplify the representation, if apriori knowledge about
the type of the nonlinearity is available. For example Narendra and Parthasarathy (1990)
characterize discrete time nonlinear models (in fact neural networks) with four different
classes (assume a deterministic case with d = 1 for notational simplicity):
1 (t)
y ( t + 1 ) = (q )y ( t ) + g ( U ) (3.13a)
(t) 1
y ( t + 1 ) = f ( Y ) + (q )u ( t ) (3.13b)
(t) (t)
y( t + 1 ) = f( Y ) + g( U ) (3.13c)
(t) (t)
y( t + 1 ) = f( Y , U ) (3.13d)
1 1
where (q ) and (q ) are linear polynomials, and f and g nonlinear functions. It is
evident that model (3.13d) includes models (3.13a)...(3.13c). However, model (3.13d) is
analytically the least tractable and hence for practical applications some other models
might prove more attractive. Also the bilinear model
(t) (t 1) (t) (t 1)
y(t + 1) = f( Y , U ) + g( Y , U )u ( t ) (3.13e)
has attractive features from the control point of view, c.f. Slotine and Li (1991).
Section 3.2 Nonlinear Predictors 29
The simplest way is to predict future outputs as a function of past measurements and
inputs. This nonlinear extension of linear ARX model is termed as Nonlinear AutoRegres-
sive with eXogenous inputs (NARX) model. The nonlinear system is assumed to be of the
form
(t 1) (t d)
y(t ) = h( Y ,U ) + v(t) (3.14)
where h is a nonlinear function and d is the input delay. An example of this model is pre-
sented in Fig. 3.2. Note how the additive noise enters the system. The one-step-ahead pre-
dictor is sought in the form
y ( t + 1 t ) = f ( ( t + 1 ), )
( t + 1 ) = [y ( t )y ( t n a + 1 ), (3.15)
T
u ( t d + 1 )u ( t d n b + 2 )]
v(t)
u( t 1) + + y(t )
h( )
q 1
q 2
The main advantage of NARX presentation is its simplicity and fast convergence of iden-
tification; faster than the predictor types presented in the sequel. Due to this reason it is
often used even if the noise model assumption is not correct.
In linear case the controller is commonly designed assuming ARX model to represent
deterministic part of process model using the certainty equivalence principle. The differ-
ence between a NARX model and a deterministic process model (NOE, see below) is con-
ceptually much bigger than in the linear case, because the NARX model cannot be divided
in two separate parts as its linear counterpart (Fig. 3.3). If the identified NARX model is
used as a process model for controller design, one might get poor results.
If the model assumption (3.14) is correct, one normally obtains good results. A typical sit-
uation, where (3.14) does not hold, is the case, when the noise is additive at the output
after the dynamical part. Even then one achieves reasonable predictions, a nice feature
used in many predictive and adaptive algorithms.
y ( t + d t ) = f ( ( t + d ), )
( t + d ) = [y ( t )y ( t n a + 1 ), (3.16)
T
u ( t )u ( t n b + 1 )]
v(t)
1
-----------------
1
A (q )
u(t ) 1 y(t)
B (q )
-----------------
1
A (q )
Fig. 3.3.
Linear ARX models can be presented in two separate parts:
the output additive (coloured) noise and the deterministic part.
Section 3.2 Nonlinear Predictors 31
In this case the predictor does not correspond to the deterministic process model even if
the noise model is correct. Eq. (3.16) can efficiently be used for predictive and adaptive
purposes, but not as a process model.
The nonlinear innovation model (3.5) will be termed as Nonlinear AutoRegressive Mov-
ing Average with eXogenous inputs (NARMAX) model. The nonlinear system is assumed
to be
(t 1) (t d) (t 1)
y( t) = h( Y ,U ,V ) + v( t) (3.17)
which is a nonlinear extension of the linear ARMAX model. The one-step-ahead predictor
is sought in the form
y ( t + 1 t ) = f ( ( t + 1 ), )
( t ) = y ( t ) y ( t t 1 )
( t + 1 ) = [y ( t )y ( t n a + 1 ), (3.18)
u ( t d + 1 )u ( t d n b + 2 ),
T
( t ) ( t n c + 1 )]
y ( t + 1 t ) = f ( ( t + 1 ), )
( t + 1 ) = [y ( t )y ( t n a + 1 ),
(3.19)
u ( t d + 1 )u ( t d n b + 2 ),
T
y ( t )y ( t n f + 1 )]
where n f is the number of previous predictions. If the system delay d is greater than one,
the predictor equation can be formulated in several ways. See Goodwin and Sin (1984)
for a lengthy discussion about indirect and direct d-step-ahead predictors for linear sys-
tems. One possible direct nonlinear d-step-ahead predictor is of the form
y ( t + d t ) = f ( ( t + d ), )
( t + d ) = [y ( t )y ( t n a + 1 ),
(3.20)
u ( t )u ( t n b + 1 ),
T
y ( t )y ( t n f + 1 )]
v(t)
u( t 1) + + y(t )
h( )
q 1
q 2
The Nonlinear Output Error (NOE) model is a direct extension of the linear OE model and
it is used for identifying the deterministic part of the system. The nonlinear system is
assumed to be of the form
(t 1) (t d)
x( t) = h( X ,U ) (3.21)
y( t) = x( t) + v(t)
(t 1) T
where X = [ x ( t 1 )x ( t l x ) ] . The noise is assumed to be output additive i.e.
additive after the dynamical part (Fig. 3.4). The one-step-ahead predictor is presented in
the form
y ( t + 1 t ) = f ( ( t + 1 ), )
( t + 1 ) = [y ( t )y ( t nf + 1 ), (3.22)
T
u ( t d + 1 )u ( t d n b + 2 )]
In practice the NARX and the NOE models are the most important representations of non-
linear systems. The NARX model is used mainly for predictive purposes and the NOE
model is identified to obtain a model for simulation purposes i.e. the deterministic part of
the system model. Of course the selection of the model type depends heavily on the true
noise process.
Section 3.2 Nonlinear Predictors 33
Instead of the NARMAX predictor, one can make an assumption about a linear noise
model. Now the system is assumed to be
1
(t 1) (t d) C(q )
y(t) = h( Y ,U ) + ----------------- v ( t ) (3.23)
1
D(q )
1 1 1
where C ( q ) and D ( q ) are polynomials in q with appropriate dimensions. Similar
assumption can be made with the NOE model. The system is now assumed to be
(t 1) (t d)
x(t) = h( X ,U )
1 (3.24)
C(q )
y ( t ) = x ( t ) + ----------------- v ( t )
1
D(q )
The identification of predictors for these systems is not considered much in this study. A
1
special case when D = 1 q is studied in conjunction with predictive control (Chapter
4). In general there are two possibilities: adopt ideas used for identification of linear mod-
els, or represent a predictor as a sparse (MIMO) network with linear short-cuts.
The predictors defined above can be straightforwardly extended to MIMO case. For exam-
ple if the nonlinear system is assumed to be
(t 1) (t 1)
y( t) = h( Y ,U ) + v(t) (3.25)
m m m
where y R , u R , v R ( m > 1 ) and { v ( t ) } is an independent white noise
sequence, the corresponding NARX predictor is
y ( t + 1 t ) = f ( ( t + 1 ), )
T T (3.26)
( t + 1 ) = [y ( t )y ( t n a + 1 ),
T T T
u ( t d + 1 )u ( t d n b + 2 )]
The only difference from the SISO case is that different delays can exist between the var-
iables. Now the common delay d must be selected as the minimum delay of all loops and
the model orders n a and n b as maximum orders of all loops. Because the function f is an
MLP network, any delay structure can be implemented by removing a sparse network.
The general complexity of the delay structure selection in the MIMO case is analogous to
that of the linear case. The selection is difficult especially if the predictor will be used for
control purposes, but it is not a representation problem.
34 Chapter 3 Modelling and Identification
Multistep Prediction
The Long Range Predictive Control (LRPC, see Section 4.2) approach is based on the idea
of reducing the controller sensitivity to high frequency noise through an objective func-
tion that minimizes the sum of squares of the control error over a finite time horizon (from
t + N1 t + N2 ). This requires prediction of the measurement several steps into the
future.
Predicted Known
Future
Future = F( Past , ) (3.27)
Inputs
Outputs Data
The main interest in this study is on recursive prediction, because the resulting models can
be used for other purposes like as a simulation model. A good reference for linear recur-
sive multistep prediction is in Holst (1977). Weigend et al. (1990) compare combined
neural network multistep prediction and recursive approach by identifying autoregressive
models for predicting sunspot series and chaotic time series. The results favor recursive
methods.
y ( t + i t ) = f ( ( t + i ), ) , i = 1N2
( t + 1 ) = [ y ( t )y ( t n a + 1 ), (3.28)
T
u ( t d + 1 )u ( t d n b + 2 ) ]
The future measurements are not known and they are replaced with the predicted ones.
An example of a neural network as a recursive multistep predictor is shown in Fig. 3.5.
Due to the delay, the first d 1 predictions do not depend on present or future inputs and
computational saving could be achieved, if the available but less accurate predictions like
y ( t + 1 t 1 ) are used instead.
Section 3.2 Nonlinear Predictors 35
y ( t + 1 ) y ( t + 2 ) y ( t + 3 )
y(t)
y( t 1) NN y(t ) NN NN
u( t)
u(t + 1)
u( t + 2)
Fig. 3.5. A functional block diagram of the recursive multistep predictor implementation
using neural network, NN s are identical copies of the predictor network, n a = 2 ,
n b = 1 , d = 1 and N2 = 3 .
An example of a combined predictor is shown in Fig. 3.6 (same prediction task as in Fig.
3.5). The identification and the usage of this predictor seems to be straightforward. How-
ever, the predictor includes also non-causal correlations like u ( t + 2 ) y ( t + 1 ) . The
parameters (weights) corresponding to these noncausal correlations seem to exists, if
adaptive control is used, i.e. they correspond to the controller. To avoid this, the neural
network should have a sparse structure, which complicates the identification algorithm.
This type of combined neural multistep predictor within adaptive long range predictive
control is used in Koivisto et al. (1991a,b) and Kimpimki (1990). The results clearly
indicate that the non-causal connections should be removed.
The properties of the recursive multistep predictor and the combined neural network pre-
dictor within the LRPC approach are discussed for instance by Saint-Donat et al. (1991),
but with a full (non-sparse) network.
u( t) y ( t + 1 )
u( t + 1)
NEURAL
u( t + 2) y ( t + 2 )
NETWORK
y( t 1 )
y( t) y ( t + 3 )
N
1 2
V ID = ---
2 y ( t ) y ( t t 1 ) (3.29)
t=1
with respect to the model parameters. If the model orders are too low or the noise model
assumption is not correct, this predictor will not produce the best multistep predictions
and consequently not the best overall control (LRPC). What is needed is an identification
method that provides the suitable predictions in the range required by the controller. The
effect of unmodelled dynamics in adaptive linear control is studied for example by Shook
et al. (1991, 1992), Lu and Fisher (1990) and Lu et al. (1990). They propose the cost func-
tion
N N2
V LRPI = 1---
2
2 y ( t ) y ( t t j ) (3.30)
t = 1 j = N1
i.e. similar to that of the control scheme. This type of identification scheme is termed as
identification for control in the literature.
Two approaches can be used to identify models, which minimize this identification cost
function:
These approaches have been shown to minimize the same cost function, when linear sys-
tems are considered, see Shook et al. (1991). Also, the corresponding prefilter can be
determined easily. Prefiltering is not useful when identifying nonlinear models and the
cost function (3.30) must be minimized numerically.
Section 3.3 Prediction Error Method 37
The general idea of the Prediction Error Method is adopted from Ljung and Sderstrm
(1983) and the nonlinear case from Goodwin and Sin (1984), where the method is called
sequential prediction error method. The main idea behind these methods is that the gra-
dient of the selected cost function w.r.t. the identified parameters is computed in a proper
manner corresponding to the dynamical behaviour of the predictor. Similar results with a
different derivation are presented by Narendra and Parthasarathy (1991) and by Nguyen
and Widrow (1991), who called it dynamic back-propagation or back-propagation
through time. These also include the steepest descent parameter update. Here the gradi-
ent computation and the parameter update are strictly separated.
The actual parameter update is to be made with the Levenberg-Marquardt (LM) approach
or with the Recursive Gauss-Newton (GN) algorithm. Chen, Billings and Grant (1990)
and Chen et al. (1990) presented the first Gauss-Newton type algorithm for the identifi-
cation of NARX type neural network models. Combining the LM or GN method with the
proper gradient computation results in the PEM and RPEM approach also for NARMAX
and NOE predictors, Koivisto et. al. (1992).
N
1 2
V N ( ) = ---
2 ( y ( t ) y ( t ) ) (3.31)
t=1
m
where y ( t ) is the prediction of the measurement y R and is a diagonal weighting
matrix. The notation y ( t ) instead of the previous y ( t t 1 ) is used to emphasize the
n
dependence between the prediction and the predictor parameters R . The predictor
formulas are presented in Section 3.2, generally as
y ( t ) = f ( ( t ), ) (3.32)
=
n
where is an estimate of the unknown parameter vector and R is a regression
data vector containing the past measurements, past predicted outputs and past inputs. The
actual content of the data vector depends on the selected predictor type: NARX, NARMAX
or NOE.
38 Chapter 3 Modelling and Identification
( t ) = ( t, ) = y ( t ) y ( t ) (3.33)
and the gradient of (3.32) w.r.t. as
( t ) = ( t, ) = y ( t ) (3.34)
=
i.e. a m n dimensional matrix. The shorthand notations ( t ) and ( t ) instead more
precice ( t, ) and ( t, ) are often used for the clarity of the presentation. If the Hes-
sian of the cost function in Newton's algorithm (2.12) is approximated as
N
T
V N ( ) ( t ) ( t ) (3.35)
t=1
and notation
N
T
V N ( ) = ( t ) ( t ) (3.36)
t=1
is used, one obtains the nonlinear least squares normal equations of the Gauss-Newton
(GN) algorithm
N N
T T
( t ) ( t ) ( k ) = ( t ) ( t ) (3.37)
t=1 t=1
from which the search direction ( k ) at iteration count k can be solved. The approxi-
mation of the Hessian V N ( ) in (3.35) must be positive definite. For numerical reasons
this may not be the case and the equation (3.37) is modified to guarantee the positive def-
initeness. Before that the equation (3.37) should be written in a more compact form. For
this introduce first a mN n dimensional gradient matrix H so that
1 1
--- ---
T T 2 T 2
H ( ) = ( 1 ) | | ( N ) (3.38)
1 1
--- ---
T T 2 2
e ( ) = ( 1 ) T( N ) (3.39)
Section 3.3 Prediction Error Method 39
[ [ H T( )H ( ) + ( k )I ] ( k ) = H T( )e ( ) ] (3.40)
= ( k 1 )
H()
( k ) e ( ) (3.41)
( k )I 0
= ( k 1 )
e.g. Ljung (1987), which is a linear least squares problem. This in turn is solved using the
QR factorization, resulting in a provisional parameter update
( k ) = ( k 1 ) + ( k ) (3.42)
When this straight and simple implementation of LM method using QR factorization is
applied, a reliable and robust identification algorithm is obtained. The use of (3.41) is
computationally somewhat heavier than solving ( k ) directly from (3.40). Both
approaches have been used extensively in this study.
The evaluation of the gradients is not computationally much heavier than the evaluation
of the cost function. Hence the line search is not well motivated here. However, a BFGS
approach with two sided Armijo line search (Mimoux, 1986) is also implemented. This
works well, but the LM approach normally outperforms it in convergence speed. The
BFGS method is used only for some constrained optimization tasks like for stability pro-
jection.
40 Chapter 3 Modelling and Identification
The cost function (3.31) can be minimized using recursive methods. Recursive minimi-
zation is mainly used in adaptive prediction and adaptive control but it can be used also
as an off-line approach. Quite often it is reasonable to assume that the most recent data
contains more information than the past data. To discard the old data exponentially, the
quadratic cost function
1 2
V ( t, ) = ( t )V ( t 1, ) + --- ( y ( t ) y ( t ) ) (3.43)
2
where 0 < ( t ) 1 , is minimized at each step t w.r.t. i.e
The exponential forgetting factor ( t ) controls the weighting of past and present predic-
tion errors. Typically ( t ) is very close to one. The choice of ( t ) 1 yields the recur-
sive minimization of the cost function (3.31).
The solution of (3.44) can be written as recursive Gauss-Newton equations (Ljung, 1987).
( t ) = ( t 1 ) + R 1 ( t ) T( t ) ( t )
(3.45)
R ( t ) = ( t )R ( t 1 ) + T( t ) ( t )
where R ( t ) is the information matrix. Also here the shorthand notations are used instead
of ( t, ( t 1 ) ) and ( t, ( t 1 ) ) . The R ( t ) must be kept positive definite. An com-
prehensive literature considering possible methods is available, e.g. Ljung (1987) and
Biermann (1977). One possibility is to use (3.45) directly, which is especially suitable for
MIMO identification.
In the MWGS the information matrix R ( t ) is updated recursively using the factorization
T T T
R ( t ) = U ( t )D ( t )U ( t ) = U ( t 1 )D ( t 1 )U ( t 1 ) + ( t ) ( t ) (3.46)
T
( t ) = ( t 1 ) + P ( t ) ( t ) ( t )
T (3.47)
1 P ( t 1 ) P ( t 1 ) ( t ) ( t )P ( t 1 )
P ( t ) = ---------
- ------------------------------------------------------------------
(t) ( t ) + ( t )P ( t 1 ) ( t )
T
is faster than MWGS, at least if coded properly (assume = 1 for notational simplicity).
In practice also (3.47) must be implemented with a numerically more robust method. The
UD factorization (Biermann, 1977) is used in this study. Then P ( t ) is updated recursively
using similar factorization as in (3.46).
The recursive algorithm is initialized with small random values or apriori estimates of
( 0 ) . It should be noted that
T
V ( 0, ) = 1--- ( ( 0 ) ) P 1( 0 ) ( ( 0 ) ) (3.48)
2
It is easy to see that the less confidence we have in the initial value of the parameter esti-
mates ( 0 ) the larger the initial covariance matrix P ( 0 ) = R 1 ( 0 ) should be selected.
The usual choice is P ( 0 ) = R 1 ( 0 ) = I , >> 0 .
resetting R ( t ) or P ( t ) periodically
using constant trace algorithms for P ( t )
using regularization i.e adding I to P ( t ) at each time step,
where is a small positive scalar.
It is well understood that a nonlinear global approximation scheme like MLP is not the
most suitable for adaptive purposes, because the model is not linear w.r.t. the parameters.
This makes the adaptation slower than in the linear case.
There are also several precautions when applying recursive methods, like sensitivity to
the initial weights and in the nonlinear case also sensitivity to the data presentation order
which may lead to different local minimums. Furthermore, all the aspects and experience
gained applying linear self-tuning control must be taken into account. For example, the
fiddling with covariance resetting schemes or with a time varying forgetting factor is sim-
ilar to that of linear recursive identification.
42 Chapter 3 Modelling and Identification
Computing Gradients
The methods presented in the previous pages are general nonlinear least squares algo-
rithms. They are true Prediction Error Methods (PEM) only if the gradient ( t ) is com-
puted properly according to (3.32).
Let us first consider gradient computation generally, without fixing the method to be batch
or recursive. Define Jacobian matrices of (3.32) w.r.t. and as
( t ) = ( t, ) = f ( , ) (3.49)
= , = ( t )
( t ) = ( t, ) = f ( , ) (3.50)
= , = ( t )
y ( t ) = f ( ( t ), )
=
T T
( t ) = [y ( t d )y ( t d n a + 1 ) (3.51)
T T T
u ( t d )u ( t d n b + 1 )]
where n a and n b are the model orders. In this case the Jacobian ( t ) can be directly
computed, because ( t ) is not a function of , resulting
( t, ) = y ( t ) = ( t, ) (3.52)
= , = ( t )
y ( t ) f ( ( t ), )
= (3.53)
( t, ) ( t, )
= , = ( t )
is referred to as the extended network model by Chen, Billings and Grant (1990). The sta-
bility of the extended network model is of vital importance in any recursive implementa-
tion. The set of all producing a stable extended network model is denoted as D .
Section 3.3 Prediction Error Method 43
y ( t ) = f ( ( t, ), )
=
T T
( t, ) = [y (t 1 )y (t n f | ), (3.54)
T T T
u ( t d )u ( t d n b + 1 )] =
where n a and n f are the model orders. The notation is used to emphasize the fact that pre-
vious predictions are also functions of the parameter estimate . Applying the chain rule
differentiation, one obtains
nf
y ( t ) f ( , ) f ( , ) T y ( t i )
------------------ = ------------------- +
-------------------------- --------------------------
y ( t i )
(3.55)
i=1
= ( t ), =
The presentation can be remarkably simplified using the (left) matrix fraction description
(MFD) presentation of a linear MIMO time series model is (Kailath, 1980). A linearized
presentation of the model y ( t ) = f ( , ) near a operation point = , = ( t ) is
1 d 1
A( q )y ( t ) = q B ( q )u ( t ) (3.56)
1
where the matrix polynomials in q are
1 1 n
A( q ) = I A1 q An q f
f
1 1 nb + 1
B( q ) = B1 + B2 q + + Bn q
b
Here A i and B i are m m matrices (assuming a square system). The linearization means
the calculation of the Jacobian ( t ) = f/ in (3.50). Comparing this to the organiza-
tion of the data vector ( t ) in (3.54) yields
( t ) = [ A1 | | An | B1 | | Bn ] (3.57)
f b
The Jacobian ( t ) contains directly the parameters of a MFD presentation and (3.55) can
be written in more consistent form. The reader should note that the Jacobian ( t ) is not
computed in a steady state. Thus the resulting model (3.56) is not a true linear version of
the nonlinear model, it is just the Jacobian written in a certain form.
y ( t ) f ( ( t ), )
= (3.58)
1 (i) (i)
A ( q , t ) ( t, ) ( t, ) =
i = 1n
44 Chapter 3 Modelling and Identification
(i) (i)
where and are the i th columns of the corresponding gradient matrices. The
1
notation A ( q , t ) is used to emphasize that the linearization is performed at every time
step.
Eq. (3.58) gives an elegant way to analyse the behaviour of the predictor and its gradients.
It will also be used for stability and convergence analysis in the next section. A change in
the parameter estimate in (3.58) requires the whole sequence of predictions and gradi-
ents evaluated recursively for time steps 1t . This feature is inbuilt in the batch type
identification algorithm, but not in the recursive case, i.e. the recursive algorithms (3.45)
and (3.47) are not truly recursive.
In the recursive case the extended network model can be approximated by simpler recur-
sive equations. Because we can expect ( t 1 ) to be close to ( t 2 ), in the limit, a
reasonable approximation is to replace (3.58) by
y ( t ) f ( ( t ), ( t 1 ) )
(3.59)
1 (i) (i)
A ( q , t ) ( t, ) ( t, ( t 1 ) )
i = 1n
where
( t ) = [y T( t 1 ( t 2 ) )y T( t n f | ( t n f 1 ) )
u T( t d )u T( t d n b + 1 )] T
( t ) = [y T( t 1 ( t 1 ) )y T( t n f | ( t n f ) )
u T( t d )u T( t d n b + 1 )] T
The equation (3.59) is a sequential form of the approximation of the gradient ( t ) and
with (3.59) the algorithms (3.45) and (3.47) again become truly recursive (Goodwin and
Sin, 1984). It is obvious that this extended network model must be stable in order to obtain
correct results using the approximation (3.59).
Equation (3.59) can easily be implemented using neural networks, like multilayered per-
ceptrons, only one forward pass, one backward pass and an extra gradient filtering pass is
needed. A minor disadvantage is that gradients ( j ), j = t 1t n f must be stored.
Section 3.3 Prediction Error Method 45
Identification of a multistep predictor according to the multistep cost function (3.30) can
be formulated in several ways depending whether recursive or batch identification is
made. The multistep cost function for multistep NARX is ( N1 = 1 for simplicity)
N N2
2
V LRPI1 = y ( t ) y ( t t j ) (3.60)
t=1j=1
where y ( t t j ) denotes the predictions computed recursively using the actual measure-
ment information at time t j . This can be reordered as
N N2 1
2
V LRPI2 = y ( t j ) y ( t j t N2 ) (3.61)
t=1 j=0
where y ( t j t N2 ) denotes the predictions calculated recursively using the actual
measurement information at time t N2 . The overall cost functions are identical, they are
just reordered for simpler implementation. However, at time t the incremental cost func-
tion is different. The first one needs N2! predictions and gradients (forward and backward
passes) to be computed at each sample t , while the latter one needs only N2 . The disad-
vantage of the reordering is that the measurement information to the data vector is delayed
N2 1 samples, which can cause problems in adaptive applications.
According (3.61), the prediction y ( t ) is computed recursively using the actual measure-
ment information available at time t N2 . The future measurements are replaced with
the predicted ones. In a similar way, the gradients are calculated according to the dynamic
gradient equations (3.58) and by assuming ( j ) = 0, j < t N2 .
The cost function and its gradients are considered as a MIMO cost function and the least
squares normal equations can be directly written according (3.37) and (3.40). The only
difference is in gradient computation.
46 Chapter 3 Modelling and Identification
The need for the stability projection is twofold. It is needed for the convergence of the
optimization procedure. The designed control system should also be stable. Applying the
stability projection within an optimization procedure means that the stability is ensured
with respect to the data used in optimization, which emphasises the selection of the input
signal and setpoint sequences.
The analysis is based on the Contraction Mapping Theorem, mainly on the work of Holz-
man (1970), Economou (1985), Li et. al. (1990) and Zafiriou (1990). Contraction map-
ping is a suitable choice, because it can be used without computing the (possible)
equilibrium state. The results considering stability issues of the neural network models,
e.g. Hernandez and Arkun (1992), Levin and Narendra (1993) and Jin Liang et. al.
(1994a), are also discussed. Before presenting actual instability detection and stability
projection methods, the stability of nonlinear difference equations is shortly reviewed to
point out what in fact is done.
Definition 1. (Holzman, 1970) Let a Banach1 space X contain a closed convex set and
let F map into itself. F is said to satisfy a Lipschitz condition, if there is a con-
stant such that
F ( xa ) F ( xb ) < xa xb x a, x b (3.62)
x eq = F ( x eq )
k k
Furthermore, for all x , lim F ( x ) = x eq , where F denotes a recursive
k
mapping ( k times).
1. Although the Contraction Mapping Theorem is originally presented using Banach spaces,
only linear vector spaces are used in this study.
Section 3.4 Stability and Convergence 47
U ( x 0, r ) = { x X x x0 r }
which maps into itself, although any closed convex set will do. When F is differenti-
able in U , an exact characterization of the contraction can be developed:
F ( x ) < 1 x U ( x 0, r ) (3.63)
Finding a closed ball which maps into itself is not always easy and the condition can be
replaced with another, more suitable for computation.
F ( x ) < 1 x U ( x 0, r ) (3.64)
where
F ( x0 ) x0
r r 0 ---------------------------------
-
1
then F is a contraction in U ( x 0, r ) . Moreover, F has a fixed point x eq in
U ( x 0, r 0 ) . The fixed point is unique in U ( x 0, r ) (region of attraction).
By searching a suitable norm and a suitable ball, the contraction conditions (3.64) can be
assured. Note that the (possible) equilibrium point does not need to be computed. If the
equilibrium is known, the ball can be centered at x eq , resulting r 0 = 0 , and any r with
U ( x eq, r ) will do, if (3.64) is satisfied. Now also a mapping into itself is found.
Because the norm is one suitable Lyapunov function, then F being a contraction map-
ping in U ( x eq, r ) denotes that x eq is an asymptotically stable equilibrium in U ( x eq, r )
in the sense of Lyapunov, e.g. Ogata (1987) or Li et al. (1990). The contraction conditions
are stronger than those of asymptotically stability (in the sense of Lyapunov). Sharper
Lyapunov functions i.e. weaker conditions can be possibly found. Also the contraction
conditions can be twisted, by using an invertible function mapping z = K ( x ) of onto
itself (Holzman, 1970), still leaving contraction conditions stronger than those of the Lya-
punov stability.
The spectral radius ( F ( x )) (largest absolute eigenvalue) is the measure for the
asymptotical stability of linear systems. It cannot replace (3.64) when F is nonlinear.
However, it can be used as an instability detector like in
48 Chapter 3 Modelling and Identification
x U ( x 0, r ) so that ( F ( x )) 1 (3.65)
because there is no induced norm for which conditions (3.64) are satisfied and so F
cannot be a contraction in U ( x 0, r ) , e.g. Economou (1985).
( F ( x )) <1 (3.66)
x = x eq
x ( t + 1 ) = F ( x ( t ), u ( t ) ) with t = 0N (3.67)
which corresponds to state space representation of an open or closed loop model of some
nonlinear dynamical system, u ( t ) being the system input signal. The task is to detect a
possible violation of the asymptotical stability. The input u ( t ) must be considered as a
time-varying parameter and thus one has samples of several different mappings, one for
each different u ( t ) and the detection is made using these samples i.e. { x ( t )} ,
t = 0N , parameter values { u ( t )} , t = 0N and the model F itself.
x eq
Fig. 3.7.A contraction mapping is asymptotically stable at least in U ( x eq, r ) . Other tra-
jectories are drawn to emphasize that nothing sure can be said about the stability outside.
Section 3.4 Stability and Convergence 49
The not-a-contraction condition and the violation of the contraction conditions are used
for the detection of possible instability, quoted, because the violation of some limit does
not necessarily correspond to true instability. The limits for different condition are
selected in a pragmatic way, to ensure the stability in practice.
Instability Detection
Consider now a NOE model and assume d = 1 for notational convenience. Note also that
the system is square. Now according to (3.22)
y ( t + 1 ) = f ( ( t + 1 ) )
( t + 1 ) = [y T( t )y T( t n f + 1 ), (3.68)
u T( t )u T ( t n b + 1 )] T
with n = n f + n b . The notation y ( t + 1 ) is here used instead y ( t + 1 ) . Eq. (3.68) can
be presented in state space form
x ( t + 1 ) = F ( x ( t ), u ( t ) )
[ 0 ] f ( x ( t ), u ( t ) )
[ IA ] [0]
x(t + 1 ) = x( t) + (3.69)
[0] u( t)
[ IB ] [ 0 ] [0]
T
y ( t + 1 ) = [ x 1 ( t + 1 )x m ( t + 1 ) ]
Possible I A and I B are m ( n f 1 ) m ( n f 1 ) and m ( n b 2 ) m ( n b 2 ) dimensional
identity matrices. The function f denotes f with a reordered input vector. The state is
T
x ( t ) = [ y T( t )y T( t n f + 1 ) u T( t 1 )u T( t n b + 1 ) ] (3.70)
A 1 A nf B 2 B nb
[ IA ]
J 1 = F ( x ) = (3.71)
[0]
[ IB ] [ 0 ]
1 1 1 1
G(z) = z A (z )B ( z ) (3.72)
50 Chapter 3 Modelling and Identification
The measurements and inputs are typically scaled between 0,5 and the closed ball
U ( 0, 0,5 ) (with the maximum norm ) is the domain of interest. The maximum norm
states that the sum of the absolute values for each row must be less than one. Due to the
1's in some rows, J 1 is in a sense a limiting presentation. It is not easy to find a similarity
transform, which gives relaxed conditions, see Zafiriou (1990).
nf nb
Ai + Bi 1 (3.73)
i=1 i=1
( J1 ) 1 (3.74)
A 1 A nf
J2 = (3.75)
[ IA ]
which can also be reasoned by what is known about the canonical representations for lin-
ear time series models. Defining now the -condition
( J2 ) 1 (3.76)
which can be used instead of the (3.74). For SISO system this is identical to
roots ( A (q )) 1 . Also the A -condition
nf
Ai 1 (3.77)
i=1
Also the basic definition (3.62) can be used as a detector, resulting in the N -condition
All the Jacobians needed for the detection are computed anyway during the identification
within the backward pass of the neural network and the detection is computationally light,
while the actual projection is not, as it will be seen in the end of the section.
the AB- and N -conditions (3.73) and (3.78) are the most secure taking into account
also the numerator of the linearized model.
the A -condition (3.77) is less secure, neglecting the numerator.
the -condition (3.76) is the least conservative, a limit for not-a-contraction.
Fulfilling the AB -condition or even A -condition is often very hard, as pointed out by
Economou (1985) and the -condition is more useful in practice, especially for SISO sys-
tems, where the condition roots ( A (q )) 1 can be transformed to Jury's criterion,
resulting in direct limits for parameters of A ( q ) .
The usage of (3.76) or (3.77) contains danger that true instability is not detected. From the
pragmatic point of view, Eq. (3.76) can be replaced with
( F ( x )) (3.79)
where < 1 is used as a tuning parameter, if stability problems are encountered, thus
causing the eigenvalues to be projected inwards.
2
y ( t + 1 ) = ay ( t ) + b 1 u ( t ) + b 2 u ( t 1 ) (3.80)
with parameters a = 1 , b 1 = 3,1 and b 2 = 3,0 obtained during the identification. The
state space presentation (3.69) is
2
x(t + 1) = y ( t + 1 ) = a x 1 ( t ) + b 2 x 2 ( t ) + b 1 u ( t ) (3.81)
u( t) 0 1
2x 1 ( t ) 3
J1 = and J 2 = [ 2x 1 ( t ) ] (3.82)
0 0
Applying the conditions above yields
52 Chapter 3 Modelling and Identification
This true instability is detected only by the AB -condition which takes into account the
numerator of the linearized model, resulting in the parameters a and b 2 to be projected
to have smaller values. Note that the zero = 3 3,1 is stable. The use of the condition
( F ( x )) with < 1 would also cause the detection of the instability, but resulting
parameter a to be projected.
Convergence of RPEM
Ljung and Sderstrm (1983) present the convergence analysis of the RPEM for linear
models. Chen, Billings and Grant (1990) applied this convergence analysis to the NARX
type neural network predictors. By using a general method known as the differential equa-
tion method for the analysis of recursive parameter estimation algorithms, the conver-
gence of the algorithm (3.45) can be proved. The convergence with the NARMAX and
NOE predictors is not proved here, only the practical results are introduced. The main
important assumption from the practical point of view is that a projection is employed to
keep ( t ) inside the stable region D and some regularity conditions hold.
When looking the extended network model for NARX-predictor (3.53), it is obvious that
for a chosen activation function, the D is the whole Euclidian space and the correspond-
ing extended network model is unconditionally stable.
Consider now the extended network model of the NOE predictor (3.59)
y ( t ) f ( ( t ), ( t 1 ) )
(3.83)
1 (i) (i)
A ( q , t ) ( t, ) ( t, ( t 1 ) )
i = 1n
This must be kept asymptotically stable to ensure the convergence of RPEM. The dynam-
(i) 1
ics of y ( t ) and ( t, ) cannot be straightforwardly separated, because A ( q , t ) is
also a function of the previous estimates y ( t 1 ) . Thus the contraction conditions of
(3.83) seems to be tighter than those of the predictor only. However, the gradient update
equations do not affect each other nor the predictor and they can be considered as separate
linear time varying systems
1 (i) (i)
A ( q , t ) ( t, ) ( t, ( t 1 ) ) , i = 1n (3.84)
The contraction conditions of (3.84) are the same or weaker than those of the predictor
i.e. if the predictor is kept asymptotically stable then also the extended network model is
asymptotically stable.
Section 3.4 Stability and Convergence 53
Stability Projection
After detection of possible instability, the parameters must be projected into a stable
domain D . This can be done either with a direct recalculation of the parameters or by
constrained minimization of the cost function (3.31).
The recalculation projects the parameters directly into D . Due to the complexity of the
function approximator (MLP network) this can be done only in a crude way which some-
what destroys the prediction capability. It is suitable for adaptive applications if high com-
putational load must be avoided. The usage of the constrained minimization is more
efficient but also computationally heavier. It is suitable for off-line identification but it can
also be used with the RPEM approach.
Consider first the Jacobian J 2 (3.75) which can be used with the A - and -conditions.
Multiplying the coefficient matrices A i with a scalar k < 1 so that
n
J 2 = kA1 k f A n (3.85)
f
[ IA ]
means that all eigenvalues of J 2 are k times the eigenvalues of J 2 . For example
k = 0,95 moves the eigenvalues 5 % inwards.
1
The J 2 or A ( q ) is used as the target for the parameter projection i.e. where the coeffi-
cients in A i 's should be in order to obtain the stability. The multiplier k can be directly
determined for low order SISO systems or it can be solved iteratively. A crude but stable
approximation is enough.
Consider now the MLP network (Fig. 2.2) with the corresponding Jacobian
(t) = f ( , ) (3.86)
= , = ( t )
( t ) = [ A1 | | An | B1 | | Bn ] (3.87)
f b
Multiply all the weights leaving the particular input nodes of the network corresponding
to the A i 's with { k, k 2, } and the backward pass of the network model will result in
n
( t ) = kA 1 | | k f A n | B 1 | | B n
f b
(3.88)
( t ) = [ A 1 | | A n | B 1 | | B n ]
f b
54 Chapter 3 Modelling and Identification
This projects only the Jacobian ( t ) of the linearized network and different value will
be obtained if ( t ) is recomputed with a forward-backward pass. This is not a limitation
if applied with the RPEM approach and the covariance matrix is reseted or the forgetting
factor is decreased after the projection. This approach is successfully applied for on-line
identification. This will be denoted as the direct projection approach.
N
V N ( ) = 1---
2 2
2 [ ( y ( t ) y ( t ) ) + v ( t, ) ]
t=1
nf (3.89)
( v 1 ) if v = A i ( t ) 1
with v ( t, ) =
i=1
0 , otherwise
The cost function is minimized with an increasing scalar weighting . The A -condition
does not often provide satisfactorily results, it is too conservative. Better results in the
SISO case are obtained by combining (3.89) with the -condition i.e. the penalty v ( t, )
is taken into account only if roots ( A (q )) 1 ; implemented using the Jury's table. This
efficiently pushes the parameters into the stable domain D .
N
V N ( ) = 1---
2 2
2 [ ( y ( t ) y ( t ) ) + v ( t, ) H ]
t=1
(3.90)
col { [ A 1 A 1 A n f A n ] } if ( J 2 ) 1
with v ( t, ) = f
0 , otherwise
1
where H = I and A i is obtained from A ( q ) which is determined directly from the
Jury's criteria, e.g. strm and Wittenmark (1990).
(i) = f ( , ) , i = 1n (3.91)
i
= , = ( t )
The second order gradient ( i ) can be computed directly for the networks with one hid-
den layer applying the chain rule differentiation in a similar fashion as done in the normal
backward pass. For more complex networks the ( i ) must be estimated numerically. The
secant technique is used in this study, see Appendix A.
These projection methods have been successfully applied within various simulations and
real-time experiments. The main drawback is that they are computationally expensive
especially due to the numerical evaluation of ( i ) . Because of this computational bur-
den, all conventional methods should be tried in advance, like varying the control speci-
fications in nonlinear controller design. The behaviour of the resulting model, controller
or controller-model loop is easy to verify after the design by applying the proposed detec-
tion methods.
When an identified model is used for predictive control or for controller design, other pre-
cautions should also be taken. Because the possible projection task is performed during
the identification, the actual conditions are presented also here. A more detailed analysis
is presented in Section 4.3. The local existence and uniqueness of the feedback law for the
predictors introduced in Section 3.2 necessitates that y ( t )/u ( t d ) is locally inverti-
ble i.e.
det { y ( t )/u ( t d ) } 0 (locally) (3.92)
For the SISO case this means that y ( t )/u ( t d ) should be projected off zero. This
input gradient projection is extremely important for the control point of view. In many
cases the sign of this gradient is known in advance, which makes the implementation of
the projection algorithm easier. If the input gradient at sample t is near zero, it should be
projected away. The actual projection is similar to that used in the stability projection.
Also other apriori information could be taken into account. Assume that the process is
known to have a monotonically increasing or decreasing steady-state characteristics and
this knowledge is intended to incorporate to the identification procedure, for example
because the amount of experimental data is small. An example of this will be presented
later, see Chapter 6.
y eq = f ( , )
(3.93)
T
= [ y eq, y eq, u eq, u eq, ]
56 Chapter 3 Modelling and Identification
From the Inverse Function Theorem, Rosenlicht (1968), local invertibility condition
y eq /u eq 0 (locally) (3.94)
nb 1
y ( t )/u ( t d i ) 0 (locally, at the steady state) (3.95)
i=0
This condition can be used as a penalty during the identification, i.e. u ( t ) D u , to
push the model behaviour into the right direction. It should be used with care because this
penalty also restricts the location of the zeroes of the linearized model outside the steady
state.
The actual projection is made applying similar penalty barrier as in the stability projec-
tion. This can be written (assume positive input gradient here)
(v B 1 ) if ( B 1 < v limit )
v ( t, ) = limit (3.96)
0 , otherwise
nb
nb
v limit B i if B i < v limit
v ( t, ) = i=1 (3.97)
i=1
0 , otherwise
where v limit is the lower limit. The former corresponds to the input gradient projection
(3.92) and the latter to the condition (3.95).
The data needed for the detection is already available in ( t ) . The minimization of these
penalties w.r.t. the parameters requires the corresponding ( i ) to be evaluated numer-
ically.
Also a simple heuristic projection method can be applied if the short-cut connections are
used and the output activation function is linear, i.e. the network is
T
y ( t ) = f 1 ( ( t ), 1 ) + 2 ( t ) (3.98)
The survey paper (Garcia et al. 1989) refers Model Predictive Control (MPC) as that fam-
ily of controllers in which there is a direct use of explicit and separate identifiable model.
The same process model is also implicitly used to compute the control action in such a
way that the control design specifications are satisfied.
Control design methods based on MPC concept have found wide acceptance in industrial
applications due to their high performance and robustness. There are several variants of
model predictive control methods, like Dynamic Matrix Control (DMC), Model Algorith-
mic Control (MAC) and Internal Model Control (IMC), e.g. Garcia et al. (1989). Also non-
linear versions of these are developed, for example the nonlinear IMC concept, e.g.
Economou et al. (1986). A largely independently developed branch of MPC, called Gen-
eralized Predictive Control (GPC), is aimed more for adaptive control, e.g. Clarke and
Mohtadi (1989). For the current state-of-the-art of MPC, see Clarke (1994).
Model predictive control in this sense is a broad area and some confusion is encountered
because the abbreviation MPC is often used to mean receding horizon (RHPC) or long
range predictive control (LRPC) where the model is used to predict the process output sev-
eral steps into the future and the control action is computed at each step by numerical min-
imization of the prediction errors, i.e. no specific controller is used. This is quite different
from the concept, where the model is controlled with an implicitly derived specific con-
troller, like in many IMC approaches.
In this study the model based approach is divided into two parts: the direct predictive con-
trol scheme without actual controllers using neural networks as process models (Direct
Predictive Control) and the indirect predictive control with actual controller designed
using the identified process model (Dual Network Control). The third possibility, neural
network as a controller, designed or identified without any process model, is omitted in
this study and only a short overview is given.
58 Chapter 4 Model Predictive Control
The existing neural network control approaches can be divided into four categories (Hunt
et al. 1992): direct inverse control, (direct) model reference control, (direct) predictive
control and the IMC approach. The first two ones are in a way similar, a process model is
not necessarily required.
The direct inverse control (for example Psaltis et al. 1988) utilizes an inverse of the static
or dynamic process or its model. It is common in robotics, for example Miller (1989). A
general identification scheme is presented in Fig. 4.1 (Hunt and Sbarbaro 1991). The task
is to minimize a cost function associated to equivalent prediction error ( t ) . This
approach relies heavily on the fidelity of the inverse model and serious questions arise
regarding the robustness and the stability.
In the (direct) model reference control the control system attempts to make the plant out-
put match the closed loop reference model asymptotically (Fig. 4.2). The training proce-
dure will force the controller to be a detuned inverse, in a sense defined by the reference
model. This type of neural network control system is studied for example by Narendra and
Parthasarathy (1991), Gupta and Rao (1993) and Jin et. al. (1995).
Despite the known potential problems considering robustness and stability, the direct
inverse control and (direct) model reference control are the most implemented neural net-
work control methods today. This is obviously due to the straightforward and simple
design. It is also true that many processes (models) have a stable inverse and a direct
inversion is possible.
Synthetic
Signal
u (t)
u ( t ) y(t)
Inverse Process
Model - +
y (t)
(t)
Fig. 4.1.A synthetic signal u ( t ) is used for identification of the inverse model
of the process. The equivalent prediction error is ( t ) .
Section 4.1 Control Methods 59
A practical approach is to use the inferential control scheme, i.e. the process is controlled
with a PI or other conventional controller, e.g. Khalid et al. (1994b) and Lightbody and
Irwin (1995), or with coarsely tuned fuzzy controller, e.g. Khalid et al. (1994a). The par-
allel additive adaptive neural network controller slowly refines the control performance.
Typically the MIT rule (=error backpropagation) is used which is in accordance with the
desired slow adaptation speed. Successful practical experiments are presented by Khalid
et. al. (1992, 1994a) for SISO temperature control, Khalid et. al. (1994b) for MIMO fur-
nace control, and Tam (1993) for temperature control.
Some sort of model is needed to obtain the gradient of the control error w.r.t. the con-
troller parameters. These models vary considerably from a fixed (known) sign of the
gradient up to an on-line identified neural network predictor. It is sometimes hard to dis-
tinguish between the high end solutions of the adaptive model reference control and the
adaptive version of the predictive dual network control. One clear difference is that the
former applies the actual control error for controller adaptation while the latter one is
adapted according to the predicted control error.
Also many existing CMAC control applications should be classified into this category.
The CMAC network is normally identified directly as an inverse of the process (model),
i.e. the inferential control approach is not applied. The applications are typically in robot-
ics, for example Miller (1989), but also more noisy applications exist, for example a fuel-
injection control (Shiraishi et. al. 1995), and a fiber placement composite manufacturing
process (Lictenwalner, 1993). Several process control experiments with the LERNAS/
AMS have been reported by e.g. Ersu and Tolle (1988), and Tolle and Ersu (1992).
Reference
Model
+
Adaptation "Model"
(t) -
The direct model reference approach is also demonstrated for instantaneous adaptive con-
trol, i.e. fast initial convergence is required, no inferential scheme and no pretraining are
used, e.g. Gupta and Rao (1993) and Jin et. al. (1994b, 1995). Also real-time experimental
results are reported, e.g. Yabuta et. al. (1990). Typically only noise free simulations and
experimental results are presented.
Direct model reference control is also common within the geometric approach, e.g. Isidori
(1989), because bilinear models are especially suitable for this kind of formulation. Adap-
tive neural network control studies using the geometric approach are presented by Tzir-
kel-Hancock and Fallside (1991, 1992), by Rovithakis and Christodoulou (1993), and by
Liu and Chen (1993). The off-line approach for a simulated CSTR model can be found in
Nikolaou and Hanagandi (1993). These show excellent performance and fast convergence
when applied to highly nonlinear (deterministic) simulated process model.
One aspect should be emphasized here. The excellent results were obtained with neural
network models which are linear-in-the-parameters. For example Tzirkel-Hancock and
Fallside (1991, 1992) used two RBF networks as building blocks for the bilinear model,
both with fixed basis functions and with fixed grid as the centers. Sanner and Slotine
(1991) present similar results using a preclustered RBF network with fixed basis func-
tions. Comparable experimental results obtained with fully adaptive neural networks have
not been found in the literature.
Based on this a clear conclusion is to be made. If fast initial convergence is needed, one
should in practice apply linear-in-the-parameters type nonlinear model, neural network or
other. The adaptive control of this type of models, like Hammerstein models, is studied
widely in the literature, for example Vadstrup (1990).
If the performance during the start-up period is not so important, one can apply the infer-
ential control scheme. The slow adaptation of the (nonlinear-in-the-parameters) neural
network controller finally produces good control performance.
u( t ) y( t )
r( t )
NN Process
+ - Controller
y ( t ) +
NN
Model -
If a nonlinear-in-the parameters type model is applied for adaptive control in a way ana-
logical to the linear self-tuning scheme, one must accept that start-up behaviour compa-
rable to the linear case is not typically obtained and that all operation points should be
visited at least once before good control performance is achieved. A minor off-line iden-
tification is of course a solution to the problems encountered during the initial start-up
phase. This reduces the effect of randomly selected initial weights.
The other model predictive approach is the dual network control, applied normally within
the IMC approach (Fig. 4.3). The control system is commonly implemented with fixed
parameters although semi-adaptive and fully adaptive versions are reported. As men-
tioned earlier, it is sometimes difficult to distinguish between this scheme and direct
model reference control. A nonlinear version of LQG control can also be designed. This
requires a state estimator design/identification.
In the IMC approach the setpoint is compensated with the prediction error (Fig. 4.3), nor-
mally with the filtered one. This introduces an integrator to the control system and no
steady state control error is encountered for stepwise disturbances. Another possibility to
is to use a PI controller in the outer loop (Fig. 4.4). This approach can of course be used
with any controller.
u( t ) y( t )
r( t ) NN
PI Process
+ Control
-
Fig. 4.4. A PI controller can be used to ensure the steady state requirements.
62 Chapter 4 Model Predictive Control
Both the direct predictive neural network control and the dual network control have been
studied widely in the literature and successfully applied in the model based control of non-
linear systems. General guidelines can be found from Nahas et al. (1992), Psichogios and
Ungar (1991), Hunt and Sbarbaro (1991), Ydstie (1990), and Bhat and McAvoy (1990).
The main interest seems to be in process control. Most articles consider the control of sim-
ulated process models, more or less complex, e.g. a dual network control of a turbogen-
erator (Wu et al. 1992), a LRPC and an IMC control of a steel rolling mill (Sbarbaro-Hofer
et al., 1993). The most common applications (simulation models) seem to be pH CSTR's,
distillation columns and several fermentors i.e. in the field of the chemical engineering,
e.g. Saint-Donat et al. (1990), Lee and Park (1992), Hernadez and Arkun (1990), Willis
et al. (1992) and Chen and Weigand (1994).
Real time tests of model based neural network control have also been reported: control of
a small water heating process (Koivisto et al. 1992, 1993), a direct predictive control of a
fermentor (Willis et al. 1992), a CSTR control study (Scott and Ray 1993), a temperature
control (Tam 1993), a fed-baker's yeast cultivation process (Ye et. al. 1994), a chemical
reactor control (Keeler 1994) and a fermentor pressure control (te Braade 1994). Also the
work of the Teaching Company Scheme in the University of Newcastle-upon-Tyne
(Turner et. al. 1994) should be mentioned in this context.
Polynomial models instead of neural network models within the LRPC are applied by Her-
nandez and Yarkun (1993) and Prll and Karim (1994). The latter presents also real time
experiments with a wastewater pH neutralization process.
Adaptive versions of direct predictive neural network control are studied for example by
Cooper et al. (1992), Koivisto (1990) and Mills et al. (1994). The last introduces an inter-
esting idea of history stack learning which clearly seems to outperform normal pre-
dictive neural network control in controlling a evaporator model. Adaptive real time
experiments of direct predictive control or dual network control are not so common.
Koivisto et al. (1991a, 1991b) applied LRPC for the control of a small water heating pro-
cess, Tam (1993) applied dual network temperature control and Khalid et. al. (1994b) for
a MIMO furnace control (direct model reference control).
The difference between direct predictive control and dual network control is merely algo-
rithmic, because the explicit controller can be viewed and designed as a function which
produces the control actions similar to those obtained from the direct predictive control,
at least in the unconstrained case. This can be done for example by applying the synthetic
signal approach (Fig. 4.2).
Section 4.1 Control Methods 63
Soeterboek (1990) has shown that there is very close relationship between both schemes
if the model is linear, the criteria is quadratic and there are no constraints. In this case long
range predictive control behaves like LQ control or pole-placement control, depending on
the selected criteria. This means also that the direct predictive control is not inherently
more or less robust than dual network control, but it just can be adjusted more easily for
robustness. However, a noise model assumption like NARX which is commonly used
within direct predictive control, transforms into a rather complicated state estimator -
optimal controller design, when applied within dual network approach.
Also other remarks from the practical point of view can be made. The direct predictive
controller is easy to design, only a model identification and fixing the control parameters
are required. The implementation of the controller is difficult due to the on-line optimiza-
tion algorithm. The implemented system can also be numerically heavy especially the
long range predictive approach. Because of the on-line optimization, the control signal
constraints etc. can easily be taken into account producing efficient control performance.
Typically a NARX model is used, simplifying the identification task.
Model reference is used to make the system behave like a diagonally dominating first or
second order linear plant with little or no overshoot. The reference model may be applied
in two different ways: filtering the setpoint or filtering the prediction or measurement
according to the reference model. The advantage of the latter method is, that it is effective
also when the setpoint is constant. These can also be combined to a two-degrees-of-free-
dom controller.
y ( t + d ) = E ( q )y* ( t + d ) (4.1)
where the filter E ( q ) is a diagonal transfer function matrix and y* ( t + d ) is the target.
Strictly speaking only E ( z ) is a transfer function matrix. The notation E ( q ) for the
1
transfer function matrix in difference equations is used instead of E ( q ) . The target
y* ( t + d ) is commonly a filtered setpoint r ( t ) i.e.
y* ( t + d ) = F 1 ( q )r ( t ) (4.2)
1
E ( q ) = E 1( q ) E ( 1 )
(4.3)
1
E (q )y ( t + d ) = E ( 1 )y* ( t + d )
1 1
where E (q ) = I E1 q is a stable matrix polynomial and E 1 is a diagonal matrix
(the desired closed loop poles). Also a second order reference model could be used. Using
(4.3), the d -step-ahead control cost function can now be written as
V ( t, u ) = 1--- 1 2 2
E ( 1 )y* ( t + d ) E (q )y ( t + d ) W + u ( t ) (4.4)
2
Section 4.2 Direct Predictive Control 65
1 1
where = ( q ) = ( 1 q )I and W is a diagonal weighting matrix with positive
diagonal elements. The diagonal weighting matrix with positive diagonal elements is
used to reduce the control signal variations. By selecting the design parameters, the user
can define the properties of the controller. For instance, deadbeat control can be achieved
1
directly with E (q ) = I and = 0 .
Defining an auxiliary variable y f ( t ) , the cost (4.4) can be written in an analogous form
V ( t, u ) = 1--- [ y* ( t + d ) y f ( t + d ) + u ( t ) ]
2 2
(4.5)
2 W
1
E ( 1 )y f ( t + d ) = E (q )y ( t + d )
The cost functions (4.4) and (4.5) give identical results (with different W ). The selection
of the weight W in MIMO case is easier with the former one. The latter is more suitable
for analytical treatment and for gradient computation. The quadratic cost function is min-
imized at each time step with respect to the control u ( t ) i.e.
T
[ u ( t )u ( t + Nu 1 ) ] (4.7)
are selected so that the predicted control error (cost function) will be minimized.
t Time
If the control horizon is shorter than the prediction horizon, the future controls after
Nu 1 are held constant = u ( t + Nu 1 ) .
N2
1 2
V ( t, u ) = ---
2 E ( 1 )y* ( t + i ) E (q
1
)y ( t + i ) W + u ( t + i N1 )
2 (4.8)
i = N1
where N1 and N2 limits the prediction horizon. The selection N1 = d is commonly used
in practice.
The cost function (4.8) is minimized at each time instant w.r.t the future controls i.e.
T
[ u ( t )u ( t + Nu 1 ) ] = arg min V ( t, u ) (4.9)
u
Only u ( t ) is used as actual control signal. If N1 = N2 = d one obtains the d -step-
ahead controller.
Several gradient based optimization techniques can be used to minimize (4.8). The mini-
mization is generally a constrained optimization task which can be quite complex. Typi-
cally at least lower and upper bounds for the control signal are set ( u min and u max ) and
commonly also a maximum allowed change u max i.e
T
[ u ( t )u ( t + Nu 1 ) ] = arg min V ( t, u )
u
(4.10)
u min u ( i ) u max, i = tt + Nu 1
u ( i ) u max, i = tt + Nu 1
A typical selection Nu = 1 simplifies considerably the optimization, especially in a SISO
case, where one has one cost function value and one free variable u ( t ) . Previous control
u ( t 1 ) is also a good initial estimate. The SISO case with Nu = 1 can be efficiently
handled using parabolic interpolation (Brent's method), e.g. Press (1986). It does not
require any gradient information. Typically the minimum is obtained in less than ten iter-
ations. This is efficient because the computational load of one forward pass is not much
greater than that of one backward pass.
For complicated tasks a more sophisticated optimizer must be used. Typical methods are
based on Sequential Quadratic Programming (SQP). An efficient and reliable implemen-
tation called CFSQP (Lawrence, Zhou and Titts 1994) is used in this study. Also the
Rosen's Gradient Projection Method (Soeterboek 1990), and the Newton-Raphson
method are implemented and applied in some experiments.
Section 4.2 Direct Predictive Control 67
As discussed in Section 3.2, there are various possibilities for multistep prediction. The
cost function (4.8) is evaluated after computing the predictions. The gradient of the cost
function w.r.t. the future controls ( V / u ) is computed according the predictor type by
applying the chain rule differentiation. Due to the variety of different multistep predictors
and delay structures, procedural formulas are not presented, only the general idea and an
illustrative example is given.
Stability
The stability of various long range and receding horizon predictive control approaches has
been studied considerably in recent years. The term guaranteed stability is commonly
used in this context, e.g. Demircioglu and Clarke (1992). Most properties such as equiv-
alence, stability and internal structures are known for linear systems without I/O con-
straints. Stability problems are now resolved for linear systems with I/O constraints, e.g.
Zafiriou and Marchal (1991) and even for discrete time nonlinear systems, e.g. Alamir
and Bornard (1994), as far as a feasible solution exists. For a survey, see Kwon (1994).
y k( t + 1 ) y k( t + 2 ) y k( t + 3 )
y( t)
NN NN NN
(forward) (forward) (forward)
u k( t )
r ( t ) y k( t + 1 ) r ( t ) y k( t + 2 ) r ( t ) y k( t + 3 )
NN + + NN + + NN
u k( t ) (backward) (backward) (backward)
+ + + +
Optimizer
k = k+ 1
V /u k( t )
Fig. 4.6.An illustrative example of the gradient calculation. A first order NARX model
is used and cost function parameters: E 1 = 0, = 0, N1 = 1, N2 = 3, Nu = 1 . The
index k is the iteration counter at time t .
68 Chapter 4 Model Predictive Control
A practical approach to maintain the stability is to add terminal cost or terminal equality
constraints, e.g. MacArthur (1993) to the cost function. To avoid a constrained minimiza-
tion, a terminal cost approach is implemented i.e. the cost function (4.8) is modified to
N2
1 1 2
V ( t, u ) = ---
2 E ( 1 )y* ( t + i ) E (q )y ( t + i ) W
i = N1
Nu
+ 1---
2
2 u ( t + i 1 ) (4.11)
i=1
N2 + N3
2
+ 1---
1
2 E ( 1 )y* ( t + i ) E (q )y ( t + i ) W T
i = N2 + 1
where W T is the weighting of the terminal cost and N3 set the length of the terminal
phase. The second term is identical with the (4.8) although defined in a different way.
Model Mismatch
If the direct predictive control scheme is applied with constant model parameters, a steady
state control error will be encountered due to the modeling error. This must be removed
by incorporating an integrating action to the controller. The approach applied here is to
assume that the noise is generated by an integrated moving average (IMA) noise process
i.e. the model for the additive noise { z ( t ) } is assumed to be
1
z ( t + 1 ) = C ( q ) ( t + 1 ) (4.12)
where { ( t ) } is a m -dimensional independent white noise sequence and C is combined
from monic and stable scalar polynomials i.e. a matrix polynomial
1 1 nc
C (q ) = I + C1 q + C nc q
For the formulation of the predictor, a (NARX + IMA = NARIMAX) noise model assump-
tion is used
y ( t + 1 ) = y ( t + 1 ) + z ( t + 1 )
y ( t + 1 ) = f ( ( t + 1 ) )
1
z ( t + 1 ) = C ( q ) ( t + 1 ) (4.13)
( t + 1) = [y T( t ), , y T( t n a + 1 ),
u T( t d + 1 ), , u T( t d n b + 2 )] T
where y ( t ) is used as a shorthand notation for the predictions obtained from the plain
Section 4.2 Direct Predictive Control 69
NARX part. The IMA noise model is just added to the identified NARX predictor to
remove the steady state control error. This noise assumption is similar to the T -polyno-
mial used in GPC, e.g. Clarke and Mohtadi (1989). The C -polynomial is not identified, it
is used as a tuning parameter to increase the robustness of the control system.
The situation is in general not so straightforward, it depends of the true noise process i.e.
whether the IMA noise is actually present in the system. If so, identification problems will
be encountered if a plain NARX model is used. In the original GPC formulation, the linear
ARIX predictor i.e. ARX with C ( q 1 ) = I is identified, but used with C ( q 1 ) I when
solving the control u ( t ) . The identification of an ARIX type predictor also decreases the
signal/noise ratio and thus the quality of the prediction and it is questionable whether it is
a generally applicable solution especially in adaptive control. The situation is application
dependent and the user must select whether to use it or not.
Because the noise model is additive and consists of separate linear SISO models, the min-
imum variance predictor for each z i ( t + i ) can easily be determined by solving the cor-
responding Diophantine equation, e.g. strm and Wittenmark (1990), resulting
1
C (q )z ( t + 1 t ) = C ( 1 )z ( t ) = C ( 1 ) ( y ( t ) y ( t ) )
(4.14)
z ( t + i t ) = z ( t + 1 t ), when i > 1
where the actual noise z ( t ) = y ( t ) y ( t ) is calculated backwards using (4.13). The final
predictor equations at time t can be presented as a recursive procedure
repeat i = 1N2
y ( t + i ) = f ( ( t + i ) )
( t + i ) = [y T( t + i 1 ), , y T( t + i n a ),
u T( t + i d ), , u T( t + i d n b + 1 )] T (4.15)
y ( t + i ) = y ( t + i ) + z ( t + i t )
y ( t + i t ) = y ( t + i )
end
where the last equation denotes that the predicted measurements y ( t + i ) are used instead
of the missing future measurements y ( t + i ) . When the predictor (4.14)...(4.15) is used
with the cost function (4.8), no steady state control error is encountered for stepwise load
1 1
disturbances. Typically C (q ) = I C 1 q i.e. the prediction error is simply filtered
with a first order filter and added to the actual predictor (4.15).
70 Chapter 4 Model Predictive Control
The extension to the NARMAX or NOE predictor is straightforward, but the NOE case is
of special interest. Because the noise is additive to the output after the dynamical part (the
NOE model itself), the filtered prediction error can also be subtracted from the target. Now
the predictor equations for the NOE + IMA (= NOE-IMA) case can be presented as
repeat i = 1N2
y ( t + i ) = f ( ( t + i ) )
( t + i ) = [y T( t + i 1 ), , y T( t + i n a ),
(4.16)
u T( t + i d ), , u T( t + i d n b + 1 )] T
y* ( t + i ) = r ( t ) z ( t + i t )
end
where y now denotes the predictions obtained from the plain NOE part.
Adding the setpoint filtering and adopting the formulation (4.2), one obtains
y* ( t + i ) = F 1 ( q )r ( t ) F 2 ( q ) [ y ( t ) y ( t ) ] i = 1N2 (4.17)
1
where F 1 ( z ) and F 2 ( z ) = C ( z )C ( 1 ) are diagonal transfer function matrices. Now
the overall control system can be presented as an analogous two-degrees-of-freedom
internal model control (IMC) approach, e.g. Morari and Zafiriou (1989), see Fig. 4.7.
y* = y*( t + i ) , i = N1N2
r (t) y* LRPC
+
F1 NN u(t) y(t)
Process
Optimizer NN
F2
d+1 NN
q model
y ( t )
+
Fig. 4.7.
The LRPC with the NOE model can be presented as an IMC approach. The
LRPC block represents iterative solving of the control signal u ( t ) using the identified
NOE model.
Section 4.2 Direct Predictive Control 71
A practical advantage of this formulation is that the filter F 2 can easily be designed for
other types of load changes. This is especially useful because the internal LRPC-NOE loop
transforms the system to be controlled to a diagonally dominating linear system and the
filter F 2 can be designed for an open loop system of the form z d E ( z ) , which consists
of linear first order SISO models. Straightforward guidelines for the filter design can be
found in Morari and Zafiriou (1989).
The proposed nonlinear LRPC approach as a whole has similar features as its linear coun-
terparts, General Predictive Control (GPC), Model Predictive Control and others. The
main benefit is efficiency. The design specifications can be set on-line. The exact apriori
knowledge of the delay is not so critical as in one-step-ahead controller. LRPC can even
handle unstable and non-minimum phase systems.
The LRPC approach is efficient but at a price of high computational on-line load. Because
most true neural network control systems are anyway targeted to dedicated special
tasks, it is reasonable to assume that a computer platform for the LRPC approach is afford-
able.
Although the LRPC approach has several design parameters, the tuning is straightforward.
Some guidelines for tuning is presented below. The proposed scheme is for the LRPC with
NARX predictor:
1. Set F 1 ( z ) = I , N1 = N2 = d , Nu = 1 and
= 10 4 I (or something small)
1
2. Select E ( q ) so that its poles corresponds to the process open loop time constant.
1 1
Select C ( q ) to be same or slower than E ( q )
5. Finally, fine tune F 1 if setpoint chance causes too heavy control signals variations.
72 Chapter 4 Model Predictive Control
Adaptive LRPC
This section considers an adaptive version of the direct predictive neural network control.
The current state-of-the-art and basic problematics associated to this type of control is dis-
cussed and also some solutions are proposed. These considerations are also applicable to
other types of adaptive nonlinear control.
When the d -step-ahead control law (4.4) or the LRPC control law (4.8) is combined with
recursive identification (RPEM) presented in Section 3.3, one obtains an adaptive nonlin-
ear controller (Fig. 4.8). The approach should in principle achieve good control perfor-
mance, at least at the end. The usage of a nonlinear model makes adaptation slower than
in the linear case although efficient recursive scheme like RPEM increase the speed of
convergence. This slow adaptation must be combined with a robust control law like LRPC
in order to obtain acceptable results. The experimental results are still rare at least if non-
linear-in-the-parameters type models are considered.
The linear adaptive and self-tuning control is studied extensively in the literature, see for
example Bitmead et. al. (1990) for an excellent analysis of the linear adaptive GPC. Also
results considering adaptive nonlinear control with linear-in-the-parameters type models
are reported widely.
For example Prll and Karim (1994) have applied polynomial ARMAX type models for
adaptive nonlinear LRPC. They present also real-time experiments using a wastewater pH
neutralization process. It is clear that the usage of the linear-in-the-parameters type model
makes the identification easier and increases the control system performance during the
initial start-up.
r (t) LRPC
NN u(t) y(t)
Process
Optimizer NN
NN
Identif
If linear-in-the-parameters nonlinear models within the LRPC produce good results, why
use nonlinear-in-the-parameters type neural network models at all? The reason is clear. If
coarser or simpler models are applied, one must typically perform a nontrivial model
structure selection according a priori process knowledge. The actual model is then nor-
mally easier to identify, it might be even linear-in-the-parameters. On the other hand, neu-
ral network models can represent virtually any multivariate nonlinear smooth function
with the same basic structure.
If a nonlinear-in-the parameters type model is applied for adaptive control in a way ana-
logical to the linear self-tuning scheme, one must accept that start-up behaviour compa-
rable to the linear case is not typically obtained and that all operation points should be
visited at least once before good control performance is obtained.
Thus the usability of the adaptive LRPC with the MLP network or with any similar global
approximator depends on, how slow adaptation is acceptable in a particular application.
Experimental case studies considering these aspects will be presented in Section 6.2. The
examples are selected to discuss basic problematics and typical behaviour of this type of
control.
The reliability and repeatability are in practice the main requirements for any adaptive
control system, much more than a requirement of an instantaneously good control perfor-
mance. These are just the same issues where all adaptive control systems have major
problems. These problems are emphasized with adaptive neural network control due to
the nonlinear nature of the model and/or controller.
Based on the simulation studies and real time experiments, the adaptation seems also to
be quite sensitive to the incorrect noise model assumption, even with the full LRPC
approach including the C -polynomial to compensate the model mismatch. Naturally this
problem is present only in the stochastic case. The adaptive LRPC with nonlinear-in-the-
parameters type neural network model is quite capable to control deterministic nonlinear
process without any a priori identification. This is verified with extensive simulation stud-
ies.
74 Chapter 4 Model Predictive Control
Several precautions must be taken into account to keep the control system stable and to
secure the identification algorithm. The usage of (variable) forgetting factor during the
start-up phase or resetting the covariance/information matrix periodically reduces the
effect of the initial weights during the start-up phase and keeps the identification alive,
but they do not totally remove the problems. One can be lucky and obtain excellent
start-up behaviour or unlucky and obtain a disaster.
The application of stability and input gradient projection schemes secure the control sys-
tem against total disaster but the resulting control performance is not acceptable for a long
period of time. In practice some preliminary off-line identification should be performed
to achieve reasonable initial weights.
Ideal model candidate for nonlinear adaptive control should have following properties:
It is apparent that this type of model family is not available and in practice a compromise
between the properties is to be made. Some attempts are reviewed below.
Consider the polynomial ARMAX type models, e.g. Hernadez and Yarkun (1993) and
Prll and Karim (1994). The models are simplified versions of the Volterra series repre-
sentation. A preliminary selection of necessary multiplicative terms is performed before
the identification. After that the model itself is linear-in-the-parameters and the identifi-
cation is easy.
There are also other possibilities for coarser and/or simpler models, for example piece-
wise linear models. Unfortunately the obvious candidates, B-splines and fuzzy models are
suitable only for very low order tasks, not for time series modeling, as discussed in Sec-
tion 2.2. The use of piecewise linear activation function within an MLP structure, like
hinge network (Niu and Pucar 1995) results in a coarser model which has many attractive
properties. The model is still nonlinear-in-the-parameters and due to non-continuous
derivatives the on-line gradient based identification is difficult. The hinge network is thus
more suitable for off-line tasks.
An alternative is to use an MLP or RBF neural network with preselected first layer or with
preselected radial base functions so that the remaining identifiable part is linear-in-the-
parameters. The determination of the first layer weigths or radial basis functions is how-
ever nontrivial and in practice some preliminary identification is required.
Section 4.2 Direct Predictive Control 75
Pao (1992) formulated the identification task in a different way. An MLP network with
one huge hidden layer was used (even hundreds of nodes). The lower layer weights were
randomly initialized and fixed after that. Only the weights of the upper layer were identi-
fied on-line. Thus the overall model was linear-in-the-parameters and fast initial conver-
gence was obtained. It is however clear that this type of model has extremely low
extrapolation capability and it incorporates potential instability etc. problems.
Still another alternative is to identify local linear models within a priori clustered input
space and to combine the predictions using Gaussian, e.g. Johanssen (1994), or triangular
membership functions. In both cases the global model is nonlinear-in-the-parameters.
Mills et al. (1994) introduce an interesting idea of history stack learning which clearly
outperforms normal neural network based adaptive LRPC when controlling a evapora-
tor model. The problems with the initial weights and with the nonlinear-in-the-parameters
property are reduced by performing a couple of iterations at each time step with an off-
line identification scheme using the stored data.
This idea can be developed even further. Modern workstation can store significant amount
of data directly in the memory. An on-line measurement data base - a sophisticated history
stack which stores only the most relevant process information could be used. Any global
or local model could then be refined on-line but using more efficient off-line methods.
Theoretically this requires development of better on-line clustering algorithms, algo-
rithms combined with the one-shot-learning capability of the ART-2 neural network.
Research towards this data base and towards suitable coarser model structures still main-
taining the main attractive features of MLP and RBF networks, and the development of the
corresponding identification schemes are clearly important issues of the future research.
76 Chapter 4 Model Predictive Control
The development of general nonlinear extension of IMC poses serious difficulties due to
the inherent complexity of nonlinear systems. Thus the nonlinear extension of IMC design
presented by Economou et al. (1986) is applicable only to open loop stable systems with
a stable inverse. However, they have shown that under certain assumptions, the closed-
loop system possesses the same stability and zero-offset properties as the linear IMC
structure. This work has been used as a basis by Nahas et al. (1992) who have demon-
strated that neural networks can be applied straightforwardly in the IMC design.
The problems related to the model inverse based IMC design can be partially avoided by
employing a nonlinear optimal controller within the IMC structure (Koivisto et al. 1992).
This approach is a general extension of the design methods proposed by Economou et al.
(1986). The nonlinear control law is approximated by a perceptron network and the cost
function associated to the optimal controller design is minimized numerically.
This IMC strategy is applied to the control of a laboratory heating process (SISO),
Koivisto et al. (1992, 1993).
Section 4.3 Dual Network Control 77
A more detailed IMC structure is presented in Fig. 4.9. The IMC controller is realized with
linear filters F 1 and F 2 . The filter F 2 improves the robustness of the control system by
reducing the closed-loop gain. The filter F 1 is used for setpoint filtering.
Assume now that the deterministic part of the process is identified, i.e. a d -step-ahead
NOE predictor is available
y ( t + d ) = f ( ( t + d ) )
( t + d ) = [ y T( t + d 1 ) y T( t + d n f ) (4.18)
u T( t ) u T( t n b + 1 )] T
u ( t ) = h ( ( t ), )
( t ) = [y * T( t + d )y * T( t + d m c + 1 )
, (4.19)
y T( t + d 1 )y T( t + d m a )
T
u T( t 1 )u T( t m b )]
(m + m + m ) n
where ( t ) R a b c is the data vector and R is the parameter vector of
the controller. Note the specification of the data vector ( t ) , the delay free part of the
system is to be controlled. This is identical to the Smith predictor formulation. The target
y*( t + d ) is the same as used in the direct predictive IMC control (4.17) i.e.
y* ( t + d ) = F 1 ( q )r ( t ) F 2 ( q ) [ y ( t ) y ( t ) ] (4.20)
F2 NN
Model
d+1
q
y ( t )
The filters F 1 and F 2 are used only after the implementation of the closed-loop system.
During the design of the controller y*( t + d ) = r( t ) , see Fig. 4.10.
The control law is realized with a MLP network and the parameters, i.e. the weights of the
network, are obtained by minimizing the quadratic cost function
N
V ( ) = 1---
2 2
2 [ y* ( t + d ) y f ( t + d )
W
+ u ( t ) ] (4.21)
t=1
where
1
E ( 1 ) y f ( t + d ) = E (q ) y ( t + d ) (4.22)
T
( t ) = ( t, ) = [ ( y* ( t + d ) y f ( t + d ) )T u T( t )] (4.23)
Q = W 0 (4.24)
0
the cost function can be written
N
1 2
V ( ) = ---
2 ( t, ) Q (4.25)
t=1
y*
NN u NN y
Controller Model
Fig. 4.10. The dual network system during the controller design.
Section 4.3 Dual Network Control 79
T T T
( t ) = ( t, ) = y f ( t + d ) - u ( t ) (4.26)
yields a control design task in the same form as the model identification task. The PEM
approach can now be directly applied after computing the gradients.
u ( t ) = u ( t, ) = h ( , ) (4.27)
, = (t)
=
u ( t ) = u ( t, ) = u(t ) (4.28)
=
y ( t + d ) = y ( t + d, ) = y ( t + d ) (4.29)
=
u ( t ) = u ( t, ) = h ( , ) (4.30)
, = (t)
=
y ( t + d ) = y ( t + d, ) = f ( , ) (4.31)
, = (t + d)
=
1 1
( t, ) = E ( 1 )E (q ) y ( t + d, ) (4.32)
u ( t, )
Applying the MFD presentation also here, one obtains a linearized model
1 1 1
R( q )u ( t ) = S ( q )y ( t + d 1 ) + T ( q )y* ( t + d )
(4.33)
1 1
A( q )y ( t + d ) = B ( q )u ( t )
1
where the matrix polynomials in q are
1 1 mb
R( q ) = I R1 q Rm q
b
1 1 ma + 1
S( q ) = S1 + S2 q + + Sm q
a
1 1 mc + 1
T( q ) = T1 + T2 q + + Tm q
c
1 1 nf
A( q ) = I A1 q An q
f
1 1 nb + 1
B( q ) = B1 + B2 q + + Bn q
b
and
y ( t + d ) = [ A1 | | An | B1 | | Bn ]
f b
(4.34)
u ( t ) = [ T 1 | | T m | S 1 | | S m | R 1 |R m ]
c a b
u ( t ) = h ( ( t ), )
y ( t + d ) = f ( ( t + d ) )
(i) (i) (i) (4.35)
R u ( t, ) = S y ( t + d 1, ) + u ( t, )
(i) (i) =
A y ( t + d, ) = B u ( t, ) i = 1n
1
where the arguments ( q , t ) of the matrix polynomials are omitted and the superscript
( i ) denotes the ith column of the corresponding Jacobian matrix. The last two rows cor-
respond to the gradient update equations. With the RPEM approach these gradient update
equations are used only as an approximation, similarly as in the model identification. The
equations (4.32) and (4.35) form now the necessary prediction and gradient computation
procedure.
Section 4.3 Dual Network Control 81
Stability
When considering the stability of the implemented control system (Fig. 4.9), four differ-
ent stability issues are encountered:
3. Controller-model loop. The asymptotic stability of the model and the controller does
not ensure the stability of the internal loop which consist of the controller and the process
model. This stability must be ensured by the parameter projection.
4. Closed loop. The closed-loop gain can always be reduced by selecting a filter F 2 so
that the input-output stability of the overall IMC structure is guaranteed, e.g. Economou
et al. (1986).
The main requirement for the convergence of the RPEM is that ( t ) is restricted to the
stable region D of the parameter space (Ljung and Sderstrm, 1983). This condition is
equivalent to the stability of the extended network model (4.35). Using similar reasoning
as in Section 3.4, the gradient update equations in (4.35) can be neglected, leaving the
conditions for the model and controller and their combination to be analyzed. The stability
conditions of the nonlinear IMC structure are thus identical to those required by the con-
vergence of the RPEM with one addition: the stability of the overall control system.
y ( t + d ) = f ( ( t + d ) )
( t + d ) = [ y T( t + d 1 ) y T( t + d n f ) (4.36)
T
u T( t ) u T( t n b + 1 )]
u ( t ) = h ( ( t ), )
( t ) = [y * T( t + d )y * T( t + d m c + 1 )
(4.37)
y T( t + d 1 )y T( t + d m a )
T
u T( t 1 )u T( t m b )]
x ( t + 1 ) = F ( x ( t ), y* ( t + d ) )
[ 0 ] f ( x ( t ), h ( x ( t ), y* ( t + d ) ) )
[ IA ] [0]
x( t + 1) = x(t) + (4.38)
[0] h ( x ( t ), y* ( t + d ) )
[ IB ] [ 0 ] [0]
T
y ( t + 1 ) = [ x 1 ( t + 1 )x m ( t + 1 ) ]
T
x ( t ) = [ y T( t + d 1 )y T( t + d n f ) u T( t 1 )u T( t n b + 1 ) ] (4.39)
The Jacobian of (4.38) using the parameters of the MFD presentation (4.33) is
J C = F ( x )
( A 1 + B S 1 ) ( A nf + B 1 S nf ) ( B 2 + B 1 R 1 ) ( B nb + B 1 R mb )
1
[ IA ]
JC = (4.40)
S1 S nf R1 R mb
[ IB ] [0]
This resembles the AB -condition (3.73) i.e. it takes into account also the denominator of
the linearized controller and closed loop models. This is unpractical and simpler condi-
tions are needed. Assume, analogously to Hernandez and Yarkun (1992), that a perfect
control is applied i.e. = 0 and
y* ( t + d ) = f ( x ( t ), g ( x ( t ), y* ( t + d ) ) )
1 (4.41)
E (q )y* ( t + d ) = E ( 1 )r ( t )
1
E (q )y ( t + d ) = E ( 1 )r ( t )
i.e. y* ( t ) = y ( t ) t . The local existence and uniqueness of such feedback law can be
proved applying the Implicit Function Theorem, e.g. Rosenlicht (1968), resulting in prac-
tical conditions
The condition (4.42) should be assured during the identification phase, resulting now in
the local existence of an unique feedback law. The use of the closed loop reference model
does not affect to this, it just relaxes the assumption of the stable inverse of the process
model. Now the Jacobian (4.40) can be written as
E1 0 0
[ IA ]
JC = (4.43)
1
S 1 S nf R 1 R mb
[ IB ] [ 0 ]
where E 1 is the poles of the closed loop reference model (4.3). So, if a perfect control is
applied, the remaining dynamics of interest are that of the controller, which is closely
related to the inverse dynamics or zero dynamics, e.g. Isidori (1989), obtained with
E ( q 1 ) = I . Different controllers are obtained with different E ( q 1 ) and it is clear that
the assumption of a stable inverse can be somewhat relaxed by a suitable selection of
1
E( q ) .
R 1 R mb
JC = (4.44)
2
[ IB ]
The instability detectors and the projection cost functions can now be written analogously
to that presented in Section 3.4. The AB -condition monitors whether (4.43) is a contrac-
tion i.e. whether J C 1 . The B -condition monitors whether (4.44) is a contraction
1
i.e. whether J C 1 and the -condition whether the spectral radius ( J C ) 1 .
2 2
In practice the -condition is the most useful, especially because the controller has typi-
cally m b = 1 and the detection and the projection is trivial in the SISO case and simple
in the TITO case.
1 1 1 1 1 1
P( q ) = A( q )R ( q ) q B ( q )S ( q ) = 0 (4.45)
where
1 1 nf mb
P( q ) = I + P1 q + + P nf + mb q
84 Chapter 4 Model Predictive Control
nf + mb
Pi 1 (4.46)
i=1
and
roots ( P ( q ) ) 1 (4.47)
In the projection phase these are combined together and used with similar conditions for
the controller, resulting in the cost function
N
V ( ) = 1---
2 2
2 [ ( t, ) Q + v ( t, ) H ] (4.48)
t=1
nf + mb
( v 1 1 ) if v 1 = P i ( t ) 1 and roots ( P ( q ) ) 1
with v 1 ( t, ) =
i=1
0 , otherwise
mb
( v 2 2 ) if v 2 = R i ( t ) 2 and roots ( R ( q ) ) 1
v 2 ( t, ) =
i=1
0 , otherwise
A global existence of the feedback law which produces (4.41) is another case. It requires
the system to be globally I/O linearizable. All dynamical systems cannot be I/O- or feed-
back linearized. Conditions for feedback linearization are presented by Levin and Naren-
dra (1993). In general, the conditions are very hard to verify and one has no other resource
in practical situations, except to assume that they are valid and apply the linearization pro-
cedure.
For nonlinearizable systems, typically unstable or nonminimum phase systems, the model
reference control is not useful. Levin and Narendra (1993) proposed a class of direct sta-
bilizing neural network controllers (feedback laws), which are designed so that (4.38) is
a contraction in a selected small closed ball near the equilibrium. The underlying idea is
to design a k -step-ahead deadbeat controller. The control error is computed according
Section 4.3 Dual Network Control 85
x ( k ), if x ( 0 ) x eq < r, or x ( k ) x eq > x ( 0 ) x eq
= (4.49)
0, otherwise
where r is the radius of the ball and is the contraction coefficient. As it can be seen,
the model reference property is removed. The cost function for the minimization is col-
lected with successive simulations starting from randomly selected initial states x ( 0 ) .
Another proposal, (Hernandez and Yarkun 1992), somewhat ties together the dual net-
work approach (4.48), (4.49) and the direct predictive control approach (4.8). The authors
suggest a p -inverse feedback law, which is a special p -step-ahead predictive controller.
The control u ( t ) is computed by minimizing the unconstrained cost function
V ( t, u ) = 1--- y* ( t + p ) y ( t + p )
2
(4.51)
2
1
or more precisely a LRPC with N1 = N2 = p , Nu = 1 , = 0 . Also E ( q ) = I , i.e.
no model reference is applied.
They proved that near a stable equilibrium point, if certain assumptions hold, mainly
(4.42), there always exists (locally) a feedback law (4.37) which stabilizes the system in
p steps (with p large enough). This is true regardless of the stability of the inverse
dynamics. The selection of p is not straightforward, because the spectral radius (4.44) is
not necessarily a monotonically decreasing function of p . The implementation was per-
formed with the LRPC approach. They demonstrated the control system behaviour con-
trolling a nonminimum phase nonlinear CSTR model.
The previous is also one demonstration that neural network controller can be designed
according any LPRC cost function, at least in the unconstrained case. This is perhaps
unpractical, because one can always first solve the LPRC task with the direct approach and
then design a controller according the result. The synthetic signal method can be used for
actual controller design (Fig. 4.1). A disadvantage is that the learning of constraints is
difficult. Also some stability problems may arise.
The proposed dual network IMC approach forms a flexible and robust controller design
framework, resulting in a controller which can be implemented with minor computational
load. It can thus be applied also to high speed control systems.
86 Chapter 4 Model Predictive Control
Chapter 5 Simulation Studies 87
5 Simulation Studies
This chapter presents simulation results of neural network identification and control. Sim-
ulation is an excellent tool for methodological development and testing. Typically a sim-
ple nonlinear process model is used as a test target to get insight, how the selected control/
identification method behaves. Most of the uncertainties of true processes are neglected
and their effect is tested separately by adding one type of uncertainty at a time.
As necessary and as useful these simulation studies are, they do not tell the whole story:
how the selected identification or control approach will perform when applied to a real
process. Real time tests or at least simulation studies using a detailed and complex process
model are also needed.
The amount of the performed simulation experiments during this study is overwhelming,
but only a few will be presented here due to the space limitations and because the main
interest has been on investigating the performance of real-time neural network control.
Before going into simulation examples, the setup for neural network identification, sim-
ulation and for real-time experiments is presented.
Several alternatives are available, just to mention two: Matlab/Simulink and MathX/
MatrixX. Even a Neural Network Toolbox is available for Matlab, although its suitability
for time series modelling is limited.
It is more important to have a good process simulator and a good mathematical software
package than a software for neural network representation and simulation. Most neural
network types and training methods can be easily be presented using matrix operations
and the neural network part of the overall environment is a moderately easy task.
88 Chapter 5 Simulation Studies
HP9000/425T SOFTWARE
RS485
Planet/XNet TCP/IP Simnon
neural net process
simulator simulator
network unit
process
models models controller
The lack of higher level mathematical subroutines limited somewhat the development of
new identification and control methods. Every needed mathematical subroutine has to be
added to the source code. In that sense a package like Matlab is superior and immediately
after the availability of Matlab mex-file interface and Simulink in HP9000/425 worksta-
tion, the overall neural network environment was transferred and developed within it.
This forms now the main platform of neural network identification, process simulation
and control design tasks. Simulink is not used for real-time control experiments, although
it is possible. The slow computational speed and the lack of appropriate user interface do
not meet the specifications set for real-time control.
The final software package consist of three parts, all within HP9000/720 or HP9000/425T
workstations
Matlab functions (partly mex-files) for off-line tasks like identification, simulation
and controller design of MIMO neural network time series models and controllers.
Resulting networks can be used as Simulink modules.
Software (C-code) for adaptive and non-adaptive neural network long range predic-
tive control (LRPC). This package is named to ADANET and it can be used as a
module in Simulink model (Fig. 5.2) or as a part of real-time experimental setup.
Both systems have been used efficiently for developing and testing neural network iden-
HP9000/425T WORKSTATION
unit
Fig. 5.3. Neuro-control workstation. controller
90 Chapter 5 Simulation Studies
The control system simulation example demonstrates the basic properties of the LRPC and
dual network control (IMC) approaches, especially from the robustness point of view.
2 2 2
x 1 ( t ) = n x 2 ( t ) 2 n x 1 ( t ) sin ( n x 2 ( t ) ) + n ( u ( t ) 0,5 )
x 2 ( t ) = x 1 ( t ) (5.1)
y ( t ) = x 2 ( t ) + 0,5
where n = 0,5 and 0 u 1 . The steady state behaviour of the example system is lin-
ear but the dynamical behaviour is nonlinear. The model is a second order system, with
2
the damping factor sin ( n x 2 ) i.e. the open loop poles are on the imaginary axis if
xxx
0.6 xxxx
x
imag
xxxxx
0 xxx
xxx
0.4 xxx
xxx
xxx
xxx
0.2 xxx
xxxx
xx x x x
xxxx x x x x x x
-0.5
0
0 0.5 1 -0.6 -0.4 -0.2 0
model input real
Fig. 5.4. Steady state characteristic curve of the model and the poles of the linearized
model, linearized at steady states 0 u 1 .
Section 5.2 Simulation Examples 91
This system is stable, but only with a narrow margin at some operation points. It is
selected to study how well neural network time series models can cope with nonlinear
dynamical behaviour and to give some insight into the stability projection methods.
Consider first the case of stationary measurement noise i.e. uniformly distributed noise in
[ 0,02, 0,02 ] is added to the measurement. For the identification, the process is driven
with a Pseudo Random Sequence (PRS) type input signal covering all operation regions.
The identification data set of 1000 samples with T = 1 sec. is collected. The simulated
noiseless open loop step responses are used as the test set. For the identification, the meas-
urement and the input signal are scaled between 0,5 , although original scales are used
in most figures. This is a standard procedure in this study, if not otherwise stated. The var-
iables are scaled so that the full operation region corresponds [ 0,5, 0,5 ] .
y ( t + 1 ) = f ( ( t + 1 ), )
T (5.2)
( t + 1 ) = [ y ( t ) y ( t 1 ) u ( t ) u ( t 1 ) ]
is implemented with an MLP network. After some trials the structure is selected as
The size of this network is quite big because of the complex dynamics of the process. The
network weights are initialized with uniformly distributed random values in [ 0,1, 0,1 ] .
The parameters of the network are identified with the LM method.
The second identification experiment is made by driving the process with similar PRS
type input signal as before, but adding an Integrated Moving Average (IMA) noise to the
measurement and collecting a data set of 1000 samples. The IMA noise is obtained by
4
integrating normally distributed white noise (variance 10 ). This data set is presented in
Fig. 5.5.
As a reference, the plain NOE predictor is identified using the same network structure
(5.2) as in the stationary case. The main interest is to identify the deterministic part of the
process using appropriate noise model i.e. assuming the process to be of NOE-IMA type,
see Section 4.2.
92 Chapter 5 Simulation Studies
y ( t + 1 ) = f ( ( t + 1 ), )
T
( t + 1 ) = [ y ( t ) y ( t 1 ) u ( t ) u ( t 1 ) ]
(5.3)
z ( t + 1 ) = y ( t ) y ( t )
y ( t + 1 ) = y ( t + 1 ) + z ( t + 1 )
This predictor can be identified by minimizing the cost function
N
V N ( ) = 1---
2
2 y ( t ) y ( t ) (5.4)
t=1
which requires the gradient y ( t + 1 ) to be computed according the predictor for-
mulation (5.3) to obtain true PEM estimates. Identical, but more convenient approach is
to minimize the cost function
N
V N ( ) = 1--- y ( 1 ) y ( 1 ) + 1---
2 2
2 2 y ( t ) y ( t ) (5.5)
t=2
y(t)
1
0.5
trend z ( t )
0
0 100 200 300 400 500 600 700 800 900 1000
time [s]
input signal
1
0.8
0.6
0.4
0.2
0
0 100 200 300 400 500 600 700 800 900 1000
time [s]
Fig. 5.5. The identification data set. The measurement is corrupted with IMA noise.
Section 5.2 Simulation Examples 93
which can be implemented with only slight modification of the plain NOE identification
procedure. In practice also the initial state ( 0 ) and z ( 0 ) must be considered as
unknown parameters. This is omitted in this simulation example, because the IMA noise
starts at zero, see Fig. 5.5.
The NOE-IMA predictor is implemented with a MLP network with the same structure as
before and the parameters of the predictor are identified with the LM approach.
The behaviour of the deterministic part of the identified predictors with the test set is pre-
sented in Fig. 5.6. As expected the NOE predictor identified with the presence of the sta-
tionary noise gives best results (Fig. 5.6a). Also the NOE-IMA predictor identified with
the presence of IMA noise performs well (Fig. 5.6c). The plain NOE predictor with IMA
noise does not give the correct responses neither in static nor in dynamic sense (Fig. 5.6b).
This can more clearly be seen if the prediction is shifted i.e. y 0,3 is plotted (Fig. 5.6d).
The behaviour of the plain NOE predictor with the presence of the IMA noise is unpredict-
able, it depends heavily on the actual noise i.e. totally different behaviour could be
obtained with different identification data. On the other hand, the NOE-IMA predictor
gives the same or almost the same responses despite of the identification data, at least if
0.6
0.4
0.4 true
0.2
0.2
0 0
0 50 100 150 200 0 50 100 150 200
time [s] time [s]
0.4 0.4
true true
0.2 0.2
0 0
0 50 100 150 200 0 50 100 150 200
time [s] time [s]
Fig. 5.6. Step responses of the deterministic part of the identified NOE predictors.
d) is the same as b), but shifted y = y 0,3 .
94 Chapter 5 Simulation Studies
the number of samples is large enough i.e. the true deterministic part of the process is
obtained as a result of the identification.
Additive and other trends are common in practice. The solution for the identification
problems is simple in the linear case: use incremental measurement and input signals. The
presented simulation example demonstrates that also in the nonlinear case the solution is
straightforward: use incremental cost function.
This approach must be used with some care, because the gradient based minimization of
the cost function (5.5) converges to a local minimum. The NARX or NOE predictors
should be used to get adequate initial values for the parameters.
One inconvenient feature is that monitoring the cost (5.5) does not clearly indicate
whether the minimization has converged. The overall prediction error y ( t ) y ( t ) is small
and the error y ( t ) y ( t ) does not tell much about the convergence. It is just an estimate
of the trend z ( t ) . Better measures can be obtained by using the correlation analysis of the
residuals i.e. if the prediction error does not correlate with the input signal, no further min-
imization is necessary. The framework for nonlinear correlation analysis can be found in
Billings and Voon (1986) and Billings and Zhu (1994). The other possibility is to use a
separate test set and stop the identification when the residuals of the test set does not
decrease any more.
The correlation measures should be used for validation of any nonlinear model. These
have not been used much in this study, because the result of the correlation analysis: plots
from 4 to 6 correlation functions, is difficult to interpret to get insight how the model
structure or model orders should be changed.
1
model output
x
imag
0.5 0 x
x
0
-0.5 -0.5
0 0.5 1 -1 -0.5 0
model input real
Fig. 5.7.
Steady state characteristic curve of the model and the poles of the linear part.
(Example 5.2)
Section 5.2 Simulation Examples 95
with 0 u 1 . This system has a static nonlinearity and linear third order dynamics with
one dominating time constant (real pole at -0.0667) and lightly damped fast mode (com-
plex pole pair at 1 0,1339j ), see Fig. 5.7. The system is selected mainly to study the
effect of unmodelled dynamics i.e. the system is modelled with a first order nonlinear time
series model. The static nonlinearity makes the control difficult, because at steady state
u = 0,5 the gradient y/u = 0 . The linear part of the model is similar to the well-
known Rohr's example, e.g. Shook et. al. (1991), which is used to study the robustness
of linear adaptive GPC.
Uniformly distributed noise in [ 0,02, 0,02 ] is added to the measurement. The discreti-
zation time is chosen as one second.
y ( t + 1 ) = f ( ( t + 1 ), )
(5.7)
T
(t + 1 ) = [ y( t) u(t)]
is implemented with an MLP network
The process model is driven with PRS+PRBS type of input signal in 0,1 u 0,9 and the
data set of 3000 samples is collected. First a normal one-step-ahead NARX predictor is
identified by minimizing a constrained cost function
N
1 2
V N ( ) = ---
2 y ( t ) y ( t )
t=1 (5.8)
y ( t )/u ( t 1 ) 0,005
y ( t )/y ( t 1 ) 1
96 Chapter 5 Simulation Studies
with RPEM approach i.e. with the stability constraints and input gradient limit. The stabil-
ity of the NARX predictor is not necessary for the convergence of RPEM. However, ensur-
ing the stability is useful in the case of recursive multistep prediction.
The 8-multistep-ahead NARX predictor is identified with the RPEM approach using the
incremental cost function with similar constraints as before
N t
1 2
V LRPI = ---
2 y ( i ) y ( i t 8 )
t = 1i = t 7 (5.9)
y ( i )/u ( i 1 ) 0,005 , i = t 7t
y ( i )/y ( i 1 ) 1 , i = t 7t
where y ( i t 8 ) denotes the prediction y ( i ) calculated recursively using actual meas-
urement at time step ( t 8 ) .
The performance of the resulting models is presented in Fig. 5.8, where the input step
responses at three different operation points u = 0,5 0,6 , u = 0,65 0,75 and
u = 0,75 0,85 are presented. The normal NARX predictor does exactly what it should
do, predicts one step ahead correctly. The multistep NARX is more efficient in prediction
further into the future. The corresponding relative prediction errors, e.g. Grabec (1991)
N
1 2
E rp ( i ) = ---------------------
Var ( y )N ( y ( t i + 1 ) y ( t i + 1 t 8 ) ) , i = 18 (5.10)
t=1
which are computed using a test set of 1000 data points are presented in Fig. 5.9. Again
the prediction capability of multistep NARX can be seen. Note that the E rp is not equally
distributed w.r.t the prediction horizon i = 18 .There is no reason why relative predic-
tion error should be equally distributed here, the identification procedure just minimizes
the cost function (5.9).
Th approach is efficient when identifying too low order models but it can be used success-
fully also if the model orders are correct. If also the noise model assumption is correct
(NARX), no benefits are gained. If not, the approach result in a compromise between plain
NARX and a NOE predictor. This compromize is often a suitable model for the LRPC
approach.
Section 5.2 Simulation Examples 97
0.75
true model
0.7
0.65
0.6
0.55
0.5
0.45
0 2 4 6 8 10 12 14 16 18 20
steps
0.05
0.04
NARX
0.03
0.02
multistep NARX
0.01
0
0 1 2 3 4 5 6 7 8 9
steps
Example 5.3
This example considers control of the process in Example 5.1. The identified NOE model
is used for the dual network controller design according to the guidelines presented in
Section 4.3. The controller of the form
u ( t ) = h ( ( t ), )
(5.11)
T
( t ) = [ u ( t 1 ) y ( t ) y ( t 1 ) y* ( t + 1 ) ]
is realized using an MLP network. The network structure is selected as
N
V ( ) = 1---
2 2
2 [ y* ( t + d ) y f ( t + d ) + u ( t ) ] (5.12)
t=1
where
1
E ( 1 ) y f ( t + d ) = E ( q ) y ( t + d ) (5.13)
with the LM approach without any stability projection. After some trials, the design
1 1 3
parameters are selected as E ( q ) = 1 0,5q and = 10 .
The ramp type setpoint sequence for the controller design is selected because it removes
the steady state control error of the constant setpoint (Fig. 5.10a-b). This is analogical to
that a PI-controller, which has zero steady-state error with a constant setpoint, could have
a constant steady-state error with a ramp type setpoint. A couple of step wise setpoint
changes are added at the end of the design set for instability detection and projection pur-
poses. Note that the activation function of the output layer of the controller is tanh() i.e.
in [ 1, 1 ] , but the whole operation region is [ 0,5, 0,5 ] . If a constrained case is needed,
the activation function should be changed to 0.5tanh().
Section 5.2 Simulation Examples 99
The performance of the resulting controller-model loop is presented in Fig. 5.10. The
eigenvalues of the Jacobian matrix of the controller (controller poles, Fig. 5.10c) and of
the closed loop model (closed loop poles, Fig. 5.10d) clearly indicate that both the con-
troller and the closed loop system are possibly unstable (see Section 3.4 and Section 4.3).
This can be seen more clearly in the input signal behaviour of the test set (Fig. 5.10f). This
ringing phenomenon could somewhat be reduced by loosening the desired closed loop
1
behaviour E ( q ) and increasing the control signal weighting , but not totally
removed.
a)
setpoint and output (design set) b) input signal (design set)
0.5 1
0 0
-0.5 -1
0 100 200 300 400 0 100 200 300 400
time [s] time [s]
c) d)
controller poles (design set) closed loop poles (design set)
1 1
0 0
-1 -1
-2 -1 0 1 -2 -1 0 1
e)
setpoint and output (test set) f) input signal (test set)
0.5 1
0 0
-0.5 -1
0 50 100 150 0 50 100 150
time [s] time [s]
Fig. 5.10. The performance of the dual network control when controlling the identified
process model. No stability projection is applied. Note the internal scaling.
100 Chapter 5 Simulation Studies
To obtain a stable control system, the closed loop poles are projected inside the unit cir-
cle using the constrained cost function (4.48) with H = diag { h 1, 0 } and 1 = 1 i.e.
the controller dynamics is not constrained.
The performance of the resulting controller-model loop is presented in Fig. 5.11. Both the
controller poles and the closed loop poles are in the stable domain (Fig. 5.11c-d) and
the ringing phenomenon has disappeared from the control signal (Fig. 5.11f).
Incorporating the stability constraints within the optimal controller design seems to be an
efficient way to quarantee the stability of the control system still maintaining its control
performance. As discussed in Section 4.3, also a justifiable solution is to project only the
controller poles or the poles and the zeroes. This detection/projection is much simpler to
implement.
a)
setpoint and output (design set) b) input signal (design set)
0.5 1
0 0
-0.5 -1
0 100 200 300 400 0 100 200 300 400
time [s] time [s]
c)
controller poles (design set) d) closed loop poles (design set)
1 1
0 0
-1 -1
-2 -1 0 1 -2 -1 0 1
e)
setpoint and output (test set) f) input signal (test set)
0.5 1
0 0
-0.5 -1
0 50 100 150 0 50 100 150
time [s] time [s]
To compare the designed dual network controller to direct predictive control, the identi-
fied neural network model is also used in the NOE type LRPC approach to control the neu-
1
ral network model using the same design parameters E ( q ) and as in the dual
network control case and without any model mismatch considerations. As discussed in
Section 4.3, the selection N1 = N2 > 1 and Nu = 1 should give suitable results.
These results are in accordance with the p-inverse feedback law (Hernandez and
Yarkun, 1992). Thus the optimal control signals could be solved with the LRPC approach,
and after that a neural network controller could be designed so that it produces these con-
trol signals. However, based on a couple of trials, this type of identification is rather dif-
ficult in a constrained case (input constraints).
Only the behaviour of the controller-model loop has considered so far and especially from
the stability point of view. To analyse also the robustness issues, the designed controller
is implemented as the dual network IMC approach (Fig. 4.9) with the filters
0,3z
F 1 ( z ) = 1 and F 2 ( z ) = ---------------
z 0,7
The control system is simulated with Simulink using the original continuous time model
(5.1) as the process. The robustness of the control system is tested using both input and
measurement additive IMA noise obtained by integrating normally distributed white noise
4
with variance 10 .
The use of trend like disturbances with the process which is near the stablity border posses
serious difficulties on control system. The use of input disturbances is also somewhat
unfair, because the used IMC filter is designed for output additive noise, see Section 4.2.
The results should be considered from the robustness point of view.
102 Chapter 5 Simulation Studies
The performance of this control system is presented in Fig. 5.12. The control system han-
dles the output disturbances quite well, but it has serious difficulties with the input distur-
bances near the stability border i.e. near y 0,5 .
This IMC approach is also implemented using the direct NOE-LRPC approach (Fig. 4.9).
The selection N1 = 1 , N2 = 2 and Nu = 1 with the same design parameters as above
gave results comparable to Fig. 5.12. This indicates also that (without input constraints)
it is quite the same whether the IMC approach is implemented with the dual network
approach or with the direct NOE-LPRPC.
0.8
0.6
0.4
0.2
0
0 100 200 300 400 500 600
time [s]
input signal
1
0.8
0.6
0.4
0.2
0
0 100 200 300 400 500 600
time [s]
disturbance
0.2
input IMA noise output IMA noise
0.1
-0.1
-0.2
0 100 200 300 400 500 600
time [s]
Fig. 5.12. The performance of the implemented control system. Dual net-
work control with first order IMC filter.
Section 5.2 Simulation Examples 103
The input noise rejection capability can be increased by designing a higher order IMC fil-
ter i.e. assuming the equivalent system to be controlles as, see (5.13)
1
1 0,5z
z E ( z ) = ------------------------ (5.14)
1
1 0,5z
and designing a linear IMC controller for a ramp type disturbance. This results in
F 2 ( z ) = ( 3,2 4,2z
1
+ 1,3z
2 0,3
) -----------------------
-
1
1 0,7z
The performance of this control system is presented in Fig. 5.13, now with NOE-LRPC
approach. Exactly the same noise sequence as in Fig. 5.12 is used to make the comparison
easier. The problems with the input disturbances are reduced but not totally removed. This
is obvious, because even the second order IMC filter does not correspond to the true noise
model. The use of second order filter also increases the control signal variations when the
presence of the output disturbances.
0.6
0.4
0.2
0
0 100 200 300 400 500 600
time [s]
input signal
1
0.8
0.6
0.4
0.2
0
0 100 200 300 400 500 600
time [s]
Fig. 5.13. The performance of the implemented control system. The LRPC
approach with the NOE model and with the second order IMC filter.
104 Chapter 5 Simulation Studies
Because the identified NOE model is almost an exact representation of the true process,
see Fig. 5.6a, here it can be straightforwardly applied as a NARX predictor. The LRPC
approach with NARX predictor demonstrates this. The design specifications are kept same
1 1
i.e. C ( q ) = 1 0,7q . The performance of this control system is presented in Fig.
5.14. The same disturbance sequence as before is used in simulations. The main differ-
ence between this and the previous simulations is that the input disturbances are handled
more efficiently and the performance under the output disturbances is not worse than
those obtained with the IMC approach. It should be noted that also here an incorrect noise
model assumption is used.
This is a first example indicating that NARX predictor combined with LRPC approach
forms an efficient and robust control system. Other examples will be seen in next Chap-
ters. The NARX predictor is easy to identify and the effect of the design parameters is
clear. The only drawback is the high computational on-line load. It is not claimed that a
NOE model with carefully designed state estimator would not show similar performance
or robustness, but it is claimed that in many cases the LRPC-NARX approach is much eas-
ier to apply.
0.6
0.4
0.2
0
0 100 200 300 400 500 600
time [s]
input signal
1
0.8
0.6
0.4
0.2
0
0 100 200 300 400 500 600
time [s]
Fig. 5.14. The performance of the implemented control system. The LRPC
approach with the NARX predictor.
Chapter 6 Control of Water Heating Process 105
The example processes incorporate all these aspects, with a minor limitation: a heavily
nonlinear pilot process was not available. All presented results consider mild nonlineari-
ties. This is not a serious limitation, because it is well known and demonstrated in numer-
ous articles, that neural networks can represent a heavy nonlinearity quite well. More
important is how the neural network control system performs with the uncertainties of the
real world. The available test processes are quite suitable for this purpose.
This chapter considers identification and control of two slightly different water heating
process denoted Heating process I (HP I) and Heating process II (HP II). The water
flows from the domestic water network into an uninsulated 0.4 litre tank through a pipe,
which has holes in it, see Fig. 6.1. The water is heated by a resistor element and the tem-
perature of the outlet flow is measured with a Pt-100 transducer. The inlet flow q in can
be set between 0 3.0 l/min. with a rotameter.
Pt-100
temperature [C]
50
q out q in
40
30
20
10
0
0 50 100
u control signal [%]
Fig. 6.1. The heating process and the measured open loop response (HP I), obtained
by driving the process with a 30 min. ramp both up and down. The measurement
y ( t ) is plotted directly as a function of u ( t ) .
106 Chapter 6 Control of Water Heating Process
The aim of the control is to drive the temperature of the outlet flow to the desired value.
Robustness of the temperature controller is essential since the dynamics of the process are
extremely sensitive to the flow changes. Also the time-delay of 12...15 seconds has to be
taken into account in control design. The dominating time constants are about 100 sec-
onds (HP I), and 30 seconds (HP II). The main difference between processes is that the
process I has a smaller heating element than the process II. From the control point of view,
the heating process I is more difficult to control. The experiments were made during sev-
eral seasons of the year and the water inlet temperature varied from 5C to 20C. Thus all
test should not be straightforwardly compared.
6.1 Identification
The heating process II was modelled using data gathered from a real time identification
run. The system was driven with a PRS type input signal 9000 seconds with T = 3 s. ,
resulting in 3000 samples. The data was divided into an identification set of 2000 samples
and a test set of 1000 samples.
The model orders and the delay were determined by identifying a series of linear ARX
models. The ARX predictor with n a = 2 and n b = 2 gave the best estimate with the
delay d = 4 . The bias was used in all linear models. The deterministic part of the system
is of interest and the corresponding OE model was also identified.
40
30
20
10
0
6000 6500 7000 7500 8000 8500 9000
input [%]
100
80
60
40
20
0
6000 6500 7000 7500 8000 8500 9000
time [s]
Fig. 6.2.The behaviour of the linear OE predictor with the test set. PRS type input signal.
Section 6.1 Identification 107
The resulting behaviour with the test set is presented in Fig. 6.2. The largest prediction
error occurs in outer operation ranges which is mainly due to the saturating phenomenon
of the thyrisor.
y ( t ) = f ( ( t ), ) (6.1)
with
T
( t) = [y(t 1 ) y( t 2) u(t 4) u(t 5 )] (NARX)
T
( t ) = [ y ( t 1 ) y ( t 2 ) u ( t 4 ) u ( t 5 ) ] (NOE)
were also identified.
The identification was initially made at the same time as the dual network control exper-
iments (Section 6.3) using a rather large network structure. The models were identified
using RPEM approach. Later, the identification was reperformed using the LM method.
The goal was to study how small a network is actually needed. The used networks were:
activation function hidden: tanh, output: linear hidden: tanh, output: linear
The overall performance of the identified models seem to be almost identical. The amount
of the parameters is significantly different: 65 versus 31 parameters. A remarkable aspect
is that the overparametrization does not have such a strong effect as happens when over-
parametrizing linear models. This is also due to the good quality of the identification data,
it covers the whole operation region of interest. If a large MLP network is used outside the
trained domain, the performance would probably not be so good.
Table 6.1. Prediction errors (MSSE) for different model structures, small network.
linear OE 7.48
NARX 0.34
The results presented here are obtained using small networks. The cost function values
are presented in Table 6.1. The difference between OE and NOE models is significant
which clearly implies that the heating system is nonlinear. This can also be seen from the
steady state characteristic curve of the NOE model (Fig. 6.3), which clearly indicates a sat-
uration type nonlinearity. The steady state curve is linear in the input ranges 30...70%, but
the dynamical behaviour is not, see Fig. 6.2.
An important aspect is that the steady state curve of the model is not monotonically
increasing, contrary as expected from the physical basis of the process. This is due to the
lack of the identification data at higher temperatures. Although small, this phenomena
causes extreme difficulties, when the identified model is used in predictive control or in
controller design.
N
V N ( ) = 1---
2
2 y ( t ) y ( t )
t=1
(6.2)
5
w.r.t. y ( t )/u ( t i ) > 0
i=4
was minimized to force the steady state curve to be monotonically increasing, success-
fully, as can be seen from Fig. 6.3. and with only minor increase in the cost function value:
0.85 versus 0.92.
50
without
constraints
40
output [C]
constrained
30
20
10
0 20 40 60 80 100
input [%]
Fig. 6.3. The steady state characteristic curve of the identified NOE models (HP II).
Section 6.1 Identification 109
The behaviour of the resulting NOE model with the test set is presented in Fig. 6.4. For
comparison, also the NARX predictor was tested as a deterministic process model. The
difference between the NOE model and the NARX model is small but clear. This probably
indicates that also noise entering the system via NARX style is present. The only clear dif-
ference between the large and small network was in this test. The large NARX pre-
dictor did not show as good performance when applied as a NOE model.
a) 40
30
20
10
6000 6500 7000 7500 8000 8500 9000
time [s]
measurement, prediction and norm of prediction error +10 [C]
50
b)
40
30
20
10
6000 6500 7000 7500 8000 8500 9000
time [s]
input [%]
100
80
60
40
20
0
6000 6500 7000 7500 8000 8500 9000
time [s]
Fig. 6.4. The NARX and NOE predictors as a deterministic process models (HP II).
a) NARX as NOE b) NOE.
110 Chapter 6 Control of Water Heating Process
The first example considers the adaptive LRPC of the heating process I using a direct pre-
dictor, i.e. a combined approach, see Section 3.2. The main goal of this part was to study
the behaviour of different combined predictors i.e whether a sparse network structure is
needed. The identification was performed with the error-backpropagation algorithm. To
compensate the slow adaptation, a slow desired response was selected.
If d > 1 and a direct predictor is used, the model reference is difficult to apply. One pos-
sibility is to take the closed loop dynamics into account already in the identification i.e.
identifying directly
y f ( t ) = E ( q 1 )y ( t ) (6.3)
y f ( t + 4 )
y f ( t + 5 ) = f ( ( t ), )
(6.4)
y f ( t + 6 )
T
( t ) = [ y ( t ) y ( t 1 ) u ( t + 2 )u ( t 1 ) ]
This predictor was realized with an MLP network with the structure
The non-causal connections were removed and the size of the input data vector for each
prediction is balanced to be equal. The combined predictor (6.4) was in fact divided into
three separate networks, each with 11 nodes in the hidden layer.
T
y f ( t + 4 ) f 1 ([ y ( t ) y ( t 1 ) u ( t ) u ( t 1 ) ] , 1 )
y f ( t + 5 ) = T (6.5)
f 2 ([ y ( t ) y ( t 1 ) u ( t + 1 ) u ( t ) ] , 2 )
y f ( t + 6 ) T
f 3 ([ y ( t ) y ( t 1 ) u ( t + 2 ) u ( t + 1 ) ] , 3 )
N2
V ( t, u ) = 1--- [ E ( 1 )r ( t + i ) y f ( t + i ) ] + 1
2 2
2 ---
2
u ( t ) (6.6)
i = N1
with = 0,02 and u max = 10 % (largest allowed change). The control error due to the
1
model mismatch was handled by adaptation, i.e. no C ( q ) was used, see (4.14) and
(4.15). The cost function (6.6) was minimized using the steepest descent method.
The network was initialized with random weights in [ 0,1, 0,1 ] and the standard error
backpropagation with the learning rate of 0.4 and the momentum term of 0.05 was used.
A series of setpoint changes covering the whole operation region were performed. A load
disturbance test with fixed weights was performed after that to study the robustness of the
resulting control system.
A detailed discussion of various sparse and non-sparse predictor structures can be found
in Kimpimki (1990) and a shorter overview in Koivisto et. al. (1991). The results indi-
cate that the control system with the sparse predictor structure performs clearly better.
112 Chapter 6 Control of Water Heating Process
The performance of the adaptive LRPC control with the predictor structure (6.5) is pre-
sented in Fig. 6.5. As it can be seen, it took a while before the measurement reached the
setpoint. The closed loop behaviour converged almost to the desired response. Some dif-
ferences can however be seen. The response was slow due to the selected closed loop pole
and the overall performance could easily be outperformed by other control methods.
A remarkable aspect is that no steady state control error is present and that the resulting
controller shows good robustness properties during the load disturbance test, although the
disturbance rejection could be faster.
40
30
20
0 10 20 30 40 50 60 70 80 90 100
time [min]
input [%]
100
80
60
40
20
0
0 10 20 30 40 50 60 70 80 90 100
time [min]
setpoint and measurement [C]
45
b)
40 inlet flow inlet temperature
inlet flow peak
35
30
25
20
0 10 20 30 40 50 60 70 80
time [min]
Fig. 6.5.
The control system performance with the adaptive LRPC, a direct combined
NARX predictor, adaptation with error backpropagation, a) setpoint tracking, b) load
disturbance test with fixed parameters. (HP I)
Section 6.2 Adaptive Control 113
The performance of the adaptive LRPC with a tighter closed loop model is presented in
Fig. 6.6, now with a recursive NARX predictor
y ( t + 1 ) = f ( ( t + 1 ), )
(6.7)
T
( t + 1) = [y(t) y(t 1 ) u(t 3 ) u( t 4) ]
N2
2
V ( t, u ) = 1---
1 2
2 E ( 1 )y* ( t + i ) E (q )y ( t + i ) + u ( t + i N1 ) (6.8)
i = N1
3
with N1 = 4 , N2 = 9 , Nu = 2 , = 10 and E ( q 1 ) = 1 0,7q 1 , was mini-
mized with Rosen's gradient projection method (Soeterboek 1990).
40
30
20
0 500 1000 1500 2000 2500 3000
time [s]
input [%]
100
80
60
40
20
0
0 500 1000 1500 2000 2500 3000
time [s]
Fig. 6.6.
The control system performance with the adaptive LRPC, a recursive NARX
predictor, adaptation with the error backpropagation (HP I).
114 Chapter 6 Control of Water Heating Process
The network starts at random weights in [ 0,1, 0,1 ] The identification was performed
using the error backpropagation with the learning rate 0,1 and by filtering the gradients
1
with 0,5 (1 0,5q ) .
The use of a recursive predictor caused in this case a strong control signal variation which
was reduced with = 0,01 and projecting the gradient y ( t + 1 ) u ( t 3 ) > 0,25 .
The gradient limitation reduced also the start-up overshoot.
1
Tighter specifications ( E ( q ) ) changed the overall performance considerably, most
important was the decrease of the overall robustness. It took a while (2000 sec = 700 con-
trol cycles) to meet the desired closed loop response, when switching only between two
setpoint values (Fig. 6.6). The use of a stair wise setpoint sequence as in Fig. 6.5 resulted
in much longer adaptation time.
The overall performance (start-up behaviour) was as expected, but not acceptable. The
results are presented only to demonstrate the adaptation speed of the error backpropaga-
tion algorithm and to get a reference for the RPEM identification, which will be consid-
ered next.
RPEM identification
A typical behaviour of the adaptive LRPC approach with the RPEM identification is pre-
sented in Fig. 6.7. The NARX predictor
y ( t + 1 ) = f ( ( t + 1 ), )
(6.9)
T
( t + 1) = [y(t) y(t 1 ) u(t 3 ) u( t 4) ]
3
The control design parameters were selected as N1 = 4 , N2 = 9 , Nu = 2 , = 10
and E ( q 1 ) = 1 0,7q 1 . The gradient y ( t + 1 ) u ( t 3 ) was limited above to
0,05 . The network was initialized with random weights [ 0,1, 0,1 ] . The UD factored
version of (3.47) was used and the initial covariance matrix was set to P ( 0 ) = 100 I .
Section 6.2 Adaptive Control 115
The start-up behaviour depends heavily on the ordering of the setpoint sequence and on
the initial weights, which is obvious: recursive Gauss-Newton algorithm converges to the
nearest local minimum.
The forgetting factor and/or covariance resetting did not remove the problem of the initial
weights. The input gradient and stability projection methods helped much but they did not
totally remove the problem. Obviously some preliminary off-line identification must be
performed in order to achieve reasonable initial weights.
40
input flow chances
30
20
10
0 500 1000 1500 2000 2500 3000
time [s]
input [%]
100
80
60
40
20
0
0 500 1000 1500 2000 2500 3000
time [s]
Fig. 6.7.
The control system performance with the adaptive LRPC, a recursive NARX
predictor, adaptation with the RPEM (HP I).
116 Chapter 6 Control of Water Heating Process
The NOE type process model and the NOE type controller were identified using two
totally separate and parallel RPEM algorithms. The controller was designed on-line using
the approach presented in the Section 4.3. The IMC filters were not used, i.e. the control
error due to the model mismatch was handled by adaptation.
y ( t + 1 ) = f ( ( t + 1 ), )
(6.11)
T
( t + 1 ) = [ y ( t ) y ( t 1 ) u ( t 4 ) u ( t 5 ) ]
The controller
u ( t ) = h ( ( t ), )
(6.12)
T
( t ) = [ u ( t 1 ) y ( t + 4 ) y ( t + 3 ) y* ( t + 5 ) y* ( t + 4 ) ]
Both networks were initialized with random weights in [ 0,1, 0,1 ] and identified with
the RPEM approach using similar covariance and forgetting factor as in the previous case.
The control cost function was
N
V ( ) = 1---
2 2
2 [ y* ( t + 5 ) y f ( t + 5 ) + u ( t ) ] (6.13)
t=1
and
Section 6.2 Adaptive Control 117
1
E ( 1 ) y f ( t + 5 ) = E (q ) y ( t + 5 ) (6.14)
1 1
where y* ( t + 5 ) = r ( t ) , E ( q ) = 1 0,7q and = 0,01 . The gradient
y ( t + 5 ) u ( t ) was limited above 0.005 and the pole of the controller was projected
between [ 0,8, 1 ] .
Fig. 6.8 shows that the performance of the control system is good. The successful start-
up (after the very beginning) is mainly due to the suitable initial weights, but also due to
the identification of NOE type model and controller with the RPEM approach using proper
gradient computation. Because no IMC filters were used, the noise rejection was not as
good as possible. This can clearly be seen from the control signals.
This experiment merely demonstrates the convergence of the RPEM approach under suit-
able conditions. Also this control system was sensitive to the initial weights and setpoint
sequences. Typically one out of five performed this way, one was somewhat a failure, and
the rest showed control performance comparable to Fig. 6.7.
The dual network control is computationally light when applied with constant parameters
but computationally heavy in an adaptive application. The difference of the computational
load of the adaptive LRPC is not big any more. The usage of LRPC is perhaps more moti-
vated in practice.
40
30
20
10
0 500 1000 1500 2000 2500 3000
time [s]
input [%]
100
80
60
40
20
0
0 500 1000 1500 2000 2500 3000
time [s]
Fig. 6.8. The control system behaviour with the adaptive dual network control, model
and controller adaptation with RPEM, HP II.
118 Chapter 6 Control of Water Heating Process
The controller redesign was made by changing the pole of the closed loop reference model
from 0,7 0,2 i.e. by minimizing the cost function (6.13) with the RPEM approach using
E ( q 1 ) = 1 0,2q 1 and the same setpoint sequence as in Fig. 6.8. The controller pole
was again limited between [ 0,8, 1 ] . The limit 0,8 was used instead 1 to avoid ringing
effect in the control signal. The redesign gave a stable controller-model loop which was
implemented with the filters F 1 ( z ) = 1 and F 2 ( z ) = 0,3z ( z 0,7 )
Several weeks after the model identification the performance of the control system was
tested in real-time experiments. During this time the properties of the process have been
changed due to the variations of the inlet flow temperature and room temperature. This
caused a significant modelling error. The performance of the control system is shown in
Fig. 6.9. The step responses show that in spite of the mismatch between the process and
the model the overshoot is very small and the response is fast. Moreover, the control is
insensitive to noise mainly because of the selected IMC filter. Due to the physical limita-
1
tions of the actuator and the use of the control penalty , the response defined by E( q )
cannot be achieved, it is not even the main target. The effect of these limitations can also
be seen by comparing the behaviour of the model with the signal y* (Fig. 6.9). The
behaviour of the controller-model loop shows that the controller satisfies the requirements
defined by the cost function.
Section 6.3 Dual Network Control - Constant Parameters 119
40
30
20
10
0
0 200 400 600 800 1000 1200 1400 1600
time [s]
target y* prediction y and error y* y [C]
50
40
30
20
10
0
0 200 400 600 800 1000 1200 1400 1600
time [s]
80
60
40
20
0
0 200 400 600 800 1000 1200 1400 1600
time [s]
The performance of the control system was also tested with an unmeasured step distur-
bance. Disturbances are generated by changing the inlet flow according to the following
sequence:
1,0 l/min, 0 t < 80 s
1,5 l/min, 80 s t < 300 s
q in( t )
0,5 l/min, 300 s t < 1100 s
1100 s t 1200 s
1,0 l/min,
The nominal inlet flow is 1.0 l/min. From the behaviour of the prediction error (Fig. 6.10)
after the flow change, it can be seen that the load disturbances affect the measurement
through slow dynamics. Since the load disturbances have slow internal dynamics and the
IMC incorporates disturbances through prediction error feedback, the resulting flow
change rejection is sluggish. Better disturbance compensation could be achieved by pay-
ing more attention to the selection of the filter F 2 . Overall, the IMC controller achieves
good performance and the control is robust in the presence of significant modelling errors
and disturbances.
40
30
inlet flow
20
10 inlet flow
0
0 200 400 600 800 1000 1200
time [s]
control signal [%]
100
80
60
40
20
0
0 200 400 600 800 1000 1200
time [s]
Fig. 6.10. The performance of the control system during the load disturbance test.
Chapter 7 Control of Air Heating Process 121
The second case study considers the identification and control of a small laboratory heat-
ing process (Fig. 7.1). The air is fanned through a 30 cm plastic tube and heated with a
thyristor controlled heating element. The voltage to the thyristor circuit is the actual con-
trol signal u ( t ) . The task of the control system is to keep the temperature y ( t ) , meas-
ured near the other end of the tube, at the desired setpoint ( 2080 C). The fan speed can
be set manually with another thyristor circuit. This alters the air flow in the tube and thus
also the delay and the dynamical behaviour of the process. The identification experiments
were made with the fan speed 100% and the 50% fan speed was used to test the robustness
of the control system.
The process is only mildly nonlinear mainly due to the thyristor circuit (hysteresis and
nonlinearity). Also the delay of 12 seconds and the heat transfer through the wall must
be taken into account. The process is continuously used in control education and it is con-
sidered rather difficult to model, but relatively easy to control. The main interest in this
case study is thus in identification, not in control.
The identification results, especially the NOE case, show that this simple process has com-
plex behaviour and properties which require the advanced methods presented in Section
3.4. On the other hand, identification for predictive control was straightforward and the
actual control of this process was a moderately easy task.
T y( t)
u(t)
7.1 Identification
The goal of the identification was twofold:
The open loop process was first driven with a ramp type of control signal, changing
3 %/min., two times up and down the whole operation region, resulting in a data set of
2900 samples with T = 1 sec. (Fig. 7.2). As it can be seen either very slow dynamics
or hysteresis is present. The steady state curve is clearly different when the fan speed
was altered.
During the actual identification experiment, the process was driven with PRBS + offset
type signal, resulting in 3300 samples (Fig. 7.3), which were divided into two parts: the
first 2500 samples for an identification set and the rest 800 samples for a test set. The
process indeed has a very slow mode and hysteresis is also present. The slow dynamics
are due to the heat transfer through the tube wall. Although plastic has a small heat trans-
fer coefficient it also has a significant heat capacity. The hysteresis is due to the thyristor
circuit. Some nonlinearity is also assumed to be present, based on a nonlinear character-
istic curve for the thyristor circuit measured several years ago during the installation of
the process. The effect of this nonlinearity is however small.
a) b)
temperature [C] temperature [C]
80 80
70 70
60 60 start
50 50
40 40
start
30 30
20 20
0 10 20 30 40 50 0 10 20 30 40 50
control signal [%] control signal [%]
Fig. 7.2. The steady state characteristic curve of the open loop process, measured as
a slow ramp response. y ( t ) plotted as a function of u ( t ) , a) fan speed 100%,
b) fan speed 50%.
Section 7.1 Identification 123
A series of linear ARX models were first identified to determine the delay d and the model
orders, resulting in delay d = 2 . The determination of model orders n a and n b were not
so straightforward, due to the slow dynamics.
The NARX predictors were identified next, but for the clarity of the presentation, the NOE
case is considered first. Due to the amount of different model structures, the abbreviations
like NOE( n f , n b , d ) are used as a shorthand notation, i.e. to denote
y ( t + 1 ) = f ( ( t + 1 ), )
( t + 1 ) = [y ( t )y ( t n f + 1 ), (7.1)
T
u ( t d + 1 )u ( t d n b + 2 )]
temperature [C]
80
60
40
20
0 500 1000 1500 2000 2500 3000
time [s]
control signal [%]
50
40
30
20
10
0
0 500 1000 1500 2000 2500 3000
time [s]
Fig. 7.3. The measurement and the control signal, when driving the process
with PRBS+offset type signal (fan speed 100%).
124 Chapter 7 Control of Air Heating Process
was identified with the LM method. Due to the slow dynamical behaviour, the instability
problems were not surprising and the model was reidentified with the constraint (3.89)
N
V N ( ) = 1---
2 2
2 [ ( y ( t ) y ( t ) ) + v ( t, ) ]
t=1
nf (7.2)
( v ) if v = A i ( t ) and roots ( A ( q ) ) 1
with v ( t, ) =
i=1
0 , otherwise
A pragmatical selection = 1,7 resulted in largest spectral radius 1,003 and largest
v ( t ) = 1,701 . It is noticeable that the resulting constrained cost function value is
smaller than the unconstrained one i.e. the mild constraint helped the identification. The
resulting cost functions are compared in Table 7.1. The results seem to justify the use of
a nonlinear model, the difference to linear models being a decade. The behaviour of the
identified NOE( 4, 2, 2 ) model with the identification set is presented in Fig. 7.4. As it can
be seen the prediction error is small.
80
60
40
20
500 1000 1500 2000 2500
time [s]
Fig. 7.4. The behaviour of the NOE( 4, 2, 2 ) model with (a part of) the identification set.
The poles and the zero of the linearized model (3.56) are presented in Fig. 7.5, com-
puted at every 10 th sample corresponding to Fig. 7.4. The model has a constant pair of
complex poles, two poles near +1 and one zero near +1. The result of the identification
indicates that IMA noise type of behaviour is present.
Also another conclusion is to be made. The poles and the zero do not vary much i.e. the
dynamics of the nonlinear model seems to be constant, although the linearized model
cannot be straightforwardly analysed this way, as discussed in Section 3.4.
The physical explanation for slow dynamics seems anyway to be relevant. However the
use of the ramp responses as a validation set showed significant modelling error. It is
somewhat questionable whether a deterministic part was obtained or a trend was identi-
fied. It is well known that a reliable identification of slow dynamics is difficult, even in
the deterministic linear case.
poles zero
1.02
1
0.98
imag
0
0.96
0.94
-1
0.92
-1 0 1 0 50 100 150 200
real sample*10
Fig. 7.5.
The poles and the zero of the linearized model. Plotted for every
10th sample.
126 Chapter 7 Control of Air Heating Process
70 70
60 60
50 50
40 40
30 30
20 20
500 1000 1500 2000 0 10 20 30 40 50
time [s] control signal [%]
After the control experiments (Section 7.2) the NOE( 4, 2, 2 ) model was refined using all
available data for the identification i.e. combining identification, test and ramp set
together. The idea was to try whether a deterministic nonlinear model could represent both
slow and fast dynamics. The existing hysteresis was a problem because such identifica-
tion requires much more experimental data than was available. The stability constraints
were again a necessity, applied now with the BFGS approach. The resulting behaviour is
presented in Fig. 7.6. The fit is quite good, especially for the ramp test, but the residuals
with the original identification set, i.e. compared to those in Fig. 7.4, were smaller. Any-
way, the resulting model is very useful for simulation purposes.
The identification results presented above do not have much to do with the models aimed
for predictive control. For this ARX( 2, 2, 2 ), ARX( 4, 2, 2 ), and NARX( 2, 2, 2 ) predic-
tors were identified using the original 2500 samples as the identification set. The NARX
predictor was implemented with an MLP network
The resulting models are compared in Table 7.2. The difference between the linear and
the nonlinear predictors is quite small, indicating that adequate control performance with
the predictive control can be obtained also with linear models.
Section 7.1 Identification 127
If the slow dynamics is considered as an external disturbance there are several possibili-
ties to incorporate the IMA noise assumption. The NOE-IMA( 2, 2, 2 ) was applied here i.e.
the predictor
y ( t + 1 ) = f ( ( t + 1 ), )
T
( t + 1 ) = [ y ( t ) y ( t 1 ) u ( t 1 ) u ( t 2 ) ]
(7.3)
z ( t + 1 ) = y ( t ) y ( t )
y ( t + 1 ) = y ( t + 1 ) + z ( t + 1 )
was identified, using the same network structure as NARX( 2, 2, 2 ), This gave the smallest
identification cost function of all experiments (Table 7.3). Also the linear OE-IMA per-
formed better than ARX predictors.
The behaviour of the NOE-IMA( 2, 2, 2 ) predictor is presented in Fig. 7.7. The predicted
trend z ( t ) is clearly due to the slow dynamics plus some undetermined part due to the
hysteresis. It is also obvious that z ( t ) correlates with the input, but the response is very
slow. Note also the steady state behaviour of the resulting NOE part ( y ( t ) ), the gain is
clearly linear.
y ( t )
60
y ( t )
40
y( t)
20
0 500 1000 1500 2000 2500
time [s]
prediction error and predicted trend [C]
20
z ( t )
10
-10
-20
0 500 1000 1500 2000 2500
time [s]
Fig. 7.7. The behaviour of the NOE-IMA( 2, 2, 2 ) predictor with the identification set.
As it can be seen from Fig. 7.2 and Fig. 7.3, the maximum desired temperature 80 C was
achieved with u ( t ) = 50 % i.e the identification data did not include any samples where
u ( t ) > 50 %. This might cause problems during the control if a nonlinear model is
applied. Indeed this happened during the first control experiments. If the control signal
even temporarily rose to 80 % or more, it stayed there. A local minimum of the LRPC cost
function was achieved. The control signal should be limited below 50 %. To study the
input gradient projection this was not applied, except as a test.
For example Fig. 7.8 presents the steady state characteristic curve of the NOE part of the
identified NOE-IMA( 2, 2, 2 ) predictor and the corresponding sum of the input gradi-
ents i.e.
3
y ( t )/ u ( t i ) (7.4)
i=2
The identified area corresponds to u ( t ) [ 0, 50 ] %. Outside that the model extrapo-
lates. It extrapolates in an inconvenient way and the characteristic curve turns back. This
can also be seen from the sum of the input gradients, which crosses zero line approxi-
Section 7.1 Identification 129
30 0.04
20 0.06
0 20 40 60 80 100 0 20 40 60 80 100
control signal [%] control signal [%]
Fig. 7.8. a) the steady state characteristic curve of the NOE part of the identified NOE-
IMA( 2, 2, 2 ) predictor, b) the corresponding sum of the input gradients , see (7.4).
The NOE-IMA( 2, 2, 2 ) predictor was refined using the penalty (3.97) i.e.
3
( v ) if v= y ( t )/ u ( t i ) <
v ( t, ) = (7.5)
i=2
0 , otherwise
The limit = 0,02 was selected interactively monitoring the simulated steady state
curve and the corresponding sum of the gradients. The performance in the trained area
decreased only slightly, from MSSE 0,0032 to 0,0031 (Table 7.3). The steady state char-
acteristic curve is also presented in Fig. 7.8. The problems with non-monotonical charac-
teristic curve have disappeared.
130 Chapter 7 Control of Air Heating Process
The identified ARX( 2, 2, 2 ) and NARX( 2, 2, 2 ) predictors were applied in the LRPC
approach. The cost function (4.8)
N2
1 2
V ( t, u ) = ---
2 E ( 1 )y* ( t + i ) E (q
1
)y ( t + i ) + u ( t + i N1 )
2 (7.6)
i = N1
was minimized with Brent's method. After some trials, design parameters were selected
as N1 = 2 , N2 = 5 , Nu = 1 , E ( q 1 ) = 1 0,85q 1 , C ( q 1 ) = 1 0,7q 1 , and
4
= 10 . Also the limit u max = 25% was used in all experiments. The correspond-
ing control system performance are presented in Fig. 7.10 and in Fig. 7.11.
60
40
20
0 200 400 600 800 1000 1200 1400
time [s]
control signal [%]
100
80 autotuner
action
60
40
20
0
0 200 400 600 800 1000 1200 1400
time [s]
40
20
0 200 400 600 800 1000 1200 1400 1600
time [s]
control signal [%]
100
80
60
40
20
0
0 200 400 600 800 1000 1200 1400 1600
time [s]
Fig. 7.10. The control performance of the LRPC approach
with the ARX predictor.
60
40
20
0 200 400 600 800 1000 1200 1400 1600
time [s]
control signal [%]
100
80
60
40
20
0
0 200 400 600 800 1000 1200 1400 1600
time [s]
Fig. 7.11. The control performance of the LRPC approach
with the NARX predictor.
132 Chapter 7 Control of Air Heating Process
It is hard to distinguish between the linear and the nonlinear control performance, except
by looking at the control signals. This justifies the analysis made during the identification
that the process is almost linear. The NARX predictor was not constrained, the input gra-
dients were just checked to see whether they met the specifications. They did, but only
barely. This can also be seen from the ringing type of behaviour in Fig. 7.10. Different
prediction horizons N2 = 210 were tried to remove it. The ringing slightly decreased
when the prediction horizon increased, but on the whole almost identical control perform-
ance was achieved.
Comparison with the control performance of the PID controller demonstrates the effi-
ciency of the LRPC approach. The noise rejection is clearly better and the robustness to
the fan speed is also better. Altering the fan speed from 100 50 % did not affect much
the overall performance.
The higher order OE( 4, 2, 2 ) and NOE( 4, 2, 2 ) models were next applied within the
LPRC-IMC approach, with the first order IMC filter F 2 . They both failed, the control sys-
tem was stable and the desired setpoint was achieved, but the overall performance was
poor. This was expected based on the analysis of the identified models (Fig. 7.5.)
The low order models OE( 2, 2, 2 ) and NOE-IMA( 2, 2, 2 ) were tried instead, also within
the LRPC-IMC approach. They performed better, especially the NOE-IMA, but failed to
achieve the desired steady state setpoint. The first order IMC filter could not compensate
the trend disturbance. The filter was redesigned assuming a ramp type disturbance and
using a tighter reference model E ( q 1 ) = 1 0,5q 1 , resulting in
F 2 ( z ) = ( 2,6 3,7z
1 2 0,1
1,2z ) -----------------------
- (7.7)
1
1 0,9z
The other design specifications were same as before. The performance of this control sys-
tem is presented in Fig. 7.12, now with the NOE-IMA( 2, 2, 2 ) model.
If the assumption of the NOE-IMA type disturbance is correct, then a NOE part of the NOE-
IMA( 2, 2, 2 ) model could be applied as a NARX predictor, violating the noise model
assumption, but assuming that the representation of f in (7.3) is correct. To test this the
NOE-IMA( 2, 2, 2 ) model was used as a NARX predictor within the LRPC approach. After
some trials, the design parameters are selected as N1 = 2 , N2 = 5 , Nu = 1 ,
4
E ( q 1 ) = 1 0,85q 1 , C ( q 1 ) = 1 0,7q 1 , and = 10 i.e same as in the origi-
nal NARX case. The control system performance is presented in Fig. 7.13
The response in both experiments was faster than before, due to tighter E ( q 1 ) . A signif-
icant difference is the better noise rejection, although it cannot be clearly seen from the
figures. The control signal variation is smaller in Fig. 7.13.
Section 7.2 Control Experiments 133
60
40
20
0 200 400 600 800 1000 1200 1400 1600
time [s]
control signal [%]
100
80
60
40
20
0
0 200 400 600 800 1000 1200 1400 1600
time [s]
Fig. 7.12. The control performance of the LRPC-IMC approach with the
NOE-IMA identified NOE predictor, second order IMC filter.
60
40
20
0 200 400 600 800 1000 1200 1400 1600
time [s]
control signal [%]
100
80
60
40
20
0
0 200 400 600 800 1000 1200 1400 1600
time [s]
Fig. 7.13. The control performance of the LRPC approach with the
NOE-IMA identified model as NARX predictor.
134 Chapter 7 Control of Air Heating Process
The results of the identification and control experiments are somewhat surprising. If the
slow dynamics was considered as an external disturbance, the remaining behaviour were
almost linear except for some hysteresis. This can be seen by comparing the identification
cost functions of the ARX and NARX predictors, analysing the NOE and NOE-IMA mod-
els, and of course from the control experiments because almost identical control perform-
ance was obtained using either linear or nonlinear model. The robustness of the LRPC had
also some effect on this.
The full nonlinear modelling, identification and control approach was applied, success-
fully and resulting in reliable models and good control performance. In this sense it is
rather irrelevant that the process showed to be linear. Still, this is perhaps the most obscure
way to design a linear controller.
The IMA noise assumption within the LRPC approach performs well with this type of dis-
turbances, in fact it is just what the LRPC is supposed to do, see Section 4.2. The NOE-
IMA model, applied as a NARX predictor gave the best overall performance if also the
control signal variations are taken into account. The practical difficulties were in the iden-
tification of such nonlinear predictors. The proposed identification scheme seems to work
in simulations and in practice. The convergence properties etc. are not thoroughly ana-
lysed and the approach is still at experimental stage.
A significant aspect is also that this small and simple process incorporates complex
behaviour which made the identification difficult. The developed constrained identifica-
tion scheme was clearly useful for obtaining reliable models.
Chapter 8 Control of Pilot Headbox 135
This pilot process simulates some basic features of a real paper machine headbox. The
pilot process is a real process not a computer simulation model. It has been extensively
used as a test process both for modelling and control purposes in the Control Engineering
Laboratory. The main task of a real headbox is to distribute stock across the wire through
a slice into a 10 millimeter thick and several meters wide jet. Stock is a mixture of fibres
and water, the amount of fibres being only 1% depending on the paper grade. Here pure
water is used instead.
The pilot headbox process is presented in Fig. 8.1. The water is pumped from the lower
storage tank to the upper tank (the headbox). The speed of the pump (WP) is controlled
AV
L P F
WV
U2
SC M
AC
U1
SC M
WP
with a thyristor circuit. The overpressure in the tank is maintained by blowing air to the
headbox with a thyristor controlled compressor (AC). The slice is a combination of a pipe
and a valve (WV) and the water jets through it into the open air pressure. The process
characteristic can be set with the water valve (WV) and the air valve (AV). The main task
of the pilot headbox is to simulate the dynamical behaviour of the pressure and the water
level w.r.t. the controls (water pump and compressor) in the real headbox.
The water level ( 00,6 m ) and the pressure at the water outlet point ( 01 bar ) are the
controlled variables (measurements L and P). The pressure is the sum of water hydrostatic
pressure and air pressure. Both measurements are scaled between 0...1. The actual control
variables are the voltages ( 15 V ) to the speed controllers (SC). The control task is to
keep the pressure and the level at desired setpoints.
The effect of the water pump to both measurements is much stronger than the effect of the
compressor which makes the control difficult. As typical to all tank processes, a water
level includes an integrating feature (not a pure integrator). Without a controller the meas-
urements slowly drift away from the setpoints i.e. the system has a saddle type equilib-
rium point. This is also a serious difficulty, if one wants to identify a deterministic model
of the process using open loop experiments. The system is not heavily nonlinear, but the
process dynamics vary depending on the operation point and on the valve settings (WV
and AV).
The compressor affects directly the air pressure and thus the overall measured pressure.
The air pressure affects in turn the level, but there is no direct relationship between the
level and the compressor speed (see Fig. 8.2), a fact which can also be derived from the
basic physical principles.
Fig. 8.2. Basic relationships between the measurements and the controls.
Section 8.1 Identification 137
8.1 Identification
The goal is to control both the pressure and the level in a wide range of operation points.
The main purpose of the identification is twofold:
The former was relatively straightforward and easy, but the latter caused problems,
mainly due to the integrating feature which made the identification experiments difficult.
Only after applying a coarse controller (PI controller + relay with hysteresis), good quality
data were obtained, resulting in two data sets (2500 samples and 4700 samples with
T = 1 s. ). The outputs of the PI controllers were transformed into a coarsely quantified
offset plus a pulse width modulated signals between high and low with a period of
15 seconds. The identification set was formed by picking several at least 500 second
intervals from both data sets, the rest remaining for a test set. The variables were ordered
according to
The model orders and delays were determined by fitting a series of linear ARX models
and selecting the best. The same model orders are used in data vector of the nonlinear
model. The predictive NARX model is identified separately for both measurements. The
level predictor is identified according Fig. 8.2 i.e. air compressor speed is not used. The
resulting two separate neural networks are combined to one sparse network i.e. the pre-
dictors
y 1 ( t + 1 ) = f 1 ( 1 ( t + 1 ), 1 )
y 2 ( t + 1 ) = f 2 ( 2 ( t + 1 ), 2 )
T (8.1)
1 ( t + 1 ) = [ y T( t ), y T( t 1 ), u T( t ), u T( t 1 ) ]
T
2 ( t + 1 ) = [ y T( t ), y T( t 1 ), u 1 ( t ), u 1 ( t 1 ) ]
are combined to
y ( t + 1 ) = f ( 1 ( t + 1 ), ) (8.2)
138 Chapter 8 Control of Pilot Headbox
Because the NARX predictor is intended to be used with LRPC approach, also a multistep
predictor was identified (10-multistep-ahead NARX) i.e. according the cost function
N 10
V LRPI = 1---
2
2 y ( t ) y ( t t j ) (8.3)
t=1 j=1
and with same network structure as the plain NARX.
Also a deterministic NOE model was identified using two separate predictors with the
other measurement as an measured disturbance i.e.
y 1 ( t + 1 ) = f 1 ( 1 ( t + 1 ), 1 )
y 2 ( t + 1 ) = f 2 ( 2 ( t + 1 ), 2 )
T (8.4)
1 ( t + 1 ) = [ y 1 ( t ), y 2 ( t ), y 1 ( t 1 ), y 2 ( t 1 ), u T( t ), u T( t 1 ) ]
T
2 ( t + 1 ) = [ y 1 ( t ), y 2 ( t ), y 1 ( t 1 ), y 2 ( t 1 ), u 1 ( t ), u 1 ( t 1 ) ]
The neural network structure and size is same as in the NARX case. The resulting models
were combined to one sparse network i.e to a predictor
y ( t + 1 ) = f ( ( t + 1 ), )
T (8.5)
( t + 1 ) = [ y T( t ), y T( t 1 ), u T( t ), u T( t 1 ) ]
which was reidentified as true MIMO-NOE style (only 30 iterations is needed).
The stability and gradient limits for all resulting predictors were checked, but no projec-
tion was needed. This is due to the good quality of the identification data. Especially the
gradient projection can be seen as a way to incorporate apriori knowledge to compensate
the lack of data.
Section 8.1 Identification 139
The cost function values of the resulting models are presented in Table 8.1. The difference
between linear OE and NOE model is remarkable, indicating that the true process is non-
linear. However, when comparing the linear ARX and NARX models the difference is
quite small indicating that a linear predictive control might give adequate performance.
As mentioned before, the NOE model is needed mainly for simulation purposes. A sig-
nificant aspect is that the amount of data (different operation points) and the computa-
tional effort needed for a NOE model was a decade greater than for a NARX model. The
resulting NOE model was tested by simulating the data set of 4700 points. Fig. 8.3 shows
three 300 second samples of the simulation, corresponding to the best and the worst case
situations. This model is capable to predict the whole data set without significant error.
This is remarkable because of the integrator in the level causes also an accumulating error.
The identified NOE model indeed revealed some basic features of the process. This does
not help much, as will be seen in control experiments, because even a slight change in
valve settings causes a cumulative prediction error.
Anyway, the NOE model can be efficiently used for simulation purposes. An example
clarifies the quality of the resulting NOE model. The simulation model and the pilot proc-
ess were controlled with the same predictive controller and with the same tuning param-
eters. As it can be seen from Fig. 8.4, the simulation model incorporates the dynamical
behaviour of the true process up to small details, only minor steady state errors in control
signals can be seen.
0.6
0.4
0.2
0
0 100 200 300 400 500 600 700 800 900
level: measurement, prediction and norm of the prediction error [0..1]
1
0.8
0.6
0.4
0.2
0
0 100 200 300 400 500 600 700 800 900
water pump [%]
100
80
60
40
20
0
0 100 200 300 400 500 600 700 800 900
air compressor [%]
100
80
60
40
20
0
0 100 200 300 400 500 600 700 800 900
time [s]
Fig. 8.3. The behaviour of the identified NOE model with the identification / test set
of 4700 seconds. Case 1: 0...300 s, case 2: 1100...1400 s and case 3: 4000...4300 s.
Section 8.1 Identification 141
0.4
0.2
0.8
0.6
0.4
process
simulation
0.2
60
process
40
20
0
0 200 400 600 800 1000 1200 1400
80
60 process
40
20
simulation
0
0 200 400 600 800 1000 1200 1400
time [s]
Fig. 8.4. The behaviour of the identified NOE model and the true process,
when both are controlled with the similar predictive controller with same
tuning parameters.
142 Chapter 8 Control of Pilot Headbox
The autotuner option was used to tune the controllers. The desired response was selected
to be fast, too tight in fact, as seen from Fig. 8.5. Note the temporal instability at the pres-
sure measurement near time 800 seconds. This process is used extensively in control edu-
cation and similar performance shown in Fig. 8.5 is seldom achieved using a linear model
as a basis for a PI controller design nor by tuning the controllers heuristically.
0.8 instability
level
0.6
0.4
pressure
0.2
80
water
60
40
air
20
0
0 200 400 600 800 1000 1200 1400
time [s]
The identified 10-multistep-ahead NARX predictor was applied with the LRPC approach
and the cost function (4.8)
N2
1 2
V ( t, u ) = ---
2 E ( 1 )y* ( t + i ) E (q
1
)y ( t + i ) W + u ( t + i N1 )
2 (8.6)
i = N1
was minimized using CFSQP/FSQP. After some trials, the design parameters were selected
3
as N1 = 1 , N2 = 5 , Nu = 1 , W = I , = 10 I and
1 1
E( q ) = I diag { 0.7, 0.9 }q
1 1
C( q ) = I diag { 0.7, 0.7 }q
0.8
0.6
0.4
0.2
0.8
0.6
0.4
0.2
80 water
60
40
air
20
0
0 200 400 600 800 1000 1200 1400
time [s]
Fig. 8.6.
The control performance of the LRPC approach
with the NARX predictor.
144 Chapter 8 Control of Pilot Headbox
where the C polynomial is used to filter the prediction error according (4.15). Also the
limiting u max = I 25 % was used in all experiments.
This control system shows excellent performance with only minor interactions which are
due to the high and low limits of both control signals. The goal of the control design was
to remove also these interactions. The LRPC could quite well compensate them if the con-
trol signals were inside their limits. To achieve this the setpoint was filtered with
0,1z 0,05z
F 1 ( z ) = diag --------------- ------------------
z 0,9 z 0,95
to remove the rest of the interactions. The resulting behaviour of the control system is pre-
sented in Fig. 8.6. Hardly no interactions can be seen and the noise rejection is also good.
Using the linear ARX model instead of NARX yields almost comparable control behav-
iour. This can be reasoned comparing the cost function values of the identified ARX and
NARX predictors (Table 8.1). However, the difference between the cost function values
of the OE and NOE predictors is significant. One explanation is that the process is nonlin-
ear only at some operation regions like at higher pressure. Another reason for good per-
formance with linear models is obviouly the LRPC approach itself.
Although the NOE model is not initially identified for control purposes, it was anyway
tested, first with the same settings for E , W , and F 1 as with the NARX case and with
the IMC filter
0,3z 0,3z
F 2 ( z ) = diag --------------- ---------------
z 0,7 z 0,7
Due to the integrator and especially due to the accumulating model mismatch, the steady
state setpoint was not achieved. The IMC filter was redesigned by assuming a ramp type
load disturbance and assuming the system to be controlled to be
1 0,3z 1 0,1z
1
z E ( z ) = diag -----------------------
- ------------------------
1 0,7z 1 1 0,9z 1
1 2 0,3
(5,33 8,06z + 3,03z ) ----------------------- -
1
1 0,7z
F 2 ( z ) = diag
0,07
(3,8 6,22z 1 + 2,49z 2 ) --------------------------
-
1
1 0,93z
Section 8.2 Control Experiments 145
The performance of this control system is presented in Fig. 8.7. The performance is sim-
ilar to that of the NARX, but at a price of heavier controller output variations due to the
higher order IMC filter. The valve settings were slightly different from the ones of the
identification experiments and the prediction error of the level accumulates (40% of the
whole operation range after 1400 seconds). On the other hand, the control system shows
remarkable robustness for the model mismatch. The small ticks are communication fail-
ures not real process disturbances.
It is clear that the IMC approach cannot be straightforwardly applied to this type of proc-
ess. Sooner or later the model goes out of the trained region. This is a serious limitation,
if one wishes to develop a dual network controller for the process. A separate NARX style
state estimator must be identified or developed otherwise.
0.8
0.6
0.4
communication
failure
0.2
0.8
0.6
0.4
0.2
80 water
60
40
air
20
0
0 200 400 600 800 1000 1200 1400
time [s]
Fig. 8.7. The control system performance of the LRPC based IMC i.e with a NOE
model. The small ticks are communication failures not real process disturbances.
146 Chapter 8 Control of Pilot Headbox
The NARX-LRPC approach (Fig. 8.6) was tested also for model mismatch by changing the
valve settings from their nominal values. The behaviour of the control system is presented
in Fig. 8.8. The process is very sensitive to the air valve position and a 10 change must
be considered as large, the water valve position is not so critical. This can also be seen
from Fig. 8.8. Changing the air valve 90 80 and dropping the pressure setpoint
brings the level control near the instability border at 400 seconds.
As a conclusion for this case study, a couple of remarks are to be made. Due to the inte-
grator and the accumulating prediction error, the NARX predictor is superior to NOE pre-
dictor when considering predictive control. The NOE model can still be efficiently applied
for simulation and control design purposes. The LRPC approach shows excellent perform-
ance and good robustness both with linear ARX and NARX predictors.
0.5
0.4
0.3
0 200 400 600 800 1000 1200 1400
80
60
40
20
0
0 200 400 600 800 1000 1200 1400
time [s]
90 t < 100 t > 555 355 t < 600 900 < t <1200
80 100 < t < 300 330 600 < t < 750
100 300 < t < 600 380 750 < t < 900 t > 1200
Fig. 8.8. The control system performance, when changing the process
from its nominal state. The NARX based LRPC approach.
Chapter 9 Conclusion 147
9 Conclusion
A practical approach to model based neural network control is presented in this study. The
efficiency of the approach has been verified with simulation studies and real time exper-
iments, yielding good control characteristics and robust control. The general applicability
of the proposed approach seems clear.
The goal of this study was to develop efficient identification and control design methods
for neural network based nonlinear control, and to implement them in a real world envi-
ronment. The study thus also fills the gap between theory and practice. Practice is anyway
the final measure to any control method and this research area is of great importance.
The multilayer perceptron neural network is selected to be the basic block for representa-
tion of nonlinear process models and controllers. The main attractive feature of neural net-
work models is that they offer a parametric function which can accurately represent
virtually any smooth nonlinear function with similar basic structure. The results verified
this well-known fact.
The advanced identification and control design methods are presented and applied. The
PEM approach combined with efficient optimization methods result in reliable models
and controllers, clearly better than obtained with conventional methods like error back-
propagation. A contribution is the incorporation of the constraints within the model iden-
tification and within the controller design to maintain the stability or to include apriori
process knowledge. The latter also somewhat compensates the amount of experimental
data needed for the identification.
The identified models are used for model predictive control. Both direct long range pre-
dictive control approach (LRPC) and dual network control approach are introduced. The
problems related to the model inverse based IMC design are partially avoided by employ-
ing a nonlinear optimal controller within the IMC structure. The nonlinear control law is
approximated by a perceptron network and the cost function, associated to the optimal
controller design, is minimized numerically. Practical solutions for the model mismatch
for both approaches are applied. General guidelines and practical methods for maintain-
ing the stability are also introduced and applied.
This model based control approach has successfully been applied to the control of labo-
ratory scale water and air heating processes and to the multivariate control of the pilot
headbox of a paper machine using the neuro-control workstation. These processes are not
heavily nonlinear, but they incorporate many of the features which commonly make con-
trol difficult: delays, noise, deterministic disturbances, trends, time varying features etc.
Solving the problems associated to these in conjunction of neural networks is the main
contribution of the thesis, both from the theoretical and practical point of view. Experi-
mental results indicate that both the direct LRPC and the dual network IMC structure pro-
vide robust performance and are clearly good alternatives for controlling nonlinear plants.
The direct LRPC approach with the NARX predictor seems to be extremely efficient and
easily applicable control method, if only the on-line computational load is acceptable. In
the linear case both ARX and OE models are commonly used for controller design accord-
ing the certainty equivalence principle. The difference between NARX and NOE models
is conceptually much bigger than in the linear case. Only NOE model which represents the
deterministic part of the system, should be used in the design of IMC like deterministic
controller.
The identification of a NOE model is often a formidable task in practice, as can be seen
from the case studies. The overall quality of the experimental data must be better. On the
other hand, the resulting NOE model is directly applicable as a simulation model for con-
trol design purposes, contrary to the NARX predictor.
As shown in simulation Example 5.3, the robustness of the IMC approach is good, but the
NARX-LRPC approach outperforms it. Paying more attention to the state estimator design
comparable results should in theory be achieved. This feature is however inbuilt to the
NARX-LRPC approach.
The simulation studies and experimental results clearly point out that it is quite the same
if the IMC approach is implemented within the LRPC or within the dual network approach,
the control system behaviour is similar. The capability of handling on-line constraints and
the possibility of on-line tuning favors the LRPC approach and the dual network approach
is worthwhile only when a minor computational on-line load is needed, like in high speed
applications.
The experiments with the adaptive control worked as expected, all operation regions
should be visited once before the adequate control performance is achieved. The adaptive
experiments should be considered mainly as demonstrations of the problematics encoun-
tered with adaptive nonlinear control. A lot of research must be done before real industrial
adaptive applications will be seen.
Chapter 9 Conclusion 149
There seems to be two adaptive approaches which might show useful in practice: inferen-
tial control and nonlinear (localized) autotuner. The inferential control slowly adapts the
parameters of an additional nonlinear controller to the time varying process. The linear
autotuner - tuning by request - is commonly favored in industry instead of linear adaptive
control. The nonlinear version would tune a nonlinear controller or identify a model
locally near the current operation point. This feature is difficult to implement with any
global nonlinear function approximator, a localized model must be used instead.
When considering real industrial applications, the proposed off-line identifiable LRPC
approach seems directly applicable, excellent performance but sometimes at a price of
significant model identification task. On the other hand, a minor identification task would
make the development of practical applications easier. The usage of coarser model would
probably reduce the control system performance only slightly, the LRPC approach is quite
robust as seen from the presented experimental results.
A coarser model, suitable for representation of nonlinear rather high dimensional func-
tions, as normally required in time series models, obtained with minor identification
experiments and applying the often available steady state information, would be an ideal
practical solution. Combined with the proposed methods for maintaining stability and
extrapolation would result in reliable models and controllers. The development of coarser
and simpler model structures still maintaining the attractive features of the MLP neural
network is clearly an important issue of the future research.
150 Chapter 9
References 151
References
[1] Alamir, M. and G. Bornard (1994): New sufficient conditions for global stability of
receding horizon control for discrete-time nonlinear systems. Advances in Model-
Based Predictive Control [20], Oxford University Press, pp. 173-181.
[2] Albus, J. S. (1975): A New Approach to Manipulator Control: The Cerebellar
Model Articulation Controller (CMAC). Trans. of the ASME. Journal of Dynamic
Systems, Measurement and Control. Sept. 1975, pp. 220-227.
[6] Bar-Kana, I., A. Guez (1989): Neuromorfic Adaptive Control. 28th IEEE Confer-
ence on Decision and Control, Tampa, Florida, Dec. 1989, pp. 1739-1743.
[7] Barto, A. G., R. S. Sutton and C. W. Anderson (1983): Neuronlike elements that can
solve difficult learning control problems. IEEE Trans. on Systems, Man and Cyber-
netics (SMC), Vol. 23, pp. 834-846.
[8] Bavarian, B (1988).: Introduction to Neural Networks for Intelligent Control. IEEE
Control Systems Magazine, April 1988, pp. 3-7.
[10] Billings, S. A. and Voon, W. S. F. (1986): Correlation based model validity tests for
non-linear models. Int. J. Control, Vol. 44, No. 1, pp. 235-244.
[11] Billings, S. A. and Zhu, Q. M. (1984): Nonlinear model validation using correlation
test. Int. J. Control, Vol. 60, No. 6, pp. 1107-1120.
[12] Bitmead, R. R., M. Gevers and V. Wertz (1990): Adaptive Optimal Control, The
Thinking Man's GPC. Prentice-Hall, 1990.
[13] Bhat, N. and T. McAvoy (1990): Use of Neural Nets for Dynamic Modelling and
Control of Chemical Process Systems. Computers Chem. Engng, Vol. 14 No. 4/5,
pp. 573-583.
[14] de Boor, C. (1978): A Practical Guide to Splines, Springer-Verlag, New York, 1978.
152 References
[15] te Braade, H., R. Babuska, E. van Can (1994): Fuzzy and neural models in predic-
tive control. Journal A, Vol. 35, No. 3, pp. 44-51.
[16] Chen, S., S. A. Billings, P. M. Grant (1990): Non-linear system identification using
neural networks. Int. J. Control, vol 51, no. 6, pp. 1191-1214.
[17] Chen, S., C. Cowan et al. (1990): Parallel Recursive Prediction Error Algorithm for
Training Layered Neural Networks. Int. J. Control, vol 51, no. 6, pp. 1215-1228.
[18] Chen, S., C. Cowan, P. Grant (1991): Orthogonal Least Squares Learning Algo-
rithm for Radial Basis Function Networks. IEEE Trans. on Neural Networks, Vol. 2,
No. 2, March 1991, pp. 302-309.
[19] Chen, Q. and W. A. Weigand (1994): Dynamic Optimization of Nonlinear Proc-
esses by Combining Neural Net Model with UDMC. AIChE Journal, Sep. 1994,
Vol. 40, No. 9, pp. 1488-1497.
[22] Cooper D. J., L. Megan and R. F. Hinde Jr. (1992): Comparing Two Neural Net-
works for Pattern Based Adaptive Process Control. AIChE Journal, January 1992,
Vol. 38, No. 1, pp. 41-55.
[23] Demircioglu, H. and D. W. Clarke (1992): CGPC with guaranteed stability proper-
ties. IEE Proceedings-D, Vol. 139, No. 4, pp. 371-380.
[27] Ers, E., H. Tolle (1988): Learning Control Structures with Neuron-Like Associa-
tive Memory Systems. In W. v. Seelen et al. (Eds.): Organization of Neural Net-
works. Structures and Models, Weinheim (FRG), 1988.
[28] Fletcher, R. (1990): Practical Methods of Optimization. John Wiley & Sons,
New York.
References 153
[29] Fliess, M. and Hazewinkel, M. (Eds.) (1986): Algebraic and Geometric Methods in
Nonlinear Control Theory, D. Reidel Publishing Company.
[30] Fox, D., V. Heinze et al. (1991): Learning by Error-Driven Decomposition. ICANN-
91, International Conference on Artificial Neural Networks. Espoo, Finland, June
24-28, 1991. pp 207-212.
[31] Foxboro Instruction Book (1989): 761 Series single station micro plus controller,
MI-018-848, the Foxboro Company.
[32] Franklin, J. A.(1989): Historical Perspective and State of the Art in Connectionistic
Learning Control. Proc. of the 28th IEEE Conference on Decision and Control.
Tampa, Florida, Dec. 1989, pp. 1730-1736.
[33] Fu, K-S. (1970): Learning Control Systems - Review and Outlook. IEEE Trans. on
Automatic Control, April 1970, pp. 210-221.
[34] Garcia, C. E., D. M. Prett, M. Morari (1989): Model Predictive Control: Theory and
Practice - a Survey. Automatica, Vol. 25, No. 3, pp. 335-348.
[35] Goodwin, C. G., K. S. Sin (1984): Adaptive Filtering, Prediction and Control.
Prentice-Hall.
[36] Grabec, T. (1991): Modelling of Chaos by a Self-Organizing Neural Network.
ICANN-91, International Conference on Artificial Neural Networks. Espoo, Fin-
land, June 24-28, pp. 151-156.
[37] Gupta, M. M. and D. H. Rao (1993): Dynamic Neural Units with Applications to
the Control of Unknown Nonlinear Systems. Journal of Intelligent and Fuzzy Sys-
tems, Vol. 1(1), pp. 73-92.
[40] Hernandez, E., Y. Arkun (1992): Study of the control-relevant properties of back-
propagation neural network models of nonlinear dynamical systems. Computers
Chem. Engng, Vol. 16 No. 4, pp. 227-240.
[41] Hernandez, E., Y. Arkun (1993): Control of Nonlinear Systems Using Polynomial
ARMA Models. AIChE Journal, March 1993, Vol. 39, No. 3, pp. 446-460.
[42] Hertz, J., A. Krogh, R. G. Palmer (1991): Introduction to the theory of the neural
computation. Addison-Wesley.
154 References
[43] Holst, J. (1977): Adaptive Prediction and Recursive Estimation. Ph.D. Thesis, Lund
Institute of Technology, Dept. of Automatic Control, Lund.
[44] Holzman, J. M. (1970): Nonlinear system theory - functional analysis approach.
Prentice-Hall, 213 p.
[48] Hunt, K. J., D. Sbarbaro et al. (1992): Neural Networks for Control Systems -
A Survey. Automatica, Vol. 28, No. 6 pp. 1083-1112.
[49] Hytyniemi, H. (1994): Self-Organizing Artificial Neural Networks in Dynamic
Systems Modelling and Control. Dissertation, Report 97, Helsinki University of
Technology, Control Engineering Laboratory, Finland, 136 p.
[50] Isidori, A. (1989): Nonlinear Control Systems. Springer-Verlag Berlin, Heidelberg
1985 and 1989.
[51] Jin Liang, P. N. Nikiforuk and M. M. Gupta (1994a): Absolute Stability Conditions
for Discrete-Time Recurrent Neural Networks. IEEE trans. on Neural Networks,
Vol. 5, No. 6, pp. 954-964.
[52] Jin Liang, P. N. Nikiforuk and M. M. Gupta (1994b): Adaptive control of discrete
time nonlinear systems using recurrent neural networks. IEE Proc.-Control Theory
Appl., Vol. 141, No. 3, pp. 169-176.
[53] Jin Liang, P. N. Nikiforuk and M. M. Gupta (1995): Fast Neural Learning and Con-
trol of Discrete-Time Nonlinear Systems. IEEE trans. on Systems, Man and Cyber-
netics, Vol. 25, No. 3, pp. 478-488.
[54] Johanssen, T. A. (1994): Operating Regime Based Process Modelling and Identifi-
cation, Dr. Ing. Thesis, Report 94-109-W, Department of Engineering Cybernetics,
Norwegian Institute of Technology, Trondheim, Norway, 224 p.
[55] Jokinen, P. (1991): Continuously Learning Nonlinear Networks with Dynamic
Capacity Allocation. Ph.D Thesis, Tampere University of Technology, Publications
73, Tampere, Finland.
[56] Kailath, T. (1980): Linear Systems. Prentice-Hall, Englewood Cliffs, N. J.
References 155
[57] Keeler, J. D. (1994): Prediction and control of chaotic chemical reactor via neural
network models. Journal A, Vol. 35, No. 3, pp 54-57.
[58] Kentridge, R. W. (1990): Neural networks for learning in the real world: represen-
tation, reinforcement and dynamics. Parallel Computing, 14, pp. 405-414.
[59] Khalid, M. and S. Omatu (1992): A Neural Network Controller for a Temperature
Control System. IEEE Control Systems, June 1992, pp. 58-64.
[60] Khalid, M., S. Omatu, R. Yosof (1994a): Adaptive Fuzzy Control of a Water Bath
Process with Neural Networks. Engng. Applic. Artif. Intell., Vol. 7, No. 1, pp. 39-52.
[61] Khalid, M., S. Omatu, R. Yosof (1994b): MIMO furnace control with Neural Net-
works. IEEE trans. on Control Systems Technology, Vol. 1, No. 4, pp. 238-245.
[64] Koivisto, H. (1990): Minimum Prediction Error Neural Controller. Proc. 29th IEEE
Conference on Decision and Control, Honolulu, 5-7 Dec. 1990, pp. 1741-6 Vol. 3
[65] Koivisto, H., P. Kimpimki, H. Koivo (1991a): Neural Net Based Control of the
Heating Process. ICANN-91, International Conference on Artificial Neural Net-
works. Espoo, Finland, June 24-28, 1991, pp. II-1277-1280.
[66] Koivisto, H., P. Kimpimki, H. Koivo (1991b): Neural Predictive Control - a Case
Study. 1991 IEEE International Symposium on Intelligent Control, Arlington,
Virginia, Aug 13-15, 1991, pp. 405-410.
[67] Koivisto, H., V. Ruoppila, H. N. Koivo (1992): Properties of the neural network
internal model controller. Preprints of 1992 IFAC/IFIP/IFORS International Sym-
posium on Artificial Intelligence on Real-Time Control (AIRTC'92), June 16-18,
1992, Delft, Netherlands, pp. 221-226.
[68] Koivisto, H, V. Ruoppila and H. N. Koivo (1993): Real-time neural network control
- and IMC approach. IFAC World Congress 1993, July 18-13, Sydney, Australia,
pp. IV-47-53.
[69] Koivisto, H. (1994): Neural network control of a pilot heating process - a design
example. International Workshop on Neural Networks in District Heating, 15-16
April 1994, Reykjavik, Iceland.
[70] Kwon, W. H. (1994): Advanced in Predictive Control: Theory and Applications.
'94 Asian Control Conference. Plenary Talk, 41 p.
156 References
[71] Landau, Y. (1979): Adaptive Control, the Model Reference Approach. Marcel
Dekker, New York.
[72] Lane, S. H., D. A. Handelman, J. J. Gelfand (1992): Theory and Development of
Higher-Order CMAC Neural Networks. IEEE Control Systems, April 92, pp. 23-30.
[73] Lawrence, C., J. L. Zhou, A. L. Titts (1994): User's Guide for CFSQP Version 2.0:
A C Code for Solving (Large Scale) Constrained Nonlinear (Minimax) Optimiza-
tion Problems, Generating Iterates Satisfying All Ineqality Constraints. Electrical
Engineering Department and Institute for Systems Research ,TR-94-16, University
of Maryland.
[74] Lee, M., S. Park (1992): A New Scheme Combining Neural Feedforward Control
with Model-Predictive Control. AIChE Journal, Feb. 1992, Vol. 38, No 2,
pp. 193-200.
[77] Lichtenwalner, P. F. (1993): Neural network control of the fiber placement compos-
ite manufacturing process. Journal of Materials Engineering and Performance,
Vol. 2, No. 5, pp. 687-691.
[78] Lightbody G., G. W. Irwin (1995): Direct neural model reference adaptive control.
IEE Proc.-Control Theory Appl., Vol. 142, No. 1, pp. 31-43.
[79] Liu C., F. Chen (1993): Adaptive control of non-linear continuous-time systems
using neural networks - general relative degree and MIMO cases. Int J. Control,
Vol. 58, No. 2, pp. 317-335.
[80] Ljung, L. (1987): System Identification: Theory for the User. Prentice-Hall, Engle-
wood Cliffs, N. J.
[81] Ljung, L., S. Sderstrm (1983): Theory and Practice of Recursive Identification.
MIT Press.
[82] Lu, W., D. G. Fisher (1990): Nonminimal Model Based Long Range Predictive
Control. Proc. of 1990 American Control Conference (ACC-90), San Diego, Cali-
fornia, May 23-25 1990, pp. 1607-1613.
References 157
[83] Lu, W., D. G. Fisher et al. (1990): Nonminimal Model Based Output Predictors.
Proc. of 1990 American Control Conference (ACC-90), San Diego, California, May
23-25 1990, pp. 998-1003.
[84] MacArthur, J. W. (1993): An End-Time Constrained Receding Horizon Control
Policy. Trans. of the ASME, Journal of Dynamic Systems, Measurement and Con-
trol, Vol. 115, pp. 334-340.
[85] McCulloch, W. S., W. Pitts (1943): A Logical Calculus of Ideas Immanent in Nerv-
ous Activity. Bulletin of Mathematical Biophysics 5, 115-133, 1943. Reprinted in
Anderson and Rosenberg [3].
[89] Mills, P. M., A. Y. Zomaya and M. O. Tade (1994): Adaptive model-based control
using neural networks. Int. J. Control, Vol. 60, No. 6, pp. 1163-1192.
[91] Miyata, Y. (1990): A User's Guide to Planet Version 5.6. Computer Science Depart-
ment, University of Colorado, Boulder.
[93] Moody, J., J. Darken (1989): Fast Learning in Networks of Locally-Tuned Process-
ing Units. Neural Computation, Vol. 1, pp. 281-294.
[94] Morari, M., E. Zafiriou (1989): Robust Process Control. Prentice-Hall.
[95] Nahas, E. P., M. A. Henson and D. E. Seborg (1992): Nonlinear Internal Model
Control Strategy for Neural Network Models. Computers chem. Engng, Vol. 16,
No. 12, pp. 1039-1057.
[96] Narendra, K. S., K. Parthasarathy (1990): Identification and Control of Dynamical
Systems Using Neural Networks. IEEE Transactions on Neural Networks, Vol. 1,
No. 1, pp. 4-27.
158 References
[97] Narendra, K. S., K. Parthasarathy (1991): Gradient Methods for the Optimization
of Dynamical Systems Containing Neural Networks. IEEE Transactions on Neural
Networks, Vol. 2, No. 2, pp. 252-263.
[98] Neural Networks (1988), Vol. 1, No. 1.
[99] Nguyen, D. H., B. Widrow (1990): Neural Networks for Self-learning Control Sys-
tems. IEEE Control Systems Magazine, April 1990, pp.18-23.
[100] Nikolaou, M. and V. Hanagandi (1993): Control of Nonlinear Dynamical Systems
Modelled by Recurrent Neural Networks. AIChE Journal, November 1993, Vol. 39,
No. 11, pp 1890-1994.
[101] Niu S. S. and P. Pucar (1995): Hinging Hyperplanes for Non-Linear Identification.
Linkping University, Division of Automatic Control, Sweden, 23 p.
ftp://joakim.isy.liu.se/
[102] Oja, E. (1989): Neural networks, principal components, and subspaces. Interna-
tional Journal of Neural Systems, Vol. 1, pp 61-68.
[105] Psichogios, D. C., L. H. Ungar (1991): Direct and Indirect Model Based Control
Using Artificial Neural Networks. Ind. Eng. Chem. Res, Vol. 30, pp. 2564-2573.
[106] Poggio, T., F. Girosi (1990): Regularization Algorithms for Learning That Are
Equivalent to Multilayer Networks. Science, Vol. 247, pp 978-982.
[107] Press, W. H et al. (1986): Numerical Recipes, the Art of Scientific Computing.
Cambridge University Press.
[108] Priestley, M. B. (1988): Non-linear and Non-stationary Time Series Analysis. Aca-
demic Press, San Diego.
[109] Prll T. and M. N. Karim (1994): Model Predictive pH Control Using Real-Time
NARX Approach. AIChE Journal, Feb. 1994, Vol. 40, No. 2, pp. 269-282.
[111] Raiskila, P., H. N. Koivo (1990): Properties of a Neural Network controller. ICARV
'90 International Conference on Automation, Robotics and Computer Vision,
19-21 Sept. 1990, pp. 1-5.
References 159
[112] Rissanen, J. (1978): Modelling by shortest data description, Automatica, Vol. 14,
pp. 465-471.
[113] Rosenlicht, M. (1968): Introduction to analysis. Dover Publications, New York.
[117] Saint-Donat, J., N. Bhat, T. J. McAvoy (1991): Neural net based model predictive
control. Int. J. Control, Vol. 54, No. 6, pp. 1453-1468.
[118] Sanner, R. M. and J-J. E. Slotine (1991): Gaussian Networks for Direct Adaptive
Control. 1991 American Control Conference (ACC-91), pp. 2153-2159.
[119] Sbarbaro-Hofer, D., D. Neumerkel and K. Hunt (1993): Neural Control of a Steel
Rolling Mill. IEEE Control Systems, June 1993, pp. 69-75.
[121] Scott, G. M. and W. H. Ray (1993): Experiences with model-based controllers based
on neural network process models. Journal of Process Control, Vol. 3, No. 4, pp.
179-196.
[122] Shiraishi, H., S. L. Ipri, D. D. Cho (1995): CMAC Neural Network Controller for
Fuel-Injection Systems. IEEE trans. on Control Systems Technology, Vol. 3, No. 1,
pp. 32-38.
[123] Shook, D. S., C. Mohtadi, S. L. Shah (1991): Identification for Long-Range Predic-
tive Control. IEE Proceedings-D, Vol. 138, No. 1, Jan. 1991, pp. 75-84.
[124] Shook, D. S., C. Mohtadi, S. L. Shah (1992): A Control-Relevant Identification
Strategy for GPC. IEEE trans. on Automatic Control, Vol. 37, No. 7, pp 975-980.
[125] Simnon User's Guide for UNIX Systems Version 3.1. SSPA Systems. Gteborg,
Sweden, 1991.
[132] Tam. Y. (1993): An architecture for adaptive neural control. Journal A, Vol. 37,
No. 4, pp 12-16.
[135] Turner P., G. A. Montague, A. J. Morris (1994): Neural networks in process plant
modelling and control. Computing & Control Engineering Journal, Vol. 5, No. 3,
pp. 131-134.
[136] Tzirkel-Hancock, E. and F. Fallside (1991): A Direct Control Method For a Class
of Nonlinear Systems Using Neural Networks. CUED/F-INFENG/TR.65, March
1991 Cambridge University Engineering Department, Cambridge, England,.
[143] Widrow, B., D. E. Rumelhart and M. A. Lehr (1994): Neural Networks: Applica-
tions in Industry, Business and Science. Communications of the ACM, Vol. 37,
No. 3, pp. 93-105.
[144] Willis, M. J., G. A. Montague et al. (1992): Artificial Neural Networks in Process
Estimation and Control. Automatica, Vol. 28, No. 6, pp. 1181-1187.
[145] Wu, Q. H., B. W. Hogg and G. W. Irwin (1992): A Neural Network Regulator for
Turbogenerators. IEEE trans. on Neural Networks, Vol. 3, No. 1, pp. 95-100.
[146] Yabuta, T., T. Tujimura, T. Yamada, T. Yasuno (1989): On the Characteristics of the
Robot Manipulator Controller using Neural Networks. Proc. of International Work-
shop on Industrial Applications of Machine Intelligence and Vision, April 10-12,
1989, Rappongi, Tokyo, pp. 76-81.
[147] Ydstie, B. E. (1990): Forecasting and Control using Adaptive Connectionistic Net-
works. Computers Chem. Engng, Vol. 4/5, pp. 583-599.
[148] Ye, K., K. Fujioka and F. Shimizu (1994): Efficient control of fed-baker's yest
cultivation based on neural networks. Process Control & Quality, Vol. 5, No. 4,
pp. 245-250.
[149] Zafiriou, E. (1990): Robust Model Predictive Control of Processes with Hard
Constraints. Computers Chem. Engng, Vol. 14, No. 4/5, pp. 359-371.
[150] Zafiriou, E. and A. L. Marchal (1991): Stability of SISO Quadratic Dynamic Matrix
Control with Hard Output Constraints. AIChE Journal, Vol. 37, No. 10,
pp. 1550-1560.
[151] Zeman, V., R. V. Patel, K. Khorasani (1989): A neural network based control strat-
egy for flexible-joint manipulators. 28th IEEE Conference on Decision and Con-
trol, Tampa, Florida, Dec. 89, pp. 1759-1764, 1989.
y ( t ) = y ( t ) = f ( x ( t ), ) (A.1)
for a sample input x ( t ) , where is an estimate of the parameters . The index t corre-
sponds to a sample t or to time t , if a time series is considered. Introduce the prediction
error
( t ) = ( t, ) = y ( t ) y ( t ) (A.2)
V ( t, ) = 1--- y ( t ) y ( t )
2
(A.3)
2
T
( t ) = V ( t, ) (A.4)
1 1
W 1 2
x1 W
y 1
1 2
2 1
x2
1 2 y 2
3 2
1 1
b 2
1 b
(t) = f ( x, ) (A.5)
= , x = x ( t )
(t) = f ( x, ) (A.6)
x
= , x = x ( t )
T
( v ) = [ 1 ( v 1 ) k ( v k ) ] (A.7)
( v ) d 1 ( v 1 ) d k ( v k )
- -------------------
-------------- = ------------------- (A.8)
v dv 1 dv k
Consider now MLP network with L layers with following definitions (see Fig. A.1):
s ms ms
:R R is a diagonal activation function of the layer s ,
where i : R R is a scalar activation function. Without a loss of
generality it is assumed that all activation functions in a layer are same.
The activation function is usually some sigmoid shape function, for example
v
i ( v i ) = 1/ (1 + e i ) or i ( v i ) = tanh ( v i ) . The outermost activation
function can also be a linear function.
Appendix A Multilayer Perceptron A.3
The terms forward pass and backward pass are used to denote the calculation of the pre-
dictions (forward) and the gradients (backward). Different backward pass types will be
defined for computing ( t ) , ( t ) and ( t ) . To maintain the simplicity and the clarity
of the presentation, a procedural pseudo code with matrix and vector notation is used.
Those fond of successive sentences, see for example Hertz et al. (1991).
Forward pass
y ( t ) = f ( x ( t ), ) (A.9)
s
for a sample input x ( t ) , The parameter vector consist of the weight matrices W and
s
bias vectors b with the columns stacked under each other.
Procedure: Forward
s s
extract all W and b from the parameter vector
0 0
set x = v = x( t)
repeat for layers s = 1L
s s s s
v = W x +b
s s s
x = (v )
end
L
set y ( t ) = x
Backward passes
The task of the first backward pass is to calculate the negative gradient ( t ) of the cost
function increment (A.3) with respect to the parameters . The application of the chain
rule differentation can be viewed as a forward pass of the linearized and transposed net-
work (see Fig. A.1), where:
s
e is the error input of the layer s
s
M is the Jacobian matrix of the diagonal activation function w.r.t. its input
A.4 Appendix A
Procedure: Backward_plain
L
set e = ( t, )
This procedure can be applied to all gradient based optimization methods which uses a
scalar cost function (like BFGS). It is a part of the standard error backpropagation algo-
rithm, where the parameter update is made according to
( t ) = ( t ) + ( t 1 ) (A.10)
1 1
W M1 2
W
e1
1 2
M2 M1
1 2 e2
M3 M2
1 1
b 2
1 b
Fig. A.2. Block diagram of the MLP network during the backward phase.
N
( k ) = ( t ) + ( k 1 ) (A.11)
t=1
where k is the iteration counter, is more effective in off-line applications. These are not
used in this study, except in two real-time adaptive control experiments.
Procedure: Backward_only
(j) 0
set (t) = e (to the j th column)
end
A.6 Appendix A
Procedure: Backward_and_gradient
s1 s T s
e = (W ) e
end
(j) s
form (t) ( j th column)by stacking all f j / W
s
and f j / b with the same order as parameters in .
(j) 0
set (t) = e (to the j th column)
end
Appendix A Multilayer Perceptron A.7
For constrained minimization also the second order Jacobian must be determined. This
can be done easily for networks with one hidden layer. For more complicated networks
the second order Jacobian y / x must be approximated numerically. Define matrices
( i) = f ( x, ) , i = 1n (A.12)
x i
= , x = x ( t )
procedure: backward_second_order
set = and i = i +
(t)
do backward_only pass and store the result in
( t ) ( t )
( i ) = ---------------------------
end
A.8 Appendix A