Chen 1990
Chen 1990
1. Introduction
Both the theory and practice of non-linear system modelling has advanced consid-
erably in recent years. It is known that a wide class of discrete-time non-linear systems
can be represented by the non-linear autoregressive moving average with exogenous
inputs (NARMAX) model (Leontaritis and Billings 1985, Chen and Billings 1989 b).
The NARMAX model provides a description of the system in terms of a non-linear
functional expansion of lagged inputs, outputs and prediction errors. The mathemat-
ical function describing a real-world system can be very complex and its exact form
is usually unknown so that in practice modelling of a real-world system must be
based upon a chosen model set of known functions. A desirable property for this
model set is the capability of approximating a system to within an arbitrary accuracy.
Mathematically, this requires that the set be dense in the space of continuous func-
tions. Polynomial functions are one choice that have such a completeness property.
This provides the foundation for modelling non-linear systems using the polynomial
NARMAX model and several identification procedures based upon this model have
been developed (Leontaritis and Billings 1988, Chen and Billings 1989 a, Chen et al.
1989). Because the derivation of the NARMAX model was independent of the form
of the non-linear functional, other choices of expansion can easily be investigated
within this framework and neural networks are an obvious alternative. Neural net-
works can therefore be viewed as just another class of functional representations.
Feedforward multi-layered neural networks have been widely used in many areas
of signal processing (see the I.E.E.E. Transactions, 1988). A common feature in these
applications is that neural networks are employed to realize some complex non-linear
decision functions. Recent theoretical works (Cybenko 1989, Funahashi 1989) have
rigorously proved that, even with only one hidden layer, neural networks can uni-
formly approximate any continuous function. The theoretical basis for modelling
non-linear systems by neural networks is therefore sound.
The present study develops an identification procedure for discrete-time non-
linear systems based on neural networks with a single hidden layer. New batch and
recursive estimation algorithms are derived for the neural network model based on
the prediction error principle. It is shown that the classical back propagation algo-
rithm is a special case of the new prediction error routines, and model validity tests
are introduced as a means of measuring the quality of fit. The results of applying the
neural network model to both simulated and real data are included and a suggestion
for further research is also given.
2. System representation
Under some mild assumptions, a discrete-time multivariable non-linear stochastic
control system with m outputs and r inputs can be represented by the multi variable
NARMAX model (Leontaritis and Billings 1985):
y(t) =f (y(t - 1), ... , y(t - ny), u(t - I), ... , u(t - nul,
e(t - 1), ... , e(t - nell + e(t) (I)
where
Yl (t)] [U 1(t)] e1 (t)]
y(t) = : , u(t) = : , e(t) = : (2)
[ [
Ym(t) u,(t) em(t)
are the system output, input and noise vectors, respectively; ny, nu and n, are the
maximum lags in the output, input and noise respectively; e(t) is a zero-mean inde-
pendent sequence; and f( •) is some vector-valued non-linear function.
The input-output relationship (I) is dependent upon the non-linear functionf(·).
In reality,f( .) is generally very complex and knowledge of the form of this function
is often not available. The solution is to approximate j'(v) using some known simpler
function, and in the present study we consider using neural networks to approximate
non-linear systems governed by the model
y(t) = f(y(t - 1), ... , y(t - ny ) , u(t - I), ... , u(t - nul) + e(t) (3)
Notice that (3) is a slightly simplified version of (I) because only additive uncorrelated
noise is considered. Extension of the results to the more general model description
(I) is discussed.
network outputs
output layer
hidden layer
hidden layer
network inputs
where X; are the node inputs and y is the node output. The activation function a( • )
for each output node is specifically chosen to be linear, and the output node is the
weighted sum of the inputs
(5)
The overall input-output relationship of an n-input m-output network with one
or more hidden layers is described by a function!: IRn -+ IRm • Under very mild assump-
tions on the activation function a( . ), it has been rigorously proved that any continu-
ous function f: D c IR n -+ IR m can be uniformly approximated by an J on D, where D
is a compact subset of IRn (Cybenko 1989, Funahashi 1989).
Our aim is to use neural networks with one hidden layer to model non-linear
systems described by (3). Define n = mny + rn.
x(t) = [Xl (t) xn(tW
= [yT(t - 1) ... yT(t - ny)uT(t - 1) uT(t- n.)]T (6)
and introduce the notation
nh number of hidden nodes
Ill h
) threshold of ith hidden node
wl'> connection weight from xAt) to ith hidden node
0h;(t) output of ith hidden node
w~i) connection weight from ith hidden node to kth output node
Let 0 = [O[ On,]T be all the weights and thresholds of the network ordered in
Downloaded By: [University of Southampton] At: 14:31 14 September 2007
The network model (7) is therefore the one-step-ahead predictor for y(t) and the
prediction error or residual is given as usual by
6(t, 0) = y(t) - Ht, 0) (10)
The first step in modelling non-linear systems using (3) is therefore to select values
for ny, nu and nh' The next is to determine values of all the weights and thresholds or
to estimate 0. The gradient of y(t, 0)
'I'(t, 0) = [
dY(t
d~
0)JT = g(x(t); 0) (11)
will be referred to as the extended network model. The stability of (12) is of vital
importance in any implementation. The set of all 0 that each produce a stable
extended network model is denoted as Do. Notice that, for the chosen activation
function (9), Do is the whole no-dimensional euclidean space and in this sense the
corresponding extended network model is unconditionally stable. Furthermore, the
elements of 'I' (t, 0) for I ~ i ~ no and I ~j ~ m are given by
Ohk(t) if lIj=w}~), I ~k~nh
4. Identification algorithm
The network model (7) is non-linear in the parameters. This section applies the
well-known prediction error estimation method to derive both the batch and recursive
algorithms for estimating the parameter vector 0 in (7).
Downloaded By: [University of Southampton] At: 14:31 14 September 2007
over 0 E Do. Such a method of obtaining 0 is known as the prediction error estima-
tion method.
The minimization of criterion (15) can be performed efficiently using the following
Gauss-Newton algorithm
(16)
where
(17)
is the optimizing direction vector, and
1 N
V J 1 (0) = - N 1~1 'I'(t, 0)A -1 e(t, 0) (18)
1 N
HI (0, <5) =- L 'I'(t, 0)A -1 'l'T (t, 0) + M (19)
N .=1
are the gradient and the approximate hessian of J 1 (0), respectively. <5 is a non-
negative small scalar and I is the identity matrix with appropriate dimension. The
scalar 5(k) is obtained by minimizing
(20)
over 0 < 5 < 1 using a linear search technique such as the golden section search.
In practice, the direction vector '7(0, <5) is computed as follows. The square root
decomposition method is first used to factorize the hessian as
H 1(0,<5)= UTU (21)
where U is an upper triangular square matrix. '7(0, <5) is then solved from
U T (U'7(0, <5)) =- V J d0) (22)
by the forward and backward substitution algorithms (Bierman 1977).
The above Gauss-Newton algorithm is known to converge to at least a local
minimum. Other loss functions can also be employed, and a different example to (15)
is
J 2 (0) =! log det (C(0)) (23)
with
1 N
C(0) = - L e(t, 0)e T(t, 0) (24)
N '=1
Downloaded By: [University of Southampton] At: 14:31 14 September 2007
I N
H z(0 , t5 ) = -
N
L
,~I
'P(t,0)C I(0)'P T(t,0)+M (26)
J
y(t) = [J(X(t); ~(t - 1))J (27)
[ 'P(t) g(x(t); 0(t - 1))
By applying a general method known as the differential equation method for the
analysis of recursive parameter estimation algorithms developed by Ljung (1977), the
convergence of the algorithm (27)-(30) can be proved. The underlying ideas of Ljung's
method are as follows.
Assume that a projection is employed to keep 0(t) inside the stable region De'
y (t) ] = [y(t, ~(t - 1), , ~(O))] ~ [y(t, ~(t - I), , ~(t - M))] (37)
[ 'P(t) 'P(t, 0(t - 1), , 0(0)) 'P(t, 0(t - 1), , 0(t - M))
Furthermore, assumption (35) implies )I(t)-->O as t--> 00. For sufficiently large t, )I(t)
will be arbitrarily small, and it is seen that {0(t)} will change more and more slowly,
i.e,
are obtained under assumption (35), which implies )I(t) --> 0 as t --> 00 (or ,1,(t) --> I as
t ..... 00). In order to track time-varying parameters, )I(t) should not tend to zero. It is
reasonable to believe that analysis under condition (35) will have relevance for the
case where )I(t) tends to some small non-zero value. As in any non-linear optimization
problem, the initial conditions have an important influence on convergence and the
speed of convergence. The performance surface (39) for a general network model is
very complex and is known in general to contain many local minima. A study of this
performance surface and the influence of 0(0) on the algorithm (27)-(30) is beyond
the scope of this paper.
Strictly speaking, algorithm (30) or (32) is only a crude approximation of the off-
line Gauss-Newton algorithm because -'P(t)A -'8(t) is hardly a good approxi-
mation of the gradient (18). A modified RPE algorithm is proposed here by intro-
Downloaded By: [University of Southampton] At: 14:31 14 September 2007
5. Model validation
If modelling is adequate, e(t~ El) will be unpredictable from (uncorrelated with) all
linear and non-linear combinations of past inputs and outputs. Model validity tests
for other non-linear models (Billings and Voon 1986, Billings and Chen 1989, Billings
et al. 1989, Leontaritis and Billings 1987)were developed based on this principle and
can therefore be applied to the current neural network model. For simplicity, only
single-input (r = I) single-output model validity tests are briefly summarized.
If the identified model is adequate, the prediction errors should satisfy the follow-
ing conditions (Billings and Voon 1986, Billings and Chen 1989)
where cIl x .(k) indicates the cross-correlation function between x(t) and z(t), eu(t) =
6(t + I)u(t + I), u2'(t) = u2(t) - u2(t) and u2(t) represents the time average or mean
value of u2 (t). Therefore if these correlation functions are within the (95%) confidence
intervals ± 1'96/JN, the model is regarded as adequate.
Alternatively a statistical test known as the chi-squared test (Bohlin 1978, Leonta-
ritis and Billings 1987) can be employed to validate the identified model. Let n(t) be
an s-dimensional vector-valued function of the past inputs, outputs and prediction
errors, and
I N
r Tr =- L n(t)nT(r) (44)
N t~1
o is the estimate of 0 and a; is the variance of the residuals. Under the null hypothesis
that the data are generated by the model, the statistic ( is asymptotically chi-squared
distributed with s degrees of freedom. A convenient choice for n(t) is
n(t) = [w(t)w(t - I) ... w(t - s + IW (47)
where w(t) is some chosen (non-linear) function of the past inputs, outputs and
prediction errors. Thus if the values of ( for several different choices of w(t) are within
the acceptance region (95 %), that is
( < X;(Cl) (48)
the model can be regarded as adequate, where X; (Cl) is the critical value of the chi-
squared distribution with s degrees offreedom for the given significance level Cl (0'05).
To sum up the discussion so far, the identification of a structure-unknown system
described by (3) using a single hidden layer neural network involves the following
procedure:
(a) choose values of ny, n. and n.;
(b) estimate 0;
(c) validate the estimated model. If the model is adequate, the procedure is termin-
ated; otherwise go to step (a).
6. Simulation study
The parameter estimation algorithm used in this simulation study was the off-
line prediction error algorithm and only single-input single-output examples are
given.
Example I
This is a simulated system. 500 points of data were generated by
y(t) = (0'8 - 0·5 exp (- y2(t - I»)y(t - I) - «}3 + 0·9 exp (- y2(t - I»)y(t - 2)
+ u(t - I) + 0'2u(t - 2) + 0·1 u(t - I)u(t - 2) + e(t)
Downloaded By: [University of Southampton] At: 14:31 14 September 2007
where the system noise e(t) was a gaussian white sequence with mean zero and
variance 0·04 and the system input u(t) was an independent sequence of uniform
distribution with mean zero and variance 1·0.
The input order of the network model was chosen as n = ny + n. = 2 + 2. When
the number of hidden nodes was increased to nh = 5 (no = 30) the model validity tests
were satisfied. Figure 2 shows the system and model response where the model
deterministic output Yd(t, 8) is defined by
Yit, 8) = j(jid(t - 1,8), ..., Yd(t - ny, 8), u(t - I), .. " u(t - n.); 8) (49)
and the deterministic error Eit, 8) is given as
Bd(t, 8) = y(t) - Yit, 8) (50)
1.74
-1.74
2 (a) 500
5.88
-5.88
500
2 (b)
5.88
-s.as 1 1 - - - '_ _
t,
500
2 (e)
Downloaded By: [University of Southampton] At: 14:31 14 September 2007
0.64
500
(d)
-3.0
500
(e)
5.88
~,".f~,_~_ 500
(f)
Figure2. System and model response (Example 1): (a) U(I); (b) y(I); (e) .HI, El); (d) e(I, El);
(e) e4(1, El); (f) .MI, El).
Figures 3 and 4 display the correlation tests and some chi-squared tests for the
estimated model.
It can easily be verified that the unforced response (that is e(t) = 0 and u(t) = 0)
of this simulated system is a stable limit cycle as illustrated in Fig. 5. The unforced
response from the estimated model with the same initial condition is plotted in Fig. 6,
where it is seen that, although the shape is different from that in Fig. 5, the estimated
model correctly predicts the existence of a limit cycle. The data shown in Fig. 5 were
used to identify a network model with n = ny = 2 and nh = 10 (n6 = 40). The resulting
model produces the limit cycle shown in Fig. 7, which is much closer to that produced
by the unforced system.
Downloaded By: [University of Southampton] At: 14:31 14 September 2007
-I
o 20 o 20
(a) (b)
-I ' - - ~ - - --<
-I
-10 o 10 -10 o 10
(e) (d)
-1
-10 o 10
(e)
Figure 3. Correlation tests (Example I): (a) cIl,,(k); (b) cIl,(.. )(k); (e) cIl",,(k); (d) cIl.'.,(k);
(e) cIl.'.,,(k). Dashed line: 95% confidence interval.
Downloaded By: [University of Southampton] At: 14:31 14 September 2007
33 -r-------------,. 33 . - - - - - - - - - - - - - - - , .
o ~- --_--l o
o 20 o 20
delay delay
(a) (b)
33 . - - - - - - - - - - - - - ; 33 . - - - - - - - - - - - - - ;
0 0
0 20 0 :0
delay delay
(e) (d)
33 33
o o
o :0
delay delay
(e) (f)
Figure 4. Chi-squared tests (Example I): (a) ro(t) = £(t- I, e); (b) ro(t) = y(t - I); (e) ro(t) =
exp(u(t - I)); (d) ro(t) = tan h(.(t -I, ell; (e) ro(t) = y2(t - 1).2(t - 2, e); (f) ro(t) =
exp(-u 2(t-2))exp(-y'(t-2)). Dashed line: 95% confidence limit.
Downloaded By: [University of Southampton] At: 14:31 14 September 2007
1.20
0.60
-0.60
-1.20
1 76 151 301
Figure 5. System unforced response (Example I): initial condition: y(-I) = 0'01, y(O) = 0-1.
1.20
0.60
-0.60
1 76 151 301
Figure 6. Control model unforced response (Example I): initial condition: y(-I) = 0'01,
y(O) = 0-1.
Downloaded By: [University of Southampton] At: 14:31 14 September 2007
Q.!iO
-0.60
-1.20
1 76 151 301
Figure 7. Time series model unforced response (Example I): initial condition: y(-I) = 0'01,
y(O) = 0·1.
Example 2
This is the time series of annual sunspot numbers. Observations for the years
1700 to 1979 can be found in a paper by Tong (1983, Appendix A. I). The first 256
observations are plotted in Fig. 8 (a).
It has long been noticed that the record of sunspot numbers reveals an intriguing
cyclical phenomenon of an approximate l l-year period. Chen and Billings (1989 c)
fitted a subset polynomial model with ny = 9 and polynomial-degree three to the first
221 observations. The unforced response of this subset polynomial model is a sus-
tained oscillation with an approximate II-year period as shown in Fig. 8 (c). In the
current study a neural network model with n = ny = 9 and nh = 5 (n. = 55) was fitted
to the first 221 observations. The unforced response of this neural network model is
illustrated in Fig. 8 (b) where it is seen that this time series model also produces a
sustained oscillation with an approximate II-year period.
Example 3
The data were generated from a heat exchanger and contains 996 points. A
detailed description of this process and the experimental design can be found in work
by Billings and Fadzil (1985). The first 500 points of the data, depicted in Fig. 9, were
used as. the identification set and the rest of the data as the test set.
A neural network model with ny = nu = 5 and nh = 3 (n. = 36) was fitted to the
identification data set. Figures 10 and II show the correlation tests using the identi-
Downloaded By: [University of Southampton] At: 14:31 14 September 2007
156
f\
o \
256
(a)
156
o
256
(b)
156
o
256
(c)
Figure 8: Observations and model unforced response (Example 2): (a) observations; (b) neural
network model; (c) subset polynomial model; first nine observations used as initial
condition in unforced response.
Downloaded By: [University of Southampton] At: 14:31 14 September 2007
-6
500
(a)
II
4
500
(b)
Figure 9. Identification data set (Example 3): (a) u(t); (b) y(t).
fication and test sets, respectively. The test set and model response for this set are
given in Fig. 12. Further increasing the size of the network only slightly improved
the quality of fit.
Previous identification results (Billings and Chen 1989) indicate that this non-
linear process can be described better by using a model with the form of (1). The
results obtained here are satisfactory considering that no noise model was fitted as
part of the model estimation.
Downloaded By: [University of Southampton] At: 14:31 14 September 2007
o~...:.--··--·=
o ~ ... -
-1 k -1 ~----~---~ k
o 20 o 20
(a) (b)
o o
-1 -1 k
-10 o 10 -10 o 10
(c) (d)
o _.: __.'-./.
__ . - .
-1
-10 o 10
(e)
Figure 10. Correlation tests using identification set (Example 3): (a) lI>,,(k); (b) lI>,(,u)(k);
(c) lI>u,(k); (d) lI>u'·,(k); (e) lI>u,·,,(k). Dashed line: 95% confidence interval.
Downloaded By: [University of Southampton] At: 14:31 14 September 2007
k
o 20 o 20
(a) (b)
-] k
-lO o lO o
(e) (d)
A
f---...:.. _
o r---=-' 7. ~
._ . .-
-_
-I
-lO o 10
(e)
Figure II. Correlation tests using test set (Example 3): (a) <I>,,(k); (b) <I>,t"l(k); (e) <I>.,(k);
(d) <I>.",(k); (e) <I>.",,(k). Dashed line: 95% confidence interval.
Downloaded By: [University of Southampton] At: 14:31 14 September 2007
-5 1-----_---_-- ------<
501 996
12 (a)
II
5
501 996
12(b)
II
5
501 996
12(e)
Downloaded By: [University of Southampton] At: 14:31 14 September 2007
1.4
1.3
11
5~--- _ _- - -_ _- - - _ - _
501 996
(f)
Figure 12. Test set and model response (Example 3):. (a) U(I); (b) y(I); (c) Y(t, El); (d) £(1, El);
(e) £d(l, El); (f) MI, 0).
Downloaded By: [University of Southampton] At: 14:31 14 September 2007
8 Conclusions
An identification procedure has been developed for disrete-time non-linear sys-
tems based on a neural network approach. Both batch and recursive prediction error
Downloaded By: [University of Southampton] At: 14:31 14 September 2007
estimation algorithms have been derived for a neural network model with a single
hidden layer and model validation methods have been discussed. Application to some
simulated and real systems has been demonstrated. The results obtained suggest that
modelling non-linear systems by neural networks is an effective approach and further
research in this field is worth pursuing.
ACKNOWLEDGMENTS
This work is supported by the U.K. Science and Engineering Research Council.
The authors are also grateful for information supplied by Dr G. 1. Gibson.
REFERENCES
BIERMAN, G. J., 1977, Factorization Methods for Discrete Sequential Estimation (New York:
Academic Press).
BILLINGS, S. A., and CHEN, S., 1989, Identification of non-linear rational systems using a
prediction-error estimation algorithm. International Journal of Systems Science, 20,
467-494.
BILLINGS, S. A., CHEN, S., and KORENBERG, M. J., 1989, Identification of MIMO non-linear
systems using a forward-regression orthogonal estimator. International Journal ofCon-
trol, 49, 2157-2189.
BILLINGS, S. A., and F ADZIL, M. B., 1985, The practical identification of systems with non-
Iinearities. Proc. of the 7th IFAC Symposium on Identification and System Parameter
Estimation, York, U.K., pp. 155-160.
BILLINGS, S. A., and VOON, W. S. F., 1986, Correlation based model validity tests for non-
linear models. International Journal of Control, 44, 235-244.
BOHLIN, T., 1978, Maximum-power validation of models without higher-order fitting. Auto-
matlea,4,137-146.
CHEN, S., and BILLINGS, S. A., 1989 a, Recursive prediction error parameter estimator for non-
linear models. International Journal of Control, 49, 569-594; 1989 b, Representation of
non-linear systems: the NARMAX model. Ibid., 49,1013-1032; 1989 c, Modelling and
analysis of non-linear time series. Ibid., 50, 2151-2171.
CHEN, S., BILLINGS, S. A., and Luo, W., 1989, Orthogonal least squares methods and their
application to non-linear system identification. International Journal of Control, 50,
1873-1896.
CYBENKO, G., 1989, Approximations by superpositions of a sigmoidal function. Mathematics of
Control, Signals and Systems, 2, 303-314.
FUNAHASHI, K., 1989, On the approximate realization of continuous mappings by neural
networks. N eural Networks, 2, 183-192.
GooDWIN, G. C, and PAYNE, R. L., 1977, Dynamic System Identification: Experiment Design
and Data Analysis (New York: Academic Press).
I.E.E.E., 1988, I.E.E.E. Transactions on Acoustics, Speech and Signal Processing, 36 (7).
JANECKI, D., 1988, New recursive parameter estimation algorithms with varying but bounded
gain matrix. International Journal of Control, 47, 75-84.
LEONTARITIS, I. 1., and BILLINGS, S. A., 1985, Input-output parametric models for non-linear
systems. Part I: Deterministic non-linear systems; Part 2: Stochastic non-linear sys-
tems. International Journal of Control, 41, 303-344; 1987, Model selection and valida-
tion methods for non-linear systems. Ibid., 45, 311-341; 1988, Prediction error
estimator for non-linear stochastic systems. International Journal of Systems Science,
19,519-536.
LJUNG, L., 1977, Analysis of recursive stochastic algorithms. I.E.E.E. Transactions on Automatic
Control, 22, 551-575; 1978, Convergence analysis of parametric identification methods.
Ibid., 23, 770-783; 1981, Analysis of a general recursive prediction error identification
algorithm. Automatica, 17, 89-99.
LJUNG, L., and SODERSTROM, T., 1983, Theory and Practice of Recursive Identification (Cam-
bridge, MA: MIT Press).
Downloaded By: [University of Southampton] At: 14:31 14 September 2007
RUMELHART, D. E., HINTON, G. E., and WILLIAMS, R. J., 1986, Learning internal representations
by error propagation. In Parallel Distributed Processing: Explorations in the Micro-
structure of Cognition, edited by Rumelhart, D. E., and McClelland, J. L., pp. 318-362
(Cambridge, MA: MIT Press).
SALGADO, M. E., GOODWIN, G. C, and MIDDLETON, R. H., 1988, Modified least squares algor-
ithm incorporating exponential resetting and forgetting. International Journal 0/ Con-
trol, 47, 477-491.
SRIPADA, N. R., and FISHER, D. G., 1987, Improved least squares identification. International
Joumal 0/ Control, 46, 1889-1913.
TONG, H., 1983, Threshold Models in Non-linear Time Series Analysis. Lecture Notes in Stat-
istics (New York: Springer-Verlag).