Mechanical Systems and Signal Processing: Luis A. Aguirre, Bruno H.G. Barbosa, Ant Onio P. Braga
Mechanical Systems and Signal Processing: Luis A. Aguirre, Bruno H.G. Barbosa, Ant Onio P. Braga
Mechanical Systems and Signal Processing: Luis A. Aguirre, Bruno H.G. Barbosa, Ant Onio P. Braga
a r t i c l e in fo abstract
Article history: This article compares the pros and cons of using prediction error and simulation error to
Received 12 November 2009 define cost functions for parameter estimation in the context of nonlinear system
Received in revised form identification. To avoid being influenced by estimators of the least squares family (e.g.
7 May 2010
prediction error methods), and in order to be able to solve non-convex optimisation
Accepted 9 May 2010
Available online 13 May 2010
problems (e.g. minimisation of some norm of the free-run simulation error),
evolutionary algorithms were used. Simulated examples which include polynomial,
Keywords: rational and neural network models are discussed. Our results—obtained using different
Prediction error model classes—show that, in general the use of simulation error is preferable to
Simulation error
prediction error. An interesting exception to this rule seems to be the equation error
Parameter estimation
case when the model structure includes the true model. In the case of error-in-variables,
Nonlinear system identification
Non-convex optimisation although parameter estimation is biased in both cases, the algorithm based on
Genetic algorithms simulation error is more robust.
& 2010 Elsevier Ltd. All rights reserved.
1. Introduction
One of the main objectives in system identification is to build models from data. In doing so, the main steps are: (i) dynamical
testing and data acquisition; (ii) choice of model class; (iii) structure selection; (iv) parameter estimation, and (v) model
validation. Each of these steps presents its own challenges for which there are solutions with varying degrees of effectiveness.
This work is concerned with the problem of parameter estimation but only as a framework within which to investigate the
different roles played by two entities which are of the greatest importance in system identification theory and practice, namely:
prediction error and simulation error. It is believed that the discussion in this paper will also impact model validation.
A very elegant solution to the parameter estimation problem for models that are linear-in-the-parameters is the well-
known least squares (LS) algorithm. This algorithm is compact, easy to implement, fast to run and the estimator is
amenable to analysis [1]. Such features greatly promoted the use of LS-based estimators in the early days of linear system
identification when there was a need to establish the new theory and when computation facilities were scarce.
Unfortunately in problems where a disturbance model is required the model becomes nonlinear-in-the-parameters and LS
methods fail to provide unbiased estimates. Fortunately for such models, which are pseudo-linear-in-the-parameters, a
considerable body of results has been developed that has come to be known as prediction error (PE) methods [2].
Algorithms based on LS and PE methods are also applicable to some nonlinear model classes [3].
More recently some works have appeared suggesting the use of the simulation error (SE) or free-run simulation error in
the context of system identification [4–7]. From the practical point of view, SE algorithms are far more computer-intensive
than those based on prediction error (PE) [4]. From a theoretical point of view, for the SE methods there is no counterpart to
0888-3270/$ - see front matter & 2010 Elsevier Ltd. All rights reserved.
doi:10.1016/j.ymssp.2010.05.003
2856 L.A. Aguirre et al. / Mechanical Systems and Signal Processing 24 (2010) 2855–2867
the rigorous analysis available for PE methods, although some results on error bounds have been recently proposed in [5]
and some results on convergence for multi-step PE methods have been developed in [8]. Nevertheless, it has been reported
that SE methods were found to be more robust than their PE counterparts [6,7].
In a recent work, intermediate solutions have been investigated in the case of linear systems. In fact, ‘‘k-steps
ahead single-step’’ and multi-step PE algorithms have been investigated as a middle-point solution between SE and PE
methods [8]. In the first example of this paper (Section 4.1) a bi-objective estimator is used to simultaneously minimise
prediction errors and simulation errors. In a sense, this can also be seen as an intermediate solution. This framework will naturally
yield a set of middle-point solutions (between the SE and PE solutions) thus increasing our understanding of the roles played by
such errors. In previous works other cost functions besides the one based on PE were used in bi- and multi-objective
optimisation [9,10]. None of these works considered a cost function based on SE.
The main contribution of the present article is to clarify the roles played by prediction error and simulation error to define
cost functions for parameter estimation in the context of system identification. This has been done in a limited but varied
set of examples which include several model classes (linear and nonlinear-in-the parameters) and includes an example
with noise in the input signal.
In order not to be biased by PE and SE methods per se, in all the simulation examples of this article genetic algorithms (GAs)
[11] will be used to solve the optimisation problems at hand, unlike the previously mentioned works where least-squares-based
methods were extensively used. GAs as optimisation tools have been used to great advantage in other studies on system
identification [12–17]. It is worth pointing out that in such references only prediction error was used in the fitness function. An
additional benefit of using GAs is that nonlinear-in-the-parameters models can also be investigated, as will be illustrated in this
paper. One of the reasons for choosing GAs to solve the optimisation problem is that such algorithms do not generally get stuck
in local minima. In the present study, this is an important point because the error function in the SE approach depends
nonlinearly on the parameters. This renders the optimisation problem nonconvex with many potential minima.
The remainder of this article is organised as follows. In Section 2 we set the framework in general terms and state the
problem to be investigated. The numerical procedure used in the study is briefly described in Section 3. Five simulation
examples are discussed in Section 4 and, finally, the main conclusions are provided in Section 5.
2. Framework
Assume that data Z from a system S are available. The black-box model building problem consists in building a
mathematical model M from the available data Z such that the model and system outputs should be similar when the
same input is used for both. Some of the key steps in the practical solution of the black-box modelling problem will be
discussed below.
The problem of building a model M that ‘‘behaves’’ approximately as the system S can be cast as an optimisation
problem if we can be more specific as to what is meant by ‘‘behaves’’. A practical way of casting this optimisation problem
is to assume that there is data Z measured from the system S. For the sake of argument we further assume that Z includes
both input and output signals. In order to characterise the model behaviour, assume we take the input signal from Z and
use it together with the model to produce some model-data ZM in some particular way.
It is then possible to define some function JðZ,ZM Þ that will measure how far is the model data ZM from the measured
data Z. An important question to ask is: if it turns out that ZM is close to Z in terms of the chosen J, will that guarantee that
the model M will behave like the system S for other inputs?
A thorough answer to this question is beyond the aims of this paper. However it is our intention to argue that the
answer does depend on the way the model data ZM is produced. Also, it will be argued that there are ways of computing
ZM which will greatly increase the chances for an affirmative answer to the former question. On the other hand, such
choices of ZM will usually turn out to be numerically more expensive.
In order to be more specific, it is assumed that a realisation Z 2 RNr of data is available which is composed of
measurements of the inputs and the output, that is, Z ¼ ½y u1 . . . ur1 . In the case of a single input we simply have
Z ¼ ½y u. A practical cost function for modelling would then be JðZ,ZM Þ, where ZM is analogous to Z, but composed
with model data (see Section 2.2). So, finally, many model building techniques solve the following unconstrained
optimisation problem
h^ ¼ minJðZ,ZM Þ, ð1Þ
h
Three types of model-produced data will be considered. It is assumed that a general model class can handle up to three
different variable classes: the output, y, the input(s) u and the disturbance e. The first two will be referred to as process
variables, and the latter as noise. Therefore, a general model can be described as
yðkÞ ¼ f ½cyue ðk1Þ, h þ eðkÞ, ð2Þ
where (k 1) indicates that the variables used in the model (y, u and e) are taken up to and including time k 1. It is not
assumed that f is linear in h. For the sake of defining terminology, it is assumed that the estimate is h^ and that
where xðkÞ is white and is available after the parameter vector h^ has been estimated. The one-step-ahead prediction of (3) is
given by y^ 1 ðkÞ ¼ f ½cyux ðk1Þ, h^ . In this case, ZM1 ¼ ½y^ 1 u n, where the single-input case was considered for simplicity.
In the reminder of this paper, the process prediction will be written as y^ p ðkÞ ¼ f ½cyu ðk1Þ, h^ . The main difference
between y^ p ðkÞ and y^ 1 ðkÞ is that in the former only the process variables are used to compose the prediction. In this case,
ZMp ¼ ½y^ p u. For models without noise terms, the process prediction and the one-step-ahead predictions coincide. In the
literature yy^ 1 and yy^ p are examples of prediction errors. The class of parameter estimation algorithms that estimate h^
minimising a particular function of prediction errors are known as prediction error methods (PEM) [2].
Finally, if the model output—obtained from the process variables only—is fed-back into the model, then we speak of
free-run simulation. In this case we can write y^ s ðkÞ ¼ f ½cy^ s u ðk1Þ, h^ and ZMs ¼ ½y^ s u. Also, yy^ s is called the simulation
error. Minimisation of the simulation error during parameter estimation falls outside the scope of PEM because it
constitutes a nonconvex problem. Minimisation of yy^ s will be of great interest in the reminder of this work.
Taking advantage of the previous framework, a few points about model validation will be briefly mentioned. In the
context of system identification it is usually assumed that there is a separate set of data Zv available for model validation
which is similar to Z in terms of amplitude and frequency ranges.
For many models, the parameters are estimated by solving the problem outlined in (1) for ZM ¼ ZM1 . However, a well-
recognized fact is that the dynamical features of the model M are difficult to assess by analyzing ZM1 [18,19,4]. A
consequence of this is that solving (1) cannot possibly guarantee that the model behaves (in free run) like the system,
although it is hoped that it will be possible to come close to achieving that goal. The bottom line is that even if the model
data approximates the measured data in terms of a particular choice of J, that does not imply that the model will generally
behave like the system. Nevertheless, there is important information in the one-step-ahead prediction errors that can be
revealed and used in system identification [20] and master–slave synchronization techniques [19,21].
It should be clear that in minimising JðZ,ZM Þ, the choice of the model data ZM will have a direct influence on the results.
Not taking into account the demand for computing resources, the use of model data ZMs should be preferred because they
convey more information on the system dynamics. PEM use ZM1 and, in this case, the optimisation problem is quite
manageable. On the other hand, free-run simulation data reveal dynamical features of the system more directly. As a
simple example, consider an unstable model. The free-run simulation of such a model will immediately reveal that it is
unstable, however this feature will not necessarily become apparent from the one-step-ahead predictions. Therefore, in a
sense, the ideal situation would be to use ZMs , although the optimisation problem becomes significantly more complex.
This seems to be the motivation of using free-run data in some recent works [4,6,7]. A fair question to ask is if ZMs should
be preferred in every situation. One of the aims of this paper is to give an answer to that question.
The use of ZMs in the optimisation problem (1) turns out to be computationally demanding and probably would not easily
apply to systems with positive Lyapunov exponents nor to time series models for which the output will usually settle to a fixed
point in the absence of a driving input and a disturbance variable. Multiple-shooting parameter estimation circumvents some of
the problems that arise when ZM is built with one-step-ahead predictions without going to the extreme of using free-run
simulation [22]. Nevertheless, multiple-shooting is known to be computationally demanding also.
It is assumed that a set of data Z is available from a dynamical system S. It is also assumed that a given model structure
M, parameterized by a vector of unknown parameters h 2 Rn , has been previously defined.
In this article we would like to investigate the optimisation problem (1) in the context of nonlinear models. To this end
two different types of model data sets ZM will be considered: one-step-ahead predictions, ZM1 , and free-run simulation
data, ZMs . To assess the roles played by ZM1 and ZMs , evolutionary algorithms (Section 3) will be used to minimise the cost
functions J1 ¼ MSEðZ,ZM1 Þ and Js ¼ MSEðZ,ZMs Þ, where MSE stands for the mean squared error. Thus, mono-objective (using
either J1 or Js) and multi-objective (using both J1 and Js) problems will be solved by evolutionary algorithms.
2858 L.A. Aguirre et al. / Mechanical Systems and Signal Processing 24 (2010) 2855–2867
Evolutionary algorithms will be used to find the vector of unknown parameters h 2 Rn . To solve mono-objective
problems, the GAs [11] were implemented. The GAs work with a population of individuals (potential solutions), being less
susceptible to get stuck at local minima. In the implemented GAs, all variables of interest (model parameters that form the
individual chromosome) are real-coded. The use of real values instead of the traditional binary ones brings some
advantages like no precision loss due to discretization procedures and less memory required [17].
The first step of the algorithm is to randomly create an initial population of individuals (the number of variables depends on
the model structure). Secondly, the fitness of each population member is computed based on its performance in the objective
function (J1 or Js), the higher the fitness value the better its performance, increasing thus its chance to survive over the
generations. The best individuals are then selected by the stochastic universal sampling selection procedure [23] where a
roulette wheel with equally spaced pointers is used providing a selection procedure with no bias and lower bounded spread.
After selecting the individuals, the standard crossover and mutation genetic operators are applied to the population.
The heuristic crossover that returns a child closer to the best parent and the Gaussian mutation that adds a random
number from a normal distribution to the individual variables were implemented. Besides, the best individual of the
current population is included in the new one (elitism strategy). The aforementioned procedures are repeated until the
user defined maximum number of generations is reached.
To solve multi-objective optimisation problems (MOPs) there are several approaches available in the literature.
Evolutionary algorithms (EA) seem to be a natural choice to solve MOPs since they can find multiple solutions in one
simulation run and are able to solve complex problems being less susceptible to those involving discontinuities and multi-
modality [24] and more recently have been applied to NARX (Nonlinear AutoRegressive with eXogenous input) model
identification [25,14,16].
A wide variety of implementations of evolutionary algorithms has been proposed to solve MOPs. In this work, the
improved nondominated sorting genetic algorithm (NSGA-II) [26] will be used to solve problem (6) (see Algorithm 1). Also,
the use of EA is justified by the rather general treatment that we aim at, because they could be used, if desired, to estimate
parameters of model structures that are not linear in the parameters, as will be the case in a couple of examples (Section 4).
The new population function (Algorithm 1) creates a new random population. In its implementation, some samples (30)
are randomly selected from the available set of data Z and the LS algorithm is applied to find the parameters of the defined
structure Np times.
The fast nondominated sort function is a procedure to sort the population in different layers of non-domination. Firstly,
the Pareto set is determined in the population, F 1 . After that, the individuals that belong to F 1 are excluded from the
process and the next Pareto front is obtained, F 2 . This procedure was repeated until all individuals were classified in a
layer.
The crowding distance assignment is a function that estimates the density of solutions surrounding a particular one in the
population. Together with the fast non dominated sort function, it has an important role during the selection procedure,
namely to keep a certain degree of diversity in the population.
The selection procedure is implemented by means of the stochastic tournament, where two individuals are randomly
selected and the best one, considering its non-domination rank and its crowding distance, is selected.
The crossover and mutation operators are implemented in the same way of mono-objective genetic algorithms (GAs)
[11]. The algorithm was implemented based on real-coded GAs, using the real biased crossover [27]. The mutation operator
adds a random number from a Gaussian distribution.
4. Simulation studies
In this section we consider five numerical examples for which parameters were estimated using one-step-ahead predictions
and free-run predictions in either bi-objective or mono-objective frameworks. Unless otherwise stated, all optimisation problems
were solved using NSGA-II or GAs, as discussed in Section 3. In most examples, the true model structure was used because it is
the only way in which we could have parameter estimates comparable to the true ones. In practical problems there is no true
model structure and no true parameter values. In such cases, sophisticated procedures should be used to determine the model
structure (see the recent works [7,28,29] and references therein). Different model classes have been compared in [30].
4.1. Example 1
with J=[J1 Js]. If the objective-functions are conflicting, instead of obtaining at one solution, a set of solutions,
namely Pareto-optimal solutions, is reached, which is the case in Fig. 1a. This figure clearly reveals that in the case of white
output noise—which translates as nonlinear colored noise in the regression equation, see model (5)—J1 and Js are competing
objectives in the sense that the minimisation of one will result in the increase of the other, and vice-versa. In the absence of
noise, J1 and Js are not competing, that is the minimisation of either will yield the same result, as expected.
Fig. 1b presents the obtained Pareto set evaluated on a clean validation data set. The models with best simulation error
performance on the estimation data achieved better performance in both prediction and simulation errors on the validation data.
It is noticed that the free-run performance of the model with parameters estimated by LS is significantly worse than that
of the model with true parameters (compare the cross and the triangle, respectively). On the other hand the free-run
performances of the true model and of the model estimated by ELS is similar (compare the square to the triangle,
respectively). In this case, there is no relevant difference between the results obtained minimising Js and those yielded by
the ELS algorithm (see Table 1). Using the EA in this case would imply undergoing a much greater computational cost,
which might not balance the advantage that in the EA no noise model is required.
It is noted that in Fig. 1a, J1 for the model estimated by ELS is larger than the same cost function for the model estimated
by LS. This is immediately justified by the fact that only the process terms (autoregressive and exogenous) of the ELS model
are used to compute indices J1 and Js over estimation data (see process prediction in Section 2.2.) On the other hand, in
Fig. 1b, where validation data is used J1 and Js are worse for LS, as expected.
Both the mean square error (MSE) and the mean absolute error (MAE) were tested. Only the results with MSE are
reported. In what concerns bias, the results with MAE are equivalent, although parameter variance is slightly greater. The
choice of MSE is justified to make the cost function value comparable to those of LS and ELS, used in this example. In the
context of GAs, the use of MAE is attractive because it is computationally less expensive than MSE.
In this example, a smoothed input (an AR(2) process) was used. In [4] the use of this type of signal together with a
prediction-error-based structure selection algorithm yielded spurious terms in the model. Here even the use of such an
input and the minimisation of J1 resulted in unbiased estimation as long as the right model structure was used. In other
words, the use of a smoothed input does pose difficulties to some structure selection algorithms (see [4] for details)
2860 L.A. Aguirre et al. / Mechanical Systems and Signal Processing 24 (2010) 2855–2867
0.04
0.036
0.032
TJS
0.028
0.024
0.02
0.0302 0.0306 0.031 0.0314 0.0318 0.0322
TJ1
0.02
0.016
0.012
TJS
0.008
0.004
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
−3
TJ1 x 10
Fig. 1. Pareto sets for 500 models for which both cost functions were evaluated using estimation data: (a) y(k)=w(k)+e(k) with eðkÞ WGNð0,0:02Þ and validation
data: (b) y(k)=w(k). The models with parameters estimated by LS and ELS are indicated by cross and square respectively, and the dense set of circles indicate
those estimated by NSGA-II. The values of J1 and Js when the true parameters are used in the model are indicated by a triangle. In the case of models estimated by
ELS, only the process (autoregressive and exogenous) terms were used to compute the cost functions. In (b) the square and triangle are at the origin.
whereas it did not to the unbiased algorithms in this example. This confirms that the structure selection problem is clearly
more subtle and difficult than that of parameter estimation in nonlinear system identification.
4.2. Example 2
This example will consider the system in (4) anew with two important differences. We would like to investigate how J1 and Js
perform in the case of equation error and in the case of model structure mismatch. The results are summarised in Fig. 2.
L.A. Aguirre et al. / Mechanical Systems and Signal Processing 24 (2010) 2855–2867 2861
Table 1
Monte Carlo simulation with 1000 runs using mono-objective GAs.
y1 y2 y3
ym yM y sy ym yM y sy ym yM y sy
MSEðZ,ZMs Þ 0.735 0.769 0.750 0.005 0.233 0.265 0.250 0.005 0.221 0.179 0.201 0.005
MSEðZ,ZM1 Þ 0.576 0.665 0.625 0.013 0.304 0.382 0.343 0.012 0.344 0.253 0.294 0.012
LS 0.574 0.677 0.626 0.014 0.301 0.384 0.343 0.013 0.337 0.251 0.293 0.013
ELS 0.723 0.769 0.750 0.007 0.229 0.270 0.250 0.007 0.224 0.179 0.200 0.007
MSE, mean square error. The columns that use the MSE refer to the GAs. ym , yM , y and sy indicate, respectively, the minimum, maximum, mean and
standard deviation over the 1000 runs. True parameter values: y1 ¼ 0:75, y2 ¼ 0:25 and y3 ¼ 0:2. The noise is eðkÞ WGNð0, 0:02Þ.
The following remarks on the results shown in Fig. 2 can be made. First and foremost, minimisation of Js
proved better in the output error (OE) case regardless of the model structure. In the equation error (EE)
situation minimisation of J1 was preferable for correct and overparametrised model structure (with the true model
included), but not for underparametrised cases (Fig. 2f). It should be noticed that in the underparametrised cases
(Fig. 2c and f) for data sets with N = 2000 and smaller Js and J1 were not statistically different at 95% confidence level. In
particular, for the OE case the difference was not significant at such a level regardless of the data length within the
investigated range.
For short data sets Js performs poorly in the EE case, especially when the model is slightly overparametrised. Finally, it is
worth pointing out that the overparametrisation is asymptotically resolved only because the spurious terms are from a
non-spurious term cluster, which is not a particularly grave model mismatch [32]. The results are basically the same for
increasing noise levels. The finding in this example could partially explain why the papers in which the use of Js is
recommended usually only consider OE examples. A hint as to how to interpret the results for the correct model structure
case is provided in Appendix A.
4.3. Example 3
In this example we investigate the case of a rational nonlinear model which, being nonlinear-in-the-parameters,
presents great challenges for parameter estimation, even when the model structure is known [33,34]. The model used in
this example, taken from [35], is
0:3wðk1Þwðk2Þ þ0:7uðk1Þ
wðkÞ ¼ ,
1þ wðk1Þ2 þ uðk1Þ2
the input u(k) was a uniformly distributed random sequence with zero mean and variance of 0.33 and the noise e(k) was
WGN(0, 0.01). Input and output time series with N = 1000 were used to estimate the model parameters. Also the EE version
of model (7)
0:3yðk1Þyðk2Þ þ 0:7uðk1Þ
yðkÞ ¼ þeðkÞ ð8Þ
1 þ yðk1Þ2 þ uðk1Þ2
0.1 0.1
0.09 0.09
0.08 0.08
0.07 0.07
Error (MSE)
Error (MSE)
0.06 0.06
0.05 0.05
0.04 0.04
0.03 0.03
0.02 0.02
0.01 0.01
0 0
0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000
Samples Samples
0.6 0.03
0.55 0.025
0.5 0.02
Error (MSE)
Error (MSE)
0.015
0.45
0.01
0.4
0.005
0.35
0
0.3
0 1000 2000 3000 4000 5000
0 1000 2000 3000 4000 5000
Samples
Samples
0.03 0.6
0.025 0.55
0.02 0.5
Error (MSE)
Error (MSE)
0.015 0.45
0.01 0.4
0.005 0.35
0 0.3
0 1000 2000 3000 4000 5000 0 1000 2000 3000 4000 5000
Samples Samples
Fig. 2. Asymptotic behaviour of validation MSE of estimated NARX models using J1 (solid) and Js (dashed). Samples refers to the number of data used for
parameter estimation. The MSE was computed using model free-run simulation of length 2 104 and noise-free (true) data. The 95% error bars were
computed from 50 Monte Carlo runs using mono-objective GAs. (a) OE with correct structure, (b) OE with added spurious terms: y(k 2)u(k 2) and
y(k 1)u(k 2), (c) OE only with linear terms: y(k 2) and u(k 1), (d) EE with correct structure, and (e) EE with same spurious terms as in (b) and (f) EE
with same linear model as in (c).
L.A. Aguirre et al. / Mechanical Systems and Signal Processing 24 (2010) 2855–2867 2863
Table 2
Monte Carlo simulation over 1000 runs using mono-objective GAs.
EE OE
ym yM y sy ym yM y sy
MSEðZ,ZMs Þ
y1 0.282 0.739 0.269 0.143 0.250 0.834 0.293 0.142
y2 0.517 0.947 0.680 0.064 0.454 0.905 0.703 0.062
y3 1.660 5.082 0.919 1.060 2.659 4.175 1.044 0.939
y4 0.510 2.200 0.994 0.224 0.452 1.773 1.016 0.225
MSEðZ,ZM1 Þ
y1 0.036 0.634 0.304 0.097 0.135 0.388 0.107 0.079
y2 0.550 0.905 0.701 0.056 0.496 0.905 0.669 0.048
y3 0.045 3.280 1.037 0.466 0.619 1.885 0.152 0.324
y4 0.368 1.789 1.002 0.221 0.395 1.971 0.963 0.197
ym , yM , y and sy indicate, respectively, the minimum, maximum, mean and standard deviation over the 1000 runs. True parameters: y1 ¼ 0:3, y2 ¼ 0:7,
and y3 ¼ y4 ¼ 1 (see models in Eqs. (7) and (8)). Boldface values indicate bias.
4.4. Example 4
A challenge in estimation theory is the errors-in-the-variables problem which has demanded alternative algorithms to
meet such a challenge [39,40]. In this example we consider the problem investigated in [41]
wðkÞ ¼ 0:5wðk1Þ þ xðkÞ þ4xðk1Þ,
where the input x(k)= 0.5e(k 1)+ e(k) is an MA(1) process with e(k) is WGN(0, 1), and N = 500 data points are used.
As before, GAs estimators were used to minimise J1 and Js. The results of a 1000-run Monte Carlo study showed that
none of the estimators tested were unbiased whenever input noise was present. It is important to notice that the amount of
input noise used (s2 ¼ 1 corresponds to 100%) was quite considerable, whereas the variance of the output noise was
around 7% of the output variance. Fig. 3 shows the results for smaller amounts of input noise.
It is noticed that for input noise s2 ¼ 0:2 (around 20%), the true value of the parameter falls within (or almost) the 95%
confidence band for y1 and y3 ; this is not true for the estimator with J1. The true value y2 is within the 95% confidence
bands for both estimators (that use J1 and Js) although in such cases the estimator variance is considerably greater.
4.5. Example 5
This example will consider the minimisation of J1 and Js in a mono-objective fashion as a way of estimating the weights
(training) of multi-layer perceptron (MLP) networks in the presence of equation or output noise. The noise-free case for
another system was investigated in [6] and multi-objective network training has been reported in [45], [42]. Parameter
estimation techniques for networks were recently put forward in [43]. Consider the system [44]:
wðk1Þ
wðkÞ ¼ þuðk1Þ3 þee ðkÞ,
1 þ wðk1Þ2
1.0 1.0
0.8 0.8
0.6 0.6
TS1
TS1
0.4 0.4
0.2 0.2
0.0 0.0
−0.55 −0.5 −0.45 −0.4 −0.35 −0.3 −0.25 −0.2 −0.15 −0.1 −0.05 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4
TT1 TT2
1.0
0.8
0.6
TS1
0.4
0.2
0.0
1 1.5 2 2.5 3 3.5 4 4.5
TT3
Fig. 3. Results for a 1000-run Monte Carlo simulation for varying levels of input noise (s2 ) indicated by TS1 on the vertical axis and for a fixed level of
output noise (s1 ¼ 1). In every case mono-objective GAs were used to perform optimisation. (a) parameter y1 (TT1 on the horizontal axis) of the term
y(k 1), (b) parameter y2 of the term u(k), and (c) parameter y3 of the term u(k 1). The vertical line indicates the real values. The mean of the values
estimated using J1 (crosses) or Js (squares) as objective-functions. The 95% confidence level bands are indicated with dashed lines for J1 and with dotted
lines for J2.
Table 3
Simulation and prediction error performances (MSE on a validation noise free dataset) of 100 MLP networks trained by mono-objective GAs using either J1
pffiffiffi pffiffiffi
or Js in OE: eo ðkÞ WGNð0, 3Þ and ee(k) = 0; and EE: ee ðkÞ WGNð0, 3Þ and eo(k) = 0.
Validation—free-run simulation
EE OE
However, unlike the results presented in Sections 4.1–4.3, in the present example minimisation of Js outperformed
minimisation of J1 in OE and EE problems. It is not surprising that a model trained by minimisation of Js performs best in
terms of free-run simulations on validation data, however it was also better in terms of predictions. The reader is reminded
that this did not always happen in other examples, for instance Section 4.2, in the EE scenario. We conjecture a possible
explanation for this result. The OE scenario can be understood as modelling data with measurement uncertainty, whereas
the EE case would correspond to model uncertainty. In previous examples the model classes were polynomial and rational
which are less ‘‘flexible’’ than the MLP networks used in this example. Therefore, the consequences of model uncertainty
(EE problems) would tend to be more severe for polynomial and rational models.
5. Conclusions
When considering linear-in-the parameter models, unless specific action is taken, like fitting a nonlinear noise model,
bias can be expected if some norm of the one-step-ahead prediction error (J1) is minimised in the case of colored noise or
L.A. Aguirre et al. / Mechanical Systems and Signal Processing 24 (2010) 2855–2867 2865
white output noise. On the other hand, the minimisation of some norm of the simulation error (Js) is another way of
avoiding bias in the OE case. Effective though as it is, the minimisation of some norm of the simulation error (Js) is also
computationally demanding.
When the correct model structure (including noise model) is assumed, minimisation of J1 and of Js is fairly similar in the
sense that the solutions in each case are practically equivalent. On the other hand, when the model class is incorrectly
specified (e.g. choosing a NARX instead of an NARMAX model) the minimisation of Js is clearly more robust and, unlike the
minimisation of J1, often yields unbiased results. However, the minimisation of Js is harder and far more time consuming.
The aforementioned conclusions hold true even for the case in which the input is a smoothed noise (e.g. an AR(2) process).
In the examples discussed in this paper, the minimisation of Js was only superior to J1 in the output error case, but not in
the equation error case for correct structure and slightly overparametrised models. For underparametrised models Js was
slightly superior to J1 in both cases (OE and EE). For MLP network models Js was superior to J1 in both cases (OE and EE),
however the advantages of using Js were statistically more significant in the OE case. These results could help explain why,
in the literature, OE problems are employed to justify the use of Js over J1.
For nonlinear-in-the-parameter models, the computational advantage of using J1 over Js is diminished because in both
cases a nonconvex optimisation problem must be solved.
For the investigated errors-in-the-variables problem, all estimators were biased for large (100%) input noise. Although
both the GA estimators with J1 and Js were biased, the bias using Js was typically much smaller. This is consistent with the
observed fact that minimization of Js is preferable in the OE situation, because in the errors-in-the-variables problem noise
is added to the output (and input) and not to the equation. Also, for input noise standard deviation around 0.2 ( 20% of
noise power) the true values fall within the 95% confidence bands around the parameter values estimated by GA with Js,
but not with J1. Therefore, it seems fair to conclude that, at least for the considered example, minimizing Js will provide a
solution with tolerable bias for the error-in-variables problem as long as the input noise variance is not excessive (less than
about 20% in the investigated example).
Acknowledgement
Appendix A
The aim of this appendix is to perform analysis on a simple example in order to get some theoretical insight on the roles
of prediction and simulation error minimisation in both EE and OE cases when the model structure is correct.
The available (measured) system output is y, and that the desired output which is the system true value is y. ~ To
minimise the cost functions J1 or Js corresponds to minimise MSEðyy^ 1 Þ and MSEðyy^ s Þ, respectively, where y^ 1 and y^ s are
the estimated model outputs, as defined in Section 2.2. Ideally, the quantities to be minimised should be MSEðy ~ y^ 1 Þ and
~ y^ s Þ. However, y~ is unknown and the measured data y is used instead. This is the source of some of the problems
MSEðy
pointed out in the examples.
For the sake of argument, consider the following linear output error (OE) model:
~
yðkÞ ~
¼ ayðk2Þ þ buðk1Þ, ð11Þ
~ þeðkÞ:
yðkÞ ¼ yðkÞ ð12Þ
Substituting (11) in (12) we get
~
yðkÞ ¼ ayðk2Þ þ buðk1Þ þ eðkÞ: ð13Þ
~
From (12) it is clear that yðk2Þ ¼ yðk2Þeðk2Þ which can be substituted in (13) to get
yðkÞ ¼ ayðk2Þ þbuðk1Þaeðk2Þ þ eðkÞ: ð14Þ
Eqs. (13) and (14) express the data y based on past clean data y~ and measured data y, respectively. These two expressions
will be important in comparing free-run simulations and one-step-ahead predictions in what follows.
Assume that the same structure of model (11) is used and that parameters are estimated by minimising the J1 cost
function. In such a case, the estimated model output (one-step-ahead prediction) is
Let us perform the same analysis, however minimising MSEðyy^ s Þ. In such a case, the estimated model output (free-run
simulation) is
~
yðkÞ ¼ ayðk2Þ þbuðk1Þ þ eðkÞ þ uðkÞ, ð19Þ
where u is the weighted sum of the equation error noises added in previous steps. Comparing (19) and (17), it follows that
~
uðkÞ ¼ a½yðk2Þyðk2Þ, ð20Þ
and using the relation in (18)
uðkÞ ¼ a½eðk2Þ þ uðk2Þ: ð21Þ
Expression (21) can be used recursively to write:
uðk2Þ ¼ a½eðk4Þ þ uðk4Þ,
References
[15] J.M. Herrero, X. Blasco, M. Martı́nez, C. Ramos, J. Sanchis, Non-linear robust identification using evolutionary algorithms: application to a biomedical
process, Eng. Appl. Artif. Intell. 21 (8) (2008) 1397–1408.
[16] L.S. Coelho, M.W. Pessoa, Nonlinear model identification of an experimental ball-and-tube system using a genetic programming approach, Mech.
Syst. Signal Process. 23 (5) (2009) 1434–1446.
[17] K. Valarmathi, D. Devaraj, T.K. Radhakrishnan, Real-coded genetic algorithm for system identification and controller tuning, Appl. Math. Modelling
33 (2009) 3392–3401.
[18] L.A. Aguirre, S.A. Billings, Validating identified nonlinear models with chaotic dynamics, Int. J. Bifurcation and Chaos 4 (1) (1994) 109–125.
[19] L.A. Aguirre, E.C. Furtado, L.A.B. Tôrres, Evaluation of dynamical models: dissipative synchronization and other techniques, Phys. Rev. E 74(019612).
[20] Q.M. Zhu, L.F. Zhang, A. Longden, Development of omni-directional correlation functions for nonlinear model validation, Automatica 43 (2007)
1519–1531.
[21] L.A.B. Tôrres, Discrete-time dynamical system synchronization: information transmission and model matching, Physica D 228 (2007) 31–39.
[22] H.U. Voss, J. Timmer, J. Kurths, Nonlinear dynamical system identification from uncertain and indirect measurements, Int. J. Bifurcation and Chaos 14
(6) (2004) 1905–1933.
[23] J.E. Baker, Reducing bias and inefficiency in the selection algorithm, in: Proceedings of the Second International Conference on Genetic Algorithms
and their Application, Lawrence Erlbaum Associates, Inc., Mahwah, NJ, USA, 1987, pp. 14–21.
[24] C.M. Fonseca, P.J. Fleming, An overview of evolutionary algorithms in multiobjective optimization, Evol. Comput. 3 (1) (1995) 1–16 doi:http://dx.doi.
org/10.1162/evco.1995.3.1.1.
[25] K. Rodrı́guez-Vázquez, P.J. Fleming, Evolution of mathematical models of chaotic systems based on multiobjective genetic programming, Knowl. Inf.
Syst. 8 (2005) 235–256.
[26] K. Deb, S. Agrawal, A. Pratap, T. Meyarivan, A fast elitist non-dominated sorting genetic algorithm for multi-objective optimisation: NSGA-II, IEEE
Trans. Evol. Comput. 6 (2) (2002) 182–197.
[27] R.H.C. Takahashi, J.A. Vasconcelos, J.A. Ramirez, L. Krahenbuhl, A multiobjective methodology for evaluating genetic operators, IEEE Trans. Magn. 39
(3) (2003) 1321–1324.
[28] H.L. Wei, S.A. Billings, Model structure selection using an integrated forward orthogonal search algorithm interfered with squared correlation and
mutual information, Int. J. Modelling, Identification and Control 3 (4) (2008) 341–356.
[29] X. Hong, R.J. Mitchell, S. Chen, C.J. Harris, K. Li, G.W. Irwin, Model selection approaches for non-linear system identification: a review, Int. J. Syst. Sci.
39 (10) (2008) 925–946.
[30] U. Parlitz, A. Hornstein, D. Engster, F. Al-Bender, V. Lampaert, T. Tjahjowidodo, S.D. Fassois, D. Rizos, C.X. Wong, K. Worden, G. Manson, Identification
of pre-sliding friction dynamics, CHAOS 14 (2) (2004) 420–430.
[31] P.C. Young, The use of linear regression and related procedures for the identification of dynamical processes, in: L.A. Univ. California (Ed.),
Proceedings of the Seventh IEEE Symposium on Adaptive Processes, IEEE, New York, 1968.
[32] E.M.A.M. Mendes, S.A. Billings, On overparametrization of nonlinear discrete systems, Int. J. Bifurcation and Chaos 8 (3) (1998) 535–556.
[33] S.A. Billings, K.Z. Mao, Rational model data smoothers and identification algorithms, Int. J. Control 68 (2) (1997) 297–310.
[34] D. Wu, Z. Ma, S. Yu, Q.M. Zhu, An enhanced back propagation algorithm for parameter estimation of rational models, Int. J. Modelling, Identification
and Control 5 (1) (2008) 27–37.
[35] Q.M. Zhu, An implicit least squares algorithm for nonlinear rational model parameter estimation, Appl. Math. Modelling 29 (2005) 673–689.
[36] S.A. Billings, S. Chen, Identification of nonlinear rational systems using a predictor-error estimation algorithm, Int. J. Syst. Sci. 20 (3) (1989) 467–494.
[37] S.A. Billings, Q.M. Zhu, Rational model identification using an extended least-squares algorithm, Int. J. Control 54 (3) (1991) 529–546.
[38] Q.M. Zhu, S.A. Billings, Recursive parameter estimation for nonlinear rational models, J. Syst. Eng. 1 (1991) 63–76.
[39] H.L. Wei, Y. Zheng, Y. Pan, D. Coca, L.M. Li, J.E.W. Mayhew, S.A. Billings, Model estimation of cerebral hemodynamics between blood flow and volume
changes: a data-based modeling approach, IEEE Trans. Biomed. Eng. 56 (6) (2009) 1606–1616.
[40] H.L. Wei, S.A. Billings, Improved parameter estimates for non-linear dynamical models using a bootstrap method, Int. J. Control 82 (11) (2009)
1991–2001.
[41] P. Stoica, A. Nehorai, On the uniqueness of prediction error models for systems with noisy input-output data, Automatica 23 (1987) 541–543.
[42] I. Kokshenev, A.P. Braga, A multi-objective approach to rbf network learning, Neurocomputing 71 (7–9) (2008) 1203–1209.
[43] B. Poczos, A. Lorincz, Identification of recurrent neural networks by bayesian interrogation techniques, J. Mach. Learn. Res. 10 (2009) 515–554.
[44] K.S. Narendra, K. Parthasarathy, Identification and control of dynamical systems using neural networks, IEEE Trans. Neural Networks 1 (1990) 4–27.
[45] R.D. Teixeira, A.P. Braga, R.C.H. Takahashi, R.R. Saldanha, Improving generalization of MLPs with multi-objective optimization, Neurocomputing 35
(2000) 189–194.