SI Nonlin
SI Nonlin
a r t i c l e i n f o abstract
Article history: This work details the Bayesian identification of a nonlinear dynamical system using a
Received 28 March 2013 novel MCMC algorithm: ‘Data Annealing’. Data Annealing is similar to Simulated Anneal-
Received in revised form ing in that it allows the Markov chain to easily clear ‘local traps’ in the target distribution.
20 March 2014
To achieve this, training data is fed into the likelihood such that its influence over the
Accepted 9 July 2014
posterior is introduced gradually - this allows the annealing procedure to be conducted
Available online 4 August 2014
with reduced computational expense. Additionally, Data Annealing uses a proposal
Keywords: distribution which allows it to conduct a local search accompanied by occasional long
Bayesian model updating jumps, reducing the chance that it will become stuck in local traps. Here it is used to
Nonlinear system identification
identify an experimental nonlinear system. The resulting Markov chains are used to
Markov chain Monte Carlo
approximate the covariance matrices of the parameters in a set of competing models
Simulated Annealing
Deviance Information Criterion before the issue of model selection is tackled using the Deviance Information Criterion.
& 2014 The Author. Published by Elsevier Ltd. This is an open access article under the CC
BY license (http://creativecommons.org/licenses/by/3.0/).
1. Introduction
This paper is concerned with the system identification of a nonlinear dynamical system using experimentally obtained
training data. A probabilistic, Bayesian approach is utilised throughout. Such an approach is now well established in the
structural dynamics community – relatively recent advances include the use of Bayesian methods in structural health
monitoring [1], modal identification [2], state-estimation [3] (through use of the particle filter), the sensitivity analysis of
large bifurcating nonlinear models [4] as well as an interesting study investigating the relations between frequentist and
Bayesian approaches to probabilistic parameter estimation [5].
The identification problem detailed herein is one of model selection as well as parameter estimation such that, using
experimental data D, one must endeavor to find the optimum model M from a set of competing model structures as well as
estimate the parameter vector θ of that particular model. Using Bayes' theorem a measure of the plausibility of a parameter
vector θ, given experimental data D and assumed model structure M, is given by
PðDjθ; MÞPðθjMÞ
PðθjD; MÞ ¼ ð1Þ
PðDjMÞ
where PðθjD; MÞ is the posterior probability density function (PDF) which one wishes to evaluate, PðDjθ; MÞ is termed the
likelihood, PðθjMÞ the prior and PðDjMÞ the evidence. The likelihood represents the probability that the experimental
training data D was witnessed according to the model M with parameters θ. Defining the likelihood requires the selection
http://dx.doi.org/10.1016/j.ymssp.2014.07.010
0888-3270 & 2014 The Author. Published by Elsevier Ltd. This is an open access article under the CC BY license
(http://creativecommons.org/licenses/by/3.0/).
134 P.L. Green / Mechanical Systems and Signal Processing 52-53 (2015) 133–146
of an error-prediction model which describes the uncertainties present in the measurement and modelling processes (see
[6] for a detailed discussion of error-prediction models). The prior is a PDF which represents one's parameter estimates for
model M before the training data was known. The evidence is a normalising constant which ensures that the posterior PDF
integrates to one.
This paper makes two main contributions. Firstly, a novel variant of Simulated Annealing (referred to as Data Annealing)
is proposed and applied to a real system identification problem. It is shown to be computationally cheap and easy to tune.
Secondly, it is shown that the issue of model selection of a real nonlinear dynamical system can be addressed using the
Deviance Information Criterion (DIC). For the sake of readability the remainder of the introduction is split into two sections.
The first outlines the motivation for the Data Annealing algorithm while the second focuses on the issue of model selection.
For the case where one is attempting to identify ND parameters (such that θ A RND ), the evidence is given by
Z Z
PðDjMÞ ¼ ⋯ PðDjθ; MÞPðθjMÞ dθ1 ⋯dθND : ð2Þ
This integral is usually intractable and its multidimensional nature makes it too computationally expensive to evaluate
numerically (if ND 4 2). Relatively early papers such as [7] made use of the property that the maximum a posteriori (MAP)
parameter vector remains the same regardless of whether the posterior distribution has been normalised such that, through
locating the MAP, a Taylor series expansion of the log posterior could be used to approximate the posterior PDF as a
Gaussian.1 Since then, an increase in computing power has allowed the adoption of Markov chain Monte Carlo (MCMC)
methods. These involve the creation of an ergodic Markov chain whose stationary distribution is equal to the posterior PDF
such that, once converged, the Markov chain is generating samples from PðθjD; MÞ (see [9] for more information on the
convergence of Markov chains). This can be achieved without having to evaluate the evidence term. While many MCMC
methods are available in the literature (Hamiltonian Monte Carlo for example [10]), by far the most popular is the
Metropolis algorithm. Although well-established, a brief description of the Metropolis algorithm is given here as it helps to
establish the motivation for the Data Annealing algorithm presented in Section 2 of this work.
ð1Þ ð2Þ
Essentially, the aim of MCMC methods is to generate a sequence of samples fθ ; θ ; …g from a target PDF π ðθÞ=Z (where
Z is a normalising constant). In the context of this paper, π ðθÞ represents the unnormalised posterior PDF and Z represents
ðiÞ 0
the evidence term. Initialising the Metropolis algorithm from parameter vector θ , a new state θ is proposed using a user-
ðiÞ
defined proposal PDF. The proposal PDF is conditional on the current state θ . For example, in the case where a Gaussian
proposal is used then the new state is generated according to
θ0 N ðθðiÞ ; Σ Þ ð3Þ
(where Σ is a user-defined covariance matrix). The new state is then accepted with probability:
( )
π ðθ0 Þ
a ¼ min 1; : ð4Þ
π ðθðiÞ Þ
ði þ 1Þ ðiÞ 0
If accepted then θði þ 1Þ ¼ θ' else θ ¼ θ . This has the property that if the proposed state θ is in a region of higher
probability density than the current state then it is always accepted. However, the Markov chain is also able to move into
regions of lower probability density. One of the benefits of using such an acceptance rule is that the acceptance probability a
can be computed without having to evaluate the evidence term. It can be shown that such an acceptance rule allows the
chain to generate samples from π ðθÞ (for more information references [8,11] are recommended).
The advantages of using MCMC are numerous. Recalling that the purpose of system identification is usually to establish a
reliable model which can be used to accurately and robustly predict the system's future response then, using the notation
outlined in [12], one may want to predict a structural quantity of interest hðθÞ using
Z Z
R¼ ⋯ hðθÞPðθjD; MÞ dθ1 ⋯dθND : ð5Þ
While evaluating Eq. (5) is difficult (for the same reason it is difficult to evaluate the evidence term), if one has used an
ð1Þ ðMÞ
MCMC algorithm to generate samples fθ ; …; θ g from the posterior parameter distribution then Eq. (5) can be
approximated by
1 M ðiÞ
R ∑ h θ : ð6Þ
Mi¼1
Additionally, it has been shown that important information with regard to parameter correlations can be realised through
the use of MCMC methods [13] (this is also demonstrated in Section 4 of the present work). However, MCMC also has its
disadvantages. Before samples from the target distribution can be drawn in an effective manner, the Markov chain must
1
For more information the reader may wish to consult the description of the Laplace approximation given in reference [8]
P.L. Green / Mechanical Systems and Signal Processing 52-53 (2015) 133–146 135
converge on the globally optimum region of the parameter space. This region can be difficult to locate as it is often very
concentrated relative to the size of one's prior distribution. Additionally, the Markov chain may become ‘stuck’ in a region of
probability density which is not the global optimum. Throughout this paper these regions are referred to as ‘local traps’.
The issue of local trapping led to the development of the Simulated Annealing algorithm [14]. This involves the
introduction of a factitious temperature2 variable T such that, at high temperatures, the Markov chain is able to easily travel
over local traps in the parameter space. The temperature variable is then reduced such that the fine details of the target
distribution are gradually introduced – this is demonstrated graphically for a bimodal target PDF in Fig. 1 (where π T
represents one's target distribution at temperature T). The rate at which T is reduced is commonly referred to as the
annealing schedule.
Although this does not guarantee that the chain will converge on the optimum region of parameter space, Simulated
Annealing has been established as a reliable optimisation algorithm. Soon after it was introduced several variants of
Simulated Annealing were proposed [15,16] in which the spread of the proposal PDF is initially set to be large but then
reduces with temperature T (at a user-defined rate), thus encouraging the Markov chain to make large jumps at higher
temperatures but conduct a more local search at lower temperatures.
When applied to Bayesian inference, the variable T can be introduced such that it controls the influence of the likelihood
on the posterior:
π T ðθÞ p PðDjθ; MÞT PðθjMÞ: ð7Þ
Through using Eq. (7) as one's target distribution and defining an annealing schedule where T varies monotonically between
0 and 1, a gradual transition between the prior and posterior distribution can be realised. This concept was utilised in
[12,17,18] where, by exploiting this gradual transition from prior to posterior, MCMC algorithms were developed which can
be used to sample from posterior parameter distributions with complex geometries (where multiple, or even a continuum
of optimum parameter vectors exist).
The performance of any Simulated Annealing algorithm will be sensitive to the choice of annealing schedule –
annealing too fast places one at risk of becoming stuck in a local trap (such that a long time is required for the
Markov chain to converge to its stationary distribution) while annealing too slowly will prove to be computationally
expensive. It is possible to overcome this issue through the use of ‘adaptive’ annealing schedules such as those proposed
in [17–19].
While the afore-mentioned algorithms are undoubtedly powerful, they can prove to be computationally expensive. One
of the main aims of the current paper is to present a relatively cheap annealing algorithm which, within the context of
Bayesian inference, can be applied to computationally demanding models.
The issue of model selection occurs when one must choose from a variety of competing model structures. This is
complicated by the fact that models with more parameters will likely be able to better replicate some training data than
models with less parameters. Consequently, if one judges models simply on their ability to replicate training data, then the
most complex of the competing structures will always be accepted. Models which are overly complex for the problem at
hand are referred to as overfitted. Such models are often poor representations of the physics involved in the system of
interest and, as a result, are poorly suited to making future predictions.
For a scenario where different model structures are available, the probability that the model Mi is suitable given the data
D can also be written using Bayes' theorem:
PðDjMi ÞPðMi Þ
PðMi jDÞ ¼ ð8Þ
PðDÞ
thus allowing one to write the relative probability of two different models, given data D, as
PðMi jDÞ PðDjMi ÞPðMi Þ
¼ ð9Þ
PðMj jDÞ PðDjMj ÞPðMj Þ
where PðMi Þ and PðMj Þ represent one's prior beliefs in the suitability of each model (typically set equal to one another) and
PðDjMÞ is the evidence term in Eq. (1). It is possible to show that the Bayesian approach to model selection automatically
prevents overfitting (see [8,20] for more information). However, as was described in the previous section, the evidence term
is difficult to evaluate. As a result, one may instead choose to use a different model selection paradigm which is easier to
evaluate than Eq. (9) but also retains the same model selection properties. In this work the Deviance Information Criterion
(DIC) [21] is used as a model selection criterion.
Before describing the Deviance Information Criterion (DIC) it is convenient to first define the deviance:
DðθÞ ¼ 2 ln PðDjθ; MÞ ð10Þ
2
The phrases ‘annealing’ and ‘temperature’ are used as the Simulated Annealing algorithm was originally developed by drawing analogies with
statistical physics [14]. The relations between Bayesian inference and statistical physics are discussed in [11].
136 P.L. Green / Mechanical Systems and Signal Processing 52-53 (2015) 133–146
π (θ)
T
10
8
6
T 4 θ
2
Fig. 1. Graphical example of simulated annealing when θ A R1 .
where, as stated previously, PðDjθ; MÞ is the likelihood. The expected Deviance E½DðθÞ is a measure of how well the model
structure M fits the data (as the parameter vector has been marginalised). The DIC is then defined as
and
Z
θ^ ¼ E½PðθjD; MÞ ¼ PðθjD; MÞθ dθ ð13Þ
such that the ‘best’ estimate parameters ðθ^ Þ are defined as the expected value of the posterior parameter distribution.
Essentially, the lower the DIC, the more favourable the model. It also has the desired property that it rewards model fidelity
while penalising model complexity (see reference [22] for a more detailed discussion).
The DIC lends itself well to situations where one has sampled from the posterior parameter distribution using MCMC as,
using the successive parameter vectors realised by the MCMC algorithm fθ ; θ ; …; θ g, the optimum parameter vector θ^
ð1Þ ð2Þ ðMÞ
can be approximated by
1 M ðiÞ
θ^ ∑ θ ð14Þ
Mi¼1
thus allowing one to approximate the DIC. While this has been applied to synthetic data in [13], the current work
demonstrates its application to real experimentally obtained data.
The paper is organised as follows. In Section 2 the novel annealing algorithm is presented. In Section 3 the experimental
system of interest is described. In Section 4 the results of the new annealing algorithm are analysed. This includes an
analysis of the parameter correlations and predictive capabilities of competing model structures. The issue of model
selection is then addressed using the Deviance Information Criterion (DIC). Section 5 is concerned with presenting possible
future work while the conclusions are presented in Section 6.
2. Data annealing
As stated in the previous section, MCMC methods can be used to generate samples from an unnormalised target PDF
π ðθÞ. In the context of this paper the target PDF is given by
π ðθÞ ¼ PðDjθ; MÞPðθjMÞ: ð16Þ
P.L. Green / Mechanical Systems and Signal Processing 52-53 (2015) 133–146 137
D ¼ fD1 ; …; DN g ð18Þ
then, assuming that each measurement is mutually independent, the likelihood is given by
N
PðDjθ; MÞ ¼ ∏ PðDi jθ; MÞ: ð19Þ
i¼1
In the case investigated here the training data D consists of a vector of inputs fy1 ; y2 ; …; yN g and a vector of measured
outputs fx1 ; x2 ; …; xN g (the physical meaning of x and y is discussed in Section 3). Using a Gaussian error-prediction model
allows the likelihood to be written as
N 1 1
P Djθ; M ¼ ∏ pffiffiffiffiffiffi exp 2 ðxi x^ i ðθÞ2 Þ ð20Þ
i¼1 2 π σ 2 σ
where x^ i ðθÞ represents the response of the model with parameters θ and σ2 is the likelihood variance (which can be treated
as another parameter to be found). Consequently, a single evaluation of the likelihood requires the simulation of N data
points. It is suggested here that, rather than using T to control the influence of the likelihood on the posterior (as with
Simulated Annealing), a similar effect can be achieved by varying the amount of data used in the likelihood. In other words,
it is possible to increase the influence of the likelihood through the introduction of additional data points into D. The rate at
which the data points are introduced can be controlled according to a user-defined schedule – this is conceptually similar to
the annealing schedule used in Simulated Annealing. The major advantage of this method is that it is computationally fast –
in the early stages of the algorithm relatively few points need to be simulated by the model per evaluation of the likelihood.
Throughout the current work this method is referred to as Data Annealing. It should be noted that the concept of annealing
through the gradual addition of data points in the likelihood was proposed but not actually implemented in [12].
As was stated in Section 1, the Metropolis algorithm requires a user-defined proposal PDF to generate candidate
0 0 ðiÞ
parameter vectors θ – this is often chosen to be a Gaussian. In the current work the proposal PDF will be denoted qðθ jθ Þ.
In [15] it was suggested that, to reduce the probability of the Markov chain becoming stuck in a local trap, a proposal
distribution with larger tails should be used in place of a Gaussian distribution. Specifically, it was suggested that a Cauchy
distribution could be utilised as, while it is locally similar to a Gaussian, it possesses larger tails (as shown in Fig. 2). This is
desirable as, while the resulting Markov chain will spend the majority of the time conducting a local search of the parameter
space, it will also occasionally propose relatively large jumps (thus increasing its ability to escape from local traps).
A disadvantage of this method becomes apparent when the dimension of the parameter space is greater than one as
samples from the multidimensional Cauchy distribution are not uncorrelated – large jumps in one parameter will often be
accompanied by large jumps in all of the other parameters [11]. In the author's opinion this seems rather restrictive. Here, it
is proposed that each parameter in θ can be sampled independently from a one-dimensional Cauchy distribution such that,
for parameter θn:
2 0 !2 13 1
θ0n θðiÞ
0 ðiÞ 4 @ A5
q θn jθn ¼ πλn 1 þ n
ð21Þ
λn
(where λn controls the width of the distribution). Consequently, for the case where θ A RND , the complete proposal
distribution is simply the product of ND Cauchy distributions:
ND
0 ðiÞ 0 ðiÞ
qðθ jθ Þ ¼ ∏ qðθn jθn Þ: ð22Þ
n¼1
The result is a valid PDF which integrates to one, maintains the irreducibility of the Markov chain, allows one to perform a
local search with occasional long jumps and does not have the afore-mentioned restrictive properties of the multi-
dimensional Cauchy distribution. In fact, this property is so useful that an effective exploration of the parameter space can
be achieved without having to vary the spread of the independent distributions fλ1 ; …; λND g with annealing time – this is
demonstrated in Section 4 of the current work. It should be noted that in Eq. (22) one has the option of choosing different
proposal widths for different parameters. This may be advantageous when the parameters are of very different scales.
However, it was found here that simply running the Data Annealing algorithm using the logarithm of the parameter vector
allowed one to achieve good mixing despite using the same distribution width for each parameter.
138 P.L. Green / Mechanical Systems and Signal Processing 52-53 (2015) 133–146
0.4
Gaussian
0.35 Cauchy
0.3
0.25
P(θ)
0.2
0.15
0.1
0.05
0
−20 −15 −10 −5 0 5 10 15 20
θ
3. Nonlinear system
A schematic of the nonlinear dynamical system of interest is shown in Fig. 3. A ‘centre magnet’ is positioned such that it
is free to slide along an aluminium rod via a set of linear bearings. Two ‘outer magnets’ are attached to the aluminium rod –
they are positioned such that their poles oppose that of the centre magnet (thus creating a magnetic restoring force on the
centre magnet). Consequently, when excited by the shaker, the centre magnet experiences oscillatory motion relative to the
shaker table. Originally developed in the context of nonlinear energy harvesting, it is known that the magnetic restoring
force on the centre magnet can be closely approximated using a linear and cubic stiffness term (similar to the hardening
spring Duffing oscillator) [23]. As a result, the equation of motion of the system is
mx€ ¼ cz_ kz k3 z3 mg F; z ¼ x y ð23Þ
where x is the absolute displacement of the centre magnet, y is the displacement of the shaker table, m is the mass of the
centre magnet, c is viscous damping, k is the linear stiffness, k3 is the cubic stiffness and g is gravity. The training data D is
made up of discretely sampled values of the excitation y (measured using the LVDT in Fig. 3) and of the centre magnet
response x (measured using the laser in Fig. 3). The quantity F represents the force on the centre magnet as a result of
friction effects. Three different friction models were considered. Firstly it was investigated whether the friction effects could
be modelled simply using the viscous damping term c. Secondly, the Coulomb damping model was utilised such that
F ¼ F c sgnðz_ Þ ð24Þ
where Fc is a parameter to be estimated. Finally, it was hypothesised that the hyperbolic tangent model was appropriate:
F ¼ F c tanhðβz_ Þ ð25Þ
(where Fc and β are parameters to be estimated). Throughout this paper these candidate models are referred to as the
viscous, Coulomb and hyperbolic tangent models respectively. The hyperbolic tangent model has the property that
lim tanhðβz_ Þ ¼ sgnðz_ Þ ð26Þ
β-1
such that it is able to form a close approximation to the signum function without being discontinuous at z_ ¼ 0. It should be
noted that the mass of the centre magnet was measured accurately before testing and so, in the following analysis, it is not
included in the vector of parameters to be estimated.
With regard to the applied excitation, a signal generator was used in conjunction with a PID controller to create a band-
limited white noise acceleration. For a more detailed discussion of this experiment (which was also developed in the context
of energy harvesting) the reader is directed towards references [24,25]. Two seconds of data measured at 1500 Hz was used
as training data (this is shown in Fig. 4).
4. Results
Uniform (but not improper) prior distributions were used in all runs of the Data Annealing algorithm. The upper and
lower limits of the priors for each parameter are shown in Table 1. A uncorrelated Gaussian error-prediction model (as
described in Section 2) was used in the likelihood. It was assumed that the standard deviation of the likelihood (σ) was
constant throughout the experimental test. In each of the following cases the value of σ was estimated alongside the other
model parameters.
P.L. Green / Mechanical Systems and Signal Processing 52-53 (2015) 133–146 139
LVDT Laser
Outer
Magnet
Linear Bearings
Outer Centre
Magnet Magnet
Signal Generator
-
+ PID Shaker
−3
x 10
6
4
Absolute Displacement (m)
−2
−4
−6
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Time (s)
For each model the Data Annealing algorithm was used to generate 50 000 samples of θ. The proposal distribution shown
in Eq. (22) was used with λ ¼ 0:005 for each parameter. For the initial sample the data D used in the likelihood consisted of 2
points (fy1 ; y2 g and fx1 ; x2 g). Additional data points were then introduced into the likelihood in a linear fashion for the first
2000 samples until the data D contained 3000 values of input ðyÞ and 3000 values of the corresponding response (x). The
amount of data D was then held constant for the remaining samples. The nonstationary portion of the resulting Markov
chains were removed. To increase the independence between samples only every tenth sample from the resulting Markov
chain was used to approximate the marginal PDFs of the posterior distribution.
The resulting Markov chains and parameter histograms for the viscous damping, Coulomb and hyperbolic tangent
models are shown in Figs. 5, 6 and 7 respectively. As desired, use of the Data Annealing algorithm has allowed the Markov
140 P.L. Green / Mechanical Systems and Signal Processing 52-53 (2015) 133–146
Table 1
Limits of uniform prior distribution.
c 0 0.2
Fc 0 0.01
β 0 1 107
k 0 80
k3 0 1 107
σ 0 0.001
0 0 0 0
0.05 0.1 0.15 0.2 0.25 30 40 50 60 70 80 0 5 10 4 6 8 10
x 10 x 10
0 0 0 0
0.1 0.105 0.11 56.5 57 57.5 1 1.5 2 2.5 5.6 5.8 6 6.2
x 10 x 10
1500 1500 1500 1500
0 0 0 0
0.1 0.105 0.11 56.5 57 57.5 1 1.5 2 2.5 5.6 5.8 6 6.2
3 x 10 x 10
c (Ns/m) k (N/m) k 3 (N/m ) σ
Fig. 5. Results of the Data Annealing algorithm for the viscous model. The first row shows the burnt data during the annealing stage of the algorithm, the
second row shows the thinned Markov chain with the burn period removed and the third row shows the resulting parameter histograms.
chain to make large jumps across the parameter space during the early stages while also allowing it to conduct a more local
search once the chain has become stationary. To reiterate, this was achieved without having to vary the width of the
proposal density.
With regard to Fig. 7 it should be noted that the Markov chain for the β parameter did not appear to become stationary.
This demonstrates an interesting flaw in the MCMC algorithm used in this paper: it is not clear whether the non-stationarity
of the Markov chain is a result of β being a nuisance parameter or of a poorly tuned MCMC algorithm. Upon closer inspection
it became apparent that at no point did the chain transition into a region lower than β 1000. Recalling that the hyperbolic
tangent model forms a close approximation to the Coulomb model when a large value of β is utilised allows one to
hypothesise that the Coulomb model may be more appropriate in this case (the ability of all the models to predict future
response and the issues of model selection are discussed in the subsequent sections).
One of the advantages of using MCMC methods is that one can approximate the covariance matrix of the model
parameters of a particular system. This is achieved by computing the correlation coefficients between the Markov chains of
the different parameters. The resulting covariance matrices for the viscous, Coulomb and hyperbolic tangent models are
shown in Figs. 8, 9 and 10 respectively. For all three models it is interesting to note that there appears to be a strong negative
correlation between the linear stiffness k and the nonlinear stiffness term k3. This is a relation which is possible to show
using the technique of equivalent linearisation: the situation where one is attempting to model the response of a system
with a nonlinear hardening spring as accurately as possible using an equivalent linear system. In such a case one must
compensate for the lack of a nonlinear spring term via an increase in the linear spring term (see [26] for more details). In
Figs. 9 and 10 it is also shown that there is a strong negative correlation between the viscous damping term c and Fc which
controls the magnitude of friction in the system. This indicates that one may able to compensate for the lack of a friction
model in a linear system through an increase in viscous damping. Again, this is something which can be shown using
equivalent linearisation.
P.L. Green / Mechanical Systems and Signal Processing 52-53 (2015) 133–146 141
0 0 0 0 0
0 0.1 0.2 0 0.005 0.01 20 40 60 80 0 5 10 4 6 8 10
x 10 x 10
6000 6000 6000 6000 6000
0 0 0 0 0
0.04 0.05 0.06 5 6 7 57.4 57.6 57.8 58 0.8 1 1.2 1.4 1.6 5.2 5.4 5.6
x 10 x 10 x 10
1500 1500 1500 1000 1500
500
500 500 500 500
0 0 0 0 0
0.04 0.05 0.06 5 6 7 57.4 57.6 57.8 58 0.8 1 1.2 1.4 1.6 5.2 5.4 5.6
x 10 x 10 x 10
c (Ns/m) Fc (N) k (N/m) k 3 (N/m )
3 σ
Fig. 6. Results of the Data Annealing algorithm for the Coulomb model. The first row shows the burnt data during the annealing stage of the algorithm, the
second row shows the thinned Markov chain with the burn period removed and the third row shows the resulting parameter histograms.
0 0 0 0 0 0
0 0.1 0.2 0 0.005 0.01 0 5 10 20 40 60 80 0 5 10 4 6 8 10
x 10 x 10 x 10
6000 6000 6000 6000 6000 6000
Retained
Samples
0 0 0 0 0 0
0.04 0.045 0.05 0.055 5 6 7 8 0 2 4 6 57 57.5 58 58.5 0.8 1 1.2 1.4 1.6 5.1 5.2 5.3 5.4 5.5
x 10 x 10 x 10 x 10
0 0 0 0 0 0
0.04 0.045 0.05 0.055 5 6 7 8 0 2 4 6 57 57.5 58 58.5 0.8 1 1.2 1.4 1.6 5.1 5.2 5.3 5.4 5.5
x 10 x 10 x 10 x 10
c (Ns/m) Fc (N) β k (N/m) k 3 (N/m )
3 σ
Fig. 7. Results of the Data Annealing algorithm for the hyperbolic tangent model. The first row shows the burnt data during the annealing stage of the
algorithm, the second row shows the thinned Markov chain with the burn period removed and the third row shows the resulting parameter histograms.
Having obtained probabilistic estimates for the parameters, each model was used to predict the response of the system to
59 seconds of a new excitation (which was part of a different set of experimental data). This data set will be denoted Dnew to
distinguish it from the training data D. As stated in [20], the Theorem of Total Probability can be used to obtain probabilistic
estimates of Dnew :
Z
PðDnew jD; MÞ ¼ PðDnew jD; θ; MÞPðθjD; MÞ dθ ð27Þ
142 P.L. Green / Mechanical Systems and Signal Processing 52-53 (2015) 133–146
1 M ðiÞ
∑ P Dnew θ ; M ð28Þ
Mi¼1
ðiÞ
where θ ; i ¼ 1; …; M, are the posterior samples generated by the Data Annealing algorithm.
An alternative method was suggested in [13] where, to account for the assumption that the system parameters are time-
independent, it was suggested that one could sample a new parameter vector from the posterior after every time step of the
model simulation. In the current work, both methods of uncertainty propagation were investigated (using a total ensemble of
50 model predictions) although it was found that the results were indistinguishable.
Figs. 11 and 12 show the ability of the viscous and Coulomb models to replicate one second of the experimentally
obtained response (with confidence bounds). It can be seen that both models have replicated the response of the system to a
good level of accuracy. The prediction made by the hyperbolic tangent model is not shown here as it was indistinguishable
from that of the Coulomb model. This strengthens the hypothesis that the Coulomb damping model is preferable to the
hyperbolic tangent model as it is able to generate a very similar response despite having less parameters.
P.L. Green / Mechanical Systems and Signal Processing 52-53 (2015) 133–146 143
The mean square error (MSE) between the predicted future response from each model and the measured experimental
response was calculated. This was taken over the entire 59 seconds of data. The MCMC samples realised in the previous
section were then used to calculate the Deviance Information Criterion. The results are shown in Table 2. The MSE for the
Coulomb and hyperbolic tangent models is significantly lower than that for the viscous model while the MSE for the
Coulomb and hyperbolic tangent models is identical. This indicates that while the inclusion of a friction model has enhanced
performance, the hyperbolic tangent model is simply acting as an approximation for the Coulomb model. This is confirmed
by the Deviance Information Criterion which indicates that the Coulomb model is the most appropriate (thus confirming
what was already suspected). For the sake of completeness, the ability of the Coulomb model to replicate the full 59 seconds
of experimental data is shown in Fig. 13.
One of the disadvantages of Data Annealing is that, relative to algorithms such as Transitional MCMC (TMCMC) [17] and
Asymptotically Independent Markov Sampling (AIMS) [18], the user has less control over the rate at which the influence of
the likelihood is increased during the annealing process. This is because TMCMC and AIMS utilise the temperature variable
in such a way that the transition from prior to posterior can be controlled in a continuous manner. The ability to select each
temperature T from the set T A ½0; 1 (subject to the constraint that the sequence of temperature values must increase
monotonically from 0 to 1) essentially means that the user has an uncountably infinite set of possible annealing schedules
available to them. This flexibility is lost when utilising the Data Annealing algorithm as the transition from prior to posterior
is influenced by the sensitivity of one's parameter estimates to the introduction of a new data set. As a topic of future work
the author aims to develop a version of Data Annealing algorithm which allows the user to have greater control over the
annealing schedule.
Throughout this paper the DIC was used as a model selection criterion. The disadvantage of this approach is that,
although it can be estimated using samples from the posterior, it is an ad hoc penalty term which can only be used when
each model has a single optimum parameter vector. A more complete approach would involve a variation of Data Annealing
which was also able to estimate the model evidence (Eq. (2)) (thus allowing the relative plausibility of competing model
structures to be investigated within a Bayesian framework). Consequently, for future work the author intends to investi-
gate whether Data Annealing can be combined with other MCMC methods which are capable of estimating the model
evidence – such methods could include Simulated Tempering [27,28], Reversible Jump MCMC [29], TMCMC [17], AIMS [18]
and Nested Sampling [30].
6. Conclusions
In this paper the system identification of an experimental nonlinear dynamical system was investigated using three
competing model structures. A new MCMC algorithm named ‘Data Annealing’ was proposed. Being conceptually similar to
Simulated Annealing, Data Annealing is designed such that, at its initial stages, the prior distribution dominates the shape of
the target distribution. This allows the Markov chain to move freely around the parameter space. Additional training data is
144 P.L. Green / Mechanical Systems and Signal Processing 52-53 (2015) 133–146
−3
x 10
8
Model
6 Experiment
±3σ
Absolute Displacement (m) 4
−2
−4
−6
−8
10 10.1 10.2 10.3 10.4 10.5 10.6 10.7 10.8 10.9 11
Time (s)
Fig. 11. Comparison between one second of viscous model prediction (black) and one second of experimental data (grey) where dashed black lines
represent 3σ confidence bounds.
−3
x 10
8
Model
6 Experiment
±3σ
Absolute Displcement (m)
−2
−4
−6
−8
10 10.1 10.2 10.3 10.4 10.5 10.6 10.7 10.8 10.9 11
Time (s)
Fig. 12. Comparison between one second of Coulomb model prediction (black) and one second of experimental data (grey) where dashed black lines
represent 3σ confidence bounds.
Table 2
Mean square error between model and experiment and Deviance Information
Criterion for the viscous, Coulomb and hyperbolic tangent models.
then progressively introduced into the likelihood such that the influence of the likelihood on the posterior is gradually
increased. This computationally cheap method improves the ability of the Markov chain to converge on the globally
optimum region of the parameter space without getting stuck in ‘local traps’. Additionally, the Data Annealing algorithm
utilises a proposal distribution which allows it to conduct a local search of the parameter space accompanied by occasional
long jumps. It was shown that this proposal distribution is well suited to the problem at hand as it initially allows the
Markov chain to explore large regions of the parameter space while is also capable of providing a more local search once the
chain has converged. This was achieved without having to alter the width of the proposal distribution. Having demonstrated
the Data Annealing algorithm on a real system identification problem, the resulting Markov chains were used to extract
approximate covariance matrices for all of the models investigated, thus revealing information about parameter correlations
P.L. Green / Mechanical Systems and Signal Processing 52-53 (2015) 133–146 145
0.01
0
Model
−0.01
0
−0.01
15 20 25 30
0.01
0
−0.01
30 35 40 45
0.01
0
−0.01
45 50 55 60
Time (s)
Fig. 13. Comparison between 59 seconds of Coulomb model prediction (black) and 59 seconds of experimental data (grey) where dashed black lines
represent 3σ confidence bounds.
induced by the data. Finally, a model selection criterion known as the Deviance Information Criterion was used to select the
most appropriate model from the set of competing structures. It was shown that the DIC can be used to identify a model
which can accurately replicate a set of training data without being overfitted (relative to the other elements in a set of user-
defined model structures).
Acknowledgements
The author would like to thank James L. Beck from the California Institute of Technology for his talk at IMAC XXXI which
inspired much of the work shown in this paper.
This work was conducted as part of an EPSRC fellowship and is also closely aligned to the EPSRC Programme Grant
‘Engineering Nonlinearity’ EP/K003836/1.
References
[1] M.W. Vanik, J.L. Beck, S.-K. Au, Bayesian probabilistic approach to structural health monitoring, J. Eng. Mech. 126 (7) (2000) 738–745.
[2] K.-V. Yuen, L.S. Katafygiotis, Bayesian fast Fourier transform approach for modal updating using ambient data, Adv. Struct. Eng. 6 (2) (2003) 81–95.
[3] J. Ching, J.L. Beck, K.A. Porter, Bayesian state and parameter estimation of uncertain dynamical systems, Probab. Eng. Mech. 21 (1) (2006) 81–96.
[4] W. Becker, K. Worden, J. Rowson, Bayesian sensitivity analysis of bifurcating nonlinear models, Mech. Syst. Signal Process. 34 (1) (2013) 57–75.
[5] S.-K. Au, Connecting Bayesian and frequentist quantification of parameter uncertainty in system identification, Mech. Syst. Signal Process. 29 (2012)
328–342.
[6] E. Simoen, C. Papadimitriou, G. Lombaert, On prediction error correlation in Bayesian model updating, J. Sound Vib. 332 (18) (2013) 4136–4152.
[7] J.L. Beck, L.S. Katafygiotis, Updating models and their uncertainties. i: Bayesian statistical framework, J. Eng. Mech. 124 (4) (1998) 455–461.
[8] D.J.C. MacKay, Information Theory, Inference and Learning Algorithms, Cambridge University Press, Cambridge, CB2 8RU, UK, 2003.
[9] J.L. Doob, Stochastic Processes, Wiley Publications in Statistics, Wiley, Oxford, OX4 2DQ, UK, 1953.
[10] S. Cheung, J.L. Beck, Bayesian model updating using hybrid Monte Carlo simulation with application to structural dynamic models with many
uncertain parameters, J. Eng. Mech. 135 (4) (2009) 243–255.
[11] R.M. Neal, Probabilistic Inference using Markov Chain Monte Carlo Methods, Technical Report, University of Toronto, 1993.
[12] J.L. Beck, S.-K. Au, Bayesian updating of structural models and reliability using Markov chain Monte Carlo simulation, J. Eng. Mech. 128 (4) (2002)
380–391.
[13] K. Worden, J.J. Hensman, Parameter estimation and model selection for a class of hysteretic systems using Bayesian inference, Mech. Syst. Signal
Process. 32 (2012) 153–169.
[14] S. Kirkpatrick, M.P. Vecchi, Optimization by simulated annealing, Science 220 (4598) (1983) 671–680.
[15] H. Szu, R. Hartley, Fast simulated annealing, Phys. Lett. A 122 (3–4) (1987) 157–162.
[16] L. Ingber, Very fast simulated re-annealing, Math. Comput. Modell. 12 (8) (1989) 967–973.
[17] J. Ching, Y.C. Chen, Transitional Markov chain Monte Carlo method for Bayesian model updating, model class selection, and model averaging, J. Eng.
Mech. 133 (7) (2007) 816–832.
[18] J.L. Beck, K.M. Zuev, Asymptotically independent Markov sampling: a new Markov chain Monte Carlo scheme for Bayesian inference, Int. J. Uncertain
Quantif. 3 (5) (2013).
[19] P. Salamon, J.D. Nulton, J.R. Harland, J. Pedersen, G. Ruppeiner, L. Liao, Simulated annealing with constant thermodynamic speed, Comput. Phys.
Commun. 49 (3) (1988) 423–428.
[20] M. Muto, J.L. Beck, Bayesian updating and model class selection for hysteretic structural models using stochastic simulation, J. Vib. Control 14 (1–2)
(2008) 7–34.
[21] D.J. Spiegelhalter, N.G. Best, B.P. Carlin, A. Van Der Linde, Bayesian measures of model complexity and fit, J. R. Stat. Soc.: Ser. B (Stat. Methodol.) 64 (4)
(2002) 583–639.
[22] A. Gelman, J.B. Carlin, H.S. Stern, D.B. Rubin, Bayesian Data Analysis, Chapman & Hall, CRC, Boca Raton, Florida 33431, US, 2003.
[23] B.P. Mann, N.D. Sims, Energy harvesting from the nonlinear oscillations of magnetic levitation, J. Sound Vib. 319 (1) (2009) 515–530.
[24] P.L. Green, K. Worden, K. Atallah, N.D. Sims, The effect of Duffing-type non-linearities and Coulomb damping on the response of an energy harvester to
random excitations, J. Intell. Mater. Syst. Struct. 23 (18) (2012) 2039–2054.
[25] P.L. Green, K. Worden, K. Atallah, N.D. Sims, The benefits of Duffing-type nonlinearities and electrical optimisation of a mono-stable energy harvester
under white Gaussian excitations, J. Sound Vib. 331 (20) (2012) 4504–4517.
146 P.L. Green / Mechanical Systems and Signal Processing 52-53 (2015) 133–146
[26] K. Worden, G.R. Tomlinson, Nonlinearity in Structural Dynamics: Detection, Identification and Modelling, Taylor & Francis, Bristol, BS1 6BE, UK, 2010.
[27] E. Marinari, G. Parisi, Simulated tempering: a new Monte Carlo scheme, EPL (Europhys. Lett.) 19 (6) (1992) 451.
[28] C.J. Geyer, E.A. Thompson, Annealing Markov chain Monte Carlo with applications to ancestral inference, J. Am. Stat. Assoc. 90 (431) (1995) 909–920.
[29] P.J. Green, Reversible jump Markov chain Monte Carlo computation and Bayesian model determination, Biometrika 82 (4) (1995) 711–732.
[30] J. Skilling, Nested sampling for general Bayesian computation, Bayesian Anal. 1 (4) (2006) 833–859.